關(guān)鍵詞:提示學(xué)習(xí);命名實(shí)體識別;自然語言處理;低資源
中圖分類號:TP183 文獻(xiàn)標(biāo)志碼:A 文章編號:1671-6841(2025)05-0031-08
DOI: 10.13705/j.issn.1671-6841.2024040
Abstract: Prompt-based fine-tuning was a new direction to improve the performance of domain specific named entity recognition (NER).However,the existing methods faced challenges such as the need of manual template construction,lengthy prompt information,and fixed prompt templates.To address these issues,a method combined prompt learning with expert knowledge was proposed in the field of domain specific named entity recognition.Firstly,by introducing the bootstrapping algorithm,potential entities were automatically identified. And the string matching algorithm used in the process of obtaining unannotated entity types from the same context was improved to obtain more prompt information templates. Next, expert knowledge from the domain ontology was introduced to address the reliability concerns associated with prompt information. Simultaneously,first-order predicate logic was used to represent prompt information and to improve the representation of prompt information.Finally,with experiments on finance dataset and information security dataset,the method was verified to improve the performance of domain specific named entity recognition effectively.
Key words: prompt based learning; named entity recognition; natural language processing; low resource
0 引言
命名實(shí)體識別(NER)旨在從文本中提取各種類型的實(shí)體,其結(jié)果可用于其他復(fù)雜任務(wù)諸如關(guān)系提取[1]、領(lǐng)域知識圖譜的構(gòu)建[2-3]等。與通用領(lǐng)域的命名實(shí)體識別任務(wù)相比,特殊領(lǐng)域的命名實(shí)體識別經(jīng)常面臨著兩方面的問題:領(lǐng)域標(biāo)注數(shù)據(jù)缺乏;領(lǐng)域中的實(shí)體形式更加復(fù)雜,并非局限于傳統(tǒng)NER定義的名詞或者名詞短語。
目前,基于提示的調(diào)優(yōu)學(xué)習(xí)已經(jīng)成為自然語言處理領(lǐng)域的新范式[4]。基于提示的調(diào)優(yōu)學(xué)習(xí)可以通過改造下游任務(wù)和增加專家知識,使任務(wù)的輸入和輸出適合原始語言模型,從而在少樣本場景中獲得良好的效果。但是,目前基于提示的調(diào)優(yōu)學(xué)習(xí)多應(yīng)用于文本分類或文本生成領(lǐng)域[5-7],在命名實(shí)體識別領(lǐng)域的應(yīng)用較少。
通過對文獻(xiàn)[8-12]進(jìn)行分析后發(fā)現(xiàn),目前已經(jīng)發(fā)表的基于提示學(xué)習(xí)的 NER方法具有如下缺陷:1)需要人工構(gòu)造提示信息模板,因此需要耗費(fèi)大量的人力且容易出錯(cuò);2)需要對序列中的每一個(gè)單詞都構(gòu)造提示信息[9-i0],當(dāng)輸人序列較長時(shí),會增加序列的長度,增加模型的計(jì)算復(fù)雜度;3)提示信息模板較為固定[11-12],在面對復(fù)雜類型實(shí)體時(shí)表現(xiàn)較差。
事實(shí)上,構(gòu)造提示信息模板是影響提示學(xué)習(xí)方法性能的關(guān)鍵因素[13]。因此,本文將更加關(guān)注提示信息的自動構(gòu)造問題以及提示信息的可靠性問題。目前已有不少研究驗(yàn)證了專家知識的加人對提升模型在數(shù)據(jù)集上的可靠性有很大幫助[14-15]。……