張文超 胡玉蘭
摘要 網(wǎng)絡(luò)信息數(shù)量的日益增加,對(duì)人們從中獲取有效信息的能力提出了更高要求。為了更好地響應(yīng)用戶需求,提高信息處理效率并降低人力成本,基于PyQt進(jìn)行全文搜索引擎平臺(tái)開發(fā)。采用模塊化思想設(shè)計(jì)網(wǎng)絡(luò)信息采集功能,然后將獲取的信息經(jīng)數(shù)據(jù)處理后建立索引庫,采用PageRank算法對(duì)查詢響應(yīng)結(jié)果進(jìn)行排序,實(shí)現(xiàn)檢索器功能,并通過用戶的點(diǎn)擊決策,利用神經(jīng)網(wǎng)絡(luò)對(duì)排序結(jié)果進(jìn)行二次修正。最后,在界面輸入查詢字符串后,便可快速得到已排序的鏈接響應(yīng),從而能更好地反映用戶對(duì)檢索結(jié)果的感興趣程度,并提供個(gè)性化服務(wù)。
關(guān)鍵詞關(guān)鍵詞:全文搜索引擎;網(wǎng)絡(luò)信息采集;PageRank;PyQt
DOIDOI:10.11907/rjdk.181009
中圖分類號(hào):TP319
文獻(xiàn)標(biāo)識(shí)碼:A文章編號(hào)文章編號(hào):16727800(2018)009013204
英文標(biāo)題Development of FullText Search Engine Platform Based on PyQt
--副標(biāo)題
英文作者ZHANG Wenchao, HU Yulan
英文作者單位(Institute of Information Science and Technology,Shengyang Ligong University,Shengyang 110159,China)
英文摘要Abstract:With the increasing of network information,people also have higher requirements on their ability to obtain effective information.In order to better respond to users'needs,improve the efficiency of information processing and reduce human resources,the function of network information collection is designed with the idea of modularizationfocusing on the hot technology of fulltext search engine,and the index database after the data is established and processed,then we use PageRank algorithm to implement the retriever function in the query response,and the ranking results are secondarily corrected by using the neural network through the user's click decision.At last, after the completion of the development of fulltext search engine system platform by using of PyQt, the query string is inputted in the interface and the sorted link response can be quickly obtained,which can better reflect the users' interest in the search results and provide personalized service.
英文關(guān)鍵詞Key Words:fulltext search engine;network information collection;PageRank;PyQt
0引言
隨著計(jì)算機(jī)與網(wǎng)絡(luò)技術(shù)的快速發(fā)展,每天的信息量呈爆炸式增長,搜索引擎應(yīng)運(yùn)而生。搜索引擎通過對(duì)互聯(lián)網(wǎng)上的信息資源進(jìn)行采集、提取和組織處理,為用戶提供檢索服務(wù),已成為當(dāng)今一種必不可少的網(wǎng)絡(luò)資源獲取工具,也是科研人員研究的重點(diǎn)方向。
文獻(xiàn)[1]和文獻(xiàn)[2]采用對(duì)象交換模型將頁面中結(jié)構(gòu)化標(biāo)簽對(duì)應(yīng)的數(shù)據(jù)部分抽取出來,形成相應(yīng)的Web信息模型,但由于Web網(wǎng)頁結(jié)構(gòu)只是一種信息的簡單表現(xiàn)形式,使用這些標(biāo)簽進(jìn)行信息抽取[3]得到的精度、可信度不高,因此對(duì)網(wǎng)絡(luò)信息的采集進(jìn)行模塊化設(shè)計(jì),對(duì)頁面的文本內(nèi)容進(jìn)行預(yù)處理后建立索引庫,實(shí)現(xiàn)基于內(nèi)容的信息抽取;……