999精品在线视频,手机成人午夜在线视频,久久不卡国产精品无码,中日无码在线观看,成人av手机在线观看,日韩精品亚洲一区中文字幕,亚洲av无码人妻,四虎国产在线观看 ?

Natural Language Semantic Construction Based on CloudDatabase

2018-12-26 07:22:38SuzhenWangLuZhangYanpiaoZhangJieliSunChaoyiPangGangTianandNingCao
Computers Materials&Continua 2018年12期

Suzhen Wang,Lu Zhang,Yanpiao Zhang,Jieli Sun,Chaoyi Pang,Gang Tian and Ning Cao

1 Hebei University of Economics and Business,Shijiazhuang,Hebei,050061,China.

2 The Australian e-Health Research Centre,ICT Centre,CSIRO,Australia.

3 Shandong University of Science and Technology,Qingdao,Shandong,266590,China.

4 University College Dublin,Belfield,Dublin 4,Ireland.

Abstract:Natural language semantic construction improves natural language comprehension ability and analytical skills of the machine.It is the basis for realizing the information exchange in the intelligent cloud-computing environment.This paper proposes a natural language semantic construction method based on cloud database,mainly including two parts:natural language cloud database construction and natural language semantic construction.Natural Language cloud database is established on the CloudStack cloud-computing environment,which is composed by corpus,thesaurus,word vector library and ontology knowledge base.In this section,we concentrate on the pretreatment of corpus and the presentation of background knowledge ontology,and then put forward a TF-IDF and word vector distance based algorithm for duplicated webpages(TWDW).It raises the recognition efficiency of repeated web pages.The part of natural language semantic construction mainly introduces the dynamic process of semantic construction and proposes a mapping algorithm based on semantic similarity(MBSS),which is a bridge between Predicate-Argument(PA)structure and background knowledge ontology.Experiments show that compared with the relevant algorithms,the precision and recall of both algorithms we propose have been significantly improved.The work in this paper improves the understanding of natural language semantics,and provides effective data support for the natural language interaction function of the cloud service.

Keywords:Natural language,cloud database,semantic construction.

1 Introduction

The Internet of things and big data have promoted the rapid development of the artificial intelligence industry.Natural Language Processing is a major branch of artificial intelligence.It simulates and processes natural language through the research results of computer technology,such as analyzing,understanding or generating natural language,so that human and machine can realize natural communication.With the development of mobile terminals and natural language processing technology,numerous applications based on Natural Language processing technology have emerged today.American Nuance company launched a virtual voice assistant Nina.It gave users a very high degree of perception in the interactive experience [Masterson(2012)].In 2009,Google launched the voice search service,which was the first combination of cloud technology and voice recognition technology in the industry [Breitbach(2010)].In 2014,Microsoft Asia Internet Engineering Institute released Xiaoice,it introduced emotional computing framework,and realized a simple human-computer natural interaction [Shum,He and Li(2016)].The development of natural language technology is inseparable from the support of big data.Nowadays,intelligent interactive products receive more and more attention.However,these products are faced with a same problem:their performance of speech recognition is correct,but the understanding of natural language is not accurate,so it cannot respond well to natural language.

In recent years,researches on natural language semantic construction have been increasing.For example,Archetypes [Beale(2002)] put forward a two-layer models semantic construction method,it constructs information semantic level and knowledge semantic layer,and creates the general semantic model and a special model.In literature[Misra,Sung,Lee et al.(2014)],a semantic analysis model is proposed and established,which defines the energy functions based on natural language instruction and coding independence hypothesis.The literature [Zhao,Li and Jiang(2016)] discloses a method for natural language understanding of intention,using intentions tagging,text vectoring and support vector machine to understand natural language instruction.He et al.[He,Deng,Gao et al.(2016); He,Deng,Wang et al.(2017)] proposed the improved model approach to grammatical evolution,which provides a new idea for semantic construction.At present,there are some problems in the research of semantic construction,such as poor portability and high requirement for data.

Understanding the semantics of natural language is the key step to achieve intelligent language interaction.Therefore,from the perspective of semantic analysis,this paper proposes a method of Chinese natural language semantic construction based on cloud database,and focuses on the corpus,ontology construction and semantic construction base on cloud database.The contributions of this paper include:

(1)This paper proposes a natural language semantic construction method based on cloud database;

(2)This paper proposes a TF-IDF and word vector distance based webpage deduplication algorithm and a mapping algorithm based on semantic similarity;

(3)Based on the above algorithm,corpus and ontology knowledge base are stored in HBase,thesaurus and word vector library are stored in MySQL;

(4)This paper uses the crawling data set for experimental test,and the experimental results show that algorithms proposed in this paper are effective.

2 Natural language cloud database construction

2.1 Natural language cloud database architecture

In this paper,a cloud database architecture for natural language is designed and proposed.As shown in Fig.1,the architecture of natural language cloud database is composed of two parts:Platform support layer and database layer,and the database layer is built on platform support layer.

CloudStack is the platform supporting layer of cloud database.It is a highly available and extensible open source cloud computing platform.Based on CloudStack,it can quickly and easily create cloud services through the existing infrastructure.Cloud computing platform provides basic support for data analysis and database construction [Lin,Wu,Lin et al.(2017); Jhaveri,Patel,Zhong et al.(2018)].

The database layer is mainly composed of four parts:corpus,thesaurus,word vector library and ontology knowledge base.Corpus and ontology knowledge base are constructed by HBase,and thesaurus and word vector library are constructed by MySQL.The establishment of corpus,thesaurus and word vector library provides a basis for the construction of ontology knowledge base.

The construction technology of thesaurus and word vector library has been mature,so this section will focus on the corpus and ontology knowledge base construction.

Figure 1:Architecture of natural language cloud database

2.2 Corpus dynamic construction

Corpus can sort,count and classify language data.In 1964,Francis and Κuceral established the first worldwide computer-readable corpus called Brown [Miller,Vandome and Mcbrewster(2010)].Nowadays,researchers have paid more and more attentions in semantic corpus construction,there are several English corpus,such as LOB,Bank of English and ACL/DCI [Gacitua,Sawyer and Rayson(2008); Greenbaum and Nelson(2010); Davies and Mark(2014)].Chinese commonly used corpus includes People’s Daily annotation corpus,ZW large General Chinese corpus system,and modern Chinese grammar research corpus,etc.This section will focus on the corpus construction process and the pre-processing of corpus.

2.2.1 Corpus construciton process

As shown in Fig.2,corpus construction mainly includes 7 steps [Yuan,Chen,Li et al.(2014); Jiang and Wang(2016); Haveliwala(2003)],which are:(1)data acquisition;(2)web links removal;(3)webpage cleaning;(4)webpage parsing;(5)webpage content removal;(6)corpus storage;(7)text vectorization.

Figure 2:Corpus dynamic construction

2.2.2 TF-IDF and word vector distance based algorithm for duplicated webpages

Web content may be approximately repeated or completely repeated although these webpages have different URL addresses.According to the overall layout and contents of the webpage,duplication may in the following 4 cases:(1)Completely repeated pages:the overall layout and contents of the pages are exactly the same.(2)Content repeated pages:The overall layout of the pages is different,but the contents of the pages are same.(3)Local repeated pages:the overall layout of the pages is the same,but the contents of the pages are different.(4)Partially repeated pages:the overall layout of the pages is different; some webpage contents are the same.

However,web crawler cannot identify the repeated content automatically,so we need an algorithm to eliminate repeated web content.There exist several web de-duplication algorithms,such as:VSM algorithm [Huang(2016)],Κ-Shingle,Simhash and Minhash algorithm [Oprisa(2015)].Based on the TF-IDF and the word vector distance,we propose a new web content de-duplication algorithm(TWDW)[Wang(2017)],which will be shown in Fig.3.

It is described as follows:

(1)Using TF-IDF to get the keywords

Assuming that there is a word w in document d,the word w is not a stop word and the frequency of occurrence in the document is high,then it is assumed that the word w is one of the keywords in document d.If the occurrence frequency of the word w in documents except the document d is very low,then the word w is easy to distinguish,that is,the importance of the word w in document d can be evaluated [Chen(2010)].The main ideas of TF-IDF algorithm are as follows:

a)Firstly,segment the document and remove the meaningless stop words.

b)Calculating term frequency(TF)as follows.

TF(w,d)is the term frequency,count(w,d)is the frequency of words w in document d,size(d)is the total number of words in document d.

c)Calculating the Inverse Document Frequency(IDF)as follows.

Where n represents the total number of documents in the entire document collection,and docs(w,D)represents the document number with word w’s appearance in the entire document set.

d)Calculating the product of TF and IDF as word w’s TF-IDF.

Figure 3:TWDW flow chart

(2)Using Word2vec to train the words,which are similar to the keywords of the document will be extracted.

a)Using Word2vec to train the divided words into bin files.

b)Using the LoadModel method to load the relevant path under the bin file,and selecting the top-5 similar keywords.

c)Judging the repeated pages through word matching.If more than 3 keywords are similar,the web page is regarded as a repeated one,and should not be included.

2.3 Ontology knowledge base construction

Ontology [Greenbaum and Nelson(2010)] begins with the concept of philosophical,it is a formal and explicit statement of shared concepts.After the introduction of artificial intelligence in the early 1990s,ontology is seen as a conceptual modeling tool which can describe the information system at the semantic and knowledge level [Wang,Wu and Ren(2014)].At present,there have been many research results based on semantic ontology in the field of Natural Language Processing,and the most prominent are WordNet and HowNet [Ren and Guo(2012)].This paper discusses the construction principle of ontology knowledge base and the representation method of ontology.

2.3.1 Construction principle

In 1990s,Gildea et al.[Gildea(2000)] proposed shallow semantic analysis,which is based on“Predicate-Argument(PA)structure”.Deep semantic analysis,based on shallow semantic analysis,implements a complete mapping from the PA structure to the semantic structure.The background knowledge ontology is the bridge between the PA structure and the semantic structure.Κnowledge base develops continuously,and the semantic construction which seems it as background knowledge is more feasible [Yuan(2012)].Therefore,the ontology knowledge base is used as the background knowledge ontology of the semantic construction,and the ontology knowledge base should be designed according to the characteristics of the PA structure.It mainly satisfies three principles:(1)The mapping of PA structure to knowledge structure is convenient;(2)Covering almost all PA structures;(3)It is easy to optimize the expression of knowledge in future.

2.3.2 Ontology description method

As shown in Fig.4,PA structure is composed of predicate,argument and argument semantic role.Predicate is the core vocabulary in an arbitrary sentence,which can express the main idea of the sentence.Argument is the modification and supplement of predicate,and semantic role argument mainly determines the semantic relationship between predicate and argument,a semantic role may contain more than one argument.

The structure of background knowledge ontology is similar with PA structure,it is represented with the Event-Property(EP)structure [Liu(2013)].In Fig.5,E represents the event,P represents the property of the event,and Ele represents the element of the event.Background knowledge ontology including event ontology and argument ontology.Event ontology corresponds to the predicate in PA structure,which describes the concept of verb.There is a generic event class in the event ontology,and the other events are its subclass.Argument ontology corresponds to the argument in PA structure,it describes nominal concept.According to the semantic characteristics of PA structure,the argument class ontology can be divided into subclass:such as time,place,personal pronoun and so on.

Figure 4:PA structure

Figure 5:EP structure

3 Dynamic construction of natural language semantics

3.1 Semantic construction process

Dynamically natural language semantic construction is based on ontology knowledge base.Fig.6 represents the detailed process of semantic dynamic construction.First,using syntactic analysis to get the syntactic structure of sentences; Secondly,according to the semantic role labeling method,the syntactic structure is transformed into PA structure;Then,transforming the PA structure into semantic structure based on the background knowledge ontology,and finally the user evaluation is carried out.This process uses unsupervised semantic role training to label large-scale corpus,and implements supervised semantic role labeling for newly added text,it is based on trained PA structure and semantic annotation result of text set.The process realizes the dynamic incremental construction of the natural language semantics,and effectively improves the efficiency of semantic extraction.This paper will focus on the establishment of the mapping between PA structure and background knowledge ontology.

Figure 6:Semantic dynamic construction

3.2 Establishment of mapping process

3.2.1 Mapping process

The establishment of mapping relationship between PA structure and background knowledge ontology is important for the construction of natural language semantics [Liu(2013)].The center point of the PA structure is the predicate P,and the argument element is mainly the modification and supplement of the predicate.The"event" is the basic unit of the background knowledge ontology,and each event contains the corresponding relation of elements and attributes.The ontology structure of background knowledge is very similar to PA structure.Therefore,predicates in PA structure can be mapped to event in background knowledge ontology,and then the argument is mapped to the elements contained in the event.The mapping process between PA structure and EP structure is shown in Fig.7.

Figure 7:Mapping relation

3.2.2 Mapping algorithm based on semantic similarity

Mapping between the PA structure and background knowledge ontology is mainly divided into two steps [Liu(2013); Chen,Chen,Nagiza et al.(2014); Zhang,Yang,Xing et al.(2017)]:one is the semantic similarity calculation of predicate P and all of the events,which can get a matching set; two is to match all elements of matching set and argument corresponding to predicate P,and find the most matching events.

This paper optimizes and improves the mapping algorithm provided by document [Liu(2013)],and proposes a mapping algorithm based on semantic similarity(MBSS)to construct mapping relationship between PA structure and background knowledge ontology,which will be shown in Fig.8.

The main idea are as follows:

(1)Set representation

PA structure is represented with two-tuple PA=

(2)Calculating the matching event set of predicate P

An important part of establishing the mapping relation between PA structure and background knowledge ontology is to find the matching event set Epof predicate P.This paper uses vector space model to calculate semantic similarity of predicate P and event E,If the similarity is greater than 0.5,then adding the event E to the set of events set Ep.The detailed are as follows:

a)Using CBOW model to train the word vector of the large-scale corpus,and obtaining the word vector of predicate P and event E,the predicate word vector is shown as P={P1,P2,…,Pn},event word vector is shown as E={E1,E2,…,Em},CBOW training formula is shown as Eq.(4).

b)Computing the word vector distance d(P,E).

c)Computing the semantic similarity sim(P,E)between predicate P and event E,as shown in Eq.(6).If sim(P,E)>0.5,then the event E is placed in the matching event set Ep,and the semantic similarity value between the event and the predicate is placed in the similarity set Ssim.

Whereαis seemed as an adjustable parameter.

(3)Getting the most matching event emaxof the predicate P

Finding the most matching event emaxby calculating the semantic similarity of two-word clusters.Matching rules [Liu(2013)] are shown in Tab.1.

Table 1:Matching rules of argument and element

Calculating process are as follows:

a)Calculating the ratio of matching argument,that is calculating the proportion of all argument under each semantic role and matching event element.

Match(ai)∈drepresents that aiis the matching argument of event element d.

b)Calculating the matching degree between the whole PA structure and the event propery structure.The calculation formula is as follows.

Figure 8:MBSS flow chart

4 Experimental verification and analysis

This section provides a scientific assessment of the proposed TWDW algorithm and MBSS algorithm.The main criteria are Precision and Recall [Zhou(2012)].A good algorithm often has both high precision and recall ratios,while these two standards are mutually constrained and researchers can only take an appropriate method to keep both at a relatively high level.

4.1 Experiment of TWDW algorithm

Experiment mainly includes three aspects,first,verifying the selection of keyword extraction algorithm,second,verifying the selection number of repeated keywords,and then verifying the effectiveness of webpages removal algorithm.

Experimental data sets are crawled from webpages by WebMagic crawler,including:(a)Crawled 500 texts,containing 50 texts with approximately duplicated content,(b)Crawled 500 texts,including 25 texts with completely duplicate content,and 25 texts with similar content.

4.1.1.Verifying the validity of the keyword selection algorithm

The commonly used keyword extraction algorithms mainly include the following three:TextRank algorithm [Mihalcea(2004)],Latent Dirichlet allocation [Blei,Ng and Jordan(2003)] and TF-IDF statistical method.In this paper,three algorithms are used to extract keywords from a section of text.The text is as follows:

Natural language processing is an important direction in the field of computer science and artificial intelligence.It studies various theories and methods that enable effective communication between people and computers using natural language.Natural language processing is a science that integrates linguistics,computer science,and mathematics.

Table 2:Experimental results of keywords extraction

In this paper,the experimental results are arranged in descending order.The top 10 keywords obtained by each algorithm are shown in Tab.2.The decimal numbers in the table indicate the corresponding weight of keywords in each method.

Tab.2 shows that the TF-IDF and TextRank algorithm are better than the LDA model in weight discrimination.Compared with the TextRank algorithm,the TF-IDF is more in line with human judgment.Therefore,using the TF-IDF algorithm to extract keywords is more appropriate.

4.1.2 Verifying the selection number of repeated keywords

Using data(a)as a data set,when the number of matches is 2,3,and 4,the precision and recall of the algorithm are shown in Tab.3.

Table 3:Experimental results of matching numbers

It can be seen from the experiment in Tab.3 that,when matching number is 3,the precision and recall of the document are relatively high and stable,so it is appropriate to set the matching number to be 3.

4.1.3 Verifying the effectiveness of webpages removal algorithm

Using data(b)as a data set to compare Simhash algorithm and TWDW algorithm,the precision and recall of them are shown in Tab.4.

Table 4:Experimental results of algorithm comparison

It can be seen from Tab.4 that the TWDW algorithm performs obviously better in the precision and recall,compared with Simhash algorithm.At the same time,comparison between Tab.3 and Tab.4 shows that,precision and recall rates of TWDW algorithm are higher when the texts are similar.

4.2 Experiment of MBSS algorithm

Experiment mainly includes two aspects,first,verifying the performance of improved semantic similarity calculation method of predicate and event,and then verifying the effectiveness of optimized algorithm.

Experimental data set:(a)PA structure of unsupervised semantic role training for the corpus constructed in this paper,(b)EP structure constructed in this paper,(c)SemLink obtained from network.

4.2.1 Verifying the performance of semantic similarity algorithm

Using semantic similarity algorithm provided by MBSS(SS-MBSS)and semantic similarity algorithm provided by literature [Liu(2013)](SS-[Liu(2013)])to calculate the matching event set of predicate P.As shown in Tab.5,one of the experimental results is selected in this paper.

Table 5:Experimental results of semantic similarity calculation

It can be seen from Tab.5 that the result of SS-[Liu(2013)] algorithm is relatively rough,and it only determines whether events match with predicates.The result of SS-MBSS algorithm is more accurate,so the SS-MBSS algorithm is more reasonable than SS-[Liu(2013)] algorithm.

4.2.2 Verifying the performance of MBSS algorithm and algorithm provided by literature Using MBSS algorithm and algorithm provided by literature [Liu(2013)](A-[Liu(2013)])to construct the mapping relationship between PA structure and background knowledge ontology,and the precision and recall are calculated by referencing to the SemLink.In this experiment,data(b)and 500 PA structure from data(a)are taken as an example,and the results are shown in Tab.6.

Table 6:Experimental results of Mapping algorithm comparison

As shown in Tab.6,the precision and recall of MBSS algorithm are significantly improved compared with algorithm provided by literature [Liu(2013)].Therefore,the MBSS algorithm is more accurate while constructing the mapping relationship.

5 Conclusions

Understanding semantics correctly is the basis for the realization of natural language interaction.This paper proposes a method of natural language semantic construction based on cloud database,analyzes and optimizes the construction technology of natural language cloud database and semantic structure.Meanwhile,this paper proposes TF-IDF and word vector distance based webpages de-duplication algorithm and mapping algorithm based on semantic similarity,both of them are proved effective.The work of this paper improves the understanding of natural language semantics,and provides a good data support for the natural language interaction function of the cloud service.However,our work also has some shortcomings,such as the semantic construction method proposed in this paper is more applicable to the verb structures,which needs to be optimized in the future.

Acknowledgement:This paper is partially supported by the Natural Science Foundation of Hebei Province(No.F2015207009)and the Hebei higher education research project(No.BJ2016019,QN2016179),Research project of Hebei University of Economics and Business(No.2016ΚYZ05)and Education technology research Foundation of the Ministry of Education(No.2017A01020).At the same time,the paper is also supported by the National Natural Science Foundation of China under grant No.61702305,the China Postdoctoral Science Foundation under grant No.2017M622234,the Qingdao city Postdoctoral Researchers Applied Research Projects,University Science and Technology Program of Shandong Province under the grant No.J16LN08,and the Shandong Province Κey Laboratory of Wisdom Mine Information Technology foundation under the grant No.WMIT201601.

主站蜘蛛池模板: 伊人天堂网| vvvv98国产成人综合青青| 中文字幕欧美日韩| 狠狠v日韩v欧美v| 强奷白丝美女在线观看| www.99在线观看| 成人第一页| 欧美天堂在线| 伊人久久大香线蕉aⅴ色| 日韩成人高清无码| 伊人久久精品无码麻豆精品| 欧美一区福利| 国产真实乱子伦精品视手机观看| 国产美女叼嘿视频免费看| 国产精品久久久久婷婷五月| 国产在线观看精品| 极品av一区二区| 免费av一区二区三区在线| av在线无码浏览| 91国语视频| 国产精品99一区不卡| 免费人欧美成又黄又爽的视频| 九九热精品视频在线| 亚洲国产欧洲精品路线久久| 久久国产拍爱| 日韩欧美中文字幕在线韩免费| 免费一级α片在线观看| 毛片在线看网站| 天堂av综合网| 精品无码视频在线观看| 91久久夜色精品| 亚洲成人黄色网址| 国模私拍一区二区| 国产精品午夜福利麻豆| 香蕉综合在线视频91| 色婷婷丁香| 欧美日韩在线亚洲国产人| 中国毛片网| 中国一级特黄视频| 99一级毛片| 熟妇丰满人妻| 91精品视频网站| 在线观看国产小视频| 亚洲综合狠狠| 成人精品午夜福利在线播放| 精品久久香蕉国产线看观看gif| 精品免费在线视频| 日本欧美中文字幕精品亚洲| 天天摸天天操免费播放小视频| 亚洲视频影院| 华人在线亚洲欧美精品| 91在线激情在线观看| 思思99热精品在线| 欧美成人国产| 人人爽人人爽人人片| 欧洲熟妇精品视频| 亚洲欧美日韩色图| 欧美亚洲香蕉| A级毛片无码久久精品免费| 亚洲国产91人成在线| 久久这里只精品热免费99| 精品伊人久久久大香线蕉欧美| 青青操视频在线| 色有码无码视频| 国产免费高清无需播放器| 一级成人欧美一区在线观看 | 亚洲自偷自拍另类小说| 伊人91视频| 欧美日本在线播放| 欧美伦理一区| 日韩精品一区二区深田咏美| 久久久精品久久久久三级| 99精品视频播放| 91在线视频福利| 国产一区二区三区精品久久呦| 青草娱乐极品免费视频| 欧美日韩福利| 久久96热在精品国产高清| 午夜影院a级片| 亚洲欧美一区二区三区图片 | 中文字幕欧美日韩| 毛片在线播放a|