LUO Jinshang, SHI Xin, WU Jie, and HOU Mengshu*
(1. Information Center, University of Electronic Science and Technology of China Chengdu 611731;2. School of Computer Science and Engineering, University of Electronic Science and Technology of China Chengdu 611731)
Abstract Event detection (ED) is a fundamental task of event extraction, which aims to detect triggers in text and determine their event types. Most existing methods regard event detection as a sentence-level classification problem, ignoring the correlations between events in different sentences. A novel event detection framework,named document embedding networks combined with semantic space (DENSS), is proposed in this paper. The document-level information is utilized to alleviate semantic ambiguity and enhance contextual understanding.Specifically, the representations of event types and triggers are obtained through an off-the-shelf pre-trained model and a designed multi-level attention mechanism. Then the feature vectors of event types and triggers are mapped into a shared semantic space, where the distance represents the correlation of different events. The experimental results on the benchmark dataset demonstrate that our method outperforms most existing methods, and justify the effectiveness of document-level information with shared semantic space.
Key words BERT; document-level information; event detection; semantic space
Event detection (ED) is a crucial task of event extraction (EE), which aims to identify event triggers from text and classify them into corresponding event types. The event trigger is the word or phrase that can clearly indicate the existence of an event in a sentence.According to the automatic context extraction (ACE)2005 dataset, which is widely applied to the ED task,there are 8 event types and 33 subtypes, such as“Attack”, “Transport”, “Meet” etc. Take the following sentences as examples:
S1: He has died of his wounds after being shot.
S2: An American tank fired on the Palestine hotel.
S3: Another veteran war correspondent is being fired for his controversial conduct in Iraq.
An ideal ED model is expected to recognize two events: a “Die” event triggered by the trigger word“died” and an “Attack” event triggered by “shot”in S1.
The difficulty of the ED task lies in the diversity and ambiguity of natural language expression. On the one hand, there are a variety of expressions that belong to the same event type. In S1, “shot” triggers an“Attack” event, and “fired” also triggers the same event type in S2. On the other hand, the same trigger can denote different events. In S3, “fired” can trigger an “Attack” event or an “End-Position”event. Because of the ambiguity, a traditional approach may mislabel “fired” with “Attack” according to the word “war” with sentence-level information.However, in the same document, other sentences like“NBC is terminating freelancer reporter Peter Arnett for statements he made to the Iraqi media.” could provide the clue that “fired” triggers an “End-Position” event. Up to 57% of the event triggers are ambiguous in the ACE 2005 dataset[1]. Thus, how to solve the ambiguity of event trigger has become an important problem in ED task.
ED is a booming and challenging task in NLP.The dominant approaches for ED adopt deep neural networks to learn effective features for the input sentences. Most existing methods either generally focus on sentence-level context, or ignore the correlations between events, such as semantic correlation information. Many methods[2-3]mainly exploit sentence-level features that lack a summary of the document. Sometimes sentence-level information is insufficient to address the ambiguity of event trigger, such as the event trigger “fired” in S3.Some document-level models have been proposed to leverage global context[4-6]. However, these methods extract features of the entire document, which are coarse-grained features for event classification.Actually, by means of processing context more effectively, the model’s performance can be improved.
The semantic correlations between different events exist objectively and pervasively, and they are manifested in several aspects. Initially, different event types have some semantic relevance. For instance,compared with the “Transport” event, the “Attack”event and the “Injure” event are semantically closer.Belonging to the same parent event type, different subtypes have certain semantic correlations. “Be-Born” and “Marry” belong to the same parent event type “Life”, which can reveal more collective features. They are more likely to co-occur in the same document. Furthermore, different event triggers have some semantic correlations in the same document,such as event trigger “shot” and “died” in S1.The events mentioned in the same document tend to be semantically coherent. As pointed out by Ref. [5],many events usually co-occur in the same document.According to the ACE 2005 dataset, the top 5 event types that accompany with “Attack” event in the same sentence are as follows: Attack, Die, Transport,Injure and Meet. Eventually, there is similar semantics between the event trigger and its corresponding event type. The event type word indicates the fundamental semantic information and reveals common features,and the event trigger word has extended semantic information with a more specific context. Suppose we replace the trigger word with its corresponding event type word, the semantics of the whole sentence will not change much. Thus, how to model the semantic correlation information between event types and event triggers becomes a challenge to be overcome.
Existing methods generally use the one-hot label,which classifies the event type with the 0/1 label.Despite the simplicity, it regards multiple events in the same document as independent ones, and therefore it is difficult to accurately represent the correlations between different event types.
In this paper, we propose document embedding networks with shared semantics space (DENSS) to address the aforementioned problems. To learn the event correlations, we use bidirectional encoder representations from transformers (BERT) to obtain event type representations and map them into a semantic space, where the more relevant event types are, the closer they stay. We apply BERT again to acquire the representation of each word with document-level and sentence-level information via gated attention, project the representation of each event trigger into the same semantic space, and choose the label of the closest event type.
In summary, the contributions of this paper are as follows: 1) We study the event correlations problem and propose a novel ED framework, which utilizes BERT for capturing document-level and sentence-level information. 2) We employ a shared semantic space to represent event types and event triggers, which minimizes the distance between each event trigger and its corresponding type. Experiment results on the ACE 2005 dataset verify the effectiveness of our approach.
The goal of ED consists of identifying event triggers (trigger identification) and classifying them into corresponding event types (trigger classification).According to the ACE 2005 dataset, an event is defined as a specific occurrence involving one or more participants. The event trigger is the main word or phrase that can most clearly express the occurrence of an event. As shown in Table 1, the ACE 2005 dataset defines 8 event types and 33 subtypes. Each event subtype has its specific semantic information and different event subtypes have certain semantic correlations.

Table 1 Some Event Types and Subtypes of the ACE 2005 Dataset

We formalize ED as a multi-label sequence tagging problem. We assign a tag for each word to indicate whether it triggers the specific event. We adopt the “BIO” tags schema. Tags “B” and“I” represent the position of the word in a trigger to solve the problem that a trigger contains multiple words such as “take away”, “go to” and so on.
Figure 1 describes the architecture of DENSS,which primarily involves the following four components: 1) Event embedding, which learns correlations between event types through BERT;2) Word embedding, which exploits BERT and gated attention to gain semantic information of words;3) Trigger identification, which identifies the event triggers; 4) Trigger classification, which classifies the event triggers to corresponding types.
To enrich the contextual information of event type words, we replace each trigger word in the sentence with the corresponding event type word. For instance, sentence S1 is transformed into another sentence “ He has die of his wounds after being attack”. Sentence S3 is converted into a new sentence“Another veteran war correspondent is being endposition for his controversial conduct in Iraq”.Contextualized embedding produced by pre-trained language models[7]has been proved to be capable of modeling context beyond the sentence boundary and improving performance on a variety of tasks. Pretrained bi-directional transformer models such as BERT can better capture long-distance dependencies as compared with Recurrent Neural Network (RNN)architecture. These newly replaced sentences are fed into BERT, and the last layer’s hidden vectors of BERT are set as the words’ embedding. LetEibe the event embedding corresponding to thei-th event type word. For simplicity, we calculate the average of all representations to get the final representation for the event type word, which appears many times in the training sentences. The ACE 2005 dataset defines 33 subtypes. According to “BIO” tags schema, we finally obtain 67 representations of the event type words, asE={E1,E2,···Ey,···E67}, and map the feature vectorsEyinto a shared semantic space.

Fig. 1 The Architecture of the DENSS Model
To give an intuitive illustration, the different event correlations are shown in Figure 2.

Fig. 2 Event Correlation
In this figure, the solid circle denotes the event type and event type vector. The empty circle denotes the event trigger and event trigger vector.
1.4.1 Word-level Embedding
Given a documentd={s1,s2,···sj,···sm}, the j-th sentence can be represented as token sequencesj={wj1,wj2,···wjk,···wjn}. Special tokens [CLS] and[SEP] are placed at the start and end of the sentence, as{[CLS],wj1,wj2,···wjk,···wjn,[SEP]}. BERT can create token embedding, segment embedding, position embedding automatically, and concatenate these embeddings as the input of the next layer. For each wordwjk, we select the feature vector from the last layer of BERT as word embeddingvjk. The sentencesjis represented as {vj1,vj2,···vjk,···vjn}. By considering the embedding of token [CLS] as the sentence embedding, simultaneously we obtain the sentence embeddingvj0, which corresponds to token [CLS].
1.4.2 Sentence-level Embedding


1.4.3 Document-level Embedding

1.4.4 Gated Fusion
Inspired by the gated multi-level attention mechanisms[5], we apply a fusion gate to dynamically incorporate sentence-level informationsjkand document-level informationdjfor thek-th wordwjkin thej-th sentencesjof the documentd. The fusion gategkis designed to control how information should be integrated, which is calculated by:

whereWgis the weight matrix,bgis the bias term, and σis the sigmoid function. Hence, the contextual representation of the wordwjkwith both sentencelevel information and document-level information is calculated by:

where ? denotes element-wise multiplication. We concatenate the contextual representationcjkand word embeddingvjkto acquire the final word representationejk=[cjk,vjk].
We model trigger identification task as a binary classification problem and annotate the trigger with label 1 while the others with label 0. The final word representationejkis fed into a binary classifier to decide whether it is a trigger.

We adopt cross-entropy loss as the loss function in trigger identification and hinge loss in trigger classification. The hinge loss which is widely used for maximum-margin classification, aims to separate the correct and incorrect predictions with a margin larger than a pre-defined constant. For each triggerx, we name the corresponding event typeyas positive and the other types as negative. We construct the hinge ranking loss:


whereyis the corresponding event type ofx,Yis the event type set,iis the other event type forxfromY,andbis the margin. The function cos calculates the cosine similarity between the feature vectorexof the triggerxand the feature vectorEyof the event typey.
We conduct experiments on the ACE 2005 dataset. For comparison, we create the same test set with 40 documents, the development set of 30 documents, and the training set of the remaining 529,the same as previous works[2-3]. We adopt the formal ACE evaluation criteria with Precision (P), Recall (R)and F measure (F1) to evaluate the model.
Hyper-parameters are tuned on the development set. We employ BERT-base model, which generates 768 dimensional word embedding. We set the dimension of hidden vector as 768, the dimension of semantic space as 768, and marginbas 0.1. We adopt the Adam optimizer for training with a learning rate of 2×10-5.
In order to evaluate our model, we compare it with a comprehensive set of baselines and representative models, including:
1) DMCNN builds the dynamic multi-pooling convolutional neural network to learn sentence-level features[2].
2) JRNN exploits the bidirectional RNN to capture sentence-level features for event extraction[3].
3) GCN-ED applies Graph Convolutional Network (GCN) to model dependency tree for extracting event information[11].
4) DEEB-RNN utilizes document embedding and hierarchical supervised attention mechanism[4].
5) HBTNGMA uses hierarchical and bias tagging networks to detect multiple events[5].
6) PLMEE employs BERT to create labeled data for promoting event extraction[12].
7) DMBERT+Boot utilizes BERT to generate more training data for ED[13].
8) EE-GCN exploits syntactic structure and typed dependency label information to perform ED[14].
Experimental results are shown in Table 2. From the table, we can observe that our proposed DENSS model achieves the bestF1score for trigger classification among all the compared methods.

Table 2 Trigger Classification Performance (%) on the ACE 2005 Dataset
Compared with DMCNN and JRNN, our method significantly outperforms them. The reason is that DMCNN and JRNN only extract sentence-level information, while our method exploits multi-level information. It indicates that the document-level information is indeed beneficial to ED task. In contrast to DEEB-RNN and HBTNGMA, our method gains great improvement. It’s because that DEEB-RNN and HBTNGMA learn document-level information but do not capture rich semantic information. However our method applies pre-trained language model BERT to acquire semantic information of words and employs the semantic space to represent the semantic correlations of different event types. As compared with PLMEE and DMBERT+Boot, our method achieves more desirable performance. PLMEE and DMBERT+Boot use BERT to create training data and promote event extraction, whereas our method fuses multi-level information to represent features of words with rich semantic information. Compared with GCNED and EE-GCN, our method is also superior. GCNED and EE-GCN adopt GCN with the syntactic information to capture the event information, but the syntactic information is still limited at sentence level.Our method learns the embedding of the document through the hierarchical attention mechanisms, which indicates that multi-level semantic information is conducive to ED task.
In this section, we focus on the effectiveness of crucial components in our DENSS model with the ablation study. We examine the following models: 1)EE: to study whether the event embedding contributes to improving the performance, we substitute the onehot label for the event embedding. As a result, theF1score drops by 6.4% absolutely, which demonstrates that the event embedding is beneficial to represent the semantic correlations. 2) SATT: to prove the contribution of the sentence-level attention, we remove it. As can be seen from Table 3, theF1score drops by 2.7%, which verifies that the sentence-level information provides important clues. 3) DATT:removing the document-level attention update model hurts the performance by 2.1%, which proves that the document-level information is helpful to enhance the performance. 4) GATE: when we calculate the average of sentence-level information and document-level information instead of the fusion gate, theF1score decreases by 1.5%, which indicates that the fusion gate dynamically incorporates multi-level semantic information. 5) Bi-LSTM: Bi-LSTM is removed from the model and the result score decline by 1.8%, which again verifies the effectiveness of the document-level information.

Table 3 The Ablation Study of DENSS
From these ablations, we have the following observations: 1) All crucial components are beneficial to the DENSS model, as removing any component degrades the performance significantly. 2) Compared with others, DENSS-EE substituting the one-hot label for the event embedding hurts the performance deeply.We inference that semantic correlations among event types can propagate more knowledge. 3) As compared with DENSS-DATT, DENSS-SATT has greater performance degradation. It illustrates that the sentence-level information provides more signals than the document-level information commonly. 4) The sentence-level information and document-level information are complementary to the feature representation, and the semantic correlation information is conducive to enhancing ED.
In the section, we present the visualization for the role of the attention mechanism, to validate whether the attention works as we designed. Figure 3 shows the example of the scalar attention weight α learned by our model. In this case, “delivered” triggers a “Phone-Write” event. Our model captures the clue “couriers delivered the letters” and assigns it with a large attention weight. The contextual information plays important role in disambiguating “delivered”, and the words “couriers” and “letters” provide the evidence to predict that “delivered ” triggers a“Phone-Write” event.

Fig. 3 Visualization for the Role of the Sentence-Level Attention Mechanism. The heat map expresses the contextual attention weight, which represents the relatedness of the corresponding word pair.
Figure 4 shows that the document-level information contributes to improving the performance.We observe that some sentences with the triggers in Table 4 obtain greater attention weight than others.The triggers “convicted ”, “killed ” and“murdering ” in the same document tend to be semantically coherent. It indicates that document-level attention can capture the significant clues at the document level to alleviate semantic ambiguity.

Fig. 4 Visualization for the Role of the Document-Level Attention Mechanism. The heat map expresses the contextual attention weight, which represents the relatedness of the corresponding sentence pair.

Table 4 Example of the Document
ED is one of the important tasks in NLP. Many methods have been proposed for this task. In earlier ED studies, researchers focused on employing featurebased methods[15-16], which depended on the quality of artificially designed features. Most recent works have concentrated on the representation-based neural network methods, which automatically capture the feature representations by the neural network. These methods can be roughly divided into two classes. One class is to improve ED through different learning techniques, such as CNN[2], RNN[3], GCN[11,14,17], and pre-trained models[7,12]. The other class is to enhance ED through introducing extra resources, such as document information[4-5], argument information[18],semantic information[9]and syntactic information[19-20].
Document information plays an important role in ED. Ref. [4] employed document embedding and hierarchical supervised attention mechanism to enhance event detection. Ref. [5] utilized hierarchical and bias tagging networks to model document information. The attention mechanism widely used in NLP has also been applied to ED. Ref. [18] proposed to encode argument information via supervised attention mechanisms. MOGANED[21]improved GCN with aggregative attention to model multi-order syntactic representations.
In this work, we propose a novel approach to integrate document-level and sentence-level information to enhance ED task. A hierarchical attention network is devised to automatically capture contextual information. Each event type has specific semantic information and different event types have certain semantic correlations. We deploy a shared semantic space to represent the event types and event triggers, which minimizes the distance between each event trigger and its corresponding type such that the classification of the latter is more informative and precise. Experiments on the ACE 2005 dataset verify the effectiveness of the proposed method.