Joint Entity Linking with Deep Reinforcement Learning

02/01/2019 ∙ by Zheng Fang, et al. ∙ University of Technology Sydney 0

Entity linking is the task of aligning mentions to corresponding entities in a given knowledge base. Previous studies have highlighted the necessity for entity linking systems to capture the global coherence. However, there are two common weaknesses in previous global models. First, most of them calculate the pairwise scores between all candidate entities and select the most relevant group of entities as the final result. In this process, the consistency among wrong entities as well as that among right ones are involved, which may introduce noise data and increase the model complexity. Second, the cues of previously disambiguated entities, which could contribute to the disambiguation of the subsequent mentions, are usually ignored by previous models. To address these problems, we convert the global linking into a sequence decision problem and propose a reinforcement learning model which makes decisions from a global perspective. Our model makes full use of the previous referred entities and explores the long-term influence of current selection on subsequent decisions. We conduct experiments on different types of datasets, the results show that our model outperforms state-of-the-art systems and has better generalization performance.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 8

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

Entity Linking (EL), which is also called Entity Disambiguation (ED), is the task of mapping mentions in text to corresponding entities in a given knowledge Base (KB). This task is an important and challenging stage in text understanding because mentions are usually ambiguous, i.e., different named entities may share the same surface form and the same entity may have multiple aliases. EL is key for information retrieval (IE) and has many applications, such as knowledge base population (KBP), question answering (QA), etc.

Existing EL methods can be divided into two categories: local model and global model. Local models concern mainly on contextual words surrounding the mentions, where mentions are disambiguated independently. These methods are not work well when the context information is not rich enough. Global models take into account the topical coherence among the referred entities within the same document, where mentions are disambiguated jointly. Most of previous global models (Ganea and Hofmann, 2017; Titov and Le, 2018; Nguyen et al., 2016) calculate the pairwise scores between all candidate entities and select the most relevant group of entities. However, the consistency among wrong entities as well as that among right ones are involved, which not only increases the model complexity but also introduces some noises. For example, in Figure 1, there are three mentions ”France”, ”Croatia” and ”2018 World Cup”, and each mention has three candidate entities. Here, ”France” may refer to French Republic, France national basketball team or France national football team in KB. It is difficult to disambiguate using local models, due to the scarce common information in the contextual words of ”France” and the descriptions of its candidate entities. Besides, the topical coherence among the wrong entities related to basketball team (linked by an orange dashed line) may make the global models mistakenly refer ”France” to France national basketball team. So, how to solve these problems?

Figure 1. Illustration of mentions in the free text and their candidate entities in the knowledge base. Solid black lines point to the correct target entities corresponding to the mentions and to the descriptions of these correct target entities. Solid red lines indicate the consistency between correct target entities and the orange dashed lines denote the consistency between wrong candidate entities.

We note that, mentions in text usually have different disambiguation difficulty according to the quality of contextual information and the topical coherence. Intuitively, if we start with mentions that are easier to disambiguate and gain correct results, it will be effective to utilize information provided by previously referred entities to disambiguate subsequent mentions. In the above example, it is much easier to map ”2018 World Cup” to 2018 FIFA World Cup based on their common contextual words ”France”, ”Croatia”, ”4-2”. Then, it is obvious that ”France” and ”Croatia” should be referred to the national football team because football-related terms are mentioned many times in the description of 2018 FIFA World Cup.

Inspired by this intuition, we design the solution with three principles: (i) utilizing local features to rank the mentions in text and deal with them in a sequence manner; (ii) utilizing the information of previously referred entities for the subsequent entity disambiguation; (iii) making decisions from a global perspective to avoid the error propagation if the previous decision is wrong.

In order to achieve these aims, we consider global EL as a sequence decision problem and proposed a deep reinforcement learning (RL) based model, RLEL for short, which consists of three modules: Local Encoder, Global Encoder and Entity Selector. For each mention and its candidate entities, Local Encoder encodes the local features to obtain their latent vector representations. Then, the mentions are ranked according to their disambiguation difficulty, which is measured by the learned vector representations. In order to enforce global coherence between mentions, Global Encoder encodes the local representations of mention-entity pairs in a sequential manner via a LSTM network, which maintains a long-term memory on features of entities which has been selected in previous states. Entity Selector uses a policy network to choose the target entities from the candidate set. For a single disambiguation decision, the policy network not only considers the pairs of current mention-entity representations, but also concerns the features of referred entities in the previous states which is pursued by the Global Encoder. In this way, Entity Selector is able to take actions based on the current state and previous ones. When eliminating the ambiguity of all mentions in the sequence, delayed rewards are used to adjust its policy in order to gain an optimized global decision.

Deep RL model, which learns to directly optimize the overall evaluation metrics, works much better than models which learn with loss functions that just evaluate a particular single decision. By this property, RL has been successfully used in many NLP tasks, such as information retrieval

(Nogueira and Cho, 2017), dialogue system (Dhingra et al., 2017) and relation classification (Feng et al., 2018), etc. To the best of our knowledge, we are the first to design a RL model for global entity linking. And in this paper, our RL model is able to produce more accurate results by exploring the long-term influence of independent decisions and encoding the entities disambiguated in previous states.

In summary, the main contributions of our paper mainly include following aspects:

  • We are the first to consider EL as a sequence decision problem and innovatively utilize a deep reinforcement learning model in this task.

  • The proposed model takes into account both local context and global coherence. In the process of global disambiguation, we make full use of the previous selected entity information and make decisions from a global perspective.

  • We evaluate our model on several benchmark datasets and the experimental results showed that our model achieves significant improvements over the state-of-the-art methods.

Figure 2. The overall structure of our RLEL model. It contains three parts: Local Encoder, Global Encoder and Entity Selector. In this framework, denotes the concatenation of the mention context vector and one candidate entity vector . The policy network selects one entity from the candidate set, and denotes the concatenation of the mention context vector and the selected entity vector . represents the hidden status of , and it will be input into .

2. Methodology

The overall structure of our RLEL model is shown in Figure 2. The proposed framework mainly includes three parts: Local Encoder which encodes local features of mentions and their candidate entities, Global Encoder which encodes the global coherence of mentions in a sequence manner and Entity Selector which selects an entity from the candidate set. As the Entity Selector and the Global Encoder are correlated mutually, we train them jointly. Moreover, the Local Encoder as the basis of the entire framework will be independently trained before the joint training process starts. In the following, we will introduce the technical details of these modules.

2.1. Preliminaries

Before introducing our model, we firstly define the entity linking task. Formally, given a document with a set of mentions , each mention has a set of candidate entities . The task of entity linking is to map each mention to its corresponding correct target entity or return ”NIL” if there is not correct target entity in the knowledge base. Before selecting the target entity, we need to generate a certain number of candidate entities for model selection.

Inspired by the previous works (Pappu et al., 2017; Zwicklbauer et al., 2016; Phan et al., 2017)

, we use the mention’s redirect and disambiguation pages in Wikipedia to generate candidate sets. For those mentions without corresponding disambiguation pages, we use its n-grams to retrieve the candidates

(Phan et al., 2017). In most cases, the disambiguation page contains many entities, sometimes even hundreds. To optimize the model’s memory and avoid unnecessary calculations, the candidate sets need to be filtered (Cao et al., 2018; Ganea and Hofmann, 2017; Titov and Le, 2018)

. Here we utilize the XGBoost model

(Chen and Guestrin, 2016) as an entity ranker to reduce the size of candidate set. The features used in XGBoost can be divided into two aspects, the one is string similarity like the Jaro-Winkler distance between the entity title and the mention, the other is semantic similarity like the cosine distance between the mention context representation and the entity embedding. Furthermore, we also use the statistical features based on the pageview and hyperlinks in Wikipedia. Empirically, we get the pageview of the entity from the Wikipedia Tool Labs111The url of the website is: https://tools.wmflabs.org/pageviews/ which counts the number of visits on each entity page in Wikipedia. After ranking the candidate sets based on the above features, we take the top k scored entities as final candidate set for each mention.

2.2. Local Encoder

Given a mention and the corresponding candidate set , we aim to get their local representation based on the mention context and the candidate entity description. For each mention, we firstly select its surrounding words, and represent them as word embedding using a pre-trained lookup table (Mikolov et al., 2013)

. Then, we use Long Short-Term Memory (LSTM) networks to encode the contextual word sequence

as a fixed-size vector . The description of entity is encoded as in the same way. Apart from the description of entity, there are many other valuable information in the knowledge base. To make full use of these information, many researchers trained entity embeddings by combining the description, category, and relationship of entities. As shown in (Ganea and Hofmann, 2017), entity embeddings compress the semantic meaning of entities and drastically reduce the need for manually designed features or co-occurrence statistics. Therefore, we use the pre-trained entity embedding and concatenate it with the description vector to enrich the entity representation. The concatenation result is denoted by .

After getting , we concatenate it with

and then pass the concatenation result to a multilayer perceptron (MLP). The MLP outputs a scalar to represent the local similarity between the mention

and the candidate entity . The local similarity is calculated by the following equations:

(1)

Where indicates vector concatenation. With the purpose of distinguishing the correct target entity and wrong candidate entities when training the local encoder model, we utilize a hinge loss that ranks ground truth higher than others. The rank loss function is defined as follows:

(2)

When optimizing the objective function, we minimize the rank loss similar to (Ganea and Hofmann, 2017; Titov and Le, 2018). In this ranking model, a training instance is constructed by pairing a positive target entity with a negative entity . Where is a margin parameter and our purpose is to make the score of the positive target entity is at least a margin higher than that of negative candidate entity .

With the local encoder, we obtain the representation of mention context and candidate entities, which will be used as the input into the global encoder and entity selector. In addition, the similarity scores calculated by MLP will be utilized for ranking mentions in the global encoder.

2.3. Global Encoder

In the global encoder module, we aim to enforce the topical coherence among the mentions and their target entities. So, we use an LSTM network which is capable of maintaining the long-term memory to encode the ranked mention sequence. What we need to emphasize is that our global encoder just encode the mentions that have been disambiguated by the entity selector which is denoted as .

As mentioned above, the mentions should be sorted according to their contextual information and topical coherence. So, we firstly divide the adjacent mentions into a segment by the order they appear in the document based on the observation that the topical consistency attenuates along with the distance between the mentions. Then, we sort mentions in a segment based on the local similarity and place the mention that has a higher similarity value in the front of the sequence. In Equation 1, we define the local similarity of and its corresponding candidate entity . On this basis, we define as the the maximum local similarity between the and its candidate set . We use as criterion when sorting mentions. For instance, if then we place before . Under this circumstances, the mentions in the front positions may not be able to make better use of global consistency, but their target entities have a high degree of similarity to the context words, which allows them to be disambiguated without relying on additional information. In the end, previous selected target entity information is encoded by global encoder and the encoding result will be served as input to the entity selector.

Before using entity selector to choose target entities, we pre-trained the global LSTM network. During the training process, we input not only positive samples but also negative ones to the LSTM. By doing this, we can enhance the robustness of the network. In the global encoder module, we adopt the following cross entropy loss function to train the model.

(3)

Where represents the label of the candidate entity. If the candidate entity is correct , otherwise . indicates the output of our model. After pre-training the global encoder, we start using the entity selector to choose the target entity for each mention and encode these selections.

Figure 3.

The architecture of policy network. It is a feedforward neural network and the input consists of four parts: mention context representation, candidate entity representation, feature representation, and encoding of the previous decisions.

2.4. Entity Selector

In the entity selector module, we choose the target entity from candidate set based on the results of local and global encoder. In the process of sequence disambiguation, each selection result will have an impact on subsequent decisions. Therefore, we transform the choice of the target entity into a reinforcement learning problem and view the entity selector as an agent. In particular, the agent is designed as a policy network which can learn a stochastic policy and prevents the agent from getting stuck at an intermediate state (Xiong et al., 2017). Under the guidance of policy, the agent can decide which action (choosing the target entity from the candidate set)should be taken at each state, and receive a delay reward when all the selections are made. In the following part, we first describe the state, action and reward. Then, we detail how to select target entity via a policy network.

State

The result of entity selection is based on the current state information. For time , the state vector is generated as follows:

(4)

Where indicates vector concatenation. The and respectively denote the vector of and at time . For each mention, there are multiple candidate entities correspond to it. With the purpose of comparing the semantic relevance between the mention and each candidate entity at the same time, we copy multiple copies of the mention vector. Formally, we extend to and then combine it with . Since and are mainly to represent semantic information, we add feature vector to enrich lexical and statistical features. These features mainly include the popularity of the entity, the edit distance between the entity description and the mention context, the number of identical words in the entity description and the mention context etc. After getting these feature values, we combine them into a vector and add it to the current state. In addition, the global vector is also added to . As mentioned in global encoder module, is the output of global LSTM network at time , which encodes the mention context and target entity information from to . Thus, the state contains current information and previous decisions, while also covering the semantic representations and a variety of statistical features. Next, the concatenated vector will be fed into the policy network to generate action.

Action

According to the status at each time step, we take corresponding action. Specifically, we define the action at time step is to select the target entity for . The size of action space is the number of candidate entities for each mention, where indicates the position of the selected entity in the candidate entity list. Clearly, each action is a direct indicator of target entity selection in our model. After completing all the actions in the sequence we will get a delayed reward.

Reward

The agent takes the reward value as the feedback of its action and learns the policy based on it. Since current selection result has a long-term impact on subsequent decisions, we don’t give an immediate reward when taking an action. Instead, a delay reward is given by follows, which can reflect whether the action improves the overall performance or not.

(5)

where indicates whether the current action is correct or not. When the action is correct otherwise . Hence and respectively represent the number of correct and wrong actions from time t to the end of episode. Based on the above definition, our delayed reward can be used to guide the learning of the policy for entity linking.

Policy Network

After defining the state, action, and reward, our main challenge becomes to choose an action from the action space. To solve this problem, we sample the value of each action by a policy network . The structure of the policy network is shown in Figure 3. The input of the network is the current state, including the mention context representation, candidate entity representation, feature representation, and encoding of the previous decisions. We concatenate these representations and fed them into a multilayer perceptron, for each hidden layer, we generate the output by:

(6)

Where and are the parameters of the th hidden layer, through the activation function we get the

. After getting the output of the last hidden layer, we feed it into a softmax layer which generates the probability distribution of actions. The probability distribution is generated as follows:

(7)

Where the and are the parameters of the softmax layer. For each mention in the sequence, we will take action to select the target entity from its candidate set. After completing all decisions in the episode, each action will get an expected reward and our goal is to maximize the expected total rewards. Formally, the objective function is defined as:

(8)

Where is the state transfer function, indicates the probability of taking action under the state , is the expected reward of action at time step . According to REINFORCE policy gradient algorithm(Williams, 1992), we update the policy gradient by the way of equation 9.

(9)

As the global encoder and the entity selector are correlated mutually, we train them jointly after pre-training the two networks. The details of the joint learning are presented in Algorithm 1.

0:  Training data include multiple documents
0:  The target entity for mentions
1:  Initialize the policy network parameter , global LSTM network parameter ;
2:  for  in  do
3:     Generate the candidate set for each mention
4:     Divide the mentions in into multiple sequences ;
5:     for  in  do
6:        Rank the mentions in based on the local similarity;
7:        for  in  do
8:           Sample the target entity for with ;
9:           Input the and to global LSTM network;
10:        end for
11:         End of sampling, update parameters
12:        Compute delayed reward for each action;
13:        Update the parameter of policy network:  
14:     end for
15:     Update the parameter in the global LSTM network
16:  end for
Algorithm 1 The Policy Learning for Entity Selector

3. Experiment

In order to evaluate the effectiveness of our method, we train the RLEL model and validate it on a series of popular datasets that are also used by (Ganea and Hofmann, 2017; Titov and Le, 2018)

. To avoid overfitting with one dataset, we use both AIDA-Train and Wikipedia data in the training set. Furthermore, we compare the RLEL with some baseline methods, where our model achieves the state-of-the-art results. We implement our models in Tensorflow and run experiments on 4 Tesla V100 GPU.

3.1. Experiment Setup

Datasets

We conduct experiments on several different types of public datasets including news and encyclopedia corpus. The training set is AIDA-Train and Wikipedia datasets, where AIDA-Train contains 18448 mentions and Wikipedia contains 25995 mentions. In order to compare with the previous methods, we evaluate our model on AIDA-B and other datasets. These datasets are well-known and have been used for the evaluation of most entity linking systems. The statistics of the datasets are shown in Table 1.

  • AIDA-CoNLL (Hoffart et al., 2011) is annotated on Reuters news articles. It contains training (AIDA-Train), validation (AIDA-A) and test (AIDA-B) sets.

  • ACE2004 (Ratinov et al., 2011) is a subset of the ACE2004 Coreference documents.

  • MSNBC (Cucerzan, 2007) contains top two stories in the ten news categories(Politics, Business, Sports etc.)

  • AQUAINT (Milne and Witten, 2008) is a news corpus from the Xinhua News Service, the New York Times, and the Associated Press.

  • WNED-CWEB (Guo and Barbosa, 2018) is randomly picked from the FACC1 annotated ClueWeb 2012 dataset.

  • WNED-WIKI (Guo and Barbosa, 2018) is crawled from Wikipedia pages with its original hyperlink annotation.

  • OURSELF-WIKI is crawled by ourselves from Wikipedia pages.

Dataset Doc Num Mention Num Mentions Per Doc
AIDA-Train 946 18448 19.5
AIDA-A 216 4791 22.1
AIDA-B 231 4485 19.4
ACE2004 36 251 7.1
MSNBC 20 656 32.8
AQUAINT 50 727 14.5
WNED-CWEB 320 11154 34.8
WNED-WIKI 320 6821 21.3
OURSELF-WIKI 460 25995 56.5
Table 1. Statistics of document and mention numbers on experimental datasets.

Training Details

During the training of our RLEL model, we select top K candidate entities for each mention to optimize the memory and run time. In the top K candidate list, we define the recall of correct target entity is . According to our statistics, when K is set to 1, is 0.853, when K is 5, is 0.977, when K increases to 10, is 0.993. Empirically, we choose top 5 candidate entities as the input of our RLEL model. For the entity description, there are lots of redundant information in the wikipedia page, to reduce the impact of noise data, we use TextRank algorithm (Mihalcea and Tarau, 2004)

to select 15 keywords as description of the entity. Simultaneously, we choose 15 words around mention as its context. In the global LSTM network, when the number of mentions does not reach the set length, we adopt the mention padding strategy. In short, we copy the last mention in the sequence until the number of mentions reaches the set length.

Hyper-parameter setting

We set the dimensions of word embedding and entity embedding to 300, where the word embedding and entity embedding are released by (Pennington et al., 2014) and (Ganea and Hofmann, 2017) respectively. For parameters of the local LSTM network, the number of LSTM cell units is set to 512, the batch size is 64, and the rank margin is 0.1. Similarly, in global LSTM network, the number of LSTM cell units is 700 and the batch size is 16. In the above two LSTM networks, the learning rate is set to 1e-3, the probability of dropout is set to 0.8, and the Adam is utilized as optimizer. In addition, we set the number of MLP layers to 4 and extend the priori feature dimension to 50 in the policy network.

3.2. Comparing with Previous Work

Baselines

We compare RLEL with a series of EL systems which report state-of-the-art results on the test datasets. There are various methods including classification model (Milne and Witten, 2008), rank model (Chisholm and Hachey, 2015; Ratinov et al., 2011) and probability graph model (Guo and Barbosa, 2018; Hoffart et al., 2011; Huang et al., 2015; Ganea and Hofmann, 2017; Titov and Le, 2018). Except that, Cheng (Cheng and Roth, 2013)

formulate their global decision problem as an Integer Linear Program (ILP) which incorporates the entity-relation inference. Globerson

(Globerson et al., 2016)

introduce a multi-focal attention model which allows each candidate to focus on limited mentions, Yamada

(Yamada et al., 2016) propose a word and entity embedding model specifically designed for EL.

Evaluation Metric

We use the standard Accuracy, Precision, Recall and F1 at mention level (Micro) as the evaluation metrics:

(10)
(11)
(12)
(13)

where is the golden standard set of the linked name mentions, is the set of linked name mentions outputted by an EL method.

Methods AIDA-B
Huang and Heck (2015)(Huang et al., 2015) 86.6%
Chisholm and Hachey (2015)(Chisholm and Hachey, 2015) 88.7%
Guo and Barbosa (2016)(Guo and Barbosa, 2018) 89.0%
Globerson et al. (2016)(Globerson et al., 2016) 91.0%
Yamada et al. (2016)(Yamada et al., 2016) 91.5%
Ganea and Hofmann (2017)(Ganea and Hofmann, 2017) 92.2%
Phong and Titov (2018)(Titov and Le, 2018) 93.1%
our 94.3%
Table 2. In-KB accuracy result on AIDA-B dataset.
Methods MSNBC AQUAINT ACE2004 CWEB WIKI Avg
Milne and Witten (2008)(Milne and Witten, 2008) 78% 85% 81% 64.1% 81.7% 77.96%
Hoffart and Johannes(2011)(Hoffart et al., 2011) 79% 56% 80% 58.6% 63% 67.32%
Ratinov and Lev(Ratinov et al., 2011) 75% 83% 82% 56.2% 67.2% 72.68%
Cheng and Roth (2013)(Cheng and Roth, 2013) 90% 90% 86% 67.5% 73.4% 81.38%
Guo and Barbosa (2016)(Guo and Barbosa, 2018) 92% 87% 88% 77% 84.5% 85.7%
Ganea and Hofmann (2017)(Ganea and Hofmann, 2017) 93.7% 88.5% 88.5% 77.9% 77.5% 85.22%
Phong and Titov (2018)(Titov and Le, 2018) 93.9% 88.3% 89.9% 77.5% 78.0% 85.51%
our 92.8% 87.5% 91.2% 78.5% 82.8% 86.56%
Table 3. Compare our model with other baseline methods on different types of datasets. The evaluation metric is micro F1.

Results

Same as previous work, we use in-KB accuracy and micro F1 to evaluate our method. We first test the model on the AIDA-B dataset. From Table 2, we can observe that our model achieves the best result. Previous best results on this dataset are generated by (Ganea and Hofmann, 2017; Titov and Le, 2018) which both built CRF models. They calculate the pairwise scores between all candidate entities. Differently, our model only considers the consistency of the target entities and ignores the relationship between incorrect candidates. The experimental results show that our model can reduce the impact of noise data and improve the accuracy of disambiguation. Apart from experimenting on AIDA-B, we also conduct experiments on several different datasets to verify the generalization performance of our model.

From Table 3, we can see that RLEL has achieved relatively good performances on ACE2004, CWEB and WIKI. At the same time, previous models (Ganea and Hofmann, 2017; Titov and Le, 2018; Cheng and Roth, 2013) achieve better performances on the news datasets such as MSNBC and AQUINT, but their results on encyclopedia datasets such as WIKI are relatively poor. To avoid overfitting with some datasets and improve the robustness of our model, we not only use AIDA-Train but also add Wikipedia data to the training set. In the end, our model achieve the best overall performance.

For most existing EL systems, entities with lower frequency are difficult to disambiguate. To gain further insight, we analyze the accuracy of the AIDA-B dataset for situations where gold entities have low popularity. We divide the gold entities according to their pageviews in wikipedia, the statistical disambiguation results are shown in Table 4. Since some pageviews can not be obtained, we only count part of gold entities. The result indicates that our model is still able to work well for low-frequency entities. But for medium-frequency gold entities, our model doesn’t work well enough. The most important reason is that other candidate entities corresponding to these medium-frequency gold entities have higher pageviews and local similarities, which makes the model difficult to distinguish.

Pageview/million Mention Num Micro F1
0.01 307 91.93%
0.01-0.1 612 86.06%
0.1-1 968 88.97%
1-5 1006 96.03%
5-10 493 96.43%
10 825 99.39%
Table 4. The micro F1 of gold entities with different pageviews on part of AIDA-B dataset.

3.3. Discussion on different RLEL variants

To demonstrate the effects of RLEL, we evaluate our model under different conditions. First, we evaluate the effect of sequence length on global decision making. Second, we assess whether sorting the mentions have a positive effect on the results. Third, we analysis the results of not adding globally encoding during entity selection. Last, we compare our RL selection strategy with the greedy choice.

Sequence in different length

A document may contain multiple topics, so we do not add all mentions to a single sequence. In practice, we add some adjacent mentions to the sequence and use reinforcement learning to select entities from beginning to end. To analysis the impact of the number of mentions on joint disambiguation, we experiment with sequences on different lengths. The results on AIDA-B are shown in Figure 4. We can see that when the sequence is too short or too long, the disambiguation results are both very poor. When the sequence length is less than 3, delay reward can’t work in reinforcement learning, and when the sequence length reaches 5 or more, noise data may be added. Finally, we choose the 4 adjacent mentions to form a sequence.

Figure 4. The performance of models with different sequence lengths on AIDA-B dataset.
Figure 5. The comparative experiments of RLEL model.
Document Content Mentions after ranking Selected Target Entity(is correct)
Australia beat West Indies by five wickets
in a World Series limited overs match at the
Melbourne Cricket Groundon Friday…
1.Melbourne Cricket Ground
2.World Series
3.West Indies
4.Australia
1.Melbourne Cricket Ground(correct)
2.World Series Cricket(correct)
3.West Indies cricket team(correct)
4.Australia national cricket team(correct)
Instead of Los Angeles International,
…, consider flying into Burbank or John
Wayne Airport in Orange County, Calif…
1.John Wayne Airport
2.Orange County
3.Los Angeles International
4.Burbank
1.John Wayne Airport(correct)
2.Orange County, California(correct)
3.Los Angeles International Airport(correct)
4.Burbank, California
(wrong, the correct is ”Hollywood Burbank Airport”)
Table 5. Entity selection examples by our RLEL model.

Influence of ranking mentions

In this section, we test whether ranking mentions is helpful for entity selections. At first, we directly input them into the global encoder by the order they appear in the text. We record the disambiguation results and compare them with the method which adopts ranking mentions. As shown in Figure 5a, the model with ranking mentions has achieved better performances on most of datasets, indicating that it is effective to place the mention that with a higher local similarity in front of the sequence. It is worth noting that the effect of ranking mentions is not obvious on the MSNBC dataset, the reason is that most of mentions in MSNBC have similar local similarities, the order of disambiguation has little effect on the final result.

Effect of global encoding

Most of previous methods mainly use the similarities between entities to correlate each other, but our model associates them by encoding the selected entity information. To assess whether the global encoding contributes to disambiguation rather than add noise, we compare the performance with and without adding the global information. When the global encoding is not added, the current state only contains the mention context representation, candidate entity representation and feature representation, notably, the selected target entity information is not taken into account. From the results in Figure 5b, we can see that the model with global encoding achieves an improvement of 4% accuracy over the method that without global encoding.

Different entity selection strategies

To illustrate the necessity for adopting the reinforcement learning for entity selection, we compare two entity selection strategies like (Feng et al., 2018). Specifically, we perform entity selection respectively with reinforcement learning and greedy choice. The greedy choice is to select the entity with largest local similarity from candidate set. But the reinforcement learning selection is guided by delay reward, which has a global perspective. In the comparative experiment, we keep the other conditions consistent, just replace the RL selection with a greedy choice. Based on the results in Figure 5c, we can draw a conclusion that our entity selector perform much better than greedy strategies.

3.4. Case Study

Table 5 shows two entity selection examples by our RLEL model. For multiple mentions appearing in the document, we first sort them according to their local similarities, and select the target entities in order by the reinforcement learning model. From the results of sorting and disambiguation, we can see that our model is able to utilize the topical consistency between mentions and make full use of the selected target entity information.

4. Related Work

The related work can be roughly divided into two groups: entity linking and reinforcement learning.

4.1. Entity Linking

Entity linking falls broadly into two major approaches: local and global disambiguation. Early studies use local models to resolve mentions independently, they usually disambiguate mentions based on lexical matching between the mention’s surrounding words and the entity profile in the reference KB. Various methods have been proposed to model mention’s local context ranging from binary classification (Milne and Witten, 2008) to rank models (Dredze et al., 2010; Chen and Ji, 2011). In these methods, a large number of hand-designed features are applied. For some marginal mentions that are difficult to extract features, researchers also exploit the data retrieved by search engines (Cornolti et al., 2014, 2016) or Wikipedia sentences (Tan et al., 2017)

. However, the feature engineering and search engine methods are both time-consuming and laborious. Recently, with the popularity of deep learning models, representation learning is utilized to automatically find semantic features

(Gupta et al., 2017; Cao et al., 2017). The learned entity representations which by jointly modeling textual contexts and knowledge base are effective in combining multiple sources of information. To make full use of the information contained in representations, we also utilize the pre-trained entity embeddings in our model.

In recent years, with the assumption that the target entities of all mentions in a document shall be related, many novel global models for joint linking are proposed. Assuming the topical coherence among mentions, authors in (Ganea et al., 2016; Ran et al., 2018) construct factor graph models, which represent the mention and candidate entities as variable nodes, and exploit factor nodes to denote a series of features. Two recent studies (Ganea and Hofmann, 2017; Titov and Le, 2018)

use fully-connected pairwise Conditional Random Field(CRF) model and exploit loopy belief propagation to estimate the max-marginal probability. Moreover, PageRank or Random Walk

(Han et al., 2011; Guo and Barbosa, 2018; Zwicklbauer et al., 2016) are utilized to select the target entity for each mention. The above probabilistic models usually need to predefine a lot of features and are difficult to calculate the max-marginal probability as the number of nodes increases. In order to automatically learn features from the data, Cao et al. (Cao et al., 2018) applies Graph Convolutional Network to flexibly encode entity graphs. However, the graph-based methods are computationally expensive because there are lots of candidate entity nodes in the graph.

To reduce the calculation between candidate entity pairs, Globerson et al. (Globerson et al., 2016) introduce a coherence model with an attention mechanism, where each mention only focus on a fixed number of mentions. Unfortunately, choosing the number of attention mentions is not easy in practice. Two recent studies (Phan et al., 2017, 2018) finish linking all mentions by scanning the pairs of mentions at most once, they assume each mention only needs to be consistent with one another mention in the document. The limitation of their method is that the consistency information is too sparse, resulting in low confidence. Similar to us, Guo et al. (Guo and Barbosa, 2018) also sort mentions according to the difficulty of disambiguation, but they did not make full use of the information of previously referred entities for the subsequent entity disambiguation. Nguyen et al. (Nguyen et al., 2016) use the sequence model, but they simply encode the results of the greedy choice, and measure the similarities between the global encoding and the candidate entity representations. Their model does not consider the long-term impact of current decisions on subsequent choices, nor does they add the selected target entity information to the current state to help disambiguation.

4.2. Reinforcement Learning

In the last few years, reinforcement learning has emerged as a powerful tool for solving complex sequential decision-making problems. It is well known for its great success in the game field, such as Go (Silver et al., 2016) and Atari games (Mnih et al., 2015)

. Recently, reinforcement learning has also been successfully applied to many natural language processing tasks and achieved good performance

(Xiong et al., 2017; Liu et al., 2018; Feng et al., 2018). Feng et al.(Feng et al., 2018)

used reinforcement learning for relation classification task by filtering out the noisy data from the sentence bag and they achieved huge improvements compared with traditional classifiers. Zhang

et al. (Zhang et al., 2018) applied the reinforcement learning on sentence representation by automatically discovering task-relevant structures. To automatic taxonomy induction from a set of terms, Han et al. (Han et al., 2018) designed an end-to-end reinforcement learning model to determine which term to select and where to place it on the taxonomy, which effectively reduced the error propagation between two phases. Inspired by the above works, we also add reinforcement learning to our framework.

5. Conclusions

In this paper we consider entity linking as a sequence decision problem and present a reinforcement learning based model. Our model learns the policy on selecting target entities in a sequential manner and makes decisions based on current state and previous ones. By utilizing the information of previously referred entities, we can take advantage of global consistency to disambiguate mentions. For each selection result in the current state, it also has a long-term impact on subsequent decisions, which allows learned policy strategy has a global view. In experiments, we evaluate our method on AIDA-B and other well-known datasets, the results show that our system outperforms state-of-the-art solutions. In the future, we would like to use reinforcement learning to detect mentions and determine which mention should be firstly disambiguated in the document.

Acknowledgements.
This research is supported by the Sponsor National Key Research and Development Program of China Rl(No. Grant #3), Sponsor the Beijing Municipal Science and Technology Project Rl under grant (No. Grant #3), and Sponsor the National Natural Science Foundation of China Rl grants(No. Grant #3).

References

  • (1)
  • Cao et al. (2018) Yixin Cao, Lei Hou, Juanzi Li, and Zhiyuan Liu. 2018. Neural Collective Entity Linking. In Proceedings of the 27th International Conference on Computational Linguistics, COLING 2018, Santa Fe, New Mexico, USA, August 20-26, 2018. 675–686.
  • Cao et al. (2017) Yixin Cao, Lifu Huang, Heng Ji, Xu Chen, and Juanzi Li. 2017. Bridge Text and Knowledge by Learning Multi-Prototype Entity Mention Embedding. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, ACL 2017, Vancouver, Canada, July 30 - August 4, Volume 1: Long Papers. 1623–1633.
  • Chen and Guestrin (2016) Tianqi Chen and Carlos Guestrin. 2016. XGBoost: A Scalable Tree Boosting System. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, CA, USA, August 13-17, 2016. 785–794.
  • Chen and Ji (2011) Zheng Chen and Heng Ji. 2011. Collaborative Ranking: A Case Study on Entity Linking. In Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing, EMNLP 2011, 27-31 July 2011, John McIntyre Conference Centre, Edinburgh, UK, A meeting of SIGDAT, a Special Interest Group of the ACL. 771–781.
  • Cheng and Roth (2013) Xiao Cheng and Dan Roth. 2013. Relational Inference for Wikification. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, EMNLP 2013, 18-21 October 2013, Grand Hyatt Seattle, Seattle, Washington, USA, A meeting of SIGDAT, a Special Interest Group of the ACL. 1787–1796.
  • Chisholm and Hachey (2015) Andrew Chisholm and Ben Hachey. 2015. Entity Disambiguation with Web Links. TACL 3 (2015), 145–156.
  • Cornolti et al. (2016) Marco Cornolti, Paolo Ferragina, Massimiliano Ciaramita, Stefan Rüd, and Hinrich Schütze. 2016. A Piggyback System for Joint Entity Mention Detection and Linking in Web Queries. In Proceedings of the 25th International Conference on World Wide Web, WWW 2016, Montreal, Canada, April 11 - 15, 2016. 567–578.
  • Cornolti et al. (2014) Marco Cornolti, Paolo Ferragina, Massimiliano Ciaramita, Hinrich Schütze, and Stefan Rüd. 2014. The SMAPH system for query entity recognition and disambiguation. In ERD’14, Proceedings of the First ACM International Workshop on Entity Recognition & Disambiguation, July 11, 2014, Gold Coast, Queensland, Australia. 25–30.
  • Cucerzan (2007) Silviu Cucerzan. 2007. Large-Scale Named Entity Disambiguation Based on Wikipedia Data. In EMNLP-CoNLL 2007, Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, June 28-30, 2007, Prague, Czech Republic. 708–716.
  • Dhingra et al. (2017) Bhuwan Dhingra, Lihong Li, Xiujun Li, Jianfeng Gao, Yun-Nung Chen, Faisal Ahmed, and Li Deng. 2017. Towards End-to-End Reinforcement Learning of Dialogue Agents for Information Access. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, ACL 2017, Vancouver, Canada, July 30 - August 4, Volume 1: Long Papers. 484–495.
  • Dredze et al. (2010) Mark Dredze, Paul McNamee, Delip Rao, Adam Gerber, and Tim Finin. 2010. Entity Disambiguation for Knowledge Base Population. In COLING 2010, 23rd International Conference on Computational Linguistics, Proceedings of the Conference, 23-27 August 2010, Beijing, China. 277–285.
  • Feng et al. (2018) Jun Feng, Minlie Huang, Li Zhao, Yang Yang, and Xiaoyan Zhu. 2018. Reinforcement Learning for Relation Classification From Noisy Data. In

    Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence, New Orleans, Louisiana, USA, February 2-7, 2018

    .
  • Ganea et al. (2016) Octavian-Eugen Ganea, Marina Ganea, Aurélien Lucchi, Carsten Eickhoff, and Thomas Hofmann. 2016. Probabilistic Bag-Of-Hyperlinks Model for Entity Linking. In Proceedings of the 25th International Conference on World Wide Web, WWW 2016, Montreal, Canada, April 11 - 15, 2016. 927–938.
  • Ganea and Hofmann (2017) Octavian-Eugen Ganea and Thomas Hofmann. 2017. Deep Joint Entity Disambiguation with Local Neural Attention. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, EMNLP 2017, Copenhagen, Denmark, September 9-11, 2017. 2619–2629.
  • Globerson et al. (2016) Amir Globerson, Nevena Lazic, Soumen Chakrabarti, Amarnag Subramanya, Michael Ringgaard, and Fernando Pereira. 2016. Collective Entity Resolution with Multi-Focal Attention. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, ACL 2016, August 7-12, 2016, Berlin, Germany, Volume 1: Long Papers.
  • Guo and Barbosa (2018) Zhaochen Guo and Denilson Barbosa. 2018. Robust named entity disambiguation with random walks. Semantic Web 9, 4 (2018), 459–479.
  • Gupta et al. (2017) Nitish Gupta, Sameer Singh, and Dan Roth. 2017. Entity Linking via Joint Encoding of Types, Descriptions, and Context. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, EMNLP 2017, Copenhagen, Denmark, September 9-11, 2017. 2681–2690.
  • Han et al. (2018) Jiawei Han, Xiang Ren, Jiaming Shen, Yuning Mao, and Xiaotao Gu. 2018. End-to-End Reinforcement Learning for Automatic Taxonomy Induction. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, ACL 2018, Melbourne, Australia, July 15-20, 2018, Volume 1: Long Papers. 2462–2472.
  • Han et al. (2011) Xianpei Han, Le Sun, and Jun Zhao. 2011. Collective entity linking in web text: a graph-based method. In Proceeding of the 34th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2011, Beijing, China, July 25-29, 2011. 765–774.
  • Hoffart et al. (2011) Johannes Hoffart, Mohamed Amir Yosef, Ilaria Bordino, Hagen Fürstenau, Manfred Pinkal, Marc Spaniol, Bilyana Taneva, Stefan Thater, and Gerhard Weikum. 2011. Robust Disambiguation of Named Entities in Text. In Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing, EMNLP 2011, 27-31 July 2011, John McIntyre Conference Centre, Edinburgh, UK, A meeting of SIGDAT, a Special Interest Group of the ACL. 782–792.
  • Huang et al. (2015) Hongzhao Huang, Larry P. Heck, and Heng Ji. 2015. Leveraging Deep Neural Networks and Knowledge Graphs for Entity Disambiguation. CoRR abs/1504.07678 (2015). arXiv:1504.07678
  • Liu et al. (2018) Ting Liu, William Yang Wang, Yu Zhang, Qingyu Yin, and Weinan Zhang. 2018. Deep Reinforcement Learning for Chinese Zero Pronoun Resolution. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, ACL 2018, Melbourne, Australia, July 15-20, 2018, Volume 1: Long Papers. 569–578.
  • Mihalcea and Tarau (2004) Rada Mihalcea and Paul Tarau. 2004. TextRank: Bringing Order into Text. In Proceedings of the 2004 Conference on Empirical Methods in Natural Language Processing , EMNLP 2004, A meeting of SIGDAT, a Special Interest Group of the ACL, held in conjunction with ACL 2004, 25-26 July 2004, Barcelona, Spain. 404–411.
  • Mikolov et al. (2013) Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. Efficient Estimation of Word Representations in Vector Space. CoRR abs/1301.3781 (2013). arXiv:1301.3781
  • Milne and Witten (2008) David N. Milne and Ian H. Witten. 2008. Learning to link with wikipedia. In Proceedings of the 17th ACM Conference on Information and Knowledge Management, CIKM 2008, Napa Valley, California, USA, October 26-30, 2008. 509–518.
  • Mnih et al. (2015) Volodymyr Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. A. Riedmiller, A. Fidjeland, G. Ostrovski, S. Petersen, C. Beattie, A. Sadik, I. Antonoglou, H. King, D. Kumaran, D. Wierstra, S. Legg, and D. Hassabis. 2015. Human-level control through deep reinforcement learning. Nature 518, 7540 (2015), 529–533.
  • Nguyen et al. (2016) Thien Huu Nguyen, Nicolas R. Fauceglia, Mariano Rodriguez-Muro, Oktie Hassanzadeh, Alfio Massimiliano Gliozzo, and Mohammad Sadoghi. 2016. Joint Learning of Local and Global Features for Entity Linking via Neural Networks. In COLING 2016, 26th International Conference on Computational Linguistics, Proceedings of the Conference: Technical Papers, December 11-16, 2016, Osaka, Japan. 2310–2320.
  • Nogueira and Cho (2017) Rodrigo Nogueira and Kyunghyun Cho. 2017. Task-Oriented Query Reformulation with Reinforcement Learning. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, EMNLP 2017, Copenhagen, Denmark, September 9-11, 2017. 574–583.
  • Pappu et al. (2017) Aasish Pappu, Roi Blanco, Yashar Mehdad, Amanda Stent, and Kapil Thadani. 2017. Lightweight Multilingual Entity Extraction and Linking. In Proceedings of the Tenth ACM International Conference on Web Search and Data Mining, WSDM 2017, Cambridge, United Kingdom, February 6-10, 2017. 365–374.
  • Pennington et al. (2014) Jeffrey Pennington, Richard Socher, and Christopher D. Manning. 2014. Glove: Global Vectors for Word Representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, EMNLP 2014, October 25-29, 2014, Doha, Qatar, A meeting of SIGDAT, a Special Interest Group of the ACL. 1532–1543.
  • Phan et al. (2017) Minh C. Phan, Aixin Sun, Yi Tay, Jialong Han, and Chenliang Li. 2017. NeuPL: Attention-based Semantic Matching and Pair-Linking for Entity Disambiguation. In Proceedings of the 2017 ACM on Conference on Information and Knowledge Management, CIKM 2017, Singapore, November 06 - 10, 2017. 1667–1676.
  • Phan et al. (2018) Minh C. Phan, Aixin Sun, Yi Tay, Jialong Han, and Chenliang Li. 2018. Pair-Linking for Collective Entity Disambiguation: Two Could Be Better Than All. CoRR abs/1802.01074 (2018). arXiv:1802.01074
  • Ran et al. (2018) Chenwei Ran, Wei Shen, and Jianyong Wang. 2018. An Attention Factor Graph Model for Tweet Entity Linking. In Proceedings of the 2018 World Wide Web Conference on World Wide Web, WWW 2018, Lyon, France, April 23-27, 2018. 1135–1144.
  • Ratinov et al. (2011) Lev-Arie Ratinov, Dan Roth, Doug Downey, and Mike Anderson. 2011. Local and Global Algorithms for Disambiguation to Wikipedia. In The 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, Proceedings of the Conference, 19-24 June, 2011, Portland, Oregon, USA. 1375–1384.
  • Silver et al. (2016) David Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. v. d. Driessche, J. Schrittwieser, I. Antonoglou, V. Panneershelvam, M. Lanctot, S. Dieleman, D. Grewe, J. Nham, N. Kalchbrenner, I. Sutskever, T. P. Lillicrap, M. Leach, K. Kavukcuoglu, T. Graepel, and D. Hassabis. 2016. Mastering the game of Go with deep neural networks and tree search. Nature 529, 7587 (2016), 484–489.
  • Tan et al. (2017) Chuanqi Tan, Furu Wei, Pengjie Ren, Weifeng Lv, and Ming Zhou. 2017. Entity Linking for Queries by Searching Wikipedia Sentences. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, EMNLP 2017, Copenhagen, Denmark, September 9-11, 2017. 68–77.
  • Titov and Le (2018) Ivan Titov and Phong Le. 2018. Improving Entity Linking by Modeling Latent Relations between Mentions. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, ACL 2018, Melbourne, Australia, July 15-20, 2018, Volume 1: Long Papers. 1595–1604.
  • Williams (1992) Ronald J. Williams. 1992. Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning. Machine Learning 8 (1992), 229–256.
  • Xiong et al. (2017) Wenhan Xiong, Thien Hoang, and William Yang Wang. 2017.

    DeepPath: A Reinforcement Learning Method for Knowledge Graph Reasoning. In

    Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, EMNLP 2017, Copenhagen, Denmark, September 9-11, 2017. 564–573.
  • Yamada et al. (2016) Ikuya Yamada, Hiroyuki Shindo, Hideaki Takeda, and Yoshiyasu Takefuji. 2016. Joint Learning of the Embedding of Words and Entities for Named Entity Disambiguation. In Proceedings of the 20th SIGNLL Conference on Computational Natural Language Learning, CoNLL 2016, Berlin, Germany, August 11-12, 2016. 250–259.
  • Zhang et al. (2018) Tianyang Zhang, Minlie Huang, and Li Zhao. 2018. Learning Structured Representation for Text Classification via Reinforcement Learning. In Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence, (AAAI-18), the 30th innovative Applications of Artificial Intelligence (IAAI-18), and the 8th AAAI Symposium on Educational Advances in Artificial Intelligence (EAAI-18), New Orleans, Louisiana, USA, February 2-7, 2018. 6053–6060.
  • Zwicklbauer et al. (2016) Stefan Zwicklbauer, Christin Seifert, and Michael Granitzer. 2016. Robust and Collective Entity Disambiguation through Semantic Embeddings. In Proceedings of the 39th International ACM SIGIR conference on Research and Development in Information Retrieval, SIGIR 2016, Pisa, Italy, July 17-21, 2016. 425–434.