Entity linking (EL) refers to the joint task of recognizing entity mentions in text through mention detection (MD) and assigning those mentions to corresponding entities in a knowledge base (KB) through entity disambiguation (ED).111Person or Organization, whereas in MD there exist only two classes. MD and NER are sometimes used interchangeably in the literature since they can both be used to discover entity mentions in text.222ED is also known as named entity disambiguation (NED). End-to-end EL is a difficult task because mentions must be recognized and disambiguated in a single pass. For example, in the sentence: “The Times began publication in London under its current name in 1788,” the words The Times must be recognized as belonging to an entity mention and then disambiguated to the correct corresponding entity, the British daily newspaper The Times. Due to the ambiguity of entity names, one can easily see how The Times could be assigned to any number of other newspapers which colloquially go by a similar name, such as The New York Times. As a result, EL models hinge on good: (1) mention detection, (2) local mention features, (3) global mention features, which model the relationship between mentions and (4) entity embeddings, the quality of which allow for easier disambiguation. In this paper, we describe a new EL model jointly trained on MD and ED and based on Bert devlin-etal-2019-bert. It takes in a sequence of words, binary mention indicators, and entity ids which index a pre-trained entity embedding matrix which we take from yamada2018wikipedia2vec. We represent the input words by their contextualized embeddings and train them on a joint MD binary classification and ED similarity maximization objective. Our model doesn’t explicitly model local and global mention features but instead relies on Bert’s ability to model word- and context-level features. This paper introduces two main contributions: (i) An end-to-end differentiable EL model that jointly performs MD and ED with state-of-the-art results. We use no separate local or global features and instead rely on Bert’s expressivity. (ii) We study the impact of not using candidate sets at train and test time. Candidate sets limit a model to predicting across a predefined set of candidate entities.
2 Related Work
Neural-network based approaches to EL and ED have recently achieved strong results across standard datasets. Research into ED has focused on learning better entity representations and on extracting better mention-side features through novel model architectures. Entity representation. Good KB entity representations are a key component of most ED and EL models. Representation learning has been tackled by (yamada-etal-2016-joint; ganea-hofmann-2017-deep; cao-etal-2017-bridge; yamada2017learning) and (sil2018neural; cao-etal-2018-joint) in the cross-lingual setting. These approaches trace their lineage back to conventional word embedding models such as (mikolov). More recently, (yamada2019pre) propose learning entity representations using a model based on bidirectional transformer encoders which allow them to achieve state-of-the-art results in ED. To the best of our knowledge, our papers are the first to have applied transformer encoders to the ED and EL tasks. Mention features. Recent work on the mention-side has focused on extracting global features (ratinov2011local; globerson2016collective; ganea-hofmann-2017-deep; le-titov-2018-improving), extending the scope of ED to larger non-standard datasets (eshel-etal-2017-named), and positing the problem in new ways such as building separate classifiers for KB entities (barrena-etal-2018-learning). Entity linking. Early work on end-to-end entity linking (sil2013re; luo2015joint; nguyen2016j) introduced models that do joint learning of NER and ED using hand-engineered features.333Confusingly, EL is sometimes used to denote what we call ED in which case it is a form of disambiguation-only EL as opposed to the end-to-end EL we focus on. More recently, (kolitsas-etal-2018-end) proposed a model which considers all possible spans as potential mentions and learns contextual similarity scores over their entity candidates. As a result, MD is handled implicitly by only considering mention spans which have non-empty candidate entity sets. (martins-etal-2019-joint) propose jointly training a multi-task NER and ED objective using stack-LSTMs (dyer-etal-2015-transition). Our work differs from previous EL work in that it does not need candidate sets during training, it also focuses on MD instead of NER and swaps task-specific architectures with a more general BERT layer.
3 Model Description
Our EL model jointly solves MD and ED. It takes in a sequence of word tokens , mention indicators , and entity ids which index a pre-trained entity embedding matrix . The goal of the model is to tag words with their correct mention indicators and entity ids. The motivation behind
3.1 Input Representations
The text input layers of our model are based on the Bert architecture devlin-etal-2019-bert which is formed of many bidirectional Transformers vaswani2017attention. We use the pretrained weights for Bert-Base released with the original Bert code.444https://github.com/google-research/bert Bert uses WordPiece johnson-etal-2017-googles for unsupervised tokenization of the input text. The vocabulary is built such that it contains the most frequently used words or sub-word units. We use the representation of the first sub-word as the input to the word-level classifier over the MD label set. The output of the input layers are contextualized WordPiece embeddings which are grouped to form the embedding matrix , where is the embedding size and in the case of Bert-Base is equal to . On the entity-side we use pre-trained entity embeddings from yamada2018wikipedia2vec. The entities are trained on the 2018 version of Wikipedia and their embeddings are a function of the contexts in which they appear and where they sit in the Wikipedia link graph. We denote entity embeddings by where is the number of entity embeddings we consider and is their embedding size and in the case of yamada2018wikipedia2vec is equal to .
3.2 Word-Level EL model
We train on a multi-task learning caruana1997multitask objective which combines the MD and ED tasks. Both our MD and ED predictions are based on the contextualized WordPiece embeddings. MD. We model the MD task as a sequence labelling problem across the familiar inside-outside-beginning (IOB) label set ramshaw-marcus-1995-text. Contextualized embeddings are taken and passed through a single feedforward neural network before being softmaxed:
where is the bias term, is a weight matrix, and is the predicted distribution across the label set. The predicted label is then simply:
ED. We model the ED task as a similarity maximization problem between transformed contextualized word embeddings and entity embeddings. We first apply a feedforward neural network to the contextualized word embeddings:
where is the bias term, is a weight matrix, and is the same size as the entity embeddings. By we denote any similarity measure which relates to every entity embedding in
. In our case, we use cosine similarity. Our predicted entity label is the index ofwith the highest score.
4.1 Dataset and metrics
We train and evaluate our model on the standard AIDA/CoNLL dataset (Hoffart et al., 2011). It is a collection of news wire articles from Reuters and split into a training set of 18,448 linked mentions in 946 documents, a validation set of 4,791 mentions in 216 documents, and a test set of 4,485 mentions in 231 documents.
4.2 Candidate selection
For each token we select entity candidates that might be referred to by the mention in which the token is in. We use the candidate sets generated by hoffart2011robust
using YAGO dictionaries. Importantly, we do not use any of the prior probabilities associated with candidate sets in any part of our model. We also set no limits on the size of the candidate sets.
4.3 Training details and settings
We minimize the following multi-task objective
is a cross entropy loss function anda cosine similarity loss function. We train with a batch size of 4 for 50,000 steps. We use the ADAM optimizer kingma2014adam. We use a learning rate of 1e-5, , , L2 weight decay of , learning rate warmup over the first 5,000 steps, and a linear decay of the learning rate. We use a dropout probability of 0.1 on all layers. The training of our model was performed on single Tesla V100 GPUs with 16GB of memory. Models took around 6 hours to train.
Comparison with other EL models. We compare the results of our model with four of the most recent EL models in Table 1. Our model which uses the candidate sets mentioned in Section 4.2 achieves state-of-the-art results. The small difference in micro and macro F1 suggest our model does not overfit (More detail here).
|System||Validation F1||Test F1|
|Our model (with candidate sets)||92.6||93.6||87.5||87.7|
|Our model (without candidate sets)||82.6||83.5||70.7||69.4|
Without candidate sets, and with 1 million entity embeddings to rank across, our model is not able to achieve similar results without using candidate sets, indicating that ranking over one million candidate entities is difficult. Ablation tests. We perform ablation tests to try and identify components which contribute most to our final results. We find that by freezing BERT our results fall slightly when using candidate sets, but fall substantially when candidate sets are not used. This points to the importance of fine-tuning BERT for EL without candidate sets to be possible.
|Ablation||Validation F1||Test F1|
|Without BERT fine-tuning||87.1||90.3||83.5||84.8|
|Without BERT fine-tuning (without candidate sets)||63.3||64.1||57.2||54.1|
|With random entity embeddings||88.4||90.0||79.6||80.9|
|With fasttext entity embeddings||90.4||91.4||82.8||82.9|
We also look at the effect entity embeddings have on performance. We test random embeddings
which are 100-dimensional embeddings sampled from a multivariate uniform distribution with a range ofto . We also form fasttext entity embeddings which we define as the averaged 300-dimensional fasttext embeddings of entity titles. The results show that with random embeddings, there is an expected substantial drop between validation and test sets. Nevertheless, test set results are still high which points to the helpfulness of candidate sets. Similarly, fasttext entity embeddings perform only slightly better than random embeddings pointing to the need for Wikipedia-specific entity embeddings.
5 Conclusion and Future Work
We proposed doing joint learning of MD and ED, in order to improve end-to-end EL results. Our results show our model achieves state of the art results. Furthermore, we show that training without candidate sets is possible and also present test results for when candidate sets are not used. The model introduced in this paper focuses on the mention-side and it would be interesting to study how it performs given BERT-based entity embeddings such as the ones recently introduced in yamada2019pre.