A contemporary approach to entity linking represents each entity with a textual description , encodes these descriptions and contextualized mentions of entities, , into a shared vector space using dual-encoders and , and scores each mention-entity pair as the inner-product between their encodings (Botha et al., 2020; Wu et al., 2019). By restricting the interaction between and to an inner-product, this approach permits the pre-computation of all and fast retrieval of top scoring entities using maximum inner-product search (MIPS).
Here we begin with the observation that many entities appear in diverse contexts, which may not be easily captured in a single high-level description. For example, Actor Tommy Lee Jones played football in college, but this fact is not captured in the entity description derived from his Wikipedia page (see Figure 1). Furthermore, when new entities need to be added to the index in a zero-shot setting, it may be difficult to obtain a high quality description. We propose that both problems can be solved by allowing the entity mentions themselves to serve as exemplars. In addition, retrieving from the set of mentions can result in more interpretable predictions – since we are directly comparing two mentions – and allows us to leverage massively multilingual training data more easily, without forcing choices about which language(s) to use for the entity descriptions.
We present a new approach (moleman111Mention Only Linking of Entities with a Mention Annotation Network) that maintains the dual-encoder architecture, but with the same mention-encoder on both sides. Entity linking is modeled entirely as a mapping between mentions, where inference involves a nearest neighbor search against all known mentions of all entities in the training set. We build moleman using exactly the same mention-encoder architecture and training data as Model F Botha et al. (2020). We show that moleman significantly outperforms Model F on both the Mewsli-9 and tsai-roth-2016-cross datasets, particularly for low-coverage languages, and rarer entities.
We also observe that moleman achieves high accuracy with just a few mentions for each entity, suggesting that new entities can be added or existing entities can be modified simply by labeling a small number of new mentions. We expect this update mechanism to be significantly more flexible than writing or editing entity descriptions. Finally, we compare the massively multilingual moleman model to a much more expensive English-only dual-encoder architecture Wu et al. (2019) on the well-studied TACKBP-2010 dataset (Ji et al., 2010) and show that moleman is competitive even in this setting.
We train a model that performs entity linking by ranking a set of entity-linked indexed mentions-in-context. Formally, let a mention-in-context be a sequence of tokens from vocabulary , which includes designated entity span tokens. An entity-linked mention-in-context pairs a mention with an entity from a predetermined set of entities . Let be a set of entity-linked mentions-in-context, and let be a function that returns the entity associated with , and returns the token sequence .
Our goal is to learn a function that maps an arbitrary mention-in-context token sequence to a fixed vector with the property that
gives a good prediction of the true entity label of a query mention-in-context .
Recent state-of-the-art entity linking systems employ a dual encoder architecture, embedding mentions-in-context and entity representations in the same space. We also employ a dual encoder architecture but we score mentions-in-context (hereafter, mentions) against other mentions, with no consolidated entity representations. The dual encoder maps a pair of mentions to a score:
is a learned neural network that encodes the input mention as a-dimensional vector.
3.2 Training Process
3.2.1 Mention Pairs Dataset
We build a dataset of mention pairs using the 104-language collection of Wikipedia mentions as constructed by botha2020entity. This dataset maps Wikipedia hyperlinks to WikiData (Vrandečić and Krötzsch, 2014), a language-agnostic knowledge base. We create mention pairs from the set of all mentions that link to a given entity.
We use the same division of Wikipedia pages into train and test splits used by botha2020entity for compatibility to the TR2016 test set (Tsai and Roth, 2016). We take up to the first 100k mention pairs from a randomly ordered list of all pairs regardless of language, yielding 557M and 31M training and evaluation pairs, respectively. Of these, 69.7% of pairs involve two mentions from different languages. Our index set contains 651M mentions, covering 11.6M entities.
3.2.2 Hard Negative Mining and Positive Resampling
Previous work using a dual encoder trained with in-batch sampled softmax has improved performance with subsequent training rounds using an auxiliary cross-entropy loss against hard negatives sampled from the current model (Gillick et al., 2019; Wu et al., 2019; Botha et al., 2020). We investigate the effect of such negative mining for moleman, controlling the ratio of positives to negatives on a per-entity basis. This is achieved by limiting each entity to appear as a negative example at most 10 times as often as it does in positive examples, as done by botha2020entity.
In addition, since moleman is intended to retrieve the most similar indexed mention of the correct entity, we experiment with using this retrieval step to resample the positive pairs used to construct our mention-pair dataset for the in-batch sampled softmax, pairing each mention with the highest-scoring other mention of the same entity in the index set. This is similar to the index refreshing that is employed in other retrieval-based methods trained with in-batch softmax (Guu et al., 2020; Lewis et al., 2020a).
3.2.3 Input Representations
Following prior work (Wu et al., 2019; Botha et al., 2020), our mention representation consists of the page title and a window around the mention, with special mention boundary tokens marking the mention span. We use a total context size of 64 tokens.
Though our focus is on entity mentions, the entity descriptions can still be a useful additional source of data, and allow for zero-shot entity linking (when no mentions of an entity exist in our training set). We therefore experiment with adding the available entity descriptions as additional “pseudo-mentions”. These are constructed in a similar way to the mention representations, except without mention boundaries. Organic and psuedo-mentions are fed into BERT using distinct sets of token type identifiers. We supplement our training set with additional mention pairs formed from each entity’s description and a random mention, adding 38M training pairs, and add these descriptions to the index, expanding the entity set to 20M.
For inference, we perform a distributed brute-force maximum inner product search over the index of training mentions. During this search, we can either return only the top-scoring mention for each entity, which improves entity-based recall, or else all mentions, which allows us to experiment with k-Nearest Neighbors inference (see Section 4.1).
Table 1 shows our results on the Mewsli-9 dataset compared to the models described by botha2020entity. Model F is a dual encoder which scores entity mentions against entity descriptions, while Model F adds two additional rounds of training with hard negative mining and an auxiliary cross-lingual objective. Despite using an identically-sized transformer, and trained on the same data, moleman outperforms Model F when training only on mention pairs, and sees minimal improvement from a further round of training with hard negative and resampled positives (as described in Section 3.2.2). This suggests that training moleman is a simpler learning problem compared to previous models which must capture all an entity’s diverse contexts with a single description embedding. Additionally, we examine a further benefit of indexing multiple mentions per entity: the ability to do top-K inference, and find that top-1 accuracy improves by half a point with k=5.
We also compare to the recent mGENRE system of decao2021multilingual, which performs entity linking using constrained generation of entity names. It should be noted that this work uses an expanded training set that results in fewer zero- and few-shot entities (see decao2021multilingual Table 3).
4.1.1 Per-Language Results
Table 2 shows per-language results for Mewsli-9. A key motivation of botha2020entity was to learn a massively multilingual entity linking system, with a shared context encoder and entity representations between 104 languages in the Wikipedia corpus. moleman takes a step further: the indexed mentions from all languages are included in the retrieval index, and can contribute to the prediction in any language. In fact, we find that for 21.4% of mentions in the Mewsli-9 corpus, moleman’s top prediction came from a different language.
4.1.2 Frequency Breakdown
Table 3 shows a breakdown in performance by entity frequency bucket, defined as the number of times an entity was mentioned in the Wikipedia training set. When indexing only mentions, moleman can never predict the entities in the 0 bucket, but it shows significant improvement in the other frequency bands, particularly in the “few shot” bucket of [1,10). This suggests when introducing new entities to the index, labelling a small number of mentions may be more beneficial than producing a single description. To further confirm this intuition, we retrained moleman with a modified training set which had all entities in the [1, 10) band of Mewsli-9 removed, and only added to the index at inference time. This model achieved +0.2 R@1 and +5.6 R@10 relative to Model F (which was trained with these entities in the train set). When entity descriptions are added to the index, moleman outperforms Model F across frequency bands.
4.1.3 Inference Efficiency
Due to the large size of the mention index, nearest neighbor inference is performed using distributed maximum inner-product search. We also experiment with approximate search using ScaNN (Guo et al., 2020). Table 4 shows throughput and recall statistics for brute force search as well as two approximate search approaches that run on a single multi-threaded CPU, showing that inference over such a large index can be made extremely efficient with minimal loss in recall.
4.2 Tsai Roth 2016 Hard
In order to compare against previous multilingual entity linking models, we report results on the “hard” subset of tsai-roth-2016-cross’s cross-lingual dataset which links 12 languages to English Wikipedia. Table 5 shows our results on the same 4 languages reported by botha2020entity. moleman outperforms all previous systems.
4.3 Tackbp 2010
Recent work on entity linking have employed dual-encoders primarily as a retrieval step before reranking with a more expensive cross-encoder (Wu et al., 2019; Agarwal and Bikel, 2020). Table 6 shows results on the extensively studied TACKBP 2010 dataset (Ji et al., 2010). wu2019scalable used a 24-layer BERT-based dual-encoder which scores the 5.9 million entity descriptions from English Wikipedia, followed by a 24-layer cross-encoder reranker. moleman does not achieve the same level of top-1 accuracy as their full model, as it lacks the expensive cross-encoder reranking step, but despite using a single, much smaller Transformer and indexing the larger set of entities from multilingual Wikipedia, it outperforms this prior work in retrieval recall at 100.
We also report the accuracy of a MOLEMAN model trained only with English training data, and using an Enlish-only index for inference. This experiment shows that although the multilingual index contributes to moleman’s overall performance, the pairwise training data is sufficient for high performance in a monolingual setting.
5 Discussion and Future Work
We have recast the entity linking problem as an application of a more generic mention encoding task. This approach is related to methods which perform clustering on test mentions in order to improve inference (Le and Titov, 2018; Angell et al., 2020), and can also be viewed as a form of cross-document coreference resolution (Rao et al., 2010; Shrimpton et al., 2015; Barhom et al., 2019). We also take inspiration from recent instance-based language modelling approaches (Khandelwal et al., 2020; Lewis et al., 2020b).
Our experiments demonstrate that taking an instance-based approach to entity-linking leads to better retrieval performance, particularly on rare entities, for which adding a small number of mentions leads to superior performance than a single description. For future work, we would like to explore the application of this instance-based approach to entity knowledge related tasks (Seo et al., 2018; Petroni et al., 2020), and to entity discovery (Ji et al., 2017).
The authors would like to thank Ming-Wei Chang, Livio Baldini-Soares and the anonymous reviewers for their helpful feedback. We also thank Dave Dopson for his extensive help with profiling the brute-force and approximate search inference.
- Tensorflow: a system for large-scale machine learning. In OSDI 2016, Cited by: §A.1.
- Entity linking via dual and cross-attention encoders. arXiv preprint arXiv:2004.03555. Cited by: §4.3.
- Clustering-based inference for zero-shot biomedical entity linking. arXiv preprint arXiv:2010.11253. Cited by: §5.
- Revisiting joint modeling of cross-document entity and event coreference resolution. In ACL 2019, Cited by: §5.
- Entity linking in 100 languages. In EMNLP 2020, Cited by: §1, §1, §3.1, §3.2.2, §3.2.3, Table 1, Table 2.
- Multilingual autoregressive entity linking. arXiv preprint arXiv:2103.12528. Cited by: Table 1.
- BERT: pre-training of deep bidirectional transformers for language understanding. In ACL 2019, Cited by: §3.1.
- Empirical evaluation of pretraining strategies for supervised entity linking. In AKBC 2020, Cited by: §3.1.
- Learning dense representations for entity retrieval. In CoNLL 2019, Cited by: §3.2.2.
- End-to-end retrieval in continuous space. arXiv preprint arXiv:1811.08008. Cited by: §A.1.
- Accelerating large-scale inference with anisotropic vector quantization. In International Conference on Machine Learning, External Links: Cited by: §4.1.3, Table 4.
- Realm: retrieval-augmented language model pre-training. In ICML 2020, Cited by: §3.2.2.
- Overview of the TAC 2010 knowledge base population track. In TAC 2010, Cited by: §1, §4.3.
- Overview of tac-kbp2017 13 languages entity discovery and linking.. In TAC 2017, Cited by: §5.
Generalization through memorization: nearest neighbor language models. In ICLR 2020, Cited by: §5.
- Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §A.1.
- Improving entity linking by modeling latent relations between mentions. In ACL 2018, Cited by: §5.
- Pre-training via paraphrasing. In NeurIPS 2020, Cited by: §3.2.2.
- Retrieval-augmented generation for knowledge-intensive nlp tasks. In NeurIPS 2020, Cited by: §5.
- Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101. Cited by: §A.1.
- Cited by: §5.
- Streaming cross document entity coreference resolution. In COLING 2010: Posters, Cited by: §5.
- Phrase-indexed question answering: a new challenge for scalable document comprehension. In EMNLP 2018, Cited by: §5.
- Sampling techniques for streaming cross document coreference resolution. In NAACL 2015, Cited by: §5.
- Cross-lingual wikification using multilingual embeddings. In ACL 2016, Cited by: §3.2.1.
- Attention is all you need. In NeurIPS 2017, Cited by: §3.1.
- Wikidata: a free collaborative knowledgebase. Communications of the ACM 57 (10), pp. 78–85. Cited by: §3.2.1.
- Scalable zero-shot entity linking with dense entity retrieval. In EMNLP 2020, Cited by: §1, §1, §3.2.2, §3.2.3, §4.3.
Appendix A Appendices
a.1 Training setup and hyperparameters
To isolate the impact of representing entities with multiple mention embeddings, we follow the training methodology and hyperparameter choices presented in botha2020entity (Appendix A).
We train MOLEMAN using in-batch sampled softmax (Gillick et al., 2018) using a batch size of 8192 for 500k steps, which takes about a day. Our model is implemented in Tensorflow (Abadi et al., 2016), using the Adam optimizer (Kingma and Ba, 2014; Loshchilov and Hutter, 2017) with the mention encoder preinitialized from a multilingual BERT checkpoint222github.com/google-research/bert/multi_cased_L-12_H-768_A-12. All model training was carried out on a Google TPU v3 architecture333cloud.google.com/tpu/docs/tpus.
a.2 Datasets Links
a.3 Profiling Details
The brute-force numbers we’ve reported are the theoretical maximum throughput for computing 300D dot-products on an AVX-512 processor running at 2.2Ghz, and are thus an overly optimistic baseline. Practical implementations, such as the one in ScaNN, must also compute the top-k and rarely exceed 70% to 80% of this theoretical limit. The brute-force latency figure is the minimum time to stream the database from RAM using 144 GiB/s of memory-bandwidth. In practice, we ran distributed brute-force inference on a large cluster of CPUs, which took about 5 hours.
The numbers for ScaNN are empirical single-machine benchmarks of an internal solution that uses the open-source ScaNN library444https://github.com/google-research/google-research/tree/master/scann on a single 24-core CPU. We use ScaNN to search a multi-level tree that has the following shape: (687.3 million datapoints). We used a combination of several different anisotropic vector quantizations that combine 3, 6, 12, or 24 dimensions per 4-bit code, as well as re-scoring with an int8-quantization.
a.4 Expanded experimental results
|(mentions only)||(+ descriptions)|