Entity Linking Meets Deep Learning: Techniques and Solutions

Entity linking (EL) is the process of linking entity mentions appearing in web text with their corresponding entities in a knowledge base. EL plays an important role in the fields of knowledge engineering and data mining, underlying a variety of downstream applications such as knowledge base population, content analysis, relation extraction, and question answering. In recent years, deep learning (DL), which has achieved tremendous success in various domains, has also been leveraged in EL methods to surpass traditional machine learning based methods and yield the state-of-the-art performance. In this survey, we present a comprehensive review and analysis of existing DL based EL methods. First of all, we propose a new taxonomy, which organizes existing DL based EL methods using three axes: embedding, feature, and algorithm. Then we systematically survey the representative EL methods along the three axes of the taxonomy. Later, we introduce ten commonly used EL data sets and give a quantitative performance analysis of DL based EL methods over these data sets. Finally, we discuss the remaining limitations of existing methods and highlight some promising future directions.

READ FULL TEXT VIEW PDF

page 4

page 7

page 8

page 9

page 14

page 16

page 18

page 19

05/24/2022

Community Question Answering Entity Linking via Leveraging Auxiliary Data

Community Question Answering (CQA) platforms contain plenty of CQA texts...
08/15/2021

Complex Knowledge Base Question Answering: A Survey

Knowledge base question answering (KBQA) aims to answer a question over ...
05/25/2021

A Survey on Complex Knowledge Base Question Answering: Methods, Challenges and Solutions

Knowledge base question answering (KBQA) aims to answer a question over ...
05/10/2017

A Survey of Deep Learning Methods for Relation Extraction

Relation Extraction is an important sub-task of Information Extraction w...
07/12/2022

Effective Few-Shot Named Entity Linking by Meta-Learning

Entity linking aims to link ambiguous mentions to their corresponding en...
07/21/2020

Deep Learning Techniques for Future Intelligent Cross-Media Retrieval

With the advancement in technology and the expansion of broadcasting, cr...
04/16/2016

Supervised and Unsupervised Ensembling for Knowledge Base Population

We present results on combining supervised and unsupervised methods to e...

1 Introduction

The rapid growth of web data contains an overwhelming amount of information and knowledge. Natural language is one of the most important forms of web data. However, natural language is ambiguous, especially for named entities which appear in it frequently. In addition, with the development of information extraction (IE) techniques, a growing number of high-quality, large-scale, and machine-readable knowledge bases (KBs) have been constructed recently, such as YAGO [suchanek2007yago]

, DBpedia

[auer2007dbpedia], Freebase [bollacker2008freebase], and Probase [wu2012probase]. These KBs contain tens of millions of named entities and billions of relational facts between named entities. Bridging web text data and KBs is very helpful for understanding the ambiguous natural language on the web and enriching the existing KBs. To achieve this goal, entity linking (EL) is a fundamental task which needs to be solved.

What is entity linking? Entity linking is the task to link entity mentions appearing in web text with their corresponding entities in a KB [shen2014entity]. Figure 1 shows an illustration for the entity linking task. It acts an important pre-processing step for many downstream applications, such as question answering [zhang2016joint], relation extraction [lin2016neural], knowledge base population (KBP) [ji2011knowledge], and content analysis [michelson2010discovering]. For example, KBP is the task of enriching existing KBs with newly extracted facts from the text. Before the enrichment, an EL system is needed to map entity mentions associated with facts to their corresponding named entities in a KB.

Entity linking is challenging due to the name variation and entity ambiguity. On one hand, a named entity may have many different surface forms (e.g., full name, partial name, nickname, alias, and abbreviation). As the example shown in Figure 1, entity mentions “NYC” and “Big Apple” are the abbreviation and nickname of the named entity “New York City”, respectively. On the other hand, an entity mention may refer to many different named entities. For the example in Figure 1, the entity mention “MJ” may refer to an American professional basketball player named “Michael Jordan”, an American recording artist named “Michael Jackson”, an American singer named “Mj Rodriguez”, or many other named entities which could be referred to as “MJ”.

Traditional machine learning based entity linking. Many traditional machine learning (ML) based EL methods have been comprehensively reviewed in a survey [shen2014entity]

. These ML based EL methods mainly leverage manually designed features of entity popularity, local context compatibility, and document-level global coherence of referring entities. Their entity ranking techniques can be broadly divided into two categories: unsupervised and supervised ranking methods. Unsupervised ranking methods include vector space model based methods

[cucerzan2007large] and information retrieval based methods [varma2009iiit]. Supervised ranking methods mainly include binary classification methods [chen2011collaborative], learning to rank methods [ratinov2011local], probabilistic methods [shen2017shine+], and graph-based methods [hoffart2011robust].

Fig. 1: An illustration for the task of entity linking. Entity mentions detected from web text are underlined; Candidate mapping entities in a knowledge base for each entity mention are shown via a dashed arrow line and circle; Their correct mapping entities (a.k.a. gold mapping entities) are in boldface.

Most traditional ML based EL methods follow the popular two-step procedure, which firstly extracts some hand-crafted features and then feeds those features to an entity ranking method to make the final linking predictions. However, these methods have two limitations: (1) features that lead to good performance require a lot of careful and tedious feature engineering; (2) generalizing the trained entity linking model to other KBs or domains is difficult due to the strong dependence on the specific KB and domain knowledge in the process of designing features.

Deep learning based entity linking. With the success of deep learning in various domains, deep learning (DL) based EL methods have been proposed and attracted significant attention recently [sevgili2020neural]. Compared with traditional machine learning, deep learning can automatically learn important features and is more transferable from one domain to another. For example, DL based EL methods are able to obtain vector representations of the words via a pre-trained language model such as Word2Vec [mikolov2013efficient, mikolov2013distributed] and GloVe [pennington2014glove]

. This neural encoding can be directly used as semantic and syntactic features, which are then fed into neural network architectures, such as recurrent neural networks (RNNs)

[hochreiter1997long, cho2014learning]

, convolutional neural networks (CNNs)

[lecun1998gradient], attention [bahdanau2015neural], and Transformers [vaswani2017attention] to capture more sophisticated long-distance feature representations.

Motivations. In the past few years, DL based EL methods with minimal feature engineering have been flourishing as shown in Figure 2. Deep learning has become the mainstream framework for the entity linking methods that achieve the state-of-the-art performance. In the previous comprehensive survey [shen2014entity] of entity linking published in 2015, we reviewed and analyzed a variety of traditional ML based EL systems, which are mainly based on manually designed features without the use of deep learning. This prompts us to write this survey to systematically review and summarize the current status of deep learning techniques for entity linking. Our intended audience includes researchers, developers, and practitioners in related fields who are interested in entity linking. We hope that this survey paper will help them get a clear overall picture of current DL based EL methods and guide them to the right way to start with entity linking research and practice.

Contributions. Overall, the contributions of this survey can be summarized as follows.

  • A comprehensive review. We conduct a comprehensive survey to present a thorough overview and analysis of DL based EL methods.

  • A new taxonomy. We propose a new taxonomy, which organizes the techniques for ranking candidate entities in existing DL based EL methods using three axes: (1) embedding; (2) feature; (3) algorithm.

  • A quantitative analysis. We present a quantitative analysis on the performance of the DL based EL methods on a variety of data sets.

  • Some future directions. We discuss the remaining limitations of existing entity linking methods and point out possible future directions.

The remainder of this survey is organized as follows. We first give a formal formulation of the entity linking task in Section 2. Then we explain our new taxonomy of DL based EL methods in Section 3. We present a comprehensive survey along the three axes of the new taxonomy (i.e., embedding, feature, and algorithm) in Sections 4, 5, and 6 respectively. Section 7

introduces ten commonly used entity linking data sets as well as evaluation metrics, and presents a quantitative performance analysis of the DL based EL methods. Section

8 discusses the main limitations and future directions, and in Section 9 we finally give a conclusion.

Fig. 2: Statistics of DL based EL publications in peer-reviewed venues.

2 Problem Overview

An entity mention is a token sequence in text which potentially refers to some named entity. A named entity is a word or a phrase that is explicitly defined by KBs with a unique identifier, and it could be a real-world object (e.g., organization and individual) or an abstract concept (e.g., song and event). Most of the existing EL works assume that entity mention boundaries are provided by users or mention detection systems, such as Named Entity Recognition (NER) tools, which can identify the boundaries of named entities in text automatically. Then we formally state the task of EL as follows.

Definition (Entity Linking).

Given a document containing a set of recognized entity mentions and a target KB containing a set of named entities , the goal is to map each entity mention in to its gold mapping entity in .

It is possible that some entity mention’s gold mapping entity does not exist in the target KB, which is called “unlinkable entity mention”. A special label NIL is usually used to describe this kind of unlinkable entity mention.

As surveyed in [shen2014entity], a typical EL system often consists of three stages: (1) candidate entity generation stage can generate a candidate mapping entity set for each entity mention using a name dictionary and web information; (2) candidate entity ranking stage can leverage different kinds of evidence to rank the candidate entities; (3) unlinkable mention prediction stage can validate whether the entity mention should be labeled as NIL. As the techniques used in the first and third stages have not changed much in recent years, in this survey we mainly concentrate on the candidate entity ranking stage and discuss how different neural architectures and features help to rank entities. For the technical details of approaches used in the first and third stages, you could refer to our previous survey [shen2014entity].

3 A Taxonomy of DL Based EL Methods

In this section, we first introduce basic concepts of DL based EL methods and explain why EL methods can benefit from DL. Next, we give a brief introduction of the newly proposed taxonomy.

3.1 Why Choose Deep Learning?

DL is an artificial intelligence function that imitates the working mechanism of the human brain in processing information and creating patterns. It utilizes a cascade of multiple layers of non-linear processing units (i.e., neurons) for feature extraction and transformation by adjusting the connection weights between units, which can be regarded as resembling the learning behavior of a human brain. DL models have been applied widely and successfully to numerous tasks in the fields of computer vision (CV) and natural language process (NLP) since the DL based model AlexNet

[krizhevsky2012imagenet]

won the ImageNet competition by a big margin in 2012.

The most basic model in DL corresponds to a fully connected layer [mudgal2018deep]. It takes vectors from the input layer as input, and outputs a value

by a non-linear activation function

at the output layer, where are weights, is the bias, and the common choice of is tanh function, sigmoid function, or ReLU function [glorot2011deep]

. Multi-layer neural network is simply a generalization of this basic model where neural network layers are stacked in sequence. A multi-layer neural network is also called a multi-layer perceptron (MLP).

Compared with traditional ML, DL has three main strengths to solve the EL task. First and foremost, DL can utilize the given documents and KBs to automatically discover multiple levels of distributed representations to help disambiguation without human intervention

[bengio2013deep], while traditional ML based methods require considerable feature engineering and analysis. Second, DL is more transferable, which means that deep neural networks can learn more transferable representations that disentangle the exploratory factors of variations underlying the data samples and group features hierarchically in accordance with their relatedness to invariant factors [wang2018deep]. Finally, DL is able to learn feature representations and perform classification or regression in an end-to-end style, which possibly motivates complex and advanced EL methods. Along with these advantages of DL, a lot of DL based EL methods have been proposed in recent years and achieved the state-of-the-art performance.

3.2 Basic Structure of Our Taxonomy

Previous survey [shen2014entity] roughly divides the candidate entity ranking methods into two categories: supervised ranking methods and unsupervised ranking methods. Different from the previous survey, we propose a new taxonomy of the techniques in the candidate entity ranking stage of DL based EL methods, which contains three steps as follows:

  • Embedding. Embedding can use low-dimensional and dense vectors to implicitly represent the semantic and syntactic properties of natural language, which is also called distributed representation. Generally, DL models need embeddings as input [collobert2011natural]. Thus, a key point in applying DL to EL is the embeddings of different inputs (e.g., mention surface form, entity context, and entity description). Lots of embedding techniques have been utilized by existing DL based EL methods, and we divide them into four categories: (1) word embedding; (2) mention embedding; (3) entity embedding; (4) alignment embedding.

  • Feature. Features intend to measure the similarity between the entity mention and the candidate entity in different aspects in the form of a vector or a score. According to the previous survey [shen2014entity], almost all traditional EL methods are based on quantitative features of mention-entity popularity, mention-entity similarity, and entity-entity topical coherence. These features are still widely utilized by researchers until now, and they can benefit from DL according to the following two aspects: (1) features can be directly learned by neural models without manual feature engineering. For example, Wu et al. [wu2020dynamic] learned an entity-entity topical coherence feature vector for each candidate entity automatically via a dynamic Graph Convolutional Network architecture; (2) embeddings introduced above can be leveraged to generate diverse features. For example, unlike traditional methods preferring to represent the context as a bag-of-words, many DL methods could leverage the embedding of text to calculate the similarity between mention context and entity description as a context similarity feature. In this survey, we concentrate on the following five principal features for entity linking: (1) prior popularity; (2) surface form similarity; (3) type similarity; (4) context similarity; (5) topical coherence.

  • Algorithm. Algorithm is the final step in the candidate entity ranking

    stage. It takes features as input and outputs the final entity linking result. The commonly used algorithms in DL based EL methods include MLP, PageRank, graph regularization, and reinforcement learning. The output of the algorithm is a score list of all candidate entities for each entity mention. We summarize such algorithms into three groups, namely, MLP, graph-based algorithms, and reinforcement learning (RL).

Although the implementation details of DL based EL methods may vary considerably, their general steps for ranking entities are very similar. Given an entity mention in a document and a set of candidate entities, various embeddings are generated at first. Then features can be calculated using embeddings. Finally, features are fed into algorithms to rank the candidate entities and get the final linking result. It is noted that these three steps are in a sequence manner, which means the output of the previous step is used as the input of the current step. In the following, we review the main techniques used in these three steps in detail. Table I summarizes the step choices of DL based EL methods according to the proposed taxonomy.

Model Embeddings Features Algorithms
Word Mention Entity Alignment PP SFS TS CS TC
He et al. (ACL 2013) [he2013learning] description -
Sun et al. (IJCAI 2015) [sun2015modeling] learned surface form, type -
DSRM (arXiv 2015) [huang2015leveraging] description, context, type graph-based
Globerson et al. (ACL 2016) [globerson2016collective] -
Zwicklbauer et al. (SIGIR 2016) [zwicklbauer2016robust] pre-trained description, context graph-based
EDKate (CoNLL 2016) [fang2016entity] learned context -
Yamada et al. (CoNLL 2016) [yamada2016joint] learned context -
Francis-Landau et al. (NAACL 2016) [francis2016capturing] learned surface form, description -
Nguyen et al. (COLING 2016) [nguyen2016joint] learned surface form, description -
Cao et al. (ACL 2017) [cao2017bridge] learned context -
Gupta et al. (EMNLP 2017) [gupta2017entity] pre-trained description, type -
Deep-ED (EMNLP 2017) [ganea2017deep] pre-trained description, context MLP
NeuPL (CIKM 2017) [phan2017neupl] learned description, context -
Eshel et al. (CoNLL 2017) [eshel2017named] learned context MLP
MR-Deep-ED (ACL 2018) [le2018improving] pre-trained description, context MLP
Moon et al. (ACL 2018) [moon2018multimodal] pre-trained context -
Sil et al. (AAAI 2018) [sil2017neural] learned description MLP
DeepType (AAAI 2018) [raiman2018deeptype] -
Mueller and Durrett (EMNLP 2018) [mueller2018effective] learned context MLP
Kolitsas et al. (CoNLL 2018) [kolitsas2018end] pre-trained description, context MLP
SGTB-BiBSG (NAACL 2018) [yang2018collective] pre-trained description, context -
NCEL (COLING 2018) [cao2018neural] learned context MLP
Le and Titov (ACL 2019) [le2019distant] pre-trained type MLP
Le and Titov (ACL 2019) [le2019boosting] pre-trained description, context MLP
Logeswaran et al. (ACL 2019) [logeswaran2019zero] -
Sevgili et al. (ACL 2019) [sevgili2019improving] description, context MLP
RRWEL (IJCAI 2019) [xue2019neural] learned surface form, description graph-based
RLEL (WWW 2019) [fang2019joint] pre-trained description, context RL
DCA (EMNLP 2019) [yang2019learning] pre-trained surface form, description, context MLP, RL
Gillick et al. (CoNLL 2019) [gillick2019learning] pre-trained description MLP
E-ELMo (arXiv 2019) [shahbazi2019entity] learned context MLP
FGS2EE (ACL 2020) [hou2020improving] pre-trained description, context, type MLP
ET4EL (AAAI 2020) [onoe2020fine] learned -
Chen et al. (AAAI 2020) [chen2020improving] pre-trained description, context, type MLP
REL (SIGIR 2020) [van2020rel] learned context MLP
SeqGAT (WWW 2020) [fang2020high] description MLP
DGCN (WWW 2020) [wu2020dynamic] description, context, type MLP
BLINK (EMNLP 2020) [wu2020scalable] description -
ELQ (EMNLP 2020) [li2020efficient] description -
GNED (KBS 2020) [hu2020graph] pre-trained description, context MLP
JMEL (ECIR 2020) [adjali2020ecir] learned MLP
Yamada et al. (arXiv 2020) [yamada2020global] context -
M3 (AAAI 2021) [gu2021read] -
Bi-MPR (AAAI 2021) [tang2021bidirectional] description MLP
Chen et al. (AAAI 2021) [chen2021lightweight] learned surface form MLP
CHOLAN (EACL 2021) [ravi2021cholan] -
Zhang et al. (DASFAA 2021) [zhang2021attention] description -
TABLE I: Summary of the step choices of DL based EL methods according to our proposed taxonomy. Due to the limited space, “PP” refers to the prior popularity feature, “SFS” refers to the surface form similarity feature, “TS” refers to the type similarity feature, “CS” refers to the context similarity feature, and “TC” refers to the topical coherence feature. “-” in the column of Algorithms means the corresponding model uses a relatively simple algorithm to select the mapping entity, such as a linear combination of features.

4 Embedding

The core idea of embedding is to represent the meaning of a piece of natural language by using a low-dimensional real-valued dense vector. Based on the given documents and KBs, diverse embedding techniques could be leveraged to capture the linguistic patterns and common sense knowledge in text, such as semantic roles, syntactic structures, and lexical meanings. In this section, we introduce four kinds of embeddings used in DL based EL methods: (1) word embedding; (2) mention embedding; (3) entity embedding; (4) alignment embedding. Considering the strong correlation between context embedding and context similarity feature, we will introduce the context embedding in Section 5.4.1.

4.1 Word Embedding

A word embedding can map words from a vocabulary to vectors of real numbers to represent their meaning. In this subsection, we classify word embeddings used by most EL works into two categories: (1) pre-trained; (2) learned.

4.1.1 Pre-trained

Pre-trained language models can learn widely applicable embeddings of words based on their co-occurrences and neighborhoods in a large quantity of text corpora. They learn word embeddings in advance and store them in lookup tables. Lots of DL based EL methods directly use a pre-trained (fixed) lookup table to get word embeddings. Specifically, EL methods

[zwicklbauer2016robust, ganea2017deep, kolitsas2018end, yang2018collective, yang2019learning, le2019boosting, hou2020improving, chen2020improving, hu2020graph] utilized a Word2Vec [mikolov2013efficient] lookup table to get word embeddings. Additionally, several DL based EL methods [gupta2017entity, le2018improving, le2019distant, le2019boosting, gillick2019learning, fang2019joint, moon2018multimodal] exploited a GloVe [pennington2014glove] lookup table as the word embedding source.

4.1.2 Learned

Pre-trained embedding may not be suitable for domain-specific data sets which contain string tokens with highly specialized semantics. In this case, many DL based EL methods learn a domain-specific word embedding using some embedding techniques. DL based EL methods [cao2017bridge, cao2018neural, yamada2016joint, sun2015modeling, van2020rel, sil2017neural, francis2016capturing, phan2018pair, phan2017neupl, eshel2017named, xue2019neural, mueller2018effective, nguyen2016joint, fang2016entity] learned word embeddings via Word2Vec based on the huge corpora such as Wikipedia. Word2Vec contains continuous bag-of-words (CBOW) model and skip-gram (SG) model [mikolov2013efficient]. In reality, EL methods [yamada2016joint, cao2017bridge, cao2018neural, sun2015modeling, van2020rel, phan2018pair, phan2017neupl, eshel2017named, mueller2018effective] prefer to use SG model for training. Additionally, Shahbazi et al. [shahbazi2019entity] utilized ELMo [peters2018deep]

to learn word embeddings. ELMo is a large-scale context-sensitive language model that uses bi-directional long short-term memory (Bi-LSTM)

[lample2016neural] in forward and backward directions to encode word embeddings depending on its context. Adjali et al. [adjali2020ecir] used Sent2Vec [pagliardini2018unsupervised] which extends the CBOW model to learn representations of words.

Some word embedding aims to obtain numeric representations of words by leveraging their character-level information. The main idea is that words are made of morphemes, i.e., meaningful sequences of characters with different lengths. Injecting character-level information into the final word embedding could effectively handle the issue of out-of-vocabulary (OOV) words [kim2016character]. For example, OOV words often occur due to misspellings, and substrings of the word can be leveraged to approximate its embedding. Mueller and Durrett [mueller2018effective] and Onoe and Durrett [onoe2020fine] utilized CNNs [lecun1998gradient] to generate character-level representations of words concatenated with the learned word embedding, which makes their models recognize character-level correspondences between entity mention and candidate entity better. Kolitsas et al. [kolitsas2018end] and Chen et al. [chen2021lightweight] utilized Bi-LSTM to capture significant character lexical information. Both of them generated the character-dependent embedding of a word by concatenating the forward hidden state of its last character, the backward hidden state of its first character, and its pre-trained word embedding.

Generally, word embedding cannot represent the order information of words. Intuitively, the position-dependent signal can reinforce semantics by incorporating the order information. Some EL systems add the positional embedding to the word embedding to explicitly encode the relative/absolute positions of words as vectors. Sun et al. [sun2015modeling] and Le and Titov [le2019distant] leveraged a positional embedding that was modeled by the distance between the word and the entity mention in a given piece of text. Xue et al. [xue2019neural] followed Vaswani et al. [vaswani2017attention] to define the positional embedding of each word in the entity mention. Specifically, they took the position index of each word as input and utilized a pre-defined sinusoidal function of the position index as its positional encoding.

4.2 Mention Embedding

Mention embedding is a learned representation for an entity mention. It is usually used to model the similarity between the entity mention and the candidate entity. For some EL methods [sun2015modeling, wu2020dynamic], it is simply obtained by averaging the embeddings of words which compose the entity mention.

Additionally, several DL based EL models utilize CNNs to generate vectors of entity mentions. Specifically, each word in the entity mention is encoded into a word embedding using a pre-trained Word2Vec lookup table, and thus a sequence of vectors can be found, where is the word length of the entity mention. Francis-Landau et al. [francis2016capturing] mapped the vector sequence into a fixed-size vector using a CNN parameterized with a filter bank , where is the number of filter maps, is the dimension of words, and is the width of the convolution. Then they put the result through a ReLU function and combined the results with sum-pooling, based on the following formulation:

(1)

where is a concatenation of word embeddings and the is element-wise. Some other EL methods [nguyen2016joint, yang2019learning, xue2019neural, chen2021lightweight] also utilized this similar process to obtain the mention embedding.

Moreover, due to the high ambiguity of entity mentions, Cao et al. [cao2017bridge, cao2018neural] proposed an embedding model, which can learn multiple sense embeddings for each entity mention to denote its different meanings. Specifically, each entity mention is first mapped to a set of shared mention senses according to a pre-defined dictionary. To train the mention sense embedding, they combined each mention sense and its context to predict the corresponding named entity by extending the CBOW model.

Kolitsas et al. [kolitsas2018end] defined a “soft head” embedding to capture key components of the entity mention, which is built using an attention mechanism on top of the entity mention’s word embeddings. They concatenated the embeddings of the first, last, and the “soft head” words of the entity mention, and then leveraged a shallow feedforward neural network (FFNN) to produce the representation of the entity mention as follows:

(2)

where , , are the representations of the first, last, and “soft head” words respectively.

Wu et al. [wu2020dynamic] took the entity mention with the text where it appears as input and leveraged ELMo [peters2018deep] to learn a contextualized representation for the entity mention. What’s more, Onoe and Durrett [onoe2020fine] fed the representation of the entity mention learned by ELMo into a Bi-LSTM encoder followed by a span attention layer to get the final mention embedding.

Li et al. [li2020efficient] and Yamada et al. [yamada2020global] leveraged the pre-trained language model BERT [devlin2019bert]

to encode the entity mention. When training the BERT model, the masked language model and the next sentence prediction model are trained together, with the goal of minimizing the combined loss function of the two strategies. The masked language model randomly masks some tokens in text, and then independently recovers these masked tokens by conditioning on the encoding vectors obtained via a bi-directional Transformer. The next sentence prediction model receives pairs of sentences as input and learns to predict if the second sentence in the pair is the subsequent sentence in the original document. Li et al.

[li2020efficient] took the entity mention with its context as a sequence of words as input to BERT and the outputs of tokens of the last layer were selected as the contextualized embeddings of words. They averaged the output embeddings of the entity mention words to generate the mention embedding. Yamada et al. [yamada2020global] leveraged BERT to encode an input sequence consisting of words in the document and masked entity tokens corresponding to the entity mention. For each entity mention, taking the BERT output vector of its corresponding masked entity as input, the mention embedding is learned via a single-layer perceptron.

4.3 Entity Embedding

Entity embedding, which represents each entity in a KB in a continuous vector space, is important for DL based EL. It is helpful for capturing the entity-entity topical coherence feature (described in Section 5.5) by mapping similar entities close to each other in the same space. Entity embedding is also used as an important raw material to calculate the context similarity feature, which will be introduced in Section 5.4. KBs like Wikipedia provide lots of structured and textual data for learning entity embeddings, such as surface form, entity description, entity context, and type information. Some EL methods leverage one type of them, while others combine several types to learn the entity embedding.

4.3.1 Surface Form

The most direct way to get entity embeddings is using entity surface forms to learn lexical representations. Surface form consists of a word or a sequence of words like “Paris” or “New York City”. It is usually used as the matching sequence to locate the corresponding candidate entity due to its uniqueness in the reference KB [charton2014improving]. Sun et al. [sun2015modeling] obtained the entity surface form embedding by averaging the embeddings of words which compose the entity surface form based on learned word embeddings. Several EL methods [francis2016capturing, nguyen2016joint, yang2019learning, xue2019neural] used CNNs to convert each entity surface form into a vector, which is the same as the way of using CNNs to generate mention embeddings introduced in Section 4.2. What’s more, Chen et al. [chen2021lightweight] ran a Bi-LSTM on the entity surface form to get a representation for each entity.

4.3.2 Entity Description

Entity description is a piece of text that introduces useful background information about the entity. It usually provides a concise summary of salient properties for the entity. For instance, in the description page of the entity “Los Angeles Lakers” in Wikipedia, much useful information such as players, championships, and history can be found. A great quantity of EL works consider it as a good resource during the process of constructing entity embeddings, and take the representation of the entity description as the important auxiliary information to reinforce the semantic information of entities. It is also noted that zero-shot setting has been a surge in the field of EL [logeswaran2019zero, wu2020scalable, li2020efficient, tang2021bidirectional] where each named entity is defined only by its entity description, while other information such as type information and relations between entities are absent. Thus, it is essential to design a suitable method to learn the representation of the entity description to obtain the entity embedding.

He et al. [he2013learning] built a stacked denoising auto-encoders (DA) [vincent2008extracting] to encode the entity description. Stacked DA is able to capture general concepts of the input and ignore noise. It takes a one-hot vector of the entity description as input and outputs a learned representation.

Huang et al. [huang2015leveraging] first represented each entity description as a bag-of-words, which is then transformed by a word hashing layer into letter tri-gram vectors. They applied a deep neural network (DNN) on these tri-gram vectors to learn useful semantic entity embedding based on the entity description.

Sil et al. [sil2017neural] computed a weighted average of embeddings of all words in the entity description by using the inverse document frequency of each word as the weight. They further applied a fully connected tanh activation layer to this obtained embedding to get the final entity description embedding.

Lots of studies [francis2016capturing, nguyen2016joint, yang2019learning, gupta2017entity, xue2019neural] utilized CNNs to distill the entity description into a meaningful topic embedding. They combined this embedding with the entity surface form embedding mentioned in Section 4.3.1 to form the final entity embedding.

Doc2Vec [le2014distributed] is a modification of Word2Vec and can learn fixed-size embeddings from variable-length pieces of text like entity descriptions. Both Zwicklbauer et al. [zwicklbauer2016robust] and Sevgili et al. [sevgili2019improving] leveraged Doc2Vec to generate an entity embedding based on the entity description.

Phan et al. [phan2017neupl] and Fang et al. [fang2019joint] both used a single-directional LSTM network to encode the entity description. Instead of taking the last hidden state as the representation for entity description, Phan et al. [phan2017neupl]

applied max-pooling over all hidden state vectors to produce a fixed-size entity embedding.

Ganea and Hofmann [ganea2017deep] collected co-occurrence words from the entity description of each entity and regarded them as positive words of that entity for learning its embedding. They leveraged the word-entity co-occurrence counts to define a practical approximation of a word-entity conditional distribution. In addition, they sampled negative words unrelated to that entity from a generic word distribution. Finally, they used a max-margin loss to infer the optimal embedding of the entity with a goal that vectors of positive words are closer to the embedding of that entity compared with vectors of negative words. Many EL methods [kolitsas2018end, le2018improving, yang2018collective, fang2019joint, yang2019learning, le2019boosting, wu2020dynamic] follow this work and use the same method to generate the entity embedding.

Hu et al. [hu2020graph] extracted words with the highest TF-IDF from the entity description as related words of the entity. To learn better representations of entities by aggregating the semantic information from their neighboring entities and words, they constructed a heterogeneous entity-word graph for each document, which consists of entity nodes corresponding to the candidate entities of mentions in the document, word nodes corresponding to the related words of candidate entities, and relationships added by computing the similarity score between nodes based on pre-trained word embeddings. They applied Graph Convolutional Network (GCN) [kipf2016semi] on this graph to encode the global semantic information among candidate entities and finally generated entity embeddings.

Sometimes, entity descriptions are a little too long to accurately express semantic information of entities. To alleviate the long-term dependency problem and extract worthy information, some DL based EL systems leverage BERT [devlin2019bert] to encode the entity description in order to learn the entity embedding. Specifically, several EL works [wu2020scalable, li2020efficient, fang2020high, zhang2021attention] took the entity description as a sequence of words as input to BERT. Most of these works [wu2020scalable, li2020efficient, zhang2021attention] inserted a special start token at the beginning of the input sequence and the output of the last layer at this start token produced by the Transformer encoder is regarded as the vector representation of the input sequence. Fang et al. [fang2020high] obtained the entity embedding via average-pooling over the hidden states of all description tokens in the last BERT layer.

However, BERT can only take a limited length (i.e., tokens) of the entity description as input. To tackle this problem, Tang et al. [tang2021bidirectional] proposed a multi-paragraph reading model for discovering more textual information in the entity description which is composed of multiple paragraphs. Specifically, they first utilized BERT to encode the concatenation of the mention context and the entity description paragraph to obtain a representation for each paragraph. These representations are then used as the input of a multi-head attention module proposed in Transformer [vaswani2017attention] to gather the semantic dependence among the paragraphs. Finally, a weighted-pooling layer is applied on hidden states output by the multi-head attention module to produce the entity embedding.

4.3.3 Entity Context

In addition to entity descriptions, KBs like Wikipedia can provide additional valuable context information for named entities. Some DL based EL methods utilize various kinds of entity context to learn entity embeddings, such as anchor texts, fact information, and neighbor entities.

Anchor text in Wikipedia is a hyperlink from an entity mention in an entity page linking to its corresponding entity page, and can help the learning of entity embeddings in two aspects. First, the anchor text is a link that jumps from one entity page to another, providing rich entity-entity co-occurrence information. What’s more, anchor texts residing in documents supply abundant context words for their referring entities, and thus they can offer rich word-entity co-occurrence information.

Specifically, Yamada et al. [yamada2016joint] used anchor texts in Wikipedia to learn the entity embedding based on the entity-entity co-occurrence information they provide. Inspired by Wikipedia Link-based Measure (WLM) [milne2008learning]

, which is a standard entity relatedness measure in traditional EL, they assumed that entities with similar hyperlinks are related. For example, the teams “Boston Celtics” and “Toronto Raptors” are highly related because they have many common hyperlinks to the entity pages about “NBA”. Their model simply learned to place entities with similar hyperlinks near one another in the vector space via maximizing the conditional probability

, where is one of the hyperlinks of the entity , which is similar to the idea of the SG model in Word2Vec [mikolov2013efficient]. Fang et al. [fang2016entity] exploited the entity-entity co-occurrence information offered by anchor texts to learn the entity embedding by placing the representations of two entities connected by anchor texts close to each other in the vector space. Sevgili et al. [sevgili2019improving] used anchor texts in Wikipedia to create a graph whose nodes are entities and edges are the hyperlinks between entities. Then the vector representation of each entity is generated by running DeepWalk [perozzi2014deepwalk] over the constructed graph.

Starting from Ganea and Hofmann [ganea2017deep], many EL methods [kolitsas2018end, le2018improving, yang2018collective, fang2019joint, yang2019learning, le2019boosting, wu2020dynamic, hu2020graph] took word-entity co-occurrence counts provided by anchor texts as another source to learn entity embeddings. We have introduced their specific learning method in Section 4.3.2. Additionally, lots of EL approaches [yamada2016joint, fang2016entity, cao2017bridge, phan2017neupl, eshel2017named, mueller2018effective, wu2020dynamic, van2020rel, phan2018pair] utilized anchor texts to construct entity embeddings by mapping embeddings of words and entities into the same vector space. In these approaches, similar words and entities are placed close to each other in a vector space. More details will be introduced in Section 4.4.

Recently, Yamada et al. [yamada2020global] utilized anchor texts residing in Wikipedia to learn the contextualized entity embedding. Specifically, they proposed a new masked entity prediction model to learn the entity embedding, inspired by the masked language model adopted in BERT [devlin2019bert]. Masked entity prediction aims to predict randomly masked entities based on words and non-masked entities. They trained this model based on the contexts of the referring entities provided by the anchor texts in Wikipedia.

A KB is usually composed of facts about entities. Facts are also important information for modeling entity embeddings. We denote a fact as , where is a head entity, is a relation, and is a tail entity. Fang et al. [fang2016entity] followed TransE [bordes2013translating] to define a score function for a fact based on embeddings of , , and as follows:

(3)

where , , are the embeddings of , , respectively, and is a constant for numerical stability. After maximizing the above score function, the representations of and can be used as entity embeddings. Moon et al. [moon2018multimodal] also leveraged facts to generate entity embeddings as follows:

(4)

where is a deep neural network that produces a likelihood of a valid fact . For a certain entity , Huang et al. [huang2015leveraging] defined a connected entity set containing its corresponding head entities and tail entities existing in the same facts, which is another form of entity-entity co-occurrence information. For each entity in , they obtained its surface form and represented it as a bag-of-words. The vectors of all connected entities are concatenated to generate the final entity embedding. Cao et al. [cao2017bridge, cao2018neural] extended the SG model in Word2Vec by maximizing the log probability of observing its connected entity set given an entity as follows:

(5)

Thus, entities sharing many common connected entities tend to have similar representations.

Zwicklbauer et al. [zwicklbauer2016robust] defined the neighbor entities of entity as entities that appear surrounding in text. They generated a corpus that exclusively comprises entities sequentially by replacing all available linked surface forms in documents with its corresponding target entities and removing all non-entity identifiers like words and punctuations. Similar to Equation 5, they leveraged the SG model in Word2Vec over this created corpus to learn entity embeddings by leveraging the neighbor entities in a window.

4.3.4 Type Information

A KB consists of a type hierarchy and entities that are instances of types. The entity type is a kind of important information to represent the semantics of entities. A large number of EL methods have leveraged entity type information to learn entity embeddings.

Sun et al. [sun2015modeling]

averaged embeddings of entity type words based on the learned word embeddings to get the entity type embedding. They utilized a low-rank neural tensor network

[socher2013recursive] to jointly encode the entity type embedding and the entity surface form embedding introduced in Section 4.3.1 to learn the final entity representation. Gupta et al. [gupta2017entity] calculated a relevant probability of observing type given entity as , where and are type embedding and entity embedding respectively. They maximized this probability to learn entity and type embeddings jointly and injected type embeddings into entity embeddings ultimately.

Huang et al. [huang2015leveraging] extracted a set of attached entity types for each entity based on Freebase [bollacker2008freebase], represented each entity type as a one-hot vector, and used it as one of the inputs for DNN to learn entity embeddings. Wu et al. [wu2020dynamic] extracted the notable type from Freebase for each entity and utilized the average embedding of words in the type string as a type embedding. They then obtained the final entity embedding by concatenating the type embedding with the learned entity embedding [ganea2017deep]. Le and Titov [le2019distant] only used type information to produce the entity embedding. They first leveraged Freebase to get a type set for each entity. Each type of the entity is assigned a vector based on the pre-trained word embeddings, and then the representation of the entity can be calculated as follows:

(6)

where is the type set of the entity , is a weight matrix and

is a bias vector.

Chen et al. [chen2020improving] proposed to inject latent type information into the entity embedding via modeling the immediate context where the entity appears. They considered that context consistency is a strong proxy for type compatibility, so they leveraged the pre-trained BERT to encode entity contexts that are randomly sampled from Wikipedia for that entity. The representation of the entity is computed by aggregating all the context representations via average-pooling.

Hou et al. [hou2020improving] exploited the embeddings of semantic types to generate entity embeddings. They first created a dictionary of fine-grained semantic types, and then extracted semantic types from its Wikipedia article for each entity. To construct semantic reinforced entity embedding, they averaged the semantic type embeddings obtained from Word2Vec and combined them with learned entity embeddings [ganea2017deep] via linear aggregation.

4.4 Alignment Embedding

As introduced in Sections 4.1 and 4.3

, the word embedding and the entity embedding could be learned in different ways based on various data. However, in many cases, these two categories of embeddings are learned separately and are not in the same vector space. This makes it impossible to calculate the similarity between words and entities effectively, which is an essential operation for computing the context similarity feature in EL. Therefore, aligning word and entity embeddings into the same vector space (called alignment embedding) is very necessary and important so that similar words and entities are placed close to each other in a common space. After alignment, we can measure the similarity between words and entities by simply computing the cosine similarity of their embeddings. We introduce some classic alignment embedding techniques as follows.

As mentioned in Section 4.3.3, anchor text is a key resource for aligning embeddings of words and entities. The window words surrounding the anchor text could be regarded as context words of its referring entity, and could provide ample word-entity co-occurrence examples, which could be leveraged to align embeddings. Yamada et al. [yamada2016joint] proposed an alignment model which leverages anchor texts and their context words by extending the SG model in Word2Vec. The objective function is defined to predict the surrounding context words of the referring entity of the anchor text:

(7)

where denotes the set of anchor texts, denotes the referring entity of the anchor text, is the set of its surrounding context words, and is a word in .

Moreover, based on the word-entity co-occurrence counts collected from anchor texts and entity descriptions, Fang et al. [fang2016entity] maximized the score function to shorten the distance between each co-occurrence pair:

(8)

where and represent the corresponding embeddings for word and entity respectively, and is a constant.

Additionally, Eshel et al. [eshel2017named] used Word2Vecf embedding algorithm [levy2014dependency] to train word and entity embeddings jointly via leveraging the word-entity co-occurrence counts collected from the set of frequent words appearing in the Wikipedia article of that entity.

5 Feature

Features are designed to calculate the similarity between the entity mention and the candidate entity in various aspects. Benefiting from DL, more and more EL methods have utilized embeddings as features directly or used different embeddings as a source for features design. In addition, some traditional features which are manually designed, such as the prior popularity feature, are still applied in recent DL based EL methods. In this section, we review five categories of features found to be effective and broadly used to rank candidate entities in DL based EL methods, i.e., prior popularity, surface form similarity, type similarity, context similarity, and topical coherence.

5.1 Prior Popularity

The prior popularity is the probability of the appearance of a candidate entity given an entity mention without considering the context where the mention appears. It is a simple but strong feature, and almost all DL based EL methods leverage this feature. For example, the former U.S. president “Barack Obama” is more popular than his wife “Michelle Obama”, both of which could be referred to by “Obama”. In most cases when people mention “Obama”, they mean the former president rather than his wife. Anchor text in Wikipedia is the most broadly used source to estimate the prior popularity feature. Given an entity mention

, the prior popularity feature of a candidate entity can be estimated as follows:

(9)

where denotes the number of anchor texts having the entity mention as the surface form in Wikipedia; represents the number of anchor texts with the surface form pointing to the candidate entity . In many common real-world entity linking data sets, the accuracy of using this prior popularity feature alone can reach [ganea2017deep], which demonstrates its effectiveness in EL.

5.2 Surface Form Similarity

Intuitively, an entity mention and a candidate entity with the same or similar surface forms are likely to indicate a gold mapping. For example, the entity mention “Univ Manchester” is more likely to refer to the university “The University of Manchester” rather than the football team “Manchester City F.C.” due to their greater surface form similarity. A great many DL based EL systems leverage some traditional surface form similarity measures, such as edit distance, Dice coefficient score, character Dice, skip bigram Dice, and left and right Hamming distance scores. We have introduced them in our previous survey [shen2014entity] and would not introduce them in detail here.

In addition, several DL based EL methods [francis2016capturing, nguyen2016joint, yang2019learning, xue2019neural] considered the cosine similarity between the mention embedding introduced in Section 4.2 and the entity surface form embedding introduced in Section 4.3.1 as the surface form similarity feature. Chen et al. [chen2021lightweight]

leveraged CNNs over the mention embedding and the entity surface form embedding to extract n-gram features of the entity mention and the candidate entity respectively. A two-layer fully connected neural network is then applied to the concatenation of the output embeddings to obtain the surface form similarity feature. What’s more, Moon et al.

[moon2018multimodal] trained a separate deep neural network to encode surface forms of the entity mention and the candidate entity, which produces a purely lexical embedding without semantic allusion. The surface form similarity feature between the entity mention and the candidate entity is computed based on their distance in the embedding space.

5.3 Type Similarity

In addition to being used as auxiliary information to construct entity embeddings introduced in Section 4.3.4, type information is utilized by several DL based EL methods to calculate the type similarity feature. For example, knowing that the correct type of the entity mention “Boston” in some context is sports_team could constrain its corresponding entity with relevant types, such as sport or team. Generally, KBs contain rich type information for a candidate entity, while the type information of an entity mention in some context is absent. Some typing systems are proposed to predict types of an entity mention with the help of its surrounding context.

Raiman and Raiman [raiman2018deeptype] proposed DeepType, which is a method to solve EL only using type constraints. They restricted the types in their typing system via selecting a set of parent-child relations over the ontology in Wikipedia. Then they leveraged a Bi-LSTM classifier to obtain the type conditional probability for type given the entity mention and its surrounding context . For a candidate entity belonging to types , the type similarity feature is calculated as follows:

(10)

where is a smoothing parameter over all types and is a smoothing parameter per type.

Onoe and Durrett [onoe2020fine] proposed a fine-grained typing system, which contains tens of thousands of types derived from Wikipedia categories. They selected a single linear layer to decode the concatenation of the mention embedding introduced in Section 4.2 and the context embedding which will be introduced in Section 5.4.1. They then utilized a sigmoid function to normalize the output vector of the decoder and obtained the type conditional probability for type given the entity mention and its surrounding context . The type similarity feature between the entity mention and the candidate entity of types is calculated as follows:

(11)

Yang et al. [yang2019learning] trained a typing system proposed by Xu and Barbosa [xu2018neural]

. They concatenated the mention embedding and the context embedding to form a representation, which is then used as the input of a softmax classifier to predict the probability distribution over four types (i.e., PER, GPE, ORG, and UNK) of the entity mention. The type similarity feature is measured as the similarity between the predicted type distribution of the entity mention and the types of the candidate entity.

5.4 Context Similarity

The most straightforward feature for entity linking is to measure the similarity between the representation derived from the context around the entity mention and the representation associated with the candidate entity. To model this context similarity feature, the corresponding context of the entity mention would be encoded as a context embedding using various neural architectures, and then diverse computing methods would be utilized to calculate the similarity score between the generated context embedding and the entity embedding mentioned in Section 4.3. In the following, we first describe how to generate the context embedding, and subsequently show the computing methods for calculating this context similarity feature.

5.4.1 Context Embedding

With the advent of latent embeddings, traditional bag-of-words [guo2013link, zhang2010nus, vstajner2009entity] and keyphrase [hoffart2011robust, hoffart2012kore] models may seem to be superseded by neural models, which can implicitly capture semantic and syntactic information of the context. Formally, given an entity mention in a sentence within a pre-defined window size , the context of contains the left-side context words of as and the right-side context words of as . Context embedding aims to represent the surrounding context of as a dense and low-dimensional vector. As the context is composed of words, a large number of DL based EL methods leverage word embeddings introduced in Section 4.1 as input to learn the context embedding based on various neural models. For the simplest case, some EL methods [yamada2016joint, cao2017bridge, yang2018collective, shahbazi2019entity, adjali2020ecir] averaged the embeddings of context words to derive the context embedding. For the other cases, we group models of generating context embeddings into several categories based on their different neural architectures, such as RNNs, CNNs, attention, Transformers, and others. In the remainder, we will describe these categories of context embedding models in detail.

RNNs-based. RNN is designed to process data that is sequential in nature, especially when the input sequence has variable-length. Accordingly, the RNNs-based neural models are appropriate for context learning. RNN intends to capture word dependencies and context structures using the recurrent unit, which is an NN that is shared between all time steps [mudgal2018deep]. At time step , the recurrent unit takes the -th input and the hidden state vector of the previous time step to produce the hidden state vector of the current time step . Hence, contains the previous and current input information .

LSTM [hochreiter1997long] is the most popular RNN architecture, which is designed to better maintain the long term memory. It introduces a memory cell to remember values over arbitrary time intervals, and uses three kinds of gates (input gate, output gate, and forget gate) to regulate the flow of information into and out of the cell. Fang et al. [fang2019joint] utilized a single-directional LSTM to encode the context as an embedding, while some works [le2019distant, onoe2020fine, wu2020dynamic, chen2021lightweight] used a Bi-LSTM [lample2016neural] to encode context. Additionally, a few DL based EL methods [phan2017neupl, gupta2017entity, sil2017neural] used a forward LSTM and a backward LSTM to encode the left-side and right-side context of the entity mention respectively. Specifically, Phan et al. [phan2017neupl] leveraged the hidden state vectors to generate the context embedding, while the others [gupta2017entity, sil2017neural] used the output vectors. It is noted that these two LSTMs are different from Bi-LSTM since each LSTM of them only embeds half of the context while each LSTM of Bi-LSTM encodes the full context.

A slight variation of LSTM is the Gated Recurrent Unit (GRU)

[cho2014learning]. Compared with LSTM, it is simpler by combining the input and forget gate into a single update gate, and has been adopted by some DL based EL methods such as [mueller2018effective, eshel2017named]. Both of them utilized a forward GRU and a backward GRU to encode the left-side and right-side context respectively. The outputs of the forward GRU and the backward GRU are leveraged to generate the context embedding.

CNNs-based. RNN is trained to recognize patterns across time, while CNN learns to recognize patterns across space [lecun1998gradient]. RNN works well when the long-term semantics is required, while CNN works well when detecting local and position-invariant patterns is important, which might be a keyphrase that expresses a particular sentiment [minaee2020deep]. Thus, contexts in CNNs-based models are not supposed to be too long. CNN consists of multiple convolutional layers, each of which extracts local features around each context word, and the size of the output of the convolutional layers depends on the number of words in context. The context embedding is constructed via combining local feature vectors extracted by convolutional layers.

Several DL based EL methods [sun2015modeling, nguyen2016joint, francis2016capturing, xue2019neural] applied CNNs over the context words to generate the hidden vector sequences, which were then transformed by a non-linear function and pooled by sum-pooling [nguyen2016joint, francis2016capturing, xue2019neural] or average-pooling [sun2015modeling] to generate the context embedding.

Fig. 3: Architectures of different neural networks for learning the context embedding.

Attention-based. Starting from Bahdanau et al. [bahdanau2015neural], attention mechanism has become an increasingly popular concept and useful tool in many NLP tasks. The idea of attention is to highlight the informative parts of the input based on an attention vector. In order to give different attention weights to the context words in learning the context embedding, attention-based EL models estimate how strongly they are correlated with the attention vector. Then these models take the weighted sum of the values of the context words as the context embedding. To illustrate more clearly, we propose a general formula to introduce various attention-based models for learning the context embedding, which is generally defined as a sum of the context word values weighted by the correlation between the value and the attention vector:

(12)

where and correspond to the matrices of values and attention vectors respectively, denotes a function used to calculate the attention weight of each value, usually denotes a softmax function, and is the context embedding. We summary some prominent attention-based EL models by introducing different choices of values, attention vectors, and as follows.

In learning the context embedding, values are usually components of the context. Lots of attention-based EL models [ganea2017deep, cao2018neural, kolitsas2018end, hou2020improving, yang2019learning, le2018improving, le2019boosting, hu2020graph] used embeddings of context words as values. Phan et al. [phan2017neupl] took hidden state vectors of the forward LSTM and the backward LSTM as values, while Wu et al. [wu2020dynamic] and Chen et al. [chen2021lightweight] took outputs of the Bi-LSTM as values. Eshel et al. [eshel2017named] and Mueller and Durrett [mueller2018effective] regarded outputs of the forward GRU and the backward GRU as values.

Attention vectors are used to provide background information and discover the important and relevant parts of values. Some EL works [phan2017neupl, mueller2018effective, eshel2017named, cao2018neural, wu2020dynamic] chose the same attention vector for all values, which is the embedding of the candidate entity. Several DL based EL models [ganea2017deep, kolitsas2018end, hou2020improving, yang2019learning, le2018improving, le2019boosting, hu2020graph] assumed that a context word is important if it is strongly related to at least one candidate entity, so they selected the embedding of the most related candidate entity for each context word as the attention vector. Additionally, Zhang et al. [zhang2021attention] regarded the representations of connected entities of the candidate entity introduced in Section 4.3.3 as the attention vectors.

is a function used to determine the attention weight of each value by measuring the similarity or correlation between the value and the attention vector. Cao et al. [cao2018neural] used the cosine similarity, Mueller and Durrett [mueller2018effective] and Zhang et al. [zhang2021attention] leveraged the dot product, and a great many attention-based models [ganea2017deep, kolitsas2018end, hou2020improving, yang2019learning, le2018improving, le2019boosting, wu2020dynamic, hu2020graph] utilized the bilinear similarity. We will introduce these three metrics in detail in Section 5.4.2 as they are also utilized to calculate the context similarity feature. Additionally, Phan et al. [phan2017neupl] and Eshel et al. [eshel2017named] obtained the similarity by leveraging a single-layer perceptron to encode the value and the attention vector.

Transformers-based. RNNs-based models suffer from the sequential processing of the context, which is one of their computation bottlenecks. Despite that CNNs-based models are less sequential, the computational cost to capture relations between words in the context also grows with the increasing length of the context. Transformers [vaswani2017attention] overcome this limitation via applying self-attention mechanism to calculate an attention weight for each word in the context, in order to model the influence each word has on another [minaee2020deep]. Recently, Transformers-based pre-trained language models use much deeper neural architectures compared with models based on RNNs and CNNs, and are pre-trained on much larger text corpora to learn contextualized text representations by predicting words conditioned on their surrounding contexts. These pre-trained language models could be fine-tuned using task-specific labels of downstream tasks like EL.

BERT [devlin2019bert] is a widely used Transformers-based pre-trained language model in context embedding learning for EL task. Several DL based EL methods [wu2020scalable, fang2020high, zhang2021attention] took the word-pieces of the entity mention and its context as the input of BERT. Wu et al. [wu2020scalable] and Zhang et al. [zhang2021attention] inserted a start token and regarded the output of the last layer at this start token as the context embedding, while Fang et al. [fang2020high] obtained the context embedding via average-pooling over the hidden states of all context tokens in the last layer. Tang et al. [tang2021bidirectional] leveraged the multi-paragraph reading model based on BERT introduced in Section 4.3.2 to encode the context as a representation. Compared with the model introduced in Section 4.3.2, they added a new multi-head attention module which leverages the learned entity embedding as the query to emphasize the importance of the context which is encoded in the entity embedding.

Architectures of RNNs-based, CNNs-based, attention-based, and Transformers-based models for learning the context embedding are shown in Figure 3.

Other Methods. There are also some other methods to learn context embeddings, which we briefly introduce as follows. He et al. [he2013learning] used stacked DA [vincent2008extracting] to encode the context as a representation, while Zwicklbauer et al. [zwicklbauer2016robust] and Sevgili et al. [sevgili2019improving] utilized Doc2Vec [le2014distributed] to generate an embedding for the context, which is the same way for them to learn the entity embedding based on the entity description introduced in Section 4.3.2.

5.4.2 Context Similarity Feature Computation

After the generation of the context embedding described above, we could compute the context similarity feature based on some similarity measures between the context embedding of the entity mention and the entity embedding of the candidate entity. Here, we overview the context similarity measures used by DL based EL models in the following.

Cosine similarity is a commonly used similarity measure for real-valued vectors. Based on it, the context similarity feature between the entity mention and the candidate entity can be defined as follows:

(13)

where is the context embedding for the entity mention , is the entity embedding of the candidate entity , and and are lengths of two embeddings respectively. Cosine is a trigonometric function that helps to describe the orientation of two vectors. The highest similarity value is reserved for the two vectors that are the most close together, while the lowest similarity value is reserved for the two vectors that are the least close together. Lots of DL based EL methods [sun2015modeling, yamada2016joint, francis2016capturing, zwicklbauer2016robust, nguyen2016joint, cao2017bridge, yang2018collective, cao2018neural, xue2019neural, chen2020improving, wu2020dynamic, van2020rel, phan2018pair, sil2017neural, gillick2019learning, chen2021lightweight] applied this metric to calculate the context similarity feature.

Additionally, the numerator of Equation 13, which is the dot product of the context embedding and the entity embedding , has been directly used to measure the context similarity feature by several EL methods [he2013learning, kolitsas2018end, wu2020scalable, li2020efficient, yamada2020global, moon2018multimodal]. Compared with cosine similarity that cares about angle difference between two vectors, dot product focuses on both angle and magnitude.

Starting from Ganea and Hofmann [ganea2017deep], bilinear similarity has been utilized by several DL based EL systems [le2018improving, yang2019learning, le2019boosting, wu2020dynamic, hou2020improving, van2020rel, hu2020graph] to compute the context similarity feature, which can be denoted as follows:

(14)

where is the context embedding for the entity mention , is the entity embedding of the candidate entity , and is a trainable diagonal matrix, which indicates the relationships between two embeddings.

Moreover, several EL works utilized neural architectures to calculate the context similarity feature. Specifically, some DL based EL methods [mueller2018effective, le2019distant, fang2020high] passed the concatenation of the context embedding and the candidate entity embedding to a single-layer perceptron to learn a context similarity score for each candidate entity, while Sevgili et al. [sevgili2019improving] and Fang et al. [fang2019joint] used an MLP instead. Some DL based EL methods [logeswaran2019zero, gu2021read, ravi2021cholan] concatenated the mention context and the entity description as a sequence pair together with a special start token and separator tokens as the input of BERT, while Wu et al. [wu2020scalable] concatenated the context embedding described in Section 5.4.1 and the entity embedding described in Section 4.3.2 as the input of BERT. All these works regarded the output of the last hidden layer at the start token as the representation of the input pair. Some of these works [logeswaran2019zero, wu2020scalable, ravi2021cholan] applied a linear layer to this representation to produce the context similarity feature, while Gu et al. [gu2021read] utilized an MLP with softmax function over this representation.

5.5 Topical Coherence

The aforementioned features mainly concern the local similarity between the entity mention and one of its candidate entities, and each entity mention in the document could be linked independently based on these features, which are often called local features. Starting from Cucerzan et al. [cucerzan2007large], more and more EL methods take into consideration the global topical coherence among the referring entities within the same document, where entity mentions are linked collectively. It is based on an assumption that co-occurring entity mentions in the same document often refer to topically coherent entities. Given a document , the topical coherence feature for a candidate entity of an entity mention could be calculated via averaging the topical coherence between the candidate entity and assigned mapping entities of an entity mention set in the same document:

(15)

where is a mention set in the document , is the assigned mapping entity of the entity mention in the mention set , and is a function to calculate a pairwise similarity between entities which may indicate the degree of their topical coherence. In order to introduce the various methods of calculating the topical coherence feature clearly, we summarize the process of this feature generation in the following three steps: (1) mention set selection; (2) mapping entity assignment; and (3) topical coherence feature computation. Firstly, a mention set whose mapping entities are the targets that the candidate entity needs to be coherent with is selected from all the entity mentions in the document . Next, for each entity mention in the mention set , we temporarily assign a mapping entity to it from its candidate entities when computing this feature, as its real mapping entity is unknown to us and needs to be figured out in this entity linking task. Finally, the topical coherence feature is obtained by computing the average similarity between the candidate entity and assigned mapping entities . An illustration of this process is shown in Figure 4.

Fig. 4: The process of calculating the topical coherence feature for a candidate entity of an entity mention in a document .

The key component in the third step of feature computation is the similarity function for a pair of entities. Entity embeddings introduced in Section 4.3 are usually utilized in this step to calculate the entity similarity. Specifically, a large number of DL based EL methods [zwicklbauer2016robust, phan2017neupl, kolitsas2018end, xue2019neural, cao2017bridge, zwicklbauer2016robust, huang2015leveraging, cao2018neural, yang2018collective, nguyen2016joint, phan2018pair] leveraged cosine similarity as the similarity function and several EL methods [yamada2016joint, ganea2017deep, le2018improving, le2019boosting, yang2019learning, chen2020improving, van2020rel, hou2020improving, wu2020dynamic, hu2020graph] exploited bilinear similarity to calculate the entity similarity based on entity embeddings. Another well-known method of entity similarity is WLM [milne2008learning], which is a powerful but simple method based on anchor texts in Wikipedia. More details of WLM can be found in our previous survey [shen2014entity]. Some DL based EL methods [yang2018collective, xue2019neural, phan2018pair] used WLM as the similarity function . Moreover, starting from Le and Titov [le2018improving], some EL works [le2019boosting, hou2020improving, van2020rel] incorporated relations between entities as latent variables to compute pairwise entity similarity. Fang et al. [fang2016entity] utilized a distance based metric which is similar to Equation 8. Some hand-crafted similarity measurements such as the number of KB relations existing between entities [globerson2016collective] and entity-entity co-occurrence counts obtained from Wikipedia [yang2018collective] were also leveraged to calculate the entity similarity.

However, even if we get the pairwise similarity function between entities, it is still hard to compute the topical coherence feature for the candidate entity. According to some works [yamada2016joint, ganea2017deep, yang2019learning, le2019boosting], the optimization of this feature is an NP-hard problem since the second step (i.e., mapping entity assignment) that needs to assign a mapping entity for each entity mention in , is impossible to be implemented before the entity linking task is finished as its real mapping entity is unknown to us. To solve this problem, several DL based EL methods obtain the assigned mapping entity for an entity mention temporarily by selecting an entity from its candidate entities, which are called simplified methods. In addition, some DL based EL methods do not obtain the assigned mapping entities explicitly but leverage the whole set of candidate entities to compute the topical coherence feature, which are called optimized methods. In the remainder of this subsection, we introduce simplified methods and optimized methods respectively.

5.5.1 Simplified Methods

Simplified methods try to transform the NP-hard problem into a computable problem by determining a temporary mapping entity for each entity mention in the mention set in the calculation of the topical coherence feature. Here, we mainly focus on how to determine a temporary mapping entity for an entity mention. According to the range of entity mentions that are selected to form the mention set in the first step (i.e., mention set selection), we classify the simplified methods into the following two groups: (1) whole coherence; (2) partial coherence.

Whole coherence. The whole coherence means all the other entity mentions in the document are selected to form the mention set except the entity mention of the candidate entity for which we calculate the topical coherence feature via Equation 15.

Some EL method [shen2012linden] chose the candidate entity with the highest prior popularity feature as the assigned mapping entity for the entity mention in , which is defined as follows:

(16)

where is a set of candidate entities for entity mention and is the prior popularity feature introduced in Section 5.1. Additionally, several DL based EL methods [globerson2016collective, le2019boosting, fang2016entity, wu2020dynamic] regarded the most similar candidate entity to the candidate entity in Equation 15 as the assigned mapping entity for the entity mention in , which is defined as follows:

(17)

where is an entity similarity function as we have introduced earlier.

Partial coherence. Different from the whole coherence, the partial coherence means that the candidate entity in Equation 15 only needs to be coherent with the assigned mapping entities of parts of entity mentions in the same document, which significantly reduces computational complexity and time consumption.

Yamada et al. [yamada2016joint] regarded a set of unambiguous entity mentions in the document as the mention set and considered an entity mention unambiguous when the prior popularity feature of one of its candidate entities is greater than . They used Equation 16 to obtain the assigned mapping entity for the entity mention in . The topical coherence feature for the candidate entity is computed as the similarity between the candidate entity embedding and the averaged embedding of assigned mapping entities of unambiguous entity mentions.

Kolitsas et al. [kolitsas2018end] leveraged the candidate entities having high local scores derived from local features as assigned mapping entities to participate in the calculation of the topical coherence feature. The entity mentions of these selected candidate entities form the mention set . To obtain the topical coherence feature for each candidate entity, they summed up the representations of assigned mapping entities and calculated the similarity between the candidate entity embedding and this generated representation.

It is noted that in the above two simplified methods, candidate entities of different entity mentions have a fixed mention set throughout the EL process, while in the following simplified methods, the mention set changes with candidate entities of different entity mentions.

Fang et al. [fang2016entity] and Chen et al. [chen2021lightweight] selected entity mentions around the entity mention of the candidate entity in Equation 15 in a pre-defined window as the mention set . They obtained the topical coherence feature for the candidate entity by averaging the similarities between the candidate entity embedding and embeddings of assigned mapping entities defined via Equation 17.

A great number of DL based EL methods link entity mentions in a document in a sequential manner and utilize already disambiguated entities to help generate the topical coherence feature for the subsequent candidate entities. When we calculate this feature for some candidate entity via Equation 15, the mention set is composed of entity mentions which have already been linked before the entity mention of the candidate entity and those already disambiguated entities are considered as assigned mapping entities. Specifically, Zwicklbauer et al. [zwicklbauer2016robust] created a topic vector, which is set by summing up the representations of already disambiguated entities. Once the linking of some mention is completed, the topic vector changes accordingly. The topical coherence feature for the candidate entity is defined as the similarity between the candidate entity embedding and the topic vector. Nguyen et al. [nguyen2016joint] used GRUs to encode already disambiguated entities with an assumption that the hidden state vector of GRUs could summarize the information about the already disambiguated entities based on the characteristics of RNN. The topical coherence feature for the candidate entity is calculated as the similarity between the candidate entity embedding and the current hidden state vector of GRUs. Similarly, Fang et al. [fang2019joint] utilized an LSTM as the encoder instead. What’s more, to generate a topical coherence feature vector for each candidate entity, Yang et al. [yang2018collective] first calculated the similarities between the candidate entity embedding and the embeddings of already disambiguated entities, and then concatenated the maximal and average similarity score to form the topical coherence feature vector. Some EL methods [cao2017bridge, fang2019joint, yamada2020global, gu2021read] used a simple to complex strategy, i.e., began to link entity mentions that are easier to be disambiguated and then utilized information provided by already disambiguated entities to generate topical coherence features for subsequent entity mentions that are relatively more difficult to be disambiguated. Yang et al. [yang2019learning] proposed the dynamic context augmentation (DCA), whose basic idea is to accumulate knowledge from already disambiguated entities as a dynamic context to enhance later decisions. They applied an attention mechanism on DCA to adjust the weights of already disambiguated entities. The topical coherence feature for the candidate entity is obtained by summing up the similarity scores between the candidate entity embedding and the embeddings of already disambiguated entities. Chen et al. [chen2020improving] also applied DCA in their global model. Gu et al. [gu2021read] utilized already disambiguated entities to enhance the later decisions through a dynamic multi-turn way. Specifically, once an entity mention is linked, the context for identifying the entity mention will be updated by replacing with its assigned mapping entity . Then the updated context is leveraged to generate the topical coherence feature for using the BERT [devlin2019bert] encoder in each turn. To alleviate error propagation caused by falsely linked entities, a gate mechanism is introduced on the current and historical representations to control which part of history cues should be inherited.

In addition, Phan et al. [phan2017neupl, phan2018pair] proposed pair-linking, which is based on an assumption that each candidate entity only needs to be coherent with another candidate entity. They selected the most compatible candidate entity for the candidate entity in Equation 15 from candidate entities of all entity mentions in the document as the assigned mapping entity. The topical coherence feature of the candidate entity could be computed based on local features of the pair of the candidate entity and the assigned mapping entity as well as the similarity between them.

5.5.2 Optimized Methods

Different from simplified methods, optimized methods do not select a temporary mapping entity but keep all candidate entities for the entity mention in the mention set when computing the topical coherence feature. Thus, optimizing the topical coherence objective is intractable due to the large number of candidate entities. Optimized methods usually leverage graph-based approaches to perform optimization as they can dynamically capture the global interdependence between different candidate entities in the document and learn the topical coherence feature for each candidate entity. We will introduce some optimized methods in detail as follows.

We first introduce three EL methods based on Graph Neural Network (GNN) to automatically decide the relevant candidate entities of entity mentions in the mention set and then generate a topical coherence feature for each candidate entity. Cao et al. [cao2018neural] applied GCN [kipf2016semi] whose basic idea is to enhance the feature of a node according to its neighbor nodes to integrate topical coherence information. The entity mentions around the entity mention in a pre-defined window are regarded as the mention set . To obtain the topical coherence feature for each candidate entity, they constructed a graph for the candidate entity by taking this candidate entity and all candidate entities of entity mentions in the mention set as nodes and relations between them extracted from a KB as edges. Then they applied a GCN on this graph to aggregate information from linked nodes to the candidate entity node. Given the hidden state of the ()-th layer, graph convolution is operated to generate the hidden state as follows:

(18)

where is a normalized adjacent matrix of the input graph with self-connection, and are the weights and bias in the -th layer respectively, and represents a non-linear activation function. After times of the graph convolution, the hidden state integrates information from both the candidate entity and its neighbor nodes, and it is treated as a topical coherence feature of the candidate entity.

Similarly, Wu et al. [wu2020dynamic] leveraged a dynamic GCN learning paradigm to calculate the topical coherence feature. The graph structure of is dynamically computed and modified by a graph weight determinator during training in each layer, and it can capture topical coherence better than GCN with a fixed graph structure. Fang et al. [fang2020high] utilized a Graph Attention Network (GAT) model to encode the global topical coherence information of the candidate entity. It is noted that GAT is a typical dynamic GCN, which can dynamically change nodes and edges according to the current state. Moreover, with the help of the masked self-attention layers, GAT can implicitly assign different importance to different neighbor nodes and capture the relevance between candidate entities well.

In addition, Ganea and Hofmann [ganea2017deep] formulated this problem based on a binary Conditional Random Field (CRF) model by taking candidate entities of all entity mentions in the document as nodes. Considering the exact maximum a posteriori inference on this CRF is NP-hard, they adopted loopy belief propagation (LBP) as an approximate inference method based on times message-passing iterations to produce a marginal probability for each candidate entity as its topical coherence feature. This optimized method is followed by several DL based EL methods [le2018improving, le2019boosting, chen2020improving, van2020rel, hou2020improving, hu2020graph].

6 Algorithm

Algorithm takes features introduced in the above section as input and outputs the final entity linking result. Specifically, algorithm aims to select the target mapping entity for each entity mention based on the local and global features of its candidate entities. Traditional EL methods pursued some algorithms to select the mapping entity, such as linear model [ratinov2011local, shen2012linden, shen2012liege]

, support vector machines

[chen2011collaborative, pilz2011names], logistic classifier [fang2016entity, monahan2011cross] and Naïve Bayes classifier [varma2009iiit]. Recently, DL based EL methods that we focus on in this survey used algorithms, such as MLP, PageRank, graph regularization, and reinforcement learning. In this section, we introduce such algorithms by summarizing them into three groups, namely, MLP, graph-based algorithms, and RL.

6.1 Mlp

For each candidate entity of the entity mention , MLP could transform its local and global features to a ranking score

via a learned non-linear transformation. We have introduced its basic structure in Section

3.1. Some DL based EL methods [wu2020dynamic, le2019distant, cao2018neural, le2019boosting, eshel2017named, sil2017neural] leveraged a single-layer perceptron to encode features to produce a ranking score for the candidate entity. For each entity mention, the candidate entity with the highest ranking score is chosen as the output mapping entity. Additionally, several EL works [van2020rel, chen2020improving, hou2020improving, sevgili2019improving, fang2020high, yang2019learning, le2018improving, kolitsas2018end, gillick2019learning, mueller2018effective, ganea2017deep, shahbazi2019entity, tang2021bidirectional, chen2021lightweight, hu2020graph, adjali2020ecir] leveraged an MLP instead.

MLP could be trained using the stochastic gradient descent algorithm and require a loss function to guide the parameter learning. Loss functions are used to evaluate how well the specific algorithm models the given data. If the predicted mapping results output by the algorithm deviate too much from the gold mapping results, loss functions would generate large values. Thus, loss functions need to be minimized during the training of MLP. There are two widely used loss functions which we introduce briefly.

A large quantity of MLP methods [fang2020high, van2020rel, chen2020improving, wu2020dynamic, hou2020improving, le2019distant, yang2019learning, le2018improving, kolitsas2018end, ganea2017deep, le2019boosting, chen2021lightweight, hu2020graph] leveraged the max-margin loss, which tries to make the ranking score of the gold mapping entity higher than the ranking scores of the other candidate entities by a safety margin. The max-margin loss of a training instance is defined as follows:

(19)

where denotes the ranking score of the candidate entity given the entity mention , and is the safety margin. Each training instance is constructed by a positive gold mapping entity with a negative entity , with the purpose to make the ranking score of be at least a margin larger than that of .

Cross-entropy loss is another loss function utilized by some MLP methods [cao2018neural, eshel2017named, gillick2019learning, tang2021bidirectional, adjali2020ecir]. The binary cross-entropy loss which increases when the predicted label diverges from the actual label is defined as follows:

(20)

where denotes the actual label of the candidate entity. If the candidate entity is the gold mapping entity for the entity mention , the value of is ; otherwise . indicates the ranking score of the candidate entity given the entity mention .

6.2 Graph-Based Algorithms

In general, graph-based algorithms are usually leveraged by collective EL methods, which make decisions on mapping entities jointly for all entity mentions in the same document. Specifically, for each document containing a set of entity mentions, the graph-based algorithm first needs to construct a graph by taking the candidate entities of all entity mentions in the document as nodes and similarity scores between candidate entities as edge weights. Next, a graph-based ranking algorithm is performed on this graph to assign a ranking score to each candidate entity, which represents its degree of importance in the overall graph structure. Finally, for each entity mention, the candidate entity with the highest ranking score is chosen as the output mapping entity. In the following, we introduce graph-based algorithms leveraged by DL based EL methods in detail.

Zwicklbauer et al. [zwicklbauer2016robust] applied PageRank on a graph which is composed of the candidate entities of all entity mentions in the same document and a topic vector node introduced in Section 5.5.1

. The transition matrix of the graph describes the likelihood of walking from a node to the adjacent node and is calculated as the harmonic mean between two nodes’ local scores. They employed the prior popularity feature introduced in Section

5.1 as a jump probability for each candidate entity node. Ultimately, they applied the PageRank algorithm over the constructed graph to compute a ranking score for each candidate entity.

Xue et al. [xue2019neural] introduced recurrent random-walk layers for collective EL, which reinforce the evidence for related EL decisions into high probability decisions with the help of external KB. The graph constructed for each document contains the candidate entities of all entity mentions in the document. To define the transition matrix between candidate entities, they calculated a relevance score for each pair of candidate entities by summing up their semantic similarity score and the WLM [milne2008learning] score based on Wikipedia. By introducing random-walk layers, they can easily propagate evidence for times and produce a ranking score for each candidate entity.

Data set (Abbreviation) Genre Year KB # M. # D. # M./D. URL
MSNBC [cucerzan2007large] news 2007 Wikipedia 656 20 32.80 https://cogcomp.seas.upenn.edu/page /resource_view/4
AQUAINT (AQ) [milne2008learning] news 2008 Wikipedia 449 50 8.98 http://community.nzdl.org/wikification /docs.html
TAC-KBP2010 (KBP) [ji2011knowledge] news, blogs 2010 Wikipedia 3750 3684 1.02 https://tac.nist.gov/2010/KBP/data.html
AIDA-CoNLL (AIDA) [hoffart2011robust] news 2011
YAGO/Freebase/
Wikipedia
34587 1393 24.83 http://resources.mpi-inf.mpg.de/yago-naga /aida/download/
ACE2004 (ACE) [ratinov2011local] news 2011 Wikipedia 257 35 7.34 https://cogcomp.seas.upenn.edu/page /resource_view/4
KORE50 (KORE) [hoffart2012kore] tweets 2012 YAGO/DBpedia 148 50 2.96 http://resources.mpi-inf.mpg.de/yago-naga /aida/download/
N3-RSS500 (RSS) [roder2014n3] RSS-feeds 2014 DBpedia 1000 500 2.00 http://aksw.org/Projects/N3NERNEDNIF.html
N3-Reuters128 (Reuters) [roder2014n3] news 2014 DBpedia 881 128 6.88 http://aksw.org/Projects/N3NERNEDNIF.html
WNED-CWEB (CW) [guo2018robust] news 2016 Wikipedia 11154 320 34.86 https://doi.org/10.7939/DVN/10968
WNED-WIKI (WI) [guo2018robust] news 2016 Wikipedia 6821 320 21.32 https://doi.org/10.7939/DVN/10968
TABLE II: List of ten widely used data sets for entity linking. “# M.” refers to the number of entity mentions, “# D.” refers to the number of documents, and “# M./D.” refers to the number of entity mentions per document in the data set.

What’s more, Huang et al. [huang2015leveraging] leveraged graph regularization to perform collective inference. They constructed a graph for each document and each node in the graph represents a pair of entity mention and one of its candidate entities. A weighted edge is added between two nodes if satisfying some constraints and the weight is computed as the semantic similarity between two candidate entities. They first initialized a ranking score for each candidate entity based on the linear combination of local features. Some nodes with high prior popularity feature introduced in Section 5.1 are regarded as labeled seed nodes, which remain unchanged during the graph regularization. Then graph regularization is applied on this graph to refine the ranking scores of unlabeled nodes by means of the labeled seed nodes, with an assumption that two strongly connected nodes should have similar ranking scores.

6.3 Reinforcement Learning

Reinforcement learning (RL) is an area of ML concerned with how software agents ought to perform discrete actions in an environment according to a policy, which is trained to maximize some cumulative rewards [kaelbling1996reinforcement]. Using RL as a ranking algorithm for entity linking, a set of candidate entities are selected by agents as the entity mapping results which can maximize the sum of expected rewards.

Fang et al. [fang2019joint] and Yang et al. [yang2019learning] regarded the EL task as a sequence decision problem and leveraged RL algorithm to obtain the entity mapping results. In their models, the agent is designed as a policy network which can learn a stochastic policy and prevent the agent from getting stuck at an intermediate state. Under the guidance of the policy network, the agent decides which action (i.e., choosing the target mapping entity from candidate entities) should be taken at each state (i.e., current local and global encoding). After performing all decisions in the episode, each action will get an expected reward and the goal is to maximize the total expected rewards , which is defined as follows:

(21)

where is the policy network indicating the probability of taking the action under the state , and is the expected reward of the action at -th time step. To compute the expected reward , indicates whether the current action is correct or not, and and represent the number of correct actions and that of wrong ones from time to the end of episode respectively. Accordingly, RL could explore the long-term influence of current selection on subsequent decisions.

7 Data Sets and Evaluation

In this section, we first introduce several widely used real-world EL data sets, tools, and evaluation metrics. Then we give a quantitative performance analysis of representative DL based EL methods.

7.1 Data Sets and Tools

Lots of data sets with different properties (e.g., genre, year, KB, and the number of entity mentions per document) have been used to evaluate EL systems. Table II shows an overview of ten well-known public EL data sets used by DL based EL methods. We introduce these data sets in detail as follows:

  • MSNBC [cucerzan2007large] is annotated from MSNBC news articles and contains documents from different domains (i.e., two documents per domain).

  • AQUAINT [milne2008learning] contains documents collected from the Xinhua News Service, New York Times, and the Associated Press. Each document contains about to words, where the first entity mention of an entity is manually annotated to Wikipedia.

  • TAC-KBP2010 [ji2011knowledge] contains news and blogs from various agencies. It is constructed for the TAC conference and only contains approximately one entity mention per document which is unsuitable to model the topical coherence feature.

  • AIDA-CoNLL [hoffart2011robust] is an annotated corpus of Reuters news documents. It contains much more documents than other existing EL data sets and is manually linked to KBs by authors. To train and test EL methods, the data set is often divided into three parts: AIDA-Train for training, AIDA-A for validation, and AIDA-B for testing.

  • ACE2004 [ratinov2011local] is a subset of ACE2004 [doddington2004automatic] coreference documents annotated using Amazon Mechanical Turk111https://www.mturk.com/.

  • KORE50 [hoffart2012kore] is extracted from some microblogging platform (i.e., Twitter). It contains short documents (i.e., tweets) on various domains (e.g., music, business, sports, and celebrities). Each tweet consists of a few sentences with some ambiguous entity mentions, and most entity mentions are first names referring to persons with high level of ambiguity.

  • N3-RSS500 [roder2014n3] is created using a data set of RSS-feeds (i.e., short formal documents), which are from major international newspapers. The data set covers a wide range of domains, such as world, business, and science.

  • N3-Reuters128 [roder2014n3] contains economic news documents extracted from the Reuters-21587 corpus222http://kdd.ics.uci.edu/databases/reuters21578/reuters21578.html. Both N3-RSS500 and N3-Reuters128 are manually annotated by Röder et al. [roder2014n3].

  • WNED-CWEB [guo2018robust] is randomly picked from the FACC1 [Gabrilovich2013FACC1] data set, which provides annotations of mention-entity pairs for ClueWeb 2012 data333http://lemurproject.org/clueweb12.

  • WNED-WIKI [guo2018robust] is crawled from Wikipedia pages with its original anchor text annotations. Both WNED-CWEB and WNED-WIKI are automatically extracted by Guo and Barbosa [guo2018robust] and are relatively large with a large number of documents.

EL tool URL
StanfordCoreNLP https://stanfordnlp.github.io/CoreNLP/entitylink.html
spaCy https://spacy.io/api/entitylinker
TAGME https://services.d4science.org/web/tagme/tagme-help
Wikipedia Miner http://community.nzdl.org/wikification/
DBpedia Spotlight https://www.dbpedia-spotlight.org/
Babelfy http://babelfy.org/
AGDISTIS https://github.com/dice-group/AGDISTIS
WAT https://services.d4science.org/web/tagme/wat-api
FEL https://github.com/yahoo/FEL
REL https://github.com/informagi/REL
TABLE III: Off-the-shelf EL tools.
Model MSNBC AQ KBP AIDA ACE KORE RSS Reuters CW WI
He et al. (ACL 2013) [he2013learning] - - 81.0 85.6 - - - - - -
Sun et al. (IJCAI 2015) [sun2015modeling] - - 83.9 - - - - - - -
DSRM (arXiv 2015) [huang2015leveraging] - - - 86.6 - - - - - -
EDKate (CoNLL 2016) [fang2016entity] 75.5 85.2 88.9 - 80.8 - - - - -
Zwicklbauer et al. (SIGIR 2016) [zwicklbauer2016robust] 91.1 84.2 - 78.4 90.7 - - - - -
Francis-Landau et al. (NAACL 2016) [francis2016capturing] - 89.9 - 85.5 - - - - - -
Nguyen et al. (COLING 2016) [nguyen2016joint] - - - 87.2 89.7 - - - - -
Globerson et al. (ACL 2016) [globerson2016collective] - - 87.2 92.7 - - - - - -
Yamada et al. (CoNLL 2016) [yamada2016joint] - - 85.5 93.1 - - - - - -
Gupta et al. (EMNLP 2017) [gupta2017entity] - - - 82.9 90.7 - - - - -
Eshel et al. (CoNLL 2017) [eshel2017named] - - - 87.3 - - - - - -
Deep-ED (EMNLP 2017) [ganea2017deep] 93.7 88.5 - 92.2 88.5 - - - 77.9 77.5
NeuPL (CIKM 2017) [phan2017neupl] 91.8 - - - 92.9 79.4 80.0 91.6 - -
NCEL (COLING 2018) [cao2018neural] - 87.0 91.0 80.0 88.0 - - - - 86.0
Kolitsas et al. (CoNLL 2018) [kolitsas2018end] 86.4 - - 83.1 - 60.8 68.6 67.3 - -
SGTB-BiBSG (NAACL 2018) [yang2018collective] 92.6 89.9 - 93.0 88.5 - - - 81.8 79.2
MR-Deep-ED (ACL 2018) [le2018improving] 93.9 88.3 - 93.1 89.9 - - - 77.5 78.0
Sil et al. (AAAI 2018) [sil2017neural] - - 87.4 94.0 - - - - - -
DeepType (AAAI 2018) [raiman2018deeptype] - - 90.9 94.9 - - - - - -
Le and Titov (ACL 2019) [le2019distant] - - - 81.5 - - - - - -
Gillick et al. (CoNLL 2019) [gillick2019learning] - - 87.0 - - - - - - -
Le and Titov (ACL 2019) [le2019boosting] 92.2 90.7 - 89.7 88.1 - - - 78.2 81.7
RRWEL (IJCAI 2019) [xue2019neural] 94.4 91.9 - 92.4 90.6 - - - 79.7 85.5
E-ELMo (arXiv 2019) [shahbazi2019entity] 92.3 90.1 88.3 93.5 88.7 - - - 78.4 79.8
RLEL (WWW 2019) [fang2019joint] 92.8 87.5 - 94.3 91.2 - - - 78.5 82.8
DCA-RL (EMNLP 2019) [yang2019learning] 93.8 88.3 - 93.7 90.1 - - - 75.6 78.8
DCA-SL (EMNLP 2019) [yang2019learning] 94.6 87.4 - 94.6 89.4 - - - 73.5 78.2
SeqGAT (WWW 2020) [fang2020high] 80.0 88.0 - 83.0 89.0 68.0 68.0 71.0 - -
REL (SIGIR 2020) [van2020rel] 85.8 - - 84.0 - 54.0 64.1 64.9 - -
ET4EL (AAAI 2020) [onoe2020fine] - - - 85.9 - - - - - -
FGS2EE (ACL 2020) [hou2020improving] 94.2 88.5 - 92.6 90.7 - - - 77.4 77.8
DGCN (WWW 2020) [wu2020dynamic] 92.5 89.4 - 93.1 90.6 - - - 81.2 77.6
Chen et al. (AAAI 2020) [chen2020improving] - 89.8 - 93.4 - - - - 77.9 80.1
GNED (KBS 2020) [hu2020graph] 95.5 91.6 - 92.4 90.1 - - - 77.5 78.5
BLINK (EMNLP 2020) [wu2020scalable] - - 94.5 - - - - - - -
Yamada et al. (arXiv 2020) [yamada2020global] 96.3 93.5 - 95.0 91.9 - - - 78.9 89.1
M3 (AAAI 2021) [gu2021read] - - - - - 74.3 - - - -
CHOLAN (EACL 2021) [ravi2021cholan] 83.4 76.8 - 85.7 86.8 - - - - -
TABLE IV: The performance of representative DL based EL methods on various data sets taken from both their original papers and the GERBIL platform. The best results are in boldface and the second-best results are underlined. Due to the limited space, in the table header we show the abbreviations of the data sets which are defined in the first column of Table II.

In Table II, it can be seen that each data set has its own characteristics. For example, KORE50 [hoffart2012kore] and N3-RSS500 [roder2014n3] emphasize entity linking over short documents, which are composed of tweets and RSS-feeds respectively, while other data sets are constructed using relatively long news documents. What’s more, besides being used for testing, some data sets, particularly larger ones like AIDA-CoNLL [hoffart2011robust] and TAC-KBP2010 [ji2011knowledge] could be also used for training, while some data sets such as MSNBC [cucerzan2007large] and ACE2004 [ratinov2011local] containing a few documents are generally not used for training. Thus, researchers could select appropriate data sets based on the different characteristics of their own EL systems when training or testing.

Lots of off-the-shelf entity linking tools are publicly available online. Table III summarizes popular EL tools and their URLs.

7.2 Evaluation Metrics

Evaluation metrics are used to evaluate the performance of EL systems on data sets. GERBIL [usbeck2015gerbil] is a benchmark entity annotation platform that provides a unified comparison among different EL systems across various data sets and metrics. Currently, GERBIL offers six metrics and subdivides them into two groups, namely the macro- and the micro- groups of precision, recall, and F1-measure. The macro- metric is the average of the corresponding metric over each document in the data set , while the micro- metric takes into account all annotations together thus giving more importance to documents with more entity mentions [cornolti2013framework]. The majority of DL based EL methods select micro- metrics for evaluation, whereas some EL works [onoe2020fine, chen2020improving, he2013learning, cao2017bridge, huang2015leveraging, cao2018neural, yamada2016joint] utilize macro- metrics as well. Let , be the gold mapping entity annotations associated with all the entity mentions in the data set , and a set of entity mentions in the document respectively. Let , be the output entity annotations of some EL system associated with all the entity mentions in the data set , and a set of entity mentions in the document respectively. We give the definitions of the evaluation metrics, i.e., precision, recall, and F1-measure of both macro- and micro- groups as follows:

(22)

The precision is computed as the fraction of correctly linked entity mentions that are generated by the EL system, and the recall is computed as the fraction of correctly linked entity mentions that should be correctly linked. To take into consideration both of them, the F1-measure puts them together, and it is defined as the harmonic mean of the precision and recall. In addition, the accuracy refers to the ratio of the number of entity mentions that are correctly linked to the total number of entity mentions in the data set, and it is defined as follows:

(23)

7.3 Performance Analysis

Table IV presents the performance of representative DL based EL methods on data sets introduced in Section 7.1. Most recent EL works used the GERBIL [usbeck2015gerbil] platform to report their performance. To ensure correctness, we collect the experimental results from both their original papers and the GERBIL platform. The micro-F1 metric is commonly used by DL based EL methods to report the performance. Therefore, we show micro-F1 scores for all data sets except TAC-KBP2010 [ji2011knowledge] since the micro-accuracy is regarded as the official evaluation metric in the TAC-KBP track.

In Table IV, we can see that DL based EL methods have achieved the state-of-the-art performance on all data sets, which demonstrates the effectiveness of DL. Specifically, Yamada et al. [yamada2020global] achieves the best results on four data sets (i.e., MSNBC [cucerzan2007large], AQUAINT [milne2008learning], AIDA-CoNLL [hoffart2011robust], and WNED-WIKI [guo2018robust]), Wu et al. [wu2020scalable] performs best on TAC-KBP2010 [ji2011knowledge], NeuPL [phan2017neupl] owns the leadership on four data sets (i.e., ACE2004 [ratinov2011local], KORE50 [hoffart2012kore], N3-RSS500 [roder2014n3], and N3-Reuters128 [roder2014n3]), and SGTB-BiBSG [yang2018collective] obtains the best result on WNED-CWEB [guo2018robust].

It can be found from Table IV that Transformers-based EL systems (e.g., Yamada et al. [yamada2020global], Wu et al. [wu2020scalable], and Chen et al. [chen2020improving]) achieve advanced performance on many data sets as they are pre-trained on huge corpora and can generate more sophisticated long-distance feature representations for entity linking. The good performance of models leveraging type information (e.g., DeepType [raiman2018deeptype], FGS2EE [hou2020improving], and Chen et al. [chen2020improving]) demonstrates the effectiveness of type information and points out a promising direction for entity linking. Specifically, DeepType [raiman2018deeptype] that only leverages the prior popularity feature and the type similarity feature for linking achieves the second-best result on AIDA-CoNLL, which is amazing and enlightening.

It is also noted that no perfect EL system can achieve the best results on all data sets due to the different characteristics of various data sets, such as the document genre, the document length, and the number of entity mentions per document. That is, the best EL method on one data set may perform poorly on other data sets. For example, Transformers-based EL method Yamada et al. [yamada2020global] obtains four best results and one second-best result over five data sets, but performs not well on WNED-CWEB since documents in this data set are significantly longer than documents in other data sets. There are approximately words per document in the WNED-CWEB data set, which is much longer than the maximum word length that BERT can deal with (i.e., words). Nevertheless, SGTB-BiBSG [yang2018collective] performs excellent on WNED-CWEB because this model designs various hand-crafted features capturing document-level contextual information. What’s more, NeuPL [phan2017neupl] performs best on data sets containing short documents such as KORE50 and N3-RSS500 since the pair-linking algorithm proposed in the NeuPL model iteratively identifies and resolves pairs of entity mentions without the requirement of much global information. Some EL methods (e.g., DCA-SL [yang2019learning], Deep-ED [ganea2017deep], and DGCN [wu2020dynamic]) mainly based on the topical coherence feature perform well on data sets such as MSNBC and AIDA-CoNLL which have tens of entity mentions per document, because they can capture the global interdependence between candidate entities of entity mentions in a document well.

Overall, the entity linking task is highly data and domain dependent and it is unlikely that a technique dominates all others across all types of data sets. For a given data set for entity linking, we should leverage suitable embeddings, features, and algorithms to obtain advanced results based on the characteristics of the data.

8 Future Directions

Based on the above review and analysis, we believe that there is still much space for further enhancement in this field. In this section, we discuss the remaining limitations of existing EL methods and list some directions for further exploration in EL research.

Multi-source heterogeneous text data. In the era of big data, text data have multi-source heterogeneous characteristics. Text may come from diverse data sources in various structures. For example, news documents from news websites are relatively long and formal. Web tables shown in web pages are structured. Queries from search engine logs are often short and noncontiguous. Reviews from e-commerce websites are usually colloquial and noisy. Entities may appear in those multi-source heterogeneous text data which contain abundant knowledge about them. Bridging the multi-source heterogeneous text data with KBs is beneficial for the understanding of the text data and the enrichment of KBs. Present EL studies mainly focus on linking entity mentions in common text data (e.g., news documents [ganea2017deep, le2018improving, yamada2016joint, yamada2020global], tweets [guo2013link, shen2021toward, hua2015microblog], and web tables [zhang2020novel, ritze2016profiling, bhagavatula2015tabel]). As different types of text data have various characteristics, existing EL methods may not be applicable or be difficult to achieve satisfactory linking performance when dealing with other types of text data (e.g., search queries, reviews, community question answering (CQA) text, and open information extraction (OIE) triples). Therefore, it is very meaningful and essential to develop EL techniques to link entities in these diverse types. Although some works have preliminarily addressed the entity linking task for search queries [tan2017entity, blanco2015fast, cornolti2016piggyback], CQA text [wang2017named], and OIE triples [lin2020kbpearl, liu2021joint] respectively, we believe there are still many opportunities for substantial improvement.

Joint NER and EL. NER is the task to identify text spans that mention named entities, and to classify them into pre-defined categories, such as person, location, and organization [li2020survey]. It serves as a preceded task for entity linking, which provides the boundaries of named entities in text. However, detecting the correct entity mentions is challenging, especially for informal short noisy text that often contains phrases with ambiguous meanings. Therefore, NER is often the performance bottleneck of EL since the performance of NER leads to an upper limit to the performance of EL. Some EL methods are designed to jointly perform NER and EL, which allows each subtask to benefit from another and alleviates error propagations that are unavoidable in pipeline settings. So far, Guo et al. [guo2013link] utilized a structured support vector machines algorithm for tweet entity linking that jointly optimizes NER and EL as an end-to-end task. Kolitsas et al. [kolitsas2018end] and Broscheit [broscheit2019investigating] proposed neural end-to-end EL systems that jointly discover and link entities in the news documents. Li et al. [li2020efficient] designed an end-to-end EL system used for downstream question answering systems. In summary, we consider it is worth exploring effective approaches for jointly performing NER and EL for real applications in the future.

More advanced language models. Neural language models have revolutionized the field of NLP due to their superior expressive power. As introduced earlier, many effective language models have been applied in the field of EL and achieved great success. Recently, there are many more advanced language models being developed and available. For instance, BERT [devlin2019bert], widely leveraged by existing EL works, has been exceeded by several variants and other transformer-based models, which made major changes to loss functions, model architecture, and pre-training objectives. Specifically, RoBERTa [liu2019roberta] is more robust than BERT which is trained using much more training data and leveraging dynamic masking rather than static masking. SpanBERT [joshi2020spanbert] extends BERT to better represent and predict text spans. XLNet [yang2019xlnet] is a generalized order-aware autoregressive language model which makes use of a permutation operation to perform better than BERT on lots of NLP tasks. Moreover, GPT-3 [brown2020language] is the largest pre-trained language model by far with hundreds of billions of parameters, which has shown great abilities in zero-shot, one-shot, and few-shot settings. In summary, we consider there is still much space for further enhancement of EL by leveraging these more advanced language models in modeling text semantics.

EL model robustness. The robustness of deep learning has received great attention recently. A DL model is considered to be robust if its output label is consistently accurate even if one or more of the input features or assumptions are drastically changed due to unforeseen circumstances. For entity linking, robustness refers to the ability to achieve consistent performance over a wide range of data sets with different properties, such as different domains, text structures, and knowledge bases [zwicklbauer2016robust]. However, most existing DL based EL systems were designed and optimized for a specific domain (e.g., general domain or biomedical domain), for a specific text structure (e.g., news document or tweet), and for a specific knowledge base (e.g., Wikipedia or Freebase). This leads to very specialized models that lack robustness and are applicable for very specific tasks. Therefore, we strongly believe that robust DL based EL models that generalize well deserve much deeper exploration by the community.

9 Conclusion

Applying deep learning to entity linking has become a popular research topic today. In this survey, we give a comprehensive and detailed review of the existing DL based EL methods. We first propose a new taxonomy which categories DL techniques for ranking candidate entities based on three axes (i.e., embedding, feature, and algorithm). Second, we systematically review the representative DL based EL methods according to the taxonomy. Third, we present ten widely used real-world entity linking data sets and a quantitative performance analysis of DL based EL methods in tabular form. Finally, we discuss some limitations and highlight several future research directions. Through this paper, we hope to demonstrate the progress and problem in existing entity linking research and encourage more improvements in this area.

Acknowledgments

This work was supported in part by National Natural Science Foundation of China under Grant No. U1936206 and 61772289, Natural Science Foundation of Tianjin under Grant No. 19JCQNJC00100, YESS by CAST under Grant No. 2019QNRC001, and CAAI-Huawei MindSpore Open Fund. Jianyong Wang was supported in part by National Key Research and Development Program of China under Grant No. 2020YFA0804503, National Natural Science Foundation of China under Grant No. 61532010 and 61521002, and Beijing Academy of Artificial Intelligence (BAAI).

References