Log In Sign Up

Efficient Entity Candidate Generation for Low-Resource Languages

by   Alberto García-Durán, et al.

Candidate generation is a crucial module in entity linking. It also plays a key role in multiple NLP tasks that have been proven to beneficially leverage knowledge bases. Nevertheless, it has often been overlooked in the monolingual English entity linking literature, as naive approaches obtain very good performance. Unfortunately, the existing approaches for English cannot be successfully transferred to poorly resourced languages. This paper constitutes an in-depth analysis of the candidate generation problem in the context of cross-lingual entity linking with a focus on low-resource languages. Among other contributions, we point out limitations in the evaluation conducted in previous works. We introduce a characterization of queries into types based on their difficulty, which improves the interpretability of the performance of different methods. We also propose a light-weight and simple solution based on the construction of indexes whose design is motivated by more complex transfer learning based neural approaches. A thorough empirical analysis on 9 real-world datasets under 2 evaluation settings shows that our simple solution outperforms the state-of-the-art approach in terms of both quality and efficiency for almost all datasets and query types.


Design Challenges for Low-resource Cross-lingual Entity Linking

Cross-lingual Entity Linking (XEL) grounds mentions of entities that app...

Improving Candidate Generation for Low-resource Cross-lingual Entity Linking

Cross-lingual entity linking (XEL) is the task of finding referents in a...

Towards Zero-resource Cross-lingual Entity Linking

Cross-lingual entity linking (XEL) grounds named entities in a source la...

Joint Multilingual Supervision for Cross-lingual Entity Linking

Cross-lingual Entity Linking (XEL) aims to ground entity mentions writte...

Bootstrapping Transliteration with Constrained Discovery for Low-Resource Languages

Generating the English transliteration of a name written in a foreign sc...

A Deep Learning Approach to Geographical Candidate Selection through Toponym Matching

Recognizing toponyms and resolving them to their real-world referents is...

Read, Retrospect, Select: An MRC Framework to Short Text Entity Linking

Entity linking (EL) for the rapidly growing short text (e.g. search quer...

1 Introduction

Candidate generation [15, 28] is the task of retrieving a short list of plausible candidate entities from a knowledge base (KB) for a given mention. This module is usually a fundamental part of a more elaborate pipeline as that of entity linking [7], coreference resolution [26] or question answering [1]. With few exceptions [5], it is present in almost all existing entity linking techniques, where the subsequent entity disambiguation technique only has to evaluate a small set of entities from the KB, instead of all of them. Therefore, in addition to obtaining a high recall—fraction of mentions where the true entity is present in the set of candidates—a desired characteristic of a candidate generator is to constitute a low computational overhead of the full pipeline. Cross-lingual entity linking [20, 17] associates mentions in non-English documents to entities from a KB—typically the English Wikipedia. This poses several challenges that can often be attributed to the amount of resources of the target language, especially if it has low resources. Candidate generation is deemed trivial in the English entity linking literature, as there exist resources (i.e. indexes/lookup tables [18]) capable of providing high candidate recall. However, this is not the case in the cross-lingual setting. This is specially true when dealing with low-resource languages (e.g., karelian or turkmen), wherein, moreover, the number of works that tackle this important step is very scarce. Some recent approaches [14, 28] are based on the idea of performing transfer learning from a pivot language that is lexically similar to the target language.

Contributions. In this paper we propose a light-weight solution, called pivot token indexes (pti), to the candidate generation problem with a special focus on low-resource languages. pti

combines prior and posterior probabilities learned by aggregating information from the pivot and target language. These probabilities are constructed with tokens the mentions can be decomposed into. This simple solution keeps the efficiency of standard indexes—thereby possessing the desired property of a candidate generator—while being equipped with the sophistication of more complex solutions

[28]. Most importantly, the main contribution of this work is an in-depth analysis of the candidate generation problem in the context of entity linking, to which we make several contributions: i) we establish that the (inherited) practice of ignoring mentions that are not linkable to the English Wikipedia leads to an unrealistic evaluation; ii) we perform a realistic evaluation and propose an efficient trick to increase the ceiling recall of any candidate generator; iii) we introduce three query types according to their difficulty based on the available training data; iv) we compare the performance of the different methods and establish differences across them based on the amount of resources of the pivot language and the query type. Moreover, our experimental section includes experiments on 9 datasets and two different learning settings (without and with some supervision from the target language). We also empirically analyze the complexity of pti and conclude that it takes up much less memory, and requires significantly less training and inference time than the current state-of-the-art [28].

Other applications of candidate generators. In general, tasks that can beneficially leverage the knowledge contained in KBs may benefit from better candidate generators. Examples of such tasks are coreference resolution [26] or question answering [1], to name but a few. These works retrieve facts from a KB that are related to a given mention. As in the case of entity linking, candidate generation may alleviate the computational cost of the subsequent modules by just keeping a reduced set of facts.

2 Background

Let be a mention observed in a document, and let be the ground truth entity associated to that mention. The task of a candidate generator is to retrieve a list of possible candidate entities such that . The candidate generator is trained with mention entity pairs observed in a training corpus , where the mentions are written in a certain language lang. At test time, candidate generators provide a score to all the entities that form the candidate space, and find the entities with the largest scores—this is a linear operation (done with a max-heap structure) with respect to the candidate space.

A candidate generator must have good recall, and also be much more computational efficient as compared to the more complex downstream model. See [10] for a detailed discussion. We use en, pl and tl to indicate English, pivot and target language, respectively.

3 Related Work

While candidate generation in monolingual (English) entity linking is deemed a trivial task given the existence of high-quality mention-entity lookup tables, most of other languages—specially low-resource languages—do not benefit from lookup tables of a similar quality. Therefore, contrary to (English) entity linking, candidate generation constitutes a challenge in the cross-lingual entity linking literature.

3.1 Wikipedia-based Indexes

This is the de facto standard for candidate generation in the entity linking literature [16, 25, 17, 21]

. An index of prior probabilities

is constructed from a corpus. While this solution has been shown to provide a good recall for languages where the training corpus is large, for low-resource languages the training corpus may offer a very poor coverage in terms of both mentions [3] and entities.

3.2 Pivoting-based Embedding Approaches

Methods within this category leverage information of a pivot language that is lexically related to the target language: the candidate generator is trained with data from the pivot language, but then it is evaluated in the target language. These methods rely on a certain lexical overlap between pairs of related languages (e.g. Italian and Corsican) that allow to transfer learning between the pivot—which is supposed to have more resources than the target language—and the target language. This is opposed to other techniques [6] that rely on the existence of accurate machine translation models to map all the resources to a common language. Unfortunately, the performance of translation models is also largely affected by the scarce data that is available for low-resource languages. Similar ideas have been applied to other problems such as POS tagging [2] or machine translation [29].

The learning objective consists of maximizing a similarity function

(the cosine similarity) between mention

and entity embedding pairs:


where is the collection of training mention entity pairs in the pivot language, and is the mention embedding obtained by applying an encoding function Enc to the output of a tokenizer . A similar architecture is used to obtain the entity embedding from its English name. The superscript or indicates whether the entity is the target or a randomly sampled one.

Previous works have explored different tokenizers (e.g. character-level [14] or character -grams [28]) and encoding functions (e.g. LSTM- [14] or CNN-based architectures [27]). The best setup [28] consists of a character

-gram tokenizer, followed by an encoder that maps each token to an embedding (via an embedding lookup table), sums them all up, and applies a nonlinear activation function to the aggregated representation. This approach, called

Charagram [24], has shown to also outperform other type of approaches based on indexes and translation [13].

Additional Related Work. There are related tasks such as transliteration [22] and bilingual lexical induction [8]. The former is a harder task, as it consists of generating, instead of retrieving, English entity names. The latter relies to a great extent on the existence of large monolingual corpus in both languages [23].

4 Pivot Token Indexes

Intuitively, Charagram learns to associate tokens that co-occur between mentions and English entity names. The learned associations between tokens determine the similarity (score) between entity and mention embeddings. This approach has been shown to provide good recall but both memory requirements and training/inference time are considerable (an empirical analysis is in Section 6.1). Thus, it is arguable whether Charagram fulfills a desired characteristic of a candidate generator, namely its computational overhead is low. Similarly to Charagram, our approach also learns associations between mentions and entities. Different to Charagram

, the associations learned by

pti are meant to indicate the attachment strength between tokens and entities, and not equivalences between tokens. pti consists of two indexes—then being high efficient—that are linearly combined to score entities. The indexes correspond to the prior probability that associates tokens to entities, and the posterior probability that associates entities to tokens. The goal of the posterior probability is to counterbalance the (low) score that unpopular entities obtain for popular tokens from the prior index. pti has the following steps:

  1. We apply a tokenizer tknr to the collection of mentions observed in the training corpus of the pivot language . Each mention , linked to a certain entity, is decomposed into a set of tokens .

  2. We construct prior and posterior token-based probabilities. Optionally, we can remove low probabilities by thresholding (Section 6.1) the indexes without an impact in the performance. The thresholding significantly reduces the memory requirements of pti.

  3. At test time, given a mention , each entity obtains an score as follows:


    where is a weighting term. As argued, the posterior probability term may counterbalance the bias introduced by the prior probability towards the popular entities: a large value of will favor unpopular entities. While these scores do not have a probabilistic interpretation, one can easily normalize them so that they sum up to 1. As opposed to Charagram, pti performs a search that only includes the entities that have a non-zero score; Then contributing towards a better efficiency.

We will see in Section 6 that this simple approach has the following advantages as compared to Charagram: fewer parameters, less training time, less inference time, and better performance in almost all settings.

4.1 Joint learning with the target language

In practice, there is often some training data that is available for the target language. To incorporate this data into the construction of the prior and posterior probabilities we modify step 2 by weighting the counts over the pivot language corpus with a factor . Thus, the index of priors is computed as

where is a shortcut that refers to the number of times the conditional argument is observed in the corpus . corresponds to the set of entities observed in the corpora. The same weighting is applied to compute the posterior probabilities. At test time, entities are scored following Eq. (2).

More complex versions of our approach pti are in Appendix A.5. While these versions led to minor improvements, their complexity also increased.

5 Experimental Setup

All the resources required to reproduce the experiments are available at

In order to comprehensively understand the performance of pti we run experiments on a variety of publicly available datasets. The benchmarked methods are compared in terms of accuracy and complexity (number of parameters and training/inference time). The accuracy is measured by the recall, which is defined as the proportion of retrieved candidate lists containing the correct entity. More formally, the recall@ is defined as

where is the set of mentions entity pairs in an evaluation set, and returns 1 if its argument is true, and 0 otherwise. As in previous work [28] we set for our main set of experiments. However, for a subset of the experiments we include a comparative performance for other values of in Appendix A.4.

We compare the performance of the different methods in the two following learning settings:

  • [nolistsep,leftmargin=4mm]

  • Zero-shot. We use zero resources—no training data—from the target language. Therefore, the candidate generator must fully rely on the available information for the pivot language. While it is possible to obtain a (small) training set of mention entity pairs for some target languages, in some other cases the target language is so poorly resourced that this is the only possible learning setting.

  • Joint learning with the target language. Even for low-resource languages, there is often some training data for the target language that can be used in conjunction with the data that is available for the pivot language. We sometimes refer to this setting as supervised.

We note that the aim of this section is to establish a detailed comparison of different candidate generation methods. As any progress in the candidate generator is independent from any improvement in the entity disambiguation module, the integration of the candidate generation in the entity linking pipeline is out of the scope of this work.

Figure 1: Query types according to their difficulty. The target and pivot language correspond to corsican (co) and italian (it), respectively. The backgrounds of the queries (Mention : Ground Truth Entity) indicate their type (Easy=green, Medium=orange, and Hard=red).

5.1 Query Types

As discussed in Section 5, despite the little resources that are available for the target languages, sometimes it is possible to obtain some training data. This training data will determine the difficulty of the queries contained in the test set. Given a mention query that links to an entity

, we classify it according to our proposed categorization (see Figure

1 for a visual explanation):

  • [nolistsep,leftmargin=4mm]

  • Easy. The mention is observed with the entity in the training data of the target language.

  • Medium. The entity is observed in the training data of the target language but with mentions other than .

  • Hard. The entity is never observed in the training data of the target language. To successfully link these mentions we fully rely on the pivot language. In the zero-shot setting, all mention queries belong to this type.

5.2 Baselines

  • [nolistsep,leftmargin=4mm]

  • WikiPriors. Prior probabilities (i.e.

    ) are estimated using a training corpus obtained from both a target and a pivot language, and constructs indexes mapping mentions to their co-occurring entities. This simple procedure resembles the approach followed by

    [18], where a lookup table was built for entities that have an English Wikipedia page. The strength of a mention-entity association is given by the estimated prior probabilities. In the supervised setting, it first generates candidates using the index constructed on the target language. More details are provided in Appendix A.2.

  • Charagram [28]. This is the state-of-the-art on candidate generation for (low-resource) cross-lingual settings. It has shown to significantly outperform other architectures such as [14], and translation-based methods [13], among others. For the supervised setting, we extend Charagram to also incorporate training data from the target language by optimizing the loss . More details are provided in Section 3.2.

5.3 Datasets

As previous works [14, 28], we create our datasets from Wikipedia. We choose Wikipedia as, contrary to other datasets (e.g. DARPA LORELEI [19]), it can be accessed without any restriction, and, importantly, its crowdsourcing spirit guarantees a realistic setting.

We select 9 pairs of languages, where the target language is always (extremely) low-resourced, and the pivot language can be categorized as either high, medium, or low-resourced. The distinction regarding the amount of resources of the pivot language has never done in previous works, but it helps to analyze the performance of the different methods. The 9 language pairs (target-pivot) are Corsican-Italian (co-it), Limburgish-Dutch (li-nl), Bavarian-German (bar-de), Karelian-Finnish (olo-fi), Macedonian-Bulgarian (mk-bg), Javanese-Indonesian (jv-id), Turkmen-Azerbaijani (tk-az), Marathi-Hindi (mr-hi) and Faroese-Icelandic (fo-is). Appendix A.1 contains information about the dataset statistics (Table 4), additional preprocessing details and detailed information about the selection of the pivot languages.

Pivot Language
High Resource Medium Resource Low Resource
Target Lang. co - it li - nl bar - de olo - fi mk - bg jv - id tk - az mr - hi fo - is
WikiPriors 47 51 63 45 31 74 48 30 29
Charagram 51 51 50 48 47 60 67 48 49
pti 63 70 76 72 49 79 69 40 47
Joint learning with the target language
WikiPriors 47 68 77 45 59 82 48 58 58
Charagram 51 72 73 48 69 72 67 66 74
pti 63 84 88 72 71 87 69 70 79
Table 1: Micro recall@30 of different approaches. The upper and lower block show the performance of the methods in the zero-shot and supervised setting, respectively. For the pairs co - it, olo - fi and tk - az, where there is no training data in the target language, the supervised setting amounts to the zero-shot one. The best performance for each target language and setting is in bold.
Pivot Language
High Resource Medium Resource Low Resource
Target Lang. co - it li - nl bar - de olo - fi mk - bg jv - id tk - az mr - hi fo - is
Query Type H E M H E M H H E M H E M H H E M H E M H
WikiPriors 47 100 49 57 100 56 74 45 99 53 26 100 79 68 48 99 61 15 100 38 35
Charagram 51 83 72 61 84 75 62 48 86 74 46 82 70 61 67 82 75 42 90 80 54
pti 63 98 80 73 99 84 82 72 92 78 44 99 86 78 69 98 82 30 100 93 45
Table 2: Performance breakdown in terms of recall@30 for the case when the models are trained with some supervision from the target language—except for co - it, olo - fi and tk - az, which correspond to the zero-shot setting. E, M and H are shortcuts that refer to Easy, Medium and Hard queries, respectively. The best performance for each target language and query type is shown in bold.

Except for cases when zero-shot is the only possible learning setting (co-it, olo-fi and tk-az) we create the validation and test sets by sampling a maximum of 1,000 unique queries of each query type—in total it amounts to a maximum of 3,000 queries; and the training data is adapted accordingly. The same validation and test sets are used in the zero-shot setting. For target languages that can only be evaluated in the zero-shot setting, we equally distribute the available mention entity pairs into validation and test. These sets never contain duplicates.

5.4 Towards a Realistic Evaluation

A common practice in the (English) entity linking literature [25, 4] consists of ignoring mentions whose correct entity does not exist in the considered KB—typically the English Wikipedia which contains over 5 million entities. This is not a major problem in these models, as this has been shown [11] to remove a small percentage (typically less than 5%) of the data. Previous works [14, 28] on candidate generation for cross-lingual entity linking have adopted the same practice. Indeed, cross-lingual entity linking is defined as the task of grounding mentions in non-English documents to entries in the English Wikipedia [20, 21]. However, we will show in Section 6.1 that this practice is problematic in the cross-lingual setting, as we would have to ignore a very large percentage of data.

(a) co - it

(b) li - nl

(c) bar - de

(d) olo - fi

(e) mk - bg

(f) jv - id

(g) tk - az

(h) mr - hi

(i) fo - is
Figure 2: Ceiling recall for all query types and target languages. In the experiments reported in Section 5, the candidate space of pti is as follows: PL and PL+TL in the zero-shot and supervised setting, respectively. Whereas for Charagram is as follows: EN+PL and EN+PL+TL in the zero-shot and supervised setting, respectively.

For some methods such as Charagram is easy to expand the candidate space by considering entities in Wikipedias other than English. While one might evaluate all entities (more than 20 million), we discovered one much less costly alternative. We experimentally observe that expanding the candidate space with the entities observed in the training data of the pivot language significantly improves the entity coverage of the entities observed in the test data of the target language (more in Section 6.1). With the goal of performing a realistic evaluation we simply evaluate all mentions.

5.5 Training Details

The simplicity of our approach pti

is also reflected in its number of hyperparameters. The hyperparameter

, which controls the contribution of the pivot language in the construction of the prior and posterior probabilities, is validated among the values . The hyperparameter , which controls the importance of the posterior probability, is validated among the values . As in Charagram, the tokenizer tknr applied to mentions returns character -grams with . The tokenizer counts each symbol as a character regardless of the script of the language, although more sophisticated mechanisms are also possible (e.g. by converting strings to international phonetic alphabet (IPA) symbols). For Charagram we replicate the exact same setup as reported in [28]. The hyperparamenter of the supervised setting is validated among the values . More details are provided in Appendix A.2. The hyperparameters and are required by both pti and Charagram, respectively, only in the supervised setting. For all cases, the validation metric is micro recall@30.

6 Results

Table 1 depicts a comparison of the methods in terms of micro recall@30. Similar conclusions are drawn for both learning settings. Overall, our approach pti clearly outperforms Charagram in those target languages where the pivot language is high-resourced. For pivot languages that are medium-resourced, while pti is the best performing technique Charagram performs very similarly in one target language (mk). This tendency is not as clear when the pivot language is low-resourced, where Charagram shows a clear superior performance in the target language mr in the zero-shot learning setting. We see that all models largely benefit from the existence of training data in the target language.

By the definition of the query types introduced in Section 5.1, it follows that all queries are of type hard in the zero-shot setting. However, this is not the case when the models are also trained with data from the target language. We show the performance breakdown in Table 2. The conclusions are clear: pti always outperforms Charagram for queries of type Easy and Medium, and it is only beaten for queries of type Hard when the pivot language has low resources, with the exception of az. We note that while in this work we have created the evaluation splits so that all query types are equally important, the distribution of the query types depends on the target evaluation data. For instance, splitting Wikipedia articles into training, validation and test in a proportion of 70%, 15% and 15%, respectively, leads to a distribution of query types largely dominated by Easy ones.

The performance of WikiPriors is also dependent on the amount of resources of the pivot language, showing a performance that is comparable or even better than Charagram when the pivot language is high- and medium-resourced (see Table 1). The more detailed analysis depicted in Table 2 shows that WikiPriors has a very heterogeneous performance with respect to the different query types in the supervised setting. As expected, it obtains a perfect performance for Easy queries, but its performance tends to be poor, with some few exceptions, for Medium and Hard queries.

6.1 Analysis

Entity Coverage As argued in Section 5.4, it is unrealistic to simply ignore those mentions that are not in the considered KB, specially if this translates into a performance that would not resemble that of the realistic evaluation that we propose. We proceed to analyze the ceiling recall—which depends on the entity coverage—of the methods in the realistic setting, where no mention is excluded.

(a) pti

(b) Charagram
Figure 3: Recall@30 of pti (left) and Charagram (right) for different amount of training data. For simplicity, the candidate space of Charagram is always formed as if all training data was observed.

Figure 2 shows the ceiling recall of pti and Charagram for the different target languages and query types. The ceiling recall of Charagram is lower than 30% in almost all cases when the candidate space is given by the English Wikipedia (EN) considered in previous work [14, 28]. This showcases that previous work were limiting, to a great extent, the mention queries that were evaluated. This is not only unrealistic, but also a notable limiting factor in performance. The ceiling recall notably increases when the candidate space also includes the entities that are observed in the training data of the pivot language. We hypothesize that this is because the pivot language is not only lexically similar to the target language, but also their respective Wikipedias exhibit similar topical interests. Moreover, the ceiling for queries of type Easy and Medium is also increased when the candidate space is also formed by the entities observed in the training data of the target language. The latter corresponds to the supervised setting; and surprisingly, as opposed to pti, Charagram does not reach a ceiling recall of 1 for queries of type Easy and Medium. The reason is that there are a fair amount of entities in the respective Wikipedias that do not have an English name (e.g. Q61068957 or Q15842755). This is the strength and weakness of Charagram: its candidate space can include any entity as long as its English title is available. On the other side, pti is language-agnostic to the representation of the entity, but its entity coverage is limited to the entities observed in the pivot language (PL), or in both the pivot and target language (PL+TL). It is for this reason that pti is beaten in queries of type Hard when the pivot language is low-resourced : it fully relies on the entity coverage of the pivot language, which is low, to successfully answer these queries. This is illustrated in Figure 1(h) and 1(i), where Charagram shows a higher ceiling recall for Hard mentions due to the additional entity coverage provided by the English (EN) Wikipedia. Future work should address this limitation of pti (e.g. by incorporating more pivot languages). However, in all other situations where the ceiling recall is higher or comparable to that of Charagram, pti shows the best performance.

Impact of the amount of training data We train both pti and Charagram with increasing amount of training data, and computed micro recall@30 for several target languages. This is shown in Figure 3. The performance of pti increases with the more data up to a point—around 10 million. One reason is that the entity coverage of pti is expected to increase with the amount of training data. Another reason is that the constructed indexes become more accurate. On the other side, Charagram barely improves beyond 80,000 data points111Authors of Charagram confirmed in an email they used, at most, this amount of data for training. Charagram needs relatively little data to learn equivalences between -grams co-occurring in mentions and English entity names, but gets stuck very quickly and does not leverage all the statistical information contained in the training corpora.

Runtime and Memory The goal of a candidate generator is to reduce the computational cost of a more complex subsequent model. Therefore, besides recall, complexity must also be a factor that drives the design of a candidate generator. We perform a detailed analysis of the complexity in Appendix A.3. The conclusion of such analysis is that in average, pti requires 20 times less memory than Charagram, and it is 300 and 7 times faster than Charagram in terms of training and inference time, respectively.

Lang. it nl de
Query Type E M E M E M
WikiPriors 95 32 98 30 98 34
Charagram 50 45 57 47 39 38
pti 79 49 87 53 77 54
Table 3: Recall@30 in the high-resource setting.

6.2 Can pti also become the de-facto choice for high-resource languages?

We analyze whether pti may compete with WikiPriors in the most standard candidate generation setting: candidates have to be generated for mentions of a high-resource language by only exploiting its own training corpus. We build this experiment motivated by the recent findings disclosed in the recent work by Fu et al. [3], where authors found that even high-resources languages, such as Spanish, are also limited by their mention coverage. This limitation relates to the queries of type Medium. Fu et al. partially circumvent the limitation with information coming from search engine query logs.

Following the same protocol as in Section 5.3 we create 1,000 queries of type Easy and Medium for the three high-resource languages used in Section 5 as pivot languages: Italian (it), Dutch (nl) and German (de). We do not include queries of type Hard as they cannot be successfully completed. The results, depicted in Table 3, indicate that pti always obtains the best performance for queries of type Medium, followed by Charagram. On the other side, for queries of type Easy standard lookup tables (i.e. WikiPriors) exhibit the best performance. We conclude that for the standard setting, WikiPriors must be used as the primary technique, falling back to pti only if WikiPriors fails to retrieve any candidate.

7 Conclusions

We perform an in-depth analysis of the candidate generation problem. We show that the inherited practice of ignoring mentions where the target entity is not in the English Wikipedia is unrealistic in a cross-lingual setting. We alleviate this problem with an efficient solution that consists of using the pivot and target language to expand the candidate space. We also contribute with a categorization of the mention queries according to their difficulty. Finally, we propose a light-weight approach, called pti, that outperforms the current state-of-the-art in almost all target languages and query types.


This project was partly funded by the Swiss National Science Foundation (grant 200021_185043), the European Union (TAILOR, grant 952215), and the Microsoft Swiss Joint Research Center. We also gratefully acknowledge generous gifts from Facebook and Google supporting West’s lab.

8 Bibliographical References


  • [1] J. Bao, N. Duan, Z. Yan, M. Zhou, and T. Zhao (2016)

    Constraint-based question answering with knowledge graph

    In COLING, Cited by: §1, §1.
  • [2] M. Fang and T. Cohn (2017) Model transfer for tagging low-resource languages using a bilingual dictionary. In ACL, Cited by: §3.2.
  • [3] X. Fu, W. Shi, Z. Zhao, X. Yu, and D. Roth (2020) Design challenges for low-resource cross-lingual entity linking. In EMNLP, Cited by: §3.1, §6.2.
  • [4] O. Ganea and T. Hofmann (2017) Deep joint entity disambiguation with local neural attention. In EMNLP, Cited by: §A.4, §5.4.
  • [5] D. Gillick, S. Kulkarni, L. Lansing, A. Presta, J. Baldridge, E. Ie, and D. García-Olano (2019) Learning dense representations for entity retrieval. In CoNLL, Cited by: §1.
  • [6] T. Gollins and M. Sanderson (2001) Improving cross language retrieval with triangulated translation. In SIGIR, pp. 90–95. Cited by: §3.2.
  • [7] Y. Guo, B. Qin, Y. Li, T. Liu, and S. Li (2013) Improving candidate generation for entity linking. In International Conference on Application of Natural Language to Information Systems, pp. 225–236. Cited by: §1.
  • [8] A. Haghighi, P. Liang, T. Berg-Kirkpatrick, and D. Klein (2008)

    Learning bilingual lexicons from monolingual corpora

    In ACL, Cited by: §3.2.
  • [9] L. S. Levin, P. Littell, D. R. Mortensen, K. Lin, K. Kairis, and C. Turner (2017)

    URIEL and lang2vec: representing languages as typological, geographical, and phylogenetic vectors

    In EACL, Cited by: §A.1.
  • [10] X. Ling, S. Singh, and D. S. Weld (2015) Design challenges for entity linking. TACL 3, pp. 315–328. Cited by: §2.
  • [11] C. Liu, F. Li, X. Sun, and H. Han (2019) Attention-based joint entity linking with entity embedding. Information 10, pp. 46. Cited by: §5.4.
  • [12] L. Ngo and M. P. Wand (2004) Smoothing with mixed model software. Journal of Statistical Software 9 (1). Cited by: §A.5.
  • [13] X. Pan, B. Zhang, J. May, J. Nothman, K. Knight, and H. Ji (2017) Cross-lingual name tagging and linking for 282 languages. In ACL, Cited by: §3.2, 2nd item.
  • [14] S. Rijhwani, J. Xie, G. Neubig, and J. G. Carbonell (2019) Zero-shot neural transfer for cross-lingual entity linking. In AAAI, Cited by: §1, §3.2, 2nd item, §5.3, §5.4, §6.1.
  • [15] W. Shen, J. Wang, and J. Han (2014) Entity linking with a knowledge base: issues, techniques, and solutions. IEEE Transactions on Knowledge and Data Engineering 27 (2), pp. 443–460. Cited by: §1.
  • [16] A. Sil and R. Florian (2016) One for all: towards language independent named entity linking. In ACL, Cited by: §3.1.
  • [17] A. Sil, G. Kundu, R. Florian, and W. Hamza (2018) Neural cross-lingual entity linking. In AAAI, Cited by: §1, §3.1.
  • [18] V. I. Spitkovsky and A. X. Chang (2012) A cross-lingual dictionary for english wikipedia concepts. In LREC, Cited by: §1, 1st item.
  • [19] S. Strassel and J. Tracey (2016) LORELEI language packs: data, tools, and resources for technology development in low resource languages. In LREC, Cited by: §5.3.
  • [20] C. Tsai and D. Roth (2016) Cross-lingual wikification using multilingual embeddings. In HLT-NAACL, Cited by: §A.2, §1, §5.4.
  • [21] S. Upadhyay, N. Gupta, and D. Roth (2018) Joint multilingual supervision for cross-lingual entity linking. In EMNLP, Cited by: §A.2, §A.4, §3.1, §5.4.
  • [22] S. Upadhyay, J. Kodner, and D. Roth (2018) Bootstrapping transliteration with constrained discovery for low-resource languages. In EMNLP, Cited by: §3.2.
  • [23] I. Vulic, S. Ruder, and A. Søgaard (2020) Are all good word vector spaces isomorphic?. In EMNLP, Vol. abs/2004.04070. Cited by: §3.2.
  • [24] J. Wieting, M. Bansal, K. Gimpel, and K. Livescu (2016)

    Charagram: embedding words and sentences via character n-grams

    In EMNLP, Cited by: §3.2.
  • [25] I. Yamada, H. Shindo, H. Takeda, and Y. Takefuji (2017)

    Learning distributed representations of texts and entities from knowledge base

    Transactions of the Association for Computational Linguistics 5, pp. 397–411. Cited by: §3.1, §5.4.
  • [26] H. Zhang, Y. Song, Y. Song, and D. Yu (2019) Knowledge-aware pronoun coreference resolution. In ACL, Cited by: §1, §1.
  • [27] X. Zhang, J. J. Zhao, and Y. LeCun (2015) Character-level convolutional networks for text classification. In NIPS, Cited by: §3.2.
  • [28] S. Zhou, S. Rijhawani, J. Wieting, J. G. Carbonell, and G. Neubig (2020) Improving candidate generation for low-resource cross-lingual entity linking. TACL 8, pp. 109–124. Cited by: §A.1, §A.2, §A.2, Table 6, §1, §1, §3.2, 2nd item, §5.3, §5.4, §5.5, §5, §6.1.
  • [29] B. Zoph, D. Yuret, J. May, and K. Knight (2016)

    Transfer learning for low-resource neural machine translation

    In EMNLP, Cited by: §3.2.

Appendix A Supplementary Material

a.1 Dataset Information

We preprocess the Wikipedia dumps for each of the target and pivot languages. We apply possible redirects to each Wikipedia entity, and map the (redirected) Wikipedia entity to the corresponding Wikidata entity222This last step is not really necessary, but we prefer a language-agnostic representation of the entities. (e.g. Q21 is the Wikidata identifier for the Wikipedia entity England). We do not use aliases [28] to enrich the data. Nevertheless its inclusion will improve the performance of any technique.

Lang. WP Size #Mention-Entity pairs.
it (Italian) 1.5M 24M
nl (Dutch) 2M 16M
de (German) 2.5M 44M
fi (Finnish) 500k 5.7M
bg (Bulgarian) 250k 3.4M
id (Indonesian) 500k 4.4M
az (Azerbaijani) 150k 1.1M
hi (Hindi) 140k 700k
is (Icelandic) 50k 400k
Table 4: Pivot languages used in this work categorized according to their amount of resources: high-, medium- and low-resource are in the upper, medium and lower block, respectively.

The transfer learning paradigm of applying a model, learned in a pivot language, to a target language has been explored in different natural language problems. Nevertheless, such works have never reached a consensus on how to select a pivot language that is closely related to a target language. For those target languages that have been used in previous work (e.g. jv or mr) we select the same pivot languages as in these works. For all others we rely on lang2vec [9] to choose appropriate pivot languages.

a.2 More Details about WikiPriors and Charagram

WikiPriors is based on the candidate generator used in previous work [20, 21]. It fallbacks to the pivot language when the index of the target language does not retrieve candidates. On the contrary, for the zero-shot setting it only relies on the index constructed using the pivot language. Furthermore, for both zero-shot and supervised settings, it constructs indexes that associate the constituent words of mention strings with entities using prior probabilities . These indexes are used to retrieve candidates if the indexes on mention strings provide less than candidates. Previous work [28] has shown that sometimes it is competitive as compared to Charagram.

We replicate the exact same setup for Charagram as reported in [28]: the tokenizer tknr, which is applied to mentions and (English) entity names, returns character -grams with

. The embedding size is 300. We train the model with stochastic gradient descent (SGD) with batch size 64, and a learning rate of 0.1. We stop training if the micro recall@30 on the validation set does not increase for 50 epochs, and the maximum number of training epochs is set to 200.

Thresholding Value
Target - Pivot 0 1e-2 0.1
co - it 63 (0%) 62 (72%) 48 (98%)
li - nl 84 (0%) 83 (65%) 73 (98%)
bar - de 88 (0%) 87 (75%) 70 (99%)
olo - fi 72 (0%) 71 (68%) 55 (97%)
mk - bg 71 (0%) 69 (66%) 55 (95%)
jv - id 87 (0%) 87 (64%) 79 (96%)
tk - az 69 (0%) 69 (56%) 64 (92%)
mr - hi 70 (%0) 69 (62%) 58 (94%)
fo - is 79 (%0) 79 (53%) 73 (90%)
Table 5: Micro recall@30 after applying thresholding. In parenthesis the percentage of entries that are removed from the indexes built by pti.

a.3 Complexity of pti and Charagram

All the experiments were done using code written in Python on an Intel(R) Xeon(R) E5-2680 24-core machine with 2.50GHz CPU, and 256 GB RAM running Linux Ubuntu 16.04. Experiments for Charagram are run in a TitanX GPU. We do not make use of any parallelism in the experiments.

As previously discussed, the candidate generator must be computational efficient. For this reason we compare the memory and time complexity of pti to that of Charagram. We experimentally observe that it is possible to reduce significantly the amount of non-zero prior and posterior probabilities in pti by thresholding (see Table 5), while having very little impact on the performance. These experiments correspond to the best performing learning setting. We perform further analysis of pti with a thresholding value of 0.01, and compare it to Charagram (trained with 80,000 data points) in terms of number of parameters, training and inference time. The inference time corresponds to the time needed to evaluate all mention queries—3,000 queries for almost all target languages. All target languages are evaluated in the best performing learning setting. A comparison is shown in Table 6. The findings are summarized below:

Memory Time #Parameters Training (minutes) Inference (seconds) Target - Pivot pti Charagram pti Charagram pti Charagram co - it 45M 200M 10 180 45 85 li - nl 38M 220M 7 495 60 120 bar - de 80M 225M 20 645 150 150 olo - fi 18M 180M 3 420 12 70 mk - bg 10M 225M 1 770 30 120 jv - id 11M 200M 2 480 20 120 tk - az 4M 160M 1 250 3 60 mr - hi 3.5M 150M 1 680 10 110 fo - is 2.2M 120M 1 480 6 115 Table 6: Comparison between pti (with a thresholding value of 0.01) and Charagram (trained with 80,000 data points as in [28] in terms of number of parameters, training and inference time.
Figure 4: Recall@ (R@) of pti and Charagram (Char) for several values of .
  • [nolistsep,leftmargin=4mm]

  • Number of parameters. While the number of parameters remains more or less constant for Charagram, pti shows its number of parameters is proportional to the amount of resources of the pivot language.

  • Training time. The training time of pti corresponds to the time that takes to construct the prior and posterior probabilities, which is proportional to the amount of resources of the pivot language. The optimization of the learning objective done by Charagram is significantly much slower.

  • Inference time. pti uses sparse matrices to represent and . At test time, scores are simply computed by aggregating rows of such matrices. Moreover, the search performed by pti to find the largest values only includes those candidates with non-zero scores. These characteristics make often pti more efficient than Charagram during inference.

a.4 Recall@ for different values of

Previous works on candidate generation have set a value of 30 for . However, when the candidate generator is followed by an entity disambiguation technique, there are other values that are typically used by these methods such as (approximately) 10 [4] and 20 [21]. Figure 4 shows the recall of pti and Charagram for these other values. Except for CO-IT, OLO-FI and TK-AZ, where zero-shot is the only possible learning setting, all other metrics are evaluated in the supervised setting. The comparative performance between the methods is similar for all values of .

a.5 Other versions of pti

Wildcard Tokens The tokenizer extends the tokens returned for a given mention with wildcard tokens. Wildcard tokens includes placeholders represented by an asterisk, which can be interpreted as any character. We observed some minor improvements in recall for some target languages, but this notably increased the amount of tokens and, consequently, the memory requirements of the approach.

Probability-level Fusion

For the supervised learning setting, we explored another approach to integrate the information from both the target and pivot language. Independent prior and posterior probabilities are computed for both languages, and linearly combined with a weighting factor. Finally, the probabilities are normalized. We did not observe any improvement in performance with this fusion technique.

Probability Smoothing We perform one additional step in the previous version. We use additive smoothing [12] (also known as Laplace smoothing) in the computation of the prior and posterior probabilities for the pivot language. These smoothed probabilities account for possible discrepancies between the target and pivot language. We sometimes observed a minor improvement for some query types, but this comes at the cost of validating one more hyperparameter (the smoothing factor).