Design Challenges for Low-resource Cross-lingual Entity Linking

05/02/2020 ∙ by XingYu Fu, et al. ∙ University of Pennsylvania Columbia University 0

Cross-lingual Entity Linking (XEL) grounds mentions of entities that appear in a foreign (source) language text into an English (target) knowledge base (KB) such as Wikipedia. XEL consists of two steps: candidate generation, which retrieves a list of candidate entities for each mention, followed by candidate ranking. XEL methods have been successful on high-resource languages, but generally perform poorly on low-resource languages due to lack of supervision. In this paper, we show a thorough analysis on existing low-resource XEL methods, especially on their candidate generation methods and limitations. We observed several interesting findings: 1. They are heavily limited by the Wikipedia bilingual resource coverage. 2. They perform better on Wikipedia text than on real-world text such as news or twitter. In this paper, we claim that, under the low-resource language setting, outside-Wikipedia cross-lingual resources are essential. To prove this argument, we propose a simple but effective zero-shot framework, CogCompXEL, that complements current methods by utilizing query log mapping files from online search engines. CogCompXEL outperforms current state-of-the-art models on almost all 25 languages of the LORELEI dataset, achieving an absolute average increase of 25 candidate recall.



There are no comments yet.


page 5

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Cross-lingual Entity Linking (XEL) aims at grounding mentions written in a foreign (source) language into entries in (target) language Knowledge Bases (KB), typically the English Wikipedia. The task involves two main steps: (1) candidate generation, retrieving a list of candidate KB entries for each entity mention, and (2) candidate ranking, selecting the most likely entry from the candidates.

XEL importance stems from the ability to support some level of knowledge acquisition directly from documents in any language without resorting to machine translation. The challenge in XEL lies in addressing cross-lingual supervision for candidate detection supervision and contextual ambiguity. For low-resource languages, XEL becomes more challenging to generate the correct candidate, since the natural source of supervision – Wikipedia – is limited. In this paper, we focus on the low-resource XEL candidate generation problem.

Existing low-resource XEL systems Tsai et al. (2016); Upadhyay et al. (2018); Pan et al. (2017); Zhang et al. (2018); Rijhwani et al. (2019); Zhou et al. (2020) all depend heavily and only on Wikipedia cross-lingual resources. Common resources of low-resource XEL include

  • Wikipedia Bilingual Title Mappings: A map between source language entities and English entities. The mapping comes from Wikipedia inter-language links between the source language and English. The mapping can directly link a low-resource entity to the English KB without ambiguation.

  • Wikipedia Anchor Text Linking

    : In Wikipedia articles, an underlined text mention is annotated with anchor text linking towards an entity. These annotation is usually used together with bilingual title mappings to build a mention-entity probability map, and provide English KB candidates for a low-resource mention as well as disambiguation.

Consequently, the availability and variety of these Wikipedia resources significantly limits the performance of XEL systems on low-resource languages. Unlike high resource languages such as English or German that have 6,051,918 and 2,416,422 articles respectively, low-resource languages such as Odia (15,673 articles) and Sinhalese (15,642 articles) do not get enough supervision signals from Wikipedia resource. Also, the limited variety of mentions in low-resource language Wikipedia significantly weaken the power of XEL candidate generation when it evaluates on real-world test such as news or social media.

In this paper, we first give a thorough evaluation and analysis on candidate generation of several STOA supervised low-resource XEL methods, and show their limitations due to lack of relevant data. We then claim that, under the low-resource language setting, outside-Wikipedia cross-lingual resources are essential. Specifically, we suggest to utilize the abundant query log files of online search engines in various ways to compensate the lack of supervision. Query logs, just like Wikipedia, is a free resource, generated in a distributed manner by a large number of users. Unlike Wikipedia, it is noisy, and requires more careful use to tame in a fruitful way. Therefore, we propose CogCompXEL, a simple, but efficient zero-shot framework that utilizes query log mapping files from online search engines and achieves STOA on gold candidate recall on all languages. Compared with state-of-the-art models, CogCompXEL achieves an average 25% increase of candidate recall on LORELEI dataset and an average 3% increase of candidate recall on Wikipedia dataset. To evaluate end-to-end linking accuracy, we also propose a simple candidate ranking module. CogCompXEL achieves an absolute average increase of 13%. It also achieves the state-of-the-art performance on Wikipedia dataset, Followed with an exhaustive ablation study, we further examine the effectiveness of our method on 7 randomly selected low-resource languages.

2 Related Work

The XEL problem was introduced in the TAC KBP Entity Linking Tracks that developed a dataset in English, Chinese and Spanish Ji and Nothman (2016). The first approach to XELTsai et al. (2016) uses Wikipedia inter-language links and hyperlinks as supervision and trains multilingual embeddings using words and Wikipedia titles for candidate ranking. upadhyay2018joint follows up using same candidate generation method and develops an approach that combines supervision from multiple languages jointly for candidate ranking. pan2017cross, zhang2018elisa focus on scaling up XEL to many languages and develop a system for 282 Wikipedia languages. These methods focus on extracting name translation mappings using the Wikipedia and generalizes better because they train single word mappings to English.

A different trend of approach is presented in  Rijhwani et al. (2019); Zhou et al. (2019) where, in an attempt to generate better candidates, they translate mentions in the source language into a “pivot” language, which is closely related to but has more cross-lingual resources. After their XEL systems transfers the mention into the selected “pivot” language, they further link it to an English entity to get the final result. The methods suffer from the need to find high-quality language pairs, while not all low-resource languages have a higher-resource language to pivot to.

Transliteration that jointly leverages similarities between related languages, such as writing systems or phonetic properties, is another method proposed for the XEL problem GadElrab2015NamedED, Tsai2018LearningBN, upadhyay-etal-2018-bootstrapping. Many approaches use transliteration as one of the resource for candidate generation, and  GadElrab2015NamedED uses it for candidate disambiguation. Recently, transliteration and name translation methods that requires less training data are proposed, and generally help XEL task on low-resource languages.

In the normal English entity linking task, 6823700 pointed out that google query logs can be an efficient way to identify candidates. Dredze:2010:EDK:1873781.1873813, Monahan2011CrossLingualCC use search result as one of their method for candidate generation on high resource entity linking task. However, there is no previous work that studies how to combine query log methods within XEL for low-resource languages.

3 Limitations of Existing Models – Candidate Generation

3.1 Problem Formulation

We formalize the XEL problem as follows. Given a document annotated with mentions in the source language , a XEL system aims to retrieve the English Wikipedia entity that the mention refers to. The task breaks down into two processes: candidate generation and candidate ranking. Given a mention , we first generate all English entity candidates . With a set of candidates , we compute a weighting score to rank the candidates and select the most likely one.

3.2 Datasets

LORELEI dataset Strassel and Tracey (2016) includes entity mentions extracted from news or social media, where most languages are low-resource. The dataset links into a knowledge base that they built using geonames111 for GPE (Geopolitical entity) and LOC (Location) entities, and other resources for remaining entities. Dataset details are reported in Table 9. We processed the original gold data entities into the corresponding Wikipedia entries. Specifically for the GPE and LOC entities, we link it to a wikipedia entry if there exists a wikipedia link inside the geonames information. Otherwise, we do not include this mention in the golden dataset. For the remaining PER (Person) and ORG (Organization) entities, we search for the English surface names of the gold entries, and then have several experts manually check the correctness of resulting Wikipedia entries, considering that such mentions are relatively few. Dataset details are included in Appendix Section A, with the test size in all languages ranging between 300 and 2500, with average 1176.
Wikipedia-based dataset collected from Tsai et al. (2016) is built upon Wikipedia hyperlinks using entity mentions that have bilingual Wikipedia page mappings between the source and target languages, and we use their test data only. Dataset details are reported in Table 10. Compared to the LORELEI dataset, almost all of the source languages in this dataset are high-resource languages. Dataset details are also included in Appendix Section A, with the test size in all languages ranging between 1000 and 12000, with average 7046.

p(e|m) name_trans pivoting translit
xelms (x)
PBEL_PLUS (x) (x)
Table 1: This table contains each baseline (introduced in Section 5) as in the left column, and the corresponding candidate generation methods they use in the top row.

3.3 Existing methods and Their Supervision

As shown in Table 1p(e|m) in xelms Upadhyay et al. (2018), name_trans in ELISA Zhang et al. (2018), pivoting to high-resource language in PBEL_PLUS Zhou et al. (2020), and translit Upadhyay et al. (2018) are four most popular candidate generation methods for low-resource XEL systems. The system details are described in Section 5. Candidate generation methods details along with level of supervision needed are listed below.

  • p(e|m) is the most popular and earliest method for candidate generation Tsai et al. (2016); Upadhyay et al. (2018). The method follows a linking flow of mention – KB entity – English KB entity. It uses cross-article anchor text hyperlinks along with bilingual Wikipedia title mappings to build a mention – English KB probabilistic map. For instance, if Oromo mention “Itoophiyaatti” is linked to entity “Itoophiyaa” in some Oromo Wikipedia articles, the corresponding English Wikipedia entity “Ethiopia” will be added as a candidate for the mention “Itoophiyaatti”. (The Oromo Wikipedia page and English one is linked by Wikipedia.) The score p(e|m) provides is calculated using frequency count and shows the probability of linking to entity given mention .

  • name_trans (Name Translation) as introduced in Pan et al. (2017); Zhang et al. (2018) is the most efficient method on LORELEI dataset. It essentially learns from the bilingual Wikipedia title pairs to obtain a fixed mapping between each single word and possible English Wikipedia entries – it tries to map each word in the given mention to English. Then it links the translation to target KB through an unsupervised collective inference approach. For instance, to link the Suomi name “Pekingin tekninen instituutti” (Beijing Institute of Technology), it translates each word correspondingly: (“Pekingin” – Beijing, “tekninen” – Technology, “instituutti” – Institute), then links the English translation to the KB.

  • pivoting to a related high-resource language (HRL) is another efficient method Zhou et al. (2020) for low-resource XEL. The method trains on aforementioned two kinds of Wikipedia data, and tries to generalize to unseen low-resource mentions through grapheme or phoneme similarity. Specifically, it first finds a related HRL for the low-resource

    through n-gram matching. Then it learns a neural model through bilingual (between HRL and English) data pairs gathered from bilingual Wikipedia title pairs, and Wikipedia anchor text – English Wikipedia title pairs. The method either trains on the original text to learn from Grapheme similarity when the HRL shares same script with

    , or converts original text to international phonetic alphabet (IPA) symbols to learn from Phoneme similarity.

  • translit is the process of generating English transliteration of a name written in another writing system. It is useful to generate an English candidate that has similar pronunciation to the foreign mention in XEL. The method applies bootstrapping to mine name pairs in low-resource languages and trains a sequence to sequence model to generate string transduction.

  • Language-specific Morphological normalization is a basic process for all candidate generation methods. An entity may have different surface forms in the document, which makes candidate generation difficult. To cope with this issue, several operations including removing, adding, or replacing suffixes and prefixes are conducted as a prior process.

3.4 Limitations of Existing Methods

  1. Small Bilingual Wikipedia Title Mappings Size is a general and severe problem for all three methods. All methods depend heavily on Wikipedia bilingual title pairs, while most low-resource languages do not have many intersections with the English Wikipedia.

    Lang. Akan Oromo Zulu Wolof Somali
    Wiki size 726 621 1,328 1231 4,025
    Entity (%) 70.93 34.68 49.77 37.54 12.93
    Table 2: Proportion of gold entities that only have English Wikipedia pages but not source language Wikipedia pages on the LORELEI dataset for 5 randomly picked real low-resource languages along with Wikipedia size. Wiki size represents article number with statistics coming from the 2019-10-20 wikidump 333
    Model Akan Oromo Zulu Wolof Somali
    Gold Candidate Recall
    xelms 23.9 33.2 19.8 16.9 55.1
    ELISA 60.7 26.2 20.4 55.5 54.3
    PBEL_PLUS 31.7 19.1 40.8 48.5 57
    CogCompXEL 79.0 60.6 47.1 74.4 77.4
    Linking Accuracy
    xelms 23.9 31.4 19.8 16.6 54.5
    ELISA 38.00 25.6 12.4 42.2 45.0
    PBEL_PLUS 26.5 5.9 27.7 35.5 48.6
    CogCompXEL 53.8 45.4 23.8 51.8 72.1
    Table 3: Gold candidate recall (top) and end-to-end linking accuracy (bottom) for entities that only English Wikipedia pages but without source language Wikipedia pages over 5 randomly picked real low-resource languages on the LORELEI dataset. (Compared approaches and dataset are introduced in Section 5.)
    Figure 1: Mention token coverage refers to the percentage of mentions that have at least one token of which appearing in some Wikipedia entries or anchor text that can be bilingually mapped to an English Wikipedia entry. The figure shows mention token coverage and mention title inter versus gold candidate recall of each model on low-resource languages (top) and high-resource languages (bottom), both in the increasing order of Wikipedia size. There is a clear trend that the line is correlated with Wikipedia size. Also, it shows that previous methods are correlated and limited by Wikipedia bilingual resources, while ours can reach and even exceed the Wikipedia coverage on low-resource languages.
    Figure 2: This figure shows the ratio of gold candidate recall divided by Mention token coverage(introduced in Figure 1). Average gold candidate recall is taken from existing methods: p(e|m), name_trans, pivoting. From this figure, we can see that for low-resource XEL, existing methods are almost always limited by the mention token coverage. However, our proposed candidate generation method Query_log + p(e|m) can reach 1.5 to 2 times of mention token coverage on low-resource languages, and always stay at 1 times of mention token coverage on high-resource languages. For both low-resource and high-resource languages, Query_log + p(e|m) is tested to be statistically significant, with p-value 0.01%.
    Figure 3: Wikipedia sizes for each language.

    For p(e|m), since the linking mechanism flows as mention – Wikipedia entity – English Wikipedia entity, the method can easily fail when any part of linking is missing. For instance, the method fails when people have a English Wikipedia page but without inter-language link to an Oromo Wikipedia page, or vise versa. Another situation can be that the two corresponding Wikipedia pages are just not linked. For example, “Australopithecus africanus” is title in both Oromo and English Wikipedia, but they are not linked using inter-language link. Tables 5 and 3 show statistics and performances on 5 randomly picked low-resource languages from the LORELEI dataset, to highlight the heavy dependence of existing methods on bilingual Wikipedia title mappings.

    For name_trans, bilingual title pairs are of essential importance for obtaining the token – English Wikipedia dictionary, which is ELISA’s word translation model. Apparently, it is limited by the tokens contained in the titles. For a given mention, if none of its token ever appeared in the Wikipedia titles, then none of the token would provide any English translation, thus the mention will not have any English Wikipedia candidates eventually.

    Figure 1 shows the mention token coverage versus gold candidate recall on LORELEI dataset. Mention token coverage means the percentage of mentions that have at least a token mapping to an English Wikipedia title either through p(e|m)or name_trans. It clearly shows that existing methods are largely limited by the mention token coverage, while CogCompXEL can break the limit on low-resource languages and stay highest on high-resource languages.

  2. Only having a Few Wikipedia Articles of Small Size can cause insufficient supervision when extracting mention – English Wikipedia title pairs. Considering that the large divergence between real-world mentions and Wikipedia titles, without sufficient coverage of mentions, it is extremely hard for models to generalize to mentions that are never seen before.

    For instance, the Oromo Wikipedia article for “Laayibeeriyaa” has much less information (anchor text hyperlinks) than the English Wikipedia article for “Liberia”, even though they are the same entity. As a consequence, p(e|m)and pivoting based methods would suffer much from small Wikipedia article sizes.

    Figure 3 shows the Wikipedia size (article number) for LORELEI dataset languages. Comparing it with Figure 1, we can see that gold candidate generation increases when wikipedia size is large, and decreases when size is small.

  3. Simply using Wikipedia supervision is not enough to cover every mention – the linking performance can be largely limited due to mention mismatch between given mentions and Wikipedia covered words.

    As shown in Figure 1, we propose a global metric mention text coverage that can be computed without the gold data to show mention mismatch between given mentions and Wikipedia covered words in low-resource languages. Mention token coverage refers to the percentage of mentions that have at least one token of which appearing in some Wikipedia entries or anchor text that can be bilingually mapped to an English Wikipedia entry. We argue that the performance of current systems is highly correlated with the metric. As we can see in Figure 2, existing methods’ gold candidate recall range between 0.5 to 1 times of mention token coverage. We also say that high values will guarantee better supervision. In Figure 2, our proposed Query_log + p(e|m)  can always stay at 1 times mention token coverage on high-resource languages. Combined with Figure 3, we can see clearly that for low-resource languages, the mention–entity token intersection is much smaller than high-resource languages, and thus EDL methods suffer more from mention–entity mismatch.

  4. Linguistic properties are hardly satisfied for pivoting based methods. While not suffering from few bilingual resources, the difficulty for pivoting lies in finding a good HRL that is similar enough. pivoting learns the linking through grapheme or phoneme similarity, thus is limited by the language similarity–when pivoting is given a mention that has no token or IPA symbols covered by HRL resources and learned before, it can fail completely and generate no candidate.

    As for learning from Grapheme similarity, it is not possible that every low-resource language has a related HRL that uses same scripts. As a result, pivoting uses phoneme similarity in most cases. While finding HRL is already hard enough, there is no enough resource to even turn low-resource languages into IPA symbols. pivoting  in Zhou et al. (2020) uses Epitran Mortensen et al. (2018) to convert strings to international phonetic alphabet (IPA) symbols, but Epitran is not supported for all 309 Wikipedia languages. It only has a coverage of 55 languages at the time of this writing. Therefore, linguistic properties are hardly satisfied for pivoting methods.

  5. Entity and mention does not have word-by-word mapping – Transliteration model is useful when phonological similarity is preserved in an entity. However, it cannot be applied to a case when entity and mention does not have word-by-word mapping Zhou et al. (2020). Such mismatch can be additional words and alias. Table 4 shows results of transliteration models trained on high-quality NEWS2015 dataset and name pairs from Wikipedia. Candidate generation with transliteration trained on NEWS2015 is higher than the one trained on Wikipedia by 5.9% for Hindi and 1.8% for Bengali. The performance discrepancy between Hindi and Bengali is likely due to the proportion gap of name-entity mismatch. We find that entries where mention and entity share same number of words in Hindi is 6.3% higher than in Bengali.

    Another limitation of transliteration is that it requires a large of number of name pairs. Many approaches use Wikipedia inter-language links to mine name pairs for supervision. However, supervision signals in low-resource languages can be limited and noisy. The size of inter-language links in low resource languages is small and identified name pairs may not be perfect transliteration. The results in Table 4 demonstrates that transliteration model doesn’t give significant improvement in low-resource setting.

    Model Hindi Bengali Oria Sinahala
    translit-Wiki 24.6 23.4 13.4 8.6
    translit-News 30.5 25.2
    Table 4: Gold candidate recall for four languages. translit-Wiki and translit-News refer to transliteration models trained on NEWS2015 Duan et al. (2015) and name pairs obtained from inter-language Wikipedia links respectively. NEWS2015 does not include Oria and Sinhala.

4 Our Method: CogCompXEL

To deal with the limitations mentioned in Section 3, we argue that additional robust cross-lingual resources outside of Wikipedia must be collected to compensate for the lack of supervision. We propose to use query log mapping files of online search engines for low-resource XEL along with p(e|m). Our candidate generation method, Query_log + p(e|m)has some results shown in Figures 2 and 1. We provide a simple, but robust and efficient zero shot method to combine multiple cross-lingual resources, and in our model we choose to use Google search as our additional supervision source, while we can use any search engines.

4.1 Candidate Generation

We obtain a high-quality candidate list using a combination of the techniques including morphological normalization, Google query logs, two kinds of pivoting skills, p(e|m), and two kinds of transliteration models, while utilizing Wikipedia cross-lingual resources along with Google Map bilingual supervision.

4.1.1 Query logs

We leverage the web information such as Google query log mapping files to not only better use Wikipedia cross-lingual information, but also to retrieve additional resource to better generate candidates. Specifically, we query the Google search engine using the normalized mention form and retrieve among the search results top Wikipedia pages that are in source language or in English. The English entity pages are directly marked as candidates, and the pages are first converted to corresponding English page through inter-language links, then marked as candidates.

We also use Google map KB to augment the cross lingual supervision besides Wikipedia resource, in order to generate better candidates. When the given mention is labeled as “GPE” (Geopolitical entity) or “LOC” (Location), we further integrate the cross-lingual supervision from Google Map KB by querying its API444 We retrieve the top English result if there exists one, and process the returned result as input to the Google search engine again. Depending on the language, “wiki” or “[SL’s Country]” can be added to the searching token.

4.1.2 Pivoting

Pivoting methods listed here are based on query logs and different from pivoting as introduced in Section 3.

Language-specific Pivoting is used on our selected language pairs. We use a simple utf-8 converter to translate a mention into a related, but higher-resource language, such as Oria to Hindi, and then conduct XEL on the new mention. We employ pivoting after morphological normalization, but do normalization again on the returned pivoted mention.

Language-indifferent Pivoting Some low-resource languages have multiple highly-related and high-resource languages that they may share similar writing systems with. In this case, we include the query returned Wikipedia entity pages that are in any language other than and English, as language-indifferent pivoting candidates. For example, Tigrinya can be pivoted to multiple languages as long as they all share the Ge’ez script. The query result555 %88%B5 %E1%8A%A0%E1%89%A0%E1%89%A3 shares the same surface form with the Tigrinya mention to be linked, but is in Amharic language. Another query result for the same mention666 Ababa when we add “ wiki” to the query input is in Scots language.

We further treat the returned pivoting entities with their corresponding languages as new inputs, and run through our method to generate candidates.

4.1.3 p(e|m)

We use the same p(e|m) introduced in Section 3 as part of the candidate generation to retrieve a full list of candidates.

Figure 4: End-to-end XEL linking accuracy on the LORELEI dataset sorted by Wikipedia size in ascending order. Specific scores are reported in Tables 12 and 11 in Appendix A.
Figure 5: XEL gold candidate recall on the LORELEI dataset sorted by Wikipedia size in ascending order. Specific scores are reported in Tables 12 and 11 in Appendix A.
Figure 6: End-to-end XEL linking accuracy on the Wikipedia-based dataset sorted by Wikipedia size in ascending order. Specific scores are reported in Table 13 in Appendix A


Figure 7: XEL gold candidate recall on the Wikipedia-based dataset sorted by Wikipedia size in ascending order. Specific scores are reported in Table 13 in Appendix A.
Model Avg Candidate Recall Avg Linking Accuracy
Lorelei Dataset
xlwikifier 52.54 46.58
xelms 52.54 48.65
ELISA 50.52 43.91
PBEL_PLUS 38.36 30.38
CogCompXEL 78.21 61.40
Wikipedia Dataset
xlwikifier 79.40 63.51
xelms 79.40 68.73
ELISA 47.82 41.22
CogCompXEL 83.54 66.16
Table 5: The table includes average gold candidate recall and average linking accuracy of each method on all languages. We can see clearly that CogCompXELis better than all other methods, especially on Lorelei dataset, which is closer to real data people use everyday.

5 Experiments: System Comparison

In this Section, with important results are already shown in Section 3 focusing on candidate generation and gold candidate recall, we now compare the overall performance of different low-resource XEL systems. We first introduce different existing systems, then add a ranking components to CogCompXEL for complete comparison. The focus of this section is on addressing the following experimental questions: how different approaches perform on different datasets – we distinguish between performance on data taken from Wikipedia itself (which is of little interest in applications) and performance on text that is outside Wikipedia, which is the more realistic and challenging case. We also do Entity type specific analysis showing that, for some types, current methods cannot link successfully, while the problem can be partly resolved by exploiting the query logs.

Akan (%) Thai (%) Tigrinya (%)
Accuracy Cand. Recall Accuracy Cand. Recall Accuracy Cand. Recall
Google top1 w/o Google Map 53.4 57.9 73.5 74.6 31.7 31.9
Google top1 54.7 61 74 77.1 44.9 46.4
Google top5 54 79 73.8 79.5 37.2 48.7
Google top1 + p(e|m) 55.1 61 73.8 78.8 45.3 46.4
Google top5 + p(e|m) 53.8 79 73.5 80.9 36.3 48.7
Oromo (%) Somali (%) Oria (%)
Accuracy Cand. Recall Accuracy Cand. Recall Accuracy Cand. Recall
Google top1 w/o Google Map 40 43.8 67.8 71.4 47.6 55.3
Google top1 43.9 50.1 71.8 76.3 59.2 70
Google top5 41.8 55.5 70.7 80.6 56.5 76.6
Google top1 + p(e|m) 45.4 57.2 72.1 77.4 66 78.6
Google top5 + p(e|m) 42.7 60.6 71.2 81.5 64.6 83.5
Table 6: Ablation study on 6 low-resource languages that examines each candidate generation component for end-to-end linking accuracy (left) and gold candidate recall(right). Candidate number is below 5 in most languages and varies between 2-9. Our method as default includes Google Map module.

5.1 Compared Methods

We compare against the following supervised state-of-the-art (SoTA) approaches.
xlwikifier Tsai et al. (2016) trains a separate XEL model for each language using mention contexts extracted from the source language Wikipedia only. We use his trained system on the Wikipedia-based data. Due to lack of training data, we do not have rankers added on LORELEI datasets. For a comparison purpose, since xelms is superior to xlwikifier as shown in the paper Upadhyay et al. (2018), we can mainly compare with xelms on the LORELEI dataset.
xelms Upadhyay et al. (2018) develops the first XEL approach that combines supervision from multiple languages jointly. For most languages, we use his model under the annotation setting, but due to lack of annotation data, we use his zero-shot setting for some languages (Akan and Kinyarwanda).
ELISA Pan et al. (2017) develops a huge XEL system described in Zhang et al. (2018) that transforms entity mentions into English by automatically mined word translation pairs, and then links translated English mentions to an external English KB using entity coherence statistics from English Wikipedia and the document context of a mention for XEL. We access the system using the ELISA API 777 and call the GET/entity_linking/{identifier} function.
PBEL_PLUS Zhou et al. (2020) focuses on pivot-based entity linking, which leverages information from a high resource “pivot” language to train character-level neural entity linking models for the source low-resource language in a zero-shot manner. We test this approach only on low-resource languages, because it does not make sense to pivot a high-resource language to other languages.

5.2 CogCompXEL: Candidate Ranking

Given a mention and the candidate list , CogCompXEL leverages the contextual information along with candidate sources to compute a weighting score that measures relatedness between the mention and the candidate. CogCompXEL then picks the candidate with highest score as output.

5.2.1 Cross Checking Candidate Source

CogCompXEL considers the candidate sources as part of ranking scores. Specifically, CogCompXEL ranks candidates generated by more components higher. For instance, “Bhadrak” is generated by morphological normalization + query logs, google map KB query, pivoting, and name transliteration modules, while “Bhadrak_district” is only generated from morphological normalization + query logs and name transliteration. Therefore, “Bhadrak” ranks higher than “Bhadrak_district”. We denote the candidate source cross checking score as and define it as:

where for each separate candidate generation component. In the previous example, , larger than .

5.2.2 Contextual Disambiguation

Multilingual BERT (M-BERT) Devlin et al. (2018) is a bi-directional transformer language model that maps multilingual representations into the same embedding space, thus providing similar embedding for similar sentences under cross-lingual setting. CogCompXEL incorporates M-BERT888 to compare the similarity between the mention context in and the candidate context in English Wikipedia. For a mention in a document , we get the sentence where appears and computes its contextualized embedding . For each , we retrieve a list of sentences that contains the entity in its corresponding Wikipedia page. The contextualized embedding for the entity is denoted by :

The context similarity score

is defined as the cosine similarity between

and . We select the most likely entity by computing:

5.3 Linking Results

A comprehensive evaluation on 25 languages in the LORELEI dataset is shown in Figures 5 and 4. Specific scores are reported in Tables 12 and 11 in Appendix A, for end-to-end linking accuracy and gold candidate recall respectively. Our method has proved its efficiency on almost all the languages, being superior to all other supervised methods on the LORELEI dataset on both linking accuracy and candidate recall. Since the LORELEI dataset is built using news and social media such as tweets, it is more similar to real world daily language and strongly proves CogCompXEL’s efficiency. On the Wikipedia-based dataset that is built using the bilingually mapped Wikipedia pages between the source and target languages, CogCompXEL also shows large increase on gold candidate recall in all languages, while reaching the state-of-art performance, considering that ELISA, xlwikifier and xelms are all supervised approaches.

It is interesting for us to see in the performance figures that CogCompXEL increases much more significantly on the LORELEI dataset then on the Wikipedia-based data. Besides the aforementioned reason that LORELEI dataset is more similar to real-world dataset, while Wikipedia-based dataset is built using bilingual Wiki-page mappings plus other systems are supervised, one other reason leads to such difference: that LORELEI dataset mostly has low-resource languages while the other has high-resource languages.

Note that we picked two representative languages: Oria and Ilocano, for which we have additional LORELEI provided monolingual text, and trained the M-BERT model Devlin et al. (2018) using their Wikipedia data along with LORELEI text. We did not use pre-trained M-BERT 999 on all languages because many low-resource languages are not supported, and for the supported ones the performance increase is much less than that of models trained with LORELEI text plus Wikipedia data. This experiment serves to show the gain one could get from additional supervision and, at the same time, highlights the results we show when M-BERT is not available, which is more realistic.

5.4 Ablation Study on Candidate generation

We now quantify the effects of each component in our candidate generation method and show the results in Table 6.

Google Map. Our model is default to use the Google Map cross lingual resource. We test the effect of adding supervision from this KB.

Google top1. In everywhere that takes the Google query log results, take only the first Wikipedia page result that is in source or target language as candidate.

Google top5. Similar to Google top1, take the top 5 Wikipedia page results as candidates. We can see that Google top1 and top5’s effects are language dependent.

p(e|m). We test whether adding the p(e|m) module would help in linking performance. To better show the results, p(e|m) is added under the setting of using query log and Google Map KB, without adding other modules.

Pivoting. Pivoting here refers to our query-log based pivoting, different from pivoting in Section 3. We picked two low-resource languages: Oria and Tigrinya, to explore the pivoting effect and show results in Table 7. On Oria, language-specific pivoting skill is used. Since we know in prior that Oria and Hindi are similar while the latter has much more resource, a simple utf8-converter is used to transform Oria to Hindi, and then runs the Hindi mention through our whole system. On Tigrinya, a language-indifferent pivoting skill is used. After getting query log results, besides using Google top1 or top5, we further pick Wikipedia page results that are in any other language, such as Amharic or Scots that have similar scripts, but with richer cross-lingual supervision then Tigrinya.

We further examined the effect of transliteration models using trained models Upadhyay et al. (2018) specifically on Sinhala and Oria, with bilingually mapped Wikipedia titles as supervision. We also used Google transliteration resource for Oria mentions. However, no increase on linking accuracy is observed, and the absolute increase in gold candidate recall is less than 0.5%. Since we only studied on Sinhala and Oria, maybe the transliteration resource is useful on other languages.

Tables 7 and 6 show that we added a lot of value beyond the use of Google search – simply using google search without adding other parts of our candidate generation methods does not have good linking results. Indeed, incorporating online knowledge bases effectively to XEL is highly non-trivial. In this context, it is important to note that all existing methods make heavy use of Wikipedia, and therefore using google search as a cross-lingual resource is as fair. Moreover, as our results show, the use of Wikipedia allows existing systems to perform well only on Wikipedia data, which is uninteresting for all practical purposes. As shown in Figures 5 and 4 and Tables 12 and 11), our method works well outside Wikipedia!

Oria (%) Tigrinya (%)
Accuracy Recall Accuracy Recall
CogCompXEL w/o pivot. 66 78.6 45.3 46.4
CogCompXEL 66.7 79.3 45.7 46.4
Table 7: Ablation study on 2 low-resource languages to examine effect of pivoting techniques. To better show the difference, the besting setting, Google top1 + p(e|m), is used for each language.

6 Analysis

This section provides analysis on entity distribution and type to explain why CogCompXEL exceeds previous methods largely on low-resource languages.

6.1 Entity distribution

Considering entity distribution in low-resource languages as shown in Table 3, CogCompXEL proves its superiority on entities outside of the Wikipedia bilingual mapping pages. As is reported in Table 5, we see that CogCompXEL obtains an absolute improvement in gold candidate recall ranging between 5.4% and 22.6% over the supervised state-of-the-art approaches, with average increase being 17.7%. This indicates that Google query logs, Google Map KB, and pivoting techniques provide the ability to link entities without Wikipedia pages, contributing to the significant improvement of CogCompXEL in low-resource languages.

Lang. Model Accuracy (%)
xelms 33.0 3.6 6.2 6.3
Oromo ELISA 37.0 11.6 12.5 11.9
PBEL_PLUS 53.0 40.5 6.4 0.0
CogCompXEL 57.5 30.4 17.2 4.2
xelms 25.6 0.0 37.5 5.7
Zulu ELISA 11.9 2.5 31.2 5.7
PBEL_PLUS 24.6 6.2 38.1 8.7
CogCompXEL 27.8 10.8 50.0 12.6
xelms 55.0 28.6 16.7 55.0
Somali ELISA 44.7 14.3 33.3 75.0
PBEL_PLUS 50.1 0.0 16.7 15.0
CogCompXEL 71.7 57.1 33.3 65.0
Table 8: Linking accuracy on geopolitical entities (GPE), location (LOC), person (PER) and organization (ORG) for three low-resource languages.

6.2 Entity type

We observe that CogCompXEL also achieves a significant performance improvement in geopolitical (GPE) or location (LOC) entities, while the performance on person (PER) and organization (ORG) entities is comparative with other approaches. Table 8 shows the linking accuracy results on GPE, LOC, PER and ORG entities. We believe this improvement is brought by the use of cross-lingual information from Google Map KB.

Figure 8: Native speaker number (in million) for each language, data retrieved from Wikipedia.

6.3 Multiple Supervision Source from Native Speakers

Compared with the limitations listed in Section 3, our zero-model utilizes the most out of online knowledge. As shown in Figure 8, millions of native speakers provide sufficient supervision for CogCompXELto generate gold candidates.

7 Conclusion

This paper proposes a zero-shot XEL approach for low-resource languages that combines multiple cross-lingual supervision KBs using morphological normalization, pivoting to richer languages, name transliteration, and query logs. Comprehensive experimental results show that our proposed methodologies are effective under all limited-resource scenarios, giving an average absolute improvement of 13% in end-to-end linking accuracy, and 25% in gold candidate recall over the supervised baseline systems on the real-world LORELEI dataset.


  • J. Devlin, M. Chang, K. Lee, and K. Toutanova (2018) Bert: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805. Cited by: §5.2.2, §5.3.
  • X. Duan, R. E. Banchs, M. Zhang, H. Li, and A. Kumaran (Eds.) (2015) Proceedings of the fifth named entity workshop. Association for Computational Linguistics, Beijing, China. External Links: Link, Document Cited by: Table 4.
  • H. Ji and J. Nothman (2016) Overview of tac-kbp 2016 tri-lingual edl and its impact on end-to-end kbp. In TAC, Cited by: §2.
  • D. R. Mortensen, S. Dalmia, and P. Littell (2018) Epitran: precision g2p for many languages. In Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018), Cited by: item 4.
  • X. Pan, B. Zhang, J. May, J. Nothman, K. Knight, and H. Ji (2017) Cross-lingual name tagging and linking for 282 languages. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Vol. 1, pp. 1946–1958. Cited by: §1, 2nd item, §5.1.
  • S. Rijhwani, J. Xie, G. Neubig, and J. Carbonell (2019) Zero-shot neural transfer for cross-lingual entity linking. In

    Proceedings of the AAAI Conference on Artificial Intelligence

    Vol. 33, pp. 6924–6931. Cited by: §1, §2.
  • S. Strassel and J. Tracey (2016) LORELEI language packs: data, tools, and resources for technology development in low resource languages. In Proceedings of the Tenth International Conference on Language Resources and Evaluation, pp. 3273–3280. Cited by: §3.2.
  • C. Tsai, S. Mayhew, and D. Roth (2016)

    Cross-lingual named entity recognition via wikification

    In Proceedings of The 20th SIGNLL Conference on Computational Natural Language Learning, pp. 219–228. Cited by: §1, §2, 1st item, §3.2, §5.1.
  • S. Upadhyay, N. Gupta, and D. Roth (2018) Joint multilingual supervision for cross-lingual entity linking. arXiv preprint arXiv:1809.07657. Cited by: §1, 1st item, §3.3, §5.1.
  • S. Upadhyay, J. Kodner, and D. Roth (2018) Bootstrapping transliteration with constrained discovery for low-resource languages. In

    Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing

    pp. 501–511. Cited by: §3.3, §5.4.
  • B. Zhang, Y. Lin, X. Pan, D. Lu, J. May, K. Knight, and H. Ji (2018) ELISA-edl: a cross-lingual entity extraction, linking and localization system. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Demonstrations, pp. 41–45. Cited by: §1, 2nd item, §3.3, §5.1.
  • S. Zhou, S. Rijhawani, J. Wieting, J. Carbonell, and G. Neubig (2020) Improving candidate generation for low-resource cross-lingual entity linking. arXiv preprint arXiv:2003.01343. Cited by: §1, 3rd item, item 4, item 5, §3.3, §5.1.
  • S. Zhou, S. Rijhwani, and G. Neubig (2019) Towards zero-resource cross-lingual entity linking. arXiv preprint arXiv:1909.13180. Cited by: §2.

Appendix A Appendices

In Dataset Statistics, we include the dataset statistics calculated from 2019-10-20 Wikidump101010 as below, demonstrating the Wikipedia size (article number) along with test mention number for every language on both of the LORELEI dataset, and Wikipedia-based dataset. The evaluation on 25 languages in LORELEI dataset, and 9 languages in the Wikpedia-based dataset is shown in Comprehensive Evaluation.

a.1 Dataset Statistics

Language Wiki Size # Test Mentions
Tigrinya 189 3174
Oromo 790 2576
Akan 726 462
Wolof 1,231 302
Zulu 1,328 1071
Kinyarwanda 1,670 521
Somali 4,025 884
Amharic 8,176 1157
Sinahala 11,314 673
Oria 12,307 2079
Ilocano 12,377 1274
Swahili 34,354 1251
Bengali 64,183 1266
Tagalog 64,847 1050
Hindi 74,906 814
Tamil 76,800 1157
Thai 98,088 1122
Indonesian 286,723 1376
Hungarian 331,829 1059
Vietnamese 550,111 990
Chinese 612,335 1157
Persian 603,740 877
Arabic 633,168 1188
Russian 847,036 1205
Spanish 1,005,407 711
Table 9: Dataset Description for LORELEI dataset and corresponding wikipedia sizes.
Language Wiki Size # Test Mentions
Tagalog 64,847 1,075
Tamil 76,800 1,075
Thai 98,088 11,380
Urdu 128,227 1,390
Hebrew 193,391 16,137
Turkish 244,882 13,795
Chinese 612,335 11,252
Arabic 633,168 1,0647
French 1,398,118 2,637
Table 10: Dataset Description for Wikipedia-based dataset and corresponding wikipedia sizes.

a.2 Comprehensive Evaluation

Language Method Hit@1 Hit@5 Hit@n
Tamil xlwikifier 49.8 57.4 57.4
xelms 53.8 57.4 57.4
ELISA 19.6 24.1 24.4
CogCompXEL 58.2 73.6 76.6
Zulu xlwikifier 19.6 19.8 19.8
xelms 19.8 19.8 19.8
ELISA 12.4 17.9 20.4
PBEL_PLUS 27.7 33.7 40.8
CogCompXEL 23.8 41.2 47.1
Akan xlwikifier 23.9 23.9 23.9
xelms 23.9 23.9 23.9
ELISA 38 60.5 60.7
PBEL_PLUS 26.5 28.0 31.7
CogCompXEL 53.8 78.1 79
Amharic xlwikifier 23.3 28.2 28.2
xelms 24.6 28.2 28.2
ELISA 16.4 16.7 16.8
PBEL_PLUS 11.7 11.9 16.0
CogCompXEL 30.7 43.8 44.7
Hindi xlwikifier 53.5 63.9 63.9
xelms 57.4 63.9 63.9
ELISA 40.3 43.4 45.8
CogCompXEL 63.6 74.4 79
Indonesian xlwikifier 59.2 65.3 65.3
xelms 62.2 65.3 65.3
ELISA 56 64 67.7
CogCompXEL 60 73.2 74.6
Spanish xlwikifier 63.9 78.1 78.1
xelms 68.4 78.1 78.1
ELISA 57.8 68.3 69.8
CogCompXEL 56 81.5 87.9
Arabic xlwikifier 73.3 80.4 80.4
xelms 75.1 80.4 80.4
ELISA 35.5 37.3 37.9
CogCompXEL 75.6 84 90.2
Swahili xlwikifier 61.3 69.6 69.9
xelms 63.4 69.6 69.6
ELISA 62 71.4 72.2
PBEL_PLUS 36.2 37.3 39.3
CogCompXEL 66.3 76.2 76.2
Wolof xlwikifier 16.6 16.9 16.9
xelms 16.6 16.9 16.9
ELISA 42.2 52.2 55.5
PBEL_PLUS 35.5 42.2 48.5
CogCompXEL 51.8 66.1 66.1
Vietnamese xlwikifier 82.4 86.9 86.9
xelms 84.1 86.9 86.9
ELISA 72.1 76.7 76.9
CogCompXEL 81.3 91.3 95
Thai xlwikifier 40 50.1 50.1
xelms 48.3 50.1 50.1
ELISA 6.2 9.1 9.1
CogCompXEL 73.8 79.4 79.5
Bengali xlwikifier 36.5 46.4 46.4
xelms 40.7 46.4 46.4
ELISA 7.3 9.4 9.9
CogCompXEL 47.4 61.6 65
Tagalog xlwikifier 61.4 65.3 65.3
xelms 63.2 65.3 65.3
ELISA 75.3 82.3 83.6
CogCompXEL 74.1 88.5 90.4
Hungarian xlwikifier 52.5 66.4 66.4
xelms 55.8 66.4 66.4
ELISA 26.3 31.6 32.2
CogCompXEL 47.7 78.1 87.2
Table 11: Quantitative Evaluations on Results on 25 languages on LORELEI Dataset. Hit@1 is linking accuracy, hit@n is gold candidate recall, with n ranging between 2 to 9. Hit@5 is gold candidate recall if we reserve only top5 candidates by the ranking score.
Language Method Hit@1 Hit@5 Hit@n
Chinese xlwikifier 61.4 83.2 83.2
xelms 66.4 83.2 83.2
ELISA 77.3 83.6 84.5
CogCompXEL 73.8 89.8 92.4
Persian xlwikifier 66.1 76.1 76.1
xelms 67 76.1 76.1
ELISA 46.1 53 53.4
CogCompXEL 74.7 84.6 89.5
Russian xlwikifier 53.9 57.4 57.4
xelms 54.1 57.4 57.4
ELISA 19.1 20.8 22.2
CogCompXEL 78.6 87.7 91.2
Oromo xlwikifier 29.7 33.2 33.2
xelms 31.4 33.2 33.2
ELISA 25.6 26.1 26.2
PBEL_PLUS 5.9 20.6 24.2
CogCompXEL 45.4 57.2 57.2
Tigrinya xlwikifier 0 0 0
xelms 0 0 0
ELISA 30 30.4 37
PBEL_PLUS 53.4 56.7 61.6
CogCompXEL 45.7 46.4 46.4
Sinhala xlwikifier 51.9 54.1 54.1
xelms 52.9 54.1 54.1
ELISA 72 77.7 78.2
PBEL_PLUS 19.2 26.7 47.3
CogCompXEL 64.1 72.8 76.8
Kinyarwanda xlwikifier 35.1 35.1 35.1
xelms 35.1 35.1 35.1
ELISA 75.9 79.2 79.2
PBEL_PLUS 48.5 51.4 62.0
CogCompXEL 73.6 83.4 83.4
Ilocano xlwikifier 52.0 53.2 53.2
xelms 53.2 53.2 53.2
ELISA 74.2 77.4 79.5
PBEL_PLUS 12.3 13.3 16.1
CogCompXEL 74.9 84.9 91.1
Oria xlwikifier 42.6 47.6 47.6
xelms 44.29 47.6 47.6
ELISA 65.1 71.8 72.3
PBEL_PLUS 39.1 42.0 45.5
CogCompXEL 66.7 79.2 79.7
Somali xlwikifier 54.5 55.1 55.1
xelms 54.5 55.1 55.1
ELISA 45 53.1 54.3
PBEL_PLUS 48.6 54.5 57.0
CogCompXEL 71.2 80.7 81.5
Table 12: Quantitative Evaluations on Results on 25 languages on LORELEI Dataset (continued). Hit@1 is linking accuracy, hit@n is gold candidate recall, with n ranging between 2 to 9. Hit@5 is gold candidate recall if we reserve only top5 candidates by the ranking score.
Language Method Hit@1 Hit@5 Hit@n
Arabic_wiki xlwikifier 65.2 83.4 83.4
xelms 69.2 83.4 83.4
ELISA 34.1 34.5 34.9
CogCompXEL 66.1 85.2 85.5
French_wiki xlwikifier 62.7 81.1 81.9
xelms 71.8 81.6 81.9
ELISA 50.3 59 61.2
CogCompXEL 63.2 82.5 83.5
Hebrew_wiki xlwikifier 63.5 84.4 84.9
xelms 68.4 84.9 84.39
ELISA 37.39 42.4 43.2
CogCompXEL 64.6 86.3 86.8
Tamil_wiki xlwikifier 71.5 85.6 85.8
xelms 74.1 85.8 85.8
ELISA 16.9 20 20.5
CogCompXEL 72.8 87.4 87.5
Thai_wiki xlwikifier 73.36 75.1 75.3
xelms 74.5 75.3 75.3
ELISA 38.9 42.6 43.5
CogCompXEL 68.1 82.4 82.4
Tagalog_wiki xlwikifier 68 80.3 80.4
xelms 70.5 80.4 80.4
ELISA 50.5 57.5 58.8
CogCompXEL 72.3 88.6 88.7
Turkish_wiki xlwikifier 54.5 72.1 72.5
xelms 56.8 72.4 72.5
ELISA 44.5 52.2 54.7
CogCompXEL 57.1 76.1 76.5
Urdu_wiki xlwikifier 59.5 73.2 73.5
xelms 62.4 73.5 73.5
ELISA 43.6 50.1 51
CogCompXEL 63.8 80.4 80.7
Chinese_wiki xlwikifier 64.9 76.8 76.9
xelms 71.2 76.9 76.9
ELISA 54.3 60.4 62.6
CogCompXEL 67.4 80.3 80.3
Table 13: Quantitative Evaluations on Results on 9 languages on Wikipedia-based dataset. Hit@1 is linking accuracy, hit@n is gold candidate recall, with n ranging between 2 to 9. Hit@5 is gold candidate recall if we reserve only top5 candidates by the ranking score.