Global Entity Ranking Across Multiple Languages

by   Prantik Bhattacharyya, et al.

We present work on building a global long-tailed ranking of entities across multiple languages using Wikipedia and Freebase knowledge bases. We identify multiple features and build a model to rank entities using a ground-truth dataset of more than 10 thousand labels. The final system ranks 27 million entities with 75 and empirical evidence of the quality of ranking across languages, and open the final ranked lists for future research.



There are no comments yet.


page 1

page 2


Computing Entity Semantic Similarity by Features Ranking

This article presents a novel approach to estimate semantic entity simil...

Partition-Mallows Model and Its Inference for Rank Aggregation

Learning how to aggregate ranking lists has been an active research area...

Word-Entity Duet Representations for Document Ranking

This paper presents a word-entity duet framework for utilizing knowledge...

DAWT: Densely Annotated Wikipedia Texts across multiple languages

In this work, we open up the DAWT dataset - Densely Annotated Wikipedia ...

Revealing subgroup structure in ranked data using a Bayesian WAND

Ranked data arise in many areas of application ranging from the ranking ...

Mining Wikidata for Name Resources for African Languages

This work supports further development of language technology for the la...

Ranking Triples using Entity Links in a Large Web Crawl - The Chicory Triple Scorer at WSDM Cup 2017

This paper describes the participation of team Chicory in the Triple Ran...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

In the past decade, a number of openly available Knowledge Bases (KBs) have emerged. The most popular ones include Freebase, Wikipedia, and Yago, containing around 48M, 25M, and 10M entities respectively. Many of the entities overlap across the KBs. In NLP entity linking 111

also known as named entity linking (NEL), named entity disambiguation (NED) or named entity recognition and disambiguation (NERD)

, the task is to link mentioned entities within text to their identity within the KB. A foundational part of setting up a real-time entity linking system is to choose which entities to consider, as memory constraints prohibit considering the entire knowledge base [1]. Additionally, some entities may not be of relevance. In order to maximize quality of the NLP entity linking system, we need to include as many important entities as possible.

In this paper we identify a collection of features to perform scoring and ranking of the entities. We also introduce the ground truth data set that we use to train and apply the ranking function.

2 Related Work

A large body of previous work has addressed ranking entities in terms of temporal popularity, as well as in the context of a query; however, little study has been done in terms of building the global rank of entities within a KB. Temporal entity importance on Twitter was studied by Pedro et. al. [4]. In [2], authors propose a hybrid model of entity ranking and selection in the context of displaying the most important entities for a given constraint while eliminating redundant entities. Entity ranking of Wikipedia entities in the context of a query, has been done using link structure and categories [5], as well as graph methods and web search [6].

3 Our Approach

Given KB, we want to build a global long-tailed ranking of entities in order of socially recognizable importance. When building the NLP entity linking system, the top ranked entities from KB should yield maximum perceived quality by casual observers.

3.1 Data Set

We collected a labeled data set by selecting

entities. We randomly sampled as well as added some important entities, to balance the skewed ratio that KBs have of important / non-important entries. Each evaluator had to score the entities on scale 1 to 5; 5 being most important. Seven evaluators used the following guidelines regarding importance:

[noitemsep, nolistsep, style=unboxed,leftmargin=0.5cm]

Public Persons

important if currently major pro athletes, serving politicians, etc. If no longer active, important if influential (e.g. Muhammad Ali, Tony Blair).


look at population (e.g. Albany, California vs. Toronto, Canada), historical significance (Waterloo).


unimportant unless shorthand for a holiday or event (4th of July, 9/11).


important, especially high-circulation ones (WSJ).

Sports Teams

important if in pro league.


important if recognised globally.

Films & Song

major franchises and influential classics are important – more obscure are often not.


important if they enacted social change (Loving v. Virginia, Roe v. Wade), unimportant otherwise.


entities that disambiguate are important because we want them in the dictionary (Apple, Inc. and Apple Fruit).

3.2 Features and Scoring

Features were derived from Freebase and Wikipedia sources. They capture popularity within Wikipedia links, and how important an entity is within Freebase. Some signals used are page rank, link in/out counts and ratio, number of categories a page belongs to in Wikipedia. We also use the number of objects, a given entity is connected to, i.e., object and object type count features, as well as the number of times a given entity was an object with respect to another entity, i.e., subject and subject type count features. We also extract social media identities mentioned in an entity’s KB and use their Klout score [3] as a feature. The full set of features derived as well as their performance is listed in Table 1.

We model the evaluator’s score using simple linear regression. The feature vector

for an entity is represented as: where is the feature value associated with a specific feature . Normalized feature values are denoted by . Features are normalized as: . Importance score for an entity is denoted by and is computed as the dot product of a weight vector and the normalized feature vector:

 (1). Weight vector is computed with supervised learning techniques, using labeled ground truth data (train/test split of 80/20).

4 Experiments

Feature P R F1 C RMSE


Page Rank 0.59 0.05 0.09 0.164 1.54
Outlink Count 0.55 0.13 0.21 0.164 2.09
Inlink Count 0.62 0.12 0.20 0.164 1.82
In Out Ratio 0.75 0.19 0.31 0.164 1.54
Category Count 0.65 0.21 0.36 0.164 1.89


Subject # 0.28 0.06 0.10 1.000 2.39
Subject Types # 0.42 0.10 0.16 1.000 2.25
Object # 0.62 0.12 0.20 0.973 2.00
Object Types # 0.46 0.11 0.17 0.973 2.25
Klout Score 0.57 0.11 0.17 0.004 2.32
- All Feat. 0.75 0.37 0.48 1.00 1.15
Table 1: Feature Performance For English Rankings
Figure 1: Entity Count by Type

Table 1 shows precision, recall, F1 and the population coverage for the full list of features and the final system. The importance score was calculated using Eq.3.2 where final score was rounded to an integer value so it can be compared against the labels from ground-truth data.

We observe that Wikipedia features have the highest precision among all the features. The Freebase features have the highest coverage values. The Klout score feature also has one of the highest individual precision values. While this feature has the lowest coverage, it helps boost the final score and floats up a few relevant entities for final system application in social media platforms. We also look at root mean squared error (RMSE) of the entity scores against assigned labels. The final model shows the lowest RMSE value.

We also plot the distribution of entity types in the top million ranked entities and the unranked list for the English language. of entities are of type ‘person’ in the global list while the top ranked list contains entities of type ‘person’. The percentage of ‘MISC’ entity types drop from to . These difference in coverage highlight that entities are ranked relevantly in the corpus.

Entity Image EN AR ES FR IT
 Vogue 2 6,173 200 2,341 62
 Bank 322 103 3,747 2,758 5,704
 Morocco 1,277 2 527 544 232
 Duck 10,001 9,494 7,444 10,380 4,575
 Balkans 36,753 109 17,456 9,383 2,854
 Bed 109,686 23,809 68,180 66,859 52,713
 Bed 992,576 64,399 330,669 906,988 416,292
Table 2: Entity Ranking Examples For Different Languages

In Table 2, we provide examples of entities with their ranks in a particular language. We see that the entity ranks are regionally sensitive in the context of their language, e. g. ‘Morocco’ is ranked in the ranking for ‘Arabic’ language. We also observe the rankings are sensitive with respect to the specificity of the entity, for example ‘bunk bed’ is ranked magnitudally lower than the more generic entity ‘bed’.

5 Summary

We make the ranked list of top entities available as an open source data set at To conclude, in this work, we built a global ranking of entities across multiple languages combining features from multiple knowledge bases. We also found that combination of multiple features yields the best results. Future work in this direction is to include new signals such as Wikipedia page view statistics and edit history.


  • [1] P. Bhargava, N. Spasojevic, and G. Hu. High-throughput and language-agnostic entity disambiguation and linking on user generated data. In Proceedings of the 26th International Conference Companion on World Wide Web. International World Wide Web Conferences Steering Committee, 2017.
  • [2] A. Gionis, T. Lappas, and E. Terzi. Estimating entity importance via counting set covers. In 18th Intl. Conf. on Knowledge Discovery and Data Mining, 2012.
  • [3] A. Rao, N. Spasojevic, Z. Li, and T. Dsouza. Klout score: Measuring influence across multiple social networks. In IEEE Intl. Conf. on Big Data, 2015.
  • [4] P. Saleiro and C. Soares. Learning from the news: Predicting entity popularity on twitter. In International Symposium on Intelligent Data Analysis, 2016.
  • [5] A.-M. Vercoustre, J. A. Thom, and J. Pehcevski. Entity ranking in wikipedia. In ACM symposium on Applied computing, 2008.
  • [6] H. Zaragoza, H. Rode, P. Mika, J. Atserias, M. Ciaramita, and G. Attardi. Ranking very many typed entities on wikipedia. In 16th ACM conference on Conference on Information and Knowledge Management, 2007.