In the past decade, a number of openly available Knowledge Bases (KBs) have emerged.
The most popular ones include Freebase, Wikipedia, and Yago, containing around 48M, 25M, and 10M entities respectively.
Many of the entities overlap across the KBs.
In NLP entity linking 111 also known as named entity linking (NEL),
named entity disambiguation (NED) or named entity recognition and disambiguation (NERD)
also known as named entity linking (NEL), named entity disambiguation (NED) or named entity recognition and disambiguation (NERD), the task is to link mentioned entities within text to their identity within the KB. A foundational part of setting up a real-time entity linking system is to choose which entities to consider, as memory constraints prohibit considering the entire knowledge base . Additionally, some entities may not be of relevance. In order to maximize quality of the NLP entity linking system, we need to include as many important entities as possible.
In this paper we identify a collection of features to perform scoring and ranking of the entities. We also introduce the ground truth data set that we use to train and apply the ranking function.
2 Related Work
A large body of previous work has addressed ranking entities in terms of temporal popularity, as well as in the context of a query; however, little study has been done in terms of building the global rank of entities within a KB. Temporal entity importance on Twitter was studied by Pedro et. al. . In , authors propose a hybrid model of entity ranking and selection in the context of displaying the most important entities for a given constraint while eliminating redundant entities. Entity ranking of Wikipedia entities in the context of a query, has been done using link structure and categories , as well as graph methods and web search .
3 Our Approach
Given KB, we want to build a global long-tailed ranking of entities in order of socially recognizable importance. When building the NLP entity linking system, the top ranked entities from KB should yield maximum perceived quality by casual observers.
3.1 Data Set
We collected a labeled data set by selecting
entities. We randomly sampled as well as added some important entities, to balance the skewed ratio that KBs have of important / non-important entries. Each evaluator had to score the entities on scale 1 to 5; 5 being most important. Seven evaluators used the following guidelines regarding importance:
[noitemsep, nolistsep, style=unboxed,leftmargin=0.5cm]
- Public Persons
important if currently major pro athletes, serving politicians, etc. If no longer active, important if influential (e.g. Muhammad Ali, Tony Blair).
look at population (e.g. Albany, California vs. Toronto, Canada), historical significance (Waterloo).
unimportant unless shorthand for a holiday or event (4th of July, 9/11).
important, especially high-circulation ones (WSJ).
- Sports Teams
important if in pro league.
important if recognised globally.
- Films & Song
major franchises and influential classics are important – more obscure are often not.
important if they enacted social change (Loving v. Virginia, Roe v. Wade), unimportant otherwise.
entities that disambiguate are important because we want them in the dictionary (Apple, Inc. and Apple Fruit).
3.2 Features and Scoring
Features were derived from Freebase and Wikipedia sources. They capture popularity within Wikipedia links, and how important an entity is within Freebase. Some signals used are page rank, link in/out counts and ratio, number of categories a page belongs to in Wikipedia. We also use the number of objects, a given entity is connected to, i.e., object and object type count features, as well as the number of times a given entity was an object with respect to another entity, i.e., subject and subject type count features. We also extract social media identities mentioned in an entity’s KB and use their Klout score  as a feature. The full set of features derived as well as their performance is listed in Table 1.
(1). Weight vector is computed with supervised learning techniques, using labeled ground truth data (train/test split of 80/20).
|In Out Ratio||0.75||0.19||0.31||0.164||1.54|
|Subject Types #||0.42||0.10||0.16||1.000||2.25|
|Object Types #||0.46||0.11||0.17||0.973||2.25|
|- All Feat.||0.75||0.37||0.48||1.00||1.15|
Table 1 shows precision, recall, F1 and the population coverage for the full list of features and the final system. The importance score was calculated using Eq.3.2 where final score was rounded to an integer value so it can be compared against the labels from ground-truth data.
We observe that Wikipedia features have the highest precision among all the features. The Freebase features have the highest coverage values. The Klout score feature also has one of the highest individual precision values. While this feature has the lowest coverage, it helps boost the final score and floats up a few relevant entities for final system application in social media platforms. We also look at root mean squared error (RMSE) of the entity scores against assigned labels. The final model shows the lowest RMSE value.
We also plot the distribution of entity types in the top million ranked entities and the unranked list for the English language. of entities are of type ‘person’ in the global list while the top ranked list contains entities of type ‘person’. The percentage of ‘MISC’ entity types drop from to . These difference in coverage highlight that entities are ranked relevantly in the corpus.
In Table 2, we provide examples of entities with their ranks in a particular language. We see that the entity ranks are regionally sensitive in the context of their language, e. g. ‘Morocco’ is ranked in the ranking for ‘Arabic’ language. We also observe the rankings are sensitive with respect to the specificity of the entity, for example ‘bunk bed’ is ranked magnitudally lower than the more generic entity ‘bed’.
We make the ranked list of top entities available as an open source data set at https://github.com/klout/opendata. To conclude, in this work, we built a global ranking of entities across multiple languages combining features from multiple knowledge bases. We also found that combination of multiple features yields the best results. Future work in this direction is to include new signals such as Wikipedia page view statistics and edit history.
-  P. Bhargava, N. Spasojevic, and G. Hu. High-throughput and language-agnostic entity disambiguation and linking on user generated data. In Proceedings of the 26th International Conference Companion on World Wide Web. International World Wide Web Conferences Steering Committee, 2017.
-  A. Gionis, T. Lappas, and E. Terzi. Estimating entity importance via counting set covers. In 18th Intl. Conf. on Knowledge Discovery and Data Mining, 2012.
-  A. Rao, N. Spasojevic, Z. Li, and T. Dsouza. Klout score: Measuring influence across multiple social networks. In IEEE Intl. Conf. on Big Data, 2015.
-  P. Saleiro and C. Soares. Learning from the news: Predicting entity popularity on twitter. In International Symposium on Intelligent Data Analysis, 2016.
-  A.-M. Vercoustre, J. A. Thom, and J. Pehcevski. Entity ranking in wikipedia. In ACM symposium on Applied computing, 2008.
-  H. Zaragoza, H. Rode, P. Mika, J. Atserias, M. Ciaramita, and G. Attardi. Ranking very many typed entities on wikipedia. In 16th ACM conference on Conference on Information and Knowledge Management, 2007.