DeepType: Multilingual Entity Linking by Neural Type System Evolution

02/03/2018
by   Jonathan Raiman, et al.
OpenAI
0

The wealth of structured (e.g. Wikidata) and unstructured data about the world available today presents an incredible opportunity for tomorrow's Artificial Intelligence. So far, integration of these two different modalities is a difficult process, involving many decisions concerning how best to represent the information so that it will be captured or useful, and hand-labeling large amounts of data. DeepType overcomes this challenge by explicitly integrating symbolic information into the reasoning process of a neural network with a type system. First we construct a type system, and second, we use it to constrain the outputs of a neural network to respect the symbolic structure. We achieve this by reformulating the design problem into a mixed integer problem: create a type system and subsequently train a neural network with it. In this reformulation discrete variables select which parent-child relations from an ontology are types within the type system, while continuous variables control a classifier fit to the type system. The original problem cannot be solved exactly, so we propose a 2-step algorithm: 1) heuristic search or stochastic optimization over discrete variables that define a type system informed by an Oracle and a Learnability heuristic, 2) gradient descent to fit classifier parameters. We apply DeepType to the problem of Entity Linking on three standard datasets (i.e. WikiDisamb30, CoNLL (YAGO), TAC KBP 2010) and find that it outperforms all existing solutions by a wide margin, including approaches that rely on a human-designed type system or recent deep learning-based entity embeddings, while explicitly using symbolic information lets it integrate new entities without retraining.

READ FULL TEXT VIEW PDF

Authors

page 1

page 2

page 3

page 4

01/06/2020

Improving Entity Linking by Modeling Latent Entity Type Information

Existing state of the art neural entity linking models employ attention-...
08/28/2017

On Type-Aware Entity Retrieval

Today, the practice of returning entities from a knowledge base in respo...
04/27/2018

Improving Entity Linking by Modeling Latent Relations between Mentions

Entity linking involves aligning textual mentions of named entities to t...
12/03/2021

Survey on English Entity Linking on Wikidata

Wikidata is a frequently updated, community-driven, and multilingual kno...
04/06/2019

Tracking Discrete and Continuous Entity State for Process Understanding

Procedural text, which describes entities and their interactions as they...
02/16/2021

Learning Symbolic Expressions: Mixed-Integer Formulations, Cuts, and Heuristics

In this paper we consider the problem of learning a regression function ...
10/12/2020

Using Type Information to Improve Entity Coreference Resolution

Coreference resolution (CR) is an essential part of discourse analysis. ...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Online encyclopedias, knowledge bases, ontologies (e.g. Wikipedia, Wikidata, Wordnet), alongside image and video datasets with their associated label and category hierarchies (e.g. Imagenet 

[Deng et al.2009], Youtube-8M [Abu-El-Haija et al.2016], Kinetics [Kay et al.2017]

) offer an unprecedented opportunity for incorporating symbolic representations within distributed and neural representations in Artificial Intelligence systems. Several approaches exist for integrating rich symbolic structures within the behavior of neural networks: a label hierarchy aware loss function that relies on the ultrametric tree distance between labels (e.g. it is worse to confuse sheepdogs and skyscrapers than it is to confuse sheepdogs and poodles)

[Wu, Tygert, and LeCun2017], a loss function that trades off specificity for accuracy by incorporating hypo/hypernymy relations [Deng et al.2012], using NER types to constrain the behavior of an Entity Linking system [Ling, Singh, and Weld2015], or more recently integrating explicit type constraints within a decoder’s grammar for neural semantic parsing [Krishnamurthy, Dasigi, and Gardner2017]. However, current approaches face several difficulties:

  • Selection of the right symbolic information based on the utility or information gain for a target task.

  • Design of the representation for symbolic information (hierarchy, grammar, constraints).

  • Hand-labelling large amounts of data.

DeepType overcomes these difficulties by explicitly integrating symbolic information into the reasoning process of a neural network with a type system that is automatically designed without human effort for a target task. We achieve this by reformulating the design problem into a mixed integer problem: create a type system by selecting roots and edges from an ontology that serve as types in a type system, and subsequently train a neural network with it. The original problem cannot be solved exactly, so we propose a 2-step algorithm:

  1. heuristic search or stochastic optimization over the discrete variable assignments controlling type system design, using an Oracle and a Learnability heuristic to ensure that design decisions will be easy to learn by a neural network, and will provide improvements on the target task,

  2. gradient descent to fit classifier parameters to predict the behavior of the type system.

In order to validate the benefits of our approach, we focus on applying DeepType to Entity Linking (EL), the task of resolving ambiguous mentions of entities to their referent entities in a knowledge base (KB) (e.g. Wikipedia). Specifically we compare our results to state of the art systems on three standard datasets (WikiDisamb30, CoNLL (YAGO), TAC KBP 2010). We verify whether our approach can work in multiple languages, and whether optimization of the type system for a particular language generalizes to other languages111e.g. Do we overfit to a particular set of symbolic structures useful only in English, or can we discover a knowledge representation that works across languages?

by training our full system in a monolingual (English) and bilingual setup (English and French), and also evaluate our Oracle (performance upper bound) on German and Spanish test datasets. We compare stochastic optimization and heuristic search to solve our mixed integer problem by comparing the final performance of systems whose type systems came from different search methodologies. We also investigate whether symbolic information is captured by using DeepType as pretraining for Named Entity Recognition (NER) on two standard datasets (i.e. CoNLL 2003

[Sang and Meulder2003], OntoNotes 5.0 (CoNLL 2012) [Pradhan et al.2012]).

Our key contributions in this work are as follows:

  • A system for integrating symbolic knowledge into the reasoning process of a neural network through a type system, to constrain the behavior to respect the desired symbolic structure, and automatically design the type system without human effort.

  • An approach to EL that uses type constraints, reduces disambiguation resolution complexity from to , incorporates new entities into the system without retraining, and outperforms all existing solutions by a wide margin.

We release code for designing, evolving, and training neural type systems222http://github.com/openai/deeptype. Moreover, we observe that disambiguation accuracy reaches 99.0% on CoNLL (YAGO) and 98.6% on TAC KBP 2010 when entity types are predicted by an Oracle, suggesting that EL would be almost solved if we can improve type prediction accuracy.

The rest of this paper is structured as follows. In Section 2 we introduce EL and EL with Types, in Section 3 we describe DeepType for EL, In Section 4 we provide experimental results for DeepType applied to EL and evidence of cross-lingual and cross-domain transfer of the representation learned by a DeepType system. In Section 5 we relate our work to existing approaches. Conclusions and directions for future work are given in Section 6.

2 Task

Before we define how DeepType can be used to constrain the outputs of a neural network using a type system, we will first define the goal task of Entity Linking.

Entity Linking

The goal is to recover the ground truth entities in a KB referred to in a document by locating mentions (text spans), and for each mention properly disambiguating the referent entity. Commonly, a lookup table that maps each mention to a proposal set of entities for each mention : (e.g. “Washington” could mean Washington, D.C. or George Washington). Disambiguation is finding for each mention the a ground truth entity in . Typically, disambiguation operates according to two criteria: in a large corpus, how often does a mention point to an entity, , and how often does entity co-occur with entity , an process, often named coherence [Milne and Witten2008, Ferragina and Scaiella2010, Yamada et al.2016].

Entity Linking with Types

In this work we extend the EL task to associate with each entity a series of types (e.g. Person, Place, etc.) that if known, would rule out invalid answers, and therefore ease linking (e.g. the context now enables types to disambiguate “Washington”). Knowledge of the types associated with a mention can also help prune entities from the the proposal set, to produce a constrained set: . In a probabilistic setting it is also possible to rank an entity in document according to its likelihood under the type system prediction and under the entity model:

(1)

In prior work, the 112 FIGER Types [Ling and Weld2012] were associated with entities to combine an NER tagger with an EL system [Ling, Singh, and Weld2015]. In their work, they found that regular NER types were unhelpful, while finer grain FIGER types improved system performance.

3 DeepType for Entity Linking

Figure 1: Example model output: “jaguar” refers to different entities depending on context. Predicting the type associated with each word (e.g. animal, region, etc.) helps eliminate options that do not match, and recover the true entity. Bar charts give the system’s belief over the type-axis “IsA

”, and the table shows how types affects entity probabilities given by Wikipedia links.

DeepType is a strategy for integrating symbolic knowledge into the reasoning process of a neural network through a type system. When we apply this technique to EL, we constrain the behavior of an entity prediction model to respect the symbolic structure defined by types. As an example, when we attempt to disambiguate “Jaguar” the benefits of this approach are apparent: our decision can be based on whether the predicted type is Animal or Road Vehicle as shown visually in Figure 1.

In this section, we will first define key terminology, then explain the model and its sub-components separately.

Terminology

Figure 2:

Defining group membership with a knowledge graph relation: children of root (city) via edge (instance of).

Relation

Given some knowledge graph or feature set, a relation is a set of inheritance rules that define membership or exclusion from a particular group. For instance the relation instance of(city) selects all children of the root city connected by instance of as members of the group, depicted by outlined boxes in Figure 2.

Type

In this work a type is a label defined by a relation (e.g. IsHuman is the type applied to all children of Human connected by instance of).

Type Axis

A set of mutually exclusive types (e.g. ).

Type System

A grouping of type axes, , along with a type labelling function: . For instance a type system with two axes {IsA, Topic} assigns to George Washington: {Person, Politics}, and to Washington, D.C.: {Place, Geography}).

Model

To construct an EL system that uses type constraints we require: a type system, the associated type classifier, and a model for predicting and ranking entities given a mention. Instead of assuming we receive a type system, classifier, entity prediction model, we will instead create the type system and its classifier starting from a given entity prediction model and ontology with text snippets containing entity mentions (e.g. Wikidata and Wikipedia). For simplicity we use as our entity prediction model.

We restrict the types in our type systems to use a set of parent-child relations over the ontology in Wikipedia and Wikidata, where each type axis has a root node and an edge type , that sets membership or exclusion from the axis (e.g. , splits entities into: human vs. non-human333Type “instance of:human” mimics the NER PER label.).

We then reformulate the problem into a mixed integer problem, where discrete variables control which roots and edge types among all roots and edge types will define type axes, while the continuous variables parametrize a classifier fit to the type system. Our goal in type system design is to select parent-child relations that a classifier easily predicts, and where the types improve disambiguation accuracy.

Objective

To formally define our mixed integer problem, let us first denote as the assignment for the discrete variables that define our type system (i.e. boolean variables defining if a parent-child relation gets included in our type system), as the parameters for our entity prediction model and type classifier, and as the disambiguation accuracy given a test corpus containing mentions . We now assume our model produces some score for each proposed entity given a mention in a document , defined . The predicted entity for a given mention is thus: . If , the mention is disambiguated. Our problem is thus defined as:

(2)

This original formulation cannot be solved exactly444There are choices if each Wikipedia article can be a type within our type system.. To make this problem tractable we propose a 2-step algorithm:

  1. Discrete Optimization of Type System: Heuristic search or stochastic optimization over the discrete variables of the type system, , informed by a Learnability heuristic and an Oracle.

  2. Type classifier: Gradient descent over continuous variables to fit type classifier and entity prediction model.

We will now explain in more detail discrete optimization of a type system, our heuristics (Oracle and Learnability heuristic), the type classifier, and inference in this model.

Discrete Optimization of a Type System

The original objective cannot be solved exactly, thus we rely on heuristic search or stochastic optimization to find suitable assignments for . To avoid training an entire type classifier and entity prediction model for each evaluation of the objective function, we instead use a proxy objective for the discrete optimization555Training of the type classifier takes 3 days on a Titan X Pascal, while our Oracle can run over the test set in 100ms.. To ensure that maximizing also maximizes

, we introduce a Learnability heuristic and an Oracle that quantify the disambiguation power of a proposed type system, an estimate of how learnable the type axes in the selected solution will be. We measure an upper bound for the disambiguation power by measuring disambiguation accuracy

for a type classifier Oracle over a test corpus.

To ensure that the additional disambiguation power of a solution translates in practice we weigh by an estimate of solution’s learnability improvements between and the accuracy of a system that predicts only according to the entity prediction model666For an entity prediction model based only on link counts, this means always picking the most linked entity., .

Selecting a large number of type axes will provide strong disambiguation power, but may lead to degenerate solutions that are harder to train, slow down inference, and lack higher-level concepts that provide similar accuracy with less axes. We prevent this by adding a per type axis penalty of .

Combining these three terms gives us the equation for :

(3)

Oracle

Our Oracle is a methodology for abstracting away machine learning performance from the underlying representational power of a type system

. It operates on a test corpus with a set of mentions, entities, and proposal sets: . The Oracle prunes each proposal set to only contain entities whose types match those of , yielding . Types fully disambiguate when , otherwise we use the entity prediction model to select the right entity in the remainder set :

(4)

If , the mention is disambiguated. Oracle accuracy is denoted given a type system over a test corpus containing mentions :

(5)
Learnability
(a)
(b)
Figure 3: Text window classifier in (a) serves as type Learnability estimator, while the network in (b) takes longer to train, but discovers long-term dependencies to predict types and jointly produces a distribution for multiple type axes.

To ensure that disambiguation gains obtained during the discrete optimization are available when we train our type classifier, we want to ensure that the types selected are easy to predict. The Learnability heuristic empirically measures the average performance of classifiers at predicting the presence of a type within some Learnability-specific training set.

To efficiently estimate Learnability for a full type system we make an independence assumption and model it as the mean of the Learnability for each individual axis, ignoring positive or negative transfer effects between different type axes. This assumption lets us parallelize training of simpler classifiers for each type axis. We measure the area under its receiver operating characteristics curve (AUC) for each classifier and compute the type system’s learnability:

. We use a text window classifier trained over windows of 10 words before and after a mention. Words are represented with randomly initialized word embeddings; the classifier is illustrated in Figure 2(a). AUC is averaged over 4 training runs for each type axis.

Type Classifier

After the discrete optimization has completed we now have a type system . We can now use this type system to label data in multiple languages from text snippets associated with the ontology777Wikidata’s ontology has cross-links with Wikipedia, IMDB, Discogs, MusicBrainz, and other encyclopaedias with snippets., and supervize a Type classifier.

The goal for this classifier is to discover long-term dependencies in the input data that let it reliably predict types across many contexts and languages. For this reason we select a bidirectional-LSTM [Lample et al.2016] with word, prefix, and suffix embeddings as done in [Andor et al.2016]. Our network is shown pictorially in Figure 2(b). Our classifier is trained to minimize the negative log likelihood of the per-token types for each type axis in the document with tokens: . When using Wikipedia as our source of text snippets our label supervision is partial888We obtain type labels only on the intra-wiki link anchor text., so we make a conditional independence assumption about our predictions and use Softmax as our output activation: .

Inference

At inference-time we incorporate classifier belief into our decision process by first running it over the full context and obtaining a belief over each type axis for each input word . For each mention covering words , we obtain the type conditional probability for all type axes : . In multi-word mentions we must combine beliefs over multiple tokens : the product of the beliefs over the mention’s tokens is correct but numerically unstable and slightly less performant than max-over-time999The choice of max-over-time is empirically motivated: we compared product mean, min, max, and found that max was comparable to mean, and slightly better than the alternatives., which we denote for the -th type axis: .

The score of an entity

given these conditional probability distributions

, and the entities’ types in each axis can then be combined to rank entities according to how predicted they were by both the entity prediction model and the type system. The chosen entity for a mention is chosen by taking the option that maximizes the score among the possible entities; the equation for scoring and is given below, with , a per type axis smoothing parameter, is a smoothing parameter over all types:

(6)

4 Results

Type System Discovery

In the following experiments we evaluate the behavior of different search methodologies for type system discovery: which method best scales to large numbers of types, achieves high accuracy on the target EL task, and whether the choice of search impacts learnability by a classifier or generalisability to held-out EL datasets.

For the following experiments we optimize DeepType’s type system over a held-out set of 1000 randomly sampled articles taken from the Feb. 2017 English Wikipedia dump, with the Learnability heuristic text window classifiers trained only on those articles. The type classifier is trained jointly on English and French articles, totalling 800 million tokens for training, 1 million tokens for validation, sampled equally from either language.

We restrict roots and edges to the most common entities that are entity parents through wikipedia category or instance of edges, and eliminate type axes where is 0, leaving 53,626 type axes.

Human Type System Baseline

To isolate discrete optimization from system performance and gain perspective on the difficulty and nature of the type system design we incorporate a human-designed type system. Human designers have access to the full set of entities and relations in Wikipedia and Wikidata, and compose different inheritance rules through Boolean algebra to obtain higher level concepts (e.g. , or 101010Taxon is the general parent of living items in Wikidata.). The final human system uses 5 type axes111111IsA, Topic, Location, Continent, and Time., and 1218 inheritance rules.

Search methodologies

Beam Search and Greedy selection

We iteratively construct a type system by choosing among all remaining type axes and evaluating whether the inclusion of a new type axis improves our objective: . We use a beam size of and stop the search when all solutions stop growing.

Cross-Entropy Method

(CEM) [Rubinstein1999]

is a stochastic optimization procedure applicable to the selection of types. We begin with a probability vector

set to , and at each iteration we sample vectors

from the Bernoulli distribution given by

, and measure each sample’s fitness with Eq. 3. The highest fitness elements are our winning population at iteration . Our probabilities are fit to giving . The optimization is complete when the probability vector is binary.

Genetic Algorithm

The best subset of type axes can be found by representing type axes as genes carried by individuals in a population undergoing mutations and crossovers [Harvey2009] over generations. We select individuals using Eq. 3 as our fitness function.

Search Methodology Performance Impact

Approach Evals Accuracy Items
BeamSearch 97.84 130
Greedy 97.83 130
GA 96.959 128
CEM 96.26 89
Random N/A 128
No types 0 0
(a) Type system discovery method comparison
Model CoNLL 2003 OntoNotes
Dev Test Dev Test
Bi-LSTM - 76.29 - 77.77
[Chiu and Nichols2015]
Bi-LSTM-CNN + emb + lex 94.31 91.62 84.57 86.28
[Chiu and Nichols2015]
Bi-LSTM (Ours) 89.49 83.40 82.75 81.03
Bi-LSTM-CNN (Ours) 90.54 84.74 83.17 82.35
Bi-LSTM-CNN (Ours) + types 93.54 88.67 85.11 83.12
(b) NER F1 score comparison for DeepType pretraining vs. baselines.
Model enwiki frwiki dewiki eswiki WKD30 CoNLL TAC 2010
M&W[Milne and Witten2008] 84.6 - -
TagMe [Ferragina and Scaiella2010] 83.224 80.711 90.9 - -
[Globerson et al.2016] - 91.7 87.2
[Yamada et al.2016] - 91.5 85.2
NTEE [Yamada et al.2017] - - 87.7
only 89.064 92.013 92.013 89.980 82.710 68.614 81.485

Ours

manual 94.331 92.967 91.888 93.108 90.743
manual (oracle) 97.734 98.026 98.632 98.178 95.872 98.217 98.601
greedy 93.725 92.984 92.375 94.151 90.850
greedy (oracle) 98.002 97.222 97.915 98.246 97.293 98.982 98.278
CEM 93.707 92.415 92.247 93.962 90.302
CEM (oracle) 97.500 96.648 97.480 97.599 96.481 99.005 96.767
GA 93.684 92.027 92.062 94.879 90.312
GA (oracle) 97.297 96.783 97.408 97.609 96.268 98.461 96.663
GA (English only) 93.029 91.743 93.701 -
(c) Entity Linking model Comparison. Significant improvements over prior work denoted by   for , and   for .
Table 1: Method comparisons. Highest value in bold, excluding oracles.

To validate that controls type system size, and find the best tradeoff between size and accuracy, we experiment with a range of values and find that accuracy grows more slowly below 0.00007, while system size still increases.

From this point on we keep , and we compare the number of iterations needed by different search methods to converge, against two baselines: the empty set and the mean performance of 100 randomly sampled sets of 128 types (Table 1(a)). We observe that the performance of stochastic optimizers GA and CEM is similar to heuristic search, but requires orders of magnitude less function evaluations.

Next, we compare the behavior of the different search methods to a human designed system and state of the art approaches on three standard datasets (i.e. Wiki-Disamb30 (WKD30) [Ferragina and Scaiella2010]121212We apply the preprocessing and link pruning as [Ferragina and Scaiella2010] to ensure the comparison is fair., CoNLL(YAGO) [Hoffart et al.2011], and TAC KBP 2010 [Ji et al.2010]), along with test sets built by randomly sampling 1000 articles from Wikipedia’s February 2017 dump in English, French, German, and Spanish which were excluded from training the classifiers. Table 1(c) has Oracle performance for the different search methods on the test sets, where we report disambiguation accuracy per annotation. A baseline is included that selects the mention’s most frequently linked entity131313Note that LinkCount accuracy is stronger than the one found in [Ferragina and Scaiella2010] or [Milne and Witten2008] because newer Wikipedia dumps improve link coverage and reduce link distribution noisiness.. All search techniques’ Oracle accuracy significantly improve over , and achieve near perfect accuracy on all datasets (97-99%); furthermore we notice that performance between the held-out Wikipedia sets and standard datasets sets is similar, supporting the claim that the discovered type systems generalize well. We note that machine discovered type systems outperform human designed systems: CEM beats the human type system on English Wikipedia, and all search method’s type systems outperform human systems on Wiki-Disamb30, CoNLL(YAGO), and TAC KBP 2010.

Search Methodology Learnability Impact

To understand whether the type systems produced by different search methods can be trained similarly well we compare the type system built by GA, CEM, greedy, and the one constructed manually. EL Disambiguation accuracy is shown in Table 1(c), where we compare with recent deep-learning based approaches [Globerson et al.2016], or recent work by Yamada et al. for embedding word and entities [Yamada et al.2016], or documents and entities [Yamada et al.2017], along with count and coherence based techniques Tagme [Ferragina and Scaiella2010] and Milne & Witten [Milne and Witten2008]. To obtain Tagme’s Feb. 2017 Wikipedia accuracy we query the public web API141414https://tagme.d4science.org/tagme/ available in German and English, while other methods can be compared on CoNLL(YAGO) and TAC KBP 2010. Models trained on a human type system outperform all previous approaches to entity linking, while type systems discovered by machines lead to even higher performance on all datasets except English Wikipedia.

Cross-Lingual Transfer

Type systems are defined over Wikidata/Wikipedia, a multi-lingual knowledge base/encyclopaedia, thus type axes are language independent and can produce cross-lingual supervision. To verify whether this cross-lingual ability is useful we train a type system on an English dataset and verify whether it can successfully supervize French data. We also measure using the Oracle (performance upper bound) whether the type system is useful in Spanish or German. Oracle performance across multiple languages does not appear to degrade when transferring to other languages (Table 1(c)). We also notice that training in French with an English type system still yields improvements over for CEM, greedy, and human systems.

Because multi-lingual training might oversubscribe the model, we verified if monolingual would outperform bilingual training: we compare GA in English + French with only English (last row of Table 1(c)). Bilingual training does not appear to hurt, and might in fact be helpful.

We follow-up by inspecting whether the bilingual word vector space led to shared representations: common nouns have their English-French translation close-by, while proper nouns do not (French and US politicians cluster separately).

Named Entity Recognition Transfer

The goal of our NER experiment is to verify whether DeepType produces a type sensitive language representation useful for transfer to other downstream tasks. To measure this we pre-train a type classifier with a character-CNN and word embeddings as inputs, following [Kim et al.2015], and replace the output layer with a linear-chain CRF [Lample et al.2016] to fine-tune to NER data. Our model’s F1 scores when transferring to the CoNLL 2003 NER task and OntoNotes 5.0 (CoNLL 2012) split are given in Table 1(b). We compare with two baselines that share the architecture but are not pre-trained, along with the current state of the art [Chiu and Nichols2015].

We see positive transfer on Ontonotes and CoNLL: our baseline Bi-LSTM strongly outperforms [Chiu and Nichols2015]

’s baseline, while pre-training gives an additional 3-4 F1 points, with our best model outperforming the state of the art on the OntoNotes development split. While our baseline LSTM-CRF performs better than in the literature, our strongest baseline (CNN+LSTM+CRF) does not match the state of the art with a lexicon. We find that DeepType always improves over baselines and partially recovers lexicon performance gains, but does not fully replace lexicons.

5 Related Work

Neural Network Reasoning with Symbolic structures

Several approaches exist for incorporating symbolic structures into the reasoning process of a neural network by designing a loss function that is defined with a label hierarchy. In particular the work of [Deng et al.2012] trades off specificity for accuracy, by leveraging the hyper/hyponymy relation to make a model aware of different granularity levels. Our work differs from this approach in that we design our type system within an ontology to meet specific accuracy goals, while they make the accuracy/specificity tradeoff at training time, with a fixed structure. More recently [Wu, Tygert, and LeCun2017] use a hierarchical loss to increase the penalty for distant branches of a label hierarchy using the ultrametric tree distance. We also aim to capture the most important aspects of the symbolic structure and shape our loss function accordingly, however our loss shaping is a result of discrete optimization and incorporates a learnability heuristic to choose aspects that can easily be acquired.

A different direction for integrating structure stems from constraining model outputs, or enforcing a grammar. In the work of [Ling, Singh, and Weld2015], the authors use NER and FIGER types to ensure that an EL model follows the constraints given by types. We also use a type system and constrain our model’s output, however our type system is task-specific and designed by a machine with a disambiguation accuracy objective, and unlike the authors we find that types improve accuracy. The work of [Krishnamurthy, Dasigi, and Gardner2017] uses a type-aware grammar to constrain the decoding of a neural semantic parser. Our work makes use of type constraints during decoding, however the grammar and types in their system require human engineering to fit each individual semantic parsing task, while our type systems are based on online encyclopaedias and ontologies, with applications beyond EL.

Neural Entity Linking

Current approaches to entity linking make extensive use of deep neural networks, distributed representations. In

[Globerson et al.2016] a neural network uses attention to focus on contextual entities to disambiguate. While our work does not make use of attention, RNNs allow context information to affect disambiguation decisions. In the work of [Yamada et al.2016] and [Yamada et al.2017], the authors adopt a distributed representation of context which either models words and entities, or documents and entities such that distances between vectors informs disambiguation. We also rely on word and document vectors produced by RNNs, however entities are not explicitly represented in our neural network, and we use context to predict entity types, thereby allowing us to incorporate new entities without retraining.

6 Conclusion

In this work we introduce DeepType, a method for integrating symbolic knowledge into the reasoning process of a neural network. We’ve proposed a mixed integer reformulation for jointly designing type systems and training a classifier for a target task, and empirically validated that when this technique is applied to EL it is effective at integrating symbolic information in the neural network reasoning process. When pre-training with DeepType for NER, we observe improved performance over baselines and a new state of the art on the OntoNotes dev set, suggesting there is cross-domain transfer: symbolic information is incorporated in the neural network’s distributed representation. Furthermore we find that type systems designed by machines outperform those designed by humans on three benchmark datasets, which is attributable to incorporating learnability and target task performance goals within the design process. Our approach naturally enables multilingual training, and our experiments show that bilingual training improves over monolingual, and type systems optimized for English operate at similar accuracies in French, German, and Spanish, supporting the claim that the type system optimization leads to the discovery of high level cross-lingual concepts useful for knowledge representation. We compare different search techniques, and observe that stochastic optimization has comparable performance to heuristic search, but with orders of magnitude less objective function evaluations.

The main contributions of this work are a joint formulation for designing and integrating symbolic information into neural networks, that enable us to constrain the outputs to obey symbolic structure, and an approach to EL that uses type constraints. Our approach reduces EL resolution complexity from to , while allowing new entities to be incorporated without retraining, and we find on three standard datasets (WikiDisamb30, CoNLL (YAGO), TAC KBP 2010) that our approach outperforms all existing solutions by a wide margin, including approaches that rely on a human-designed type system [Ling, Singh, and Weld2015] and the more recent work by Yamada et al. for embedding words and entities [Yamada et al.2016], or document and entities [Yamada et al.2017]. As a result of our experiments, we observe that disambiguation accuracy using Oracles reaches 99.0% on CoNLL (YAGO) and 98.6% on TAC KBP 2010, suggesting that EL would be almost solved if we can close the gap between type classifiers and the Oracle.

The results presented in this work suggest many directions for future research: we may test how DeepType can be applied to other problems where incorporating symbolic structure is beneficial, whether making type system design more expressive by allowing hierarchies can help close the gap between model and Oracle accuracy, and seeing if additional gains can be obtained by relaxing the classifier’s conditional independence assumption.

Acknowledgments

We would like to thank the anonymous reviewers for their valuable feedback. In addition, we thank John Miller, Andrew Gibiansky, and Szymon Sidor for thoughtful comments and fruitful discussion.

Appendix A Training details and hyperparameters

Optimization

Our models are implemented in Tensorflow and optimized with Adam with a learning rate of

, , annealed by 0.99 every 10,000 iterations.

To reduce over-fitting and make our system more robust to spelling changes we apply Dropout to input embeddings and augment our data with noise: swap input words with a special <UNK> word, remove capitalization or a trailing “s.” In our NER experiments we add Gaussian noise during training to the LSTM weights with .

We use early stopping in our NER experiments when validation F1 score stops increasing. Type classification model selection is different as the models did not overfit, thus we instead stop training when no more improvements in F1 are observed on held-out type-training data ( days on one Titan X Pascal).

Method Parameter Value
Greedy 1
Beam Search 8

CEM

1000
151515The choice of affects the system size at the first step of the CEM search: setting it too low leads to poor search space exploration, while too high increase the cost of the objective function evaluation. Empirically we know that for a given the solution will have an expected size . Setting leads to sufficient exploration to reach the performance of larger .
200

GA

200
1000
mutation probability 0.5
crossover probability 0.2
Table 2: Hyperparameters for type system discovery search.

Architecture

Character representation

Our character-convolutions have character filters with (width, channels): a maximum word length of 40, and 15-dimensional character embeddings followed by 2 highway layers. We learn 6-dimensional embeddings for 2 and 3 character prefixes and suffixes.

Text Window Classifier

The text window classifiers have 5-dimensional word embeddings, and use Dropout of 0.5. Empirically we find that two passes through the dataset with a batch size of 128 is sufficient for the window classifiers to converge. Additionally we train multiple type axes in a single batch, reaching a training speed of 2.5 type axes/second.

Appendix B Wikipedia Link Simplification

Link statistics collected on large corpuses of entity mentions are extensively used in entity linking. These statistics provide a noisy estimate of the conditional probability of an entity for a mention . Intra-wiki links in Wikipedia provide a multilingual and broad coverage source of links, however annotators often create link anaphoras: “king” Charles I of England. This behavior increases polysemy (“king” mention has 974 associated entities) and distorts link frequencies (“queen” links to the band Queen 4920 times, Elizabeth II 1430 times, and monarch only 32 times).

Problems with link sparsity or anaphora were previously identified, however present solutions rely on pruning rare links and thus lose track of the original statistics [Ferragina and Scaiella2010, Hasibi, Balog, and Bratsberg2016, Ling, Singh, and Weld2015]. We propose instead to detect anaphoras and recover the generic meaning through the Wikidata property graph: if a mention points to entities A and B, with A being more linked than B, and A is B’s parent in the Wikidata property graph, then replace B with A. We define A to be the parent of B if they connect through a sequence of Wikidata properties {instance of, subclass of, is a list of}, or through a single edge in {occupation, position held, series161616e.g. Return of the Jedi Star Wars}. The simplification process is repeated until no more updates occur. This transformation reduces the number of associated entities for each mention (“king” senses drop from 974 to 143) and ensures that the semantics of multiple specific links are aggregated (number of “queen” links to monarch increase from 32 to 3553).

Figure 4: Mention Polysemy change after simplification.
Step Replacements Links changed
1 1,109,408 9,212,321
2 13922 1,027,009
3 1229 364,500
4 153 40,488
5 74 25,094
6 4 1,498
Table 3: Link change statistics per iteration during English Wikipedia Anaphora Simplification.

After simplification we find that the mean number of senses attached to polysemous mentions drops from 4.73 to 3.93, while over 10,670,910 links undergo changes in this process (Figure 4). Table 3 indicates that most changes result from mentions containing entities and their immediate parents. This simplification method strongly reduces the number of entities tied to each Wikipedia mention in an automatic fashion across multiple languages.

Appendix C Multilingual Training Representation

Argentinian lui Sarkozy Clinton hypothesis
1 argentin (0.259) he (0.333) Bayron (0.395) Reagan (0.413) paradox (0.388)
2 Argentina (0.313) il (0.360) Peillon (0.409) Trump (0.441) Hypothesis (0.459)
3 Argentine (0.315) him (0.398) Montebourg (0.419) Cheney (0.495) hypothèse (0.497)
Table 4: Top- Nearest neighbors (cosine distance) in shared English-French word vector space.

Multilingual data creation is a side-effect of the ontology-based automatic labeling scheme. In Table 4 we present nearest-neighbor words for words in multiple languages. We note that common words (he, Argentinian, hypothesis) remain close to their foreign language counterpart, while proper nouns group with country/language-specific terms. We hypothesize that common words, by not fulfilling a role as a label, can therefore operate in a language independent way to inform the context of types, while proper nouns will have different type requirements based on their labels, and thus will not converge to the same representation.

feu computer
1 killing (0.585) Computer (0.384)
2 terrible (0.601) computers (0.446)
3 beings (0.618) informatique (0.457)
Table 5: Additional set of Top- Nearest neighbors (cosine distance) in shared English-French word vector space.

Appendix D Effect of System Size Penalty

Figure 5: Effect of varying on CEM type system discovery
(a)
(b)
(c)
(d)
Figure 6: Effect of varying on CEM type system discovery: Solution size (a) and iterations to convergence (b) grow exponentially with penalty decrease, while accuracy plateaus (c) around

. Objective function increases as penalty decreases, since solution size is less penalized (d). Standard deviation is shown as the red region around the mean.

We measure the effect of varying on type system discovery when using CEM for our search. The effect averaged on 10 trials for a variety of penalties is shown in Figure 6. In particular we notice that there is a crossover point in the performance characteristics when selecting , where a looser penalty has diminishing returns in accuracy around .

Appendix E Learnability Heuristic behavior

To better understand the behavior of the population of classifiers used to obtain AUC scores for the Learnability heuristic we investigate whether certain type axes are systematically easier or harder to predict, and summarize our results in Figure 7. We find that type axes with a instance of edge have on average higher AUC scores than type axes relying on wikipedia category

. Furthermore, we also wanted to ensure that our methodology for estimating learnability was not flawed or if variance in our measurement was correlated with AUC for a type axis. We find that there is no obvious relation between the standard deviation of the AUC scores for a type axis and the AUC score itself.

(a)
(b)
(c)
Figure 7: Most instance of type-axes have higher AUC scores than wikipedia categories (a). The standard deviation for AUC scoring with text window classifiers is below 0.1 (b), AUC is not correlated with AUC’s standard deviation.

Appendix F Multilingual Part of Speech Tagging

Finally the usage of multilingual data allows some amount of subjective experiments. For instance in Figure 8 we show some samples from the model trained jointly on english and french correctly detecting the meaning of the word “car” across three possible meanings.

This PRON is VERB a DET car NOUN , PCT ceci PRON n’ PART est VERB pas ADV une DET voiture NOUN car CONJ c’ PRON est VERB un DET car NOUN . PCT

Figure 8: Model trained jointly on monolingual POS corpora detecting the multiple meanings of “car” (shown in bold) in a mixed English-French sentence.

Appendix G Human Type System

To assist humans with the design of the system, the rules are built interactively in a REPL, and execute over the 24 million entities in under 10 seconds, allowing for real time feedback in the form of statistics or error analysis over an evaluation corpus. On the evaluation corpus, disambiguation mistakes can be grouped according to the ground truth type, allowing a per type error analysis to easily detect areas where more granularity would help. Shown below are the 5 different type axes designed by humans.

Activity
Aircraft
Airport
Algorithm
Alphabet
Anatomical structure
Astronomical object
Audio visual work
Award
Award ceremony
Battle
Book magazine article
Brand
Bridge
Character
Chemical compound
Clothing
Color
Concept
Country
Crime
Currency
Data format
Date
Developmental biology period
Disease
Electromagnetic wave
Event
Facility
Family
Fictional character
Food
Gas
Gene
Genre
Geographical object
Geometric shape
Hazard
Human
Human female
Human male
International relations
Table 6: Human Type Axis: IsA
Kinship
Lake
Language
Law
Legal action
Legal case
Legislative term
Mathematical object
Mind
Molecule
Monument
Mountain
Musical work
Name
Natural phenomenon
Number
Organization
Other art work
People
Person role
Physical object
Physical quantity
Plant
Populated place
Position
Postal code
Radio program
Railroad
Record chart
Region
Religion
Research
River
Road vehicle
Sea
Sexual orientation
Software
Song
Speech
Sport
Sport event
Sports terminology
Strategy
Taxon
Taxonomic rank
Title
Train station
Union
Unit of mass
Table 7: Human Type Axis: IsA (continued)
Value
Vehicle
Vehicle brand
Volcano
War
Watercraft
Weapon
Website
Other
Table 8: Human Type Axis: IsA (continued)
Archaeology
Automotive industry
Aviation
Biology
Botany
Business other
Construction
Culture
Culture-comics
Culture-dance
Culture-movie
Culture-music
Culture-painting
Culture-photography
Culture-sculpture
Culture-theatre
Culture arts other
Culture ceramic art
Culture circus
Culture literature
Economics
Education
Electronics
Energy
Engineering
Environment
Family
Fashion
Finance
Food
Health-alternative-medicine
Health-science-audiology
Health-science-biotechnology
Healthcare
Health cell
Health childbrith
Health drug
Health gene
Health hospital
Health human gene
Health insurance
Health life insurance
Health medical
Health med activism
Health med doctors
Health med society
Health organisations
Health people in health
Health pharma
Health protein
Table 9: Human Type Axis: Topic
Health protein wkp
Health science medicine
Heavy industry
Home
Hortculture and gardening
Labour
Law
Media
Military war crime
Nature
Nature-ecology
Philosophy
Politics
Populated places
Religion
Retail other
Science other
Science-anthropology
Science-astronomy
Science-biophysics
Science-chemistry
Science-computer science
Science-geography
Science-geology
Science-history
Science-mathematics
Science-physics
Science-psychology
Science-social science other
Science chronology
Science histology
Science meteorology
Sex industry
Smoking
Sport-air-sport
Sport-american football
Sport-athletics
Sport-australian football
Sport-baseball
Sport-basketball
Sport-climbing
Sport-combat sport
Sport-cricket
Sport-cue sport
Sport-cycling
Sport-darts
Sport-dog-sport
Sport-equestrian sport
Sport-field hockey
Sport-golf
Table 10: Human Type Axis: Topic (continued)
Sport-handball
Sport-ice hockey
Sport-mind sport
Sport-motor sport
Sport-multisports
Sport-other
Sport-racquet sport
Sport-rugby
Sport-shooting
Sport-soccer
Sport-strength-sport
Sport-swimming
Sport-volleyball
Sport-winter sport
Sport water sport
Toiletry
Tourism
Transportation
Other
Table 11: Human Type Axis: Topic (continued)
Post-1950
Pre-1950
Other
Table 12: Human Type Axis: Time
Africa
Antarctica
Asia
Europe
Middle East
North America
Oceania
Outer Space
Populated place unlocalized
South America
Other
Table 13: Human Type Axis: Location

References