Learning in Text Streams: Discovery and Disambiguation of Entity and Relation Instances

09/06/2019 ∙ by Marco Maggini, et al. ∙ 0

We consider a scenario where an artificial agent is reading a stream of text composed of a set of narrations, and it is informed about the identity of some of the individuals that are mentioned in the text portion that is currently being read. The agent is expected to learn to follow the narrations, thus disambiguating mentions and discovering new individuals. We focus on the case in which individuals are entities and relations, and we propose an end-to-end trainable memory network that learns to discover and disambiguate them in an online manner, performing one-shot learning, and dealing with a small number of sparse supervisions. Our system builds a not-given-in-advance knowledge base, and it improves its skills while reading unsupervised text. The model deals with abrupt changes in the narration, taking into account their effects when resolving co-references. We showcase the strong disambiguation and discovery skills of our model on a corpus of Wikipedia documents and on a newly introduced dataset, that we make publicly available.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Most of nowadays machine-learning-based approaches to language-related problems are defined in the classical setting in which we exploit a batch of data to build powerful, offline predictors, especially when the data are paired with supervisions. Little has been done when considering the setting in which we process a continuous stream of text and build systems that learn and respond in an online manner. However, conversational systems

[48], information extractors from streams of news [7], or social network data [43], and those systems that require interactions with the environment [6], belong to the latter setting. Supervisions are usually sparse and very limited, and the system is expected to perform “one-shot” learning and to quickly react to them.

Our work faces the problem of learning to extract information while reading a text stream, with the aim of identifying entities and relations in the text portion that is currently being read. This problem is commonly tackled by assuming the existence of a Knowledge Base (KB) of entities and relations, where several entity/relation instances are paired with additional information, such as the common ways of referring to them or sentences/facts in which they are involved. Then, once an input sentence is provided for reading, sub-portions of text must be linked to entity or relation instances of the KB. The linking process introduces the challenging issue of dealing with multiple distinct entities (relations) that are mentioned with the same text, and thus the system has to disambiguate which is the “right” entity or relation instance of the KB for the considered text fragment. In particular, the context around the fragment or, if needed, information that was provided in the previous sentences of the text stream can be used to perform the disambiguation. As a very simple example, consider the sentence Clyde went to the office, being Clyde and the office two text fragments that indicate entities, while went to is text that is about a relation. Clyde could be the mention that is used to indicate different people in the KB, and several offices could be mentioned by the expression the office (mentions to relations follow the same logic).

At a first glance, this problem shares basic principles and intuitions with several existing methods, such as Entity Linking [44], Word Sense Disambiguation [39]

, Named Entity Recognition

[5], Knowledge Population [40], and others (we postpone to Section 5 the description of the related approaches). However, the problem we consider in this paper is way more challenging, since the aforementioned KB is not given in advance, and the system has to take care of progressively building it, while also using such KB to disambiguate the input data. This means that the system is required to decide whether a certain mention is about an entity/relation instance already inserted into the KB or if it is about a never-seen-before entity/relation, and, in that case, to update the KB. Moreover, since we deal with streams of text, learning is performed in an online manner, and the system has to take a decision before processing the following sentence.

Motivated by the aforementioned challenges, in this paper we propose an end-to-end memory-augmented trainable system that learns to discover and disambiguate entity/relation instances in a text stream, exploiting a small number of sparse supervisions111Supervisions are indications of the precise entity/relation instance that a fragment of text refers to. and operating in an online manner. In particular, this work makes the following contributions. (1) We propose a new online-learning method for populating a not-given-in-advance KB of entities and relations. (2) We introduce a new scheme to learn latent representations of entities and relations directly from data, either autonomously (self-learning) or using limited supervisions. Character-based encoders are exploited to handle small morphological variations (plurals, suffixes, …) and typos, synonymy, semantic similarity. (3) We present a new problem setting where the system is evaluated while reading a stream of sentences organized into short stories, requiring online learning capabilities. (4) We showcase the validity of our approach both in an existing dataset made of Wikipedia articles and a new dataset that we introduce and make publicly available.

This paper is organized as follows. Section 2 formally describes the problem we face. The inferential process of the proposed architecture is described in Section 3, while online learning dynamics are described in Section 4. Related work is reviewed in Section 5. Experiments and future plans are reported in Section 6 and 7, respectively.

2 Problem Setting

We consider a continuous stream of text, that at each time step produces a sentence . Groups of contiguous sentences are organized into small stories about a (not-known-in-advance) set of actors/objects, so that the narration is discontinuous whenever a new story begins.

We focus on the problem of developing a system that, given , produces its interpretation by linking text fragments to a KB, that the system is responsible of creating and updating (Figure 1). We think of the KB as a set of instances, and the considered text fragments of are mentions to them. Some mentions are about entities, others are about relations. For each instance, the KB stores (possibly) multiple mentions that are commonly used to refer to the instance itself, as for entity and in Figure 1. On the other way around, the same mention can be shared by more than one instance, as in the example of Figure 1 where the same textual form Clyde refers to different entities (entities and ). KB instances also include information about the contexts in which instances have been mentioned in the stream so far, where the notion of context compactly indicates the whole sentence in which the instance was mentioned. Here and throughout the paper, we simplify the descriptions by frequently using the generic term instance to refer to both the cases of relations and entities, without making a precise distinction (if not needed).

We consider the case in which text fragments of are matched with the mentions in the KB to detect compatible instances. Then, instances are disambiguated by observing the current context (that is the portion of around the mention), and exploiting the knowledge about the story to which belongs (up to the current sentence). At the beginning of each story, a small number of sentences are supervised with the identity of the mentioned entities and relations. The system is expected to follow the narration by disambiguating mentions, learning from such sparse supervisions (and quickly reacting to them), and discovering new instances. The natural ambiguity of language, the discontinuous narration, and the dynamic nature of the KB, require the system to develop advanced disambiguation procedures to interpret .

It is worth noticing that, in the proposed setting, the system decisions on a given sentence are evaluated immediately when processing the following sentences, regardless of whether the sentence was supervised or not. This schema gives rise to a novel online dynamics of the learning process, which is profoundly different from most of the existing approaches, being it more challenging and realistic than common batch-mode offline approaches (consider, for example, the case in which it is framed within conversational applications). This work encourages the development of new methods, models and datasets, for this extremely relevant, but largely unexplored setting.

Figure 1: Left: text stream composed of two stories. Right: an example of KB. (a) Input: sentences from the stream. (b) System output: mentions to entities and relations are detected (pale yellow and pale blue background, respectively), and linked (dashed lines) to KB instances (circles) - only some links are shown, for clarity. Empty circles are entity instances, while filled-grey circles are relation instances (circles are intended to also include context-related information that characterize the instances). Boxes represent known mentions, and they are connected with the compatible instances. We printed with the same color those mentions that should be linked to the same instance.

The proposed model has the potentiality of being enhanced by introducing a more structured and advanced knowledge component (types, facts, abstract notions for higher level inference - such as logic formulas). For example, once entities and relations are detected from a sentence

, they could be combined together to create symbolic representations of facts. Thus the KB could be represented as a knowledge graph, where entities are vertices and relations are the edges of the graph. As long as the knowledge base becomes more structured, the added information may be used to enhance the disambiguation procedure, or to introduce a further level of abstraction on which symbolic reasoning can be performed. In this work we do not face the problem of designing an enhanced knowledge base module, but we focus on the tasks of instance discovery and disambiguation in text streams, that provide the basic functionalities required to link the input text to the instances in the KB.

Section 3 will describe our model from an operational point of view, while Section 4 will describe the learning dynamics associated to the online learning framework.

3 Model

At each time step, the system processes an input sentence (we drop the time index, for simplicity), and it detects those text portions (i.e. one or more adjacent tokens) that are expected to mention KB instances. This task is referred to as mention detection. For each candidate mention in , the system predicts how much each KB instance is compatible with in the context where it appears in

. The prediction scores are collected in a vector

, where each entry corresponds to a given instance in the KB. Then, then mention is linked to the most-likely instance as


where is the function that computes the affinity scores of with respect to all the KB instances, given the context of the current sentence . In this work, we replace the second argument of by the sequence of mentions detected in , excluding itself, i.e., .

Figure 2: Computational flow of the model for mention-instance linking. The system can be seen as a sequential composition of sub-systems that process the input sentence and finally output the identifiers of the instances that are linked to the mentions detected in ( is a generic mention detected in ).

The whole system is the composition of multiple computational modules, that we sketch in Figure 2. The Mention Detector segments by identifying mentions to entities and relations. For example, Parry is chasing a mouse is segmented into two mentions to entities (i.e “Parry” and “a mouse”) and a mention to a relation (i.e “is chasing”). Each mention and its context are encoded into vectorial representations (i.e. embeddings) by the Encoder module. The Candidate Generator

produces a probability distribution over the KB instances, by combining different sources of information (e.g. surface form of the mention, embedding, temporal coherence in the current story). Finally, a

Disambiguator takes the final decision on which is the most likely KB instance to link, using the aforementioned probability distribution and the embedding of the mention context. Section 3.1, 3.2, 3.3, 3.4 describe each computational block in details.

3.1 Mention Detection

The goal of the Mention Detector (first block of the pipeline in Figure 2) is to segment each sentence into non-overlapping text fragments, which are mentions to yet-unknown entity or relation instances.

Motivated by the need of developing models that are robust to morphological changes and that do not depend on a pre-defined vocabulary of words (as needed by interactive/conversational applications), we process the input data using a character-level encoding, following the approach we proposed in [27]

. In particular, Bidirectional Recurrent Neural Networks (BiRNN) are exploited to build vectorial representations of words at multiple levels. Given an input sentence

, composed of words , , the word embedding of the word is


where is the -th character of the -th word and is a BiRNN outputting the concatenation of its hidden states in both directions. Since the morphological representation of a word is usually meaningless if taken isolated from its context, we compute the contextualized word embedding of the word as


where and are two RNNs processing the character-level embeddings of the words in the left and right contexts of within the sentence , including itself. and output their hidden state and we denote with the concatenation operation. All the RNNs used in this work are LSTMs.

The output of this computational block is based on the predictions of an MLP classifier that processes (one-by-one) the contextualized embeddings of the words in the input sentence, computed as in eq. (


). The MLP is trained using supervised learning, with a tagging scheme similar to

[21]. In particular, each word needs to be classified as being the begin, inside, end word of either an entity or a relation mention, for a total of 6 classes. As a result, each mention is composed by the sequence of words where the first word is tagged with the begin tag, the last word with the end tag, and the other words are predicted as inner (tags must be all of the same type, either entity or relation).

As remarked in Sections 1 and 2, we exploit sparse supervisions, so that training the MLP-tagger (and, in turn, the embedding networks ) might be difficult. However, we follow the intuition that syntax has a crucial role in text segmentation: noun phrases are mentions to entities, while fragments that start with a verb and end with a preposition (if any) are mentions to relations [9].222See the supplemental material for the details. Hence, we can use these rules to automatically generate artificial supervisions on large collections of text to pre-train the Mention Detector.

3.2 Mention and Context Encoding

Once mentions have been detected, the Encoder module (second block of Figure 2) encodes them and their contexts into vectorial representations, following an encoding scheme similar to the one in Section 3.1. However, after the processing of the previous computational block, the input sentence becomes a sequence of mentions, each composed of one or more words. In details, at this stage, is a sequence of mentions . Two different vectorial representations are computed for each mention. First, the mention embedding of is obtained as


being the sequence of characters of the mention.

Second, the context embedding is computed from the other mentions , in , as


Notice that, differently from eq. (3), here we are encoding only the context around the considered mention, excluding the mention itself.

In order to learn the parameters of the encoders ( and ), we follow the principles at the basis of the Word2Vec-CBOW architecture [29], that is rooted on the idea of encoding the context around a word and decoding the word itself, to generate representations where synonyms or semantically similar words are close to each other in the embedding space. However, differently from the original CBOW, we use character-level input encoding (Eq. 4), so that we also expect morphologically similar inputs to be close in the embedding space. We also introduce another feature that makes our approach different from other related systems (e.g. [16, 27]), that is we keep also the decoding stage at a character-level, thus avoiding the need of a large mention vocabulary.

In particular, once the context around mention is encoded into the context embedding (Eq. 5), we exploit an LSTM-based decoder whose initial state is set to , and that generates the sequence of characters that compose

. The encoder and the decoder can be trained by exploiting the cross-entropy loss function over character predictions.

Once the Mention Detector has been pre-trained accordingly to what suggested in Section 3.1, the Encoder module can be pre-trained as well without any human intervention, by processing large collections of text and learning to decode each mention.

3.3 Candidate Generation

Given an input mention from the current sentence and its embedding , the Candidate Generator (third block of Figure 2) implements four memory components that are used to generate a list of candidate KB instances compatible with , and, afterwards, to store the information on the disambiguated instance. Before providing further details on the candidate generation process, we describe the four memory components, as shown in Figure 3.

Figure 3: A graphical representation of the four memory components , , , and . collects the (lowercase) mentions that are stored in the KB. For each of them, the embedding vector (blue rectangle) is stored in a row of the matrix . The affinity scores of the considered mention, with respect to each of the instances in the KB, are stored in a row of the matrix . Finally, contains the KB instances that have been recently linked by the system (here represented as a histogram of the number of times each KB instance was recently linked).

The memory component is an ordered set that collects all the mentions that were processed up to the current sentence; this allows a fast lookup of previously predicted instances for specific mentions. For example, if the mention John Doe was previously assigned to instance , when processing the same mention again the system could easily hypothesize that it still belongs to instance .
The matrix stores (row-wise) the embeddings of the mentions in , computed with the encoder of Section 3.2, Eq. 4. Thanks to a similarity measure in the embedding space, this component allows the system to associate KB instances to never-seen-before mentions which are small variations of previously seen ones, or that refer to semantically similar elements. For example, given the never-seen-before mention John D., the system could easily predict it still belongs to instance , since its char-level embedding is close to the one of John Doe, even though the exact lookup in failed.
The set keeps track of the last disambiguated instances (with repetitions). This memory naively allows the system to handle co-references. As we will see shortly, the system can learn that some specific inputs (e.g. pronouns, category identifiers, etc.) are often assigned to recently mentioned instances, making valuable temporal hypotheses when it has to disambiguate such inputs.
The matrix stores (row-wise) the instance-activation scores of each mention in . Each row is associated to a mention in , each column corresponds to a KB instance. The row of associated to a certain mention models how strongly each KB instance is associated to the mention . The matrix is learned while reading the text stream, as we will describe in Section 4.

In the following, we will show how the Candidate Generator exploits these memory components to generate candidates of KB instances, given an input mention. For the purpose of this description, we suppose that the current memory consists of instances and mentions. The candidate generation routine first outputs three hypotheses, that are represented by three vectors , , , of scores each. Each score is in , and it represents how strongly an instance is a candidate for being linked to the mention . For example, (the -th component of ) models the probability of the instance to be a link candidate accordingly to the first hypothesis. Then, the three hypotheses are combined into a final vector . A visual representation of this combination is shown in Figure 4.

Figure 4: A visual representation of the combination of the three hypotheses. suggests that the current mention (i.e. “john doe”) has been often linked to entity instance 1. indicates that the embedding of the current mention is very similar to the embeddings of mentions usually linked to entity 1. Finally, indicates that entity 1 has been recently mentioned several times.

The first vector, , named string-match hypothesis, contains the activation scores of the instances given the string . It models the idea that the surface form of is a strong indicator for spotting candidate links. Formally,


where the function returns the index of in ,

is a sigmoidal function that operates element-wise on its input (yielding output values in

) and the subscript after the matrix M indicates the matrix row.

The second hypothesis, , named embedding-match hypothesis, collects the activations of the instances given the embedding of . The rationale behind it is that if

is similar to an embedding of a known mention (in the sense of the cosine similarity

), then it is likely to activate the same instances. Due to the way our embeddings are computed (Section 3.2), we expect that two embeddings are “close” if they have similar roles in the processed sentences (semantic-similarity, synonymy), and similar morphological properties (due to the character-level input). Formally:


where the notation indicates the (column) vector . Notice that is a matrix of the same size of , and involves a vector-times-matrix operation that basically computes a weighted sum of the rows of accordingly to the similarity between and the stored embeddings333In our implementation, we kept only the top- cosine similarities , forcing the other ones to ..

The third hypothesis, , named temporal hypothesis, implements the idea that recently disambiguated instances are good candidates for co-reference resolution (temporal locality). In other words, if a story is talking about a certain entity, it is likely that the narration will make references to it using some new surface forms. For example, given an entity labeled Donald Trump, there could be ambiguous mentions like Donald, Mr. Trump (other people called this way), or the president (that cannot be captured by the other two hypotheses). In these cases, the temporal locality has a crucial role, that is even more evident when using pronouns, that are shared by several instances. Formally, we have


where returns the number of occurrences of instance in .

The Candidate Generator merges the three hypotheses and produces the final hypothesis . The idea behind this operation is to give more priority to than to when the former is strongly activated (since is about “exact” matches in terms of surface forms). The importance of the temporal component in the merge operation depends on the current mention . For example, if is a pronoun, the system must trust more than the others, while, in case of some unambiguous mention, it must learn that is not important. We let the system learn the importance of depending on the mention embedding . Formally, the system computes , where is a learnable function (whose form will be defined shortly), and the vector of merged hypotheses is

The vector contains a set of scores that model how strongly each KB instance is related to the current input mention.

We kept the model as simple as possible by considering only the three hypotheses that were necessary to cover all the ambiguities present in the task at hand. However any number of hypotheses can be attached to the candidate generation module, making our model adaptable to different NLP problems.

3.4 Disambiguation

While the candidate generation routine only focusses on the mention , the Disambiguator (Figure 2, fourth block) is responsible of determining what are the most likely KB instances given the context of . The representation of the context is computed by the Encoder (Section 3.2) by Eq. (5). The Disambiguator is based on the functions , , also referred to as disambiguation units. In details, each is associated to a KB instance, and it is learned while reading the text stream in a supervised or unsupervised way, as we will describe in Section 4. In particular, in the considered problem, we do not have the use of any discriminative information: when we receive the supervision that the input mention is about a certain KB instance (or when the model decides this in an unsupervised manner), we cannot infer that the context of the considered mention is not compatible with any other KB instance. As a matter of fact, the same context may be shared by several instances, so each must have the capability of learning from positive examples only. For this reason we implement with a locally supported similarity measure,


that models the distribution of the contexts for instance by means of centroids . As we will describe Section 4, these centroids are developed in an online manner

, and, in the unsupervised case, we end up in an instance of online spherical K-Means. Also the previously mentioned function

needs to be locally supported (for the same reasons), so we implemented it following Eq. (9) as well.

We combine the activations of candidates (i.e., the vectors ), and the disambiguation-unit outputs, , to get the output of the system,


where returns a binary vector with 1’s in those positions for which the (element-wise-evaluated) condition in bracket is true. The scalar is a reject threshold, and is a tunable parameter that controls the role of the hypothesis in the decision process444For simplicity, we set .. The reject threshold allows us to avoid computing the associated to very-low-probability candidate instances.

4 Online Learning Dynamics

The system reads data from a text stream and it optimizes the model parameters in an online manner. Before going into further details, we recall that the learnable parameters of the proposed model are the matrix in the memory component, the vectors of the disambiguation units and of the temporal relevance function (i.e., the parameters of the Candidate Generator and of the Disambiguator), and the weights of the LSTMs in the Mention Detector and in the Encoder.

The system starts with an empty KB (so is not allocated yet), and with randomly initialized model parameters. As already introduced in Sections 3.1 and 3.2, we can pre-train those modules that constitute the preliminary stages in the system pipeline of Figure 2, i.e., the Mention Detector first, and then the Encoder (in both cases, without human intervention). In other words, the system will start reading data from the text stream and it will progressively acquire the skill of detecting mentions to entities and relations, and the skill of encoding such mentions and their context. When a stationary condition is reached in the detector and encoders, the system can start to develop also the KB-based disambiguator, and to eventually refine and improve the pre-trained modules.

1foreach mention in the current sentence do
2        if supervision not provided then
3               if recognized some instances then
4                      reinforce the associated disambiguation units
5               end if
6              else if uncertain disambiguation then
7                      no actions
8               end if
9              else if no recognized instances then
10                      create new instance
11                      reinforce the new disambiguation unit until
12               end if
14        end if
15       else
16               if already known supervision then
17                      reinforce the associated disambiguation unit until
18                      penalize the other disambiguation units until
19               end if
20              else
21                      disambiguate instance
22                      if  was associated to another supervision then
23                             goto line 10
24                      end if
25                     else
26                             associate supervision to the -th disambiguation unit
27                             goto line 16
28                      end if
30               end if
32        end if
34 end foreach
Algorithm 1 Learning Dynamics

While processing the text stream, accordingly to Eq. (1), each detected mention is associated to a disambiguated KB instance . Before starting the disambiguation, the system verifies if is already in . If it is not the case, then is included in , its embedding is appended to , and a new row is added to (with values such that ). The learning stage consists in an online process to optimize the model parameters accordingly to either self-learning or a supervision about the target instance . A sketch of the whole learning stage is shown in Algorithm 1.

Self-Learning. When no supervision is provided, learning dynamics changes in function on the confidence that the system yields in formulating hypotheses () and in disambiguating the mention (). We distinguish among three cases:

  • recognized some instances

  • uncertainty

  • unknown instance  ,

where is the aforementioned reject threshold, and is an accept threshold.555We set and . In case i. the response of the system has been rather strong in indicating at least one instance, therefore the prediction of the model is considered reliable enough, so such decision must be reinforced for all those disambiguation units that generated outputs in above the accept threshold . This is done in a self-learning fashion [34], by means of a single online gradient-based update, with the aim of minimizing the quadratic loss that measures the distance between the selected disambiguation unit outputs (indexed by ) and , that is (i.e., we reinforce the selected outputs). In case ii. the system activates some candidates but it is uncertain in the disambiguation, so no further actions are taken. Case iii. is triggered when is composed of only low-confidence candidate activations. This situation happens when the candidate generation module does not find a known instance that is compatible with the current mention, that is likely to indicate the occurrence of a new entity/relation. Therefore, the system creates a new instance in the KB and reinforces its disambiguation unit until its response is above , to develop the new instance model (i.e., multiple gradient-based updates).

Supervision. When a supervision is provided for the mention , we want the system to immediately learn from it. The system keeps track of the mapping between the set of user-provided supervisions and the set of instances in the memory components (the user is not aware of how the system handles instances). When is a known supervision, the index of the corresponding instance is found, and the output of the disambiguation unit associated to this instance is reinforced until it is greater than the accept threshold . We also push the output of the other disambiguation units towards , by minimizing the quadratic loss, . This implements a penalization process, that involves multiple gradient-based updates until all the involved outputs fall below 666Notice that, whenever the system needs to reinforce the -th output , and it also holds that , then, due to in Eq. (10), we get no gradient, so we first increase until it is above .. When is a never-seen-before supervision for the system, then it is associated with the disambiguated instance . On the other hand, if was previously associated to another supervision symbol in different from , then we have a collision in the mapping and we solve it by creating a new instance and by associating it to . Then, we follow the same steps as in case iii above777Supervisions may not only be related to the instance-label of the detected mentions, but they can also be associated with the detection of the mention. For example, the user could label a mention that does not correspond with the detected ones. This supervision signal is propagated to the mention detector, that can be refined and improved. In turn, the mention and context encoders could be refined as well. The investigation of these refinement procedures goes beyond the scope of this paper..

5 Related Work

Mining over text streams has been studied in a number of works [3, 19, 1], with several different purposes, that, however, are different from what we consider in this paper. Our approach to the learning problem is based on simple sentences that have the same structure of the ones used in many tasks of the bAbI project by Facebook [46, 20]. However, none of such tasks is conceived for online learning or for entity/relation extraction and disambiguation. Interesting ideas on entity-oriented sub-symbolic memory components have been recently proposed by [13, 15], and extended to the case of relations by [4]: their formulation is developed to comply with the aforementioned bAbI asks. The idea of considering small text passages could resemble the task of Machine Comprehension, where, however, such passages are read with the purpose of answering a question [42, 41, 18]. Concerning the input representation, we took inspiration from those works that exploit character-level embeddings to build models that also take into account morphological information [17, 16].

Our approach disambiguates mentions using their contexts, so it shares several aspects with Word Sense Disambiguation (WSD) and Entity Linking (EL), that, differently from our case, assume to work with a given KB. In WSD [49, 39] the set of target words is known and the senses of each word are taken from a given dictionary. EL [44] is similar to WSD [33, 32], but it is about linking “potentially partial” entity mentions to a target KB, that has an encyclopaedic nature [11, 32]. The EL problem is presented in several variants and focussing on different types of data [23, 10, 37, 45, 38, 22], and it has been the subject of task-oriented evaluation procedures and benchmarks [25, 47]. A few EL systems work in an unsupervised way [12, 37], but the KB is still given. Named Entity Recognition (NER) focusses on discovering mentions to entities, and it is also a basic module of several EL systems [24]. However NER is about proper nouns, as frequently is EL [23], while here we also consider common nouns. Moreover, NER systems output the entity type (person, location, etc.) without producing any instance-level information [21, 5]. Relation Extraction (RE) has been recently approached with end-to-end and advanced embedding-based models [31, 36]. The entities involved in the target relation are usually known, and a pre-defined ontology is given (distant supervisions are also used, as in [30]). There are a number of discussions to better state the RE problem and build accurate gold labels [28].

Finally, learning the KB component is the subject of those tasks of automatic KB Construction [35] and KB Population [8, 14], that, differently from the case of this paper, either make some application-specific assumptions to implement the KB, or exploit a given ontology schema, also combining unsupervised and supervised learning with ensembles and stacking techniques [40].

6 Experiments

6.1 Datasets

Simple Story Dataset. A detailed experimentation was carried out in a new dataset that we created and made publicly available at http://sailab.diism.unisi.it/stream-of-stories/. We remark that the problem we face shares some aspects with existing benchmarks, but none of them is really focussing on what we introduced in Section 2. The data we created is composed of a stream of 10,000 sentences, organized into 564 stories (similarly to what reported in Figure 1). Each story is composed of a list of not-repeated facts, involving 130 entity and 27 relation instances that belong to a pre-designed ontology (not provided to the system) shown in Figure 5. Facts in a story mostly talk about a certain entity, that we refer to as “main entity”, and that can also appear in other stories. Entities and relations are mentioned with different surface forms (synonyms, sub-portions of names, etc.).

In particular, a sentence is constituted by a triple of mentions, involving two entities and a relation. We automatically generated the data, after having defined the aforementioned ontology with and entity and relation types, respectively (Figure 5). Different kinds of noise are introduced, from character-level perturbations (simulating typos) to non-main-entity related story facts (to make descriptions slightly depart from the main subject). The resulting dataset consists of unique single word tokens, and dictionaries of and mentions to entities and relations, respectively, that include different variations (typos, determiners, etc.) of base entity mentions and of base relation mentions. There are co-references (including pronouns). Finally, mention occurrences are ambiguous (i.e., refer to multiple instances).

Figure 5: Ontology Graph of entity (nodes) and relation (edges) types in the Simple Story Dataset. Dashed lines between node pairs indicate a hierarchy between the two types.
Figure 6: Accuracy () for different amounts of supervision in the Simple Story Dataset, in the case of entities (two leftmost graphs), and relations. We include two competitors (Deep-RNN, RB), our system, and our system reading the stream again.
All RB
Our Model 54.57 69.64 75.45
Re-Reading 44.75 54.66
Last RB
Our Model 67.45 75.37
Re-Reading 43.41 53.38
Table 1: Accuracy () for different amounts of supervision in the Wikifacts Dataset.
Figure 7: Accuracy () at different time instants for the Simple Story Dataset.

WikiFacts. WikiFacts is a dataset proposed in [2] where Wikipedia pages are loosely aligned with Freebase triples. It is composed of a collection of summaries, each of them being the textual description of a certain entity, belonging to the domain of movies. The textual span where an entity is mentioned on Wikipedia summaries is annotated with its identifier on Freebase, and the Freebases fact to which it participates. Relations are not explicitly segmented in the text. Overall the dataset contains about 560k entities extracted from 10k pages. WikiFacts can be adapted to the problem setting we consider in this paper (Section 2). In particular, we focussed on a subportion of data containing 10k Freebase facts (1112 pages), and we used each summary as a story, keeping only stories longer than three sentences. We considered all the sentences that include at least two entities, and we artificially marked as relation the text span between two consecutive entities. Since relations are not annotated with any Freebase identifiers, we only measure the accuracy on linking mentions to entity instances. A remarkable difference between the previous dataset and WikiFact is that, in the latter, entities are less often repeated among the stories. There are 4431 entities in 10K facts, with respect to the 130 entities in the previous data collection.

6.2 Learning Settings

We split each story into two parts: a supervised and an unsupervised one. The supervised portion covers the first sentences of the story, and, in different experimental settings, it consists of , , and of the story sentences. The system reads the data, receives supervisions, and it makes predictions on the unsupervised sentences, accordingly to the stream order. The accuracy on each prediction is measured at the same time when the prediction is made. The results are reported considering all the unsupervised sentences of a story (ALL - this set of sentences is different in function of the supervised part size), and only the last sentence of the story (LAST - this set is the same for all the supervised part sizes). They are averaged over each story first, and, then, over the whole set of stories. We use the same criterion of the cluster purity measure [26] to map unsupervised outputs to the ground truths, where, in case of conflicting assignments, we only keep the mappings that are determined using the largest statistics. The map used to convert predictions is computed with the statistics accumulated up to the time when the prediction is made.

Our model888The details of the architecture we selected are reported in the supplementary material. was bootstrapped accordingly to the scheme of Section 4. In particular, before streaming the stories, we generated a stream of text composed of simple language sentences, taken from Simple English Wikipedia and the Children’s Book Test (CBT) data999https://simple.wikipedia.org, https://research.fb.com/downloads/babi/, getting overall a total of million sentences. We automatically generated supervisions for the mention detector, accordingly to the procedure described in Section 3.1, and we processed the stream. Afterwards, we stopped updating the mention detector and we streamed the same data again to allow the system to develop the mention and context encoders (Section 3.2). In both the cases of the detector and of the encoders, we randomly injected a char-level noise (typos) to make the models more robust to these kind of perturbations. Finally, we stopped updating the encoder and we started to stream the stories.

6.3 Competitors

We compared our model with two competitors. The first one, Deep-RNN, is a deep neural architecture, where bottom layers completely coincide with our Encoder module, thus they are two stacked bidirectional RNNs operating on characters and mentions levels, respectively (they are pre-trained as in our model). In order to classify mentions, the Deep-RNN architecture uses an MLP (1 hidden layer with units and activation; activation in the output layer) on top of the concatenated representation (Eq. 4, Eq. 5). This network knows in advance the number of instances of the datasets, that is the size of the output layer of the MLP, and it does not incur in errors related to the self-discovering of new instances. We followed a classic online supervised learning scheme, where a single gradient-based update is performed for each processed sentence, otherwise we experimented that the network simply overfits the last supervisions and forgets about the rest. We also considered a very informed rule-based model, RB, that buffers statistics on the supervisions received up to time . Given an input mention, RB predicts the most-common supervision for it. When never-supervised-before mentions are encountered, RB predicts the most-frequent supervision of the story, that is likely to be the main entity of the story itself. This information is very precious (our model has no access to it), since several co-references refer to the main entity. We tested a large number of different rule-based classifiers, and RB was the one leading to the best results. Finally, we report another variant of our model, i.e. the case in which the system reads the whole stream another time (“Re-Reading” without providing supervisions again), that is the setting in which the self-learning skills are emphasized.

6.4 Results

Figure 6 shows the accuracy of discovering and disambiguating mentions to entity and relation instances, in the Simple Story Dataset, while Table 1 shows the accuracy in the case of the Wikifacts dataset (entities). Our system outperforms the competitors in all our experiments. Differently from Deep-RNN, our model exploits its capability of building local models of the instances, while Deep-RNN is not able to capture this locality, either not-learning enough or overfitting supervisions. Moreover, Deep-RNN has difficulties in storing information on the whole story, as shown in the LAST case.

In the case of entities with few supervisions, RB shows an accuracy that is similar to the one of our system on the Simple Story Dataset. Differently, the proposed model significantly outperforms RB on WikiFacts, showing interesting generalization capabilities on real world data. In the case of relations, RB is not different from our system, mostly due to the fact that relations are not so ambiguous in the Simple Story Dataset, also confirmed by the improved results of the Deep-RNN. In most of the cases, the classification capabilities of our model improve when it is allowed to read the stream a second time (re-reading), comparing favourably with all the other approaches. This property is more evident in the few-supervision settings. These results show that, despite the dynamic nature of the online problem we face, the proposed model has strong memorization skills, and the capability of improving its confidence by self-learning.

In the Simple Story Dataset, we also investigate the behaviour of the model at different time instants, using of the supervisions (entities). Basically, we paused the system at a certain point of the stream, measured the accuracy, and activated the system again. This process is repeated until the end of the stream is reached. Results are reported in Figure 7. Our model reaches better results than RB after having read roughly the of the input stream. We analyzed this result, and it turned out that this is mostly due to the fact that our system takes some time to learn how to handle the temporal information () needed to resolve co-references. As a matter of fact, co-reference resolution clearly depends on the number of supervisions, as reported in Table 2.

Table 2: Accuracy () in the case of pronouns.

We also evaluated the values of , that weighs the importance of the temporal locality in the disambiguation process. In the case of pronouns the average value of is 0.33, while in the case of the other mentions (that might include some other co-references different from pronouns) is 0.21, thus showing that the system learns to give more importance to the temporal component when dealing with pronouns than other mentions. This is confirmed by the small instance-activation scores in the case of mentions that are pronouns (the average of is 0.005, remarking their inherent ambiguity), with respect to the ones of the other mentions (average of is 0.5).

6.5 Ablation Study

We compare different variants of the model in order to emphasize the role of each component, and report the results in Table 3 - Simple Story Dataset. A first comparison regards the benefits of using a character-level encoding for mentions with respect to classical word-level embedding approaches. To this end, we built a vocabulary of words, including all the correct-spelled words of our dataset (and an out-of-vocabulary token), and we limited the character-based encoder to such words, thus simulating a word-level encoder. Our model is able to encode in a meaningful way those words that contain typos, and to exploit them in context encoding, while the word-level encoder faces several out-of-vocabulary words, that also creates ambiguity while comparing contexts.

Another variant of our model discards the temporal hypothesis when disambiguating entities, and considers the hypotheses and only. Table 3 shows that the temporal locality has an important role, and disabling it degrades the performances. This is not only due to its positive effects in co-reference resolution, but also when disambiguating mentions to the main entity of the story. Finally, thanks to , the system learns to develop the tendency of associating new mentions to already existing instances, instead of creating new ones, that is an inherent feature of each story (in the worst case it creates only entity instances).

Entities Relations
All Last All Last
WL Enc.
No Recent
Full Model 63.79 63.12 85.41 85.81
Table 3: Accuracy () of our Full Model, of a system based on Word-Level (WL) Encoding, and of a system not-provided with information on the recently disambiguated instances.

7 Conclusions and Future Work

We presented an end-to-end model to process text streams, where mentions to entities and relations are detected, disambiguated, and eventually added to an internal KB, that, differently from many existing works, is not-given-in-advance. Our model is capable of performing one-shot learning, self-learning, and it learns to resolve co-references. It has shown strong disambiguation and discovery skills when tested on a stream of sentences organized into small stories (we also created a new dataset that we publicly made available for further studies), even when a few, sparse supervisions are provided. We also showed how it can improve its skills by continuously reading text. Our future work will focus on exploiting entities and relations structured into facts, higher-level reasoning, types, dynamic re-organization of the KB.


  • [1] C. C. Aggarwal and C. Zhai (2012) Mining text data. Springer Science & Business Media. Cited by: §5.
  • [2] S. Ahn, H. Choi, T. Pärnamaa, and Y. Bengio (2016) A neural knowledge language model. arXiv preprint arXiv:1608.00318. Cited by: §6.1.
  • [3] A. Banerjee and S. Basu (2007)

    Topic models over text streams: a study of batch and online unsupervised learning

    In Proceedings of the 2007 SIAM International Conference on Data Mining, pp. 431–436. Cited by: §5.
  • [4] T. Bansal, A. Neelakantan, and A. McCallum (2017) RelNet: end-to-end modeling of entities & relations. arXiv:1706.07179. Cited by: §5.
  • [5] J. P. Chiu and E. Nichols (2016) Named entity recognition with bidirectional lstm-cnns. Transactions of the Association for Computational Linguistics 4, pp. 357–370. Cited by: §1, §5.
  • [6] K. Christakopoulou, F. Radlinski, and K. Hofmann (2016) Towards conversational recommender systems. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 815–824. Cited by: §1.
  • [7] G. M. Del Corso, A. Gulli, and F. Romani (2005) Ranking a stream of news. In Proceedings of the 14th international conference on World Wide Web, pp. 97–106. Cited by: §1.
  • [8] M. Dredze, P. McNamee, D. Rao, A. Gerber, and T. Finin (2010) Entity disambiguation for knowledge base population. In Proceedings of the 23rd International Conference on Computational Linguistics, pp. 277–285. Cited by: §5.
  • [9] A. Fader, S. Soderland, and O. Etzioni (2011) Identifying relations for open information extraction. In

    Proceedings of the conference on empirical methods in natural language processing

    pp. 1535–1545. Cited by: §3.1.
  • [10] S. Guo, M. Chang, and E. Kiciman (2013) To link or not to link? a study on end-to-end tweet entity linking. In Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 1020–1030. Cited by: §5.
  • [11] B. Hachey, W. Radford, J. Nothman, M. Honnibal, and J. R. Curran (2013) Evaluating entity linking with wikipedia. Artificial intelligence 194, pp. 130–150. Cited by: §5.
  • [12] X. Han and L. Sun (2011) A generative entity-mention model for linking entities with knowledge base. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies-Volume 1, pp. 945–954. Cited by: §5.
  • [13] M. Henaff, J. Weston, A. Szlam, A. Bordes, and Y. LeCun (2017) Tracking the world state with recurrent entity networks. ICLR, pp. 1–14. Cited by: §5.
  • [14] H. Ji and R. Grishman (2011) Knowledge base population: successful approaches and challenges. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies-Volume 1, pp. 1148–1158. Cited by: §5.
  • [15] Y. Ji, C. Tan, S. Martschat, Y. Choi, and N. A. Smith (2017) Dynamic entity representations in neural language models. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pp. 1830–1839. Cited by: §5.
  • [16] R. Jozefowicz, O. Vinyals, M. Schuster, N. Shazeer, and Y. Wu (2016) Exploring the limits of language modeling. arXiv:1602.02410. Cited by: §3.2, §5.
  • [17] Y. Kim, Y. Jernite, D. Sontag, and A. M. Rush (2016) Character-aware neural language models.. In AAAI, pp. 2741–2749. Cited by: §5.
  • [18] S. Kobayashi, R. Tian, N. Okazaki, and K. Inui (2016)

    Dynamic entity representation with max-pooling improves machine reading

    In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 850–855. Cited by: §5.
  • [19] B. Krawczyk, L. L. Minku, J. Gama, J. Stefanowski, and M. Woźniak (2017) Ensemble learning for data stream analysis: a survey. Information Fusion 37, pp. 132–156. Cited by: §5.
  • [20] A. Kumar, O. Irsoy, P. Ondruska, M. Iyyer, J. Bradbury, I. Gulrajani, V. Zhong, R. Paulus, and R. Socher (2016-20–22 Jun) Ask me anything: dynamic memory networks for natural language processing. In Proceedings of The 33rd International Conference on Machine Learning, M. F. Balcan and K. Q. Weinberger (Eds.), Proceedings of Machine Learning Research, Vol. 48, New York, New York, USA, pp. 1378–1387. Cited by: §5.
  • [21] G. Lample, M. Ballesteros, S. Subramanian, K. Kawakami, and C. Dyer (2016) Neural architectures for named entity recognition. In Proceedings of NAACL-HLT, pp. 260–270. Cited by: §3.1, §5.
  • [22] Y. Lin, C. Lin, and H. Ji (2017) List-only entity linking. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), Vol. 2, pp. 536–541. Cited by: §5.
  • [23] X. Ling, S. Singh, and D. S. Weld (2015) Design challenges for entity linking. Transactions of the Association for Computational Linguistics 3, pp. 315–328. Cited by: §5.
  • [24] G. Luo, X. Huang, C. Lin, and Z. Nie (2015) Joint entity recognition and disambiguation. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pp. 879–888. Cited by: §5.
  • [25] X. Ma, N. Fauceglia, Y. Lin, and E. Hovy (2017) Cmu system for entity discovery and linking at tac-kbp 2017. Proceedings of TAC2017. Cited by: §5.
  • [26] C. D. Manning, P. Raghavan, H. Schütze, et al. (2008) Introduction to information retrieval. Vol. 1, Cambridge university press Cambridge. Cited by: §6.2.
  • [27] G. Marra, A. Zugarini, S. Melacci, and M. Maggini (2018) An unsupervised character-aware neural approach to word and context representation learning. In Artificial Neural Networks and Machine Learning – ICANN 2018, pp. 126–136. Cited by: §3.1, §3.2.
  • [28] T. Martin, F. Botschen, A. Nagesh, and A. McCallum (2016) Call for discussion: building a new standard dataset for relation extraction tasks. In Proceedings of the 5th Workshop on Automated Knowledge Base Construction, pp. 92–96. Cited by: §5.
  • [29] T. Mikolov, K. Chen, G. Corrado, and J. Dean (2013)

    Efficient estimation of word representations in vector space

    arXiv:1301.3781. Cited by: §3.2.
  • [30] M. Mintz, S. Bills, R. Snow, and D. Jurafsky (2009) Distant supervision for relation extraction without labeled data. In Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP: Volume 2-Volume 2, pp. 1003–1011. Cited by: §5.
  • [31] M. Miwa and M. Bansal (2016) End-to-end relation extraction using lstms on sequences and tree structures. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, pp. 1105–1116. Cited by: §5.
  • [32] A. Moro and R. Navigli (2015-06) SemEval-2015 task 13: multilingual all-words sense disambiguation and entity linking. In Proceedings of the 9th International Workshop on Semantic Evaluation (SemEval 2015), Denver, Colorado, pp. 288–297. Cited by: §5.
  • [33] A. Moro, A. Raganato, and R. Navigli (2014) Entity linking meets word sense disambiguation: a unified approach. Transactions of the Association for Computational Linguistics 2, pp. 231–244. Cited by: §5.
  • [34] K. Nigam and R. Ghani (2000) Analyzing the effectiveness and applicability of co-training. In Proceedings of the ninth international conference on Information and knowledge management, pp. 86–93. Cited by: §4.
  • [35] F. Niu, C. Zhang, C. Ré, and J. W. Shavlik (2012) DeepDive: web-scale knowledge-base construction using statistical learning and inference.. VLDS 12, pp. 25–28. Cited by: §5.
  • [36] A. Obamuyide and A. Vlachos (2017) Contextual pattern embeddings for one-shot relation extraction. In 6th Workshop on Automated Knowledge Base Construction (AKBC), pp. 1–8. Cited by: §5.
  • [37] X. Pan, T. Cassidy, U. Hermjakob, H. Ji, and K. Knight (2015) Unsupervised entity linking with abstract meaning representation. In Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 1130–1139. Cited by: §5.
  • [38] A. Pappu, R. Blanco, Y. Mehdad, A. Stent, and K. Thadani (2017) Lightweight multilingual entity extraction and linking. In Proceedings of the Tenth ACM International Conference on Web Search and Data Mining, pp. 365–374. Cited by: §5.
  • [39] A. Raganato, J. Camacho-Collados, and R. Navigli (2017) Word sense disambiguation: a unified evaluation framework and empirical comparison. In Proc. of EACL, pp. 99–110. Cited by: §1, §5.
  • [40] N. F. Rajani and R. Mooney (2016) Combining supervised and unsupervised enembles for knowledge base population. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pp. 1943–1948. Cited by: §1, §5.
  • [41] P. Rajpurkar, J. Zhang, K. Lopyrev, and P. Liang (2016) Squad: 100,000+ questions for machine comprehension of text. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pp. pages 2383?2392. Cited by: §5.
  • [42] M. Richardson, C. J. Burges, and E. Renshaw (2013) Mctest: a challenge dataset for the open-domain machine comprehension of text. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, pp. 193–203. Cited by: §5.
  • [43] A. Ritter, O. Etzioni, S. Clark, et al. (2012) Open domain event extraction from twitter. In Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 1104–1112. Cited by: §1.
  • [44] W. Shen, J. Wang, and J. Han (2015) Entity linking with a knowledge base: issues, techniques, and solutions. IEEE Transactions on Knowledge and Data Engineering 27 (2), pp. 443–460. Cited by: §1, §5.
  • [45] A. Sil and R. Florian (2016) One for all: towards language independent named entity linking. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, Vol. 1, pp. 2255–2264. Cited by: §5.
  • [46] S. Sukhbaatar, J. Weston, R. Fergus, et al. (2015) End-to-end memory networks. In Advances in neural information processing systems, pp. 2440–2448. Cited by: §5.
  • [47] M. Van Erp, P. N. Mendes, H. Paulheim, F. Ilievski, J. Plu, G. Rizzo, and J. Waitelonis (2016) Evaluating entity linking: an analysis of current benchmark datasets and a roadmap for doing a better job.. In LREC, Vol. 5, pp. 2016. Cited by: §5.
  • [48] Z. Yu, Z. Xu, A. W. Black, and A. Rudnicky (2016) Strategy and policy learning for non-task-oriented conversational systems. In Proceedings of the 17th Annual Meeting of the Special Interest Group on Discourse and Dialogue, pp. 404–412. Cited by: §1.
  • [49] Z. Zhong and H. T. Ng (2010) It makes sense: a wide-coverage word sense disambiguation system for free text. In Proceedings of the ACL 2010 System Demonstrations, pp. 78–83. Cited by: §5.