AMORE-UPF at SemEval-2018 Task 4: BiLSTM with Entity Library

05/14/2018 ∙ by Laura Aina, et al. ∙ 0

This paper describes our winning contribution to SemEval 2018 Task 4: Character Identification on Multiparty Dialogues. It is a simple, standard model with one key innovation, an entity library. Our results show that this innovation greatly facilitates the identification of infrequent characters. Because of the generic nature of our model, this finding is potentially relevant to any task that requires effective learning from sparse or unbalanced data.



There are no comments yet.


page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

SemEval 2018 Task 4 is an entity linking task on multiparty dialogue.111 It consists in predicting the referents of nominals that refer to a person, such as she, mom, Judy – henceforth mentions. The set of possible referents is given beforehand, as well as the set of mentions to resolve. The dataset used in this task is based on Chen and Choi (2016) and Chen et al. (2017), and consists of dialogue from the TV show Friends in textual form.

Our main interest is whether deep learning models for tasks like entity linking can benefit from having an explicit

entity library

, i.e., a component of the neural network that stores entity representations learned during training. To that end, we add such a component to an otherwise relatively basic model – a bidirectional LSTM (long short-term memory;

Hochreiter and Schmidhuber 1997), the standard neural network model for sequential data like language. Training and evaluating this model on the task shows that the entity library is beneficial in the case of infrequent entities.222Source code for our model and for the training procedure is published on

2 Related Work

Previous entity linking tasks concentrate on linking mentions to Wikipedia pages (Bunescu and Paşca 2006; Mihalcea and Csomai 2007 and much subsequent work; for a recent approach see Francis-Landau et al. 2016). By contrast, in the present task (based on Chen and Choi 2016; Chen et al. 2017) only a list of entities is given, without any associated encyclopedic entries. This makes the task more similar to the way in which a human audience might watch the TV show, in that they are initially unfamiliar with the characters. What also sets the present task apart from most previous tasks is its focus on multiparty dialogue (as opposed to, typically, newswire articles).

A task that is closely related to entity linking is coreference resolution, i.e., the task of clustering mentions that refer to the same entity (e.g., the CoNLL shared task of Pradhan et al. 2011). Since mention clusters essentially correspond to entities (an insight central to the approaches to coreference in Haghighi and Klein 2010; Clark and Manning 2016), the present task can be regarded as a type of coreference resolution, but one where the set of referents to choose from is given beforehand.

Since our main aim is to test the benefits of having an entity library, in other respects our model is kept more basic than existing work both on entity linking and on coreference resolution (e.g., the aforementioned approaches, as well as Wiseman et al. 2016; Lee et al. 2017, Francis-Landau et al. 2016). For instance, we avoid feature engineering, focusing instead on the model’s ability to learn meaningful entity representations from the dialogue itself. Moreover, we deviate from the common strategy to entity linking of incorporating a specialized coreference resolution module (e.g., Chen et al. 2017).

3 Model Description

We approach the task of character identification as one of multi-class classification. Our model is depicted in Figure 1, with inputs in the top left and outputs at the bottom. In a nutshell, our model is a bidirectional LSTM (long short-term memory, Hochreiter and Schmidhuber 1997

) that processes the dialogue text and resolves mentions, through a comparison between the LSTM’s hidden state (for each mention) to vectors in a learned entity library.

The model is given chunks of dialogue, which it processes token by token. The th token and its speakers (typically a singleton set) are represented as one-hot vectors, embedded via two distinct embedding matrices ( and , respectively) and finally concatenated to form a vector  (Eq. 1; see also Figure 1). In case contains multiple speakers, their embeddings are summed.


We apply an activation function 

. The hidden state of a unidirectional LSTM for the th input is recursively defined as a combination of that input with the LSTM’s previous hidden state . For a bidirectional LSTM, the hidden state is a concatenation of the hidden states and of two unidirectional LSTMs which process the data in opposite directions (Eq. 2; see also Figure 1). In principle, this enables a bidirectional LSTM to represent the entire dialogue with a focus on the current input, including for instance its relevant dependencies on the context.


In the model, learned representations of each entity are stored in the entity library (see Figure 1): is a matrix which represents each of  entities through a -dimensional vector, and whose values are updated (only) during training. For every token that is tagged as a mention,333For multi-word mentions this is done only for the last token in the mention. we map the corresponding hidden state  to a vector . This extracted representation is used to retrieve the (candidate) referent of the mention from the entity library: The similarity of  to each entity representation stored in

is computed using cosine, and softmax is then applied to the resulting similarity profile to obtain a probability distribution

over entities (‘class scores’ in Figure 1):

Figure 1: The AMORE-UPF model (bias not depicted).

At testing time, the model’s prediction for the th token is the entity with highest probability:


We train the model with backpropagation, using negative log-likelihood as loss function. Besides the BiLSTM parameters, we optimize

, , , and . We refer to this model as AMORE-UPF, our team name in the SemEval competition. Note that, in order for this architecture to be successful, needs to be as similar as possible to the entity vector of the entity to which mention refers. Indeed, the mapping should effectively specialize in “extracting” entity representations from the hidden state because of the way its output is used in the model—to do entity retrieval. Our entity retrieval mechanism is inspired by the attention mechanism of Bahdanau et al. (2016), that has been used in previous work to interact with an external memory Sukhbaatar et al. (2015); Boleda et al. (2017). To assess the contribution of the entity library, we compare our model to a similar architecture which does not include it (NoEntLib). This model directly applies softmax to a linear mapping of the hidden state (Eq. 5, replacing Eq. 3 above).


4 Experimental Setup


We use the training and test data provided for SemEval 2018 Task 4, which span the first two seasons of the TV show Friends, divided into scenes (train:  scenes from  episodes; test: scenes from episodes).444The organizers also provided data divided by episodes rather than scenes, which we didn’t use. In total, the training and test data contain 13,280 and 2,429 nominal mentions (e.g., Ross, I; Figure 2), respectively, which are annotated with the ID of the entity to which they refer (e.g., 335, 183). The utterances are further annotated with the name of the speaker (e.g., Joey Tribbiani). Overall there are 372 entities in the training data (test data: 106).

Joey Tribbiani (183):  ”…see Ross, because I think you love her .” 335 183 335 306

Figure 2: Example of the data provided for the SemEval 2018 Task 4. It shows the speaker (first line) of the utterance (second line) and the ids of the entities to which the target mentions (underlined) refer (last line).

Our models do not use any of the provided automatic linguistic annotations, such as PoS or named entity tags.

We additionally used the publicly available 300-dimensional word vectors that were pre-trained on a Google News corpus with the word2vec Skip-gram model (Mikolov et al., 2013).555The word vectors are available at

Parameter settings

Using 5-fold cross-validation on the training data, we performed a random search over the hyperparameters and chose those which yielded the best mean F1-score. Specifically, our submitted model is trained in batch mode using the Adam optimizer

Kingma and Ba (2014) with a learning rate of . Each batch covers scenes, which are given to the model in chunks of  tokens. The token embeddings () are initialized with the word2vec vectors. Dropout rates of and are applied on the input  and hidden layer  of the LSTM, respectively. The size of  is set to  units, the embeddings of the entity library  and speakers  are set to  dimensions.

Other configurations, including randomly initialized token embeddings, weight sharing between and , self-attention Bahdanau et al. (2016) on the input layer, a uni-directional LSTM, and rectifier or linear activation function  on the input embeddings did not improve performance.

For the final submission of the answers for the test data, we created an ensemble model by averaging the output (Eq. 3) of the five models trained on the different folds.

5 Results

all entities main entities
Models F Acc F Acc
Table 1: Results obtained for the submitted AMORE-UPF model and a variant of it that does not use an entity library (NoEntLib). Best results are in boldface. Differences with respect to the 2nd row marked by ‘**’ are significant at the  probability level (see text).

Two evaluation conditions were defined by the organizers – all entities and main entities – with macro-average F-score and label accuracy as the official metrics, and macro-average F-score in the all entities condition applied to the leaderboard. The all entities evaluation has 67 classes: 66 for entities that are mentioned at least 3 times in the test set and one grouping all others. The main entities evaluation has 7 classes, 6 for the main characters and one for all the others. Among all four participating systems in this SemEval task our model achieved the highest score on the all entities evaluation, and second-highest on the main entities evaluation.

Figure 3: Distribution of all  target mentions in the test data in terms of their part-of-speech.

Table 1 gives our results in the two evaluations, comparing the models described in Section 4. While both models perform on a par on main entities, AMORE-UPF outperforms NoEntLib by a substantial margin when all characters are to be predicted (+15 points in F-score, +3 points in accuracy; Table 1).666The mean difference between the single models (trained on a single fold) and the ensemble AMORE-UPF is between -1.3 points (accuracy main entities, std ) and -4.2 points (F-score all entities, std ). The difference between the models with/without an entity library are statistically significant based on approximate randomization tests Noreen (1989), with the significance level . This shows that the use of an entity library can be beneficial for the linking of rarely mentioned characters.

Figure 3 shows that most of the target mentions in the test data fall into one of five grammatical categories. The dataset contains mostly pronouns (83%), with a very high percentage of first person pronouns (44%). Figures 4 and 5 present the accuracy and F-score which the two models described above obtain on all entities for different categories of mentions. The entity library is beneficial when the mention is a first person pronoun or a proper noun (with an increase of 30 points in F-score for both categories; Figure 4), and closer inspection revealed that this effect was larger for rare entities.

6 Discussion

The AMORE-UPF model consists of a bidirectional LSTM linked to an entity library. Compared to an LSTM without entity library, NoEntLib, the AMORE-UPF model performs particularly well on rare entities, which explains its top score in the all entities condition of SemEval 2018 Task 4. This finding is encouraging, since rare entities are especially challenging for the usual approaches in NLP, due to the scarcity of information about them.

Figure 4: F-score of the models on all entities depending on the part-of-speech of the target mentions.
Figure 5: Aaccuracy of the models on all entities depending on the part-of-speech of the target mentions.

We offer the following explanation for this beneficial effect of the entity library, as a hypothesis for future work. Having an entity library requires the LSTM of our model to output some representation of the mentioned entity, as opposed to outputting class scores more or less directly as in the variant NoEntLib. Outputting a meaningful entity representation is particularly easy in the case of first person pronouns and nominal mentions (where the effect of the entity library appears to reside; Figure 4): the LSTM can learn to simply forward the speaker embedding unchanged in the case of pronoun I, and the token embedding in the case of nominal mentions. This strategy does not discriminate between frequent and rare entities; it works for both alike. We leave further analyses required to test this potential explanation for future work.

Future work may also reveal to what extent the induced entity representations may be useful in others, to what extent they encode entities’ attributes and relations (cf. Gupta et al. 2015

), and to what extent a module like our entity library can be employed elsewhere, in natural language processing and beyond.


This project has received funding from the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation programme (grant agreement No 715154), and from the Spanish Ramón y Cajal programme (grant RYC-2015-18907). We are grateful to the NVIDIA Corporation for the donation of GPUs used for this research. We are also very grateful to the Pytorch developers. This paper reflects the authors’ view only, and the EU is not responsible for any use that may be made of the information it contains.