Log In Sign Up

A Cluster Ranking Model for Full Anaphora Resolution

Anaphora resolution (coreference) systems designed for the CONLL 2012 dataset typically cannot handle key aspects of the full anaphora resolution task such as the identification of singletons and of certain types of non-referring expressions (e.g., expletives), as these aspects are not annotated in that corpus. However, the recently released dataset for the CRAC 2018 Shared Task can now be used for that purpose. In this paper, we introduce an architecture to simultaneously identify non-referring expressions (including expletives, predicative s, and other types) and build coreference chains, including singletons. Our cluster-ranking system uses an attention mechanism to determine the relative importance of the mentions in the same cluster. Additional classifiers are used to identify singletons and non-referring markables. Our contributions are as follows. First all, we report the first result on the CRAC data using system mentions; our result is 5.8 baseline system, which used gold mentions. Second, we demonstrate that the availability of singleton clusters and non-referring expressions can lead to substantially improved performance on non-singleton clusters as well. Third, we show that despite our model not being designed specifically for the CONLL data, it achieves a score equivalent to that of the state-of-the-art system by Kantor and Globerson (2019) on that dataset.


page 1

page 2

page 3

page 4


Improving Coreference Resolution by Learning Entity-Level Distributed Representations

A long-standing challenge in coreference resolution has been the incorpo...

Narrowing the Modeling Gap: A Cluster-Ranking Approach to Coreference Resolution

Traditional learning-based coreference resolvers operate by training the...

Free the Plural: Unrestricted Split-Antecedent Anaphora Resolution

Now that the performance of coreference resolvers on the simpler forms o...

Joint Coreference Resolution for Zeros and non-Zeros in Arabic

Most existing proposals about anaphoric zero pronoun (AZP) resolution re...

Higher-order Coreference Resolution with Coarse-to-fine Inference

We introduce a fully differentiable approximation to higher-order infere...

The Cost of Troubleshooting Cost Clusters with Inside Information

Decision theoretical troubleshooting is about minimizing the expected co...

TweeTime: A Minimally Supervised Method for Recognizing and Normalizing Time Expressions in Twitter

We describe TweeTIME, a temporal tagger for recognizing and normalizing ...

Code Repositories


A Cluster Ranking Model for Full Anaphora Resolution

view repo

1 Introduction

Anaphora resolution is the task of identifying and resolving anaphoric reference to discourse entities [29].111A simplified version of this task is also known in nlp as coreference resolution.

It is an important aspect of natural language processing and has a substantial impact on downstream applications such as summarization

[34, 33]. Since the conll 2012 shared task [30], the ontonotes corpus has been the dominant resource in research on anaphora resolution / coreference [12, 5, 23, 7, 8, 9, 19, 20, 16]. But ontonotes has a number of limitations. An often mentioned limitation is that singletons are not annotated [10, 6]. A less discussed, but still crucial, limitation is that although some types of non-referring expressions are marked in ontonotes, in particular predicative ones (a policeman in John is a policeman), other types are not, such as expletives, meaning that in It rained, It is not considered a markable. As a consequence, systems optimized for ontonotes are only evaluated on non-singleton coreference chains; their performance at identifying singletons, and distinguishing them from expletives, is not evaluated.

But the decision to interpret it as referring or non-referring [38, 39, 3, 4, 14] is a key aspect of pronoun interpretation–for instance, for the purposes of machine translation [13]–so systems trained on ontonotes have had to adopt a variety of workarounds. These limitation of ontonotes have however been corrected in a number of corpora, including ancora for Spanish [35], tuba-d/z for German [36], and, for English, arrau [37], which was used as dataset for the crac 2018 shared task [28].

The first contribution of this paper is the development of a system able to perform both coreference resolution and identification of non-referring markables and singletons, using the crac 2018 shared task dataset. Our model achieves a conll score of 77.9% on coreference chains, and an F1 score of 76.3% on non-referring expressions identification. This is, to the best of our knowledge, the first modern result on the crac data using system mentions. Our conll score is even 5.8% higher than the baseline result on this dataset [28] (72.1%), obtained using gold mentions.

Our second contribution is a novel and competitive cluster ranking architecture for anaphora resolution222The code is available at Current coreference models can be classified either as mention pair models [32], in which connections are established between mentions, or entity mention models, in which mentions are directly linked to entities / coreference chains [22, 31]. The mention pair models are simpler in concept and easier to implement, so many state-of-the-art systems are exclusively based on mention ranking [41, 8, 19]. But it has long been known that entity-level information is important for coreference [22, 29] so many systems attempted to explore features beyond those of mention pairs [5, 7, 9, 20, 16, 15]. However, those systems are usually much more complex than their mention ranking counterpart, since entity features are introduced in addition to their mention ranking part. Consider the lee2018higher system, for instance: the full system has 9.6 million trainable parameters in total, which is double the number of the mention ranking part of the system (4.8M parameters). In this work, we demonstrate that it is possible to achieve state-of-the-art results by cluster ranking alone, i.e. by linking mentions directly to the entities. As a result, our model is less complex than the existing entity-level models [20, 16] using similar mention representations. Our model uses only 4.8M trainable parameters without increasing the complexity of a mention ranking model. Furthermore, our model is fast to train; we show that a cluster ranking model can be significantly sped up by training on oracle clusters333The oracle clusters are created from system mention using gold cluster information..

The key intuitions behind the proposed approach are (i) that cluster representations are crucial to the success of a cluster ranking system, and (ii) that a key property of these representations is that they should capture the fact that mentions in a cluster are not equally important. In particular, it is well-known that the mentions introducing an entity are much more informative (e.g., the president of ACME, John Smith) whereas subsequent mentions tend to employ reduced forms (e.g., Mr. Smith, he) [1]. This motivates the use of cluster representations capable of preserving the greater salience of earlier mentions. Our approach captures this mention salience by using attention scores for the mentions in a cluster and combines the mention representations according to their attention scores. We then investigate the effect of the cluster histories by including all the history of the clusters as candidate assignments to the mentions. The resulting system, besides achieving the new state-of-the-art on the crac dataset (whether including and excluding non-referring expressions and singletons), achieves conll scores equivalent to the current state-of-the-art system [16] on conll data as well (in which non-referring expressions and singletons are not annotated).

Our third and final contribution is the finding that training our system on annotations of singleton mentions and non-referring expressions enhance its performance on non-singleton coreference chains. By evaluating our system on the crac data we show that gains of up to 1.4 percentage points on non-singleton coreference chains can be achieved by training the model with additional singleton mentions and non-referring expressions.

2 System architecture

Anaphora resolution is the task of identifying the referring mentions in a text and assigning those mentions to disjoint clusters such that mentions in the same cluster refer to the same entity. The first subtask of anaphora resolution is mention detection, i.e., extracting candidate mentions from the document. Until recently, most coreference systems selected mentions prior to coreference resolution via heuristic methods often based on parse trees

[5, 7, 8, 9, 41, 40]. lee2017end first introduced a neural network approach for joint mention detection and coreference resolution, obtaining the best performing system at the time. The system was further extended by lee2018higher and kantor2019bertee, the current state-of-the-art on the conll data set.

Our model is also a joint system that predicts mentions and assigns them to the clusters jointly. For a given document with tokens, we define all possible spans in as where , are the start and the end indices of where . The task for a joint system is to partition all the spans () into a sequence of clusters such that every mention in a specific cluster refers to the same entity. Let be the partially completed clusters up to span . The set of possible assignments for is defined as all the clusters up to the previous span () and a special label . The is used for three situations: a span is not a mention, or is a non-referring expression, or is the first mention of a cluster.

1 ;
2 for   do
3       ;
4       for  do
6       end for
7      ;
8       if  then
9             ;
10             ;
11             ;
13      else
14             ;
15             ;
17       end if
19 end for
Algorithm 1 Cluster ranking algorithm.

2.1 Mention Representation

In this work, we use a mention representation based on those in [20, 16]. Our system represents a candidate span with the outputs of a bi-directional LSTM. The sentences of a document are encoded from both directions to obtain a representation for each token in the sentence. The bi-directional LSTM takes as input the concatenated embeddings () of both word and character levels. For word embeddings, GloVe [25] and BERT [11]

embeddings are used. Character embeddings are learned from a convolution neural networks (CNN) during training. The tokens are represented by concatenated outputs from the forward and the backward LSTMs. The token representations

are used together with head representations () to represent candidate spans (). The of a span is obtained by applying an attention over its token representations (), where and are the indices of the start and the end of the span respectively. Formally, we compute , as follows:

where are the cluster position and span width feature embeddings respectively.

To make the task computationally tractable, our model only considers the spans up to a maximum length of , i.e. . Further pruning is applied before feeding the candidate mentions to the coreference resolver. The top ranked spans are selected from candidate spans () by a scoring function . where:

The top selected spans are required not to be partially overlap, i.e. there is no such cases that or . The nested spans are not affected by this constrains since they are not partially overlap.

2.2 The Cluster Ranking Model

Let denote the top ranked candidate mentions selected by the mention detector after pruning. The model builds the clusters by visiting in text order and assigning them a cluster in the case , or creating a new cluster if . Let be the partial clusters consisting of up to mentions, and the cluster assigned to . The task of our cluster ranking model is to output that maximises the score of the final clusters:

where 444We follow lee2018higher and use to indicate the anaphor and for the antecedent. is a scoring function between a mention and a set of possible assignments :


is the probability that

does not belongs to any of the previous clusters . To use a scoring function for instead of a constant 0 (used by lee2018higher) gives us the flexibility to extend the function for handing more detailed types of , such as non-referring. is the mention score that has been used to rank the candidate mentions. is the cluster score computed from the mention scores that belongs to the cluster. is a pairwise score between mention and partial cluster of . To implement the cluster ranking model we use an attention mechanism [2] to assign a salience to each of the mentions. We compute the cluster score and the cluster representation () (for computing ), by mention scores/representations and with consideration of the mention salience. More precisely, we compute the scores as follows:

Both and are updated each time a cluster is expanded. is the position embeddings that indicates the position of a mention in the cluster. is a small set of features between the and the newest mention of the cluster. We used the same features as lee2018higher: these include genre, speaker (boolean, same or not) and distance (between and ) features. is cluster size, a common entity-level feature [5]. The size is assigned into buckets according to its value. We use the buckets of bjorkelund2014learning, assigning the values in 8 buckets ([1,2,3,4,5-7,8-11,12-19,20+]). The pseudo-code of our model is shown in Algorithm 1.555We also evaluated an alternative approach, in which the clusters are encoded by a LSTM. However the LSTM approach resulted in a lower accuracy than the attention approach in this evaluation. 666We do not use course-to-fine pruning or higher-order inference, unlike lee2018higher and kantor2019bertee. We found course-to-fine pruning does not improve our model when compared with simpler distance pruning. As for higher-order inference, our system already has access to the entity-level information by default, hence it is not necessary.

2.3 Cluster History

One of the advantages of the mention ranking model is that the correct cluster can be built by attaching the active mention to any of the antecedents in the correct cluster. This reduces the complexity of the task as there are multiple correct links. By contrast, in a standard cluster ranking model, only one correct cluster can be chosen. In order to make multiple links possible in our cluster ranking system, we extended our model by including all cluster histories (ch); this maximises the chance of choosing the correct clusters. (We make sure a mention is always attached to the latest version of the cluster by including an additional pointer linking every cluster history to the latest version of the cluster.) This makes the model slightly more similar to a mention ranking model; however, there is still a fundamental difference, as we use cluster representations instead of mention representations. We replace the line 13 and 14 of Algorithm 1 to get the model that includes cluster histories:

where is a function to find the latest version of the cluster .

2.4 Identifying Non-Referring Expressions

To add non-referring expressions identification, we extend into multiple classes: no for non-mention, nr for non-referring and dn for discourse new, including singletons

Several non-referring types are annotated in the arrau corpus: in addition to expletives, there are also predicative nps (e.g., a policeman in John is a policeman), non-referring quantifiers (e.g.,nobody in I see nobody here ) [17], idioms (e.g., her hand in He asked her for her hand), etc. As we will see, the basic nr classifier can be extended to do a fine-grained classification of non-referring expressions.

By distinguishing ‘non-mentionhood’ from non-anaphoricity the system naturally resolves singletons (i.e. the clusters with a size of one). Non-referring expressions are usually filtered before building the coreference chains, e.g. in mars [24]; we will call this prefiltering approach. In the prefiltering approach, the system removes the markables identified as non-referring expressions from further processing once they have been identified. To be more specific, we replace line 8 of algorithm 1 with:

else if

The prefiltering approach is aggressive, which might have a negative effect on results if referring expressions have been filtered incorrectly. We also tried therefore a second approach: only do prefiltering when the non-referring expressions classifier has high confidence (when the classifier has a softmax score above a heuristic threshold ). The softmax score is calculated between previous clusters and classes in (i.e. Tmp in algorithm 1). If the score is below this threshold, non-referring expressions are identified after (postfiltering) forming the clusters (we call this hybrid approach). During postfiltering, candidates that are classified as non-referring markables with lower confidence and are not part of clusters are included as additional non-referring markables.

Parameter Value
BiLSTM layers 3
BiLSTM size 200
BiLSTM dropout 0.4
FFNN layers 2
FFNN size 150
FFNN dropout 0.2
CNN filter widths 3,4,5
CNN filter size 50
Char embedding size 8
GloVe embedding size 300
BERT embedding size 1024
BERT embedding layer Last 4
Feature embedding size 20
Embedding dropout 0.5
Max span width () 30
Max num of clusters 250
Mention/token ratio () 0.4
Optimiser Adam
Learning rate 1e-3
Decay rate 0.999
Decay frequency 100
Training step 200K
Table 1: Hyperparameters for our models.
P R F1 P R F1 P R F1
prefiltering 75.5 79.0 77.2 75.9 80.7 78.2 75.2 77.3 76.2 77.2
hybrid 77.9 78.5 78.2 77.4 80.3 78.8 75.4 78.1 76.8 77.9
fine nr 76.7 77.3 77.0 76.8 79.7 78.2 74.9 78.0 76.4 77.2
lee-et-al:CL13* 72.1 58.9 64.8 77.5 77.1 77.3 64.2 88.1 74.3 72.1
prefiltering 75.5 79.0 77.2 67.0 73.0 69.9 67.1 65.1 66.1 71.1
hybrid 77.9 78.5 78.2 69.2 71.8 70.4 69.5 63.8 66.5 71.7
fine nr 76.7 77.3 77.0 68.0 70.7 69.3 66.6 64.2 65.4 70.6
no nr 76.7 77.0 76.8 68.7 69.7 69.2 66.1 63.8 64.9 70.3
lee-et-al:CL13* 72.3 58.9 64.9 67.9 48.5 56.5 54.2 53.0 53.6 58.3
Table 2: The comparison between our models and the state-of-the-art system on the crac test set. * indicates systems evaluated on the gold mentions.

2.5 Learning

To train a cluster ranking model on system clusters is challenging, as we need to find a way to learn from the partially correct clusters. It is also slow, as the system processes one mention at a time, hence cannot benefit largely from parallel computing. The solution we adopted was training the model on oracle clusters. This is simpler and faster, since the clusters for one training document can be created before computing more heavy stuff, e.g. the cluster scores and pairwise scores . More precisely, we create the oracle clusters during the training using gold cluster ids; system mentions belonging to the same gold clusters are grouped. This is much faster than training the model on the system mentions directly, since training on the system mentions requires computing scores for each mention separately. In a preliminary experiment, we discovered that by training on oracle clusters we obtain not only a better conll score, but also a fivefold speedup compared with the model trained on the system mentions directly.777We train both approaches on the conll data for 200K steps on a GTX 1080Ti GPU. It takes 16 and 80 hours to train a model on oracle and system mentions respectively.

As a loss function, we optimize on the marginal log-likelihood of all the clusters that contain mentions from the same gold cluster

gold of . Formally,

In case does not contain any mention from or does not belongs to a gold cluster, we set gold. For our model to have more than one class in , the gold is set to the relevant classes (no,nr or dn).

3 Data and Hyperparameters

Our primary evaluation dataset was the crac 2018 corpus [28] for full anaphora resolution. We also evaluated our model on the conll 2012 English corpora [30] to compare its performance with the state of the art on the conll task (identifying coreference chains excluding singletons, no non-referring expression identification).

The crac Task 1 dataset is based on the rst portion of the arrau corpus [37]. arrau includes texts from four very distinct domains: news (the rst subcorpus), dialogue (the trains subcorpus), fiction (the pear stories), and non-fiction (the gnome corpus). The annotation scheme specifies the annotation of referring (including singletons) and non-referring expressions; split antecedent plurals, generic references, and discourse deixis are annotated, as well as bridging references. The rst portion of arrau consists of news texts (1/3 of the penn Treebank).

The conll datasets are the standard reference corpora for coreference resolution. The English conll corpus consists of 2802, 342, and 348 documents for the train, development and test sets respectively.

We use the official conll 2012 scoring script to score our predictions when evaluating without singletons and non-referring markables, and the official crac 2018 scoring script [28] to evaluate other cases. The crac 2018 Extended Scorer is an extension of the conll 2012 script that can handle singletons and non-referring markables. The Extended Scorer is the same as the conll scorer when evaluating without singletons and non-referring markables, but also reports P, R and F1 values for non-referring markables when those are considered. Following standard practice, we report recall, precision, and F1 scores for MUC, B and CEAF and the average F1 score of those three metrics. Besides, we report the F1 score for non-referring when needed.

For our experiments, we use the same maximum span width (), number spans per tokens () and most of the network parameters as lee2018higher and kantor2019bertee. A detailed configurations can be find in Table 1.

4 Results and Discussions

4.1 Evaluation on the crac data set

We first compared the two proposed approaches for using non-referring expressions, prefiltering and hybrid. For our hybrid model, we set the threshold () to 0.5 after tuning on the development set. Table 2 shows the results of our models on the crac test set. As expected, the hybrid model, using a less greedy pruning, achieved better F1 scores on all three coreference metrics. In terms of the non-referring scores (see Table 3), the prefiltering approach has better recall and F1 score, while the hybrid approach has better precision. We hypothesize this is mainly because the prefiltering approach generates more non-referring expressions due to its greedy pruning–i.e., the prefiltering approach keeps all the candidate non-referring markables once they are identified–while the hybrid approach favours the coreference clusters for non-referring markables fall below the threshold. The hybrid approach has a better overall performance according to our weighed F1 scores (0.85 * coref F1 + 0.15 * nr F1) The weights are determined by the proportion of the referring and non-referring markables in the corpus.

Models P R F1
prefiltering 76.6 74.5 75.5
hybrid 78.0 72.4 75.1
fine nr 77.0 75.5 76.3
Table 3: The scores for non-referring expressions of our models on the crac test set.
nr types P R F1
Expletive 93.8 100.0 96.8
Predicate 77.6 75.2 76.4
Quantifier 65.0 64.7 64.9
Coordination 77.5 82.0 79.7
Idiom 77.0 55.9 64.8
Table 4: The scores of our models on the fine-grained non-referring types.

Fine-grained Non-referring

We further extended the basic nr classifier to recognise the more fine-grained classification of non-referring expressions annotated in the crac dataset by configuring our hybrid model to learn from the fine-grained types (fine nr). Our model does very well on resolving expletives (96.8% F1) and achieves 76 - 80% F1 score on predicates and coordinations, but has a lower F1 score of around 65% on recognising non-referring quantifiers and idioms. We also compared this model with the other models to dealing with non-referring expressions by collapsing the classifications it produces (Table 3). As we can see from that Table, although the task is harder, using the fine-grained types for training results in slightly better performance on identifying non-referring markables in general than models trained on a single nr class. In term of the performance on coreference chains, the fine nr approach achieved the same score as the prefiltering approach and slightly lower than the hybrid approach (see Table 2).

Training without Singletons and Non-referring

Finally, we trained our model without singletons and non-referring expressions (no nr) to assess their effects on non-singleton clusters (i.e. the standard conll setting). Since here we evaluate in a singleton excluded setting, we report for our models trained with singletons and non-referring expressions the standard conll scores with singletons and non-referring markables excluded. As shown in Table 2, all three models trained with additional singleton and non-referring markables achieved better conll scores when compared with the newly trained model. The system achieves substantial gains of up to 1.4 percentage points (hybrid) by training with the additional singletons and non-referring expressions. This suggests that the availability of singletons and non-referring markables can help the decisions made for non-singleton clusters.

State-of-the-art Comparison Since the crac corpus was released recently, the only published results are those by the baseline system [18] on the shared task [28]. Our best system (hybrid) outperforms this baseline by large margins (5.8% and 13.4% when evaluated with or without singletons respectively) (see Table 2) even though that system was evaluated on gold mentions.

P R F1 P R F1 P R F1
Models use Context Independent Embeddings
clark2016improving 79.9 69.3 74.2 71.0 56.5 63.0 63.8 54.3 58.7 65.3
clark2016deep 79.2 70.4 74.6 69.9 58.0 63.4 63.5 55.5 59.2 65.7
lee2017end 78.4 73.4 75.8 68.6 61.8 65.0 62.7 59.0 60.8 67.2
zhang2018acl 79.4 73.8 76.5 69.0 62.3 65.5 64.9 58.3 61.4 67.8
Models use Pre-trained Context Dependent Embeddings
lee2018higher 81.4 79.5 80.4 72.2 69.5 70.8 68.2 67.1 67.6 73.0
kantor2019bertee 82.6 84.1 83.4 73.3 76.2 74.7 72.4 71.1 71.8 76.6
Our model 82.7 83.3 83.0 73.8 75.6 74.7 72.2 71.0 71.6 76.4
Models Fine-tuned on BERT
joshi2019bert 84.7 82.4 83.5 76.5 74.0 75.3 74.1 69.8 71.9 76.9
Table 5: Comparison between our models and the top performing systems on the conll test set.

4.2 Evaluation on the conll data set

Next, we tested our models on the conll data to assess the performance of our system on the standard data set. Table 5 compares our results with those of the top-performing systems on conll at the present time. We report precision, recall and F1 scores for all three major metrics (MUC, B and CEAF) and mainly focus on the average conll F1 scores presented in the last column. As showed in Table 5, our model achieved a conll score of 76.4%, which is only 0.2% lower than the best-reported result at present, achieved by [16] that use a similar mention representations as our system. Although the current state-of-the-art system [15] has a slightly better result than the kantor2019bertee system, it is not directly comparable with our system, as their system is fine-tuned on BERT. Also, the joshi2019bert systems need to be trained on GPUs with 32GB memory, which is not available to our group. By contrast, our system was trained with a GTX 1080Ti GPU that has an 11GB memory.

Size 1 2 3 4 5-7
2 0.55 0.45
3 0.38 0.32 0.29
4 0.29 0.24 0.23 0.22
5 0.24 0.20 0.19 0.19 0.19
6 0.19 0.17 0.16 0.17 0.15
7 0.18 0.14 0.14 0.14 0.13
Table 6: The average mention salience attention scores in the conll development set, grouped by mentions position and cluster size in the final clusters.
Avg. F1
Our model 76.9
  - Position emb 76.2 0.7
  - Width emb 76.5 0.4
  - Cluster history 75.9 1.0
  - Oracle cluster 76.3 0.6
Table 7: The comparison between our best model and different ablated models on conll development set.

4.3 Discussion

We further analyze our model on the conll data to give a more detailed study on different aspects of our model. (We use the standard conll data instead of the crac data because the conll corpus is larger than the crac corpus and is widely used. As a result, the analysis on conll data might also be beneficial for other researchers focusing on conll only.)

Mention Salience We first assess our hypothesis that our attention scores can capture mention salience–i.e., the finding from the linguistic and psychological literature on anaphora that the initial mentions of an entity are those tend to include more information (whereas the following mentions are generally reduced). Table 6

shows an analysis of the attention scores that supports this hypothesis. We computed the average attention scores for mentions in a cluster in order of mention. Clusters that have different size are analysed separately, as scores from different-sized clusters are not directly comparable. As we can see from the Table, after analysis the attention scores assigned to the mentions at different positions in the cluster, we find that the attention scores assigned to the first mention in a cluster are always higher than others, which is in line with linguistic findings that mentions introducing an entity are more informative. This suggests that our attention model does capture something like mention salience.

Why Cluster Ranking? The reason why we use a cluster ranking approach instead of mention ranking is not only because it is linguistically more appealing, but also due to several practical restrictions of the mention ranking models. First all, the current state-of-the-art mention-ranking systems tend to be hybrids, using entity-level features alongside mention-pair features. Thus, such models are usually more complex than pure mention ranking models, and substantially increase the number of trainable parameters. Take lee2018higher system as an example. The mention ranking part of the system contains 4.8M parameters, but the full system has double the number of parameters (9.6M) to access entity-level features. Our system, on the other hand, links the mentions directly to the entity and uses only 4.8M parameters, which is much simpler than such hybrid models. Second, we hope that using a cluster ranking model will allow us to explore rich cluster level features and advanced search algorithms (such as beam search) in future work.

The Effect of Oracle Clusters on Training Time Training cluster ranking systems using system clusters is time-consuming: Our model trained on system clusters takes 80 hours to train for 200K steps, which is much more than the 48 hours training time of the lee2018higher system (400K steps). The main reason the cluster ranking system is slower than its mention ranking counterpart is that the cluster ranking model processes one mention at a time, hence does not benefit from parallelization. To solve this problem, we trained the system on oracle clusters instead. The oracle clusters are created by using the system mentions with the gold cluster ids. By doing so all the clusters can be created before resolving the mentions into the entities. As a result, the training (200K steps) can be finished in as little as 16 hours, which is 5x faster than training the model on system clusters, and 3x faster than training the mention ranking model.

4.4 Ablation study

We removed different part of our model to show the importance of the individual part of our system (see Table 7).

Position Embeddings We first removed the position embeddings, used in the self-attention to determine the relative importance of the mentions in the cluster. By removing the position embeddings, the relative importance of a mention becomes independent of its position in the cluster. As a result, the performance of the model drops by 0.7%.

Width Embeddings We then removed the cluster width embeddings from our features. The cluster width embedding is a feature used in computing the pairwise scores, which allows mentions to known the size of individual candidate clusters. (Cluster size can be used as an indicator of the cluster salience, as the larger the size is, the more frequently an entity is mentioned, hence have a higher salience.) The cluster width feature contributes 0.4% towards our model.

Cluster History We trained a model that keeps exactly one cluster per entity, and the history clusters are excluded from the candidate lists. This removing of history clusters reduces the chance of linking the mentions to the correct entity; as a consequence, the performance drops by 1 percentage point.

Oracle Clusters Finally, we trained a model using the system clusters directly instead of the oracle clusters. As we mentioned in the previous section, training on the system clusters is more time consuming than training on the oracle clusters. And replaceing these clusters suggests that training on the oracle clusters is not only faster, but also results in better performance (0.6%).

5 Related Work

Pure Mention Ranking Models Most recent coreference systems are highly reliant on mention ranking, which is effective and generally faster to train compared with the cluster ranking system. Systems based only on the mention ranking model include [41, 9, 19]. wiseman2015learning introduced a neural network based approach to solve the task in a non-linear way. In their system, the heuristic features commonly used in linear models are transformed by a

function to be used as the mention representations. clark2016improving integrated reinforcement learning to let the model optimize directly on the B

scores. lee2017end first presented a neural joint approach for mention detection and coreference resolution. Their model does not rely on parse trees; instead, the system learns to detect mentions by exploring the outputs of a bi-directional LSTM.

Models using Entity Level Features Researchers have been aware of the importance of entity level information at least since luo-et-al:ACL04, and many systems trying to exploit cluster based features have been proposed since. Among neural network models, bjorkelund2014learning built a latent tree system that explores non-local features through beam search. The global feature-aided model showed clear gains when compared with the model based only on pairwise features. clark2015entity introduced a entity-centric coreference system by manipulating the scores of a mention pair model. The system first runs a mention pair model on the document and then uses an agglomerative clustering algorithm to build the clusters in an easy-first fashion. This system was later extended by clark2016improving to make it run on neural networks. wiseman2016learning add to the wiseman2015learning system an LSTM to encode the partial clusters. The outputs of the LSTM are used as additional features for the mention ranking model. lee2018higher is an extended version of lee2017end mainly enhanced by using ELMo embeddings [26], but the use of second-order inference enabled the system explore partial entity level features and further improved the system by 0.4 percentage points. Later the model was further improved by kantor2019bertee who use BERT embeddings [11] instead of ELMo embeddings. At this stage, both BERT and ELMo embeddings are used in a pre-trained fashion. Recently, joshi2019bert fine-tunes the BERT model for coreference task, result in again a small improvement.

Cluster Ranking Models To the best of our knowledge, our system is the only recent system that does not rely on a mention ranking model. However, there are a number of early studies that laid a solid foundation for the cluster ranking models (see [27] for a survey). The best known ‘modern’ examples are the systems proposed by luo-et-al:ACL04 and by rahman&ng:JAIR11, but this approach was the dominant model for anaphora resolution at least until the paper by soon-et-al:CL01, as it directly implements the linguistically and psychologically motivated view that anaphora resolution involves the creation of a discourse model articulated around discourse entities [17]. The entity mention model of luo-et-al:ACL04 introduced the notion that a training instance consists of a mention and an active cluster, and therefore allowed for cluster-level features encoding information about multiple entities in the cluster. luo-et-al:ACL04 also proposed a clustering algorithm in which the clustering options are encoded in a Bell tree that also specifies the coreference decisions resulting in a cluster–an idea related to our idea of cluster history. rahman&ng:JAIR11 introduced the term ‘cluster ranking’ and greatly developed the approach, e.g., by introducing a rich set of cluster-level features. Their model was the first cluster-ranking model to significantly outperform mention pair models.

Singletons and Non-referring Expressions Again, to the best of our knowledge, ours is the only modern neural network-based, full coreference system that attempts to output singletons and non-referring markables. The Stanford Deterministic Coreference Resolver [18] uses a number of filters to exclude expletives as well as quasi-referring mentions such as percentages (e.g., 9%) and measure nps (e.g., a liter of milk) and its extension proposed by DeMarneffe:2015:MLD:2831407.2831417 includes more fiters to exclude singletons, but these aspects of the system are not evaluated. The best-known systems also attempting to annotate non-referring markables date back to the pre-ontonotes era. The pronoun resolution algorithm proposed by lappin&leass:94 includes a series of hand-crafted heuristics to detect expletives. The statistical classifier proposed by evans2001applying classifies pronouns in several categories which, apart from nominal anaphoric, include cataphoric, pleonastic, and clause-anaphoric. versley-EtAl:2008:PAPERS used the BBN pronoun corpus to confirm the hypothesis that tree kernels would be well-suited to identify expletive pronouns. boyd-et-al:05 develop a set of hand-crafted heuristics to identify non-referring nominals in the sense of karttunen:76. The systems developed by Bergsma and colleagues to identify pronominal it with a classifier using a combination of lexical features and web counts [3, 4]. A lot of work on identifying expletives was carried out in the context of the DiscoMT evaluation campaigns, but this work was typically only focused on disambiguating pronoun it [21]. For more discussion of these and other systems, see [38].

6 Conclusions

In this work, we presented the first neural network based system for full coreference resolution also covering singletons and non-referring markables. Our system uses an attention mechanism to form the cluster representations using mention salience scores from the mentions belonging to the cluster. By training the system on oracle clusters we show that a cluster ranking system can be trained 5x faster, and faster than a mention-ranking system with a similar architecture. Evaluation on the crac corpus shows that our system is 5.8% better than the only existing comparable system, the Shared Task baseline system that used the gold mentions. Further evaluation on the conll corpus shows our system achieves on that corpus, for the subtask in which singleton and non-referring expression detection are excluded, a performance equivalent to that of the state-of-the-art kantor2019bertee system. We also demonstrated that a large improvement on non-singleton coreference chains can be made by training the system with additional singletons and non-referring expressions.

7 Bibliographical References


  • [1] M. Ariel (1990) Accessing noun-phrase antecedents. Croom Helm Linguistics Series, Routledge. Cited by: §1.
  • [2] D. Bahdanau, K. Cho, and Y. Bengio (2014) Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473. Cited by: §2.2.
  • [3] S. Bergsma, D. Lin, and R. Goebel (2008-06) Distributional identification of non-referential pronouns. In Proceedings of ACL-08: HLT, Columbus, Ohio, pp. 10–18. External Links: Link Cited by: §1, §5.
  • [4] S. Bergsma and D. Yarowsky (2011) NADA: a robust system for non-referential pronoun detection. In Anaphora Processing and Applications, I. Hendrickx, S. Lalitha Devi, A. Branco, and R. Mitkov (Eds.), Berlin, Heidelberg, pp. 12–23. Cited by: §1, §5.
  • [5] A. Björkelund and J. Kuhn (2014)

    Learning structured perceptrons for coreference resolution with latent antecedents and non-local features

    In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Vol. 1, pp. 47–57. Cited by: §1, §1, §2.2, §2.
  • [6] H. Chen, Z. Fan, H. Lu, A. Yuille, and S. Rong (2018-October-November) PreCo: a large-scale dataset in preschool vocabulary for coreference resolution. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, pp. 172–181. External Links: Link Cited by: §1.
  • [7] K. Clark and C. D. Manning (2015) Entity-centric coreference resolution with model stacking. In Association for Computational Linguistics (ACL), Cited by: §1, §1, §2.
  • [8] K. Clark and C. D. Manning (2016) Deep reinforcement learning for mention-ranking coreference models. In Empirical Methods on Natural Language Processing (EMNLP), Cited by: §1, §1, §2.
  • [9] K. Clark and C. D. Manning (2016)

    Improving coreference resolution by learning entity-level distributed representations

    In Association for Computational Linguistics (ACL), Cited by: §1, §1, §2, §5.
  • [10] M. De Marneffe, M. Recasens, and C. Potts (2015-01) Modeling the lifespan of discourse entities with application to coreference resolution. J. Artif. Int. Res. 52 (1), pp. 445–475. External Links: ISSN 1076-9757, Link Cited by: §1.
  • [11] J. Devlin, M. Chang, K. Lee, and K. Toutanova (2019) BERT: pre-training of deep bidirectional transformers for language understanding. In NAACL, Cited by: §2.1, §5.
  • [12] E. R. Fernandes, C. N. dos Santos, and R. L. Milidiú (2014-12) Latent trees for coreference resolution. Computational Linguistics 40 (4), pp. 801–835. External Links: ISSN 0891-2017, Link, Document Cited by: §1.
  • [13] L. Guillou and C. Hardmeier (23-28) PROTEST: a test suite for evaluating pronouns in machine translation. In Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC 2016), N. C. (. Chair), K. Choukri, T. Declerck, S. Goggi, M. Grobelnik, B. Maegaard, J. Mariani, H. Mazo, A. Moreno, J. Odijk, and S. Piperidis (Eds.), Paris, France (english). External Links: ISBN 978-2-9517408-9-1 Cited by: §1.
  • [14] C. Hardmeier, P. Nakov, S. Stymne, J. Tiedemann, Y. Versley, and M. Cettolo (2015-09) Pronoun-focused mt and cross-lingual pronoun prediction: findings of the 2015 discomt shared task on pronoun translation. In Proceedings of the Second Workshop on Discourse in Machine Translation, Lisbon, Portugal, pp. 1–16. External Links: Link Cited by: §1.
  • [15] M. Joshi, O. Levy, D. S. Weld, and L. Zettlemoyer (2019) BERT for coreference resolution: baselines and analysis. arXiv preprint arXiv:1908.09091. Cited by: §1, §4.2.
  • [16] B. Kantor and A. Globerson (2019-07) Coreference resolution with entity equalization. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, pp. 673–677. External Links: Document Cited by: §1, §1, §1, §2.1, §4.2.
  • [17] L. Karttunen (1976) Discourse referents. In Syntax and Semantics 7 - Notes from the Linguistic Underground, Cited by: §2.4, §5.
  • [18] Heeyoung. Lee, Angel. Chang, Yves. Peirsman, Nathaneal. Chambers, Mihai. Surdeanu, and Dan. Jurafsky (2013) Deterministic coreference resolution based on entity-centric, precision-ranked rules. Computational Linguistics 39 (4), pp. 885–916. External Links: Document Cited by: §4.1, §5.
  • [19] K. Lee, L. He, M. Lewis, and L. Zettlemoyer (2017) End-to-end neural coreference resolution. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, Cited by: §1, §1, §5.
  • [20] K. Lee, L. He, and L. S. Zettlemoyer (2018) Higher-order coreference resolution with coarse-to-fine inference. In Proceedings of the 2018 Annual Conference of the North American Chapter of the Association for Computational Linguistics, Cited by: §1, §1, §2.1.
  • [21] S. Loáiciga, L. Guillou, and C. Hardmeier (2017) What is it? disambiguating the different readings of the pronoun ‘it’. In Proc. of EMNLP, Cited by: §5.
  • [22] X. Luo, A. Ittycheriah, H. Jing, N. Kambhatla, and S. Roukos (2004) A mention-synchronous coreference resolution algorithm based on the bell tree. In Proc. of the ACL, Cited by: §1.
  • [23] S. Martschat and M. Strube (2015) Latent structures for coreference resolution. Transactions of the Association for Computational Linguistics 3, pp. 405–418. Cited by: §1.
  • [24] R. Mitkov, R. Evans, and C. Orasan (2002) A new, fully automatic version of mitkov’s knowledge-poor pronoun resolution method. In International Conference on Intelligent Text Processing and Computational Linguistics, pp. 168–186. Cited by: §2.4.
  • [25] J. Pennington, R. Socher, and C. Manning (2014)

    Glove: global vectors for word representation

    In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pp. 1532–1543. Cited by: §2.1.
  • [26] M. E. Peters, M. Neumann, M. Iyyer, M. Gardner, C. Clark, K. Lee, and L. S. Zettlemoyer (2018) Deep contextualized word representations. In Proceedings of the 2018 Annual Conference of the North American Chapter of the Association for Computational Linguistics, Cited by: §5.
  • [27] M. Poesio, R. Stuckardt, Y. Versley, and R. Vieira (2016) Early approaches to anaphora resolution: theoretically inspired and heuristic-based. In Anaphora Resolution: Algorithms, Resources and Applications, M. Poesio, R. Stuckardt, and Y. Versley (Eds.), Cited by: §5.
  • [28] M. Poesio, Y. Grishina, V. Kolhatkar, N. Moosavi, I. Roesiger, A. Roussel, F. Simonjetz, A. Uma, O. Uryupina, J. Yu, and H. Zinsmeister (2018-06) Anaphora resolution with the arrau corpus. In Proc. of the NAACL Worskhop on Computational Models of Reference, Anaphora and Coreference (CRAC), New Orleans, pp. 11–22. Cited by: §1, §1, §3, §3, §4.1.
  • [29] M. Poesio, R. Stuckardt, and Y. Versley (2016) Anaphora resolution: algorithms, resources and applications. , Springer, Berlin. Cited by: §1, §1.
  • [30] S. Pradhan, A. Moschitti, N. Xue, O. Uryupina, and Y. Zhang (2012) CoNLL-2012 shared task: modeling multilingual unrestricted coreference in OntoNotes. In Proceedings of the Sixteenth Conference on Computational Natural Language Learning (CoNLL 2012), Jeju, Korea. Cited by: §1, §3.
  • [31] Altaf. Rahman and Vincent. Ng (2011) Narrowing the modeling gap: a cluster-ranking approach to coreference resolution.

    Journal of Artificial Intelligence Research

    40, pp. 469–521.
    Cited by: §1.
  • [32] W. M. Soon, D. C. Y. Lim, and H. T. Ng (2001-12)

    A machine learning approach to coreference resolution of noun phrases

    Computational Linguistics 27 (4). External Links: Link Cited by: §1.
  • [33] J. Steinberger, M. Kabadjov, and M. Poesio (2016) Coreference applications to summarization. In Anaphora Resolution: Algorithms, Resources and Applications, Cited by: §1.
  • [34] J. Steinberger, M. Poesio, M. Kabadjov, and K. Jezek (2007) Two uses of anaphora resolution in summarization. Information Processing and Management 43 (6), pp. 1663–1680. Note: Special issue on Summarization Cited by: §1.
  • [35] M. Taulé, M. A. Martí, and M. Recasens (2008) AnCora: multilevel annotated corpora for catalan and spanish. In LREC 2008, External Links: Link Cited by: §1.
  • [36] H. Telljohann, E. W. Hinrichs, S. Kübler, H. Zinsmeister, and K. Beck Stylebook for the tübingen treebank of written german (tüba-d/z). Cited by: §1.
  • [37] O. Uryupina, R. Artstein, A. Bristot, F. Cavicchio, F. Delogu, K. J. Rodriguez, and M. Poesio (2019) Annotating a broad range of anaphoric phenomena, in a variety of genres: the ARRAU corpus. Journal of Natural Language Engineering. Cited by: §1, §3.
  • [38] O. Uryupina, M. Kabadjov, and M. Poesio (2016) Detecting non-reference and non-anaphoricity. In Anaphora Resolution: Algorithms, Resources, and Applications, pp. 369–392. Cited by: §1, §5.
  • [39] Y. Versley, A. Moschitti, M. Poesio, and X. Yang (2008-08) Coreference systems based on kernels methods. In Proceedings of the 22nd International Conference on Computational Linguistics (Coling 2008), Manchester, UK, pp. 961–968. External Links: Link Cited by: §1.
  • [40] S. Wiseman, A. M. Rush, and S. M. Shieber (2016) Learning global features for coreference resolution. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 994–1004. Cited by: §2.
  • [41] S. Wiseman, A. M. Rush, S. Shieber, and J. Weston (2015) Learning anaphoricity and antecedent ranking features for coreference resolution. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), Vol. 1, pp. 1416–1426. Cited by: §1, §2, §5.