Revisiting Memory-Efficient Incremental Coreference Resolution

04/30/2020 ∙ by Patrick Xia, et al. ∙ Johns Hopkins University 0

We explore the task of coreference resolution under fixed memory by extending an incremental clustering algorithm to utilize contextualized encoders and neural components. Our algorithm creates explicit representations for each entity, where given a new sentence, spans are proposed and subsequently scored against each entity representation, leading to emergent clusters. Our approach is end-to-end trainable and can be used to transform existing models, leading to an asymptotic reduction in memory usage while remaining competitive on task performance, which allows for more efficient use of computational resources for short documents, and making coreference more feasible across very long document contexts.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Coreference resolution, the task of recognizing when spans of text refer to the same entity, is a core step in natural language processing for both demonstrating language understanding and improving information extraction systems. At the sentence level, ambiguities in pronoun coreference can be used to probe a model’s ability to capture common sense

Levesque et al. (2012); Sakaguchi et al. (2020) or gender bias Rudinger et al. (2018); Zhao et al. (2018). At the document level, coreference resolution is commonly a component of the information extraction pipeline to associate properties of different spans in text to the same entity, which is needed for relation extraction, argument extraction, and entity typing. Coreference resolution is also used at the document level for question answering Dasigi et al. (2019) or modeling characters in books Bamman et al. (2014).

Models for this task typically encode the text before scoring and subsequently clustering candidate mention spans, either found by a parser Clark and Manning (2016) or learned jointly Lee et al. (2017). The first challenge in prior work then, has been to find the best features for a pairwise span scoring function Raghunathan et al. (2010); Clark and Manning (2016); Wu et al. (2019). The second area of innovation has been in decoding; simply building a cluster by attaching its highest scoring antecedent only ensures a high score with the nearest, not all, members of the cluster. Instead, rescoring passes Lee et al. (2018); Kantor and Globerson (2019) and global cluster information Wiseman et al. (2016) have been used to promote global cluster consistency. Finally, recent models are reliant on pretrained encoders to create high-dimensional input text (and span) representations, as improvements in contextualized encoders appear to translate directly to coreference resolution Lee et al. (2018); Joshi et al. (2019, 2019).

These models rely on simultaneous access to all spans for pairwise scoring and all pairwise scores for clustering. As the output dimensionality of contextualized encoders (and therefore span embedding size) increases, this becomes computationally intractable for long documents or with limited memory. Further, it breaks with any notion of how humans incrementally read a text. Webster and Curran (2014) argued in favor of a limited memory constraint as a more psycholinguistically plausible approach to reading, modeling coreference resolution via a shift-reduce parsing.

We revisit that approach primarily motivated by scalability and armed with advances in neural architectures and models. Our model begins with a SpanBERT Joshi et al. (2019)

encoding of the text and proposes candidate mention spans. These are used to create or update explicit entity representations. Clustering is performed online: each span either attaches to an existing cluster or creates a new one. Our goal is to decrease memory usage during inference, accomplished by storing only the embeddings of entities in the document and an (small) active set of candidate mention spans. Our model is trainable end-to-end, independent of document length. It reduces the apparently linear memory needed by a state-of-the-art model to apparently constant, while sacrificing little (e.g., 1.4 F1 in performance on OntoNotes 5.0, before any hyperparameter optimization).

2 Model

Our algorithm revisits the approach taken by Webster and Curran (2014), which incrementally makes coreference resolution decisions (online clustering). Abstractly our algorithm is very similar. The major differences, however, lie in the use of neural components, explicit entity representations, and learning. We first give an overview the algorithm during inference before discussing the challenges of learning.

2.1 Inference


Our algorithm creates a permanent list of entities, each with its own representation. For a given sentence or segment, the model proposes a candidate set of spans. For each span, a scorer evaluates the span representation with all the cluster representations. This is used to determine to which (if any) of the pre-existing clusters the current span should be added. Upon inclusion of the span in the cluster, the cluster’s representation is subsequently updated. Under this algorithm, each clustering decision is also permanent. This process is formalized in Algorithm 1.

function FindClusters(Document)
     Create an empty Entity List,
     for segment Document do
         for  do
              if  then
              end if
         end for
     end for
end function
Algorithm 1 Inference


Concretely, our model uses a contextualized encoder, SpanBERT Joshi et al. (2019), to encode an entire segment. Given a segment, Spans returns candidate spans, following an enumerate-and-prune approach Lee et al. (2017). Span representations are formed by concatenating the contextualized embedding for both the first and last token of the mention, following prior work Joshi et al. (2019). Then, a feedforward network is trained to score each enumerated span (up to a certain width). The top fraction of the spans are returned as output.

PairScore is another feedforward network which takes as input the concatenation of two span representations along with additional embeddings for features like distance or genre. Update updates the entity with the newly linked span. We default to an entity’s representation being the average of its mention spans through a moving average.111Leaving the possibility to future work that a trainable entity representation will further improve the approach.

These components are interchangeable, as only scores and representations are passed between them. Specifically, our algorithm is compatible with Wu et al. (2019). They use a query-based pairwise scorer, which we could adopt in place of the feedforward pairwise scorer. The use of abstract components also allows for comparison of different encoders or update rules.

2.2 Training

As described, the algorithm is intuitively trainable end-to-end because each of the components are themselves end-to-end trainable. For each mention , we can treat

as unnormalized probabilities of

for , where is the entity list, and include an target label, representing the action of creating a new cluster. Then, the objective is to maximize , where is the gold cluster for .

However, for the many spans that are singletons, . As a result, trained models quickly reach a poor local optimum as Spans learns to output syntactically nonsensical spans whose gold label are . The algorithm also introduces sample inefficiency (which slows down training), as most updates have same label and barely accrue loss. This would typically not be an issue, but the algorithm is entirely sequential. While training can be batched by document, the size of each document makes this strategy inefficient. However, if we don’t batch by document, it is slow (hundreds of updates per document) compared to other end-to-end approaches (one update per document). We discuss mitigation strategies for each of these.


Since the goal is to create a limited-memory model for inference, we heavily lean on pretrained components. In particular, we reuse not only encoder weights that have already been finetuned on this dataset, but also the mention and pairwise scorers from Joshi et al. (2019) as initialization for our encoder, Spans and PairScore. This reduces the number of mentions with a gold label of . Furthermore, we can apply teacher forcing. Even if is not the top-scoring entry of scores, can instead update the entity cluster .


Based on the hyperparameters chosen, even in the best case, over 70% of the predicted mentions are either singletons or the new mentions for an entity cluster. To increase the signal of the clusterable mentions, in training, we only compute loss on a randomly sampled subset of the mentions with a gold label of .


While the previous two approaches ensure that more meaningful updates are made, the training process is slow due to the sequential nature of the algorithm and memory cost of batching. We accumulate loss across all examples and only update the parameters once every segment. However, accumulating loss over a growing computation graph is memory intensive. To resolve this, we periodically perform the backwards pass for the gradients for each parameter without updating their weights (or the optimizer). This releases the computation graph stored in memory and compromises between memory usage and speed. We also freeze the encoder, which speeds up training by not only reducing the model size but also freeing up memory, reducing the number of backwards passes.

3 Experiments

We perform preliminary investigations for several questions: 1) How robust is the algorithm to document length and genre; 2) How does segment length affect performance; 3) What is the tradeoff between memory and task performance? As this work is focused on exploration and analysis, the experiments are conducted over the development set, with Table 1 being the sole exception.

3.1 Data

We use OntoNotes 5.0 Weischedel et al. (2013); Pradhan et al. (2013), which consists of 2,802, 343, and 348 documents annotated for coreference resolution in the training, development and test splits respectively. These documents span several genres: broadcast news, newswire, magazines, telephone conversations, weblogs, and the Bible. Further, each sentence is annotated with its speaker, an important component for conversational genres.

3.2 Hyperparameters

We use all hyperparameters used in the publicly released spanbert_large coreference resolution model by Joshi et al. (2019, 2019)

. As described above, we also use their parameters for the encoder, span scorer, and span pair scorer as initialization. The only changes to the hyperparameters are relate to training: we use a learning rate of 5e-4 and patience of 5 epochs for early stopping. In addition, our model does not make use of speaker features, especially since it’s not meaningful to assign a speaker to the cluster representation. We also do not make use of segment distance embeddings, as one goal is to freely interchange training and testing segment lengths based on test time memory constraints. Besides learning rate,

222We also tried and with segment length 512. we did no hyperparameter search; our goal is to show that any model can easily be adapted by our algorithm.

When computing loss, we sample negative examples (spans that do not belong in a cluster) at a rate of 0.2, which roughly equalizes the number of attachable spans and new clusters.

4 Results

4.1 Performance and Document Length

Subset #Docs JS-L Ours SOTA
All 343 80.1 78.7 -1.4 83.4
0-128 57 84.6 83.8 -0.8
129-256 73 83.7 83.3 -0.4
257-512 78 82.9 82.0 -0.9
513-768 71 80.1 78.6 -1.5
769-1152 52 79.1 77.9 -1.2
1153+ 12 71.3 69.3 -2.0
1 Speaker 268 81.1 80.3 -0.8
2+ Speakers 75 76.7 74.6 -2.1
Test 348 79.6 78.2 -1.4 83.1
Table 1: Average F1 score broken down by document length and conversational genre. JS-L refers to the spanbert_large model from Joshi et al. (2019), which we treat as our baseline. SOTA refers to Wu et al. (2019).

To address the first question, we compare the performance of our best model against the baseline model Joshi et al. (2019). We break down the performance of our model based on the length (in subtokens) of the document in Table 1.333This split of the development set differs from that in Joshi et al. (2019) which counts the number of 128-subtoken sized segments. We directly count subtokens. Our average F1 falls behind the baseline, but there is no significant trend based on distance. We perform a similar analysis based on genres that do and do not have multiple speakers and find that removing speaker features hurts our model, matching previous findings Lee et al. (2017). One way to address this and retain explicit entity representations is by treating speakers as part of the input text Wu et al. (2019).

The state-of-the-art model scores considerably higher by replacing a feedforward span pair scorer with a query-based scorer Wu et al. (2019). In their model, scoring spans and consists of posing as a query and computing the likelihood is an answer. Notably, this scorer fits the interface of PairScore, and future work can investigate adapting their model in a more memory-efficient manner.444At time of writing the implementation of joshi-etal-2019-bert was more amenable to extension and experimentation, of the current best performing coreference models, and therefore served as our illustrative example.

4.2 Effect of Segment Length

We experiment with several possible segment lengths both for training and inference by considering segments of up to [128, 256, 384, 512] subtokens (split at sentence boundaries). We also explore segmentation by number of sentences, of up to [1, 5, 10] sentences (capped by 512 subtokens).

These results, in Table 2, follow the claim from Joshi et al. (2019) that larger context windows, and specifically those compatible with the window size of the contextualized encoder, lead to better performance. On the other hand, we find that if the inference segment size is fixed in advance, then training a model using that segment size is better then using the maximum length. Finally, there is a substantial drop when either training or testing at the single sentence level.

Sentences Tokens
Train. \Inf. 1 10 128 512
128 toks. 69.1 76.8 75.7 77.2
256 toks. 69.2 77.1 75.7 78.4
384 toks. 67.6 76.9 75.5 78.5
512 toks. 66.7 77.1 74.7 78.7
1 sent. 69.1 73.6 73.0 72.9
5 sent. 69.6 76.9 75.7 77.9
10 sent. 68.6 77.1 75.3 78.1
Table 2: Average dev. F1 score for models that are trained and evaluated across a range of segment lengths.
Gold Pretrained TF # Epochs Avg. F1
N Y N 3 78.7
N Y Y 11 77.8
N N N 15 78.0
N N Y 2 78.7
Y Y N 32 91.4
Y Y Y 14 91.2
Y N N 6 91.7
Y N Y 5 91.5
Table 3: Average dev. F1 score for models that (do not) use gold spans, pretrained scorers for coreference resolution, and teacher forcing (TF).

4.3 Training Conditions

Table 3 shows performance when we use gold spans, pretrained coreference scorers as initialization, and teacher forcing (as described in 2.2). The primary result is that the incremental model is able to effectively use randomly initialized Spans and PairScore weights without teacher forcing.

4.4 Inference Memory

Segment GPU Memory (GB) Dev. F1
1 sent. 1.7 66.7
10 sent. 2.0 77.1
128 toks. 1.6 74.7
512 toks. 2.0 78.7
JS-B 6.4 77.7
JS-L 11.9 80.1
Table 4: Varying segment length to observe GPU Memory used during inference and its relation to task performance. JS-B and JS-L refer to the base and large variants of Joshi et al. (2019) with SpanBERT.

Our main goal was to show that it is possible to efficiently perform coreference resolution with recent neural models with respect to memory. In Table 4, we report the peak memory used in practice as observed by nvidia-smi when evaluating our best model over the development set on a NVIDIA GTX Titan X (12 GB).555We evaluate over the full set because while the longest document has 2,742 subtokens, that document may not consume the most memory. Compared to the baseline (which runs out of memory even with segment lengths of 128) and its smaller spanbert-base counterpart, our model uses substantially less memory.

Figure 1: Plot of peak memory allocated for a run on each document in the development set. The models compared are the small and large models of the baseline Joshi et al. (2019) and our model with inference sequence lengths of 128 and 512.

However, usage in practice is subject to the memory allocation algorithms used. To be faithful to the original implementation, Joshi et al. (2019)

is profiled with TensorFlow 1.15, while our model uses PyTorch 1.5. To compare the theoretical minimum peak usage, we sum the usage of the allocated tensors at every step and plot an exact lower bound for each document during inference over the development set.

666We use the pytorch_memlab package version 0.0.4 and torch.cuda for profiling PyTorch and tf.benchmark.run_op_benchmark for profiling TensorFlow. The figure shows both versions of the baseline and our model with test segment lengths 128 and 512. Figure 1 compares the peak theoretical memory usage against the document length (in tokens). In particular, it shows that the memory usage in the baseline is dominated by a term that grows linearly for longer documents, while that does not appear to be the case for our model, which has virtually constant memory usage on this dataset.

These plots do not represent asymptotic memory usage: the entity list, , can grow linearly with respect to document length, while the other models have a quadratic component for scoring span pairs. These terms have (relatively) small scalar coefficients and their effect is not apparent on this dataset. In fact, the encoder, SpanBERT, is the dominating term for both shorter documents in the baseline and for most documents in our model. Future work can explore replacing the underlying Transformer in SpanBERT with sparse transformers, which would reduce memory usage Child et al. (2019); Kitaev et al. (2020).

4.5 Span Representations

Entity Embeddings and Clustering

In this work, we assumed that averaging span representations, produced by SpanBERT, from the same cluster into a single embedding would be beneficial. Here, we challenge this assumption. As span representations are contextual and entities can be referred to with different surface forms, representations of the same entity are not necessarily similar. If they are closely clustered, then there is evidence that SpanBERT (finetuned on this dataset) captures some notion of coreference and that an online clustering approach, as we are do here, has promise. If they are spread apart, then the success in modeling coreference resolution, to date, has due to improvements in the span pair scorer.

Figure 2: t-SNE plot of span representations and predictions of our model on document cnn_0000_0. Each color is a cluster, while black indicates that it was a singleton. Some singletons are labeled with the text to show types of spans that are considered for clustering.

Figure 2 shows a t-SNE visualization of all of the proposed span representations from one of the documents in the development set, while the colors represent our models predictions. The underlying encoder is a SpanBERT model already finetuned on this dataset Joshi et al. (2019), while the pairwise scorer is from our best model (train and test segment lengths of 512). In most cases, the clusters are close. However, there is a clear cluster of pronouns whose span representations are “far” from their clusters’ center or full form (e.g. “them” and its antecedent, “the billboards for it”).

This figure shows that while an online clustering approach is suitable for this task, it alone is not sufficient. Due to the distance of the pronouns from their entities, a strong pairwise scorer is still needed to correctly match pronouns. This is not too surprising: matching pronouns has known to be challenging for neural models Webster et al. (2018); Joshi et al. (2019), and model visualizations like Figure 2 may provide more insight.


One limitation (and bias) in our work is its heavy reliance on spanbert-large that has already been fine-tuned on the OntoNotes training split. Empirically, we find that after not even one epoch of training, the model begins to overfit (in both loss and F1) on the training set, and after three epochs, the model attains >99 F1 performance on the training set. On the other hand, using a spanbert-large model that is not finetuned on the data results in significantly worse performance.777With the same hyperparameters and a frozen SpanBERT that was not already finetuned, F1 did not exceed 50. This suggests that our algorithm, in its current state, is more suitable for adapting existing finetuned encoders. Furthermore, it suggests that the SpanBERT released with the baseline model may be overfit to the training set.

5 Conclusions

We reintroduced an online algorithm for memory-efficient coreference resolution that incorporates contributions from recent neural end-to-end models. We show that it is possible to transform a model which performs document-level inference into a segment-wise online manner. By doing so, we greatly reduced the memory usage of the model during inference at a relatively small cost in performance. As a result, we provide an alternative option for researchers and practitioners who are constrained by memory.


This work was supported in part by DARPA AIDA and KAIROS. The views and conclusions contained in this work are those of the authors and should not be interpreted as necessarily representing the official policies, either expressed or implied, or endorsements of DARPA or the U.S. Government.


  • D. Bamman, T. Underwood, and N. A. Smith (2014) A Bayesian mixed effects model of literary character. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Baltimore, Maryland, pp. 370–379. External Links: Link, Document Cited by: §1.
  • R. Child, S. Gray, A. Radford, and I. Sutskever (2019) Generating long sequences with sparse transformers. URL Cited by: §4.4.
  • K. Clark and C. D. Manning (2016)

    Improving coreference resolution by learning entity-level distributed representations

    In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Berlin, Germany, pp. 643–653. External Links: Link, Document Cited by: §1.
  • K. Clark and C. D. Manning (2016)

    Deep reinforcement learning for mention-ranking coreference models

    In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, Austin, Texas, pp. 2256–2262. External Links: Link, Document Cited by: §1.
  • P. Dasigi, N. F. Liu, A. Marasović, N. A. Smith, and M. Gardner (2019) Quoref: a reading comprehension dataset with questions requiring coreferential reasoning. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Hong Kong, China, pp. 5925–5932. External Links: Link, Document Cited by: §1.
  • M. Joshi, D. Chen, Y. Liu, D. S. Weld, L. Zettlemoyer, and O. Levy (2019) SpanBERT: improving pre-training by representing and predicting spans. arXiv preprint arXiv:1907.10529. Cited by: §1, §1, §2.1, §3.2, Figure 1, §4.5.
  • M. Joshi, O. Levy, L. Zettlemoyer, and D. Weld (2019) BERT for coreference resolution: baselines and analysis. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Hong Kong, China, pp. 5803–5808. External Links: Link, Document Cited by: §1, §2.2, §3.2, §4.1, §4.2, §4.4, §4.5, Table 1, Table 4, footnote 3.
  • B. Kantor and A. Globerson (2019) Coreference resolution with entity equalization. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, pp. 673–677. External Links: Link, Document Cited by: §1.
  • N. Kitaev, L. Kaiser, and A. Levskaya (2020) Reformer: the efficient transformer. In International Conference on Learning Representations, External Links: Link Cited by: §4.4.
  • K. Lee, L. He, M. Lewis, and L. Zettlemoyer (2017) End-to-end neural coreference resolution. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, Copenhagen, Denmark, pp. 188–197. External Links: Link, Document Cited by: §1, §2.1, §4.1.
  • K. Lee, L. He, and L. Zettlemoyer (2018) Higher-order coreference resolution with coarse-to-fine inference. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers), New Orleans, Louisiana, pp. 687–692. External Links: Link, Document Cited by: §1.
  • H. J. Levesque, E. Davis, and L. Morgenstern (2012) The winograd schema challenge. In Proceedings of the Thirteenth International Conference on Principles of Knowledge Representation and Reasoning, KR’12, pp. 552–561. External Links: ISBN 9781577355601 Cited by: §1.
  • S. Pradhan, A. Moschitti, N. Xue, H. T. Ng, A. Björkelund, O. Uryupina, Y. Zhang, and Z. Zhong (2013) Towards robust linguistic analysis using ontonotes. In Proceedings of CoNLL, Cited by: §3.1.
  • K. Raghunathan, H. Lee, S. Rangarajan, N. Chambers, M. Surdeanu, D. Jurafsky, and C. Manning (2010) A multi-pass sieve for coreference resolution. In Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing, Cambridge, MA, pp. 492–501. External Links: Link Cited by: §1.
  • R. Rudinger, J. Naradowsky, B. Leonard, and B. Van Durme (2018) Gender bias in coreference resolution. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers), New Orleans, Louisiana, pp. 8–14. External Links: Link, Document Cited by: §1.
  • K. Sakaguchi, R. L. Bras, C. Bhagavatula, and Y. Choi (2020) WinoGrande: an adversarial winograd schema challenge at scale. AAAI. Cited by: §1.
  • K. Webster and J. R. Curran (2014) Limited memory incremental coreference resolution. In Proceedings of COLING 2014, the 25th International Conference on Computational Linguistics: Technical Papers, Dublin, Ireland, pp. 2129–2139. External Links: Link Cited by: §1, §2.
  • K. Webster, M. Recasens, V. Axelrod, and J. Baldridge (2018) Mind the GAP: a balanced corpus of gendered ambiguous pronouns. Transactions of the Association for Computational Linguistics 6, pp. 605–617. External Links: Link, Document Cited by: §4.5.
  • R. Weischedel, M. Palmer, M. Marcus, E. Hovy, S. Pradhan, L. Ramshaw, N. Xue, A. Taylor, J. Kaufman, M. Franchini, et al. (2013) OntoNotes release 5.0 LDC2013T19. Linguistic Data Consortium, Philadelphia, PA. Cited by: §3.1.
  • S. Wiseman, A. M. Rush, and S. M. Shieber (2016) Learning global features for coreference resolution. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, San Diego, California, pp. 994–1004. External Links: Link, Document Cited by: §1.
  • W. Wu, F. Wang, A. Yuan, F. Wu, and J. Li (2019) Coreference resolution as query-based span prediction. External Links: 1911.01746 Cited by: §1, §2.1, §4.1, §4.1, Table 1.
  • J. Zhao, T. Wang, M. Yatskar, V. Ordonez, and K. Chang (2018) Gender bias in coreference resolution: evaluation and debiasing methods. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers), New Orleans, Louisiana, pp. 15–20. External Links: Link, Document Cited by: §1.