Coreference resolution is the task of clustering mentions in text that refer to the same real-world entities  (Figure 1). As a fundamental task of natural language processing, coreference resolution can be an essential component for many downstream applications [26, 17]. Many traditional coreference resolution systems are pipelined systems, each consists of two separate components: (1) a mention detector for identifying entity mentions from text (2) a coreference resolver for clustering the extracted mentions [21, 5, 2, 27, 3]. These models typically rely heavily on syntatic parsers and use highly engineered mention proposal algorithms.
In 2017, the first end-to-end coreference resolution model named e2e-coref was proposed . It outperforms previous pipelined systems without using any syntactic parser or complicated hand-engineered features. Since then, many extensions to the e2e-coref model have been introduced, ranging from using higher-order inference to directly optimizing evaluation metrics using reinforcement learning [30, 16, 7, 6, 10, 9, 12, 8] (Figure 2). Despite improving the coreference resolution performance by a large margin, these extensions add a lot of extra complexity to the original model. Motivated by this observation and the recent advances in pre-trained Transformer language models, we propose a simple yet effective baseline for coreference resolution. We introduce simplifications to the original e2e-coref model, creating a conceptually simpler model for coreference resolution. Despite its simplicity, our model achieves promising performance, outperforming all aforementioned methods on the public English OntoNotes benchmark. Our work provides evidence for the necessity of carefully justifying the complexity of existing or newly proposed models, as introducing a conceptual or practical simplification to an existing model can still yield competitive results. The findings of our work agree with the results of several recent work [29, 11]. For example, in , the authors also introduced a minimalist approach that performs on par with more complicated models.
At a high level, our coreference resolution model is similar to the e2e-coref model (Figure 3). Given a sequence of tokens from an input document, the model first forms a contextualized representation for each token using a Transformer-based encoder. After that, all the spans (up to a certain length) in the document are enumerated. The model then assigns a score to each candidate span indicating whether the span is an entity mention. A portion of top-scoring spans is extracted and fed to the next stage where the model predicts distributions over possible antecedents for each extracted span. The final coreference clusters can be naturally constructed from the antecedent predictions. In the following subsections, we go into more specific details.
2.1 Notations and Preliminaries
Given an input document consisting of tokens, the total number of possible text spans is . For each span , we denote the start and end indices of the span by and respectively. We also assume an ordering of the spans based on ; spans with the same start index are ordered by . Furthermore, we only consider spans that are entirely within a sentence and limit spans to a max length of .
Since the speaker information is known to contain useful information for coreference resolution, it has been extensively used in previous works [5, 15, 16, 9, 8]. For example, the original e2e-coref model converts speaker information into binary features indicating whether two candidate mentions are from the same speaker. In this work, we employ a more intuitive strategy that directly concatenates the speaker’s name with the corresponding utterance . This straightforward strategy is simple to implement and has been shown to be more effective than the feature-based method . Figure 4 illustrates the concatenation strategy.
2.2 Encoder Layer
Given the input document , the model simply forms a contextualized representation for each input token, using a Transformer-based encoder such as BERT  or SpanBERT . These pretrained language models typically can only run on sequences with at most 512 tokens. Therefore, to encode a long document (i.e, ), we split the document into overlapping segments by creating a -sized segment after every tokens. These segments are then passed on to the Transformer-based encoder independently. The final token representations are derived by taking the token representations with maximum context. Let be the output of the Transformer encoder.
and character embeddings produced by 1-dimensional convolution neural networks. From an implementation point of view, it is easier to use a Transformer encoder than combining these traditional embeddings. For example, theTransformers library 111https://github.com/huggingface/transformers allows users to experiment with various state-of-the-art Transformer-based models by simply writing few lines of code.
Now, for each span , its span representation is defined as:
where and are the boundary representations, consisting of the first and the last token representations of the span . And is computed using an attention mechanism  as follows:
is a multi-layer feedforward neural network that maps each token-level representationinto an unnormalized attention score.
is a weighted sum of token vectors in the span. Our span representation generation process closely follows that of e2e-coref. However, a simplification we make is that we do not include any additional features such as the size of span in its representation .
2.3 Mentions Extractor Layer
In this layer, we first enumerate all the spans (up to a certain length ) in the document. For each span , we simply use a feedforward neural network to compute its mention score:
After this step, we only keep up to spans with the highest mention scores. In previous works, to maintain a high recall of gold mentions, is typically set to be [15, 16]. These works do not directly train the mention extractor. The mention extractor and the mention linker are jointly trained to only maximize the marginal likelihood of gold antecedent spans.
In coreference resolution datasets such as the OntoNotes benchmark , singleton mentions are not explicitly labeled, because the annotations contain only mentions that belong to a coreference chain. However, these annotations of non-singleton mentions can still provide useful signals for training an efficient mention extractor . Thus, we also propose to pre-train our mention extractor using these annotations. In Section 3, we will empirically demonstrate that this pre-training step greatly improves the performance of our mention extractor layer. As a result, we only need set the parameter to be 0.25 in order to maintain a high recall of gold mentions. To this end, the pretraining loss is calculated as follows:
where , and if and only if the span is a mention in one of the coreference chains. is the set of the top scoring spans (and so ).
2.4 Mentions Linker Layer
For each span extracted by the mention extractor, the mention linker needs to assign an antecedent from all preceding spans or a dummy antecedent : (the ordering of spans was discussed in Subsection 2.1). The dummy antecedent represents two possible cases. One case is the span itself is not an entity mention. The other case is the span is an entity mention but it is not coreferent with any previous span extracted by the mention extractor.
The coreference score of two spans and is computed as follows:
where is a feedforward network. and are calculated using Equation 3. The score is affected by three factors: (1) , whether span is a mention, (2) , whether span is a mention, and (3) whether is an antecedent of . In the special case of the dummy antecedent, is fixed to 0. In the e2e-coref model, when computing , a vector encoding additional features such as genre information and the distance between the two spans is also used. We do not use such feature vector when computing to simplify the implementation.
We want to maximize the marginal log-likelihood of all antecedents in the correct coreference chain for each mention:
where is the set of the top scoring spans extracted by the mention extractor (i.e., the set of unpruned spans). is the set of spans in the gold cluster containing span . If span does not belong to any coreference chain or all gold antecedents have been pruned, then .
|e2e-coref + Structural info ||80.5||73.9||77.1||71.2||61.5||66.0||64.3||61.1||62.7||68.6|
|c2f-coref + ELMo ||81.4||79.5||80.4||72.2||69.5||70.8||68.2||67.1||67.6||73.0|
|EE + BERT-large ||82.6||84.1||83.4||73.3||76.2||74.7||72.4||71.1||71.8||76.6|
|c2f-coref + BERT-large ||84.7||82.4||83.5||76.5||74.0||75.3||74.1||69.8||71.9||76.9|
|c2f-coref + SpanBERT-large ||85.8||84.8||85.3||78.3||77.9||78.1||76.4||74.2||75.3||79.6|
|Simplified e2e-coref (Ours)||85.4||85.4||85.4||78.4||78.9||78.7||76.1||73.9||75.0||79.7|
|Avg. Nb Spans Proposed||Gold Mention Recall|
|e2e-coref ||200.43 spans / docs||92.7%|
|Simplified e2e-coref (Ours)||141.79 spans / docs||95.7%|
3 Experiments and Results
Dataset and Experiments Setup To evaluate the effectiveness of the proposed approach, we use the CoNLL-2012 Shared Task English data  which is based on the OntoNotes corpus. This dataset has 2802/343/348 documents for the train/dev/test split. Similar to previous works, we report precision, recall, and F1 of the MUC, , and metrics, and also average the F1 score of all three metrics. We used SpanBERT (spanbert-large-cased) 
as the encoder. Two different learning rates are used, one for the lower pretrained SpanBERT encoder (5e-05) and one for the upper layers (1e-4). We also use learning rate decay. The number of training epochs is set to be 100. The batch size is set to be 32. We did hyper-parameter tuning using the provided dev set. To train our model, we use two 16GB V100 GPUs and use techniques such as gradient checkpointing and gradient accumulation to avoid running out of GPUs’ memory.
Comparison with Previous Methods Table 1 compares our model with several state-of-the-art coreference resolution systems. Overall, our model outperforms the original e2e-coref model and also all recent extended works. For example, compared to the variant [c2f-coref + SpanBERT-large], our model achieves higher F1-scores for the MUC and metrics. Even though our model achieves a slightly lower F1-score for the metric, the overall averaged F1 score of our model is still better. It is worth mentioning that the variant [c2f-coref + SpanBERT-large] is more complex than our method, because it has some other additional components such as coarse-to-fine antecedent pruning and higher-order inference [16, 8].
Recently, a model named CorefQA has been proposed , and it achieves an averaged F1 score of on the English OntoNotes benchmark. The work takes a complete departure from the paradigm used by the e2e-coref model, and instead, proposes to formulate the coreference resolution problem as a span prediction task, like in question answering. Despite its impressive performance, the CorefQA model is very computationally expensive. In order to predict coreference clusters for a single document, CorefQA needs to run a Transformer-based model on the same document many times (each time a different query is appended to the document).
Analysis on the Performance of the Mention Extractor As mentioned in Subsection 2.3, in our work, the value of the parameter for pruning is set to be 0.25. On the other hand, it is set to be 0.4 in the e2e-coref model. Table 2 shows the comparison in more details. Our mention extractor is able to extract 95.7% of all the gold mentions in the dev set, while the mention extractor of the e2e-coref model is only able to extract 92.7% of them. Furthermore, by proposing less candidate spans, the workload of our mention linker is also reduced.
In this work, we propose a simple yet effective baseline for the task of coreference resolution. Despite its simplicity, our model still can achieve impressive performance, outperforming all recent extended works on the popular English OntoNotes benchmark. In future work, we are interested in reducing the computational complexity of our baseline model using compression techniques [22, 23, 14]. We also plan to extend our work to address the task of event coreference resolution [13, 25].
-  (2015) Neural machine translation by jointly learning to align and translate. CoRR abs/1409.0473. Cited by: §2.2.
-  (2015) Entity-centric coreference resolution with model stacking. In ACL, Cited by: §1.
-  (2016) Deep reinforcement learning for mention-ranking coreference models. In EMNLP, Cited by: §1.
-  (2019) BERT: pre-training of deep bidirectional transformers for language understanding. In NAACL-HLT, Cited by: §2.2.
-  (2013) Easy victories and uphill battles in coreference resolution. In EMNLP, Cited by: §1, §2.1.
-  (2019) End-to-end deep reinforcement learning based coreference resolution. In ACL, Cited by: §1.
-  (2018) A study on improving end-to-end neural coreference resolution. In CCL, Cited by: §1.
-  (2020) Spanbert: improving pre-training by representing and predicting spans. Transactions of the Association for Computational Linguistics 8, pp. 64–77. Cited by: §1, §2.1, §2.2, Table 1, §3, §3.
-  (2019) BERT for coreference resolution: baselines and analysis. In EMNLP/IJCNLP, Cited by: §1, §2.1, Table 1.
-  (2019) Coreference resolution with entity equalization. In ACL, Cited by: §1, Table 1.
-  (2021) Coreference resolution without span representations. ArXiv abs/2101.00434. Cited by: §1.
-  (2019) Incorporating structural information for better coreference resolution. In IJCAI, Cited by: §1, Table 1.
-  (2021) A context-dependent gated module for incorporating symbolic semantics into event coreference resolution. arXiv preprint arXiv:2104.01697. Cited by: §4.
-  (2020) A simple but effective bert model for dialog state tracking on resource-limited systems. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8034–8038. Cited by: §4.
-  (2017) End-to-end neural coreference resolution. In EMNLP, Cited by: §1, §2.1, §2.3, Table 1, Table 2.
-  (2018) Higher-order coreference resolution with coarse-to-fine inference. In NAACL-HLT, Cited by: §1, §2.1, §2.3, Table 1, §3.
-  (2019) A general framework for information extraction using dynamic span graphs. In NAACL-HLT, Cited by: §1.
-  (2010) Supervised noun phrase coreference research: the first fifteen years. In ACL, Cited by: §1.
-  (2014) Glove: global vectors for word representation. In EMNLP, Cited by: §2.2.
-  (2012) CoNLL-2012 shared task: modeling multilingual unrestricted coreference in ontonotes. In EMNLP-CoNLL Shared Task, Cited by: §2.3, §3.
-  (2010) A multi-pass sieve for coreference resolution. In EMNLP, Cited by: §1.
-  (2019) DistilBERT, a distilled version of bert: smaller, faster, cheaper and lighter. ArXiv abs/1910.01108. Cited by: §4.
-  (2020) MobileBERT: a compact task-agnostic bert for resource-limited devices. In ACL, Cited by: §4.
Word representations: a simple and general method for semi-supervised learning. In ACL, Cited by: §2.2.
-  (2021) RESIN: a dockerized schema-guided cross-document cross-lingual cross-media information extraction and event tracking system. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies: Demonstrations, pp. 133–143. Cited by: §4.
-  (2016) Towards ai-complete question answering: a set of prerequisite toy tasks. CoRR abs/1502.05698. Cited by: §1.
-  (2016) Learning global features for coreference resolution. ArXiv abs/1604.03035. Cited by: §1.
-  (2020) CorefQA: coreference resolution as query-based span prediction. In ACL, Cited by: §2.1, §3.
-  (2020-11) Revealing the myth of higher-order inference in coreference resolution. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), Online, pp. 8527–8533. External Links: Cited by: §1.
-  (2018) Neural coreference resolution with deep biaffine attention by joint mention detection and mention clustering. In ACL, Cited by: §1, §2.3.