Coreference resolution of anaphoric identities (a.k.a. anaphora resolution) is a long-studied Natural Language Processing (NLP) task, and is still considered one of the unsolved problems, as it demands deep semantic understanding as well as world knowledge. Although there is a significant performance boost recently by the neural decoders(Lee et al., 2017, 2018) and deep contextualized encoders such as BERT and SpanBERT (Joshi et al., 2019, 2020), the majority of the experiments are based on OntoNotes (Pradhan et al., 2012) from the CoNLL 2012 shared task, which may overestimate the model performance due to two perspectives: the lack of support for harder cases such as singletons and split-antecedents, and the lack of focus on real-world dialogues. In this work, we target on the task of anaphora resolution in the CRAC 2021 shared task, and present an effective coreference resolution system experimented on the provided datasets that address both two perspectives.
All datasets in the CRAC 2021 shared task are in the Universal Anaphora format. For simplicity, we refer to it as the UA format, and refer to the annotation scheme of the CoNLL 2012 shared task as the CoNLL format. The UA format is an extension of the CoNLL format, and further supports bridging references and discourse deixis. For anaphora resolution, the UA format differs from the CoNLL format on three aspects: the support of singletons, split-antecedents, and non-referring expressions (excluded from the current evaluation). Our approach specifically addresses the singleton problem (Section 3.1), which is shown to be a critical component under the UA setting that brings 17-22 F1 improvement on all datasets (Section 5.2). Few recent work has studied the split-antecedent problem (Zhou and Choi, 2018), and we leave the split-antecedents as future work.
In addition to singletons, our approach also emphasizes on the speaker encoding (Section 3.3) and knowledge transfer (Section 3.4) to address the dialogue-domain perspective. Especially, we use a simple strategy of speaker-augmented encoding that captures the speaker interaction and dialogue-turn information, utilizing the strong Transformers encoder. It has been shown by the previous study that conversational metadata such as speakers can be significant for coreference resolution on dialogue documents (Luo et al., 2009), and we do see 2-3 F1 improvement on three datasets by simply applying the speaker encoding strategy (Section 5.3).
Knowledge transfer from other existing resources is also shown to be important in our approach. Two different strategies are experimented, and the domain-adaptation strategy is able to bring large improvement, boosting 8 F1 for LIGHT and 6 F1 on PSUA (Table 3).
Our final system ranks the 1st place on the leaderboard of the anaphora resolution track in the CRAC 2021 shared task, and achieves the best evaluation results on all four datasets, with 63.96 F1 for AMI, 80.33 F1 for LIGHT, 78.41 F1 for PSUA, 74.49 F1 for SWBD (Section 5.1). A brief summary of our final submission is shown in Table 4.
2 Related Work
Pretrained Transformers encoders have been successfully adopted by recent coreference resolution models and shown significant improvement (Joshi et al., 2019, 2020). We also adopt the Transformers encoder in our approach because of its superior performance. For the neural decoder, there have been two popular directions from recent work. One is mention-ranking-based, where the model predicts only one antecedent for each mention without focusing on the cluster structure (Wiseman et al., 2015; Lee et al., 2017; Wu et al., 2020). The other is cluster-based, where the model maintains the predicted clusters and performs cluster merging (Clark and Manning, 2015, 2016; Xia et al., 2020). We adopt the mention-ranking framework in our approach because of its simplicity as well as its state-of-the-art decoding performance.
3.1 Mention-Ranking (Mr)
Our baseline model MR adopts the mention-ranking strategy, and follows the architecture of the end-to-end neural coreference resolution model (Lee et al., 2017, 2018) with a Transformer encoder (Joshi et al., 2019, 2020). Given a document with tokens, the model first enumerates all valid spans, and scores every span for being a likely mention, denoted by the mention score . The model then greedily selects top spans by as mention candidates that may appear in the final coreference clusters. Let be the list of all mention candidates in the document, ordered by their appearance. For each mention candidate , the model selects a single coreferent antecedent from all its preceding mention candidates, denoted by , with being a “dummy” antecedent that may be selected when is not anaphoric (no antecedents).
The antecedent selection is performed by the pairwise scoring process between the current mention candidate and each of its preceding candidate . The final pairwise score consists of three scores: how likely each candidate being a mention, measured by the mention score ; and how likely they refer to the same entity, measured by the antecedent score . The final score can be denoted as follows:
are computed by the FeedForward Neural Network (FFNN), andrepresents additional meta features. Unlike previous work, we do not include the specific genre as a feature; instead, we simply use a binary feature on whether the document is dialogue-based or article-based, since dialogues can exhibit quite different traits from written articles (Aktaş and Stede, 2020). We also adopt a speaker feature that indicates whether two candidates are from the same speaker, or whether the speaker information is not available, which is important for written articles or two-party dialogues. In Section 3.3, we further adapt more speaker encoding to benefit multi-speaker dialogues and the personal pronoun issue.
For inference, the selected antecedent is the preceding candidate with the most pairwise score, denoted by . For training, the marginal log-likelihood of all gold antecedents for each is optimized, denoted by the loss :
3.2 Singleton Recognition (Sr)
As the UA format does support singletons, MR would fail to predict those singleton clusters, since the antecedent selection can only generate clusters with at least one pair of mentions. Our SR model is built upon MR and further recognizes singletons. In particular, we make use of the mention score in the antecedent selection process, and create a singleton cluster for any candidates with that have not yet found any antecedents, which poses an additional requirement on the mention score, such that only valid mentions should have . Let be the set of gold mention candidates, and be the set of other mention candidates. We optimize the mention score with the binary cross-entropy loss and joinly train with the coreference loss :
is the sigmoid function, and
is a hyperparameter.is the final loss composed of two tasks. In practice, we also perform negative sampling on dynamically, so that and are of similar sizes (
), to alleviate the negative effects from the skewed class distribution.
In the new selection process, we still regard the selected non-dummy antecedent to be valid by , even though the mention score of either candidate can be negative ( or ). This is to allow certain slacks on the mention score prediction which could help with the mention recall. Figure 1 shows three different cases of the predicted clusters by the SR model.
3.3 Speaker Encoding (Se)
Our SE model is further adapted upon SR model and aims to strengthen the speaker encoding for each candidate representation. As we are targeting on the coreference resolution in dialogues, encoding speaker interactions becomes more critical, especially for the correct understanding of the speaker-grounded personal pronouns that are more frequent in dialogues than other non-dialogue genres (Aktaş and Stede, 2020).
The speaker feature introduced in Section 3.1 provides shallow distinction on whether two mentions are from the same speaker. However, the speaker interactions across dialogue turns are not presented in the document encoding; therefore, the representation of each candidate has no awareness on the speaker interactions at all. To provide deeper knowledge on the interactions, we adopt a simple but effective strategy that is similar to some other work in speaker encoding (Le et al., 2019; Wu et al., 2020): a special speaker token is prepended to each sentence, and we feed the new speaker-augmented document to the encoder directly.
Table 1 shows an example on this speaker augmentation. Each speaker is indexed by the order of the first appearance in the dialogue. All the special speaker tokens are added to the tokenizer vocabulary, and will be picked up in the tokenization and encoding process. Therefore, all encoded candidate representation in the SE model is conditioned on the entire speaker interactions, and automatically learns to fuse the information of speakers and turns in the training process.
|John: Do you know Mike?|
|Mary: He is my best friend!|
|Paul: I like him too!|
|Mary: We should meet together!|
|[SPK1] Do you know Mike ? [SPK2] He is my best friend !|
|[SPK3] I like him too ! [SPK2] We should meet together !|
3.4 Knowledge Transfer
We also emphasize on the knowledge transfer in this task, as the training resources of dialogue corpora annotated in the UA format are limit and expensive to obtain, while there already exist larger-scale training corpora for other domains in different annotation schemes, e.g. the OntoNotes dataset in the CoNLL format that mainly consists of non-dialogue genres. For clarity, we denote the provided data annotated in the UA format as UAD, and other existing data in non-UA format as OD. We investigate two common ways to make use of OD in the training for SE, denoted as follows:
SE+M: Mix OD and UAD together as a larger dataset, regarding OD as data augmentation that provides more knowledge.
SE+P: Pretrain the model on OD first, then further train the model on UAD only, regarding training on UAD as domain adaptation.
Above two choices are plausible in our approach, because we only use data in the CoNLL format for OD, which is still largely similar to the UA format, despite the difference on the singletons, non-referring expressions, and split-antecedents.
Similarly, we denote the model as SR+M/P if the SR model is used instead of SE.
For data in the UA format (UAD), we use the ARRAU corpus (Poesio et al., 2018) from the CRAC 2018/2021 shared task. Four sub-corpora are used as the training set for UAD, namely TRAINS-93, PEAR, RST, GNOME. One sub-corpus named TRAINS-91 is used as one of the development (dev) set. In addition, four other corpora from the CRAC 2021 shared task are also used as the development set as well as the final test set, namely AMI, LIGHT, Persuasion for Good (PSUA), Switchboard (SWBD). All above datasets are of the dialogue domain except for RST and GNOME. Table 2 shows the detailed statistics of all UAD datasets. Note that certain datasets do not provide speaker information, therefore their averaged numbers of speakers per document are shown as 0.
For non-UA format data (OD), we use two datasets in the CoNLL format: OntoNotes (ON) (Pradhan et al., 2012) and BOLT (Li et al., 2016). OntoNotes consists of documents in six genres, where only two genres “Telephone Conversation” and “Broadcast Conversation” are of the dialogue domain; we use the same provided train/dev/test split for OntoNotes. BOLT has the same annotation scheme as OntoNotes and consists of documents from discussion forums, instant messages and telephone conversations. We perform a random 80/10/10 split for the train/dev/test set of BOLT. Detailed statistics of both datasets are shown in the bottom of Table 2.
We only perform one trivial preprocessing step specific to the training set of UAD datasets: remove all non-referring expressions and regard them as non-mentions, as they will not be counted in the final evaluation (Section 5). In addition, our current approach does not consider split-antecedents, which we will leave as future work.
is the main evaluation metric. Section3 describes the details of all listed approaches. SE+P+DEV is the setting of our final submission to the CRAC 2021 shared task, where all available development sets are also added in the training process for SE+P (Section 5.1).
|Track||Resolution of anaphoric identities|
|Baseline||MR (§3.1). The end-to-end coreference resolution model with the|
|SpanBERT encoder (Joshi et al., 2020; Xu and Choi, 2020) is used as the baseline.|
|Approach||SE+P+DEV (§3.4). The final model is built upon baseline with three key adaptations:|
|1) An updated antecedent selection process is used to support singletons,|
|with an additional optimization on the mention scores.|
|2) Speaker-augmentation strategy is used to encode the speakers and dialogue-turns.|
|3) Knowledge transfer is employed that pretrains the model on CoNLL datasets, then|
|further trains on the UA datasets as a domain adaptation step.|
|The final submission includes the dev data into training.|
|Train Data||TRAINS-93, PEAR, RST, GNOME, ON, BOLT (§4.1)|
|Dev Data||TRAINS-91, AMI, LIGHT, PSUA, SWBD, ON, BOLT (§4.1)|
Our system is based on the PyTorch implementation of the end-to-end coreference resolution model fromXu and Choi (2020), and we follow the similar hyperparameter settings. Specifically, SpanBERTLarge (Joshi et al., 2020) is used as the Transformers encoder with maximum sequence length of 512. Long documents are split into multiple sequences, and each sequence is encoded by SpanBERTLarge independently, as suggested by (Joshi et al., 2019). During training, we limit the maximum sequences to be 3 due to the GPU memory constraints, and a long document will be truncated into multiple documents if it exceeds the maximum sequences.
For all datasets, nested mentions are always enabled. We set the and maximum span width to be 30 in the span enumeration stage, and limit the maximum antecedents to be 50 in the pair scoring process. Adam optimizer is used for the optimization, with the weight decay rate of
and gradient clipping norm of. We employ the learning rate of for Transformers parameters, and for task parameters. is used for Eq (4). In particular, we do not apply any higher-order inferences, as their benefits are shown trivial (Xu and Choi, 2020).
When training UAD or OD alone, we concatenate and mix all its corresponding corpora together as the training data. For SE+M, we concatenate and mix all available training corpora together regardless of UAD or OD
. All experiments are conducted on a Nvidia A100 GPU. 20 training epochs are used for all the settings, and the training takes around 1-2 hours forUAD and 3-4 hours for OD.
In particular, development sets are not added to the training data, except for our final submission to the shared task, where the best-performed model has been identified, then we train the final model with the same setting but adding all development sets in the training (Section 5.1).
5 Results and Analysis
The Universal Anaphora Scorer111https://github.com/juntaoy/universal-anaphora-scorer is used in the official evaluation process. For the task of anaphora resolution, the main evaluation metric is the averaged F1 score of MUC, B3 and CEAF, same as the CoNLL 2012 shared task. Singletons and split-antecedents are included in the evaluation, while non-referring expressions are excluded.
Table 3 shows the evaluation results on the test set of four datasets using different approaches. Among all approaches without adding the dev sets into training, SE+P achieves the best results on all four datasets. Another SE+P model is then trained with adding the dev sets as our final submission, dentoed by SE+P+DEV, which further yields the best results, and ranks the 1st place at the “anaphoric identity” track in the CRAC 2021 shared task.
Comparing the two knowledge transfer strategies, the pretraining diagram SE+P performs significantly better than the mixing diagram SE+M. In fact, while the pretraining diagram brings improvement over SE, the mixing diagram even performs worse than without knowledge transfer, likely because of the domain mismatch and the annotation format mismatch, showing that the pretraining strategy should always be preferred in this case.
Table 4 lists the summary of our final submission to the shared task.
5.2 Analysis: Singleton Recognition
One of the main differences between the UA and CoNLL format is that UA supports singletons, as UA annotates all noun phrases. The left side of Table 6 shows the total number and percentage of the singleton clusters on the test set of four datasets. Singletons are indeed prevalent, and all four datasets have at least 73% of their gold clusters as singletons. Therefore, recognizing singletons can become critical for coreference resolution on the UA formatted data.
|AMI||1883||1383 (73.5%)||4139||1566 (37.8%)|
|LIGHT||1359||1024 (75.4%)||3501||1676 (47.9%)|
|PSUA||1857||1525 (82.1%)||3446||1464 (42.5%)|
|SWBD||3897||2968 (76.2%)||7847||3746 (47.7%)|
Comparing MR and SR in Table 3, it is clear that singleton recognition plays a pivotal role in the final performance, with SR outperforming MR by a huge margin of 17-22 Avg F1 on all four datasets. To further examine the performance of SR, we collect the precision/recall of the predicted mentions by different models, as well as the precision/recall of predicted singletons over gold singletons, as shown in Table 5. Compared with MR, all models that support singletons receive huge gains on the mention recall with 26-36% improvement, with relatively small 5-10% degradation on the mention precision.
More interestingly, most SR/SE-related models are able to recover the majority of gold singletons on all four datasets, up to 73% recall on LIGHT, demonstrating the effectiveness of the mention score optimization in Eq (3) and the new antecedent selection process. Nevertheless, the best F1 for singletons is still below 67 out of four datasets, suggesting that resolving singletons alone can be a challenging aspect already.
5.3 Analysis: Speaker Encoding
Despite the simple strategy of speaker-augmented encoding described in Section 3.3, SE+P shows decent improvement over its counterpart SR+P, with 2-3% enhancement on Avg F1 on all datasets, except for LIGHT that has only trivial improvement, confirming that stronger speaker encoding is indeed important for the dialogue domain.
Meanwhile, SE does not show advantages over SR due to the fact that the current training corpora of all ARRAU datasets do not provide the speakers (Table 2); consequently, neither models could learn to use the speaker information, resulting in similar performance. This on the other side also demonstrates the significance of knowledge transfer that utilizes other existing resources.
In this work, we present an adapted end-to-end coreference resolution system for anaphoric identities in dialogues, specifically addressing three aspects: the support for singletons, stronger speaker and turn encoding through the dialogue interactions, as well as the knowledge transfer utilizing other existing resources. Our final system achieves the best results on all four datasets on the leaderboard of the CRAC 2021 shared task, and further analysis is performed to show the effectiveness of our proposed adaptation strategies.
- Variation in coreference strategies across genres and production media. In Proceedings of the 28th International Conference on Computational Linguistics, Barcelona, Spain (Online), pp. 5774–5785. External Links: Cited by: §3.1, §3.3.
- Entity-centric coreference resolution with model stacking. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), Beijing, China, pp. 1405–1415. External Links: Cited by: §2.
Improving coreference resolution by learning entity-level distributed representations. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Berlin, Germany, pp. 643–653. External Links: Cited by: §2.
- SpanBERT: improving pre-training by representing and predicting spans. Transactions of the Association for Computational Linguistics 8 (), pp. 64–77. External Links: Cited by: §1, §2, §3.1, §4.3, Table 4.
- BERT for coreference resolution: baselines and analysis. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Hong Kong, China, pp. 5803–5808. External Links: Cited by: §1, §2, §3.1, §4.3.
- Who is speaking to whom? learning to identify utterance addressee in multi-party conversations. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Hong Kong, China, pp. 1909–1919. External Links: Cited by: §3.3.
- End-to-end neural coreference resolution. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, Copenhagen, Denmark, pp. 188–197. External Links: Cited by: §1, §2, §3.1.
- Higher-order coreference resolution with coarse-to-fine inference. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers), New Orleans, Louisiana, pp. 687–692. External Links: Cited by: §1, §3.1.
- Large multi-lingual, multi-level and multi-genre annotation corpus. In Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC’16), Portorož, Slovenia, pp. 906–913. External Links: Cited by: §4.1.
- Improving coreference resolution by using conversational metadata. In Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics, Companion Volume: Short Papers, Boulder, Colorado, pp. 201–204. External Links: Cited by: §1.
- Anaphora resolution with the ARRAU corpus. In Proceedings of the First Workshop on Computational Models of Reference, Anaphora and Coreference, New Orleans, Louisiana, pp. 11–22. External Links: Cited by: §4.1.
- CoNLL-2012 shared task: modeling multilingual unrestricted coreference in OntoNotes. In Joint Conference on EMNLP and CoNLL - Shared Task, Jeju Island, Korea, pp. 1–40. External Links: Cited by: §1, §4.1.
- Learning anaphoricity and antecedent ranking features for coreference resolution. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), Beijing, China, pp. 1416–1426. External Links: Cited by: §2.
- CorefQA: coreference resolution as query-based span prediction. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Online, pp. 6953–6963. External Links: Cited by: §2, §3.3.
- Incremental neural coreference resolution in constant memory. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), Online, pp. 8617–8624. External Links: Cited by: §2.
- Revealing the myth of higher-order inference in coreference resolution. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), Online, pp. 8527–8533. External Links: Cited by: §4.3, §4.3, Table 4.
- They exist! introducing plural mentions to coreference resolution and entity linking. In Proceedings of the 27th International Conference on Computational Linguistics, Santa Fe, New Mexico, USA, pp. 24–34. External Links: Cited by: §1.