Entity coreference resolution has become a critical component for many Natural Language Processing (NLP) tasks. Systems requiring deep language understanding, such as information extraction [Wellner et al.2004], semantic event learning [Chambers and Jurafsky2008, Chambers and Jurafsky2009], and named entity linking [Durrett and Klein2014, Ji et al.2014] all benefit from entity coreference information.
Entity coreference resolution is the task of identifying mentions (i.e., noun phrases) in a text or dialogue that refer to the same real-world entities. In recent years, several supervised entity coreference resolution systems have been proposed, which, according to ng:2010:ACL, can be categorized into three classes — mention-pair models [McCarthy and Lehnert1995], entity-mention models [Yang et al.2008a, Haghighi and Klein2010, Lee et al.2011] and ranking models [Yang et al.2008b, Durrett and Klein2013, Fernandes et al.2014] — among which ranking models recently obtained state-of-the-art performance. However, the manually annotated corpora that these systems rely on are highly expensive to create, in particular when we want to build data for resource-poor languages [Ma and Xia2014]. That makes unsupervised approaches, which only require unannotated text for training, a desirable solution to this problem.
Several unsupervised learning algorithms have been applied to coreference resolution. haghighi-klein:2007:ACLMain presented a mention-pair nonparametric fully-generative Bayesian model for unsupervised coreference resolution. Based on this model, ng:2008:EMNLP probabilistically induced coreference partitions via EM clustering. poon-domingos:2008:EMNLP proposed an entity-mention model that is able to perform joint inference across mentions by using Markov Logic. Unfortunately, these unsupervised systems’ performance on accuracy significantly falls behind those of supervised systems, and are even worse than the deterministic rule-based systems. Furthermore, there is no previous work exploring the possibility of developing an unsupervised ranking model which achieved state-of-the-art performance under supervised settings for entity coreference resolution.
In this paper, we propose an unsupervised generative ranking model for entity coreference resolution. Our experimental results on the English data from the CoNLL-2012 shared task [Pradhan et al.2012] show that our unsupervised system outperforms the Stanford deterministic system [Lee et al.2013] by 3.01% absolute on the CoNLL official metric. The contributions of this work are (i) proposing the first unsupervised ranking model for entity coreference resolution. (ii) giving empirical evaluations of this model on benchmark data sets. (iii) considerably narrowing the gap to supervised coreference resolution accuracy.
2 Unsupervised Ranking Model
2.1 Notations and Definitions
In the following, represents a generic input document which is a sequence of coreference mentions, including the artificial root mention (denoted by ). The method to detect and extract these mentions is discussed later in Section 2.6. Let denote the coreference assignment of a given document, where each mention
has an associated random variabletaking values in the set ; this variable specifies ’s selected antecedent (), or indicates that it begins a new coreference chain ().
2.2 Generative Ranking Model
The following is a straightforward way to build a generative model for coreference:
where we factorize the probabilitiesand into each position by adopting appropriate independence assumptions that given the coreference assignment and corresponding coreferent mention , the mention is independent with other mentions in front of it. This independent assumption is similar to that in the IBM 1 model on machine translation [Brown et al.1993], where it assumes that given the corresponding English word, the aligned foreign word is independent with other English and foreign words. We do not make any independent assumptions among different features (see Section 2.4 for details).
Inference in this model is efficient, because we can compute separately for each mention:
The model is a so-called ranking model because it is able to identify the most probable candidate antecedent given a mention to be resolved.
|prec||Mention Type||the type of a mention. We use three mention types:|
|str||Mention Type||the same as the mention type feature under prec mode.|
|Exact Match||boolean feature corresponding to String Match sieve in Stanford system.|
|Relaxed Match||boolean feature corresponding to Relaxed String Match sieve in Stanford system.|
|Head Match||boolean feature corresponding to Strict Head Match A sieve in Stanford system.|
|attr||Mention Type||the same as the mention type feature under prec mode.|
|Number||the number of a mention similarly derived from Lee:2013:CL.|
|Gender||the gender of a mention from bergsma-lin:2006:COLACL and ji2009gender.|
|Person||the person attribute from Lee:2013:CL. We assign person attributes to all mentions, not only pronouns.|
|Animacy||the animacy attribute same as Lee:2013:CL.|
|Semantic Class||semantic classes derived from WordNet [Soon et al.2001].|
|Distance||sentence distance between the two mentions. This feature is for parameter|
2.3 Resolution Mode Variables
According to previous work [Haghighi and Klein2009, Ratinov and Roth2012, Lee et al.2013], antecedents are resolved by different categories of information for different mentions. For example, the Stanford system [Lee et al.2013] uses string-matching sieves to link two mentions with similar text and precise-construct sieve to link two mentions which satisfy special syntactic or semantic relations such as apposition or acronym. Motivated by this, we introduce resolution mode variables , where for each mention the variable indicates in which mode the mention should be resolved. In our model, we define three resolution modes — string-matching (str), precise-construct (prec), and attribute-matching (attr) — and is deterministic when is given (i.e. is a point distribution). We determine for each mention in the following way:
, if there exists a mention such that the two mentions satisfy the String Match sieve, the Relaxed String Match sieve, or the Strict Head Match A sieve in the Stanford multi-sieve system [Lee et al.2013].
, if there exists a mention such that the two mentions satisfy the Speaker Identification sieve, or the Precise Constructs sieve.
, if there is no mention satisfies the above two conditions.
Now, we can extend the generative model in Eq. 1 to:
where we define
to be uniform distribution. We modeland in the following way:
where are parameters of our model. Note that in the attribute-matching mode () we model with parameter , while in the other two modes, we use the uniform distribution. It makes sense because the position information is important for coreference resolved by matching attributes of two mentions such as resolving pronoun coreference, but not that important for those resolved by matching text or special relations like two mentions referring the same person and matching by the name.
In this section, we describe the features we use to represent mentions. Specifically, as shown in Table 1, we use different features under different resolution modes. It should be noted that only the Distance feature is designed for parameter , all other features are designed for parameter .
|Corpora||# Doc||# Sent||# Word||# Entity||# Mention|
|CoNLL’12 English development data||CoNLL’12 English test data|
F1 scores of different evaluation metrics for our model, together with two deterministic systems and one unsupervised system as baseline (above the dashed line) and seven supervised systems (below the dashed line) for comparison on CoNLL 2012 development and test datasets.
2.5 Model Learning
For model learning, we run EM algorithm [Dempster et al.1977] on our Model, treating as observed data and as latent variables. We run EM with 10 iterations and select the parameters achieving the best performance on the development data. Each iteration takes around 12 hours with 10 CPUs parallelly. The best parameters appear at around the 5th iteration, according to our experiments.The detailed derivation of the learning algorithm is shown in Appendix A. The pseudo-code is shown is Algorithm 1. We use uniform initialization for all the parameters in our model.
Several previous work has attempted to use EM for entity coreference resolution. cherry-bergsma:2005 and charniak-elsner:2009 applied EM for pronoun anaphora resolution. ng:2008:EMNLP probabilistically induced coreference partitions via EM clustering. Recently, moosavi2014 proposed an unsupervised model utilizing the most informative relations and achieved competitive performance with the Stanford system.
2.6 Mention Detection
The basic rules we used to detect mentions are similar to those of Lee:2013:CL, except that their system uses a set of filtering rules designed to discard instances of pleonastic it, partitives, certain quantified noun phrases and other spurious mentions. Our system keeps partitives, quantified noun phrases and bare NP mentions, but discards pleonastic it and other spurious mentions.
3.1 Experimental Setup
Due to the availability of readily parsed data, we select the APW and NYT sections of Gigaword Corpus (years 1994-2010) [Parker et al.2011] to train the model. Following previous work [Chambers and
Jurafsky2008], we remove duplicated documents and the documents which include fewer than 3 sentences. The development and test data are the English data from the CoNLL-2012 shared task [Pradhan et al.2012], which is derived from the OntoNotes corpus [Hovy et al.2006].
The corpora statistics are shown in Table 2. Our system is evaluated with automatically extracted mentions on the version of the data with automatic preprocessing information (e.g., predicted parse trees).
Evaluation Metrics. We evaluate our model on three measures widely used in the literature: MUC [Vilain et al.1995], B [Bagga and Baldwin1998], and Entity-based CEAF (CEAF) [Luo2005]. In addition, we also report results on another two popular metrics: Mention-based CEAF (CEAF) and BLANC [Recasens and Hovy2011]. All the results are given by the latest version of CoNLL-2012 scorer 111http://conll.cemantix.org/2012/software.html
3.2 Results and Comparison
Table 3 illustrates the results of our model together as baseline with two deterministic systems, namely Stanford: the Stanford system [Lee et al.2011] and Multigraph: the unsupervised multigraph system [Martschat2013], and one unsupervised system, namely MIR: the unsupervised system using most informative relations [Moosavi and Strube2014]. Our model outperforms the three baseline systems on all the evaluation metrics. Specifically, our model achieves improvements of 2.93% and 3.01% on CoNLL F1 score over the Stanford system, the winner of the CoNLL 2011 shared task, on the CoNLL 2012 development and test sets, respectively. The improvements on CoNLL F1 score over the Multigraph model are 1.41% and 1.77% on the development and test sets, respectively. Comparing with the MIR model, we obtain significant improvements of 2.62% and 3.02% on CoNLL F1 score.
To make a thorough empirical comparison with previous studies, Table 3 (below the dashed line) also shows the results of some state-of-the-art supervised coreference resolution systems — IMS: the second best system in the CoNLL 2012 shared task [Björkelund and Farkas2012]; Latent-Tree: the latent tree model [Fernandes et al.2012] obtaining the best results in the shared task; Berkeley: the Berkeley system with the final feature set [Durrett and Klein2013]; LaSO
: the structured perceptron system with non-local features[Björkelund and Kuhn2014]; Latent-Strc: the latent structure system [Martschat and Strube2015]; Model-Stack: the entity-centric system with model stacking [Clark and Manning2015]; and Non-Linear: the non-linear mention-ranking model with feature representations [Wiseman et al.2015]. Our unsupervised ranking model outperforms the supervised IMS system by 1.02% on the CoNLL F1 score, and achieves competitive performance with the latent tree model. Moreover, our approach considerably narrows the gap to other supervised systems listed in Table 3.
We proposed a new generative, unsupervised ranking model for entity coreference resolution into which we introduced resolution mode variables to distinguish mentions resolved by different categories of information. Experimental results on the data from CoNLL-2012 shared task show that our system significantly improves the accuracy on different evaluation metrics over the baseline systems.
One possible direction for future work is to differentiate more resolution modes. Another one is to add more precise or even event-based features to improve the model’s performance.
This research was supported in part by DARPA grant FA8750-12-2-0342 funded under the DEFT program. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of DARPA.
- [Bagga and Baldwin1998] Amit Bagga and Breck Baldwin. 1998. Algorithms for scoring coreference chains. In The first international conference on language resources and evaluation workshop on linguistics coreference, volume 1, pages 563–566. Citeseer.
- [Bergsma and Lin2006] Shane Bergsma and Dekang Lin. 2006. Bootstrapping path-based pronoun resolution. In Proceedings of ACL-2006, pages 33–40, Sydney, Australia, July. Association for Computational Linguistics.
- [Björkelund and Farkas2012] Anders Björkelund and Richárd Farkas. 2012. Data-driven multilingual coreference resolution using resolver stacking. In Proceedings of EMNLP-CoNLL-2012 - Shared Task, pages 49–55, Jeju Island, Korea, July. Association for Computational Linguistics.
- [Björkelund and Kuhn2014] Anders Björkelund and Jonas Kuhn. 2014. Learning structured perceptrons for coreference resolution with latent antecedents and non-local features. In Proceedings of ACL-2014, pages 47–57, Baltimore, Maryland, June. Association for Computational Linguistics.
[Brown et al.1993]
Peter F Brown, Vincent J Della Pietra, Stephen A Della Pietra, and Robert L
The mathematics of statistical machine translation: Parameter estimation.Computational linguistics, 19(2):263–311.
- [Chambers and Jurafsky2008] Nathanael Chambers and Dan Jurafsky. 2008. Unsupervised learning of narrative event chains. In Proceedings of ACL-2008: HLT, pages 789–797, Columbus, Ohio, June. Association for Computational Linguistics.
- [Chambers and Jurafsky2009] Nathanael Chambers and Dan Jurafsky. 2009. Unsupervised learning of narrative schemas and their participants. In Proceedings of ACL-2009, pages 602–610, Suntec, Singapore, August. Association for Computational Linguistics.
- [Charniak and Elsner2009] Eugene Charniak and Micha Elsner. 2009. EM works for pronoun anaphora resolution. In Proceedings of EACL 2009, pages 148–156, Athens, Greece, March.
[Cherry and Bergsma2005]
Colin Cherry and Shane Bergsma.
An Expectation Maximization approach to pronoun resolution.In Proceedings of CoNLL-2005, pages 88–95, Ann Arbor, Michigan, June.
- [Clark and Manning2015] Kevin Clark and Christopher D. Manning. 2015. Entity-centric coreference resolution with model stacking. In Proceedings of ACL-IJCNLP-2015, pages 1405–1415, Beijing, China, July. Association for Computational Linguistics.
- [Dempster et al.1977] Arthur P Dempster, Nan M Laird, and Donald B Rubin. 1977. Maximum likelihood from incomplete data via the em algorithm. Journal of the royal statistical society. Series B (methodological), pages 1–38.
- [Durrett and Klein2013] Greg Durrett and Dan Klein. 2013. Easy victories and uphill battles in coreference resolution. In Proceedings of EMNLP-2013, pages 1971–1982, Seattle, Washington, USA, October. Association for Computational Linguistics.
- [Durrett and Klein2014] Greg Durrett and Dan Klein. 2014. A joint model for entity analysis: Coreference, typing, and linking. In Proceedings of the Transactions of the Association for Computational Linguistics.
- [Fernandes et al.2012] Eraldo Fernandes, Cícero dos Santos, and Ruy Milidiú. 2012. Latent structure perceptron with feature induction for unrestricted coreference resolution. In Proceedings of EMNLP-CoNLL-2012 - Shared Task, pages 41–48, Jeju Island, Korea, July. Association for Computational Linguistics.
- [Fernandes et al.2014] Eraldo Rezende Fernandes, Cícero Nogueira dos Santos, and Ruy Luiz Milidiú. 2014. Latent trees for coreference resolution. Computational Linguistics.
- [Haghighi and Klein2007] Aria Haghighi and Dan Klein. 2007. Unsupervised coreference resolution in a nonparametric bayesian model. In Proceedings of ACL-2007, pages 848–855, Prague, Czech Republic, June. Association for Computational Linguistics.
- [Haghighi and Klein2009] Aria Haghighi and Dan Klein. 2009. Simple coreference resolution with rich syntactic and semantic features. In Proceedings of EMNLP-2009, pages 1152–1161, Singapore, August. Association for Computational Linguistics.
- [Haghighi and Klein2010] Aria Haghighi and Dan Klein. 2010. Coreference resolution in a modular, entity-centered model. In Proceedings of NAACL-2010, pages 385–393, Los Angeles, California, June. Association for Computational Linguistics.
- [Hovy et al.2006] Eduard Hovy, Mitchell Marcus, Martha Palmer, Lance Ramshaw, and Ralph Weischedel. 2006. Ontonotes: The 90% solution. In Proceedings of NAACL-2006, pages 57–60, New York City, USA, June. Association for Computational Linguistics.
[Ji and Lin2009]
Heng Ji and Dekang Lin.
Gender and animacy knowledge discovery from web-scale n-grams for unsupervised person mention detection.In Proceedings of PACLIC-2009, pages 220–229.
- [Ji et al.2014] Heng Ji, HT Dang, J Nothman, and B Hachey. 2014. Overview of tac-kbp2014 entity discovery and linking tasks. In Proc. Text Analysis Conference (TAC2014).
- [Lee et al.2011] Heeyoung Lee, Yves Peirsman, Angel Chang, Nathanael Chambers, Mihai Surdeanu, and Dan Jurafsky. 2011. Stanford’s multi-pass sieve coreference resolution system at the conll-2011 shared task. In Proceedings of CoNLL-2011: Shared Task, pages 28–34, Portland, Oregon, USA, June. Association for Computational Linguistics.
- [Lee et al.2013] Heeyoung Lee, Angel Chang, Yves Peirsman, Nathanael Chambers, Mihai Surdeanu, and Dan Jurafsky. 2013. Deterministic coreference resolution based on entity-centric, precision-ranked rules. Comput. Linguist., 39(4):885–916, December.
- [Luo2005] Xiaoqiang Luo. 2005. On coreference resolution performance metrics. In Proceedings of EMNLP-2005, pages 25–32, Vancouver, British Columbia, Canada, October. Association for Computational Linguistics.
- [Ma and Xia2014] Xuezhe Ma and Fei Xia. 2014. Unsupervised dependency parsing with transferring distribution via parallel guidance and entropy regularization. In Proceedings of ACL-2014, pages 1337–1348, Baltimore, Maryland, June.
- [Martschat and Strube2015] Sebastian Martschat and Michael Strube. 2015. Latent structures for coreference resolution. Transactions of the Association for Computational Linguistics, 3:405–418.
- [Martschat2013] Sebastian Martschat. 2013. Multigraph clustering for unsupervised coreference resolution. In ACL-2013: Student Research Workshop, pages 81–88, Sofia, Bulgaria, August. Association for Computational Linguistics.
[McCarthy and Lehnert1995]
Joseph F McCarthy and Wendy G Lehnert.
Using decision trees for conference resolution.In Proceedings of IJCAI-1995, pages 1050–1055. Morgan Kaufmann Publishers Inc.
- [Moosavi and Strube2014] Nafise Sadat Moosavi and Michael Strube. 2014. Unsupervised coreference resolution by utilizing the most informative relations. In Proceedings of COLING-2014, pages 644–655.
- [Ng2008] Vincent Ng. 2008. Unsupervised models for coreference resolution. In Proceedings of EMNLP-2008, pages 640–649, Honolulu, Hawaii, October. Association for Computational Linguistics.
- [Ng2010] Vincent Ng. 2010. Supervised noun phrase coreference research: The first fifteen years. In Proceedings of ACL-2010, pages 1396–1411, Uppsala, Sweden, July. Association for Computational Linguistics.
- [Parker et al.2011] Robert Parker, David Graff, Junbo Kong, Ke Chen, and Kazuaki Maeda. 2011. English gigaword fifth edition. Linguistic Data Consortium, LDC2011T07.
- [Poon and Domingos2008] Hoifung Poon and Pedro Domingos. 2008. Joint unsupervised coreference resolution with Markov Logic. In Proceedings of EMNLP-2008, pages 650–659, Honolulu, Hawaii, October. Association for Computational Linguistics.
- [Pradhan et al.2012] Sameer Pradhan, Alessandro Moschitti, Nianwen Xue, Olga Uryupina, and Yuchen Zhang. 2012. Conll-2012 shared task: Modeling multilingual unrestricted coreference in ontonotes. In Proceedings of EMNLP-CoNLL-2012 - Shared Task, pages 1–40, Jeju Island, Korea, July. Association for Computational Linguistics.
- [Ratinov and Roth2012] Lev Ratinov and Dan Roth. 2012. Learning-based multi-sieve co-reference resolution with knowledge. In Proceedings of EMNLP-CoNLL-2012, pages 1234–1244, Jeju Island, Korea, July. Association for Computational Linguistics.
- [Recasens and Hovy2011] Marta Recasens and Eduard Hovy. 2011. Blanc: Implementing the rand index for coreference evaluation. Natural Language Engineering, 17(04):485–510.
[Soon et al.2001]
Wee Meng Soon, Hwee Tou Ng, and Daniel Chung Yong Lim.
A machine learning approach to coreference resolution of noun phrases.Computational linguistics, 27(4):521–544.
- [Vilain et al.1995] Marc Vilain, John Burger, John Aberdeen, Dennis Connolly, and Lynette Hirschman. 1995. A model-theoretic coreference scoring scheme. In Proceedings of the 6th conference on Message understanding, pages 45–52. Association for Computational Linguistics.
[Wellner et al.2004]
Ben Wellner, Andrew McCallum, Fuchun Peng, and Michael Hay.
An integrated, conditional model of information extraction and
coreference with application to citation matching.
Proceedings of the 20th conference on Uncertainty in artificial intelligence, pages 593–601. AUAI Press.
- [Wiseman et al.2015] Sam Wiseman, Alexander M. Rush, Stuart Shieber, and Jason Weston. 2015. Learning anaphoricity and antecedent ranking features for coreference resolution. In Proceedings of ACL-IJCNLP-2015, pages 1416–1426, Beijing, China, July. Association for Computational Linguistics.
[Yang et al.2008a]
Xiaofeng Yang, Jian Su, Jun Lang, Chew Lim Tan, Ting Liu, and Sheng Li.
An entity-mention model for coreference resolution with inductive logic programming.In Proceedings of ACL-2008, pages 843–851.
- [Yang et al.2008b] Xiaofeng Yang, Jian Su, and Chew Lim Tan. 2008b. A twin-candidate model for learning-based anaphora resolution. Computational Linguistics, 34(3):327–356.