1 Introduction
Coreference resolution is the task of identifying all mentions which refer to the same entity in a document. It has been shown beneficial in many natural language processing (NLP) applications, including question answering
Hermann et al. (2015) and information extraction Kehler (1997), and often regarded as a prerequisite to any text understanding task.Coreference resolution can be regarded as a clustering problem: each cluster corresponds to a single entity and consists of all its mentions in a given text. Consequently, it is natural to evaluate predicted clusters by comparing them with the ones annotated by human experts, and this is exactly what the standard metrics (e.g., MUC, B, CEAF) do. In contrast, most stateoftheart systems are optimized to make individual coreference decisions, and such losses are only indirectly related to the metrics.
One way to deal with this challenge is to optimize directly the nondifferentiable metrics using reinforcement learning (RL), for example, relying on the REINFORCE policy gradient algorithm Williams (1992). However, this approach has not been very successful, which, as suggested by clarkmanning:2016:EMNLP2016, is possibly due to the discrepancy between sampling decisions at training time and choosing the highest ranking ones at test time. A more successful alternative is using a ‘rollout’ stage to associate cost with possible decisions, as in clarkmanning:2016:EMNLP2016, but it is computationally expensive. Imitation learning Ma et al. (2014b); Clark and Manning (2015), though also exploiting metrics, requires access to an expert policy, with exact policies not directly computable for the metrics of interest.
In this work, we aim at combining the best of both worlds by proposing a simple method that can turn popular coreference evaluation metrics into differentiable functions of model parameters. As we show, this function can be computed recursively using scores of individual local decisions, resulting in a simple and efficient estimation procedure. The key idea is to replace nondifferentiable indicator functions (e.g. the member function
) with the corresponding posterior probabilities (
) computed by the model. Consequently, nondifferentiable functions used within the metrics (e.g. the set size function ) become differentiable (). Though we assume that the scores of the underlying statistical model can be used to define a probability model, we show that this is not a serious limitation. Specifically, as a baseline we use a probabilistic version of the neural mentionranking model of P151137, which on its own outperforms the original one and achieves similar performance to its global version Wiseman et al. (2016). Importantly when we use the introduced differentiable relaxations in training, we observe a substantial gain in performance over our probabilistic baseline. Interestingly, the absolute improvement (+0.52) is higher than the one reported in clarkmanning:2016:EMNLP2016 using RL (+0.05) and the one using reward rescaling^{1}^{1}1Reward rescaling is a technique that computes error values for a heuristic loss function based on the reward difference between the best decision according to the current model and the decision leading to the highest metric score.
(+0.37). This suggests that our method provides a viable alternative to using RL and reward rescaling.The outline of our paper is as follows: we introduce our neural resolver baseline and the B and LEA metrics in Section 2. Our method to turn a mention ranking resolver into an entitycentric resolver is presented in Section 3, and the proposed differentiable relaxations in Section 4. Section 5 shows our experimental results.
2 Background
2.1 Neural mention ranking
In this section we introduce neural mention ranking, the framework which underpins current stateoftheart models Clark and Manning (2016a). Specifically, we consider a probabilistic version of the method proposed by P151137. In experiments we will use it as our baseline.
Let be the list of mentions in a document. For each mention , let be the index of the mention that is coreferent with (if , is the first mention of some entity appearing in the document). As standard in coreference resolution literature, we will refer to as an antecedent of .^{2}^{2}2This slightly deviates from the definition of antecedents in linguistics Crystal (1997). Then, in mention ranking the goal is to score antecedents of a mention higher than any other mentions, i.e., if is the scoring function, we require for all such that and are coreferent but and are not.
Let and be respectively features of and features of pair . The scoring function is defined by:
where
and
are real vectors and matrices with proper dimensions,
are real scalars.Unlike P151137, where the maxmargin loss is used, we define a probabilistic model. The probability^{3}^{3}3For the sake of readability, we do not explicitly mark in our notation that all the probabilities are conditioned on the document (e.g., the mentions) and dependent on model parameters. that and are coreferent is given by
(1) 
Following D131203 we use the following softmaxmargin Gimpel and Smith (2010) loss function:
where are model parameters, is the set of the indices of correct antecedents of , and . is a cost function used to manipulate the contribution of different error types to the loss function:
The error types are “false anaphor”, “false new”, “wrong link”, and “no mistake”, respectively. In our experiments, we borrow their values from D131203: . In the subsequent discussion, we refer to the loss as mentionranking heuristic cross entropy.
2.2 Evaluation Metrics
We use five most popular metrics^{4}^{4}4All are implemented in pradhanEtAl:2014:P142, https://github.com/conll/referencecoreferencescorers.,

MUC Vilain et al. (1995),

B Bagga and Baldwin (1998),

CEAF, CEAF Luo (2005),

BLANC Luo et al. (2014),

LEA Moosavi and Strube (2016).
for evaluation. However, because MUC is the least discriminative metric Moosavi and Strube (2016), whereas CEAF is slow to compute, out of the five most popular metrics we incorporate into our loss only B. In addition, we integrate LEA, as it has been shown to provide a good balance between discriminativity and interpretability.
Let and be the goldstandard entity set and an entity set given by a resolver. Recall that an entity is a set of mentions. The recall and precision of the B metric is computed by:
The LEA metric is computed as:
where is the number of coreference links in entity . , for both metrics, is defined by:
is used in the standard evaluation.
3 From mention ranking to entity centricity
Mentionranking resolvers do not explicitly provide information about entities/clusters which is required by B and LEA. We therefore propose a simple solution that can turn a mentionranking resolver into an entitycentric one.
First note that in a document containing mentions, there are potential entities where has as the first mention. Let be the probability that mention corresponds to entity . We now show that it can be computed recursively based on as follows:
In other words, if , we consider all possible with which can be coreferent, and which can correspond to entity . If , the link to be considered is the ’s selflink. And, if , the probability is zero, as it is impossible for to be assigned to an entity introduced only later. See Figure 1 for extra information.
We now turn to two crucial questions about this formula:

Is
a valid probability distribution?

Is it possible for a mention to be mostly anaphoric (i.e. is low) but for the corresponding cluster to be highly probable (i.e. is high for some )?
The first question is answered in Proposition 1. The second question is important because, intuitively, when a mention is anaphoric, the potential entity does not exist. We will show that the answer is “No” by proving in Proposition 2 that the probability that is anaphoric is always higher than any probability that , refers to .
Proposition 1.
is a valid probability distribution, i.e., , for all .
Proof.
We prove this proposition by induction.
Basis: it is obvious that .
Assume that for all . Then,
Because for all , this expression is equal to
∎
Proposition 2.
for all .
Proof.
We prove this proposition by induction.
Basis: for ,
Assume that for all and . Then
∎
3.1 Entitycentric heuristic cross entropy loss
Having computed, we can consider coreference resolution as a multiclass prediction problem. An entitycentric heuristic cross entropy loss is thus given below:
where is the correct entity that belongs to, . Similar to in the mentionranking heuristic loss in Section 2.1, is a cost function used to manipulate the contribution of the four different error types (“false anaphor”, “false new”, “wrong link”, and “no mistake”):
4 From nondifferentiable metrics to differentiable losses
There are two functions used in computing B and LEA: the set size function and the link function . Because both of them are nondifferentiable, the two metrics are nondifferentiable. We thus need to make these two functions differentiable.
There are two remarks. Firstly, both functions can be computed using the indicator function :
Secondly, given , the indicator function , is the converging point of the following softmax as (see Figure 2):
where is called temperature Kirkpatrick et al. (1983).
Therefore, we propose to represent each as a softcluster:
where, as defined in Section 3, is the potential entity that has as the first mention. Replacing the indicator function by the probability distribution , we then have a differentiable version for the set size function and the link function:
and are computed similarly with the constraint that only mentions in
are taken into account. Plugging these functions into precision and recall of B
and LEA in Section 2.2, we obtain differentiable and , which are then used in two loss functions:where is the hyperparameter of the regularization terms.
It is worth noting that, as , and .^{5}^{5}5We can easily prove this using the algebraic limit theorem. Therefore, when training a model with the proposed losses, we can start at a high temperature (e.g., ) and anneal to a small but nonzero temperature. However, in our experiments we fix . Annealing is left for future work.
5 Experiments
We now demonstrate how to use the proposed differentiable B and LEA to train a coreference resolver. The source code and trained models are available at https://github.com/lephong/diffmetric_coref.
Setup
We run experiments on the English portion of CoNLL 2012 data Pradhan et al. (2012) which consists of 3,492 documents in various domains and formats. The split provided in the CoNLL 2012 shared task is used. In all our resolvers, we use not the original features of P151137 but their slight modification described in N161114 (section 6.1).^{6}^{6}6https://github.com/swiseman/nn_coref/
Resolvers
We build following baseline and three resolvers:

baseline: the resolver presented in Section 2.1. We use the identical configuration as in N161114: , , (where are respectively the numbers of mention features and pairwise features). We also employ their pretraining methodology.

: the resolver using the entitycentric cross entropy loss introduced in Section 3.1. We set .

and : the resolvers using the losses proposed in Section 4. is tuned on the development set by trying each value in .
To train these resolvers we use AdaGrad Duchi et al. (2011) to minimize their loss functions with the learning rate tuned on the development set and with onedocument minibatches. Note that we use the baseline as the initialization point to train the other three resolvers.
MUC  B  CEAF  CEAF  BLANC  LEA  CoNLL  
P151137  72.60  60.52    57.05      63.39 
N161114  73.42  61.50    57.70      64.21 
Our proposals  
baseline (heuristic loss)  73.22  61.44  65.12  57.74  62.16  57.52  64.13 
73.2  61.75  65.77  57.8  63.3  57.89  64.25  
73.37  61.94  65.79  58.22  63.19  58.06  64.51  
73.48  61.99  65.9  58.36  63.1  58.13  64.61  
73.3  61.88  65.69  57.99  63.27  58.03  64.39  
73.53  62.04  65.95  58.41  63.09  58.18  64.66  
clarkmanning:2016:EMNLP2016  
baseline (heuristic loss)  74.65  63.03    58.40      65.36 
REINFORCE  74.48  63.09    58.67      65.41 
Reward Rescaling  74.56  63.40    59.23      65.73 
5.1 Results
We firstly compare our resolvers against P151137 and N161114. Results are shown in the first half of Table 1. Our baseline surpasses P151137. It is likely due to using features from N161114. Using the entitycentric heuristic cross entropy loss and the relaxations are clearly beneficial: is slightly better than our baseline and on par with the global model of N161114. outperform the baseline, the global model of N161114, and . However, the best values of are , respectively for , and . Among these resolvers, achieves the highest F scores across all the metrics except BLANC.
When comparing to clarkmanning:2016:EMNLP2016 (the second half of Table 1), we can see that the absolute improvement over the baselines (i.e. ‘heuristic loss’ for them and the heuristic cross entropy loss for us) is higher than that of reward rescaling but with much shorter training time: (7 days^{7}^{7}7As reported in https://github.com/clarkkev/deepcoref) and
(15 hours) on the CoNLL metric for clarkmanning:2016:EMNLP2016 and ours, respectively. It is worth noting that our absolute scores are weaker than these of clarkmanning:2016:EMNLP2016, as they build on top of a similar but stronger mentionranking baseline, which employs deeper neural networks and requires a much larger number of epochs to train (300 epochs, including pretraining). For the purpose of illustrating the proposed losses, we started with a simpler model by P151137 which requires a much smaller number of epochs, thus faster, to train (20 epochs, including pretraining).
NonAnaphoric (FA)  Anaphoric (FN + WL)  

Proper  Nominal  Pronom.  Proper  Nominal  Pronom.  
baseline  630  714  1051  374 + 190  821 + 238  347 + 779 
529  609  904  438 + 182  924 + 220  476 + 740  
545  559  883  433 + 172  951 + 192  457 + 761  
557  564  926  426 + 178  941 + 194  431 + 766  
513  547  843  456 + 170  960 + 191  513 + 740  
577  591  1001  416 + 176  919 + 198  358 + 790 
5.2 Analysis
Table 2 shows the breakdown of errors made by the baseline and our resolvers on the development set. The proposed resolvers make fewer “false anaphor” and “wrong link” errors but more “false new” errors compared to the baseline. This suggests that loss optimization prevents overclustering, driving the precision up: when antecedents are difficult to detect, the selflink (i.e., ) is chosen. When increases, they make more “false anaphor” and “wrong link” errors but less “false new” errors.
In Figure 3(a) the baseline, but not nor , mistakenly links [it] with [the virus]. Underclustering, on the other hand, is a problem for our resolvers with : in example (b), missed [We]. This behaviour results in a reduced recall but the recall is not damaged severely, as we still obtain a better score. We conjecture that this behaviour is a consequence of using the score in the objective, and, if undesirable, F with can be used instead. For instance, also in Figure 3, correctly detects [it] as nonanaphoric and links [We] with [our].
Figure 4 shows recall, precision, F (average of MUC, B, CEAF), on the development set when training with and . As expected, higher values of yield lower precisions but higher recalls. In contrast, F increases until reaching the highest point when for ( for ), it then decreases gradually.
5.3 Discussion
Because the resolvers are evaluated on F score metrics, it should be that and perform the best with . Figure 4 and Table 1 however do not confirm that: should be set with values a little bit larger than 1. There are two hypotheses. First, the statistical difference between the training set and the development set leads to the case that the optimal on one set can be suboptimal on the other set. Second, in our experiments we fix , meaning that the relaxations might not be close to the true evaluation metrics enough. Our future work, to confirm/reject this, is to use annealing, i.e., gradually decreasing down to (but larger than) 0.
Table 1 shows that the difference between and in terms of accuracy is not substantial (although the latter is slightly better than the former). However, one should expect that would outperform on B metric while it would be the other way around on LEA metric. It turns out that, B and LEA behave quite similarly in nonextreme cases. We can see that in Figure 2, 4, 5, 6, 7 in moosavistrube:2016:P161.
6 Related work
Mention ranking and entity centricity are two main streams in the coreference resolution literature. Mention ranking Denis and Baldridge (2007); Durrett and Klein (2013); Martschat and Strube (2015); Wiseman et al. (2015a) considers local and independent decisions when choosing a correct antecedent for a mention. This approach is computationally efficient and currently dominant with stateoftheart performance Wiseman et al. (2016); Clark and Manning (2016a). P151137 propose to use simple neural networks to compute mention ranking scores and to use a heuristic loss to train the model. N161114 extend this by employing LSTMs to compute mentionchain representations which are then used to compute ranking scores. They call these representations global features
. clarkmanning:2016:EMNLP2016 build a similar resolver as in P151137 but much stronger thanks to deeper neural networks and “better mention detection, more effective, hyperparameters, and more epochs of training”. Furthermore, using reward rescaling they achieve the best performance in the literature on the English and Chinese portions of the CoNLL 2012 dataset. Our work is built upon mention ranking by turning a mentionranking model into an entitycentric one. It is worth noting that although we use the model proposed by P151137, any mentionranking models can be employed.
Entity centricity Wellner and McCallum (2003); Poon and Domingos (2008); Haghighi and Klein (2010); Ma et al. (2014a); Clark and Manning (2016b), on the other hand, incorporates entitylevel information to solve the problem. The approach can be topdown as in haghighi2010coreference where they propose a generative model. It can also be bottomup by merging smaller clusters into bigger ones as in clarkmanning:2016:P161. The method proposed by maEtAl:2014:EMNLP2014 greedily and incrementally adds mentions to previously built clusters using a pruneandscore technique. Importantly, employing imitation learning these two methods can optimize the resolvers directly on evaluation metrics. Our work is similar to maEtAl:2014:EMNLP2014 in the sense that our resolvers incrementally add mentions to previously built clusters. However, different from both maEtAl:2014:EMNLP2014,clarkmanning:2016:P161, our resolvers do not use any discrete decisions (e.g., merge operations). Instead, they seamlessly compute the probability that a mention refers to an entity from mentionranking probabilities, and are optimized on differentiable relaxations of evaluation metrics.
Using differentiable relaxations of evaluation metrics as in our work is related to a line of research in reinforcement learning where a nondifferentiable actionvalue function is replaced by a differentiable critic Sutton et al. (1999); Silver et al. (2014). The critic is trained so that it is as close to the true actionvalue function as possible. This technique is applied to machine translation Gu et al. (2017) where evaluation metrics (e.g., BLUE) are nondifferentiable. A disadvantage of using critics is that there is no guarantee that the critic converges to the true evaluation metric given finite training data. In contrast, our differentiable relaxations do not need to train, and the convergence is guaranteed as .
7 Conclusions
We have proposed

a method for turning any mentionranking resolver into an entitycentric one by using a recursive formula to combine scores of individual local decisions, and

differentiable relaxations for two coreference evaluation metrics, B and LEA.
Experimental results show that our approach outperforms the resolver by N161114, and gains a higher improvement over the baseline than that of clarkmanning:2016:EMNLP2016 but with much shorter training time.
Acknowledgments
We would like to thank Raquel Fernández, Wilker Aziz, Nafise Sadat Moosavi, and anonymous reviewers for their suggestions and comments. The project was supported by the European Research Council (ERC StG BroadSem 678254), the Dutch National Science Foundation (NWO VIDI 639.022.518) and an Amazon Web Services (AWS) grant.
References
 Bagga and Baldwin (1998) Amit Bagga and Breck Baldwin. 1998. Algorithms for scoring coreference chains. In The first international conference on language resources and evaluation workshop on linguistics coreference. volume 1, pages 563–566.
 Clark and Manning (2016a) Kevin Clark and Christopher D. Manning. 2016a. Deep reinforcement learning for mentionranking coreference models. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Austin, Texas, pages 2256–2262. https://aclweb.org/anthology/D161245.
 Clark and Manning (2016b) Kevin Clark and Christopher D. Manning. 2016b. Improving coreference resolution by learning entitylevel distributed representations. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics, Berlin, Germany, pages 643–653. http://www.aclweb.org/anthology/P161061.
 Clark and Manning (2015) Kevin Clark and D. Christopher Manning. 2015. Entitycentric coreference resolution with model stacking. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). Association for Computational Linguistics, pages 1405–1415. https://doi.org/10.3115/v1/P151136.
 Crystal (1997) David Crystal. 1997. Dictionary of Linguistics and Phonetics. Blackwell Publishers, Cambrindge, MA.
 Denis and Baldridge (2007) Pascal Denis and Jason Baldridge. 2007. A ranking approach to pronoun resolution. In IJCAI. volume 158821593.

Duchi et al. (2011)
John Duchi, Elad Hazan, and Yoram Singer. 2011.
Adaptive subgradient methods for online learning and stochastic
optimization.
Journal of Machine Learning Research
12(Jul):2121–2159.  Durrett and Klein (2013) Greg Durrett and Dan Klein. 2013. Easy victories and uphill battles in coreference resolution. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, pages 1971–1982. http://aclweb.org/anthology/D131203.
 Gimpel and Smith (2010) Kevin Gimpel and Noah A. Smith. 2010. Softmaxmargin crfs: Training loglinear models with cost functions. In Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics. Association for Computational Linguistics, Los Angeles, California, pages 733–736. http://www.aclweb.org/anthology/N101112.
 Gu et al. (2017) Jiatao Gu, Kyunghyun Cho, and Victor OK Li. 2017. Trainable greedy decoding for neural machine translation. arXiv preprint arXiv:1702.02429 .
 Haghighi and Klein (2010) Aria Haghighi and Dan Klein. 2010. Coreference resolution in a modular, entitycentered model. In Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics. Association for Computational Linguistics, pages 385–393.
 Hermann et al. (2015) Karl Moritz Hermann, Tomáš Kočiský, Edward Grefenstette, Lasse Espeholt, Will Kay, Mustafa Suleyman, and Phil Blunsom. 2015. Teaching machines to read and comprehend. In Advances in Neural Information Processing Systems (NIPS). http://arxiv.org/abs/1506.03340.
 Kehler (1997) Andrew Kehler. 1997. Second Conference on Empirical Methods in Natural Language Processing, chapter Probabilistic Coreference in Information Extraction. http://aclweb.org/anthology/W970319.
 Kirkpatrick et al. (1983) Scott Kirkpatrick, C Daniel Gelatt, Mario P Vecchi, et al. 1983. Optimization by simulated annealing. science 220(4598):671–680.
 Luo (2005) Xiaoqiang Luo. 2005. On coreference resolution performance metrics. In Proceedings of Human Language Technology Conference and Conference on Empirical Methods in Natural Language Processing. http://aclweb.org/anthology/H051004.
 Luo et al. (2014) Xiaoqiang Luo, Sameer Pradhan, Marta Recasens, and Eduard Hovy. 2014. An extension of blanc to system mentions. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers). Association for Computational Linguistics, pages 24–29. https://doi.org/10.3115/v1/P142005.
 Ma et al. (2014a) Chao Ma, Janardhan Rao Doppa, J. Walker Orr, Prashanth Mannem, Xiaoli Fern, Tom Dietterich, and Prasad Tadepalli. 2014a. Pruneandscore: Learning for greedy coreference resolution. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP). Association for Computational Linguistics, Doha, Qatar, pages 2115–2126. http://www.aclweb.org/anthology/D141225.
 Ma et al. (2014b) Chao Ma, Rao Janardhan Doppa, Walker J. Orr, Prashanth Mannem, Xiaoli Fern, Tom Dietterich, and Prasad Tadepalli. 2014b. Pruneandscore: Learning for greedy coreference resolution. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP). Association for Computational Linguistics, pages 2115–2126. https://doi.org/10.3115/v1/D141225.
 Martschat and Strube (2015) Sebastian Martschat and Michael Strube. 2015. Latent structures for coreference resolution. Transactions of the Association for Computational Linguistics 3:405–418.
 Moosavi and Strube (2016) Nafise Sadat Moosavi and Michael Strube. 2016. Which coreference evaluation metric do you trust? a proposal for a linkbased entity aware metric. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics, Berlin, Germany, pages 632–642. http://www.aclweb.org/anthology/P161060.
 Poon and Domingos (2008) Hoifung Poon and Pedro Domingos. 2008. Joint unsupervised coreference resolution with markov logic. In Proceedings of the conference on empirical methods in natural language processing. Association for Computational Linguistics, pages 650–659.
 Pradhan et al. (2014) Sameer Pradhan, Xiaoqiang Luo, Marta Recasens, Eduard Hovy, Vincent Ng, and Michael Strube. 2014. Scoring coreference partitions of predicted mentions: A reference implementation. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers). Association for Computational Linguistics, Baltimore, Maryland, pages 30–35. http://www.aclweb.org/anthology/P142006.
 Pradhan et al. (2012) Sameer Pradhan, Alessandro Moschitti, Nianwen Xue, Olga Uryupina, and Yuchen Zhang. 2012. Joint Conference on EMNLP and CoNLL  Shared Task, Association for Computational Linguistics, chapter CoNLL2012 Shared Task: Modeling Multilingual Unrestricted Coreference in OntoNotes, pages 1–40. http://aclweb.org/anthology/W124501.
 Silver et al. (2014) David Silver, Guy Lever, Nicolas Heess, Thomas Degris, Daan Wierstra, and Martin A. Riedmiller. 2014. Deterministic policy gradient algorithms. In Proceedings of the 31th International Conference on Machine Learning, ICML 2014, Beijing, China, 2126 June 2014. pages 387–395. http://jmlr.org/proceedings/papers/v32/silver14.html.
 Sutton et al. (1999) Richard S Sutton, David A McAllester, Satinder P Singh, Yishay Mansour, et al. 1999. Policy gradient methods for reinforcement learning with function approximation. In NIPS. volume 99, pages 1057–1063.
 Vilain et al. (1995) Marc Vilain, John Burger, John Aberdeen, Dennis Connolly, and Lynette Hirschman. 1995. A modeltheoretic coreference scoring scheme. In Sixth Message Understanding Conference (MUC6): Proceedings of a Conference Held in Columbia, Maryland, November 68, 1995. http://aclweb.org/anthology/M951005.
 Wellner and McCallum (2003) B Wellner and A McCallum. 2003. Towards conditional models of identity uncertainty with application to proper noun coreference. In IJCAI Workshop on Information Integration and the Web.
 Williams (1992) Ronald J Williams. 1992. Simple statistical gradientfollowing algorithms for connectionist reinforcement learning. Machine learning 8(34):229–256.
 Wiseman et al. (2015a) Sam Wiseman, Alexander M Rush, Stuart M Shieber, and Jason Weston. 2015a. Learning anaphoricity and antecedent ranking features for coreference resolution. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, volume 1, pages 92–100.
 Wiseman et al. (2016) Sam Wiseman, M. Alexander Rush, and M. Stuart Shieber. 2016. Learning global features for coreference resolution. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Association for Computational Linguistics, pages 994–1004. https://doi.org/10.18653/v1/N161114.
 Wiseman et al. (2015b) Sam Wiseman, M. Alexander Rush, Stuart Shieber, and Jason Weston. 2015b. Learning anaphoricity and antecedent ranking features for coreference resolution. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). Association for Computational Linguistics, pages 1416–1426. https://doi.org/10.3115/v1/P151137.
Comments
There are no comments yet.