Through the use of neural networks, performance on the task of coreference resolution has increased significantly over the last few years. Still, neural systems trained on the standard coreference dataset have issues with generalization, as shown byMoosavi and Strube (2018).
One way to improve the understanding of how a system overfits a dataset is to study the change in the system’s performance when the dataset is modified slightly in a focused and relevant manner. We take this approach by modifying the test set so that each PER and GPE (person and geopolitical entity) named entity is different from those seen in training. In other words, we ensure that there is no leakage of PER and GPE named entities from the training set into the test set. We demonstrate that the performance of the Lee et al. (2018) system, which is the current state-of-the-art, decreases when the named entities are replaced. An example of a replacement that causes the system to make an error is given in Table 1.
|Original: But Dirk Van Dongen , president of the National Association of Wholesaler - Distributors , said that last month ’s rise “ is n’t as bad an omen ” as the 0.9 % figure suggests . “ If you examine the data carefully , the increase is concentrated in energy and motor vehicle prices , rather than being a broad - based advance in the prices of consumer and industrial goods , ” he explained .|
|Replacement: Replace Dirk Van Dongen with Vendemiaire Van Korewdit.|
Motivated by these issues of generalization, this paper aims to improve the training process of neural coreference systems. Various regularization techniques have been proposed for improving the generalization capability of neural networks, including dropout (Srivastava et al., 2014) and adversarial training (Goodfellow et al., 2015; Miyato et al., 2017). The model of Lee et al. (2018), like most neural approaches, uses dropout. In this work, we apply the adversarial fast-gradient-sign-method (FGSM) described by Miyato et al. (2017) to the model of Lee et al. (2018), and show that this technique improves the model’s generalization even when applied on top of dropout.
The CoNLL-2012 Shared Task dataset (Pradhan et al., 2012) has been the standard dataset used for both training and evaluating English coreference systems since the dataset was introduced. The dataset includes seven genres that span multiple writing styles and multiple nationalities. We demonstrate that the system of Lee et al. (2018) retrained with adversarial training achieves state-of-the-art performance on the original CoNLL-2012 dataset (Pradhan et al., 2012) as well as the CoNLL-2012 dataset with changed named entities. Furthermore, the system trained with the adversarial method exhibits state-of-the-art performance on the GAP dataset (Webster et al., 2018), a recently released dataset focusing on resolving pronouns to people’s names in excerpts from Wikipedia. The code and other relevant files for this project can be found via https://cogcomp.org/page/publication_view/871.
2 Related Work
Moosavi and Strube (2017, 2018) also study generalization of neural coreference resolvers. However, they focus on transfer and indicate that the ranking of coreference resolvers (trained on the CoNLL training set) induced by their performance on the CoNLL test set is not preserved when the systems are evaluated on a different dataset. They use the Wikicoref dataset (Ghaddar and Langlais, 2016), which is limited in that it consists of only documents. They then show that the addition of features representing linguistic information improves the performance of a coreference resolver on the out-of-domain dataset.
The adversarial fast-gradient-sign-method (FGSM) was first introduced by Goodfellow et al. (2015) and was applied to sentence classification tasks through word embeddings by Miyato et al. (2017). Gradient-based adversarial attacks have since been used to train models for various NLP tasks, such as relation extraction (Wu et al., 2017) and joint entity and relation extraction Bekoulis et al. (2018).
Our replacements of named entities can also be viewed as a way of generating adversarial examples for coreference systems; it is related to the earlier method proposed in Khashabi et al. (2016) in the context of question answering and to Alzantot et al. (2018), which provides a way of generating adversarial examples for simple classification tasks.
3 Adversarial Training for Coreference
In coreference resolution, the goal is to find and cluster phrases that refer to entities. We use the word “span” to mean a series of consecutive words. A span that refers to an entity is called a mention. If two mentions and refer to the same entity and mention occurs before mention in the text, we say that mention is an antecedent of mention . For a given mention , the candidate antecedents of are the mentions that occur before in the text.
In Figure 1, each line segment represents a mention and the arrows are directed from one mention to its possible antecedents.
We now review the model architecture of Lee et al. (2018) and describe how we apply the fast-gradient-sign-method (FGSM) of Miyato et al. (2017) to the model. Using GloVe (Pennington et al., 2014) and ELMo Peters et al. (2018) embeddings of each word and using learned character embeddings, the model computes contextualized representations of each word in the input document using a bidirectional LSTM Hochreiter and Schmidhuber (1997). For candidate span , which consists of the words at indices , the model constructs a span representation by concatenating , , , and , where the ’s are learned scalar values and is a learned embedding representing the width of the span (Lee et al., 2017). The span representations are then used as inputs to feedforward networks that compute mention scores for each span and that compute antecedent scores for pairs of spans. In Figure 1, the number associated with each arrow is the antecedent score for the associated pair of mentions. The coreference score for the pair of spans is the sum of the mention score for span , the mention score for span , and the antecedent score for . For each span , the antecedent span predicted by the model is the span that maximizes the antecedent score for . Let denote the set of the representations of all candidate spans. Let
denote the original model’s loss function. (Note that the model’s predictions and the loss depend on the input text only through the span representations.) For each, let denote the gradient of the loss with respect to the span embeddings. Then the adversarial loss with the FGSM is
The total loss used in training is
In our experiments, we find that and work well. A key difference between our method and that employed by Miyato et al. (2017) is that the latter applies the adversarial perturbation to the input embeddings, whereas we apply it to the span representations, which are an intermediate layer of the model. We found in our experiments that applying the FGSM to the character embeddings in the initial layer was not as effective as applying the method to the span representations as described above. Another difference between our method and that of Miyato et al. (2017) is that we do not normalize the span embeddings before applying the adversarial perturbations.
4 No Leakage of Named Entities
Named entities are an important subset of the entities a coreference system is tasked with discovering. Agarwal et al. (2018) provide the percentages of clusters in the CoNLL dataset represented by the PER, ORG, GPE, and DATE named entity types – , , , and , respectively. It is important for generalization that systems perform well with names that are different from those seen in training. We found that in the CoNLL dataset, roughly of the PER and GPE named entities that are the head of a mention of some gold cluster in the test set are also the head of a mention of a gold cluster in the train set. Therefore, there is considerable overlap, or leakage, between the names in the train and test sets. In this section, we describe a method for evaluating on the CoNLL test set without leaked name entities.
We focus on PER and GPE named entities because they are two of the three most common entity types and because in general when replacing a PER or GPE name with another name, it is easy to not change the true coreference structure of the document. In particular, changing the name of an organization while ensuring that it is compatible with nominals in the cluster is nontrivial without a finer semantic typing.
|We asked Judy Muller if she would like to do the story of a fascinating man . She took a deep breath and said , okay .||We asked Sallie Kousonsavath if she would like to do the story of a fascinating man . She took a deep breath and said , okay .|
|The last thing President Clinton did today before heading to the Mideast is go to church – appropriate , perhaps , given the enormity of the task he and his national security team face in the days ahead .||The last thing President Golia did today before heading to the Mideast is go to church – appropriate , perhaps , given the enormity of the task he and his national security team face in the days ahead .|
|In theory at least , tight supplies next spring could leave the wheat futures market susceptible to a supply - demand squeeze , said Daniel Basse , a futures analyst with AgResource Co. in Chicago .||In theory at least , tight supplies next spring could leave the wheat futures market susceptible to a supply - demand squeeze , said Daniel Basse , a futures analyst with AgResource Co. in Machete .|
By contrast, we describe below how we control for gender and location type when replacing PER and GPE names, respectively. We also ensure that the capitalization of the first letter in the replacement name is the same as in the original text. Finally, we note that the diversity of PER and GPE entities exceeds that of other named entity types; this increases the importance of generalization to new names and, at the same time, enables us to find matching names to use as replacements. Table 2 provides examples of text in the original CoNLL-2012 dataset and the corresponding text after our modifications.
4.1 Replacing PER entities
For replacing PER entities, we utilize the publicly available list of last names from the 1990 U.S. Census and a gazetteer of first names that has the proportion of people with this name who are males. The gazetteer was collected in an unsupervised fashion from Wikipedia. We denote the list of last names by , the list of male first names (i.e. first names with male proportion greater than or equal to in the gazetteer) by , and the list of female first names (i.e. first names with male proportion less than or equal to in the gazetteer) by . We remove all names occurring in training from , , and . We use the spaCy dependency parser Honnibal and Johnson (2015) to find the heads of each mention. We say that a mention is a person-mention if the head of the mention is a PER named entity, and we say that the name of the person-mention is the PER named entity that is its head. We use the dependency parser and the gold NER to identify all of the person-mentions. For each gold cluster containing a person-mention, we find the longest name among the names of all of the person-mentions in the cluster. If the longest name of a cluster has only one token, we assume that the name is a last name, and we replace the name with a name chosen uniformly at random from the remaining last names in . Otherwise, if the longest name has multiple tokens, we say that the cluster is male if the cluster contains no female pronouns (“she”, “her”, “hers”) and one of the following is true: the first token does not appear in or , if the token appears in , or the cluster contains a male pronoun (“he”, “him”, “his”). We say that the cluster is female if it is not male. Then we (1) replace the last token with a name chosen uniformly at random from the remaining last names in , and (2) replace the first token with a name chosen uniformly at random from the remaining first names in if the cluster is male or from the remaining first names if the cluster is female. Note that our sampling from each of , , and is without replacement, so no last name is used as a replacement more than once, no male first name is used more than once, and no female first name is used more than once.
4.2 Replacing GPE entities
Our approach to replacing GPE entity names is very similar to that used for PER names. We use the GeoNames111http://www.geonames.org/ database of geopolitical names. In addition to providing a list of GPE names, this database also categorizes the names by the type of entity to which they refer (e.g. city, state, county, etc.). The data includes the names and categories of more than locations in the world. We restrict our attention to GPE entities that satisfy the following requirements: (1) they occur in the GeoNames database and (2) they are not countries. We say that a mention is a GPE-mention if its head (as given by the dependency parser) is a GPE named entity satisfying these three requirements. (Again, we use the gold NER to identify GPE names in the CoNLL text.) We remove all GPE names occurring in the training set from the list of replacement GPE names for each location category. Then for each cluster containing a GPE-mention, we find the GeoNames category for the mention’s GPE name and replace the name with a randomly chosen name from the same category. As with PER names, we sample names from each category without replacement, so each GPE name is used for replacement at most once.
We trained the Lee et al. (2018) model architecture with the adversarial approach on the CoNLL training set for
iterations (the same number of iterations for which the original model was trained) with the same training hyperparameters used by original model. For comparing with theLee et al. (2017) and Lee et al. (2018) systems, we use the pretrained models released by the authors.222Available at https://lil.cs.washington.edu/coref/final.tgz and https://lil.cs.washington.edu/coref/final.tgz
The datasets used for evaluation are the CoNLL and GAP datasets.
5.1 CoNLL Dataset
|Lee et al. (2018)||72.96||71.84|
Table 3 shows the performance on the CoNLL test set, as measured by CoNLL F1, of the Lee et al. (2018) system with and without our adversarial training approach.333Please note that the small differences between the No Leakage results here and those in the version of this paper in the ACL Anthology are due to a small mistake in our preprocessing pipeline, which we have fixed since publication. The replacement of PER and GPE entities decreased the performance of the original system by more than F1.
|Lee et al. (2017)||68.7||60.0||64.5|
|Lee et al. (2018)||75.8||70.6||73.3|
5.2 GAP Dataset
The GAP dataset (Webster et al., 2018) focuses on resolving pronouns to named people in excerpts from Wikipedia. The dataset, which is gender-balanced, consists of examples in which the system must determine whether a given pronoun refers to one, both, or neither of two given names. Thus, the task can be viewed a binary classification task in which the input is a (pronoun, name) pair and the output is True if the pair is coreferent and False otherwise. Performance is evaluated using the F1 score in this binary classification setup. Table 4 shows the performance on the GAP test set of the Lee et al. (2017)444The results that we report for the Lee et al. (2017) system differ slightly from those reported in Table 10 of Webster et al. (2018) due to a difference in the parser and potentially small differences in the algorithm for converting the system’s output to the binary predictions necessary for the GAP scorer. and Lee et al. (2018) systems as well as the system trained with our adversarial method. The adversarially trained system performs significantly better over the entire dataset in comparison to the previous systems, and the difference is consistent between genders. In particular, we observe that the bias (i.e. ratio of female to male F1 score) is roughly the same () for the Lee et al. (2018) system with and without adversarial training and that this bias is better (i.e. the ratio is closer to ) than that exhibited by the Lee et al. (2017) system ().
We show that the performance of the Lee et al. (2018) system decreases when the names of PER and GPE entities are changed in the CoNLL test set so that no names from the training set leak to the test set. We then retrain the same system using an application of the fast-gradient-sign-method (FGSM) of adversarial training, showing that the retrained system consistently performs better on the original CoNLL test set, the CoNLL test set with No Leakage, and the GAP test set. Our new model is a new state-of-the-art for all these data sets.
We thank Sihao Chen for providing a gazetteer of first names collected from Wikipedia with scores for their gender likelihood, and the anonymous reviewers for their comments. This work was supported in part by contract HR0011-18-2-0052 with the US Defense Advanced Research Projects Agency (DARPA). The views expressed are those of the authors and do not reflect the official policy or position of the Department of Defense or the U.S. Government.
- Named person coreference in english news. arXiv preprint arXiv:1810.11476. Cited by: §4.
Generating natural language adversarial examples.
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pp. 2890–2896. Cited by: §2.
- Adversarial training for multi-context joint entity and relation extraction. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pp. 2830–2836. Cited by: §2.
- A constrained latent variable model for coreference resolution. In EMNLP, External Links: Cited by: Figure 1.
- WikiCoref: an english coreference-annotated corpus of wikipedia articles.. In LREC, Cited by: §2.
- Explaining and harnessing adversarial examples. In International Conference on Learning Representations, External Links: Cited by: §1, §2.
- Long short-term memory. Neural computation 9 (8), pp. 1735–1780. Cited by: §3.
- An improved non-monotonic transition system for dependency parsing. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pp. 1373–1378. Cited by: §4.1.
Question answering via integer programming over semi-structured knowledge.
Proc. of the International Joint Conference on Artificial Intelligence (IJCAI), External Links: Cited by: §2.
- End-to-end neural coreference resolution. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pp. 188–197. Cited by: §3, §5.2, Table 4, §5, footnote 4.
- Higher-order coreference resolution with coarse-to-fine inference. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers), pp. 687–692. Cited by: Table 1, §1, §1, §3, §5.1, §5.2, Table 3, Table 4, §5, §6.
- Note on the sampling error of the difference between correlated proportions or percentages. Psychometrika 12 (2), pp. 153–157. Cited by: Table 4.
- Adversarial training methods for semi-supervised text classification. ICLR. External Links: Cited by: §1, §2, §3.
- Lexical features in coreference resolution: to be used with caution. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pp. 14–19. Cited by: §2.
- Using linguistic features to improve the generalization capability of neural coreference resolvers. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pp. 193–203. Cited by: §1, §2.
- Computer-intensive methods for testing hypotheses. Wiley New York. Cited by: Table 3.
Glove: global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pp. 1532–1543. Cited by: §3.
- Deep contextualized word representations. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pp. 2227–2237. Cited by: §3.
- CoNLL-2012 shared task: modeling multilingual unrestricted coreference in ontonotes. In Joint Conference on EMNLP and CoNLL-Shared Task, pp. 1–40. Cited by: §1.
Dropout: a simple way to prevent neural networks from overfitting.
The Journal of Machine Learning Research15 (1), pp. 1929–1958. Cited by: §1.
- Mind the gap: a balanced corpus of gendered ambiguou. In Transactions of the ACL, pp. to appear. Cited by: §1, §5.2, Table 4, footnote 4.
- Adversarial training for relation extraction. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pp. 1778–1783. Cited by: §2.