Being an important human language phenomenon, coreference brings simplicity for human languages while introducing a huge challenge for machines to process, especially for pronouns, which are hard to be interpreted owing to their weak semantic meanings Ehrlich (1981). As one challenging yet vital subtask of the general coreference resolution, pronoun coreference resolution Hobbs (1978)
is to find the correct reference for a given pronominal anaphor in the context and has showed its importance in many natural language processing (NLP) tasks, such as machine translationMitkov et al. (1995), dialog systems Strube and Müller (2003), information extraction Edens et al. (2003), and summarization Steinberger et al. (2007), etc.
|Example A||Example B|
|Sentence||The apple on the table looks great and I want to eat it.||Yesterday, the patient took the CT scan in the hospital and it showed that she had recovered.|
|Answer||The apple||the CT scan|
|Knowledge||We can eat apples but we cannot eat a table.||A ‘test’ shows results to patients; ‘the CT scan’ is a medical test.|
In general, to resolve pronoun coreferences, one needs intensive knowledge support. As shown in Table 1, answering the first question requires the knowledge on which object can be eaten (apple v.s. table), while the second question requires the knowledge that the CT scan is a test (not the hospital) and only tests can show something. Previously, rule-based Hobbs (1978); Nasukawa (1994); Mitkov (1998); Zhang et al. (2019a) and feature-based Ng (2005); Charniak and Elsner (2009); Li et al. (2011) supervised models were proposed to integrate knowledge to this task. However, while easy to incorporate external knowledge, these traditional methods faced the problem of no effective representation learning models can handle such complex knowledge. Later, end-to-end solutions with neural models Lee et al. (2017, 2018) achieved good performance on the general coreference resolution task. Although such algorithms can effectively incorporate contextual information from large-scale external unlabeled data into the model, they are insufficient to incorporate existing complex knowledge into the representation for covering all the knowledge one needs to build a successful pronoun coreference system. In addition, overfitting is always observed on deep models, whose performance is thus limited in cross-domain scenarios and restricts their usage in real applications Liu et al. (2018, 2019). Recently, a joint model Zhang et al. (2019b) was proposed to connect the contextual information and human-designed features together for pronoun coreference resolution task (with gold mention support) and achieved the state-of-the-art performance. However, their model still requires the complex features designed by experts, which is expensive and difficult to acquire, and requires the support of the gold mentions.
To address the limitations of the aforementioned models, in this paper, we propose a novel end-to-end model that learns to resolve pronoun coreferences with general knowledge graphs (KGs). Different from conventional approaches, our model does not require to use featurized knowledge. Instead, we directly encode knowledge triplets, the most common format of modern knowledge graphs, into our model. In doing so, the learned model can be easily applied across different knowledge types as well as domains with adopted KG. Moreover, to address the knowledge matching issue, we propose a knowledge attention module in our model, which learns to select the most related and helpful knowledge triplets according to different contexts. Experiments conducted on general (news) and in-domain (medical) cases shows that the proposed model outperforms all baseline models by a great margin. Additional experiments with the cross-domain setting further illustrate the validity and effectiveness of our model in leveraging knowledge smartly rather than fitting with limited training data111All code and data are available at: https://github.com/HKUST-KnowComp/Pronoun-Coref-KG.. To summarize, this paper makes the following contributions:
We explore how to resolve pronoun coreferences with KGs, which outperforms all existing models by a large margin on datasets from two different domains.
We propose a knowledge attention module, which helps to select the most related and helpful knowledge from different KGs.
We evaluate the performance of different pronoun coreference models in a cross-domain setting and show that our model has better generalization ability than state-of-the-art baselines.
2 The Task
Given a text , which contains a pronoun , the goal is to identify all the mentions that refers to. We denote the correct mentions refers to as , where is the correct mention set. Similarly, each candidate span is denoted as , where is the set of all candidate spans. Note that in the case where no golden mentions are annotated, all possible spans in are used to form . To exploit knowledge, we denote the knowledge set as , instantiated by multiple knowledge triplets222Each triplet contains a head, a tail, and a relation from the head to the tail.. The task is thus to identify out of with the support of . Formally, it optimizes
where is the overall scoring function333We omit and in the rest of this paper for simplicity. of referring to in with . The details of are illustrated in the following section.
The overall framework of our model is shown in Figure 1. There are several layers in it. At the bottom, we encode all mention spans () and pronouns () into embeddings so as to incorporate contextual information. In the middle layer, for each pair of (, ), we use their embeddings to select the most helpful knowledge triplets from and generate the knowledge representation of and . At the top layer, we concatenate the textual and knowledge representation as the final representation of each and , and then use this representation to predict whether there exists the coreference relation between them.
3.1 Span Representation
Contextual information is crucial to distinguish the semantics of a word or phrase, especially for text representation learning Song et al. (2018); Song and Shi (2018). In this work, a standard bidirectional LSTM (BiLSTM) Hochreiter and Schmidhuber (1997) model is used to encode each span with attentions Bahdanau et al. (2014), which is similar to the one used in Lee et al. (2017). The structure is shown in Figure 2. Let initial word embeddings in a span be denoted as and their encoded representation be . The weighted embeddings of each span is obtained by
where is the inner-span attention computed by
is a standard feed-forward neural network444We use to present feed-forward neural networks. = .
Finally, the starting () and ending () embedding of each span is concatenated with the weighted embedding () and the length feature () to form its final representation :
Thus the span representation of and are marked as and , respectively.
3.2 Knowledge Representation
For each candidate span and the target pronoun , different knowledge from a KG can be extracted with various methods. For simplicity and generalization consideration, we use the string match in our model for knowledge extraction. Specifically, for each triplet where the head and tail of are both lists of words, if its head is the same as the string of , we consider it to be a related triplet. Therefore, we encode the information of by the averaging embeddings of all words in its tail. For example, if is ‘the apple’ and the knowledge triplet (‘the apple’, IsA, ‘healthy food’) is found by searching the KG, we represent this relation from the averaged embeddings of ‘healthy’ and ‘food’. Consequently, for and , we denote their retrieved knowledge set as and respectively, where contains related knowledge embeddings , , …, and contains of them , , …, .
To incorporate the aforementioned knowledge embeddings into our model, we face a challenge that there are a huge number of such embeddings while most of them are useless in certain contexts. To solve it, a knowledge attention module is proposed to select the appropriate knowledge.
For each pair of (, ), as shown in Figure 3, we first concatenate and to get the overall (span, pronoun) representation , which is used to select knowledge for both and . Taking that for as example, we compute the weight of each by
where . As a result, the knowledge of is summed by
to represent the overall knowledge for . A similar process is also conducted for with its knowledge representation .
The final score of each pair (, ) is computed by
where is the scoring function for to be a valid mention and is the scoring function to identify whether there exists a coreference relation from to , with denoting element-wise multiplication.
After getting the coreference score for all mention spans, we adopt a softmax selection on the most confident candidates for the final prediction, which is formulated as
where candidates with score higher than a threshold are selected.
Experiments are illustrated in this section.
Two datasets are used in our experiments, where they are from two different domains:
i2b2: The i2b2 shared task dataset Uzuner et al. (2012), consisting of electronic medical records from two different organizations, namely, Partners HealthCare (Part) and Beth Israel Deaconess medical center (Beth). All records have been fully de-identified and manually annotated with coreferences.
We split the datasets into different proportions based on their original settings. Three types of pronouns are considered in this paper following Ng (2005), i.e., third personal pronoun (e.g., she, her, he, him, them, they, it), possessive pronoun (e.g., his, hers, its, their, theirs), and demonstrative pronoun (e.g., this, that, these, those). Table 2 reports the number of the three types of pronouns and the overall statistics of the experiment datasets with proportion splittings. Following conventional approaches Ng (2005); Li et al. (2011), for each pronoun, we consider its candidate mentions from the previous two sentences and the current sentence it belongs to. According to our selection range of the candidate mentions, each pronoun in the CoNLL data and i2b2 data has averagely 1.3 and 1.4 correct references, respectively.
4.2 Knowledge Resources
As mentioned in previous sections, our model is designed to leverage general KGs, where it takes triplets as the input of knowledge representations. For all knowledge resources, we format them as triplets and merge them together to obtain the final knowledge set. Different knowledge resources are introduced as follows.
Commonsense knowledge graph (OMCS). We use the largest commonsense knowledge base, the open mind common sense (OMCS) Singh (2002) in this paper. OMCS contains 600K crowd-sourced commonsense triplets such as (food, UsedFor, eat) and (wind, CapableOf, blow to east). All relations in OMCS are human-defined and we select those highly-confident ones (confidence score larger than ) to form the OMCS KG, with 62,730 triplets.
Medical concepts (Medical-KG). Being part of the i2b2 contest, the related knowledge about medical concepts such as (the CT scan, is, test) and (intravenous fluids, is, treatment) are provided. The annotated triplets are used as the medical concept KG, which contains 22,234 triplets.
Linguist features (Ling). In addition to manually annotated KGs, we also consider linguist features, i.e., plurality and animacy & gender (AG), as one important knowledge resources. Stanford parser666https://stanfordnlp.github.io/CoreNLP/ is employed to generate plurality, animacy, and gender markups for all the noun phrases, so as to automatically generate linguistic knowledge (in the form of triplets) for our data. Specifically, the plurality feature denotes each and to be singular or plural. The animacy & gender (AG) feature denotes whether the or is a living object, and being male, female, or neutral if it is alive. For example, a mention ‘the girls’ is labeled as plural and female; we use triplets (‘the girls’, plurality, Plural) and (‘the girls’, AG, female) to represent them. As a result, we have 40,149 and 40,462 triplets for plurality and AG, respectively.
Selectional Preference (SP). Selectional preference Hobbs (1978) knowledge is employed as the last knowledge resource, which is the semantic constraint for word usage. SP generally refers to that, given a predicate (e.g., verb), people have the preference for the argument (e.g., its object or subject) connected. To collect SP knowledge, we first parse the English Wikipedia777https://dumps.wikimedia.org/enwiki/ with the Stanford parser and extract all dependency edges in the format of (predicate, argument, relation, number), where predicate is the governor and argument the dependent in each dependency edge888In the Stanford parser, an ‘nsubj’ edge is created between its predictive and subject when a verb is a linking verb (e.g., am, is); the predicative is thus treated as the predicate for the subject (argument) in this paper.. Following Resnik (1997)
, each potential SP pair is measured by a posterior probability
where and refer to how many times and the predicate-argument pair (, ) appear in the relation , respectively. In our experiment, if and , we consider the triplet (, , ) (e.g., (‘dog’, nsubj, ‘barks’)) a valid SP relation. Finally, we select two SP relations, nsubj and dobj, to form the SP knowledge graph, including 17,074 and 4,536 frequent predicate-argument pairs for nsubj and dobj, respectively.
Several baselines are compared in this paper, including three widely used pre-trained models:
Deterministic model Raghunathan et al. (2010), which is an unsupervised model and leverages manual rules to detect coreferences.
Statistical model Clark and Manning (2015), which is a supervised model and trained on manually crafted entity-level features between clusters and mentions.
The above models are included in the Stanford CoreNLP toolkit999https://stanfordnlp.github.io/CoreNLP/coref.html. We also include a state-of-the-art end-to-end neural model as one of our baselines:
We use their released code101010https://github.com/kentonl/e2e-coref. In addition, to show the importance of incorporating knowledge, we also experiment with two variations of our model:
Without KG removes the KG component and keeps all other components in the same setting as that in our complete model.
Without Attention removes the knowledge attention module and concatenates all the knowledge embeddings. All other components are identical as our complete model.
embeddings as the initial word representations for computing span representations. For knowledge triplets, we use the GloVe embeddings to encode tail words in them. Out-of-vocabulary words are initialized with zero vectors. The hidden state of the LSTM module is set to 200, and all the feed-forward networks have two 150-dimension hidden layers. The selection thresholds are set toand for the CoNLL and i2b2 dataset, respectively.
For model training, we use cross-entropy as the loss function and Adam Kingma and Ba (2014)
as the optimizer. All the aforementioned hyper-parameters are initialized randomly, and we apply dropout rate 0.2 to all hidden layers in the model. For the CoNLL dataset, the model training is performed with up to 100 epochs, and the best one is selected based on its performance on the development set. For the i2b2 dataset, because no dev set is provided, we train the model up to 100 epochs and use the final converged one.
Table 3 reports the performance of all models, with the results for CoNLL and i2b2 in (a) and (b), respectively. Overall, our model outperforms all baselines on two datasets with respect to all pronoun types. There are several interesting observations. In general, the i2b2 dataset seems simpler than the CoNLL dataset, which might because that i2b2 only involves clinical narratives and its training data is highly similar to the test data. As a result, all neural models perform dramatically good, especially on the third personal and possessive pronouns. In addition, we also notice that it is more challenging for all models to resolve demonstrative pronouns (e.g., this, that) on both datasets, because such pronouns may refer to complex things and occur with low frequency.
|The Complete Model||75.7||-||95.2||-|
Moreover, there are significant gaps in the performance of different models, with the following observations. First, models with manually defined rules or features, which cannot cover rich contextual information, perform poorly. In contrast, deep learning models (e.g., End2end and our proposed models), which leverage text representations for context, outperform other approaches by a great margin, especially on the recall. Second, adding knowledge in an appropriate manner within neural models is helpful, which is supported by that our model outperforms the End2end model and the Without KG one on both datasets, especially CoNLL, where the external knowledge plays a more important role. Third, the knowledge attention module ensures our model to predict more precisely, which also results in the overall improvement on F1. To summarize, the results suggest that external knowledge is important for effectively resolving pronoun coreference, where rich contextual information determines the appropriate knowledge with a well-designed module.
Further analysis is conducted in this section regarding the effect of different knowledge resources, model components, and settings. Details are illustrated as follows.
5.1 Ablation Study
We ablate different knowledge for their contributions in our model, with the results reported in Table 4. It is observed that all knowledge resources contribute to the final success of our model, where different knowledge types play their unique roles in different datasets. For example, the Ling knowledge contributes the most to the CoNLL dataset while the medical knowledge is the most important one for the medical data.
5.2 Effect of the Selection Threshold
We experiment with different thresholds for the softmax selection. The effects of against overall performance are shown in Figure 4. In general, with the increase of
, fewer candidates are selected. Therefore, the overall precision increases and the recall drops. Consider that both the precision and recall are important for resolving pronoun coreference, we select different thresholds for different datasets to ensure the balance between precision and recall. In detail, for the CoNLL dataset, we setto select the most confident predictions; and for the i2b2 dataset, we set so as to keep more predictions.
5.3 Effect of Gold Mentions
|+ Gold mention||77.8||94.4|
|+ Gold mention||80.7||96.0|
|Zhang et al. (2019b)||+ Gold mention||79.9||-|
The effect of adding gold mentions is shown in Table 5. Providing gold mentions to the End2end model can significantly boost its performance by 6.2 F1 and 2.1 F1 on the CoNLL and i2b2 dataset, respectively. Yet, the performance gain from gold mentions is less for our model. Such results clearly illustrate that our model is able to benefit the mention detection with the help of KG incorporation. Besides that, with the help of gold mentions, our model achieves the comparable (slightly better) performance with the context-and-knowledge model Zhang et al. (2019b). As their features are originally designed for CoNLL, we only report the performance on CoNLL in Table 5. As we also included one new challenging pronoun type, the demonstrative pronoun, the overall performance of their model is lower than the one reported in the original paper. The reason of our model being better is that more knowledge resources (e.g., OMCS) can be incorporated into our model due to its generalizable design. Moreover, it is more difficult for their method Zhang et al. (2019b) to incorporate mention detection into the model, because in this case we need to enumerate all mention spans and generate corresponding features for all spans. which is expensive and difficult to acquire.
5.4 Cross-domain Evaluation
|Model||Training data||Test data|
Considering that neural models are intensive data-driven and normally restricted by data nature, they are not easily applied in a cross-domain setting. However, if a model is required to perform in real applications, it has to show promising performance on those cases out of the training data. Herein we investigate the performance of different models with training and testing in different domains, with the results reported in Table 6. Overall, all models perform significantly worse if they are used cross domains. Specifically, if we train the End2end model on the CoNLL dataset and test it on the i2b2 dataset, it only achieves 75.2 F1. As a comparison, our model can achieve 80.9 F1 in the same case. This observation confirms the capability of knowledge where our model is able to handle. A similar observation is also drawn for the reversed case. However, even though our model outperforms the End2end model by 22.7 F1 from i2b2 to CoNLL, its overall performance is still poor, which might be explained by that the i2b2 is an in-domain dataset and the knowledge contained in its training data is rarely useful for the general (news) domain dataset. Nevertheless, this experiment clearly shows that the generalization ability of deep models is still crucial for building a successful coreference model, and learns to use knowledge is a promising solution to it.
6 Case Study
To better illustrate the effectiveness of incorporating different knowledge in this task, two examples are provided for the case study in Table 7. In example A, our model correctly predicts that ‘it’ refers to the ‘magazine’ rather than the ‘room’, because we successfully retrieve the knowledge that compared with the ‘room’, the ‘magazine’ is more likely to be the object of drop. In example B, even though the distance between ‘erythema’ and ‘This’ is relatively far111111We omit the intermediate part of the long sentence in the table for a clear presentation., our model is able to determine the coreference relation between them because it successfully finds out that ‘erythema’ is a kind of disease, while a lot of diseases appear as the context of ‘be treated’ in the training data.
7 Related Work
Detecting mention spans in linguistic expressions and identifying coreference relations among them is a core task, namely, coreference resolution, for natural language understanding. Mention detection and coreference prediction are the two major focuses of the task as listed in Lee et al. (2017). Compared to general coreference problem, pronoun coreference resolution has its unique challenge since pronouns themselves have weak semantics meanings, which make it the most challenging sub-task in general coreference resolution. To address the unique difficulty brought by pronouns, we thus focus on resolving pronoun coreferences in this paper.
|Example A||Example B|
|Sentence||He walks into the room with one magazine and drops it on the couch.||… A small area of erythema around his arm … This will be treated empirically.|
|Knowledge||(‘magazine’, dobj, ‘drop’)||(‘erythema’, IsA, ‘disease’)|
Resolving pronoun coreference relations often requires the support of manually crafted knowledge Rahman and Ng (2011); Emami et al. (2018), especially for particular domains such as medicine Uzuner et al. (2012) and biology Cohen et al. (2017). Previous studies on pronoun coreference resolution incorporated external knowledge including human defined rules Hobbs (1978); Ng (2005), e.g., number/gender requirement of different pronouns, domain-specific knowledge such as medical Jindal and Roth (2013) or biological Trieu et al. (2018) ones, and world knowledge Rahman and Ng (2011), such as selectional preference Wilks (1975). Later, end-to-end solutions Lee et al. (2017, 2018) were proposed to learn contextual information and solve coreferences synchronously with neural networks, e.g., LSTM. Their results proved that such knowledge is helpful when appropriately used for coreference resolution. However, external knowledge is often omitted in their models. Consider that context and external knowledge have their own advantages: the contextual information covering diverse text expressions that are difficult to be predefined while the external knowledge being usually more precisely constructed and able to provide extra information beyond the training data, one could benefit from both sides for this task. Different from previous studies, we provide a generic solution to resolving pronoun coreference with the support of knowledge graphs based on contextual modeling, where deep learning models are adopted in our work to incorporate knowledge into pronoun coreference resolution and achieve remarkably good results.
In this paper, we explore how to build a knowledge-aware pronoun coreference resolution model, which is able to leverage different external knowledge for this task. The proposed model is an attempt of the general solution of incorporating knowledge (in the form of KG) into the deep learning based pronoun coreference model, rather than using knowledge as features or rules in a dedicated manner. As a result, any knowledge resource presented in the format of triplets, the most widely used entry format for KG, can be consumed in our model with a proposed attention module. Experimental results on two different corpora from two domains demonstrate the superiority of the proposed model to all baselines. Moreover, as our model learns to use knowledge rather than just fitting the training data, our model achieves much better and more robust performance than state-of-the-art models in the cross-domain scenario.
This paper was partially supported by the Early Career Scheme (ECS, No.26206717) from Research Grants Council in Hong Kong and Tencent AI Lab Rhino-Bird Focused Research Program. In addition, Hongming Zhang has been supported by the Hong Kong Ph.D. Fellowship. We also thank the anonymous reviewers for their valuable comments and suggestions that help improving the quality of this paper.
- Bahdanau et al. (2014) Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2014. Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473.
- Charniak and Elsner (2009) Eugene Charniak and Micha Elsner. 2009. Em works for pronoun anaphora resolution. In EACL, 2009, pages 148–156.
- Clark and Manning (2015) Kevin Clark and Christopher D Manning. 2015. Entity-centric coreference resolution with model stacking. In ACL-IJCNLP, 2015, volume 1, pages 1405–1415.
- Clark and Manning (2016) Kevin Clark and Christopher D. Manning. 2016. Deep reinforcement learning for mention-ranking coreference models. In EMNLP 2016, Austin, Texas, USA, November 1-4, 2016, pages 2256–2262.
- Cohen et al. (2017) K Bretonnel Cohen, Arrick Lanfranchi, Miji Joo-young Choi, Michael Bada, William A Baumgartner, Natalya Panteleyeva, Karin Verspoor, Martha Palmer, and Lawrence E Hunter. 2017. Coreference annotation and resolution in the colorado richly annotated full text (craft) corpus of biomedical journal articles. BMC bioinformatics, 18(1):372.
- Edens et al. (2003) Richard J Edens, Helen L Gaylard, Gareth JF Jones, and Adenike M Lam-Adesina. 2003. An investigation of broad coverage automatic pronoun resolution for information retrieval. In SIGIR, pages 381–382. ACM.
- Ehrlich (1981) Kate Ehrlich. 1981. Search and inference strategies in pronoun resolution: An experimental study. In ACL, 1981, pages 89–93.
- Emami et al. (2018) Ali Emami, Paul Trichelair, Adam Trischler, Kaheer Suleman, Hannes Schulz, and Jackie Chi Kit Cheung. 2018. The hard-core coreference corpus: Removing gender and number cues for difficult pronominal anaphora resolution. arXiv preprint arXiv:1811.01747.
- Hobbs (1978) Jerry R Hobbs. 1978. Resolving pronoun references. Lingua, 44(4):311–338.
- Hochreiter and Schmidhuber (1997) Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural computation, 9(8):1735–1780.
Jindal and Roth (2013)
Prateek Jindal and Dan Roth. 2013.
End-to-end coreference resolution for clinical narratives.
Twenty-Third International Joint Conference on Artificial Intelligence.
- Kingma and Ba (2014) Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980.
- Lee et al. (2017) Kenton Lee, Luheng He, Mike Lewis, and Luke Zettlemoyer. 2017. End-to-end neural coreference resolution. In EMNLP, 9-11, 2017, pages 188–197.
- Lee et al. (2018) Kenton Lee, Luheng He, and Luke Zettlemoyer. 2018. Higher-order coreference resolution with coarse-to-fine inference. In NAACL-HLT, pages 687–692.
Li et al. (2011)
Dingcheng Li, Tim Miller, and William Schuler. 2011.
A pronoun anaphora resolution system based on factorial hidden markov models.In Proceedings of ACL 2011, pages 1169–1178.
- Liu et al. (2018) Miaofeng Liu, Jialong Han, Haisong Zhang, and Yan Song. 2018. Domain Adaptation for Disease Phrase Matching with Adversarial Networks. In Proceedings of the BioNLP 2018 workshop 2018, pages 137–141.
- Liu et al. (2019) Miaofeng Liu, Yan Song, Hongbin Zou, and Tong Zhang. 2019. Reinforced Training Data Selection for Domain Adaptation. In Proceedings of ACL 2019.
- Mitkov (1998) Ruslan Mitkov. 1998. Robust pronoun resolution with limited knowledge. In ACL, 1998, pages 869–875.
- Mitkov et al. (1995) Ruslan Mitkov et al. 1995. Anaphora resolution in machine translation. In Proceedings of the Sixth International conference on Theoretical and Methodological issues in Machine Translation. Citeseer.
- Nasukawa (1994) Testuya Nasukawa. 1994. Robust method of pronoun resolution using full-text information. In CCL, 1994, pages 1157–1163.
- Ng (2005) Vincent Ng. 2005. Supervised ranking for pronoun resolution: Some recent improvements. In EMNLP, 2005, volume 20, page 1081.
- Pennington et al. (2014) Jeffrey Pennington, Richard Socher, and Christopher Manning. 2014. Glove: Global vectors for word representation. In EMNLP, 2014, pages 1532–1543.
- Peters et al. (2018) Matthew E Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. 2018. Deep contextualized word representations. arXiv preprint arXiv:1802.05365.
- Pradhan et al. (2012) Sameer Pradhan, Alessandro Moschitti, Nianwen Xue, Olga Uryupina, and Yuchen Zhang. 2012. Conll-2012 shared task: Modeling multilingual unrestricted coreference in ontonotes. In EMNLP, 2012, pages 1–40.
- Raghunathan et al. (2010) Karthik Raghunathan, Heeyoung Lee, Sudarshan Rangarajan, Nathanael Chambers, Mihai Surdeanu, Dan Jurafsky, and Christopher Manning. 2010. A multi-pass sieve for coreference resolution. In EMNLP, 2010, pages 492–501.
- Rahman and Ng (2011) Altaf Rahman and Vincent Ng. 2011. Coreference resolution with world knowledge. In ACL, 2011, pages 814–824.
- Resnik (1997) Philip Resnik. 1997. Selectional preference and sense disambiguation. Tagging Text with Lexical Semantics: Why, What, and How?
- Singh (2002) Push Singh. 2002. The open mind common sense project. KurzweilAI. net.
- Song and Shi (2018) Yan Song and Shuming Shi. 2018. Complementary Learning of Word Embeddings. In Proceedings of IJCAI 2018, pages 4368–4374.
- Song et al. (2018) Yan Song, Shuming Shi, Jing Li, and Haisong Zhang. 2018. Directional Skip-Gram: Explicitly Distinguishing Left and Right Context for Word Embeddings. In Proceedings of NAACL-HLT 2018, pages 175–180.
- Steinberger et al. (2007) Josef Steinberger, Massimo Poesio, Mijail A Kabadjov, and Karel Jevzek. 2007. Two uses of anaphora resolution in summarization. Information Processing & Management, 43(6):1663–1680.
Strube and Müller (2003)
Michael Strube and Christoph Müller. 2003.
A machine learning approach to pronoun resolution in spoken dialogue.In ACL, 2003, pages 168–175.
- Trieu et al. (2018) Long Trieu, Nhung Nguyen, Makoto Miwa, and Sophia Ananiadou. 2018. Investigating domain-specific information for neural coreference resolution on biomedical texts. In Proceedings of the BioNLP 2018 workshop, pages 183–188.
- Uzuner et al. (2012) Ozlem Uzuner, Andreea Bodnari, Shuying Shen, Tyler Forbush, John Pestian, and Brett R South. 2012. Evaluating the state of the art in coreference resolution for electronic medical records. Journal of the American Medical Informatics Association, 19(5):786–791.
- Wilks (1975) Yorick Wilks. 1975. A preferential, pattern-seeking, semantics for natural language inference. Artificial intelligence, 6(1):53–74.
- Zhang et al. (2019a) Hongming Zhang, Xin Liu, Haojie Pan, Yangqiu Song, and Cane Wing-Ki Leung. 2019a. Aser: A large-scale eventuality knowledge graph. arXiv preprint arXiv:1905.00270.
- Zhang et al. (2019b) Hongming Zhang, Yan Song, and Yangqiu Song. 2019b. Incorporating context and external knowledge for pronoun coreference resolution. In Proceedings of NAACL-HLT 2019.