Named Person Coreference in English News

by   Oshin Agarwal, et al.

People are often entities of interest in tasks such as search and information extraction. In these tasks, the goal is to find as much information as possible about people specified by their name. However in text, some of the references to people are by pronouns (she, his) or generic descriptions (the professor, the German chancellor). It is therefore important that coreference resolution systems are able to link these different types of mentions to the correct person name. Here, we evaluate two state of the art coreference resolution systems on the subtask of Named Person Coreference, in which we are interested in identifying a person mentioned by name, along with all other mentions of the person, by pronoun or generic noun phrase. Our analysis reveals that standard coreference metrics do not reflect adequately the requirements in this task: they do not penalize systems for not identifying any mentions by name and they reward systems even if systems find correctly mentions to the same entity but fail to link these to a proper name (she--the student---no name). We introduce new metrics for evaluating named person coreference that address these discrepancies. We present a simple rule-based named entity recognition driven system, which outperforms the current state-of-the-art systems on these task-specific metrics and performs on par with them on traditional coreference evaluations. Finally, we present similar evaluation for coreference resolution of other named entities and show that the rule-based approach is effective only for person named coreference, not other named entity types.


page 1

page 2

page 3

page 4


Protagonists' Tagger in Literary Domain – New Datasets and a Method for Person Entity Linkage

Semantic annotation of long texts, such as novels, remains an open chall...

NSURL-2019 Task 7: Named Entity Recognition (NER) in Farsi

NSURL-2019 Task 7 focuses on Named Entity Recognition (NER) in Farsi. Th...

Entity-Switched Datasets: An Approach to Auditing the In-Domain Robustness of Named Entity Recognition Models

Named entity recognition systems perform well on standard datasets compr...

"The Michael Jordan of Greatness": Extracting Vossian Antonomasia from Two Decades of the New York Times, 1987-2007

Vossian Antonomasia is a prolific stylistic device, in use since antiqui...

A Machine Learning Approach to Quantitative Prosopography

Prosopography is an investigation of the common characteristics of a gro...

Cluster-based Mention Typing for Named Entity Disambiguation

An entity mention in text such as "Washington" may correspond to many di...

The ApposCorpus: A new multilingual, multi-domain dataset for factual appositive generation

News articles, image captions, product reviews and many other texts ment...


Coreference resolution is the task of identifying all expressions in text that refer to the same entity. In this paper we set out to provide an in-depth analysis and improve performance on a special coreference subtask: finding all references—either by name, pronoun or nominal—to a person named in the text. Named Person Coreference (NPC) would be especially useful for downstream information extraction tasks, in which facts about the person are extracted from large textual corpora and used as knowledge source for a range of artificial intelligence systems. NPC extends the power of Named Entity Recognition (NER) systems by not only finding all text snippets that are a person’s name, but also identifying all places in the text where the same person was mentioned.

Our work is oriented towards practical uses of the results in downstream applications, so we re-examine standard coreference metrics and find them lacking in their ability to quantify the performance of systems. Since applications require information about people and people are identified by their names, the evaluation metrics for this NPC task should focus on the resolution of mentions to the correct name. If all the pronouns referring to a person are resolved correctly to each other but are not linked to any named mention or are linked to a wrong named mention, their correct resolution to each other would not be useful for downstream applications. Standard coreference metrics do not incorporate these aspects of performance and hence give high performance for results unsuitable for further use. We also show that the existing metrics are not sensitive to finding any mention to a person at all. They give higher values for systems that do not find a large number of entities but do good coreference resolution on the subset of entities they find.

We introduce new metrics to overcome these shortcomings. We separate the identification of entities and resolution of different mention types, thus transparently tracking areas of system performance and improvement. We create a subset of the standard coreference resolution annotated data sets by identifying all named person entities and adapting existing coreference systems to filter their original output to only person entities identified by name.

Inspired by the error analysis of the performance of current coreference systems, we introduce a new solution to the NPC task that builds upon the capabilities of state of the art named entity recognition systems. We describe a highly effective rule-based approach to named person coreference that combines NER-driven rules and basic heuristics based on writing styles for resolving pronominal coreference. We show that this system has comparable performance to the state of the art systems on the standard metrics and superior performance on the task-informed metrics we introduce.

Finally, we test our approach for named entity coreference on entities other than people. We find that the state-of-the-art coreference systems do not suffer from the same issues for other entities as they do for persons and thus the results are similar across both standard metrics and NPC metrics.

Why Named Person Coreference?

People, and the information about people expressed in text, are of interest in numerous applications. Moreover, references to people behave differently from references to other named entities, indicating that the intersection of coreference and mentions to people represents a fruitful area for application-oriented research. We expand on these observations motivating our work.

NPC in Downstream Applications

Many information extraction and language technology tasks involve people. About 15% of all web searches contain a person’s name [Weerkamp et al.2011] and especially in news search, NPC can help find articles in which the person of interest is the focus of discussion, mentioned by pronouns in addition to their full name.111An open problem in search is to disambiguate between different people with the same name, mentioned in different documents [Artiles et al.2010]. We do not deal with this problem and resolve mentions only within the same document.

People are also often targets for information extraction systems [Ji and Grishman2011] and for knowledge base completion tasks [West et al.2014]. Yet about half of the references to a person in text are not by name (cf. the first column of Table 1). Systems need to extract information about the person where they are mentioned explicitly by name as well as referenced by a pronoun.

Biography summarization [Zhou, Ticrea, and Hovy2005] needs to extract sentences from the news that contain information about a given person. More relevant sentences can be extracted if we know which pronouns and nominals refer to this person. Similarly, creation of proper noun ontologies [Mann2002] can use patterns other than (proper noun - common noun) if other references to the entity are known.

Non-singleton 67.98% 50.76% 52.02% 21.3%
Entities 15.44% 11.35% 11.36% 4.24%
Entity Mentions 24.47% 14.33% 13.62% 3.41%
Named mentions 45.12% 55.79% 68.43% 82.69%
Non-named mentions 54.88% 44.21% 31.57% 17.31%
Avg cluster size 5.74 4.57 4.34 2.89
Table 1: Statistics, in OntoNotes (nw,bn,mz), on coreference for PERson, ORGanization, geopolitical entity (GPE) and DATE named entities. Non-singleton entities are mentioned at least twice in a text and so require coreference. Entities is the percentage of coreference clusters to entities of the given type. Entity mentions is the percentage of all individual references to any entity of a given type. Average cluster size is the average number of coreferent mentions.

Coreference and Named Entities

References to people are distributed quite differently from other types of named entities, as we show in Table 1. To compile these numbers, we make use of the OntoNotes coreference resolution corpus [Pradhan et al.2007] and gold-standard annotations for named entity recognition on the same data. In this way, we can quantify the patterns in coreference of different named entity types.

People are usually identified by their name in text: 86% of the coreference clusters on people in the OntoNotes training data have at least one named mention; 88.2% of animate third person singular pronouns (pronouns that can be used to refer to people) are a part of coreference chains with a named person, as seen in Table  2.222Female pronouns occur considerably less often than male pronouns, and a larger portion of female pronouns are to unnamed women. These numbers raise question about usage but overall the trend is clear: third person animate pronouns, in their vast majority, resolve to a person named in the article. Therefore, it is uncommon to refer to a person without identifying them by their name at least once.

All named entities are on average much less likely to be singletons than a typical entity, mentioned only once in the text and not requiring coreference resolution [De Marneffe, Recasens, and Potts2015], and of the named entities people are most likely to be mentioned repeatedly. People are rarely mentioned only once in news (see the first line in Table 1): 68% of people named in text have at least one other coreferent mention to them in contrast to 51% and 52% for organizations and locations respectively.

Named people entities make up 15% of all coreference clusters in OntoNotes (see line 2 of Table 1), yet 25% of all mentions that require coreference resolution are mentions of people (Line 3 in Table 1). There are more mentions of the same person on average (line 6 in Table 1) and PERson is the entity type with the largest portion of references that are not by name (line 5 in Table 1). Only 45% of mentions to people are by name, compared to 56% for organizations and 68% for geo-political entities and over 80% for dates.

In sum, references to named people make up a quarter of all references involved in coreference resolution and only half of these references are by name, so output of traditional NER systems that identify the names are not sufficient to track all mentions of the person; rules to disambiguate pronouns and noun phrase references are needed to track all mentions of people.

Total Part of PER cluster
he 3062 2705 (88.3%)
him 447 379 (84.7%)
his 1931 1669 (86.4%)
she 718 588 (81.8%)
her 667 521 (78.1%)
hers 2 2 (100%)
Table 2: Third person singular pronouns in the Ontonotes train set.

NPC and Coreference Evaluation

Named Person Coreference is motivated by the needs in downstream applications. This setting allows us to critically review current practices for coreference evaluation and identify aspects where they fall short in quantifying system performance.

Shared tasks on coreference (at CoNLL-2011 and 2012 [Pradhan et al.2014] ) use the average of three scores as their official evaluation: MUC [Vilain et al.1995], [Bagga and Baldwin1998] and CEAFE [Luo2005]. Prior work [Stoyanov et al.2009] discussed the shortcoming of these metrics and introduced the link entity aware (LEA) score. Below we describe these and point out how they are deficient when examined in the context of downstream tasks, particularly ones that involve NPC. 333This view of evaluation focused on the ability of a system to discover and track entities, which we later expand on, is similar to the vastly popular entity linking task, in which named mentions in text are linked to an abstract entity, such as one defined in Wikipedia [Mihalcea and Csomai2007, Han and Sun2011, Durrett and Klein2014, Pan et al.2015, Radhakrishnan, Talukdar, and Varma2018].

Gold clusters:
Solution 1:
Solution 2: Solution 3:
Table 3: NPC Examples


MUC computes performance at the entity level, where a goldstandard cluster of mentions (noun phrases in text) represents an entity. The recall for an entity is based on the minimum number of links that would have to be added in the predicted clusters containing any mention of this entity, to make them connected and part of the same cluster. Precision is computed by reversing the role of gold and predicted clusters.


measures performance on the mention level. It iterates over all goldstandard mentions of an entity, averaging the recall of its gold cluster in its predicted cluster. Different mentions may have been predicted as referring to different entities, placed in different solution clusters. It computes precision by reversing the role of gold and predicted clusters.


CEAF first finds a one-to-one mapping between gold and predicted clusters, in effect placing the discovery of entities first. It then computes recall as the number of same or similar mentions shared by the gold and predicted clusters divided by the number of mentions in the gold cluster. Precision similarly is equal to the number of same/similar entities in the gold and predicted, divided by the number of mentions in the predicted cluster. The final version averages these number either per mention as in B-cubed (CEAFm), or per entity (CEAFe), as would be more reasonable.


LEA is a link-based metric most similar to MUC, that computes recall as the number of correctly resolved links between mentions, weighting the results for each entity by its number of mentions. In this way, resolving correctly an entity with more mentions contributes more to the overall score than resolving correctly all mentions to an entity mentions only twice for example. Precision is computed by reversing the role of gold and predicted clusters.

Solution 1 Solution 2 Solution 3
R P F1 R P F1 R P F1
MUC 0.55 1 0.71 0.66 1 0.8 0.44 1 0.61
B-cub 0.5 1 0.66 0.56 1 0.72 0.34 1 0.51
CEAFm 0.5 1 0.66 0.75 1 0.85 0.58 1 0.73
CEAFe 0.33 1 0.5 0.83 0.83 0.83 0.75 0.75 0.75
LEA 0.5 1 0.66 0.5 1 0.66 0.26 1 0.42
Table 4: Evaluation of the hypothetical solutions on NPC examples in Table 3

Recall that the goal of NPC is to find all mentions referring to a person, identified by their name. We provide a made up example of a goldstandard and three possible solutions in Table  3. The gold standard contains three entities: John Doe, Richard Roe and Joe Smith. As the clusters indicate, each is also mentioned by a pronoun a number of times. The first hypothetical system response (solution 1) identifies only one entity. It finds all mentions to John Doe correctly but completely misses all mentions to the other two entities. The second solution correctly resolves the pronouns to each other but does not link these to any names. The third solution is able to identify a few mentions for all of the three entities. Intuitively, solution 2 has little practical value, solution 3 is best and solution 1 is acceptable.

The results for standard coreference scores are shown in Table  4. All of the metrics have the highest values for the Solution 2, which does not identify a single name. The same would be true even if any of these correct set of pronouns were linked to the wrong name. This is because none of the metrics take into account the types of mentions and the need for resolution to correct names.

NPC evaluation metrics

We showed in the previous section that the existing coreference metrics are not suitable for the NPC task. Moreover, having a single number to track all aspects of the systems, makes it less interpretable. Prior work has argued that when coreference is used in downstream applications, evaluation criteria should be interpretable [Tuggener2014]. We concur, and introduce a set of task-specific criteria for evaluation of NPC. These are inspired by error analysis we performed. For clarity, we provide examples errors on Ontonotes.

Entity F1: Matching Output to Goldstandard Entities

In the goldstandard, all noun phrases referring to the same person mentioned by name at least once are grouped into entities. In the system output, we also wish to find the chain corresponding to each person, similar to the motivation of the CEAFE evaluation. To map entity chains between the goldstandard and the system output, we select for each goldstandard entity, the predicted entity that has the highest F1 score with respect to the mentions it contains. The entity F1 score is the harmonic mean of the precision

444Number of mentions to the goldstandard entity also in the system entity divided by the number of all system mentions. and recall555Number of mentions to the goldstandard entity also in the system entity divided by the number of mentions in the goldstandard entity. for the entity mapping.

To compute the intersection between a goldstandard and a system entity, we first augment each goldstandard entity with a list of all variations of the person’s name. We rely on the goldstandard named entity annotation in OntoNotes and intersect this with the membership in a coreference chain. This provides lists of the full name, last name, occasionally nicknames and variants of the name, i.e. {Frank Curzio, Francis X. Curzio, Curzio}, {Dwayne Dog Chapman, Dog Chapman, Chapman}. We consider a predicted chain to be a candidate match for a gold chain only if it contains at least one of the name variants.

We consider a predicted mention to match if it matches the gold mention exactly in the same place of the text. Mention detection has a huge impact on evaluation but we use an exact span matching to be consistent with the existing coreference scorers.666Exact mention match is used for calculating F1. We do not use exact mention match to find candidate chains as the presence of the name can indicate which person the cluster is about.

If a goldstandard entity does not get paired with any system entity, the F1 for that entity is taken to be zero. In our example above, Solution 1 will have poor recall, because it finds only one entity in a document containing three entities. We find the overall F1 of the system as the average of the F1 for each gold entity.

Entity not found

The entity F1 evaluation gives a sense of overall system performance but mixes true purity of the system-discovered entities and the ability to discover entities at all. ”Entity not found” is the error when no NPC system output overlaps with a gold standard entity. Entities not found contribute a score of 0 for the average F1.777We consider only chains containing a named mention. Chains that do not contain any named mention are filtered out. More details on filtering to follow in the section on system performance.

Pronoun Resolution Accuracy

Finally, we track the entity F1 when only mentions of given syntactic type are preserved in the chain—name, pronoun and nominal. Of special interest is to track system performance when resolving pronouns. Many of these issues arise due to the need for commonsense knowledge and reasoning for correct resolution, as in the these examples:

talked about the Vice President’s chances during an interview with the Boston Globe. says it’s unlikely Gore will be selected, because doesn’t have enough experience in the academic world.

The pronoun is incorrectly resolved to the school official instead of Al Gore. To correctly resolve this, we need to know that not having enough experience would be the reason for Gore not getting selected and not the reason for the school official making a statement.

Maybe Lily became so obsessed with where people slept and how because her own arrangements kept shifting. When died, uncles moved in and let make the sleeping and other household arrangements.

Rosie is Lily’s mother (explained earlier in the article). Both are incorrectly resolved to Rosie, despite the common sense fact that a person can not make household arrangements after their death.

Chains not found F1 (NPC) Avg F1 (coref) LEA F1 F1 (names) F1 (pronouns) F1 (nominals) F1 (nw) F1 (bn) F1 (mz)
CoreNlp deterministic 24.76% 0.501 0.499 0.409 0.489 0.384 0.027 0.499 0.494 0.525
CoreNlp statistical 39.30% 0.45 0.577 0.499 0.488 0.307 0.019 0.411 0.49 0.45
CoreNlp neural 27.35% 0.572 0.677 0.613 0.612 0.397 0.062 0.59 0.58 0.502
AllenNlp 31.87% 0.563 0.71 0.66 0.61 0.349 0.078 0.412 0.699 0.608
Table 5: Performance of existing systems. The left panel shows named people coreference metrics (percentage chains not found and entity F1), and average F1 coreference evaluation combining MUC, and CEAFE, on all test data. The middle panel shows entity F1 by type of mention, names, pronouns or nominals. The right panel shows entity F1 broken down by subgenres in the test data: newswire (nw), broadcast news (bn) and magazines (mz). Existing F1 metric is not sensitive to chains not found, and this explains the difference from the new intuitive NPC metric. For each metric, the best system has been bold-faced

Over-splitting and over-combination of entities

Sometimes, systems produce more than one clusters, each containing the same name. In this case we say that the gold-standard entity was over-split. Following is an example illustrating this error -

. Many people now claim to have predicted the 1987 crash. actually did it: stated in writing in September 1987 that the Dow Jones Industrial Average was likely to decline about 500 points the following month. says what happens now will depend a good deal on the Federal Reserve Board. If it promptly cuts the discount rate it charges on loans to banks, says, ” That could quiet things down. ”

The above text mentions Frank Curzio 5 times, however all the mentions aren’t considered to be coreferent to each other. Mentions 2 and 3 form one cluster while the other three form another one.

At times systems also produce coreference chains that combines mentions to two different people. This error occurs when different people mentioned in the text have the same last name (notoriously for U.S. news Bill Clinton and Hilary Clinton) but also occasionally in cases where the names are completely different but the roles of the people are similar, as in the example below -

UN Secretary General Kofi said Wednesday, it is important to help spread democracy around the world. VOA’s Breck Ardery reports from the United Nations . In a new report, says democratization has now taken root as a universal norm and that the United Nations should strengthen its commitment to assisting nations that are moving toward democracy. Commenting on the report, UN Assistant Secretary General for Political Affairs Danilo told reporters the principle of national sovereignty does not preclude support for democracy.

Although it is clear from the above text snippet that Kofi Annan and Danilo Turk are two different people, they have similar titles and recognized to be coreferent.

Following is another document, where Scott Peterson and his wife Laci Peterson are said to be coreferrent, in spite of having distinct gender pronouns referring to each.

Moving on to the case Bill, an unusual request from ’s attorney. A pair of Laci ’s missing shoes could be very important evidence in murder trial. is asking anyone who finds the pair to give them back. No sign of Geragos or in court yesterday when a judge was considering whether to unseal warrants obtained before Scott Peterson’s arrest. Peterson awaiting trial in the murder of wife Laci and their unborn son.

Evaluation of existing systems

We evaluate the Stanford coreference system, with its deterministic [Raghunathan et al.2010], statistical [Clark and Manning2015] and neural [Clark and Manning2016] versions, and the neural end-to-end AllenNLP system [Lee et al.2017] on both the existing metrics and the NPC metrics.

These general coreference systems find coreferring expressions of any type and produce coreference clusters for all mentioned entities (groups of noun phrases that refer to the same entity). In named person coreference, the goal is to find all mentions to a person who has been referred to by name at least once in the document. This means that the output of off-the-shelf coreference systems has to be filtered to keep only chains that contain at least one mention noun phrase with a syntactic head that is a person’s name. For our evaluation, we use dependency parsing to detect whether a name is the head of a mention, by checking that no other word in the mention is an ancestor of the name in the dependency parse tree. We use automatic person NER tags from Stanford coreNLP [Finkel, Grenager, and Manning2005] to determine if the head is a name. We call the coreference chains remaining after filtering NPC entities.

Less strict filtering, such as the presence of 3rd person singular personal or possessive pronouns would also indicate that the corresponding entity is a person. For NPC, we insist on having at least one named mention, as elaborated upon in the earlier sections. We found that the AllenNLP system does not have a named mention in about 30% of the coreference chains that do contain a personal or possessive third person pronoun. This number is about 20% for the CoreNLP neural system.

For evaluating the systems, we use only the relevant subsample of the standard coreference evaluation data in OntoNotes. We work with the test newswire (89 documents, 136 NPC chains), broadcast news (93 documents, 135 NPC chains) and magazine (45 documents, 49 NPC chains) documents only. We evaluate on the 320 NPC chains in these 227 documents. The NPC chains contain three types of mentions—names, pronouns, and nominals. Nominals account for less than 5% of the mentions in all genres, while the remaining mentions are split almost equally between names and pronouns.

The third column in the first panel of Table 5 shows the standard F1 on all the systems. As expected, the AllenNLP systems outperforms all the systems, with CoreNLP neural as a close second. Both perform much than the CoreNLP deterministic and statistical systems.

However, this difference in performance isn’t as big on the NPC F1 (see second column of Table 5). The Stanford CoreNLP neural system has the highest NPC F1, only slightly better than the AllenNLP system.888This evaluation uses an exact span matching to be consistent with the existing coreference scorers. Our experiments showed a relaxed mention span matching allowing a difference of a few words results in gains up to 12 points F1. The gain with relaxed mention matching is highest for the deterministic system which ends up performing as good as the CoreNLP neural system. The NPC F1 includes 0s for goldstandard entities not found by the system. We track this separately as well (first column of Table 5). The deterministic and neural Stanford CoreNLP systems have the lowest percentage of entities not found errors, about a quarter of all entities. The Stanford statistical system is worst at finding entities, missing almost 40% of entities.

We also separate the performance of the systems by mention type. The second panel of Table 5 reveals that the systems make more mistakes on pronouns compared to names.

We also evaluated separately on the three genres, as shown in the third panel of Table 5. While the CoreNLP system outperforms on newswire by a huge margin, the AllenNLP system performs the best on broadcast news and magazines. Like overall F1, scores are greatly impacted by the mention detection and the difference in performance becomes less with a relaxed mention matching.

We tracked the over-splitting and the over-combination of entities as well. However, the overall values were quite small and similar for all the systems and have thus not been included in the results here.

Chains not found F1 (NPC) Avg F1 (coref) LEA F1 F1 (names) F1 (pronouns) F1 (nominals) F1 (nw) F1 (bn) F1 (mz)
NER-DE 12.50% 0.685 0.67 0.583 0.727 0.477 0.012 0.601 0.739 0.771
NER-DE (CoreNlp mentions) 13.12% 0.594 0.584 0.481 0.58 0.466 0.028 0.516 0.663 0.619
NER-DE (CogComp mentions) 13.12% 0.616 0.594 0.484 0.598 0.479 0.006 0.516 0.702 0.656
NER-DE (AllenNlp mentions) 35.31% 0.541 0.676 0.606 0.588 0.335 0.032 0.396 0.675 0.576
NER-DE (gold mentions) 11.56% 0.789 0.817 0.756 0.867 0.488 0.056 0.749 0.815 0.831
Table 6: Performance of NER-DE system with different methods of mention detection. The left panel shows named people coreference metrics (percentage chains not found and entity F1), and average F1 coreference evaluation combining MUC, and CEAFE, on all test data. The middle panel shows entity F1 by type of mention, names, pronouns or nominals. The right panel shows entity F1 broken dawn by subgenres in the test data: newswire (nw), broadcast news (bn) and magazines (mz). Existing F1 metric is not sensitive to chains not found, and this explains the difference from the new intuitive NPC metric. For each metric, the best system has been bold-faced

NER-driven Coreference

The most striking error of current systems is their inability to find a substantial fraction (20%–40%) of the NPC entities. In these cases the system may have produced an entity that overlaps with some mentions but does not include the name. Detection of named mentions can be done with high accuracy by named entity recognition systems [Stoyanov et al.2009] and the matching of names can also be done accurately via string matching [Wacholder, Ravin, and Choi1997, Wick et al.2009]. Pronoun resolution is more difficult and sometimes even requires background knowledge. However, generally, writing styles dictate how pronouns are used in text. Nominal references, on the other hand, are unlikely to make any noticeable difference since there are only a handful. Inspired by these observations, we present a NER-driven rule based coreference system, NER-DE. We perform clustering on named entities and link pronouns using basic heuristics. This is simpler than prior work on coreference resolution using clustering. Authors of [Cardie and Wagstaf1999] represent all noun phrases using a set of features and perform clustering on these representations, which can also lead to clusters not having an associated name.

People-mention coreference

In a first pass, NER-DE finds all spans of text that are person named entities, using the Cogcomp NER [Ratinov and Roth2009]. We later use a dependency parser [Honnibal and Johnson2015] to find the noun phrase of which that named entity serves as the syntactic head. This is the smallest span that contains all descendants of the last word in the named entity. We also use two additional rules: (1) if there is an ’and’ immediately after the name, the end of the name is the end of the mention, and (2) if the token immediately before the name is a noun or a proper noun, we include that token and its descendants (in the parse tree) in the mention span. In addition, we experimented with taking mention detection from existing coreference systems rather than using this dependency parsing-based method.

Coreference decisions are done on the named entity spans. We start by focusing on names, initializing a separate chain for each named entity. We then do agglomerative clustering of the names, using a named entity similarity metric [Khashabi et al.2018]. We merge two chains if the longest mention of the first chain has a similarity of more than 0.5 with any mention of the second chain. Most similarity scores are close to 0 and 1 so the system is not sensitive to the chosen similarity cut off.

Resolving pronouns

We add first-person and third-person singular pronouns to the coreference chains containing named mentions using three rules, restricted by gender compatibility. We consider the gender of the chain to be the gender of the first word in the longest named mention of the chain. We determine the gender using a gazetteer of names, created in an unsupervised manner from Wikipedia. For each person’s page, the name is considered female if the number of female pronouns is greater than the number of male pronouns in the first paragraph of the page, else it is considered male. A name is considered unisex if it does not appear in the gazetteer. This gazetteer has been evaluated from names that originate from different languages and gives accuracies in the high 90s. We use the following rules for pronoun resolution:

(i) If the pronoun is the subject of a verb and the preceding subject is a name, we assign the pronoun to that name. For instance, he is assigned to Henry in the text - Henry went to see Barry in the hospital. Afterward, he ate a pizza.,

(ii) Else, we assign the pronoun to the nearest preceding name such that (a) the pronoun is not a subject of a verb whose object is the name (or any other name in the same entity cluster) (b) the pronoun is not an object of a verb whose subject is the name (or any other name in the same chain), and (c) the name appears at least once in the 100 words preceding the pronoun.

(iii) If no name satisfies any of the above two conditions, the pronoun is not assigned to a name.

We remove chains that have only one mention, to follow the convention of having only non-singleton chains.

Evaluation of NER-DE

We evaluate NER-DE on both standard and NPC metrics. The results can be seen in Table  6.

NER-DE performs better than the CoreNLP deterministic and statistical systems on the standard metrics. AllenNLP still outperforms all others on this metric. However, as discussed in the earlier sections, this metric is not sensitive to entities not found at all. NER-DE misses 12.5% of the chains, which is much smaller than what the other systems miss and is still comparatively weaker on the standard metrics. 4.67% of the chains are missed because NER is not able to find a single named mention in that chain. The rest of the chains are missed due to the coreference algorithm, with no other reference linked to the name.

The existing metrics also do not take into account the resolution of mentions to the correct name, which is incorporated in the NPC F1. NER-DE outperforms all systems on NPC F1, both overall and in the subgenres of nw, bn and mz. It also does better on finding named mentions as well as pronouns. We should note that this improved performance is obtained by basic heuristics of matching pronouns to the nearest preceding named mentions with just a few additional constraints.

We also see that the performance varies by a significant amount depending on the source of mentions used by NER-DE. Surprisingly, the performance of NER-DE is better by using mentions generated using parsing, instead of using mentions generated by CogComp and CoreNLP systems, on both the NPC and standard metrics. The performance of NER-DE using AllenNLP mentions is comparable to parsed mentions using existing metrics but the AllenNLP systems combines mention detection with coreference and thus has better mentions. The NER-DE using parsed mentions still outperforms even the one using AllenNLP mentions on NPC metrics. Using gold mentions gives a very high performance on both the NPC and standard metrics, thus showing that NER-DE, which uses just NER and basic heuristics for pronominal resolution is surprisingly good at named person coreference. Improving mention detection will lead to even better performance.

Chains not found F1 (NPC) Avg F1 (coref)
PER Allen 31.87% 0.563 0.71
PER NER-DE 12.5% 0.685 0.67
ORG Allen 28.9% 0.55 0.6
ORG NER-DE 28% 0.41 0.42
GPE Allen 13.9% 0.75 0.75
GPE NER-DE 13.2% 0.63 0.51
Table 7: Performance in coreference for PER, ORG, and GPE entities

Coreference for Other NE types

Natural language processing applications usually deal with PER, ORG and GPE entities together. Therefore, similar to PER, we evaluated the ORG and GPE coreference chains using standard and NPC metrics. We also built a similar NER-driven rule-based system with agglomerative clustering for resolving names and basic heuristics for resolving pronouns. Unlike PER, the state-of-the-art coreference systems are able to find the same percentage of chains as the NER-driven system and do not suffer from the same issues as person entities that often. They are also able to resolve all types of mentions better than the NER-driven system. These results can be seen in Table 7.


We presented the task of Named Person Coreference (NPC) and showed that the standard coreference metrics are not suitable for the evaluation of this task. We introduced interpretable evaluation metrics that tackle the shortcomings of the standard metrics and also track the different errors made by systems. We showed that the top off-the-shelf systems do not perform well on these metrics. They output many clusters without a link to any name or a link to the incorrect name. We showed that similar issues are not prevalent for other entities. We presented a simple NER-driven algorithm (NER-DE) for the named person coreference task that performs better than top off-the-shelf systems on the new metrics and on par on the existing metrics.


  • [Artiles et al.2010] Artiles, J.; Borthwick, A.; Gonzalo, J.; Sekine, S.; and Amigó, E. 2010. Weps-3 evaluation campaign: Overview of the web people search clustering and attribute extraction tasks. In CLEF.
  • [Bagga and Baldwin1998] Bagga, A., and Baldwin, B. 1998.

    Entity-based cross-document coreferencing using the vector space model.

    In 36th Annual Meeting of the Association for Computational Linguistics and 17th International Conference on Computational Linguistics, COLING-ACL ’98, August 10-14, 1998, Université de Montréal, Montréal, Quebec, Canada. Proceedings of the Conference., 79–85.
  • [Cardie and Wagstaf1999] Cardie, C., and Wagstaf, K. 1999. Noun phrase coreference as clustering. In 1999 Joint SIGDAT Conference on Empirical Methods in Natural Language Processing and Very Large Corpora.
  • [Clark and Manning2015] Clark, K., and Manning, C. D. 2015. Entity-centric coreference resolution with model stacking. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing of the Asian Federation of Natural Language Processing, ACL 2015, July 26-31, 2015, Beijing, China, Volume 1: Long Papers, 1405–1415.
  • [Clark and Manning2016] Clark, K., and Manning, C. D. 2016.

    Deep reinforcement learning for mention-ranking coreference models.

    In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, EMNLP 2016, Austin, Texas, USA, November 1-4, 2016, 2256–2262.
  • [De Marneffe, Recasens, and Potts2015] De Marneffe, M.-C.; Recasens, M.; and Potts, C. 2015. Modeling the lifespan of discourse entities with application to coreference resolution. J. Artif. Int. Res. 52(1):445–475.
  • [Durrett and Klein2014] Durrett, G., and Klein, D. 2014. A joint model for entity analysis: Coreference, typing, and linking. TACL 2:477–490.
  • [Finkel, Grenager, and Manning2005] Finkel, J. R.; Grenager, T.; and Manning, C. D. 2005. Incorporating non-local information into information extraction systems by gibbs sampling. In ACL 2005, 43rd Annual Meeting of the Association for Computational Linguistics, Proceedings of the Conference, 25-30 June 2005, University of Michigan, USA, 363–370.
  • [Han and Sun2011] Han, X., and Sun, L. 2011. A generative entity-mention model for linking entities with knowledge base. In The 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, Proceedings of the Conference, 19-24 June, 2011, Portland, Oregon, USA, 945–954.
  • [Honnibal and Johnson2015] Honnibal, M., and Johnson, M. 2015. An improved non-monotonic transition system for dependency parsing. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, 1373–1378. Lisbon, Portugal: Association for Computational Linguistics.
  • [Ji and Grishman2011] Ji, H., and Grishman, R. 2011. Knowledge base population: Successful approaches and challenges. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies - Volume 1, HLT ’11, 1148–1158.
  • [Khashabi et al.2018] Khashabi, D.; Sammons, M.; Zhou, B.; Redman, T.; Christodoulopoulos, C.; Srikumar, V.; Rizzolo, N.; Ratinov, L.; Luo, G.; Do, Q.; Tsai, C.-T.; Roy, S.; Mayhew, S.; Feng, Z.; Wieting, J.; Yu, X.; Song, Y.; Gupta, S.; Upadhyay, S.; Arivazhagan, N.; Ning, Q.; Ling, S.; and Roth, D. 2018. Cogcompnlp: Your swiss army knife for nlp. In 11th Language Resources and Evaluation Conference.
  • [Lee et al.2017] Lee, K.; He, L.; Lewis, M.; and Zettlemoyer, L. 2017. End-to-end neural coreference resolution. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, EMNLP 2017, Copenhagen, Denmark, September 9-11, 2017, 188–197.
  • [Luo2005] Luo, X. 2005. On coreference resolution performance metrics. In HLT/EMNLP 2005, Human Language Technology Conference and Conference on Empirical Methods in Natural Language Processing, Proceedings of the Conference, 6-8 October 2005, Vancouver, British Columbia, Canada, 25–32.
  • [Mann2002] Mann, G. S. 2002. Fine-grained proper noun ontologies for question answering. In Proceedings of the 2002 workshop on Building and using semantic networks-Volume 11, 1–7. Association for Computational Linguistics.
  • [Mihalcea and Csomai2007] Mihalcea, R., and Csomai, A. 2007. Wikify!: Linking documents to encyclopedic knowledge. In Proceedings of the Sixteenth ACM Conference on Conference on Information and Knowledge Management, CIKM ’07, 233–242.
  • [Pan et al.2015] Pan, X.; Cassidy, T.; Hermjakob, U.; Ji, H.; and Knight, K. 2015. Unsupervised entity linking with abstract meaning representation. In NAACL HLT 2015, The 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Denver, Colorado, USA, May 31 - June 5, 2015, 1130–1139.
  • [Pradhan et al.2007] Pradhan, S. S.; Hovy, E.; Marcus, M.; Palmer, M.; Ramshaw, L.; and Weischedel, R. 2007. Ontonotes: A unified relational semantic representation. International Journal of Semantic Computing 1(04):405–419.
  • [Pradhan et al.2014] Pradhan, S.; Luo, X.; Recasens, M.; Hovy, E.; Ng, V.; and Strube, M. 2014. Scoring coreference partitions of predicted mentions: A reference implementation. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), 30–35. Baltimore, Maryland: Association for Computational Linguistics.
  • [Radhakrishnan, Talukdar, and Varma2018] Radhakrishnan, P.; Talukdar, P.; and Varma, V. 2018.

    Elden: Improved entity linking using densified knowledge graphs.

  • [Raghunathan et al.2010] Raghunathan, K.; Lee, H.; Rangarajan, S.; Chambers, N.; Surdeanu, M.; Jurafsky, D.; and Manning, C. D. 2010. A multi-pass sieve for coreference resolution. In Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing, EMNLP 2010, 9-11 October 2010, MIT Stata Center, Massachusetts, USA, A meeting of SIGDAT, a Special Interest Group of the ACL, 492–501.
  • [Ratinov and Roth2009] Ratinov, L., and Roth, D. 2009. Design challenges and misconceptions in named entity recognition. In CoNLL.
  • [Stoyanov et al.2009] Stoyanov, V.; Gilbert, N.; Cardie, C.; and Riloff, E. 2009. Conundrums in noun phrase coreference resolution: Making sense of the state-of-the-art. In ACL 2009, Proceedings of the 47th Annual Meeting of the Association for Computational Linguistics and the 4th International Joint Conference on Natural Language Processing of the AFNLP, 2-7 August 2009, Singapore, 656–664.
  • [Tuggener2014] Tuggener, D. 2014. Coreference resolution evaluation for higher level applications. In Proceedings of the 14th Conference of the European Chapter of the Association for Computational Linguistics, volume 2: Short Papers, 231–235.
  • [Vilain et al.1995] Vilain, M.; Burger, J.; Aberdeen, J.; Connolly, D.; and Hirschman, L. 1995. A model-theoretic coreference scoring scheme. In Proceedings of the 6th Conference on Message Understanding, MUC6 ’95, 45–52.
  • [Wacholder, Ravin, and Choi1997] Wacholder, N.; Ravin, Y.; and Choi, M. 1997. Disambiguation of proper names in text. In Proceedings of the Fifth Conference on Applied Natural Language Processing, ANLC ’97, 202–208.
  • [Weerkamp et al.2011] Weerkamp, W.; Berendsen, R.; Kovachev, B.; Meij, E.; Balog, K.; and de Rijke, M. 2011. People searching for people: analysis of a people search engine log. In Proceeding of the 34th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2011, Beijing, China, July 25-29, 2011, 45–54.
  • [West et al.2014] West, R.; Gabrilovich, E.; Murphy, K.; Sun, S.; Gupta, R.; and Lin, D. 2014. Knowledge base completion via search-based question answering. In Proceedings of the 23rd international conference on World wide web, 515–526. ACM.
  • [Wick et al.2009] Wick, M.; Culotta, A.; Rohanimanesh, K.; and McCallum, A. 2009. An entity based model for coreference resolution. In Proceedings of the 2009 SIAM International Conference on Data Mining, 365–376. SIAM.
  • [Zhou, Ticrea, and Hovy2005] Zhou, L.; Ticrea, M.; and Hovy, E. 2005. Multi-document biography summarization. arXiv preprint cs/0501078.