Qualitative and Quantitative Analysis of Diversity in Cross-document Coreference Resolution Datasets

by   Anastasia Zhukova, et al.

Cross-document coreference resolution (CDCR) datasets, such as ECB+, contain manually annotated event-centric mentions of events and entities that form coreference chains with identity relations. ECB+ is a state-of-the-art CDCR dataset that focuses on the resolution of events and their descriptive attributes, i.e., actors, location, and date-time. NewsWCL50 is a dataset that annotates coreference chains of both events and entities with a strong variance of word choice and more loosely-related coreference anaphora, e.g., bridging or near-identity relations. In this paper, we qualitatively and quantitatively compare annotation schemes of ECB+ and NewsWCL50 with multiple criteria. We propose a phrasing diversity metric (PD) that compares lexical diversity within coreference chains on a more detailed level than previously proposed metric, e.g., a number of unique lemmas. We discuss the different tasks that both CDCR datasets create, i.e., lexical disambiguation and lexical diversity challenges, and propose a direction for further CDCR evaluation.



There are no comments yet.


page 1

page 2

page 3

page 4


XCoref: Cross-document Coreference Resolution in the Wild

Datasets and methods for cross-document coreference resolution (CDCR) fo...

Cross-document Event Identity via Dense Annotation

In this paper, we study the identity of textual events from different do...

NEREL: A Russian Dataset with Nested Named Entities, Relations and Events

In this paper, we present NEREL, a Russian dataset for named entity reco...

Identity and Granularity of Events in Text

In this paper we describe a method to detect event descrip- tions in dif...

Event Coreference Resolution by Iteratively Unfolding Inter-dependencies among Events

We introduce a novel iterative approach for event coreference resolution...

Journals Titles and Mission Statements: Lexical structure, diversity and readability

There is an established research agenda on dissecting an articles compon...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Cross-document coreference resolution (CDCR) is a set of techniques that aims to resolve mentions of events and entities across a set of related documents. CDCR is employed as an essential analysis component in a broad spectrum of use cases, e.g., to identify potential targets in sentiment analysis or as a part of discourse interpretation.

Although the CDCR research has been gaining attention, the annotation schemes and corresponding datasets have been infrequently exploring the mix of identity and loose coreference relations and lexical diversity of the annotated chains of mentions. For example, resolution of identity relations, i.e., coreference resolution, and resolution of more loose relations, i.e., bridging, are typically split into two separate tasks Kobayashi and Ng (2020). Resolution and evaluation of entity and event mentions of the two types of relation remains a research gap in CDCR research.

In this paper, we explore the qualitative and quantitative characteristics of two CDCR annotation schemes that annotate both events and entities. (1) ECB+, a state-of-the-art event-centric CDCR dataset Cybulska and Vossen (2014) that annotates mentions with identity coreference relations. (2) NewsWCL50, an experimental concept-centric dataset for identification of semantic concepts that contain mentions with a mix of identity, loose coreferential, and bridging relations, e.g., context-dependent synonyms, metonyms, and meronyms Hamborg et al. (2019). We propose a phrasing diversity metric (PD) that describes the variation of wording in annotated coreference chains and measures the lexical complexity of CDCR datasets. Unlike the number of unique lemmas, i.e., a previously used metric for lexical diversity Eirew et al. (2021), the proposed PD allows capturing higher lexical variation in coreference chains. We discuss the CDCR tasks that each of the datasets creates for CDCR models, i.e., lexical disambiguation and lexical diversity challenges. Finally, we continue the discussion of the future of the CDCR evaluation Bugert et al. (2021) and propose a task-driven CDCR evaluation to aim at improved robustness of CDCR models.

2 Related work

Coreference resolution (CR) and cross-document coreference resolution (CDCR) are tasks that aim to resolve coreferential mentions in one or multiple documents, respectively Singh et al. (2019). (CD)CR approaches tend to depend on the annotation schemes of the CDCR datasets that specify a definition of mentions and coreferential relations Bugert et al. (2021). The established CDCR datasets are typically event-centric O’Gorman et al. (2016); Mitamura et al. (2017, 2015); Cybulska and Vossen (2014); Hong et al. (2016); Bugert et al. (2020); Vossen et al. (2018), i.e., triggers annotating a mention at the presence of an event, or concept-centric Weischedel et al. (2011); Recasens et al. (2010); Hasler et al. (2006); Minard et al. (2016), i.e., annotates mentions if an antecedent contains a minimum number of coreferential mentions of occurring entities or events in the documents.

Most (CD)CR datasets contain only strict identity relations, e.g., TAC KBP Mitamura et al. (2017, 2015), ACE Linguistic Data Consortium and others (2008, 2005), MEANTIME Minard et al. (2016), OntoNotes Weischedel et al. (2011), ECB+ Bejan and Harabagiu (2010); Cybulska and Vossen (2014), GVC Vossen et al. (2018), and FCC Bugert et al. (2020). Less commonly used (CD)CR datasets explore relations beyond strict identity. For example, NiDENT Recasens et al. (2012) is a CDCR dataset of entities-only mentions that was created by reannotating NP4E. NiDENT explores coreferential mentions of more loose coreference relations coined near-identity that among all included metonymy, e.g., “White House” to refer to the US government, and meronymy, e.g., “US president” being a part of the US government and representing it. Reacher Event Description (RED), a dataset for CR, contains also more loose coreference relations among events O’Gorman et al. (2016).

Mentions coreferential with more loose relations are harder to annotate and automatically resolve than mentions with identity relations Recasens et al. (2010). Bridging relations occur when a connection between mentions is implied but is not strict, e.g., a “part-of” relation. Bridging relations, unlike identity relations, form a link between nouns that do not match in grammatical constraints, i.e., gender and number agreement, and allow linking noun and verb mentions, thus, constructing abstract entities Kobayashi and Ng (2020). The existing datasets for identification of bridging relations, e.g., ISNotes Hou et al. (2018), BASHI Rösiger (2018), ARRAU Poesio and Artstein (2008) annotate the relations only of noun phrases (NPs) on a single-document level and solve the problem as antecedent identification problem rather than identification of a set of coreferential anaphora Hou et al. (2018). GUM corpus Zeldes (2017) annotates both coreference and bridging relations but as two separate tasks. Definition identification in the DEFT dataset Spala et al. (2019) focuses on annotating mentions that are linked with “definition-like” verb phrases (e.g., means, is, defines, etc.) but does not address linking the antecedents and definitions into the coreferential chains.

In the following sections, we discuss and contrast two perspectives on annotating coreferential relations in CDCR datasets: coreferential relations of identity strength in event-centric chains of mentions and more loose coreference relations, i.e., a combination of identity and bridging anaphoric relations such as synonymy, metonymy, meronymy Poesio and Vieira (1998); Rösiger (2018), in concept-centric chains. We aim at exploring if loose coreference relations are a challenge to the CDCR task and if it is possible to solve with feature-based approaches.

3 Coreference anaphora in CDCR

We compare ECB+ Cybulska and Vossen (2014), a state-of-the-art CDCR dataset, and NewsWCL50 Hamborg et al. (2019), a dataset that annotates frequently reported concepts in news articles that tend to contained biased phrases by word choice and labeling. The two datasets annotate both event and entity mentions but significantly differ in defining coreference chains. To our best knowledge, despite focusing on a problem of media bias, NewsWCL50 is the only CDCR-related dataset that annotated coreferential chains with identity and more loose relation together in NPs and verb phrases (VPs).

3.1 Qualitative comparison

We qualitatively compare the datasets regarding three main factors that represent viable characteristics in CDCR. First, properties of the topic composition of the CDCR datasets. Second, how each dataset defines which phrases should become coreference mentions, i.e., candidates for a single coreference chain, and how these phrases should be annotated. Third, how coreferential chains are constructed, i.e., which relations link the coreferential anaphora.

3.1.1 Dataset structure

CDCR is performed on a collection of narrowly-related articles based on topic or event similarity.

NewsWCL50 contains one level of topically related news articles that report about the same event. Although reporting about the same event, some parts of these news articles discuss various aspects of it. For example, in topic 5 of NewsWCL50, while generally reporting on Trump visiting the UK, some news articles focused on Trumps plans during this visit, some compared his visit to the other visits of the US presidents, and others on the demonstrators opposing this visit.

ECB+ contains news articles that are related on two levels: first, on a general topic level of on event, e.g., earthquake, second, on a subtopic level of a specific event that is described by event, actors, location, and time. Therefore, CDCR on ECB+ is possible on a topic or sub-topic levels.

For ECB+, Cybulska and Vossen (2014) proposed a split into train, validation, and test subsets. On the contrary, Hamborg et al. (2019) used the entire corpus of NewsWCL50 as a test set for their unsupervised approach due to the small number of annotated topics (see Table 3).

Since the resolution of ECB+ consists of 43 topics with two subtopics per each topic. Cybulska and Vossen (2014) Upadhyay et al. (2016) and Cattan et al. (2021) measure complexity a

Chain’s name Coding book Annotated mentions
36: Warren Jeffs (entity) ECB+ t36_warren_jeffs: attorney; FLDS leader’s; he; head; him; his; Jeffs; leader; leader Warren Jeffs; pedophile; Polygamist; polygamist leader Warren Jeffs; Polygamist prophet Warren Jeffs; polygamist sect leader Warren Jeffs; Polygamist Warren Jeffs; Warren Jeffs; Warren Jeffs, Polygamist Leader; who
NewsWCL50 a handful from day one; a problem; a victim of religious persecution; an accomplice for his role; an accomplice to rape by performing a marriage involving an underage girl; an accomplice to sexual conduct with a minor; an accomplice to sexual misconduct with minors; an accomplice to the rape of a 14-year-old girl; FLDS prophet Warren Jeffs; God’s spokesman on earth; her father; his client; Jeffs; Jeffs, 54; Jeffs, who acted as his own attorney; Jeffs, who was indicted more than two years ago; Mr. Jeffs; one individual, Warren Steed Jeffs; one of the most wicked men on the face of the earth since the days of Father Adam; penitent; Polygamist prophet Warren Jeffs; polygamist sect leader Warren Jeffs; polygamist Warren Jeffs; president; prophet of the Fundamentalist Church of the Jesus Christ of the Latter Day Saints; prophet Warren Jeffs; stone-faced; The 54-year-old Jeffs; the defendant; the ecclesiastical head of the Fundamentalist Church of Jesus Christ of Latter Day Saints; the father of a 15-year-old FLSD member’s child; the highest-profile defendant; the prophet; the self-styled prophet; their client; their spiritual leader; This individual; Warren Jeffs; Warren Jeffs, leader of the Fundamentalist Church of Jesus Christ of Latter Day Saints; Warren Jeffs, polygamist leader
39: Become new Dr.Who (event) ECB+ t39_capaldi_play_doc: play, take on, take up
t39_replace_smith: replace, replacing, stepped into, take over, takes over
t39_play_doc: one, play, role
NewsWCL50 about to play the best part on television; an incredible incarnation of number 12; become one of the all-time classic Doctors; become the next Doctor Who; becoming the 12th Doctor; becoming the next Doctor; Being asked to play The Doctor; being the doctor; had been chosen; get started; getting it; has been announced as the new star of BBC sci-fi series Doctor Who; has been named as the 12th actor to play the Doctor; his unique take on the Doctor; is about to play the best part on television; is to be the new star of Doctor Who; might be the right person to take on this iconic part; play it; play the Doctor; revealed as 12th Doctor; stepped into Matt Smith’s soon to be vacant Doctor Who shoes; take over from Matt Smith as Doctor Who; take the role for days; takes over Doctor Who Tardis; the Doctor’s appointment; replace Matt Smith on Doctor Who; will be the 12th actor to play the Doctor; will play the 12th Doctor; will replace Matt Smith; will take over from Smith
Table 1: Comparison of mentions as manually annotated chains in ECB+ and when re-annotated as concepts following the NewsWCL50 coding book (B).

3.1.2 Mentions

ECB+ dataset, the annotation process is centered around events as main concept-holders. Each event is described using four components, i.e., action, time, location, and participant Cybulska and Vossen (2014). Due to its focus on events, NP and VP phrases are annotated as mentions only if defining an event, resulting later in (1) many in-text mentions of entities being not annotated and thus not part of annotated coreference chains or new non-event-centric coreference chains, (2) many event or entity singleton-mentions of entities, e.g., location, that occurred only once describing an event.

In contrast, in NewsWCL50, all mentions are annotated if at least five candidate phrases occur in at least one document in NewsWCL50 Hamborg et al. (2019). The goal is to identify the most frequently reported concepts within the related news articles. NewsWCL50 annotates mentions of NPs and VPs. The annotation scheme distinguishes the mention chains referring to actors, actions, objects, events, geo-political entities (GPE), and abstract entities Poesio and Artstein (2008). NewsWCL50 does not annotate date-time and annotates locations if they are frequently reported concepts.

A length of mentions defines which tokens of phrase are annotated as a mention. Following a “minimum span” strategy, ECB+ demands the annotation of only phrases with the smallest number of words that preserve the core meaning of a phrase. Often this annotation style leads to annotating only the heads of phrases. On the contrary, the NewsWCL50 annotation scheme follows a “maximum span” style that yields longer phrases. A maximum span style allows capturing modifiers of the heads of phrases and the overall meaning that can change with the modifies, e.g., compare “kids” to “undocumented kids who did nothing wrong” and “American kids of undocumented immigrant parents.” ECB+ focuses on identifying the most reliable spans if they are to be extracted whereas NewsWCL50 focuses on the annotation of the longest coreferential phrase as the determining words my carry the bias by word choice and labeling.

3.1.3 Relations

The annotation schemes differ in the strength of the coreferential relations that link the mentions, i.e., identity or more loose relations that include bridging and near-identity.

ECB+ links mentions into the chains if (1) mentions are related with strict identity, e.g., “the United States” – “the U.S.”, (2) VP-mentions belong to the same event as of participant, location, and time, (3) NP-mentions refer to the same named entity (NE), e.g., country, organization, location, person, or the same entity as of action, time, and location and matches grammar constraints, e.g., number and gender Cybulska and Vossen (2014). Therefore, if an article that reports on immigration to the US covers either two points in time within a couple of weeks difference or if the immigrants are located on multiple spots on the immigration way, annotators will need to create multiple separate entities for the entity “immigrants.”

NewsWCL50 links the mentions if they are coreferential with a mix of identity and bridging relations Poesio and Vieira (1998): (1) mentions refer to the same entity/event including subject-predicate coreference (e.g., “Donald Trump” – “the newly-minted president”), (2) synonym relations, (e.g., “a caravan of immigrants” – “a few hundred asylum seekers”), (3) metonymy (e.g., “the Russian government” – “the Kremlin”), (4) metonymy/holonymy (e.g., “the U.S.-Mexico border” – “the San Ysidro port of entry”), (5) mentions are linked with definition-like verbs or phrases that establish association (e.g., “Kim Jong Un” – “Little Rocket Man” or “crossing into the U.S.” – “a crisis at the border”), (6) meet GPE definitions as of ACE annotation Linguistic Data Consortium and others (2008) (e.g., “the U.S.” – “American people’), (7) mentions are elements of one set (e.g., “guaranteed to disrupt trade” – “drove down steel prices” – “increased frictions” as members of a set “Consequences of tariff imposition”). The annotation scheme requires to annotate the generic use of mentions with “part-of” relations referring to GPEs. For example, always annotate mentions of the police as referring to the U.S. because the biased language used to report about the police can affect the perception of the U.S. in general Hamborg (2019).

3.1.4 Reannotation experiment

To provide a qualitative example of the difference of the annotations schemes, we followed a coding book of Hamborg et al. Hamborg (2019) and reannotated an entity and events of ECB+.

Table 1 shows a qualitative example of how coreference chains of the two annotation schemes differ. The table shows how exemplary mentions are annotated in ECB+ and in NewsWCL50. NewsWCL50 coding book yields coreference chains that (1) annotate any occurred coreferential mentions in the text, not only related to an event, (2) contain a mix of strict identity and loose bridging coreferential anaphora.

The reannotated ECB+ entities contain more mentions compared to the original ECB+ annotation. ECB+ annotated entities in an event-centric way whereas NewsWCL50 annotated any occurrence of mentions referring to the same entity/event. Moreover, NewsWCL50’s coding book annotates coreferential mentions that were used to describe an entity via association with this entity, e.g., “Warren Jeffs” – “God’s spokesman on earth” or “become one of the all-time classic Doctors” – “an incredible incarnation of number 12.”

3.1.5 Summary

ECB+ annotations cover narrowly defined entities that describe an event as precisely as possible regarding the event’s participants, action, time, and location. In contrast, the annotation scheme employed for the creation of NewsWCL50 aims at determining broader coreference chains that include mentions of entities that would have not been annotated or would be split into multiple chains if the ECB+ annotation scheme was used.

3.2 Quantitative comparison

Besides the previously discussed qualitative differences, we also quantitatively compare the annotation schemes. We compare the lexical diversity of coreference chains, their sizes, numerical parameters of the corresponding datasets, and annotation complexity measured by inter-annotator reliability.

3.2.1 Lexical diversity

Lexical diversity indicates how the wording differs within annotated concepts and how obvious or hard is the task of resolving such mentions. We evaluate lexical diversity with three metrics: (1) CoNLL F1 score of a primitive same-head-lemma baseline Cybulska and Vossen (2014); Pradhan et al. (2012) and (2) a number of different head-lemmas in coreference chains Eirew et al. (2021), (3) a new metric for lexical diversity that accounts for more within-chain phrase variations than a number of unique lemmas. To ensure a more fair comparison of the datasets, we calculate all metrics for lexical diversity on the annotated chains that exclude singletons. Table 3 summarizes all metrics and presents general statistics of the datasets, e.g., a number of annotated mentions and coreference chains.

Accuracy of a lemma CDCR baseline

A baseline method of CDCR is to resolve mentions with same-lemmas within related sets of documents. The accuracy of a lemma-baseline shows that the better this method performs, the lower is the lexical variation Cybulska and Vossen (2014). Table 3 reports that same-lemmas is more efficient on the chains in ECB+ dataset and denotes smaller lexical diversity of the mentions in ECB+ compared to NewsWCL50. Coreferential anaphora with more loose relations increases the possible number combinations of the predicted chains, thus, increases the complexity of the CDCR task Kobayashi and Ng (2020).

Number of unique lemmas of phrases’ heads

Eirew et al. (2021) proposed to measure lexical diversity with an average number of unique lemmas of phrases’ heads in coreference chains. Similar to on the lemma-baseline, the number of unique lemmas shows that NewsWCL50 contains almost four times more diverse coreference chains than ECB+ (see Table 3).

Unique lemmas as a metric for lexical diversity used in Eirew et al. (2021) has the most advantage in a “minimum span” mention annotation style, i.e., annotate mainly heads pf phrases. Unfortunately, this metric does not account for larger phrasing diversity in a “maximum span” style, i.e., the annotation the largest spans that include all head modifiers instead of only minimum descriptive ones. Figure 1 depicts the trends in the two annotation styles where numbers of the unique lemmas in the coreference chains are plotted against the sizes of the chains. Although the average value of NewsWCL50 is higher than ECB+, the trends suggest that larger subtopics of many documents annotated with ECB+ coding book would yield similarly diverse coreference chains as NewsWCL50.

Figure 1: Very subtle difference between the trends in lexical diversity when it with measures with unique lemmas per coreference chain as proposed by Eirew et al. (2021)
Figure 2: (a) Comparison of coreference chains from ECB+ and NewsWCL50 with a phrasing diversity metric (PD): the trend of lexical diversity of NewsWCL50 is higher than ECB+. (b) The phrasing diversity metric (PD) proposed in Section 3.2 distinguishes between the levels of diversity measured by unique lemmas.

Phrasing diversity metric

We introduce a new phrasing diversity metric (PD) that represents the variation of phrases used to refer to the same entity/event. PD measures lexical variation of coreference chains using the mention frequency and phrasing diversity:

where represents all unique head words of an annotated chain , is the number of unique phrases with head , is the number of all phrases with head , and is a number of all mentions of an annotated chain . Lastly, to aggregate PD per dataset, we compute weighted average over all corresponding chains:

To ensure a fair comparison of the lexical diversity of ECB+ to NewsWCL50, we need to normalize the annotation styles of the mentions between two schemes, i.e., minimum versus maximum spans of annotated mentions. We automatically expand the annotated mentions from ECB+ by extracting NPs and VPs of maximum length from the structure parsing trees that contain the annotated heads of phrases. Such phrase expansion seeks to minimize the cases in ECB+ when is very small.

Similar to Figure 1, Figure 2a plots PD against the sizes of coreference chains for each dataset and shows the higher lexical diversity of NewsWCL50 compared to ECB+. The trends suggest that coreference chains of ECB+ are narrowly defined: the large coreference chains have lower values of PD than NewsWCL50, i.e., the chains contain a lot of repetitive mentions. The figure shows that a large majority of the ECB+ concepts do not exceed the size of 20 and PD of 5, whereas NewsWCL50’s concepts are distributed over the entire plot. Unlike ECB+, coreference chains of NewsWCL50 are more broadly-defined, i.e., have large values of both PD and an average size of coreference chains (see Table 3). This finding supports the qualitative example in Section 3.1.4 where reannotating leads to the increased size of original ECB+ chains due to (1) annotated mentions with more loose coreference relations, (2) merging together previously annotated chains of ECB+ leading to more broadly-defined chains.

Mentions # repetitions # unique lemmas Phrasing diversity (PD)
chain 1 Donald Trump 3 2 1.1
the president 1
President Donald Trump 1
Mr. Trump 1
chain 2 undocumented immigrants 1 2 2.0
immigrants seeking hope 1
unauthorized immigrants 1
migrant caravan 1
a caravan of Central American migrants 1
a caravan of hundreds of migrants 1
Table 2: A comparison of two metrics of lexical diversity: unique lemmas and phrasing diversity (PD). While a number of unique lemmas are identical for the two chains in the example, PD accounts more for the variation in the phrases.

Figure 2b shows a difference between unique lemmas and the proposed PD metric, i.e., the same unique lemmas have higher or lower PD depending on the fraction of repetitive phrases to be resolved. Consider an example in the Table 2:

The example shows that for the same number of unique lemmas a PD metric may differ and show higher or lower variation depending on both a number of unique head lemmas and variation of phrases with these heads. Figure 2b depicts that for the same number of unique lemmas NewsWCL50 annotates chains with larger PD than ECB+. Therefore, PD is capable of indicating larger lexical variation than unique lemmas.

3.2.2 Inter-annotator reliability and general statistics

We explore inter-annotator reliability (IAR, also called inter-coder reliability, ICR). For the strictly formalized ECB+, Cybulska and Vossen reported Cohen’s Kappa for the annotation of the event/entity mentions and for connecting these mentions into coreferential chains . For the more flexible NewsWCL50 scheme, Hamborg et al. Hamborg et al. (2019) reported an average observed agreement Byrt et al. (1993). While these measures cannot be compared directly, they indicate that schemes including mentions with more loosely related anaphora may result in lower IAR, since AOA is considered less restrictive than Kappa Recasens et al. (2012); Byrt et al. (1993).

3.2.3 Summary

Table 3 reports numeric properties, i.e., a number of annotated documents and mentions, and quantitative description parameters of the datasets, i.e., lexical diversity. On the one hand, ECB+ includes a larger number of topics and articles compared to NewsWCL50 (43/982 against 10/50) and annotates almost three times more mentions (15122 and 5696). On the other hand, the density of annotations per article in higher in NewsWCL50, i.e., 112 annotations per article in NewsWCL50 against 15 in ECB+, that shows longer documents in NewsWCL50 and more thoroughly covered annotation of mentions.

We evaluated the complexity of the coreference chains with the average number of contained mentions, i.e., the size of the coreference chains, and multiple metrics of lexical diversity. To ensure fair comparison, we removed singletons chains from the analysis of datasets’ parameters. An average number of mentions per coreference chain is more than four times larger in NewsWCL50 than in ECB+. Table 3 shows that coreference chains of ECB+ are 4.5 times smaller and their lexical diversity measured by PD is almost 4 times lower than those of NewsWCL50. Moreover, NewsWCL50 has a higher value of on the head-lemma baseline on a subtopic level, thus, indicating a more complex task for a CDCR model.

Criteria ECB+ NewsWCL50
# topics 43 10
# subtopics 86 10
# articles 982 50
# mentions 15 122 5 696
# chains 4 965 171
# singletons 3006 11
average chain size* 7.2 33.4
average # unique lemmas* 3.2 12.0
Phrasing diversity (PD)* 1.3 7.4
F1* 64.4 49.9
Table 3: Quantitative comparison of ECB+ and NewsWCL50. The values marked with a star (*) are calculated for the versions of the datasets without singletons.

4 Discussion

In our quantitative and qualitative analysis, ECB+ showed the lower average size of coreference chains and lower lexical diversity measured by all three metrics, e.g., on a primitive CDCR method, number of unique lemmas of phrases’ heads, and phrasing diversity metric (PD). NewsWCL50 shows that annotating both identity, synonym, metonymy/meronymy, bridging and subject-predicate coreference relations yields an increase of the lexical diversity in the annotated chains (see Table 3). Consequently, the increased lexical diversity results in a higher level of semantic complexity and abstractness in the annotated coreferential chains, i.e., resolution of some mentions requires an understanding of the context in which they appear. The increase in the lexical diversity and in the level of abstractness of coreference chains poses a new challenge to the established CDCR models.

Although ECB+ annotates narrowly-defined coreference chains with smaller lexical diversity on a subtopic level, ECB+ creates a lexical ambiguity challenge, i.e., CDCR models need to resolve mentions on the mixed documents of two subtopics, which contain verbs that refer to different events Cybulska and Vossen (2014); Upadhyay et al. (2016); Cattan et al. (2021). Therefore, ECB+ and NewsWCL50 establish two diverse CDCR tasks: a lexical ambiguity challenge and a high lexical diversity challenge.

This diversity of the annotation schemes supports the conclusion of Bugert et al. (2021) to evaluate CDCR models on multiple CDCR datasets. Bugert et al. (2021) evaluated the state-of-the-art CDCR models on multiple event-centric datasets and reported their unstable performance across these datasets. The authors suggested evaluating every CDCR model on four event-centric datasets and report the performance on a single-document, within-subtopic, and within-topic levels.

For further CDCR evaluation, we see a need to evaluate CDCR models on the two challenges: lexical ambiguity and lexical diversity. Evaluation should consist not only of multiple event-centric datasets but also include NewsWCL50, which focuses on the loose context-specific coreference relations, and an additional concept-centric dataset with identity coreference relations. Such a setup with diverse CDCR datasets will facilitate the fair and comparable evaluation of coreferential chains of various natures. Annotation of a large silver-quality CDCR dataset with multiple diverse annotation layers is a solution towards robust CDCR models Eirew et al. (2021).

5 Conclusion

CDCR research has a long history of focusing on the resolution of mentions annotated in an event-centric way, i.e., the occurrence of events trigger annotating mentions of events and entities as their attributes. We compared a state-of-the-art event-centric CDCR dataset, i.e., ECB+, to a concept-centric NewsWCL50 dataset that explored the identification of high variance in news articles as a CDCR task. The reviewed CDCR datasets reveal a large scope of relations that define coreferential anaphora, i.e, from strict identity in ECB+ to mixed identity and bridging of coreference relations in NewsWCL50.

We proposed a phrasing diversity metric (PD) that enabled a more fine-grained comparison of lexical diversity in coreference chains compared to the average number of unique lemmas. PD of NewsWCL50 is 5.7 times higher than ECB+. Such a high prevalence in lexical diversity in NewsWCL50 rises a new challenge in the CDCR research.

To ensure CDCR can robustly handle data variations on the lexical disambiguation challenge, Bugert et al. (2021) proposed evaluating CDCR models on three event-centric datasets on multiple levels of coreference resolution, i.e., a document, subtopic, and topic levels. We propose to create an additional CDCR challenge, i.e., lexical diversity challenge, and include CDCR datasets that annotate coreference chains with high lexical variance chains, e.g., NewsWCL50.


  • Linguistic Data Consortium et al. (2005) ACE (automatic content extraction) english annotation guidelines for events. version 5.4. 3. ACE. Cited by: §2.
  • C. A. Bejan and S. Harabagiu (2010) Unsupervised event coreference resolution with rich linguistic features. In Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics, pp. 1412–1422. Cited by: §2.
  • M. Bugert, N. Reimers, S. Barhom, I. Dagan, and I. Gurevych (2020) Breaking the subtopic barrier in cross-document event coreference resolution. In Proceedings of Text2Story - Third Workshop on Narrative Extraction From Texts co-located with 42nd European Conference on Information Retrieval, Text2Story@ECIR 2020, Lisbon, Portugal, April 14th, 2020 [online only], R. Campos, A. M. Jorge, A. Jatowt, and S. Bhatia (Eds.), CEUR Workshop Proceedings, Vol. 2593, pp. 23–29. External Links: Link Cited by: §2, §2.
  • M. Bugert, N. Reimers, and I. Gurevych (2021) Generalizing cross-document event coreference resolution across multiple corpora. Computational Linguistics, pp. 1–43. Cited by: §1, §2, §4, §5.
  • T. Byrt, J. Bishop, and J. B. Carlin (1993) Bias, prevalence and kappa. Journal of Clinical Epidemiology 46 (5), pp. 423–429. External Links: ISSN 0895-4356, Document Cited by: §3.2.2.
  • A. Cattan, A. Eirew, G. Stanovsky, M. Joshi, and I. Dagan (2021) Realistic evaluation principles for cross-document coreference resolution. In Proceedings of *SEM 2021: The Tenth Joint Conference on Lexical and Computational Semantics, Online, pp. 143–151. External Links: Link, Document Cited by: §3.1.1, §4.
  • A. Cybulska and P. Vossen (2014) Using a sledgehammer to crack a nut? lexical diversity and event coreference resolution. In Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC’14), Reykjavik, Iceland, pp. 4545–4552. External Links: Link Cited by: §1, §2, §2, §3.1.1, §3.1.1, §3.1.2, §3.2.1, §3.2.2, §3.2, §3, §4.
  • A. Cybulska and P. Vossen (2014) Guidelines for ECB+ annotation of events and their coreference. Technical report Technical Report NWR-2014-1, VU University Amsterdam. Cited by: §3.1.3.
  • A. Eirew, A. Cattan, and I. Dagan (2021) WEC: deriving a large-scale cross-document event coreference dataset from Wikipedia. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Online, pp. 2498–2510. External Links: Link, Document Cited by: §1, Figure 1, §3.2.1, §3.2, §3.2, §4.
  • F. Hamborg, A. Zhukova, and B. Gipp (2019) Automated identification of media bias by word choice and labeling in news articles. In Proceedings of the ACM/IEEE Joint Conference on Digital Libraries (JCDL), External Links: Document Cited by: §1, §3.1.1, §3.1.2, §3.2.2, §3.
  • F. Hamborg (2019) Codebook: bias by word choice and labeling. Technical report Technical report, University of Konstanz. External Links: Link Cited by: §3.1.3, §3.1.4.
  • L. Hasler, C. Orasan, and K. Naumann (2006) NPs for events: experiments in coreference annotation.. In LREC, pp. 1167–1172. Cited by: §2.
  • Y. Hong, T. Zhang, T. O’Gorman, S. Horowit-Hendler, H. Ji, and M. Palmer (2016) Building a cross-document event-event relation corpus. In Proceedings of the 10th Linguistic Annotation Workshop held in conjunction with ACL 2016 (LAW-X 2016), Berlin, Germany, pp. 1–6. External Links: Document, Link Cited by: §2.
  • Y. Hou, K. Markert, and M. Strube (2018) Unrestricted Bridging Resolution. Computational Linguistics 44 (2), pp. 237–284. External Links: ISSN 0891-2017, Document, Link, https://direct.mit.edu/coli/article-pdf/44/2/237/1808960/coli_a_00315.pdf Cited by: §2.
  • H. Kobayashi and V. Ng (2020) Bridging resolution: a survey of the state of the art. In Proceedings of the 28th International Conference on Computational Linguistics, Barcelona, Spain (Online), pp. 3708–3721. External Links: Link, Document Cited by: §1, §2, §3.2.
  • Linguistic Data Consortium et al. (2008) ACE (automatic content extraction) english annotation guidelines for entities. Technical report Technical report, Linguistic Data Consortium. Cited by: §2, §3.1.3.
  • A. Minard, M. Speranza, R. Urizar, B. Altuna, M. van Erp, A. Schoen, and C. van Son (2016) MEANTIME, the NewsReader multilingual event and time corpus. In Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC’16), Portorož, Slovenia, pp. 4417–4422. External Links: Link Cited by: §2, §2.
  • T. Mitamura, Z. Liu, and E. H. Hovy (2017) Events detection, coreference and sequencing: what’s next? overview of the TAC KBP 2017 event track.. In TAC, Cited by: §2, §2.
  • T. Mitamura, Z. Liu, and E. Hovy (2015) Overview of TAC KBP 2015 event nugget track.. In TAC, Cited by: §2, §2.
  • T. O’Gorman, K. Wright-Bettner, and M. Palmer (2016) Richer event description: integrating event coreference with temporal, causal and bridging annotation. In Proceedings of the 2nd Workshop on Computing News Storylines (CNS 2016), Austin, Texas, pp. 47–56. External Links: Document, Link Cited by: §2, §2.
  • M. Poesio and R. Artstein (2008) Anaphoric annotation in the ARRAU corpus. In Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC’08), Marrakech, Morocco. External Links: Link Cited by: §2, §3.1.2.
  • M. Poesio and R. Vieira (1998) A corpus-based investigation of definite description use. Comput. Linguist. 24 (2), pp. 183–216. External Links: ISSN 0891-2017 Cited by: §2, §3.1.3.
  • S. Pradhan, A. Moschitti, N. Xue, O. Uryupina, and Y. Zhang (2012) CoNLL-2012 shared task: modeling multilingual unrestricted coreference in OntoNotes. In Joint Conference on EMNLP and CoNLL - Shared Task, Jeju Island, Korea, pp. 1–40. External Links: Link Cited by: §3.2.1.
  • M. Recasens, E. Hovy, and M. A. Martí (2010) A typology of near-identity relations for coreference (NIDENT). In Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC’10), Cited by: §2.
  • M. Recasens, E. Hovy, and M. A. Martí (2010) A typology of near-identity relations for coreference (NIDENT). In Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC’10), Valletta, Malta. External Links: Link Cited by: §2.
  • M. Recasens, M. A. Martí, and C. Orasan (2012) Annotating near-identity from coreference disagreements. In Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC’12), Istanbul, Turkey, pp. 165–172. External Links: Link Cited by: §2, §3.2.2.
  • I. Rösiger (2018) BASHI: a corpus of Wall Street Journal articles annotated with bridging links. In Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018), Miyazaki, Japan. External Links: Link Cited by: §2, §2.
  • S. Singh, K. Bhattacharjee, H. Darbari, and S. Verma (2019) Analyzing coreference tools for NLP application. International Journal of Computer Sciences and Engineering 7, pp. 608–615. External Links: ISSN 2347-2693, Document Cited by: §2.
  • S. Spala, N. A. Miller, Y. Yang, F. Dernoncourt, and C. Dockhorn (2019) DEFT: a corpus for definition extraction in free- and semi-structured text. In Proceedings of the 13th Linguistic Annotation Workshop, Florence, Italy, pp. 124–131. External Links: Document, Link Cited by: §2.
  • S. Upadhyay, N. Gupta, C. Christodoulopoulos, and D. Roth (2016) Revisiting the evaluation for cross document event coreference. In Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers, Osaka, Japan, pp. 1949–1958. External Links: Link Cited by: §3.1.1, §4.
  • P. Vossen, F. Ilievski, M. Postma, and R. Segers (2018) Don’t annotate, but validate: a data-to-text method for capturing event data. In Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018), Miyazaki, Japan. External Links: Link Cited by: §2, §2.
  • R. Weischedel, S. Pradhan, L. Ramshaw, M. Palmer, N. Xue, M. Marcus, A. Taylor, C. Greenberg, E. Hovy, R. Belvin, et al. (2011) OntoNotes release 4.0. LDC2011T03, Philadelphia, Penn.: Linguistic Data Consortium. Cited by: §2, §2.
  • A. Zeldes (2017) The gum corpus: creating multilayer resources in the classroom. Lang. Resour. Eval. 51 (3), pp. 581–612. External Links: ISSN 1574-020X, Link, Document Cited by: §2.