Coreferential Reasoning Learning for Language Representation

04/15/2020 ∙ by Deming Ye, et al. ∙ Tsinghua University 0

Language representation models such as BERT could effectively capture contextual semantic information from plain text, and have been proved to achieve promising results in lots of downstream NLP tasks with appropriate fine-tuning. However, existing language representation models seldom consider coreference explicitly, the relationship between noun phrases referring to the same entity, which is essential to a coherent understanding of the whole discourse. To address this issue, we present CorefBERT, a novel language representation model designed to capture the relations between noun phrases that co-refer to each other. According to the experimental results, compared with existing baseline models, the CorefBERT model has made significant progress on several downstream NLP tasks that require coreferential reasoning, while maintaining comparable performance to previous models on other common NLP tasks.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Recently, language representation models (Devlin et al., 2019; Yang et al., 2019; Liu et al., 2019a; Joshi et al., 2019a)

have made significant strides in many natural language understanding tasks, such as natural language inference, sentiment classification, question answering, relation extraction, fact extraction and verification, and coreference resolution 

(Zhang et al., 2019; Sun et al., 2019; Talmor and Berant, 2019; Peters et al., 2019; Zhou et al., 2019; Joshi et al., 2019b). These models usually conduct self-supervised pre-training tasks over large-scale corpus to obtain informative language representation, which could capture the contextual semantics of the input text.

Despite existing language representation models have made a success on many downstream tasks, they are still not sufficient to understand coreference in long texts. Pre-training tasks, such as masked language modeling, sometimes lead model to collect local semantic and syntactic information to recover the masked tokens. Meanwhile, they may ignore long-distance connection beyond sentence-level due to the lack of modeling the coreference resolution explicitly. Coreference can be considered as the linguistic connection in natural language, which commonly appears in a long sequence and is one of the most important elements for a coherent understanding of the whole discourse. Long text usually accommodates complex relationships between noun phrases, which has become a challenge for text understanding. For example, in the sentence “The physician hired the secretary because she was overwhelmed with clients.”, it is necessary to realize that she refers to the physician, for comprehending the whole context.

To improve the capacity of coreferential reasoning of language representation models, a straightforward solution is to fine-tune these models on supervised coreference resolution data. Nevertheless, it is impractical to obtain a large-scale supervised coreference dataset. In this paper, we present CorefBERT, a language representation model designed to better capture and represent the coreference information in the utterance without supervised data. CorefBERT introduces a novel pre-training task called Mention Reference Prediction (MRP), besides the Masked Language Modeling (MLM). MRP leverages repeated mentions (e.g. noun or noun phrase) that appears multiple times in the passage to acquire abundant co-referring relations. Particularly, MRP involve mention reference masking strategy, which masks one or several mentions among the repeated mentions in the passage and requires model to predict the maksed mention’s corresponding referents. Here is an example:

Sequence: Jane presents strong evidence against Claire, but [MASK] may present a strong defense.

Candidates: Jane, evidence, Claire, …

For the MRP task, we substitute the repeated mention, Claire, with [MASK] and require the model to find the proper candidate for filling the [MASK].

To explicitly model the coreference information, we further introduce a copy-based training objective to encourage the model to select the consistent noun phrase from context instead of the vocabulary. The copy mechanism establishes more interactions among mentions of an entity, which thrives on the coreference resolution scenario.

We conduct experiments on a suite of downstream NLP tasks which require coreferential reasoning in language understanding, including extractive question answering, relation extraction, fact extraction and verification, and coreference resolution. Experimental results show that CorefBERT outperforms the vanilla BERT on almost all benchmarks based on the improvement of coreference resolution. To verify the robustness of our model, we also evaluate CorefBERT on other common NLP tasks where CorefBERT still achieves comparable results to BERT. It demonstrates that the introduction of the new pre-training task would not impair BERT’s ability in common language understanding.

2 Background

BERT (Devlin et al., 2019), a language representation model, learns universal language representation with deep bidirectional TransformerVaswani et al. (2017) from a large-scale unlabeled corpus. Typically, it utilizes two training tasks to learn from unlabeled text, including Masked Language Modeling (MLM) and Next Sentence Prediction (NSP). However, it turns out that NSP is not as helpful as expected for the language representation learning (Joshi et al., 2019a; Liu et al., 2019a). Therefore, we train our model, CorefBERT, on contiguous sequences without the NSP objective.


Given a sequence of tokens111In this paper, tokens are at the subword level. , BERT first represents each token by aggregating the corresponding token, segment, and position embeddings, and then feeds the input representation into a deep bidirectional Transformer to obtain the final contextual representation.

Masked language modeling (MLM)

MLM is regarded as a kind of cloze tasks and aims to predict the missing tokens according to its final contextual representation. In CorefBERT, we reserve the MLM objective for learning general representation, and further add Mention Reference Prediction for infusing stronger coreferential reasoning ability into the language representation.

3 Methodology

Figure 1: An illustration of CorefBERT’s training process. In this example, the second Claire is masked. We use copy-based objective to predict the masked token from context for mention reference prediction task. The overall loss consists of the loss of both mention reference prediction and masked language modeling.

In this section, we present CorefBERT, a language representation model, which aims to better capture the coreference information of the text. Our approach comes up with a novel auxiliary training task Mention Reference Prediction (MRP), which is added to enhance reasoning ability of BERT (Devlin et al., 2019). MRP utilizes mention reference masking strategy to mask one of the repeated mentions in the sequence and then employs a copy-based training objective to predict the masked tokens by copying other tokens in the sequence.

3.1 Mention Reference Masking

To better capture the coreference information of the text, we propose a novel masking strategy: mention reference masking, which masks tokens of the repeated mentions in the sequence instead of masking random tokens. The idea is inspired by the unsupervised coreference resolution. We follow a distant supervision assumption: the repeated mentions in a sequence would refer to each other, therefore, if we mask one of them, the masked tokens would be inferred through its context and the unmasked references. Based on the above strategy and assumption, the CorefBERT model is expected to capture the coreference information in the text for filling the masked token.

In practice, we regard nouns in the text as mentions. We first use spaCy222 for part-of-speech tagging to extract all nouns in the given sequence. Then, we cluster the nouns into several groups where each group contains all mentions of the same noun. After that, we select the masked nouns from different groups uniformly.

In order to maintain the universal language representation ability in CorefBERT, we utilize both the masked language modeling (random token masking) and mention reference prediction (mention reference masking) in the training process. Empirically, the masked words for masked language modeling and mention reference prediction are sampled on a ratio of 4:1. Similar to BERT, of the tokens are masked in total where of them are replaced with [MASK], with original tokens, and with random tokens. We also adopt whole word masking, which masks all the subwords belong to the masked words or mentions.

3.2 Copy-based Training Objective

In order to capture the coreference information of the text, CorefBERT models the correlation among words in the sequence. Copy mechanism is a method widely adopted in sequence-to-sequence tasks, which alleviates out-of-vocabulary problems in text summarization 

(Gu et al., 2016), translates specific words in translation (Cao et al., 2017), and retells queries in dialogue generation (He et al., 2017). We adapt the copy mechanism and introduce a copy-based training objective to require the model to predict missing tokens of the masked noun by copying the unmasked tokens in the context. Through copying mechanism, the CorefBERT model could explicitly capture the relations between the masked mention and its referring mentions, therefore to obtain the coreference information in the context.

The representations of the start token and the end token of a word typically contain the whole word’s information (Lee et al., 2017, 2018; He et al., 2018), based on which we apply the copy-based training objective on both ends of the masked word.

Formally, we first encode the given input sequence , with some tokens masked, into hidden states via multi-layer Transformer (Vaswani et al., 2017)

. The probability of recovering the masked token

by copying from is defined as:


where denotes element-wise product function and is a trainable parameter to measure the importance of each dimension for token’s similarity.

For a masked noun consisting of a sequence of tokens , we recover by copying its referring context word, and defines the probability of choosing word as:


A masked noun possibly has multiple corresponding words in the sequence, for which we collectively maximize the similarity of all corresponding words. It is an approach widely used in question answering  (Kadlec et al., 2016; Swayamdipta et al., 2018; Clark and Gardner, 2018) designed to handle multiple answers. Finally, we define the loss of mention reference prediction (MRP) as:


where is the set of all masked mentions for mention reference masking, and is the set of all corresponding words of word .

3.3 Training

CorefBERT aims to capture the coreference information of the text while maintaining the language representation capability of BERT. Thus, the overall loss of CorefBERT consists of two losses: the mention reference prediction loss and the masked language modeling loss , which can be formulated as:


4 Experiment

In this section, we first introduce the training details of CorefBERT. After that, we present the fine-tuning results on a comprehensive suite of tasks, including extractive question answering, document-level relation extraction, fact extraction and verification, coreference resolution, and eight tasks in the GLUE benchmark.

4.1 Training Details

Due to the large cost of training CorefBERT from scratch, we initialize the parameters of CorefBERT with BERT released by Google 333, which is used as our baselines on downstream tasks. Similar to previous language representation models (Devlin et al., 2019; Yang et al., 2019; Joshi et al., 2019a; Liu et al., 2019a), we adopt English Wikipeida444 as our training corpus, which contains about 3,000M tokens. Note that, since Wikipedia corpus has been used to train the original BERT, CorefBERT does not use additional corpus. We train CorefBERT with contiguous sequences of up to tokens, and shorten the input sequences with a 10% probability. To verify the effectiveness of our method for the language representation model trained with tremendous corpus, we further train CorefRoBERTa starting from the released RoBERTa555

Additionally, we follow the pre-training hyper-parameters used in BERT, and adopt Adam optimizer (Kingma and Ba, 2015) with batch size of . Learning rate of 5e-5 is used for the base model and 1e-5 is used for the large model. The optimization runs k steps, where the first % steps utilize linear warm-up learning rate. The pre-training took 1.5 days for base model and 11 days for large model with 8 2080ti GPUs .

4.2 Extractive Question Answering

Model SQuAD NewsQA TriviaQA SearchQA HotpotQA NaturalQA Average
BERT 88.4 66.9 68.8 78.5 74.2 75.6 75.4
CorefBERT 89.0 69.5 70.7 79.6 76.3 77.7 77.1
BERT 91.0 69.7 73.1 81.2 77.7 79.1 78.6
CorefBERT 91.8 71.5 73.9 82.0 79.1 79.6 79.6
Table 1: Performance (F1) on six MRQA extractive question answering benchmarks.
Model Dev Test
QANet 34.41 38.26 34.17 38.90
QANet+BERT 43.09 47.38 42.41 47.20
BERT 58.44 64.95 59.28 66.39
BERT 61.29 67.25 61.37 68.56
CorefBERT 66.87 72.27 66.22 72.96
BERT 67.91 73.82 67.24 74.00
CorefBERT 70.89 76.56 70.67 76.89
RoBERTa-MT 74.11 81.51 72.61 80.68
RoBERTa 74.15 81.05 75.56 82.11
CorefRoBERTa 74.94 81.71 75.80 82.81
Table 2: Results on QUOREF measured by exact match (EM) and F1. Results with , are from Dasigi et al. (2019) and official leaderboard respectively.

Given a question and passage, the extractive question answering task aims to select spans in passage to answer the question. We evaluate our model on the Questions Requiring Coreferential Reasoning dataset (QUOREF) (Dasigi et al., 2019), which contains more than k question-answer pairs. Compared to previous reading comprehension benchmarks, QUOREF is more challenging: % of the questions in QUOREF cannot be answered without coreference resolution while tracking entities’ coreference is essential to comprehending documents. Therefore, QUOREF could examine the coreference resolution capability of question answering models to some extent. We also evaluate the models on the MRQA shared task (Fisch et al., 2019). MRQA integrates several existing datasets to a unified format, which provides a single context within tokens for each question, ensuring at least one answer could be accurately found in the context. We use six benchmarks of MRQA, including SQuAD (Rajpurkar et al., 2016), NewsQA (Trischler et al., 2017), SearchQA (Dunn et al., 2017), TriviaQA (Joshi et al., 2017), HotpotQA (Yang et al., 2018), and Natural Questions (NaturalQA) (Kwiatkowski et al., 2019). The MRQA shared task involves paragraphs from different sources and questions with manifold styles, helping us effectively evaluate our model in different domains. Since MRQA does not provide a public test set, we randomly split the development set into two halves to make new validation and test sets.


For QUOREF, we compare our CorefBERT model with three baseline models: (1) QANet (Yu et al., 2018)

combines self-attention mechanism with the convolutional neural network, which achieves the best performance to date without pre-training; (2)

QANet+BERT adopts BERT representation as an additional input feature into QANet; (3) BERT (Devlin et al., 2019), which simply fine-tunes BERT for extractive question answering. We further design two components accounting for coreferential reasoning and multiple answers, by which we obtain a stronger BERT baseline on QUOREF. (4) RoBERTa-MT, the current state-of-the-art, is pre-trained on CoLA, SST2, SQuAD datasets in turns before finally fine-tuned on QUOREF.

Implementation Details

Following BERT’s setting (Devlin et al., 2019), given the question and the passage , we represent them as a sequence , feed the sequence

into the pre-trained encoder and train two classifiers on the top of it to seek answer’s start and end positions simultaneously. For MRQA, CorefBERT maintains the same framework as BERT. For QUOREF, we further employ two extra components to process multiple mentions of the answers: (1) Spurred by the idea from

Hu et al. (2019) in handling multiple answer spans problem, we utilize the representation of [CLS] to predict the number of answers. Then, we adopt non-maximum suppression (NMS) algorithm (Rosenfeld and Thurston, 1971) to extract a specific quantity of non-overlapped spans. NMS first selects the answer span of the current highest scores, then continue to choose that of the second-highest score with no overlap to previous spans, and so on, until the predicted number of spans are selected. (2) When answering a question from QUOREF, the coreference mention could possibly be a pronoun in the sentence most relevant to the correct answer, so we add an additional reasoning layer (Transformer layer) before the span boundary classifier.


Table 2 shows the performance on QUOERF. Our BERT outperforms original BERT by about 2 points in EM and F1 score, which indicates the effectiveness of the added reasoning layer and multi-span prediction module. CorefBERT and CorefBERT exceeds our adapted BERT and BERT by 4.4% and 2.9% F1 points respectively. CorefRoBERTa also gains 0.7% F1 improvement and achieves a new state-of-the-art. We show four case studies in Supplemental Materials, which indicate that through reasoning over mentions, CorefBERT could aggregate information to answer the question requiring coreferential reasoning

Table 1 further shows that the effectiveness of CorefBERT is consistent in six datasets of the MRQA shared task besides QUOREF. We find that though the MRQA shared task is not designed for coreferential reasoning, our CorefBERT model still achieves averagely over point improvement on all six datasets, especially on NewsQA and HotpotQA. In NewsQA , 20.7% of the answers can only be inferred by synthesizing information distributed across multiple sentences. In HotpotQA, 63% of the answers need to be inferred through bridge entities or checking multiple properties in different positions. It demonstrates that coreferential reasoning is an essential ability in question answering.

4.3 Relation Extraction

Relation extraction (RE) aims to extract the relationship between two entities in a given text. We evaluate our model on DocRED (Yao et al., 2019), a challenging document-level RE dataset which requires to extract relations between entities by synthesizing information from all the mentions of them after reading the whole document. DocRED requires a variety of reasoning types, where % of the relation facts need to be uncovered through coreferential reasoning.


We compare our model with the following baselines: (1) CNN/LSTM/BiLSTM. CNN (Zeng et al., 2014), LSTM (Hochreiter and Schmidhuber, 1997), bidirectional LSTM (BiLSTM) (Cai et al., 2016) are widely adopted as text encoders in relation extraction tasks. The above text encoders are employed to convert each word in the document into its output representations. Then, the representations of the two entities are used to predict the relationship between them. We replace the encoder with BERT/RoBERTa to provide a stronger baseline. (2) ContextAware (Sorokin and Gurevych, 2017) takes relations’ interaction into account, which demonstrates that other relations in the sentential context are beneficial for target relation prediction. (3) BERT-TS (Wang et al., 2019) applies a two-step prediction to deal with a large amount of non-relations. (4) HinBERT (Tang et al., 2020) proposes a hierarchical inference network to obtain and aggregate the inference information with different granularity.

Model Dev Test
IgnF1 F1 IgnF1 F1
CNN 41.58 43.45 40.33 42.26
LSTM 48.44 50.68 47.71 50.07
BiLSTM 48.87 50.94 50.26 51.06
ContextAware 48.94 51.09 48.40 50.70
BERT-TS - 54.42 - 53.92
HINBERT 54.29 56.31 53.70 55.60
BERT 54.63 56.77 53.93 56.27
CorefBERT 55.32 57.51 54.54 56.96
BERT 56.67 58.83 56.47 58.69
CorefBERT 56.73 58.88 56.48 58.70
RoBERTa 57.14 59.22 57.51 59.62
CorefRoBERTa 57.84 59.93 57.68 59.91
Table 3: Results on DocRED measured by micro ignore F1 (IgnF1) and micro F1. IgnF1 metrics ignores the relational facts shared by the training and dev/test sets. Results with , , are from Yao et al. (2019), Wang et al. (2019),Tang et al. (2020) respectively.


Table 3 shows the performance on DocRED. CorefBERT outperforms BERT model by % F1. CorefRoBERTa beats RoBERTa by F1 and outperforms all previous published work. It proves the effectiveness of considering coreference information of text for document-level relation classification.

4.4 Fact Extraction and Verification

Fact extraction and verification aim to verify deliberately fabricated claims with trust-worthy corpora. We evaluate our model performance on a large-scale public fact verification dataset, FEVER (Thorne et al., 2018). FEVER consists of annotated claims with all Wikipedia documents.


We compare our model with four BERT-based fact verification models: (1) BERT Concat (Zhou et al., 2019) concatenates all evidence pieces and the claim to predict the claim label; (2) SR-MRS (Nie et al., 2019) employs hierarchical BERT retrieval to improve model performance; (3) GEAR (Zhou et al., 2019) constructs an evidence graph and conducts a graph attention network for joint reasoning over several evidence pieces; (4) KGAT (Liu et al., 2019b) further conducts a fine-grained graph attention network with kernels.


Table 4 shows the performance on FEVER. KGAT with CorefBERT outperforms KGAT with BERT by % FEVER score. KGAT with CorefRoBERTa gains 1.4% FEVER score improvement compared to the model with RoBERTa, which makes our model perform the best compared with all previously published research. It again demonstrates the effectiveness of our model. The CorefBERT, which incorporates coreference information in distant-supervised pre-training, helps to verify if the claim and evidence discuss about the same mentions, such as person or object.

BERT Concat 71.01 65.64
GEAR 71.60 67.10
SR-MRS 72.56 67.26
KGAT (BERT) 72.81 69.40
KGAT (CorefBERT) 72.88 69.82
KGAT (BERT) 73.61 70.24
KGAT (CorefBERT) 74.37 70.86
KGAT (RoBERTa) 74.07 70.38
KGAT (CorefRoBERTa) 75.41 71.80
Table 4: Results on FEVER test set measured by label accuracy (LA) and FEVER. The FEVER score evaluates the model performance and considers whether the golden evidence is provided. Results with , , are from Zhou et al. (2019), Nie et al. (2019) and Liu et al. (2019b) respectively.

4.5 Coreference Resolution

Coreference resolution aims to link referring expressions that evoke the same discourse entity. We inspect the models’ intrinsic coreference resolution ability under the setting that all mentions have been detected. Given two sentences where the former has two or more mentions and the latter contains an ambiguous pronoun, models should predict what mention the pronoun refers to. We evaluate our model on several widely-used datasets, including GAP (Webster et al., 2018), DPR (Rahman and Ng, 2012), WSC (Levesque, 2011), Winogender (Rudinger et al., 2018) and PDP (Davis et al., 2017).


We compare our model with coreference resolution models based on the pre-trained language model and fine-tunes on the GAP and DPR training set. Trinh and Le (2018) substitutes the pronoun with [MASK] and use language model to compute the probability of recovering candidates from [MASK]. Kocijan et al. (2019a) generates GAP-like sentences automatically. After that, They pre-train BERT with the objective minimizing the perplexity of correct mentions in these sentences and finally fine-tune the model on supervised datasets. Benefiting from the augmented data, Kocijan et al. (2019a) achieves state-of-the-art in sentence-level coreference resolution.

BERT 76.0 80.1 70.0 78.8 81.7
WikiCREM 78.0 84.8 70.0 76.7 86.7
CorefBERT 76.8 85.1 71.4 80.8 90.0
Table 5: Results on coreference resolution test sets. Performance on GAP are measured by F1, while scores on the others are given in accuracy. WG: Winogender.
392k 363k 104k 67k 8.5k 5.7k 3.5k 2.5k
BERT 84.6/83.4 71.2 90.5 93.5 52.1 85.8 88.9 66.4
CorefBERT 84.2/83.5 71.3 90.5 93.1 51.5 84.8 88.1 67.2
Table 6: Test set performance metrics on GLUE task. The number below each task denotes the number of training examples. Matched/mistached accuracies are reported for MNLI; F1 scores are reported for QQP and MRPC, Spearmanr correlation is reported for STS-B; Accuracy scores are reported for the other tasks.
Model QUOREF SQuAD NewQA TriviaQA SearchQA HotpotQA NaturalQA DocRED
BERT 67.3 88.4 66.9 68.8 78.5 74.2 75.6 56.8
 -NSP 70.6 88.7 67.5 68.9 79.4 75.2 75.4 56.7
 -NSP +WWM 70.1 88.3 69.2 70.5 79.7 75.5 75.2 57.1
 -NSP +MRM 70.0 88.5 69.2 70.2 78.6 75.8 74.8 57.1
CorefBERT 72.3 89.0 69.5 70.7 79.6 76.3 77.7 57.5
Table 7: Ablation study on various benchmark datasets (F1).


Table 5 shows the performance on the test set of the above coreference dataset. Our CorefBERT model significantly outperforms BERT, which demonstrates that the intrinsic coreference resolution ability of CorefBERT has been enhanced by involving the mention reference prediction training task. Moreover, it achieves comparable performance with state-of-the-art baseline WikiCREM. Note that, WikiCREM is specially designed for sentence-level coreference resolution and not suitable for other NLP tasks. The capability of CorefBERT in terms of coreferential reasoning can be transferred to other NLP tasks.

4.6 Glue

The Generalized Language Understanding Evaluation(GLUE) (Wang et al., 2018) is designed to evaluate and analyze the performance of models across a diverse range of existing natural language understanding tasks. We evaluate CorefBERT on the main benchmark used in Devlin et al. (2019), including MNLI (Williams et al., 2018), QQP666, QNLI (Rajpurkar et al., 2016), SST-2 (Socher et al., 2013), CoLA (Warstadt et al., 2019), STS-B (Cer et al., 2017), MRPC (Dolan and Brockett, 2005) and RTE (Giampiccolo et al., 2007).

Implementation Details

Following BERT’s setting, we add [CLS] token in front of the input sentences, and extract its top-layer representation as the whole sentence or sentence pair’s representation for classification or regression. We use a batch size of and fine-tune for epochs for all GLUE tasks and select the learning rate of Adam among 2e-5, 3e-5, 4e-5, 5e-5 for the best performance on the development set.


Table 6 shows the performance on GLUE. We notice that CorefBERT achieves comparable results to BERT. Though GLUE does not require much coreference resolution ability due to its attributes, the results prove that our masking strategy and auxiliary training objective would not weaken the performance on natural language understanding tasks.

4.7 Ablation Study

In this subsection, we explore the effects of the Whole Word Masking (WWM), Mention Reference Masking (MRM), Next Sentence Prediction (NSP) and copy-based training objective using several benchmark datasets. We continue to train Google’s released BERT on the same Wikipedia corpus with different strategies. As shown in Table 7, we have the following observations: (1) Deleting next sentence prediction training task results in better performance on almost all tasks. The conclusion is consistent with Joshi et al. (2019a); Liu et al. (2019a);. (2) MRM scheme usually achieves parity with WWM scheme except on SearchQA, and both of them outperform the original subword masking scheme on NewsQA (averagely +1.7% F1) and TriviaQA (averagely +1.5% F1); (3) On the basis of mention reference masking scheme, our copy-based training objective explicitly requires model to look for noun’s referents in the context, which could effectively consider the coreference information of the sequence. CorefBERT takes advantage of the objective and further improves performance, with a substantial gain (+2.3% F1) on QUOREF.

5 Related Work

Word representation (Mikolov et al., 2013; Pennington et al., 2014)

aims to capture semantic information of words from the unlabeled corpus, to transform the discrete word into continuous vectors representation. Since pre-trained word representation cannot handle the polysemy well, ELMO 

(Peters et al., 2018)

further extracts context-aware word embeddings from a sequence-level language model. Deep learning models benefit from adopting the word representations as input features, which have achieved encouraging progress in the last few years 

(Kim, 2014; Lample et al., 2016; Lin et al., 2016; Chen et al., 2017; Seo et al., 2017; Lee et al., 2018).

More recently, language representation models that generate contextual word representations have been learned from a large-scale unlabeled corpus and then fine-tuned for downstream tasks. SA-LSTM (Dai and Le, 2015) pre-trains auto-encoder on unlabeled text, and achieves strong performance in text classification with a few fine-tuning steps. ULMFiT (Howard and Ruder, 2018) further builds a universal language model.OpenAI GPT (Radford et al., 2018) learns pre-trained language representation with Transformer (Vaswani et al., 2017) architecture. BERT (Devlin et al., 2019) trains a deep bidirectional Transformers with masked language modeling objective, which achieves state-of-the-art results on various NLP tasks. SpanBERT (Joshi et al., 2019a) extends BERT by masking continuous random spans and train models to predict the entire context within the span boundary. XLNET (Yang et al., 2019) combines Transformer-XL (Dai et al., 2019) and auto-regressive loss, which takes dependency between the predicted positions into account. MASS (Song et al., 2019) explores masking strategy on the sequence-to-sequence pre-training. Though both pre-trained word representation and language models have achieved great success, they still cannot well capture the coreference information. In this paper, we design mention referring prediction tasks to enhance language representation models in terms of coreferential reasoning.

Our work, which acquires coreference resolution ability from an unlabeled corpus, can also be viewed as a special form of unsupervised coreference resolution. Formerly, researchers have made efforts to explore feature-based unsupervised coreference resolution methods (Haghighi and Klein, 2007; Bejan et al., 2009; Ma et al., 2016). After that, Trinh and Le (2018) uncover that it is natural to resolve pronouns in the sentence according to the probability of language models. Moreover, Kocijan et al. (2019a, b) proposes sentence-level unsupervised coreference resolution datasets to train a language-model-based coreference discriminator, which achieves outstanding performance in coreference resolution. However, we found the above methods cannot be directly transferred to the training of language representation models since their learning objective may weaken the model performance on downstream tasks. Therefore, in this paper, we introduce mention reference prediction objective along with masked language model to make learned abilities available for more downstream tasks.

6 Conclusion and Future Work

In this paper, we present a language representation model named CorefBERT, which is trained on a novel task, mention reference prediction, for strengthening the coreferential reasoning ability of BERT. Experimental results on several downstream NLP tasks show that our CorefBERT significantly outperforms BERT by considering the coreference information within the text. In the future, there are several prospective research directions: (1) we introduce a Distant Supervision (DS) assumption in our mention reference prediction training task. It is a feasible approach to introducing the coreferential signal to language representation models, but the automatic labeling mechanism inevitably accompanies with the wrong labeling problem. Until now, mitigating noise in DS data is still an open question. (2) The DS assumption does not consider the pronouns in the text, while the pronouns play an important role in coreferential reasoning. Thus, it is worth developing a novel strategy such as self-supervised learning to further consider pronouns in CorefBERT.


  • C. A. Bejan, M. Titsworth, A. Hickl, and S. M. Harabagiu (2009) Nonparametric bayesian models for unsupervised event coreference resolution. In Advances in Neural Information Processing Systems 22: 23rd Annual Conference on Neural Information Processing Systems 2009. Proceedings of a meeting held 7-10 December 2009, Vancouver, British Columbia, Canada, pp. 73–81. External Links: Link Cited by: §5.
  • R. Cai, X. Zhang, and H. Wang (2016) Bidirectional recurrent convolutional neural network for relation classification. See DBLP:conf/acl/2016-1, External Links: Link Cited by: §4.3.
  • Z. Cao, C. Luo, W. Li, and S. Li (2017) Joint copying and restricted generation for paraphrase. In Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence, February 4-9, 2017, San Francisco, California, USA, pp. 3152–3158. External Links: Link Cited by: §3.2.
  • D. M. Cer, M. T. Diab, E. Agirre, I. Lopez-Gazpio, and L. Specia (2017) SemEval-2017 task 1: semantic textual similarity multilingual and crosslingual focused evaluation. See DBLP:conf/semeval/2017, pp. 1–14. External Links: Link, Document Cited by: §4.6.
  • Q. Chen, X. Zhu, Z. Ling, S. Wei, H. Jiang, and D. Inkpen (2017) Enhanced LSTM for natural language inference. See DBLP:conf/acl/2017-1, pp. 1657–1668. External Links: Link, Document Cited by: §5.
  • C. Clark and M. Gardner (2018) Simple and effective multi-paragraph reading comprehension. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, ACL 2018, Melbourne, Australia, July 15-20, 2018, Volume 1: Long Papers, pp. 845–855. External Links: Link, Document Cited by: §3.2.
  • A. M. Dai and Q. V. Le (2015) Semi-supervised sequence learning. See DBLP:conf/nips/2015, pp. 3079–3087. External Links: Link Cited by: §5.
  • Z. Dai, Z. Yang, Y. Yang, J. G. Carbonell, Q. V. Le, and R. Salakhutdinov (2019) Transformer-xl: attentive language models beyond a fixed-length context. See DBLP:conf/acl/2019-1, pp. 2978–2988. External Links: Link Cited by: §5.
  • P. Dasigi, N. F. Liu, A. Marasovic, N. A. Smith, and M. Gardner (2019) Quoref: A reading comprehension dataset with questions requiring coreferential reasoning. CoRR abs/1908.05803. External Links: Link, 1908.05803 Cited by: Table 8, §4.2, Table 2.
  • E. Davis, L. Morgenstern, and C. L. O. Jr. (2017) The first winograd schema challenge at IJCAI-16. AI Magazine 38 (3), pp. 97–98. External Links: Link Cited by: §4.5.
  • J. Devlin, M. Chang, K. Lee, and K. Toutanova (2019) BERT: pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, June 2-7, 2019, Volume 1 (Long and Short Papers), pp. 4171–4186. External Links: Link Cited by: §1, §2, §3, §4.1, §4.2, §4.2, §4.6, §5.
  • W. B. Dolan and C. Brockett (2005) Automatically constructing a corpus of sentential paraphrases. See DBLP:conf/acl-iwp/2005, External Links: Link Cited by: §4.6.
  • M. Dunn, L. Sagun, M. Higgins, V. U. Güney, V. Cirik, and K. Cho (2017) SearchQA: A new q&a dataset augmented with context from a search engine. CoRR abs/1704.05179. External Links: Link, 1704.05179 Cited by: §4.2.
  • A. Fisch, A. Talmor, R. Jia, M. Seo, E. Choi, and D. Chen (2019) MRQA 2019 shared task: evaluating generalization in reading comprehension. CoRR abs/1910.09753. External Links: Link, 1910.09753 Cited by: §4.2.
  • D. Giampiccolo, B. Magnini, I. Dagan, and B. Dolan (2007) The third PASCAL recognizing textual entailment challenge. See DBLP:conf/acl/2007pascal, pp. 1–9. External Links: Link Cited by: §4.6.
  • J. Gu, Z. Lu, H. Li, and V. O. K. Li (2016) Incorporating copying mechanism in sequence-to-sequence learning. See DBLP:conf/acl/2016-1, External Links: Link Cited by: §3.2.
  • A. Haghighi and D. Klein (2007) Unsupervised coreference resolution in a nonparametric bayesian model. In ACL 2007, Proceedings of the 45th Annual Meeting of the Association for Computational Linguistics, June 23-30, 2007, Prague, Czech Republic, External Links: Link Cited by: §5.
  • L. He, K. Lee, O. Levy, and L. Zettlemoyer (2018) Jointly predicting predicates and arguments in neural semantic role labeling. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, ACL 2018, Melbourne, Australia, July 15-20, 2018, Volume 2: Short Papers, pp. 364–369. External Links: Link, Document Cited by: §3.2.
  • S. He, C. Liu, K. Liu, and J. Zhao (2017) Generating natural answers by incorporating copying and retrieving mechanisms in sequence-to-sequence learning. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, ACL 2017, Vancouver, Canada, July 30 - August 4, Volume 1: Long Papers, pp. 199–208. External Links: Link, Document Cited by: §3.2.
  • S. Hochreiter and J. Schmidhuber (1997) Long short-term memory. Neural Computation 9 (8), pp. 1735–1780. External Links: Link, Document Cited by: §4.3.
  • J. Howard and S. Ruder (2018) Universal language model fine-tuning for text classification. See DBLP:conf/acl/2018-1, pp. 328–339. External Links: Link, Document Cited by: §5.
  • M. Hu, Y. Peng, Z. Huang, and D. Li (2019) A multi-type multi-span network for reading comprehension that requires discrete reasoning. CoRR abs/1908.05514. External Links: Link, 1908.05514 Cited by: §4.2.
  • M. Joshi, D. Chen, Y. Liu, D. S. Weld, L. Zettlemoyer, and O. Levy (2019a) SpanBERT: improving pre-training by representing and predicting spans. CoRR abs/1907.10529. External Links: Link, 1907.10529 Cited by: §1, §2, §4.1, §4.7, §5.
  • M. Joshi, E. Choi, D. S. Weld, and L. Zettlemoyer (2017) TriviaQA: A large scale distantly supervised challenge dataset for reading comprehension. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, ACL 2017, Vancouver, Canada, July 30 - August 4, Volume 1: Long Papers, pp. 1601–1611. External Links: Link, Document Cited by: §4.2.
  • M. Joshi, O. Levy, D. S. Weld, and L. Zettlemoyer (2019b) BERT for coreference resolution: baselines and analysis. CoRR abs/1908.09091. External Links: Link, 1908.09091 Cited by: §1.
  • R. Kadlec, M. Schmid, O. Bajgar, and J. Kleindienst (2016) Text understanding with the attention sum reader network. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, ACL 2016, August 7-12, 2016, Berlin, Germany, Volume 1: Long Papers, External Links: Link Cited by: §3.2.
  • Y. Kim (2014) Convolutional neural networks for sentence classification. See DBLP:conf/emnlp/2014, pp. 1746–1751. External Links: Link Cited by: §5.
  • D. P. Kingma and J. Ba (2015) Adam: A method for stochastic optimization. See DBLP:conf/iclr/2015, External Links: Link Cited by: §4.1.
  • V. Kocijan, O. Camburu, A. Cretu, Y. Yordanov, P. Blunsom, and T. Lukasiewicz (2019a) WikiCREM: A large unsupervised corpus for coreference resolution. CoRR abs/1908.08025. External Links: Link, 1908.08025 Cited by: §4.5, §5.
  • V. Kocijan, A. Cretu, O. Camburu, Y. Yordanov, and T. Lukasiewicz (2019b) A surprisingly robust trick for the winograd schema challenge. In Proceedings of the 57th Conference of the Association for Computational Linguistics, ACL 2019, Florence, Italy, July 28- August 2, 2019, Volume 1: Long Papers, pp. 4837–4842. External Links: Link Cited by: §5.
  • T. Kwiatkowski, J. Palomaki, O. Redfield, M. Collins, A. P. Parikh, C. Alberti, D. Epstein, I. Polosukhin, J. Devlin, K. Lee, K. Toutanova, L. Jones, M. Kelcey, M. Chang, A. M. Dai, J. Uszkoreit, Q. Le, and S. Petrov (2019) Natural questions: a benchmark for question answering research. TACL 7, pp. 452–466. External Links: Link Cited by: §4.2.
  • G. Lample, M. Ballesteros, S. Subramanian, K. Kawakami, and C. Dyer (2016)

    Neural architectures for named entity recognition

    See DBLP:conf/naacl/2016, pp. 260–270. External Links: Link Cited by: §5.
  • K. Lee, L. He, M. Lewis, and L. Zettlemoyer (2017) End-to-end neural coreference resolution. See DBLP:conf/emnlp/2017, pp. 188–197. External Links: Link Cited by: §3.2.
  • K. Lee, L. He, and L. Zettlemoyer (2018) Higher-order coreference resolution with coarse-to-fine inference. See DBLP:conf/naacl/2018-2, pp. 687–692. External Links: Link Cited by: §3.2, §5.
  • H. J. Levesque (2011) The winograd schema challenge. In Logical Formalizations of Commonsense Reasoning, Papers from the 2011 AAAI Spring Symposium, Technical Report SS-11-06, Stanford, California, USA, March 21-23, 2011, External Links: Link Cited by: §4.5.
  • Y. Lin, S. Shen, Z. Liu, H. Luan, and M. Sun (2016) Neural relation extraction with selective attention over instances. See DBLP:conf/acl/2016-1, External Links: Link Cited by: §5.
  • Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, and V. Stoyanov (2019a) RoBERTa: A robustly optimized BERT pretraining approach. CoRR abs/1907.11692. External Links: Link, 1907.11692 Cited by: §1, §2, §4.1, §4.7.
  • Z. Liu, C. Xiong, and M. Sun (2019b) Kernel graph attention network for fact verification. CoRR abs/1910.09796. External Links: Link, 1910.09796 Cited by: §4.4, Table 4.
  • X. Ma, Z. Liu, and E. H. Hovy (2016) Unsupervised ranking model for entity coreference resolution. In NAACL HLT 2016, The 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, San Diego California, USA, June 12-17, 2016, pp. 1012–1018. External Links: Link Cited by: §5.
  • T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean (2013) Distributed representations of words and phrases and their compositionality. In Advances in Neural Information Processing Systems 26: 27th Annual Conference on Neural Information Processing Systems 2013. Proceedings of a meeting held December 5-8, 2013, Lake Tahoe, Nevada, United States, pp. 3111–3119. External Links: Link Cited by: §5.
  • Y. Nie, S. Wang, and M. Bansal (2019) Revealing the importance of semantic retrieval for machine reading at scale. CoRR abs/1909.08041. External Links: Link, 1909.08041 Cited by: §4.4, Table 4.
  • J. Pennington, R. Socher, and C. D. Manning (2014) Glove: global vectors for word representation. See DBLP:conf/emnlp/2014, pp. 1532–1543. External Links: Link Cited by: §5.
  • M. E. Peters, M. Neumann, R. L. L. IV, R. Schwartz, V. Joshi, S. Singh, and N. A. Smith (2019) Knowledge enhanced contextual word representations. CoRR abs/1909.04164. External Links: Link, 1909.04164 Cited by: §1.
  • M. E. Peters, M. Neumann, M. Iyyer, M. Gardner, C. Clark, K. Lee, and L. Zettlemoyer (2018) Deep contextualized word representations. See DBLP:conf/naacl/2018-1, pp. 2227–2237. External Links: Link Cited by: §5.
  • A. Radford, K. Narasimhan, T. Salimans, and I. Sutskever (2018)

    Improving language understanding with unsupervised learning

    Technical report Technical report, OpenAI. Cited by: §5.
  • A. Rahman and V. Ng (2012) Resolving complex cases of definite pronouns: the winograd schema challenge. See DBLP:conf/emnlp/2012, pp. 777–789. External Links: Link Cited by: §4.5.
  • P. Rajpurkar, J. Zhang, K. Lopyrev, and P. Liang (2016) SQuAD: 100, 000+ questions for machine comprehension of text. In

    Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, EMNLP 2016, Austin, Texas, USA, November 1-4, 2016

    pp. 2383–2392. External Links: Link Cited by: §4.2, §4.6.
  • A. Rosenfeld and M. Thurston (1971) Edge and curve detection for visual scene analysis. IEEE Trans. Computers 20 (5), pp. 562–569. External Links: Link, Document Cited by: §4.2.
  • R. Rudinger, J. Naradowsky, B. Leonard, and B. V. Durme (2018) Gender bias in coreference resolution. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT, New Orleans, Louisiana, USA, June 1-6, 2018, Volume 2 (Short Papers), pp. 8–14. External Links: Link Cited by: §4.5.
  • M. J. Seo, A. Kembhavi, A. Farhadi, and H. Hajishirzi (2017) Bidirectional attention flow for machine comprehension. See DBLP:conf/iclr/2017, External Links: Link Cited by: §5.
  • R. Socher, A. Perelygin, J. Wu, J. Chuang, C. D. Manning, A. Y. Ng, and C. Potts (2013) Recursive deep models for semantic compositionality over a sentiment treebank. See DBLP:conf/emnlp/2013, pp. 1631–1642. External Links: Link Cited by: §4.6.
  • K. Song, X. Tan, T. Qin, J. Lu, and T. Liu (2019) MASS: masked sequence to sequence pre-training for language generation. See DBLP:conf/icml/2019, pp. 5926–5936. External Links: Link Cited by: §5.
  • D. Sorokin and I. Gurevych (2017) Context-aware representations for knowledge base relation extraction. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, EMNLP 2017, Copenhagen, Denmark, September 9-11, 2017, pp. 1784–1789. External Links: Link Cited by: §4.3.
  • C. Sun, L. Huang, and X. Qiu (2019)

    Utilizing BERT for aspect-based sentiment analysis via constructing auxiliary sentence

    In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, June 2-7, 2019, Volume 1 (Long and Short Papers), pp. 380–385. External Links: Link Cited by: §1.
  • S. Swayamdipta, A. P. Parikh, and T. Kwiatkowski (2018) Multi-mention learning for reading comprehension with neural cascades. In 6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, April 30 - May 3, 2018, Conference Track Proceedings, External Links: Link Cited by: §3.2.
  • A. Talmor and J. Berant (2019) MultiQA: an empirical investigation of generalization and transfer in reading comprehension. In Proceedings of the 57th Conference of the Association for Computational Linguistics, ACL 2019, Florence, Italy, July 28- August 2, 2019, Volume 1: Long Papers, pp. 4911–4921. External Links: Link Cited by: §1.
  • H. Tang, Y. Cao, Z. Zhang, J. Cao, F. Fang, S. Wang, and P. Yin (2020) HIN: hierarchical inference network for document-level relation extraction. CoRR abs/2003.12754. External Links: Link, 2003.12754 Cited by: §4.3, Table 3.
  • J. Thorne, A. Vlachos, C. Christodoulopoulos, and A. Mittal (2018) FEVER: a large-scale dataset for fact extraction and verification. See DBLP:conf/naacl/2018-1, pp. 809–819. External Links: Link Cited by: §4.4.
  • T. H. Trinh and Q. V. Le (2018) A simple method for commonsense reasoning. CoRR abs/1806.02847. External Links: Link, 1806.02847 Cited by: §4.5, §5.
  • A. Trischler, T. Wang, X. Yuan, J. Harris, A. Sordoni, P. Bachman, and K. Suleman (2017) NewsQA: A machine comprehension dataset. In Proceedings of the 2nd Workshop on Representation Learning for NLP, Rep4NLP@ACL 2017, Vancouver, Canada, August 3, 2017, pp. 191–200. External Links: Link Cited by: §4.2.
  • A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin (2017) Attention is all you need. See DBLP:conf/nips/2017, pp. 5998–6008. External Links: Link Cited by: §2, §3.2, §5.
  • A. Wang, A. Singh, J. Michael, F. Hill, O. Levy, and S. R. Bowman (2018) GLUE: A multi-task benchmark and analysis platform for natural language understanding. See DBLP:conf/emnlp/2018blackbox, pp. 353–355. External Links: Link Cited by: §4.6.
  • H. Wang, C. Focke, R. Sylvester, N. Mishra, and W. Wang (2019) Fine-tune bert for docred with two-step process. CoRR abs/1909.11898. External Links: Link, 1909.11898 Cited by: §4.3, Table 3.
  • A. Warstadt, A. Singh, and S. R. Bowman (2019) Neural network acceptability judgments. TACL 7, pp. 625–641. External Links: Link Cited by: §4.6.
  • K. Webster, M. Recasens, V. Axelrod, and J. Baldridge (2018) Mind the GAP: A balanced corpus of gendered ambiguous pronouns. TACL 6, pp. 605–617. External Links: Link Cited by: §4.5.
  • A. Williams, N. Nangia, and S. R. Bowman (2018) A broad-coverage challenge corpus for sentence understanding through inference. See DBLP:conf/naacl/2018-1, pp. 1112–1122. External Links: Link Cited by: §4.6.
  • Z. Yang, Z. Dai, Y. Yang, J. G. Carbonell, R. Salakhutdinov, and Q. V. Le (2019) XLNet: generalized autoregressive pretraining for language understanding. CoRR abs/1906.08237. External Links: Link, 1906.08237 Cited by: §1, §4.1, §5.
  • Z. Yang, P. Qi, S. Zhang, Y. Bengio, W. W. Cohen, R. Salakhutdinov, and C. D. Manning (2018) HotpotQA: A dataset for diverse, explainable multi-hop question answering. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, October 31 - November 4, 2018, pp. 2369–2380. External Links: Link Cited by: §4.2.
  • Y. Yao, D. Ye, P. Li, X. Han, Y. Lin, Z. Liu, Z. Liu, L. Huang, J. Zhou, and M. Sun (2019) DocRED: A large-scale document-level relation extraction dataset. In Proceedings of the 57th Conference of the Association for Computational Linguistics, ACL 2019, Florence, Italy, July 28- August 2, 2019, Volume 1: Long Papers, pp. 764–777. External Links: Link Cited by: §4.3, Table 3.
  • A. W. Yu, D. Dohan, M. Luong, R. Zhao, K. Chen, M. Norouzi, and Q. V. Le (2018) QANet: combining local convolution with global self-attention for reading comprehension. See DBLP:conf/iclr/2018, External Links: Link Cited by: §4.2.
  • D. Zeng, K. Liu, S. Lai, G. Zhou, and J. Zhao (2014) Relation classification via convolutional deep neural network. See DBLP:conf/coling/2014, pp. 2335–2344. External Links: Link Cited by: §4.3.
  • Z. Zhang, Y. Wu, H. Zhao, Z. Li, S. Zhang, X. Zhou, and X. Zhou (2019) Semantics-aware BERT for language understanding. CoRR abs/1909.02209. External Links: Link, 1909.02209 Cited by: §1.
  • J. Zhou, X. Han, C. Yang, Z. Liu, L. Wang, C. Li, and M. Sun (2019) GEAR: graph-based evidence aggregating and reasoning for fact verification. In Proceedings of the 57th Conference of the Association for Computational Linguistics, ACL 2019, Florence, Italy, July 28- August 2, 2019, Volume 1: Long Papers, pp. 892–901. External Links: Link Cited by: §1, §4.4, Table 4.

Appendix A Supplemental Material

Case Study on QUOREF

Table 8 shows examples from QUOREF.

For example (1), it is essential to obtain the fact that the asthmatic boy in question refers to Barry. After that, we should synthesize information from two Mr.Lee’s mentions: Mr.Lee trains Barray; Mr.Lee is the uncle of Noreen. Reasoning over the above information, we could know Noreen’s uncle trains the asthmatic boy. In example (2), it needs to infer that Tippett is a composer from [2] for obtaining the final answer from [1]. After training on mention reference prediction task, CorefBERT has become capable of reasoning over these mentions, summarizing messages from mentions in different positions, and finally figuring out the correct answer.

For example (3)(4), it is necessary to know she refers to Elena, and he refers to Ector by respective coreference resolution. Benefiting from a large number of distant-supervised coreference resolution training data, CorefBERT successfully found out the reference relationship and provided accurate answers.

(1) Q: Whose uncle trains the asthmatic boy?
Paragraph: [1] Barry Gabrewski is an asthmatic boy … [2] Barry wants to learn the martial arts, but is rejected by the arrogant dojo owner Kelly Stone for being too weak. [3] Instead, he is taken on as a student by an old Chinese man called Mr. Lee, Noreen’s sly uncle. [4] Mr. Lee finds creative ways to teach Barry to defend himself from his bullies.
(2) Q: Which composer produced String Quartet No. 2?
Paragraph: [1] Tippett’s Fantasia on a Theme of Handel for piano and orchestra was performed at the Wigmore Hall in March 1942, with Sellick again the soloist, and the same venue saw the premiere of the composer’s String Quartet No. 2 a year later. … [2] In 1942, Schott Music began to publish Tippett’s works, establishing an association that continued until the end of the the composer’s life.

(3) Q: What is the first name of the person who lost her beloved husband only six months earlier?
Pargraph: [1] Robert and Cathy Wilson are a timid married couple in 1940 London. … [2] Robert toughens up on sea duty and in time becomes a petty officer. His hands are badly burned when his ship is sunk, but he stoically rows in the lifeboat for five days without complaint. [3] He recuperates in a hospital, tended by Elena, a beautiful nurse. [4] He is attracted to her, but she informs him that she lost her beloved husband only six months earlier, kisses him, and leaves.

(4) Q: Who would have been able to win the tournament with one more round?
Paragraph: [1] At a jousting tournament in 14th-century Europe, young squires William Thatcher, Roland, and Wat discover that their master, Sir Ector, has died. [2] If he had completed one final pass he would have won the tournament. [3] Destitute, William wears Ector’s armour to impersonate him, winning the tournament and taking the prize.

Table 8: Examples from QUOREEF (Dasigi et al., 2019). Answers from BERT, Answers from CorefBERT, and Clue are colored accordingly.