Log In Sign Up

Multi-task Learning Based Neural Bridging Reference Resolution

We propose a multi task learning-based neural model for bridging reference resolution tackling two key challenges faced by bridging reference resolution. The first challenge is the lack of large corpora annotated with bridging references. To address this, we use multi-task learning to help bridging reference resolution with coreference resolution. We show that substantial improvements of up to 8 p.p. can be achieved on full bridging resolution with this architecture. The second challenge is the different definitions of bridging used in different corpora, meaning that hand-coded systems or systems using special features designed for one corpus do not work well with other corpora. Our neural model only uses a small number of corpus independent features, thus can be applied easily to different corpora. Evaluations with very different bridging corpora (ARRAU, ISNOTES, BASHI and SCICORP) suggest that our architecture works equally well on all corpora, and achieves the SoTA results on full bridging resolution for all corpora, outperforming the best reported results by up to 34.9 percentage points.


page 1

page 2

page 3

page 4


A Multi-Task Architecture on Relevance-based Neural Query Translation

We describe a multi-task learning approach to train a Neural Machine Tra...

On the relationship between disentanglement and multi-task learning

One of the main arguments behind studying disentangled representations i...

Free the Plural: Unrestricted Split-Antecedent Anaphora Resolution

Now that the performance of coreference resolvers on the simpler forms o...

Metrical Tagging in the Wild: Building and Annotating Poetry Corpora with Rhythmic Features

A prerequisite for the computational study of literature is the availabi...

Balancing Multi-Domain Corpora Learning for Open-Domain Response Generation

Open-domain conversational systems are assumed to generate equally good ...

Anaphora Resolution in Dialogue: System Description (CODI-CRAC 2022 Shared Task)

We describe three models submitted for the CODI-CRAC 2022 shared task. T...

Learning to Revise References for Faithful Summarization

In many real-world scenarios with naturally occurring datasets, referenc...

1 Introduction

Anaphora resolution [16, 40, 14, 6, 29] is linking nominal expressions to the entities in the context of interpretation (or discourse model). As illustrated by (1) (adapted from [10]), nominal expressions can be linked to the context in several ways: corefererence (linking [The Bakersfield Supermarket], [The business], and [its]), bridging or associative reference (linking [the customers] to the supermarket) [2, 28, 10], and discourse deixis (linking [the murder] to the event of murdering in the previous sentence) [41, 18].

  • [The Bakersfield Supermarket] went bankrupt last May. [The business] closed when [[its] old owner] was murdered by [robbers]. [The murder] saddened [the customers].

Bridging reference resolution is the sub-task of anaphora resolution concerned with identifying and resolving bridging references, i.e., anaphoric references to non-identical associated antecedents. Bridging resolution is much less studied than the closely related sub-task of coreference resolution, which has received a lot of attention ([32, 42, 20, 21], to mention just a few recent proposals). One reason for this is the lack of training data. Several corpora have been annotated with bridging reference, including e.g. gnome [30], isnotes [23], scicorp [34] and bashi [35], but they are all rather small, with at most around 1k examples of bridging reference. arrau [26, 38]

is much larger, but still contains only 5.5k bridging pairs. It is challenging to train a learning based system on that amount of data, particularly the new neural models. As a result, the current SoTA systems for full bridging resolution are still rule-based, employing a number of heuristic rules many of which are corpus-dependent

[9, 33]. This is problematic at the light of the second challenge for work in this area: namely, that the definitions of bridging are different in these different corpora [33]. Existing corpora differ in whether they attempt to annotate only what Roesiger et al call referential bridging, as in the case of isnotes, or also lexical bridging, as in arrau. 111Roesiger et al use ‘referential bridging’ for the cases in which the bridging reference needs an antecedent in order to be interpretable, such as the door in John walked towards the house. The door was open.. ‘Lexical bridging’ is when the bridging reference could also be interpreted autonomously, such as Madrid in I went to Spain last year. I particularly liked Madrid. See [30, 1, 23, 10, 38] for a detailed discussion of the annotation schemes and their motivations, and [33, 36] for a discussion of the implications. The isnotes, bashi and scicorp corpora consist mostly of referential bridging examples, while the arrau corpus contains both types of bridging references. As a consequence, a system designed for one corpus (e.g. isnotes) works poorly when applied to other corpora (e.g. arrau), and significant modifications are needed in order to make the system works equally well on different corpora [36].

To tackle these challenges, we introduce a multi task learning-based neural model that learns bridging resolution together with coreference resolution222The code is available at . Multi task architectures have proven effective at exploiting the synergies between distinct but related tasks to in cases when only limited amounts of data are available for one or more of the tasks [3]. Such an architecture should therefore be especially suited for our context, given that, linguistically, bridging reference resolution and coreference resolution are two distinct but closely related aspects of anaphora resolution, and indeed were often tackled together in early ML-based systems [39]

. Using a neural network-based approach that minimises feature engineering enables the system to be more flexible on the choices of corpora. We mainly evaluate our system on the

rst portion of the arrau corpus since it is the largest available resource, but we additionally evaluate it on the trains and pear portion of the arrau corpus, isnotes, bashi and scicorp corpus to demonstrate its tolerance of diversity.

We start with a strong baseline for bridging adapted from the SoTA coreference architecture [21, 15] enhanced by bert embeddings [5]

. We extend the system to multi-task learning by adding a coreference classifier that shares part of the network with the bridging classifier. In this way, we improve full bridging resolution and its subtasks (anaphor recognition and antecedent selection) by 6.5-7.3% respectively. But because the number of coreference examples is much larger than the number of bridging pairs, the dataset is highly imbalanced. We achieve further improvements of 1.7% and 6.6% on full bridging resolution and anaphor recognition by using undersampling during the training. This final system achieves SoTA results on both full bridging resolution and its subtasks, i.e. 4.5%, 6% and 9.5% higher than the best reported results

[36] on full bridging resolution, anaphor recognition and antecedent selection respectively. Evaluation on trains, pear, isnotes, bashi and scicorp shows the same trend. Although the datasets are much smaller and the annotation schemes for isnotes, bashi and scicorp are different from arrau, our system works equally well, achieving the new SoTA results on full bridging resolution for all five corpora.

2 Related Work

2.1 Bridging Reference Resolution

Bridging reference resolution involves two subtasks: anaphor recognition and antecedent selection [10]. Early work on bridging resolution mostly focused on definite bridging anaphors [37, 39, 19], but later systems covered unrestricted antecedent selection [28, 8]. hou-etal-2013-global introduced a model based on Markov logic networks and using an extensive set of features and constraints. They evaluated the system with both local and global features on isnotes, and showed that global features can greatly improve performance. The system was later extended in [13, 12, 10] to explore additional features from embeddings tailored for bridging resolution to advanced antecedent candidate selection using discourse structure extracted from the Penn Discourse Treebank (d-scope-salience). The hou-et-al:CL18 system using d-scope-salience is the current SoTA on antecedent selection for isnotes. The anaphor recognition subtask is usually solved as a part of the information status task [23, 11, 10].

Recently, systems for full bridging resolution were introduced [9, 10, 33, 36]

. hou-etal-2014-emnlp proposed a rule-based system for full bridging resolution with the

isnotes corpus, consisting of a rich system of rules motivated by linguistic knowledge. They also evaluated a learning-based system that uses the rules as features, but the learning-based system only outperforms the rule-based system’s F1 score by 0.1 percentage points. The rule-based system was later adapted by roesiger-etal-2018-bridging,roesiger-2018-crac for full bridging resolution on arrau. But since arrau follows a different definition of bridging, most of the rules had to be changed. hou-et-al:CL18 is the current state of the art on full bridging resolution, but it was only evaluated on isnotes.

2.2 Multi-task Learning for Under-Resourced Tasks

Multi-task learning has been successfully used in several NLP applications [4, 22, 17, 3]. Normally, the goal of multi-task learning is to improve performance on all tasks; but in an under-resourced setting, the aim often is only to improve performance on the low resource task/language/domains (the target task). This is sometimes known as shared representation based transfer learning

. yang2017transfer applied transfer learning to sequence labelling tasks; the deep hierarchical recurrent neural network used in their work is fully/partially shared between the source and the target tasks. They demonstrated that SoTA performance can be achieved by using models trained on multi-tasks. cotterell2017low trained a neural NER system on a combination of high-/low-resource languages to improve NER for the low-resource languages. In their work, character-based embeddings are shared across the languages. Recently, zhou2019dual introduced a multi-task network together with adversarial learning for under-resourced NER. The evaluation on both cross-language and cross-domain settings shows that partially sharing the BiLSTM works better for cross-language transfer, while for cross-domain setting, the system performs better when the LSTM layers are fully shared.

2.3 Neural Coreference Resolution

By contrast with bridging reference, coreference resolution has been extensively studied. wiseman2015learning,wiseman2016learning first introduced a neural network-based approach to solving coreference in a non-linear way. clark2016improving integrated reinforcement learning to let the model, optimized directly on the B

scores. lee2017end proposed a neural joint approach for mention detection and coreference resolution. Their model does not rely on parse trees; instead, the system learns to detect mentions by exploring the outputs of a BiLSTM. After the introduction of context dependent word embeddings such as ELMo [25] and bert [5], the lee2017end system has been greatly improved by those embeddings [21, 15] to achieve SoTA results. We use a simplified version of the model by [21, 15] as baseline.

3 Methods

3.1 The Single-Task Baseline System

We use as our single-task baseline for bridging reference a simplified version of the SoTA coreference systems by lee2018higher,kantor-globerson-2019-coreference, since bridging resolution is closely related to coreference: like coreference it requires establishing a link to an entity in the discourse model, but through a non-identity relation. The kantor-globerson-2019-coreference model is an extended version of [21]; the main difference is that Kantor et al use BERT embeddings [5] instead of the ElMo embeddings [25] used by Lee et al. These systems have similar architecture and both do mention detection and coreference jointly. We only use the coreference part of the system, since for bridging resolution evaluation is usually on gold mentions.

Our baseline system first creates representations for mentions using the output of a BiLSTM. The sentences of a document are encoded from both directions to obtain a representation for each token in the sentence. The BiLSTM takes as input the concatenated embeddings () of both word and character levels. For word embeddings, GloVe [24] and BERT [5]

embeddings are used. Character embeddings are learned from a convolution neural network (CNN) during training. The tokens are represented by concatenated outputs from the forward and the backward LSTMs. The token representations

are used together with head representations () to represent mentions (). The of a mention is obtained by applying attention over its token representations (), where and are the indices of the start and the end of the mention respectively. Formally, we compute , as follows:

where is the mention width feature embeddings. Next, we pair the mentions with candidate antecedents to create a pair representation ():

where , is the representation of the antecedent and anaphor respectively, denotes element-wise product, and is the distance feature between a mention pair. To make the model computationally tractable, we consider for each mention a maximum 150 candidate antecedents.333The number of maximum antecedents was tuned on the dev set.

The next step is to compute the pairwise score (). Following lee2018higher, we add an artificial antecedent to deal with cases of non-bridging anaphor mentions or cases when the antecedent does not appear in the candidate list. We compute as follows:

For each mention the predicted antecedent is the one has the highest , a bridging link will be only created if the predicted antecedent is not .

3.2 Our Multi-task Learning Architecture

Choosing a source task that is closely related to bridging resolution is crucial to the success of our multi-task learning model. In this work, we use coreference as the source task. The key intuitions behind this choice are: (i) from a language interpretation point of view, resolving anaphoric coreference and anaphoric bridging reference are closely related tasks in that they both involve trying to identify relations between anaphors and antecedents [31]–indeed, the two tasks were typically tackled jointly by non ML-based anaphora resolution systems [37, 7, 39]; (ii) from the point of view of the model, both tasks rely on a good mention representations and can be solved by neural mention pair models.

We turn our model into a multi-task model by adding to the architecture an additional classifier for coreference, and jointly predicting coreference and bridging (Figure 1). We use the same candidate antecedents for both bridging and coreference tasks. As shown in Figure 1, our model uses shared mention representations (i.e. the word embeddings and the BiLSTM) with additional options to share some/all hidden layers of the ffnn. By sharing most of the network structure, the mention representations learned by the coreference task become accessible by bridging resolution.

Figure 1: The proposed multi-task architecture.

3.3 Learning with Imbalanced Data

Following lee2018higher we optimise our system on the marginal log-likelihood of all correct antecedents. For bridging, we consider an bridging antecedent correct if it is from the same gold coreference cluster gold of the gold bridging antecedent. For coreference, the correct antecedents is implied from the gold coreference cluster gold the mention belongs to. We compute both bridging and coreference losses as follows:

in case mention is not a bridging/coreference anaphor or (the candidate antecedents) does not contain mentions from for bridging or gold for coreference, we set gold.

When training with coreference one of the problem we need to face is class imbalance. Consider the rst portion of arrau as an example (mostly WSJ text). The corpus contains 72k mentions in total: of these, 45k (63%) are discourse-new (DN), 24k (33%) are discourse-old (DO), and the remaining 3k (4%) are bridging anaphors. Training the model on such an imbalanced corpus may significantly harm recall with bridging anaphors.

To reduce the negative effect of this imbalance, we use undersampling during to training by randomly removing DN and DO examples to make the corpus more balanced. More precisely, we use a heuristic negative example ratio to control the total number of negative examples during the training, so that, e.g., means we keep 6k DN and 6k DO examples during training. We set a value for by trying a few small values in preliminary experiments; they all gave very similar results, hence we set in the experiments below.

4 Experiments

Datasets We evaluated our systems on arrau [26, 38], isnotes [23], bashi [35] and scicorp [34] with arrau rst as our primary dataset as it’s substantially larger.

Bridging references are annotated in arrau according to the scheme in [38], which covers both what roesiger-2018-crac call ‘lexical’ and ‘referential’ bridging. The corpus was used for Task 2 of the crac 2018 shared task [27], focused on bridging resolution. As done in the crac shared task, we evaluate our system on all three subcorpus rst,trains and pear stories. The rst portion of the corpus, consisting of 413 news documents (1/3 of the WSJ section of the Penn Treebank). We used the default train/dev and test subdivisions. The trains and pear portion of the corpus contains 114 dialogues and 20 fictions respectively. Since the trains and pear are much smaller we use 10-fold cross validation and report the results on test set to compare with previous work.

The isnotes corpus consists of 50 documents from the WSJ portion of OntoNotes, with 663 bridging pairs annotated as well as fine-grained information status according to the scheme in [23]. Bridging is annotated as one of the information status subclasses. Like isnotes, the bashi corpus contains 50 documents from the WSJ portion of OntoNotes. The dataset has 459 bridging pairs annotated according to a novel annotation scheme focused on referential bridging [35]. The scicorp corpus uses text from a very different domain of scientific texts. The corpus has in total 1366 bridging pairs annotated, again according to its own annotation scheme [34]. Since those corpora are rather small, we used 10-fold cross-validation to evaluate on them.

Evaluation metrics We evaluate our system on both full bridging resolution and its subtasks (anaphor recognition/antecedent selection). For full bridging resolution and anaphor recognition we report F1 scores444Following [36] we consider a predicted bridging antecedent is correct when it belongs to the same gold corefernce cluster as the gold bridging antecedent.. For antecedent selection we report accuracy as it uses gold bridging anaphors.

Hyperparameters Apart from the two parameters introduced by our model (maximum number of antecedents and negative example ratio ), we use mostly the default settings from lee2018higher, but replace their ELMo settings with the BERT settings from kantor-globerson-2019-coreference. We train the models evaluated on the arrau rst

corpus for 200 epochs, and for 50 epochs the models trained on the other corpora.

5 Results and Discussions

System Shared Network rst isnotes
bridging only 47.4 33.8
embeddings, LSTM 50.9 38.7
multi-task + 1 FFNN Layer 54.7 43.7
+ 2 FFNN Layer 51.7 40.1
(a) antecedent selection
rst isnotes
anaphor rec. full bridging res. anaphor rec. full bridging res.
System P R F1 P R F1 P R F1 P R F1
bridging only 50.0 12.5 20.0 34.5 8.6 13.8 63.9 16.2 25.8 33.3 8.5 13.5
multi-task 47.3 19.0 27.1 35.5 14.2 20.3 59.3 22.5 32.7 31.5 12.0 17.4
 + undersampling 31.5 36.2 33.7 20.6 23.7 22.0 45.6 47.2 46.4 19.1 19.7 19.4
(b) full bridging resolution
Table 1: Parameter tuning on the dev set of arrau rst and isnotes.

5.1 Evaluation on the arrau rst corpus

We first evaluated our multi-task learning based system on the antecedent selection subtask, to assess the suitability of our model on bridging. The antecedent selection subtask uses gold bridging anaphors, hence it is simpler than full bridging resolution which additionally involves identifying the bridging references. Focusing on a simpler task allows us to have a clearer view of the effects of multi-task learning. In this experiment, we configured the system to share only the mention representations (the word embeddings and BiLSTM). As illustrated in Table 0(a), the baseline system already achieved a pretty good accuracy for this type of task. Although starting from a strong baseline, our multi-task learning based system achieved an improvement of 3.5 percentage points, confirming our hypothesis that coreference is a good source task for bridging.

Sharing The Feed-forward Layers We further extended our model to share the FFNN layers in addition to the mention representations. The FFNN layers have access to pairwise representations to learn the relations between the anaphors and the antecedents, hence contain useful information regarding how likely two mentions are to be related. As expected, this additional sharing of the FFNN layers resulted in additional improvements. The largest improvement of 3.8 percentage points was achieved by sharing 1 additional FFNN layer. The accuracy drops when both hidden layers are shared between coreference and bridging, but the performance is still higher when compared with the model that only shares mention representations. Overall, the multi-task model achieved a substantial gain of 7.3 percentage points when compared with the system only carrying out bridging reference resolution (see Table 0(a)).

Full Bridging Resolution Having ascertained the benefits of our multi-task model for antecedent selection, we applied the best settings (sharing mention representations and the 1st hidden layer of the FFNN) to full bridging resolution as well. We also report the F1 scores for bridging anaphor recognition, a byproduct of full bridging resolution. Table 0(b) shows a comparison between the single-task baseline and the multi-task models. The baseline model trained without multi-task learning achieved F1 scores of 13.8% and 20% on full bridging resolution and anaphor recognition, respectively. The low F1 scores are mainly due to a poor recall in both tasks, a well known problem with bridging reference resolution. When applying multi-task learning, the F1 scores improve substantially (6.5% and 7.1% for full bridging resolution and anaphor recognition respectively). These F1 improvements are mainly a result of better recall; the precisions of the two models are similar. This suggests that learning with coreference does help the model to capture more correct bridging pairs. However, recall is still much lower than precision. As this might a result of data imbalance, we applied undersampling during training, to train the model on a more balanced dataset. As shown in Table 0(b)

, with undersampling the model has a more balanced precision and recall, and also achieves better F1 scores on both full bridging resolution and anaphor recognition. The new model achieved improvements of 6.6% and 1.7% on anaphor recognition and full bridging resolution, respectively, when compared with the model without undersampling. Overall, our multi-task models showed their merit on both tasks and achieved considerable gains of 8.2% and 13.7% when compared with the single-task system.

rst trains pear isnotes bashi scicorp
hou-etal-2013-global MLN model I - - - 35.6 - -
hou-etal-2013-global MLN model II - - - 41.3 - -
hou-2018-emnlp 32.4 - - 46.5 27.4 -

hou-et-al:CL18 system used additional gold information for feature extraction, see section

5.3 for more details.
- - - 50.7 - -
roesiger-2018-crac 39.8 48.9 28.2 - -
Our model 49.3 50.9 61.2 40.7 34.0 33.4
Table 2: Comparing our model with the SoTA for antecedent selection.
Corpus Gold Coreference Models anaphor rec. full bridging res.
Anaphors Setting P R F1 P R F1
rst Keep Our model 31.8 29.8 30.8 20.2 18.9 19.5
Remove roesiger-2018-crac 29.2 32.5 30.7 18.5 20.6 19.5
Our model 37.6 35.9 36.7 24.6 23.5 24.0
trains Keep Our model 49.4 36.0 41.6 33.7 24.6 28.4
Remove roesiger-2018-crac 39.3 21.8 24.2 27.1 21.8 24.2
Our model 62.2 40.4 48.9 39.2 25.4 30.9
pear Keep Our model 67.0 51.8 58.4 58.4 45.1 50.9
Remove roesiger-2018-crac 75.0 16.0 26.4 57.1 12.2 20.1
Our model 75.1 53.9 62.7 65.9 47.2 55.0
isnotes Keep hou-etal-2014-emnlp 666The results of hou-etal-2014-emnlp are from roesiger-etal-2018-bridging, as they were obtained on an unknown subset of the corpus. 65.9 14.1 23.2 57.7 10.1 17.2
roesiger-etal-2018-bridging 45.9 18.3 26.2 32.0 12.8 18.3
Our model 61.5 30.6 40.9 33.0 16.4 22.0
Remove roesiger-etal-2018-bridging 71.6 18.3 29.2 50.0 12.8 20.4
hou-et-al:CL18777The hou-et-al:CL18 result is evaluated on a different setting, i.e. instead of filtering out gold coreference anaphors, they use an IS classifier to assign mentions IS classes and exclude mentions belongs to IS classes other than bridging (including predicted coreference anaphors). - - - 20.6 22.6 21.6
Our model 68.4 32.0 43.6 36.5 17.0 23.2
bashi Keep Our model 42.7 19.6 26.9 22.6 10.4 14.2
Remove roesiger-etal-2018-bridging 49.4 20.2 28.7 24.3 10.0 14.1
Our model 44.6 19.6 27.2 23.6 10.4 14.4
scicorp Keep Our model 45.0 35.7 39.8 21.5 17.1 19.0
Remove roesiger-etal-2018-bridging 17.7 0.9 8.1 3.2 0.9 1.5
Our model 52.9 41.2 46.3 25.0 19.4 21.9
Table 3: Comparing our model with the SoTA for full bridging resolution.

Comparison with the State of the Art We then evaluated our model on the test set of arrau rst to compare it with the previously reported state of the art on the same dataset. Table 2 shows the comparison on antecedent selection. The best reported system on this task, [36], is a modified version of the original rule-based system designed for isnotes by hou-etal-2014-emnlp. Our system outperforms the current state of the art by nearly 10 percentage points. Table 3 presents the comparison on the full bridging resolution and anaphor recognition. Since the only reported full bridging resolution results on arrau [36] are evaluated with coreferent anaphors removed, we follow the same method to remove gold coreferent anaphors from the evaluation, but we also report the results with coreferent anaphors included for future reference. Filtering out the gold coreferent anaphors the task is easier, resulting in better F1 scores. After filtering out gold coreferent anaphors, our system achieved F1 scores of 24% and 36.7% on full bridging resolution and anaphor recognition respectively, which is 4.5% and 6% higher than the scores reported in roesiger-2018-crac. Overall, our model achieved the new SoTA results on both full bridging resolution and its subtasks.

5.2 Evaluation on the arrau trains and pear corpus

We then evaluate our system on the trains and pear portion of the arrau corpus. For both corpora, the only reported results are by roesiger-2018-crac. For antecedent selection our system achieved scores 2% and 33% better than theirs on trains and pear respectively (see Table 2). For the other two tasks, they only report scores after filtering out the gold coreference anaphors, when evaluate in the same setting, our system achieved substantial improvements of up to 36.3% and 34.9% for anaphor recognition and full bridging resolution respectively. Overall, our system is substantially better than the roesiger-2018-crac system on both trains and pear corpora.

5.3 Evaluation on the isnotes corpus

Most recent work on bridging reference resolution was evaluated on isnotes; a number of systems were developed for both full bridging resolution [9, 33, 10] and antecedent selection [8, 13, 12, 10]. Since the isnotes follows a very different annotation scheme than that of the arrau, to confirm the suitability of our best setting for arrau on corpora only containing referential bridging examples (isnotes, bashi and scicorp) we run additional parameter tuning on the isnotes corpus. For parameter tuning we use the same 10 documents as used in roesiger-etal-2018-bridging as a development set and use the rest 40 documents for training. As shown in Table 0(a) and Table 0(b) the results on isnotes follows the same trend as for arrau rst, the best settings for two corpora remain the same.

To compare with the SoTA systems, we use 10-fold cross-validation to obtain predictions for the whole corpus. On the full bridging resolution task, our system outperforms all the previous results both when coreferent anaphors are included (3.7%) and when they are excluded (1.6%). The improvements on anaphor recognition are larger, and our system is more than 14% better in both settings.

For antecedent selection, however, our system achieved a result broadly comparable with that of the model called MLN model II in [8], but lower than those obtained with subsequent developments of this model [13, 12, 10]. We can see three main reasons for the lower performance. (i) The Hou et al systems rely on hand-annotated gold information from OntoNotes to compute their features, including named entity annotation and syntactic annotation. In addition, the top performing system from the lab, [10], also uses discourse structure annotations from the Penn Discourse Treebank to define the set of antecedents in the ’discourse scope’. By contrast, our system does not rely on any hand-coded annotations at test time. We would argue that this setup is more realistic than in particular the setup in [10]888The hou-et-al:CL18 system is not publicly available.. (ii) The models in [13, 12, 10] include, in addition to pairwise features, a number of ‘global’ features designed for bridging references to globally salient entities and for bridging references that share the same antecedent (‘siblings’). However, the versions of such features we have tested in our model do not appear to improve its performance. (iii) The Hou et al systems are evaluated in a mention-entity setting that assumes that gold coreference chains are available at test time, while our system is based on a mention pair architecture. The use of a mention-entity setting results in a much smaller pool of candidate antecedents;999In their system each bridging anaphor only has less than 20 candidate, in contrast with our 150 candidates hence the task becomes easier, but less realistic. (iv) The Hou et al systems are heavily tuned on the isnotes corpus, the results on other corpora are either not reported or much lower.

Overall, on the isnotes corpus our system achieved a competitive result on antecedent selection and the SoTA on full bridging resolution and anaphor recognition.

5.4 Evaluation on the bashi corpus101010We also tried to combine isnotes and bashi for training as suggested by Roesiger, since they both mainly focus on referential bridging; however concatenating the corpora does not improve the performance on either corpus.

We next compare our system with previous models on the bashi corpus. Since gold mentions are not annotated in bashi, we use NPs as our predicted mentions without filtering121212The NPs do not belong to coreference clusters or bridging relations are treated as non-mention during training.. For antecedent selection the only reported result on bashi corpus is from hou-2018-emnlp (see Table 2). Our system achieved a accuracy of 34%, which is 6.6% better than that of hou-2018-emnlp. roesiger-etal-2018-bridging reported the only results for full bridging resolution and anaphor recognition. Our system achieves an F1 that is 1.5% lower than their result on anaphor recognition, but a better F1 on full bridging resolution (see Table 3). Overall, our models achieves new SoTA on full bridging resolution and antecedent selection.

5.5 Evaluation on the scicorp corpus

Finally, we evaluated our system on scicorp corpus, in which, like in the bashi corpus, gold mentions are not annotated, so again we used NPs as our predicted mentions. scicorp consists of scientific documents that are very different from the bashi (news). As a result, the only reported result on scicorp, [33], is rather poor. Roesiger et al.’s rule-based system only achieved 1.5% and 8.1% (F1) for full bridging resolution and anaphor recognition respectively. The poor result is manly due to the system only recognizing less than 1% of the bridging anaphors, which is another example of the sensitivity of rule-based systems to domain shifting. By contrast, our system achieved on this corpus F1 scores of 21.8% and 46.3% for full bridging resolution and anaphor recognition, respectively (Table 3). These scores on scicorp are broadly in the same range to the scores achieved by our system on the other three corpora, which indicates that our system’s performance doesn’t deteriorate so badly with domain shifting. In terms of the antecedent selection task, our system achieved an accuracy of 33.4%; to the best of our knowledge, this is the first result for antecedent selection on scicorp .

6 Conclusions

In this paper we proposed a multi-task neural architecture tackling two major challenges for bridging reference resolution. The first challenge is the lack of very large training datasets, as the largest corpus for bridging reference, arrau [38], only contains 5.5k examples, and other corpora are much smaller (the most used corpus for bridging, isnotes, [23] only contains 663 bridging pairs). The second challenge is that different annotation schemes for bridging are used in different corpora (referential and lexical bridging), so designing a system that can be applied to different corpora is complicated. Our results on the arrau rst corpus demonstrate that the performance on full bridging resolution and its subtasks can be significantly improved by learning with additional coreference annotations. Our multi-task model achieved substantial improvements of 7.3%-13.7% for full bridging resolution and its subtasks when compared with the single task baseline that learns solely on bridging annotations. As a result, our final system achieved SoTA results in all three tasks. Further evaluation on trains, pear, isnotes, bashi and scicorp demonstrates the robustness of our system under changes of annotation scheme and domain. The very same architecture used for arrau again achieved SoTA results on full bridging resolution for all three corpus.

Overall, our results suggest that coreference is a useful source task for bridging reference resolution, and our neural bridging architecture is applicable to bridging corpora based on different domain or definitions of bridging.


  • [1] S. Baumann and A. Riester (2012) Referential and lexical givenness: semantic, prosodic and cognitive aspects. In Prosody and Meaning, Cited by: footnote 1.
  • [2] H. H. Clark (1975) BRIDGING. In

    Theoretical Issues in Natural Language Processing

    Cited by: §1.
  • [3] K. Clark, M. Luong, U. Khandelwal, C. D. Manning, and Q. V. Le (2019) BAM! born-again multi-task networks for natural language understanding. In ACL, Cited by: §1, §2.2.
  • [4] R. Collobert and J. Weston (2008) A unified architecture for natural language processing: deep neural networks with multitask learning. In ICML, Cited by: §2.2.
  • [5] J. Devlin, M. Chang, K. Lee, and K. Toutanova (2019) BERT: pre-training of deep bidirectional transformers for language understanding. In NAACL, Cited by: §1, §2.3, §3.1, §3.1.
  • [6] A. Garnham (2001) Mental models and the interpretation of anaphora. Psychology Press. Cited by: §1.
  • [7] J. R. Hobbs, M. Stickel, D. Appelt, and P. Martin (1993) Interpretation as abduction. Artificial Intelligence Journal 63 (), pp. 69–142. Note: Cited by: §3.2.
  • [8] Y. Hou, K. Markert, and M. Strube (2013) Global inference for bridging anaphora resolution. In NAACL, Cited by: §2.1, §5.3, §5.3.
  • [9] Y. Hou, K. Markert, and M. Strube (2014) A rule-based system for unrestricted bridging resolution: recognizing bridging anaphora and finding links to antecedents. In EMNLP, Cited by: §1, §2.1, §5.3.
  • [10] Y. Hou, K. Markert, and M. Strube (2018) Unrestricted bridging resolution. Computational Linguistics 44 (2), pp. 237–284. Cited by: §1, §2.1, §2.1, §5.3, §5.3, footnote 1.
  • [11] Y. Hou (2016) Incremental fine-grained information status classification using attention-based LSTMs. In COLING, Cited by: §2.1.
  • [12] Y. Hou (2018) A deterministic algorithm for bridging anaphora resolution. In EMNLP, Cited by: §2.1, §5.3, §5.3.
  • [13] Y. Hou (2018) Enhanced word representations for bridging anaphora resolution. In NAACL, Cited by: §2.1, §5.3, §5.3.
  • [14] H. Kamp and U. Reyle (1993) From discourse to logic. D. Reidel, Dordrecht. Note: Cited by: §1.
  • [15] B. Kantor and A. Globerson (2019) Coreference resolution with entity equalization. In ACL, Cited by: §1, §2.3.
  • [16] L. Karttunen (1976) Discourse referents. In Syntax and Semantics 7 - Notes from the Linguistic Underground, pp. 363–385. Note: Cited by: §1.
  • [17] E. Kiperwasser and M. Ballesteros (2018) Scheduled multi-task learning: from syntax to translation. TACL 6. Cited by: §2.2.
  • [18] V. Kolhatkar, A. Roussel, S. Dipper, and H. Zinsmeister (2018) Anaphora with non-nominal antecedents in computational linguistics: a Survey. Computational Linguistics 44 (3), pp. 547–612. External Links: Link, Document Cited by: §1.
  • [19] E. Lassalle and P. Denis (2011) Leveraging different meronym discovery methods for bridging resolution in french. In Proc. of 8th DAARC, Faro, Portugal, pp. 35–46. Cited by: §2.1.
  • [20] K. Lee, L. He, M. Lewis, and L. Zettlemoyer (2017) End-to-end neural coreference resolution. In EMNLP, Cited by: §1.
  • [21] K. Lee, L. He, and L. S. Zettlemoyer (2018) Higher-order coreference resolution with coarse-to-fine inference. In NAACL, Cited by: §1, §1, §2.3, §3.1.
  • [22] M. Luong, Q. V. Le, I. Sutskever, O. Vinyals, and L. Kaiser (2016) Multi-task sequence to sequence learning. ICLR. Cited by: §2.2.
  • [23] K. Markert, Y. Hou, and M. Strube (2012) Collective classification for fine-grained information status. In ACL, Cited by: §1, §2.1, §4, §4, §6, footnote 1.
  • [24] J. Pennington, R. Socher, and C. Manning (2014)

    Glove: global vectors for word representation

    In EMNLP, Cited by: §3.1.
  • [25] M. E. Peters, M. Neumann, M. Iyyer, M. Gardner, C. Clark, K. Lee, and L. S. Zettlemoyer (2018) Deep contextualized word representations. In NAACL, Cited by: §2.3, §3.1.
  • [26] M. Poesio and R. Artstein (2008) Anaphoric Annotation in the ARRAU Corpus. In LREC, Marrakech, Morocco. Cited by: §1, §4.
  • [27] M. Poesio, Y. Grishina, V. Kolhatkar, N. Moosavi, I. Roesiger, A. Roussel, F. Simonjetz, A. Uma, O. Uryupina, J. Yu, and H. Zinsmeister (2018) Anaphora resolution with the arrau corpus. In CRAC, Cited by: §4.
  • [28] M. Poesio, R. Mehta, A. Maroudas, and J. Hitzeman (2004) Learning to resolve bridging references. In ACL, Cited by: §1, §2.1.
  • [29] M. Poesio, R. Stuckardt, and Y. Versley (2016) Anaphora resolution: algorithms, resources and applications. , Springer, Berlin. Cited by: §1.
  • [30] M. Poesio (2004) Discourse annotation and semantic annotation in the GNOME corpus. In Proc. of the ACL Workshop on Discourse Annotation, Cited by: §1, footnote 1.
  • [31] M. Poesio (2016) Linguistic and cognitive evidence about anaphora. In Anaphora Resolution: Algorithms, Resources and Applications, M. Poesio, R. Stuckardt, and Y. Versley (Eds.), Cited by: §3.2.
  • [32] S. Pradhan, A. Moschitti, N. Xue, O. Uryupina, and Y. Zhang (2012) CoNLL-2012 shared task: modeling multilingual unrestricted coreference in OntoNotes. In Proceedings of the Sixteenth Conference on Computational Natural Language Learning (CoNLL 2012), Jeju, Korea. Cited by: §1.
  • [33] I. Rösiger, A. Riester, and J. Kuhn (2018) Bridging resolution: task definition, corpus resources and rule-based experiments. In COLING, Cited by: §1, §2.1, §5.3, §5.5, footnote 1.
  • [34] I. Rösiger (2016-05) SciCorp: a corpus of English scientific articles annotated for information status analysis. In Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC’16), Portorož, Slovenia, pp. 1743–1749. External Links: Link Cited by: §1, §4, §4.
  • [35] I. Rösiger (2018) BASHI: a corpus of wall street journal articles annotated with bridging links. In LREC, Cited by: §1, §4, §4.
  • [36] I. Rösiger (2018) Rule- and learning-based methods for bridging resolution in the ARRAU corpus. In CRAC, Cited by: §1, §1, §2.1, §5.1, footnote 1, footnote 4.
  • [37] C. L. Sidner (1979) Towards a computational theory of definite anaphora comprehension in english discourse. Ph.D. Thesis, MIT. Cited by: §2.1, §3.2.
  • [38] O. Uryupina, R. Artstein, A. Bristot, F. Cavicchio, F. Delogu, K. J. Rodriguez, and M. Poesio (2019) Annotating a broad range of anaphoric phenomena, in a variety of genres: the ARRAU corpus. Journal of Natural Language Engineering. Cited by: §1, §4, §4, §6, footnote 1.
  • [39] R. Vieira and M. Poesio (2000-12) An empirically-based system for processing definite descriptions. Computational Linguistics 26 (4), pp. 539–593. Note: Cited by: §1, §2.1, §3.2.
  • [40] B. L. Webber (1979) A formal approach to discourse anaphora. Garland, New York. Cited by: §1.
  • [41] B. L. Webber (1991) Structure and ostension in the interpretation of discourse deixis. Language and Cognitive Processes 6 (2), pp. 107–135. Note: Cited by: §1.
  • [42] S. Wiseman, A. M. Rush, S. Shieber, and J. Weston (2015) Learning anaphoricity and antecedent ranking features for coreference resolution. In ACL-IJCNLP, Cited by: §1.