The question of how human beings resolve pronouns has long been an attractive research topic in both linguistics and natural language processing (NLP) communities, for the reason that pronoun itself has weak semantic meaningEhrlich (1981) and the correct resolution of pronouns requires complex reasoning over various information. As a core task of natural language understanding, pronoun coreference resolution (PCR) Hobbs (1978) is the task of identifying the noun (phrase) that pronouns refer to. Compared with the general coreference resolution task, the string-matching methods are no longer effective for pronouns Stoyanov et al. (2009), which makes PCR more challenging than the general coreference resolution task.
Recently, great efforts have been devoted into the coreference resolution task Raghunathan et al. (2010); Clark and Manning (2015, 2016); Lee et al. (2018) and good performance has been achieved on formal written text such as newspapers Pradhan et al. (2012); Zhang et al. (2019b) and diagnose records Zhang et al. (2019a). However, when it comes to dialogues, where more abundant information is needed, the performance of existing models becomes less satisfying. The reason behind is that, different from formal written language, correct understanding of spoken language often requires the support of other information sources. For example, when people chat with each other, if they intend to refer to some object that all speakers can see, they may directly use pronouns such as “it” instead of describing or mentioning it in the first place. Sometimes, the object (name or text description) that pronouns refer to may not even appear in a conversation, and thus one needs to ground the pronouns into something outside the text, which is extremely challenging for conventional approaches purely based on human-designed rules Raghunathan et al. (2010) or contextual features Lee et al. (2018).
A visual-related dialogue is shown in Figure 1. Both A and B are talking about a picture, in which several people are celebrating something. In the dialogue, the first “it” refers to the “the big cake,” which is relatively easy for conventional models, because the correct mention just appears before the targeting pronoun. However, the second “it” refers to the statue in the image, which does not appear in the dialogue at all. Without the support of the visual information, it is almost impossible to identify the coreference relation between “it” and “the statue.”
In this work, we focus on investigating how visual information can help better resolve pronouns in dialogues. To achieve this goal, we first create VisPro, a large-scale visual-supported PCR dataset. Different from existing datasets such as ACE NIST (2003) and CoNLL-shared task Pradhan et al. (2012), VisPro is annotated based on dialogues discussing given images. In total, VisPro contains annotations for 29,722 pronouns extracted from 5,000 dialogues. Once the dataset is created, we formally define a new task, visual pronoun coreference resolution (Visual PCR), and design a novel visual-aware PCR model VisCoref, which can effectively extract information from images and leverage them to help better resolve pronouns in dialogues. Particularly, we align mentions in the dialogue with objects in the image and then jointly use the contextual and visual information for the final prediction.
The contribution of this paper is three-folded: (1) we formally define the task of visual PCR; (2) we present VisPro, the first dataset that focuses on PCR in visual-supported dialogues; (3) we propose VisCoref, a visual-aware PCR model. Comprehensive experiments and case studies are conducted to demonstrate the quality of VisPro and the effectiveness of VisCoref. The dataset, code, and models are available at: https://github.com/HKUST-KnowComp/Visual_PCR.
2 The VisPro Dataset
To generate a high-quality and large-scale visual-aware PCR dataset, we select VisDial Das et al. (2017) as the base dataset and invite annotators to annotate. In VisDial, each image is accompanied by a dialogue record discussing that image. One example is shown in Figure 1. In addition, VisDial also provides a caption for each image, which brings more information for us to create VisPro111The information contained in the caption only helps provide noun phrase candidates for workers to annotate and will not be treated as part of the dialogue.. In this section, we introduce the details about the dataset creation in terms of pre-processing, survey design, annotation, and post-processing.
To make the annotation task clear to annotators and help them provide accurate annotation, we first extract all the noun phrases and pronouns with Stanford Parser Klein and Manning (2003) and then provide the extracted noun phrases as candidate mentions to annotate on. To avoid the overlap of candidate noun phrases, we choose noun phrases with a height of two in parse trees. One example is shown in Figure 2. In the syntactic tree for the sentence “A man with a dog is walking on the grass,” we choose “A man,” “a dog” and “the grass” as candidates. If the height of noun phrases is not limited, then the noun phrase “A man with a dog” will cover “A man” and “a dog,” leading to confusion in the options.
Following Strube and Müller (2003); Ng (2005), we only select third-person personal (it, he, she, they, him, her, them) and possessive pronouns (its, his, her, their) as the targeting pronouns. In total, the VisPro dataset contains 29,722 pronouns of 5,000 dialogues selected from 133,351 dialogues in VisDial v1.0 Das et al. (2017). We choose dialogues in which the number of pronouns ranges from four to ten for the following reasons. For one thing, dialogues with few pronouns are of little use to the task. For another, dialogues with too many pronouns often contain repeating pronouns referring to the same object, which makes the task too easy. The dialogues selected contain 5.94 pronouns on average. Figure 3 presents the distribution of different pronouns. From the figure we can see that “it” and “they” are used most frequently in the dialogues.
2.2 Survey Design
We divide 29,722 pronouns from 5,000 dialogues into 3,304 surveys. In each survey, besides the normal questions, we also include one checkpoint question to control the annotation quality222We design the checkpoint dialogue straightforward and unambiguous such that any responsive worker can easily provide the correct annotation.. In total, each survey contains ten questions (nine normal questions and one checkpoint question). Each question is about one pronoun.
The survey consists of three parts. We begin by explaining the task to the annotators, including how to deal with particular cases such as multi-word expressions. Then, we present examples to help the annotators better understand the task. Finally, we invite them to provide annotations for the pronouns in the dialogues.
One example of the annotation interface is shown in Figure 4. The text and the image of the dialogue are displayed on the left and right panel, respectively. For each of the targeting pronoun, the workers are asked to select all the mentions that it refers to. If any of the noun phrases is selected, the reference type of the pronoun on the right panel will be set to “noun phrases in text” automatically, and vice versa. If the concept that the pronoun refers to is not available in the options, or the pronoun is not referring to anything in particular, workers are asked to choose “some concepts not present in text” or “the pronoun is non-referential” on the right panel accordingly. The nine normal questions in each survey are consecutive so that pronouns in the same dialogue are displayed sequentially.
We employ the Amazon Mechanical Turk platform (MTurk) for our annotations.
We require that our annotators have more than 1,000 approved tasks and a task approval rate more than 90%. They are also asked to pass a simple test of pronoun resolution to prove that they understand the instruction of the task. Based on these criteria, we identify 186 valid annotators. For each dialogue, we invite at least four different workers to annotate. In total, we collect 122,443 annotations at a total cost of USD 3,270.80. We support the multiple participation of annotators by ensuring that subsequent surveys are generated with their previously-unlabelled dialogues.
Before processing the annotation result, we first exclude the annotation of workers who fail to answer the checkpoint questions correctly. As a result, 116,300 annotations are kept, which is 95% of the overall annotation results. We then decide the gold mentions that pronouns refer to using the following procedure:
Step one: We look into the annotations of each worker to find out the coreference clusters he annotates for each dialogue. To achieve this goal, we merge the intersecting sets of noun phrase antecedents for pronouns in the same dialogue. We observe that annotators often make the right decision for noun phrases near the anaphor pronoun, but neglect antecedents far away. It also happens in the annotation of other coreference datasets Chen et al. (2018). Therefore, we generate clusters from different pronouns in the same dialogue rather than merging antecedents for each pronoun separately. If an entity is mentioned and referred to by pronouns for multiple times in the dialogue, combining the antecedents for all pronouns could create a more accurate coreference cluster for the entity.
Step two: We adjudicate the coreference clusters for the dialogue by majority voting within all workers.
Step three: We then decide the anaphoric type of all pronouns by voting. If a pronoun is considered to refer to somef noun phrases in the text, we find out the coreference cluster it belongs to and choose the noun phrases in the cluster that precede it as its antecedents.
Step four: We randomly split the collected data into train/val/test sets of 4,000/500/500 dialogues, respectively.
After collecting the data, we found out that 73.43% of pronouns act as an anaphor to some noun phrases, 5.67% of pronouns do not have a suitable antecedent, and the rest 20.90% are not referential. Among all the pronouns that have noun phrases as antecedents, 13.45% of them do not have an antecedent in the dialogue context.333The antecedent labeled by the worker is provided by the caption. For anaphoric pronouns, each has 2.06 antecedents on average. In the end, we calculate the inner-annotator agreement (IAA) to demonstrate the quality of the resulting dataset. Following conventional approaches Pradhan et al. (2012)
, we use the average MUC score between individual workers and the adjudication result as the evaluation metric. The final IAA score is 72.4, which indicates that the workers can clearly understand our task and provide reliable annotation.
3 The Task
In this work, we focus on jointly leveraging the contextual and visual information to resolve pronouns. Thus, we formally define the visual-aware pronoun coreference resolution as follows:
Given an image , a dialogue record which discusses the content in , and an external mention set , for any pronoun that appears in , the goal is to optimize the following objective:
where is the overall scoring function of refers to given and . and denote the correct mention and the candidate mention, and and denote the correct mention set and the candidate mention set, respectively. Note that in the case where no golden mentions are annotated, the union of all possible spans in and are used to form .
4 The Model
The overall framework of the proposed model VisCoref is presented in Figure 5. In VisCoref, we want to leverage both the contextual and visual information to resolve pronouns. Thus we split the scoring function into two parts as follows:
where and are the scoring functions based on contextual and visual information respectively. is the hyper-parameter to control the importance of visual information in the model. The details of the two scoring functions are described in the following subsections.
4.1 Contextual Scoring
Before computing , we first need to encode the contextual information into all the candidates and targeting pronouns through a mention representation module, which is shown as the dotted box in Figure 5.
Following Lee et al. (2018), a standard bidirectional LSTM (BiLSTM) Hochreiter and Schmidhuber (1997) is used to encode each span with attentions Bahdanau et al. (2015). Assume initial embeddings of words in a span are denoted as , and their encoded representation after BiLSTM as , the weighted embeddings of each span are obtained by
where is the inner-span attention computed by
is obtained by a standard feed-forward neural network444We use to represent feed-forward neural networks. = .
After that, we concatenate the embeddings of the starting word () and the ending word () of each span, as well as its weighted embedding () and the length feature () to form its final representation :
On top of the extracted mention representation, we then compute the contextual score as follows:
where represents the concatenation, and are the mention representation of the targeting pronoun and current candidate mention, and indicates the element-wise multiplication.
4.2 Visual Scoring
In order to align mentions in the text with objects in the image, the first step of leveraging the visual information is to recognize the objects from the picture. We use a object detection module to identify object labels from each image
, such as “person,” “car,” or “dog.” After that, we convert the identified labels into vector representations following the same encoding process in4.1. For each image, we add a label “null,” indicating that the pronoun is referring to none of the detected objects in . We denote the resulting embeddings as , in which denotes the detected labels, and is the total number of unique labels in the corresponding image.
After extracting objects from the image, we would like to know whether the mentions are referring to them. To achieve this goal, we calculate the possibility of a mention corresponding to each detected object :
Then we take the softmax of as the final possibility of aligned with the object label :
If corresponds to a certain object in , the score of that label should be large. Otherwise, the possibility of “null” should be the largest. Therefore, we use the maximum of possibility scores among all classes except “null”
as the probability ofrelated to some object in .
Similarly, given two mentions and , if they refer to the same detected object , then both and should be large. Thus, we can use the maximum of their product among all K classes except “null”
as the probability of and related to the same object in .
In the end, we define the visual scoring function as follows:
5 The Experiment
In this section, we introduce the implementation details, experiment setting, and baseline models.
5.1 Experiment Setting
As introduced in Section 2.4, we randomly split the data into training, validation, and test sets of 4,000, 500, and 500 dialogues, respectively. For each dialogue, a mention pool of size 30 is provided for models to detect plausible mentions outside the dialogue. The pool contains both mentions extracted from the corresponding caption and randomly selected negative mention samples from other captions. All models are evaluated based on the precision (P), recall (R), and F1 score. Last but not least, we split the test dataset by whether the correct antecedents of the pronoun appear in the dialogue or not. We denote these two groups as “Discussed” and “Not Discussed.”
5.2 Implementation Details
embedding as the initial word representations. Out-of-vocabulary words are initialized with zero vectors. We adopt the “ssd_resnet_50_fpn_coco” model from Tensorflow detection model zoo555https://github.com/tensorflow/models/tree/master/research/object_detection as the object detection module. The size of hidden states in the LSTM module is set to 200, and the size of the projected embedding for computing similarity between text spans and object labels is 512. The feed-forward networks for contextual scoring and visual scoring have two 150-dimension hidden layers and one 100-dimension hidden layer, respectively.
For model training, we use cross-entropy as the loss function and AdamKingma and Ba (2015) as the optimizer. All the parameters are initialized randomly. Each mention selects the text span of the highest overall score among all previous text spans in the dialogue or the mention pool as its antecedent, so that all mentions in one dialogue are clustered into several coreference chains. The noun phrases in the same coreference cluster as a pronoun are deemed as the predicted antecedents of that pronoun. The models are trained with up to 50,000 steps, and the best one is selected based on its performance on the validation set.
5.3 Baseline Methods
Since we are the first to proposed a visual-aware model for pronoun coreference resolution, we compare our results with existing models of general coreference resolution.
Statistical model Clark and Manning (2015) learns upon human-designed entity-level features between clusters of mentions to produce accurate coreference chains.
End-to-end model Lee et al. (2018) is the state-of-the-art method of coreference resolution. It predicts coreference clusters via an end-to-end neural network that leverages pretrained word embeddings and contextual information.
Last but not least, to demonstrate the effectiveness of the proposed model, we also present a variation of the End-to-end model, which can also use the visual information, as an extra baseline:
End-to-end+Visual first extracts features from images with ResNet-152 He et al. (2016). Then it concatenates the image feature with the contextual feature in the original End-to-end model together to make the final prediction.
6 The Result
The experimental results are shown in Table 1. Our proposed model VisCoref outperforms all the baseline models significantly, which indicates that the visual information is crucial for resolving pronouns in dialogues. Besides that, we also have the following interesting findings:
For all the conventional models, the “Not Discussed” pronouns whose antecedents are absent in dialogues are more challenging than “Discussed” pronouns whose antecedents appear in the dialogue context. The reason behind is that if the correct mentions appear in the near context of pronouns, the information about the correct mention can be aggregated to the targeting pronoun through either human-designed rules or deep neural networks (Bi-LSTM). However, when the correct mention is not available in the near context, it is quite challenging for conventional models to understand the dialogue and correctly ground the pronoun to the object both speakers can see, as they do not have the support of visual information.
As is shown in the result of the “End-to-end+Visual” model, simply concatenating the visual feature to the contextual feature can help resolve “Discussed” pronouns but may hurt the performance of the model on “Not Discussed” pronouns. Different from them, the proposed Viscoref can improve the resolution of both the “Discussed” and “Not Discussed” pronouns. There are mainly two reasons behind: (1) The visual information in our model is first converted into textual labels and then transformed into vector representation in the same way as the dialogue context. Thus the vector space of contextual and visual information is perfectly aligned. (2) We introduce a hyper-parameter to balance the influence of different knowledge resources.
Even though our model outperforms all the baseline methods, we still can observe a huge gap between our model and human being. It indicates that current models still cannot fully understand the dialogue even with the support of visual information and further proves the value and necessity of proposing VisPro.
6.1 Hyper-parameter Analysis
We traverse different weights of visual and contextual information from 0 to 1, and the result is shown in Figure 6. Along with the increase of , our model puts more weight on the visual information. As a result, our model can perform better. However, when our model focuses too much on the visual information (when equals to 0.9 or 1), the model overfits to the visual information and thus performs poorly on the task. To achieve the balance between the visual and contextual information, we set to be 0.4.
6.2 Case Study
To further investigate how visual information can help solve PCR, we randomly select two examples and show the prediction results of VisCoref and End-to-end model in Figure 7.
In the first example in Figure 7(a), given the pronoun “it,” the End-to-end model picks “any writing” from the dialogue, while the VisCoref model chooses “a blue, white and red train” from the candidate mention sets. Without looking at the picture, we cannot distinguish between these two candidates. However, when the picture is taken into consideration, we observe that there is a train in the image and thus “a blue, white and red train” is a more suitable choice, which proves the importance of visual information. A similar situation happens in Figure 7(b), where the End-to-end model connects “they” to “the people” but there is no human being in the image at all. On the contrary, as VisPro can effectively leverage the visual information and make the decision that “they” should refer to “2 zebras.”
7 Related Work
In this section, we introduce the related work about pronoun coreference resolution and visual-aware natural language processing problems.
7.1 Pronoun Coreference Resolution
As one core task of natural language understanding, pronoun coreference resolution, the task of identifying mentions in text that the targeting pronoun refers to, plays a vital role in many downstream applications in natural language processing, such as machine translation Guillou (2012), summarization Steinberger et al. (2007) and information extraction Edens et al. (2003). Traditional studies focus on resolving pronouns in expert-annotated formal textual dataset such as ACE NIST (2003) or OntoNotes Pradhan et al. (2012). However, models that perform well on these datasets might not perform as well in other scenarios such as dialogues due to the informal language and the lack of essential information (e.g., the shared view of two speakers). In this work, we thus focus on the PCR in dialogues and show that the information contained in the shared view can be crucial for understanding the dialogues and correctly resolving the pronouns.
7.2 Visual-aware NLP
Another related work is the comprehension of referring expressions Mao et al. (2016), which is inferring the object in an image that an expression describes. However, the task is formulated on isolated noun phrases specially designed for discriminative descriptions without putting them into a meaningful context. Instead, our task focuses on resolving pronouns in dialogues based on images as the shared view, which enhances the understanding of dialogues based on the comprehension of expressions and images.
In this work, we formally define the task of visual pronoun coreference resolution (PCR) and present VisPro, the first large-scale visual-supported pronoun coreference resolution dataset. Different from conventional pronoun datasets, VisPro focuses on resolving pronouns in dialogues which discusses a view that both speakers can see.
Moreover, we also propose VisCoref, the first visual-aware PCR model that aligns contextual information with visual information and jointly uses them to find the correct objects that the targeting pronouns refer to. Extensive experiments demonstrate the effectiveness of the proposed model. Further case studies also demonstrate that jointly using visual information and contextual information is an essential path for fully understanding human language, especially dialogues.
This paper was supported by Beijing Academy of Artificial Intelligence (BAAI), the Early Career Scheme (ECS, No. 26206717) from Research Grants Council in Hong Kong, and the Tencent AI Lab Rhino-Bird Focused Research Program.
- VQA: visual question answering. In Proceedings of ICCV, pp. 2425–2433. Cited by: §7.2.
- Neural machine translation by jointly learning to align and translate. In Proceedings of ICLR, Cited by: §4.1.
- PreCo: A large-scale dataset in preschool vocabulary for coreference resolution. In Proceedings of EMNLP, pp. 172–181. Cited by: 1st item.
- Entity-centric coreference resolution with model stacking. In Proceedings of ACL, pp. 1405–1415. Cited by: §1, 2nd item.
- Deep reinforcement learning for mention-ranking coreference models. In Proceedings of EMNLP, pp. 2256–2262. Cited by: §1, 3rd item.
- Visual dialog. In Proceedings of CVPR, pp. 1080–1089. Cited by: §2.1, §2, §7.2.
- An investigation of broad coverage automatic pronoun resolution for information retrieval. In Proceedings of SIGIR, pp. 381–382. Cited by: §7.1.
- Search and inference strategies in pronoun resolution: an experimental study. In Proceedings of ACL, pp. 89–93. Cited by: §1.
- Improving pronoun translation for statistical machine translation. In Proceedings of EACL, pp. 1–10. Cited by: §7.1.
- Deep residual learning for image recognition. In Proceedings of CVPR, pp. 770–778. Cited by: 1st item.
- Resolving pronoun references. Lingua 44 (4), pp. 311–338. Cited by: §1.
- Long short-term memory. Neural computation 9 (8), pp. 1735–1780. Cited by: §4.1.
- Adam: A method for stochastic optimization. In Proceedings of ICLR, Cited by: §5.2.
- Accurate unlexicalized parsing. In Proceedings of ACL, pp. 423–430. Cited by: §2.1.
- Visual coreference resolution in visual dialog using neural module networks. In Proceedings of ECCV, pp. 160–178. Cited by: §7.2.
- Higher-order coreference resolution with coarse-to-fine inference. In Proceedings of NAACL-HLT, pp. 687–692. Cited by: §1, §4.1, 4th item, §5.2.
- Generation and comprehension of unambiguous object descriptions. In Proceedings of CVPR, pp. 11–20. Cited by: §7.2.
- Supervised ranking for pronoun resolution: some recent improvements. In Proceedings of AAAI, pp. 1081–1086. Cited by: §2.1.
- The ace 2003 evaluation plan. US National Institute for Standards and Technology (NIST), pp. 2003–08. Cited by: §1, §7.1.
- Glove: global vectors for word representation. In Proceedings of EMNLP, pp. 1532–1543. Cited by: §5.2.
- Deep contextualized word representations. In Proceedings of NAACL-HLT, pp. 2227–2237. Cited by: §5.2.
- CoNLL-2012 shared task: modeling multilingual unrestricted coreference in ontonotes. In Proceedings of EMNLP-CoNLL, pp. 1–40. Cited by: §1, §1, §2.4, §7.1.
- A multi-pass sieve for coreference resolution. In Proceedings of EMNLP, pp. 492–501. Cited by: §1, 1st item.
- Two uses of anaphora resolution in summarization. Inf. Process. Manage. 43 (6), pp. 1663–1680. Cited by: §7.1.
- Conundrums in noun phrase coreference resolution: making sense of the state-of-the-art. In Proceedings of ACL, pp. 656–664. Cited by: §1.
A machine learning approach to pronoun resolution in spoken dialogue. In Proceedings of ACL, pp. 168–175. Cited by: §2.1.
- Show, attend and tell: neural image caption generation with visual attention. In Proceedings of ICML, pp. 2048–2057. Cited by: §7.2.
- Knowledge-aware pronoun coreference resolution. In Proceedings of ACL, pp. 867–876. Cited by: §1.
- Incorporating context and external knowledge for pronoun coreference resolution. In Proceedings of NAACL-HLT, pp. 872–881. Cited by: §1.