An Annotated Corpus of Reference Resolution for Interpreting Common Grounding

11/18/2019 ∙ by Takuma Udagawa, et al. ∙ 8

Common grounding is the process of creating, repairing and updating mutual understandings, which is a fundamental aspect of natural language conversation. However, interpreting the process of common grounding is a challenging task, especially under continuous and partially-observable context where complex ambiguity, uncertainty, partial understandings and misunderstandings are introduced. Interpretation becomes even more challenging when we deal with dialogue systems which still have limited capability of natural language understanding and generation. To address this problem, we consider reference resolution as the central subtask of common grounding and propose a new resource to study its intermediate process. Based on a simple and general annotation schema, we collected a total of 40,172 referring expressions in 5,191 dialogues curated from an existing corpus, along with multiple judgements of referent interpretations. We show that our annotation is highly reliable, captures the complexity of common grounding through a natural degree of reasonable disagreements, and allows for more detailed and quantitative analyses of common grounding strategies. Finally, we demonstrate the advantages of our annotation for interpreting, analyzing and improving common grounding in baseline dialogue systems.



There are no comments yet.


page 1

page 2

page 4

page 5

page 6

page 7

page 8

page 9

Code Repositories

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Figure 1: A visualized example of the raw dialogue (left) and our annotated dialogue (right). In our annotation, referring expressions are detected and their intended referents are annotated based on the speaker’s view (only one judgement shown in this example). Background task is described in detail in Section 3 and our annotation procedure in Section 4.

Common grounding is the process of creating, repairing and updating mutual understandings, which is a critical aspect of sophisticated human communication [6] as well as a longstanding goal in dialogue modeling [31]. Recently, there have been several new proposals of dialogue tasks which require advanced skills of common grounding under continuous and partially-observable context [32, 12]

. Their main contributions include proposal of clear evaluation metrics based on task success rate, collection of large-scale datasets (thousands of dialogues) and introduction of complex ambiguity, uncertainty, partial understandings and misunderstandings which are minimally observed under traditional settings based on either categorical or fully-observable context.

However, interpretation of the process of common grounding remains largely an open problem. Although a formal theory such as poesio2010completions (poesio2010completions) can account for some of the important details in common grounding, constructing such precise semantic representation is a difficult and costly process, especially under continuous and partially-observable context with high ambiguity and uncertainty. Interpretation becomes even more challenging when we deal with dialogue systems represented by end-to-end neural models [33, 3], which can converse fluently but still lack true competency of natural language understanding and generation.

In this work, we approach this problem by decomposing the common grounding task based on its intermediate subtasks. Specifically, we consider reference resolution as the central subtask of common grounding (in the sense that mutual understanding can only be created through successful references to the entities in the task domain), define this subtask formally based on a simple and general annotation schema, and create a large-scale resource to study this subtask along with the original task of common grounding.

Our annotated corpus consists of a total of 40,172 referring expressions in 5,191 dialogues curated from the existing corpus [32], along with multiple (a minimum of 3) judgements for referent interpretations. A visualization of our annotation is shown in Figure 1.

Through our corpus analysis, we show that our annotation has high agreement in general but also includes a natural degree of reasonable disagreements, which verified that our annotation can be conducted reliably while capturing the ambiguity and uncertainty under continuous and partially-observable context. In addition, we give a more quantitative analysis of pragmatic expressions as an illustrative example of analyses that can be conducted based on our annotation.

Finally, through our experiments we show that our annotation is critical for interpreting and analyzing common grounding in baseline dialogue systems, as well as improving their performance on difficult end tasks.

Overall, we propose a fundamental method and resource for interpreting the process of common grounding through its subtask of reference resolution. All materials related to this work will be publicly available at

2 Related Work

One of the most influential models of common grounding to date is the contribution model [6], which distinguishes information in a dialogue into two phases: the presentation phase where a piece of information is introduced by a speaker, and the acceptance phase where it gets accepted by a listener. However, applying such theory for analysis in realistic settings can be difficult or even problematic [16], especially when contributions are implicit, indirect, unstructured, uncertain or partial. In contrast, we propose a more practical approach of decomposing common grounding based on well-defined subtasks: in our case we focus on reference resolution. Although our approach does not give a formal account of common grounding, we show that our annotation is general with simple and clear definition, reliable in terms of annotator agreement under complex settings, and useful for interpreting and analyzing the intermediate process of common grounding.

Our work is also relevant to the recent literature of interpretable and explainable machine learning

[9, 18]. Especially the analysis of neural based models is gaining attention in NLP [2], including end-to-end dialogue models [25]. The main novelty of our approach is that we decompose the original task (common grounding) based on its central subtask (or could be subtasks), define the subtask (reference resolution) formally with an annotation framework, and create a large-scale resource to study the subtask along with the original task. Our approach has several advantages compared to previous analysis methods. First, it is applicable to both humans and machines, which is especially important in dialogue domains where they interact. Second, it can be used to study the relationships between the original task and its subtasks, which is critical for a more skill-oriented

evaluation of artificial intelligence

[13, 29]. Third, it can be used for investigating the dataset on which the models are trained: this is important in many aspects, such as understanding undesirable bias in the dataset [11, 28] or correct model predictions based on the wrong reasons [19]. Finally, the collected resource can be used for both probing whether the models solve the subtasks implicitly [17] or developing new models which can be explicitly supervised, evaluated and interpreted based on the subtasks.

Coreference and anaphora resolution have also been studied extensively in NLP [23, 21], including disagreements in their interpretations [24, 20]. The main difference between our annotation schema and theirs is that we focus on exophoric references and directly annotate the referent entities of each referring expression in situated dialogues. We show that our annotation can be conducted reliably, even by using non-expert annotators for referent identifications. Our annotation does not capture explicit relations between anaphora, but they capture basic coreference relations as well as complex associative anaphora (such as part-of relations), at least in an indirect way. Most importantly, they are compatible with such existing schema, and annotating explicit anaphoric relations could be a viable approach for future work.

Finally, visually grounded dialogues have been studied in a wide variety of settings. In comparison, the main strengths and novelty of our corpus can be summarized as follows:

  1. Our corpus is based on the advanced setting of continuous and partially-observable context where complex common grounding strategies are required.

  2. Our corpus has more simplicity and controllability compared to realistic visual dialogues, which makes controlled experiments and analyses easier.

  3. Our corpus includes large-scale manual annotation of reference resolution and detailed analyses of agreements/disagreements based on multiple judgements.

Prior work in common grounding [22, 8] and visual reference resolution [30, 34, 26] mostly focus on categorical or fully-observable settings and do not satisfy A. While visual dialogues [7, 12, 4, 15] have the strengths of being more complex and realistic, they do not satisfy B and C. Although gotze-boye-2016-spaceref (gotze-boye-2016-spaceref) conducted a smaller-scale (and more loosely defined) annotation of reference resolution, they did not assess the reliability of the annotation (hence does not satisfy B and C). To the best of our knowledge, our work is the first to satisfy all of the above criteria.

3 Background Task

Misunderstanding Partial Understanding

A’s view

B’s view

A’s view

B’s view
A: I see three smaller circles almost in a line slanting down
from right to left
B: I think I see it. Is the left one the largest? …
A: I have 5 larger dots close together, the bottom left one is
largest and darkest?
B: i have three that could be part of that …
Figure 2: Example of misunderstanding and partial understanding captured by our annotation.

Our annotation is conducted on a recently proposed common grounding dataset, which is a minimal formalization of a collaborative referring task under continuous and partially-observable context [32]. In this task, two players are given slightly different but overlapping perspectives of a 2-dimensional grid. Both players have 7 entities in each view, but only 4, 5 or 6 of them are in common: this makes their setting partially-observable with different degrees of partial-observability. In addition, each entity only has continuous attributes (color, size and location). The goal of the task is to find one of the common entities through natural language communication, and the task is successful if and only if they could find and select the same entity.

Some distinguishing characteristics of their dataset include its large size (a total of 6,760 dialogues, out of which 5,191 were successful on the task), rich linguistic variety with limited vocabulary (a total of only 2,035 unique tokens after preprocessing in our curated corpus), and most importantly the complexity of common grounding introduced by continuous and partially-observable context. As shown in Figure 2, there could be complex misunderstandings and partial understandings that need to be resolved through advanced skills of common grounding. We can also find various nuanced expressions (“almost in a line”, “I think I see …”, “could be”) and pragmatic expressions (“a line”, “largest”, “bottom left”) which can be ambiguous or need to be interpreted based on their context.

4 Annotation Procedure

The goal of our annotation is to provide a general, reliable and useful annotation of reference resolution to interpret the intermediate process of common grounding. In this work, we use the 5,191 successful dialogues from the existing corpus which are expected to be of higher quality (however, our annotation is applicable to unsuccessful dialogues as well). Our annotation procedure consists of two main steps: markable detection to semi-automatically annotate referring expressions currently under consideration and referent identification to identify the referents of each referring expression.

As an optional step, we also conducted preprocessing of the dialogues to correct obvious misspellings and grammatical errors. Due to the limited size of the vocabulary, we manually looked for rare unigrams and bigrams in the dialogue and carefully created rules to correct them. Our preprocessing step is reversible, so the collected annotation can also be applied to the original dialogues without preprocessing.

4.1 Markable Detection

In this work, we define a markable as an independent referring expression of the entities currently under consideration (in our case, the dots in the circular view). Basically, we annotate a markable as a minimal noun phrase including all prenominal modifiers (such as determiners, quantifiers, and adjectives) but excluding all postnominal modifiers (such as prepositional phrases and relational clauses). This eliminates the complexity of the annotation because markables will not overlap or nest with each other. See the figures for many examples of the detected markables (underlined).

To reduce the annotation effort in the later process, we optionally annotate three attributes for each markable if they are obvious from the context: a generic attribute when the markable is not specific enough to identify the referents, all-referents when the markable is referring to all of the entities in the speaker’s view, and no-referent when the referents are empty. Generic markables are ignored in our annotation, and the referents of all-referents or no-referent are annotated automatically in the later process. To reduce the redundancy of annotation, we consider a predicative noun phrase as a markable only if there is no previous markable in the same utterance that refer to the same entities: for example, “a triangle” in three dots are forming a triangle” is not considered as a markable since “three dots” is already annotated, but it is considered a markable in one light dot and two dark dots are forming a triangle. We also annotate obvious anaphoric and cataphoric relations in the same utterance: this way, the referents of anaphoric and cataphoric markables can be annotated automatically based on their antecedents or postcedents. Note that we do not annotate such relations across utterances, as they can actually refer to different entities (see Figure 2 for such example).

Detection of the markables, their attributes and relations are conducted using the brat annotation tool [27]. Annotators were trained extensively and had access to all available information (including original dialogues, players’ observations and selections) during annotation.

4.2 Referent Identification

Figure 3: Visual interface for referent identification.

Next, we used crowdsourcing on Amazon Mechanical Turk to collect large-scale judgements of the referents of each markable. Our visual interface for referent identification is shown in Figure 3. Annotators were instructed to read the instructions carefully (including description of the background task), put a check on ambiguous box and select all possible candidates when the referents are ambiguous, and put a check on unidentifiable if the referents are completely unidentifiable based on the available information.

To collect reliable annotations, we restricted the workers to those with at least 100 previously completed HITs and above 99% acceptance rate. We paid the workers well, with $0.25 for dialogues with less than 7 markables, $0.35 with less than 14 markables, and $0.45 otherwise. In addition, we automatically detected outliers based on several statistics (such as agreement with other workers) and manually reviewed them to encourage better work or reject clearly unacceptable works. The overall rejection rate was 1.18%.

As a result of this careful crowdsourcing, we were able to collect a large-scale annotation of 103,894 judgements with at least 3 judgements for each of the 34,341 markables.

5 Annotated Corpus

5.1 Basic Statistics

# Markables # All-Referents # No-Referent # Anaphora # Cataphora % Start Agreement % End Agreement
40,172 128 1,149 4,548 6 99.11 (96.32) 99.06 (96.11)
Table 1: Basic statistics of markable detection. Referents for all-referents, no-referent, anaphora and cataphora are annotated automatically. 130 dialogues with 3 independent annotations are used to compute agreement (Fleiss’s Multi- in parenthesis).
# Markables # Judgements % Ambiguous % Unidentifiable % Agreement % Exact Match
34,341 103,894 4.65 0.77 96.26 (88.66) 86.90
Table 2: Basic statistics of referent identification, along with the rate of ambiguous and unidentifiable checked in the judgements. Agreement is calculated at the entity level (Fleiss’s Multi- in parenthesis) and exact match rate at the markable level.

First, we report the basic statistics of the annotation for markable detection in Table 1 and referent identification in Table 2. All agreements are computed based on pairwise judgements. For markable detection, agreement is calculated for the markable text span (at the token level of whether each token is the start or end of the markables). Agreements for markable attributes and relations are also publicly available (but omitted in this paper since they were optional and annotated only in obvious cases). For referent identification, agreement is calculated based on binary judgements of whether each entity is included in the referents or not, and exact match is calculated only if the referents of the markable matched exactly. In addition, we compute Fleiss’s Multi- [10] to remove the effect of chance level agreements.

Overall, we found high agreement for all annotations, which verified the reliability of our annotation framework.

5.2 Disagreement Analysis

However, it is natural that there is a certain degree of disagreements in referent interpretations. In fact, it is important to capture such disagreements as there can be genuine ambiguity and uncertainty under continuous and partially-observable context (see Figure 4 for example). Therefore, in addition to explicitly annotating the ambiguity and unidentifiability as described in Subsection 4.2, we aim to capture them implicitly by collecting multiple judgements from different annotators, similar in approach to poesio-etal-2019-crowdsourced (poesio-etal-2019-crowdsourced).

Annotator 1

Annotator 2

medium sized light gray dot with a darker one directly under it and to the right?
Figure 4: Example of seemingly reasonable disagreements captured by our annotation.

To study the disagreements in detail, we compute the observed agreement statistics given the number of referents in each judgement. To be specific, for a certain number of referents (from 0 to 7), we consider all judgements with the number of referents, make all possible pairs with other judgements on the same markable, and compute the average of entity level agreement and exact match rate. The results are summarized in Table 3.

# Referents % Agreement % Exact % Judgements
0 78.04 17.78 01.31
1 97.45 90.28 71.81
2 94.87 82.17 14.85
3 93.93 83.03 07.51
4 92.18 76.66 02.20
5 90.31 71.03 00.88
6 90.75 78.14 01.22
7 81.47 62.50 00.21
Table 3: Agreement statistics given the number of referents in the judgement and the percentages of such judgements.

We can see that there is a significant amount of disagreements when the number of referents was judged to be either 0 or 7. This could be due to several reasons: obvious cases were already annotated as no-referent or all-referents during markable detection (so only difficult cases were left), annotators simply made mistakes (e.g. forgot to annotate), or the referents were annotated as such when it was too difficult to identify them. Since the number of such judgements are relatively small, their effect can be mitigated after appropriate aggregation of multiple judgements. In addition, they could be a useful resource for studying whether the disagreements are caused by annotation error or genuine difficulty in the annotation, as suggested in poesio-etal-2019-crowdsourced (poesio-etal-2019-crowdsourced).

We also found that the exact match rate is highest when the number of referents is 1, and much lower as the number of referents increases. This is reasonable because referring expressions for multiple entities tend to be more pragmatic and ambiguous (e.g. “a cluster”, “most of”, “a line”), and it would be more difficult to match the referents exactly. Note that entity level agreements are still at a high level, and the interpreted referents seem to mostly overlap with each other.

Next, as a preliminary analysis to study which expressions tend to have higher (or lower) disagreements, we compute the correlations between the occurrence of common tokens (represented by binary values) and the exact match rate of the pairwise judgements for each markable. Illustrative examples are shown in Table 4 and the whole list will be publicly available.

Low Count
it -0.149 12.7K
any -0.103 00.5K
that -0.100 12.5K
your -0.083 01.5K
few -0.081 00.1K
what -0.081 00.4K
others -0.064 00.8K
line -0.062 01.7K
bunch -0.060 00.2K
all -0.048 01.1K
triangle -0.046 02.5K
some -0.042 00.2K
medium -0.041 12.5K
another -0.039 01.4K
and -0.029 01.7K
High Count
lower 0.028 01.3K
two 0.030 14.7K
three 0.031 04.2K
darkest 0.036 02.1K
larger 0.039 07.7K
middle 0.041 02.1K
smallest 0.043 02.0K
very 0.056 06.1K
top 0.061 05.2K
light 0.072 18.7K
tiny 0.076 07.8K
large 0.084 21.7K
the 0.125 55.0K
one 0.136 57.1K
black 0.145 26.9K
Table 4: Tokens with low or high correlation with the exact match rate. Correlation scores are shown in .

In general, the correlations are very small and the amount of disagreements seem relatively constant across all token types. However, the general trend is still intuitive: ambiguous or complex expressions such as pronouns, interrogatives, quantifiers, and coordinating conjunctions tend to have negative correlations, while simple and less ambiguous expressions tend to have positive correlations.

To summarize the analyses, our annotation has high overall agreement but also includes interesting, reasonable disagreements which capture the ambiguity and uncertainty under continuous and partially-observable context.

5.3 Pragmatic Expressions

Finally, as an illustrative example of additional analyses that can be conducted based on our annotation, we give a more quantitative analysis of pragmatic expressions which have been pointed out to exist in previous work but without sufficient amount of evidence [32].

In this work, we focus on pragmatic expression of color

and estimate the distribution of the actual color of the referents described by the common adjectives. We simply assume that the adjective in the minimal noun phrase describe the color of the referents, since the exceptions (such as negation in the prenominal modifier) seemed rare and ignorable. Distributions are calculated based on kernel density estimation. As we can see in Figure

5, all adjectives (including the specific color black) have smooth and wide distributions which overlap with each other. This is a strong evidence that the same color can be described in various ways and become more pragmatic under continuous context.

Figure 5: Distribution of the actual color of the referents expressed by common adjectives (the range of color is 256 as in RGB scale, lower is darker).

6 Experiments

In this section, we evaluate and analyze baseline models based on three tasks. First is the target selection task proposed by udagawa2019natural (udagawa2019natural), which tries to predict the entity selected by each player at the end of the collaborative referring task: this requires correct recognition of the created common ground based on the dialogue and context (i.e. player’s view). Second is the reference resolution task, where we focus on binary predictions of whether each entity is included in the referents or not. Last is the selfplay dialogue task where the model plays the whole collaborative referring task (Section 3) against an identical copy of itself.

For reference resolution, we use simple majority voting (at the entity level) and automatic annotation of the referents to create gold annotation. Markables are removed if the majority considered them as unidentifiable.

6.1 Model Architecture

Figure 6: Our baseline model architecture (best seen in color). TSEL decoder is shown in green, REF decoder and the input markable three black dots are in red, and DIAL decoder is in blue. All decoders share the entity-level attention module.

The overall architecture of our baseline models is shown in Figure 6.


Our baseline models have two encoders: one for encoding dialogue tokens and one for context information.

Dialogue tokens are encoded with a standard GRU [5]. To encode context information, we embed each entity using a shared entity encoder. This consists of an attribute encoder which embeds the attributes of each entity (size, color and location) with a matrix followed by a tanh layer, and a relational encoder which embeds relative attributes of each entity pairs (e.g. distance) with another matrix followed by a tanh layer. The final embedding of each entity is the concatenation of its attribute embedding and the sum of relational embeddings with the other 6 entities.


Our models can have up to three decoders: TSEL for target selection, REF for reference resolution, and DIAL for predicting next tokens. Each decoder shares (some or all layers of) the attention module based on MLP to compute a scalar score for each entity based on its embedding and certain positions of the GRU: TSEL takes the final hidden state, REF takes (the mean of) the start position of the markable, the end position of the markable, and the end position of the utterance including the markable, and DIAL takes the current hidden state. Based on these attention scores, TSEL simply computes the softmax and REF

computes logistic regressions for each entity.

DIAL reweights the entity embeddings based on these attention scores, concatenates it with the current hidden state and decodes with an MLP [1].

In this experiment, we built five models based on different combinations of the three decoders. All models are trained with the default hyperparameters with minimal tuning.

6.2 Results

Model Target Selection Reference Resolution (Exact Match) Selfplay Dialogue
#Shared=4 #Shared=5 #Shared=6
TSEL 67.791.53 - - - -
REF - 85.750.22 (33.910.86) - - -
TSEL-REF 69.011.58 85.470.36 (32.881.28) - - -
TSEL-DIAL 67.011.29 - 42.071.27 57.371.29 77.001.13
TSEL-REF-DIAL 69.091.12 85.860.18 (33.660.93) 45.782.15 61.951.72 80.011.61
Human 90.79 96.26 (86.90) 65.83 76.96 87.00
Table 5: Results of our baseline models. Human scores from udagawa2019natural (udagawa2019natural) and Table 2 as a reference.

We run the experiments 10 times with different random seeds and dataset splits (8:1:1 for train, validation and test). For selfplay dialogues, we generated 1,000 scenarios with each number of shared entities (4, 5 or 6) and set the output temperature to 0.25 during next token prediction. We report the mean and standard deviation of the results in Table


In terms of target selection and selfplay dialogue tasks, we found consistent improvements by training the models jointly with reference resolution. This verified that we can indeed leverage the central subtask of reference resolution to improve performance on difficult end tasks. The results for reference resolution are reasonably high in terms of entity level accuracy but much lower in terms of exact match rate. Considering the high agreements (Subsection 5.1) and improved reliability of the gold annotation after aggregation, we expect there to be a huge room for further improvements.

Overall, common grounding under continuous and partially-observable context is still a challenging task, and we expect our resource to be a fundamental testbed for solving this task through advanced skills of reference resolution.

6.3 Analysis

To demonstrate the advantages of our approach for interpreting and analyzing dialogue systems, we give a more detailed analysis of TSEL-REF-DIAL model which performed well on all three tasks. In Table 6, we show the results for reference resolution (entity level accuracy and exact match rate) grouped by the number of referents in the gold annotation. In terms of the exact match rate, we found that the model performs very well on 0 and 7 referents: this is because most of them can be recognized at the superficial level, such as none of them, all of mine or “I don’t have that. However, the model struggles on all other cases: the results are especially worse for markables with more than 1 referent. This shows that the model still lacks the ability of precisely tracking multiple referents, which can be expressed in complex, pragmatic ways (such as groupings).

# Referents % Accuracy % Exact Match Count
0 95.911.38 83.534.65 0148.5
1 89.340.17 36.861.32 2782.5
2 78.141.07 20.591.90 0587.9
3 70.641.02 13.632.06 0283.3
4 69.122.69 10.163.47 0081.0
5 73.572.94 17.565.88 0033.0
6 78.694.45 13.187.31 0043.0
7 74.607.49 50.3811.40 0022.3
Table 6: Results of the reference resolution task grouped by the number of referents in the gold annotation (along with the average count of such markables in the test set).

In addition, we found that the correlation between reference resolution score (average accuracy of reference resolution in each dialogue) and target selection score (binary result of target selection in each dialogue) was relatively weak, with an average of only in 10 runs of the experiments. Indeed, we verified that the model is often correct for the target selection task based on the wrong reason, without tracking the referents correctly. Our annotation is also useful for error analysis in recognizing the process of common grounding, by inspecting where the model made a mistake and lost track of the correct referents.

Model A’s view

Model B’s view
A: I have a large black dot with a smaller dark dot to the
right of it
B: I see that . Let’s pick the large black dot
Figure 7: Example dialogue from the selfplay task by TSEL-REF-DIAL model. Predicted referents are highlighted (no referents were predicted for the large black dot).

Finally, we show an example dialogue from the selfplay task along with the interpreted process of common grounding in Figure 7. Referring expressions are automatically detected by a BiLSTM-CRF tagger [14] trained on our corpus (with 98.9% accuracy at the token level). Based on the raw dialogue only, it is difficult to identify which dots the models are referring to. However, by visualizing the intended referents, we can see that model A is describing two dots in somewhat unnatural and inappropriate way (albeit using the anaphoric expression it appropriately). In turn, model B acknowledges this in a perfectly coherent way but without predicting any referents for the large black dot: we often observed such phenomena, where the utterance by a model cannot be interpreted correctly even by itself. This way, our annotation allows for fine-grained analysis of both capabilities and incapabilities of existing dialogue systems. The generated dialogue is short in this example, but our approach would be even more critical for interpretation as the dialogues get longer and more complicated.

7 Conclusion

We propose a novel method of decomposing common grounding based on its subtasks to study the intermediate process of common grounding. We demonstrated the advantages of our approach through extensive analysis of the annotated corpus and the baseline models. Overall, we expect our work to be a fundamental step towards interpreting and improving common grounding through reference resolution.


This work was supported by JSPS KAKENHI Grant Number 18H03297 and NEDO SIP-2 “Big-data and AI-enabled Cyberspace Technologies.” We also thank the anonymous reviewers for their valuable suggestions and comments.


  • [1] D. Bahdanau, K. Cho, and Y. Bengio (2014) Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473. Cited by: §6.1.
  • [2] Y. Belinkov and J. Glass (2019-03) Analysis methods in neural language processing: a survey. Transactions of the Association for Computational Linguistics 7, pp. 49–72. External Links: Link, Document Cited by: §2.
  • [3] A. Bordes and J. Weston (2016) Learning end-to-end goal-oriented dialog. CoRR abs/1605.07683. External Links: Link, 1605.07683 Cited by: §1.
  • [4] H. Chen, A. Suhr, D. Misra, N. Snavely, and Y. Artzi (2019) Touchdown: natural language navigation and spatial reasoning in visual street environments. In

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    pp. 12538–12547. Cited by: §2.
  • [5] K. Cho, B. van Merrienboer, D. Bahdanau, and Y. Bengio (2014) On the properties of neural machine translation: encoder–decoder approaches. In Proceedings of SSST-8, Eighth Workshop on Syntax, Semantics and Structure in Statistical Translation, pp. 103–111. External Links: Document, Link Cited by: §6.1.
  • [6] H. H. Clark (1996) Using language. Cambridge university press. Cited by: §1, §2.
  • [7] A. Das, S. Kottur, K. Gupta, A. Singh, D. Yadav, J. M. Moura, D. Parikh, and D. Batra (2017) Visual dialog. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Vol. 2. Cited by: §2.
  • [8] H. De Vries, F. Strub, S. Chandar, O. Pietquin, H. Larochelle, and A. Courville (2017) GuessWhat?! visual object discovery through multi-modal dialogue. In Proc. of CVPR, Cited by: §2.
  • [9] F. Doshi-Velez and B. Kim (2017) Towards a rigorous science of interpretable machine learning. arXiv preprint arXiv:1702.08608. Cited by: §2.
  • [10] J. L. Fleiss (1971) Measuring nominal scale agreement among many raters.. Psychological bulletin 76 (5), pp. 378. Cited by: §5.1.
  • [11] S. Gururangan, S. Swayamdipta, O. Levy, R. Schwartz, S. Bowman, and N. A. Smith (2018-06) Annotation artifacts in natural language inference data. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers), New Orleans, Louisiana, pp. 107–112. External Links: Link, Document Cited by: §2.
  • [12] J. Haber, T. Baumgärtner, E. Takmaz, L. Gelderloos, E. Bruni, and R. Fernández (2019-07) The PhotoBook dataset: building common ground through visually-grounded dialogue. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, pp. 1895–1910. External Links: Link Cited by: §1, §2.
  • [13] J. Hernndez-Orallo (2017) The measure of all minds: evaluating natural and artificial intelligence. 1st edition, Cambridge University Press, New York, NY, USA. External Links: ISBN 1107153018, 9781107153011 Cited by: §2.
  • [14] Z. Huang, W. Xu, and K. Yu (2015) Bidirectional lstm-crf models for sequence tagging. CoRR abs/1508.01991. External Links: Link Cited by: §6.3.
  • [15] N. Ilinykh, S. Zarrieß, and D. Schlangen (2019) MeetUp! a corpus of joint activity dialogues in a visual environment. arXiv preprint arXiv:1907.05084. Cited by: §2.
  • [16] T. Koschmann and C. D. LeBaron (2003) Reconsidering common ground. In ECSCW 2003, pp. 81–98. Cited by: §2.
  • [17] T. Linzen, E. Dupoux, and Y. Goldberg (2016) Assessing the ability of lstms to learn syntax-sensitive dependencies. Transactions of the Association for Computational Linguistics 4, pp. 521–535. External Links: Link Cited by: §2.
  • [18] Z. C. Lipton (2016) The mythos of model interpretability. arXiv preprint arXiv:1606.03490. Cited by: §2.
  • [19] T. McCoy, E. Pavlick, and T. Linzen (2019-07)

    Right for the wrong reasons: diagnosing syntactic heuristics in natural language inference

    In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, pp. 3428–3448. External Links: Link Cited by: §2.
  • [20] M. Poesio, J. Chamberlain, S. Paun, J. Yu, A. Uma, and U. Kruschwitz (2019-06) A crowdsourced corpus of multiple judgments and disagreement on anaphoric interpretation. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Minneapolis, Minnesota, pp. 1778–1789. External Links: Link, Document Cited by: §2.
  • [21] M. Poesio, R. Stuckardt, and Y. Versley (2016) Anaphora resolution. Springer. Cited by: §2.
  • [22] C. Potts (2012) Goal-driven answers in the cards dialogue corpus. In Proceedings of the 30th west coast conference on formal linguistics, pp. 1–20. Cited by: §2.
  • [23] S. Pradhan, L. Ramshaw, M. Marcus, M. Palmer, R. Weischedel, and N. Xue (2011) Conll-2011 shared task: modeling unrestricted coreference in ontonotes. In Proceedings of the Fifteenth Conference on Computational Natural Language Learning: Shared Task, pp. 1–27. Cited by: §2.
  • [24] M. Recasens, M. A. Martí, and C. Orasan (2012-05) Annotating near-identity from coreference disagreements. In Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC-2012), Istanbul, Turkey, pp. 165–172. External Links: Link Cited by: §2.
  • [25] C. Sankar, S. Subramanian, C. Pal, S. Chandar, and Y. Bengio (2019-07) Do neural dialog systems use the conversation history effectively? an empirical study. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, pp. 32–37. External Links: Link Cited by: §2.
  • [26] T. Shore, T. Androulakaki, and G. Skantze (2018-05) KTH tangrams: a dataset for research on alignment and conceptual pacts in task-oriented dialogue. In Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018), Miyazaki, Japan. External Links: Link Cited by: §2.
  • [27] P. Stenetorp, S. Pyysalo, G. Topić, T. Ohta, S. Ananiadou, and J. Tsujii (2012) BRAT: a web-based tool for nlp-assisted text annotation. In Proceedings of the Demonstrations at the 13th Conference of the European Chapter of the Association for Computational Linguistics, pp. 102–107. Cited by: §4.1.
  • [28] S. Sugawara, K. Inui, S. Sekine, and A. Aizawa (2018-October-November) What makes reading comprehension questions easier?. In

    Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing

    Brussels, Belgium, pp. 4208–4219. External Links: Link, Document Cited by: §2.
  • [29] S. Sugawara, H. Yokono, and A. Aizawa (2017) Prerequisite skills for reading comprehension: multi-perspective analysis of mctest datasets and systems.. In AAAI, pp. 3089–3096. Cited by: §2.
  • [30] T. Tokunaga, R. Iida, A. Terai, and N. Kuriyama (2012-05) The REX corpora: a collection of multimodal corpora of referring expressions in collaborative problem solving dialogues. In Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC’12), Istanbul, Turkey, pp. 422–429. External Links: Link Cited by: §2.
  • [31] D. R. Traum (1994) A computational theory of grounding in natural language conversation.. Technical report ROCHESTER UNIV NY DEPT OF COMPUTER SCIENCE. Cited by: §1.
  • [32] T. Udagawa and A. Aizawa (2019) A natural language corpus of common grounding under continuous and partially-observable context. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33, pp. 7120–7127. Cited by: §1, §1, §3, §5.3.
  • [33] O. Vinyals and Q. V. Le (2015) A neural conversational model. CoRR abs/1506.05869. External Links: Link, 1506.05869 Cited by: §1.
  • [34] S. Zarrieß, J. Hough, C. Kennington, R. Manuvinakurike, D. DeVault, R. Fernández, and D. Schlangen (2016-05) PentoRef: a corpus of spoken references in task-oriented dialogues. In Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC’16), Portorož, Slovenia, pp. 125–131. External Links: Link Cited by: §2.