Common grounding is the process of creating, repairing and updating mutual understandings, which is a fundamental aspect of natural language conversation. However, interpreting the process of common grounding is a challenging task, especially under continuous and partially-observable context where complex ambiguity, uncertainty, partial understandings and misunderstandings are introduced. Interpretation becomes even more challenging when we deal with dialogue systems which still have limited capability of natural language understanding and generation. To address this problem, we consider reference resolution as the central subtask of common grounding and propose a new resource to study its intermediate process. Based on a simple and general annotation schema, we collected a total of 40,172 referring expressions in 5,191 dialogues curated from an existing corpus, along with multiple judgements of referent interpretations. We show that our annotation is highly reliable, captures the complexity of common grounding through a natural degree of reasonable disagreements, and allows for more detailed and quantitative analyses of common grounding strategies. Finally, we demonstrate the advantages of our annotation for interpreting, analyzing and improving common grounding in baseline dialogue systems.READ FULL TEXT VIEW PDF
Common grounding is the process of creating, repairing and updating mutu...
In order to interpret the communicative intents of an utterance, it need...
Effective dialogue involves grounding, the process of establishing mutua...
Common grounding is the process of creating and maintaining mutual
Recent models achieve promising results in visually grounded dialogues.
We present a data resource which can be useful for research purposes on
Grounding a pronoun to a visual object it refers to requires complex
Common grounding is the process of creating, repairing and updating mutual understandings, which is a critical aspect of sophisticated human communication  as well as a longstanding goal in dialogue modeling . Recently, there have been several new proposals of dialogue tasks which require advanced skills of common grounding under continuous and partially-observable context [32, 12]
. Their main contributions include proposal of clear evaluation metrics based on task success rate, collection of large-scale datasets (thousands of dialogues) and introduction of complex ambiguity, uncertainty, partial understandings and misunderstandings which are minimally observed under traditional settings based on either categorical or fully-observable context.
However, interpretation of the process of common grounding remains largely an open problem. Although a formal theory such as poesio2010completions (poesio2010completions) can account for some of the important details in common grounding, constructing such precise semantic representation is a difficult and costly process, especially under continuous and partially-observable context with high ambiguity and uncertainty. Interpretation becomes even more challenging when we deal with dialogue systems represented by end-to-end neural models [33, 3], which can converse fluently but still lack true competency of natural language understanding and generation.
In this work, we approach this problem by decomposing the common grounding task based on its intermediate subtasks. Specifically, we consider reference resolution as the central subtask of common grounding (in the sense that mutual understanding can only be created through successful references to the entities in the task domain), define this subtask formally based on a simple and general annotation schema, and create a large-scale resource to study this subtask along with the original task of common grounding.
Our annotated corpus consists of a total of 40,172 referring expressions in 5,191 dialogues curated from the existing corpus , along with multiple (a minimum of 3) judgements for referent interpretations. A visualization of our annotation is shown in Figure 1.
Through our corpus analysis, we show that our annotation has high agreement in general but also includes a natural degree of reasonable disagreements, which verified that our annotation can be conducted reliably while capturing the ambiguity and uncertainty under continuous and partially-observable context. In addition, we give a more quantitative analysis of pragmatic expressions as an illustrative example of analyses that can be conducted based on our annotation.
Finally, through our experiments we show that our annotation is critical for interpreting and analyzing common grounding in baseline dialogue systems, as well as improving their performance on difficult end tasks.
Overall, we propose a fundamental method and resource for interpreting the process of common grounding through its subtask of reference resolution. All materials related to this work will be publicly available at https://github.com/Alab-NII/onecommon.
One of the most influential models of common grounding to date is the contribution model , which distinguishes information in a dialogue into two phases: the presentation phase where a piece of information is introduced by a speaker, and the acceptance phase where it gets accepted by a listener. However, applying such theory for analysis in realistic settings can be difficult or even problematic , especially when contributions are implicit, indirect, unstructured, uncertain or partial. In contrast, we propose a more practical approach of decomposing common grounding based on well-defined subtasks: in our case we focus on reference resolution. Although our approach does not give a formal account of common grounding, we show that our annotation is general with simple and clear definition, reliable in terms of annotator agreement under complex settings, and useful for interpreting and analyzing the intermediate process of common grounding.
Our work is also relevant to the recent literature of interpretable and explainable machine learning[9, 18]. Especially the analysis of neural based models is gaining attention in NLP , including end-to-end dialogue models . The main novelty of our approach is that we decompose the original task (common grounding) based on its central subtask (or could be subtasks), define the subtask (reference resolution) formally with an annotation framework, and create a large-scale resource to study the subtask along with the original task. Our approach has several advantages compared to previous analysis methods. First, it is applicable to both humans and machines, which is especially important in dialogue domains where they interact. Second, it can be used to study the relationships between the original task and its subtasks, which is critical for a more skill-oriented
evaluation of artificial intelligence[13, 29]. Third, it can be used for investigating the dataset on which the models are trained: this is important in many aspects, such as understanding undesirable bias in the dataset [11, 28] or correct model predictions based on the wrong reasons . Finally, the collected resource can be used for both probing whether the models solve the subtasks implicitly  or developing new models which can be explicitly supervised, evaluated and interpreted based on the subtasks.
Coreference and anaphora resolution have also been studied extensively in NLP [23, 21], including disagreements in their interpretations [24, 20]. The main difference between our annotation schema and theirs is that we focus on exophoric references and directly annotate the referent entities of each referring expression in situated dialogues. We show that our annotation can be conducted reliably, even by using non-expert annotators for referent identifications. Our annotation does not capture explicit relations between anaphora, but they capture basic coreference relations as well as complex associative anaphora (such as part-of relations), at least in an indirect way. Most importantly, they are compatible with such existing schema, and annotating explicit anaphoric relations could be a viable approach for future work.
Finally, visually grounded dialogues have been studied in a wide variety of settings. In comparison, the main strengths and novelty of our corpus can be summarized as follows:
Our corpus is based on the advanced setting of continuous and partially-observable context where complex common grounding strategies are required.
Our corpus has more simplicity and controllability compared to realistic visual dialogues, which makes controlled experiments and analyses easier.
Our corpus includes large-scale manual annotation of reference resolution and detailed analyses of agreements/disagreements based on multiple judgements.
Prior work in common grounding [22, 8] and visual reference resolution [30, 34, 26] mostly focus on categorical or fully-observable settings and do not satisfy A. While visual dialogues [7, 12, 4, 15] have the strengths of being more complex and realistic, they do not satisfy B and C. Although gotze-boye-2016-spaceref (gotze-boye-2016-spaceref) conducted a smaller-scale (and more loosely defined) annotation of reference resolution, they did not assess the reliability of the annotation (hence does not satisfy B and C). To the best of our knowledge, our work is the first to satisfy all of the above criteria.
Our annotation is conducted on a recently proposed common grounding dataset, which is a minimal formalization of a collaborative referring task under continuous and partially-observable context . In this task, two players are given slightly different but overlapping perspectives of a 2-dimensional grid. Both players have 7 entities in each view, but only 4, 5 or 6 of them are in common: this makes their setting partially-observable with different degrees of partial-observability. In addition, each entity only has continuous attributes (color, size and location). The goal of the task is to find one of the common entities through natural language communication, and the task is successful if and only if they could find and select the same entity.
Some distinguishing characteristics of their dataset include its large size (a total of 6,760 dialogues, out of which 5,191 were successful on the task), rich linguistic variety with limited vocabulary (a total of only 2,035 unique tokens after preprocessing in our curated corpus), and most importantly the complexity of common grounding introduced by continuous and partially-observable context. As shown in Figure 2, there could be complex misunderstandings and partial understandings that need to be resolved through advanced skills of common grounding. We can also find various nuanced expressions (“almost in a line”, “I think I see …”, “could be”) and pragmatic expressions (“a line”, “largest”, “bottom left”) which can be ambiguous or need to be interpreted based on their context.
The goal of our annotation is to provide a general, reliable and useful annotation of reference resolution to interpret the intermediate process of common grounding. In this work, we use the 5,191 successful dialogues from the existing corpus which are expected to be of higher quality (however, our annotation is applicable to unsuccessful dialogues as well). Our annotation procedure consists of two main steps: markable detection to semi-automatically annotate referring expressions currently under consideration and referent identification to identify the referents of each referring expression.
As an optional step, we also conducted preprocessing of the dialogues to correct obvious misspellings and grammatical errors. Due to the limited size of the vocabulary, we manually looked for rare unigrams and bigrams in the dialogue and carefully created rules to correct them. Our preprocessing step is reversible, so the collected annotation can also be applied to the original dialogues without preprocessing.
In this work, we define a markable as an independent referring expression of the entities currently under consideration (in our case, the dots in the circular view). Basically, we annotate a markable as a minimal noun phrase including all prenominal modifiers (such as determiners, quantifiers, and adjectives) but excluding all postnominal modifiers (such as prepositional phrases and relational clauses). This eliminates the complexity of the annotation because markables will not overlap or nest with each other. See the figures for many examples of the detected markables (underlined).
To reduce the annotation effort in the later process, we optionally annotate three attributes for each markable if they are obvious from the context: a generic attribute when the markable is not specific enough to identify the referents, all-referents when the markable is referring to all of the entities in the speaker’s view, and no-referent when the referents are empty. Generic markables are ignored in our annotation, and the referents of all-referents or no-referent are annotated automatically in the later process. To reduce the redundancy of annotation, we consider a predicative noun phrase as a markable only if there is no previous markable in the same utterance that refer to the same entities: for example, “a triangle” in “three dots are forming a triangle” is not considered as a markable since “three dots” is already annotated, but it is considered a markable in “one light dot and two dark dots are forming a triangle”. We also annotate obvious anaphoric and cataphoric relations in the same utterance: this way, the referents of anaphoric and cataphoric markables can be annotated automatically based on their antecedents or postcedents. Note that we do not annotate such relations across utterances, as they can actually refer to different entities (see Figure 2 for such example).
Detection of the markables, their attributes and relations are conducted using the brat annotation tool . Annotators were trained extensively and had access to all available information (including original dialogues, players’ observations and selections) during annotation.
Next, we used crowdsourcing on Amazon Mechanical Turk to collect large-scale judgements of the referents of each markable. Our visual interface for referent identification is shown in Figure 3. Annotators were instructed to read the instructions carefully (including description of the background task), put a check on ambiguous box and select all possible candidates when the referents are ambiguous, and put a check on unidentifiable if the referents are completely unidentifiable based on the available information.
To collect reliable annotations, we restricted the workers to those with at least 100 previously completed HITs and above 99% acceptance rate. We paid the workers well, with $0.25 for dialogues with less than 7 markables, $0.35 with less than 14 markables, and $0.45 otherwise. In addition, we automatically detected outliers based on several statistics (such as agreement with other workers) and manually reviewed them to encourage better work or reject clearly unacceptable works. The overall rejection rate was 1.18%.
As a result of this careful crowdsourcing, we were able to collect a large-scale annotation of 103,894 judgements with at least 3 judgements for each of the 34,341 markables.
|# Markables||# All-Referents||# No-Referent||# Anaphora||# Cataphora||% Start Agreement||% End Agreement|
|40,172||128||1,149||4,548||6||99.11 (96.32)||99.06 (96.11)|
|# Markables||# Judgements||% Ambiguous||% Unidentifiable||% Agreement||% Exact Match|
First, we report the basic statistics of the annotation for markable detection in Table 1 and referent identification in Table 2. All agreements are computed based on pairwise judgements. For markable detection, agreement is calculated for the markable text span (at the token level of whether each token is the start or end of the markables). Agreements for markable attributes and relations are also publicly available (but omitted in this paper since they were optional and annotated only in obvious cases). For referent identification, agreement is calculated based on binary judgements of whether each entity is included in the referents or not, and exact match is calculated only if the referents of the markable matched exactly. In addition, we compute Fleiss’s Multi-  to remove the effect of chance level agreements.
Overall, we found high agreement for all annotations, which verified the reliability of our annotation framework.
However, it is natural that there is a certain degree of disagreements in referent interpretations. In fact, it is important to capture such disagreements as there can be genuine ambiguity and uncertainty under continuous and partially-observable context (see Figure 4 for example). Therefore, in addition to explicitly annotating the ambiguity and unidentifiability as described in Subsection 4.2, we aim to capture them implicitly by collecting multiple judgements from different annotators, similar in approach to poesio-etal-2019-crowdsourced (poesio-etal-2019-crowdsourced).
To study the disagreements in detail, we compute the observed agreement statistics given the number of referents in each judgement. To be specific, for a certain number of referents (from 0 to 7), we consider all judgements with the number of referents, make all possible pairs with other judgements on the same markable, and compute the average of entity level agreement and exact match rate. The results are summarized in Table 3.
|# Referents||% Agreement||% Exact||% Judgements|
We can see that there is a significant amount of disagreements when the number of referents was judged to be either 0 or 7. This could be due to several reasons: obvious cases were already annotated as no-referent or all-referents during markable detection (so only difficult cases were left), annotators simply made mistakes (e.g. forgot to annotate), or the referents were annotated as such when it was too difficult to identify them. Since the number of such judgements are relatively small, their effect can be mitigated after appropriate aggregation of multiple judgements. In addition, they could be a useful resource for studying whether the disagreements are caused by annotation error or genuine difficulty in the annotation, as suggested in poesio-etal-2019-crowdsourced (poesio-etal-2019-crowdsourced).
We also found that the exact match rate is highest when the number of referents is 1, and much lower as the number of referents increases. This is reasonable because referring expressions for multiple entities tend to be more pragmatic and ambiguous (e.g. “a cluster”, “most of”, “a line”), and it would be more difficult to match the referents exactly. Note that entity level agreements are still at a high level, and the interpreted referents seem to mostly overlap with each other.
Next, as a preliminary analysis to study which expressions tend to have higher (or lower) disagreements, we compute the correlations between the occurrence of common tokens (represented by binary values) and the exact match rate of the pairwise judgements for each markable. Illustrative examples are shown in Table 4 and the whole list will be publicly available.
In general, the correlations are very small and the amount of disagreements seem relatively constant across all token types. However, the general trend is still intuitive: ambiguous or complex expressions such as pronouns, interrogatives, quantifiers, and coordinating conjunctions tend to have negative correlations, while simple and less ambiguous expressions tend to have positive correlations.
To summarize the analyses, our annotation has high overall agreement but also includes interesting, reasonable disagreements which capture the ambiguity and uncertainty under continuous and partially-observable context.
Finally, as an illustrative example of additional analyses that can be conducted based on our annotation, we give a more quantitative analysis of pragmatic expressions which have been pointed out to exist in previous work but without sufficient amount of evidence .
In this work, we focus on pragmatic expression of color
and estimate the distribution of the actual color of the referents described by the common adjectives. We simply assume that the adjective in the minimal noun phrase describe the color of the referents, since the exceptions (such as negation in the prenominal modifier) seemed rare and ignorable. Distributions are calculated based on kernel density estimation. As we can see in Figure5, all adjectives (including the specific color black) have smooth and wide distributions which overlap with each other. This is a strong evidence that the same color can be described in various ways and become more pragmatic under continuous context.
In this section, we evaluate and analyze baseline models based on three tasks. First is the target selection task proposed by udagawa2019natural (udagawa2019natural), which tries to predict the entity selected by each player at the end of the collaborative referring task: this requires correct recognition of the created common ground based on the dialogue and context (i.e. player’s view). Second is the reference resolution task, where we focus on binary predictions of whether each entity is included in the referents or not. Last is the selfplay dialogue task where the model plays the whole collaborative referring task (Section 3) against an identical copy of itself.
For reference resolution, we use simple majority voting (at the entity level) and automatic annotation of the referents to create gold annotation. Markables are removed if the majority considered them as unidentifiable.
The overall architecture of our baseline models is shown in Figure 6.
Our baseline models have two encoders: one for encoding dialogue tokens and one for context information.
Dialogue tokens are encoded with a standard GRU . To encode context information, we embed each entity using a shared entity encoder. This consists of an attribute encoder which embeds the attributes of each entity (size, color and location) with a matrix followed by a tanh layer, and a relational encoder which embeds relative attributes of each entity pairs (e.g. distance) with another matrix followed by a tanh layer. The final embedding of each entity is the concatenation of its attribute embedding and the sum of relational embeddings with the other 6 entities.
Our models can have up to three decoders: TSEL for target selection, REF for reference resolution, and DIAL for predicting next tokens. Each decoder shares (some or all layers of) the attention module based on MLP to compute a scalar score for each entity based on its embedding and certain positions of the GRU: TSEL takes the final hidden state, REF takes (the mean of) the start position of the markable, the end position of the markable, and the end position of the utterance including the markable, and DIAL takes the current hidden state. Based on these attention scores, TSEL simply computes the softmax and REF
computes logistic regressions for each entity.DIAL reweights the entity embeddings based on these attention scores, concatenates it with the current hidden state and decodes with an MLP .
In this experiment, we built five models based on different combinations of the three decoders. All models are trained with the default hyperparameters with minimal tuning.
|Model||Target Selection||Reference Resolution (Exact Match)||Selfplay Dialogue|
We run the experiments 10 times with different random seeds and dataset splits (8:1:1 for train, validation and test). For selfplay dialogues, we generated 1,000 scenarios with each number of shared entities (4, 5 or 6) and set the output temperature to 0.25 during next token prediction. We report the mean and standard deviation of the results in Table5.
In terms of target selection and selfplay dialogue tasks, we found consistent improvements by training the models jointly with reference resolution. This verified that we can indeed leverage the central subtask of reference resolution to improve performance on difficult end tasks. The results for reference resolution are reasonably high in terms of entity level accuracy but much lower in terms of exact match rate. Considering the high agreements (Subsection 5.1) and improved reliability of the gold annotation after aggregation, we expect there to be a huge room for further improvements.
Overall, common grounding under continuous and partially-observable context is still a challenging task, and we expect our resource to be a fundamental testbed for solving this task through advanced skills of reference resolution.
To demonstrate the advantages of our approach for interpreting and analyzing dialogue systems, we give a more detailed analysis of TSEL-REF-DIAL model which performed well on all three tasks. In Table 6, we show the results for reference resolution (entity level accuracy and exact match rate) grouped by the number of referents in the gold annotation. In terms of the exact match rate, we found that the model performs very well on 0 and 7 referents: this is because most of them can be recognized at the superficial level, such as “none of them”, “all of mine” or “I don’t have that”. However, the model struggles on all other cases: the results are especially worse for markables with more than 1 referent. This shows that the model still lacks the ability of precisely tracking multiple referents, which can be expressed in complex, pragmatic ways (such as groupings).
|# Referents||% Accuracy||% Exact Match||Count|
In addition, we found that the correlation between reference resolution score (average accuracy of reference resolution in each dialogue) and target selection score (binary result of target selection in each dialogue) was relatively weak, with an average of only in 10 runs of the experiments. Indeed, we verified that the model is often correct for the target selection task based on the wrong reason, without tracking the referents correctly. Our annotation is also useful for error analysis in recognizing the process of common grounding, by inspecting where the model made a mistake and lost track of the correct referents.
|A: I have a large black dot with a smaller dark dot to the|
|right of it|
|B: I see that . Let’s pick the large black dot|
Finally, we show an example dialogue from the selfplay task along with the interpreted process of common grounding in Figure 7. Referring expressions are automatically detected by a BiLSTM-CRF tagger  trained on our corpus (with 98.9% accuracy at the token level). Based on the raw dialogue only, it is difficult to identify which dots the models are referring to. However, by visualizing the intended referents, we can see that model A is describing two dots in somewhat unnatural and inappropriate way (albeit using the anaphoric expression it appropriately). In turn, model B acknowledges this in a perfectly coherent way but without predicting any referents for the large black dot: we often observed such phenomena, where the utterance by a model cannot be interpreted correctly even by itself. This way, our annotation allows for fine-grained analysis of both capabilities and incapabilities of existing dialogue systems. The generated dialogue is short in this example, but our approach would be even more critical for interpretation as the dialogues get longer and more complicated.
We propose a novel method of decomposing common grounding based on its subtasks to study the intermediate process of common grounding. We demonstrated the advantages of our approach through extensive analysis of the annotated corpus and the baseline models. Overall, we expect our work to be a fundamental step towards interpreting and improving common grounding through reference resolution.
This work was supported by JSPS KAKENHI Grant Number 18H03297 and NEDO SIP-2 “Big-data and AI-enabled Cyberspace Technologies.” We also thank the anonymous reviewers for their valuable suggestions and comments.
Right for the wrong reasons: diagnosing syntactic heuristics in natural language inference. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, pp. 3428–3448. External Links: Cited by: §2.
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, pp. 4208–4219. External Links: Cited by: §2.