CLEVR-Dialog: A Diagnostic Dataset for Multi-Round Reasoning in Visual Dialog

Visual Dialog is a multimodal task of answering a sequence of questions grounded in an image, using the conversation history as context. It entails challenges in vision, language, reasoning, and grounding. However, studying these subtasks in isolation on large, real datasets is infeasible as it requires prohibitively-expensive complete annotation of the 'state' of all images and dialogs. We develop CLEVR-Dialog, a large diagnostic dataset for studying multi-round reasoning in visual dialog. Specifically, we construct a dialog grammar that is grounded in the scene graphs of the images from the CLEVR dataset. This combination results in a dataset where all aspects of the visual dialog are fully annotated. In total, CLEVR-Dialog contains 5 instances of 10-round dialogs for about 85k CLEVR images, totaling to 4.25M question-answer pairs. We use CLEVR-Dialog to benchmark performance of standard visual dialog models; in particular, on visual coreference resolution (as a function of the coreference distance). This is the first analysis of its kind for visual dialog models that was not possible without this dataset. We hope the findings from CLEVR-Dialog will help inform the development of future models for visual dialog. Our dataset and code will be made public.


page 2

page 3

page 5

page 7

page 9

page 10

page 12


Visual Coreference Resolution in Visual Dialog using Neural Module Networks

Visual dialog entails answering a series of questions grounded in an ima...

Spot the Difference: A Cooperative Object-Referring Game in Non-Perfectly Co-Observable Scene

Visual dialog has witnessed great progress after introducing various vis...

SeqDialN: Sequential Visual Dialog Networks in Joint Visual-Linguistic Representation Space

In this work, we formulate a visual dialog as an information flow in whi...

The Dialog Must Go On: Improving Visual Dialog via Generative Self-Training

Visual dialog (VisDial) is a task of answering a sequence of questions g...

VD-PCR: Improving Visual Dialog with Pronoun Coreference Resolution

The visual dialog task requires an AI agent to interact with humans in m...

UNITER-Based Situated Coreference Resolution with Rich Multimodal Input

We present our work on the multimodal coreference resolution task of the...

Examining Cooperation in Visual Dialog Models

In this work we propose a blackbox intervention method for visual dialog...

Code Repositories


Repository to generate CLEVR-Dialog: A diagnostic dataset for Visual Dialog

view repo

1 Introduction

The focus of this work is on intelligent systems that can see (perceive their surroundings through vision), talk (hold a visually grounded dialog), and reason (store entities in memory as a dialog progresses, refer back to them as appropriate, count, compare, ). Recent works have begun studying such systems under the umbrella of Visual Dialog Das et al. (2017a); de Vries et al. (2017), where an agent must answer a sequence of questions grounded in an image. As seen in Fig. 1, this entails challenges in – vision (, identifying objects and their attributes in the image), language/reasoning (, keeping track of and referencing previous conversation via memory), and grounding (, grounding textual entities in the image).

Figure 1: CLEVR-Dialog: we view dialog as communication between two agents – an Answerer (A-er) who can ‘see’ the image and has the complete scene graph (far right), and a Questioner (Q-er), who does not ‘see’ the image. A-er begins the dialog with a grounded caption (‘A cylinder is next to a yellow object’). The Q-er converts this caption into a partial scene graph (far left, top), follows up with a question grounded in (‘What shape is the object?’), which the A-er answers, and the dialog progresses. Questions at round are generated based solely on , , without looking at I or , which mimics real-life scenarios of visual dialog.

In order to train and evaluate agents for Visual Dialog, visdial collected a large dataset of human-human dialog on real images collected between pairs of workers on Amazon Mechanical Turk (AMT). While such large-scale realistic datasets enable new lines of research, it is difficult to study the different challenges (vision, language, reasoning, grounding) in isolation or to break down the performance of systems over different challenges to identify bottlenecks, because that would require prohibitively-expensive complete annotation of the ‘state’ of all images and dialogs (all entities, coreferences, ).

In this work, we draw inspiration from johnson2017clevr, and develop a large diagnostic dataset—CLEVR-Dialog—for studying and benchmarking multi-round reasoning in visually-grounded dialog. Each CLEVR image is synthetically rendered by a particular scene graph Johnson et al. (2017) and thus, is by construction exhaustively annotated. We construct a dialog grammar that is grounded in these scene graphs. Specifically, similar to das_iccv17, we view dialog generation as communication between two agents – an Answerer (A-er) who can ‘see’ the image and has the complete scene graph (say ), and a Questioner (Q-er), who does not ‘see’ the image and is trying to reconstruct the scene graph over rounds of dialog (say ). As illustrated in Fig. 1, the dialog begins by A-er providing a grounded caption for the image, which conveys some but not all information about . The Q-er builds a partial scene graph based on the caption, and follows up by asking questions grounded in , which the A-er answers, and the dialog progresses. Our dialog grammar defines rules and templates for constructing this grounded dialog. Note that A-er with access to (perfect vision) exists only during dialog generation to obtain ground truth answers. While studying visual dialog on CLEVR-Dialog, models are forced to answer questions with just the image and dialog history (caption and previous question-answer pairs) as additional inputs.

In total, CLEVR-Dialog contains instances of -round dialogs for each of (train) and (val) CLEVR images, totaling to (train) and (val) question-answer pairs. We benchmark several visual dialog models on CLEVR-Dialog, which serve as strong baselines for future work.

The combination of CLEVR images (with full scene graph annotations) and our dialog grammar results in a dataset where all aspects of the visual dialog are fully annotated. We use this to study one particularly difficult challenge in multi-dialog visual reasoning – of visual coreference resolution. A coreference arises when two or more phrases (coreferring phrases) in the conversation refer to the same entity (referent) in the image. For instance, in the question ‘What about that cylinder?’ (Q3) from Fig. 1, the referent for the phrase ‘that cylinder’ can be inferred only after resolving the phrase correctly based on the dialog history, as there are multiple cylinders in the image. We use CLEVR-Dialog to diagnose performance of different methods as a function of the history dependency (., coreference distance—the number of rounds between successive mentions of the same object) and find that the performance of a state-of-art model (CorefNMN) is at least 30 points inferior for questions involving coreference resolution compared to those which do not (Fig. 7), highlighting the challenging nature of our dataset. This is the first analysis of its kind for visual dialog that was simply not possible without this dataset. We hope the findings from CLEVR-Dialog will help inform the development of future models for visual dialog.

2 Related Work

Figure 2: Example dialogs from MNIST Dialog, CLEVR-Dialog, and VisDial, with coreference chains manually marked for VisDial and automatically extracted for MNIST Dialog and CLEVR-Dialog.

Coreference Resolution is a well studied problem in the NLP community Ng (2010); Lee et al. (2017); Wiseman et al. (2016); Clark and Manning (2016a, b). Our work focuses on visual coreference resolution – the referent is now a visual entity to be grounded in visual data. Several works have tackled visual coreference resolution in videos Ramanathan et al. (2014); Rohrbach et al. (2017) and 3D data Kong et al. (2014), and have introduced real image datasets for the same Hodosh et al. (2014).

Visual Dialog and Synthetic Datasets.

We contrast CLEVR-Dialog against four existing datasets: (1) CLEVR Johnson et al. (2017) is a diagnostic dataset for visual question answering (VQA) Antol et al. (2015) on rendered images that contain objects like cylinders, cubes, ., against a plain background (Fig. 1). While CLEVR-Dialog uses the same set of images, the key difference is that of focus and emphasis – the objective of CLEVR-VQA questions is to stress-test spatial reasoning in independent single-shot question answering; the objective of CLEVR-Dialog is to stress-test temporal or multi-round reasoning over the dialog history. (2) CLEVR-Ref+ Liu et al. (2019) is a diagnostic dataset based on CLEVR images for visual reasoning in referring expressions. CLEVR-Dialog goes beyond CLEVR-Ref+, which focuses on grounding objects given a natural language expression, and deals with additional visual and linguistic challenges that require multi-round reasoning in visual dialog. (3) MNIST-Dialog Seo et al. (2017) is a synthetic dialog dataset on a grid of stylized MNIST digits (Fig. 2). While MNIST-Dialog is similar in spirit to CLEVR-Dialog, key difference is complexity – the distance between a coreferring phrase and its antecedent is always 1 in MNIST-Dialog; in contrast, CLEVR-Dialog has a distribution ranging from to . (4) VisDial Das et al. (2017a) is a large scale visual dialog dataset collected by pairing two human annotators (a Q-er and an A-er) on AMT, built on COCO Lin et al. (2014) images. VisDial being a large open-ended real dataset encompasses all the challenges of visual dialog, making it difficult to study and benchmark progress on individual challenges in isolation. Fig. 2 qualitatively compares MNIST-Dialog, CLEVR-Dialog, and VisDial, and shows coreference chains (manually annotated for this VisDial example by us, and automatically computed for MNIST-Dialog and CLEVR-Dialog). We can see that the coreference links in MNIST-Dialog are the simplest (distance always 1). While coreferences in VisDial can be on a similar level of difficulty than CLEVR-Dialog, the difficult cases are rarer in VisDial.

3 CLEVR-Dialog Dataset

In this section, we describe the existing annotation for CLEVR images, then detail the generation process for CLEVR-Dialog, and present the dataset statistics in comparison to existing datasets.

3.1 CLEVR Images

Every CLEVR image has a full scene graph annotation, . This contains information about all the objects in the scene, including four major attributes color, shape, material, size, 2D image and 3D world positions, and relationships front, back, right, left between these objects. The values for the attributes are: (a) Shape—cylinder, cube, sphere; (b) Color—blue, brown, cyan, gray, green, purple, red, yellow; (c) Size—large and small; and finally (d) Material—metal and rubber. We only use objects, attributes, and relationships.

Figure 3: Usage of dialog grammar in caption generation.

3.2 Dataset Generation

An important characteristic of visual dialog that makes it suitable for practical applications is that the questioner does not ‘see’ the image (because if it did, it would not need to ask questions). To mimic this setup, we condition our question generation at round only on the partial scene graph that accumulates information received so far from the dialog history (and not on ). Specifically, we use a set of caption and question templates (enumerated in Tab. 1), which serve as the basis for our dialog generation. Each of these templates in turn consists of primitives, composed together according to a generation grammar. The nature and difficulty of the dataset is highly dependent on these templates, thus making their selection crucial. In what follows, we will first describe these primitives, discuss how they are used to generate a caption or a question at each round, and tie everything together to explain dialog generation in CLEVR-Dialog.

Grammar Primitives.

The templates used to generate captions and questions are composed of intuitive and atomic operations called primitives. Each of these primitives can have different instantiations depending on a parameter, and also take input arguments. For example, Filter primitives filter out objects from an input set of objects according to certain constraints. In particular, Filter[color](blue) filters out blue objects from a given set of objects, while Filter[shape](sphere) filters out all spheres. In our work, we use the following primitives:

  • [leftmargin=0.15in,itemsep=-5pt]

  • Sample: sample an object/attribute,

  • Unique: identify unique objects/attributes,

  • Count: count the number of input objects,

  • Group: group objects based on attribute(s),

  • Filter: filter inputs according to a constraint,

  • Exist: check for existence of objects,

  • Relate: apply a relation (, right of).

Note that each of these primitives inherently denotes a set of constraints, which when failed leads to a reset of the generation process for the current caption/question in the dialog. For example, if the output of Filter[color](blue) is empty due to the absence of blue objects in the input, we abort generation for the current template and move on to the next template.

Caption Generation.

The role of the caption is to seed the dialog and initialize . In other words, caption gives Q-er partial information about the image so that asking follow-up questions is possible. Because A-er generates the caption, it uses the full scene graph . Fig. 3 shows the caption grammar in action, producing three different captions for a given image. Consider the grammar for Fig. 3(c). First, Sample[attributes] produces {shape, color} used by Unique to select objects from with unique shape and color attributes. An object (gray cylinder) is then sampled from these using Sample[object]. Next, a relation (in front of) is enforced via a Relate primitive leading to the green cylinder in front of the gray cylinder. Finally, Sample[attribute] samples one of the attributes to give us the caption, ‘A green object stands in front of a gray cylinder.’

We carefully design four different categories of caption templates: (a) Obj-unique mentions an object with unique set of attributes in the image, (b) Obj-count specifies the presence of a group of objects with common attributes, (c) Obj-extreme describes an object at one of the positional extremes of the image (right, left, fore, rear, center), (d) Obj-relation talks about the relationship between two objects along with their attributes in a way that allows them to be uniquely identified in the complete scene graph . In our work, the relationships are used in an immediate or closest sense, ., a relation to the right of actually means to the immediate right of. Tab. 1 shows example captions.

Figure 4: Usage of dialog grammar in question generation.
Figure 5: Dialog generation in CLEVR-Dialog. At each round, all valid question templates are used to generate candidates for the next question. However, only a few interesting candidates (beams) are retained for further generation, thus avoiding an exploding number of possibilities as rounds of dialog progress.

Question Generation.

Unlike the caption, the questions are generated by the Q-er, having access only to a partial scene graph at round . This is an assimilation of information from the previous rounds of the dialog. The primitives in the question template therefore take as the input scene graph, and the generation proceeds in a manner similar to that of the caption explained above. As the dialog is driven by Q-er based on partial scene information, only a few questions are non-redundant (or even plausible) at a given round of the dialog. To this end, the inherent constraints associated with the primitives now play a bigger role in the template selection.

In this work, we experiment with three different categories of question templates: (a) Count questions ask for a count of objects in the image satisfying specific conditions, , ‘How many objects share the same color as this one?’, (b) Existence questions are yes/no binary questions that verify conditions in the image, , ‘Are there any other cubes?’, and (c) Seek questions query attributes of objects, , ‘What color is that cylinder?’.

Consider Fig. 4 that shows how the current question is generated using the primitives and grammar, given the caption and dialog history (question-answer pair for the first three rounds). For the current round, the question ‘What material is the green object at the back?’ is clearly implausible (Q-er is unaware of the existence of a green object), while the question ‘What shape is the red object?’ is redundant. For the templates visualized, Unique[object] returns a list of unique known object-attribute pairs (using ). A candidate is sampled by Sample[object] and a relation is applied through Relate(in front of). There are multiple choices at this junction: (a) The use of Count leads to a counting question (count-obj-rel-early), (b) Invoking Sample[attribute] results in a seek question (seek-attr-rel-early), and finally, (c) Exist primitive generates an exist question of type exist-obj-rel-early.

obj-relation ‘A [Z] [C] [M] [S] stands [R] a [Z1] [C1] [M1] [S1].’
‘A gray sphere stands to the right of a red object.’
obj-unique ’A [Z] [C] [M] [S] is present in the image.’
‘A red object is present in the image’
obj-extreme ‘The rightmost thing in the view is a [Z] [C] [M] [S].’
‘The rightmost thing in the view is a cylinder.’
obj-count ‘The image has [X] [Z] [C] [M] [S].’
‘The image has four cylinders.’
Count/Exist Question Type
count-all ‘How many objects in the image?’
count/ ‘[How many Are there] other [Z] [C] [M] [S] in the picture?’
exist-excl ‘[How many Are there] other cubes in the picture?’
count/ ‘[If present, how many Are there] [Z] [C] [M] [S] objects?’
exist-attr ‘[If present, how many Are there] metallic objects?’
count/ ‘[How many Are there] [Z] [C] [M] [S] among them?’
exist-attr-group ‘[How many Are there] blue cylinders among them?’
count/ ‘[How many Are there] things to its [R]?’
exist-obj-rel-imm ‘[How many Are there] things to its right?’
count/ ‘How about to its [R]?’
exist-obj-rel-imm2 ‘How about to its left?’
count/ ‘[How many Are there] things [R] that [Z] [C] [M] [S]?’
exist-obj-rel-early ‘[How many Are there] things in front of that shiny object?’
count/ ‘[How many Are there] things that share its [A]?’
exist-obj-excl-imm ‘[How many Are there] things that share its color?’
count/ ‘[How many Are there] things that are the same [A] as that [Z] [C] [M] [S]?’
exist-obj-excl-early ‘[How many Are there] things that are the same size as that round object?’
Seek Question Type
seek-attr-imm ‘What is its [A]?’
‘What is its shape?’
seek-attr-imm2 ‘How about [A]?’
‘How about color?’
seek-attr-early ‘What is the [A] of that [Z] [C] [M] [S]?’
‘What is the shape of that shiny thing?’
seek-attr-sim-early ‘What about the earlier [Z] [C] [M] [S]?’
‘What about the earlier box?’
seek-attr-rel-imm ‘If there is a thing to its [R], what [A] is it?’
‘If there is a thing to its right, what color is it?’
seek-attr-rel-early ‘If there is a thing [R] that [Z] [C] [M] [S], what [A] is it made of?’
‘If there is a thing in front of that shiny object, what material is it made of?’
Table 1: Example templates for all the caption and question types used to generate CLEVR-Dialog dataset. For each type, we show both: (a) a sample template with placeholders (Z=size, C=color, M=material, S=shape, A=attribute, X=count, R=relation), and (b) a realization with placeholders filled with random values.
(a) Distribution of caption (left) and question (right) categories.
(b) Distribution of coreference distances.
(c) Distribution of questions according to the template labels.
(d) Distribution of answers.
Figure 6: Visualization of various distributions for captions, questions, answers, and history dependency in our CLEVR-Dialog dataset. See Sec. 3.3 for more details.

Dialog Generation.

At a high level, dialog generation now ‘simply’ involves selecting a sequence of templates such that the accompanying constraints are satisfied by at all . As a tractable approximation to this exponentially-large constraint satisfaction problem, we use beam search that finds a valid solution and enforces additional conditions to make the dialog interesting. We found this to be effective both in terms of speed and dialog diversity. More concretely, at every round of the dialog (after 3 rounds), we ensure that each of the question template types—count, existence, and seek—falls within a range ( for count/existence each, and for seek) In addition, we identify independent questions that do not need history to answer them, , ‘How many objects are present in the image?’, and limit their number to under . Finally, to encourage questions that require reasoning over the history, , seek-attr-sim-early and count-obj-excl-imm, we tailor our beam search objective so that dialogs containing such questions have a higher value. We use a beam search with beams for each dialog. Fig. 5 illustrates the diverse set of candidate questions generated at each round for a given image.

To summarize, the usage of primitives and a dialog grammar makes our generation procedure: (a) modular: each primitive has an intuitive meaning, (b) expressive: complex templates can be broken down into these primitives, (c) computationally efficient: outputs can reused for templates sharing similar primitive structures (as seen in Fig. 4), thus allowing an easy extension to new primitives and templates. We believe that CLEVR-Dialog represents not a static dataset but a recipe for constructing increasingly challenging grounded dialog by expanding this grammar.

3.3 Dataset Statistics

We compare CLEVR-Dialog to MNIST-Dialog and VisDial in Tab. 2, but the key measure of coreference distance cannot be reported for VisDial as it is not annotated. Overall, CLEVR-Dialog has the questions and a striking the unique number of questions than MNIST-Dialog, indicating higher linguistic diversity. CLEVR-Dialog questions are longer with a mean length of compared to for MNIST-Dialog. Crucially, supporting our motivation, the mean distance (in terms of rounds) between the coreferring expressions in CLEVR-Dialog is compared to in MNIST-Dialog. Moreover, the distances in CLEVR-Dialog vary (min of , max of ), while it is constant (at 1) in MNIST-Dialog, making it easy for models to pick up on this bias.

Further, we visualize the distribution of caption templates, question templates, answers, and the history dependency of questions in CLEVR-Dialog (Fig. 6), and discuss in detail below.

Dialog (ours) Dialog
Unique Q
Unique A
Vocab. Size 125 54
Mean Q Len. 10.6 8.9 5.1
Mean Coref Dist. 3.2 1.0 -
Table 2: Dataset statistics comparing CLEVR-Dialog to MNIST Dialog Seo et al. (2017). Our dataset has the questions (larger), the unique number of questions (more diverse), the mean coreference distance (more complex), and longer question lengths. Similar stats for VisDial shown for completeness. Coreference distance can not be computed for VisDial due to lack of annotations.

Question Categories and Types.

CLEVR-Dialog contains three broad question categories—count, exist, and seek—with each further containing variants totaling up to different types of questions. In comparison, MNIST-Dialog only has types of questions and is less diverse. The distributions for the question categories and question types are shown in Fig. 5(a) and Fig. 5(c), respectively. Our questions are seek as they open up more interesting follow-up questions, count, and exist.

History Dependency.

Recall that our motivation for CLEVR-Dialog to create a diagnostic dataset for multi-round reasoning in visual dialog. As a result, a majority of questions in our dataset depend on the dialog history. We identify three major kinds of history dependency for the questions: (a) Coreference occurs when a phrase within the current question refers to a earlier mentioned object (referent). We characterize coreferences by measuring the distance between the current and the earlier mention, in terms of dialog rounds. This can range from (., ‘What is its color?’) to (a question in round referring to an entity in the caption). (b) All: When the question depends on the entire dialog history, ., ‘How many other objects are present in the image?’, (c) None: When the question is stand-alone and does not depend on the history, ., ‘How many spheres does the scene have?’ The distribution of questions characterized according to the history dependency is shown in Fig. 5(b). Unlike MNIST Dialog, CLEVR-Dialog contains a good distribution of reference distances beyond just , leading to a mean distance of . Thus, the models will need to reason through different rounds of dialog history in order to succeed.

4 Experiments

In this section, we describe and benchmark several models on CLEVR-Dialog. We then breakdown and analyze their performance according to question type and history dependency. Finally, we focus on the best performing model and study its behavior on CLEVR-Dialog both qualitatively and quantitatively. Specifically, we visualize qualitative examples and develop metrics to quantitatively evaluate the textual and visual grounding. Note that such a diagnostic analysis of visual dialog models is first of its kind which would not be possible without our CLEVR-Dialog.

4.1 Baselines

To benchmark performance, we evaluate several models on CLEVR-Dialog. Random picks an answer at random. Random-Q picks an answer at random among valid answers for a given question type (, name of a color for color questions). Further, we adapt the discriminative visual dialog models from Das et al. (2017a): (a) Late Fusion (LF) that models separately encode each of question (Q), history (H), and image (I); and then fuse them by concatenation. (b) Hierarchical Recurrent Encoder (HRE

) that models dialog via both dialog-level and sentence-level recurrent neural networks. (c)

Memory Network (MN) that stores history as memory units and retrieves them based on the current question. We also consider neural modular architectures: (a) CorefNMN Kottur et al. (2018) that explicitly models coreferences in visual dialog by identifying the reference in the question (textual grounding) and then localizing the referent in the image (visual grounding), and (b) NMN Hu et al. (2017), which is a history-agnostic ablation of CorefNMN.

Model Acc.
Random 3.4
Random-Q 33.4


plus1fil minus1fil

LF-Q 40.3
LF-QI 50.4
LF-QH 44.1
LF-QIH 55.9


plus1fil minus1fil

HRE-QH 45.9
HRE-QIH 63.3


plus1fil minus1fil

MN-QH 44.2
MN-QIH 59.6


plus1fil minus1fil

NMN 56.6
CorefNMN 68.0
Table 3: Accuracy () on CLEVR-Dialog (higher is better). See text for details.
Figure 7: Breakdown of performance by questions that depend on entire history (All), require coreference resolution (Coref), and are history-independent (None).

4.2 Overall Results

We use multi-class classification accuracy for evaluation since CLEVR-Dialog has one-word answers. Tab. 3 shows the performance of different models. The key observations are: (a) Neural models outperform random baselines by a large margin. The best performing model, CorefNMN, outperforms Random-Q by 35%. (b) As expected, blind models (LF-Q, LF-QH, HRE-QH, MN-QH) are inferior to their counterparts that use I, by at least 10%. (c) History-agnostic models (LF-Q, LF-QI, NMN) also suffer in performance, highlighting the importance of history.

4.3 Accuracy vs History Dependency

The breakdown of model performances based on the history dependency is presented in Fig. 8. The following are the important observations:

  • [leftmargin=0.15in,itemsep=-5pt]

  • The best performing model, CorefNMN, has a superior performance (on an average) on all question with coreference () compared to all other models. As CorefNMN is designed specifically to handle coreferences in visual dialog, this is not surprising.

  • Interestingly, the second best model HRE-QIH has the best accuracy on ‘All’ questions, even beating CorefNMN by a margin of . In other words, HRE-QIH (and even MN-QIH) is able to answer ‘All’ questions significantly better than CorefNMN perhaps due to the ability of its dialog-level RNN to summarize information as the dialog progresses.

  • Both NMN and CorefNMN perform similarly on the ‘None’ questions. This observation is intuitive as NMN is a history-agnostic version of CorefNMN by construction. However, the difference becomes evident as CorefNMN outperforms NMN by about overall.

Figure 8: Accuracy breakdown of models according to the history dependency type. While CorefNMN outperforms all methods on questions (average) containing references (), its performance is not as good on questions that depend on the entire history (‘All’).
Figure 9: Accuracy breakdown of models according to the question type. See text in Sec. 4.4 for more details.
Figure 10: Qualitative visualization of CorefNMN on CLEVR-Dialog.

4.4 Accuracy vs Question Type

Fig. 9 breaks down the performance of all the models according to the question types. An obvious observation is that performance on counting and seek questions is worse than that on exist questions. While this is in part because of the binary nature of exist questions, they are also easier to answer than counting or extracting attributes that need complicated visual understanding.

4.5 Qualitative Anaylsis for CorefNMN

We now qualitatively visualize (Fig. 10) the best performing model, CorefNMN. In the example shown, CorefNMN first parses the caption ‘There is a cyan metal object to the front of all the objects.’ and localizes the right cyan object. While answering Q-1, CorefNMN rightly instantiates the Refer module and applies the desired transformation (see module outputs on the right). For Q-2, it accurately identifies the object as the previous one, and extracts the attributes. Finally, the question ‘What about that cyan object?’ cannot be answered in isolation as: (a) there are multiple cyan objects, (b) the meaning of the question is incomplete without Q-2. It is interesting to note that even though CorefNMN overcomes (a) by correctly resolving the reference that cyan object (in the image), it is unable to circumvent (b) due to its specialization in visual coreferences.

We also provide additional analysis to evaluate the textual and visual grounding by CorefNMN in the supplement.

5 Conclusion

We proposed a large, synthetic dataset called CLEVR-Dialog, to study multi-round reasoning in visual dialog, and in particular the challenge of visual coreference resolution. We benchmarked several qualitatively different models from prior work on this dataset, which act as baselines for future work. Our dataset opens the door to evaluate how well models do on visual coreference resolution, without the need to collect expensive annotations on real datasets.


The supplement is organized as follows:

  • Grounding analysis for the best performing model, CorefNMN, in Sec. A,

  • Sec. B provides implementation details.

Appendix A Grounding Analysis for CorefNMN

As mentioned earlier, CorefNMN identifies a reference phrase in the current question and proceeds to visually ground the corresponding referent in the image. Such explicit textual and visual grounding at each round allows for an interesting quantitative analysis for CorefNMN, with the help of annotations in our CLEVR-Dialog. In what follows, we first describe the grounding annotations, detail the evaluation procedure, and then present our observations.

(a) NDCG value for text grounding for various question types.
(b) NDCG value for visual grounding for various question types.
Figure 11: Evaluating the textual (above) and visual (below) grounding of CorefNMN on CLEVR-Dialog, using Normalized Discounted Cumulative Gain (NDCG) for various question types. Higher is better.


While the original CLEVR dataset Johnson et al. (2017) does not contain bounding box annotations for the objects in the scene, Krishna et al. (2018) later added these in their work on referring expressions. We leverage these annotations to obtain the ground truth visual groundings () for the referents in our questions. On the other hand, each of the caption and question templates has referring phrase annotations in them, thus giving the ground truth textual groundings (). We use the above two groundings for evaluation.


For every coreference resolution, CorefNMN produces a visual attention map of size () and a textual attention over the question words (). We rank all the cells in according to their attention values. Next, we appropriately scaled down () and consider the cells spanning the bounding box as relevant. To evaluate grounding, we measure the retrieval performance of the relevant cells in the sorted through the widely used the Normalized Discounted Cumulative Gain (NDCG)111 metric. It is a measure of how highly the relevant cells were ranked in the sorted , with a logarithmic weighting scheme to higher ranks, thus higher is better. For the textual grounding, we perform a similar computation between and and report NDCG.


The NDCG values to evaluate both textual and visual groundings for CorefNMN are shown in Fig. 11. An important takeaway is that the model is able to accurately ground the references in the question (Fig. 10(a)) consistently for several question types, as reflected in a higher average NDCG. Similarly, the visual grounding in Fig. 10(b) (average NDCG of ) is significantly superior to a random baseline (NDCG of  ).

Appendix B Implementation Details

The dataset generation was done entirely in Python, without any significant package dependencies. To evaluate the models from visdial, we use their open source implementation222

based on Lua Torch

333 For the neural module architectures Hu et al. (2017); Kottur et al. (2018), we use the authors’ Python-based, publicly available implementations—NMN444 and CorefNMN555 Questions are encoded by first learning a -dimensional embedding for the words, which are then fed into a single layer LSTM of hidden size

. We use a pretrained convolution neural network, ResNet-101

He et al. (2016), to extract features for the images. Adam Kingma and Ba (2014) steps with a learning rate of are employed to maximize the log-likelihood of the ground truth answer, while training. A subset ( images) of the training set is set aside to pick the best performing model via early stopping.


  • Antol et al. (2015) Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C. Lawrence Zitnick, and Devi Parikh. 2015. Vqa: Visual question answering. In

    Proceedings of the IEEE International Conference on Computer Vision (ICCV)

  • Clark and Manning (2016a) Kevin Clark and Christopher D. Manning. 2016a. Deep reinforcement learning for mention-ranking coreference models. In

    Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing

    , pages 2256–2262. Association for Computational Linguistics.
  • Clark and Manning (2016b) Kevin Clark and Christopher D. Manning. 2016b. Improving coreference resolution by learning entity-level distributed representations. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 643–653. Association for Computational Linguistics.
  • Das et al. (2017a) Abhishek Das, Satwik Kottur, Khushi Gupta, Avi Singh, Deshraj Yadav, José M.F. Moura, Devi Parikh, and Dhruv Batra. 2017a. Visual Dialog. In CVPR.
  • Das et al. (2017b) Abhishek Das, Satwik Kottur, José M. F. Moura, Stefan Lee, and Dhruv Batra. 2017b.

    Learning cooperative visual dialog agents with deep reinforcement learning.

    In Proceedings of the IEEE International Conference on Computer Vision (ICCV).
  • He et al. (2016) Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

  • Hodosh et al. (2014) Peter Hodosh, Alice Young, Micah Lai, and Julia Hockenmaier. 2014. From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions. Transactions of the Association for Computational Linguistics (TACL).
  • Hu et al. (2017) Ronghang Hu, Jacob Andreas, Marcus Rohrbach, Trevor Darrell, and Kate Saenko. 2017. Learning to reason: End-to-end module networks for visual question answering. In Proceedings of the IEEE International Conference on Computer Vision (ICCV).
  • Johnson et al. (2017) Justin Johnson, Bharath Hariharan, Laurens van der Maaten, Li Fei-Fei, C Lawrence Zitnick, and Ross Girshick. 2017. Clevr: A diagnostic dataset for compositional language and elementary visual reasoning. In Computer Vision and Pattern Recognition (CVPR), 2017 IEEE Conference on, pages 1988–1997. IEEE.
  • Kingma and Ba (2014) Diederik Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv:1412.6980.
  • Kong et al. (2014) Chen Kong, Dahua Lin, Mohit Bansal, Raquel Urtasun, and Sanja Fidler. 2014. What are you talking about? text-to-image coreference. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
  • Kottur et al. (2018) Satwik Kottur, Jose M. F. Moura, Devi Parikh, Dhruv Batra, and Marcus Rohrbach. 2018. Visual coreference resolution in visual dialog using neural module networks. In The European Conference on Computer Vision (ECCV).
  • Krishna et al. (2018) Ranjay Krishna, Ines Chami, Michael Bernstein, and Li Fei-Fei. 2018. Referring relationships. In IEEE Conference on Computer Vision and Pattern Recognition.
  • Lee et al. (2017) Kenton Lee, Luheng He, Mike Lewis, and Luke Zettlemoyer. 2017. End-to-end neural coreference resolution. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 188–197. Association for Computational Linguistics.
  • Lin et al. (2014) Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C. Lawrence Zitnick. 2014. Microsoft COCO: Common Objects in Context. In Proceedings of the European Conference on Computer Vision (ECCV).
  • Liu et al. (2019) Runtao Liu, Chenxi Liu, Yutong Bai, and Alan L. Yuille. 2019. Clevr-ref+: Diagnosing visual reasoning with referring expressions. CoRR, abs/1901.00850.
  • Ng (2010) Vincent Ng. 2010. Supervised noun phrase coreference research: The first fifteen years. In Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics, ACL ’10, pages 1396–1411, Stroudsburg, PA, USA. Association for Computational Linguistics.
  • Ramanathan et al. (2014) V. Ramanathan, A. Joulin, P. Liang, and L. Fei-Fei. 2014. Linking people with ”their” names using coreference resolution. In Proceedings of the European Conference on Computer Vision (ECCV).
  • Rohrbach et al. (2017) Anna Rohrbach, Marcus Rohrbach, Siyu Tang, Seong Joon Oh, and Bernt Schiele. 2017. Generating descriptions with grounded and co-referenced people. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
  • Seo et al. (2017) Paul Hongsuck Seo, Andreas Lehrmann, Bohyung Han, and Leonid Sigal. 2017. Visual reference resolution using attention memory for visual dialog. In Advances in Neural Information Processing Systems (NIPS).
  • de Vries et al. (2017) Harm de Vries, Florian Strub, Sarath Chandar, Olivier Pietquin, Hugo Larochelle, and Aaron C. Courville. 2017. Guesswhat?! visual object discovery through multi-modal dialogue. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
  • Wiseman et al. (2016) Sam Wiseman, Alexander M. Rush, and Stuart M. Shieber. 2016. Learning global features for coreference resolution. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 994–1004. Association for Computational Linguistics.