Log In Sign Up

Two Causal Principles for Improving Visual Dialog

This paper is a winner report from team MReaL-BDAI for Visual Dialog Challenge 2019. We present two causal principles for improving Visual Dialog (VisDial). By "improving", we mean that they can promote almost every existing VisDial model to the state-of-the-art performance on Visual Dialog 2019 Challenge leader-board. Such a major improvement is only due to our careful inspection on the causality behind the model and data, finding that the community has overlooked two causalities in VisDial. Intuitively, Principle 1 suggests: we should remove the direct input of the dialog history to the answer model, otherwise the harmful shortcut bias will be introduced; Principle 2 says: there is an unobserved confounder for history, question, and answer, leading to spurious correlations from training data. In particular, to remove the confounder suggested in Principle 2, we propose several causal intervention algorithms, which make the training fundamentally different from the traditional likelihood estimation. Note that the two principles are model-agnostic, so they are applicable in any VisDial model.


page 7

page 8

page 14

page 15


Image-Question-Answer Synergistic Network for Visual Dialog

The image, question (combined with the history for de-referencing), and ...

History for Visual Dialog: Do we really need it?

Visual Dialog involves "understanding" the dialog history (what has been...

Examining Cooperation in Visual Dialog Models

In this work we propose a blackbox intervention method for visual dialog...

VD-PCR: Improving Visual Dialog with Pronoun Coreference Resolution

The visual dialog task requires an AI agent to interact with humans in m...

Ensemble based discriminative models for Visual Dialog Challenge 2018

This manuscript describes our approach for the Visual Dialog Challenge 2...

Neuro-Symbolic Visual Dialog

We propose Neuro-Symbolic Visual Dialog (NSVD) -the first method to comb...

Adversarial Robustness of Visual Dialog

Adversarial robustness evaluates the worst-case performance scenario of ...

Code Repositories

1 Introduction

Given an image , a dialog history of past Q&A pairs: , and the current -th round question , a Visual Dialog (VisDial) agent [7] is expected to give a good answer . Our community has always considered VQA [2] and VisDial as sister tasks due to their similar settings: Q&A grounded by (VQA) and Q&A grounded by (VisDial). Indeed, from a technical point view — just like the VQA models — a typical VisDial model first uses encoder to represent , , and

as vectors, and then feed them into

decoder for answer . Thanks to the recent advances in encoder-decoder frameworks for VQA [22, 35]

, as well as for natural language processing 

[36], the performance (NDCG [38]) of VisDial in literature is significantly improved from the baseline 51.63% [37] to the state-of-the-art 64.47% [10].

Figure 1: Causal graphs of VisDial models (baseline and ours). : dialog history. : image. : question. : visual knowledge. : answer. : user preference. Shaded denotes unobserved confounder. See Section 3.2 for detailed definitions.

However, in this paper, we want to highlight an important fact: VisDial is essentially NOT VQA!

And this fact is so profound that all the common heuristics in the vision-language community — such as the fusion tricks 

[35, 44] and attention variants [22, 24] — cannot appreciate the difference. Instead, we introduce the use of causal inference [28, 26]: a graphical framework that stands in the cause-effect interpretation of the data, but not merely the statistical association of them. Before we delve into the details, we would like to present the main contributions: two causal principles, rooted from the analysis of the difference between VisDial and VQA, which lead to a performance leap — a farewell to the 60%-s and an embrace for the 70%-s — for all the VisDial models111Only those with codes&reproducible results due to resource limit. in literature [7, 21, 39, 25], promoting them to the state-of-the-art in Visual Dialog 2019 Challenge [37].

Principle 1

(P1): Delete the link .

Principle 2

(P2): Add one new (unobserved) node and three new links: , , and .

Figure 1 compares the causal graphs of existing VisDial models and the one applied with the proposed two principles. Although a formal introduction of them is given in Section 3.2, now you can simply understand the nodes as data types and the directed links as data flows. For example, and indicate that the visual knowledge , e.g., the encoded feature from a multi-model encoder, works with the question to “dictate” the answer .

P1 suggests that we should remove the direct input of dialog history to the answer model. This principle contradicts most of the prevailing VisDial models [7, 14, 39, 25, 41, 16, 10, 31], which are based on the widely accepted intuition: the more features you input, the more effective the model is. It is mostly correct, but only with our discretion of the data generation process. In fact, the annotators of the VisDial dataset [7] were not allowed to copy from the previous Q&A, i.e., , and were encouraged to ask consecutive questions that includes co-referenced pronouns like “it” and “those”, i.e., , and the answer should be based only on question and the reasoned visual knowledge . Therefore, a good VisDial model is expected to reason over the context with but not to memorize the bias. However, the direct path will contaminate the expected causality. Figure 2(a) shows a very ridiculous bias observed in all baselines without P1: the top answers are those whose lengths are close to the average length in the history answers! We will offer more justifications for P1 in Section 4.1.

(a) A Typical Bias
(b) User Preference
Figure 2: The illustrative motivations of the two causal principles: (a) P1 and (b) P2.

P2 implies that the model training based only on the association among the sample and is spurious. By “spurious”, we mean that the effect on caused by — the goal of VisDial — is confounded by an unobserved variable , because it appears in every undesired causal path (a.k.a., backdoor [26]), which is an indirect causal path from the input to output : and . We believe that such unobserved should be users as the VisDial dataset essentially brings humans in the loop. Figure 2(b) illustrates how the user’s hidden preference confounds them, as the VisDial dataset essentially involves humans in the loop. Therefore, during training, if we focus only on the conventional likelihood , the model will inevitably be biased towards the spurious causality, e.g., it may score answer “Yes, he is” higher than “Yes”, merely because the users prefer to see a “he” appeared in the answer, given the history context of “he”. It is worth noting that the confounder is more impactful in VisDial than in VQA, because the former encourages the user to rank similar answers subjectively while the latter is more objective. A plausible explanation might be: VisDial is interactive in nature and a not quite correct answer is tolerable in one iteration (i.e., dense prediction); while VQA has only one chance, which demands accuracy (i.e., one-hot prediction).

By applying P1 and P2 to the baseline causal graph, we have the proposed one (the right one in Figure 1), which serves as a model-agnostic roadmap for the causal inference of VisDial. To remove the spurious effect caused by , we use the do-calculus [26] , which is fundamentally different from the conventional likelihood : the former is an active intervention, which cuts off and , and sample (where the name “calculus” is from) every possible , seeking the true effect on only caused by ; while the latter likelihood is a passive observation that is affected by the existence of . The formal introduction and details will be given in Section 4.3. In particular, given the fact that once the dataset is ready, is no longer observed, we propose a series of effective approximations in Section 5.

We validate the effectiveness of P1 and P2 on the most recent Visual Dialog Challenge 2019 dataset. We show significant performance boosts (absolute NDCG) by applying them in 4 representative baseline models: LF [7] (16.42%), HCIAE [21] (15.01%), CoAtt [39] (15.41%), and RvA [25] (16.14%). Impressively, on the official test-std server, we use an ensemble model of the most simple baseline LF [7] to beat the 2019 champion by 0.2%, a more complex ensemble to beat it by 0.9%, and lead all the single-model baselines to the state-of-the-art performance.

2 Related Work

Visual Dialog. Visual Dialog [7, 9] is more interactive and challenging than most of the vision-language task, e.g

., image captioning 

[43, 42, 1] and VQA [2, 35, 34]. Specifically, Das et al[7] collected a large-scale free-form visual dialog dataset VisDial [4]. They applied a novel protocol: during the live chat, the questioner cannot see the picture and asks open-ended questions, while the answerer gives free-form answers. Another dataset GuessWhat?! proposed by [9] is a goal-driven visual dialog: questioner should locate an unknown object in a rich image scene by asking a sequence of closed-ended “yes/no” questions. We apply the first setting in this paper. Thus, the key difference is that the users played an important role in the data collection process.

All of the existing approaches in the VisDial task are based on the typical encoder-decoder framework [14, 11, 32, 10, 31, 45]. They can be categorized by the usage of history. 1) Holistic: they treat history as a whole to feed into models like HACAN [41], DAN [16] and CorefNMN [18]. 2) Hierarchical: they use a hierarchical structure to deal with history like HRE [7]. 3) Recursive: RvA [25] uses a recursive method to process history. However, they all overlook the fact that the history information should not be directly fed to the answer model (i.e., our proposed Principle 1). The baselines we used in this paper are LF [7]: the earliest model, HCIAE [21]: the first model to use history hierarchical attention, CoAtt [39]: the first one to a co-attention mechanism, and RvA [25]: the first one for a tree-structured attention mechanism.

Causal Inference. Recently, some works [15, 23, 8, 3]

introduced causal inference into machine learning, trying to endow models the abilities of causal reasoning through the learning process. In contrast to them, we use the structural graph causality  

[26], which is a model-agnostic framework that reflects the nature of the data.

3 Visual Dialog in Causal Graph

In this section, we formally introduce the visual dialog task and describe how the popular encoder-decoder framework follows the baseline causal graph shown in Figure 1. More details of causal graph can be found in [26, 27].

3.1 Visual Dialog Settings

Settings. According to the definition of VisDial task proposed by Das et al[7], at each time , given input image , current question , dialog history , where is the image caption, is the -th round Q&A pair, and a list of 100 candidate answers , the task of the dialog agent is to generate a free-form answer or give an answer by ranking candidate answers .

Evaluation. Recently, the ranking metric Normalized Discounted Cumulative Gain (NDCG) is adopted by the VisDial community. It is different from the classification metric (e.g., top-1 accuracy) used in VQA. It is more compatible with the relevance scores of the answer candidates in VisDial rated by humans. NDCG requires to rank relevant candidates in higher places, rather than just to select the ground-truth answer. More details of NDCG can be found in [38].

3.2 Encoder-Decoder as Causal Graph

We first give the definition of causal graph, then revisit the encoder-decoder framework in existing methods using the elements from the baseline graph in Figure 1.

Causal Graph. Causal graph [26], as shown in Figure 1, describes how variables interact with each other, expressed by a directed acyclic graph consisting of nodes and directed edges (i.e., arrows). denote variables, and (arrows) denote the causal relationships between two nodes, i.e., denotes that is the cause and is the effect, meaning the outcome of is caused by . Causal graph is a highly general roadmap specifying the causal dependencies among variables.

As we will discuss in the following part, all of the existing methods can be revisited in the view of the baseline graph shown in Figure 1.

Feature Representation and Attention in Encoder. Visual feature is denoted as node

in the baseline graph, which is usually a fixed feature extracted by Faster-RCNN 

[30] based on ResNet backbone [12] pre-trained on Visual Genome [19]. For language feature, the encoder firstly embeds sentence into word vectors, followed by passing the RNN [13, 6] to generate features of question and history, which are denoted as .

Most of existing methods apply attention mechanism [40] in encoder-decoder to explore the latent weights for a set of features. A basic attention operation can be represented as where is the set of features need to attend, is the key (i.e., guidance) and is the attended feature of . Details can be found in most visual dialog methods [21, 39, 41]. In the baseline graph, the sub-graph denotes a series of attention operations for visual knowledge . Note that these arrows are not necessarily independent, such as co-attention [39], and the process can be written as , where intermediate variables can be yielded in the graph with respect to different attention strategies such as co-attention [39] and recursive attention [25]. However, without loss of generality, these variables do not affect the causalities in the graph.

Response Generation in Decoder. After obtaining the features from the encoder, existing methods will fuse them and feed the fused ones into a decoder to generate an answer. In the baseline graph, node denotes the answer sentence that decoder takes the features via and then transforms them into a vector for decoding the answer. In particular, the decoder can be generative, i.e., to generate an answer sentence by RNN; or discriminative, i.e., select an answer by discriminating answer candidates.

Next, we will advance to the middle part of Figure 1, to reveal what is wrong with the baseline graph.

4 Two Causal Principles

4.1 Principle 1

When should we draw an arrow from one node pointing to another? According to the definition in Section 3.2, the criterion is that if the node is the cause and the other one is the effect. Intrigued, let’s understand P1 by discussing the rationale behind the “double-blind” review policy. Given three variables: “Well-known Researcher” (), “High-quality Paper” (), and “Accept” (). From our community common sense, we know that because top researchers usually lead high-quality research, and is even more obvious. Therefore, for the good of the community, the double-blind prohibits the direct link by author anonymity, otherwise the bias such as personal emotions and politics from may affect the outcome of .

The story is similar in VisDial. Without loss of generality, we only analyze the path . If we inspect the role of , we can find that it is to help resolve some co-reference like “it” and “their”. As a result, listens to . Then, we use to obtain . Here, becomes a mediator which cuts off the direct association between and that makes , like the “High-quality Paper” that we mentioned in the previous story. However, if we set an arrow from to : , the undesirable bias of will be learned for the prediction of , that hampers the natural process of VisDial, such as the interesting bias illustrated in Figure 2(a). Another example is discussed in Figure 4 that prefers to match the words in even though they are literally nonsense about if we add the direct link . After we apply P1, these phenomena will be relieved, such as the blue line illustrated in Figure 2(a), which is closer to the NDCG ground truth (i.e., candidates with non-zero relevance score) average answer length represented as green dash line, and the other qualitative studies in Section 6.4.

4.2 Principle 2

Before discussing P2, we first introduce an important concept in causal inference [26]. In causal graph, the fork-like pattern in Figure 3(a) contains a confounder , which is the common cause for and (i.e., ). The confounder opens a non-causal path started from which is also called the backdoor, making and spuriously correlated even if there is no direct causality between them.

In the data generation process of VisDial, we know that not only both the questioner and answerer can see the dialog history which offers them a latent topic, but also the answer annotators can look at the history when annotating the answer. Their preference can be understood as part of the human nature or subtleties conditional on a dialog context, and thus it has a causal effect on both and . Moreover, due to the fact that the preference is nuanced and uncontrollable, we consider it as an unobserved confounder for and .

It is worth noting that the confounder hinders us to find the true causal effect. Let’s take the graph in Figure 3(a) as an example, if there is no

, the probability

is the causal effect that we want to pursue. However, due to the existence of , is no longer the true causality from to . When we calculate , we take into account which can be shown by Bayes rule:


The distribution of is conditional on (i.e., ). That means when using the conditional weight (i.e., ) to sum every effect (i.e., ), the likelihood sum (i.e., ) will be biased towards the effect with larger weights. For better understanding, if we treat Eq (1) as a process of data stratification, at each layer , we can obtain the causality conditional on , because given will block the backdoor of . Then, we have to sum these causalities by the natural distribution of rather than conditional distribution , which will remix the data bias. In a nutshell, we cannot calculate causality from to by under the confounder . To resolve this problem (i.e., de-confounding to find causal effect), we need more powerful tools.

(a) Confounder
(b) do-operator
(c) Question Type
(d) Score Sampling
(e) Hidden Dictionary
Figure 3: Example of confounder, do-operator and sketch causal graphs of our three attempts to de-confounder

4.3 Overall Causal Graph

Here, we first introduce two additional tools: do-operator and do-calculus [26, 27], which can help us to de-confounder.

do-operator. do-operator is a type of intervention to de-confounder. Illustrated in Figure 3(c), do-operator (e.g., do()) is that we set a value to variable , i.e., is caused by itself rather than its parent nodes. Therefore, do() cut off all the original arrows that come into (i.e., ) because its parents do not cause it anymore. This operation can prevent any information about from flowing in the non-causal direction (i.e., backdoor ). As a result, the confounder of can be relieved and the causal effect of can be estimated. In the following parts, we use do() to represent do() for concision.

do-calculus. However, it is hard to take a real intervention on a fixed dataset. We need to use some rules to translate into , which has no do-operator and can be calculated by conditional probability. The rules of do-calculus are given in [26, 27] and here we just introduce the most important one: If a set of variables blocks all backdoor paths from to , then conditional on , do is equivalent to observe: where capital letter denotes variable and lowercase denotes value. Other rules will be given in supplementary materials.

After obtaining the tools, we can revisit the example in Section 4.2. If we calculate rather than , the result will be . In this formula, the distribution of is the natural prior instead of the conditional distribution . Therefore, the summation of the causal effect by weight (i.e., ) will not remix the data bias. In other words, is the ideal causality from to .

In our graph of VisDial shown in Figure 1, we can also de-confounder by intervention do to find causal effects from to , then perform do-calculus rules to transform pretended intervention into probability formula:


The last transformation takes the rule we introduced in do-calculus because ’s backdoors are blocked by controlling . The rest derivation proofs and the details of other rules can be found in supplementary materials. As we mentioned, the result of is the real causal effect that we want.

So far, we have given all of the contents about baseline causal graph, two principles and our causal graph. In the next section, we will try to calculate the real causal effect and give some attempts to realize our causal graph to en-light the future of visual dialog.

5 Improved Visual Dialog Models

It is easy to implement P1 and we will give some examples as training details in Section 6.3. As for P2, we can obtain causal effect estimation by Eq (2) which can be written as:


where represents the probability of under the conditions and . Since the variable is unobserved, we just give some examples of attempts to replace or approximate it and corresponding sketch graphs will be given to help understand.

5.1 Question Type

Inspired by data stratification form in Eq (3), we try to use question type to stratify the data. Specifically, we manually define some question types, count appeared answers and set preference for every answer in each type of question. According to the Eq (3

), we can use the preference generated by question type to train our model with the loss function:


where is the -th candidate in answer list, is the probability of candidate , is the preference we counted and the sketch graph is shown in Figure 3(c). The implementation details will be given in Section 6.3.

5.2 Answer Score Sampling

The official gives a set of dense annotations in training set which can be treated as a representation of preference because the annotators score every candidate in the context with their preference. As a result, if we regard each candidate in the decoder as a , illustrated in Figure 3(d), we can follow Eq (3) to calculate loss by the following function:


where is the index of answer candidate. Eq (5) can be implemented as different forms. Here we give three examples (detailed formulas are in supplementary materials):

Weighted Softmax Loss (). We extend the log-softmax loss as a weighted form, where is denoted by ,

denotes the logit of candidate

, and is corresponding relevance score.

Binary Sigmoid Loss (). This loss is close to the binary cross entropy loss, where represents or , and is also corresponding relevance score.

Generalized Ranking Loss (). Note that answer generation process can be viewed as a ranking problem. Therefore, we derive a ranking loss that is a ranking probability where is a group of candidates which has a lower relevance score than and represents (with no relevance score) or (with positive relevance score). This loss function is reorganized from ListNet [5] to become more suitable for this task.

Note that our loss functions are derived from the Eq (3), not just the regression of dense annotation. The comparison experiments will be given in Section 6.4.

5.3 Hidden Dictionary Learning

We find that the Eq (3) can be written as:


Although, we cannot determine the exact meaning of , we try to use a vector representation to approximate an expression of . We can approximate as  [40, 33] (i.e

., normalized weighted geometric mean), and this term can be further calculated by creating a dictionary

of :


where is a fully connected layer, represents a variable and its value is selected from directory . The details and proofs of the series of approximations can be found in supplementary materials. After deriving the last term, we can use to calculate shown in Figure 3(e) to approximate Eq (3). Noting that although when we train the dictionary, we still need to use answer score sampling, the hidden dictionary learning is a more proper way to approximate the unobserved confounder because it explores the whole space of rather than the second attempt which only uses some samples of .

6 Experiments

6.1 Experimental Setup

Dataset. Our principles are evaluated on the recently released real-world dataset VisDial v1.0222Suggest by the official [38], results should be reported on v1.0 instead of v0.9. Specifically, the training set of VisDial v1.0 contains 123K images from COCO dataset [20] with 10 rounds of dialog for each image, totally about 1.2M dialog pairs. The validation and test sets were collected from Flickr, with 2K and 8K COCO-like images respectively. The test set is further split into test-std and test-challenge splits, both with the number of 4K images that are hosted on the blind online evaluation server. Each image in training and validation sets has a 10-round dialog, while in test set the number of the dialog is flexible. Every dialog in VisDial dataset is given with 100 answer candidates. We evaluated our results on the validation and test-std set.

Metric. We used Normalized Discounted Cumulative Gain (NDCG) to evaluate our models. As introduced in Section 3.1, NDCG is adopted as the new metric for visual dialog which is appointed by the official and accepted by the community. Note that 2018 and 2019 Visual Dialog challenge winners were both picked by NDCG.

6.2 Model Zoo

We report the performance of the following baseline VisDial models, including LF [7], HCIAE [21], CoAtt [39] and RvA [25]:

LF [7]. This naive base model has no attention modules. We expand the model by adding some very basic attention operations to the naive baseline model, including question-based history attention and question-history-based visual attention refinement.

HCIAE [21]. The model consists of question-based history attention and question-history-based visual attention.

CoAtt [39]. The model consists of question-based visual attention, image-question-based history attention, image-history-based question attention, and the final question-history-based visual attention.

RvA [25]. The model consists of question-based visual attention and history-based visual attention refinement.

6.3 Implementation Details

Pre-processing. As for language pre-processing, we followed the process introduced by [7]

. Firstly, we lowercased all the letters in sentences, converted digits to words and removed contractions. After that, we used Python NLTK toolkit to tokenize sentences into word lists, followed by padding or truncating captions, questions, and answers to the length of 40, 20 and 20, respectively. And we built a vocabulary of the tokens of the size of 11,322 including 11,319 words that occur at least 5 times in train v1.0 and 3 instruction tokens. We loaded the pre-trained word embeddings from GloVe 

[29] to initialize all word embeddings, which were shared in encoder and decoder, and we applied 2-layers LSTMs to encode word embedding and set its hidden states dimension to 512. As for the visual feature, we used bottom-up-attention features [1] given by the official [38].

Implementation of Principles. For P1, we eliminated the history feature in the final fused vector representation for all models, while kept other parts unchanged. For HCIAE [21] and CoAtt [39], we also blocked the history guidance to the image. For P2, we trained our models using the preference score, which was counted from question type or given by the official (i.e., dense annotation in train v1.0). Specifically, for “question type”, we first defined 55 types and marked answers occurred over 5 times as preferred answers, then used the preference to train our model by loss. “Answer score sampling” was directly used to train our pre-trained model by the proposed loss function. For “dictionary”, we set a memory with the dimension 100512 to realize , then trained it by dense annotations with loss. More details can be found in supplementary materials. Note that other implementations following P1 and P2 are also acceptable.

Training. We used softmax cross-entropy loss to train the model with P1, and used Adam [17] with the learning rate of

which decayed at epoch 5, 7, 9 with the decay rate of 0.4. We trained the model for 15 epochs totally. Dropout 

[33] was also applied with ratio of 0.4 for RNN and 0.25 for fully connected layers. Other settings were set by default.

Model baseline QT S D
LF [7] 57.21 58.97 67.82 71.27 72.04 72.36 72.65
LF +P1 61.88 62.87 69.47 72.16 72.85 73.42 73.63
Table 1: Performance (NDCG%) comparison for the experiments of applying our principles on the validation set of VisDial v1.0. LF is the enhanced version as we mentioned. QT, S and D denote question type, answer score sampling, and hidden dictionary learning, respectively. , , , denote regressive loss, weighted softmax loss, binary sigmoid loss ,and generalized ranking loss, respectively.

6.4 Quantitative Results

Table 1 shows the results with different implementations in P2, i.e., question type, answer score sampling, and hidden dictionary learning. Overall, all of the implementations can improve the performances of base models. Specifically, the attempts of P2 can further boost performance by 11.75% at most by hidden dictionary learning. To be more specific, our designed loss functions based on Eq. (3) outperform the regressive score (i.e., ) which is a Euclidean distance loss, and we also find that our proposed generalized ranking loss (i.e., ) is the best because it satisfies the ranking property of VisDial.

To justify that our principles are model-agnostic, Table 2 shows the results of our experiments about applying our principles on four different models (i.e., LF [7], HCIAE [21], CoAtt [39] and RvA [25]). In general, both of our principles can improve all the models in any ablative conditions. We also find that the effectiveness of P1 and P2 are additive, that is to say, their combination performs the best. Note that the enhanced LF model is very simple without complex attention strategies. However, this simple architecture still does not hinder it to achieve the best performance.

6.5 Qualitative Analysis

The qualitative results illustrated in Figure 4 and Figure 5 show the following advantages of our principles.

Model LF [7] HCIAE [21] CoAtt [39] RvA [25]
baseline 57.21 56.98 56.46 56.74
+P1 61.88 60.12 60.27 61.02
+P2 72.65 71.50 71.41 71.44
+P1+P2 73.63 71.99 71.87 72.88
Table 2: Performance(NDCG%) of ablative studies on different models on VisDial v1.0 validation set. P2 indicates the most effective one (i.e., hidden dictionary learning) shown in Table 1. Note that only applying P2 is implemented by the attempts in Section 5 with the history shortcut.
Figure 4: Qualitative results of the baseline and baseline with P1 on the validation set of VisDial v1.0. The numbers in brackets in ranked denote relevance scores. Red boxes denote that the baseline model copies the words from the dialog history, even they are literally nonsense for answering the current question. The bottom example shows that although baseline can correctly select the ground truth answer, it is influenced by the unreasonable history shortcut to answer, and thus it ranks “yes” at a high place, which degrades its performance (NDCG). As for the baseline with P1, it does not make such unreasonable choices.
Figure 5: Qualitative examples of the ranked candidates of baseline and baseline with P2. We also give some key rank changes for boosting NDCG performance by implementing P2. These examples are taken from the validation set of VisDial v1.0.

History Bias Elimination. After applying P1, many harmful patterns learned from history are relieved, especially the answer-length bias shown in Figure 2(a) and word-match bias shown in Figure 4. After applying P1, the average length of top-1 answers (i.e., the blue line in Figure 2(a)) is no longer related to the history answer average length, and become more close to NDCG ground truth answer average length (i.e., green dash line). As for the word-match bias in Figure 4, we can observe that the word “eyes” from history is literally unrelated to the current question. But in the top of the ranked answer list of the baseline model, the word “eyes” can be found in some undesirable candidates (i.e., with low relevance score). In general, due to the wrong direct path from history to answer, the baseline model prefers to match the word in history and ranks matched candidates in high places. If we count the matching times of meaningful words on the validation set (e.g., word “eyes”) in the top-10 candidates of the ranked lists, obtained by baseline with P1 and the baseline, we find that P1 can decrease about 10% word matching from history ( times compared with times).

The bottom example shown in Figure 4 also illustrates a type of word matching. In the ranked list of the baseline model, the rank of “yes” is very high, and “yes” exists in history for many times. By analyzing the results on validation, we found that if “yes” or “no” exists in dialog history, the baseline model will give the two answers a higher rank than average because of the word matching. After applying P1, this phenomenon will no longer happen. More details of these biases can be found in supplementary materials.

Model NDCG(%)
Ours P1+P2 (More Ensemble) 74.91
LF+P1+P2 (Ensemble) 74.19
LF+P1+P2 (single) 71.60
RvA+P1+P2 (single) 71.28
CoAtt+P1+P2 (single) 69.81
HCIAE+P1+P2 (single) 69.66
Leaderboard MReaL-BDAI 74.02
ReDAN+ (Ensemble) [10] 64.47
square 60.16
VIC-SNU [16] 57.59
UET-VNU 57.40
idansc [31] 57.13
Table 3: Our results and comparisons to the recent 2019 2nd Visual Dialog Challenge Leaderboard results on the test-std set of VisDial v1.0. Results are reported by the test server, () is taken from [37].

More Reasonable Ranking. Figure 5 shows that the baseline model only focuses on ground truth answer like “no” or “yes” and does not care about the rank of other answers with similar semantics like “nope” or “yes, he is”. This does not conform to human’s intuition because we think the candidates with similar semantics are still correct answers. This also leads the baseline model to perform badly under the NDCG metric. Compared with the model with P2, in the bottom example, it almost rank all the suitable answers like “yes, he is”, “yes he is” and “I think so” at top places together with the ground truth answer “yes”, which greatly improves the NDCG performance.

6.6 Visual Dialog Challenge 2019

We finally used the blind online test server to justify the effectiveness of our principles on the test-std split of VisDial v1.0. Shown in Table 3, the top part contains the results of the baseline models implemented our principles, where P2 denotes the most effective one (i.e., hidden dictionary learning). The bottom part is the 2019 Visual Dialog Challenge leader-board [37]. We used the ensemble of the enhanced LF [7] to beat our best performance in 2019 Visual Dialog Challenge, which also used other implementations of P1 and P2. Promisingly, by applying our principles, we can promote all the baseline single models to the top ranks on the leader-board.

7 Conclusions

In this paper, we proposed two causal principles for improving the VisDial task. They are model-agnostic, and thus can be applied in almost all the existing methods and bring major improvement. The principles are drawn from our in-depth causal analysis of the VisDial nature, which is however unfortunately overlooked by our community. For technical contributions, we offered some implementation examples on how to apply the principles into baseline models. We conducted extensive experiments on the official VisDial dataset and the online evaluation servers. Promising results demonstrate the effectiveness of the two principles. As moving forward, we will stick to our causal thinking to discover other potential causalities hidden in embodied Q&A and conversational visual dialog tasks.


  • [1] P. Anderson, X. He, C. Buehler, D. Teney, M. Johnson, S. Gould, and L. Zhang (2018) Bottom-up and top-down attention for image captioning and visual question answering. In

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    pp. 6077–6086. Cited by: §2, §6.3.
  • [2] S. Antol, A. Agrawal, J. Lu, M. Mitchell, D. Batra, C. Lawrence Zitnick, and D. Parikh (2015) Vqa: visual question answering. In Proceedings of the IEEE international conference on computer vision, pp. 2425–2433. Cited by: §1, §2.
  • [3] Y. Bengio, T. Deleu, N. Rahaman, R. Ke, S. Lachapelle, O. Bilaniuk, A. Goyal, and C. Pal (2019) A meta-transfer objective for learning to disentangle causal mechanisms. arXiv preprint arXiv:1901.10912. Cited by: §2.
  • [4] M. Buhrmester, T. Kwang, and S. D. Gosling (2011) Amazon’s mechanical turk: a new source of inexpensive, yet high-quality, data?. Perspectives on Psychological Science 6 (1), pp. 3–5. Cited by: §2.
  • [5] Z. Cao, T. Qin, T. Liu, M. Tsai, and H. Li (2007) Learning to rank: from pairwise approach to listwise approach. In Proceedings of the 24th international conference on Machine learning, pp. 129–136. Cited by: §5.2.
  • [6] K. Cho, B. Van Merriënboer, D. Bahdanau, and Y. Bengio (2014)

    On the properties of neural machine translation: encoder-decoder approaches

    arXiv preprint arXiv:1409.1259. Cited by: §3.2.
  • [7] A. Das, S. Kottur, K. Gupta, A. Singh, D. Yadav, J. M. Moura, D. Parikh, and D. Batra (2017) Visual dialog. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 326–335. Cited by: §1, §1, §1, §1, §2, §2, §3.1, §6.2, §6.2, §6.3, §6.4, §6.6, Table 1, Table 2.
  • [8] I. Dasgupta, J. Wang, S. Chiappa, J. Mitrovic, P. Ortega, D. Raposo, E. Hughes, P. Battaglia, M. Botvinick, and Z. Kurth-Nelson (2019)

    Causal reasoning from meta-reinforcement learning

    arXiv preprint arXiv:1901.08162. Cited by: §2.
  • [9] H. De Vries, F. Strub, S. Chandar, O. Pietquin, H. Larochelle, and A. Courville (2017) Guesswhat?! visual object discovery through multi-modal dialogue. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5503–5512. Cited by: §2.
  • [10] Z. Gan, Y. Cheng, A. E. Kholy, L. Li, J. Liu, and J. Gao (2019) Multi-step reasoning via recurrent dual attention for visual dialog. arXiv preprint arXiv:1902.00579. Cited by: §1, §1, §2, Table 3.
  • [11] D. Guo, C. Xu, and D. Tao (2019) Image-question-answer synergistic network for visual dialog. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 10434–10443. Cited by: §2.
  • [12] K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778. Cited by: §3.2.
  • [13] S. Hochreiter and J. Schmidhuber (1997) Long short-term memory. Neural computation 9 (8), pp. 1735–1780. Cited by: §3.2.
  • [14] U. Jain, S. Lazebnik, and A. G. Schwing (2018) Two can play this game: visual dialog with discriminative question generation and answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5754–5763. Cited by: §1, §2.
  • [15] D. Kalainathan, O. Goudet, I. Guyon, D. Lopez-Paz, and M. Sebag (2018) Sam: structural agnostic model, causal discovery and penalized adversarial learning. arXiv preprint arXiv:1803.04929. Cited by: §2.
  • [16] G. Kang, J. Lim, and B. Zhang (2019) Dual attention networks for visual reference resolution in visual dialog. arXiv preprint arXiv:1902.09368. Cited by: §1, §2, Table 3.
  • [17] D. P. Kingma and J. Ba (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §6.3.
  • [18] S. Kottur, J. M. Moura, D. Parikh, D. Batra, and M. Rohrbach (2018) Visual coreference resolution in visual dialog using neural module networks. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 153–169. Cited by: §2.
  • [19] R. Krishna, Y. Zhu, O. Groth, J. Johnson, K. Hata, J. Kravitz, S. Chen, Y. Kalantidis, L. Li, D. A. Shamma, et al. (2017) Visual genome: connecting language and vision using crowdsourced dense image annotations. International Journal of Computer Vision 123 (1), pp. 32–73. Cited by: §3.2.
  • [20] T. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick (2014) Microsoft coco: common objects in context. In European conference on computer vision, pp. 740–755. Cited by: §6.1.
  • [21] J. Lu, A. Kannan, J. Yang, D. Parikh, and D. Batra (2017) Best of both worlds: transferring knowledge from discriminative learning to a generative visual dialog model. In Advances in Neural Information Processing Systems, pp. 314–324. Cited by: §1, §1, §2, §3.2, §6.2, §6.2, §6.3, §6.4, Table 2.
  • [22] J. Lu, J. Yang, D. Batra, and D. Parikh (2016) Hierarchical question-image co-attention for visual question answering. In Advances In Neural Information Processing Systems, pp. 289–297. Cited by: §1, §1.
  • [23] S. Nair, Y. Zhu, S. Savarese, and L. Fei-Fei (2019) Causal induction from visual observations for goal directed tasks. arXiv preprint arXiv:1910.01751. Cited by: §2.
  • [24] H. Nam, J. Ha, and J. Kim (2017) Dual attention networks for multimodal reasoning and matching. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 299–307. Cited by: §1.
  • [25] Y. Niu, H. Zhang, M. Zhang, J. Zhang, Z. Lu, and J. Wen (2019) Recursive visual attention in visual dialog. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6679–6688. Cited by: §1, §1, §1, §2, §3.2, §6.2, §6.2, §6.4, Table 2.
  • [26] J. Pearl, M. Glymour, and N. P. Jewell (2016) Causal inference in statistics: a primer. John Wiley & Sons. Cited by: §1, §1, §1, §2, §3.2, §3, §4.2, §4.3, §4.3.
  • [27] J. Pearl and D. Mackenzie (2018) THE book of why: the new science of cause and effect. Basic Books. Cited by: §3, §4.3, §4.3.
  • [28] J. Pearl et al. (2009) Causal inference in statistics: an overview. Statistics surveys 3, pp. 96–146. Cited by: §1.
  • [29] J. Pennington, R. Socher, and C. Manning (2014) Glove: global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pp. 1532–1543. Cited by: §6.3.
  • [30] S. Ren, K. He, R. Girshick, and J. Sun (2015) Faster r-cnn: towards real-time object detection with region proposal networks. In Advances in neural information processing systems, pp. 91–99. Cited by: §3.2.
  • [31] I. Schwartz, S. Yu, T. Hazan, and A. G. Schwing (2019) Factor graph attention. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2039–2048. Cited by: §1, §2, Table 3.
  • [32] P. H. Seo, A. Lehrmann, B. Han, and L. Sigal (2017) Visual reference resolution using attention memory for visual dialog. In Advances in neural information processing systems, pp. 3719–3729. Cited by: §2.
  • [33] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov (2014)

    Dropout: a simple way to prevent neural networks from overfitting

    The journal of machine learning research 15 (1), pp. 1929–1958. Cited by: §5.3, §6.3.
  • [34] K. Tang, H. Zhang, B. Wu, W. Luo, and W. Liu (2019) Learning to compose dynamic tree structures for visual contexts. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6619–6628. Cited by: §2.
  • [35] D. Teney, P. Anderson, X. He, and A. van den Hengel (2018) Tips and tricks for visual question answering: learnings from the 2017 challenge. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4223–4232. Cited by: §1, §1, §2.
  • [36] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin (2017) Attention is all you need. In Advances in neural information processing systems, pp. 5998–6008. Cited by: §1.
  • [37] Visual Dialog Challenge 2019 Leaderboard. Note: Cited by: §1, §1, §6.6, Table 3.
  • [38] Visual Dialog. Note: Cited by: §1, §3.1, §6.3, footnote 2.
  • [39] Q. Wu, P. Wang, C. Shen, I. Reid, and A. van den Hengel (2018) Are you talking to me? reasoned visual dialog generation through adversarial learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6106–6115. Cited by: §1, §1, §1, §2, §3.2, §6.2, §6.2, §6.3, §6.4, Table 2.
  • [40] K. Xu, J. Ba, R. Kiros, K. Cho, A. Courville, R. Salakhudinov, R. Zemel, and Y. Bengio (2015) Show, attend and tell: neural image caption generation with visual attention. In International conference on machine learning, pp. 2048–2057. Cited by: §3.2, §5.3.
  • [41] T. Yang, Z. Zha, and H. Zhang (2019) Making history matter: gold-critic sequence training for visual dialog. arXiv preprint arXiv:1902.09326. Cited by: §1, §2, §3.2.
  • [42] X. Yang, K. Tang, H. Zhang, and J. Cai (2019) Auto-encoding scene graphs for image captioning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 10685–10694. Cited by: §2.
  • [43] T. Yao, Y. Pan, Y. Li, and T. Mei (2018) Exploring visual relationship for image captioning. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 684–699. Cited by: §2.
  • [44] Z. Yu, J. Yu, J. Fan, and D. Tao (2017) Multi-modal factorized bilinear pooling with co-attention learning for visual question answering. In Proceedings of the IEEE international conference on computer vision, pp. 1821–1830. Cited by: §1.
  • [45] Z. Zheng, W. Wang, S. Qi, and S. Zhu (2019) Reasoning visual dialogs with structural and partial observations. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6669–6678. Cited by: §2.