This paper is a winner report from team MReaL-BDAI for Visual Dialog Challenge 2019. We present two causal principles for improving Visual Dialog (VisDial). By "improving", we mean that they can promote almost every existing VisDial model to the state-of-the-art performance on Visual Dialog 2019 Challenge leader-board. Such a major improvement is only due to our careful inspection on the causality behind the model and data, finding that the community has overlooked two causalities in VisDial. Intuitively, Principle 1 suggests: we should remove the direct input of the dialog history to the answer model, otherwise the harmful shortcut bias will be introduced; Principle 2 says: there is an unobserved confounder for history, question, and answer, leading to spurious correlations from training data. In particular, to remove the confounder suggested in Principle 2, we propose several causal intervention algorithms, which make the training fundamentally different from the traditional likelihood estimation. Note that the two principles are model-agnostic, so they are applicable in any VisDial model.READ FULL TEXT VIEW PDF
Given an image , a dialog history of past Q&A pairs: , and the current -th round question , a Visual Dialog (VisDial) agent  is expected to give a good answer . Our community has always considered VQA  and VisDial as sister tasks due to their similar settings: Q&A grounded by (VQA) and Q&A grounded by (VisDial). Indeed, from a technical point view — just like the VQA models — a typical VisDial model first uses encoder to represent , , and
as vectors, and then feed them intodecoder for answer . Thanks to the recent advances in encoder-decoder frameworks for VQA [22, 35]
, as well as for natural language processing, the performance (NDCG ) of VisDial in literature is significantly improved from the baseline 51.63%  to the state-of-the-art 64.47% .
However, in this paper, we want to highlight an important fact: VisDial is essentially NOT VQA!
And this fact is so profound that all the common heuristics in the vision-language community — such as the fusion tricks[35, 44] and attention variants [22, 24] — cannot appreciate the difference. Instead, we introduce the use of causal inference [28, 26]: a graphical framework that stands in the cause-effect interpretation of the data, but not merely the statistical association of them. Before we delve into the details, we would like to present the main contributions: two causal principles, rooted from the analysis of the difference between VisDial and VQA, which lead to a performance leap — a farewell to the 60%-s and an embrace for the 70%-s — for all the VisDial models111Only those with codes&reproducible results due to resource limit. in literature [7, 21, 39, 25], promoting them to the state-of-the-art in Visual Dialog 2019 Challenge .
(P1): Delete the link .
(P2): Add one new (unobserved) node and three new links: , , and .
Figure 1 compares the causal graphs of existing VisDial models and the one applied with the proposed two principles. Although a formal introduction of them is given in Section 3.2, now you can simply understand the nodes as data types and the directed links as data flows. For example, and indicate that the visual knowledge , e.g., the encoded feature from a multi-model encoder, works with the question to “dictate” the answer .
P1 suggests that we should remove the direct input of dialog history to the answer model. This principle contradicts most of the prevailing VisDial models [7, 14, 39, 25, 41, 16, 10, 31], which are based on the widely accepted intuition: the more features you input, the more effective the model is. It is mostly correct, but only with our discretion of the data generation process. In fact, the annotators of the VisDial dataset  were not allowed to copy from the previous Q&A, i.e., , and were encouraged to ask consecutive questions that includes co-referenced pronouns like “it” and “those”, i.e., , and the answer should be based only on question and the reasoned visual knowledge . Therefore, a good VisDial model is expected to reason over the context with but not to memorize the bias. However, the direct path will contaminate the expected causality. Figure 2(a) shows a very ridiculous bias observed in all baselines without P1: the top answers are those whose lengths are close to the average length in the history answers! We will offer more justifications for P1 in Section 4.1.
P2 implies that the model training based only on the association among the sample and is spurious. By “spurious”, we mean that the effect on caused by — the goal of VisDial — is confounded by an unobserved variable , because it appears in every undesired causal path (a.k.a., backdoor ), which is an indirect causal path from the input to output : and . We believe that such unobserved should be users as the VisDial dataset essentially brings humans in the loop. Figure 2(b) illustrates how the user’s hidden preference confounds them, as the VisDial dataset essentially involves humans in the loop. Therefore, during training, if we focus only on the conventional likelihood , the model will inevitably be biased towards the spurious causality, e.g., it may score answer “Yes, he is” higher than “Yes”, merely because the users prefer to see a “he” appeared in the answer, given the history context of “he”. It is worth noting that the confounder is more impactful in VisDial than in VQA, because the former encourages the user to rank similar answers subjectively while the latter is more objective. A plausible explanation might be: VisDial is interactive in nature and a not quite correct answer is tolerable in one iteration (i.e., dense prediction); while VQA has only one chance, which demands accuracy (i.e., one-hot prediction).
By applying P1 and P2 to the baseline causal graph, we have the proposed one (the right one in Figure 1), which serves as a model-agnostic roadmap for the causal inference of VisDial. To remove the spurious effect caused by , we use the do-calculus  , which is fundamentally different from the conventional likelihood : the former is an active intervention, which cuts off and , and sample (where the name “calculus” is from) every possible , seeking the true effect on only caused by ; while the latter likelihood is a passive observation that is affected by the existence of . The formal introduction and details will be given in Section 4.3. In particular, given the fact that once the dataset is ready, is no longer observed, we propose a series of effective approximations in Section 5.
We validate the effectiveness of P1 and P2 on the most recent Visual Dialog Challenge 2019 dataset. We show significant performance boosts (absolute NDCG) by applying them in 4 representative baseline models: LF  (16.42%), HCIAE  (15.01%), CoAtt  (15.41%), and RvA  (16.14%). Impressively, on the official test-std server, we use an ensemble model of the most simple baseline LF  to beat the 2019 champion by 0.2%, a more complex ensemble to beat it by 0.9%, and lead all the single-model baselines to the state-of-the-art performance.
All of the existing approaches in the VisDial task are based on the typical encoder-decoder framework [14, 11, 32, 10, 31, 45]. They can be categorized by the usage of history. 1) Holistic: they treat history as a whole to feed into models like HACAN , DAN  and CorefNMN . 2) Hierarchical: they use a hierarchical structure to deal with history like HRE . 3) Recursive: RvA  uses a recursive method to process history. However, they all overlook the fact that the history information should not be directly fed to the answer model (i.e., our proposed Principle 1). The baselines we used in this paper are LF : the earliest model, HCIAE : the first model to use history hierarchical attention, CoAtt : the first one to a co-attention mechanism, and RvA : the first one for a tree-structured attention mechanism.
introduced causal inference into machine learning, trying to endow models the abilities of causal reasoning through the learning process. In contrast to them, we use the structural graph causality, which is a model-agnostic framework that reflects the nature of the data.
In this section, we formally introduce the visual dialog task and describe how the popular encoder-decoder framework follows the baseline causal graph shown in Figure 1. More details of causal graph can be found in [26, 27].
Settings. According to the definition of VisDial task proposed by Das et al. , at each time , given input image , current question , dialog history , where is the image caption, is the -th round Q&A pair, and a list of 100 candidate answers , the task of the dialog agent is to generate a free-form answer or give an answer by ranking candidate answers .
Evaluation. Recently, the ranking metric Normalized Discounted Cumulative Gain (NDCG) is adopted by the VisDial community. It is different from the classification metric (e.g., top-1 accuracy) used in VQA. It is more compatible with the relevance scores of the answer candidates in VisDial rated by humans. NDCG requires to rank relevant candidates in higher places, rather than just to select the ground-truth answer. More details of NDCG can be found in .
We first give the definition of causal graph, then revisit the encoder-decoder framework in existing methods using the elements from the baseline graph in Figure 1.
Causal Graph. Causal graph , as shown in Figure 1, describes how variables interact with each other, expressed by a directed acyclic graph consisting of nodes and directed edges (i.e., arrows). denote variables, and (arrows) denote the causal relationships between two nodes, i.e., denotes that is the cause and is the effect, meaning the outcome of is caused by . Causal graph is a highly general roadmap specifying the causal dependencies among variables.
As we will discuss in the following part, all of the existing methods can be revisited in the view of the baseline graph shown in Figure 1.
Feature Representation and Attention in Encoder. Visual feature is denoted as node
in the baseline graph, which is usually a fixed feature extracted by Faster-RCNN based on ResNet backbone  pre-trained on Visual Genome . For language feature, the encoder firstly embeds sentence into word vectors, followed by passing the RNN [13, 6] to generate features of question and history, which are denoted as .
Most of existing methods apply attention mechanism  in encoder-decoder to explore the latent weights for a set of features. A basic attention operation can be represented as where is the set of features need to attend, is the key (i.e., guidance) and is the attended feature of . Details can be found in most visual dialog methods [21, 39, 41]. In the baseline graph, the sub-graph denotes a series of attention operations for visual knowledge . Note that these arrows are not necessarily independent, such as co-attention , and the process can be written as , where intermediate variables can be yielded in the graph with respect to different attention strategies such as co-attention  and recursive attention . However, without loss of generality, these variables do not affect the causalities in the graph.
Response Generation in Decoder. After obtaining the features from the encoder, existing methods will fuse them and feed the fused ones into a decoder to generate an answer. In the baseline graph, node denotes the answer sentence that decoder takes the features via and then transforms them into a vector for decoding the answer. In particular, the decoder can be generative, i.e., to generate an answer sentence by RNN; or discriminative, i.e., select an answer by discriminating answer candidates.
Next, we will advance to the middle part of Figure 1, to reveal what is wrong with the baseline graph.
When should we draw an arrow from one node pointing to another? According to the definition in Section 3.2, the criterion is that if the node is the cause and the other one is the effect. Intrigued, let’s understand P1 by discussing the rationale behind the “double-blind” review policy. Given three variables: “Well-known Researcher” (), “High-quality Paper” (), and “Accept” (). From our community common sense, we know that because top researchers usually lead high-quality research, and is even more obvious. Therefore, for the good of the community, the double-blind prohibits the direct link by author anonymity, otherwise the bias such as personal emotions and politics from may affect the outcome of .
The story is similar in VisDial. Without loss of generality, we only analyze the path . If we inspect the role of , we can find that it is to help resolve some co-reference like “it” and “their”. As a result, listens to . Then, we use to obtain . Here, becomes a mediator which cuts off the direct association between and that makes , like the “High-quality Paper” that we mentioned in the previous story. However, if we set an arrow from to : , the undesirable bias of will be learned for the prediction of , that hampers the natural process of VisDial, such as the interesting bias illustrated in Figure 2(a). Another example is discussed in Figure 4 that prefers to match the words in even though they are literally nonsense about if we add the direct link . After we apply P1, these phenomena will be relieved, such as the blue line illustrated in Figure 2(a), which is closer to the NDCG ground truth (i.e., candidates with non-zero relevance score) average answer length represented as green dash line, and the other qualitative studies in Section 6.4.
Before discussing P2, we first introduce an important concept in causal inference . In causal graph, the fork-like pattern in Figure 3(a) contains a confounder , which is the common cause for and (i.e., ). The confounder opens a non-causal path started from which is also called the backdoor, making and spuriously correlated even if there is no direct causality between them.
In the data generation process of VisDial, we know that not only both the questioner and answerer can see the dialog history which offers them a latent topic, but also the answer annotators can look at the history when annotating the answer. Their preference can be understood as part of the human nature or subtleties conditional on a dialog context, and thus it has a causal effect on both and . Moreover, due to the fact that the preference is nuanced and uncontrollable, we consider it as an unobserved confounder for and .
It is worth noting that the confounder hinders us to find the true causal effect. Let’s take the graph in Figure 3(a) as an example, if there is no
, the probabilityis the causal effect that we want to pursue. However, due to the existence of , is no longer the true causality from to . When we calculate , we take into account which can be shown by Bayes rule:
The distribution of is conditional on (i.e., ). That means when using the conditional weight (i.e., ) to sum every effect (i.e., ), the likelihood sum (i.e., ) will be biased towards the effect with larger weights. For better understanding, if we treat Eq (1) as a process of data stratification, at each layer , we can obtain the causality conditional on , because given will block the backdoor of . Then, we have to sum these causalities by the natural distribution of rather than conditional distribution , which will remix the data bias. In a nutshell, we cannot calculate causality from to by under the confounder . To resolve this problem (i.e., de-confounding to find causal effect), we need more powerful tools.
do-operator. do-operator is a type of intervention to de-confounder. Illustrated in Figure 3(c), do-operator (e.g., do()) is that we set a value to variable , i.e., is caused by itself rather than its parent nodes. Therefore, do() cut off all the original arrows that come into (i.e., ) because its parents do not cause it anymore. This operation can prevent any information about from flowing in the non-causal direction (i.e., backdoor ). As a result, the confounder of can be relieved and the causal effect of can be estimated. In the following parts, we use do() to represent do() for concision.
do-calculus. However, it is hard to take a real intervention on a fixed dataset. We need to use some rules to translate into , which has no do-operator and can be calculated by conditional probability. The rules of do-calculus are given in [26, 27] and here we just introduce the most important one: If a set of variables blocks all backdoor paths from to , then conditional on , do is equivalent to observe: where capital letter denotes variable and lowercase denotes value. Other rules will be given in supplementary materials.
After obtaining the tools, we can revisit the example in Section 4.2. If we calculate rather than , the result will be . In this formula, the distribution of is the natural prior instead of the conditional distribution . Therefore, the summation of the causal effect by weight (i.e., ) will not remix the data bias. In other words, is the ideal causality from to .
In our graph of VisDial shown in Figure 1, we can also de-confounder by intervention do to find causal effects from to , then perform do-calculus rules to transform pretended intervention into probability formula:
The last transformation takes the rule we introduced in do-calculus because ’s backdoors are blocked by controlling . The rest derivation proofs and the details of other rules can be found in supplementary materials. As we mentioned, the result of is the real causal effect that we want.
So far, we have given all of the contents about baseline causal graph, two principles and our causal graph. In the next section, we will try to calculate the real causal effect and give some attempts to realize our causal graph to en-light the future of visual dialog.
where represents the probability of under the conditions and . Since the variable is unobserved, we just give some examples of attempts to replace or approximate it and corresponding sketch graphs will be given to help understand.
Inspired by data stratification form in Eq (3), we try to use question type to stratify the data. Specifically, we manually define some question types, count appeared answers and set preference for every answer in each type of question. According to the Eq (3
), we can use the preference generated by question type to train our model with the loss function:
where is the -th candidate in answer list, is the probability of candidate , is the preference we counted and the sketch graph is shown in Figure 3(c). The implementation details will be given in Section 6.3.
The official gives a set of dense annotations in training set which can be treated as a representation of preference because the annotators score every candidate in the context with their preference. As a result, if we regard each candidate in the decoder as a , illustrated in Figure 3(d), we can follow Eq (3) to calculate loss by the following function:
where is the index of answer candidate. Eq (5) can be implemented as different forms. Here we give three examples (detailed formulas are in supplementary materials):
Weighted Softmax Loss (). We extend the log-softmax loss as a weighted form, where is denoted by ,
denotes the logit of candidate, and is corresponding relevance score.
Binary Sigmoid Loss (). This loss is close to the binary cross entropy loss, where represents or , and is also corresponding relevance score.
Generalized Ranking Loss (). Note that answer generation process can be viewed as a ranking problem. Therefore, we derive a ranking loss that is a ranking probability where is a group of candidates which has a lower relevance score than and represents (with no relevance score) or (with positive relevance score). This loss function is reorganized from ListNet  to become more suitable for this task.
We find that the Eq (3) can be written as:
., normalized weighted geometric mean), and this term can be further calculated by creating a dictionaryof :
where is a fully connected layer, represents a variable and its value is selected from directory . The details and proofs of the series of approximations can be found in supplementary materials. After deriving the last term, we can use to calculate shown in Figure 3(e) to approximate Eq (3). Noting that although when we train the dictionary, we still need to use answer score sampling, the hidden dictionary learning is a more proper way to approximate the unobserved confounder because it explores the whole space of rather than the second attempt which only uses some samples of .
Dataset. Our principles are evaluated on the recently released real-world dataset VisDial v1.0222Suggest by the official , results should be reported on v1.0 instead of v0.9. Specifically, the training set of VisDial v1.0 contains 123K images from COCO dataset  with 10 rounds of dialog for each image, totally about 1.2M dialog pairs. The validation and test sets were collected from Flickr, with 2K and 8K COCO-like images respectively. The test set is further split into test-std and test-challenge splits, both with the number of 4K images that are hosted on the blind online evaluation server. Each image in training and validation sets has a 10-round dialog, while in test set the number of the dialog is flexible. Every dialog in VisDial dataset is given with 100 answer candidates. We evaluated our results on the validation and test-std set.
Metric. We used Normalized Discounted Cumulative Gain (NDCG) to evaluate our models. As introduced in Section 3.1, NDCG is adopted as the new metric for visual dialog which is appointed by the official and accepted by the community. Note that 2018 and 2019 Visual Dialog challenge winners were both picked by NDCG.
LF . This naive base model has no attention modules. We expand the model by adding some very basic attention operations to the naive baseline model, including question-based history attention and question-history-based visual attention refinement.
HCIAE . The model consists of question-based history attention and question-history-based visual attention.
CoAtt . The model consists of question-based visual attention, image-question-based history attention, image-history-based question attention, and the final question-history-based visual attention.
RvA . The model consists of question-based visual attention and history-based visual attention refinement.
Pre-processing. As for language pre-processing, we followed the process introduced by 
. Firstly, we lowercased all the letters in sentences, converted digits to words and removed contractions. After that, we used Python NLTK toolkit to tokenize sentences into word lists, followed by padding or truncating captions, questions, and answers to the length of 40, 20 and 20, respectively. And we built a vocabulary of the tokens of the size of 11,322 including 11,319 words that occur at least 5 times in train v1.0 and 3 instruction tokens. We loaded the pre-trained word embeddings from GloVe to initialize all word embeddings, which were shared in encoder and decoder, and we applied 2-layers LSTMs to encode word embedding and set its hidden states dimension to 512. As for the visual feature, we used bottom-up-attention features  given by the official .
Implementation of Principles. For P1, we eliminated the history feature in the final fused vector representation for all models, while kept other parts unchanged. For HCIAE  and CoAtt , we also blocked the history guidance to the image. For P2, we trained our models using the preference score, which was counted from question type or given by the official (i.e., dense annotation in train v1.0). Specifically, for “question type”, we first defined 55 types and marked answers occurred over 5 times as preferred answers, then used the preference to train our model by loss. “Answer score sampling” was directly used to train our pre-trained model by the proposed loss function. For “dictionary”, we set a memory with the dimension 100512 to realize , then trained it by dense annotations with loss. More details can be found in supplementary materials. Note that other implementations following P1 and P2 are also acceptable.
Training. We used softmax cross-entropy loss to train the model with P1, and used Adam  with the learning rate of
which decayed at epoch 5, 7, 9 with the decay rate of 0.4. We trained the model for 15 epochs totally. Dropout was also applied with ratio of 0.4 for RNN and 0.25 for fully connected layers. Other settings were set by default.
Table 1 shows the results with different implementations in P2, i.e., question type, answer score sampling, and hidden dictionary learning. Overall, all of the implementations can improve the performances of base models. Specifically, the attempts of P2 can further boost performance by 11.75% at most by hidden dictionary learning. To be more specific, our designed loss functions based on Eq. (3) outperform the regressive score (i.e., ) which is a Euclidean distance loss, and we also find that our proposed generalized ranking loss (i.e., ) is the best because it satisfies the ranking property of VisDial.
To justify that our principles are model-agnostic, Table 2 shows the results of our experiments about applying our principles on four different models (i.e., LF , HCIAE , CoAtt  and RvA ). In general, both of our principles can improve all the models in any ablative conditions. We also find that the effectiveness of P1 and P2 are additive, that is to say, their combination performs the best. Note that the enhanced LF model is very simple without complex attention strategies. However, this simple architecture still does not hinder it to achieve the best performance.
|Model||LF ||HCIAE ||CoAtt ||RvA |
History Bias Elimination. After applying P1, many harmful patterns learned from history are relieved, especially the answer-length bias shown in Figure 2(a) and word-match bias shown in Figure 4. After applying P1, the average length of top-1 answers (i.e., the blue line in Figure 2(a)) is no longer related to the history answer average length, and become more close to NDCG ground truth answer average length (i.e., green dash line). As for the word-match bias in Figure 4, we can observe that the word “eyes” from history is literally unrelated to the current question. But in the top of the ranked answer list of the baseline model, the word “eyes” can be found in some undesirable candidates (i.e., with low relevance score). In general, due to the wrong direct path from history to answer, the baseline model prefers to match the word in history and ranks matched candidates in high places. If we count the matching times of meaningful words on the validation set (e.g., word “eyes”) in the top-10 candidates of the ranked lists, obtained by baseline with P1 and the baseline, we find that P1 can decrease about 10% word matching from history ( times compared with times).
The bottom example shown in Figure 4 also illustrates a type of word matching. In the ranked list of the baseline model, the rank of “yes” is very high, and “yes” exists in history for many times. By analyzing the results on validation, we found that if “yes” or “no” exists in dialog history, the baseline model will give the two answers a higher rank than average because of the word matching. After applying P1, this phenomenon will no longer happen. More details of these biases can be found in supplementary materials.
|Ours||P1+P2 (More Ensemble)||74.91|
|ReDAN+ (Ensemble) ||64.47|
More Reasonable Ranking. Figure 5 shows that the baseline model only focuses on ground truth answer like “no” or “yes” and does not care about the rank of other answers with similar semantics like “nope” or “yes, he is”. This does not conform to human’s intuition because we think the candidates with similar semantics are still correct answers. This also leads the baseline model to perform badly under the NDCG metric. Compared with the model with P2, in the bottom example, it almost rank all the suitable answers like “yes, he is”, “yes he is” and “I think so” at top places together with the ground truth answer “yes”, which greatly improves the NDCG performance.
We finally used the blind online test server to justify the effectiveness of our principles on the test-std split of VisDial v1.0. Shown in Table 3, the top part contains the results of the baseline models implemented our principles, where P2 denotes the most effective one (i.e., hidden dictionary learning). The bottom part is the 2019 Visual Dialog Challenge leader-board . We used the ensemble of the enhanced LF  to beat our best performance in 2019 Visual Dialog Challenge, which also used other implementations of P1 and P2. Promisingly, by applying our principles, we can promote all the baseline single models to the top ranks on the leader-board.
In this paper, we proposed two causal principles for improving the VisDial task. They are model-agnostic, and thus can be applied in almost all the existing methods and bring major improvement. The principles are drawn from our in-depth causal analysis of the VisDial nature, which is however unfortunately overlooked by our community. For technical contributions, we offered some implementation examples on how to apply the principles into baseline models. We conducted extensive experiments on the official VisDial dataset and the online evaluation servers. Promising results demonstrate the effectiveness of the two principles. As moving forward, we will stick to our causal thinking to discover other potential causalities hidden in embodied Q&A and conversational visual dialog tasks.
On the properties of neural machine translation: encoder-decoder approaches. arXiv preprint arXiv:1409.1259. Cited by: §3.2.
Causal reasoning from meta-reinforcement learning. arXiv preprint arXiv:1901.08162. Cited by: §2.
Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15 (1), pp. 1929–1958. Cited by: §5.3, §6.3.