Visual dialog is one of the most comprehensive task for benchmarking the AI’s comprehension of natural language grounded by a visual scene. A good dialog agent should accomplish a complex series of reasoning sub-tasks: contextual visual perception [31, 21, 6, 28, 18, 32, 29], language modeling [15, 5], and co-reference resolution in dialog history [28, 18] (e.g
., identify what is “that”). Thanks to the end-to-end deep neural networks in their respective sub-tasks, state-of-the-art visual dialog systems can be built by assembling them into a codec framework[31, 7, 21]. The encoder encodes the triplet: the history question-answer sentences, the image, and the current question sentence, into a vector representation; then, the decoder fuses those vectors (e.g., usually by concatenation) and decode the fused one into answer sentences (e.g., by generation [7, 21, 31] or candidate ranking [16, 18]).
So far, one may identify that the key difference between Visual Dialog (VisDial) and another well-known task Visual Question Answering (VQA) [3, 2, 6, 12, 1, 4, 6, 22] is the exploitation of the history. As shown in Figure 1 (top), in Recompile 3 a high-level view, VisDial can be cast into a multi-round VQA given additional language context of history question-answering pairs. Essentially, at each round, the response generated by the agent is “thrown away”; for the next round, the history is artificially “tampered” by the ground-truth answers . Note that this ground-truth history setting is reasonable because it steers the conversation to be evaluable; otherwise, any other response may digress the conversation into a never-ending open-domain chitchat [24, 14]. However, we argue that by only exploiting the ground-truth history is ineffective for the codec model’s supervised training. For example, the ground-truth answer only tells the model that “it is covered by hat” is good, but neglect to show how bad other answers are. Therefore, the resultant model is easily over-grounded in the insufficient ground-truth data, over-fitted to certain vision-language bias but not learning visual reasoning .
In this paper, we propose a novel training strategy that utilizes the history response in a more efficient way, that is, to make the codec model more sensitive to the history dialog such as co-reference resolution and context. As illustrated in Figure 1 (bottom), in a nutshell, we intentionally impose wrong answers in the “tamperred” history and see how the model behaves as compared to the “gold” history. Specifically, suppose we are going to train a model at the
-th round, to gain more insights about the wrong answers, we sample a most probable mistake, e.g., “white”, to be the fake answer instead of the ground-truth at . Then, we infer two lines of future dialog to round . These two lines are almost identical except for the -th round answer of the fake one, which is replaced by the wrong answer. Then, the model at round will output two losses. The first one is called Adverse Reward (AR): the larger loss the more successfully the fake answer at impacts the future. The second one is called Gold Critic (GC), the conventional loss using ground-truth history. Interestingly, their difference will tell the model how to penalize , i.e., if , it means that the mistake does impact negatively on the future; otherwise, the mistake is insignificant and the model should focus on correcting others. Finally, this procedure can produces a sequence of training signals from future to .
We develop a novel codec model equipped with the proposed Residual Co-Attention encoder to address the essential co-reference and visual context in history encoding. Equipped with the the proposed Gold-Critic Sequence Training (GCST), we achieve a new state-of-the-art single-model on the real-world VisDial benchmarks: 0.6617 MRR on VisDial0.9, 0.6282 MRR on VisDial v1.0, and 66.8% accuracy on GuessWhat?! . We also achieve a top performing 0.567 NDCG score on the official VisDial online challenge server. More ablative studies, qualitative examples, and detailed results are discussed in Section 5.
2 Related Work
Visual Dialog. Visual dialog is recently proposed in  and , which is a succession of vision-and-language problems. A number of vision-and-language tasks gain a great improvement such as image caption [10, 11, 17, 2] and visual question answering [2, 1, 4, 6, 22]. However, most vision-and-language problems are based on a single-round language interaction. On the contrary, visual dialog task involves a multi-round dialog which is more complex. Specifically, visual dialog task is based on a image and a visually-grounded dialog. Das et al.  proposed a large-scale free-form visual dialog dataset, which consists of sequential open questions and answers about arbitrary objects in the image. The dialogs were collected by pairing two people on Amazon Mechanical Turk to chat with each other about the images. Another visual dialog task GuessWhat?! proposed by  focuses on the different aspects, which aims at object discovery with a set of yes/no questions. We follow the first setting in this paper.
Several approaches have been proposed to solve visual dialog problems, following a encoder-decoder structure. Lu et al.  proposed a generative-discriminative model for visual dialog. The discriminator considers the generated response from the generator and the provided list of candidate responses. Then the powerful discriminator transfers knowledge to update the generator. Wu et al.  used an adversarial learning to encourage the generator to generate more human-like dialogs. Seo et al.  proposed a new attention mechanism with an attention memory to resolve visual co-reference. Kottur et al.  further used neural module networks to deal with visual co-reference resolution in visual dialog.
Attention Mechanism. Attention mechanisms are widely used in vision-and-language problems, and have achieved inspiring. For visual question answering, the attention-based model may attend to both the relevant regions in the image and the phrases in the questions. A number of approaches have been proposed that apply question-based attention module on image features. Yang et al.  proposed an attention-based model SAN, which applies stacked attention networks to produce multiple attention maps for multi-step reasoning. Chen et al. 
developed the attention model which can encode the cross-region relationships and support the model to answer more complex questions with visual reference resolution. Attention models which attend to the phrases or words in the questions are developed in later studies. Luet al.  developed the co-attention network that calculates the attention weights of the image regions and question words guided by each other. Nam et al.  introduced Dual Attention Networks (DANs). The model uses multiple reasoning steps to refine attention weights of image regions and questions. Our model also uses multiple steps to generate the attention weights. However, we apply an extra element-wise co-attention module to encode the multi-modal features. The feature-wise co-attention module and the element-wise co-attention module produce multiple attention maps in a sequential manner that each attention module is guided by the outputs of the other attention module.
Reinforcement Learning With Baseline. Reinforcement learning with baseline is widely applied to image caption , as the sequence training performs better with the baseline. As far as we know, under the settings of visual dialog task, there has been no work to use RL with baseline. Different from reinforcement learning, we utilize the gold-critic baseline to request more information from assumptions.
3 Our Codec Model
In this section, we describe the details of the proposed model for the visual dialog task. Formally, we follow the definition introduced by Das et al. . Given input as: 1) an image , 2) a dialog history with a caption of the image and rounds of the dialog , where is the -th round of the “ground-truth” question and answer pair, 3) a follow-up question , and 4) a list of 100 candidate answer options which contains one correct answer. The visual dialog model needs to sort the answer options and choose the right one when given the input . To perform response generation given the above task, as illustrated in Figure 2, our codec model includes four modules: 1) feature representation (Section 3.1), 2) the proposed Residual Co-Attention Network (RCAN) for the encoder (Section 3.2), and 3) a discriminative decoder for response generation by ranking (Section 3.3).
3.1 Feature Representation
We extract features for the image , the dialog history , the question , and the ansnwer .
Image Feature. We follow the approach to extract region-based image features as in . We train Faster-RCNN  based on the ResNet-101 backbone  and then feed the image into the network. We choose top- regions with highest confidences from the outputs of Faster-RCNN and encode the regions as the visual feature from the image. The image feature is a matrix , where is the number of the regions in the image and d is the feature dimension.
Question and Answer Feature.
Our feature extraction for question and answer is the same. Without loss of generality, we only introduce the question feature. We use Bi-LSTM to encode the questions. We use the right arrow to denote the forward direction of the Bi-LSTM and the left one to denote the backward direction. Then, the question feature is a matrix , where is the length of the question, is the feature dimension and (), which is the concatenation of the hidden states in the two directions
History Feature. We concatenate each round of the question and the answer into a long “sentence”, named the Q-A pair. Each round of the Q-A pair can be represented as , which is the concatenation of the last hidden states in the two directions using Bi-LSTM, where M is the length of the Q-A pair. The history (Q-A pairs) feature is a matrix where is the round number of the dialog.
3.2 Residual Co-Attention Network
We propose a novel attention model called Residual Co-Attention Network(RCAN) to encode the input features described above with the co-attention mechanism [31, 21, 22, 9]. Given the image features , the history features , and the question features , the encoder encodes the three different types of features with a sequence of attention blocks, each of which contains two types of components. The first component is a feature-wise attention module. The encoder calculates the attention weight for each feature type guided by other two types of features and focuses on the features which are interrelated. We denote the attention modules as , and . For the second component, we apply the element-wise co-attention module for each feature type. We denote them as , and . We apply the two kinds of modules for each feature type recurrently and stack the outputs in a residual way. An overview of the attention block with two components is illustrated in Figure 3. We now describe the two components in details.
where is the output of the last unit in the current attention round and is the output of the last unit in the current attention round in the element-wise attention module. We further define Eq. 1,2,3 as . and for history and image follow the similar definition as above.
Element-wise Co-Attention Module. We also take as an example. The input of includes , which are the outputs of last attention module in the current attention step. can be defined as follows:
We now show how the two components, i.e., Feature-wise Co-Attention Module and Element-wise Co-Attention Module, work recurrently in a residual way in the -th attention step (=1,2,…):
In particular, is the output of the self-attention of the visual feature and is the output of the self-attention of the history feature, as there are no inputs when . After attention rounds in RCAN, the encoded feature . We concatenate the three features together and use a fully connected layer to obtain the final encoding feature: . Note that the key difference between our RCAN and the co-attention proposed by Wu et al.  is that we design the element-wise co-attention module to encode the features, which can better exploits the co-reference stored in the history.
3.3 Response Generation
Now we introduce how to generate the answers for the visual dialog task. We apply self-attention mechanism to candidate answer features. We dot product the answer features and the final outputs of RCAN (in Section 3.2) to calculate the similarities of the final encoding feature and candidate answer features. We sort the answer candidates by the similarities and choose the top one with highest similarity as the prediction.
In the task of GuessWhat?!, the information of answer candidates are the localizations and categories of the objects. We concatenate the localizations and categories and embed them with a fully connected layer to obtain the answer option features. We also calculate the similarities of answer features and the final encoding features by dot product.
4 Gold-Critic Sequence Training
The RCAN described in Section 3.2 encodes the “ground truth” triplet and generates the responds. However, by only using the supervised learning, RCAN does not take into account the impact of imperfect history. To this end, we intentionally impose wrong answers in the history which is chosen by the model, and see how the historic error impacts the RCAN’s future behavior by subtracting the proposed gold-critic baseline. We punish the codec if the codec behaves better when fed with the fake history than the ground truth history, which violating the conversational nature of visual dialog. Next, we briefly describe the loss in the supervised learning in Section 4.1, and then introduce the Gold-Critic Sequence Training (GCST) in Section 4.2.
4.1 Discriminative Loss
where is the encoder of answer options described in Section 3.3. is the similarity between the final encoding feature and the -th negative answer option and is the similarity between the final encoding feature and the ground truth answer.
4.2 Gold-Critic Reward
We encourage the discriminative model to learn visual reasoning but not datasets bias. Specifically, when the discriminative model for the visual dialog described in Section 3 is fed with the wrong history, it should generate answers which are unsatisfactory. We punish the model if it behaves better when fed with the fake history than the ground truth history, which is unreasonable. Inspired by the actor-critic policy gradient in reinforcement learning, we treat the discriminative model as an agent and the action it takes is choosing the answers from the answer options. For the sake of understanding, we take the gold-critic sequence training on the -th round as an example.
We feed the “ground truth” triplet of the -th round into the discriminative model and achieve the score of the right answer and the wrong answers. We rank the scores and choose the top K wrong answers as the fake answers. We use the softmax function to calculate the probability of the fake answers, which is denoted as .
We replace the right answer of the -th round with the fake answers chosen by the model and fake the history. After that, we use the fake history as the history input and feed the new triplet into the discriminative model to sort the answer options from next round to the last round which may be impacted by the fake history. We punish the agent when the rank of the right answer with the fake input is smaller than the rank with “ground truth” inputs.
We define the gold-critic reward as:
where is the current round in the training and is the -th wrong answer the model chooses at the t round.
The gold-critic training loss is defined as:
The total loss in the gold-critic sequence training is a combination of and defined in 4.1:
The whole training process is reviewed in the Alg. 1.
. We first present the details about the datasets and the evaluation metrics, and then describe the implementation details. Finally, we analyse the results on the datasets.
VisDial v0.9 contains about 123k image-caption-dialog tuples. The images are all from MS COCO  with multiple objects. The dialog of each image has 10 question-answer pairs, which were collected by pairing two people on Amazon Mechanical Turk to chat with each other about the image. Specifically, the “questioner” is required to “imagine the scene better” by sequentially asking questions about the hidden image. The “answerer” then observes the picture and answers questions. Note that the picture is only seen to the “answerer” during the game.
VisDial v1.0 is an extention of VisDial v0.9. Images for the training set are all from COCO train2014 and val2014. The dialogs in validation and test sets were collected on about 10k COCO-like images from Flickr. The test set is split into two parts, 4k images for test-std and 4k images for test-challenge. Answers are already provided for the train and val set, but not in the test set. For the test-std and test-challenge phases, the results must be submitted to the evaluation server.
We also evaluated our proposed model on GuessWhat?! dataset. The dataset contains 67k images collected from MS COCO  and 155k dialogs including about 820k question-answer pairs. The guesser game in GuessWhat?! is to predict the correct object in object options through a multi-round dialog.
5.2 Evaluation Metrics
For VisDial v0.9, we follow the evaluation protocol established in  and use the retrieval setting to evaluate the responses at each round in the dialog. Specifically, for each question we sort the answer options and use Recall@, mean reciprocal rank (MRR) and mean rank of the ground truth answer to evaluate the model. Recall@ is the percentage of questions for which the correct answer option is ranked in the top predictions of a model. Mean rank is the average rank of the ground truth answer option. Mean reciprocal rank is the average of 1/rank of the ground truth answer option. For VisDial v1.0, we also evaluate our model using normalized discounted cumulative gain (NDCG). NDCG is invariant to the order of options with identical relevance and to the order of options outside of the top , where is the number of answers marked as correct by at least one annotator. We use classification accuracy to evaluate the model in GuessWhat?! dataset.
5.3 Implementation Details
We pre-processed the data as follows. We first constructed a vocabulary of words that appear at least 5 times in the training split. The captions, questions and answers were then truncated to 24, 15 and 9 words, respectively. The word embedding feature is 512-d and the learnable word embeddings are shared across all words. All the Bi-LSTMs we use are 1-layered and the hidden size for each path is 512. We used the sigmoid function instead of the softmax function in the history attention unit in Eq.6. We used Adam optimizer and start training with the base learning rate of
. The model was pretrained using the supervised training for 20 epochs before starting the gold-critic training with the base learning rate of. The hyper-parameter in Eq. 13 was set to 10. The gradient is also clipped in the gold-critic sequence learning when the probabilities of wrong answers chosen by the model are smaller than 0.1.
5.4 Ablative Results
We present a few variants of our model to verify the contribution of each component:
Baseline is a model with only one attention module, where the image features are guided by the question and the history in the input triplet. We used Bi-LSTM encoders to encode the question and the history without attention modules.
RCAN-1-w/o E is a model with one feature-wise attention module (described in Eq. 6). The model has no element-wise attention module (described in Eq. 7)
RCAN-1-w/ E is a model with one attention block which contains a feature-wise attention module and an element-wise attention module.
is a model with three residual-connected attention blocks. The model was trained using discriminative loss.
RCAN-3-w/ E-GC(NP) is a model with three residual-connected attention blocks. The model was trained in the gold-critic sequence learning. The gold-critic reward is defined in Eq. 9.
RCAN-3-w/ E-GC(MRR) is a model with three residual-connected attention blocks. The model was trained in the gold-critic sequence learning. The gold-critic reward is the mean reciprocal rank which is defined in Eq. 10.
Since the labels for the test split of VisDial v1.0 are not provided, we conducted the ablative experiments on the validation split. As shown in Table 1, RCAN-1-w/ E outperforms the baseline by over 3% on Recall@, which demonstrates that our feature-wise module and element-wise module can obtain a more informative multi-source encoding. RCAN-3-w/ E further improves MRR of RCAN-1-w/ E by approximately 0.5%. Our gold-critic sequence training with the reward of MRR RCAN-3-w/ E-GC(MRR) outperforms the baseline by average 5.5% on Recall@, and achieves the best result over all the variants.
We also visualized the outputs of attention modules to check whether the gold-critic sequence learning works. As introduced in section 4.2, we replaced the right answers with wrong answers and check whether the model is confused about imperfect history. In Fig. 4, the supervised model gets the right answer with wrong image attention and does not receive the impact of historical changes. On the contrary, the model trained with the gold-critic sequence learning focuses on the right area when given the “ground-truth” history but is confused by the historical changes. Figure 5 shows another example generated by our gold-critic model on GuessWhat?!.
5.5 Comparison with State-of-The-Art
We compared our model with the state-of-the-art methods on VisDial v0.9 and v1.0. We briefly introduce the state-or-the-art methods as follows. HCIAE-NP-ATT is the original HCIAE model with the N-pair loss in Eq. 9. AMEM uses a memory network to model the relationship of current question and histories. CoAtt trains a generative model and a discriminative model with adversarial learning. CorefNMN is a model with neural module networks. Table 2 and 3 demonstrate that our model outperforms the baseline and state-of-the-art models across most evaluation metrics on VisDial v0.9 and v1.0 datasets. We also achieve a new state-of-the-art single-model on the official VisDial online challenge server. Furthermore, we conducted supplementary experiments on the guesser task of GuessWhat?!. Table 4 shows that our method is comparable to the state-of-the-art methods.
In this paper we develop a codec model equipped with Residual Co-Attention Network (RCAN) for the visual dialog task. RCAN contains feature-wise co-attention module and element-wise co-attention module to address the co-reference and visual context in question and history encoding. We propose a novel training strategy dubbed Gold-Critic Sequence Training (GCST) that utilizes the history response to make the codec model more sensitive to the history dialog. Extensive experiments on the real-world datasets, VisDial and GuessWhat?!, achieve a new state-of-the-art single-model on the benchmarks.
-  H. Agrawal, A. Chandrasekaran, D. Batra, D. Parikh, and M. Bansal. Sort story: Sorting jumbled images and captions into stories. arXiv preprint arXiv:1606.07493, 2016.
-  P. Anderson, X. He, C. Buehler, D. Teney, M. Johnson, S. Gould, and L. Zhang. Bottom-up and top-down attention for image captioning and vqa. arXiv preprint arXiv:1707.07998, 2017.
-  J. Andreas, M. Rohrbach, T. Darrell, and D. Klein. Deep compositional question answering with neural module networks. arxiv preprint. arXiv preprint arXiv:1511.02799, 2, 2015.
-  S. Antol, A. Agrawal, J. Lu, M. Mitchell, D. Batra, C. Lawrence Zitnick, and D. Parikh. Vqa: Visual question answering. In ICCV, 2015.
-  J. Chung, C. Gulcehre, K. Cho, and Y. Bengio. Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv preprint arXiv:1412.3555, 2014.
-  A. Das, H. Agrawal, L. Zitnick, D. Parikh, and D. Batra. Human attention in visual question answering: Do humans and deep networks look at the same regions? EMNLP, 2016.
-  A. Das, S. Kottur, K. Gupta, A. Singh, D. Yadav, J. M. Moura, D. Parikh, and D. Batra. Visual dialog. In CVPR, 2017.
-  H. De Vries, F. Strub, S. Chandar, O. Pietquin, H. Larochelle, and A. C. Courville. Guesswhat?! visual object discovery through multi-modal dialogue. In CVPR, 2017.
-  C. Deng, Q. Wu, Q. Wu, F. Hu, F. Lyu, and M. Tan. Visual grounding via accumulated attention. In CVPR, 2018.
-  J. Donahue, L. Anne Hendricks, S. Guadarrama, M. Rohrbach, S. Venugopalan, K. Saenko, and T. Darrell. Long-term recurrent convolutional networks for visual recognition and description. In CVPR, 2015.
-  H. Fang, S. Gupta, F. Iandola, R. K. Srivastava, L. Deng, P. Dollár, J. Gao, X. He, M. Mitchell, J. C. Platt, et al. From captions to visual concepts and back. In CVPR, 2015.
-  A. Fukui, D. H. Park, D. Yang, A. Rohrbach, T. Darrell, and M. Rohrbach. Multimodal compact bilinear pooling for visual question answering and visual grounding. arXiv preprint arXiv:1606.01847, 2016.
-  K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In CVPR, 2016.
R. Higashinaka, K. Imamura, T. Meguro, C. Miyazaki, N. Kobayashi, H. Sugiyama,
T. Hirano, T. Makino, and Y. Matsuo.
Towards an open-domain conversational system fully based on natural language processing.In COLING, 2014.
-  S. Hochreiter and J. Schmidhuber. Long short-term memory. Neural computation, 1997.
-  U. Jain, S. Lazebnik, and A. G. Schwing. Two can play this game: visual dialog with discriminative question generation and answering. In CVPR, 2018.
-  A. Karpathy and L. Fei-Fei. Deep visual-semantic alignments for generating image descriptions. In CVPR, 2015.
-  S. Kottur, J. M. Moura, D. Parikh, D. Batra, and M. Rohrbach. Visual coreference resolution in visual dialog using neural module networks. In ECCV, 2018.
-  B. M. Lake, T. D. Ullman, J. B. Tenenbaum, and S. J. Gershman. Building machines that learn and think like people. BEHAV BRAIN SCI, 2017.
-  T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick. Microsoft coco: Common objects in context. In ECCV, 2014.
-  J. Lu, A. Kannan, J. Yang, D. Parikh, and D. Batra. Best of both worlds: Transferring knowledge from discriminative learning to a generative visual dialog model. In NIPS, 2017.
-  J. Lu, J. Yang, D. Batra, and D. Parikh. Hierarchical question-image co-attention for visual question answering. In NIPS, 2016.
-  H. Nam, J.-W. Ha, and J. Kim. Dual attention networks for multimodal reasoning and matching. arXiv preprint arXiv:1611.00471, 2016.
-  S. Quarteroni and S. Manandhar. Designing an interactive open-domain question answering system. NAT LANG ENG, 2009.
-  S. Ren, K. He, R. Girshick, and J. Sun. Faster r-cnn: Towards real-time object detection with region proposal networks. In NIPS, 2015.
-  S. J. Rennie, E. Marcheret, Y. Mroueh, J. Ross, and V. Goel. Self-critical sequence training for image captioning. In CVPR, 2017.
-  M. Schuster and K. K. Paliwal. Bidirectional recurrent neural networks. IEEE T SIGNAL PROCE, 1997.
-  P. H. Seo, A. Lehrmann, B. Han, and L. Sigal. Visual reference resolution using attention memory for visual dialog. In NIPS, 2017.
-  K. J. Shih, S. Singh, and D. Hoiem. Where to look: Focus regions for visual question answering. In CVPR, 2016.
-  K. Sohn. Improved deep metric learning with multi-class n-pair loss objective. In NIPS, 2016.
-  Q. Wu, P. Wang, C. Shen, I. Reid, and A. van den Hengel. Are you talking to me? reasoned visual dialog generation through adversarial learning. arXiv preprint arXiv:1711.07613, 2017.
-  Z. Yang, X. He, J. Gao, L. Deng, and A. Smola. Stacked attention networks for image question answering. In CVPR, 2016.
-  C. Zhu, Y. Zhao, S. Huang, K. Tu, and Y. Ma. Structured attentions for visual question answering. In ICCV, 2017.