Making History Matter: Gold-Critic Sequence Training for Visual Dialog

02/25/2019 ∙ by Tianhao Yang, et al. ∙ Nanyang Technological University 0

We study the multi-round response generation in visual dialog systems, where a response is generated according to a visually grounded conversational history. Given a triplet: an image, Q&A history, and current question, all the prevailing methods follow a codec (ie, encoder-decoder) fashion in the supervised learning paradigm: a multimodal encoder encodes the triplet into a feature vector, which is then fed into the decoder for the current answer generation, supervised by the ground-truth answer. However, this conventional supervised learning does not take into account the impact of imperfect history in the codec training, violating the conversational nature of visual dialog and thus making the codec more inclined to learn dataset bias but not visual reasoning. To this end, inspired by the actor-critic policy gradient in reinforcement learning, we propose a novel training paradigm called Gold-Critic Sequence Training (GCST). Specifically, we intentionally impose wrong answers in the history, obtaining an adverse reward, and see how the historic error impacts the codec's future behavior by subtracting the gold-critic baseline --- reward obtained by using ground-truth history --- from the adverse reward. Moreover, to make the codec more sensitive to the history, we propose a novel attention network called Recurrent Co-Attention Network (RCAN) which can be effectively trained by using GCST. Experimental results on three benchmarks: VisDial0.9&1.0 and GuessWhat?!, show that the proposed GCST strategy consistently outperforms over state-of-the-art supervised counterparts under all metrics.



There are no comments yet.


page 7

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Visual dialog is one of the most comprehensive task for benchmarking the AI’s comprehension of natural language grounded by a visual scene. A good dialog agent should accomplish a complex series of reasoning sub-tasks: contextual visual perception [31, 21, 6, 28, 18, 32, 29], language modeling [15, 5], and co-reference resolution in dialog history [28, 18] (e.g

., identify what is “that”). Thanks to the end-to-end deep neural networks in their respective sub-tasks, state-of-the-art visual dialog systems can be built by assembling them into a codec framework 

[31, 7, 21]. The encoder encodes the triplet: the history question-answer sentences, the image, and the current question sentence, into a vector representation; then, the decoder fuses those vectors (e.g., usually by concatenation) and decode the fused one into answer sentences (e.g., by generation [7, 21, 31] or candidate ranking [16, 18]).

So far, one may identify that the key difference between Visual Dialog (VisDial) and another well-known task Visual Question Answering (VQA) [3, 2, 6, 12, 1, 4, 6, 22] is the exploitation of the history. As shown in Figure 1 (top), in Recompile 3 a high-level view, VisDial can be cast into a multi-round VQA given additional language context of history question-answering pairs. Essentially, at each round, the response generated by the agent is “thrown away”; for the next round, the history is artificially “tampered” by the ground-truth answers [7]. Note that this ground-truth history setting is reasonable because it steers the conversation to be evaluable; otherwise, any other response may digress the conversation into a never-ending open-domain chitchat [24, 14]. However, we argue that by only exploiting the ground-truth history is ineffective for the codec model’s supervised training. For example, the ground-truth answer only tells the model that “it is covered by hat” is good, but neglect to show how bad other answers are. Therefore, the resultant model is easily over-grounded in the insufficient ground-truth data, over-fitted to certain vision-language bias but not learning visual reasoning [19].

In this paper, we propose a novel training strategy that utilizes the history response in a more efficient way, that is, to make the codec model more sensitive to the history dialog such as co-reference resolution and context. As illustrated in Figure 1 (bottom), in a nutshell, we intentionally impose wrong answers in the “tamperred” history and see how the model behaves as compared to the “gold” history. Specifically, suppose we are going to train a model at the

-th round, to gain more insights about the wrong answers, we sample a most probable mistake

, e.g., “white”, to be the fake answer instead of the ground-truth at . Then, we infer two lines of future dialog to round . These two lines are almost identical except for the -th round answer of the fake one, which is replaced by the wrong answer. Then, the model at round will output two losses. The first one is called Adverse Reward (AR): the larger loss the more successfully the fake answer at impacts the future. The second one is called Gold Critic (GC), the conventional loss using ground-truth history. Interestingly, their difference will tell the model how to penalize , i.e., if , it means that the mistake does impact negatively on the future; otherwise, the mistake is insignificant and the model should focus on correcting others. Finally, this procedure can produces a sequence of training signals from future to .

We develop a novel codec model equipped with the proposed Residual Co-Attention encoder to address the essential co-reference and visual context in history encoding. Equipped with the the proposed Gold-Critic Sequence Training (GCST), we achieve a new state-of-the-art single-model on the real-world VisDial benchmarks: 0.6617 MRR on VisDial0.9, 0.6282 MRR on VisDial v1.0, and 66.8% accuracy on GuessWhat?! [8]. We also achieve a top performing 0.567 NDCG score on the official VisDial online challenge server. More ablative studies, qualitative examples, and detailed results are discussed in Section 5.

Figure 2: The framework of our proposed codec model. The model contains three components: Feature Representation (Section 3.1), Residual Co-Attention Network (Section 3.2) and Response Generation (Section 3.3). First, CNNs and Bi-LSTMs are used to extract features from the inputs. Then we propose Residual Co-Attention Network to encode the features of input triplets with attention weights. Finally the model generates the response and ranks answer candidates. Refer to Section 3 for the detailed description.

2 Related Work

Visual Dialog. Visual dialog is recently proposed in [7] and [8], which is a succession of vision-and-language problems. A number of vision-and-language tasks gain a great improvement such as image caption [10, 11, 17, 2] and visual question answering [2, 1, 4, 6, 22]. However, most vision-and-language problems are based on a single-round language interaction. On the contrary, visual dialog task involves a multi-round dialog which is more complex. Specifically, visual dialog task is based on a image and a visually-grounded dialog. Das et al. [7] proposed a large-scale free-form visual dialog dataset, which consists of sequential open questions and answers about arbitrary objects in the image. The dialogs were collected by pairing two people on Amazon Mechanical Turk to chat with each other about the images. Another visual dialog task GuessWhat?! proposed by  [8] focuses on the different aspects, which aims at object discovery with a set of yes/no questions. We follow the first setting in this paper.

Several approaches have been proposed to solve visual dialog problems, following a encoder-decoder structure. Lu et al. [21] proposed a generative-discriminative model for visual dialog. The discriminator considers the generated response from the generator and the provided list of candidate responses. Then the powerful discriminator transfers knowledge to update the generator. Wu et al. [31] used an adversarial learning to encourage the generator to generate more human-like dialogs. Seo et al[28] proposed a new attention mechanism with an attention memory to resolve visual co-reference. Kottur et al. [18] further used neural module networks to deal with visual co-reference resolution in visual dialog.

Attention Mechanism. Attention mechanisms are widely used in vision-and-language problems, and have achieved inspiring. For visual question answering, the attention-based model may attend to both the relevant regions in the image and the phrases in the questions. A number of approaches have been proposed that apply question-based attention module on image features. Yang et al[32] proposed an attention-based model SAN, which applies stacked attention networks to produce multiple attention maps for multi-step reasoning. Chen et al[33]

developed the attention model which can encode the cross-region relationships and support the model to answer more complex questions with visual reference resolution. Attention models which attend to the phrases or words in the questions are developed in later studies. Lu

et al[22] developed the co-attention network that calculates the attention weights of the image regions and question words guided by each other. Nam et al[23] introduced Dual Attention Networks (DANs). The model uses multiple reasoning steps to refine attention weights of image regions and questions. Our model also uses multiple steps to generate the attention weights. However, we apply an extra element-wise co-attention module to encode the multi-modal features. The feature-wise co-attention module and the element-wise co-attention module produce multiple attention maps in a sequential manner that each attention module is guided by the outputs of the other attention module.

Reinforcement Learning With Baseline. Reinforcement learning with baseline is widely applied to image caption [26], as the sequence training performs better with the baseline. As far as we know, under the settings of visual dialog task, there has been no work to use RL with baseline. Different from reinforcement learning, we utilize the gold-critic baseline to request more information from assumptions.

3 Our Codec Model

In this section, we describe the details of the proposed model for the visual dialog task. Formally, we follow the definition introduced by Das et al[7]. Given input as: 1) an image , 2) a dialog history with a caption of the image and rounds of the dialog , where is the -th round of the “ground-truth” question and answer pair, 3) a follow-up question , and 4) a list of 100 candidate answer options which contains one correct answer. The visual dialog model needs to sort the answer options and choose the right one when given the input . To perform response generation given the above task, as illustrated in Figure 2, our codec model includes four modules: 1) feature representation (Section 3.1), 2) the proposed Residual Co-Attention Network (RCAN) for the encoder (Section 3.2), and 3) a discriminative decoder for response generation by ranking (Section 3.3).

3.1 Feature Representation

We extract features for the image , the dialog history , the question , and the ansnwer .

Image Feature. We follow the approach to extract region-based image features as in [2]. We train Faster-RCNN [25] based on the ResNet-101 backbone [13] and then feed the image into the network. We choose top- regions with highest confidences from the outputs of Faster-RCNN and encode the regions as the visual feature from the image. The image feature is a matrix , where is the number of the regions in the image and d is the feature dimension.

Question and Answer Feature.

Our feature extraction for question and answer is the same. Without loss of generality, we only introduce the question feature. We use Bi-LSTM 

[27] to encode the questions. We use the right arrow to denote the forward direction of the Bi-LSTM and the left one to denote the backward direction. Then, the question feature is a matrix , where is the length of the question, is the feature dimension and (), which is the concatenation of the hidden states in the two directions

History Feature. We concatenate each round of the question and the answer into a long “sentence”, named the Q-A pair. Each round of the Q-A pair can be represented as , which is the concatenation of the last hidden states in the two directions using Bi-LSTM, where M is the length of the Q-A pair. The history (Q-A pairs) feature is a matrix where is the round number of the dialog.

3.2 Residual Co-Attention Network

Figure 3: The framework of the attention block with feature-wise attention modules and element-wise attention modules. We present and in details. After the feature-wise attention modules, the input triplets are encoded and fed into the element-wise attention modules. We stack the attention block in a residual way.

We propose a novel attention model called Residual Co-Attention Network(RCAN) to encode the input features described above with the co-attention mechanism [31, 21, 22, 9]. Given the image features , the history features , and the question features , the encoder encodes the three different types of features with a sequence of attention blocks, each of which contains two types of components. The first component is a feature-wise attention module. The encoder calculates the attention weight for each feature type guided by other two types of features and focuses on the features which are interrelated. We denote the attention modules as , and . For the second component, we apply the element-wise co-attention module for each feature type. We denote them as , and . We apply the two kinds of modules for each feature type recurrently and stack the outputs in a residual way. An overview of the attention block with two components is illustrated in Figure 3. We now describe the two components in details.

Feature-wise Co-Attention Module. Without loss of generality, we take as an example to show how the feature-wise attention module works. Inspired by [31, 9], is defined as follows:


where is the output of the last unit in the current attention round and is the output of the last unit in the current attention round in the element-wise attention module. We further define Eq. 1,2,3 as . and for history and image follow the similar definition as above.

Element-wise Co-Attention Module. We also take as an example. The input of includes , which are the outputs of last attention module in the current attention step. can be defined as follows:


where , and . is a vector with all elements set to 1. represents the outer product of vectors and represents the element-wise product. We define Eq. 4,5 as .

We now show how the two components, i.e., Feature-wise Co-Attention Module and Element-wise Co-Attention Module, work recurrently in a residual way in the -th attention step (=1,2,…):


In particular, is the output of the self-attention of the visual feature and is the output of the self-attention of the history feature, as there are no inputs when . After attention rounds in RCAN, the encoded feature . We concatenate the three features together and use a fully connected layer to obtain the final encoding feature: . Note that the key difference between our RCAN and the co-attention proposed by Wu et al[31] is that we design the element-wise co-attention module to encode the features, which can better exploits the co-reference stored in the history.

3.3 Response Generation

Now we introduce how to generate the answers for the visual dialog task. We apply self-attention mechanism to candidate answer features. We dot product the answer features and the final outputs of RCAN (in Section 3.2) to calculate the similarities of the final encoding feature and candidate answer features. We sort the answer candidates by the similarities and choose the top one with highest similarity as the prediction.

In the task of GuessWhat?!, the information of answer candidates are the localizations and categories of the objects. We concatenate the localizations and categories and embed them with a fully connected layer to obtain the answer option features. We also calculate the similarities of answer features and the final encoding features by dot product.

4 Gold-Critic Sequence Training

The RCAN described in Section 3.2 encodes the “ground truth” triplet and generates the responds. However, by only using the supervised learning, RCAN does not take into account the impact of imperfect history. To this end, we intentionally impose wrong answers in the history which is chosen by the model, and see how the historic error impacts the RCAN’s future behavior by subtracting the proposed gold-critic baseline. We punish the codec if the codec behaves better when fed with the fake history than the ground truth history, which violating the conversational nature of visual dialog. Next, we briefly describe the loss in the supervised learning in Section 4.1, and then introduce the Gold-Critic Sequence Training (GCST) in Section 4.2.

4.1 Discriminative Loss

In the supervised training, we use a metric-learning multi-class N-pair loss  [30, 21] which is defined as:


where is the encoder of answer options described in Section 3.3. is the similarity between the final encoding feature and the -th negative answer option and is the similarity between the final encoding feature and the ground truth answer.

4.2 Gold-Critic Reward

We encourage the discriminative model to learn visual reasoning but not datasets bias. Specifically, when the discriminative model for the visual dialog described in Section 3 is fed with the wrong history, it should generate answers which are unsatisfactory. We punish the model if it behaves better when fed with the fake history than the ground truth history, which is unreasonable. Inspired by the actor-critic policy gradient in reinforcement learning, we treat the discriminative model as an agent and the action it takes is choosing the answers from the answer options. For the sake of understanding, we take the gold-critic sequence training on the -th round as an example.

We feed the “ground truth” triplet of the -th round into the discriminative model and achieve the score of the right answer and the wrong answers. We rank the scores and choose the top K wrong answers as the fake answers. We use the softmax function to calculate the probability of the fake answers, which is denoted as .

We replace the right answer of the -th round with the fake answers chosen by the model and fake the history. After that, we use the fake history as the history input and feed the new triplet into the discriminative model to sort the answer options from next round to the last round which may be impacted by the fake history. We punish the agent when the rank of the right answer with the fake input is smaller than the rank with “ground truth” inputs.

We define the gold-critic reward as:


where is the current round in the training and is the -th wrong answer the model chooses at the t round.

The gold-critic training loss is defined as:


The total loss in the gold-critic sequence training is a combination of and defined in 4.1:


The whole training process is reviewed in the Alg. 1.

0:  Pre-trained Model
1:  for Round  do
2:     Sample
3:     for  do
4:        Fake the history with
5:        for  do
6:           Compute the fake score and the gold-critic baseline using
7:           Compute with Eq. 10
8:        end for
9:        Calculate the total reward with Eq. 11
10:     end for
11:     Update with in Eq. 13
12:  end for
Algorithm 1 Discriminative Model with the gold-critic learning

5 Experiments

In the following we evaluate our proposed approach on three visual dialog datasets, VisDial v0.9 [7], VisDial v1.0 [7] and GuessWhat?! [8]

. We first present the details about the datasets and the evaluation metrics, and then describe the implementation details. Finally, we analyse the results on the datasets.

5.1 Datasets

VisDial v0.9 contains about 123k image-caption-dialog tuples. The images are all from MS COCO [20] with multiple objects. The dialog of each image has 10 question-answer pairs, which were collected by pairing two people on Amazon Mechanical Turk to chat with each other about the image. Specifically, the “questioner” is required to “imagine the scene better” by sequentially asking questions about the hidden image. The “answerer” then observes the picture and answers questions. Note that the picture is only seen to the “answerer” during the game.

VisDial v1.0 is an extention of VisDial v0.9. Images for the training set are all from COCO train2014 and val2014. The dialogs in validation and test sets were collected on about 10k COCO-like images from Flickr. The test set is split into two parts, 4k images for test-std and 4k images for test-challenge. Answers are already provided for the train and val set, but not in the test set. For the test-std and test-challenge phases, the results must be submitted to the evaluation server.

We also evaluated our proposed model on GuessWhat?! dataset. The dataset contains 67k images collected from MS COCO [20] and 155k dialogs including about 820k question-answer pairs. The guesser game in GuessWhat?! is to predict the correct object in object options through a multi-round dialog.

5.2 Evaluation Metrics

For VisDial v0.9, we follow the evaluation protocol established in [7] and use the retrieval setting to evaluate the responses at each round in the dialog. Specifically, for each question we sort the answer options and use Recall@, mean reciprocal rank (MRR) and mean rank of the ground truth answer to evaluate the model. Recall@ is the percentage of questions for which the correct answer option is ranked in the top predictions of a model. Mean rank is the average rank of the ground truth answer option. Mean reciprocal rank is the average of 1/rank of the ground truth answer option. For VisDial v1.0, we also evaluate our model using normalized discounted cumulative gain (NDCG). NDCG is invariant to the order of options with identical relevance and to the order of options outside of the top , where is the number of answers marked as correct by at least one annotator. We use classification accuracy to evaluate the model in GuessWhat?! dataset.

5.3 Implementation Details

We pre-processed the data as follows. We first constructed a vocabulary of words that appear at least 5 times in the training split. The captions, questions and answers were then truncated to 24, 15 and 9 words, respectively. The word embedding feature is 512-d and the learnable word embeddings are shared across all words. All the Bi-LSTMs we use are 1-layered and the hidden size for each path is 512. We used the sigmoid function instead of the softmax function in the history attention unit in Eq. 

6. We used Adam optimizer and start training with the base learning rate of

. The model was pretrained using the supervised training for 20 epochs before starting the gold-critic training with the base learning rate of

. The hyper-parameter in Eq. 13 was set to 10. The gradient is also clipped in the gold-critic sequence learning when the probabilities of wrong answers chosen by the model are smaller than 0.1.

Figure 4: Visualizations of supervised model and gold-critic model with ’ground truth’ and ’fake’ history on VisDial v1.0. The red regions in the images are visual attention and the numbers are history attention weights. The supervised model gets the right answer with wrong image attention and does not receive the impact of historical changes. The gold-critic model is sensitive to historical changes.
Figure 5: Samples generated by our model on GuessWhat?!. The green bounding boxes highlight the right predictions with ’ground-truth’ histories. The red bounding boxes highlight the wrong predictions with ’fake’ histories. The model is confused by historical changes.
Model MRR R@1 R@5 R@10 Mean
Baseline 0.5837 44.52 74.77 84.84 5.56
RCAN-1-w/o E 0.6181 48.29 78.23 87.76 4.77
RCAN-1-w/ E 0.6267 49.14 79.19 88.42 4.61
RCAN-3-w/ E 0.6314 49.70 79.67 88.41 4.51
RCAN-3-w/ E-GC(NP) 0.6378 50.46 80.27 89.21 4.33
RCAN-3-w/ E-GC(MRR) 0.6395 50.83 80.08 89.17 4.34
Table 1: Results of discriminative models on the validation split of VisDial v1.0 dataset.

5.4 Ablative Results

We present a few variants of our model to verify the contribution of each component:
Baseline is a model with only one attention module, where the image features are guided by the question and the history in the input triplet. We used Bi-LSTM encoders to encode the question and the history without attention modules.
RCAN-1-w/o E is a model with one feature-wise attention module (described in Eq. 6). The model has no element-wise attention module (described in Eq. 7)
RCAN-1-w/ E is a model with one attention block which contains a feature-wise attention module and an element-wise attention module.
RCAN-3-w/ E

is a model with three residual-connected attention blocks. The model was trained using discriminative loss.

RCAN-3-w/ E-GC(NP) is a model with three residual-connected attention blocks. The model was trained in the gold-critic sequence learning. The gold-critic reward is defined in Eq. 9.
RCAN-3-w/ E-GC(MRR) is a model with three residual-connected attention blocks. The model was trained in the gold-critic sequence learning. The gold-critic reward is the mean reciprocal rank which is defined in Eq. 10.

Since the labels for the test split of VisDial v1.0 are not provided, we conducted the ablative experiments on the validation split. As shown in Table 1, RCAN-1-w/ E outperforms the baseline by over 3% on Recall@, which demonstrates that our feature-wise module and element-wise module can obtain a more informative multi-source encoding. RCAN-3-w/ E further improves MRR of RCAN-1-w/ E by approximately 0.5%. Our gold-critic sequence training with the reward of MRR RCAN-3-w/ E-GC(MRR) outperforms the baseline by average 5.5% on Recall@, and achieves the best result over all the variants.

We also visualized the outputs of attention modules to check whether the gold-critic sequence learning works. As introduced in section 4.2, we replaced the right answers with wrong answers and check whether the model is confused about imperfect history. In Fig. 4, the supervised model gets the right answer with wrong image attention and does not receive the impact of historical changes. On the contrary, the model trained with the gold-critic sequence learning focuses on the right area when given the “ground-truth” history but is confused by the historical changes. Figure 5 shows another example generated by our gold-critic model on GuessWhat?!.

Model MRR R@1 R@5 R@10 Mean
LF [7] 0.5807 43.82 74.68 84.07 5.78
HRE [7] 0.5846 44.67 74.50 84.22 5.72
HREA [7] 0.5868 44.82 74.81 84.36 5.66
MN [7] 0.5965 45.55 76.22 85.37 5.46
SAN-QI [32] 0.5764 43.44 74.26 83.72 5.88
HieCoAtt-QI [22] 0.5788 43.51 74.49 83.96 5.84
AMEM [28] 0.6160 47.74 78.04 86.84 4.99
HCIAE-NP-ATT [21] 0.6222 48.48 78.75 87.59 4.81
CoAtt [31] 0.6398 50.29 80.71 88.71 4.47
CorefNMN [18] 0.641 50.92 80.18 88.81 4.45
Ours 0.6617 53.34 81.93 90.08 4.15
Table 2: Results of discriminative models on VisDial v0.9.
Model NDCG MRR R@1 R@5 R@10 Mean
LF [7] 0.453 0.554 40.95 72.45 82.83 5.95
HRE [7] 0.455 0.542 39.93 70.45 81.50 6.41
MN [7] 0.475 0.555 40.98 72.30 83.30 5.92
CorefNMN [18] 0.547 0.615 47.55 78.10 88.80 4.40
Ours 0.567 0.621 48.31 78.95 88.33 4.51
Table 3: Results of discriminative models on the test-standard split of VisDial v1.0 dataset
Model Train err Val err Test err
LSTM [8] 27.9% 37.9% 38.7%
HRED [8] 32.6% 38.2% 39.0%
LSTM+VGG [8] 26.1% 38.5% 39.2%
HRED+VGG [8] 27.4% 38.4% 39.6%
ATT [9] 26.7% 33.7% 34.2%
Ours 26.1% 32.3% 33.2%
Table 4: Results on the guesser game of GuessWhat?!

5.5 Comparison with State-of-The-Art

We compared our model with the state-of-the-art methods on VisDial v0.9 and v1.0. We briefly introduce the state-or-the-art methods as follows. HCIAE-NP-ATT is the original HCIAE model with the N-pair loss in Eq. 9. AMEM uses a memory network to model the relationship of current question and histories. CoAtt trains a generative model and a discriminative model with adversarial learning. CorefNMN is a model with neural module networks. Table 2 and 3 demonstrate that our model outperforms the baseline and state-of-the-art models across most evaluation metrics on VisDial v0.9 and v1.0 datasets. We also achieve a new state-of-the-art single-model on the official VisDial online challenge server. Furthermore, we conducted supplementary experiments on the guesser task of GuessWhat?!. Table 4 shows that our method is comparable to the state-of-the-art methods.

6 Conclusion

In this paper we develop a codec model equipped with Residual Co-Attention Network (RCAN) for the visual dialog task. RCAN contains feature-wise co-attention module and element-wise co-attention module to address the co-reference and visual context in question and history encoding. We propose a novel training strategy dubbed Gold-Critic Sequence Training (GCST) that utilizes the history response to make the codec model more sensitive to the history dialog. Extensive experiments on the real-world datasets, VisDial and GuessWhat?!, achieve a new state-of-the-art single-model on the benchmarks.