Are You Talking to Me? Reasoned Visual Dialog Generation through Adversarial Learning

11/21/2017 ∙ by Qi Wu, et al. ∙ 0

The Visual Dialogue task requires an agent to engage in a conversation about an image with a human. It represents an extension of the Visual Question Answering task in that the agent needs to answer a question about an image, but it needs to do so in light of the previous dialogue that has taken place. The key challenge in Visual Dialogue is thus maintaining a consistent, and natural dialogue while continuing to answer questions correctly. We present a novel approach that combines Reinforcement Learning and Generative Adversarial Networks (GANs) to generate more human-like responses to questions. The GAN helps overcome the relative paucity of training data, and the tendency of the typical MLE-based approach to generate overly terse answers. Critically, the GAN is tightly integrated into the attention mechanism that generates human-interpretable reasons for each answer. This means that the discriminative model of the GAN has the task of assessing whether a candidate answer is generated by a human or not, given the provided reason. This is significant because it drives the generative model to produce high quality answers that are well supported by the associated reasoning. The method also generates the state-of-the-art results on the primary benchmark.



There are no comments yet.


page 1

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The combined interpretation of vision and language has enabled the development of a range of applications that have made interesting steps towards Artificial Intelligence, including Image Captioning

[11, 34, 37], Visual Question Answering (VQA) [1, 22, 38], and Referring Expressions [10, 12, 41]. VQA, for example, requires an agent to answer a previously unseen question about a previously unseen image, and is recognised as being an AI-Complete problem [1]. Visual Dialogue [5] represents an extension to the VQA problem whereby an agent is required to engage in a dialogue about an image. This is significant because it demands that the agent is able to answer a series of questions, each of which may be predicated on the previous questions and answers in the dialogue. Visual Dialogue thus reflects one of the key challenges in AI and Robotics, which is to enable an agent capable of acting upon the world, that we might collaborate with through dialogue.

Figure 1: Human-like vs. Machine-like responses in a visual dialog. The human-like responses clearly answer the questions more comprehensively, and help to maintain a meaningful dialogue.

Due to the similarity between the VQA and Visual Dialog tasks, VQA methods [19, 40] have been directly applied to solve the Visual Dialog problem. The fact that the Visual Dialog challenge requires an ongoing conversation, however, demands more than just taking into consideration the state of the conversation thus far. Ideally, the agent should be an engaged participant in the conversation, cooperating towards a larger goal, rather than generating single word answers, even if they are easier to optimise. Figure 1 provides an example of the distinction between the type of responses a VQA agent might generate and the more involved responses that a human is likely to generate if they are engaged in the conversation. These more human-like responses are not only longer, they provide reasoning information that might be of use even though it is not specifically asked for.

Previous Visual Dialog systems [5]

follow a neural translation mechanism that is often used in VQA, by predicting the response given the image and the dialog history using the maximum likelihood estimation (MLE) objective function. However, because this over-simplified training objective only focus on measuring the word-level correctness, the produced responses tend to be generic and repetitive. For example, a simple response of ‘yes’,‘no’, or ‘I don’t know’ can safely answer a large number of questions and lead to a high MLE objective value. Generating more comprehensive answers, and a deeper engagement of the agent in the dialogue, requires a more engaged training process.

A good dialogue generation model should generate responses indistinguishable from those a human might produce. In this paper, we introduce an adversarial learning strategy, motivated by the previous success of adversarial learning in many computer vision

[3, 21] and sequence generation [4, 42] problems. We particularly frame the task as a reinforcement learning problem that we jointly train two sub-modules: a sequence generative model to produce response sentences on the basis of the image content and the dialog history, and a discriminator that leverages previous generator’s memories to distinguish between the human-generated dialogues and the machine-generated ones. The generator tends to generate responses that can fool the discriminator into believing that they are human generated, while the output of the discriminative model is used as a reward to the generative model, encouraging it to generate more human-like dialogue.

Although our proposed framework is inspired by generative adversarial networks (GANs) [9], there are several technical contributions that lead to the final success on the visual dialog generation task. First, we propose a sequential co-attention generative model that aims to ensure that attention can be passed effectively across the image, question and dialog history. The co-attended multi-modal features are combined together to generate a response. Secondly, and significantly, within the structure we propose the discriminator has access to the attention weights the generator used in generating its response. Note that the attention weights can be seen as a form of ‘reason’ for the generated response. For example, it indicates which region should be focused on and what dialog pairs are informative when generating the response. This structure is important as it allows the discriminator to assess the quality of the response, given the reason. It also allows the discriminator to assess the response in the context of the dialogue thus far. Finally, as with most sequence generation problems, the quality of the response can only be assessed over the whole sequence. We follow [42] to apply Monte Carlo (MC) search to calculate the intermediate rewards.

We evaluate our method on the VisDial dataset [5] and show that it outperforms the baseline methods by a large margin. We also outperform several state-of-the-art methods. Specifically, our adversarial learned generative model outperforms our strong baseline MLE model by 1.87% on recall@5, improving over previous best reported results by 2.14% on recall@5, and 2.50% recall@10. Qualitative evaluation shows that our generative model generates more informative responses and a human study shows that 49% of our responses pass the Turing Test. We additionally implement a model under the discriminative setting (a candidate response list is given) and achieve the state-of-the-art performance.

2 Related work

Figure 2: The adversarial learning framework of our proposed model. Our model is composed of two components, the first being a sequential co-attention generator that accepts as input image, question and dialog history tuples, and uses the co-attention encoder to jointly reason over them. The second component is a discriminator tasked with labelling whether each answer has been generated by a human or the generative model by considering the attention weights. The output from the discriminator is used as a reward to push the generator to generate responses that are indistinguishable from those a human might generate.

Visual dialog

is the latest in a succession of vision-and-language problems that began with image captioning [11, 34, 37], and includes visual question answering [1, 22, 38]. However, in contrast to these classical vision-and-language tasks that only involve at most a single natural language interaction, visual dialog requires the machine to hold a meaningful dialogue in natural language about visual content. Mostafazadeh et al. [20] propose an Image Grounded Conversation (IGC) dataset and task that requires a model to generate natural-sounding conversations (both questions and responses) about a shared image. De Vries et al. [7] propose a ‘GuessWhat’ game style dataset, where one person asks questions about an image to guess which object has been ‘selected’, and the second person answers questions in ‘yes’/‘no’/NA. Das et al. [5] propose the largest visual dialog dataset, VisDial, by pairing two subjects on Amazon Mechanical Turk to chat about an image. They further formulate the task as a ‘multi-round’ VQA task and evaluate individual responses at each round in a retrieval or multiple-choice setup. Recently, Das et al. [6] propose to use RL to learn the policies of a ‘Questioner-Bot’ and an ‘Answerer-Bot’, based on the goal of selecting the right images that the two agents are talking, from the VisDial dataset.

Concurrent with our work, Lu et al. [18] propose a similar generative-discriminative model for Visual Dialog. However, there are two differences. First, their discriminative model requires to receive a list of candidate responses and learns to sort this list from the training dataset, which means the model only can be trained when such information is available. Second, their discriminator only considers the generated response and the provided list of candidate responses. Instead, we measure whether the generated response is valid given the attention weights which reflect both the reasoning of the model, and the history of the dialogue thus far. As we show in our experiments in Sec. 4, this procedure results in our generator producing more suitable responses.

Dialog generation in NLP

Text-only dialog generation [15, 16, 23, 30, 39]

has been studied for many years in the Natural Language Processing (NLP) literature, and has leaded to many applications. Recently, the popular ‘Xiaoice’ produced by Microsoft and the ‘Its Alive’ chatbot created by Facebook have attracted significant public attention. In NLP, dialog generation is typically viewed as a sequence-to-sequence (Seq2Seq) problem, or formulated as a statistical machine translation problem

[23, 30]. Inspired by the success of the Seq2Seq model [32] in the machine translation, [26, 33] build end-to-end dialog generation models using an encoder-decoder model. Reinforcement learning (RL) has also been applied to train a dialog system. Li et al. [15] simulate two virtual agents and hand-craft three rewards (informativity, coherence and ease of answering) to train the response generation model. Recently, some works make an effort to integrate the Seq2Seq model and RL. For example, [2, 31] introduce real users by combining RL with neural generation.

Li et al. in [16] were the first to introduce GANs for dialogue generation as an alternative to human evaluation. They jointly train a generative (Seq2Seq) model to produce response sequences and a discriminator to distinguish between human, and machine-generated responses. Although we also introduce an adversarial learning framework to the visual dialog generation in this work, one of the significant differences is that we need to consider the visual content in both generative and discriminative components of the system, where the previous work [16] only requires textual information. We thus designed a sequential co-attention mechanism for the generator and an attention memory access mechanism for the discriminator so that we can jointly reason over the visual and textual information. Critically, the GAN we proposed here is tightly integrated into the attention mechanism that generates human-interpretable reasons for each answer. It means that the discriminative model of the GAN has the task of assessing whether a candidate answer is generated by a human or not, given the provided reason. This is significant because it drives the generative model to produce high quality answers that are well supported by the associated reasoning. More details about our generator and discriminator can be found in Sections 3.1 and 3.2 respectively.

Adversarial learning

Generative adversarial networks [9] have enjoyed great successes in a wide range of applications in Computer Vision, [3, 21, 24], especially in image generation tasks [8, 43]. The learning process is formulated as an adversarial game in which the generative model is trained to generate outputs to fool the discriminator, while the discriminator is trained not to be fooled. These two models can be jointly trained end-to-end. Some recent works have applied the adversarial learning to sequence generation, for example, Yu et al. [42]backpropagate the error from the discriminator to the sequence generator by using policy gradient reinforcement learning. This model shows outstanding performance on several sequence generation problems, such as speech generation and poem generation. The work is further extended to more tasks such as image captioning [4, 28] and dialog generation [16]. Our work is also inspired by the success of adversarial learning, but we carefully extend it according to our application, i.e. the Visual Dialog. Specifically, we redesign the generator and discriminator in order to accept multi-modal information (visual content and dialog history). We also apply an intermediate reward for each generation step in the generator, more details can be found in Sec. 3.3.

3 Adversarial Learning for Visual Dialog Generation

In this section, we describe our adversarial learning approach to generating natural dialog responses based on an image. There are several ways of defining the visual based dialog generation task [7, 20]. We follow the one in [5], in which an image , a ‘ground truth’ dialog history (including an image description ) (we define each Question-Answer (QA) pair as an utterance , and ), and the question are given. The visual dialog generation model is required to return a response sentence to the question, where is the length (number of words) of the response answer. As in VQA, two types of models may be used to produce the response — generative and discriminative. In a generative decoder, a word sequence generator (for example, an RNN) is trained to fit the ground truth answer word sequences. For a discriminative decoder, an additional candidate response vocabulary is provided and the problem is re-formulated as a multi-class classification problem. The biggest limitation of the discriminative style decoder is that it only can produce a response if and only if it exists in the fixed vocabulary. Our approach is based on a generative model because a fixed vocabulary undermines the general applicability of the model, but also because it offers a better prospect of being extensible to the problem of generating more meaningful dialogue in future.

In terms of reinforcement learning, our response sentence generation process can be viewed as a sequence of prediction actions that are taken according to a policy defined by a sequential co-attention generative model. This model is critical as it allows attention (and thus reasoning) to pass across image, question, and dialogue history equally. A discriminator is trained to label whether a response is human generated or machine generated, conditioned on the image, question and dialog attention memories. Considering here that as we take the dialog and the image as a whole into account, we are actually measuring whether the generated response can be fitted into the visual dialog. The output from this discriminative model is used as a reward to the previous generator, pushing it to generate responses that are more fitting with the dialog history. In order to consider the reward at the local (i.e. word and phase) level, we use a Monte Carlo (MC) search strategy and the REINFORCE algorithm [36] is used to update the policy gradient. An overview of our model can be found in the Fig. 2. In the following sections, we will introduce each component of our model separately.

Figure 3: The sequential co-attention encoder. Each input feature is co-attend by the other two features in a sequential fashion, using the Eq.1-3. The number on each function indicates the sequential order, and the final attended features and form the output of the encoder.

3.1 A sequential co-attention generative model

We employ the encoder-decoder style generative model which has been widely used in the sequence generation problems. In contrast to text-only dialog generation problem that only needs to consider the dialog history, however, visual dialog generation additionally requires the model to understand visual information. And distinct from VQA that only has one round of questioning, visual dialog has multiple rounds of dialog history that need to be accessed and understood. It suggests that an encoder that can combine multiple information sources is required. A naive way of doing this is to represent the inputs - image, history and question separately and then concatenate them to learn a joint representation. We contend, however, that it is more powerful to let the model selectively focus on regions of the image and segments of the dialog history according to the question.

Based on this, we propose a sequential co-attention mechanism [35]. Specifically, we first use a pre-trained CNN [29] to extract the spatial image features from the convolutional layer, where is the number of image regions. The question features is , where , which is the hidden state of an LSTM at step given the input word of the question. is the length of the question. Because the history is composed by a sequence of utterance, we extract each utterance feature separately to make up the dialog history features, i.e., , where is the number of rounds of the utterance (QA-pairs). And each is the last hidden state of an LSTM, which accepts the utterance words sequences as the input.

Given the encoded image, dialog history and question feature and , we use a co-attention mechanism to generate attention weights for each feature type using the other two as the guidance in a sequential style. Each co-attention operation is denoted as , which can be expressed as follows:


where is the input feature sequence (i.e., , or ), and , represent guidances that are outputs of previous attention modules. Here is the feature dimension. , , and are learnable parameters. Here denotes the size of hidden layers of the attention module. is the input sequence length that corresponding to the and for different feature inputs.

As shown in Fig. 3, in our proposed process, the initial question feature is first used to attend to the image. The weighted image features and the initial question representation are then combined to attend to utterances in the dialog history, to produce the attended dialog history (). The attended dialog history and weighted image region features are then jointly used to guide the question attention (). Finally, we run the image attention () again, guided by the attended question and dialog history, to complete the circle. All three co-attended features are concatenated together and embedded to the final feature :



is a concatenation operator. Finally, this vector representation is fed to an LSTM to compute the probability of generating each token in the target using a softmax function, which forms the response

. The whole generation process is denoted as .

3.2 A discriminative model with attention memories

Our discriminative model is a binary classifier that is trained to distinguish whether the input dialog is generated by humans or machines. In order to consider the visual information and the dialog history, we allow the discriminator to access to the attention memories in the generator. Specifically, our discriminator takes

as the input, where are the attended image and dialog history features produced in the generative model111we also tested to use the question memory , but we find the discriminator result is not as good as when using the original question input ., given the question . And is the generated response in the generator. The - pair is further sent to an LSTM to obtain a vector representation

. All three features are embedded together and sent to a 2-way softmax function, which returns the probability distribution of whether the whole visual dialog is human-natural or not:


The probability of the visual dialog being recognised as a human-generated dialog is denoted as .

3.3 Adversarial REINFORCE with an intermediate reward

In adversarial learning, we encourage the generator to generate responses that are close to human generated dialogs, or, in our case, we want the generated response can fit into the visual dialog as good as possible. The policy gradient methods are used here to achieve the goal. The probability of the visual dialog being recognised as a human-generated dialog by the discriminator (i.e., ) is used as a reward for the generator, which is trained to maximize the expected reward of generated response using the REINFORCE algorithm [36]:


Given the input visual information (), question () and dialog history utterances (), the generator generates an response answer by sampling from the policy. The attended visual () and dialog () memories with the and generated answer are concatenated together and fed to the discriminator. We further use the likelihood ratio trick [36] to approximate the gradient of Eq. 7:


where is the probability of the generated responses words, is the -th word in the response. denotes the baseline value. Following [16]

, we train a critic neural network to estimate the baseline value

by given the current state under the current generation policy

. The critic network takes the visual content, dialog history and question as input, encodes them to a vector representation with our co-attention model and maps the representation to a scalar. The critic neural network is optimised based on the mean squared loss between the estimated reward and the real reward obtained from the discriminator. The entire model can be trained end-to-end, with the discriminator updating synchronously. We use the human generated dialog history and answers as the positive examples and the machine generated responses as negative examples.

Intermediate reward

An issue in the above vanilla REINFORCE is it only considers a reward value for a finished sequence, and the reward associated with this sequence is used for all actions, i.e., the generation of each token. However, as a sequence generation problem, rewards for intermediate steps are necessary. For example, given a question ‘Are they adults or babies?’, the human-generated answer is ‘I would say they are adults’, while the machine-generated answer is ‘I can’t tell’. The above REINFORCE model will give the same low reward to all the tokens for the machine-generated answer, but a proper reward assignment way is to give the reward separately, i.e., a high reward to the token ‘I’ and low rewards for the token ‘can’t’ and ‘tell’.

Considering that the discriminator is only trained to assign rewards to fully generated sentences, but not intermediate ones, we propose to use the Monte Carlo (MC) search with a roll-out (generator) policy to sample tokens. An N-time MC search can be represented as:


where and are sampled based on the roll-out policy and the current state. We run the roll-out policy starting from the current state till the end of the sequence for times and the generated answers share a common prefix . These sequences are fed to the discriminator, the average score


of which is used as a reward for the action of generating the token . With this intermediate reward, our gradient is computed as:


where we can see the intermediate rewards for each generation action are considered.

0:  Pretrained generator and discriminator
1:  for Each iteration do
2:       Train the generator
3:      for i=1, steps do
4:          Sample from the real data
5:          Sample
6:          Compute Reward for using
7:          Evaluate with Eq. 8 or 11 depends on whether the intermediate reward (Eq. 10) is used
8:          Update parameter using
9:          Update baseline parameters for
10:          Teacher-Forcing: Update on using MLE
11:       Train the discriminator
12:      Sample from the real data
13:      Sample
14:      Update using as positive examples and as negative examples
Algorithm 1 Training Visual Dialog Generator with REINFORCE

Teacher forcing

Although the reward returned from the discriminator has been used to adjust the generation process, we find it is still important to feed human generated responses to the generator for the model updating. Hence, we apply a teacher forcing [14, 16] strategy to update the parameters in the generator. Specifically, at each training iteration, we first update the generator using the reward obtained from the sampled data with the generator policy. Then we sample some data from the real dialog history and use them to update the generator, with a standard maximum likelihood estimation (MLE) objective. The whole training process is reviewed in the Alg. 1.

4 Experiments

We evaluate our model on a recently published visual dialog generation dataset, VisDial [5]. Images in Visdial are all from the MS COCO [17], which contain multiple objects in everyday scenes. The dialogs in Visdial are collected by pairing 2 AMT works (a ‘questioner’ and an ‘answerer’) to chat with each other about an image. To make the dialog measurable, the image remains hidden to the questioner and the task of the questioner is to ask questions about this hidden image to ‘imagine the scene better’. The answerer sees the image and his task is to answer questions asked by the questioner. Hence, the conversation is more like multi-rounds of visual based question answering and it only can be ended after 10 rounds. There are 83k dialogs in the COCO training split and 40k in the validation split, for totally 1,232,870 QA pairs, in the Visdial v0.9, which is the latest available version thus far. Following [17], we use 80k dialogs for train, 3k for val and 40k as the test.

4.1 Evaluation Metrics

Different from the previous language generation tasks that normally use BLEU, MENTOR or ROUGE score for evaluation, we follow [17] to use a retrieval setting to evaluate the individual responses at each round of a dialog. Specifically, at test time, besides the image, ground truth dialog history and the question, a list of 100 candidates answers are also given. The model is evaluated on retrieval metrics: (1) rank of human response, (2) existence of the human response in top- ranked responses, i.e., recall@ and (3) mean reciprocal rank (MRR) of the human response. Since we focus on evaluating the generalization ability of our generator, we simply rank the candidates by the generative model’s log-likelihood scores.


Image+Caption Question Human Answer CoAtt-G-MLE Ours
A bathroom with a white bath tub, sink and large window.
What color is the bathroom?
Are there any people in there?
Are there towels hanging?
Is there any soap on the sink?
What color are the towels?
What kind of bathtub is it?
Can you see anything out the bathroom window?
Are there curtains on the window?
Is the bathroom light on?
Is there anything else on the sink?
The walls are gray
No folded up
I do n’t think so
A fancy rectangular
No soap
It ’s a tub
Most white
No, on the floor
I do n’t think so
It ’s a shower tub with a shower
No, just the wall
No curtains
A motorcycle, moped and a bus parked by the street.
What color is the motorcycle?
Is this on a busy street with shops and people?
Is it daylight or night time?
Is the photo in color?
What color are the other cars?
Are there any people walking?
Can you tell what shops businesses they are?
Do you see any traffic lights?
Do you think the motorcycle should be parked on the sidewalk?
Do you see any signs?
It is black and white
It looks like it is not
Yes it is
I see a white van and a blue
Not that i can see
Not really
No, i do not
One, but only a picture
White and blue
It ’s daytime
white and black
i ’m not sure
No i do n’t
It’s black and white
No it is not
It is daytime
One is blue and the other is white
no, there are no people
I ’m not sure , they are in the background
No i do n’t
No, it looks like it ’s parked
I see a sign on the side of road
A man in a wet suit carrying a surfboard by some rocks.
Is the photo in color?
How old does the man appear to be?
What color wetsuit?
What color surfboard?
Do the rocks appear to be smooth or sharp?
Is he close to the water?
Does it appear to be a beach or private section?
What color is the water dark or light blue?
Does he have any shoes on?
Does he appear to be wet or dry?
I would estimate late 30s
Dark blue
White and red
I would guess they are smooth
Moderately close
Private area
It is blurry so it appears black
I ca n’t see his feet
20 ’s
White with red
I ca n’t tell
light blue
I ca n’t see his feet
I would say 20 ’s
It ’s white with red
They look smooth
I ca n’t tell
It ’s light blue
I ca n’t see his feet
He looks dry


Figure 4: Qualitative results of our model (CoAtt-GAN-w/ -TF) comparing to human ground-truth answer and our baseline model.

4.2 Implementation Details

To pre-process the data, we first lowercase all the texts, convert digits to words, and remove contractions, before tokenizing. The captions, questions and answers are further truncated to ensure that they are no longer than 40, 20 and 20, respectively. We then construct the vocabulary of words that appear at least 5 times in the training split, giving us a vocabulary of 8845 words. The words are represented as one-hot vector and 512-d embeddings for the words are learned. These word embeddings are shared across question, history, decoder LSTMs. All the LSTMs in our model are 1-layered with 512 hidden states. The Adam [13] optimizer is used with the base learning rate of , further decreasing to

. We use 5-time Monte Carlo (MC) search for each token. The co-attention generative model is pre-trained using the ground-truth dialog history for 30 epochs. We also pre-train our discriminator (for 30 epochs), where the positive examples are sampled from the ground-truth dialog, the negative examples are sampled from the dialog generated by our generator. The discriminator is updated after every 20 generator-updating steps.

4.3 Experiment results

Baselines and comparative models

We compare our model with a number of baselines and state-of-the-art models. Answer Prior [5] is a naive baseline that encodes answer options with an LSTM and scored by a linear classifier, which captures ranking by frequency of answers in the training set. NN [5] finds the nearest neighbor images and questions for a test question and its related image. The options are then ranked by their mean-similarity to answers to these questions. Late Fusion (LF) [5]

encodes the image, dialog history and question separately and later concatenated together and linearly transformed to a joint representation.

HRE [5] applies a hierarchical recurrent encoder [27] to encode the dialog history and the HREA [5] additionally adds an attention mechanism on the dialogs. Memory Network (MN) [5] maintains each previous question and answer as a ‘fact’ in its memory bank and learns to refer to the stored facts and image to answer the question. A concurrent work [18] proposes a HCIAE (History-Conditioned Image Attentive Encoder) to attend on image and dialog features.


Model MRR R@1 R@5 R@10 Mean
Answer Prior [5] 0.3735 23.55 48.52 53.23 26.50
NN [5] 0.4274 33.13 50.83 58.69 19.62
LF [5] 0.5199 41.83 61.78 67.59 17.07
HRE [5] 0.5237 42.29 62.18 67.92 17.07
HREA [5] 0.5242 42.28 62.33 68.17 16.79
MN [5] 0.5259 42.29 62.85 68.88 17.06
HCIAE [18] 0.5386 44.06 63.55 69.24 16.01
CoAtt-G-MLE 0.5411 44.32 63.82 69.75 16.47
CoAtt-GAN-w/o 0.5415 44.52 64.17 70.31 16.28
CoAtt-GAN-w/ 0.5506 45.56 65.16 71.07 15.30
CoAtt-GAN-w/ -TF 0.5578 46.10 65.69 71.74 14.43


Table 1: Performance of generative methods on VisDial v0.9. Higher is better for MRR and recall@k, while lower is better for mean rank.

From Table 1, we can see our final generative model CoAtt-GAN-w/ -TF

performs the best on all the evaluation metrics. Comparing to the previous state-of-the-art model MN

[5], our model outperforms it by 3.81% on R@1. We also produce better results than the HCIAE [18] model, which is the previous best results that without using any discriminative knowledges. Figure 4 shows some qualitative results of our model. More results can be found in the supplementary material.


Model MRR R@1 R@5 R@10 Mean
LF [5] 0.5807 43.82 74.68 84.07 5.78
HRE [5] 0.5846 44.67 74.50 84.22 5.72
HREA [5] 0.5868 44.82 74.81 84.36 5.66
MN [5] 0.5965 45.55 76.22 85.37 5.46
SAN-QI [40] 0.5764 43.44 74.26 83.72 5.88
HieCoAtt-QI [19] 0.5788 43.51 74.49 83.96 5.84
AMEM [25] 0.6160 47.74 78.04 86.84 4.99
HCIAE-NP-ATT [18] 0.6222 48.48 78.75 87.59 4.81
Ours 0.6398 50.29 80.71 88.81 4.47


Table 2: Performance of discriminative methods on VisDial v0.9. Higher is better for MRR and recall@k, while lower is better for mean rank.

Ablation study

Our model contains several components. In order to verify the contribution of each component, we evaluate several variants of our model.

  • CoAtt-G-MLE is the generative model that uses our co-attention mechanism shown in Sec. 3.1. This model is trained only with the MLE objective, without any adversarial learning strategies. Hence, it can be used as a baseline model for other variants.

  • CoAtt-GAN-w/o is the extension of above CoAtt-G model, with an adversarial learning strategy. The reward from the discriminator is used to guide the generator training, but we only use the global reward to calculate the gradient, as shown in Equ. 8.

  • CoAtt-GAN-w/ uses the intermediate reward as shown in the Equ. 10 and 11.

  • CoAtt-GAN-w/ -TF is our final model which adds a ‘teacher forcing’ after the adversarial learning.

Our baseline CoAtt-G-MLE model outperforms the previous attention based models (HREA, MN, HCIAE) shows that our co-attention mechanism can effectively encode the complex multi-source information. CoAtt-GAN-w/o produces slightly better results than our baseline model by using the adversarial learning network, but the improvement is limited. The intermediate reward mechanism contributes the most to the improvement, i.e., our proposed CoAtt-GAN-w/  model improves over our baseline by average 1%. The additional Teacher-Forcing model (our final model) brings the further improvement, by average 0.5%, achieving the best results.

Discriminative setting

We additionally implement a model for the discriminative task on the Visdial dataset [5]. In this discriminative setting, there is no need to generate a string, instead, a pre-defined answer set is given and the problem is formulated as a classification problem. We modify our model by replacing the response generation LSTM (can be treated as a multi-step classification process) as a single-step classifier. HCIAE-NP-ATT [18] is the original HCIAE model with a n-pair discriminative loss and a self-attention mechanism. AMEM [25] applies a more advanced memory network to model the dependency of current question on previous attention. Additional two VQA models [19, 40] are used for comparison. Table 2 shows that our model outperforms the previous baseline and state-of-the-art models on all the evaluation metrics.

4.4 Human study

Above experiments verify the effectiveness of our proposed model on the Visdial [5] task. In this section, to check whether our model can generate more human-like dialogs, we conduct a human study.

We randomly sample 1000 results from the test dataset in different length, generated by our final model, our baseline model CoAtt-G-MLE, and the Memory Network (MN)222we use the author provided code and pre-trained model provided on [5] model. We then ask 3 human subjects to guess whether the last response in the dialog is human-generated or machine-generated and if at least 2 of them agree it is generated by a human, we say it passed the Truing Test. Table 3 summarizes the percentage of responses in the dialog that passes the Turing Test (M1), we can see our model outperforms both the baseline model and the MN model. We also apply our discriminator model in Sec. 3.2 on these 1000 samples and it recognizes that nearly 70% percent of them as human-generated responses (random guess is 50%), which suggests that our final generator successfully fool the discriminator in this adversarial learning. We additionally record the percentage of responses that are evaluated as better than or equal to human responses (M2), according to the human subjects’ manual evaluation. As shown in Table 3, 45% of the responses fall into this case.


MN [5] CoAtt-G-MLE Ours
M1: Percentage of responses that pass the Turing Test 0.39 0.46 0.49
M2: Percentage of responses that are evaluated as better or equal to human responses. 0.36 0.42 0.45


Table 3: Human evaluation on 1000 sampled responses on VisDial v0.9

5 Conclusion

Visual Dialog generation is an interesting topic that requires machine to understand visual content, natural language dialog and have the ability of multi-modal reasoning. More importantly, as a human-computer interaction interface for the further robotics and AI, apart from the correctness, the human-like level of the generated response is a significant index. In this paper, we have proposed an adversarial learning based approach to encourage the generator to generate more human-like dialogs. Technically, by combining a sequential co-attention generative model that can jointly reason the image, dialog history and question, and a discriminator that can dynamically access to the attention memories, with an intermediate reward, our final proposed model achieves the state-of-art on VisDial dataset. A Turing Test fashion study also shows that our model can produce more human-like visual dialog responses.