Improving Generative Visual Dialog by Answering Diverse Questions

09/23/2019 ∙ by Vishvak Murahari, et al. ∙ 18

Prior work on training generative Visual Dialog models with reinforcement learning(Das et al.) has explored a Qbot-Abot image-guessing game and shown that this 'self-talk' approach can lead to improved performance at the downstream dialog-conditioned image-guessing task. However, this improvement saturates and starts degrading after a few rounds of interaction, and does not lead to a better Visual Dialog model. We find that this is due in part to repeated interactions between Qbot and Abot during self-talk, which are not informative with respect to the image. To improve this, we devise a simple auxiliary objective that incentivizes Qbot to ask diverse questions, thus reducing repetitions and in turn enabling Abot to explore a larger state space during RL ie. be exposed to more visual concepts to talk about, and varied questions to answer. We evaluate our approach via a host of automatic metrics and human studies, and demonstrate that it leads to better dialog, ie. dialog that is more diverse (ie. less repetitive), consistent (ie. has fewer conflicting exchanges), fluent (ie. more human-like),and detailed, while still being comparably image-relevant as prior work and ablations.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 9

page 10

Code Repositories

visdial-diversity

Pytorch implementation of https://arxiv.org/pdf/1909.10470.pdf


view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

(a) Left. Prior work on training generative Visual Dialog models with RL on an image-guessing task between Q-bot and A-bot visdial_rl leads to repetitive dialog. Right. We devise an auxiliary objective that incentivizes Q-bot to ask diverse questions, thus reducing repetitions and enabling A-bot to be exposed to more varied questions during RL, overall leading to better dialog, as measured by automatic metrics and human studies.
(b) Cosine similarity of successive dialog state embeddings within Q-bot. Prior work visdial_rl has high similarity. Our approach explicitly minimizes this similarity leading to more diverse dialog.
Figure 1:

Our goal is to build agents that can see and talk agents that can perceive the visual world and communicate this understanding in natural language conversations in English. To this end, visdial; vries_cvpr17

proposed the task of Visual Dialog – given an image, dialog history consisting of a sequence of question-answer pairs, and a follow-up question about the image, predict a free-form natural language answer to the question – along with a dataset and evaluation metrics

111visualdialog.org, guesswhat.ai.

Posing Visual Dialog as a supervised learning problem is unnatural. This is because at every round of dialog, the agent’s answer prediction is thrown away and it gets access to ground-truth dialog history, thus disabling it from steering conversations during training. This leads to compounding errors over long-range sequences at test time, a problem also common in training recurrent neural networks for language modeling 

bengio_nips15; ross_aistats11; ranzato_iclr16.

To overcome this, visdial_rl devised a goal-driven approach for training Visual Dialog agents with deep reinforcement learning. This is formulated as a game between two agents – Q-bot (that asks questions) and A-bot (that answers questions). Q-bot is shown a one-line description of an unseen image that A-bot has access to and Q-bot is allowed to ask questions (in natural language) to A-bot for a fixed number of rounds and simultaneously make predictions of the unseen image. Both agents are rewarded for Q-bot’s image-guessing performance and trained with REINFORCE williams1992simple to optimize this reward. Thus, there is incentive for Q-bot to ask questions informative of the hidden image, and for A-bot to provide meaningful answers.

While this reinforcement learning approach leads to improved performance on the image-guessing task than supervised learned agents, it has a few shortcomings – 1) image-guessing performance degrades after a few rounds of dialog (Fig. 3), and 2) these improvements over supervised learning do not translate to an improved A-bot, responses from this Visual Dialog agent are not necessarily better (on automatic metrics or human judgements), just that the Q-bot-A-bot pair is sufficiently in sync to do well at image-guessing.

We begin by understanding why this is the case, and find that Q-bot-A-bot dialog during ‘self-talk’ often tends to be repetitive the same question-answer pairs get repeated across rounds (Fig. 0(a) left). Since repeated interactions convey no additional information, image-guessing performance saturates, and even starts to degrade as the agent forgets useful context from the past.

These repetitions are due to high similarity in Q-bot

’s context vectors of successive rounds driving question generation (Fig. 

0(b)). To address this, we devise a smooth-L1 penalty that penalizes similarity in successive state vectors (Section 3). This incentivizes Q-bot to ask diverse questions, thus reducing repetitions and in turn enabling A-bot to explore a larger part of the state space during RL be exposed to more visual concepts to talk about, and varied questions to answer (Section 6).

Note that a trivial failure mode with this penalty is for Q-bot to start generating diverse but totally image-irrelevant questions, which are not useful for the image-guessing task. A good balance between diversity and image-relevance in Q-bot’s questions is necessary to improve at this task.

We extensively evaluate each component of our approach against prior work and baselines:

  • [leftmargin=*]

  • Q-bot on diversity and image-relevance of generated questions during Q-bot-A-bot self-talk. We find that diverse-Q-bot asks more novel questions while still being image-relevant.

  • Q-bot-A-bot self-talk on consistency, fluency, level of detail, and human-interpretability, through automatic metrics and human studies. We find that diverse-Q-bot-A-bot dialog after RL is more consistent, fluent, and detailed.

  • A-bot

    on precision and recall of generated answers on the VisDial dataset 

    visdial  quality of answers to human questions. Training diverse-Q-bot A-bot with RL does not lead to a drop in accuracy on VisDial.

2 Preliminaries

We operate in the same setting as visdial_rl – an image-guessing task between a questioner (Q-bot) and an answerer (A-bot) – where Q-bot has to guess the image A-bot has access to by asking questions in multi-round dialog.

We adopt the same training paradigm222github.com/batra-mlp-lab/visdial-rl consisting of 1) a supervised pre-training stage where Q-bot and A-bot

are trained with Maximum Likelihood Estimation objectives on the VisDial dataset 

visdial, and 2) a self-talk RL finetuning stage where Q-bot and A-bot interact with each other and the agents are rewarded for each successive exchange based on incremental improvements in guessing the unseen image. We learn parameterized policies and for Q-bot and A-bot respectively which decide what tokens to utter (actions: question and answer at every dialog round ) conditioned on the context available to the agent (state representations: ). Q-bot additionally makes an image feature prediction at every dialog round (treated as a deterministic action with a continuous action space), and the reward is , change in distance to the true representation before and after a dialog round. We use REINFORCE williams1992simple to update agent parameters, Q-bot and A-bot are respectively updated with and as gradients.

Our transition from supervised to RL is gradual – we supervise for rounds and have policy-gradient updates for the remaining , starting from till , one round at a time. After reaching , repeating this procedure from led to further (marginal) improvements.

Both Q-bot and A-bot are modeled as Hierarchical Recurrent Encoder-Decoder architectures serban_acl16. Q-bot’s fc simonyan_iclr15 feature prediction of the unseen image () is conditioned on the dialog history so far () via a regression head333Please refer to appendix for architecture details..

Diversity Relevance
# Novel questions # Unique questions Mutual overlap Ent- Ent- Dist- Dist- Negative log likelihood
Baseline: visdial_rl
1-10 SL: Q-bot A-bot
SL: Diverse-Q-bot A-bot
RL: Diverse-Q-bot A-bot
Table 1: Q-bot diversity and relevance on v1.0 val. indicates higher is better. indicates lower is better.
v val v test-std
NDCG MRR R@ R@ R@ Mean Rank NDCG MRR R@ R@ R@ Mean Rank
Baseline: visdial_rl
1-14 SL: A-bot
RL: A-bot (finetuned with Diverse-Q-bot)
Table 2: A-bot performance on VisDial v visdial. indicates higher is better. indicates lower is better.
(a) Consistency
(b) Fluency
(c) Detail
(d) Top- Acc.
(e) Q-bot-A-bot image-guessing task performance
Figure 2: (a-d): Human evaluation of Q-bot-A-bot dialog over images and human subjects for each model variant. (e): Percentile rank (higher is better) of the true image (shown to A-bot) as retrieved using fc7 feature predictions from Q-bot.

3 Smooth-L1 Penalty on Question Repetition

Our goal is to encourage Q-bot to ask a diverse set of questions so that when A-bot is exposed to the same during RL finetuning, it is better able to explore its state space444A-bot’s state-space is characterized by a representation of the question, image, and the dialog history so far.. Furthermore, asking diverse questions allows Q-bot-A-bot exchanges across rounds to be more informative of the image, thus more useful for the image-guessing task.

We observe that agents trained using the paradigm proposed by visdial_rl suffer from repetition of context across multiple rounds of dialog – similar dialog state embeddings across multiple rounds leading to repeated utterances and similar predicted image representations, which consequently further increases similarity in state embeddings. Fig. 0(b) shows increasing across dialog rounds for visdial_rl.

To encourage Q-bot to ask diverse questions, we propose a simple auxiliary loss that penalizes similar dialog state embeddings. Specifically, given Q-bot states – – in addition to maximizing likelihood of question (during supervised pre-training), or image-guessing reward (during self-talk RL finetuning), we maximize a smooth-L1 penalty on ,

(1)

resulting in as an additional term in the overall objective ( no. of dialog rounds).

Note that in order to maximize this penalty, Q-bot has to push and further apart, which can only happen if is updated using a question-answer pair that is different from the previous exchange, thus overall forcing Q-bot to ask different questions in successive dialog rounds. Similar diversity objectives have also been explored in li_emnlp16

as reward heuristics.

Before arriving at eqn:diversity, and following Fig. 0(b), we also experimented with directly minimizing cosine similarity, . This led to the network learning large biases to flip the direction of successive vectors (without affecting norms), leading to question repetitions in alternating rounds.

4 Experiments

Baselines and ablations. To understand the effect of the proposed penalty, we compare our full approach – ‘RL: Diverse-Q-bot + A-bot’ – with the baseline setup in visdial_rl, as well as several ablations – 1) ‘SL: Q-bot + A-bot’: supervised agents (trained on VisDial data under MLE, no RL, no smooth-L1 penalty). Comparing to this quantifies how much our penalty + RL helps. 2) ‘SL: Diverse-Q-bot + A-bot’: supervised agents where Q-bot is trained with the smooth-L1 penalty. This quantifies gains from RL.

Automatic Metrics. To evaluate Q-bot’s diversity (Table 4), we generate Q-bot-A-bot dialogs (with beam size = 5) for rounds on VisDial v val and compute 1) Novel Questions: the number of new questions (via string matching) in the generated dialog not seen during training, 2) Unique Questions: no. of unique questions per dialog instance (so ), 3) Dist-n and Ent-n zhang_nips18; li_naacl16

: the number and entropy of unique n-grams in the generated questions normalized by the total number of tokens, and 4) Mutual Overlap 

deshpande2019fast: BLEU-4 overlap of every question in the generated -round dialog with the other questions, followed by averaging these numbers. To measure Q-bot’s relevance, we report the negative log-likelihood under the model of human questions from VisDial. We evaluate A-bot’s answers to human questions from the VisDial dataset on the retrieval metrics introduced by visdial (Table 3). Finally, we also evaluate performance of the Q-bot-A-bot pair at image-guessing (Fig. 3), which is the downstream task they are trained with RL for.

Human Studies. To evaluate how human-understandable Q-bot-A-bot dialogs are, we conducted a study where we showed humans these dialogs (from our agents as well as baselines), along with a pool of images from the VisDial v test-std split – consisting of the unseen image, nearest neighbors (in fc space), and random images – and asked humans to pick their top-5 guesses for the unseen image. Our hypothesis was that if questions are more diverse, the dialog will be more image-informative, and so humans should be able to better guess which image was being talked about. We report top- accuracy of true image in human guesses. We also asked humans to rate Q-bot-A-bot dialog on consistency, fluency, and level of detail on a -point Likert scale.

5 Implementation Details

We used beam search with a beam size of 5 during self-talk between all Q-bot-A-bot variants. NDCG scores on the v1.0 val split and the total SL loss (on the same split) were used to select the best SL A-bot and Q-bot checkpoints respectively. We used a dropout rate of for all SL-pretraining experiments and no dropout for RL-finetuning. We used Adam kingma_iclr15 with a learning rate of decayed by

every epoch, upto a minimum of

.

The objective for training Diverse-Q-bot was a sum of the smooth-L1 penalty (introduced in Section 3), cross entropy loss, and L2 loss between the regression head output and the fc7 simonyan_iclr15 embedding of the image. We observed that coefficients in the range of to worked best for the smooth-L1 penalty. We also observed that training for a large number of epochs (80) with the above mentioned range of coefficient values led to the best results.

6 Results

  • [leftmargin=*]

  • Q-bot’s diversity (Table 4): The question-repetition penalty consistently increases diversity (in both SL and RL) over the baseline! RL: Diverse-Q-bot asks ~ more unique questions on average than visdial_rl () for every -round dialog, ~6.3x more novel questions (), and a higher fraction and entropy of unique generated n-grams, while still staying comparably relevant (NLL).

  • A-bot on VisDial (Table. 3): RL: A-bot outperforms SL: A-bot, but does not statistically improve over the baseline on answering human questions from VisDial555This is consistent with trends in visdial_rl. (on v val & test-std).

  • Image-guessing task (Fig. 3): Diverse-Q-bot + A-bot (SL and RL) significantly outperform the baseline on percentile rank of ground-truth image as retrieved using Q-bot’s fc prediction. Thus, the question-repetition penalty leads to a more informative communication protocol.

  • Human studies (Fig. 2): Humans judged RL: Diverse-Q-bot + A-bot dialog significantly more consistent (fewer conflicting exchanges), fluent (fewer grammatical errors), and detailed (more image-informative) over the baseline and supervised learning. This is an important result. Performance on GuessWhich, together with these dialog quality judgements from humans show that agents trained with our approach develop a more effective communication protocol for the downstream image-guessing task, while still not deviating off English, which is a common pitfall when training dialog agents with RL kottur_emnlp17; lewis_emnlp17.

Note that since our penalty (Eqn. 1) is structured to avoid repetition across successive rounds, one possible failure mode is that Q-bot learns to ask the same question every alternate dialog round (at and ). Empirically, we find that this happens of times ( times out of k question pairs) on v1.0 val for RL: Diverse Q-bot + A-bot compared to for SL: Q-bot + A-bot. This observation, combined with the fact that Diverse-Q-bot asks 1.6 more unique questions relative to SL: Q-bot across 10 rounds suggests that simply incentivizing diversity in successive rounds works well empirically. We hypothesize that this is because repeating questions every other round or other such strategies to game our repetition penalty is fairly specific behavior that is likely hard for models to learn given the large space of questions Q-bot could potentially ask.

7 Related Work

Our work is related to prior work in visual dialog visdial; vries_cvpr17 and modeling diversity in text-only dialog zhang_nips18; li_naacl16; li_emnlp16.

Closest to our setting is work on using conditional variational autoencoders for self-talk in visual dialog 

massiceti_cvpr18, where diversity is not explicitly modeled but is measured via metrics specific to the proposed architecture.

Adding constraints to generate a diverse set of natural language dialog responses have previously been explored in zhang_nips18 via adversarial information maximization, in gao_arxiv19 by jointly modeling diversity and relevance in a shared latent space, and in li_naacl16 using a maximum mutual information criterion. In contrast, we are interested in diversity at the level of the entire dialog (instead of a single round) – reducing repetitions in QA pairs across multiple rounds. Our repetition penalty is partly inspired by the ‘Information Flow’ constraint in li_emnlp16. As detailed in Section 2, we experimented with similar forms of the penalty and eventually settled on smooth-L1. To the best of our knowledge, we are the first to explicitly model diversity as a constraint in visual dialog.

8 Conclusions & Future Work

We devised an auxiliary objective for training generative Visual Dialog agents with RL, that incentivizes Q-bot to ask diverse questions. This reduces repetitions in Q-bot-A-bot dialog during RL self-talk, and in turn enables A-bot to be exposed to a larger state space. Through extensive evaluations, we demonstrate that our Q-bot-A-bot pair has significantly more diverse dialog while still being image-relevant, better downstream task performance, and higher consistency, fluency, and level of detail than baselines. Our code will be publicly released, and we hope this will serve as a robust base for furthering progress on training visual dialog agents with RL for other multi-agent grounded language games, adapting to learn to talk about novel visual domains, .

9 Acknowledgements

We thank Nirbhay Modhe and Viraj Prabhu for the PyTorch implementation 

modhe_github18 of visdial_rl that we built on, and Jiasen Lu for helpful discussions. The Georgia Tech effort is supported in part by NSF, AFRL, DARPA, ONR YIPs, ARO PECASE. AD is supported in part by fellowships from Facebook, Adobe, and Snap Inc. The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies or endorsements, either expressed or implied, of the US Government, or any sponsor.

References

10 Appendix

10.1 Qualitative Output

The dialog generated by self-talk between different SL model variants is provided in Table. 5 and different RL model variants is provided in Table. 6. We observe the variants with Diverse-Q-bot tend to generate more diverse, image relevant and fluent dialog.

10.2 Experiments with Diverse-A-bot

We also ran experiments where we used the repetition penalty (Eqn. (1) in the main paper) during SL pre-training of A-bot. In Tables 3 and 4 we report results on the A-bot retrieval metrics visdial and the diversity metrics respectively. In Fig. 3, we provide the performance of various Q-bot variants paired with this Diverse-A-bot on the image-guessing task.

We note that retrieval metrics for Diverse-A-bot are better than vanilla SL: A-bot. Therefore, this repetition penalty does help significantly during supervised pre-training. However, finetuning the Diverse-A-bot via RL does not lead to significant improvements.

We note that for the diversity metrics on Q-bot-A-bot dialog, self-talk with Diverse-A-bot does not change the diversity metrics significantly. We observe the same trend in image-guessing performance as well.

10.3 Model Architecture details

We use a Hierarchical Recurrent Encoder (HRE) for representing dialog context. In this encoder the image representation is concatenated with every question word when fed to the LSTM. We then encode each QA-pair in the dialog history with another LSTM with shared weights. The image-question representation, computed for every round from through , is concatenated with history representation from the previous round. This gives us a set of question-history vectors for t rounds. These vectors are fed as input to a dialog-level LSTM, whose output state at is used to decode the response to . 4 shows the model architecture of the HRE.

All LSTMs are 2-layered with -dim hidden states. We learn -dim embeddings for words and images. These word embeddings are shared across question, history, and decoder LSTMs.

Figure 3: Performance on the image-guessing task. Percentile rank (higher is better) of the true image (shown to A-BOT) as retrieved using fc7 image feature predictions from Q-BOT.
Figure 4: Model architecture diagram borrowed from visdial_rl with permission.
v val v test-std
NDCG MRR R@ R@ R@ Mean Rank NDCG MRR R@ R@ R@ Mean Rank
Baseline: visdial_rl
1-14 Diverse SL A-bot
Diverse RL A-bot (Finetuned with Diverse-A-bot)
Table 3: A-bot performance on the VisDial v dataset visdial. Arrow indicates direction of better performance.
Diversity Relevance
# Novel questions # Unique questions Mutual overlap Ent- Ent- Dist- Dist- Negative log likelihood
Baseline: visdial_rl
1-10 SL: Q-bot Diverse-A-bot
SL: Diverse-Q-bot Diverse-A-bot
RL: Diverse-Q-bot Diverse-A-bot
Table 4: Q-bot diversity and relevance on v1.0 val. indicates higher is better. indicates lower is better.

px

Image + Caption GT dialog visdial Q-bot-A-bot dialog Diverse-Q-bot-A-bot dialog

a cutting board with carrots next to a cell phone.
what color is the cutting board? blue how big is it about the size of a regular laptop screen how many carrots are there ? 4 are they the orange kind ? 2 are orange , 2 are white what kind of cell phone is it ? a samsung can you see any text on the screen ? a recipe for UNK UNK soup do you see the room ? just the counter are there people ? not visible what color is the counter ? white is the photo in color ? no how many people do you see ? no people is this a close up ? no is this a close up ? zebra is is this a close up ? no is this a close up ? no is this a close up ? no is this a close up ? no is this a close up ? no is this a close up ? no is the photo in color ? 1 picture is in color is the photo close up ? black and white is the photo in color ? no it ’s black and white is it a close up ? no it ’s a close up of the cake can you tell what brand it is ? no i ca n’t see the brand can you tell if it ’s a male or female ? no i ca n’t tell can you tell if it ’s a male or female ? no i ca n’t tell is this in a kitchen ? nope it ’s in a home are there any utensils ? no are there any utensils ? nope

many brown and black cows standing in the grass
is the field large ? yes is the grass green or brown ? green can you see a barn ? no can you see a fence ? no can you see a water supply ? yes can you see any people ? no can you see trees ? yes can you see mountains in the background ? no is the grass tall or short ? short what color are the cows ? they are brown and white how many cows are there ? 4 what color are the cows ? they are brown and white can you see the sky ? it is not visible do you see any trees ? i do not see any trees can you see the sky ? green sky are there any clouds ? the sky is not visible how many cows are there ? 2 what color are the cows ? the cows are brown and white what color are the cows ? they are brown do you see any people ? 1 do you see a fence ? a small 1 how many cows are there ? 5 what color are the cows ? black and white can you see the sky ? some can you see the sky ? nope are there any other animals ? not that i can see can you see the sky ? not at all are there any trees ? in the background can you see the sky ? not at all

many elephants standing in an enclosure near a building
what type of building ? may be part of a zoo how many elephants ? about 4 is it sunny ? clear day , yes any trees ? yes what color is the building ? dark grey are there other animals ? no are there any people ? yes what are they wearing ? shorts and shirts how many ? 3 how many elephants are there ? ca n’t tell what color are the elephants ? ca n’t tell how many elephants are there ? ca n’t tell what color is the fence ? i ca n’t tell can you see the sky ? i ca n’t see the sky are there any trees ? just a few does this look like a zoo ? 1 of them is are there any people ? nothing in the picture how many elephants are there ? just 1 can you see the sky ? just a little how many elephants are there ? 1 any people ? ca n’t see any male or female ? not sure what is the elephant doing ? just standing there male or female ? there are 2 males and 1 female can you see the sky ? 1 of it what color is the building ? just a black and white picture do you see any trees ? there are trees in the background are there any other animals ? red and white do you see any trees ? i see some in the background

a reflection of the tub is in the bathroom mirror along with the sink , you can see all the room
is the photo in color ? yes is there humans in the photo ? no is there any animals in the photo ? no what color are the walls ? tan what color is the sink ? white is there any towels ? no is there any pictures on the wall ? no what color is the tub ? white can you see the floor ? no is the photo in color ? 5 of them are are there any people ? it ’s a close up so i ca n’t see any people what color is the sink ? about the same color as the toilet do you see a mirror ? no what color is the sink ? yes it is white what color is the sink ? yes it is white what shape is the mirror ? yes , it ’s a rectangle are there any towels ? he ’s on a towel rack are there any towels ? i do n’t see any towels are there any towels ? i do n’t see any towels is the photo in color ? no is there any people ? 5 what color walls ? i ca n’t see the walls is there a mirror ? yes is there a mirror ? yes is there a mirror ? yes are the walls seen ? no what color are the walls ? no walls what is the floor made of ? yes what color is the floor ? i ca n’t see the floor

a bunch of cows that are in a grassy field
about how many cows ? over a dozen are there any calves ? no are they eating ? yes is it sunny ? no is there a fence around them ? no do you see any buildings ? no mountains in the background ? no is there snow on the ground ? no any trees ? yes is the photo in color ? yes is there any people ? no is there a fence ? no is there a fence ? no is there a fence ? no is there a fence ? no is there a fence ? no is there a fence ? no is there a fence ? no is there a fence ? no how many cows are there ? about 10 what color are the cows ? black and white how many cows are there ? about 10 what color are the cows ? black and white what is the weather like ? sunny are there any trees ? no are there any people ? no are there any trees ? no are there any buildings ? no are there any trees ? yes

the kitchen is UNK with dark wood cabinets and a nearby staircase
what color are the kitchen walls ? white is there a window present ? yes are there pictures paintings hanging up ? no is the stove visible ? yes what color is the stove ? black and silver are there any people visible ? no what color is the nearby staircase ? dark brown is the staircase carpeted ? no are the handrails of the staircase visible ? yes is the photo in color ? yes is there any people ? no is there a stove ? no are there any windows ? no can you see the floor ? no can you see the floor ? no can you see the floor ? yes can you see the floor ? yes can you see the floor ? yes are the walls visible ? yes is the kitchen clean ? brown is the kitchen clean ? it ’s cluttered are there any windows ? yes is it daytime ? yes are there any people ? yes are the appliances stainless steel ? i ca n’t tell are the people in the kitchen ? yes is the kitchen clean ? yes are the people male or female ? yes is the kitchen clean ? yes

black and white photo of men working in a kitchen
how many men ? 3 how big is the kitchen ? small industrial what is the age range ? teens to 40s is the kitchen belong in a house or a restaurant ? restaurant what are they doing in the kitchen ? cooking and mixing food what kind of floor in the kitchen ? i ca n’t tell is there any decorations in the kitchen ? no is the kitchen look clean ? no does the men seem to be UNK ? they seem UNK can you see the fridge ? the side of it what color is it ? the photo is black and white can you see the sink ? 1 of it can you see the sink ? just a tiny bit of it can you see the sink ? 0 can you see the sink ? does n’t look like it can you see the sink ? 0 can you see the sink ? 4 of them can you see the fridge ? UNK can you see the fridge ? not real how many men are there ? looks like 2 men and 1 woman how old do they look ? it ’s hard to tell , but they look like they are in their 20s how old do they look ? looks to be in their 30 ’s are they wearing aprons ? just 1 of them is wearing hat are they wearing hats ? it is hard to tell , but i think so can you see the stove ? not at all are there any pots on the stove ? not that i can see does the kitchen look clean ? black and white photo but i think it ’s clean can you see any windows ? no do you see any windows ? ca n’t see any windows

a half-eaten chicken breast is next to an UNK piece of bread and a small serving of green beans and carrots
who is eating ? is the drink in the cup soda what is the food on ? bread and vegetable is it in a plate ? what kind of sauces are there in the back i am asking the questions here ? yes you are what is the food on ? bread , vegetables , chicken and drink are they on a table ? yes they are any people ? there are n’t any people in sight what color is the table ? it is dark gray any cutlery ? yes there is a butter knife in the plate is the photo in color ? it is is the photo close up ? no is the photo in color ? it is do you see any other people ? it ’s a close up of the man ’s hand , so i ca n’t tell how old is the man ? maybe in his 30s is he wearing glasses ? i ca n’t see his face what color is his shirt ? not sure , the photo is in black and white does he have facial hair ? no does he wear glasses ? yes are there any people in the photo ? yes is the photo in color ? it is is it a professional photo ? i do n’t think so how many people ? 0 is there a tablecloth on the table ? yes what color is the tablecloth ? white are there any utensils visible ? no what color is the plate ? it ’s white are there any utensils visible ? no are there any utensils visible ? no

[5pt]

Table 5: Selected examples of dialog on v VisDial test split for different SL variants. We observe that variant involving Diverse-Q-bot generates more diverse, image relevant and fluent dialog.

px

Image + Caption GT dialog  visdial visdial_rl dialog RL: Diverse-Q-bot-A-bot dialog

a cutting board with carrots next to a cell phone.
what color is the cutting board? blue how big is it about the size of a regular laptop screen how many carrots are there ? 4 are they the orange kind ? 2 are orange , 2 are white what kind of cell phone is it ? a samsung can you see any text on the screen ? a recipe for UNK UNK soup do you see the room ? just the counter are there people ? not visible what color is the counter ? white is the photo in color ? no is the photo close up ? no how many people ? cows is this a close up ? 2 of them are what are they wearing ? no shirt and jeans what are they wearing ? no shirt and jeans do you see other people ? sunny 1 in the background what color are the scissors ? no handle is this in a kitchen ? yes do you see the counter ? no is the photo in color ? black and white is there any people ? 1 hand is this indoors or outdoors ? no idea is the person male or female ? up close , i ca n’t tell can you tell what kind of sandwich it is ? no is there a knife ? no is there a cutting board ? yes are the carrots cut up ? daytime is the scene well lit ? no are the carrots raw or cooked ? short

many brown and black cows standing in the grass
is the field large ? yes is the grass green or brown ? green can you see a barn ? no can you see a fence ? no can you see a water supply ? yes can you see any people ? no can you see trees ? yes can you see mountains in the background ? no is the grass tall or short ? short can you see any people ? nope can you see the sky ? a little bit are there any clouds ? i ca n’t see the sky are there any other animals ? not that i can see do you see any trees ? lots of trees how many cows are there ? ca n’t tell , they are far away do you see any buildings ? i do not can you see the sky ? ca n’t see the sky do you see any clouds ? ca n’t see the sky can you see the sky ? ca n’t see the sky are the cows all the same color ? they are different shades of brown do any of the cows have horns ? it does n’t look like it do you see a fence ? there is no fence how many cows are there ? 4 can you see the sky ? 1 of it what color is the grass ? the grass is green can you see the sky ? it is not visible does it appear to be sunny ? not really can you see the sky ? not at all can you see the ground ? a little bit

many elephants standing in an enclosure near a building
what type of building ? may be part of a zoo how many elephants ? about 4 is it sunny ? clear day , yes any trees ? yes what color is the building ? dark grey are there other animals ? no are there any people ? yes what are they wearing ? shorts and shirts how many ? 3 do you see people ? 2 how many elephants ? not sure does the elephant have tusks ? not that i can see does the elephant have tusks ? in the distance , yes does the elephant have tusks ? not that i can see any other animals ? not that i can see does the elephant have tusks ? not that i can see does the elephant have tusks ? not that i can see does the elephant have tusks ? not that i can see can you see the sky ? 1 of it can you see any people ? yes what are the elephants doing ? just standing around how many people are there ? a lot , too many to count does this look like a zoo ? not really how many elephants are there ? in the distance there are many does this look like a parade ? a little bit does this look like a circus ? nope can you see any signs ? there is a sign on the street does it look like it is in the usa ? i do n’t think so can you see the sky ? i can

a reflection of the tub is in the bathroom mirror along with the sink , you can see all the room
is the photo in color ? yes is there humans in the photo ? no is there any animals in the photo ? no what color are the walls ? tan what color is the sink ? white is there any towels ? no is there any pictures on the wall ? no what color is the tub ? white can you see the floor ? no how big is the bathroom ? ca n’t see the whole room how big is the mirror ? i can only see a small part of it what color is the sink ? there is no sink what color is the sink ? black what color is the sink ? yes what color are the walls ? he is white is there a mirror ? yes are there any people ? he is alone are there any animals ? white cat what color is the sink ? yes how big is the bathroom ? it ’s pretty big is it clean ? maybe how big is the mirror ? i can only see a small part of it what color is the wall ? maybe white is there a shower curtain ? there is no shower curtain is there a mirror ? yes what shape is the mirror ? yes , it ’s a rectangle what color is the wall ? i ca n’t see the wall are there any people ? he ’s the only 1 is there a window ? i do n’t see 1

a bunch of cows that are in a grassy field
about how many cows ? over a dozen are there any calves ? no are they eating ? yes is it sunny ? no is there a fence around them ? no do you see any buildings ? no mountains in the background ? no is there snow on the ground ? no any trees ? yes is the photo in color ? yes is there any people ? no how many cows ? black and white cows is there a fence ? no is there a fence ? no is there a fence ? no is there a fence ? no is there a fence ? no is there a fence ? no is there a fence ? no how many cows are there ? about 10 what color are the cows ? black and white are any of them babies ? no how many cows are there ? about 10 are there any people in the photo ? no are there any other animals in the photo ? no are there any buildings in the photo ? no are there any trees ? yes is there a fence ? no is it sunny out ? yes

the kitchen is UNK with dark wood cabinets and a nearby staircase
what color are the kitchen walls ? white is there a window present ? yes are there pictures paintings hanging up ? no is the stove visible ? yes what color is the stove ? black and silver are there any people visible ? no what color is the nearby staircase ? dark brown is the staircase carpeted ? no are the handrails of the staircase visible ? yes is this in color ? no do you see any people ? no do you see a stove ? standing are there any windows ? white windows can you see the floor ? metal floor can you see the floor ? brown tile can you see the floor ? yes is it tiled ? no do you see a window ? short window are there curtains on the window ? yes what color are the appliances ? black and white is there a window ? yes is there a sink ? yes can you see a refrigerator ? yes are there any people ? yes is it daytime ? nope is it sunny out ? just a little is there a window ? brown can you see the floor ? no do you see a stove ? no

black and white photo of men working in a kitchen
how many men ? 3 how big is the kitchen ? small industrial what is the age range ? teens to 40s is the kitchen belong in a house or a restaurant ? restaurant what are they doing in the kitchen ? cooking and mixing food what kind of floor in the kitchen ? i ca n’t tell is there any decorations in the kitchen ? no is the kitchen look clean ? no does the men seem to be UNK ? they seem UNK can you see the fridge ? yep can you see the fridge ? kind of does it have a freezer on it ? nope how many men are there ? ca n’t tell how old do they look ? it ’s hard to tell , but i ’d say in their 30 ’s what color are the walls ? it ’s a black and white photo so i ca n’t tell how old are the men ? i ca n’t see their faces , i do n’t know does the kitchen look clean ? 0 how old do the men look ? i ca n’t see their faces , i do n’t know are they wearing aprons ? a few are can you tell what kind of food they are ? i can not tell what kind of food it is does the kitchen look clean ? fairly clean do you see any windows ? partially in the background do you see any curtains ? maybe 1 in the background is it day or night ? ca n’t tell are the men old or young ? ca n’t see their faces are they wearing hats ? just 1 are they wearing glasses ? it looks like they are are they smiling ? i ca n’t see their faces are they smiling ? i ca n’t see their faces

a half-eaten chicken breast is next to an UNK piece of bread and a small serving of green beans and carrots
who is eating ? is the drink in the cup soda what is the food on ? bread and vegetable is it in a plate ? what kind of sauces are there in the back i am asking the questions here ? yes you are what is the food on ? bread , vegetables , chicken and drink are they on a table ? yes they are any people ? there are n’t any people in sight what color is the table ? it is dark gray any cutlery ? yes there is a butter knife in the plate are there any people ? i see part of a person ’s hand what color is the plate ? black and white are there any utensils ? no are there any utensils ? no are there any utensils ? silver spoon are there any utensils ? beige spoon are there any utensils ? i see a fork are there any utensils ? yes a fork and a knife are there any napkins ? 0 that i can see are there any napkins ? it ’s a close up so i ca n’t tell are there any people in the picture ? yes are they inside or outside ? outside what gender are the people ? i ca n’t tell what are the people doing ? 1 is looking at the camera , the other is looking at the camera can you tell if they are male or female ? i ca n’t tell can you tell if the person is male or female ? i ca n’t see their face , i do n’t know can you tell if they are male or female ? i ca n’t tell can you tell what kind of sandwich it is ? it has some meat on it , but i ca n’t tell what it is is it on a plate ? no what color is the handle of the knife ? silver

[5pt]

Table 6: Selected examples of dialog on v VisDial test split for different RL variants. We observe that variant involving Diverse-Q-bot generates more diverse, image relevant and fluent dialog.