Log In Sign Up

Dialog without Dialog Data: Learning Visual Dialog Agents from VQA Data

Can we develop visually grounded dialog agents that can efficiently adapt to new tasks without forgetting how to talk to people? Such agents could leverage a larger variety of existing data to generalize to new tasks, minimizing expensive data collection and annotation. In this work, we study a setting we call "Dialog without Dialog", which requires agents to develop visually grounded dialog models that can adapt to new tasks without language level supervision. By factorizing intention and language, our model minimizes linguistic drift after fine-tuning for new tasks. We present qualitative results, automated metrics, and human studies that all show our model can adapt to new tasks and maintain language quality. Baselines either fail to perform well at new tasks or experience language drift, becoming unintelligible to humans. Code has been made available at


page 2

page 6

page 15

page 16


Mind Your Language: Learning Visually Grounded Dialog in a Multi-Agent Setting

The task of visually grounded dialog involves learning goal-oriented coo...

Community Regularization of Visually-Grounded Dialog

The task of conducting visually grounded dialog involves learning goal-o...

GODEL: Large-Scale Pre-Training for Goal-Directed Dialog

We introduce GODEL (Grounded Open Dialogue Language Model), a large pre-...

The Dialog Must Go On: Improving Visual Dialog via Generative Self-Training

Visual dialog (VisDial) is a task of answering a sequence of questions g...

Like hiking? You probably enjoy nature: Persona-grounded Dialog with Commonsense Expansions

Existing persona-grounded dialog models often fail to capture simple imp...

Referring to the recently seen: reference and perceptual memory in situated dialog

From theoretical linguistic and cognitive perspectives, situated dialog ...

SeqDialN: Sequential Visual Dialog Networks in Joint Visual-Linguistic Representation Space

In this work, we formulate a visual dialog as an information flow in whi...

1 Introduction

One goal of AI is to enable humans and computers to communicate naturally with each other in grounded language to achieve a collaborative objective. Recently the community has studied goal oriented dialog, where agents communicate for tasks like booking a flight or searching for images [miller2017parlai].

A popular approach to these tasks has been to observe humans engaging in dialogs like the ones we would like to automate and then train agents to mimic these human dialogs [visdial_rl, deal_or_no_deal]. Mimicking human dialogs allows agents to generate intelligible language (, meaningful English, not gibberish). However, these models are typically fragile and generalize poorly to new tasks. As such, each new task requires collecting new human dialogs, which is a laborious and costly process often requiring many iterations before high quality dialogs are elicited [visdial, guesswhat].

A promising alternative is to use goal completion as a supervisory signal to adapt agents to new tasks. Specifically, this is realized by pre-training dialog agents via human dialog supervision on one task and then fine-tuning them on a new task by rewarding the agents for solving the task regardless of the dialog’s content. This approach can indeed improve task performance, but language quality suffers even for similar tasks. It tends to drifts from human language, becoming ungrammatical and loosing human intelligible semantics – sometimes even turning into unintelligible code. Such code may allow communication with other bots, but is largely incomprehensible to humans. This trade off between task performance and language drift has been observed in prior dialog work [visdial_rl, deal_or_no_deal].

Figure 1: An example of Dialog without Dialog (DwD) task: (Left) We pre-train questioner agent ( in green) that can discriminate between pairs of images by mimicking questions from VQAv2 [vqav2]. (Right) needs to generate a sequence of discriminative questions (a dialog) to identify the secret image that picked. Note that the language supervision is not available, thus we can only fine-tune with task performance. In DwD, we can test generalization ability by varying dialog length, pool size and image domain.

The goal of this paper is to develop visually grounded dialog models that can adapt to new tasks while exhibiting less linguistic drift, thereby reducing the need to collect new data for the new tasks. To test this we consider an image guessing game demonstrated in fig:guesswhich (right). In each episode, one agent ( in red) secretly selects a target image (starred) from a pool of images. The other agent ( in green) must identify this image by asking questions for to answer in a dialog. To succeed, needs to understand the image pool, generate discriminative questions, and interpret the answers provides to identify the secret image. The image guessing game provides the agent with a goal, and we can test generalization ability by varying dialog length, pool size and image domain.

Contribution 1.

We propose the Dialog without Dialog () task, which requires a to perform our image guessing game without dialog level language supervision. As shown in in fig:guesswhich (left), in this setting first learns to ask questions to identify the secret image by mimicking single-round human-annotated visual questions. For the dialog task (right), no human dialogs are available so can only be supervised by its image guessing performance. To measure task performance and language drift in increasingly out-of-distribution settings we consider varied pool sizes and take pool images from diverse image sources (close-up bird images).

Contribution 2.

We propose a architecture for the task that decomposes question intent from the words used to express that intent. We model the question intent by introducing a discrete latent representation that is the only input to the language decoder. We further pair this with a pre-train then fine-tune learning approach that teaches how to ask visual questions from VQA during pre-training and ‘what to ask’ during fine-tuning for visual dialogs.

Contribution 3.

We measure ’s ability to adapt to new tasks and maintain language quality. Task performance is measured with both automatic and human answerers while language quality is measured using three automated metrics and two human judgement based metrics. Our results show the proposed both adapts to new tasks better than a baseline chosen for language quality and maintains language quality better than a baseline optimized for just task performance.

2 Dialog Based Image Guessing Game

2.1 Game Definition

Our image guessing game proceeds one round at a time, starting at round and running for a fixed number of rounds of dialog . At round , observes the pool of images of size , the history of question answer pairs , and placeholder representations that provide input for the first round. It generates a question


Given this question, but not the entire dialog history, answers based on the randomly selected target image (not known to ):


Once receives the answer from , it makes a prediction guessing the target image:


Comparison to GuessWhich.

Our Image Guessing game is inspired by GuessWhich game of visdial_rl, and there are two subtle but important differences. In GuessWhich, initially observes a caption describing ’s selected image and must predict the selected image’s features to retrieve it from a large, fixed pool of images it does not observe. First, the inclusion of the caption leaves little room for the dialog to add information [mironenco2017examining], so we omit it. Second, in our game a small pool of images is sampled for each dialog and directly predicts the target image given those choices.

2.2 Modelling

In this work, we focus primarily on agent rather than . We set to be a standard visual question answering agent, specifically the Bottom-up Top-down [bottomup_tricks] model; however, we do make one modification. may generate questions that are not well grounded in ’s selected image (though they may be grounded in other pool images) – e.g. asking about a surfer when none exists. To enable to respond appropriately, we augment ’s answer space with a Not Relevant token. To generate training data for this token we augment every image with an additional randomly sampled question and set Not Relevant as its target answer. is trained independently from on the VQAv2 dataset and then frozen.

2.3 Modelling

We conceptualize as having three major modules. The planner encodes the state of the game to decide what to ask about. The speaker takes this intent and formulates the language to express it. The predictor makes target image predictions taking the dialog history into account. We make fairly standard design choices here, then adapt this model for the task in sec:dwd.

Pool & Image Encoding.

We represent the -th image of the pool as a set of bounding boxes such that is the embedding of the -th box using the same Faster R-CNN [fasterrcnn] embeddings as in [bottomup]. Note that we do not assume prior knowledge about the size or composition of the pool.

Figure 2: A single round of our which decomposes into the modules described in sec:qbot. This factorization allows us to limit language drift while fine-tuning for task performance.


The planner’s role is to encode the dialog context (image pool and dialog history) into representation , deciding what to ask about in each round. It also produces an encoding of the dialog history. To limit clutter, we denote the question-answer pair at round as a ‘fact’ .

Planner – Context Encoder.

Given the prior dialog state , , and image pool , the context encoder performs hierarchical attention over images in and object boxes in each image to identify image regions that are most relevant for generating the next question. As we detail in Section 2 of the supplement, and query the image to attend to relevant regions across the pool. First a dimensional distribution over images in the pool is produced and then a dimensional distribution over boxes is produced for each image . The image pool encoding at round is


This combines the levels of attention and is agnostic to pool size.

Planner – History Encoder.

To track the state of the game, the planner applies an LSTM-based history encoder that takes and as input and produces an intermediate hidden state . Here includes a compact representation of question intent and dialog history, helping provide a differentiable connection between the intent and final predictions through the dialog state.

Planner – Question Policy.

The question policy transforms to this module’s output , which the speaker decodes into a question. By default is equal to the hidden state , but in sec:new_qbot we show how a discrete representations can be used to reduce language drift.


Given an intent vector

, the speaker generates a natural language question. Our speaker is a standard LSTM-based decoder with an initial hidden state equal to .


The predictor uses the dialog context generated so far to guess the target image. It takes a concatenation of fact embeddings and the dialog state and computes an attention pooled fact using as attention context. Along with , this is used to attend to salient image features then compute a distribution over images in the pool using a softmax (see Algorithm 2 in the supplement for full details), allowing for the use of cross-entropy as the task loss. Note that the whole model is agnostic to pool size.

3 Dialog without Dialog

Aside from some abstracted details, the game setting and model presented in the previous section could be trained without any further information – a pool of images could be generated, could be assigned an image, the game could be rolled out for arbitrarily many rounds, and could be trained to predict the correct image given ’s answers. While this is an interesting research direction in its own right [ren20, chaabouni20, liang20], there is an obvious shortcoming – it would be highly improbable for to discover a fully functional language that humans can already understand. Nobody discovers French. They have to learn it.

At the other extreme – representing standard practice in dialog problems – humans could be recruited to perform this image guessing game and provide dense supervision for what questions should ask to perform well at this specific task. However, this requires collecting language data for every new task. It is also intellectually dissatisfying for agents’ knowledge of natural language to be so inseparably intertwined with individual tasks. After all, one of the greatest powers of language is the ability to use it to communicate about many different problems.

In this section, we consider a middle ground that has two stages. Stage 1 trains our agent on one task where training data already exists (VQA; , single round dialog) and then stage 2 adapts it to carry out goal driven dialog (image guessing game) without further supervision.

3.1 Stage 1: Language Pre-training

We leverage the VQAv2 [vqav2] dataset as our language source to learn how to ask questions that humans can understand. By construction, for each question in VQAv2 there exists at least one pair of images which are visually similar but have different ground truth answers to the same question. Fortuitously, this resembles our dialog game – the image pair is the pool, the question is guaranteed to be discriminative, and we can provide an answer depending on ’s selected image. We view this as a special case of our game that is fully supervised but contains only a single round of dialog. During stage 1 is trained to mimic the human question (via cross-entropy teacher forcing) and to predict the correct image given the ground truth answer.

For example, in the top left of fig:guesswhich outlined in dashed green we show a pair of two bird images with the question “What is in the bird’s beak?” from VQAv2. Our agents engage in a single round dialog where asks that question and provides the answer (also supervised by VQAv2).

3.2 Stage 2: Transferring to Dialog

A first approach for adapting agents would be to take the pre-trained weights from stage 1 and simply fine-tune for our full image guessing task. However, this agent would face a number of challenges. It has never had to model multiple steps of a dialog. Further, while trying to predict the target image there is little to encourage to continue producing intelligible language. Indeed, we find our baselines do exhibit language drift. We consider four modifications to address these problems.

Discrete Latent Intention Representation .

Rather than a continuous vector passing from the question policy to the speaker, we pass discrete vectors. Specifically, we consider a representation composed of different -way Concrete variables [concrete_distribution]. Let

and let the logits

paramterize the Concrete distribution

. We learn a linear transformation

from the intermediate dialog state to produce these logits for each variable :


To provide input to the speaker, is embedded using a learned dictionary of embeddings. In our case each variable in has a dictionary of learned embeddings. The value of () picks one of the embeddings for each variable and the final representation simply sums over all variables:


VAE Pre-training.

When using this representation for the intent, we train stage 1 by replacing the likelihood with an ELBO (Evidence Lower BOund) loss as seen in Variational Auto-Encoders (VAEs) [vae] to help disentangle intent from expression by restricting information flow through . We use the existing speaker module to decode into questions and train a new encoder module to encode ground truth VQAv2 question into conditional distribution over at round 1.

For the encoder we use a version of the previously described context encoder from sec:qbot that uses the question as attention query instead of and (which are not available in this context). The resulting ELBO loss is


This is like the Full ELBO described, but not implemented, in [rethink_zhao]. The first term encourages the encoder to represent and the speaker to mimic the VQA question. The second term uses the KL Divergence to push the distribution of close to the -way uniform prior , encouraging to ignore irrelevant information. Together, the first two terms form an ELBO on the question likelihood given the image pool [gumbel_softmax, kaiser18].

Fixed Speaker.

Since the speaker contains only lower level information about how to generate language, we freeze it during task transfer. We want only the high level ideas represented by and the predictor which receives direct feedback to adapt to the new task. If we updated the speaker then its language could drift given only the sparse feedback available in each new setting.

Adaptation Curriculum.

As the pre-trained (stage 1) model has never had to keep track of dialog contexts beyond the first round, we fine-tune in two stages, 2.A and 2.B. In stage 2.A we fix the Context Encoder and Question Policy parts of the planner so the model can learn to track dialog effectively without trying to generate better dialog at the same time. This stage takes 20 epochs to train. Once learns how to track dialog we update the entire planner in stage 2.B for 5 epochs.111We find 5 epochs stops training early enough to avoid overfitting on our val set.

4 Experiments

We want to show that our proposed agent can adapt to new tasks while exhibiting less linguistic drift. In sec:experimental_settings and sec:baselines we start by describing the new tasks we construct and the baselines we compare to, then the following sections demonstrate how our model adapts while preventing drift using qualitative examples (sec:qualitative), automated metrics (sec:auto_eval), and human judgements (sec:human-study). We also summarize the model ablations (sec:ablations) detailed in the supplement.

Task Settings.

We construct new tasks by varying four parameters of our image guessing game:

  • Number of Dialog Rounds. The number of dialog rounds is fixed at 1, 5, or 9.

  • Pool Size. The number of images in a pool to 2, 4, or 9.

  • Image Domain. By default we use VQA images (, from COCO [LinECCV14coco]), but we also construct pools using CUB (bird) images [cub] and AWA (animal) images [awa2].

  • Pool Sampling Strategy. We test two ways of sampling pools of images. The Constrast sampling method, required for pre-training (sec:stage1), chooses a pair of images with contrasting answers to the same question from VQAv2. This method only works for . The Random sampling method chooses images at random from the images available in the split.

For example, consider the ‘VQA - 2 Contrast - 5 Round’ setting. These pools are constructed from 2 VQA images with the Contrast sampling strategy and dialogs are rolled out to 5 rounds.

4.1 Baselines

We compare our proposed approach to two baselines – Zero-shot Transfer and Typical Transfer – ablating aspects of our model that promote adaptation to new tasks or prevent language drift. The Zero-shot Transfer baseline is our model after the single round fully supervised pre-training. Improvements over this model represent gains made from task based fine-tuning. The Typical Transfer baseline is our model under standard encoder-decoder dialog model design choices – , a continuous latent variable, maximum likelihood pre-training, and fine-tuning the speaker module. Improvements over this model represent gains made from the modifications aimed at preventing language drift described in sec:new_qbot – specifically, the discrete latent variable, ELBO pre-training, and frozen speaker module.

Figure 3: Qualitative comparison of dialogs generated by our model with those generated by Typical Transfer and Zero-shot Transfer baselines. Top / middle / bottom rows are image pool from COCO / AWA / CUB images respectively. Our model is pre-trained on VQA (COCO images) and generates more intelligible questions on out-of-domain images.

4.2 Qualitative Results

Figure 3 shows example outputs of the Typical Transfer and Zero-shot Transfer baselines alongside our on VQA, AWA and CUB images using size 4 Randomly sampled pools and 5 rounds of dialog. Both our model and the Typical Transfer baseline tend to guess the target image correctly, but it is much easier to tell what the questions our model asks mean and how they might help with guessing the target image. On the other hand, questions from the Zero-shot Transfer baseline are clearly grounded in the images, but they do not seem to help guess the target image and the Zero-shot Transfer baseline indeed fails to guess correctly. This is a pattern we will reinforce with quantitative results in sec:human-study and sec:auto_eval.

These examples and others we have observed suggest interesting patterns that highlight . Our automated based on [bottomup] does not always provide accurate answers, limiting the questions can usefully ask. When there is signal in the answers, it is not necessarily intelligible, providing an opportunity for ’s language to drift.

4.3 Automated Evaluation

We consider metrics addressing both Task performance and Language quality. While task performance is straightforward (did guess the correct target image?), language quality is harder to measure. We describe three automated metrics here and further investigate language quality using human evaluations in sec:human-study.

Task – Guessing Game Accuracy.

To measure task performance so we report the accuracy of ’s target image guess at the final round of dialog.

Language – Question Relevance via .

To be human understandable, the generated questions should be relevant to at least one image in the pool. We measure question relevance as the maximum question image relevance across the pool as measured by , i.e., . We note that this is only a proxy for actual question relevance as may report Not Relevant erroneously if it fails to understand ’s question; however, in practice we find does a fair job in determining relevance. We also provide human relevance judgements in sec:human-study.

Language – Fluency via Perplexity.

To evaluate ’s fluency, we train an LSTM-based language model on the corpus of questions in VQA. This allows us to evaluate the perplexity of the questions generated by for dialogs on its new tasks. Lower perplexity indicates the generated questions are similar to VQA questions in terms of syntax and content. However, we note that questions generated for the new tasks could have lower perplexity because they have drifted from English or because different things must be asked for the new task, so lower perplexity is not always better [gen_eval].

Language – Diversity via Distinct -grams.

This considers the set of all questions generated by across all rounds of dialog on the val set. It counts the number of -grams in this set, , and the number of distinct -grams in this set, , then reports for each value of . Note that instead of normalizing by the number of words as in previous work [dbs, li15ado]

, we normalize by the number of n-grams so that the metric represents a percentage for values of

other than . Generative language models frequently produce safe standard outputs [dbs], so diversity is a sign this problem is decreasing, but diversity by itself does not make language meaningful or useful.


Table 1 presents results on our val set for our model and baselines across the various settings described in sec:experimental_settings. Agents are tasked with generalizing further and further from their source language data. Setting A is the same as for stage 1 pre-training. In that same column, B and C require generalization to multiple rounds of dialog and Randomly sampled image pairs instead of pools sampled with the Contrast strategy. In the right side of tab:main_results we continue to test generalization farther from the language source using more images and rounds of dialog (D) and then using different types of images (E and F

). Our model performs well on both task performance and language quality across the different settings in terms of these automatic evaluation metrics. Other notable findings are:

Accuracy Perplexity Relevance Diversity Accuracy Perplexity Relevance Diversity
VQA 2 Contrast 1 Round
A1 Zero-shot Transfer 0.73 2.62 0.87 0.50
VQA 9 Random 9 Rounds
D1 Zero-shot Transfer 0.18 2.72 0.77 1.11
A2 Typical Transfer 0.71 10.62 0.66 5.55 D2 Typical Transfer 0.78 40.66 0.77 2.57
A3 Ours 0.82 2.6 0.88 0.54 D3 Ours 0.53 2.55 0.75 0.95
VQA 2 Contrast 5 Rounds
B1 Zero-shot Transfer 0.67 2.62 0.87 0.50
AWA 9 Random 9 Rounds
E1 Zero-shot Transfer 0.47 2.49 0.96 0.24
B2 Typical Transfer 0.74 10.62 0.66 5.55 E2 Typical Transfer 0.48 12.56 0.64 2.21
B3 Ours 0.87 2.60 0.88 0.54 E3 Ours 0.74 2.41 0.96 0.28
VQA 2 Random 5 Rounds
C1 Zero-shot Transfer 0.64 2.64 0.75 1.73
CUB 9 Random 9 Rounds
F1 Zero-shot Transfer 0.36 2.56 1.00 0.04
C2 Typical Transfer 0.86 16.95 0.62 8.13 F2 Typical Transfer 0.38 20.92 0.47 2.16
C3 Ours 0.95 2.69 0.77 2.34 F3 Ours 0.74 2.47 1.00 0.04
Table 1: Performance of our models and baselines in different experimental settings. From setting A to setting F, agents are tasked with generalizing further from the source data. Our method strikes a balance between guessing game performance and interpretability.

Ours Zero-shot Transfer.

To understand the relative importance of the proposed stage 2 training which transferring to dialog for DwD task, we compared the task accuracy of our model with that of Zero-shot Transfer. In setting, A which matches the training regime, our model outperforms Zero-shot Transfer by 9% (A3 A1) on task performance. As the tasks differ in settings B-F, we see further gains with our model consistently outperforming Zero-shot Transfer by 20-38%. Despite these gains, our model maintains similar language perplexity, A-bot relevance, and diversity.

Ours Typical Transfer.

Our discrete latent variable, variational pre-training objective, and fixed speaker play an important role in avoiding language drift. Compared to the Typical Transfer model without these techniques, our model achieves over 4x (A2 / A3) lower perplexity and 10-53% better A-bot Relevance. Our model also improves in averaged accuracy, which means more interpretable language also improves the task performance. Note that Typical Transfer has 2-100x higher diversity compared to our model, which is consistent with the gibberish we observe from that model (, in fig:examples) and further suggests its language is drifting away from English.

Results from Game Variations.

We consider the following variations on the game:

  • Dialog Rounds. Longer dialogs (more rounds) achieve better accuracy (A3 vs B3).

  • Pool Sampling Strategy. As expected, Random pools are easier compared to Contrast pools (B3 vs C3 accuracy), however language fluency and relevance drop on the Random pools (B3 vs C3 perplexity and a-bot relevance).

  • Image Source. CUB and AWA pools are harder compared to COCO image domain (D3 vs E3 vs F3). Surprisingly, our models maintains similar perplexity and high a-bot relevance even on these out-of-domain image pools. The Zero-shot Transfer and Typical Transfer baselines generalize poorly to these different image domains – reporting task accuracies nearly half our model performance.

4.4 Human Studies

We also evaluate our models by asking if humans can understand ’s language. Specifically, we use workers (turkers) on Amazon Mechanical Turk to evaluate the relevance, fluency, and task performance of our models. Section 3 of the supplement details these studies, but we briefly summarize the results here.

Turkers from Amazon’s Mechanical Turk considered the questions from our model more relevant to the image pool than those from the Typical Transfer model and about equally as relevant as the Zero-shot Transfer model’s questions. Similarly, they considered our model’s questions more fluent than the Typical Transfer model questions and equally as fluent as the Zero-shot Transfer model’s questions. However, when we used used turkers to answer ’s questions – replacing the automated – our was able to guess the correct image 69% of the time while the Typical Transfer only achieved 45% accuracy and the Zero-shot Transfer model achieved 23% accuracy. This again confirms that our model can adapt to new tasks with minimal sacrifice to language quality.

4.5 Model Ablations

We investigate the impact of our modelling choices from sec:new_qbot by ablating these choices in Section 5 of the supplement, summarizing the results here. The choice of discrete instead of continuous

helps maintain language quality, as does the use of variational (ELBO) pre-training instead of maximum likelihood. Surprisingly, the ELBO loss probably has more impact than the discreteness of

. Fixing the speaker module during stage 2 also had a minor role in discouraging language drift. Finally, we find that improvements in task performance are due more to learning to track the dialog in stage 2.A than they are due to asking more discriminative questions.

5 Related Work

Our interest comes from language drift problems encountered when using models comparable to the Zero-shot Transfer baseline. In [deal_or_no_deal] a dataset is collected with question supervision then fine-tuning is used in an attempt to increase task performance, but the resulting utterances are unintelligible. Similarly, [visdial_rl] takes a very careful approach to fine-tuning for task performance but finds that language also diverges, becoming difficult for humans to understand. Neither approach uses a discrete latent variables or a multi stage training curriculum, as in our proposed model. Furthermore, these models need to be adapted to work in our new setting, and doing so would yield models very similar to our Typical Transfer baseline.

More recently, [drift_emnlp19] observe language drift in a translation game from French to German. They reduce drift by supervising communications between agents with auxiliary translations to English and grounding in images. This setting is somewhat different than ours since grounding is directly necessary to solve our task. The approach also requires direct supervision on the communication channel, which is not practical for a multiple round dialog game like ours.

We used a visual reference game to study question generation, improving the quality of generated language using concepts related to latent action spaces. Some works like [lba] and [visual_curiosity] also aim to ask visual questions with limited question supervision. Other works represent dialog using latent action spaces [rethink_zhao, zhao_unsup, zero_shot_zhao, yarats2017hierarchical, wen2017latent, serban2016piecewise, yarats2017hierarchical, hu2019hierarchical, kang2019recommendation, serban2017hierarchical, williams2017hybrid]. Finally, reference games are generally popular for studying language [lewis, guesswhat, visdial_eval]. Section 6 of the supplement describes the relationship between our approach and these works in more detail.

6 Conclusion

In this paper we proposed the Dialog without Dialog () task along with a model designed to solve this task and an evaluation scheme that takes its goals into account. The task is to perform the image guessing game, maintaining language quality without dialog level supervision. This balance is hard to strike, but our proposed model manages to strike it. Our model approaches this task by representing dialogs with a discrete latent variable and carefully transfering language information via multi stage training. While baseline models either adapt well to new tasks or maintain language quality and intelligibility, our model is the only one to do both according to both automated metrics and human judgements. We hope these contributions help inspire useful dialog agents that can also interact with humans.

7 Broader Impact

We think the main ethical aspects of this work and their consequences for society have to do with fairness. There is an open research problem around existing deep learning models often reflecting and amplifying undesirable biases that exist in society.

While visual question answering and visual dialog models do not currently work well enough to be relied on in the real world (largely because of the aforementioned proneness to bias), they could be deployed in applications where these biases could have negative impacts on fairness in the future. For example, visually impaired users might use these models to understand visual aspects of their world [vizwiz]. If these models are not familiar with people in certain contexts (e.g., men shopping) or are only used to interacting with certain users (e.g., native English speakers) then they might fail for some sub-groups (e.g., non-native English speakers who go shopping with men) but not others.

Our research model may be prone to biases, though it was trained on the VQAv2 dataset [vqav2], which aimed to be more balanced than its predecessor. However, by increasing the intelligibility of generated language our work may help increase the overall interpretability of models. This may help by making bias easier to measure and providing additional avenues for correcting it.


8 Acknowledgements

The Georgia Tech effort was supported in part by NSF, AFRL, DARPA, ONR YIPs, ARO PECASE, Amazon. The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies or endorsements, either expressed or implied, of the U.S. Government, or any sponsor.


9 Supplement Overview

This document contains supplementary material for “Dialog without Dialog Data: Learning Visual Dialog Agents from VQA Data”. The main paper excludes some details which we provide here. sec:arch_details describes the proposed in the main paper in more detail, including algorithms that show how it executes one round of dialog. sec:human_studies describes the human studies we use to evaluate our model and reports those results in detail. sec:ablations reports the ablations we use to evaluate the effects of different aspects of the proposed . sec:round_perf reports how various models we consider perform at different rounds of dialog. Finally, sec:rel_work explores in more depth how our work relates to other relevant work in the literature.

10 Architecture Details

This section describes our architecture in more detail. alg:qbot summarizes our complete QBot() implementation and subsequent algorithms define the subroutines used inside QBot() along with the encoder we use for variational pre-training. The planner module is described in alg:planner, the predictor is described in alg:predictor, and the speaker is described in alg:speaker. alg:encoder describes the encoder used for the ELBO loss.

Note that the number of bounding boxes per image is , the number of images in a pool is , and the max question length is .

There is a minor notation difference between this section and the main paper. In this section there is an additional hidden state that parallels and is used only inside the planner. While is the hidden state of an LSTM, is computed in the same way except it uses a different output gate (see line 11 of alg:planner). This is essentially a second LSTM output that allows the context coder query to forget dialog history information irrelevant to the current round, and allowing to focus on representing the entire dialog state.

1 Function QBot()
2       Predictor()
3       Planner()
4       Speaker()
5       return
Algorithm 1 Question Bot
1 Function Predictor()
       Input: (), ,
       /* fact */
       /* Attention over rounds */
       /* Attention over bounding boxes */
11       return
Algorithm 2 Predictor
1 Function Planner()
       Input: (), , , ,
       Output: , ,
       /* Context Coder */
       /* Dialog RNN */
       /* Question Policy */
16       return
Algorithm 3 Planner

In the planner alg:planner at lines 5 and 6 are all two layer MLPs with ReLU output and weight norm. Both and

are linear transformations with weight norm applied (no activation function).

is a linear transformation without weight norm purely for dimensionality reduction. To compute we also add new linear weights and as for a standard LSTM output gate.

Note that for the planner there is an additional residual connection at line 16 which augments the hidden state. This allows gradients to flow through the question policy parameters

at line 13 when we fine-tune for task performance without fully supervised dialogs.

In alg:predictor are both 2-layer ReLU nets with weight norm. Also is a 2-layer net with ReLU and Dropout on the hidden activation and weight norm on both layers.

1 Function Speaker()
4       return
Algorithm 4 Speaker

In alg:speaker is an LSTM decoder.

1 Function Encoder()
       Input: ,
       Output: (sample or distribution parameters)
       /* Context Coder */
9       return
Algorithm 5 Encoder

11 Human Evaluation

As summarized in Section 4.5 of the main paper, we also evaluate our models by asking if humans can understand ’s language. Specifically, we use workers (turkers) on Amazon Mechanical Turk to evaluate the relevance, fluency, and task performance of our models. We discuss each study below and report results for all studies in fig:human_studies.

Figure 4: Human evaluation of language quality – question fluency (left), relevance (middle) and task performance (right). Question fluency and relevance compare a pair of agent-generated questions, asking users which (or possibly neither) is more fluent/relevant. Task performance is to have humans interact dynamically with Q-bot in real time.

Human Study for Question Relevance.

To get a more accurate measure of question relevance, we asked humans to evaluate questions generated by our model and the baselines (Zero-shot Transfer & Typical Transfer). We curated 300 random, size 4 pools where all three models predicted the target correctly at round 5. Size 9 pools require longer dialogs, so they take more effort for humans to analyze. Humans can analyze more size 4 pools in the same time, so we use size 4 pools here. For a random round, we show turkers the questions from a pair of models and ask ‘Which question is most relevant to the images?’ Answering the question is a forced choice between three options: one of the two models or an ‘Equally relevant’ option. See fig:relevance_turk_interface for an example of the interface we presented to turkers. All model pairs were evaluated for each pool of images and the questions were presented in a random order, though the ‘Equally relevant’ option was always last.

The results in fig:human_studies (middle) show the frequency with which each option was chosen for each model pair. Our model was considered more relevant than the Typical Transfer model (47.8% 30.2%) and about the same as the Zero-shot Transfer model (36.6% 38.4%).

Figure 5: Human study instructions for question relevancy.

Human Study for Question Fluency.

We also evaluate fluency by asking humans to compare questions. In particular, we presented the same pairs of questions to turkers as in the relevance study, but this time we did not present the pool of images and asked them ‘Which question is more understandable?’ As before, there was a forced choice between two models and an ‘Equally understandable’ option. This captures fluency because humans are more likely to report that they understand grammatical and fluent language. An example interface is in fig:fluency_turk_interface. We used the same pairs of questions as in the relevance interface but turkers were not given image pools with which to associate the questions. As in the relevance study, questions were presented in a random order.

Figure 4 (left) shows the frequency with which each option was chosen for each model pair. Our model is considered more fluent than the Typical Transfer model (49.4% 17.9%) and about the same as the Zero-shot Transfer model (24.8% 26.2%).

Figure 6: Human study instructions for question fluency.

Human Study for Task Performance.

What we ultimately want in the long term is for humans to be able to collaborate with bots to solve tasks. Therefore, the most direct evaluation of our the task is to have humans interact dynamically with . We implemented an interface that allowed turkers to interact with in real time, depicted in fig:interactive_turk_interface. asks a question. A human answers it after looking at the target image. asks a new question in response to the human answer and the human responds to that question. After the 4th answer makes a guess about which target image the human was answering based on.

We perform this study for the same pools for each model and find our approach achieves an accuracy of 69.39% – significantly higher than Typical Transfer at 44.90% and Zero-shot Transfer at 22.92% as shown in fig:human_studies (right). This study shows that our model learns language for this task that is amenable to human-AI collaboration. This is in contrast to prior work [visdial_eval] that showed that improvements captured by task-trained models for similar image-retrieval tasks did not transfer when paired with human partners.

Figure 7: Interactive task performance MTurk interface.

12 Model Ablations

We investigate the impact of our modelling choices from Section 3.2 of the main paper. In tab:ablations we report the mean of all four automated metrics averaged over pool sizes, pool sampling strategies, and datasets.222This includes 10 settings: {random 2, 4, 9 pools } {VQA, AWA, CUB} and 2 contrats pools on VQANext we explain how we vary each of these model dimensions

  • Our 128 4-way Concrete variables require 512 logits (Discrete

    ). Thus we compare to the standard Gaussian random variable common throughout VAEs with 512 dimensions (


  • In both discrete and continuous cases we train with an ELBO loss (ELBO), so we compare to a maximum likelihood only model (MLE) that uses an identity function as in the default option for the Question Policy (see Section 2.3.1 of the main paper). The MLE model essentially removes the KL term (2nd term of Eq. 7 of the main paper) and ignores the encoder during pre-training.

  • We consider checkpoints after each step of our training curriculum: Stage 1, Stage 2.A, and Stage 2.B. For some approaches we skip Stage 2.A and go straight to fine-tuning everything except the speaker as in Stage 2.B. This is denoted by Stage 2.

  • We consider 3 variations on how the speaker is fine-tuned. The first is our proposed approach of fixing the speaker (Fixed). The next fine-tunes the speaker (Fine-tuned). To evaluate the impact of fine-tuning we also consider a version of the speaker which can not learn to ask better questions by using a parallel version of the same model (Parallel). This last version will be described more below.

Structure Loss Curriculum Speaker Accuracy Perplexity Relevance Diversity
1 Discrete ELBO Stage 2.B Fixed 0.81 2.57 0.89 0.86
2 Discrete ELBO Stage 2 Fine-tuned 0.82 2.54 0.85 0.59
3 Discrete ELBO Stage 2 Parallel 0.78 2.60 0.88 0.73
4 Discrete ELBO Stage 1 Fixed 0.72 2.60 0.91 0.48
5 Discrete ELBO Stage 2.A Fixed 0.80 2.59 0.89 0.81
6 Discrete ELBO Stage 2 Fixed 0.80 2.53 0.85 0.62
7 Continuous ELBO Stage 2.B Fixed 0.75 2.45 0.66 0.23
8 Continuous MLE Stage 2.B Fixed 0.78 4.27 0.83 4.33
Table 2: Various ablations of our training curriculum.

Discrete Outperforms Continuous .

By comparing our model in row 1 of tab:ablations to row 7 we see that our discrete model outperforms the corresponding continuous model in terms of task performance (higher Accuracy) and about matches it in interpretability (similar Perplexity and higher Relevance). This may be a result of discreteness constraining the optimization problem to prevent overfitting and is consistent with previous work that used a discrete latent variable to model dialog [rethink_zhao].

Stage 2.B Less Important than Stage 2.A

Comparing rows 4, 5, and 1 of tab:ablations, we can see that each additional step, Stage 2.A (row 4 -> 5) and Stage 2.B (row 5 -> 1), increases task performance and stays about the same in terms of interpretability. However, most gains in task performance happen between Stage 1 and Stage 2. This indicates that improvements in task performance are mainly from learning to incorporate information over multiple rounds of dialog.

Better Predictions, Slightly Better Questions

To further investigate whether is asking better questions or just understanding dialog context for prediction better we considered a Parallel speaker model. This model loaded two copies of , A and B both starting at the model resulting from Stage 1. Copy A was fine-tuned for task performance, but every it generated was ignored and replaced with the generated by copy B, which was not updated at all. The result was that copy A of the model could not incorporate dialog context into its questions any better than the Stage 1 model, so all it could do was track the dialog better for prediction purposes. By comparing the performance of copy A (row 3 of tab:ablations) to our model (row 1) we can see a 3 point different in accuracy, so the question content of our model has improved after fine-tuning, but not by a lot. Again, this indicates most improvements are from dialog tracking for prediction (row 3 accuracy is much higher than row 4 accuracy).

Fine-tuned Speaker

During both Stage 2.A and Stage 2.B we fix the Speaker module because it is intended to capture low level language details and we do not want it to change its understanding of English. Row 2 of tab:ablations does not fix the Speaker during Stage 2 fine-tuning. Instead, it uses each softmax at each step of the LSTM decoder to parameterize one Concrete variable [gumbel_softmax] per word. This allows gradients to flow through the decoder during fine-tuning, allowing the model to tune low-level signals. This is similar to previous approaches which either used this technique [best_of_both_worlds] or REINFORCE [visdial_rl] This model is competitive with in terms of task performance. However, when we inspect its output we see somewhat less understandable language.

Variational Prior Helps Interpretability

We found the most important factor for maintaining interpretability to be the ELBO loss we applied during pre-training. Comparing the continuous Gaussian variable (row 7) to a similar hidden state (row 8) trained without the KL prior term we see drastically different perplexity and diversity. In the main paper these metrics dropped when a model had drifted from English (, for Typical Transfer). This suggests the model without the ELBO in row 8 has experienced similar language drift.

13 Performance by Round

Experiments in the main paper considered dialog performance after the first round (top of Table 1) and at the final round of dialog (either 5 or 9). This does not give much sense for how dialog performance increases over rounds of dialog, so we report QBot()’s guessing game performance at each round of dialog in fig:acc_over_rounds. For all fine-tuned models performance goes up over multiple rounds of dialog, though some models benefit more than others. Stage 1 models decrease in performance after round 1 because it is too far from the training data such models have been exposed to.

Figure 8: Task performance (guessing game accuracy) over rounds of dialog. Performance increases over rounds for all models except the Stage 1 models.

14 Related Work

We used a visual reference game to study question generation, improving the quality of generated language using concepts related to latent action spaces. This interest is mainly inspired by problems encountered when using models comparable to the Zero-shot Transfer baseline. In [deal_or_no_deal] a dataset is collected with question supervision then fine-tuning is used in an attempt to increase task performance, but the resulting utterances are not intelligible. Similarly, [visdial_rl] takes a very careful approach to fine-tuning for task performance but finds that language also diverges, becoming difficult for humans to understand.

Visual Question Generation.

Other approaches like [lba] and [visual_curiosity] also aim to ask questions with limited question supervision. They give Q-bot access to an oracle to which it can ask any question and get a good answer back. This feedback allows these models to ask questions that are more useful for teaching  [lba] or generating scene graphs [visual_curiosity], but they require a domain specific oracle and do not take any measures to encourage interpretability. We are also interested in generalizing with limited supervision, using a standard VQAv2 [vqav2] trained as a flawed oracle, but we focus on maintaining interpretability of generated questions and not just their usefulness.

Latent Action Spaces.

Of particular interest to us is a line of work that uses represents dialogs using latent action spaces [zhao_unsup, zero_shot_zhao, yarats2017hierarchical, wen2017latent, serban2016piecewise, yarats2017hierarchical, hu2019hierarchical, kang2019recommendation, serban2017hierarchical, williams2017hybrid]. Recent work uses these representations to discover intelligible language [zhao_unsup] and to perform zero-shot dialog generation [zero_shot_zhao], though neither works consider visually grounded language as in our approach. Most relevant is [rethink_zhao], which focuses on the difference between word level feedback and latent action level feedback. Like us, they use a variationally constrained latent action space (like our ) to generate dialogs and find that by providing feedback to the latent actions instead of the generated words (as opposed to the approaches in [visdial_rl] and [deal_or_no_deal]) they achieve better dialog performance. Our variational prior is similar to the Full ELBO considered in [rethink_zhao], but we consider generalization from non-dialog data and generalization to new modalities.

Reference Games.

The task we use to study question generation follows a body of work that uses reference games to study language and its interaction with other modalities [lewis]. Our particular task is most similar to those in [guesswhat] and [visdial_eval]. In particular, [guesswhat] collects a dataset for goal oriented visual dialog using a similar image reference game and [visdial_eval] uses a similar guessing game we use to evaluate how well humans can interact with A-bot.