A fundamental characteristic of a good HCI system is its ability to effectively acquire and
disseminate knowledge about the tasks and environments in which it is involved. A particular subclass of such systems, natural-language-driven conversational agents such asAlexa and Siri, have seen great success in a number of well-defined language-driven tasks. Even such widely adopted systems suffer, however, when exposed to less circumscribed, more free-form situations. Ultimately, an implicit requirement for the wide-scale success of such systems is the effective understanding of the environments and goals of the user – an exceedingly difficult problem in the general case as it involves getting to grips with a variety of sub-problems (semantics, grounding, long-range dependencies) each of which are extremely difficult problems in themselves. One avenue to ameliorate such issues is the incorporation of visual context to help explicitly ground the language used – providing a domain in which knowledge can be anchored and extracted from. Conversely, this also provides a way in which language can be used to characterise visual information in richer terms, for example with sentences describing salient features in the image (referred to as “captioning”) [13, 15].
In recent years, there has been considerable interest in visually-guided language generation in the form of VQA  and subsequently visual dialogue , both involving the task of answering questions in the context of an image. In the particular case of visual dialogue, along with the image, previously seen questions and answers (i.e. the dialogue history) are also accepted, and a relevant answer at the current time produced. We refer to this one-sided or answer-only form of visual dialogue as 1VD. Inspired by these models and aiming to extend their capabilities, we establish the task of 2VD whereby an agent must be capable of acting as both the questioner and the answerer.
Our motivation for this is simple – AI agents need to be able to both ask questions and answer them, often interchangeably, rather do either one exclusively. For example, a vision-based home-assistant (e.g. Amazon’s Alexa) may need to ask questions based on her visual input (“There is no toilet paper left. Would you like me to order more?”) but may also need to answer questions asked by humans (“Did you order the two-ply toilet paper?”). The same question-answer capability is true for other applications. For example, with aids for the visually-impaired, a user may need the answer to “Where is the tea and kettle?”, but the system may equally need to query “Are you looking for an Earl Grey or Rooibos teabag?” to resolve potential ambiguities.
We take one step toward this broad research goal with FlipDial, a generative model capable of both 1VD and 2VD. The generative aspect of our model is served by using the CVAE, a framework for learning deep conditional generative models while simultaneously amortising the cost of inference in such models over the dataset [17, 24]. Furthermore, inspired by the recent success of CNN in language generation and prediction tasks [11, 14, 21], we explore the use of CNN on sequences of sequences (i.e. a dialogue) to implicitly capture all sequential dependences through the model. Demonstrating the surprising effectiveness of this approach, we show sets of sensible and diverse answer generations for the 1VD task in Fig. 1.
We here provide a brief treatment of works related to visual dialogue. We reserve a thorough comparison to Das et.al.  for Section 4.3, noting here that our fully-generative convolutional extension of their model outperforms their state-of-the-art results on the answering of sequential visual-based questions (1VD). In another work, Das et.al. 
present a Reinforcement Learning based model to do 1VD, where they instantiate two separate agents, one each for questioning and answering. Crucially, the two agents are givendifferent
information – with one (QBot) given the caption, and the other (ABot) given the image. While this sets up the interesting task of performing image retrieval from natural-language descriptions, it is also fundamentally different from having a single agent perform both roles. Jain et.al. explore a complementary task to VQA  where the goal is instead to generate a (diverse) set of relevant questions given an image. In their case, however, there is no dependence on a history of questions and answers. Finally, we note that Zhao et.al.  employ a similar model structure to ours, using a CVAE to model dialogue, but condition their model on discourse-based constraints for a purely linguistic (rather than visuo-linguistic) dataset. The tasks we target, our architectural differences (CNN), and the dataset and metrics we employ are distinct.
Our primary contributions in this work are therefore:
A fully-generative, convolutional framework for visual dialogue that outperforms state-of-the-art models on sequential question answering (1VD) using the generated answers, and establishes a baseline in the challenging two-way visual dialogue task (2VD).
Evaluation using the predicted (not ground-truth) dialogue – essential for real-world conversational agents.
Novel evaluation metrics for generative models of two-way visual dialogue to quantify answer-generation quality, question relevance, and the models’s generative capacity.
Here we present a brief treatment of the preliminaries for deep generative models – a conglomerate of deep neural networks and generative models. In particular, we discuss the VAE  which given a dataset with elements , simultaneously learns i) a variational approximation 111Following the literature, the terms recognition model or inference network may also be used to refer to the posterior variational approximation. to the unknown posterior distribution for latent variable , and ii) a generative model over data and latent variables. These are both highly attractive prospects as the ability to approximate the posterior distribution helps amortise inference for any given data point over the entire dataset , and learning a generative model helps effectively capture the underlying abstractions in the data. Learning in this model is achieved through a unified objective, involving the marginal likelihood (or evidence) of the data, namely:
The unknown true posterior in the first KL divergence is intractable to compute making the objective difficult to optimise directly. Rather a lower-bound of the marginal log-likelihood , referred to as the ELBO, is maximised instead.
where the first term is referred to as the reconstruction or negative CE term, and the second, the regularisation or KL divergence term. Here too, similar to the VAE, and
are typically taken to be isotropic multivariate Gaussian distributions, whose parametersand are provided by DNN with parameters and , respectively. The generative model likelihood , whose form varies depending on the data type – Gaussian or Laplace for images and Categorical for language models – is also parametrised similarly. In this work, we employ the CVAE model for the task of eliciting dialogue given contextual information from vision (images) and language (captions).
3 Generative Models for Visual Dialogue
In applying deep generative models to visual dialogue, we begin by characterising a preliminary step toward it, VQA. In VQA, the goal is to answer a single question in the context of a visual cue, typically an image. The primary goal for such a model is to ensure that the elicited answer conforms to a stronger notion of relevance than simply answering the given question – it must also relate to the visual cue provided. This notion can be extended to 1VD which we define as the task of answering a sequence of questions contextualised by an image (and a short caption describing its contents), similar to . Being able to exclusively answer questions, however, is not fully encompassing of true conversational agents. We therefore extend 1VD to the more general and realistic task of 2VD. Here the model must elicit not just answers given questions, but questions given answers as well – generating both components of a dialogue, contextualised by the given image and caption. Generative 1VD and 2VD models introduce stochasticity in the latent representations.
As such, we begin by characterising our generative approach to 2VD using a CVAE. For a given image and associated caption , we define a dialogue as a sequence of question-answer pairs , simply denoted when sequence indexing is unnecessary. Additionally, we denote a dialogue context . When indexed by step as , it captures the dialogue subsequence .
With this formalisation, we characterise a generative model for 2VD under latent variable as , with the corresponding recognition model defined as . Note that with relation to Eq. 2, data is dialogue and the condition variable is , giving:
with the graphical model structures shown in Fig. 2.
The formulation in Eq. 3 is general enough to be applied to single question-answering (VQA) all the way to full two-way dialogue generation (2VD). Taking a step back from generative 2VD, we can re-frame the formulation for generative 1VD (i.e. sequential answer generation) by considering the generated component to be the answer to a particular question at step , given context from the image, caption and the sequence of previous question-answers. Simply put, this corresponds to the data being the answer , conditioned on the image, its caption, the dialogue history to -1, and the current question, or . For simplicity, we denote a compound context as and reformulate Eq. 3 for 1VD as:
with the graphical model structures shown in Fig. 3.
Our baseline  for the 1VD model can also be represented in our formulation by taking the variational posterior and generative prior to be conditional Dirac-Delta distributions. That is, . This transforms the objective from Eq. 4 by a) replacing the expectation of the log-likelihood over the recognition model by an evaluation of the log-likelihood for a single encoding (one that satisfies the Dirac-Delta), and b) ignoring the regulariser, which is trivially 0. This computes the marginal likelihood directly as just the model likelihood , where .
Note that while such models can “generate” answers to questions by sampling from the likelihood function, we typically don’t call them generative since they effectively make the encoding of the data and conditions fully deterministic. We explore and demonstrate the benefit of a fully generative treatment of 1VD in Section 4.3. It also follows trivially that the basic VQA model (for single question-answering) itself can be obtained from this 1VD model by simply assuming there is no dialogue history (i.e. step length ).
3.1 “Colouring” Visual Dialogue with Convolutions
FlipDial’s convolutional formulation allows us to implicitly capture the sequential nature of sentences and sequences of sentences. Here we introduce how we encode questions, answers, and whole dialogues with CNN.
We begin by noting the prevalence of recurrent approaches (e.g. LSTM , GRU ) in modelling both visual dialogue and general dialogue to date [6, 7, 8, 12, 27]. Typically recurrence is employed at two levels – at the lower level to sequentially generate the words of a sentence (a question or answer in the case of dialogue), and at a higher level to sequence these sentences together into a dialogue.
Recently however, there has been considerable interest in convolutional models of language [3, 11, 14, 21], which have shown to perform at least as well as recurrent models, if not better, on a number of different tasks. They are also computationally more efficient, and typically suffer less from issues relating to exploding or vanishing gradients for which recurrent networks are known .
In modelling sentences with convolutions, the tokens (words) of the sentence are transformed into a stack of fixed-dimensional embeddings (e.g. using word2vec  or Glove , or those learned for a specific task). For a given sentence, say question , this results in an embedding for embedding size and sentence length , where
can be bounded by the maximum sentence length in the corpus, with padding tokens employed where required. This two-dimensional stack is essentially a single-channel ‘image’ on which convolutions can be applied in the standard manner in order to encode the entire sentence. Note this similarly applies to the answerand caption , producing embedded and , respectively.
We then extend this idea of viewing sentences as ‘images’ to whole dialogues, producing a multi-channel language embedding. Here, the sequence of sentences itself can be seen as a stack of (a stack of) word embeddings , where now the number of channels accounts for the number of questions and answers in the dialogue. We refer to this process as “colouring” dialogue, by analogy to the most common meaning given to image channels – colour.
Our primary motivation for adopting a convolutional approach here is to explore its efficacy in extending from simpler language tasks [11, 14] to full visual dialogue. We hence instantiate the following models for 1VD and 2VD:
- Answer [1VD]:
- Block [1VD, 2VD]:
Using the CVAE formulation from Figs. 2 and 3 we generate entire blocks of dialogue directly (i.e. since dialogue context is implicit rather than explicit). We allow the convolutional model to implicitly supply the context instead. We consider this 2VD, although this block architecture can also generate iteratively, and can be evaluated on 1VD (see Section 4.2).
- Block Auto-Regressive [1VD, 2VD]:
We introduce an auto-regressive component to our generative model in the same sense as recent auto-regressive generative models for images [9, 25]. We augment the Block model by feeding its output through an auto-regressive (AR) module which explicitly enforces sequentiality in the generation of the dialogue blocks. This effectively factorises the likelihood in Eq. 3 as where is the number of AR layers, and is the (intermediate) output from the standard Block model. Note, again , and refers to an entire dialogue at the -th AR layer (rather than the -th dialogue exchange as is denoted by ).
We present an extensive quantitative and qualitative analysis of our models’ performance in both 1VD, which requires answering a sequence of image-contextualised questions, and full 2VD, where both questions and answers must be generated given a specific visual context. Our proposed generative models are denoted as follows: A – answer architecture for 1VD B – block dialogue architecture for 1VD & 2VD – auto-regressive extension of B for 1VD & 2VD
A is a generative convolutional extension of our baseline  and is used to validate our methods against a standard benchmark in the 1VD task. B and , like A, are generative, but are extensions capable of doing full dialogue generation, a much more difficult task. Importantly, B and are flexible in that despite being trained to generate a block of questions and answers (), they can be evaluated iteratively for both 1VD and 2VD (see Section 4.2). We summarise the data and condition variables for all models in Tab. 1. To evaluate performance on both tasks, we propose novel evaluation metrics which augment those of our baseline . To the best of our knowledge, we are the first to report models that can generate both questions and answers given an image and caption, a necessary step toward a truly conversational agent. Our key results are:
Our block models are able to generate both questions and answers, a more difficult but more realistic task (2VD).
Since our models are generative, we are able to show highly diverse and plausible question and answer generations based on the provided visual context.
We use the VisDial  dataset (v0.9) which contains Microsoft COCO images each paired with a caption and a dialogue of 10 question-answer pairs. The train/test split is images, respectively.
Das et al. ’s best model, MN-QIH-G, is a recurrent encoder-decoder architecture which encodes the image , the current question and the attention-weighted ground truth dialogue history . The output conditional likelihood distribution is then used to (token-wise) predict an answer. Our A model is a generative and convolutional extension, evaluated using existing ranking-based metrics  on the generated and candidate answers. We also (iteratively) evaluate our B/ for 1VD as detailed in Section 4.2 (see Tab. 3).
4.1 Network architectures and training
Following the CVAE formulation (Section 3) and its convolutional interpretation (Section 3.1), all our models (A, B and ) have three core components: an encoder network, a prior network and a decoder network. Fig. 4 (top) shows the encoder and prior networks, and Fig. 4 (middle, bottom) show the standard and auto-regressive decoder networks.
The prior neural network, parametrised by , takes as input the image , the caption and the dialogue context. Referring to Table 1, for model A, recall where the context is the dialogue history up to and the current question . For models B/, (note ). To obtain the image representation, we pass through VGG-16  and extract the penultimate (
-d) feature vector. We pass captionthrough a pre-trained word2vec  module (we do not learn these word embeddings). If
, we pass the one-hot encoding of each word through alearnable word embedding module and stack these embeddings as described in Section 3.1. We encode these condition variables convolutionally to obtain , and pass this through a convolutional block to obtain and , the parameters of the conditional prior .
The encoder network, parametrised by , takes and the encoded condition (obtained from the prior network) as input. For model A, while for B/, . In all models, is transformed through a word-embedding module into a single-channel answer ‘image’ for A, or a multi-channel image of alternating questions and answers for B/. The embedded output is then combined with to obtain and , the parameters of the conditional latent posterior .
The decoder network takes as input a latent and the encoded condition . The sample is transpose-convolved, combined with and further transformed to obtain an intermediate output volume of dimension , where is the word embedding dimension, is the maximum sentence length and is the number of dialogue entries in ( for A, for B variants). Following this, A and B employ a standard linear layer, projecting the dimension to the vocabulary size (Fig. 4 (middle)), whereas employs an autoregressive module followed by this standard linear layer (Fig. 4 (bottom)). At train time, the -dimensional output is softmaxed and the CE term of the ELBO computed. At test time, the of the output provides the predicted word index. The weights of the encoder and prior’s learnable word embedding module and the decoder’s final linear layer are shared.
Inspired by PixelCNN  which sequentially predicts image pixels, and similar to , we apply size-preserving autoregressive layers to the intermediate output of model B (size ), and then project to vocabulary size . Each layer employs masked convolutions, considering only ‘past’ embeddings, sequentially predicting embeddings of size , enforcing sequentiality at both the sentence- and dialogue-level.
Network and training hyper-parameters
In embedding sentences, we pad to a maximum sequence length of and use a word-embedding dimension of (for word2vec, ). After pre-processing and filtering the vocabulary size is (see supplement for further details). We use the Adam optimiser  with default parameters, a latent dimensionality of and employ batch normalisation with momentum and learnable parameters. For model A we use a batch size of , and for B/. We implement our pipeline using PyTorch .
|(, PAD)||(, PAD)||(PAD, PAD) / (, PAD)|
|(PAD, PAD)||(PAD, PAD)||(PAD, PAD)|
4.2 Evaluation methods for block models
Although B/ generate whole blocks of dialogue directly (), they can be evaluated iteratively, lending them to both 1VD and 2VD (see supplement for descriptions of generation/reconstruction pipelines).
Block evaluation [2VD]. The generation pipeline generates whole blocks of dialogue directly, conditioned on the image and caption (i.e. and for B/ evaluation in Tab. 1). This is 2VD since the model must generate a coherent block of both questions and answers.
Iterative evaluation. The reconstruction pipeline can generate dialogue items iteratively. At time , the input dialogue block is filled with zeros (PAD token) and the ground-truth/predicted dialogue history to is slotted in (see below and Tab. 2). This future-padded block is then encoded with the condition inputs, and then reconstructed. The -th dialogue item is extracted (whether an answer if 1VD or a question/answer if 2VD), and this is repeated (for 1VD) or (for 2VD) times. Variations are:
– [1VD]. At time , the input dialogue block is filled with the history of ground-truth questions and answers up to , along with the current ground-truth question. All future entries are padded – equivalent to  using the ground-truth dialogue history.
– [1VD]. Similar to –, except that the input block is filled with the history of ground-truth questions and previously predicted answers along with the current ground-truth question. This is a more realistic 1VD.
– [2VD]. The most challenging and realistic condition in which the input block is filled with the history of previously predicted questions and answers.
4.3 Evaluation and Analysis
We evaluate our A, B, and models on the 1VD and 2VD tasks. Under 1VD, we predict an answer with each time step, given an image, caption and the current dialogue history (Section 4.3.1 and Tab. 3), while under 2VD, we predict both questions and answers (Section 4.3.2 and Tab. 4). All three models are able to perform the first task , while only B and are capable of the second task.
4.3.1 One-Way Visual Dialogue (1VD) task
We evaluate the performance of A and B/ on 1VD using the candidate ranking metric of  as well as an extension of this which assesses the generated answer quality (Tab. 3). Fig. 1 and Fig. 5 show our qualitative results for 1VD.
Candidate ranking by model log-likelihood 
The VisDial dataset  provides a set of 100 candidate answers for each question-answer pair at time per image. The set includes the ground-truth answer as well as similar, popular, and random answers. Das et al.  rank these candidates using the log-likelihood value of each under their model (conditioned on the image, caption and dialogue history, including the current question), and then observe the position of the ground-truth answer (closer to 1 is better). This position is averaged over the dataset to obtain the Mean Rank (MR). In addition, the Mean Reciprocal Rank (MRR; 1/MR) and recall rates at are computed.
To compare against their baseline, we rank the 100 candidates answers by estimates of theirmarginal likelihood from A. This can be done with i) the conditional ELBO (Eq. 4), and by ii) likelihood weighting (lw) in the conditional generative model . Ranking by both these approaches is shown in the section of Tab. 3, indicating that we are comparable to the state of the art in discriminative models of sequential VQA [6, 7].
Candidate ranking by word2vec cosine distance 
The evaluation protocol of  scores and ranks a given set of candidate answers, without being a function of the actual answer predicted by the model, . This results in the rank of the ground-truth answer candidate reflecting its score under the model relative to the rest of the candidates’ scores, rather than capturing the quality of the answer output by the model, which is left unobserved. To remedy this, we instead score each candidate by the cosine distance between the word2vec embedding of the predicted answer and that candidate’s word2vec embedding. We take the embedding of a sentence to be the average embedding over word tokens following Arora et al. . In addition to accounting for the predicted answer, this method also allows semantic similarities to be captured such that if the predicted answer is similar (in meaning and/or words generated) to the ground-truth candidate answer, then the cosine distance will be small, and hence the ground-truth candidate’s rank closer to 1.
We report these numbers for A, iteratively-evaluated B/, and also our baseline model MN-QIH-G , which we re-evaluate using the word2vec cosine distance ranking (see in Tab. 3). In the case of A (gen), we evaluate answer generations from A whereby we condition on and via the prior network, sample and generate an answer via the decoder network. Here we show an improvement of 5.66 points in MR over the baseline. On the other hand, A (recon) evaluates answer reconstructions in which is sampled from (where ground-truth answer is provided). We include A (recon
) merely as an “oracle” autoencoder, observing its good ranking performance, but do not explicitly compare against it.
We also note that the ranking scores of the block models are worse (by 3-4 MR points) than those of A. This is expected since A is explicitly trained for 1VD which is not the case for B/. Despite this, the performance gap between A (gen) and B/ (with –) is not large, bolstering our iterative evaluation method for the block architectures. Note finally that the B/ models perform better under – than under – (by 2-3 MR points). This is also expected as answering is easier with access to the ground-truth dialogue history rather than when only the previously predicted answers (and ground-truth questions) are provided.
4.3.2 Two-way Visual Dialogue (2VD) task
Our flexible CVAE formulation for visual dialogue allows us to move from 1VD to the generation of both questions and answers (2VD). Despite this being inherently more challenging, B/ are able to generate diverse sets of questions and answers contextualised by the given image and caption. Fig. 6 shows snippets of our two-way dialogue generations.
In evaluating our models for 2VD, the candidate ranking protocol of  which relies on a given question to rank the answer candidates, is no longer usable when the questions themselves are being generated. This is the case for B/ block evaluation, which has no access to the ground-truth dialogue history, and the – iterative evaluation, when the full predicted history of questions and answers is provided (Tab. 2). We therefore look directly to the CE and KL terms of the ELBO as well as propose two new metrics, and , to compare our methods in the 2VD task:
Question relevance (). We expect a generated question to query an aspect of the image, and we use the presence of semantically similar words in both the question and image caption as a proxy of this. We compute the cosine distance between the (average) word2vec embedding of each predicted question and that of the caption , and average over all questions in the dialogue (closer to 1 indicates higher semantic similarity).
Latent dialogue dispersion (). For a generated dialogue block , computes the KL divergence , measuring how close the generated dialogue is to the true dialogue in the latent space, given the same image and caption .
From Tab. 4, we observe a decrease in the loss terms as the auto-regressive capacity of the model increases (none 8 10), suggesting that explicitly enforcing sequentiality in the dialogue generations is useful. For within a particular model, the dispersion values are typically larger for the harder task (without dialogue context). We also observe that dispersion increases with number of AR layers, suggesting AR improves the diversity of the model outputs, and avoids simply recovering data observed at train time.
While the proposed metrics provide a novel means to evaluate dialogue in a generative framework, like all language-based metrics, they are not complete. The question-relevance metric, , can stagnate, and neither metric precludes redundant or nonsensical questions. We intend for these metrics to augment the bank of metrics available to evaluate dialogue and language models. Further evaluation, including i) using auxiliary tasks, as in the image-retrieval task of , to drive and evaluate the dialogues, and ii) turning to human evaluators to rate the generated dialogues, can be instructive in painting a more complete picture of our models.
In this work we propose FlipDial, a generative convolutional model for visual dialogue which is able to generate answers (1VD) as well as generate both questions and answers (2VD) based on a visual context. In the 1VD task, we set new state-of-the-art results with the answers generated by our model, and in the 2VD task, we are the first to establish a baseline, proposing two novel metrics to assess the quality of the generated dialogues. In addition, we propose and evaluate our models under a much more realistic setting for both visual dialogue tasks in which the predicted rather than ground-truth dialogue history is provided at test time. This challenging setting is more akin to real-world situations in which dialogue agents must be able to evolve with their predicted exchanges. We emphasize that research focus must be directed here in the future. Finally, under all cases, the sets of questions and answers generated by our models are qualitatively good: diverse and plausible given the visual context. Looking forward, we are interested in exploring additional methods for enforcing diversity in the generated questions and answers, as well as extending this work to explore recursive models of reasoning for visual dialogue.
This work was supported by the EPSRC, ERC grant ERC-2012-AdG 321162-HELIOS, EPSRC grant Seebibyte EP/M013774/1, EPSRC/MURI grant EP/N019474/1 and the Skye Foundation.
-  S. Antol, A. Agrawal, J. Lu, M. Mitchell, D. Batra, C. L. Zitnick, and D. Parikh. VQA: Visual Question Answering. In ICCV, 2015.
-  S. Arora, Y. Liang, and T. Ma. A simple but tough-to-beat baseline for sentence embeddings. In ICLR, 2017.
-  S. Bai, J. Kolter, and V. Koltun. An Empirical Evaluation of Generic Convolutional and Recurrent Networks for Sequence Modeling . CoRR, abs/1803.01271, 2018.
-  S. R. Bowman, L. Vilnis, O. Vinyals, A. M. Dai, R. Jozefowicz, and S. Bengio. Generating sentences from a continuous space. arXiv preprint arXiv:1511.06349, 2015.
-  J. Chung, C. Gulcehre, K. Cho, and Y. Bengio. Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv preprint arXiv:1412.3555, 2014.
-  A. Das, S. Kottur, K. Gupta, A. Singh, D. Yadav, J. M. Moura, D. Parikh, and D. Batra. Visual Dialog. In CVPR, 2017.
-  A. Das, S. Kottur, J. M. Moura, S. Lee, and D. Batra. Learning cooperative visual dialog agents with deep reinforcement learning. arXiv preprint arXiv:1703.06585, 2017.
-  A. Das, S. Kottur, J. M. Moura, S. Lee, and D. Batra. Learning cooperative visual dialog agents with deep reinforcement learning. In ICCV, 2017.
-  I. Gulrajani, K. Kumar, F. Ahmed, A. A. Taiga, F. Visin, D. Vazquez, and A. Courville. Pixelvae: A latent variable model for natural images. arXiv preprint arXiv:1611.05013, 2016.
-  S. Hochreiter and J. Schmidhuber. Long short-term memory. Neural Computation, 9(8):1735–1780, Nov. 1997.
-  B. Hu, Z. Lu, H. Li, and Q. Chen. Convolutional neural network architectures for matching natural language sentences. In NIPS, pages 2042–2050, 2014.
-  U. Jain, Z. Zhang, and A. Schwing. Creativity: Generating diverse questions using variational autoencoders. arXiv preprint arXiv:1704.03493, 2017.
J. Johnson, A. Karpathy, and L. Fei-Fei.
Densecap: Fully convolutional localization networks for dense captioning.In , June 2016.
-  N. Kalchbrenner, E. Grefenstette, P. Blunsom, D. Kartsaklis, N. Kalchbrenner, M. Sadrzadeh, N. Kalchbrenner, P. Blunsom, N. Kalchbrenner, and P. Blunsom. A convolutional neural network for modelling sentences. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics, pages 212–217. Association for Computational Linguistics, 2014.
-  A. Karpathy and L. Fei-Fei. Deep visual-semantic alignments for generating image descriptions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3128–3137, 2015.
-  D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. In Proceedings of the 3rd International Conference on Learning Representations (ICLR), 2014.
-  D. P. Kingma and M. Welling. Auto-encoding Variational Bayes. In Proceedings of the 2nd International Conference on Learning Representations (ICLR), 2014.
-  T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean. Distributed representations of words and phrases and their compositionality. In NIPS, pages 3111–3119, 2013.
R. Pascanu, T. Mikolov, and Y. Bengio.
On the difficulty of training recurrent neural networks.In
International Conference on Machine Learning, pages 1310–1318, 2013.
-  J. Pennington, R. Socher, and C. D. Manning. Glove: Global vectors for word representation. In EMNLP, pages 1532–1543, 2014.
-  N.-Q. Pham, G. Kruszewski, and G. Boleda. Convolutional neural network language models. In EMNLP, pages 1153–1162, 2016.
-  PyTorch, 2017.
-  K. Simonyan and A. Zisserman. Very Deep Convolutional Networks for Large-Scale Image Recognition. CoRR, abs/1409.1556, 2014.
-  K. Sohn, H. Lee, and X. Yan. Learning structured output representation using deep conditional generative models. In NIPS, pages 3483–3491, 2015.
-  A. van den Oord, N. Kalchbrenner, L. Espeholt, O. Vinyals, A. Graves, et al. Conditional image generation with pixelcnn decoders. In Advances in Neural Information Processing Systems, pages 4790–4798, 2016.
-  A. van den Oord, N. Kalchbrenner, and K. Kavukcuoglu. Pixel recurrent neural networks. In NIPS, 2016.
-  T. Zhao, R. Zhao, and M. Eskenazi. Learning discourse-level diversity for neural dialog models using conditional variational autoencoders. arXiv preprint arXiv:1703.10960, 2017.
Appendix A Glossary
- block dialogue/architecture
Models B/ are built and trained for the task of 2VD with data and condition variable . Since refers to the whole dialogue sequence/block we refer to B/ as block architectures.
This represents the scenario when only the condition variable is available at test time. In this case, the decoder network receives a sample , a multivariate Gaussian parametrised by and exponentiated learned using the prior network. We call the decoded output a generation.
Differing from a generation, both and are available. The decoder network receives a sample , a multivariate Gaussian parametrised by and exponentiated learned using the encoder network. We call the decoded output a reconstruction. The reconstruction pipeline is used during training when the input and the condition variable are available. Note, this pipeline is also used when B/ are evaluated iteratively (see Section 4.2).
Appendix B Extended Quantitative Results on 1VD task
Tab. 3 in the main paper evaluates A and B/ in the task of 1VD. Here we shed light on these numbers and the metrics used to obtain them. We also present a more extensive quantitative analysis of B/ in the 1VD task (see Tab. 5).
Evaluating B/ on 1vd
We extend Tab. 3 withTab. 5, which further compares B/ under the iterative evaluation settings of – and –, using the CE and KL terms of the ELBO and our two new metrics, and . We observe that B/ (–) shows superior performance of around 7-10 points in MR over B/ (–), and also improves in MRR and recall rates. This is expected since the ground-truth rather than predicted answers are included in the dialogue history (along with the ground-truth questions). The metrics and , on the other hand, show very little performance difference across the two evaluation settings. We also note that ranking performance is worse when both image and caption are excluded from condition variable. This does not, however, correlate with the CE and KL terms of the loss which are lower for a condition-less setting. We attribute this to the model being transformed from a CVAE to a VAE, hence lifting the burden of capturing the conditional posterior distribution (i.e. the KL is now between an unconditional and ). Interestingly, however, excluding either the image or the caption achieves similar performance to when both are included, indicating that the caption acts as a good textual proxy of the image (a reassurance of our metric).
Appendix C Extended Quantitative Results on 2VD task
Extending Tab. 4 in the main paper, Tab. 6 here shows results for B/ trained with permutations of the image and caption ) (denoted by + if included in the condition, and - otherwise). We note the decrease in CE and KL as conditions () are excluded from the model. This is expected since the task of dialogue generation is made simpler without the constrains of an explicit visual/textual condition.
Appendix D Network architectures and training
The following section provides detailed descriptions of the architectures of our models A, B and . The descriptions are dense but thorough. We also include further details of our training procedure. Where not explicitly noted, each convolutional layer is proceeded by a batch normalisation layer (with momentum and learnable parameters) and a ReLU activation.
The prior neural network, parametrised by , takes as input the image , the caption and the dialogue context. For the model A, this context is , containing the dialogue history up to and the current question . For models B/, the dialogue context is the null set (). To obtain the image representation, we scale and centre-cropped each image to and feed it through VGG-16 . The output of the penultimate layer is extracted and -normalised (as in ) to obtain a -dimensional image feature vector. For the caption, we pass through a pre-trained word2vec  model (we do not learn these word embeddings) to obtain where is the maximum sentence length (). For the dialogue context (relevant only in the case of A) we pass the one-hot encoding of each word through a learnable word embedding module. We stack these embeddings as described in Section 3.1 of the main paper to obtain , where is the word embedding dimension (), is the maximum sentence length () and is the number of dialogue entries at time . We encode these inputs convolutionally to obtain (the encoded condition) as follows: is passed through a convolutional block (output size ) and concatenated with the image feature vector (reshaped to ). The concatenated output is passed through a convolutional block to obtain the jointly encoded image-caption (output size ). If , then the context is passed through a convolutional block (output size ) and is concatenated with the encoded image-caption and passed through yet another convolutional block to get the encoded image-caption-context (output size ). We call this the encoded condition . The encoded condition is then passed through a further convolutional block (output size ) followed by two final convolutional layers (in parallel) to obtain and , respectively, the parameters of the conditional prior . At this stage, and are both of size (the latent dimensionality). At test time, a sample is obtained via and is passed to the decoder in order to generate a sample (for A) or (for B/).
The encoder network, parametrised by , takes as input along with the encoded condition, , obtained from the prior network. For model A, and . For models B/, and . In all models, is passed through a learnable word embedding module, and the word embeddings stacked (see Section 3.1 in the main paper) to obtain , where , and is the number of entries in (for A, and for B/ ). In this way, we transform into a single-channel answer ‘image’ in the case of A, and a multi-channel image of alternating questions and answers in the case of B/. is then passed through a convolutional block (output size ), the output of which is concatenated with and forwarded through another convolutional block (output size ). This output is forwarded through two final convolutional layers (in parallel) to obtain and , the parameters of the conditional latent posterior . Here and are both of size .
At train time, the KL divergence term of the ELBO is computed using (from the encoder network) and (from the prior network).
The decoder network (for simplicity, the parameters of the prior and decoder network are subsumed into ) takes as input a latent and the encoded condition . During training, is sampled from a Gaussian parametrised by the and exponentiated outputs of the encoder network. This distribution is . At test time, is sampled from a Gaussian parametrised by the and exponentiated outputs of the prior network. This distribution is . At both train and test time, we employ the commonly-used ‘re-parametrisation trick’  to compute the latent sample as where and and correspond to those derived from the encoder or prior network as described above.
The sample is then transformed through a transpose-convolutional block (output size ), concatenated with and forwarded through a convolutional block (output size ). This output is forwarded through a second transpose-convolutional block, producing an intermediate output volume of dimension which we permute to match the size of . As before, , and (for A) or (for B/).
Following this, our models diverge in architecture: A and B employ a standard linear layer which projects the dimension of the intermediate output to the vocabulary size . The model instead employs an autoregressive module (detailed below) followed by this standard linear layer. At train time, the -dimensional network output is softmaxed and used in the computation of the CE term of the ELBO. At test time, the of the (softmax-ed) output is taken to be the index of the word token predicted. We share the weight matrices of the decoder’s final linear layer and the encoder and prior’s learnable word embedding module (which are the same size by virtue of our network architecture) with the motivation that language encoders and decoders should share common word representations.
The autoregressive (AR) block ( in Fig. 4 - bottom) in ’s decoder is inspired by PixelCNN  which sequentially predicts the pixels in an image along the two spatial dimensions. In the same fashion, we use an autoregressive approach to sequentially predict the next sentence (question or answer) in a dialogue. Since our framework is convolutional with sentences viewable as ‘images’, our approach can similarly be adapted from that of [26, 9]. We first reshape the intermediate output of the decoder to (essentially ‘unravelling’ the dialogue sequentially into a stack of its word embeddings). We then apply a size-preserving masked convolution to the reshaped output (followed by a learnable batch normalisation and a ReLU activation). We call this triplet an AR layer. The masked convolution of the AR layer ensures that future rows (i.e. future -dimensional word embedding) are hidden in the prediction of the current row/word embedding. We apply AR layers in this way with each layer taking in the output of the previous AR layer. Following the block, a linear layer projects the final output’s dimension to the vocabulary size . We report numbers for . We base our implementation of the AR block on a publicly-available implementation of PixelCNN.
Appendix E Dialogue preprocessing
The word vocabulary is constructed from the VisDial v0.9  training dialogues (not including the candidate answers). The dialogues are preprocessed as follows: apostrophes are removed, numbers are converted to their worded equivalents, and all exchanges are made lower-case and either padded or truncated to a maximum sequence length (). The vocabulary is also filtered such that words with a frequency of 5 are removed and replaced with the UNK token. After pre-processing and filtering, the vocabulary size is .
Appendix F Extended Qualitative Results
We present additional qualitative results for the A model in Figs. 8 and 7 (1VD task) and for the 10 model (under the block evaluation setting) in Figs. 10 and 9 (2VD task). Note that for both, different colours indicate generations ( for A and for B/) from different samples of . In Figs. 10 and 9, whole generated dialogue blocks are shown with coloured sections indicating subsets exhibiting coherent question-answering and white sections indicating subsets that are not entirely coherent.