, has widely adopted an image-to-sequence architecture that encodes the image through a convolutional neural network (CNN)
and then decodes the language with a recurrent neural network
. The whole framework can be efficiently trained by maximum likelihood estimation (MLE) and has demonstrated state-of-the-art performance in various tasks[60, 8, 10, 11, 38]. However, this training procedure is not suitable for generating questions or enabling discovery of new concepts. In fact, most MLE-based training schemes have shown to produce generic questions that result in uninformative answers (e.g. “yes”) , questions (e.g. “What is the person doing?”) , captions (e.g. “A clear day with a blue sky” ) or dialogue (e.g. “I don’t know”) [35, 51]. Simply generating a generic question is not sufficient or useful for discovering new concepts.
Instead of generating generic questions, question generation models should be goal-driven — we show how they can be trained to ask questions aimed at extracting specific answer categories. Visual question generation is not a bijection, i.e. multiple correct questions can be generated from the same image. Previous research moved away from a supervised approach to question generation to a variational approach that can generate multiple questions by sampling a latent space  (see Figure 2). However, previous approaches are not goal-driven — they do not guarantee that the question will result in a specific type of answer. To remedy their problem, we could encode the answer along with the image before generating the question. While such an approach allows the model to condition its question on the answer, it is neither technically feasible nor practical. Technical infeasibility arises because variational models often lead to the posterior collapsing problem , where the model can learn to ignore the answer when generating questions. Impracticality arises because the main purpose of asking questions is to attain an answer, implying that knowing the answer defeats the purpose of generating the question.
To tackle the first challenge, we design a visual question generation architecture that maximizes the mutual information between the generated question with the image as well as with the expected answer (see Figure 2). We call our model Information Maximizing Visual Question Generator
as it maximizes relevance with the image and expectation over the answer. Safe, generic questions that lead to uninformative answers are discouraged as they have low mutual information with either. However, optimizing for mutual information is often intractable and given the discrete tokens (words) we wish to generate, no unbiased, low variance gradient estimator exists[24, 40, 16, 52, 45, 58, 27]. We formulate our model as a variational auto-encoder that attempts to learn a joint continuous latent space between the image, question and the expected answer. Instead of directly optimizing discrete utterances, the question, image and expected answer are all trained to maximize the mutual information with this latent space. By reconstructing the image and expected answer representations, we can maximize the evidence lower bound (ELBO) and control what information the generated questions request.
The second challenge arises from the lack of an expected answer in real world deployments. Since we require an answer to map the image into a latent space, it is not possible to generate questions in the absence of an answer. Enumerating all possible answers is infeasible. Instead, we propose creating a second latent space that is learned from the image and the answer category instead of the answer itself. Answer categories can be objects, attributes, colors, materials time, etc. During training, we minimize the KL-divergence between these two latent spaces. Not only does this allow us to generate visual questions that maximize mutual information with the expected answer, it also acts as a regularizer into the original latent space. It prevents the learned latent spaces from overfitting to specific answers in the training set and forces them to generalize to categories of questions.
We annotate the VQA dataset  with categories for the top answers and use it to train our model, which queries for specific answer categories. We evaluate our model on relevance to the image and on its ability to expect the answer type. Finally, we run our model on real world images and discover new objects, new attributes, new colors, and new materials.
2 Related work
Visual understanding has been studied vigorously through question answering with the availability of large scale visual question answering (VQA) datasets [2, 64, 31, 25]. Current VQA approaches follow a traditional supervised MLE paradigm that typically relies on a CNN + RNN encoder-decoder formulation . Successive models have improved performance by stacking attention [62, 37], modularizing components [1, 26, 21], adding relation networks , augmenting memory , and adding proxy tasks [13, 57]. While the performance of VQA models have been encouraging, they require a large labelled dataset with a predefined vocabulary. In contrast, we focus on the surrogate task of generating questions in the hopes of augmenting real world agents with the ability to expand it’s visual knowledge by discovering new visual concepts.
In contrast to answering questions, generating questions has received little interest so far. In NLP, a few methods have attempted to automatically generate questions from knowledge bases using rule based 
or deep learning based systems
. In computer vision, a few recent projects have explored the task ofvisual question generation to build curious visual agents [61, 23]. These projects have also either followed an algorithmic rule-based [54, 50] or learning-based [44, 47] approach. Newer papers have treated the generation process as a variational process 
or placed it in a active learning
or reinforcement learning setting. Our work draws inspiration from these previous methods and extends them by treating question generation as a process that maximizes mutual information between not just the image but also considers the expected answer’s category. We believe that a good question generator should be goal driven — it should generate questions to receive a particular answer category.
There is a large body of work exploring generative models55, 18, 19]. Recent successes of these applications have primarily been a result of variational auto-encoders (VAEs)  and generative adversarial networks (GANs) . With the reparameterization trick, VAEs can be trained to learn a semi-supervised latent space to generate images . They have also been extended to continuous state space [32, 3] and sequential models [15, 9]. GANs, on the other hand, can learn image representations that support basic linear algebra 
and even enable one-shot learning by using probabilistic inference over Bayesian programs. Both VAEs and GANs have disentangled their representations based on class labels or other visual variations [28, 41]. While we do not explicitly disentangle the representation, we will demonstrate later how the second latent space regularizes the original space and disentangles the representations of different answer categories.
Generative models often require a series of tricks for successful training [48, 46, 5, 4]. And even with these tricks, training them with discrete tokens is only possible by using gradient estimators. As we mentioned earlier, these estimators often suffer from one of two problems: high bias [27, 27] or high variance . Low variance methods like Gumbel-Softmax , CONCRETE distribution , semantic hashing 
or vector quantization result in biased estimators. Similarly, low bias methods like REINFORCE  with Monte Carlo rollouts, result in high variance [16, 52, 45]. We overcome this issue by introducing a continuous latent space that maximizes mutual information with encodings of the image, question and answer. This latent space can be trained using existing VAE training procedures that attempt to reconstruct the image and answer representations. We further extend this model with a second latent space conditioned on the answer category that removes the need for an actual answer when generating questions.
3 Information Maximizing Visual Question Generator
Our aim is to generate questions that have a tightly focused purpose — questions with the aim to learn something specific about the image. Agents with the capability to request specific categories of information can extract new concepts more effectively from the real world. In this section, we detail how we design an Information Maximizing Visual Question Generator. Recall that the goal of our model is to generate questions given an image and an answer category. For example, if we want to understand materials or binary answers, our model should generate questions “What material is that desk made out of?” or “Is the desk on the right of the chair?”, respectively. Our two challenges are (1) technical infeasibility caused by non-differentiable discrete tokens and variational posterior collapse and (2) impracticality of requiring answers to generate questions. We start off with a formal definition of the problem, explain why current methods fail and then detail our training and inference process.
3.1 Problem formulation
Let denote the question we want to generate for an image . This question should result in the an answer of category . For example, the question “What is the person in red doing with the ball?” should result in the answer “kicking”, which belong to category “activity”. Our final goal is to define a model . But first, let’s attempt to define a simpler model that maximizes the mutual information between the image and the question and between the expected answer and the question . This objective can be written as:
is a hyperparameter that adjusts for their relative importance in the optimization.
3.2 Continuous latent space
As already mentioned, directly optimizing this objective is infeasible because the exact computation of mutual information is intractable. Additionally, optimizing by estimating gradients between discrete steps is difficult as the estimator needs to have both low bias and low variance. To overcome this challenge, we introduce a continuous, dense, latent -space. We learn a mapping , parameterized by , from the image and the expected answer to this latent space.
With this -space, our new optimization becomes:
where and are hyperparameters that relatively weight the mutual information terms in the optimization.
3.3 Variational mutual information maximization
So far, we have avoided discrete tokens. However, this mutual information maximization is still intractable as it requires knowing the true posteriors and . Fortunately, we can opt to maximize its ELBO:
where is the entropy function and is expectation. is a function parameterized by . This optimization is often referred to as variational information maximization . Similarly,
The third and final conditional mutual information term can also be bounded by:
Note that we ignore the entropy terms associated with the training data as it doesn’t involve the parameter we are trying to optimize. Therefore, optimizing Eq. 6 can be accomplished by maximizing the reconstruction of the image and answer representations while maximizing the MLE objective of generating the question.
3.4 Question generation by reconstructing image and answer representations
To functionalize the optimization presented above, we begin by first encoding the image using a CNN as a dense vector (see Figure 3). Similarly, we encode the answer
using a long short term memory network (LSTM), which is a variant of RNNs, into another dense vector . Next, we feed and into a VAE that embeds both into a latent -space. In practice, we assume that
follows a multivariate Gaussian distribution with diagonal covariance. We use the reparameterization trick, to generate means , combine it with a sampled unit Gaussian noise to generate .
From , we reconstruct and and optimize the first two terms in Eq. 6 by minimizing the following losses:
Next, we use a decoder LSTM to generate the question from -space. We minimize the MLE objective between and the true question in our training set , which results in the third and final term in Eq. 6.
3.5 Regularizing with a second latent space
So far, we have proposed building a model that maximizes the lower bound of mutual information between a latent space, the image and the expected answer. This allows us to generate questions if we know what the expected answer should be. This is not conducive to our original goal of deploying our model in real world situations where it does not know the answer a priori. If we already know the answer to a question, there is no point in generating a question.
To remedy this, we propose a second latent -space. Instead of using both and to encode and into -space, we discard the answer and instead only use its category
. We classify answers as being one of a few predefined categories, such as objects (e.g. “cat”), attributes (e.g. “cold”), color (e.g. “brown”), relationship (e.g. “ride”), counting (e.g. “1”), etc. These categories are cast as a one hot vector and encoded asand used, along with to embed into the variational -space. We train -space by minimizing the KL-divergence with -space:
where are the parameters used to embed into -space. This allows us to now utilize to embed into a space that closely resembles -space. Since we assume that both -space and -space follow a multivariate Gaussian with diagonal covariance, the KL term has the analytical form shown above. We no longer need to know the answer to embed and generate questions. Intuitively, the -space can be also thought of as a regularizer on -space, preventing the model from overfitting to the answers in the training data and relying instead on utilizing the answer categories.
Putting them together, the final loss for our model is:
where and have already been introduced and is a hyperparameter that controls the amount of regularization used in our model. Note that we are omitting the KL-loss with respect to a unit normal centered at zero that maintains the two latent spaces’ priors.
During inference, we are given an image and answer category and are expected to generate questions. We encode the inputs into the second latent -space and sample from it to generate questions, as shown in Figure 4. This allows us to generate goal-driven questions for any image, focused towards extracting its objects, its attributes, etc.
3.7 Implementation details
We implement our model using PyTorch and plan on releasing all our code. We use ResNet18 as our image encoder and do not fine-tune its weights. , and are all dimensional vectors. -space and -space are dimensions. The encoders for the image and answer are trained only from and not , or to prevent the encoders from simply optimizing for the reconstruction loss at the cost of not being able to generate questions. We optimized the hyperparameters such that , , with a learning rate of that decays every epochs for a total of epochs.
|Language modeling||Mutual information||Relevance|
|Ours w/o A||38.88||20.74||12.75||6.29||12.78||40.13||10.02||40.44||98.10||42.70|
|Ours w/o AC||38.99||21.48||12.73||6.57||13.01||42.13||10.10||60.00||96.80||42.80|
|Ours w/o C||50.09||32.32||24.61||16.27||20.58||94.33||33.44||61.04||98.00||82.40|
|Ours w/o A||31.20||16.20||11.18||6.24||12.11||35.89||9.35||68.23||98.00||52.50|
To test our visual question generation model, we perform a series of experiments and evaluate the model along multiple dimensions. We start by discussing the dataset and evaluation metrics used. We then showcase examples of our model’s generated questions when conditioned on the answer. Next, we demonstrate its ability when conditioned only on the answer category. We compare both these cases against a series of baselines and ablations. We analyze the diversity of questions produced within each answer category. Finally, we report a small proof of concept deployment of our model on real world images found online and show that it can learn new concepts.
4.1 Experimental setup
Dataset. To enable the kind of interaction where we can specify input answer categories, we need a VQA dataset that categorizes its answers. The VQA dataset  has a few basic categorizations of questions but not their answers. We annotate the VQA  dataset answers with a set of categories and label their top answers. These categories include objects (e.g. “cat”, “person”), attributes (e.g. “cold”, “old”), color (e.g. “brown”, “red”), relationship (e.g. “ride”, “jump”), counting (e.g. “1”, “10”), etc. The top answers make up the of the VQA dataset, resulting in training+validation examples. We treat their validation set as our test set as the answers in their test set are not publicly available. We break the training set up into a - train-validation split.
Evaluation metrics. All past question generation papers have used a variety of evaluation metrics to calculate the quality of a question. While some have focused on maximizing diversity [54, 23, 63], others have treated it as a proxy task to improve question answering [36, 47, 57]. Diversity measures have included using variants of beam search , measuring novel questions or unique tri-grams  or creating rule-based datasets . Proxy tasks have typically used accuracy of multiple-choice answers to measure the performance of question generation.
We too report a variety of different evaluation metrics to highlight different components of our model. First, we use language modeling evaluation metrics like BLEU, METEOR and CIDEr  to calculate how well our generated questions match the ground truth questions in our test set. Next, we measure the mutual information retained in the latent space by training a classifier to classify answer categories encoded in the latent space. This metric sheds light on how well our method retains information about the input answers or answer categories. Next, we measure relevance of the question, ensuring that the questions are valid for the given image and result in the expected answer category. Relevance results are calculated from majority vote conducted by hiring crowd-workers that vote on whether a question can be answered given its corresponding image. Finally, we report diversity scores for each category, which measures the number of unique questions generated.
Baselines. We adapt a series of past CNN-RNN models to accept answer or answer types when generating questions. The first model IA2Q is a supervised, non-variational model that takes an image and answer as input and generates a question . This model is reminiscent of the VQA models often used to answer questions , except the answers are now inputs and the questions outputs [2, 64]. Next, V-IA2Q is a variational version of IA2Q, which embeds the answer and question to a latent space before generating the question . We also train versions of these models that accept the answer categories instead of the answer: IC2Q and V-IC2Q. When generating from a variational model, we set or to keep its outputs consistent for all measures except diversity.
We refer to our full model as Ours and can generate questions from either the answer latent space or the category latent space . We perform ablations on this model by removing specific components. Ours w/o A doesn’t maximize mutual information with respect to the expected answer but can also generate questions from both the and spaces. Ours w/o C doesn’t include the - space and can only generate questions from answers. Finally, Ours w/o AC doesn’t train with the reconstruction loss nor does it have a second latent space . Our evaluations empirically demonstrate how these ablations justify our model designs.
4.2 Mutual information maximization
We check whether our model improves the mutual information retained in the latent space with the input answer. We freeze the weights of a trained model and embed input images, answers and categories into the latent or -space, depending on the model. We train a simple -layer MLP that attempts to classify the latent code as either one of the answer categories or as one of the answers. We evaluate our model on the test set with a random chance of and , respectively. Table 1 shows that the baseline models do a poor job of actually remembering the answer or category, justifying the need for a mutual information maximization approach. Since these models are unable to retain information about the input answers, it also explains why they often generate safe, generic, uninformative questions. Since our model can embed into both the as well as the space, we report how well these two spaces retain information. We find that Ours retains near perfect information about the input answer category with an accuracy of from -space and from the -space. We find that when trained without the -space, Ours w/o C retains more information as it no longer has to constrain the -space to regularize answers of the same category. We also visualize a TSNE  representation of the two spaces in Figure 5. Models that don’t reconstruct the answer (e.g. in Ours w/o A, Ours w/o AC or any of the baselines) show visually inseparable categories.
4.3 Generating questions given the answers
Since our model can produce questions from both answers as well as answer categories, we evaluate both scenarios individually. The language modeling section in Table 1 showcases how the various models perform when generating questions from the -space, i.e. generating questions from answers. We find that Ours w/o C performs the best over all the baselines and across all ablations of our model. This is likely because the latent space has more capacity when it is not also being regularized by the -space. We find that Ours w/o A performs METEOR points worse than Ours and Ours w/o C implying that forcing the model to reconstruct the answer does improve the quality of questions generated to better match the ground truth.
4.4 Generating questions with answer types
The lower half of Table 1 evaluates how well our model and the baselines perform when generating questions in the absence of the actual answer and only in the presense of the answer categories. We find that overall, all the language metrics are slightly lower than when the questions were generated from the -space. This is expected as now the questions need to be generated with only the answer category encoded in the -space without knowing exactly what the answer is. Therefore, the models are penalized for asking an “object” question about the “horse” when the answer expects the question to focus on the “saddle” instead. We also qualitatively sample and report a random set of questions generated by our model in Figure 6. We see that our model often uses concepts in the image to ground the questions. It asks specific questions like “what is the bat made of?” or “is the man going to the right of the girl?”. However, there are categories like “time” that have a low diversity of training questions and result in the inevitable “what time of day is this?” question. The qualitative errors we have observed often occur when the model is forced to ask a question about a category that is not present in the image; it is hard to ask about “food” when no food is present.
4.5 Measuring diversity of questions
For all the images in our test set, we generated one question per answer category, resulting in a total of questions. We report diversity in Table 2 using two existing metrics: (1) Strength of generation: the percentage of unique generated questions normalized by the number of unique ground truth questions and (2) Inventiveness of generation: the percentage of unique questions unseen during training normalized by all unique questions generated. We compare our model with the baseline V-IC2Q which does not reconstruct the answer or the image. We find that our method results in more diverse set of questions across most categories. Questions asking for “shape” and “materials” tend to generate the most unseen questions as the model learns to generate questions like “what [shape/material] is the ____ [made out of]?” and injects objects in the given image into the missing blank. Answers agnostic to the image contents, such as “time”, result in the fewest number of novel questions.
4.6 Real world deployment of our model
To examine our model in a real world deployment, we generated question each for images with hashtags #food, #nature, #sports, #fashion scraped from online public social media posts. Since our model needs an input answer category to ask a question, we trained a simple ResNet18 CNN  on the VQA images to output one of categories (see Table 3). We generated answer categories using the CNN and fed it into our model to generate the questions. The questions were sent to two crowd workers: one answered the question and the other reported the relevance of the question with the image and the answer with the answer category. We found all the questions asked by both Ours and V-IC2Q to be relevant to the image while and were relevant to the answer category. Our methods questions led to more unseen concepts.
We believe that visual question generation should be a task that is aimed at extracting specific categories of concepts from an image. We define a good question to be one that is not only relevant to the image but is also designed to expect a specific answer category. We build Information Maximizing Visual Question Generator that maximizes the mutual information between the generated question, the input image and the expected answer. We extend this model to overcome technical challenges associated with maximizing mutual information with discrete tokens and collapsing posterior while also allowing it to generate questions when the expected answer is absent. We analyze the questions using language modeling, diversity, relevance and mutual information metrics. We further show that through a real world deployment of this system, it can discover new concepts.
We thank Justin Johnson, Andrey Kurenkov, Apoorva Dornadula and Vincent Chen for their helpful comments and edits. This work was partially funded by the Brown Institute of Media Innovation and by Toyota Research Institute (“TRI”) but this article solely reflects the opinions and conclusions of its authors and not TRI or any other Toyota entity.
-  J. Andreas, M. Rohrbach, T. Darrell, and D. Klein. Learning to compose neural networks for question answering. arXiv preprint arXiv:1601.01705, 2016.
-  S. Antol, A. Agrawal, J. Lu, M. Mitchell, D. Batra, C. Lawrence Zitnick, and D. Parikh. Vqa: Visual question answering. In Proceedings of the IEEE international conference on computer vision, pages 2425–2433, 2015.
-  E. Archer, I. M. Park, L. Buesing, J. Cunningham, and L. Paninski. Black box variational inference for state space models. arXiv preprint arXiv:1511.07367, 2015.
-  S. R. Bowman, L. Vilnis, O. Vinyals, A. M. Dai, R. Jozefowicz, and S. Bengio. Generating sentences from a continuous space. arXiv preprint arXiv:1511.06349, 2015.
-  Y. Burda, R. Grosse, and R. Salakhutdinov. Importance weighted autoencoders. arXiv preprint arXiv:1509.00519, 2015.
-  X. Chen, Y. Duan, R. Houthooft, J. Schulman, I. Sutskever, and P. Abbeel. Infogan: Interpretable representation learning by information maximizing generative adversarial nets. In Advances in neural information processing systems, pages 2172–2180, 2016.
-  X. Chen, H. Fang, T.-Y. Lin, R. Vedantam, S. Gupta, P. Dollár, and C. L. Zitnick. Microsoft coco captions: Data collection and evaluation server. arXiv preprint arXiv:1504.00325, 2015.
-  K. Cho, A. Courville, and Y. Bengio. Describing multimedia content using attention-based encoder-decoder networks. IEEE Transactions on Multimedia, 17(11):1875–1886, 2015.
-  J. Chung, K. Kastner, L. Dinh, K. Goel, A. C. Courville, and Y. Bengio. A recurrent latent variable model for sequential data. In Advances in neural information processing systems, pages 2980–2988, 2015.
-  A. Das, H. Agrawal, L. Zitnick, D. Parikh, and D. Batra. Human attention in visual question answering: Do humans and deep networks look at the same regions? Computer Vision and Image Understanding, 163:90–100, 2017.
J. Donahue, L. Anne Hendricks, S. Guadarrama, M. Rohrbach, S. Venugopalan,
K. Saenko, and T. Darrell.
Long-term recurrent convolutional networks for visual recognition and
Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2625–2634, 2015.
-  X. Du, J. Shao, and C. Cardie. Learning to ask: Neural question generation for reading comprehension. arXiv preprint arXiv:1705.00106, 2017.
-  A. Fukui, D. H. Park, D. Yang, A. Rohrbach, T. Darrell, and M. Rohrbach. Multimodal compact bilinear pooling for visual question answering and visual grounding. arXiv preprint arXiv:1606.01847, 2016.
-  I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio. Generative adversarial nets. In Advances in neural information processing systems, pages 2672–2680, 2014.
-  K. Gregor, I. Danihelka, A. Graves, D. J. Rezende, and D. Wierstra. Draw: A recurrent neural network for image generation. arXiv preprint arXiv:1502.04623, 2015.
-  D. He, Y. Xia, T. Qin, L. Wang, N. Yu, T. Liu, and W.-Y. Ma. Dual learning for machine translation. In Advances in Neural Information Processing Systems, pages 820–828, 2016.
-  K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
-  G. E. Hinton, S. Osindero, and Y.-W. Teh. A fast learning algorithm for deep belief nets. Neural computation, 18(7):1527–1554, 2006.
-  G. E. Hinton and R. R. Salakhutdinov. Reducing the dimensionality of data with neural networks. science, 313(5786):504–507, 2006.
-  S. Hochreiter and J. Schmidhuber. Long short-term memory. Neural computation, 9(8):1735–1780, 1997.
-  R. Hu, J. Andreas, M. Rohrbach, T. Darrell, and K. Saenko. Learning to reason: End-to-end module networks for visual question answering. CoRR, abs/1704.05526, 3, 2017.
-  A. Jabri, A. Joulin, and L. van der Maaten. Revisiting visual question answering baselines. In European conference on computer vision, pages 727–739. Springer, 2016.
-  U. Jain, Z. Zhang, and A. G. Schwing. Creativity: Generating diverse questions using variational autoencoders. In CVPR, pages 5415–5424, 2017.
-  E. Jang, S. Gu, and B. Poole. Categorical reparameterization with gumbel-softmax. arXiv preprint arXiv:1611.01144, 2016.
-  J. Johnson, B. Hariharan, L. van der Maaten, L. Fei-Fei, C. L. Zitnick, and R. Girshick. Clevr: A diagnostic dataset for compositional language and elementary visual reasoning. In Computer Vision and Pattern Recognition (CVPR), 2017 IEEE Conference on, pages 1988–1997. IEEE, 2017.
-  J. Johnson, B. Hariharan, L. van der Maaten, J. Hoffman, L. Fei-Fei, C. L. Zitnick, and R. B. Girshick. Inferring and executing programs for visual reasoning. In ICCV, pages 3008–3017, 2017.
-  Ł. Kaiser, A. Roy, A. Vaswani, N. Pamar, S. Bengio, J. Uszkoreit, and N. Shazeer. Fast decoding in sequence models using discrete latent variables. arXiv preprint arXiv:1803.03382, 2018.
-  D. P. Kingma, S. Mohamed, D. J. Rezende, and M. Welling. Semi-supervised learning with deep generative models. In Advances in Neural Information Processing Systems, pages 3581–3589, 2014.
-  D. P. Kingma and M. Welling. Auto-encoding variational bayes. ICLR, 2014.
-  J. Krause, J. Johnson, R. Krishna, and L. Fei-Fei. A hierarchical approach for generating descriptive image paragraphs. In Computer Vision and Pattern Recognition (CVPR), 2017 IEEE Conference on, pages 3337–3345. IEEE, 2017.
-  R. Krishna, Y. Zhu, O. Groth, J. Johnson, K. Hata, J. Kravitz, S. Chen, Y. Kalantidis, L.-J. Li, D. A. Shamma, et al. Visual genome: Connecting language and vision using crowdsourced dense image annotations. International Journal of Computer Vision, 123(1):32–73, 2017.
-  R. G. Krishnan, U. Shalit, and D. Sontag. Deep kalman filters. arXiv preprint arXiv:1511.05121, 2015.
-  A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pages 1097–1105, 2012.
-  B. M. Lake, R. Salakhutdinov, and J. B. Tenenbaum. Human-level concept learning through probabilistic program induction. Science, 350(6266):1332–1338, 2015.
-  J. Li, M. Galley, C. Brockett, J. Gao, and B. Dolan. A diversity-promoting objective function for neural conversation models. arXiv preprint arXiv:1510.03055, 2015.
-  Y. Li, N. Duan, B. Zhou, X. Chu, W. Ouyang, X. Wang, and M. Zhou. Visual question generation as dual task of visual question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 6116–6124, 2018.
-  J. Lu, J. Yang, D. Batra, and D. Parikh. Hierarchical question-image co-attention for visual question answering. In Advances In Neural Information Processing Systems, pages 289–297, 2016.
-  J. Lu, J. Yang, D. Batra, and D. Parikh. Neural baby talk. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 7219–7228, 2018.
L. v. d. Maaten and G. Hinton.
Visualizing data using t-sne.
Journal of machine learning research, 9(Nov):2579–2605, 2008.
-  C. J. Maddison, A. Mnih, and Y. W. Teh. The concrete distribution: A continuous relaxation of discrete random variables. arXiv preprint arXiv:1611.00712, 2016.
-  A. Makhzani, J. Shlens, N. Jaitly, I. Goodfellow, and B. Frey. Adversarial autoencoders. arXiv preprint arXiv:1511.05644, 2015.
-  T. Mikolov, M. Karafiát, L. Burget, J. Černockỳ, and S. Khudanpur. Recurrent neural network based language model. In Eleventh Annual Conference of the International Speech Communication Association, 2010.
-  I. Misra, R. Girshick, R. Fergus, M. Hebert, A. Gupta, and L. van der Maaten. Learning by asking questions. arXiv preprint arXiv:1712.01238, 2017.
-  N. Mostafazadeh, I. Misra, J. Devlin, M. Mitchell, X. He, and L. Vanderwende. Generating natural questions about an image. arXiv preprint arXiv:1603.06059, 2016.
-  R. Paulus, C. Xiong, and R. Socher. A deep reinforced model for abstractive summarization. arXiv preprint arXiv:1705.04304, 2017.
-  A. Radford, L. Metz, and S. Chintala. Unsupervised representation learning with deep convolutional generative adversarial networks. arXiv preprint arXiv:1511.06434, 2015.
-  M. Ren, R. Kiros, and R. Zemel. Exploring models and data for image question answering. In Advances in neural information processing systems, pages 2953–2961, 2015.
-  T. Salimans, I. Goodfellow, W. Zaremba, V. Cheung, A. Radford, and X. Chen. Improved techniques for training gans. In Advances in Neural Information Processing Systems, pages 2234–2242, 2016.
-  A. Santoro, D. Raposo, D. G. Barrett, M. Malinowski, R. Pascanu, P. Battaglia, and T. Lillicrap. A simple neural network module for relational reasoning. In Advances in neural information processing systems, pages 4967–4976, 2017.
-  I. V. Serban, A. García-Durán, C. Gulcehre, S. Ahn, S. Chandar, A. Courville, and Y. Bengio. Generating factoid questions with recurrent neural networks: The 30m factoid question-answer corpus. arXiv preprint arXiv:1603.06807, 2016.
-  I. V. Serban, A. Sordoni, Y. Bengio, A. C. Courville, and J. Pineau. Building end-to-end dialogue systems using generative hierarchical neural network models. In AAAI, volume 16, pages 3776–3784, 2016.
-  R. Shetty, M. Rohrbach, L. A. Hendricks, M. Fritz, and B. Schiele. Speaking the same language: Matching machine to human captions by adversarial training. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), 2017.
-  A. van den Oord, O. Vinyals, et al. Neural discrete representation learning. In Advances in Neural Information Processing Systems, pages 6306–6315, 2017.
-  A. K. Vijayakumar, M. Cogswell, R. R. Selvaraju, Q. Sun, S. Lee, D. Crandall, and D. Batra. Diverse beam search: Decoding diverse solutions from neural sequence models. arXiv preprint arXiv:1610.02424, 2016.
P. Vincent, H. Larochelle, Y. Bengio, and P.-A. Manzagol.
Extracting and composing robust features with denoising autoencoders.In Proceedings of the 25th international conference on Machine learning, pages 1096–1103. ACM, 2008.
-  O. Vinyals, A. Toshev, S. Bengio, and D. Erhan. Show and tell: A neural image caption generator. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3156–3164, 2015.
-  T. Wang, X. Yuan, and A. Trischler. A joint model for question answering and question generation. arXiv preprint arXiv:1706.01450, 2017.
-  R. J. Williams. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine learning, 8(3-4):229–256, 1992.
-  C. Xiong, S. Merity, and R. Socher. Dynamic memory networks for visual and textual question answering. In International conference on machine learning, pages 2397–2406, 2016.
-  K. Xu, J. Ba, R. Kiros, K. Cho, A. Courville, R. Salakhudinov, R. Zemel, and Y. Bengio. Show, attend and tell: Neural image caption generation with visual attention. In International conference on machine learning, pages 2048–2057, 2015.
-  J. Yang, J. Lu, S. Lee, D. Batra, and D. Parikh. Visual curiosity: Learning to ask questions to learn visual recognition. arXiv preprint arXiv:1810.00912, 2018.
-  Z. Yang, X. He, J. Gao, L. Deng, and A. Smola. Stacked attention networks for image question answering. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 21–29, 2016.
-  S. Zhang, L. Qu, S. You, Z. Yang, and J. Zhang. Automatic generation of grounded visual questions. arXiv preprint arXiv:1612.06530, 2016.
-  Y. Zhu, O. Groth, M. Bernstein, and L. Fei-Fei. Visual7w: Grounded question answering in images. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4995–5004, 2016.