Information Maximizing Visual Question Generation

by   Ranjay Krishna, et al.
Stanford University

Though image-to-sequence generation models have become overwhelmingly popular in human-computer communications, they suffer from strongly favoring safe generic questions ("What is in this picture?"). Generating uninformative but relevant questions is not sufficient or useful. We argue that a good question is one that has a tightly focused purpose --- one that is aimed at expecting a specific type of response. We build a model that maximizes mutual information between the image, the expected answer and the generated question. To overcome the non-differentiability of discrete natural language tokens, we introduce a variational continuous latent space onto which the expected answers project. We regularize this latent space with a second latent space that ensures clustering of similar answers. Even when we don't know the expected answer, this second latent space can generate goal-driven questions specifically aimed at extracting objects ("what is the person throwing"), attributes, ("What kind of shirt is the person wearing?"), color ("what color is the frisbee?"), material ("What material is the frisbee?"), etc. We quantitatively show that our model is able to retain information about an expected answer category, resulting in more diverse, goal-driven questions. We launch our model on a set of real world images and extract previously unseen visual concepts.



There are no comments yet.


page 7


C3VQG: Category Consistent Cyclic Visual Question Generation

Visual Question Generation (VQG) is the task of generating natural quest...

Answer-Driven Visual State Estimator for Goal-Oriented Visual Dialogue

A goal-oriented visual dialogue involves multi-turn interactions between...

Visual Question Rewriting for Increasing Response Rate

When a human asks questions online, or when a conversational virtual age...

Guiding Visual Question Generation

In traditional Visual Question Generation (VQG), most images have multip...

MixQG: Neural Question Generation with Mixed Answer Types

Asking good questions is an essential ability for both human and machine...

Generative Modelling of BRDF Textures from Flash Images

We learn a latent space for easy capture, semantic editing, consistent i...

Good and safe uses of AI Oracles

An Oracle is a design for potentially high power artificial intelligence...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The task of transforming visual scenes to language, questions [47, 44], answers [64, 2] or captions [56, 30]

, has widely adopted an image-to-sequence architecture that encodes the image through a convolutional neural network (CNN) 


and then decodes the language with a recurrent neural network 


. The whole framework can be efficiently trained by maximum likelihood estimation (MLE) and has demonstrated state-of-the-art performance in various tasks 

[60, 8, 10, 11, 38]. However, this training procedure is not suitable for generating questions or enabling discovery of new concepts. In fact, most MLE-based training schemes have shown to produce generic questions that result in uninformative answers (e.g. “yes”) [22], questions (e.g. “What is the person doing?”) [23], captions (e.g. “A clear day with a blue sky” [60]) or dialogue (e.g. “I don’t know”) [35, 51]. Simply generating a generic question is not sufficient or useful for discovering new concepts.

Figure 1: Our new architecture generates goal-driven visual questions that maximize the likelihood of receiving an expected answer. When attempting to learn about objects or their attributes, it can generate questions aimed at attaining such answer categories.
Figure 2: Since multiple questions are possible for any image, previous approaches moved away from supervised question generation to variational approaches [23]. However, this resulted in generic, uninformative questions. We argue that a good question maximizes mutual information with an expected answer. But such a model is not practical as knowing the answer defeats the purpose of generating a question. Also, such models often lead to the posterior collapsing problem [4]. Instead, we propose an architecture that maximizes mutual information between the image, answer and question while also maintaining a regularization based on answer categories. Our final model uses two latent spaces and can generate questions both in the presence and the absence of answers.

Instead of generating generic questions, question generation models should be goal-driven — we show how they can be trained to ask questions aimed at extracting specific answer categories. Visual question generation is not a bijection, i.e. multiple correct questions can be generated from the same image. Previous research moved away from a supervised approach to question generation to a variational approach that can generate multiple questions by sampling a latent space [23] (see Figure 2). However, previous approaches are not goal-driven — they do not guarantee that the question will result in a specific type of answer. To remedy their problem, we could encode the answer along with the image before generating the question. While such an approach allows the model to condition its question on the answer, it is neither technically feasible nor practical. Technical infeasibility arises because variational models often lead to the posterior collapsing problem [4], where the model can learn to ignore the answer when generating questions. Impracticality arises because the main purpose of asking questions is to attain an answer, implying that knowing the answer defeats the purpose of generating the question.

To tackle the first challenge, we design a visual question generation architecture that maximizes the mutual information between the generated question with the image as well as with the expected answer (see Figure 2). We call our model Information Maximizing Visual Question Generator

as it maximizes relevance with the image and expectation over the answer. Safe, generic questions that lead to uninformative answers are discouraged as they have low mutual information with either. However, optimizing for mutual information is often intractable and given the discrete tokens (words) we wish to generate, no unbiased, low variance gradient estimator exists 

[24, 40, 16, 52, 45, 58, 27]. We formulate our model as a variational auto-encoder that attempts to learn a joint continuous latent space between the image, question and the expected answer. Instead of directly optimizing discrete utterances, the question, image and expected answer are all trained to maximize the mutual information with this latent space. By reconstructing the image and expected answer representations, we can maximize the evidence lower bound (ELBO) and control what information the generated questions request.

The second challenge arises from the lack of an expected answer in real world deployments. Since we require an answer to map the image into a latent space, it is not possible to generate questions in the absence of an answer. Enumerating all possible answers is infeasible. Instead, we propose creating a second latent space that is learned from the image and the answer category instead of the answer itself. Answer categories can be objects, attributes, colors, materials time, etc. During training, we minimize the KL-divergence between these two latent spaces. Not only does this allow us to generate visual questions that maximize mutual information with the expected answer, it also acts as a regularizer into the original latent space. It prevents the learned latent spaces from overfitting to specific answers in the training set and forces them to generalize to categories of questions.

We annotate the VQA dataset [2] with categories for the top answers and use it to train our model, which queries for specific answer categories. We evaluate our model on relevance to the image and on its ability to expect the answer type. Finally, we run our model on real world images and discover new objects, new attributes, new colors, and new materials.

2 Related work

Visual understanding has been studied vigorously through question answering with the availability of large scale visual question answering (VQA) datasets [2, 64, 31, 25]. Current VQA approaches follow a traditional supervised MLE paradigm that typically relies on a CNN + RNN encoder-decoder formulation [56]. Successive models have improved performance by stacking attention [62, 37], modularizing components [1, 26, 21], adding relation networks [49], augmenting memory [59], and adding proxy tasks [13, 57]. While the performance of VQA models have been encouraging, they require a large labelled dataset with a predefined vocabulary. In contrast, we focus on the surrogate task of generating questions in the hopes of augmenting real world agents with the ability to expand it’s visual knowledge by discovering new visual concepts.

Figure 3: Training our model: we embed the image and answer into a latent space and attempt to reconstruct them, thereby maximizing mutual information with the image and the answer. We also use to generate questions and train it with an MLE objective. Finally, we introduce a second latent space that is trained by minimizing KL-divergence with . allows us to remove the dependence on the answer when generating questions and instead grants us the ability to generate questions conditioned on the answer category.

In contrast to answering questions, generating questions has received little interest so far. In NLP, a few methods have attempted to automatically generate questions from knowledge bases using rule based [50]

or deep learning based systems 


. In computer vision, a few recent projects have explored the task of

visual question generation to build curious visual agents [61, 23]. These projects have also either followed an algorithmic rule-based [54, 50] or learning-based [44, 47] approach. Newer papers have treated the generation process as a variational process [23]

or placed it in a active learning 


or reinforcement learning setting 

[61]. Our work draws inspiration from these previous methods and extends them by treating question generation as a process that maximizes mutual information between not just the image but also considers the expected answer’s category. We believe that a good question generator should be goal driven — it should generate questions to receive a particular answer category.

There is a large body of work exploring generative models

and learning latent representation spaces. Early work focused primarily on stacked autoencoders and then on restricted boltzman machines 

[55, 18, 19]. Recent successes of these applications have primarily been a result of variational auto-encoders (VAEs) [29] and generative adversarial networks (GANs) [14]. With the reparameterization trick, VAEs can be trained to learn a semi-supervised latent space to generate images [29]. They have also been extended to continuous state space [32, 3] and sequential models [15, 9]. GANs, on the other hand, can learn image representations that support basic linear algebra [46]

and even enable one-shot learning by using probabilistic inference over Bayesian programs 

[34]. Both VAEs and GANs have disentangled their representations based on class labels or other visual variations [28, 41]. While we do not explicitly disentangle the representation, we will demonstrate later how the second latent space regularizes the original space and disentangles the representations of different answer categories.

Generative models often require a series of tricks for successful training [48, 46, 5, 4]. And even with these tricks, training them with discrete tokens is only possible by using gradient estimators. As we mentioned earlier, these estimators often suffer from one of two problems: high bias [27, 27] or high variance [58]. Low variance methods like Gumbel-Softmax [24], CONCRETE distribution [40], semantic hashing [27]

or vector quantization 

[53] result in biased estimators. Similarly, low bias methods like REINFORCE [58] with Monte Carlo rollouts, result in high variance [16, 52, 45]. We overcome this issue by introducing a continuous latent space that maximizes mutual information with encodings of the image, question and answer. This latent space can be trained using existing VAE training procedures that attempt to reconstruct the image and answer representations. We further extend this model with a second latent space conditioned on the answer category that removes the need for an actual answer when generating questions.

3 Information Maximizing Visual Question Generator

Figure 4: Inference on our model: Given an image input and an answer category (e.g. attribute), we encode both into a latent representation , parameterized by mean and . We sample from with noise to generate questions that are relevant to the image and who’s answers result in the given answer category.

Our aim is to generate questions that have a tightly focused purpose — questions with the aim to learn something specific about the image. Agents with the capability to request specific categories of information can extract new concepts more effectively from the real world. In this section, we detail how we design an Information Maximizing Visual Question Generator. Recall that the goal of our model is to generate questions given an image and an answer category. For example, if we want to understand materials or binary answers, our model should generate questions “What material is that desk made out of?” or “Is the desk on the right of the chair?”, respectively. Our two challenges are (1) technical infeasibility caused by non-differentiable discrete tokens and variational posterior collapse and (2) impracticality of requiring answers to generate questions. We start off with a formal definition of the problem, explain why current methods fail and then detail our training and inference process.

3.1 Problem formulation

Let denote the question we want to generate for an image . This question should result in the an answer of category . For example, the question “What is the person in red doing with the ball?” should result in the answer “kicking”, which belong to category “activity”. Our final goal is to define a model . But first, let’s attempt to define a simpler model that maximizes the mutual information between the image and the question and between the expected answer and the question . This objective can be written as:



is a hyperparameter that adjusts for their relative importance in the optimization.

3.2 Continuous latent space

As already mentioned, directly optimizing this objective is infeasible because the exact computation of mutual information is intractable. Additionally, optimizing by estimating gradients between discrete steps is difficult as the estimator needs to have both low bias and low variance. To overcome this challenge, we introduce a continuous, dense, latent -space. We learn a mapping , parameterized by , from the image and the expected answer to this latent space.

With this -space, our new optimization becomes:


where and are hyperparameters that relatively weight the mutual information terms in the optimization.

3.3 Variational mutual information maximization

So far, we have avoided discrete tokens. However, this mutual information maximization is still intractable as it requires knowing the true posteriors and . Fortunately, we can opt to maximize its ELBO:


where is the entropy function and is expectation. is a function parameterized by . This optimization is often referred to as variational information maximization [6]. Similarly,


The third and final conditional mutual information term can also be bounded by:


Putting Eq. 34 and 5 together in Eq. 2:

s. t.

Note that we ignore the entropy terms associated with the training data as it doesn’t involve the parameter we are trying to optimize. Therefore, optimizing Eq. 6 can be accomplished by maximizing the reconstruction of the image and answer representations while maximizing the MLE objective of generating the question.

3.4 Question generation by reconstructing image and answer representations

To functionalize the optimization presented above, we begin by first encoding the image using a CNN as a dense vector (see Figure 3). Similarly, we encode the answer

using a long short term memory network (LSTM) 

[20], which is a variant of RNNs, into another dense vector . Next, we feed and into a VAE that embeds both into a latent -space. In practice, we assume that

follows a multivariate Gaussian distribution with diagonal covariance. We use the reparameterization trick 

[29], to generate means

and standard deviations

, combine it with a sampled unit Gaussian noise to generate .

From , we reconstruct and and optimize the first two terms in Eq. 6 by minimizing the following losses:


Next, we use a decoder LSTM to generate the question from -space. We minimize the MLE objective between and the true question in our training set , which results in the third and final term in Eq. 6.

3.5 Regularizing with a second latent space

So far, we have proposed building a model that maximizes the lower bound of mutual information between a latent space, the image and the expected answer. This allows us to generate questions if we know what the expected answer should be. This is not conducive to our original goal of deploying our model in real world situations where it does not know the answer a priori. If we already know the answer to a question, there is no point in generating a question.

To remedy this, we propose a second latent -space. Instead of using both and to encode and into -space, we discard the answer and instead only use its category

. We classify answers as being one of a few predefined categories, such as objects (e.g. “cat”), attributes (e.g. “cold”), color (e.g. “brown”), relationship (e.g. “ride”), counting (e.g. “1”), etc. These categories are cast as a one hot vector and encoded as

and used, along with to embed into the variational -space. We train -space by minimizing the KL-divergence with -space:


where are the parameters used to embed into -space. This allows us to now utilize to embed into a space that closely resembles -space. Since we assume that both -space and -space follow a multivariate Gaussian with diagonal covariance, the KL term has the analytical form shown above. We no longer need to know the answer to embed and generate questions. Intuitively, the -space can be also thought of as a regularizer on -space, preventing the model from overfitting to the answers in the training data and relying instead on utilizing the answer categories.

Putting them together, the final loss for our model is:


where and have already been introduced and is a hyperparameter that controls the amount of regularization used in our model. Note that we are omitting the KL-loss with respect to a unit normal centered at zero that maintains the two latent spaces’ priors.

3.6 Inference

During inference, we are given an image and answer category and are expected to generate questions. We encode the inputs into the second latent -space and sample from it to generate questions, as shown in Figure 4. This allows us to generate goal-driven questions for any image, focused towards extracting its objects, its attributes, etc.

3.7 Implementation details

We implement our model using PyTorch and plan on releasing all our code. We use ResNet18 

[17] as our image encoder and do not fine-tune its weights. , and are all dimensional vectors. -space and -space are dimensions. The encoders for the image and answer are trained only from and not , or to prevent the encoders from simply optimizing for the reconstruction loss at the cost of not being able to generate questions. We optimized the hyperparameters such that , , with a learning rate of that decays every epochs for a total of epochs.

4 Experiments

Language modeling Mutual information Relevance
Models Bleu-1 Bleu-2 Bleu-3 Bleu-4 METEOR CIDEr Answer Category Image Category
-space IA2Q [57] 32.43 15.49 9.24 6.23 11.21 36.22 11.48 35.33 91.10 36.80
V-IA2Q [23] 36.91 17.79 10.21 6.25 12.39 36.39 11.13 36.91 90.10 39.00
Ours w/o A 38.88 20.74 12.75 6.29 12.78 40.13 10.02 40.44 98.10 42.70
Ours w/o AC 38.99 21.48 12.73 6.57 13.01 42.13 10.10 60.00 96.80 42.80
Ours w/o C 50.09 32.32 24.61 16.27 20.58 94.33 33.44 61.04 98.00 82.40
Ours 48.09 29.76 20.71 15.17 18.78 92.13 30.23 91.02 97.10 91.20
-space IC2Q 30.42 13.55 6.23 4.44 9.42 27.42 9.88 40.23 90.00 38.80
V-IC2Q 35.40 25.55 14.94 10.78 13.35 42.54 10.11 60.23 92.20 45.00
Ours w/o A 31.20 16.20 11.18 6.24 12.11 35.89 9.35 68.23 98.00 52.50
Ours 47.40 28.95 19.93 14.49 18.35 85.99 28.23 99.02 97.20 98.00
Table 1: We report our model’s efficacy with multiple metrics. We use language modeling metrics to measure its capability to generate questions similar to the ground truth. Next, we measure the model’s ability to maximize mutual information by predicting the answer or its category from the latent space embedding. Finally, we measure the relevance of the question with the image. Note that language modeling scores are multiplied by to show more significant digits and mutual information and relevance scores are reported in percentages.
Figure 5: TSNE [39] visualization of the latent encodings. When we don’t reconstruct the answer, the embedding show no separation between answers or their categories, confirming the posterior collapse. Meanwhile, by reconstructing the answer, both the -space and the -space encodings are visually separable. Different colors represent categories of answers and we only show categories for aesthetics.

To test our visual question generation model, we perform a series of experiments and evaluate the model along multiple dimensions. We start by discussing the dataset and evaluation metrics used. We then showcase examples of our model’s generated questions when conditioned on the answer. Next, we demonstrate its ability when conditioned only on the answer category. We compare both these cases against a series of baselines and ablations. We analyze the diversity of questions produced within each answer category. Finally, we report a small proof of concept deployment of our model on real world images found online and show that it can learn new concepts.

4.1 Experimental setup

Dataset. To enable the kind of interaction where we can specify input answer categories, we need a VQA dataset that categorizes its answers. The VQA dataset [2] has a few basic categorizations of questions but not their answers. We annotate the VQA [2] dataset answers with a set of categories and label their top answers. These categories include objects (e.g. “cat”, “person”), attributes (e.g. “cold”, “old”), color (e.g. “brown”, “red”), relationship (e.g. “ride”, “jump”), counting (e.g. “1”, “10”), etc. The top answers make up the of the VQA dataset, resulting in training+validation examples. We treat their validation set as our test set as the answers in their test set are not publicly available. We break the training set up into a - train-validation split.

Evaluation metrics. All past question generation papers have used a variety of evaluation metrics to calculate the quality of a question. While some have focused on maximizing diversity [54, 23, 63], others have treated it as a proxy task to improve question answering [36, 47, 57]. Diversity measures have included using variants of beam search [54], measuring novel questions or unique tri-grams [23] or creating rule-based datasets [63]. Proxy tasks have typically used accuracy of multiple-choice answers to measure the performance of question generation.

We too report a variety of different evaluation metrics to highlight different components of our model. First, we use language modeling evaluation metrics like BLEU, METEOR and CIDEr [7] to calculate how well our generated questions match the ground truth questions in our test set. Next, we measure the mutual information retained in the latent space by training a classifier to classify answer categories encoded in the latent space. This metric sheds light on how well our method retains information about the input answers or answer categories. Next, we measure relevance of the question, ensuring that the questions are valid for the given image and result in the expected answer category. Relevance results are calculated from majority vote conducted by hiring crowd-workers that vote on whether a question can be answered given its corresponding image. Finally, we report diversity scores for each category, which measures the number of unique questions generated.

Figure 6: Example questions generated for a set of images and answer categories. Incorrect questions are shown in grey and occur when no relevant question van be generated for a given image and answer category.

Baselines. We adapt a series of past CNN-RNN models to accept answer or answer types when generating questions. The first model IA2Q is a supervised, non-variational model that takes an image and answer as input and generates a question [57]. This model is reminiscent of the VQA models often used to answer questions [57], except the answers are now inputs and the questions outputs [2, 64]. Next, V-IA2Q is a variational version of IA2Q, which embeds the answer and question to a latent space before generating the question [23]. We also train versions of these models that accept the answer categories instead of the answer: IC2Q and V-IC2Q. When generating from a variational model, we set or to keep its outputs consistent for all measures except diversity.

We refer to our full model as Ours and can generate questions from either the answer latent space or the category latent space . We perform ablations on this model by removing specific components. Ours w/o A doesn’t maximize mutual information with respect to the expected answer but can also generate questions from both the and spaces. Ours w/o C doesn’t include the - space and can only generate questions from answers. Finally, Ours w/o AC doesn’t train with the reconstruction loss nor does it have a second latent space . Our evaluations empirically demonstrate how these ablations justify our model designs.

4.2 Mutual information maximization

We check whether our model improves the mutual information retained in the latent space with the input answer. We freeze the weights of a trained model and embed input images, answers and categories into the latent or -space, depending on the model. We train a simple -layer MLP that attempts to classify the latent code as either one of the answer categories or as one of the answers. We evaluate our model on the test set with a random chance of and , respectively. Table 1 shows that the baseline models do a poor job of actually remembering the answer or category, justifying the need for a mutual information maximization approach. Since these models are unable to retain information about the input answers, it also explains why they often generate safe, generic, uninformative questions. Since our model can embed into both the as well as the space, we report how well these two spaces retain information. We find that Ours retains near perfect information about the input answer category with an accuracy of from -space and from the -space. We find that when trained without the -space, Ours w/o C retains more information as it no longer has to constrain the -space to regularize answers of the same category. We also visualize a TSNE [39] representation of the two spaces in Figure 5. Models that don’t reconstruct the answer (e.g. in Ours w/o A, Ours w/o AC or any of the baselines) show visually inseparable categories.

4.3 Generating questions given the answers

Since our model can produce questions from both answers as well as answer categories, we evaluate both scenarios individually. The language modeling section in Table 1 showcases how the various models perform when generating questions from the -space, i.e. generating questions from answers. We find that Ours w/o C performs the best over all the baselines and across all ablations of our model. This is likely because the latent space has more capacity when it is not also being regularized by the -space. We find that Ours w/o A performs METEOR points worse than Ours and Ours w/o C implying that forcing the model to reconstruct the answer does improve the quality of questions generated to better match the ground truth.

V-IC2Q Ours
Strength Inventive Strength Inventive
counting 15.77 30.91 26.06 41.30
binary 18.15 41.95 28.85 54.50
object 11.27 34.84 24.19 43.20
color 4.03 13.03 17.12 23.65
attribute 37.76 41.09 46.10 52.03
materials 36.13 31.13 45.75 40.72
spatial 61.12 62.54 70.17 68.18
food 21.81 20.38 33.37 31.19
shape 35.51 44.03 45.81 55.65
location 34.68 18.11 45.25 27.22
people 22.58 17.38 36.20 31.29
time 25.58 15.51 34.43 25.30
activity 7.45 13.23 21.32 26.53
Overall 12.97 38.32 26.06 52.11
Table 2: Diversity measures across different answer categories. We report the generation strength, percent of unique questions generated normalized by number of unique ground truth questions and generation inventiveness, percent of unique questions generated unseen during training. All questions were generated from the -space of our model for a fair comparison with V-IC2Q.

4.4 Generating questions with answer types

The lower half of Table 1 evaluates how well our model and the baselines perform when generating questions in the absence of the actual answer and only in the presense of the answer categories. We find that overall, all the language metrics are slightly lower than when the questions were generated from the -space. This is expected as now the questions need to be generated with only the answer category encoded in the -space without knowing exactly what the answer is. Therefore, the models are penalized for asking an “object” question about the “horse” when the answer expects the question to focus on the “saddle” instead. We also qualitatively sample and report a random set of questions generated by our model in Figure 6. We see that our model often uses concepts in the image to ground the questions. It asks specific questions like “what is the bat made of?” or “is the man going to the right of the girl?”. However, there are categories like “time” that have a low diversity of training questions and result in the inevitable “what time of day is this?” question. The qualitative errors we have observed often occur when the model is forced to ask a question about a category that is not present in the image; it is hard to ask about “food” when no food is present.

4.5 Measuring diversity of questions

For all the images in our test set, we generated one question per answer category, resulting in a total of questions. We report diversity in Table 2 using two existing metrics: (1) Strength of generation: the percentage of unique generated questions normalized by the number of unique ground truth questions and (2) Inventiveness of generation: the percentage of unique questions unseen during training normalized by all unique questions generated. We compare our model with the baseline V-IC2Q which does not reconstruct the answer or the image. We find that our method results in more diverse set of questions across most categories. Questions asking for “shape” and “materials” tend to generate the most unseen questions as the model learns to generate questions like “what [shape/material] is the ____ [made out of]?” and injects objects in the given image into the missing blank. Answers agnostic to the image contents, such as “time”, result in the fewest number of novel questions.

4.6 Real world deployment of our model

To examine our model in a real world deployment, we generated question each for images with hashtags #food, #nature, #sports, #fashion scraped from online public social media posts. Since our model needs an input answer category to ask a question, we trained a simple ResNet18 CNN [17] on the VQA images to output one of categories (see Table 3). We generated answer categories using the CNN and fed it into our model to generate the questions. The questions were sent to two crowd workers: one answered the question and the other reported the relevance of the question with the image and the answer with the answer category. We found all the questions asked by both Ours and V-IC2Q to be relevant to the image while and were relevant to the answer category. Our methods questions led to more unseen concepts.

Category Questions V-IC2Q Ours Examples
object 411 10 80 blackthorns, robins
attributes 205 8 40 desecrated, crowned
colors 164 12 17 burgandy, Alabaster
materials 220 4 8 polyester, spandex
Table 3: We categorized images into one of the answer categories, generated questions and asked crowd workers to answer them. We report the number of questions asked per category and the number of new concepts discovered by our model versus a baseline. We also show examples of new discovered concepts.

5 Conclusion

We believe that visual question generation should be a task that is aimed at extracting specific categories of concepts from an image. We define a good question to be one that is not only relevant to the image but is also designed to expect a specific answer category. We build Information Maximizing Visual Question Generator that maximizes the mutual information between the generated question, the input image and the expected answer. We extend this model to overcome technical challenges associated with maximizing mutual information with discrete tokens and collapsing posterior while also allowing it to generate questions when the expected answer is absent. We analyze the questions using language modeling, diversity, relevance and mutual information metrics. We further show that through a real world deployment of this system, it can discover new concepts.


We thank Justin Johnson, Andrey Kurenkov, Apoorva Dornadula and Vincent Chen for their helpful comments and edits. This work was partially funded by the Brown Institute of Media Innovation and by Toyota Research Institute (“TRI”) but this article solely reflects the opinions and conclusions of its authors and not TRI or any other Toyota entity.