Guiding Visual Question Generation

by   Nihir Vedd, et al.

In traditional Visual Question Generation (VQG), most images have multiple concepts (e.g. objects and categories) for which a question could be generated, but models are trained to mimic an arbitrary choice of concept as given in their training data. This makes training difficult and also poses issues for evaluation – multiple valid questions exist for most images but only one or a few are captured by the human references. We present Guiding Visual Question Generation - a variant of VQG which conditions the question generator on categorical information based on expectations on the type of question and the objects it should explore. We propose two variants: (i) an explicitly guided model that enables an actor (human or automated) to select which objects and categories to generate a question for; and (ii) an implicitly guided model that learns which objects and categories to condition on, based on discrete latent variables. The proposed models are evaluated on an answer-category augmented VQA dataset and our quantitative results show a substantial improvement over the current state of the art (over 9 BLEU-4 increase). Human evaluation validates that guidance helps the generation of questions that are grammatically coherent and relevant to the given image and objects.



There are no comments yet.


page 3

page 7

page 11


An Analysis of Visual Question Answering Algorithms

In visual question answering (VQA), an algorithm must answer text-based ...

Question Relevance in VQA: Identifying Non-Visual And False-Premise Questions

Visual Question Answering (VQA) is the task of answering natural-languag...

C3VQG: Category Consistent Cyclic Visual Question Generation

Visual Question Generation (VQG) is the task of generating natural quest...

"What makes a question inquisitive?" A Study on Type-Controlled Inquisitive Question Generation

We propose a type-controlled framework for inquisitive question generati...

Information Maximizing Visual Question Generation

Though image-to-sequence generation models have become overwhelmingly po...

On the Evaluation of Answer-Agnostic Paragraph-level Multi-Question Generation

We study the task of predicting a set of salient questions from a given ...

Enhancing Visual Dialog Questioner with Entity-based Strategy Learning and Augmented Guesser

Considering the importance of building a good Visual Dialog (VD) Questio...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

In the last few years, the AI research community has witnessed a surge in multimodal tasks such as Visual Question Answering (VQA) (Antol et al., 2015; Anderson et al., 2018), Multimodal Machine Translation (Specia et al., 2016; Elliott et al., 2017; Barrault et al., 2018; Caglayan et al., 2019)

, and Image Captioning (IC)

(Vinyals et al., 2015; Karpathy and Fei-Fei, 2015; Xu et al., 2015). Visual Question Generation (VQG) (Zhang et al., 2016; Krishna et al., 2019; Li et al., 2018), a multimodal task which aims to generate a question given an image, remains relatively under-researched despite the popularity of its textual counterpart. Throughout the sparse literature in this domain, different approaches have augmented and/or incorporated extra information as input. For example, Pan et al. (2019) emphasised that providing the ground truth answer to a target question is beneficial in generating a non-generic question. Krishna et al. (2019) pointed out that requiring an answer to generate questions violates a realistic scenario. Instead, they proposed a latent variable model using answer categories to help generate the corresponding questions. More recently, Scialom et al. (2020) incorporated a pre-trained language model together with extracted object features and image captions for question generation.

In this work, we explore VQG from the perspective of ‘guiding’ a question generator. We define the general notion of ‘guiding’ as conditioning a generator on inputs that match specific chosen properties from the target. We use the answer category and objects/concepts based on an image and target question as inputs to our decoder. We propose two model variants to achieve this goal - explicit guiding and implicit guiding. The explicit variant (Section 3.1) is modelled around the notion that an actor can select a subset of detected objects in an image for conditioning the generative process. Depending on the application, this selection could be done by a human, and algorithm or chosen randomly. A human, for example, may wish to select certain concepts they want the generated question to reflect, while an algorithm could select a subset of objects based on external requirements (e.g. selecting a subset of furniture-related objects from the whole set of detected objects in order to generate questions for teaching the furniture-related vocabulary in a language learning setting). Alongside the objects, the actor may also provide an answer category to the question generator.

The implicit variant (Section 3.2), on the other hand, is motivated by removing the dependency on the aforementioned actor. That is done by learning generative distributions over the relevant answer categories and objects given certain images, which makes use of the intuitive information from the images in a implicit way. To model this implicit selection of concepts, we employ a model with two discrete latent variables that learn an internally-predicted category and a set of objects relevant for the generated question, optimised with cross-entropy and variational inference (Mnih and Gregor, 2014; Rezende et al., 2014; Hoffman et al., 2013; Kingma and Welling, 2014; Burda et al., 2016; Miao et al., 2016; Kingma et al., 2014), respectively.

Human evaluation shows that our models can generate realistic and relevant questions, with our explicit model almost fooling humans when asked to determine which, out of two questions, is the generated question. Our experiments and results are presented in Section 5.

To summarise, our main contributions are:

  • A thorough exploration of the concept of ‘guiding’ in VQG.

  • A novel generative Transformer-based set-to-sequence approach for Visual Question Generation.

  • The first work to explore discrete latent variable models in multimodal Transformer-based architectures.

  • A substantial increase in quantitative metrics - our explicit model improves the current state of the art setups by over 9 BLEU-4 and 110 CIDEr.

(a) Architecture of our explicit model. Given an image, first an object detection model is used to extract objects labels and object features; a captioning model is used to generate relevant captions. Questions and answers are concatenated to filter the conceptual information from generated objects and captions. Next the filtered concepts are combined with the category as the input to the text encoder; the extracted object features are fed into an image encoder. Finally the outputs from the text encoder and the image encoder are fused into the decoder for question generation.
(b) Architecture of our implicit model. Similar to the explicit model, first an object detection model is used to extract object labels and object features; which are combined to predict categories through an MLP. The combination of object labels and object features is also used to learn a generative distribution. Questions and answers are encoded using a PLM and fused with object labels and features to learn a variational distribution over the object labels. With object features, predicted category, and sampled object labels, a decoder is applied for question generation. The dashed lines indicate training only.

2 Related Work

2.1 Visual Question Generation

Zhang et al. (2016) introduced the first paper in the field of VQG, employing an RNN based encoder-decoder framework alongside model-generated captions to generate questions. Since then, only a handful of papers have investigated VQG. Fan et al. (2018) demonstrated the successful use of a GAN in VQG systems, allowing for non-deterministic and diverse outputs. Jain et al. (2017) proposed a model using a VAE instead of a GAN, however their improved results require the use of a target answer during inference. To overcome this requirement, Krishna et al. (2019) augmented the VQA Antol et al. (2015) dataset with answer categories, and proposed a model with two latent spaces. During training, one latent space receives the image features and an answer; the other receives the same image features along with a broader answer category. Part of the overarching model objective incorporates a divergence loss between the two encoded feature spaces, and at inference time, the latent space without the full answer is utilized for generation. Because their architecture uses information from the target as input, their work falls under our definition of guided generation. More recently, Scialom et al. (2020)

investigate the cross modal performance of pre-trained language models, by fine-tuning a BERT

Devlin et al. (2018) model on model-based object features and ground-truth image captions. Other work, such as Patro et al. (2018), Patro et al. (2020) and Uppal et al. (2020), either do not include BLEU scores higher than BLEU-1, which is not very informative, or address variants of the VQG task. In the latter case the models fail to beat previous SoTA on BLEU-4 for standard VQG.

2.2 Latent Variable Models

VAEs Kingma and Welling (2014) incorporate a form of non-determinism within a model, making them a suitable candidate for models which require diverse outputs. VAEs have been successfully applied to text processing and generation Miao et al. (2016); Miao and Blunsom (2016); Bowman et al. (2015), and widely used for diverse text generation Jain et al. (2017); Aneja et al. (2019); Lin et al. (2020).

Ongoing research demonstrates that there may also be advantages to quantization or discretization of variational models (Oord et al., 2018; Hu et al., 2017; Deng et al., 2018; van den Oord et al., 2017). A discrete latent variable is ideal for tasks which require controllable generation (Hu et al., 2017). For example, in the case of VQG, following the learned generative and variational distribution paradigm outlined above, it enables a model to implicitly learn which objects are relevant to a question and answer pair. Then, at inference time, given a set of object tokens, the model can learn which subset will produce the most informative question. In our work, we model the latent distribution over objects and the predicted category distribution as discrete latent variables which are optimised internally for guiding question generation.

3 Methodology

We introduce the shared concepts of our explicit and impilict model variants, before diving into the variant-specific methodologies (Section 3.1 and 3.2 respectively).

For both variants, we keep the VQG problem grounded to a realistic scenario. That is, during inference, we can only provide the model with an image, and data that can either be generated by a model (e.g. object features or image captions) and/or trivially provided by a human (i.e. answer category and a selected subset of the detected objects). However, during training, we are able to use any available information, such as images, captions, objects, answer categories, answers and target questions, employing latent variable models to minimise divergences between feature representations of data accessible at train time but not inference time. This framework is inspired by Krishna et al. (2019). In Appendix A, we show an example on the differences of input during training, testing and inference.

Formally, the VQG problem is as follows: Given an image , where denotes a set of images, decode a question . In the guided variant, for each , we also have access to textual utterances, such as ground truth answer categories and answers. The utterances could also be extracted by an automated model, such as image captions Li et al. (2020), or object labels and features Anderson et al. (2018)

. In our work, answer categories take on 1 out of 16 categorical variables to indicate the type of question asked. For example, “

how many people are in this picture?” would have a category of “count” (see Krishna et al. (2019) for more details).

Text Encoder. For encoding the text, we use BERT (Devlin et al., 2018) as a pre-trained language model (PLM). Thus, for a tokenised textual input of length , we can extract a -dimensional representation for :


Image Encoder. Given an image , we can extract object features, , and their respective normalized bounding boxes, , with the 4 dimensions referring to horizontal and vertical positions of the feature bounding box. Following the seminal methodology of Anderson et al. (2018), is usually 36. Subsequent to obtaining these features, we encode the image using a Transformer Vaswani et al. (2017), replacing the default position embeddings with the spatial embeddings extracted from the bounding box features Krasser and Stumpf (2020); Cornia et al. (2019). Specifically, given from image ,


Text Decoder. We employ a pretrained Transformer based decoder for our task (Wolf et al., 2020). Following standard sequence-to-sequence causal decoding practices, our decoder receives some encoder outputs, and auto-regressively samples the next token, for use in the next decoding timestep. Since our encoder outputs are the concatenation (the operator) of our textual and vision modality representation: , our decoder thus takes on the form:


where is the predicted question.

In this work, we primarily focus on a set-to-sequence problem as opposed to a sequence-to-sequence problem. That is, our textual input is not a natural language sequence, rather an unordered set comprising of tokens from the answer category, the object labels, and the caption. How this set is obtained is discussed in following section. Noteworthily, due to the set input, we disable positional encoding on the PLM encoder.

3.1 Explicit Guiding

As mentioned in Section 1, the explicit variant requires some actor in the loop. Thus, in a real world setting, this model will run in two steps. Firstly, we run object detection (OD) and image captioning (IC) over an image and return relevant guiding information to the actor. The actor may then choose a subset of objects which are sent to the decoder to start its generation process.

To enable this setup, we create paired data based on the guided notion. At a high level, our approach creates this data in three steps: 1) obtain object labels; 2) obtain concepts via IC; 3) filter the conceptual information based on a question-answer (QA) pair. More specifically, given an image, we first extract object labels following the work of Anderson et al. (2018). Then, we obtain a caption sequence using a captioning model (Li et al., 2020). Given this sequence, we remove stop words and concatenate the caption tokens with the aforementioned object labels. This information is then converted to a set. Subsequently, to emulate the involved actor, we filter the obtained candidate information based on the ground truth QA pair. Formally,


Here, OD stands for an object detector model, removeStopWords is a function which removes the stop words from a list, and set is a function which creates a set from the concatenation (the ; operator) of the detected objects and obtained captions. The set is of size . Using this obtained candidate_concepts set, we run our filtration process.

Once the set of candidate concepts has been constructed, we filter them to only retain concepts relevant to the target QA pair. After removing stop words and applying the set function to the words in the QA pair, we obtain embeddings for the candidate QA pair and the previous candidate_concepts (Eq 4

). We subsequently compute the cosine similarity between the two embedding matrices, and then select the top

most similar concepts. The chosen concepts are always a strict subset of the candidate concepts that are retrieved using automated image captioning or object detection. This process emulates the selection of objects an actor would select in an inference setting when given a choice of possible concepts, and creates paired data for the guided VQG task.

We use Sentence-BERT Reimers and Gurevych (2019)

to embed our inputs as we require one embedding vector per object label to compute the cosine distance. Using the similarity matrix between the

-dimensional embeddings from the embedded captions and QA pairs, we use a top_k function to pick the most prevalent tokens from candidate_concepts via the similarity scores from the matrix - giving us . We now concatenate an answer category to :


With this text encoding

, we run the model, optimizing the negative log likelihood between the predicted question and the ground truth. Note that the concatenation in the decoder below is along the sequence axis (resulting in a tensor



3.2 Implicit Guiding

We now introduce the implicit variant for VQG. This variant differs from its explicit counterpart as it aims to generate questions using only images as the input, while internally learning to predict the relevant category and objects.

Given an image, we apply the same object detection model as in the explicit variants to extract object labels, which are then encoded using an embed layer. The representation of object labels are then combined with object features for subsequent learning. Formally,


Since we would like the implicit model to learn relevant objects to the questions internally, we incorporate QA pairs to learn a latent variational distribution over the objects. However, since QA pairs can only be used during training to learn a variational distribution, we introduce another generative distribution that is only conditioned on the images and extracted objects. We borrow the idea from latent variable models to minimise Kullback-Leibler (KL) divergence between the variational distribution and generative distribution, where variational distribution is used during training and generation distribution is used for testing. We then apply gumbel-softmax (Jang et al., 2017) to sample from the distribution to get relevant objects.


where is the variational distribution;

is the generative distribution; MLP denotes a multilayer perceptron for learning the alignment between objects and QA pairs;

denotes element-wise multiplication.

Categories can also be a strong guiding factor and instead of making it an explicit input, we build a classifier to predict possible categories given an image, using ground-truth categories as the supervised learning signal. Specifically, the category classifier is framed as an MLP, taking the representation of object features and labels as input. The output of this component is a softmax distribution over all the possible categories and we use its one-hot representation for question generation.


Given an image, our implicit model learns the relevant objects and predicted category internally. The model is trained and optimised as follows,


4 Experiments

4.1 Datasets

We use the VQA v2.0 dataset111 (Antol et al., 2015), a large dataset consisting of all relevant information for the VQG task, following Krishna et al. (2019). The answers are balanced in order to minimise the effectiveness of dataset priors. This dataset contains over 1.1M questions with 6.5M answers for over 200K images in the MSCOCO 2015 dataset (Lin et al., 2014). We follow the official VQA partition, with i.e. 443.8K questions from 82.8K images for training, and 214.4K questions from 40.5K images for validation. We follow the setup of Krishna et al. (2019) and report the performance on validation set as the annotated categories and answers for the VQA test set are not available.

We use answer categories from the annotations of Krishna et al. (2019). The top 500 answers in the VQA v2.0 dataset are annotated with a label from the set of 15 possible categories, which covers up the 82% of the VQA v2.0 dataset; the other answers are treated as an additional category. These annotated answer categories include objects (e.g. “mountain”, “flower”), attributes (e.g. “cold”, “old”), color (e.g. “white”, “blue”), counting (e.g. “5”, “2”), etc.

We report BLEU (Papineni et al., 2002), ROUGE (Lin, 2004), CIDEr (Vedantam et al., 2015), METEOR (Lavie and Agarwal, 2007), MSJ (Montahaei et al., 2019)

as evaluation metrics, and BertScore

(Zhang et al., 2020)

. MSJ accounts for both diversity of generated outputs and n-gram overlap with ground truth utterances, and BertScore attempts to quantify similarity of a generated and reference output using contextual embeddings rather than string matching.

4.2 Comparative Approaches

We compare our models with four recently proposed VQG models Information Maximising VQG (IMVQG) (Krishna et al., 2019), What BERT Sees (WBS) (Scialom et al., 2020),

Deep Bayesian Network (DBN)

(Patro et al., 2020), and Category Consistent Cyclic VQG (C3VQG) (Uppal et al., 2020). Out of these four papers, IMVQG’s training and evaluation setup is the most similar to ours. They also report the widest range of metrics and hold the current SOTA in VQG for realistic inference regimes. They provide two variants of their model – one variant, z-path, receives an unrealistic input (the full target answer) at inference time. The other variant, t-path, achieves a weaker score, but overcomes the restriction by employing the use of an answer category instead. Our baseline is an image-only model, without other guiding information or latent variables.

During inference, an actor can select up to concepts from an object detection model run on the image. These concepts are then simply fed to the question generator to generate a question. We provide run through examples of how the training, testing and inference process looks in Appendix A.

4.3 Implementation Details

In Section 3 we described the shared aspects of our model variants. The reported scores in Section 5

use the same hyperparameters: a batch size of 128, and a learning rate of 3e-5 with an Adam optimizer. BERT Base

Devlin et al. (2018) serves as our PLM encoder with the standard 12 layers and 768 dimensions. The decoder is also a pre-trained BERT model (following Wolf et al. (2020); Scialom et al. (2020)). Though typically not used for decoding, by concatenating the encoder inputs with a [MASK] token and feeding this to the decoder model, we are able to obtain an output (e.g. ). This decoded output is concatenated with the original input sequence, and once again fed to the decoder to sample the next token. Thus, we use the BERT model as a decoder in an auto-regressive fashion. To encode the images based of the Faster-RCNN object features Ren et al. (2015); Anderson et al. (2018), we use a Transformer with 6 layers and 8 heads. The dimensionality of this model is also 768. Empirically, for both variants, we find to be the best number of sampled objects. All experiments are run with early stopping (patience 10; training iterations capped at 35000) on the BLEU-4 metric. Scores reported (in Section 5

) are from the highest performing checkpoint. We use the PyTorch library and train our model on a V100 GPU (


1.5 hours per epoch).

5 Results

We present quantitative results in Table 1 and 2, along with qualitative results in Figure 1. We evaluate the explicit model in a single-reference setup, as the chosen input concepts are meant to guide the model output towards one particular target reference. In contrast, the implicit model is evaluated in a multi-reference setup, as it receives no direct guidance and can generate any valid question about the image.

1 2 3 4 3 4 5
IMVQG (z-path) 50.1 32.3 24.6 16.3 94.3 20.6 39.6 47.2 38.0 31.5 91.0
IMVQG (t-path) 47.4 29.0 19.9 14.5 86.0 18.4 38.4 53.8 44.1 37.2 90.6
WBS 42.1 22.4 14.1 9.2 60.2 14.9 29.1 63.2 55.7 49.7 91.9
DBN 40.7 - - - - 22.6 - - - - -
C3VQG 41.9 22.1 15.0 10.0 46.9 13.6 42.3 - - - -
image-only 25.9 15.9 9.8 5.9 41.4 13.5 27.8 52.2 42.8 36.0 -
image-category 40.8 29.9 22.5 17.3 131 20.8 43.0 64.2 55.5 48.8 -
image-objects 34.7 25.0 19.1 15.0 130 19.4 36.9 67.4 59.2 52.7 -
image-guided 46.0 35.9 28.9 23.6 206 25.0 48.5 71.6 63.7 57.3 93.0
image-guided-random 23.6 12.1 5.75 2.39 17.6 10.8 24.2 62.3 52.6 45.0 -
image-category 28.2 17.1 12.1 9.1 50.5 14.0 28.4 51.8 42.9 36.4 -
image-objects 33.0 21.2 16.1 11.2 60.2 15.6 31.4 53.7 45.3 39.1 -
image-guided 33.9 23.5 16.8 12.6 63.6 16.7 33.7 55.8 48.0 41.9 92.3
Table 1: Single reference evaluation results. “*-guided” refers to the combination of category and objects. In the explicit variant only, objects refers to the subset of detected objects and caption keywords, filtered on the target QA pair. z-path is an unrealistic inference regime, using answers as input for question generation. WBS scores are from single reference evaluation based on the VQA1.0 pre-trained “Im. + Cap.” model provided by the authors.

5.1 Quantitative Results

Starting with the explicit variant, as seen in Table 1, we note that our image-only baseline model achieves a BLEU-4 score of 5.95. We test our model with different combinations of text features to identify which textual input is most influential to the reported metrics. We notice that the contribution of the category is the most important text input with respect to improving the score of the model, raising the BLEU-4 score by more than 11 points (image-category) over the aforementioned baseline. However, whilst the BLEU-4 for the image-object variant is 2.3 points lower, it outperforms the image-category variant by 3.9 points on the diversity orientated metric MSJ-5 - indicating that the image-category variant creates more generic questions. As expected, the inclusion of both the category and objects (image-guided) outperforms either of the previously mentioned models, achieving a new state-of-the-art result of 23.6 BLEU-4. This combination also creates the most diverse questions, with an MSJ-5 of 57.3.

We also test our hypothesis that guiding produces questions that are relevant to the fed in concepts. This is tested with ‘image-guided-random’ variant. This variant is the same trained model as ‘image-guided’, but uses random concepts from a respective image instead of using the ground truth question to generate concepts. Our results show that guiding is an extremely effective strategy to produce questions related to conceptual information, with a BLEU-4 score difference of over 20 points. We refer the reader to Section 5.3 for human evaluation which again validates this hypothesis, and Section 3.1 for an explanation of why guiding is valid for evaluating VQG models.

Intuitively, the implicit model is expected to perform worse than the explicit model in terms of the language generation metrics, due to the inherently large entropy of the relevant answer category and the objects given an image. However, the learned generative distributions over the categories and objects can capture the relevant concepts of different images, which in turn benefits the question generation when compared with image-only. Compared to the results from the explicit model, the questions from the implicit model are somewhat less diverse. However, the implicit model has higher diversity than the image-only baseline, showing the benefit of learning the relevant concepts internally.

According to Table 1, with the introduction of an answer category, the proposed implicit model beats the image-only baseline, and the score is on par with the best performing model of the WBS model Scialom et al. (2020) – which also generates questions without explicit guided information. Interestingly, we note disagreement with the most influential textual input when comparing our implicit model to the explicit model. In the explicit model, the category was the most important piece of information, whereas for the implicit model it is the objects. The addition of the objects (image-objects) increases the BLEU-4 score by just over 2 points when compared to the image-category variant, and now outperforms the WBS model. Despite the aforementioned disagreement, we notice two phenomena similar to the explicit model: 1) the inclusion of object information leads to more diverse questions than just the category information; and 2) the combination of the category and objects (image-guided) leads to the best performing model, in this case achieving a BLEU-4 score of 12.6.

The last column in Table 1 reports the F1 BertScore on some comparative approaches and our best performing variants for explicit and implicit. Both our implicit and explicit variants outperform the comparative approaches, with the explicit variant achieving the highest BertScore. This means that our explicit variant produces questions most similar to those in the references.

width=0.48 Model BLEU CIDEr METEOR ROUGE 1 2 3 4 Comparative WBS 75.6 56.9 44 33.6 45 26.8 66.7 image-only 69.5 56.3 44.9 34.1 34 27.7 60.1 Implicit image-category 72.5 57.6 46.1 35.5 46.1 28.8 66.7 image-objects 73 60.6 47.4 36.1 46.4 28.4 66.1 image-guided 72.6 61.3 47.5 36.6 47.8 28.9 66.9

Table 2: Multiple reference evaluation results.

In Table 2, we evaluate the implicit model in the more suitable multi-reference setting. Our weakest implicit model (image-category) outperforms the strongest model from WBS with respect to BLEU 2, 3, 4 and METEOR. Apart from the ROUGE discrepancy between image-category and image-objects, the different implicit permutations also increase all metrics in a manner consistent to the single-reference setup described above. Our best performing implicit model outperforms both the WBS model and all other implicit variants.

Figure 1: Qualitative Examples. The ground truth is the target question for the baseline, implicit and explicit. The examples of implicit and explicit are from the best variant image-guided.

5.2 Qualitative Results

Qualitative results are shown in Figure 1 and Appendix B. Figure 1 depicts how outputs from different model variants compare to ground truth questions. Without any guiding information, the image-only variant is able to decode semantic information from the image, however this leads to generic questions. The implicit variant, for which we also report the predicted category and objects, mostly generates on-topic and relevant questions with respect to the aforementioned. Focusing on the explicit variant, we witness high-quality, interesting, and on-topic questions.

Appendix B depicts how well our explicit image-guided variant handles a random selection of detected objects given the image. This experiment intends to gauge the robustness of the model to detected objects which may fall on the low tail of the human generating question/data distribution. To clarify, humans are likely to ask commonsense questions which generally focus on obvious objects in the image. By selecting objects at random for the question to be generated on, the model has to deal with both object permutations possibly not seen during training, as well as category information which may not be valid for an image.

Analysing the outputs, when viable categories and objects that are expected to fall in a commonsense distribution are sampled, the model can generate high quality questions. Interestingly, we observe that when the sampled objects are not commonsense (e.g. “ears arms” for the baby and bear picture), the model falls back to using the object features instead of the guiding information. This phenomena is also witnessed when the sampled category does not make sense for the image (e.g. category ‘animal’ in image 531086). Despite the category mismatch, the model still successfully uses the object information to conditionally decode a question.

5.3 Human Evaluation

width=0.48 Baseline Implicit Explicit Experiment 1 34.3% 0.1 36.7% 0.08 44.9% 0.08 Experiment 2 95.9% 0.03 89% 0.09 93.5% 0.06 Experiment 3 - - 77.6% 0.09 Experiment 4 - - 74.1%/40.0% 0.07/0.18

Table 3: Human evaluation results (and standard dev.)

We ask seven humans across four experiments to evaluate the generative capabilities of our models. Experiment 1 is a visual Turing test: given an image, a model generated question and a ground truth question, we ask a human to determine which question they believe is model generated. Experiment 2 attempts to discern the linguistic and grammatical capabilities of our model by asking a human to make a binary choice about whether the generated question seems natural. Experiment 3 shows a human an image alongside a model generated question (explicit variant). Then, we ask the human to make a choice about whether the generated question is relevant to the image (i.e. could an annotator have feasibly asked this question during a data collection process). Finally, experiment 4 judges whether objects are relevant to a generated question. The experiment is set up with true-pairs and adversarial-pairs. True-pairs are samples where the shown objects are the ones used to generate the question. Adversarial-pairs show a different set of objects than those which generated the question. If more true-pairs are are marked correct (i.e. if at least one of the objects is relevant to the generated question) than the adversarial-pairs, then our model successfully generates questions on guiding information.

In experiment 1, a model generating human-level questions should be expected to score 50%, as a human would not be able to reliably distinguish them from the manually created questions. Our results show the explicit model outperforming the implicit and baseline variants, fooling the human around 45% of the time. Whilst not yet at the ideal 50%, the explicit approach provides a promising step towards beating the visual Turing Test. Experiment 2 evaluates the grammatical plausibility of the generated questions. In general, all models perform extremely well in this experiment, with the baseline variant generating grammatically correct sentences  96% of the time. This is expected, as the baseline typically falls back to decoding easy/generic questions. Experiment 3, is evaluated on our best performing model (explicit image-guided). Here,  78% of the generated questions are marked as relevant/on-topic given an image. Finally, experiment 4’s results show true-pairs marked as correct vs adversarial-pairs (incorrectly) marked as correct. Since the former is larger than the latter - 72% vs 42%, the model can successfully use guiding/object information to create on-topic questions.

6 Conclusions

In this paper, we presented a guided approach to visual question generation (VQG), which allows for the generation of questions that focus on specific chosen aspects of the input image. We introduced two approaches for this task, the explicit and the implicit variant. The former generates questions based on an explicit answer category and a set of concepts from the image. In contrast, the latter predicts these concepts internally using discrete latent variables, receiving only the image as input. Both variants achieve SoTA results when evaluated against comparable models. Qualitative evaluation and human-based experiments demonstrate that both variants produce realistic and grammatically valid questions.


  • P. Anderson, X. He, C. Buehler, D. Teney, M. Johnson, S. Gould, and L. Zhang (2018) Bottom-up and top-down attention for image captioning and visual question answering. In

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

    Cited by: §1, §3.1, §3, §3, §4.3.
  • J. Aneja, H. Agrawal, D. Batra, and A. Schwing (2019) Sequential Latent Spaces for Modeling the Intention During Diverse Image Captioning. Proceedings of the IEEE International Conference on Computer Vision 2019-October, pp. 4260–4269. External Links: Link Cited by: §2.2.
  • S. Antol, A. Agrawal, J. Lu, M. Mitchell, D. Batra, C. L. Zitnick, and D. Parikh (2015) VQA: Visual Question Answering. In International Conference on Computer Vision (ICCV), Cited by: §1, §2.1, §4.1.
  • L. Barrault, F. Bougares, L. Specia, C. Lala, D. Elliott, and S. Frank (2018) Findings of the third shared task on multimodal machine translation. In Proceedings of the Third Conference on Machine Translation: Shared Task Papers, Belgium, Brussels, pp. 304–323. External Links: Link, Document Cited by: §1.
  • S. R. Bowman, L. Vilnis, O. Vinyals, A. M. Dai, R. Jozefowicz, and S. Bengio (2015) Generating Sentences from a Continuous Space. CoNLL 2016 - 20th SIGNLL Conference on Computational Natural Language Learning, Proceedings, pp. 10–21. External Links: Link Cited by: §2.2.
  • Y. Burda, R. B. Grosse, and R. Salakhutdinov (2016)

    Importance weighted autoencoders

    In ICLR (Poster), External Links: Link Cited by: §1.
  • O. Caglayan, P. Madhyastha, L. Specia, and L. Barrault (2019) Probing the need for visual context in multimodal machine translation. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Minneapolis, Minnesota, pp. 4159–4170. External Links: Link, Document Cited by: §1.
  • M. Cornia, M. Stefanini, L. Baraldi, and R. Cucchiara (2019) Meshed-Memory Transformer for Image Captioning. Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pp. 10575–10584. External Links: Link Cited by: §3.
  • Y. Deng, Y. Kim, J. Chiu, D. Guo, and A. Rush (2018) Latent alignment and variational attention. In Advances in Neural Information Processing Systems, S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett (Eds.), Vol. 31, pp. . External Links: Link Cited by: §2.2.
  • J. Devlin, M. Chang, K. Lee, and K. Toutanova (2018) BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. External Links: Link Cited by: §2.1, §3, §4.3.
  • D. Elliott, S. Frank, L. Barrault, F. Bougares, and L. Specia (2017) Findings of the second shared task on multimodal machine translation and multilingual image description. In Proceedings of the Second Conference on Machine Translation, Copenhagen, Denmark, pp. 215–233. External Links: Link, Document Cited by: §1.
  • Z. Fan, Z. Wei, S. Wang, Y. Liu, and X. Huang (2018)

    A Reinforcement Learning Framework for Natural Question Generation using Bi-discriminators

    Technical report Cited by: §2.1.
  • M. D. Hoffman, D. M. Blei, C. Wang, and J. Paisley (2013) Stochastic variational inference.

    Journal of Machine Learning Research

    14 (4), pp. 1303–1347.
    External Links: Link Cited by: §1.
  • Z. Hu, Z. Yang, X. Liang, R. Salakhutdinov, and E. P. Xing (2017) Toward Controlled Generation of Text. Technical report Cited by: §2.2.
  • U. Jain, Z. Zhang, and A. Schwing (2017) Creativity: Generating Diverse Questions using Variational Autoencoders. Proceedings - 30th IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017 2017-January, pp. 5415–5424. External Links: Link Cited by: §2.1, §2.2.
  • E. Jang, S. Gu, and B. Poole (2017) Categorical reparameterization with gumbel-softmax. External Links: Link Cited by: §3.2.
  • A. Karpathy and L. Fei-Fei (2015) Deep visual-semantic alignments for generating image descriptions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §1.
  • D. P. Kingma and M. Welling (2014) Auto-encoding variational bayes. In 2nd International Conference on Learning Representations, ICLR 2014 - Conference Track Proceedings, Cited by: §1, §2.2.
  • D. P. Kingma, S. Mohamed, D. Jimenez Rezende, and M. Welling (2014) Semi-supervised learning with deep generative models. In Advances in Neural Information Processing Systems, Z. Ghahramani, M. Welling, C. Cortes, N. Lawrence, and K. Q. Weinberger (Eds.), Vol. 27, pp. . External Links: Link Cited by: §1.
  • Krasser and Stumpf (2020) Fairseq-image-captioning. GitHub. Note: Cited by: §3.
  • R. Krishna, M. Bernstein, and L. Fei-Fei (2019) Information Maximizing Visual Question Generation. Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition 2019-June, pp. 2008–2018. External Links: Link Cited by: Appendix A, §1, §2.1, §3, §3, §4.1, §4.1, §4.2.
  • A. Lavie and A. Agarwal (2007) METEOR: an automatic metric for MT evaluation with high levels of correlation with human judgments. In Proceedings of the Second Workshop on Statistical Machine Translation, Prague, Czech Republic, pp. 228–231. External Links: Link Cited by: §4.1.
  • X. Li, X. Yin, C. Li, P. Zhang, X. Hu, L. Zhang, L. Wang, H. Hu, L. Dong, F. Wei, Y. Choi, and J. Gao (2020) Oscar: Object-Semantics Aligned Pre-training for Vision-Language Tasks.

    Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)

    12375 LNCS, pp. 121–137.
    External Links: Link Cited by: §3.1, §3.
  • Y. Li, N. Duan, B. Zhou, X. Chu, W. Ouyang, X. Wang, and M. Zhou (2018) Visual question generation as dual task of visual question answering. CVPR. External Links: Link Cited by: §1.
  • C. Lin (2004) ROUGE: a package for automatic evaluation of summaries. In Text Summarization Branches Out, Barcelona, Spain, pp. 74–81. External Links: Link Cited by: §4.1.
  • T. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick (2014)

    Microsoft coco: common objects in context

    In European conference on computer vision, pp. 740–755. Cited by: §4.1.
  • Z. Lin, G. I. Winata, P. Xu, Z. Liu, and P. Fung (2020) Variational Transformers for Diverse Response Generation. arXiv. External Links: Link Cited by: §2.2.
  • Y. Miao and P. Blunsom (2016) Language as a latent variable: discrete generative models for sentence compression. In

    Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing

    Austin, Texas, pp. 319–328. External Links: Link, Document Cited by: §2.2.
  • Y. Miao, L. Yu, and P. Blunsom (2016) Neural variational inference for text processing. In Proceedings of The 33rd International Conference on Machine Learning, M. F. Balcan and K. Q. Weinberger (Eds.), Proceedings of Machine Learning Research, Vol. 48, New York, New York, USA, pp. 1727–1736. External Links: Link Cited by: §1, §2.2.
  • A. Mnih and K. Gregor (2014) Neural variational inference and learning in belief networks. In Proceedings of the 31st International Conference on Machine Learning, E. P. Xing and T. Jebara (Eds.), Proceedings of Machine Learning Research, Vol. 32, Bejing, China, pp. 1791–1799. External Links: Link Cited by: §1.
  • E. Montahaei, D. Alihosseini, and M. S. Baghshah (2019) Jointly measuring diversity and quality in text generation models. External Links: 1904.03971 Cited by: §4.1.
  • A. v. d. Oord, Y. Li, and O. Vinyals (2018) Representation Learning with Contrastive Predictive Coding. External Links: Link Cited by: §2.2.
  • L. Pan, W. Lei, T. Chua, and M. Kan (2019) Recent Advances in Neural Question Generation. External Links: Link Cited by: §1.
  • K. Papineni, S. Roukos, T. Ward, and W. Zhu (2002) Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, Philadelphia, Pennsylvania, USA, pp. 311–318. External Links: Link, Document Cited by: §4.1.
  • B. N. Patro, S. Kumar, V. K. Kurmi, and V. P. Namboodiri (2018) Multimodal differential network for visual question generation. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, EMNLP 2018, pp. 4002–4012. External Links: ISBN 9781948087841, Document Cited by: §2.1.
  • B. N. Patro, V. K. Kurmi, S. Kumar, and V. P. Namboodiri (2020) Deep Bayesian Network for Visual Question Generation. Proceedings - 2020 IEEE Winter Conference on Applications of Computer Vision, WACV 2020, pp. 1555–1565. External Links: Link Cited by: §2.1, §4.2.
  • N. Reimers and I. Gurevych (2019) Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. EMNLP-IJCNLP 2019 - 2019 Conference on Empirical Methods in Natural Language Processing and 9th International Joint Conference on Natural Language Processing, Proceedings of the Conference, pp. 3982–3992. External Links: Link Cited by: §3.1.
  • S. Ren, K. He, R. Girshick, and J. Sun (2015) Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. Technical report Vol. 28. External Links: Link Cited by: §4.3.
  • D. J. Rezende, S. Mohamed, and D. Wierstra (2014)

    Stochastic backpropagation and approximate inference in deep generative models

    In Proceedings of the 31st International Conference on Machine Learning, E. P. Xing and T. Jebara (Eds.), Proceedings of Machine Learning Research, Vol. 32, Bejing, China, pp. 1278–1286. External Links: Link Cited by: §1.
  • T. Scialom, P. Bordes, P. Dray, J. Staiano, and P. Gallinari (2020) What BERT Sees: Cross-Modal Transfer for Visual Question Generation. External Links: Link Cited by: §1, §2.1, §4.2, §4.3, §5.1.
  • L. Specia, S. Frank, K. Sima’an, and D. Elliott (2016) A shared task on multimodal machine translation and crosslingual image description. In Proceedings of the First Conference on Machine Translation: Volume 2, Shared Task Papers, Berlin, Germany, pp. 543–553. External Links: Link, Document Cited by: §1.
  • S. Uppal, A. Madan, S. Bhagat, Y. Yu, and R. R. Shah (2020) C3VQG: Category Consistent Cyclic Visual Question Generation. arXiv. External Links: Link Cited by: §2.1, §4.2.
  • A. van den Oord, O. Vinyals, and k. kavukcuoglu (2017) Neural discrete representation learning. In Advances in Neural Information Processing Systems, I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (Eds.), Vol. 30, pp. . External Links: Link Cited by: §2.2.
  • A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin (2017) Attention Is All You Need. External Links: Link Cited by: §3.
  • R. Vedantam, C. L. Zitnick, and D. Parikh (2015) CIDEr: consensus-based image description evaluation. 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4566–4575. Cited by: §4.1.
  • O. Vinyals, A. Toshev, S. Bengio, and D. Erhan (2015) Show and tell: a neural image caption generator. 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3156–3164. Cited by: §1.
  • T. Wolf, L. Debut, V. Sanh, J. Chaumond, C. Delangue, A. Moi, P. Cistac, T. Rault, R. Louf, M. Funtowicz, J. Davison, S. Shleifer, P. von Platen, C. Ma, Y. Jernite, J. Plu, C. Xu, T. L. Scao, S. Gugger, M. Drame, Q. Lhoest, and A. M. Rush (2020) Transformers: state-of-the-art natural language processing. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, Online, pp. 38–45. External Links: Link Cited by: §3, §4.3.
  • K. Xu, J. Ba, R. Kiros, K. Cho, A. Courville, R. Salakhudinov, R. Zemel, and Y. Bengio (2015) Show, attend and tell: neural image caption generation with visual attention. In Proceedings of the 32nd International Conference on Machine Learning, F. Bach and D. Blei (Eds.), Proceedings of Machine Learning Research, Vol. 37, Lille, France, pp. 2048–2057. External Links: Link Cited by: §1.
  • S. Zhang, L. Qu, S. You, Z. Yang, and J. Zhang (2016) Automatic Generation of Grounded Visual Questions. IJCAI International Joint Conference on Artificial Intelligence, pp. 4235–4243. External Links: Link Cited by: §1, §2.1.
  • T. Zhang, V. Kishore, F. Wu, K. Q. Weinberger, and Y. Artzi (2020) BERTScore: evaluating text generation with bert. In International Conference on Learning Representations, External Links: Link Cited by: §4.1.

Appendix A Training, testing and inference

Here, using an example, we clarify the inputs to our explicit model (Section 3.1) in the training, testing and inference setups.


  • Ground truth question: What is the labrador about to catch?

  • Answer: Frisbee

  • Category: Object

  • Image:

  • {Caption}: A man throwing a frisbee to a dog

  • {Objects}: person, dog, frisbee, grass

N.B. {Caption} and {Objects} are both model generated, requiring only an image as input. These inputs are thus available at inference time.

Firstly, we create a set of candidate_concepts (see eq. 4) from the caption and objects: [person, dog, frisbee, grass, man, throwing] (). These words are individually embedded. Secondly, we concatenate and embed the set of question and answer tokens ().

Then, we construct a matrix which gives us cosine similarity scores for each candidate_concepts token to a QA token (). We choose tokens from the candidate_concepts which are most similar to the words from the QA. Here, “dog” and “frisbee” are likely chosen. Our input to the model is then <, “object”, “dog”, “frisbee”>.

Notice that it is possible for these words to be in the QA pair (e.g. “frisbee”). Importantly, these words have not been fed from the QA pair - they have been fed in from model-obtained concepts ({Object} and {Caption}). Philosophically similar, Krishna et al. (2019) constructed inputs based on target information for use in training and benchmarking.

Testing. Imagine a data labeler creating questions based on an image. They would look at the image, and decide on the concepts to create the question for. Our testing methodology follows this intuition using the strategy outlined above: the selected objects from candidate_concepts is a programmatic attempt for selecting concepts which could generate the target question. Note that there can be many questions generated for a subset of concepts (e.g. ‘is the dog about to catch the frisbee?’, ‘what is the flying object near the dog?’ etc.). As outlined above, we are not taking concepts from the target. Rather we use information from the target to emulate the concepts an actor would think of to generate the target question. Because there can be different concepts questions are based on for one image (see ground-truth questions in Appendix B), our strategy allows us to generate questions which might be similar to a singular target question. This leads to an evaluation which fairly uses information a human has access to to generate a question.

Inference. However, in the real world, there is no ‘ground-truth’ question. In this case, we simply feed image features, and actor selected concepts to our question generator model. The selected concepts here are a subset of candidate_concepts, which are fully generated from models.

Appendix B More Qualitative Examples.

(see next page)

Figure 2: Qualitative outputs from explicit variant being fed random guiding information. Failure cases are also shown.