C3VQG: Category Consistent Cyclic Visual Question Generation

05/15/2020 ∙ by Shagun Uppal, et al. ∙ IIIT Delhi 1

Visual Question Generation (VQG) is the task of generating natural questions based on an image. Popular methods in the past have explored image-to-sequence architectures trained with maximum likelihood which often lead to generic questions. While generative models try to exploit more concepts in an image, they still require ground-truth questions, answers (and categories in some cases). In this paper, we try to exploit the different visual cues and concepts in an image to generate questions using a variational autoencoder without the need for ground-truth answers. In this work, we, therefore, address two shortcomings of the current VQG approaches by minimizing the level of supervision and replacing generic questions by category-relevant generations. We, therefore, eliminate the need for expensive answer annotations thus weakening the required supervision in this task and use question categories instead. Using different categories enables us to exploit different concepts as the inference requires only the image and category. We maximize the mutual information between the image, question, and question category in the latent space of our VAE. We also propose a novel category consistent cyclic loss that motivates the model to generate consistent predictions with respect to the question category, reducing its redundancies and irregularities. Additionally, we also impose supplementary constraints on the latent space of our generative model to provide structure based on categories and enhance generalization by encapsulating decorrelated features within each dimension. Finally, we compare our qualitative as well as quantitative results to the state-of-the-art in VQG.



There are no comments yet.


page 1

page 8

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

Visual Understanding by intelligent systems is a very interesting problem in the Computer Vision community, further accelerated by the advent of Deep Learning. Humans tend to develop different concepts about visual data depending on context and researchers have tried to replicate this behavior in intelligent systems like conversational agents. Translating this visual understanding into language helps us evaluate the ”intelligence” of the system and few tasks like Visual Question Answering (VQA)

(Agrawal et al., 2015; Malinowski and Fritz, 2014; Zhu et al., 2015), Visual Question Generation (VQG) (Mostafazadeh et al., 2016), and Video Captioning (Chen et al., 2019) help us benchmark it. Such tasks require us to learn multimodal representations from visual and language data. VQG is a much more open-ended task than VQA in the sense that there exist many concepts in the image, and asking semantically coherent and visually relevant questions requires a system to recognize those concepts whereas in VQA we are given a reference question to answer. We can observe in Figure 1, the various semantics are captured via broad categories in which we have put possible natural questions which arise by looking at it. We aim to generate natural questions like these with the use of categories mentioned so as to draw out useful information from the image. Hence, the ability to form natural questions based on visual data is of great significance in intelligent systems. We provide an illustration to the key differences in the two tasks of VQA and VQG through Figure 2.

Figure 1. An example image showing the various natural questions possible which belong to the broad categories mentioned. The categories are not too specific so as to overly-constrain the network but are broad enough to encourage discovery of novel concepts.
Figure 2. Comparison of Visual Question Answering and Visual Question Generations tasks. (Top) A toy architecture showing the workings of VQA: it takes an image and question at inference time, and generates an answer for that question. At training time, it usually requires ¡image, question, expected answer¿. (Bottom-left) In a strongly supervised VQG system, we require to an image and answer as input, and we generate a question whereas (Bottom-right) in a weakly supervised setting, we can provide a category label instead of an answer, thus, reducing the supervision. We propose a solution for the weakly supervised setting and contrast it with the results of strongly supervised as well.

There are many challenges in constructing a system for VQG: (1) There are various abstract and hidden concepts in the images, (2) Questions generated need to be relevant to the image, (3) The question generated-to-image relation is many-to-one due to the fact that multiple questions are possible for an image, (4) Avoid questions which invoke generic answers like yes/I don’t know. Asking meaningful questions can help in improving the reasoning capability of the system. VQG has also been referred as a realization of the Visual Curiosity (Yang et al., 2018) of a system.

Previous studies (Mostafazadeh et al., 2016; Zhang et al., 2016; Jain et al., 2017) have explored VQG on data requiring only images without conditioning the questions generated on an answer. This has the tendency to generate open ended questions which might not be relevant to the image as there is no constraint on the problem. There has also been some work done (Li et al., 2017; Liu et al., 2018; Xu et al., 2018; Krishna et al., 2019) to use answers and the image to generate the relevant questions. While this approach asks questions relevant to the image (due to the answer being provided), it tends to overfit to the answer provided and does not leave room for generating questions on diverse concepts in the image. It restricts the many-to-one relation between image and the questions. Also, this requires the dataset to be annotated with answers as well as questions which is an expensive and tedious operation. Works like (Li et al., 2017; Wang et al., 2017) propose viewing the VQG task as a dual of VQA task or proposing a joint model for training of QA an QG task. Due to the fact the VQG task requires to be more open-ended than QA systems, treating the training of both tasks in a similar fashion, does not lead to discovery of new visual concepts in images.

Krishna et al. (Krishna et al., 2019) proposes a middle ground among the previous mentioned approaches. It proposes a generative modelling approach by adopting a variational autoencoder framework which maximizes mutual information between images, questions and answers. During inference, they only require images and answer categories, hence removing the need for answers. But it still uses answers for training.

While current works rely heavily on the availability of question-answer pairs for their method, we propose using only categories which act as a weaker form of supervision, are easy to obtain and can help in exploring various concepts in an image. This also helps generate relevant questions to the image as compared to methods which simply generate questions based on an image leading to non-diverse and often not meaningful questions. Our main contributions of the paper can be summarised as follows:

  • We adopt a variational autoencoder (Kingma and Welling, 2013) framework to generate questions. It consists of a single combined latent space for image and category embeddings and also maximizes the mutual information between them.

  • We weaken the amount of supervision on the model by removing the need of ground truth answers during training phase. This makes our approach smoothly generalizable and waves the requirement of availability of answers in the dataset.

  • We introduce additional constraints to enforce answer category consistency by utilizing a cyclic training procedure with sequential training in two disjoint steps.

  • We enforce center loss on the generative latent space in order to ensure clustering with respect to the answer category labels.

  • We also introduce a hyper-prior on the variance of the variational latent prior to capture intrinsically independent visual features within the combined latent space.

All of the above contributions ensure that we get a diverse and relevant set of questions given an image and category. We evaluate our result alongside other approaches which do not use answers for generating questions as well as which require them.

The rest of the paper is organized as follows: In Section 2, we discuss the previous works on visual question generation and structured latent space constraints. We present our approach and details of our model in Section 3. In Section 4, we provide details about the experimental setup, evaluation metrics and discuss our qualitative and quantitative results. We present our conclusion in Section 5.

2. Related Works

In this section, we discuss relevant literature that motivates key components of the C3VQG approach. We also illustrate the works that have focused on developing structured latent representations for a diverse set of down-stream tasks.

2.1. Visual Question Answering and Visual Question Generation

Visual Question Generation (VQG) is the task of developing visual understanding from images using cues from ground-truth answers and/or answer categories in order to generate relevant question. Various works focusing on this aspect have been deeply inspired by taking into consideration the multimodal context of natural language along with visual understanding of the input.

Mostafazadeh et al. (Mostafazadeh et al., 2017) suggested relevant question as well as response generations, given an image along with the relevant conversational dialogues. With the help of the dialogues, they drew broad context about the conversation from the input image. Mostafazadeh et al. (Mostafazadeh et al., 2016) focuses on a different paradigm of VQG wherein the goal is to generate more engaging and high-level common sense reasoning questions about the image/event highlighted in the image. This approach shifted its focus from the objects constituting the image to the visual understanding of these systems.

Yang et al. (Yang et al., 2015) simultaneously learned VQG and VQA models to understand the semantics and entities present in the input image. The former is trained using RNNs while CNNs are used for the latter. Such an approach examines and trains the learning model on both the aspects of natural language and vision, thereby challenging its interpretability over multimodal signals. Li et al. (Li et al., 2017) had a similar of approach of training VQA and VQG networks parallely, hence, introducing an Invertible Question-Answering network. Such a model takes advantage of the question-answer dependencies while training, then takes a question/answer as an input, and in return outputting its counterpart for evaluation.

Zhang et al. (Zhang et al., 2016) talked about automating VQG

not only with high correctness but with a high diversity in the type of questions generated. For this, they take an image and its caption as the input, as generated using a dense caption module with an LSTM-based classifier for selecting the question type. The question type along with the input image and caption and an image-caption correlation output are processed to give relevant output questions. On similar lines, Jain

et al. (Jain et al., 2017) worked on generating a wide variety of questions given a single image but with generative modelling. Here, they used variational autoencoders with a combination of LSTM networks in order to generate a diverse set of questions from a single input image.

2.2. Structured Latent Space Constraints

2.2.1. Center Loss for Learning Discriminative Latent Features

Center loss (Wen et al., 2016) for enforcing well-clustered latent space representations have been studied extensively in the past specifically focused on bio-metric applications (Wen et al., 2016, 2018; Kazemi et al., 2018). This metric-learning training strategy works on the principle of differentiating inter-class features and penalizing the distance of embeddings from their respective class centers.

Wen et al. (Wen et al., 2018)

utilized center loss for the biometric task of facial recognition. The introduction of weight sharing between softmax and the center loss reduces the computational complexity. While, the employment of an entire embedding space as the center rather than the conventionally used single point representation takes into account the intra-class variations as well. Kazemi

et al. (Kazemi et al., 2018)

also proposed a novel attribute-centered loss in order to train a Deep Coupled Convolutional Neural Network (DCCNN) for the task of sketch to photo matching using facial features.

He et al. (He et al., 2018) proposed a triplet-center loss that aims at further improving the differentiating power of features by not only minimising the distance of encoding from their class centers but also by maximising it for the class centers belonging to other classes. These discriminative latent features obtained are utilized for the task of 3D object retrieval. Ghosh and Davis (Ghosh and Davis, 2018)

highlighted the impact of introduction of center loss besides the cross entropy loss in CNNs for image retrieval problems, involving very few samples belonging to each class.

Besides, the center loss when coupled with softmax loss has been employed for emotion recognition in speech data (Tripathi et al., 2019) as well.

Although, this clustering based loss has been extensively employed for biometric based applications, to the best of our knowledge our paper is the first to explore its applications in multimodal setting.

2.2.2. Hyper-prior on Latent Spaces

Various approaches have intended to capture completely decorrelated factors of variations in the data by employing diverse training strategies like utilizing generative models to learn low-dimensional subspaces (Klys et al., 2018) or imposing a soft orthogonality constraint on the latent chunks (Shukla et al., 2019). One such effective approach is to vary the prior on the generative latent space in such a way that it intrinsically enforces independence of the captured features.

Kim et al. (Kim et al., 2019)

introduced a class of hierarchical Bayesian models with certain hyper-priors on the variances of the Gaussian distribution priors in a VAE. The fact that this ensures that each captured latent feature has a different prior distribution ensures that each of them are intrinsically independent and guarantees encapsulation of admissible as well as nuisance factors simultaneously.

Ansari and Soh (Ansari and Soh, 2018) also focused on capturing disentangled factors of variations in an unsupervised manner by utilizing the inverse-Wishart (IW) as the prior on the latent space of the generative model. By tweaking the IW parameter, various features in a set of diverse datasets could be captured simultaneously.

Bhagat et al. (Bhagat et al., 2020) utilized Gaussian processes (GP) with varying correlation structure in VAEs for the task of video sequence disentangling. The obtained latent representation was exploited for down-stream tasks like video frame prediction as well.

3. Proposed Approach

In this section, we address the key contrasts in building up our architecture with close related works, and a brief overview of our model C3VQG. This is followed by describing individual components of C3VQG alongside their motivation. Lastly, we also mention the complete training procedure and the optimization strategy for the same.

One of the key features of the previous work in VQG (Krishna et al., 2019) is the ability to generate questions which produce informative answers. The proposed architecture maximize the mutual information between generated question with the image as well as the expected answer. At the training time, both the question as well as the expected answer is used (along with answer category). While at test time, only the answer category is required.

We propose C3VQG: a cyclic training approach that enforces consistency in answer categories via a two-step framework. We introduce a variational autoencoder (VAE) which maximizes the mutual information between the question generated, image and category. We divide the basic architecture into 2 steps. While the first step ensures encapsulation of image and category information within the latent encoding, the second step establishes compatibility in the categories predicted from the generated question with that of the ground-truth categories. We formulate the latent space to contain sufficient information about the answer category besides capturing all independent features of the image in a structured manner. We do this by enforcing an additional hyper-prior on the latent space and including a center loss based constraint.

One of the challenges of the prior approaches that we intend to address is the heavy dependence on well-annotated and expensive-to-create datasets. We appropriately try to use only answer categories and questions while maintaining consistency of the generated question with the answer category by introducing an additional loss (see Section 3.3 for details) which enables us to keep the relevance of our generated question high.

The flow diagram of the entire training procedure with each component of the model is illustrated with an example in Figure 3.

3.1. Problem Formulation

Let be the dataset of all images and be the set of all answer categories. We define as the number of image-question pairs in our data and as the total number of answer categories i.e., 15 in the VQA dataset. The training data is available in the form of images that have a corresponding ground-truth questions for every answer category .

We aim to design a generative model that capsulizes information from multimodal sources of data in the form of images and answer categories to generate an encoding that aids the prediction of meaningful questions.

3.2. Information Maximisation Vqg

We denote the question to be generated by for a given image and category . For example, if the predicted question for an image is, ”Is that a bird on the terrace?”, it would correspond to the category ’binary’. We define our initial model (which we refer to as Step I) by defining which we get by maximizing a linear combination of mutual information and . Since the exact computation of mutual information is intractable, we try to learn a mapping from the image and category to a continuous latent space we refer to as . The mapping is parameterized by which is learned via optimization of the following objective:


where and are the weights for the mutual information terms. The mutual information in Equation 1 is intractable as we do not know true values of the posteriors and . So we instead try to minimize its variational lower bound. More details on the derivation of the final objective can be found in the supplementary section. Hence, we can optimize the variational lower bound by maximizing the image and category reconstruction whilst also maximizing the MLE of question generation.

3.3. Category Consistent Cyclic VQG (C3VQG)

Figure 3. Model architecture and training procedure for C3VQG approach.

We build a cyclic approach for VQG to analyze the robustness of the model in terms of its predictions and the diversity of generated questions. For this, we divide our approach into two parts. The first step homogenizes the latent representations obtained from the answer categories and the one obtained from images to form a combined latent space with a variational prior. While, the next step penalises the difference in ground truth answer categories from the ones predicted from the generated question, enforcing congruence between them.

Step 1: Visual Question Generation.

Using two separate encoders and , we generate latent encoding and for the image and category label respectively.


These latent encodings are passed onto an MLP after concatenation to generate another latent representation that has a Gaussian prior associated with it. This latent representation is depicted with forming the backbone for question generation using our approach is given by:


where depicts the weights of the MLP and

depicts the concatenation operator for two input vectors. The concatenation of the two encodings aids the aggregation of the information of the type of question that is supposed to be generated by the model. This latent encoding should intrinsically contain all the relevant information for the generation of the question, and therefore, is passed through a temporal model that captures the time-varying characteristics and outputs the question related to the images on the lines of the answer category.


Therefore, we capitalise on the ground-truth questions for the images to impose an MLE loss on the generated questions .


In order to ensure abbreviation of visual features as well as category information into the -space, we pass it through two separate prediction networks, and respectively. These prediction networks are trained to predict the original image and category encodings.

Step 2: Generation Consistency Assurance.

In order to substantiate the consistency of the answer category of the generated question with the given category, we pass the generated questions through a temporal classifier that tries to predict the answer category for the generated question.


Later, we impose a cross entropy loss between the predicted and actual answer category in order to penalise any irregularities within the previous step.


3.4. Latent Space Clustering

To ensure that our model is able to accurately predict answer categories from the latent encodings, we intend to promote well-clustered latent spaces. For this, we add structure to the latent space by imposing a constraint in the form of center loss 111https://github.com/KaiyangZhou/pytorch-center-loss (Wen et al., 2016) that aggregates the latent space into a fixed number of clusters, equal to the number of answer categories in the dataset.

The center loss helps distinguish inter-category latent features by enforcing clustering in the following way:


where, depicts the class center of all such datapoints () with label . This helps in discriminating the joint image-category representations, by casting added supervision thereby, leading to a higher fidelity and robustness in the question generation process conditioned on the category labels. The structured latent representation that is obtained as a results of applying this constraint ensures escalation of distances in the latent space between samples belonging to different classes, that in turn leads to enhanced down-stream task performance.

3.5. Modified Hyper-prior on the Latent Space

We also take motivation from one of models proposed by Kim et al. (Kim et al., 2019) that introduces a modified prior on the latent space explicitly ensuring each dimension to capture completely independent features. We do this by replacing the sub-optimal Gaussian normal prior on the -space by a long-tail distribution. We introduce a learnable hyper-prior on the variance of the Gaussian latent prior while keeping the distribution as zero mean. We also employ a supplementary regularization term that ensures sufficient nuisance dimensions.

For this, we intend to learn the inverse variance for each dimension of the -dimensional latent space. The latent space prior can then be represented as Equation 14.


The modified KL-divergence and additional regularization term is of the form given by:


where, is the concatenated latent encoding formed by image and category encoding i.e. , is latent encoding with the variational prior, and is mapping function (i.e., ).

In Equation 15, is the weight for the regularization loss that promotes sparsity and increases the generalization capacity of the model.

3.6. Training Strategy and Optimization Objective

We train our model by defining a combined loss that is the weighted sum of individual loss terms. Combining Equations 8, 9, 10, 12, 13 and 15, we obtain the optimization objective as follows:


where, represents the combination of all learnable parameters in the complete model and

are the hyperparameters depicting the weight of each loss in the combined objective.

1 Input: Input image dataset , set of all answer categories , ground-truth questions for for every , weights for all individual losses , gradient descent learning rate Output: Weights for all the individual components of the model . initialize with Kaiming initialization (He et al., 2015); for  to  do
2       for  to  do
3             Sample image batch from .
4             Sample answer category batch from .
5             Sample questions for with category .
6             Get and using Equation 4 and 5.
7             Concatenate and to get using Equation 6.
8             Use to predict and and compute and using Equation 9 and 10.
9             Generate question using Equation 7 and compute using Equation 8.
10             Predict category from generated question using Equation 11 and compute using 12.
11             Compute and using Equation 13 and 15.
12             Find gradient of all losses w.r.t. , i.e. .
13             Take gradient descent step, .
15       end for
17 end for
Algorithm 1 Training Algorithm for C3VQG with all components.

For training our model using Algorithm 1

, we use stochastic gradient descent algorithm with Adam optimizer. We train the model for 15 epochs on a machine with GeForce GTX 1080 GPU using the PyTorch framework.

4. Experiments

We evaluate the performance of our approach C3VQG 222The code for the approach will be released on acceptance. against the state-of-the-art in VQG using a variety of diverse quantitative metrics alongside highlighting the qualitative superiority of our approach.

4.1. Dataset Features

The VQA dataset 333Dataset available at https://cs.stanford.edu/people/ranjaykrishna/iq/index.html (Antol et al., 2015) consists of images alongwith corresponding questions and answers for each image. Krishna et al. (Krishna et al., 2019) annotates the answers with a set of 15 categories and labels their top 500 answers. This top 500 make up 82% of the entire VQA dataset consisting of 367K training and validation examples. Due to the lack of availability of ground-truth answers for the test set, we treat the validation set as our test set for evaluation and comparison with baselines. We use a 80-20 training-validation split for our experiments.

4.2. Evaluation Metrics

We intend to evaluate our approach alongside comparing it to the prior work in VQG using a variety of language modeling metrics including BLEU, METEOR and CIDEr (Vedantam et al., 2014). These metrics quantify the ability of the model to generate questions similar to the ground-truth questions for the validation set.

Additionally, we compute another quantitative metric: a variant of ROUGE (Lin, 2004)

called as ROUGE-L. This metric quantifies the similarity between the generated and ground truth questions by utilizing the longest common sub-sequence. The advantage of using this metric alongside others mentioned is that it takes into account any structural association present at the sentence level, thereby, capturing the longest n-gram concurrently occurring in the sequence.

We also evaluate the performance of our model against the baselines using crowd-sourced metrics for testing the relevance of the generated question with respect to the ground-truth images and answer categories. For this, we conduct a user study among 5 crowd workers in which each one is supposed to answer if the generated questions are consistent with respect to the given image and answer category.

In order to quantify the heterogeneity of generated questions, we additionally employ diversity metrics in our evaluation. For this, we compute the strength and the incentiveness. While strength is referred to as the ratio of unique generated question to the unique ground-truth questions, incentiveness is simply the ratio of unique generated questions those were unseen during training.

4.3. Quantitative Results

Supervision Models Bleu-1 Bleu-2 Bleu-3 Bleu-4 METEOR CIDEr ROUGE-L
Supervised (w A) IA2Q (Wang et al., 2017) 32.43 15.49 9.24 6.23 11.21 36.22 -
V-IA2Q (Jain et al., 2017) 36.91 17.79 10.21 6.25 12.39 36.39 -
Krishna et al. (Krishna et al., 2019) 47.40 28.95 19.93 14.49 18.35 85.99 49.10
Weakly Supervised (w/o A) IC2Q (Wang et al., 2017) 30.42 13.55 6.23 4.44 9.42 27.42 -
V-IC2Q (Jain et al., 2017) 35.40 25.55 14.94 10.78 13.35 42.54 -
Krishna et al. (Krishna et al., 2019) w/o A 31.20 16.20 11.18 6.24 12.11 35.89 40.27
I 38.44 19.83 12.02 7.69 13.27 45.19 40.90
I + II 38.80 20.12 12.32 7.96 13.40 46.42 41.27
I + CL 38.81 20.14 12.30 7.91 13.41 46.96 41.21
I + II + CL 38.94 20.30 12.47 8.10 13.47 47.32 41.27
I + II + Bayes 38.71 19.89 12.14 7.87 13.23 42.47 41.32
I + CL + Bayes 38.64 20.06 12.28 7.95 13.32 45.83 41.16
I + II + CL + Bayes 41.87 22.11 14.96 10.04 13.60 46.87 42.34
Table 1. Ablation study for different components of C3VQG using different language modeling quantitative metrics against other baselines in VQG.
Categories V-IC2Q (Jain et al., 2017) Krishna et al. (Krishna et al., 2019) C3VQG w/o Bayes C3VQG
Strength Inventiveness Strength Inventiveness Strength Inventiveness Strength Inventiveness
count 15.77 30.91 26.06 41.30 58.33 55.20 65.21 61.84
binary 18.15 41.95 28.85 54.50 58.39 36.32 65.12 38.55
object 11.27 34.84 24.19 43.20 57.77 51.51 65.58 58.85
color 4.03 13.03 17.12 23.65 58.38 48.97 65.21 54.34
attribute 37.76 41.09 46.10 52.03 60.05 58.38 64.59 63.02
materials 36.13 31.13 45.75 40.72 57.93 56.79 64.87 63.48
spatial 61.12 62.54 70.17 68.18 57.90 57.80 65.18 64.96
food 21.81 20.38 33.37 31.19 58.49 55.42 65.20 62.21
shape 35.51 44.03 45.81 55.65 58.85 58.75 66.01 65.98
location 34.68 18.11 45.25 27.22 58.39 58.10 65.09 64.72
predicate 22.58 17.38 36.20 31.29 57.05 57.05 65.67 65.67
time 25.58 15.51 34.43 25.30 58.13 58.10 65.00 64.96
activity 7.45 13.23 21.32 26.53 58.00 56.78 64.98 63.67
Overall 12.97 38.32 26.06 52.11 58.28 54.55 65.20 60.94
Table 2. Quantitative evaluation of C3VQG against other baselines using diversity-based metrics.
Model Relevance
Image Category
V-IC2Q (Jain et al., 2017) 90.10 39.00
Krishna et al. (Krishna et al., 2019) w/o A 98.10 42.70
C3VQG w/o Bayes, CL 98.00 58.40
C3VQG 97.80 60.50
Table 3. Quantitative evaluation of C3VQG against other weakly supervised baselines using crowd-sourced metrics.

In Tables 1, 2, and 3, we evaluate the performance of each model on the validation set rather than the test set similar to (Krishna et al., 2019).

In Table 1, I and II depict the step I and II respectively of our approach, CL depicts the imposed center loss on the combined latent space and Bayes represents an additional hyper-prior on the inverse variance of each latent dimension. Table 1 depicts that our approach beats the performance obtained by state-of-the-art in VQG (Krishna et al., 2019) without the supervision of answers while training. This shows the significance of cyclic consistency in answer category for generating semantically meaningful questions.

The reported values in Table 3 depict that our model outperforms the baselines as a result of the consistency of the generated questions and the structure present in latent space. The incorporation of the supplementary constraint on the congruence of answer category ensures that the generated question is completely relevant to the category. While, the squared L2 loss between the image encoding and the encoding generated from the combined latent space assists the relevance with respect to the image.

The superiority in the diversity of generated question by our model as depicted in Table 2 highlights that imposing a different prior on each dimension of the latent space enforces generation of a set of diversified questions from different answer categories. The performance in terms of the diversity of generated questions achieved by our approach with all components beats the state-of-art in VQG even without the requirement of additional answer supervision. The difference in the strength and incentivenes values with and without the latent hyper-prior suggests that capturing decorrelated features in each latent dimension enables our model to generate non-generic questions from a divergent pool of categories.

4.4. Qualitative Results

Figure 4. Question generated for each image from multiple answer categories using C3VQG approach.
Figure 5. Qualitative results for C3VQG and Krishna et al. (Krishna et al., 2019) without answers.

We present a set of four generated questions (from different answer categories) for a collection of images in Figure 4. This highlights the ability of our approach to generate diverse image and category specific non-generic questions.

Additionally in Figure 5, we also depict cases in which the questions generated by our model belong to the specified answer categories while the baseline approach in (Krishna et al., 2019) without the answer supervision fails to do so. The lack of category consistency reflected by baseline approach is well accommodated in our approach by the addition of an supplementary consistency loss. We aim to eradicate the inconsistencies of the generated questions with the provided answer categories by including cycle consistency in the model. As clearly highlighted in the qualitative evaluation, the questions generated by the (Krishna et al., 2019) make complete sense with respect to each image and are not generic questions, but it is often observed that they lack parallelism of answer category and generated questions. This is one of the loopholes with (Krishna et al., 2019) that we counter by utilizing a cyclic consistency based training procedure in addition to the quantitative improvements.

5. Conclusion

We present a novel answer category-consistent cyclic training approach for visual question generation using a structured latent space. Our approach is able to generate category-specific comprehensive questions using visual features present in the image without the requirement of ground-truth answers. With this amount of supervision, our approach beats the present state-of-the-art in terms of a variety of language modeling, crowd-sourcing and diversity-based metrics. Qualitatively, our approach avoids generic question formation and is able to generate questions that belong to the specified answer category even when former approaches fail to do so.


  • A. Agrawal, J. Lu, S. Antol, M. Mitchell, C. L. Zitnick, D. Parikh, and D. Batra (2015) VQA: visual question answering. International Journal of Computer Vision 123, pp. 4–31. Cited by: §1.
  • A. F. Ansari and H. Soh (2018) Hyperprior induced unsupervised disentanglement of latent representations. In AAAI, Cited by: §2.2.2.
  • S. Antol, A. Agrawal, J. Lu, M. Mitchell, D. Batra, C. L. Zitnick, and D. Parikh (2015) VQA: visual question answering. In International Conference on Computer Vision (ICCV), Cited by: §4.1.
  • S. Bhagat, S. Uppal, V. T. Yin, and N. Lim (2020) Disentangling representations using gaussian processes in variational autoencoders for video prediction. ArXiv abs/2001.02408. Cited by: §2.2.2.
  • S. Chen, T. Yao, and Y. Jiang (2019) Deep learning for video captioning: a review. In IJCAI, Cited by: §1.
  • P. Ghosh and L. S. Davis (2018) Understanding center loss based network for image retrieval with few training data. In ECCV Workshops, Cited by: §2.2.1.
  • K. He, X. Zhang, S. Ren, and J. Sun (2015)

    Delving deep into rectifiers: surpassing human-level performance on imagenet classification

    2015 IEEE International Conference on Computer Vision (ICCV), pp. 1026–1034. Cited by: 1.
  • X. He, Y. Zhou, Z. Zhou, S. Bai, and X. Bai (2018) Triplet-center loss for multi-view 3d object retrieval.

    2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition

    , pp. 1945–1954.
    Cited by: §2.2.1.
  • U. Jain, Z. Zhang, and A. G. Schwing (2017) Creativity: generating diverse questions using variational autoencoders. 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 5415–5424. Cited by: §1, §2.1, Table 1, Table 2, Table 3.
  • H. Kazemi, S. Soleymani, A. Dabouei, S. M. Iranmanesh, and N. M. Nasrabadi (2018) Attribute-centered loss for soft-biometrics guided face sketch-photo recognition. 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), pp. 612–6128. Cited by: §2.2.1, §2.2.1.
  • M. Kim, Y. Wang, P. Sahu, and V. Pavlovic (2019) Bayes-factor-vae: hierarchical bayesian deep auto-encoder models for factor disentanglement. 2019 IEEE/CVF International Conference on Computer Vision (ICCV), pp. 2979–2987. Cited by: §2.2.2, §3.5.
  • D. P. Kingma and M. Welling (2013) Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114. Cited by: 1st item.
  • J. Klys, J. Snell, and R. S. Zemel (2018) Learning latent subspaces in variational autoencoders. ArXiv abs/1812.06190. Cited by: §2.2.2.
  • R. Krishna, M. Bernstein, and L. Fei-Fei (2019) Information maximizing visual question generation. 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2008–2018. Cited by: §1, §1, §3, Figure 5, §4.1, §4.3, §4.3, §4.4, Table 1, Table 2, Table 3.
  • Y. Li, N. Duan, B. Zhou, X. R. Chu, W. Ouyang, and X. Wang (2017) Visual question generation as dual task of visual question answering. 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6116–6124. Cited by: §1, §2.1.
  • C. Lin (2004) ROUGE: a package for automatic evaluation of summaries. In Text Summarization Branches Out, Barcelona, Spain, pp. 74–81. External Links: Link Cited by: §4.2.
  • F. Liu, T. Xiang, T. M. Hospedales, W. Yang, and C. Sun (2018) IVQA: inverse visual question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 8611–8619. Cited by: §1.
  • M. Malinowski and M. Fritz (2014) A multi-world approach to question answering about real-world scenes based on uncertain input. External Links: 1410.0210 Cited by: §1.
  • N. Mostafazadeh, C. Brockett, W. B. Dolan, M. Galley, J. Gao, G. P. Spithourakis, and L. Vanderwende (2017) Image-grounded conversations: multimodal context for natural question and response generation. In IJCNLP, Cited by: §2.1.
  • N. Mostafazadeh, I. Misra, J. Devlin, M. Mitchell, X. He, and L. Vanderwende (2016) Generating natural questions about an image. ArXiv abs/1603.06059. Cited by: §1, §1, §2.1.
  • A. Shukla, S. Bhagat, S. Uppal, S. Anand, and P. K. Turaga (2019)

    Product of orthogonal spheres parameterization for disentangled representation learning

    In BMVC, Cited by: §2.2.2.
  • S. Tripathi, A. Ramesh, A. Kumar, C. Singh, and P. Yenigalla (2019) Learning discriminative features using center loss and reconstruction as regularizer for speech emotion recognition. ArXiv abs/1906.08873. Cited by: §2.2.1.
  • R. Vedantam, C. L. Zitnick, and D. Parikh (2014) CIDEr: consensus-based image description evaluation. 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4566–4575. Cited by: §4.2.
  • T. Wang, X. Yuan, and A. Trischler (2017) A joint model for question answering and question generation. ArXiv abs/1706.01450. Cited by: §1, Table 1.
  • Y. Wen, K. Zhang, Z. Li, and Y. Qiao (2016) A discriminative feature learning approach for deep face recognition. In ECCV, Cited by: §2.2.1, §3.4.
  • Y. Wen, K. Zhang, Z. Li, and Y. Qiao (2018) A comprehensive study on center loss for deep face recognition. International Journal of Computer Vision 127, pp. 668–683. Cited by: §2.2.1, §2.2.1.
  • X. Xu, J. Song, H. Lu, L. He, Y. Yang, and F. Shen (2018) Dual learning for visual question generation. In 2018 IEEE International Conference on Multimedia and Expo (ICME), pp. 1–6. Cited by: §1.
  • J. Yang, J. Lu, S. Lee, D. Batra, and D. Parikh (2018) Visual curiosity: learning to ask questions to learn visual recognition. In CoRL, Cited by: §1.
  • Y. Yang, Y. Li, C. Fermüller, and Y. Aloimonos (2015) Neural self talk: image understanding via continuous questioning and answering. ArXiv abs/1512.03460. Cited by: §2.1.
  • S. Zhang, L. Qu, S. You, Z. Yang, and J. Zhang (2016) Automatic generation of grounded visual questions. ArXiv abs/1612.06530. Cited by: §1, §2.1.
  • Y. Zhu, O. Groth, M. Bernstein, and L. Fei-Fei (2015) Visual7W: grounded question answering in images. External Links: 1511.03416 Cited by: §1.