Visual Question Answering (VQA) applications allow a human user to ask a machine questions about images – be it a user interacting with a visual chat-bot or a visually impaired user relying on an assistive device. As this technology steps out of the realm of curated datasets towards real-world settings, it is desirable that VQA models be robust to and consistent across reasonable variations in the input modalities. While there has been significant progress in VQA over the years [1, 18, 2, 10, 21, 46, 4, 5], today’s VQA models are, however, far from being robust.
VQA is a task that lies at the intersection of language and vision. Existing works have studied the robustness and sensitiveness of VQA models to meaningful semantic variations in images , changing answer distributions  and adversarial attacks  to images. However, to the best of our knowledge, no work has studied the robustness of VQA models to linguistic variations in the input question. This is important both from the perspective of VQA being a benchmark to test multi-modal AI capabilities (do our VQA models really “understand” the question when answering it?) and for applications (human users are likely to phrase the same query in a variety of different linguistic forms). However, today’s state-of-the-art VQA models are brittle to such linguistic variations as can be seen in Fig. 1.
One approach to make VQA models more robust is to collect a dataset with diverse rephrasings of questions to train VQA models. This requires additional human annotation and thus is not always scalable in real-world settings. Alternatively, an automatic approach that does not require additional human intervention but results in a VQA model that is more robust to linguistic variations observed in the natural language open-ended questions is desirable.
We propose a novel model-agnostic framework that relies on cycle consistency to learn robust VQA models without requiring additional annotation. Specifically, we train the model to not just answer a question, but also to generate diverse, semantically similar variations of questions conditioned on the answer. We enforce that the answer predicted for a generated question matches the ground truth answer to the original question. In other words, the model is being trained to predict the same (correct) answer for a question and its (generated) rephrasing.
Advantages of our proposed approach are two fold. First, enforcing consistent correctness across diverse rephrasings allows models to generalize to unseen semantically equivalent variations of questions at test time. The model achieves this by generating linguistically diverse rephrasings of questions on-the-fly and training with these variations. Second, a model trained generatively to generate a valid question given a candidate answer and image has a stronger multi-modal understanding of vision and language. Questions tend to have less learnable biases . As a result, models that can jointly perform the task of question generation and question answering are less prone to taking “shortcuts” and exploiting linguistic priors in questions. Indeed, we find that models trained with our approach outperform existing state-of-the-art models on both VQA and Visual Question Generation (VQG) tasks on VQA v2.0 .
We also observed that one reason for limited development of VQA models robust to linguistic variations in input questions is due to the lack of a benchmark to measure robustness. A lack of such a benchmark makes it hard to quantitatively realize the inflated capabilities and limited multi-modal understanding of modern VQA models and consequently inhibits progress in pushing the state-of-the-art in multi-modal understanding aspects of computer vision. To enable quantitative evaluation of robustness and consistency of VQA models across linguistic variations in input questions, we collect a large-scale dataset –VQA-Rephrasings (Section 4) based of the VQA v2.0 dataset . VQA-Rephrasings contains 3 human-provided rephrasings for 40k questions on 40k images from the validation split of the VQA v2.0 dataset. We also propose metrics to measure the robustness of VQA models across different question rephrasings. Further, we benchmark several state-of-the-art VQA models [4, 6, 21, 46] on our proposed VQA-Rephrasings dataset to highlight the fragility of VQA models to question rephrasings. We observe a significant drop when VQA models are required to be consistent in addition to being correct (Section 5), which reinforces our belief that existing VQA models do not understand language ”enough”. We show that VQA models trained with our approach are significantly more robust across question rephrasings than their existing counterparts on the proposed VQA-Rephrasings dataset.
In this paper, our contributions are the following:
We propose a model-agnostic cycle-consistent training scheme that enables VQA models to be more robust to linguistic variations observed in natural language open-ended questions.
To evaluate the robustness of VQA models to linguistic variations, we introduce a large-scale VQA-Rephrasings dataset and an associated consensus score. VQA-Rephrasings consists of 3 rephrasings for 40k questions on 40k images from the VQA v2.0 validation dataset, resulting in a total of 120k questions rephrasing by humans.
We show that models trained with our approach outperform state-of-the-art on the standard VQA and Visual Question Generation tasks on the VQA v2.0 dataset and are significantly more robust to linguistic variations on VQA-Rephrasings.
2 Related Work
Visual Question Answering. There has been tremendous progress in building models for VQA using LSTMs  and convolutional networks . VQA models spanning different paradigms like attention networks [45, 21], module networks [15, 5, 18], relational networks  and multi-modal fusion  have been proposed. Our method is model-agnostic and is applicable with any existing VQA architecture.
Robustness. Robustness of VQA models has been studied in several contexts [2, 44, 10]. For example,  studies the robustness of VQA models to changes in the answer distributions across training and test settings;  analyzes the extent of visual grounding in VQA models by studying robustness of VQA models to meaningful semantic changes in images; 
shows that despite the use of an advanced attention mechanism, it is easy to fool a VQA model with very minor changes in the image. Our work, however, aims to complete the study in robustness by benchmarking and improving robustness of VQA models to linguistic and compositional variations in questions in the form of rephrasings. Robustness has also been studied in natural language processing (NLP) systems[8, 13] in contexts of bias [39, 38], domain-shift  and syntactic variations . To counter these issues in NLP systems, solutions like linguistically motivated data-augmentation  and adversarial training  have been proposed. We study this in the context of visual question answering which is a multi-modal task which grounds language into the visual world.
(Visual) Question Generation. Question Generation (QG) as a task has been studied extensively by [3, 20, 37, 43] in NLP. Generating questions conditioned on an image was introduced in  and a large-scale VQG dataset was collected by  to evaluate visually grounded question generation capabilities of models. More recently, there has been work on generating questions that are diverse [17, 45]
. Training models to ask informative questions about an image in an active learning fixed-budget setting was explored in. While these techniques generate questions about an image in an answer-agnostic manner, techniques like 
propose a variational LSTM based model trained with reinforcement learning to generate answer-specific questions for an image. More recently, generates answer-specific questions for specific question-types by modelling question generation as a dual task of question answering. Unlike , our method is not restricted to generating questions only for specific question types. Different from previous works, the goal of our VQG component is to automatically generate question rephrasings that make the VQA models more robust to linguistic variations. To the best of our knowledge, we are the first to demonstrate that the VQG module can be used to improve VQA accuracy in a cycle-consistent setting.
, unpaired image-to-image translation and text-based question answering . Consistency enables learning of robust models by regularizing transformations that map one interconnected modality or domain to the other. While cycle consistency has been used vastly in the domains involving a single modality (text-only or image-only), it hasn’t been explored in the context of multi-modal tasks like VQA. Cycle-consistency in VQA can be also thought of as an online data-augmentation technique where the model is trained on several generated rephrasings of the same question.
We now introduce our cycle-consistent scheme to train robust VQA models. Given a triplet of image , question , and ground truth answer , a generic VQA model can be formulated as a transformation , where is the answer predicted by the model as in Fig. 2(a). Similarly, a generic VQG model can be formulated as a transformation as in Fig. 2(b). For a given triplet, we first obtain an answer prediction using the VQA model for the original question . We then use the predicted answer and the image to generate a question which is semantically similar to using the VQG model . Lastly, we obtain a answer prediction for the generated question .
Our design of consistency components is inspired by two beliefs. Firstly, a model which can generate a semantically and syntactically correct question given a answer and an image, has a better understanding of the cross-modal connections among the image, the question and the answer, which make them a valid triplet. Secondly, assuming the generated question is a valid rephrasing of the original question, a robust VQA model should answer this rephrasing with the same answer as the original question . In practice, however, there are several challenges that inhibit enforcement of cycle-consistency in VQA. We discuss these challenges and describe the key components of our framework geared to tackle them in the following sections.
3.1 Question Generation Module
Since VQA is a setting where there is high disparity in the information content of involved modalities (a question and answer pair is a very lossy compressed representation of the image), learning transformations that map one modality to another is non-trivial. In cycle-consistent models dealing with single-modalities, transformations need to be learned across different domains of the same modality (image or text) with roughly similar information contents. However in a multi-modality transformation like VQG, learning a transformation from a low information modality (such as answer) to high information modality (question) needs additional supervision. We provide this additional supervision to the VQG model in the form of attention. To generate a rephrasing , the VQG is guided to attend at regions of the image which were used by the VQA model to answer the original question . Unlike , this enables our models to generate questions more similar to the original question from answers like “yes”, which could possibly have a large space of plausible questions.
We model the question generation module
in a fashion similar to a conditional image captioning model. The question generation module consists of two linear encoders that transform attended image features obtained from VQA model and the distribution over answer space to lower dimensional feature vectors. We sum these feature vectors with additive noise and pass them through an LSTM which is trained to reconstruct the original question and optimized by minimizing the negative log likelihood with teacher-forcing. Note that unlike[28, 26] we do not pass the one-hot vector representing the answer obtained, or an embedding of the answer obtained to the question generation, but rather the predicted distribution over answers. This enables the question generation module to learn to map the model’s confidence over answers to the generated question.
Throughout the paper, Q-consistency implies addition of a VQG module on top of the base VQA model to generate rephrasings from the image and the predicted answer with an associated Q-consistency loss . Similarly, A-consistency implies passing all questions generated by the VQG Model to the VQA model and an associated A-consistency loss . The overall loss can be written as:
where and (i.e. A-Consistency Loss) are cross-entropy losses, (i.e. Q-Consistency Loss) is sequence generation loss  and ,
are tunable hyperparameters.
3.2 Gating Mechanism
One of the assumptions of our proposed cycle-consistent training scheme is that the generated question is always semantically and syntactically correct. However, in practice this is not always true. Previous attempts  at naively generating questions conditioned on the answer and using them without filtering to augment the training data have been unsuccessful. Like the visual question answering module, the visual question generation module is also not perfect. Therefore not all questions generated by the question generator are coherent and consistent with the image, the answer and the original question. To overcome this issue, we propose a gating mechanism, which automatically filters undesirable questions generated by the VQG model before passing them to the VQA model for A-consistency. The gating mechanism is only relevant when used in conjunction with A-consistency. We retain only those questions which either the VQA model
can answer correctly or have a cosine similarity with the original question encoding greater than a threshold.
3.3 Late Activation
One key component of designing cycle consistent models is to prevent mode collapse. Learning cycle-consistent models in complex settings like VQA needs a carefully chosen training scheme. Since cycle-consistent models have several interconnected sub-networks learning different transformations, it is important to ensure that each of these sub-networks are working in harmony. For example, if the VQA model and VQG model are jointly trained and consistency is enforced in early stages of training, it is possible that both models can just “cheat” by both producing undesirable outputs. We overcome this by activating cycle-consistency at later stages of training, to make sure both VQA and VQG models have been sufficiently trained to produce reasonable outputs. Specifically, we enable the loss associated with cycle-consistency after a fixed iterations in the training process.
We find these design choices for question generation module, gating mechanism and late activation to be crucial for effectively training our model. We demonstrate this empirically via ablation studies in Table 2. As we want to increase the robustness of the VQA model to all generated variations, the weights between VQA models which answer the original question and the generated rephrasing are shared. Our formulation of cycle-consistency in VQA can be also thought of as an online data-augmentation technique where the model is trained on several generated rephrasings of the same question and hence is more robust to such anomalies during inference. We show that with clever training strategy, coupled with attention and carefully chosen model architectures for question generation, incorporating cycle consistency for VQA is possible and not only leads to models that are better performing, but also more robust and consistent. In addition, we show that this robustness also imparts VQA models the ability to better predict their own failures.
4 VQA-Rephrasings Dataset
In this section, we introduce the VQA-Rephrasings dataset, which is the first dataset that enables evaluation of VQA models for robustness and consistency to different rephrasings of questions with the same meaning.
We use the validation split of VQA v2.0  as our base dataset which contains a total of 214,354 questions spanning over 40,504 images. We randomly sample 40,504 questions (one question per image) from the base dataset to form a sampled subset. We collect 3 rephrasings of each question in the sampled subset using human annotators in two stages. In the first stage, humans were primed with the original question and the corresponding true answer and asked to rephrase the question such that answer to the rephrased question remains the same as the original answer. To ensure rephrasings from first stage are syntactically correct and semantically inline with the original question, we filter the collected responses in the next stage.
In the second stage, humans were primed with the original question and it’s rephrasing and were asked to label the rephrasing invalid if: (a) the plausible answer to the original question and it’s rephrasing is different (i.e. if the question and it’s rephrasing have different intents) or (b) if the rephrasing is grammatically incorrect. We collected 121,512 rephrasings from the original 40504 questions in the first stage. Of these, 1320 rephrasings were flagged as invalid in the second stage and were rephrased again in the first stage. Humans were shown examples of incorrect rephrasings in the first stage to minimize the number of invalid rephrasings.
The final dataset consists of 162,016 questions (including the original 40,504 questions) spanning 40,504 images with an average of 3 rephrasings per original question. A few qualitative examples from the collected dataset can be seen in Fig. 3(a). Additional details about the data collection, interfaces used and exhaustive dataset statistics can be found in Appendix A.
Consensus Score. Intuitively, for a VQA model to be consistent across various rephrasings of the same question, the answer to all rephrasings should be the same. We measure this by a Consensus Score . For every group consisting of rephrasings, we sample all subsets of size . The consensus score is defined as the ratio of the number of subsets where all the answers are correct and the total number of subsets of size . The answer to a question is considered correct if it has a non-zero VQA Accuracy as defined in . CS(k) is formally defined as:
Where is number of subsets of size sampled from a set of size . As consensus score is a all-or-nothing score, to achieve a non-zero consensus score at for a group of questions , the model has to answer at least questions correctly in a group of questions . When (e.g. when in VQA-Rephrasings), the model needs to answer all rephrasings of a question and the original question correctly in order to get a non-zero consensus score. It is evident that a model with higher average consensus score at high values of is quantitatively more robust to linguistic variations in questions than a model with a lower score.
|BUTD + CC||61.66||50.79||44.68||42.55||62.44||52.58|
|Pythia + CC||64.36||55.45||50.92||44.30||64.52||55.65|
|BAN + CC||65.77||56.94||51.76||48.18||65.87||56.59|
5.1 Consistency Performance
We start by benchmarking a variety of existing VQA models on our proposed VQA-Rephrasings dataset.
MUTAN  111https://github.com/Cadene/vqa.pytorch parametrizes bilinear interactions between visual and textual representations using a multi-modal low-rank decomposition. MUTAN uses skip-thought  sentence embeddings to encode the question and Resnet-152  to encode images. MUTAN achieves 63.20% accuracy on VQA v2.0 test-dev. Among all models we analyze, MUTAN is the only model which uses sentence embeddings to encode questions and Resnet to encode images.
Bottom-Up Top-Down Attention (BUTD)  222https://github.com/hengyuan-hu/bottom-up-attention-vqa incorporates bottom-up attention in VQA by extracting features associated with image regions proposed by Faster-RCNN  pretrained on Visual Genome . BUTD model won the VQA Challenge in 2017 and achieves 66.25% accuracy on VQA v2.0 test-dev.
between question and image regions. Pythia uses features extracted from Detectron pretrained on Visual Genome. An ensemble of Pythia models won the VQA Challenge in 2018 using additional training data from Visual Genome  and using additional Resnet features. In this study, we use Pythia models which do not use Resnet features. Pythia without using Resnet features, achieves an accuracy of 68.43 % on VQA v2.0 test-dev.
Bilinear Attention Networks (BAN)  444https://github.com/jnhwkim/ban-vqa combines the idea of bilinear models and co-attention  between image regions and words in questions in a residual setting. Similar to , it uses Faster-RCNN  pretrained on Visual Genome  to extract image features. In all our experiments, for a fair comparison, we use BAN models which do not use additional training data from Visual Genome. BAN achieves the current state-of-the-art single-model accuracy of 69.64 % on VQA v2.0 test-dev without using additional training data from Visual Genome.
|Pythia + CC*||0.708||0.561||0.438||0.339||0.627||0.284||2.301|
|Pythia + CC||0.486||0.368||0.287||0.226||0.556||0.225||1.843|
Implementation Details For all models trained with our cycle-consistent framework, we use the values , , and . When reporting results on the validation split and VQA-Rephrasings we train on the training split and when reporting results on the test split we train on both training and validation splits of VQA v2.0. Note that we never explicitly train on the collected VQA-Rephrasings dataset and use it purely for evaluation purposes. We use publicly available implementations of each backbone VQA model. The hidden size of the LSTM used in VQG module is 1024 and the linear encoders used to encode the answer and image in VQG have dimensions of 300 each. Additional details about model-specific hyperparameters can be found in Appendix E.
We measure the robustness of each of these models on our proposed VQA-Rephrasings dataset using the consensus score (Eq. 2). Table 1 shows the consensus scores at different values of for several VQA models. We see that all models suffer significantly when measured for consistency across rephrasings. For e.g., the performance of Pythia (winner of 2018 VQA challenge) is reduced to a consensus score of 39.49% at . Similar trends are observed for MUTAN, BAN and BUTD. The drop increases with increasing , the number of rephrasings used to measure consistency. Models like BUTD, BAN and Pythia which use word-level encodings of the question suffer significant drops. It is interesting to note that even MUTAN which uses skip-thought based sentence encoding  suffers a drop when checked for consistency across rephrasings. We observe that BAN + CC model trained with our proposed cycle-consistent training framework consistently outperforms its counterpart BAN and all other models at all values of .
Fig 4 qualitatively compares the textual and visual attention (over image regions) over 4 rephrasings of a question. The top row shows attention and predictions from a Pythia model, while the bottom row shows attention and predictions from the same Pythia model, but trained using our framework. Our model attends at relevant image regions for all rephrasings and answers all of them correctly. The Pythia counterpart, however, fails to attend over relevant image regions for some rephrasings and answers those rephrasings incorrectly. This qualitatively demonstrates the robustness of models trained with our framework.
5.2 Visual Question Answering Performance
We now evaluate our approach and various ablations on the standard task of question answering on VQA v2.0 dataset . We compare the performance of several VQA models on the validation and test-dev splits of VQA v2.0. It consists of 443,757 training, 214,354 validation and 447,793 testing questions spanning over 82,783, 40,504 and 81,434 images respectively. Table 2 shows the VQA scores of different models on validation and test-dev splits. We show that BUTD, Pythia and BAN models trained with our cycle-consistent framework outperform their corresponding baselines.
We show the impact of each component of our cycle-consistent framework by performing ablation studies on our models. We study the marginal effect of components like question consistency (Q-consistency), answer consistency (A-consistency) and gating mechanism by adding them step-by-step to the base VQA model . Q-consistency implies addition of a VQG module to generate rephrasings from the image and the predicted answer with an associated VQG loss . As shown in Table 2, we see that addition of question consistency slightly improves performance of each VQA model. Inline with observations in , this shows that indeed models which can generate questions from the answer have better multi-modal understanding and in turn are better at visual question answering. A-consistency implies passing all the generated questions to the VQA model and an associated loss . As seen in Table 2, we see that naively passing all the generated questions to the VQA model leads to significant reduction in performance than the base model . This goes in line with our earlier discussion that not all questions generated are valid rephrasings of the original question and hence enforcing consistency between the answers of two invalid pairs of questions naturally leads to degradation in performance. Finally we show the effect of using our gating mechanism to filter undesirable generated questions in and passing the remaining to VQA model . We see that all VQA models perform consistently better when using a gating than just using Q-consistency.
We also experimented with Pythia model configurations where the VQG model uses unattended image features (unlike the default setting which uses image features with attention from the VQA model). We found that with this configuration, our approach still shows improved performance over the baseline. However, the question generation quality is relatively poor, and the overall gain is smaller (3.58% in consistency and 0.2% in VQA accuracy) compared to when using attention (8.08% and 0.5% respectively) – likely because attention helps in generating more-focused rephrasings
5.3 Visual Question Generation Performance
Recall that our model also includes a VQG component which generates questions conditioned on an answer and image. Since the overall performance of our framework relies highly on the performance of question generation module, we evaluate our VQG component performance as well on commonly used image captioning metrics. We compare our VQG component to several answer-conditional VQG models on the VQA v2.0 dataset. We use standard image captioning metrics CIDEr , BLEU , METEOR  and ROUGE-L  as used in . We compare our approach to two recently proposed visual question generation approaches. iVQA  uses a variational LSTM model trained with reinforcement learning to generate answer-specific questions for an image. Syntactic correctness, diversity and intent of the generated question are used to allocate rewards. iQAN  generates answer-specific questions by modelling question generation as a dual task of question answering and sharing parameters between question answering and question generation modules. Since iQAN can only generate a specific type of questions, for a fair comparison, we compare to iQAN only on a subset of the dataset containing questions from these specific types. As shown in Table 3, we observe that our question generation module trained with cycle-consistency consistently outperforms iVQA  and iQAN  on all metrics. A few qualitative examples of answer conditioned questions generated by our VQG model can be seen in Fig. 3(b). Additional examples can also be found in the Appendix D.
|BUTD + CC||0.73||0.79||0.76|
|Pythia + CC||0.77||0.81||0.77|
5.4 Failure Prediction Performance
In previous results, we show that by training models to generate and answer questions while being consistent across both tasks leads to improvement in performance and robustness. Another way of testing robustness of these models is to see if models can predict their own failures. A robust model is less confident about an incorrect answer and vice versa. Motivated by this, we seek to verify if models trained with our cycle-consistent framework can identify their own failures i.e. correctly identify if they’re wrong about a prediction. To this end, we use two failure predictions schemes. First, we naively threshold the confidence of the predicted answer. All answers above a particular threshold are marked as correctly answered and vice versa. Second, we design a failure prediction binary classification module (FP), which predicts for a given image , question and answer (predicted by the base VQA model ), whether the predicted answer is correct for the given pair. The FP module uses image and answer encoders similar to those used in the question generation module (Section 3.1) and makes use of the question representation from the base VQA model as the question encoding. These encodings are concatenated and passed to a linear layer for binary classification. The FP module is trained keeping the parameters of the base VQA model frozen. In Table 4, we show the failure prediction performance of the baseline VQA models and models trained with our proposed framework. It shows that the cycle consistency framework, even without an explicit failure predictor module, makes the models more calibrated – more capable of detecting their own failures. In both settings: (a) when using naive confidence thresholding (not marked as “+ FP” in the Table) and (b) using a specifically designed submodule to detect failures (marked as “+ FP”), models trained with our cycle-consistent training framework are better than their corresponding baselines. We see similar improvments in detecting failures for both BUTD and Pythia models, which shows that our cycle-consistency framework is model agnostic. This also shows that not only does cycle-consistent training make models robust to linguistic variations, but also allows them to be aware of their failures.
In this paper, we propose a novel model-agnostic training strategy to incorporate cycle consistency in VQA models to make them robust to linguistic variations and self-aware of their failures. We also collect a large-scale dataset, VQA-Rephrasings and propose a consensus metric to measure robustness of VQA models to linguistic variations of a question. We show that models trained with our training strategy are robust to linguistic variations, and achieve state-of-the-art performance in VQA and VQG on VQA v2.0 dataset.
-  A. Agrawal, D. Batra, and D. Parikh. Analyzing the behavior of visual question answering models. arXiv preprint arXiv:1606.07356, 2016.
A. Agrawal, D. Batra, D. Parikh, and A. Kembhavi.
Don’t just assume; look and answer: Overcoming priors for visual
2018 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018.
-  H. Ali, Y. Chali, and S. A. Hasan. Automation of question generation from sentences. In Proceedings of QG2010: The Third Workshop on Question Generation, pages 58–67, 2010.
-  P. Anderson, X. He, C. Buehler, D. Teney, M. Johnson, S. Gould, and L. Zhang. Bottom-up and top-down attention for image captioning and visual question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018.
-  J. Andreas, M. Rohrbach, T. Darrell, and D. Klein. Learning to compose neural networks for question answering. arXiv preprint arXiv:1601.01705, 2016.
-  H. Ben-Younes, R. Cadene, M. Cord, and N. Thome. Mutan: Multimodal tucker fusion for visual question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017.
-  M. Denkowski and A. Lavie. Meteor universal: Language specific translation evaluation for any target language. In Proceedings of the EACL 2014 Workshop on Statistical Machine Translation, 2014.
-  A. Ettinger, S. Rao, H. Daumé III, and E. M. Bender. Towards linguistically generalizable nlp systems: A workshop and shared task. arXiv preprint arXiv:1711.01505, 2017.
-  R. Girshick, I. Radosavovic, G. Gkioxari, P. Dollár, and K. He. Detectron. https://github.com/facebookresearch/detectron, 2018.
-  Y. Goyal, T. Khot, D. Summers-Stay, D. Batra, and D. Parikh. Making the v in vqa matter: Elevating the role of image understanding in visual question answering. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 6325–6334. IEEE, 2017.
-  D. He, Y. Xia, T. Qin, L. Wang, N. Yu, T. Liu, and W.-Y. Ma. Dual learning for machine translation. In Advances in Neural Information Processing Systems, pages 820–828, 2016.
-  K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
-  J. R. Hobbs, D. E. Appelt, J. Bear, and M. Tyson. Robust processing of real-world natural-language texts. In Proceedings of the third conference on Applied natural language processing, pages 186–192. Association for Computational Linguistics, 1992.
-  S. Hochreiter and J. Schmidhuber. Long short-term memory. Neural computation, 9(8):1735–1780, 1997.
-  R. Hu, J. Andreas, M. Rohrbach, T. Darrell, and K. Saenko. Learning to reason: End-to-end module networks for visual question answering. In 2017 IEEE International Conference on Computer Vision (ICCV), pages 804–813. IEEE, 2017.
-  M. Iyyer, J. Wieting, K. Gimpel, and L. Zettlemoyer. Adversarial example generation with syntactically controlled paraphrase networks. arXiv preprint arXiv:1804.06059, 2018.
U. Jain, Z. Zhang, and A. Schwing.
Creativity: Generating diverse questions using variational autoencoders.In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 5415–5424. IEEE, 2017.
-  J. Johnson, B. Hariharan, L. van der Maaten, J. Hoffman, L. Fei-Fei, C. Lawrence Zitnick, and R. Girshick. Inferring and executing programs for visual reasoning. In Proceedings of the IEEE International Conference on Computer Vision, pages 2989–2998, 2017.
K. Kafle, M. Yousefhussien, and C. Kanan.
Data augmentation for visual question answering.
Proceedings of the 10th International Conference on Natural Language Generation, pages 198–202, 2017.
-  S. Kalady, A. Elikkottil, and R. Das. Natural language question generation using syntax and keywords. In Proceedings of QG2010: The Third Workshop on Question Generation, 2010.
-  J.-H. Kim, J. Jun, and B.-T. Zhang. Bilinear Attention Networks. arXiv preprint arXiv:1805.07932, 2018.
-  R. Kiros, Y. Zhu, R. R. Salakhutdinov, R. Zemel, R. Urtasun, A. Torralba, and S. Fidler. Skip-thought vectors. In Advances in neural information processing systems, pages 3294–3302, 2015.
-  R. Krishna, Y. Zhu, O. Groth, J. Johnson, K. Hata, J. Kravitz, S. Chen, Y. Kalantidis, L.-J. Li, D. A. Shamma, et al. Visual genome: Connecting language and vision using crowdsourced dense image annotations. International Journal of Computer Vision, 123(1):32–73, 2017.
-  Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
-  Y. Li, T. Cohn, and T. Baldwin. Robust training under linguistic adversity. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers, volume 2, pages 21–27, 2017.
-  Y. Li, N. Duan, B. Zhou, X. Chu, W. Ouyang, X. Wang, and M. Zhou. Visual question generation as dual task of visual question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018.
-  C.-Y. Lin. Rouge: A package for automatic evaluation of summaries. Text Summarization Branches Out, 2004.
-  F. Liu, T. Xiang, T. M. Hospedales, W. Yang, and C. Sun. ivqa: Inverse visual question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 8611–8619, 2018.
-  J. Lu, J. Yang, D. Batra, and D. Parikh. Hierarchical question-image co-attention for visual question answering. In Advances In Neural Information Processing Systems, pages 289–297, 2016.
-  B. McCann, N. S. Keskar, C. Xiong, and R. Socher. The natural language decathlon: Multitask learning as question answering, 2018.
-  I. Misra, R. Girshick, R. Fergus, M. Hebert, A. Gupta, and L. van der Maaten. Learning by asking questions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 11–20, 2018.
-  I. M. Mora and S. P. de la Puente. Towards automatic generation of question answer pairs from images.
-  N. Mostafazadeh, I. Misra, J. Devlin, M. Mitchell, X. He, and L. Vanderwende. Generating natural questions about an image. arXiv preprint arXiv:1603.06059, 2016.
-  K. Papineni, S. Roukos, T. Ward, and W.-J. Zhu. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting on association for computational linguistics, pages 311–318. Association for Computational Linguistics, 2002.
E. Perez, F. Strub, H. De Vries, V. Dumoulin, and A. Courville.
Film: Visual reasoning with a general conditioning layer.
Thirty-Second AAAI Conference on Artificial Intelligence, 2018.
-  S. Ren, K. He, R. Girshick, and J. Sun. Faster r-cnn: Towards real-time object detection with region proposal networks. In Advances in neural information processing systems, pages 91–99, 2015.
-  I. V. Serban, A. García-Durán, C. Gulcehre, S. Ahn, S. Chandar, A. Courville, and Y. Bengio. Generating factoid questions with recurrent neural networks: The 30m factoid question-answer corpus. arXiv preprint arXiv:1603.06807, 2016.
-  M. Spranger, J. Suchan, and M. Bhatt. Robust natural language processing-combining reasoning, cognitive semantics and construction grammar for spatial language. arXiv preprint arXiv:1607.05968, 2016.
-  M. Stede. The search for robustness in natural language understanding. Artificial Intelligence Review, 6(4):383–414, 1992.
-  N. Sundaram, T. Brox, and K. Keutzer. Dense point trajectories by gpu-accelerated large displacement optical flow. In European conference on computer vision, pages 438–451. Springer, 2010.
-  D. Tang, N. Duan, Z. Yan, Z. Zhang, Y. Sun, S. Liu, Y. Lv, and M. Zhou. Learning to collaborate for question answering and asking. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), volume 1, pages 1564–1574, 2018.
-  R. Vedantam, C. Lawrence Zitnick, and D. Parikh. Cider: Consensus-based image description evaluation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4566–4575, 2015.
-  Z. Wang, A. S. Lan, W. Nie, A. E. Waters, P. J. Grimaldi, and R. G. Baraniuk. Qg-net: A data-driven question generation model for educational content. In Proceedings of the Fifth Annual ACM Conference on Learning at Scale, pages 7:1–7:10, 2018.
-  X. Xu, X. Chen, C. Liu, A. Rohrbach, T. Darrell, and D. Song. Fooling vision and language models despite localization and attention mechanism. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4951–4961, 2018.
-  Z. Yang, X. He, J. Gao, L. Deng, and A. Smola. Stacked attention networks for image question answering. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 21–29, 2016.
-  Yu Jiang*, Vivek Natarajan*, Xinlei Chen*, M. Rohrbach, D. Batra, and D. Parikh. Pythia v0.1: the winning entry to the vqa challenge 2018. arXiv preprint arXiv:1807.09956, 2018.
-  P. Zhang, Y. Goyal, D. Summers-Stay, D. Batra, and D. Parikh. Yin and yang: Balancing and answering binary visual questions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 5014–5022, 2016.
-  J.-Y. Zhu, T. Park, P. Isola, and A. A. Efros. Unpaired image-to-image translation using cycle-consistent adversarial networks. In Computer Vision (ICCV), 2017 IEEE International Conference on, 2017.
Appendix A Dataset Details
Statistics. Fig 5(a) shows the number of words (in percentage) belonging to different Parts-of-Speech tags. The distributions follow almost similar trends in VQA-Rephrasings and VQA v2.0. This shows that the rephrasings are not obtained by merely adding more adjectives or adverbs in the original question. Fig 5(b) shows the number of questions (in percentage) with varying lengths. The average length of questions in VQA-Rephrasings is 7.15 which is slightly higher than the average length in VQA v2.0, which is 6.32.
Interface. We used a simplistic web interface to collect rephrasings from human annotators. The interface provided three examples of invalid rephrasings and their corresponding explanations to help human annotators understand the task better. We A/B tested with 50 questions using all 4 combinations of:
Showing both valid and invalid rephrasing examples and explanations.
Showing only valid and no invalid rephrasing examples and explanations.
Showing none of valid and invalid rephrasing examples and explanations.
Showing no valid and only invalid rephrasing examples and explanations.
We found (via manual inspection) that the last setup provided higher quality data, and used that as our final interface choice.
Examples. Fig 6 shows several qualitative examples from the VQA-Rephrasings dataset. We see that the rephrasings maintain the intent of the original question while varying linguistically.
Appendix B Attention Analysis
Fig 7 qualitatively compares the textual and visual attention (over image regions) for rephrasings of a question. Each row compares predicted answers and attention from a baseline Pythia  model and the same Pythia model trained with our framework (Pythia + CC), using two question rephrasings. First and third row shows the outputs of a Pythia model (baseline) and second and forth row shows the output of a Pythia model (baseline + CC) trained with our framework. We see that in most examples, the attention over image regions doesn’t vary across rephrasings for models trained with our framework (and the model answers the questions correctly). However for the baseline model, one can see that minor linguistic changes in the question can result in completely different answers (Row 2, Columns 1 and 3). This qualitatively demonstrates the robustness of models trained with our framework. Since the baseline Pythia model doesn’t include a counting module, it doesn’t perform well on questions requiring counting. As a result we see that both the baseline and its cycle-consistent counterpart perform poorly on counting questions (Row 5, Columns 1 through 4).
Appendix C Attention Consistency
Intuitively, it seems like training the VQA model to attend over the same image regions for different rephrasings of a question should improve the robustness of the model. We tried to enforce this in our cycle-consistent framework using an additional attention consistency loss.
Recall that for a given image , question and answer , our model consists of a VQA model which takes (, ) as an input and uses the question to attend over image regions with attention and predicts an answer . We also have a VQG model which uses the predicted answer and image to generate a question . Intuitively, the VQA model should attend over the same image regions when answering . In other words, the attention over image regions used by the VQA model to answer should be close to the . We added an additional attention consistency loss to the total loss which reduces the norm between these two attentions.
However, we found that this leads to reduction in model performance. Specifically, this reduces the performance of a cycle consistent Pythia model by 1.34% VQA accuracy when evaluated on the VQA v2.0 validation split (training on train split only).
We suspect one reason why enforcing attention consistency across rephrasings reduces performance is perhaps because minimizing a large number of diverse losses ( cross entropy losses and for VQA, sequence generation loss for VQG and mean squared loss for attention consistency) is a hard problem to optimize. Concretely identifying why enforcing attention consistency across question rephrasings hurts performance is currently under investigation and is part of future work. We find naively matching attentions across question rephrasings is not effective in current settings and therefore do not include this in the final model.
Appendix D Question Generation
Fig 8 shows qualitative examples of answer conditioned questions generated by our VQG model. Our VQG model is able to correctly generate answer conditioned questions for a wide range of answers ranging from numbers, to colors and even yes/no.
Appendix E Hyperparameters
We use the default hyperparameters as described in publicly available implementations of MUTAN , BUTD , Pythia  and BAN . When using these models as base VQA models to train cycle consistent variants of them, we use the same parameters for the VQA model. For the the VQG model we use , , and . While some models use adaptive learning rates for their base VQA models, the VQG model is always trained with a fixed learning rate of . In case of BAN and Pythia, we also clip the gradients whose norm is greater than .