Visual Question Answering (VQA) is one of the most challenging computer vision tasks in which an algorithm is given a natural language question about an image and tasked with producing a natural language answer for that question-image pair. Recently, various VQA models [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11] have been proposed to tackle this problem, and their main performance measure is accuracy. However, the community has started to realize that accuracy is not the only metric to evaluate model performance. More specifically, these models should also be robust, i.e., their output should not be affected much by some small noise or perturbation
to the input. The idea of analyzing model robustness as well as training robust models is already a rapidly growing research topic for deep learning models applied to images[12, 13, 14]. However, and to the best of our knowledge, an acceptable and standardized method to measure robustness in VQA models does not seem to exist. As such, this paper is the first work to analyze VQA models from this point of view by proposing a robustness measure and a standardized large-scale dataset.
The ultimate goal is for VQA models to perform as humans do for the same task. If a human is presented with a question or this question accompanied with some highly similar questions to it, he/she tends to give the same or a very similar answer in both cases. Evidence of this has been reported on in the psychology domain. Therefore, when we add or replace some words or phrases by similar words or phrases to the query question, called the main question, the VQA model should output the same or a very similar answer. In some sense, we consider similar words or phrases as small perturbations or noise to the input, so we say that the model is robust if it produces the same answer. Note that we define a basic question as a semantically similar question to the given main question. Based on evidence from deductive reasoning in human thinking , we consider basic questions as noise. In Figure 3, cases (a) and (b) explain the general idea. In case (a), the person may have the answer “Mercedes Benz” in mind. However, in case (b), he/she would start to think about the relations among the two given questions and candidate answers to form the final answer which may be different from the final answer in case (a). If the person is given more basic questions, he/she would start to think about all the possible relations of all the provided questions and possible answer candidates. These relationships will clearly be more complicated, especially when the additional basic questions have low similarity score to the main question. In such cases, they will mislead the person. That is to say, those extra basic questions are large disturbances in some sense. Because robustness analysis requires studying the accuracy of VQA models under different noise levels, we need to know how to quantify the level of noise for the given question. We hypothesize that a basic question with larger similarity score to the main question is considered to inject a smaller amount of noise if it is added to the main question and vice versa. Our proposed basic question ranking method is one way to quantify and control the strength of this injected noise level.
2 Robustness Framework
Inspired by the above reasoning, we propose a novel framework for measuring the robustness of VQA models. Figure 1 depicts the structure of our framework. It contains two modules, a VQA model and a Noise Generator. The Noise Generator, illustrated in Figure 2, takes a plain text main question (MQ) and a plain text basic question dataset (BQD) as input. It starts by ranking the basic questions in BQD by their similarity to MQ using some text similarity ranking method. Then, depending on the required level of noise, it takes the top (e.g., ) ranked BQs and directly concatenates them. The concatenation of these BQs with MQ is the generated noisy question. Instead of feeding the MQ to the VQA model, we replace it with the generated noisy question and measure the accuracy of the output. To measure the robustness of this VQA model, the accuracy with and without the generated noise is compared. To this end, we propose a robustness measure to quantify this comparison.
For the questions ranking method [16, 17], given any two questions we can have different measures that quantify the similarity of those questions and produce a score between . Using such similarity measures, we can have different rankings of the similarity of MQ to the questions in BQD, where the BQs with higher similarity score to MQ rank higher than those with less similarity. Along those lines, we propose a new question ranking method formulated using optimization and compare it against other rankings produced by seven different yet popular textual similarity measures. We do this comparison to rank our proposed BQDs, General Basic Question Dataset (GBQD) and Yes/No Basic Question Dataset (YNBQD). Furthermore, we evaluate the robustness of six pretrained state-of-the-art VQA models [1, 6, 7, 9]. Finally, extensive experiments show that is the best BQD ranking method among others.
(i) We propose a novel framework to measure the robustness of VQA models and test it on six different models. (ii) We propose a new text-based similarity ranking method and compare it against seven popular similarity metrics, BLEU-1, BLEU-2, BLEU-3, BLEU-4 , ROUGE , CIDEr  and METEOR . Then, we show that our ranking method is the best among them. (iii) We introduce two large-scale basic questions datasets: General Basic Question Dataset (GBQD) and Yes/No Basic Question Dataset (YNBQD).
This work is supported by competitive research funding from King Abdullah University of Science and Technology (KAUST) and the High Performance Computing Center in KAUST.
-  Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C Lawrence Zitnick, and Devi Parikh. Vqa: Visual question answering. In Proceedings of the IEEE International Conference on Computer Vision, pages 2425–2433, 2015.
-  Jacob Andreas, Marcus Rohrbach, Trevor Darrell, and Dan Klein. Neural module networks. In Proceedings of the IEEE Conference on CVPR, pages 39–48, 2016.
Mateusz Malinowski, Marcus Rohrbach, and Mario Fritz.
Ask your neurons: A neural-based approach to answering questions about images.In Proceedings of the IEEE International Conference on Computer Vision, pages 1–9, 2015.
Hyeonwoo Noh, Paul Hongsuck Seo, and Bohyung Han.
Image question answering using convolutional neural network with dynamic parameter prediction.In Proceedings of the IEEE Conference on CVPR, pages 30–38, 2016.
-  Qi Wu, Peng Wang, Chunhua Shen, Anthony Dick, and Anton van den Hengel. Ask me anything: Free-form visual question answering based on knowledge from external sources. In CVPR, pages 4622–4630, 2016.
-  Jiasen Lu, Jianwei Yang, Dhruv Batra, and Devi Parikh. Hierarchical question-image co-attention for visual question answering. In NIPS, pages 289–297, 2016.
-  Hedi Ben-younes, Remi Cadene, Matthieu Cord, and Nicolas Thome. Mutan: Multimodal tucker fusion for visual question answering. In The IEEE International Conference on Computer Vision (ICCV), Oct 2017.
Akira Fukui, Dong Huk Park, Daylen Yang, Anna Rohrbach, Trevor Darrell, and
Multimodal compact bilinear pooling for visual question answering and
Conference on Empirical Methods in Natural Language Processing (EMNLP), 2016.
-  Jin-Hwa Kim, Kyoung-Woon On, Woosang Lim, Jeonghee Kim, Jung-Woo Ha, and Byoung-Tak Zhang. Hadamard product for low-rank bilinear pooling. In 5th International Conference on Learning Representations, 2017.
-  Mateusz Malinowski and Mario Fritz. Towards a visual turing challenge. arXiv preprint arXiv:1410.8027, 2014.
-  Donald Geman, Stuart Geman, Neil Hallonquist, and Laurent Younes. Visual turing test for computer vision systems. Proceedings of the National Academy of Sciences, 112(12):3618–3623, 2015.
-  Alhussein Fawzi, Seyed Mohsen Moosavi Dezfooli, and Pascal Frossard. A geometric perspective on the robustness of deep networks. Technical report, Institute of Electrical and Electronics Engineers, 2017.
-  Nicholas Carlini and David Wagner. Towards evaluating the robustness of neural networks. In Security and Privacy (SP), 2017 IEEE Symposium on, pages 39–57. IEEE, 2017.
Huan Xu, Constantine Caramanis, and Shie Mannor.
Robustness and regularization of support vector machines.
Journal of Machine Learning Research, 10(Jul):1485–1510, 2009.
-  Lance J Rips. The psychology of proof: Deductive reasoning in human thinking. Mit Press, 1994.
-  Jia-Hong Huang, Modar Alfadly, and Bernard Ghanem. Robustness analysis of visual qa models by basic questions. arXiv preprint arXiv:1709.04625, 2017.
Jia-Hong Huang, Modar Alfadly, and Bernard Ghanem.
Vqabq: Visual question answering by basic questions.
Computer Vision and Pattern Recognition Workshop (CVPRW), 2017.
-  Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting on association for computational linguistics, pages 311–318. Association for Computational Linguistics, 2002.
-  Chin-Yew Lin. Rouge: A package for automatic evaluation of summaries. In Text summarization branches out: Proceedings of the ACL-04 workshop, volume 8. Barcelona, Spain, 2004.
-  Ramakrishna Vedantam, C. Lawrence Zitnick, and Devi Parikh. Cider: Consensus-based image description evaluation. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2015.
-  Satanjeev Banerjee and Alon Lavie. Meteor: An automatic metric for mt evaluation with improved correlation with human judgments. In Proceedings of the acl workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization, volume 29, pages 65–72, 2005.