Roses Are Red, Violets Are Blue... but Should Vqa Expect Them To?

06/09/2020 ∙ by Corentin Kervadec, et al. ∙ INSA Lyon Orange 0

To be reliable on rare events is an important requirement for systems based on machine learning. In this work we focus on Visual Question Answering (VQA), where, in spite of recent efforts, datasets remain imbalanced, causing shortcomings of current models: tendencies to overly exploit dataset biases and struggles to generalise to unseen associations of concepts. We focus on a systemic evaluation of model error distributions and address fundamental questions: How is the prediction error distributed? What is the prediction accuracy on infrequent vs. frequent concepts? In this work, we design a new benchmark based on a fine-grained reorganization of the GQA dataset [1], which allows to precisely answer these questions. It introduces distributions shifts in both validation and test splits, which are defined on question groups and are thus tailored to each question. We performed a large-scale study and we experimentally demonstrate that several state-of-the-art VQA models, even those specifically designed for bias reduction, fail to address questions involving infrequent concepts. Furthermore, we show that the high accuracy obtained on the frequent concepts alone is mechanically increasing overall accuracy, covering up the true behavior of current VQA models.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Visual Question Answering (VQA), i.e. the task of answering a question posed over an image, is often seen as a testbed for the capability of learning-based systems to perform high-level reasoning. This multimodal (language and image) problem is notorious for its large diversity meaning that VQA models are required to learn various high-level general representations of concepts of the physical world as well as their interactions.

Efforts to learn the necessary high-level reasoning from large-scale datasets depend on the absence of harmful biases in the data, which could provide unwanted shortcuts to learning in the form of “Clever Hans” effects. Unfortunately, and in spite of recent efforts [goyal2017making, hudson2019gqa], most VQA datasets remain very imbalanced. Common concepts are significantly more frequent, e.g the presence of a “red rose”, compared to out of context concepts like the presence of a “zebra in a city”. This causes the tendency of models to overly rely on biases, hindering generalisation [cadene2019rubi, clark2019don]. Despite a general consensus on this diagnostic, systemic evaluations of error distributions are rare. In particular, overall accuracy is still the major, and often unique, metric used to evaluate models and methods, although it is clearly insufficient. Several questions remain open. How is error distributed? Are true positives due to reasoning or to exploitation of bias? What is the prediction accuracy on infrequent vs. frequent concepts? How can we validate models in Out Of Distribution (OOD)-settings?

(a) (b)
Figure 1: We re-organize the GQA dataset [hudson2019gqa] in a fine-grained way: (a) the benchmark contains a distribution shift in validation and test, allowing to validate and evaluate in OOD settings; (b) our distribution shifts are tailored to different question groups with highly imbalanced distributions. Illustrated here: the group raising questions on rose colors.

In this work we propose a new benchmark and a study on State-Of-The-Art (SOTA) VQA models, which allows to precisely answer these questions. The proposed new evaluation protocol is complementary to existing ones, but allows a better diagnostic of current VQA performance. The benchmark comprises (i) a new fine-grained reorganization of the GQA dataset [hudson2019gqa] introducing distribution shifts in both validation and test sets (see Figure 1

-a); (ii) a set of evaluation metrics; (iii) new evaluation plots illustrating the reasoning behavior of VQA models on different operating points. The choice of the GQA dataset is motivated by its useful structuring into question groups, which allows to capture biases precisely, to select groups with strong biases and to create distribution shifts tailored to the exact nature of each question (see Figure

1-b). It also makes it possible to analyze how errors are distributed over different associations of concepts according to their frequency in the dataset.

We validate the benchmark with a large study and we experimentally demonstrate that several SOTA VQA models fail to address questions involving infrequent concepts. We also test several SOTA methods for bias reduction and come to similar conclusions. Furthermore, we show that the high accuracy obtained on the frequent concepts alone is mechanically increasing overall accuracy, covering up the true behavior of current VQA models.

The contributions of this paper are summarized as follows:

  1. [nosep,itemsep=1mm]

  2. We propose and make public111 a new fine-grained re-organization of the GQA dataset and a set of the respective evaluation metrics allowing to precisely evaluate the reasoning behavior of VQA models and to characterise and visualise their generalisation behavior on different operating points w.r.t distribution shifts.

  3. Compared to competing benchmarks, our dataset features distribution shifts for both, validation and test, allowing to validate models under OOD conditions.

  4. In a large study, we evaluate several recent VQA models and show that they struggle to generalise in OOD conditions; we also test several SOTA bias reduction methods and show that there is still room for improvement in addressing bias in VQA.

2 Related Work

VQA corpuses —One of the first large-scale VQA datasets was VQA1 [antol2015vqa] gathering about 76K questions posed over 25K realistic images. It started a new task, but was soon found to suffer from biases. The authors of [goyal2017making] pointed to strong imbalances among the presented answers and proposed the second (improved) version of the dataset: VQA2. In parallel, [johnson2017clevr] introduced the fully synthetic CLEVR dataset, conceived to evaluate the reasoning capabilities in a simple environment. While being synthetic its strong point are detailed and structured annotations. The authors of [hudson2019gqa] adapted CLEVR to real-world images and constructed the GQA dataset, gathering 1.7M balanced questions. As this corpus is automatically generated, it offers a better control on the dataset statistics.

Attempts to reduce the bias-dependency — Despite efforts to design complex architectures, VQA models suffer from a significant generalisation incapacities. As shown in [agrawal2016analyzing], VQA models tend to answer the questions without using the image, and even when they do, they do not always exploit the relevant visual regions [vqahat]. Moreover, VQA models tend to overly rely on dataset biases [hendricks2018women], and are not able to generalise to unseen distributions [vqa-cp]. In this context, several methods have been designed to alleviate this dependency towards dataset biases. For example, one can set up an adversarial game against a question-only adversary to regularise training [ramakrishnan2018overcoming]. In a similar way, RUBi [cadene2019rubi] uses a question-only branch in addition to a base model during training to prevent it from learning textual biases. At test time, the question-only branch is omitted. In concurrent work,  [clark2019don] adopted a similar strategy by regularizing the model predictions using question type statistics from the training set. Other works [wu2019self, selvaraju2019taking] attempted to force VQA models to attend to the most significant visual regions (from humans’ perspective). While these methods show promising results when evaluated on unseen distributions [vqa-cp], they generally slightly degrade performances on the standard benchmarks, which tend to favor models relying on dataset biases.

Reinventing VQA evaluation — In parallel to the design of more and more complex but biased VQA architectures, we observe the need of new ways to evaluate VQA models. Early work [malinowski2014multi] proposed to evaluate the answer prediction using a soft evaluation score based on a lexical database. However, this metric has been replaced by a hard classification score, more prone to favor biased predictions, but easier to use in practice. The authors of GQA [hudson2019gqa] proposed several additional metrics along with their dataset, namely: consistency, plausibilitiy, validity, distribution, etc. These metrics give a better insight on the VQA performances, but do not evaluate its ability to predict correct answers in OOD setting, especially when applied to the original balanced GQA dataset. To evaluate the generalisation capability of Neural State Machines, [hudson2019learning] proposed interesting splits of GQA. In particular, they reorganized the train and validation splits to distinctly evaluate how well the models generalize on the content of the visual scene, and on the linguistic structure of the question. The authors of [bahdanauclosure] analysed the generalisation to unseen associations of known linguistic structures using their CLOSURE benchmark. They demonstrated that SOTA models (including those which were explicitly designed to overcome this issue) fail to generalise in these conditions. This confirms the need of carefully designed benchmarks to evaluate the true capabilities of VQA models. In this context, [vqa-cp] introduced a reorganisation of the VQA2 [goyal2017making] dataset splits, namely VQA-CP2, where the training distribution is made explicitly different from the one in the test. This new setup allows to evaluate the generalisation on an unseen distribution. However, as shown in Section 3, our proposed benchmark is complementary and conceptually different from VQA-CP2, especially in its construction process, since it introduces structured distribution shifts defined on local question groups.

3 GQA-OOD: a benchmark for OOD settings

We introduce a new VQA benchmark named GQA-OOD designed to evaluate VQA models and algorithms in Out-Of-Distribution (OOD) configurations. We here define OOD samples as rare events, in particular measured w.r.t. to a base distribution, e.g. a training distribution. It is worth noticing that these rare events might involve concepts which are also present in the training set. As an illustration, consider the question: ‘What color is this rose?’. If the image represents a rose, then red would be a common color, but in an OOD setting, frequent (correct) test answers would be, for instance, blue, requiring models to reason to provide the correct answer. We insist that in our benchmark this shift in distributions is not global and depends on the context. If the context changes, and the flower type is a violet then a (correct) OOD answer would now be red instead of blue.

The GQA-OOD benchmark consists of a dataset and new evaluation metrics. The dataset itself is based on the existing GQA [hudson2019gqa] dataset, which provides more fine-grained annotations compared to competing VQA2 [goyal2017making]. GQA questions have been automatically generated from scene graphs, which allows better control of the question context. Figure 1-a shows how the proposed protocol compares to the existing GQA protocol: the two share the same (existing) training set, but we introduce fine-grained shifts into both the validation and the test sets applying the same process as further developed below222We use version 1.2 of the GQA dataset [hudson2019gqa]..

The shifted subsets have been constructed in steps: (i) dividing questions into groups according to their contexts; (ii) extracting the most imbalanced question groups considering their answer distributions; (iii) selecting OOD samples among the remaining questions.

Question groups — To structure the process introducing the distribution shift, we use the notion of local groups provided in the GQA annotation. They allow to precisely define the type of a question, e.g‘What color …?’, ‘Where is …?’, etc. They also depend on the specific concept related to the question, e.g‘zebra’, ‘violet’, etc. There is a total of roughly local groups related to about questions in the GQA validation split. We use the balanced version of GQA, whose question distribution has been smoothed in order to obtain a more uniform answer distribution. However, this does not impact the imbalanced nature of the dataset, which is often due to real world tendencies, e.g. that ‘roses are red’.

Measuring group imbalance — We extract a subset of the most imbalanced question groups, as we are interested in evaluating the prediction error specifically in the context, where shifts in distribution are meaningful and strong. We measure the balance through Shannon entropy given as where

is the estimated probability of the class

. As entropy depends on the number of answer classes, which is highly variable between different question groups, we normalize entropy w.r.t. the to number of possible answers in the group: where

is equal to the entropy of a uniform distribution of size

. Normalized entropy thus measures how close the distribution is to a uniform distribution of the same dimension. On a side note, roughly of questions groups have a dimension , i.e. only one possible answer has been found. As they do not represent a large proportion of the total amount of questions, we automatically discard them. And, finally, we keep groups with a normalised entropy smaller than a threshold , which we empirically set to . This selects all the questions of the GQA-OOD benchmark, but further work has been done in order to select specific answer classes for each group.

OOD setting and metrics — We introduce a shift in distribution by selecting a subset of answer classes for each question group according to their frequencies, and introduce three different metrics according to which classes are used for evaluation. All these metrics are defined over the aforementioned imbalanced local groups.

  • [nosep,itemsep=1mm]

  • Acc-tail: the accuracy on OOD samples, which are the samples of the tail of the answer class distribution, i.e. the most rare answers given the context. We define the tail classes as classes with , where is the number of samples belonging to the class and is the average sample count for the group. We empirically set the parameter , and in Section 4.1 we analyse and illustrate the impact of the choice of on Acc-tail.

  • Acc-head: the accuracy on the head of distribution for each local group (by the head of the group we understand the set difference between the whole group and its tail).

  • Acc-all: the overall (classical) accuracy over all GQA-OOD samples.

Table 1 provides statistics of the proposed benchmark. We also analysed the nature, distribution and the diversity of the questions w.r.t to GQA [hudson2019gqa], c.f. supplementary material (Section A).

Dataset Split #Quest. #Groups #Imgs GQA-OOD val testdev 471 388 GQA val testdev 398 Split Subset #Quest. #Groups #Imgs val head tail testdev head 471 365 tail 471 330
(a) (b)
Table 1: Statistics of the proposed dataset: (a) GQA-OOD vs. GQA; (b) head vs. tail split .

Difference with VQA-CP2 — The excellent VQA-CP2 [vqa-cp] dataset was a first of its kind and paved the way for follow-up work on bias reduction methods in VQA. However, its construction is conceptually different from our work, partially due to the restrictions of the base dataset VQA2 w.r.t. to GQA, but also due to key design choices.

Lacking annotations on group structure in the base dataset, in VQA-CP questions are grouped together according to their first words and the same ground-truth answer. The shift is created by avoiding repeated types between splits. In contrast, our proposed GQA-OOD dataset allows fine-grained analysis of the generalisation behavior of a VQA model by (i) question group, and via (ii) different metrics corresponding to different amounts of shifts (acc-tail vs. acc-head), and (iii) even through the possibility of continuous evaluation along different operating points (see Figure 2).

VQA-CP2 comprises two splits only (train+test), lacking the possibility of validating model hyper-parameters. Most techniques therefore seem to optimize their hyper-parameters on the test split [cadene2019rubi, clark2019don, ramakrishnan2018overcoming, wu2019self, selvaraju2019taking], which should be frowned upon, as the evaluation overfits on the test set, or, alternatively, validate on a subset of train which does not include a shift [teney2020Unshuffle]. Our GQA-OOD datset contains a validition set with a shift w.r.t. to the train set, which allows to validate hyper-parameters in OOD settions.

Finally, unlike VQA-CP, the proposed GQA-OOD dataset requires models to be trained on the existing GQA train split. This requires models to reduce bias in their test results while being exposed to natural tendencies and biases captured in the training corpus, favoring work on bias reduction through methodology instead of through cleaning of training data.

4 Experiments

We evaluated several SOTA models on the proposed GQA-OOD benchmark and compared with evaluations on the benchmarks VQA2 [goyal2017making], GQA [hudson2019gqa] and VQA-CP2 [vqa-cp]. The selected line-up includes strong models with object-level attention and one Transformer [vaswani2017attention]-based model, as well as two blind baseline models (more training details are given in the supplementary material section B):

Question Prior — this blind baseline returns the most probable answer estimated from training set statistics. Following [vqa-cp, goyal2017making], we use the question types priors when evaluating on VQA-CP and VQA2. For GQA, we use the training global group priors.

LSTM [antol2015vqa] — this blind baseline takes GloVe embeddings  [pennington2014glove] and encodes them using an LSTM [hochreiter1997long] followed by a feed-forward network.

BUTD [anderson2018bottom]

— a strong VQA baseline based on object-level attention, in particular, bounding boxes with dense visual feature vectors extracted from the image using an object detector.

MCAN [yu2019deep] — this SOTA approach is based on a Transformer [vaswani2017attention] architecture and designed to model both intra-modality interactions and the inter-modality ones. It allows complex multi-hop reasoning through several stacked self-attention blocks. In our experiments, we use the -layers version of MCAN.

In addition to these four VQA baselines, we also evaluate three bias-reduction methods widely studied on the VQA-CP [vqa-cp] dataset. These methods are designed to be model-agnostic [cadene2019rubi, clark2019don]. In this work, we use them in conjunction with the BUTD architecture.

RUBi [cadene2019rubi] — adds a question-only branch to the base model during training to prevent it from learning question biases. This branch is omitted during evaluation. To better analyze bias dependencies, we also study a modified version of RUBi, which we refer to as RUBi+QB below. In this variant, the question-only branch is kept during evaluation.

BP [clark2019don] — is similar to RUBi but differs by directly taking training set statistics to infer question type biases during training333When training on VQA2 and VQA-CP2, biases are computed over question types. On GQA, biases are computed over local groups.. The question type biases are fused with the base model predictions using a product of experts [clark2019don], and removed during testing.

LMH [clark2019don] — is an improved version of BP [clark2019don]. In this version, the question bias is dynamically weighted by the base model in order to control its influence. In the original setup, an entropy penalty is added to the loss to prevent the model to ignore the bias. Nevertheless, when training on GQA, we obtain better results without this penalty (see supp material section B for details).

Training details for the aforementioned models and methods can be found in the supp material section B. Models evaluated on GQA-OOD are trained on the training set of GQA (balanced) and validated on the validation split of GQA-OOD

. When available, we provide the standard deviation computed over at least four different seeds.

4.1 Analysing the prediction error distribution

The GQA-OOD benchmark allows us to perform an analysis of the error prediction distributions for various VQA models. Table 2 shows the results obtained when evaluating several VQA models on GQA-OOD. We provide the three metrics introduced in Section 3: acc-tail, acc-head and acc-all. We also measure the difference to illustrate how much is the error prediction imbalanced between frequent and rare answers.

Models fail on rare question-answer pairs — As one can notice, the two blind models (Question Prior and LSTM in Table 2-a) obtain the highest gap between acc-tail and acc-head. Indeed, as they do not have access to the image, they rely on question biases. The score indicates that both BUTD and MCAN also struggle (in a lesser extent) to generalise on the less frequent question-answer pairs. However, MCAN outperforms BUTD on both metrics, confirming the superiority of the Transformer-based architecture.

Overall accuracy is dominated by frequent question-answer pairs — It is interesting to observe that acc-all, which is the standard metric in VQA, does not reflect the true model’s performances, since it is mechanically increased by the high score obtained on the most frequent question-answers. This confirms the need of a specific evaluation in OOD settings as provided by acc-tail.

Bias-reduction methods are not efficient on rare samples — Surprisingly, none of the three bias-reduction methods (namely RUBi [cadene2019rubi], BP [clark2019don] and LMH [clark2019don]) succeed to improve acc-tail (cf. Table 2-b). They even seem to deteriorate acc-head. This is unexpected as these methods have been conceived to overcome the dependency toward question type biases. In order to better understand these results, we evaluate RUBi while keeping the question-only branch during testing (RUBi+QB). Expectedly, it outperforms RUBi on acc-head showing it has better captured frequent patterns. However, it also outperforms RUBi on the OOD settings, demonstrating that preventing from learning frequent patterns does not necessarily increase performances on rare samples.

Model acc-all acc-tail acc-head Quest. Prior 21.6 17.8 24.1 35.4 LSTM [antol2015vqa] 30.7 24.0 34.8 45.0 BUTD [anderson2018bottom] 46.4 42.1 49.1 16.6 MCAN [yu2019deep] 50.8 46.5 53.4 14.8 Technique acc-all acc-tail acc-head BUTD [anderson2018bottom] 46.4 42.1 49.1 16.6 +RUBi+QB 46.7 42.1 49.4 17.3 +RUBi [cadene2019rubi] 38.8 35.7 40.8 14.3 +LMH [clark2019don] 34.5 32.2 35.9 11.5 +BP [clark2019don] 33.1 30.8 34.5 12.0
(a) (b)
Table 2: Comparison of several methods on the GQA-OOD testdev split. Acc-tail: OOD settings, Acc-head: accuracy on the most probable answers given the context. (a) Different models; (b) different bias reduction techniques on the BUTD [anderson2018bottom] model. All scores in %.

Visualising the generalisation behavior — In Figure 2, we analyse how the prediction error is distributed on the validation split. In particular, we plot accuracy acc-tail as a function of the tail’s size, which is controlled using the parameter described in Section 3. As shown in Figure 2-a, LSTM, BUTD [anderson2018bottom] and MCAN [yu2019deep] follow the same dynamic: starting from a tail size which represents roughly half of the question-answer pairs, the accuracy starts to linearly decrease until reaching a dramatically low score (about pts lower than the overall accuracy). This demonstrates the real need to take the prediction error distribution into consideration when benchmarking VQA models. We perform the same analysis on bias-reduction methods in Figure 2-b. For BP [clark2019don], LMH [clark2019don] and, to a lesser extent, RUBi [cadene2019rubi], we observe that the right side of the curve has been flattened. This demonstrates that the overall accuracy, dominated by frequent question-answer pairs, has been reduced because of the bias-reduction method. On the other hand, the left side of the curve, corresponding to the rare samples, remains almost unchanged. This reveals that these methods have somewhat succeed in preventing the base model from learning dataset biases as the error distribution between frequent and rare samples is more uniform. As a comparison, the LSTM model in Figure 2-a performs worse than BUTD [anderson2018bottom] but conserves the same frequent/rare imbalance. We observe that RUBi+QB responds the same way as BUTD [anderson2018bottom], confirming the effect of the bias-reduction method. Thus, although it is clear that these methods succeed in preventing the base model from learning salient properties of the training set, we observe that they do not help it to learn subtle distributions.

Figure 2: Performance for different definitions of the tail distribution ( parameter values): (a) different VQA models; (b) different bias reduction techniques. The x-axis is in log-scale.

4.2 Comparison with other benchmarks

Below, we compare GQA-OOD with the following three standard VQA benchmarks:

GQA [hudson2019gqa] (balanced version) is a VQA dataset gathering about 1.7M question-answer pairs automatically generated from real images. Among all other VQA datasets, GQA has the richest annotations allowing to evaluate models with complimentary metrics: consistency, validity, plausibility, [hudson2019gqa] etc. In this paper, we only exploit the overall accuracy and the distribution score on the GQA [hudson2019gqa] testdev split as the other metrics are unrelated w.r.t. the subject of our paper. The distribution score is obtained by measuring the match between the true answer distribution and the model predicted distribution.

VQA2 [goyal2017making] is composed of questions posed by humans about real-world images. The dataset contains 265K images with at least 3 questions for each. Each question is annotated with 10 ground-truth answers. We compare our benchmark with the overall accuracy on the VQA2 [goyal2017making] validation split.

VQA-CP2 [vqa-cp] has been constructed by reorganising the training and validation splits of the VQA2 [goyal2017making] dataset in order to explicitly make the distributions of answers different between the training and test splits. The VQA-CP2 dataset has been designed to measure the sensitivity of a VQA model to the language bias, in other words, as a test for measuring the ability of a model to generalise on unseen data.

Comparison with GQA and VQA2 — In Table 3, we compare our acc-tail score with the other benchmarks. One can observe that overall accuracy on GQA and VQA2 is not sufficient to fully characterise the VQA performances of evaluated models. Indeed, our evaluation in OOD settings is the only one to reveal that even SOTA models struggle on infrequent question-answer pairs. Thus, the best-performing model MCAN (reaching on GQA) looses about 10 points in the OOD setting. Finally, the error distribution measure defined in GQA is hardly interpretable in terms of VQA performances comparing to acc-tail, which makes it difficult to exploit.

overall overall distribution overall acc-tail
Question Prior 32.1 27.0% 55.62 8.8% 17.8%
LSTM [antol2015vqa] 43.0 39.1% 3.6 22.1% 24.0%
BUTD [anderson2018bottom] 63.5 51.6% 1.8 40.1% 42.1%
MCAN [yu2019deep] 66.1% 56.3% 1.59 42.5% 46.5%
BUTD [anderson2018bottom] 63.5% 51.6% 1.8 40.1% 42.1%
+RUBi+QB - 51.9% 1.68 47.6% 42.1%
+RUBi [cadene2019rubi] 61.2% 43.6% 1.88 44.2% 35.7%
+LMH [clark2019don] 56.4% 39.7% 2.1 52.0% 32.2%
+BP [clark2019don] 63.2% 39.6% 2.2 39.9% 30.8%
Table 3: We compare the proposed acc-tail metric with other benchmarks. Results are computed on the testdev split of GQA-OOD and GQA [hudson2019gqa], the test split of VQA-CP2 [vqa-cp] and the VQA2 [goyal2017making] validation split. Values in italic have been trained and tested by ourselves.

Comparison with VQA-CP2 [vqa-cp] — When comparing acc-tail and VQA-CP2 overall accuracy, we observe that the four VQA models obtain similar scores. However, VQA-CP2 [vqa-cp] penalises more the MCAN [yu2019deep] architecture. We note a completely different behavior when comparing scores obtained with bias-reduction methods. Indeed, while we observe that these methods do not improve the scores in the OOD setting (cf. Section 4.1), they achieve strong performances on VQA-CP2 [vqa-cp]. In this context, the score of LMH [clark2019don] is the most notable. Whereas it achieves the highest overall accuracy on VQA-CP2 (), it obtains the lowest acc-tail on GQA-OOD (). In a less extent, we observe the same behaviour for RUBi [cadene2019rubi] and BP [clark2019don].

Summarizing, VQA-CP2 [vqa-cp] allows to measure to what extent a VQA model is relying on question types biases. It demonstrates that VQA systems experience difficulties to generalise to unseen distribution which illustrates their sensitivity to biases. However, the VQA-CP2 evaluation does not reflect the model behaviour on rare question-answer pairs. Moreover, evaluating on VQA-CP2 [vqa-cp] requires to train on its specific training set. Not only this is restrictive, but, more importantly, it results in the emergence of models specifically conceived to address this dataset. Finally, unlike our GQA-OOD, VQA-CP2 does not have a dedicated validation split, forcing the researchers to validate their models directly on the test split [cadene2019rubi, clark2019don, ramakrishnan2018overcoming, wu2019self, selvaraju2019taking] (which obviously has a negative impact on the objectiveness of such evaluation).


Going beyond previous attempts to reduce the influence of dataset biases in VQA evaluation, the proposed GQA-OOD benchmark allows to both evaluate (1) whether VQA models have absorbed the overall tendencies of the training data, and (2) how well they generalise to rare (or unseen) question-answer pairs. This was made possible (i) by a thorough choice of the most imbalanced question groups, (ii) by designing a new set of the evaluation metrics and finally, (iii) by leaving the researchers with a possibility of controlling the amount of distribution shift via the hyper-parameter . Our experiments have shown that neither the conventional SOTA VQA models nor the dedicated bias reduction methods succeed in all aspects of the proposed evaluation benchmark. This leaves a big room of improvement for future work, and we hope that our GQA-OOD benchmark will contribute to emergence of new VQA models which are less prone to learning spurious biases and thus more reliable in real-world scenarios.


Appendix A Dataset statistics

We provide some analysis and statistics to assess the reliability of the proposed benchmark. In particular, we analyse the nature and the distribution of the questions involved in GQA-OOD and demonstrate that it preserves the original question diversity of GQA [hudson2019gqa].

Question diversity — Figure 7 and Figure 7 show the distribution of question structure type as defined in GQA [hudson2019gqa] on the validation split. As one can observe, the process implemented to construct GQA-OOD does not alter the question diversity of the original split. However, the proportion of open questions – ’query’ in Figure 5 and Figure 5 – has increased in GQA-OOD. Indeed, open questions – such as color questions – generally accept a wider diversity of answer, therefore it is prone to be more imbalanced. At contrary, other types such as ‘choose’, ‘verify’ or ‘compare’ usually accept only two possible answers and are easier to balance. Figure 5 and Figure 5 details the distribution of the structure types in the validation in GQA-OOD compared to GQA.

Appendix B Training details

Training hyper-parameters — All models evaluated on GQA and GQA have been trained on the balanced training set of GQA. For MCAN and BUTD we use publicly available implementations at LSTM, BUTD [anderson2018bottom], RUBi [cadene2019rubi], BP [clark2019don] and LMH [clark2019don]

are trained during 20 epochs with a batch size equals to

and Adam [kingma2014adam] optimizer. At the beginning of the training we linearly increase the learning rate from to during 3 epochs, followed by a decay by a factor of at epochs 10 and 12. MCAN [yu2019deep] is trained during 11 epoch with a batch size equals to and Adamax [kingma2014adam] optimizer. At the beginning of the training we linearly increase the learning rate from to during 3 epochs, followed by a decay by a factor of at epochs 10 and 12.

LMH hyper-parameters — Figure 3 details the hyper-parameter search for the entropy penalty weight in LMH [clark2019don]. We found that the entropy penalty was degrading the GQA-OOD accuracy when training on GQA. In particular, the flattening of the right side of the curve (most frequent samples) is even more present for higher penalty weight.

Figure 3: Influence of the LMH entropy penalty weight on the prediction error distribution.
Figure 4: Distribution of the semantic types in GQA.
Figure 5: Distribution of the semantic types in GQA-OOD (tail).
Figure 6: Distribution of the structural types in GQA.
Figure 7: Distribution of the semantic types in GQA-OOD (tail).