The Wisdom of MaSSeS: Majority, Subjectivity, and Semantic Similarity in the Evaluation of VQA

by   Shailza Jolly, et al.

We introduce MASSES, a simple evaluation metric for the task of Visual Question Answering (VQA). In its standard form, the VQA task is operationalized as follows: Given an image and an open-ended question in natural language, systems are required to provide a suitable answer. Currently, model performance is evaluated by means of a somehow simplistic metric: If the predicted answer is chosen by at least 3 human annotators out of 10, then it is 100 Though intuitively valuable, this metric has some important limitations. First, it ignores whether the predicted answer is the one selected by the Majority (MA) of annotators. Second, it does not account for the quantitative Subjectivity (S) of the answers in the sample (and dataset). Third, information about the Semantic Similarity (SES) of the responses is completely neglected. Based on such limitations, we propose a multi-component metric that accounts for all these issues. We show that our metric is effective in providing a more fine-grained evaluation both on the quantitative and qualitative level.


page 1

page 2

page 4

page 6

page 8


Beyond VQA: Generating Multi-word Answer and Rationale to Visual Questions

Visual Question Answering is a multi-modal task that aims to measure hig...

VQA-E: Explaining, Elaborating, and Enhancing Your Answers for Visual Questions

Most existing works in visual question answering (VQA) are dedicated to ...

Reliable Visual Question Answering: Abstain Rather Than Answer Incorrectly

Machine learning has advanced dramatically, narrowing the accuracy gap t...

Answering Questions about Data Visualizations using Efficient Bimodal Fusion

Chart question answering (CQA) is a newly proposed visual question answe...

Visual Question: Predicting If a Crowd Will Agree on the Answer

Visual question answering (VQA) systems are emerging from a desire to em...

Quantifying and Alleviating the Language Prior Problem in Visual Question Answering

Benefiting from the advancement of computer vision, natural language pro...

YouMakeup VQA Challenge: Towards Fine-grained Action Understanding in Domain-Specific Videos

The goal of the YouMakeup VQA Challenge 2020 is to provide a common benc...

1 Introduction

Since its introduction, the task of Visual Question Answering (VQA) [4] has received considerable attention in the Vision and Language community. The task is straightforward: Given an image and a question in natural language, models are asked to output the correct answer. This is usually treated as a classification problem, where answers are categories that are inferred using features from image-question pairs. Traditionally, two main versions of the tasks have been proposed: One, multiple-choice, requires models to pick up the correct answer among a limited set of options; the other, open-ended, challenges systems to guess the correct answer from the whole vocabulary.

Figure 1: Representation of MaSSeS and its components. In the circles, standard VQA accuracy (gray) and our Ma (blue), MaS (orange), and MaSSeS (green) on VQA 1.0 [4] are reported.
Figure 2: Examples of VQA questions and answers in the open-ended setting. Given the image on the left and the third question ‘How is the veggies being cut?’, currently a model gets accuracy 100% in case it outputs ‘diced’ (4 occurrences), 60% if it outputs either ‘cubed’ or ‘squares’ (2), 30% for ‘with knife’ (1), and 0% for any other response. The overall accuracy is obtained by averaging through samples.

Several metrics have been proposed recently for evaluating VQA systems (see section 2), but accuracy is still the most commonly used evaluation criterion  [4, 11, 23, 42, 44, 1, 5, 14, 45, 2]. In the multiple-choice setting, where only one answer is correct, accuracy is given by the proportion of correctly-predicted cases. In the open-ended setting, accuracy is instead based on human annotations for the question:

Using the official VQA Evaluation Tool, that averages accuracy over all 10 choose 9 sets of human annotators, an answer is considered as 100% accurate if at least 4 workers out of 10 voted for it, 90% if the annotators were 3, 60% if they were 2, 30% if the answer was chosen by just one worker, 0% in case no one opted for it.111From now on, we will report accuracy values as obtained with VQA Evaluation Tool: Being based on the responses provided by 10 different workers, the evaluation of VQA in this setting is therefore driven by a wisdom of the crowd [12] criterion: The answer is ‘perfectly’ correct if more than one third annotators agree on that, ‘almost’ correct if the agreement involves one fifth of the workers, ‘a bit’ correct if provided by only one worker. That is, the degree of correctness is a function of annotators agreement.

Though intuitively valuable, this metric has some important limitations. First, it ignores whether the predicted answer is the one selected by the majority of annotators or by just a smaller fraction of them. For example, in the second question in Figure 2 a model gets a 100% accuracy by answering ‘yes’, though this is not the most-voted option, which is ‘no’. Second, it does not account for the quantitative subjectivity of the responses for a given question. Based on the number of unique responses assigned by annotators, for example, the first question in Figure 2 (2 unique responses) looks intuitively less subjective compared to the third (5), but this aspect does not play any role in the evaluation. Third, information about semantic similarity of responses is completely neglected. That is, samples where the responses are very semantically similar (e.g., first question in Figure 2) are not considered differently from cases where they are less similar (e.g., third question) or completely dissimilar (e.g., second question).

Based on such limitations, we focus on open-ended VQA and propose MaSSeS,222Details and the code for computing MaSSeS will be available at the project page: a simple multi-component metric that jointly accounts for all these issues (see Figure 1). In particular, MaSSeS combines a Majority component (Ma) with a Subjectivity component (S) both endowed with Semantic Similarity (SeS). Similarly to the current evaluations, the output of the metric is a single score that measures the accuracy in the task. By means of thorough analyses, we show that jointly considering this information is quantitatively and qualitatively better than using current evaluations. Moreover, our findings reveal that better exploiting the ‘wisdom of the crowd’ available in human annotation is beneficial to gain a fine-grained understanding of VQA.

2 Related Work

In recent years, a number of VQA datasets have been proposed: VQA 1.0 [4], VQA-abstract [1], VQA 2.0 [47, 14], FM-IQA [13], DAQUAR [24], COCO-QA [30], Visual Madlibs [46], Visual Genome [20], VizWiz [16], Visual7W [48], TDIUC [18], CLEVR [17], SHAPES [3], Visual Reasoning [34], Embodied QA [7]. What all these resources have in common is the task for which they were designed: Given an image (either real or abstract) and a question in natural language, models are asked to correctly answer the question. Depending on the characteristics of the dataset and the models proposed, various ways to evaluate performance have been explored.

Accuracy is the most common metric. Traditionally, VQA is treated as a classification task, either in a multiple-choice (limited set of answers) or open-ended (whole vocabulary) setting. In the multiple-choice setting, there is just one correct (or ground-truth) answer among a number of alternatives called decoys [4, 46, 48, 20]. As such, accuracy is simply computed by counting the predictions of the model that match the ground-truth answer. What can affect the difficulty of the task in this setting is the type of decoys selected. Indeed, recent work has proposed methods to harvest more challenging alternatives on the basis of their consistency and semantic similarity with the correct response [6]. Similar approaches have been exploited in the domains of visual dialogue [8] and multiple-choice image captioning [10]. In the open-ended setting, accuracy can be computed in terms of Exact Matching between predicted and ground-truth answer [20, 3, 17, 34]. Though suitable for synthetic datasets where there is just one, automatically-generated answer, this approach cannot be applied to datasets where various answers have been provided by multiple human annotators. To account for the variability among 10 crowdsourced answers, [4] proposed a metric which considers as 100% correct an answer that was provided by more than 3 annotators out of 10. If 3, 2 or 1 voted for it, the model accuracy is 90%, 60%, and 30%, respectively. Being simple to compute and interpret, this metric (hence, VQA3+) is the standard evaluation criterion for open-ended VQA [4, 1, 16, 47, 14]. However, it has some important limitations. (a) It ignores whether an answer that was chosen more than 3 annotators is the most frequent or not. As such, it considers it as 100% correct even if e.g. 6 annotators converged on a different answer (see second question in Figure 2). (b) It is heavily dependent on the number of answers for a given question. While the 3+ criterion is valid with 10 annotations, this might not be the case when, e.g., 5 or 20 answers are available. (c) It does not account for the quantitative variability among the answers. (d) There is no focus on the semantic similarity between the answers. (e) Model performance and dataset features (frequency of answers) are intertwined. That is, a perfect model cannot achieve a 100% accuracy on the task.

Arithmetic and Harmonic Means are two accuracy-based metrics proposed by [18]

. The core idea is to compute an overall accuracy which takes into account the skewed question-type distribution observed in the TDIUC dataset. The harmonic mean-per-type accuracy (Harmonic MPT), in particular, is designed to capture the ability of a system to obtain high scores across all question-types, being skewed towards lowest performing categories. A

normalized version is also provided to better account for rare answers. Though fine-grained, these metrics are only suitable for datasets with only one ground-truth answer.

WUPS is a metric proposed by [24] to take into account semantic similarity in the evaluation of model predictions. The core idea is that, when evaluating performance in the exact-matching setting (i.e., only one ground-truth answer), a model should not be heavily penalized if its prediction is semantically close to the ground truth (e.g., ‘carton’ and ‘box’). This intuition is implemented using Wu-Palmer similarity [41], which computes the similarity between two words in terms of their longest common subsequence in the taxonomy tree. In practice, the predicted answer is considered as correct when its similarity with the ground truth exceeds a threshold, which in [24] is set to either 0.9 (strict) or 0.0 (tolerant). This metric has been extended by [25] to account for settings where more than one ground-truth answer is available. Two versions were proposed: In one, WUPS-acm, the overall score comes from the average of all pairwise similarities and thus considers inter-annotator agreement; in the other, WUPS-mcm, the pair with the highest similarity is taken as representative of the pattern. As observed by [19], the measure of similarity embedded in WUPS has some shortcomings. In particular, it is shown to produce high scores even for answers which are semantically very different, leading to significantly higher accuracies in both [24] and [30]. Moreover, it only works with rigid semantic concepts, making it not suitable for phrasal or sentence answers that can be found in [4, 1, 16, 47, 14].

Visual Turing Test has been proposed as a human-based evaluation metric for VQA by [13]. Based on the characteristics of the FM-IQA dataset, whose answers are often long and complex sentences, the authors tackled the task as an answer-generation rather than a classification problem (see also [49, 39, 40, 36, 37]). Given this setting, one option is to use standard metrics for the evaluation of automatically-generated language, such as BLEU [28], METEOR [21], ROUGE [22] or CIDEr [35], as [16] did. However, these metrics turned out not to be suitable for VQA evaluation due to their inability to properly handle semantically relevant words [13]. Therefore, [13] asked humans to judge whether the generated answers were provided by a human or a model. If annotators believed the answer was ‘human’, and thus implicitly good, the answer was considered as correct. Else, it failed the Visual Turing Test and considered as wrong. Intuitively, this evaluation procedure is very costly and heavily dependent on subjective opinions of annotators.

Figure 3: From left to right: Distribution of samples in the validation splits of VQA 1.0, VQA 2.0, VQA-abstract, and VizWiz with respect to number of unique answers. E.g., in 35% of samples in VQA 1.0, all annotators converge on the same answer (3% in VizWiz).

Mean Rank Finally, in the recent work by [7] the performance of the embodied agent is evaluated via mean rank of the ground-truth answer in the predictions of the model. This implies that only one ground-truth answer is given.

3 Our Metric

Based on the limitations of the current metrics, we propose MaSSeS, a novel, multi-component metric for the evaluation of open-ended VQA. Each component is aimed at evaluating various aspects of either the performance of a given model or the characteristics of the dataset. In particular, one component (Ma) evaluates the correctness of the answer predicted by the model and is thus model-specific. Two modules (S, SeS) evaluate the pattern of human responses for a given question and are thus data-specific. By jointly combining these 3 modules, one single score is provided. Below, we describe and motivate each component.

Majority (Ma): It is the core component of our metric, aimed at evaluating the performance of a given model in the task. It is based on two simple assumptions: First, the most frequent answer (hence, max) is considered as 100% correct regardless of its absolute frequency. Second, all other answers receive a score which is dependent on the frequency of max. Given a predicted answer, the score is given by dividing its frequency by the frequency of max. Consider the third example in Figure 2. If the predicted answer is ‘diced’ (max), the score is 1. If it is ‘cubed’ or ‘squares’ (2 occurrences), the score is 0.5. If it is one among the others (1), then the score is 0.25. The method used for calculating Ma is reported in (1):


where the numerator is an integer ranging from 0 to number of annotators (#ann), and the denominator an integer from 1 to #ann. Ma is a continuous value ranging from 0 to 1.

Ma overcomes some important shortcomings of the other metrics. Similarly to Exact Matching and in contrast with VQA3+, Ma assumes that there is always at least one answer that is 100% correct for the question. As a consequence, a model is allowed to achieve 100% accuracy. Similarly to VQA3+, it modulates the score on the basis of the frequency of the answer. However, in contrast to VQA3+, our score is dependent on the frequency of max and not on a fixed threshold (e.g. 4). Moreover, Ma is continuous (i.e., it ranges from 0 to 1) rather than discrete (VQA3+ assigns just 5 possible scores: 0%, 30%, 60%, 90%, 100%), thus allowing a more flexible and fine-grained evaluation of the predictions.

Subjectivity (S): This component evaluates the subjectivity of a given pattern of responses on the basis of the quantitative agreement between annotators, irrespectively of the prediction of the model. Our intuition is that highly skewed distributions would indicate more subjective and thus less reliable samples. Therefore, we should put more ‘trust’ to distributions that reflect a high agreement compared to those where a high variability is observed. Here, we operationalize S in terms of Wasserstein Distance (hence, WD) [29], a method applied to transportation problems using efficient algorithms like network simplex algorithm [27]

. Given its ability to operate on variable-length representations, WD is more robust in comparison to other histogram-matching techniques and has been used, for example, in the domain of content-based image retrieval 

[32, 31]

. Applied to discrete probability distributions, WD (also known as Earth Mover’s Distance 

[32]) is used to compute the minimum amount of work

that is needed for transforming one distribution into another. In our case, the work we measure is that required to transform a given distribution of frequencies into a uniform distribution where all elements have

max frequency. In particular, we use WD as a measure of ‘reliability’ of the sample, based on the observation that highly skewed distributions require a smaller amount of work (low WD) compared to ‘peaky’ ones (high WD). This is intuitive since, in the former case, all elements are closer to the max than in the latter. As a consequence, patterns where all annotators converge on one single answer will get a S score equal to 1 (highest reliability), whereas uniformly-distributed patterns (i.e., all answers have frequency 1) will get 0 (no reliability at all). Consider the examples in Figure 2. In the first and second, S is 0.55. In the third, more subjective, S is 0.33. The method used for computing S is shown in (2):


where the formula represents the standard way for computing WD, u,v are two different probability distributions, and is the set of (probability) distributions. The value of S is further normalized to range from 0 to 1.

dataset metric
acm0.9 mcm0.9 Ma S SeS0.7 SeS0.9 MaS MaSSeS0.7 MaSSeS0.9
VQA 1.0 0.542 0.479 0.642 0.523 0.731 0.922 0.786 0.425 0.567 0.458
VQA 2.0 0.516 0.441 0.634 0.495 0.705 0.907 0.760 0.384 0.545 0.418
VQA-abstract 0.602 0.532 0.685 0.582 0.780 0.944 0.818 0.482 0.618 0.507
VizWiz 0.448 0.163 0.441 0.444 0.460 0.705 0.541 0.207 0.292 0.227
Table 1: Results of VQA3+, WUPS-acm, WUPS–mcm, MaSSeS and its components on four VQA datasets.

Introducing such component allows us to take into account the subjectivity of a sample (and a dataset). This is crucial since, as shown in Figure 3, in current datasets the proportion of samples with a perfect inter-annotator agreement (i.e., 1 unique answer) is relatively low: 35% in VQA 1.0 [4], 33% in VQA 2.0 [14], 43% in VQA-abstract [1], and only 3% in VizWiz [16]. Moreover, we compute this score independently from the predictions of the models, thus providing a self-standing measure for the analysis of any VQA dataset. As clearly depicted in Figure 3, subjectivity is indeed a property of the datasets: In VizWiz, only 30% of samples display 3 or less unique answers, whereas this percentage exceeds 70% in the other datasets. The motivation behind proposing this component is loosely similar to [15], who tackle the task of predicting the degree of agreement between annotators, and very close to [43], who model subjectivity of samples in terms of the entropy of the response pattern (ranging from 0 to 3.32). Compared to [43], we believe ours to be an essentially equivalent measure, though simpler and more intuitive. Finally, subjectivity is indirectly taken into account in WUPS-acm, where the score is given by the average of the pairwise distances between the elements. However, this measure mixes quantitative (frequency) and qualitative (semantic similarity) information, while S specifically focuses on the former.

Semantic Similarity (SeS): This component is aimed at evaluating the semantic similarity between the answers in the sample. The rationale is that samples where the answers are overall semantically similar should be considered as more reliable (less subjective) compared to those including semantically diverse answers. Intuitively, a pattern containing e.g. ‘plane’, ‘airplane’, and ‘aircraft’ would be more consistent than one including e.g. ‘plane’, ‘train’, ‘motorbike’. We operationalize this intuition by using pre-trained word embeddings [26] to re-organize the frequency distribution of the answers in the pattern. As a consequence, SeS can be seen as a semantics-aware version of S. Technically, SeS is obtained as follows: (a) we compute an average representation of each answer (similarly to [6]); (b) we use these unique representations to build a centroid

of the pattern aimed at encoding its overall semantics, irrespective of the relative frequency of the items (we want to account for the long tail of distributions); (c) we compute the cosine similarity between centroid and each unique answer; (d) we group together the answers whose cosine similarity value exceeds a given threshold, and sum their frequencies accordingly. This way, we obtain an updated frequency distribution, on the top of which S can be computed. Notably, this is the only component of

MaSSeS that can be ‘adjusted’. In particular, using ‘strict’ thresholds (e.g. 0.9) will generate lower scores compared to using more ‘tolerant’ ones (e.g. 0.7). To illustrate, if we apply a SeS0.9 to the examples in Figure 2, only the reliability of the first example increases (from S 0.55 to SeS 1). However, by applying SeS0.7, reliability increases to 1 in all examples. Though the third question is quantitatively more subjective than the others, it becomes as reliable as them when considering its semantics. Semantic similarity is computed as in (3):


where for each ground truth answer, centroid pair we obtain a similarity score sim ranging from 0 to 1 (we set negative values to 0). Answers for which sim is equal to or higher than a threshold t(0-1) are grouped together by summing their frequencies. To obtain SeS, namely a semantics-aware measure of subjectivity, we compute (2) on the resulting distributions usim,vsim. To obtain the overall MaSSeS score, we simply compute an updated Ma (1) which is based on these distributions, and we further multiply it by SeS.

Similarly to WUPS, our metric acknowledges the importance of taking semantic similarity into account in the evaluation of VQA. However, SeS differs from WUPS in two main regards: (a) We use word embeddings instead of taxonomies trees, which makes our metric more flexible, intuitive, and convenient to compute. Moreover, it can account for phrasal and sentence answers. (b) As reported by [19], WUPS tends to be very ‘forgiving’ by assigning high scores to distant concepts (e.g., ‘raven’ and ‘writing desk’ have a WUPS score of 0.4). In contrast, word embeddings provide a more fine-grained semantic information. It is worth mentioning that, in the domain of VQA, word embeddings have been used in various ways, e.g. for selecting challenging decoys [6], or to implement nearest-neighbors baseline models [9]. As for the procedure of aggregating various responses into one based on their semantic similarity, we were inspired by previous work on crowd consensus doing the same on the basis of various criteria [33, 38].

Figure 4: Comparison between VQA3+ and MaSSeS0.9 accuracies in VQA 1.0 (left) and VizWiz (right).
Figure 5: Distribution of Subjectivity (S) and Semantic Similarity (SeS0.9) scores in VQA 1.0 (left) and VizWiz (right).

4 Experiments

We tested the validity of our metric by experimenting with four VQA datasets: VQA 1.0 [4], VQA 2.0 [14], VQA-abstract [1], and VizWiz [16]. To enable a fair comparison across the datasets, for each dataset we followed the same pipeline: The standard VQA model used in [1] was trained on the training split and tested on the validation split. Model predictions were evaluated by means of three metrics: VQA3+ [4] (using the evaluation tools), WUPS [25], and our MaSSeS. WUPS was tested in both its consensus versions, i.e. acm and mcm with a threshold of 0.9. As for MaSSeS, we computed its overall score as well as the scores provided by each of its components. The impact of ‘tuning’ semantic similarity is evaluated by exploring two thresholds: a strict 0.9 and a more tolerant 0.7.

4.1 Quantitative Results

dataset n. answers prediction VQA3+ acm Ma S SeS MaSSeS
VQA 1.0 1 [yellow: 5, orange: 4, light orange: 1] yellow 1.0 0.53 1.0 0.44 1.0 1.0
2 [refrigerator: 6, fridge: 4] refrigerator 1.0 0.98 1.0 0.55 1.0 1.0
[tennis rackets: 4, tennis racket: 2, tennis racquet: 1],
racket: 2, racquets: 1
tennis rackets 1.0 0.98 1.0 0.33 0.67 0.67
[hot dogs: 5, hot dog: 2, hot dogs and fries: 1,
hot dog fries: 1, hot dog and onion rings: 1]
hot dog 0.60 0.70 0.4 0.44 1.0 1.0
VizWiz 1
[christmas tree: 6, tree: 1, chritmas tree shaped
santaclauses: 1, christmas tree santas: 1], santas: 1
christmas tree 1.0 0.70 1.0 0.55 0.89 0.89
white: 6, [green: 2, light green: 1,
very light green: 1]
white 1.0 0.62 1.0 0.55 0.55 0.55
[ginger peach: 5, ginger peach tea: 2,
ginger peach herbal tea: 1], unanswerable: 2
unanswerable 0.60 0.20 0.4 0.44 0.77 0.19
[beef: 5, beef flavored broth: 2, beef flavored: 1,
beef flavor: 1, this beef flavor: 1]
unanswerable 0.0 0.0 0.0 0.44 1.0 0.0
Table 2: Examples from the validation splits of VQA 1.0 (top) and VizWiz (bottom). For each example, we report the pattern of answers provided by annotators (unique answer: frequency), the prediction of the model, and the scores (note that acm, SeS, MaSSeS are computed using threshold 0.9). Answers that are grouped together by SeS are included in square brackets.

Results are reported in Table 1. Note that columns VQA3+, WUPS-acm, WUPS-mcm, Ma, MaS, and MaSSeS are accuracies, while S and SeS are reliability scores. As can be noted, accuracies obtained with both versions of MaSSeS are generally lower compared to those of VQA3+, with the drop being particularly accentuated for VizWiz. This can be observed in Figure 4, which compares the distributions of accuracies scored by VQA3+ and MaSSeS0.9 in VQA 1.0 (left) and VizWiz (right). As can be seen, the scores produced by our metric (blue) are ‘distributed’ across the x-axis (from 0 to 1), while those produced by VQA3+ (red) are grouped into 5 ‘classes’. Moreover, we observe that our metric is much more reluctant to output score 1. Part of this differences can be explained by looking at the values of Ma (Table 1), which are slightly lower than those of VQA3+ due to their finer-grained nature (recall that if an element is not max it is not considered as 100% correct by Ma). This drop is further accentuated by multiplying Ma by either S (to obtain MaS) or SeS (to obtain MaSSeS). Since the values of these components cannot exceed 1, the resulting score will be lowered according to the degree of subjectivity of the dataset.

Bearing this in mind, it is worth focusing on the scores of S and SeS in each dataset. As reported in Table 1, S

is relatively high for the first three datasets (ranging from 0.70 to 0.78), extremely low for VizWiz (0.46). These numbers, in line with the descriptive statistics depicted in Figure 

3, clearly indicate that answers in VizWiz are extremely skewed, with annotators rarely agreeing on the same answer(s). This information can also be observed in Figure 5, which depicts the distribution of S (red bars) and SeS0.9 (blue bars) in VQA 1.0333We plot VQA 1.0 as representative of the three VQA-based datasets, which display very similar patterns. (left) and VizWiz (right). As can be noticed, S

in VQA is relatively high, with most of the answers being grouped in the rightmost bars (0.8 or more). In contrast, we observe an almost normal distribution of

S in VizWiz, with very few answers being scored with high values. When injecting semantic information into subjectivity (SeS0.9), however, the distribution changes. Indeed, we observe much less cases scored with extremely low values and much many cases with high values. In numbers, this is reflected in an overall increase of 8 points from S (0.46) to SeS (0.54). A similar pattern is also observed in VQA 1.0 (+5 points). It is worth mentioning that using a lowest similarity threshold (0.7) makes the increase between S and SeS even bigger. This, in turn, makes the MaSSeS score significantly higher and comparable to VQA3+ in the three VQA-based datasets (not for VizWiz).

As for WUPS, we observe that acm scores are significantly lower than VQA3+ ones, while mcm ones are generally higher. This is intuitive since mcm only considers the most similar answers, while acm, similarly to ours, considers the whole set. Compared to our metric, we notice that acm0.9 scores are somehow in between those of MaSSeS0.7 and MaSSeS0.9 in the VQA-based datasets. In contrast, they are very different in VizWiz, where our metric versions ‘outperform’ acm0.9 by around 13 and 7 points, respectively. We believe this gap is due to the main differences between WUPS and MaSSeS: (a) In WUPS the predictions of the model are intertwined with the properties of the data, while in ours the two components are disentangled. (b) The type of semantic similarity used by MaSSeS and its role in the metric allows capturing finer-grained relations between the answers compared to taxonomy trees.

4.2 Qualitative Results

To better understand the functioning of our metric, we analyze several cases extracted from the validation splits of VQA 1.0 and VizWiz (see Table 2). Starting from VQA 1.0, we notice that examples 1 and 2 are considered as 100% correct by both VQA3+ and MaSSeS. The former metric assigns this score because ‘yellow’ and ‘refrigerator’ have frequency equal to or greater than 4. As for MaSSeS, this score is produced because (a) the two answers have max frequency, and (b) the SeS score assigned to the response pattern is the highest (i.e. 1.0) due to their semantic consistency. That is, all the answers are grouped together since their cosine similarity with the centroid is equal or greater than 0.9. Notably, acm produces a similar score in example 2, but very different (i.e., much lower) in example 1, though the words involved are semantically very similar (very similar colors). Moving to example 3, we observe that MaSSeS assigns a lower score (0.67) compared to VQA3+ (1.0) since SeS makes a fine-grained distinction between generic ‘rackets’ and specific ones (i.e., for ‘tennis’). This proves the validity and precision our semantic similarity component, especially in comparison with acm, whose high score does not account for such distinction (0.98). As for example 4, the score output by MaSSeS (1.0) turns out to be higher than both VQA3+ (0.6) and acm (0.7) due to the extremely high semantic consistency of the answers.

Figure 6: Left: Two examples where VQA3+ (V) outputs higher scores than MaSSeS (M). Right: Two examples with the opposite pattern.
Figure 7: Distribution of accuracy produced by VQA3+ against SeS values in VQA 1.0 (left) and VizWiz (right).

As for VizWiz, we observe that examples 1 and 2, which receive highest accuracy from VQA3+, are assigned a lower score by MaSSeS. In the former case, the drop is minor due to the high reliability of the pattern; in the latter, the drop is bigger since the predicted answer, ‘white’, appears in a pattern where the other responses are semantically very similar to each other and thus grouped together by SeS. That is, the items in the long tail of the distribution, though not quantitatively dominant, are semantically prevalent in the pattern. As such, the reliability of the pattern is only partial, and lowers the overall score. As for example 3, VQA3+ assigns a relatively high score to the prediction (0.60), while MaSSeS (as acm) penalizes this choice mainly due to the non-max nature of the predicted answer, though the pattern has a high reliability due the semantic consistency of the alternatives (all grouped together by SeS). Finally, in example 4 the prediction of the model (‘unanswerable’) is not present in the pattern and thus scored 0 by all metrics. However, it is worth mentioning that, according to SeS, this pattern is highly reliable due to the high semantic consistency of its elements. As a consequence, a model predicting e.g. ‘beef’ would get 1.0 by MaSSeS, but only 0.5 by acm.

To further understand the qualitative difference between VQA3+ and MaSSeS, we analyze several cases from VQA 1.0 (see Figure 6) where the former metric outputs a higher score than the latter (left), and vice versa (right). In the two leftmost examples, the higher values produced by VQA3+ seem intuitively more correct than those output by MaSSeS, whose scores are affected by a valuable but somehow strict semantic criterion which penalizes the presence of other answers in the pattern. In contrast, the higher accuracies produced by MaSSeS in the rightmost cases look intuitively better than those by VQA3+. In these cases, the subjectivity of the pattern is compensated by the high semantic consistency among the answers, which makes MaSSeS to output the highest score. Overall, it is straightforward that taking semantics into account allows our metric to produce finer-grained evaluations.

5 Evaluating Dataset ‘Feasibility’ with SeS

SeS is a component evaluating the subjectivity of a sample while also taking into account the semantic relation between the answers. As such, the score it provides is a measure of reliability of a sample (and of a dataset). Since a high reliable sample is one where annotators either converge on the same answer or pick up semantically related answers, we might take SeS as an indirect measure of dataset feasibility: The higher the score assigned to a sample, the higher the probability to guess the correct answer. We test this intuition by analyzing VQA3+ accuracy against SeS. If SeS captures the degree of feasibility of a sample, we should observe a higher accuracy in correspondence to high values of our component. Our intuition is fully confirmed for VQA 1.0 (Figure 7, left), where accuracies increase on par with SeS. In contrast, a different pattern is observed for VizWiz (right), where the highest accuracy is obtained in samples with moderate SeS and monotonically decreases with increasingly-reliable scores. This pattern, we conjecture, might be due to the low number of cases having high SeS in VizWiz.

6 Discussion

We proposed MaSSeS, a novel multi-component metric for the evaluation of VQA. We showed the potential of such evaluation tool for gaining a higher-level, fine-grained understanding of models and data. Crucially, our metric can be used one component at a time: Ma for evaluating model predictions only, S and SeS for analyzing the quantitative and semantic reliability of a dataset, respectively. Overall, MaSSeS provides a single accuracy score that makes it comparable to other metrics such as VQA3+ or WUPS. Further investigation is needed to explore the functioning of our metric with other VQA models, as well as the impact of using various word embeddings techniques and similarity thresholds on the overall score.


A preliminary version of this work was presented at the ECCV2018 workshop on Shortcomings in Vision and Language (SiVL). In that venue, we had insightful discussions with Aishwarya Agrawal, Dhruv Batra, Danna Gurari, Stefan Lee, Vicente Ordonez, and many others. We thank them for helping us improving this manuscript.


  • [1] A. Agrawal, J. Lu, S. Antol, M. Mitchell, C. L. Zitnick, D. Parikh, and D. Batra. Vqa: Visual question answering.

    International Journal of Computer Vision

    , 123(1):4–31, 2017.
  • [2] P. Anderson, X. He, C. Buehler, D. Teney, M. Johnson, S. Gould, and L. Zhang. Bottom-up and top-down attention for image captioning and visual question answering. In CVPR, volume 3, page 6, 2018.
  • [3] J. Andreas, M. Rohrbach, T. Darrell, and D. Klein. Neural module networks. In

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    , pages 39–48, 2016.
  • [4] S. Antol, A. Agrawal, J. Lu, M. Mitchell, D. Batra, C. Lawrence Zitnick, and D. Parikh. Vqa: Visual question answering. In Proceedings of the IEEE international conference on computer vision, pages 2425–2433, 2015.
  • [5] H. Ben-Younes, R. Cadene, M. Cord, and N. Thome. Mutan: Multimodal tucker fusion for visual question answering. In Proc. IEEE Int. Conf. Comp. Vis, volume 3, 2017.
  • [6] W.-L. Chao, H. Hu, and F. Sha. Being Negative but Constructively: Lessons Learnt from Creating Better Visual Question Answering Datasets. arXiv preprint arXiv:1704.07121, 2017.
  • [7] A. Das, S. Datta, G. Gkioxari, S. Lee, D. Parikh, and D. Batra. Embodied Question Answering. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2018.
  • [8] A. Das, S. Kottur, K. Gupta, A. Singh, D. Yadav, J. M. Moura, D. Parikh, and D. Batra. Visual dialog. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, volume 2, 2017.
  • [9] J. Devlin, S. Gupta, R. Girshick, M. Mitchell, and C. L. Zitnick. Exploring nearest neighbor approaches for image captioning. arXiv preprint arXiv:1505.04467, 2015.
  • [10] N. Ding, S. Goodman, F. Sha, and R. Soricut. Understanding Image and Text Simultaneously: a Dual Vision-Language Machine Comprehension Task. arXiv preprint arXiv:1612.07833, 2016.
  • [11] A. Fukui, D. H. Park, D. Yang, A. Rohrbach, T. Darrell, and M. Rohrbach. Multimodal compact bilinear pooling for visual question answering and visual grounding. arXiv preprint arXiv:1606.01847, 2016.
  • [12] F. Galton. Vox populi (The wisdom of crowds). Nature, 75(7):450–451, 1907.
  • [13] H. Gao, J. Mao, J. Zhou, Z. Huang, L. Wang, and W. Xu. Are you talking to a machine? dataset and methods for multilingual image question. In Advances in neural information processing systems, pages 2296–2304, 2015.
  • [14] Y. Goyal, T. Khot, D. Summers-Stay, D. Batra, and D. Parikh. Making the V in VQA matter: Elevating the role of image understanding in Visual Question Answering. In CVPR, volume 1, page 3, 2017.
  • [15] D. Gurari and K. Grauman. CrowdVerge: Predicting If People Will Agree on the Answer to a Visual Question. In Proceedings of the 2017 CHI Conference on Human Factors in Computing Systems, pages 3511–3522. ACM, 2017.
  • [16] D. Gurari, Q. Li, A. J. Stangl, A. Guo, C. Lin, K. Grauman, J. Luo, and J. P. Bigham. VizWiz Grand Challenge: Answering Visual Questions from Blind People. arXiv preprint arXiv:1802.08218, 2018.
  • [17] J. Johnson, B. Hariharan, L. van der Maaten, L. Fei-Fei, C. L. Zitnick, and R. Girshick. CLEVR: A diagnostic dataset for compositional language and elementary visual reasoning. In Computer Vision and Pattern Recognition (CVPR), 2017 IEEE Conference on, pages 1988–1997. IEEE, 2017.
  • [18] K. Kafle and C. Kanan. An analysis of visual question answering algorithms. In Computer Vision (ICCV), 2017 IEEE International Conference on, pages 1983–1991. IEEE, 2017.
  • [19] K. Kafle and C. Kanan. Visual question answering: Datasets, algorithms, and future challenges. Computer Vision and Image Understanding, 163:3–20, 2017.
  • [20] R. Krishna, Y. Zhu, O. Groth, J. Johnson, K. Hata, J. Kravitz, S. Chen, Y. Kalantidis, L.-J. Li, D. A. Shamma, et al. Visual genome: Connecting language and vision using crowdsourced dense image annotations. International Journal of Computer Vision, 123(1):32–73, 2017.
  • [21] A. Lavie and A. Agarwal. METEOR: An automatic metric for MT evaluation with high levels of correlation with human judgments. In Proceedings of the Second Workshop on Statistical Machine Translation, pages 228–231. Association for Computational Linguistics, 2007.
  • [22] C.-Y. Lin. Rouge: A package for automatic evaluation of summaries. Text Summarization Branches Out, 2004.
  • [23] J. Lu, J. Yang, D. Batra, and D. Parikh. Hierarchical question-image co-attention for visual question answering. In Advances In Neural Information Processing Systems, pages 289–297, 2016.
  • [24] M. Malinowski and M. Fritz. A multi-world approach to question answering about real-world scenes based on uncertain input. In Advances in neural information processing systems, pages 1682–1690, 2014.
  • [25] M. Malinowski, M. Rohrbach, and M. Fritz.

    Ask your neurons: A neural-based approach to answering questions about images.

    In Proceedings of the IEEE international conference on computer vision, pages 1–9, 2015.
  • [26] T. Mikolov, E. Grave, P. Bojanowski, C. Puhrsch, and A. Joulin. Advances in Pre-Training Distributed Word Representations. In Proceedings of the International Conference on Language Resources and Evaluation (LREC 2018), 2018.
  • [27] J. B. Orlin. A polynomial time primal network simplex algorithm for minimum cost flows. Mathematical Programming, 78(2):109–129, 1997.
  • [28] K. Papineni, S. Roukos, T. Ward, and W.-J. Zhu. BLEU: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting on association for computational linguistics, pages 311–318. Association for Computational Linguistics, 2002.
  • [29] A. Ramdas, N. G. Trillos, and M. Cuturi. On wasserstein two-sample testing and related families of nonparametric tests. Entropy, 19(2):47, 2017.
  • [30] M. Ren, R. Kiros, and R. Zemel. Exploring models and data for image question answering. In Advances in neural information processing systems, pages 2953–2961, 2015.
  • [31] Y. Rubner, C. Tomasi, and L. J. Guibas. A metric for distributions with applications to image databases. In Computer Vision, 1998. Sixth International Conference on, pages 59–66. IEEE, 1998.
  • [32] Y. Rubner, C. Tomasi, and L. J. Guibas. The earth mover’s distance as a metric for image retrieval. International journal of computer vision, 40(2):99–121, 2000.
  • [33] A. Sheshadri and M. Lease. SQUARE: A benchmark for research on computing crowd consensus. In First AAAI Conference on Human Computation and Crowdsourcing, 2013.
  • [34] A. Suhr, M. Lewis, J. Yeh, and Y. Artzi. A corpus of natural language for visual reasoning. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), volume 2, pages 217–223, 2017.
  • [35] R. Vedantam, C. Lawrence Zitnick, and D. Parikh. Cider: Consensus-based image description evaluation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4566–4575, 2015.
  • [36] P. Wang, Q. Wu, C. Shen, A. Dick, and A. Van Den Henge. Explicit knowledge-based reasoning for visual question answering. In

    Proceedings of the 26th International Joint Conference on Artificial Intelligence

    , pages 1290–1296. AAAI Press, 2017.
  • [37] P. Wang, Q. Wu, C. Shen, A. Dick, and A. van den Hengel. Fvqa: Fact-based visual question answering. IEEE transactions on pattern analysis and machine intelligence, 2017.
  • [38] P. Welinder, S. Branson, P. Perona, and S. J. Belongie. The multidimensional wisdom of crowds. In Advances in neural information processing systems, pages 2424–2432, 2010.
  • [39] Q. Wu, C. Shen, L. Liu, A. Dick, and A. van den Hengel. What value do explicit high level concepts have in vision to language problems? In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 203–212, 2016.
  • [40] Q. Wu, P. Wang, C. Shen, A. Dick, and A. van den Hengel. Ask me anything: Free-form visual question answering based on knowledge from external sources. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4622–4630, 2016.
  • [41] Z. Wu and M. Palmer. Verbs semantics and lexical selection. In Proceedings of the 32nd annual meeting on Association for Computational Linguistics, pages 133–138. Association for Computational Linguistics, 1994.
  • [42] H. Xu and K. Saenko. Ask, attend and answer: Exploring question-guided spatial attention for visual question answering. In European Conference on Computer Vision, pages 451–466. Springer, 2016.
  • [43] C.-J. Yang, K. Grauman, and D. Gurari. Visual Question Answer Diversity. In HCOMP, pages 184–192, 2018.
  • [44] Z. Yang, X. He, J. Gao, L. Deng, and A. Smola. Stacked attention networks for image question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 21–29, 2016.
  • [45] D. Yu, J. Fu, T. Mei, and Y. Rui. Multi-level attention networks for visual question answering. In Computer Vision and Pattern Recognition (CVPR), 2017 IEEE Conference on, pages 4187–4195. IEEE, 2017.
  • [46] L. Yu, E. Park, A. C. Berg, and T. L. Berg. Visual madlibs: Fill in the blank description generation and question answering. In Proceedings of the ieee international conference on computer vision, pages 2461–2469, 2015.
  • [47] P. Zhang, Y. Goyal, D. Summers-Stay, D. Batra, and D. Parikh. Yin and yang: Balancing and answering binary visual questions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 5014–5022, 2016.
  • [48] Y. Zhu, O. Groth, M. Bernstein, and L. Fei-Fei. Visual7w: Grounded question answering in images. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4995–5004, 2016.
  • [49] Y. Zhu, C. Zhang, C. Ré, and L. Fei-Fei. Building a large-scale multimodal knowledge base system for answering visual queries. arXiv preprint arXiv:1507.05670, 2015.