Visual Question Answering with Prior Class Semantics

05/04/2020 ∙ by Violetta Shevchenko, et al. ∙ The University of Adelaide 5

We present a novel mechanism to embed prior knowledge in a model for visual question answering. The open-set nature of the task is at odds with the ubiquitous approach of training of a fixed classifier. We show how to exploit additional information pertaining to the semantics of candidate answers. We extend the answer prediction process with a regression objective in a semantic space, in which we project candidate answers using prior knowledge derived from word embeddings. We perform an extensive study of learned representations with the GQA dataset, revealing that important semantic information is captured in the relations between embeddings in the answer space. Our method brings improvements in consistency and accuracy over a range of question types. Experiments with novel answers, unseen during training, indicate the method's potential for open-set prediction.



There are no comments yet.


page 7

page 13

page 14

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The task of visual question answering (VQA) has become a benchmark to evaluate joint progress in computer vision and natural language processing. This complex task, in its most general formulation, requires deep analysis of both visual and textual information in order to correctly answer a question, given an associated image. Behind its simple formulation, VQA is an extremely complex task that offers a testbed for a multitude of capabilities required to develop strong AI systems.

Most recent developments in the field of VQA have focused on the development of deep learning architectures that can be trained with end-to-end supervision (

i.e. questions, images, and answers). However, even current large-scale datasets  [3, 6] can only cover a limited fraction of all knowledge potentially useful for the task. The underlying reasons for this limitation are that 1) the collection of data with end-to-end annotations, i.e. questions/answers is expensive as it usually requires human resources, 2) the desirable knowledge about the world is constantly expanding, and no single dataset can ever capture it all. Existing models trained once and for all on any of these datasets lack the generalization and adaptation capabilities desirable in real-world applications. These shortcomings motivate our search for alternative sources of information, and a method to exploit them in a VQA model.

Figure 1: Existing models treat VQA as a classification task over predefined answers (upper branch). We supplement our model with a regression objective in a semantic answer space (lower branch). This allows incorporating additional prior knowledge about answer semantics. This improves its accuracy and consistency. In the above example, red and orange are similarly likely with the traditional objective. Our regression lands closer to the representation of red in the answer space. This resolves the ambiguity and red is chosen as the final answer.

A common approach to include existing knowledge in VQA models is to use pretrained models to obtain image and question features. On the image side, pretrained convolutional neural networks (CNNs) or object detectors are ubiquitous 

[2] to extract representative image features. On the language side, pretrained word embeddings like Word2Vec [14] and GloVe [16] usually serve to encode the words of the question. The advantage of these techniques is to leverage knowledge learned from larger, non-VQA specific data (e.g

. ImageNet and large text corpora). The benefit of these approaches has been widely demonstrated, which further motivates our quest for additional sources of usable knowledge and techniques to incorporate it.

Existing models for VQA follow the common blueprint of a two-stream embedding, followed by fusion and classification stages [3, 21, 26]

. The typical setting in VQA consists of an image and a related question. The model takes this image-question pair and predicts the correct answer by solving a classification problem over the set of candidate answers that occur in the training data. This classification approach, in contrast to text generation 

[24, 5], considerably simplifies the evaluation process, as the model can be assessed by its classification accuracy. However, treating VQA as a classification task has major drawbacks. The answers are treated as distinct class labels and answer words are abstracted from their meanings. This disregards semantic relations between related answers. Moreover, some questions contain possible answers in their wording (e.gIs this car red or white ?) and it seems natural to include mechanisms to explicitly represent the semantics of possible answers as done for question words. Guided by these observations, we develop an architecture that leverages prior knowledge about answer to improve the performance of a VQA model.

Our main technical contribution is to treat VQA as a multitask problem, where we both predict the answer label based on classification scores, and we additionally learn a mapping into an answer representation space that captures the semantics of these answers (see Fig. 1). We incorporate prior knowledge into the model by initializing the representations of answers with pretrained word embeddings. We perform an extensive and rigorous analysis of the trained model. It demonstrates the benefits of the approach and provides us with insights in the ways language semantics are useful for the task of VQA. Moreover, we show that learned answer representations can be used for out-of-vocabulary answer prediction which is an important, yet understudied problem in VQA field [15].

The contributions of this paper are as follows.

  • We formulate VQA as a multitask problem, where we train the model, not only to assign scores to answer candidates, but also to perform a regression in a vector space that represents answer semantics.

  • We use this multitask formulation to incorporate additional information into the model with a particular loss and initialization of the semantic answer space. We also show that it allows the model to predict novel answers that were not seen during training.

  • We perform an extensive analysis of the model and various ablations. We demonstrate clear advantages on the GQA dataset [9], and obtain insights on the ways in which answer semantics are useful for the task of VQA.

2 Related Work

The overarching motivation for research on VQA is that of tackling a complex, open-world and multimodal task. These aspects are among the foundations required in general AI systems. While the task has attracted considerable attention over the past few years [23, 12], its open-set and open-domain aspects have largely been overlooked. The common practice of training a model with end-to-end supervision using a fixed dataset is inherently limited. Our discussion focuses on the incorporation of additional knowledge and training signals into VQA models.

Answer embeddings for VQA.

Most techniques to incorporate additional information into VQA models are based on representations of language, both of questions and of candidate answers. In [22] pretrained word embeddings are used as bag-of-words representations of candidate answers, which are passed to the network as additional inputs, along with question and image features. In [21] authors proposed to initialize the weights of the output classifier with pretrained answer embeddings. They used both a textual branch, initialized with GloVe vectors, and a visual one, initialized with visual features from images representing the candidate answers. In [7], the authors propose to learn two sets of embeddings, image-question vectors and answer embeddings. They optimize a projection of these two embeddings into a joint space where the distances between compatible pairs are minimized. Their experiments showed interestingly that the learned projections was transferable, to some extent, across datasets with different sets of possible answers.

Different from the methods cited above, our model forgoes the notion of a fixed answer set, and the output of the network is a location in a space representing answer semantics. The final prediction is still obtained by searching for the closest representation among answer candidates in this same space, but the formulation offers improved flexibility. This allows us to explore different distance measures in this semantic space. It also allows control over the contribution made by prior and task-specific data. Finally, it easily accommodates multiple representations of a same answer, thereby accounting for polysemy and context-dependent meaning of certain words and expressions.

Class embeddings for image classification.

A related line of works use non-visual data to improve image classifiers. Techniques have been proposed to use unannotated text [4]

, knowledge graphs 

[25] or hierarchical word databases [1] to obtain meaningful class embeddings, which proved beneficial for fine-grained image classification. Our work applies similar ideas to the task of VQA, where the key challenge is to find embeddings semantically connecting both visual and textual modalities.

3 Proposed Approach

Figure 2: Our contributions apply to the classifier stage (dashed box) of a VQA model. We feed the fused image/question representation into two separate branches. (1) In the upper branch, a traditional scoring model over predefined candidate answers. (2) In the lower branch, a novel, learned projection to a semantic answer space. The resulting vector serves to measure pairwise distances () with pretrained representations of candidate answers (). Nodes marked N denote non-linear layers, L linear layers, and X an element-wise product.

Our main idea is to extend VQA with a regression objective, where the model outputs a high-dimensional vector that represents the semantics of the answer. This is a shift from the traditional classification objective over predefined candidate answers. Our formulation will open the door to compositional and unbounded sets of answers, and the possibility of truly open-set prediction. Technically, our method concerns only the latter stage of a VQA model and is thus applicable to most existing “joint embedding” models, such as [3, 27, 5, 17]. In these models, the network produces a vector from the fusion of the image and question representations (see Fig. 2). The traditional approach then feeds this to a classifier and obtain , with being a vector of scores of length , the cardinality of a predefined set of candidate answers.

3.1 VQA as a Regression Task

Our contribution is to learn a supplementary branch from , which produces a projection , where are the parameters of the projection. The vector is interpreted as a representation of the semantics of the predicted answer. The key to this simple approach is both in the objective used to train this branch, and in its use to select an actual textual answer, which we both describe below.

Note that the traditional classifier over can be interpreted as a special case of our formulation. The classifier typically includes a non-linear layer followed by a linear one. They can be interpreted as a non-linear projection followed by the computation of distances (dot products) with representations of answers. These representations then correspond to the rows of the weight matrix of the linear layer. In this view, our model is a generalization of the classical approach, with benefits of increased flexibility in the choice of the distance measure, of the optimization loss, and of the representations of candidate answers including their initial and/or frozen values.

3.2 Training

To evaluate the possibility of mutual benefits of the classification and regression objectives, our full model includes both branches on top of the fused representation . Each of their respective outputs and

is fed into a specific loss. The whole network is trained by backpropagation of the gradient of the two losses through all the layers leading to


Classification loss.

The output of the classification branch goes through a standard logistic function and binary cross entropy loss . Denoting with the one-hot (multi-hot) vector of the ground truth answer(s) of a specific training instance, we have


where indexes vector elements. The sum allows for multiple ground truth answers to a single training question.

Regression loss.

The output of the additional regression branch produces the vector . It is interpreted as a location in a high-dimensional space that captures the semantics of the predicted answer. We store in a matrix representations of candidate answers in this space (-dimensional row vectors). These representations can be learned or initialized using prior knowledge, as described below. The objective of the regression branch is to produce a vector close to the representation of the ground truth answer, and distinct from those of incorrect ones. Using a metric , we compute all distances between and the rows of , noted as . We have


We then define a hinge loss on these distances:



is a scalar margin hyperparameter. Our overall optimization objective is the convex combination of the classification and regression losses:


where the scalar hyperparameter balances the two objectives. By setting , the loss falls back to a unique traditional classification objective, which serves as our baseline.

3.3 Predictions

Due to the nature of existing datasets, answer prediction during test time do not differ from the training, since both train and test splits typically share common answer set. Our current experiments thus simply use the answers predicted by the network with the same combination of the classification and regression branches as the training objective. That is, the final predicted answer is the one from the set of candidates with the combination of highest score and the lowest distance. Formally:


3.4 Incorporating Prior Knowledge about Answers

The matrix of the regression branch contains, in each of its rows, the representation of a candidate answer. can be treated and optimized as any other parameter of the network, but it can also be initialized with values that contain prior knowledge about answers. In particular, we experiment with GloVe embeddings [16] for single-word answers, and averaged (i.e. as a bag-of-words) in the case of multi-word ones. The values of are further fine-tuned during training. Freezing them always proved inferior in our preliminary experiments (not reported).

As ablations of our model, we consider two other initialization schemes of . They will serve to probe for the source of the gains of our model.

  • Random. We initialize

    with normally distributed random values, as would be any other weight matrix of the network.

  • Shuffled GloVe. We initialize with GloVe embeddings as described above, but subsequently shuffle its rows randomly, as in [21]. The rows of are thus mismatched from their corresponding answers. This allows us to disentangle the anticipated benefits of using the semantic information carried in GloVe vectors, from the mere numerical effects of using them as initial values.

4 Experiments

We performed an extensive evaluation to thoroughly validate the benefits of the proposed method, and understand the exact source of improvement. The overall conclusion is that the improvements indeed stem from the information brought in by the use of external data, rather than numerical artifacts or structural modifications to the network architecture.

Our contributions are implemented on top of the open-source Pythia framework 

[11], the winning entry of the 2018 VQA Challenge. The technique is however applicable to a wide range of current and future models. Pythia thus serves as the main baseline. We also evaluate the Pythia model where the weights of the output classifier are initialized with pretrained answer embeddings (noted ‘Pythia+GloVe’) in the manner proposed by [21]. We also compare our method to existing methods designed to inject prior knowledge in the model in the form of answer embeddings. Precisely, we consider the two variants of the “factorized Probabilistic Model of Compatibility” (fPMC) proposed by [7], using the code provided by the authors. All tested models use the same image features (those provided with the GQA dataset) and representations of question words (300-dimensional pretrained GloVe embeddings). Details are provided in Appendix A.

The general hyperparameters of Pythia (batch size, learning rate, etc.) were chosen by grid search for best performance of the baseline model (i.e. without our contributions) on the GQA validation set. They were not modified once our contributions were added. This ensures a fair and challenging baseline. The distance function

is implemented as the Euclidean distance. This choice proved empirically superior, on the GQA validation set, to a dot product or a cosine similarity. The parameter

is set to , unless otherwise noted. Every experiment was repeated with five different random seeds, and we report the average over the five runs. The ensembles use the average of the predicted scores/distances of several models trained with different random seeds, before taking the of Eq. 5.

GQA validation GQA test-dev GQA test
Binary Open All Binary Open All Binary Open All
Blind LSTM 61.90 22.69 41.07
BUTD 66.64 34.83 49.74
MAC 71.23 38.91 54.06
LXMERT 77.80 45.00 60.30
NSM 78.94 49.25 63.17
Pythia 75.45 45.76 60.13 71.51 38.15 53.46
Pythia + GloVe 74.91 45.77 59.87 71.36 37.94 53.28
fPMC(BUTD) 69.85 42.28 55.62 64.80 35.40 48.90
fPMC(SAN) 71.94 41.78 56.37 67.02 35.83 50.14
Ours + random 75.15 46.33 60.27 70.67 38.14 53.08
Ours + shuffled GloVe 76.17 46.53 60.87 71.80 38.48 53.78
Ours + GloVe 76.93 46.99 61.48 72.19 39.31 54.40 71.35 40.07 54.73
Ensemble: 5 Pythia 77.24 48.41 62.36 73.43 39.85 55.26
Ensemble: 5 Ours + GloVe 79.32 49.48 63.92 74.35 41.40 56.52
Table 1: Accuracy (%) on GQA. Our method shows clear improvements on both binary and open-ended questions.
GQA test-dev
Choose Compare Logical Query Verify Attribute Category Global Object Relation
Pythia 67.93 62.14 72.11 38.15 75.28 60.04 45.59 51.85 84.53 44.24
Pythia + GloVe 68.45 62.72 71.69 37.94 74.81 59.40 44.72 54.14 84.78 44.51
fPMC(BUTD) 60.39 57.83 63.36 35.40 69.99 51.17 42.90 51.97 81.95 43.04
fPMC(SAN) 64.80 61.53 65.62 35.83 70.69 53.80 43.69 54.01 80.95 43.32
Ours + random 67.42 62.07 70.87 38.14 74.40 58.53 45.01 55.42 84.68 44.79
Ours + shuffled GloVe 70.88 62.14 71.38 38.48 75.14 59.65 45.36 54.14 84.60 45.33
Ours + GloVe 71.12 62.21 71.14 39.31 76.17 60.28 46.04 53.88 85.01 46.01
Ensemble: 5 Pythia 70.95 63.33 73.54 39.85 77.22 61.84 47.87 52.23 86.50 45.95
Ensemble: 5 Ours + GloVe 75.11 64.18 73.04 41.40 77.66 63.00 48.30 53.50 86.25 47.70
Table 2: Accuracy (%) over question types on GQA test-dev. Our contributions bring clear improvements on most question types, with the highest gain on the choose category.

For evaluation of our approach we use the GQA dataset [9] as it provides the most comprehensive suite of metrics and cleanest data of current VQA datasets. However, we do not aim to build a data-specific solution, so our model does not utilize scene graphs and functional programs included in the dataset.

4.1 Quantitative Results

Our main results on the GQA dataset are provided in Table 1. Looking at the overall accuracy, our model clearly outperforms all baselines and ablations. The same observations can be drawn on both the binary and open-ended questions. The trend is also confirmed when evaluating an ensemble of our model, versus a similar ensemble of the Pythia baseline. The fPMC model [7] obtains the lowest results, including our modified version fPMC(BUTD) (details in Appendix A), which indicates its lack of adaptivity to complex feature representation methods. The fPMC model was initially tested only on the very noisy VQA v2 dataset, and a possible reason for its weak performance on GQA is the narrower answer set. A surprising outcome is that Pythia with pretrained classifier (‘Pythia+GloVe’) shows worse accuracy results than the baseline. This occurs mostly due to overfitting of the pre-initialized classifier to the most common answers in the training set, as observed by the reduced accuracy on both the validation (see Table 5 in Appendix D) and test-dev sets. Unlike the other described architectures, our model exploits the additional information contained in the representations of answers in an effective way, increasing performance without overfitting.

4.2 Comparison with Existing Models

We compare our model with existing methods reported in [9] and several recent state-of-the-art (see Table 1

). We report the performance of the blind LSTM, the bottom-up top-down attention model 

[2], MAC [8], LXMERT [20] and the neural state machine (NSM) [10]. Our model shows better results than all the baselines, and in spite of a much simpler architecture, it notably surpasses the MAC model. However, the newest methods LXMERT and NSM show higher performance which is not surprising. LXMERT model explores more sophisticated technique of image and language representation and is pretrained on a significantly larger amount of data. NSM implements compositional approach and performs explicit multi-step reasoning. Differently, our approach focuses on the output stage of VQA model, thus the contributions of this article are expected to be applicable to these models.

GQA validation
Pythia 95.07 91.39 3.93 83.12
Pythia + GloVe 95.13 91.40 4.07 82.68
fPMC(BUTD) 94.99 90.91 6.20 76.53
fPMC(SAN) 95.11 91.62 5.66 78.67
Ours + random 95.07 91.53 4.26 83.00
Ours + shuffled GloVe 95.14 91.48 4.01 83.37
Ours + GloVe 95.16 91.55 4.01 84.57
Ensemble: 5 Pythia 95.17 91.90 4.78 85.27
Ensemble: 5 Ours + GloVe 95.25 92.06 4.56 87.33
Table 3: Results on additional metrics: Validity, Plausibility, Distribution (lower is better), and Consistency. Our model noticeably improves in consistency over the baseline. It ranks slightly worse on the distribution metric (see discussion in text).

4.3 In-Depth Analysis

We report the detailed metrics of the GQA dataset in Table 3. The first observation is that a similar ranking of methods and ablations can be drawn from most of the metrics. This stability further confirms the benefits of the proposed method. The improvements on these advanced metrics also indicate benefits beyond the sole increase in accuracy. The validity and plausibility scores, in particular, which are noticeably higher, indicate a generally more robust model. The higher consistency score implies that the answers produced over related questions are compatible with one another (see Fig. 5). The only metric on which our model falls below the baseline is the answer distribution. It indicates that the model occasionally favors one answer over most others. We explain this by the fact that some answers are not assigned appropriate initial representations (see the rest of the discussion below). We also look at the accuracy metric broken down by question categories (Table 2). We observe no significant drop in accuracy for any type, and the highest improvements occur on the choose, query, attribute, and relational questions.

Figure 3: The performance of our model varies smoothly with the relative weight of the classification and regression losses (Eq.  4). The value = corresponds to a traditional classification-only baseline, while the optimal value = corresponds to an even contribution of the two losses.

The ablations of our method (Ours+random and Ours+shuffled GloVe) are important to determine whether the source of improvements is in the architecture of our model (the additional output branch and loss), in numerical effects from the initialization of the matrix with values from GloVe vectors, or in the actual information conveyed in the GloVe vectors. The ablation with random initial values is essentially similar to the Pythia baseline, which shows no significant effect from the architecture alone. Surprisingly, the shuffled GloVe ablation brings some improvement, which we explain by two factors. First, since the values of are further fine-tuned with the rest of model, they can still incorporate useful information from the task-specific supervision even if the initial values do not contain relevant semantic information. Second, some answers may actually benefit from the “wrong” initialization: we have determined that the absolute values of the representations of answers do not play the most significant role, but that their mutual relations are what encodes the critical information. This shows up in particular with pairs of antonym answers such as yes/no or left/right. The GloVe embeddings of these pairs are usually similar, whereas the VQA task-specific supervision tends to push their representations apart. The shuffled initialization can thus prove better than the “correct” one for some cases. This can also be observed on the high accuracy of the shuffled ablation on the choose category of questions which do specifically contain this type of antonym answers (see Table 4). Despite these effects, the full model still performs clearly better than the ablations, indicating an overall benefit from the information conveyed in the GloVe representations of answers.

Since our architecture is trained to minimize a sum of two losses (classification and regression), we sought to evaluate their possible mutual benefit by varying their relative weight ( in Eq. 4). A Value of = corresponds to the regression loss alone, and = to the baseline using the traditional classification loss alone. Interestingly, a balanced value of leads to the highest accuracy (Fig. 3), demonstrating that they are indeed complementary.

Figure 4: Absolute gain in answer recall of our model over the Pythia baseline (positive is an improvement). We report an even subset of answers (every one in descending recall gain). See the text for a discussion.

To obtain deeper insights into the additional knowledge that is actually most beneficial, we examined the improvements of our model over the Pythia baseline on individual answers. We report, in Fig. 4, the change in answer recall for a random selection of answers. We define the answer recall as, for an answer candidate , the ratio of questions with as ground truth that are correctly answered by the model. The recall of most answers improves, but it stays similar or even degrades on some others. We investigated the possible reasons. A degradation is presumably related to less relevant initial representations of the corresponding answer. To assess this, we examined the closest other answers in the space of pretrained GloVe vectors. Most answers with a negative gain in answer recall have neighbors with no semantic or syntactic connections. For instance, the three closest neighbors to modern are under, rooftop, visitor. Answers with a high recall improvement, on the contrary, tend to have semantically related neighbors. For example, basket has the closest neighbors baskets, cane, sack. These observations further support the claim that mutual relations between representations of answers are the major way in which the network stores and uses semantic information.

Figure 5: Qualitative examples from the GQA dataset, with predictions of our model and of the Pythia baseline. We show pairs of questions about a same image where the first entails the second (this information is never provided to the model during training nor testing). Our model improves in consistency over the baseline, producing pairs of answers compatible with one another.

4.4 Prediction of Novel Answers

Our model trained with the regression objective can predict answers at test time that are outside the predefined set of candidates used for training (i.e. open-set prediction a.k.a. zero-shot VQA [22]). This is achieved by replacing the matrix with new answers and setting to 0 at test time. To evaluate this setting, we use ConceptNet embeddings [19], which are designed to common sense knowledge.We use the VQA v2 dataset [6] since it features a more diverse set of answers than GQA. We use splits with disjoint sets of answers at training and test time (see details in Appendix C). In this setting, our model achieves an accuracy of 27% on the test set, while fPMC model, which also has tools for out-of-vocabulary prediction, obtains about 15% accuracy. Given that test questions feature exclusively answers never seen during training, this clearly demonstrates a capability for predictions beyond the scope of the training set. However, the performance on novel answers is highly depended on the used answer representations. Embeddings like GloVe and ConceptNet carry only limited, mostly linguistic information, which is insufficient for the full scope of their use in VQA. We discovered that embeddings learned end-to-end in a VQA model capture visual co-occurrences (Fig. 6 in Appendix D.3). This implies that additional knowledge extracted from visual data (e.g. as [15]) should be a useful complement to boost out-of-vocabulary performance. The combination of multiple types of pretrained embeddings is a promising avenue for future work.

5 Conclusions

In this paper, we reformulated VQA as a multitask problem, which allowed us to exploit prior semantic knowledge about answers. We demonstrated that GloVe word embeddings carry information about typical answers that is relevant to the task. In contrast to existing methods for incorporating additional data into VQA models, our technique is both simple and effective, and allows to tune the reliance of the model on general prior knowledge, and learned task-specific information. We evaluated our technique on the GQA dataset and obtained consistent improvement in accuracy in the majority of question categories. The extensive set of metrics also allowed identifying benefits in robustness and consistency of the model across related questions.

The fundamental idea in this paper of including a regression task as part of VQA has implications that go beyond what could be demonstrated with existing datasets. This formulation opens the door to the generation of compositional multi-word answers, and to open-set prediction, that is, predicting answers beyond the set of candidate answers predefined at training time.


  • [1] Z. Akata, S. Reed, D. Walter, H. Lee, and B. Schiele (2015) Evaluation of output embeddings for fine-grained image classification. In

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    pp. 2927–2936. Cited by: §2.
  • [2] P. Anderson, X. He, C. Buehler, D. Teney, M. Johnson, S. Gould, and L. Zhang (2018)

    Bottom-up and top-down attention for image captioning and visual question answering

    In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Vol. 3, pp. 6. Cited by: 2nd item, §1, §4.2.
  • [3] S. Antol, A. Agrawal, J. Lu, M. Mitchell, D. Batra, C. Lawrence Zitnick, and D. Parikh (2015) VQA: visual question answering. In Proceedings of the IEEE International Conference on Computer Vision, pp. 2425–2433. Cited by: §1, §1, §3.
  • [4] A. Frome, G. S. Corrado, J. Shlens, S. Bengio, J. Dean, T. Mikolov, et al. (2013) Devise: a deep visual-semantic embedding model. In Advances in Neural Information Processing Systems, pp. 2121–2129. Cited by: §2.
  • [5] H. Gao, J. Mao, J. Zhou, Z. Huang, L. Wang, and W. Xu (2015) Are you talking to a machine? dataset and methods for multilingual image question. In Proceedings of the Advances in Neural Information Processing Systems, pp. 2296–2304. Cited by: §1, §3.
  • [6] Y. Goyal, T. Khot, D. Summers-Stay, D. Batra, and D. Parikh (2017) Making the V in VQA matter: elevating the role of image understanding in visual question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Vol. 1, pp. 3. Cited by: Appendix A, §1, §4.4.
  • [7] H. Hu, W. Chao, and F. Sha (2018) Learning answer embeddings for visual question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5428–5436. Cited by: Appendix A, §2, §4.1, §4.
  • [8] D. A. Hudson and C. D. Manning (2018) Compositional attention networks for machine reasoning. arXiv preprint arXiv:1803.03067. Cited by: §4.2.
  • [9] D. A. Hudson and C. D. Manning (2019) GQA: a new dataset for real-world visual reasoning and compositional question answering. Conference on Computer Vision and Pattern Recognition. Cited by: 3rd item, §4.2, §4.
  • [10] D. Hudson and C. D. Manning (2019) Learning by abstraction: The neural state machine. In Advances in Neural Information Processing Systems, pp. 5901–5914. Cited by: §4.2.
  • [11] Y. Jiang, V. Natarajan, X. Chen, M. Rohrbach, D. Batra, and D. Parikh (2018) Pythia v0. 1: the winning entry to the VQA challenge 2018. arXiv preprint arXiv:1807.09956. Cited by: Appendix A, Appendix B, §4.
  • [12] K. Kafle and C. Kanan (2017) Visual question answering: Datasets, algorithms, and future challenges. Computer Vision and Image Understanding 163, pp. 3–20. Cited by: Appendix A, §2.
  • [13] D. P. Kingma and J. Ba (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: Appendix B.
  • [14] T. Mikolov, K. Chen, G. Corrado, and J. Dean (2013)

    Efficient estimation of word representations in vector space

    arXiv preprint arXiv:1301.3781. Cited by: §1.
  • [15] H. Noh, T. Kim, J. Mun, and B. Han (2019) Transfer learning via unsupervised task discovery for visual question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 8385–8394. Cited by: §1, §4.4.
  • [16] J. Pennington, R. Socher, and C. Manning (2014) Glove: Global vectors for word representation. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, pp. 1532–1543. Cited by: §1, §3.4.
  • [17] K. Saito, A. Shin, Y. Ushiku, and T. Harada (2017) Dualnet: Domain-invariant network for visual question answering. In Proceedings of the IEEE International Conference on Multimedia and Expo, pp. 829–834. Cited by: §3.
  • [18] T. Salimans and D. P. Kingma (2016) Weight normalization: a simple reparameterization to accelerate training of deep neural networks. In Advances in Neural Information Processing Systems, pp. 901–909. Cited by: Appendix A.
  • [19] R. Speer, J. Chin, and C. Havasi (2017) Conceptnet 5.5: an open multilingual graph of general knowledge. In

    Thirty-First AAAI Conference on Artificial Intelligence

    Cited by: §4.4.
  • [20] H. Tan and M. Bansal (2019) Lxmert: learning cross-modality encoder representations from transformers. arXiv preprint arXiv:1908.07490. Cited by: §4.2.
  • [21] D. Teney, P. Anderson, X. He, and A. van den Hengel (2018) Tips and tricks for visual question answering: learnings from the 2017 challenge. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4223–4232. Cited by: Appendix A, §1, §2, 2nd item, §4.
  • [22] D. Teney and A. van den Hengel (2016) Zero-shot visual question answering. arXiv preprint arXiv:1611.05546. Cited by: §D.4, §2, §4.4.
  • [23] D. Teney, Q. Wu, and A. van den Hengel (2017) Visual question answering: A tutorial. IEEE Signal Processing Magazine 34 (6), pp. 63–75. Cited by: Appendix A, §2.
  • [24] Q. Wu, C. Shen, P. Wang, A. Dick, and A. van den Hengel (2017) Image captioning and visual question answering based on attributes and external knowledge. IEEE Transactions on Pattern Analysis and Machine Intelligence 40 (6), pp. 1367–1381. Cited by: §1.
  • [25] H. Xu, G. Qi, J. Li, M. Wang, K. Xu, and H. Gao (2018) Fine-grained image classification by visual-semantic embedding. In Proceedings of the International Joint Conference on Artificial Intelligence, pp. 1043–1049. Cited by: §2.
  • [26] Z. Yang, X. He, J. Gao, L. Deng, and A. Smola (2016) Stacked attention networks for image question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 21–29. Cited by: 1st item, §1.
  • [27] B. Zhou, Y. Tian, S. Sukhbaatar, A. Szlam, and R. Fergus (2015) Simple baseline for visual question answering. arXiv preprint arXiv:1512.02167. Cited by: §3.

Visual Question Answering with Prior Class Semantics


Appendix A Baseline Methods


Our baseline model is the Pythia implementation [11] of the classical joint embedding model [23, 12]

. Pythia uses object image features extracted with the pretrained Faster R-CNN model provided with the GQA dataset. On the language side, words are represented with word embeddings initialized with pretrained GloVe vectors, followed by an LSTM to produce a vector representation of the whole question. A question-guided top-down attention is applied on image features to identify relevant image regions. The image and question features are passed through non-linear layers and finally combined with an element-wise multiplication. The final classifier comprises a non-linear layer and a linear one, which produces a score for each candidate answer. All non-linear layers throughout the network use weight normalization 


and ReLU activations.

Pythia serves as reference for evaluation, and as the bare model on which to build our contributions. This choice is justified by a few reasons. It is a high-performing open-source implementation that still outperforms many others on the VQA v2 dataset [6]. This provides us with a strong – and thus challenging – starting point to demonstrate the proposed method. Moreover, the implementation of Pythia is modular and easily allows one to separate, replace, and compare the various blocks of the model. In our case, this enables us to focus specifically on the classification part of the model, leaving the rest unchanged.

Pythia with pretrained classifier

We compare our method to the Pythia model, in which the output classifier is initialized with pretrained answer embeddings. As discussed in the related works section, this is a reasonable approach to embed semantic information about candidate answers within the model. Following a procedure similar to [21], we collect 300-dimensional GloVe embeddings for all words in the answer vocabulary (substituting unknown words with zero vectors). We represent each answer directly by its matching word embedding, or, in the case of multi-word answers, by the average embedding of the constituent words. Next, we design the classifier block of the model as follows: one non-linear layer with output dimension equal to the dimensionality of used GloVe embeddings followed by a linear layer with a weight matrix . Each row of thus contains the vector corresponding to one specific answer. Besides the non-random initialization of , the only distinction with the original Pythia model is that the output dimension of the non-linear layer is reduced from 5000 to 300 to match the dimensionality of the GloVe vectors.

Factorized probabilistic model of compatibility

We also compare the proposed approach to the factorized Probabilistic Model of Compatibility (fPMC) [7]. In this architecture, a joint image-question embedding is learned alongside the answer embedding, and the model is trained to increase the likelihood of the correct answer. We performed all experiments with the following two variants proposed by the authors.

  • fPMC(SAN) model, described in the original paper, that utilizes stacked attention network [26] together with bidirectional LSTM and spatial image features extracted with ResNet-152. For obtaining answer embeddings, the model exploits two-layer bidirectional LSTM over GloVe vectors. We used the code provided by the authors of the paper and made the adjustments only required to make it compatible with GQA dataset.

  • fPMC(BUTD) model is our modification of fPMC(SAN) where we used the “bottom-up and top-down attention” [2] model with object image features for parameterizing the joint embedding in the same way as all the other models used in our experiments. We were thus able to explicitly evaluate the approach of learning aligned answer embeddings independently from the impact of different feature initializations.

Appendix B Implementation of the Proposed Method

The proposed method builds directly on the open source Pythia implementation111

, which uses PyTorch. Our model is trained for 20,000 iterations with a batch size of 512 and AdaMax optimizer 

[13]. We adopted a warm-up learning schedule strategy from [11] and tuned it to the current setup. Specifically, the starting learning rate of 0.002 is linearly growing up to 0.1 during first 1000 iterations and then decreased by a factor of 0.1 at 11,000, 13,000 and 15,000 iterations. Importantly, these hyperparameters were selected for best performance of the baseline model on the validation set of the GQA dataset, thus avoiding any unfair advantage for our contributions. The values of the regression loss margin () and of the loss weight () were determined by grid search for best overall accuracy on the GQA validation set.

For VQA v2 dataset the only difference in hyperparameters is the learning schedule. The model is trained for 12000 iterations with learning rate decreasing at 5000, 7000, 9000 and 11,000 iterations, following the original Pythia implementation. We also found it beneficial to apply L2 normalization after projection layer. In out-of-vocabulary experimental setting we used a subset of VQA v2 data, so the parameter were adjusted to fit the smaller dataset. Specifically, we reduced the batch size to 128 and increased the number of iterations to 30,000 with decreasing steps at 12,000, 17,000, 22,000, and 25,000 iterations.

Appendix C Datasets


The dataset was designed as benchmark for compositional question answering over real-world images. The authors proposed new metrics for a detailed assessment of a model’s performance:

  • Validity measures whether the predicted answer fits the scope of the question (e.g. a number for a counting question).

  • Plausibility checks that the answer is semantically reasonable, defined as occurring at least once with the given question in the whole dataset.

  • Distribution is the distance between the distributions of predicted and ground truth answers over groups of questions. A lower value means a better ability to predict less frequent answers.

  • Consistency measures the agreement between answers to pairs of questions about a same image where one entails the other.

  • Grounding is used for evaluation of attention-based models and is not tested in our study since an attention is not the focus of this research.

The dataset also assigns test questions to categories (Table 4), across which the accuracy can be measured separately (as done in Table 2).

Type Example
Choose Is it an indoors or outdoors scene ?
Compare Are all these animals of the same type ?
Logical Are there nuts or vegetables ?
Query What is this bird called ?
Verify Is there a cat that is not white ?
Attribute What is the color of the fence made of metal ?
Category What piece of furniture is not small ?
Global Which place is it ?
Object Is there a train in the picture ?
Relation What is the vegetable on top of the pizza ?
Table 4: Examples of each question type of the GQA dataset.

VQA v2 with Novel Answers

For testing the out-of-vocabulary answer prediction, we created a subset of VQA v2 dataset. We used the original training and validation splits as our new training and test splits respectively. In each of them, we filtered the questions according to the following rules: 1) every ground truth answer has a corresponding ConceptNet embedding (exact match), 2) every ground truth answer consists of one word only (e.g. discarding black and white or don’t know), 3) every ground truth answer must occur in the original dataset between 5 and 500 times (thus discarding very rare and extremely frequent answers such as yes and no), 4) the sets of ground truth answers in the training and test splits do not intersect. With this procedure, we obtain 91,255 training questions with 6,928 possible answers, and 13,367 test questions with another 1,187 answers.

Appendix D Additional Results

d.1 Gqa

We provide in Table 5 the performance on the GQA validation set, matching the experiments provided in Table 2 of the main paper.

GQA validation
Choose Compare Logical Query Verify Attribute Category Global Object Relation
Pythia 70.64 67.57 81.66 45.76 75.86 64.43 56.58 64.81 83.69 51.42
Pythia + GloVe 69.94 67.11 81.28 45.77 75.30 63.84 57.19 65.20 83.44 51.22
fPMC(BUTD) 65.05 60.54 75.93 42.28 70.54 55.32 53.91 62.61 80.00 49.45
fPMC(SAN) 69.72 64.04 77.38 41.78 71.25 56.28 55.20 63.99 80.38 50.04
Ours + random 71.08 67.02 81.13 46.33 75.28 63.63 56.52 65.66 83.36 52.32
Ours + shuffled GloVe 74.27 66.92 81.31 46.53 75.68 64.88 56.89 65.74 83.54 52.64
Ours + GloVe 75.36 67.33 81.60 46.99 76.58 66.21 57.23 65.58 83.71 52.94
Ensemble: 5 Pythia 73.65 69.40 82.92 48.41 77.23 67.48 58.77 65.90 84.71 53.48
Ensemble: 5 Ours + GloVe 79.28 69.33 83.08 49.48 78.65 69.38 58.83 66.54 85.09 55.37
Table 5: Performance for different question types (accuracy in %) on the GQA validation split.

d.2 VQA v2

VQA v2 validation VQA v2 test-dev
Yes/No Number Other All Yes/No Number Other All
Pythia 83.11 44.50 56.86 65.11 83.42 45.53 57.13 66.64
Ours + GloVe 82.90 44.68 56.93 65.08 83.33 45.46 57.18 66.62
Ours + GloVe (w/o fine-tuning) 82.87 44.55 57.19 65.18 83.20 45.54 57.53 66.74
Table 6: Accuracy (%) on VQA v2 validation and test-dev sets. Our model with fixed (not fine-tuned) GloVe embeddings shows the highest results on the category other and on all questions overall for both splits.

We present experiments on the VQA v2 dataset in Table 6. Contrary to our results on GQA, we observe no significant difference compared to the baseline. We attribute this to the nature of the dataset. In VQA v2, a large fraction of the questions (over 37%) are to be answered with yes or no, and another 13% with a number. Our approach, which focuses on the representation of answer semantics, is already expected to have no influence on this large part of the dataset. Moreover, numbers in VQA v2 are used not only for counting questions, but also to refer to abstract concepts, as in questions like How old is animal ?, What time is the clock showing ?, or What is the size of the TV ?. It would certainly be difficult to infer a single representation of numbers that would encompass such a variety of concepts.

An additional challenge with VQA v2 is that most questions have multiple ground truth answers that are actually synonyms. Other times, annotation noise means that multiple answers with contradictory meanings are marked as correct. For example, a question Is the dog male or female ? has both male and female answers in the annotation. In our model, all ground truth answers contribute equally to the projection loss, meaning that noisy or incorrect answer labels can push the learned projection in wrong directions. This issue could be mitigated by introducing instance-specific weights in the projection loss. This is an interesting avenue for future work.

Overall, our approach still has a positive impact on VQA v2 for out-of-vocabulary prediction (see Section 4.4). And importantly, the above issues did not incur a decrease in performance compared to the baseline model.

d.3 Learned Representations

We use t-SNE projections to visualize and compare off-the-shelf GloVe embeddings of candidate answers, which we use as prior knowledge to initialize the representations, with these representations after fine-tuning within our VQA model. As expected, the GloVe embeddings carry the kind of semantic similarity that emerges from co-occurrence of words in natural language. In the fine-tuned representations, we rather observe that the proximity of representations captures common co-occurrences of concepts in a same image, such that they are plausible answers to possible questions about this image. For example, the word steak is projected close to the words {potato, carrot, broccoli, tomato, pickles} (Fig. 6 (b)). We indeed observe co-occurrence of these objects in images from the GQA dataset (Fig. 6 (c)).

(a) GloVe embeddings.
(b) Learned answer representations.
(c) Example images from the GQA dataset.
Figure 6: Examples of t-SNE projections in 2D of (a) initial and (b) fine-tuned representations of answers. The proximity of the learned representations better captures typical co-occurrences of the corresponding concepts in images from the dataset (c).

d.4 Prediction of Novel Answers

Figure 7: Examples of out-of-vocabulary predictions. The model works well when the ground truth answer has a clear and distinct pretrained embedding (top row), but fails to distinguish between synonymous answers (bottom row).

We analyzed the predicted answers in out-of-vocabulary test setting to discover the cause of reduced performance and possible ways for improvement (Fig. 7). The reason for many failure cases is due to synonymous and/or related answers. When the representations of multiple candidate answers are close in the semantic space, it is difficult for the model to distinguish them, especially when they are both plausible for a given question.

Another important factor in the success of our method is how well the semantic space is covered by answers seen during training. For example, if the training questions all have similar answers, e.g

. different animal species, the model could generalize well to novel animals, but not as well to anything outside these. In otherwords, the model is perfectly capable of interpolation, but extrapolation remains a challenge.

The VQA v2 dataset was not originally designed to test the out-of-vocabulary prediction, and existing attempts to repurpose it (e.g. [22]) all have notable issues. For our experiments, we created our own splits with novel answers in the test split, but we made no particular provision for an “even” coverage of semantic concepts with the training answers. These considerations suggest the need for a specific benchmark to allow a more rigorous evaluation of models designed for out-of-vocabulary and “zero-shot” VQA.