Actively Seeking and Learning from Live Data

04/05/2019 ∙ by Damien Teney, et al. ∙ The University of Adelaide 34

One of the key limitations of traditional machine learning methods is their requirement for training data that exemplifies all the information to be learned. This is a particular problem for visual question answering methods, which may be asked questions about virtually anything. The approach we propose is a step toward overcoming this limitation by searching for the information required at test time. The resulting method dynamically utilizes data from an external source, such as a large set of questions/answers or images/captions. Concretely, we learn a set of base weights for a simple VQA model, that are specifically adapted to a given question with the information specifically retrieved for this question. The adaptation process leverages recent advances in gradient-based meta learning and contributions for efficient retrieval and cross-domain adaptation. We surpass the state-of-the-art on the VQA-CP v2 benchmark and demonstrate our approach to be intrinsically more robust to out-of-distribution test data. We demonstrate the use of external non-VQA data using the MS COCO captioning dataset to support the answering process. This approach opens a new avenue for open-domain VQA systems that interface with diverse sources of data.



There are no comments yet.


page 1

page 8

page 14

page 15

page 16

page 17

page 18

page 19

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Figure 1: We propose a visual question answering (VQA) system able to retrieve and utilize information from an external source, at test time. The method learns to exploit external information of various forms, and we demonstrate question/answer tuples, but also images and corresponding captions. The method identifies the external information needed to answer a question and adapt its behaviour accordingly. This overcomes limitations of traditional approaches, including overfitting to the training data.

One of the ongoing criticisms of modern machine learning methods is that they presume the availability of large volumes of training data [20, 44]. This training data should be representative of the distribution from which the test data will be sampled from, which may be unknowable at training time. These methods usually need constant retraining to accommodate recent data, or to alleviate under-generalizing under a domain shift between the training and test distributions. While there exists a host of approaches to address these limitations (from continuum learning [37, 36] to domain adaptation [9, 30, 42]

for example), the information extracted from the training data is typically fixed into the parameters of a model during training, and applied without modification thereafter. The approach we propose here addresses this limitation by exploiting new information as it comes to light, by seeking out relevant data from a large external data source. It actively adapts its behaviour according to the information gained from this data, which represents a fundamental change from pure supervised learning.

This paper demonstrates this novel capability on the task of Visual Question Answering (VQA). The task requires answering a previously unseen question about a previously unseen image. Questions are general and open-ended, and thus require a virtually unlimited array of information and skills to answer. The current approach to VQA is to train a neural network with end-to-end supervision of questions/answers (QAs). The supervised paradigm has been transformative for most classical tasks of computer vision, but it shows its limits on complex tasks that require more than pixel-processing and pattern recognition alone. VQA models trained in this fashion have revealed to rely mostly on biases and superficial correlations in the training data. For example, questions starting with “How many…” are usually answered with 2 or 3, and those starting with “What sport …” with the answer tennis, which suffices to obtain high performance on benchmark datasets, where the training and test data are drawn from identical distributions.

The approach proposed in this paper is a step toward robust VQA models, i.e. capable of reasoning over visual and textual inputs, rather than regurgitating biases learned from a fixed training set. A robust evaluation of these capabilities has recently been made possible. Agrawal et al.  proposed the VQA-CP (“changing priors”) dataset [1]. In this resampled version of the VQA v2 dataset [15], the training and test sets are drawn from different distributions such that the question type (i.e. the first few words such as “What sport …” or “How many …”) cannot be relied upon to blindly guess the answer. The performance of existing methods significantly degrades in these conditions.

Our approach borrows ideas from recent research on meta learning [12, 17, 35]. So far, the ubiquitous approach to VQA has attempted to “fit the world” in a neural network, i.e. capturing all of the information the method could ever require to answer any question within its weights. In contrast, we train a model to identify and utilize the relevant information from a external source of support data. In the simplest instantiation of this principle, the support data is the training set of questions/answers itself, with the major novelty that it does not need to be fixed once the model is trained. The support data can expand at test time and could include data retrieved dynamically from live databases or web searches. The method then adapt itself dynamically using this data. To demonstrate the ability of the model to utilize non-VQA data (i.e. other than QA tuples), we use the MS COCO captioning dataset [19, 10] as a source of support data. While VQA data is expensive to acquire, captioned images are omnipresent on the web, and the ability to leverage such data is itself a major contribution.

The evaluation of our approach on VQA-CP demonstrates advantages over classical methods. It generalizes better and obtains state-of-the-art performance on the out-of-distribution test data of VQA-CP. Moreover, the model, once trained on a given distribution of QAs, can successfully adapt to a different distribution of an alternate support set. This is demonstrated with a novel leave-one-out evaluation with VQA-CP. Our experiments clearly demonstrate that the model makes use of the support data at test time, rather than merely capturing biases and priors of a training set. Consequently, a model trained with our approach could, for example, be reused in another domain-specific application by providing it with a domain-specific support set. This possibility opens the door to systems that reason over vision and language beyond the limited domain covered by any given training set.

The contributions of this paper are summarized as follows.

  1. [noitemsep]

  2. We propose a new approach to VQA in which the model is trained to retrieve and utilize information from an external source, to support its reasoning and answering process. We consider three instantiations of this approach, where the support data is the VQA training set itself (as an evaluation comparable to traditional models), VQA data from a different distribution, and non-VQA image captioning data.

  3. We propose an implementation of this approach based on a simple neural network and a gradient-based adaptation, which modifies its weights using selected support data. The method is based on the MAML algorithm [12] with novel contributions for efficient retrieval and cross-domain adaptation.

  4. We evaluate the components of our model on the VQA-CP v2 dataset. We demonstrate state-of-the-art performance, benefits in generalization, and the ability to leverage varied sources of support data. The novelty of the approach over existing practices opens the door to multiple opportunities to future research on VQA and vision/language reasoning.

2 Related work

Visual question answering

VQA has gathered significant interest in the past few years [5, 39] alongside other tasks combining vision and language such as image captioning [10] or visual dialog [11], for example. The appeal of VQA to the computer vision community is to constitute a practical evaluation of deep visual understanding. Open-domain VQA requires the visual parsing of an image, the comprehension of a text question, and reasoning over multiple pieces of information from these two modalities. See [39] for a survey of modern methods and available datasets.

The ubiquitous approach to VQA is based on supervised learning. It is framed as a classification task over a large set of possible answers, and a machine learning model is optimized over a training set of human-provided questions and answers [5, 15, 18, 48]. Beyond apparent success on VQA benchmarks [14, 33], the approach was revealed to have severe limitations. The models following this approach prove to be overly reliant on superficial statistical regularities in the training sets, and their performance drops dramatically when evaluated on questions drawn from a different distribution [1], or on questions containing words and concepts that appear infrequently in the training data [28, 34]. Popular benchmarks for VQA [5, 15] have involuntarily encouraged the development of methods that learn and leverage statistical patterns such as biases (i.e. the long-tailed distributions of answers) and question-conditioned biases (which make answers easy to guess given a question without the image). These models can essentially bypass the steps of reasoning and image understanding that initially motivated research on VQA.

Robust evaluation of VQA

Improved evaluations settings have recently been proposed. In [2, 15, 46], the authors introduced balanced pairs of questions, i.e

. associating each question with a pair of images that lead to different answers. This procedure, however, had limited benefits. The usual metric of accuracy over individual questions still encouraged to learn and rely on the non-uniform distribution of answers, and the crowd-sourcing procedure used to gather balanced pairs introduced many irrelevant and nonsensical questions to the dataset.

Other recent proposals follow the idea of drawing the training and evaluation questions from different distributions. This discourages overfitting to statistical regularities specific to the training set. In [28, 34], the authors evaluate questions containing words and concepts that appear rarely in the training data. In [1], Agrawal et al. propose the VQA-CP dataset (for “changing priors”), in which they enforce different training/test distributions of answers conditioned on the first few words of the question (e.g. “What is the color …” or “How many …”). Our experiments are conducted on VQA-CP as it represents the most challenging setting currently available.

Robust models for VQA

The above robust evaluations have essentially pointed at the inadequacy of current approaches [1, 15, 28, 34, 47]. To address some of these these shortcomings, Agrawal et al[1] proposed a modular architecture that prevent it from relying on undesirable biases and priors in the training data. Ramakrishnan et al[27], introduced an information-theoretic regularizer to encourage the model to utilize the image by outperforming a “blind” guesser. In [35], Teney et al. proposed a meta learning approach to VQA that improved the recall on rare answers. Their work is the most relevant to this paper, although the methods differ significantly. We use a gradient-based adaptation procedure that update the weights of a whole VQA model, whereas [35]

applied existing meta learning algorithms on the final classifier of a simple VQA model. We also formulate the use of support data as a retrieval task, whereas

[35] processes the entire support set at every iteration, which is computationally challenging and the evaluation only include small-scale experiments. [35] is also to limited to QAs as support data, where our method is much more general.

Figure 2: Data flow in the proposed method, using questions/answers as support data (Section 3.2). The input question serves to retrieve pertinent instances from the support data using a relevance function. These instances are passed through the underlying VQA model (Fig. 3) to compute the adaptation loss

, using their ground truth answers. The gradient of the adaptation loss is backpropagated to the weights

of the VQA model, which are updated, effectively adapting (i.e. fine-tuning) the VQA model to the selected support examples. The input question is finally passed through the adapted model to predict scores for the final answer. During training, the gradient of the loss on the final predictions is backpropagated to optimize the pre-adaptation weights and the gradient projection (in yellow).

Meta learning

Our central idea is to adapt a VQA model to each given question to incorporate additional information from an external source. The adaptation is implemented with the MAML meta learning algorithm [12]. Meta learning or “learning to learn” [21, 31, 37] is a general paradigm to learn to build and/or update machine learning models, e.g. to fine-tune the weights of a neural network [7, 6, 32]. Recent works in the area have focused on the adaptation of neural networks for few-shot image recognition [4, 12, 16, 29]. MAML serves to identify a set of weights that can best serve as initial values, before adaptation through one or a few steps of gradient descent. In [13, 43]

, the authors extended MAML to handle support data from a distinct domain, for robotic imitation learning from demonstration videos. We follow a similar idea to transform the gradients of a loss on captioning data into gradients suitable to update a VQA model. In

[17], Huang et al. turn the supervised task of language-to-query generation into a meta learning task. They introduce the concept of relevance functions to sample the training set. The approach is similar in spirit to our reformulation of VQA as a meta learning task. However, their aim is to improve accuracy by using specialized adapted models, while our objective is broader, as we also aim to leverage additional (non-VQA) sources of data.

Additional sources of data for VQA

The limitations of the mainstream approach to VQA stem from the limited capacity of the training set and of the trained models. Instead of attempting to capture all the training information within the weights of a network, we use an external source of data that is not fixed after training. The capacity and capabilities of the model are thus essentially unbounded. Previous works [40, 38] have interfaced VQA models with knowledge bases, using ad hoc techniques to incorporate external knowledge. In comparison, this paper presents a more general approach, applicable to various types of support data. In [34, 33]

, the authors used web image search to retrieve visual representations of question and answer words. These representations are however optimized along the other weights of the network and fixed once trained. Recent works on text-based question answering used reinforcement learning to optimize the retrieval of external information 

[8, 22, 25], which is potentially complementary to our approach.

3 Proposed approach

Our central idea is to learn a VQA model that can subsequently adapt to each particular given question, using additional support data relevant to the question. Intuitively, the adaptation makes the VQA model specialized to the narrow domain of each question. The support data relevant to each question is retrieved dynamically from an external source (Fig. 1), which is assumed to be non-differentiable and/or too large to be processed all at once. Concretely, the support data can be the VQA training set itself (making evaluation comparable with traditional methods) but we also demonstrate the use of training QAs from a different distribution (Tables 34), and the use of an image captioning dataset (Section 4.1).

3.1 Underlying VQA model

Our approach builds around a standard VQA model that underlies the adaptation procedure. Formally, we denote with the input to the VQA model, made of the question (a string of tokens, each corresponding to a word) and of visual features

pre-extracted from the given image (a feature map produced by a pre-trained convolutional neural network). The VQA model is represented as the function

of parameters . It maps

to a vector of scores with

. The vector contains the scores predicted over candidate answers, typically the few thousands most frequent in the training set. The final answer is the one of largest score, . We denote with the vector of ground truth scores (which may contain multiple non-zero values when multiple answers are annotated as correct).

The function is implemented as a neural network and denotes the set of all of its weights. Our contributions are not specific to any specific implementation of . In practice, it corresponds to a classical joint embedding model [33] illustrated in Fig. 3. The network encodes the question as a bag-of-words, taking the average of learned word embeddings. It uses a single-headed, question-guided attention over image locations, a Hadamard product to combine the two modalities, and a non-linear projection followed by a sigmoid to obtain the scores . See Appendix A for details.

Figure 3: The simple VQA model underlying our method. It implements a classical joint embedding approach [33]. Yellow elements contain learnable weights. Circled and squared ‘w’s represent affine and non-linear projections, respectively. The above network is instantiated twice in the overall diagram of Fig. 2.

3.2 Gradient-based adaptation

The role of the adaptation procedure is to modify the weights of the VQA model to best tailor its capabilities to a given input question. The motivation for a specialized model is to be potentially be more effective than a general one for a same capacity of the underlying model. Our adaptation procedure is based on MAML [12]. The original MAML algorithm is designed for adaptation using support data of the same form as for task of interest, i.e. questions with their ground truth answers. In Section 3.3, we describe an extension to use support data of another task/domain.

The adaptation procedure takes in a set of support elements = and base parameters , which it adapts to over a small number of updates. The update rule is a gradient descent of step size :


where is the adaptation loss which evaluates the predictions of the VQA model on the support data. In this case, is the binary cross-entropy loss typical used to train VQA models [33]. The above adaptation is performed when evaluating a given question at both training and test time. The key to benefit from this approach is to learn base parameters that are the most generally and most easily adaptable. They are optimized for the following objective:


where the elements are drawn from a training set , and is the main loss on the VQA model (also called “meta loss” [12]

) that corresponds again to a binary cross-entropy. The objective can be optimized with standard backpropagation and stochastic gradient descent 

[12]. To avoid the expensive differentiation through the steps of adaptation (Eq. 1), we use a first-order approximation of the gradient as in [23]. The update rule is then


were is the learning rate. The whole procedure to evaluate any training or test instance is summarized as Algorithm 1. It is wort emphasizing that during training, a support set must be simulated to best mimic the conditions in which the model will be evaluated. If the support set is held constant during training, it would be treated as a static input, and the model is unlikely to generalize to different support data at test time. Therefore, it is crucial to present randomly sampled instances from the support set across the iterations in Algorithm 1.

Input:    Test or training instance =
                Support set  =  with   =
Output: Vector of scores over candidate answers
// Retrieve support relevant to :
with max. precomputed
for =0 to  do
                                        // For each adaptation step
       random elements
           // Forward prop.
         // Backprop. adaptation loss
         // Gradient projection
         // Update weights of VQA model
end for
  // Forward prop. with updated weights
if training then
         // Backprop. main loss
         // Update base weights
end if
Algorithm 1 Evaluation of a tr. or test instance.

3.3 Using non-VQA data as support

We now extend he method to use support data other than VQA instances (questions/answers). We apply it to the particular case of images/captions, although the approach is more generally applicable. The challenge is now to produce beneficial updates to the weights without access to a loss on the target VQA model. In practice, the format of captioning data (images with text) facilitates the implementation, as we can use a similar neural network as the VQA model to process them. We define a model similar to up to the Hadamard product (Fig. 3). The final projection to answers scores is now meaningless for captions.

The adaptation procedure now proceeds as follows. The captions are passed through and its output (the Hadamard product) is passed to the alternative adaptation loss . This squared L2 norm can be interpreted as measuring the compatibility of the embeddings of the caption and of the image. It encourages embedding spaces to align across support images and their captions. Importantly, this loss does not involve ground truth labels or answers, but it allows differentiation with respect to the weights 111Weights in corresponding to the final layers of and not present in receive zero gradients when differentiating through .. The resulting gradients, however, cannot be assumed to be directly suitable to update the VQA model. We therefore pass them through a learned projection as . This produces gradients that can be plugged into Eq. 1 that now becomes


The projection is implemented as a non-linear layer that is learned similarly to , i.e. by backpropagating the gradient of the main loss as in Eq. 3 (see details in the supplementary material).

3.4 Retrieval of relevant support data

The above descriptions assumed the availability of a set of support examples relevant to an input question . In our experiments, the support data is the training split of a large VQA or captioning dataset. The selection of a relevant subset from is a crucial step to make the model adaptation both efficient (by processing a much smaller subset ) and effective (by focusing the adapted model on a narrow domain around ). The method described below provides the adaptation algorithm with a subset of the support data of bounded size, and ensures its constant time complexity.

We formalize the retrieval process from with a relevance function . It produces a scalar that reflects the pertinence of a support instance to the input . The top- elements of largest values are identified, and then randomly subsampled to the set of elements .

The relevance function can in principled be learned using the gradient of the main loss , although we did not explore this option. In our current implementation, we use a static relevance function that allows us to precompute its value between all training elements and all elements of the simulated support set . This vastly improves the computational requirements during the training process. Our experiments evaluate conjunctions (products) of the following options:


Note that the retrieval process could alternatively be formulated as a reinforcement learning task. This would allow optimizing the retrieval from “black box” data sources, such as web searches and dynamically-expanding databases [8, 22, 25], which we leave for future work.

4 Experiments

We conducted extensive experiments to evaluate the contribution of the components of our method, and to compare its performance to existing approaches. We use the VQA-CP v2 dataset [1], which is the most challenging benchmark available. Its training and test splits have different distributions of answers conditioned on the first few words of the question, and was built by resampling the VQA v2 dataset [15]. We hold out 8,000 questions from the VQA-CP training data to use as a validation set. All models are trained to convergence (with early stopping) on this validation set. Our underlying VQA model is a reimplementation of [33] (see supplementary material for details). Experiments using captions as support data use the COCO captioning dataset [19]. Since VQA-CP is itself made of images from COCO, we ensure that the captioned images also present in the VQA-CP test set are never used as support (neither during training nor evaluation). Please consult the supplementary material for additional implementation details and results. All results are reported using the standard VQA accuracy metric and broken down into the categories ‘yes/no’, ‘number’, and ‘other’ as in [15].

4.1 Results

Contribution of the proposed components

We first evaluate the impact of the proposed components with an ablative study (Table 1). For readability and computational reasons we focus on ‘other’-type questions222We focus on ‘other’-type questions because random guessing on the yes/no/number questions (or a buggy implementation !) does better than the best model in [1]. We measured that random guessing achieves 72.9% on yes/no questions ([1] gets 65.5%) and random guessing of one/two achieves 34.1% on ‘number’ questions ([1] gets 15.5%). This makes them unreliable for a meaningful analysis. with a slightly simplified VQA model. Implementation details are provided in the supplementary material. We examine in Table 1

a series of progressively more elaborate models. Each row corresponds to two different trained models, one trained for QAs as support (evaluated in the first 3 columns), another for captions (evaluated in the last column). All models using adaptation significantly outperform the baseline (first row). Interestingly, the optimal relevance function vary across the models for QAs and captions. The relevance function that includes the image similarity is only moderately useful, while the number of words in common between the question and the support text (QA or caption) proves very effective. Interestingly, in the case of captions, a uniform sampling already gives a clear improvement over the baseline model, but not with QAs, which we explain by the smaller size of the support set of captions.

We report results on both our validation set (of similar distribution as training data) and on the official test set (of different distribution). The overall lower performance on the latter shows the challenge of dealing with out-of-distribution data. The improvement in performance is much clearer on the test set than on the validation set. This demonstrates our contribution to improving generalization – arguably the most challenging aspect of VQA – which is a significant side-effect of our adaptation-based approach.

Using image captions as support data

We trained separate models for adaptation to questions/answers and to captions (Table 1 last column). While performance improves over the baseline in both cases, the adaptation using QAs provides a bigger boost, given their direct relevance to the VQA task. The improvement by adaptation to captions demonstrates the ability of the method for picking up relevant information from non-VQA data, which opens a significant avenue for future work. This evaluation currently considers either QAs or captions separately. The combination of the two implies a number of non-trivial design decisions that we will explore in future work.

Amount of retrieved support data

In Fig. 4, we examine the performance of the model as a function of the amount of data it is trained with. To make the analysis comparable to the baseline VQA model, the support QAs are the same set of QAs as used for the training (of the baseline and of our model). In the case of captions, we use the same QAs for training, and a similarly subsampled set of captions as support data. We observe that our model is clearly superior to the baseline in all regimes, using both QAs or captions. The gain in performance is maintained even when the model is trained with very little data, in particular when using adaptation with QAs (using as little as 1% of the whole training set).

Unfortunately, the gains in using captions as support data levels off as the amount of support data increases (Fig. 4) and the performance does not surpass that obtained with QAs. One would rather hope continuing improvement as the model is provided with increasing amounts of support data. We believe that our current results do not prevent this prospect, and that the saturation stems from the particular distribution of captions in COCO. These captions are purely visual and descriptive, and they only cover a limited variety of concepts. In contrast, visual questions often require common sense and knowledge beyond visual descriptions (e.g. Why is the guy wearing such a weird outfit ? Is this a healthy breakfast ?). Other sources of data, including free-form captions and paired image-text data from the web may be more suitable for this purpose.


Accuracy on VQA-CP v2 “Other”
Val. Test


Ours without adaptation 45.46 31.09


Ours with adaptation QAs       QAs Capt.
and, as support data: Tr. Tr. COCO


Uniform sampling = 46.15 31.33 34.00
Relevance function = 44.41 31.79 29.18
Relevance function = 46.49 31.76 33.73
Relevance function = 46.32 31.68 33.51
Relevance function = 46.17 31.09 34.26
Relevance function = 46.79 34.25 33.44


Table 1: Ablative evaluation of the proposed method (see discussion in Section 4.1). Each row corresponds to two different models, trained respectively for QAs (columns 1–3) and for captions (column 4) as support data. Gray cells use additional data during evaluation (QAs from VQA-CP test set in a leave-one-out setting) or during training+evaluation (COCO captions).

Comparison to existing methods

Table 2 presents a comparison of our results with existing approaches. We obtain state-of-the-art performance by a large margin over existing models and over our baseline model without adaptation. However, using captions as support data and trained on all question types (number, yes/no, and other), the model performs poorly. We hypothesized that evidence for the number and yes/no questions was difficult to extract from captions. We therefore trained a model with adaptation using only other questions. This model performs significantly better and clearly improves over the baseline. We indeed observed that captions seldom include counts or numbers, which can explain why they do not help on the corresponding questions. In the case of binary questions, it is possible than a different relevance function could address the issue.


VQA-CP v2 Test
Overall Yes/no Numbers Other


SAN [41] 24.96 38.35 11.14 21.74
GVQA [1] 31.30 57.99 13.68 22.14
UpDown [33] 39.06 62.41 15.12 34.47
UpDown + regularizer [27] 42.04 65.49 15.87 36.60
Ours without adaptation 40.71 52.22 11.85 42.88
Ours with adaptation and, as support data:
QAs (VQA-CP tr.), = 46.00 58.24 29.49 44.33
Captions (COCO), = 39.84 48.78 12.40 42.93
Captions, trained only on ‘Other’ q. 43.95


Table 2: Comparison with existing methods (accuracy on VQA-CP v2). Our method significantly improves over the comparable baseline (the same VQA model without adaptation) and obtains performance superior to all existing models. Gray cells are not directly comparable to others as they use additional data (as in Table 1).
Figure 4: Accuracy as a function of the amount of data used.

Qualitative results

Fig. 5 presents results of our best models (using QAs or captions) with visualizations of support data sampled according to the relevance function. We observe that the retrieved support data is both semantically and visually relevant to each question.

Additional experiments and qualitative results are provided in the supplementary material.

5 Conclusions

We presented a new approach to VQA in which the model is trained to interface with an external source of data, and to use it to support its answering process. This is a significant departure from the classical training of a static model on a fixed dataset, which is obviously limited by finite capacity of the model and of the dataset. In contrast, our method retrieves information from the external source specifically for each given question. It then adapts the weights of its underlying VQA model, incorporating information from the external data, and specializing its capabilities to a narrow domain around the input question.

Our experiments demonstrate the benefits of the approach over existing models. It proves intrinsically more robust to out-of-distribution data, and it generalizes to different distributions when provided with novel support data. The model also introduces novel capabilities, in particular for leveraging non-VQA data (image captions) to support the answering process. This presents a number of opportunities to future research, for accessing “black box” data sources, such as web searches and dynamic databases. This opens the door to systems capable of reasoning over vision and language beyond the limited domain covered by any given training set.


Input question Random selection of retrieved support data Predicted scores
Which sport is this  ? Correct answer: tennis .       What sport is taking place ? tennis .       What sport is this lady playing ? tennis .       What sport are they playing ? tennis .       What sport is this ? tennis .       What sport is this ? tennis .       What sport is this ? tennis . Without adaptation: [topsep=0.4ex,itemsep=-0.8ex,partopsep=0ex,parsep=0.3ex,label=,leftmargin=0.9ex]     soccer     tennis     football     frisbee     polo After adaptation: [topsep=0.4ex,itemsep=-0.8ex,partopsep=0ex,parsep=0.3ex,label=,leftmargin=0.9ex]     tennis     soccer     frisbee     polo     football
What are two men cutting  ? Correct answer: cake .       What is man cutting pizza with ? knife .       What is this man cutting ? cake .       What object are all four men holding ? knife .       What are men doing ? cutting cake .       What are men doing ? cutting cake .       What is woman cutting ? cake . Without adaptation: [topsep=0.4ex,itemsep=-0.8ex,partopsep=0ex,parsep=0.3ex,label=,leftmargin=0.9ex]     knife     cake     candles     frosting     cutting cake After adaptation: [topsep=0.4ex,itemsep=-0.8ex,partopsep=0ex,parsep=0.3ex,label=,leftmargin=0.9ex]     cake     cutting cake     yes     nothing     knife
What season is this  ? Correct answer: winter .       What season is it ? fall .       What season is this ? summer .       What season is this ? summer .       What season are these items meant to be used in ? summer .       When are these flowers in season ? summer .       What season does this look like ? summer . Without adaptation: [topsep=0.4ex,itemsep=-0.8ex,partopsep=0ex,parsep=0.3ex,label=,leftmargin=0.9ex]     winter     fall     spring     summer     snow After adaptation: [topsep=0.4ex,itemsep=-0.8ex,partopsep=0ex,parsep=0.3ex,label=,leftmargin=0.9ex]     winter     fall     summer     spring     unknown
Is this breakfast or dinner  ? Correct answer: dinner .       Dinner table with glasses of wine and plates of cheese and crackers.       Food on dinner table in a plate.       Omelet, toast and fruit for breakfast sitting on a table.       Breakfast plate with egg on toast and greens.       Restaurant table lined for breakfast with plates of food.       Table set for breakfast with ham, hashbrowns, croissants and eggs. Without adaptation: [topsep=0.4ex,itemsep=-0.8ex,partopsep=0ex,parsep=0.3ex,label=,leftmargin=0.9ex]     dinner     breakfast     dessert     lunch     no After adaptation: [topsep=0.4ex,itemsep=-0.8ex,partopsep=0ex,parsep=0.3ex,label=,leftmargin=0.9ex]     dessert     cake     desert     yes     lunch
Figure 5: Qualitative results comparing the top-5 answers and their scores predicted by the baseline, and by our model after adaptation. The retrieved support data (random samples are shown) is both visually and semantically relevant to each question.


Supplementary material

Appendix A Implementation of underlying VQA model

The VQA model within our method follows the general description of Teney et al[33] as illustrated in Fig. 3

in the main paper. One exception is in the question encoding, where we replace their gated recurrent unit (GRU) with a bag of words,

i.e. a simple average of word embeddings. The first reason is computational, to avoid the relatively slow evaluation of the unrolled GRU. The second reason is that we encountered instabilities in the training of the adaptation method with the GRU. We suspect this to be due to our first-order approximation of the MAML algorithm.

Most implementation details follow [33]. In particular, the non-linear operations in the network use gated hyperbolic tangent units. We use the “bottom-up attention” features [3] of size 362048, pre-extracted and provided by Anderson et al. 333 The word embeddings are initialized as GloVe vectors [26] of dimension 300, then optimized with the same learning rate as other weights of the network. All activations except the word embeddings and their average are of dimension 256. The answer candidates are those appearing at least 20 times in the VQA v2 training set, i.e. a set of about 2000 answers. The output of the network is passed through a logistic function to produce scores in . The final classifier is trained from a random initialization, rather than the visual and text embeddings of [33]. In our ablative and in-depth experiments (Table 1, Fig. 4, and Fig. 6), we use a slightly simplified model, where the “top-down” attention map over the image is uniform. The image features of size 362048 are thus averaged uniformly to a vector of size 12048. This significantly reduces the cost of training and evaluating the model since these averages can be precomputed and fit in memory for the whole dataset. The relevance function (Section 3.4) also uses these global image features.

Appendix B Implementation of adaptation algorithm

We use the AdaDelta algorithm [45] to train the model’s weights ( and those of the gradient projection) with backpropagation from the loss . Following this practice, we also found it beneficial to replace the gradient descent step of the adaptation (Eq. 1 and 4) with the AdaDelta weight update (see details in [45]). This effectively determines the size of the gradient step

automatically based on a rolling average of the weights’ and gradients’ magnitudes. This makes the weight updates much more stable, and it eliminates the hyperparameter


The gradient projection implemented as a simple linear scaling, with no biases, and no cross-talk across dimensions. For example, to adapt a linear layer that uses weights , the gradient is transformed with


where represents the parameters of the projection and the Hadamard (element-wise) product.

The adaptation algorithm uses a number =3 updates during training and evaluation. This value was selected in 1–5 by cross-validation.

The whole method is trained with mini-batches of size 128. The evaluation also uses mini-batches (of the same size) in a transductive manner, i.e. sharing information across multiple test instances, as done in existing implementations of MAML [12, 24]. This means that the adaptation algorithm effectively uses support data retrieved for 128 questions at a time. The primary reason for mini-batches during evaluation is computational, but we did not observe improvements in accuracy with smaller batch sizes (down to processing one single instance at a time), whether for training and/or evaluation.

Appendix C Additional experiments

c.1 Varying the amount of support data

Figure 6: Varying the amount of support data used during evaluation.

We performed additional experiments in which in varied the amount of support data available during the evaluation of the model (Fig. 6). This serves to verify that the model makes actual use of information from the support data. We indeed observe that the performance increases as more data is made available. We repeated the experiment with a model initially trained with only 40% of the data (dashed lines in Fig. 6). The trend of the accuracy versus the amount of support data remains similar. The overall performance is however lower. This indicates room for improvement for the adaptation algorithm. Ideally, a model trained with less data should approach the performance of a model trained with more data, when provided with this data (as support) at test time.

c.2 Generalization to support from a different distribution

We evaluated the proposed model by providing it with support data from a different distribution than the data it is originally trained with (Tables 34). For these experiments, we use the VQA-CP in a “leave-one-out” setting: we use the test set itself as the support data, and masking the intersection of the support data with a test instance currently evaluated. More precisely, all QAs relating to the same image as the current test question are left out of the utilized support. The results of this experiment show that the model can very effectively adapt to this novel support data, as the accuracy gets a significant jump, approaching the performance of the validation set (which is of the same distribution as the initial training data). We suspected the increase in performance might be simply due to the larger amount of data (the original training data plus the additional test set provided as support). We disproved this hypothesis by repeating the experiment with a model trained with less initial training data and less support data, such as to match the same total amount of data provided to the baseline (details in the supplementary material). This experiment gave a similarly high accuracy, which demonstrates that the model is indeed capable of adapting on-the-fly to the provided support data, even when it significantly differs from the data it was originally trained with.

Appendix D Qualitative results

We provide additional qualitative results in the following pages. A first set of results uses support data made of QAs. A second set uses support data made of captioned images (as indicated in column headings).


VQA-CP v2 Test split, “Other” questions


Ours with adaptation      QAs      QAs
and, as support data: Tr. Test


Uniform sampling = 31.33 32.83
Relevance function = 31.79 37.19
Relevance function = 31.76 36.28
Relevance function = 31.68 33.52
Relevance function = 31.09 37.78
Relevance function = 34.25 43.52


Table 3: Complement to Table 1. We evaluate the different versions of our model with, as support data, QAs from the training set (first column, identical to Table 1) and QAs from the test set (second column, in a leave-one-out protocol). These results are not comparable to competing models since they use more data, but the clear improvement in the second column demonstrates that the model clearly adapts to support data from a distribution different from the one it was trained with (since the support QAs now reflect the distribution of the test questions). We envision this capability to allow a pretrained VQA model to be applied to various domains by simply providing it, at test time, with domain-specific support data.


VQA-CP v2 Test split
Overall Yes/no Numbers Other


Ours with adaptation and, as support data:
QAs (VQA-CP tr.), = 46.00 58.24 29.49 44.33
QAs (VQA-CP test), = 52.09 62.02 47.66 48.21


Table 4: Complement to Table 2 (first row is identical to Table 2). This demonstrates the same effect as explained for Table 3.
Input question Samples of retrieved support data (QAs) Predicted scores
What season might this be  ? Correct answer: winter .     What season does it appear to be ? fall .     What season is it ? fall .     What season might this be ? summer .     What season is it ? fall .     What season is it ? spring .     What season is this ? fall . Without adaptation: [topsep=0.4ex,itemsep=-0.8ex,partopsep=0ex,parsep=0.3ex,label=,leftmargin=0.9ex]     winter     fall     spring     summer     snow After adaptation: [topsep=0.4ex,itemsep=-0.8ex,partopsep=0ex,parsep=0.3ex,label=,leftmargin=0.9ex]     winter     fall     summer     spring     snow
What is teddy bear made of  ? Correct answer: fur .     What is bear standing on ? concrete .     What is on bear is face ? fur .     What is bear made out of ? concrete .     What is behind bear ? concrete .     What material is bear made of ? cloth .     What material is polar bear walking on ? concrete . Without adaptation: [topsep=0.4ex,itemsep=-0.8ex,partopsep=0ex,parsep=0.3ex,label=,leftmargin=0.9ex]     fur     fabric     cloth     paper     concrete After adaptation: [topsep=0.4ex,itemsep=-0.8ex,partopsep=0ex,parsep=0.3ex,label=,leftmargin=0.9ex]     cotton     teddy bear     fabric     fur     none
What sport is being played  ? Correct answer: baseball .     What sport are these kids getting ready to play ? baseball .     What sport are these guys playing ? baseball .     What sport are people playing ? baseball .     What sport are they playing ? baseball .     What sport are they playing ? baseball .     What sport are they playing ? baseball . Without adaptation: [topsep=0.4ex,itemsep=-0.8ex,partopsep=0ex,parsep=0.3ex,label=,leftmargin=0.9ex]     baseball     softball     yes     playing baseball     baseball bat After adaptation: [topsep=0.4ex,itemsep=-0.8ex,partopsep=0ex,parsep=0.3ex,label=,leftmargin=0.9ex]     baseball     baseball field     softball     soccer    tennis
What is this man doing  ? Correct answer: painting .     What sport is man doing  ? fishing .     What is man doing  ? standing .     What is hanging behind man  ? painting .     What is man doing  ? standing .     What is man doing  ? painting .     What is this man doing  ? walking . Without adaptation: [topsep=0.4ex,itemsep=-0.8ex,partopsep=0ex,parsep=0.3ex,label=,leftmargin=0.9ex]     fishing     standing     boating     walking     painting After adaptation: [topsep=0.4ex,itemsep=-0.8ex,partopsep=0ex,parsep=0.3ex,label=,leftmargin=0.9ex]     standing     fishing     surfing     walking     boating
What side dish appears in bowl  ? Correct answer: salad .     What is in bowl ? soup .     What is in bowl ? soup .     What is on dish ? soup .     What is in bowl ? soup .     What is in black bowl ? soup .     What is in bowl ? soup . Without adaptation: [topsep=0.4ex,itemsep=-0.8ex,partopsep=0ex,parsep=0.3ex,label=,leftmargin=0.9ex]     pizza     none     soup     salad     vegetables After adaptation: [topsep=0.4ex,itemsep=-0.8ex,partopsep=0ex,parsep=0.3ex,label=,leftmargin=0.9ex]     soup     salad     tomatoes     beans     none
What kind of flower is white one  ? Correct answer: lily .     What are species of flower represented in this photo ? rose .     What kind of plant is this ? lily .     What kind of flower is shown ? rose .     What is name of flower in vase ? rose .     What is white plant called ? lily .     What type of flower is in vase ? lily . Without adaptation: [topsep=0.4ex,itemsep=-0.8ex,partopsep=0ex,parsep=0.3ex,label=,leftmargin=0.9ex]     tulip     lily     tulips     lilies     rose After adaptation: [topsep=0.4ex,itemsep=-0.8ex,partopsep=0ex,parsep=0.3ex,label=,leftmargin=0.9ex]     lily     tulip     tulips     lilies     rose
Are bags hard or soft  ? Correct answer: hard .     Does sand on beach look soft or coarse ? soft .     Is it better to use soft or natural lighting in bathroom ? soft .     Is this ground hard or soft ? soft .     Is it better to use soft or natural lighting in bathroom ? soft .     How hard did woman hit ball ? soft .     Is chaise lounge in foreground more likely soft or firm ? soft . Without adaptation: [topsep=0.4ex,itemsep=-0.8ex,partopsep=0ex,parsep=0.3ex,label=,leftmargin=0.9ex]     free     full     soft     laptops     open After adaptation: [topsep=0.4ex,itemsep=-0.8ex,partopsep=0ex,parsep=0.3ex,label=,leftmargin=0.9ex]     soft     clean     sunny     cold     warm
What is this man is name  ? Correct answer: unknown .     Why is man on left sleepy  ? unknown .     Is this person man or woman  ? man .     What sign is near man  ? unknown .     Is it man or woman with car  ? man .     What street is man on  ? unknown .     What color is man is boxers  ? unknown . Without adaptation: [topsep=0.4ex,itemsep=-0.8ex,partopsep=0ex,parsep=0.3ex,label=,leftmargin=0.9ex]     unknown     man     obama     not possible     don’t know After adaptation: [topsep=0.4ex,itemsep=-0.8ex,partopsep=0ex,parsep=0.3ex,label=,leftmargin=0.9ex]     unknown     none     don’t know     bob     nothing
What is strapped to his waist  ? Correct answer: backpack .     What is this man is feet strapped to ? snowboard .     What is person wearing around his waist ? belt .     What is tied around their waist ? coat .     What does child have around its waist ? belt .     What does child have around its waist ? belt .     What is tied around woman is waist ? coat . Without adaptation: [topsep=0.4ex,itemsep=-0.8ex,partopsep=0ex,parsep=0.3ex,label=,leftmargin=0.9ex]     leash     snowboard     boots     coat     belt After adaptation: [topsep=0.4ex,itemsep=-0.8ex,partopsep=0ex,parsep=0.3ex,label=,leftmargin=0.9ex]     jacket     coat     sweater     backpack     dog
What kind of kite is man flying  ? Correct answer: white .     What is flying in air ? kite .     What pattern are kites flying in ? none .     What is flying ? kite .     What is moving man ? kite .     What is flying ? kite .     How is man staying in air ? wind . Without adaptation: [topsep=0.4ex,itemsep=-0.8ex,partopsep=0ex,parsep=0.3ex,label=,leftmargin=0.9ex]     sail     none     kite     white     wind After adaptation: [topsep=0.4ex,itemsep=-0.8ex,partopsep=0ex,parsep=0.3ex,label=,leftmargin=0.9ex]     kite     white     none     seagull     no
Will that fence contain this animal  ? Correct answer: yes .     Is there more than one animal shown ? no .     Is fence as high as animal when it is standing up ? no .     Does this animal appear to live in zoo ? yes .     Is person scared of animal ? yes .     Is animal alive ? yes .     Is this wild animal ? no . Without adaptation: [topsep=0.4ex,itemsep=-0.8ex,partopsep=0ex,parsep=0.3ex,label=,leftmargin=0.9ex]     yes     no    unknown    2    none After adaptation: [topsep=0.4ex,itemsep=-0.8ex,partopsep=0ex,parsep=0.3ex,label=,leftmargin=0.9ex]     elephant     yes     no     elephants     trunk
What is he holding in his hands  ? Correct answer: pen .     What is person holding ? laptop .     What is man holding ? laptop .     What is man holding on his lap ? laptop .     What is he holding in his hands ? mouse .     What is she holding in her left hand ? laptop .     What is woman holding on her lap ? computer . Without adaptation: [topsep=0.4ex,itemsep=-0.8ex,partopsep=0ex,parsep=0.3ex,label=,leftmargin=0.9ex]     computer     laptop     mouse     books     nothing After adaptation: [topsep=0.4ex,itemsep=-0.8ex,partopsep=0ex,parsep=0.3ex,label=,leftmargin=0.9ex]     laptop     computer     keyboard     nothing     mouse
Input question Samples of retrieved support data (captions) Predicted scores
What color is comforter  ? Correct answer: white .     Large bed covered with a comforter in a bedroom.     Bed with comforter turned down.     Bedroom shot size bed, white comforter and a lamp.     Bed with a white pillow, a white comforter and accessories.     Bed with a white comforter.     Bed with comforter turned down and a night table lamp. Without adaptation: [topsep=0.4ex,itemsep=-0.8ex,partopsep=0ex,parsep=0.3ex,label=,leftmargin=0.9ex]     gray     brown     black     white     green After adaptation: [topsep=0.4ex,itemsep=-0.8ex,partopsep=0ex,parsep=0.3ex,label=,leftmargin=0.9ex]     white     black     gray     brown     blue
What is horse in background doing  ? Correct answer: eating .     Adult black horse and young brown horse interacting.     Horse eating a hay stack.     Horses in a grassy field with trees in the background.     Brown horse standing on dirt in a grass field.     Horse running in a grassy field in an enclosed area.     A giraffe in the forefront and a zebra in the background. Without adaptation: [topsep=0.4ex,itemsep=-0.8ex,partopsep=0ex,parsep=0.3ex,label=,leftmargin=0.9ex]     standing     grazing     walking     looking     running After adaptation: [topsep=0.4ex,itemsep=-0.8ex,partopsep=0ex,parsep=0.3ex,label=,leftmargin=0.9ex]     grazing     running     standing     eating     walking
What is he wearing  ? Correct answer: suit .     Teenager wearing glasses and a tie.     Man wearing a vest, a tie, and glasses.     Man wearing a shirt and a tie making a creepy face.     Bald man with mustache wearing a suit.     Man wearing a black hat and holding an umbrella.     Man standing in a bathroom wearing a shirt. Without adaptation: [topsep=0.4ex,itemsep=-0.8ex,partopsep=0ex,parsep=0.3ex,label=,leftmargin=0.9ex]     hat     fedora     jacket     cowboy     coat After adaptation: [topsep=0.4ex,itemsep=-0.8ex,partopsep=0ex,parsep=0.3ex,label=,leftmargin=0.9ex]     tie     suit     hat     ties     clothes
What color are umbrellas  ? Correct answer: green .     A group of people at a metal table with umbrellas.     Crowd of adults holding red umbrellas in a march.     People enjoying a meal with wine under white umbrellas.     Adult and child holding umbrellas in a park.     Elderly women stand in a large room with colorful umbrellas.     Group of people walking with red umbrellas. Without adaptation: [topsep=0.4ex,itemsep=-0.8ex,partopsep=0ex,parsep=0.3ex,label=,leftmargin=0.9ex]     white     purple     green     yellow     blue After adaptation: [topsep=0.4ex,itemsep=-0.8ex,partopsep=0ex,parsep=0.3ex,label=,leftmargin=0.9ex]     green     blue     black     orange     white
What color is nose of plane  ? Correct answer: red .     Older air plane parked under a bridge.     Plane sitting on a runway at an airport.     Red, yellow, blue, and white plane parked on concrete.     Air force plane sitting on tarmac with propellers.     Big blue air plane parked with people.     White plane sitting on a runway. Without adaptation: [topsep=0.4ex,itemsep=-0.8ex,partopsep=0ex,parsep=0.3ex,label=,leftmargin=0.9ex]     red     white     black     gray     pink After adaptation: [topsep=0.4ex,itemsep=-0.8ex,partopsep=0ex,parsep=0.3ex,label=,leftmargin=0.9ex]     white     red     black     gray     silver
What color is tablecloth  ? Correct answer: green and red .     A bowl of broccoli and pasta sit on a checkered tablecloth.     Colorful plate of appetizers on a white linen tablecloth.     Half-eaten food and beer on a patterned tablecloth.     Sandwich for halloween on a tablecloth covered table.     Plates of food on a red tablecloth.     Restaurant sandwich platter on a plaid tablecloth. Without adaptation: [topsep=0.4ex,itemsep=-0.8ex,partopsep=0ex,parsep=0.3ex,label=,leftmargin=0.9ex]     plaid     red and white     green     checkered     green and white After adaptation: [topsep=0.4ex,itemsep=-0.8ex,partopsep=0ex,parsep=0.3ex,label=,leftmargin=0.9ex]     red and white     checkered     plaid     green     black and white
Are all of these people friends  ? Correct answer: yes .     People riding skateboards down the street.     Bunch of people with skis ride on snow.     Large group of smiling people raising their hands.     People gathered in front of a government building flying kites.     Group people riding skis on snow.     Group of people skiing on snow. Without adaptation: [topsep=0.4ex,itemsep=-0.8ex,partopsep=0ex,parsep=0.3ex,label=,leftmargin=0.9ex]     yes     no     unknown    family    can’t tell After adaptation: [topsep=0.4ex,itemsep=-0.8ex,partopsep=0ex,parsep=0.3ex,label=,leftmargin=0.9ex]     lot     many     100     all     50
What utensil is in girl is hand  ? Correct answer: fork .     Girl pulling up a spoonful of cheesy casserole stands.     Girl with a giant platter of food.     Girl standing at the kitchen counter holding a spoon.     Girl wearing a bow in her hair with her brother brushing.     Woman and girl with plates of cakes and rolls.     Boy and girl sitting at a dinner table and both pointing. Without adaptation: [topsep=0.4ex,itemsep=-0.8ex,partopsep=0ex,parsep=0.3ex,label=,leftmargin=0.9ex]     fork     spoon     knife     right     fork and knife After adaptation: [topsep=0.4ex,itemsep=-0.8ex,partopsep=0ex,parsep=0.3ex,label=,leftmargin=0.9ex]     pizza     knife     fork     fork and knife     plate
What is child running on top of  ? Correct answer: leaves .     Man and child flying a kite in a field.     Dog running in a park with a frisbee in his mouth.     Small dog running up truck.     Small child on a street with a stop sign.     Person in black uniform running with a soccer ball.     Children running and playing with kites in a park area. Without adaptation: [topsep=0.4ex,itemsep=-0.8ex,partopsep=0ex,parsep=0.3ex,label=,leftmargin=0.9ex]     umbrella     nothing     grass     ground     rain After adaptation: [topsep=0.4ex,itemsep=-0.8ex,partopsep=0ex,parsep=0.3ex,label=,leftmargin=0.9ex]     leaves     grass     frisbee     ground     umbrella
Why do majority of people have on same color  ? Correct answer: blue .     Couple of people standing with ski on snow.     Group of people standing on skis.     Man stands with child wearing skis and people sitting.     Group of people on snow with skis.     Group of people skiing down a snow covered slope.     Many people with ski on a mountain dressed for ski. Without adaptation: [topsep=0.4ex,itemsep=-0.8ex,partopsep=0ex,parsep=0.3ex,label=,leftmargin=0.9ex]     0     no     racing     1     yes After adaptation: [topsep=0.4ex,itemsep=-0.8ex,partopsep=0ex,parsep=0.3ex,label=,leftmargin=0.9ex]     skiing     blue     white     yellow     safety
Does this cake have healthy element  ? Correct answer: no .     Group of children standing at a table eating cake.     Red cake with white frosting displayed with vase and sunflowers.     Bride and groom cutting a wedding cake.     Large white multi layered cake sitting on a table.     Table decorated with flowers, utensils, and a marriage cake.     Birthday cake sitting on a kitchen counter. Without adaptation: [topsep=0.4ex,itemsep=-0.8ex,partopsep=0ex,parsep=0.3ex,label=,leftmargin=0.9ex]     yes     no    n    flowers    don’t know After adaptation: [topsep=0.4ex,itemsep=-0.8ex,partopsep=0ex,parsep=0.3ex,label=,leftmargin=0.9ex]     wedding     yes     no     fruit     none