1 Introduction
The task of Visual Question Answering (VQA) demands that an agent correctly answer a previously unseen question about a previously unseen image. The fact that neither the question nor the image is specified until test time means that the agent must embody most of the achievements of Computer Vision and Natural Language Processing, and many of those of Artificial Intelligence.
VQA is typically framed in a purely supervised learning setting. A large training set of example questions, images, and their correct answers is used to train a method to map a question and image to scores over a predetermined, fixed vocabulary of possible answers using the maximum likelihood
[39]. This approach has inherent scalability issues, as it attempts to represent all world knowledge within the finite set of parameters of a model such as deep neural network. Consequently, a trained VQA system can only be expected to produce correct answers to questions from a very similar distribution to those in the training set. Extending the model knowledge or expanding its domain coverage is only possible by retraining it from scratch, which is computationally costly, at best. This approach is thus fundamentally incapable of fulfilling the ultimate promise of VQA, which is answering general questions about general images.
As a solution to these issues we propose a metalearning approach to the problem. The meta learning approach implies that the model learns to learn, i.e. it learns to use a set of examples provided at test time to answer the given question (Fig. 1). Those examples are questions and images, each with their correct answer, such as might form part of the training set in a traditional setting. They are referred to here as the support set. Importantly, the support set is not fixed. Note also that the support set may be large, and that the majority of its elements may have no relevance to the current question. It is provided to the model at test time, and can be expanded with additional examples to increase the capabilities of the model. The model we propose ‘learns to learn’ in that it is able to identify and exploit the relevant examples within a potentially large support set dynamically, at test time. Providing the model with more information thus does not require retraining, and the ability to exploit such a support set greatly improves the practicality and scalability of the system. Indeed, it is ultimately desirable for a practical VQA system to be adaptable to new domains and to continuously improve as more data becomes available. That vision is a long term objective and this work takes only a small step in that direction.
Our primary technical contribution is to adapt a stateoftheart VQA model[34] to the meta learning scenario. Our resulting model is a deep neural network that uses sets of dynamic parameters – also known as fast weights – determined at test time depending on the provided support set. The dynamic parameters allow to modify adaptively the computations performed by the network and adapt its behaviour depending on the support set. We perform a detailed study to evaluate the effectiveness of those techniques under various regimes of training and support set sizes. Those experiments are based on the VQA v2 benchmark, for which we propose data splits appropriate to study a meta learning setting.
A completely new capability demonstrated by the resulting system is to learn to produce completely novel answers (i.e. answers not seen during training). Those new answers are only demonstrated by instances of the support set provided at test time. In addition to these new capabilities, the system exhibits a qualitatively distinct behaviour to existing VQA systems in its improved handling of rare answers. Since datasets for VQA exhibit a heavy class imbalance, with a small number of answers being much more frequent than most others, models optimized for current benchmarks are prone to fall back on frequent “safe” answers. In contrast, the proposed model is inherently less likely to fall victim to dataset biases, and exhibits a higher recall over rare answers. The proposed model does not surpass existing methods on the common aggregate accuracy metric, as is to be expected given that it does not overfit to the dataset bias, but it nonetheless exhibits desirable traits overall.
The contributions of this paper are summarized as follows.

[noitemsep]

We reframe VQA as a meta learning task, in which the model is provided a test time with a support set of supervised examples (questions and images with their correct answers).

We provide an experimental evaluation of the proposed model in different regimes of training and support set sizes and across variations in design choices.

Our results demonstrate the unique capability of the model to produce novel answers, i.e. answers never seen during training, by learning from support instances, an improved recall of rare answers, and a better sample efficiency than existing models.
2 Related Work
Visual question answering
Visual question answering has gathered significant interest from the computer vision community [6], as it constitutes a practical setting to evaluate deep visual understanding. In addition to visual parsing, VQA requires the comprehension of a text question, and combined reasoning over vision and language, sometimes on the basis of external or commonsense knowledge. See [39] for a recent survey of methods and datasets.
VQA is always approached in a supervised setting, using large datasets [6, 15, 22, 44]
of humanproposed questions with their correct answers to train a machine learning model. The
VQAreal and VQA v2 datasets [6, 15] have served as popular benchmarks by which to evaluate and compare methods. Despite the large scale of those datasets, e.g. more than 650,000 questions in VQA v2, several limitations have been recognized. These relate to the dataset bias (i.e. the nonuniform, longtailed distribution of answers) and the questionconditioned bias (making answers easy to guess given a question without the image). For example, the answer Yes is particularly prominent in [6] compared to no, and questions starting with How many can be answered correctly with the answer two more than 30% of the time [15]. These issues plague development in the field by encouraging methods which fare well on common questions and concepts, rather than on rare answers or more complicated questions. The aggregate accuracy metric used to compare methods is thus a poor indication of method capabilities for visual understanding. Improvements to datasets have been introduced [1, 15, 43], including the VQA v2, but they only partially solve the evaluation problems. An increased interest has appeared in the handling of rare words and answers [29, 35]. The model proposed in this paper is inherently less prone to incorporate dataset biases than existing methods, and shows superior performance for handling rare answers. It accomplishes this by keeping a memory made up of explicit representations of training and support instances.VQA with additional data
In the classical supervised setting, a fixed set of questions and answers is used to train a model once and for all. With few exceptions, the performance of such a model is fixed as it cannot use additional information at test time. Among those exceptions, [40, 38] use an external knowledge base to gather nonvisual information related to the input question. In [35], the authors use visual information from web searches in the form of exemplar images of question words, and better handle rare and novel words appearing in questions as a result. In [34], the same authors use similar images from web searches to obtain visual representations of candidate answers.
Those methods use adhoc engineered techniques to incorporate external knowledge in the VQA model. In comparison, this paper presents a much more general approach. We expands the model knowledge with data provided in the form of additional supervised examples (questions and images with their correct answer). A demonstration of the broader generality of our framework over the works above is its ability to produce novel answers, i.e. never observed during initial training and learned only from testtime examples.
Recent works on textbased question answering have investigated the retrieval of external information with reinforcement learning
[26, 25, 8]. Those works are tangentially related and complementary to the approach explored in this paper.Meta learning and few shot learning
The term meta learning broadly refers to methods that learn to learn, i.e. that train models to make better use of training data. It applies to approaches including the learning of gradient descentlike algorithms such as [5, 13, 17, 30] for faster training or finetuning of neural networks, and the learning of models that can be directly fed training examples at test time [7, 33, 36]
. The method we propose falls into the latter category. Most works on meta learning are motivated by the challenge of oneshot and fewshot visual recognition, where the task is to classify an image into categories defined by a few examples each. Our meta learning setting for VQA bears many similarities. VQA is treated as a classification task, and we are provided, at test time, with examples that illustrate the possible answers – possibly a small number per answer. Most existing methods are, however, not directly applicable to our setting, due to the large number of classes (
i.e. possible answers), the heavy class imbalance, and the need to integrate into an architecture suitable to VQA. For example, recent works such as [36] propose efficient training procedures that are only suitable for a small number of classes.Our model uses a set of memories within a neural network to store the activations computed over the support set. Similarly, Kaiser et al. [19] store past activations to remember “rare events”, which was notably evaluated on machine translation. Our model also uses network layers parametrized by dynamic weights, also known as fast weights. Those are determined at test time depending on the actual input to the network. Dynamic parameters have a long history in neural networks [32] and have been used previously for fewshot recognition [7] and for VQA [27]. One of the memories within our network stores the gradient of the loss with respect to static weights of the network, which is similar to the Meta Networks model proposed by Munkhdalai et al. [24]. Finally, our output stage produces scores over possible answers by similarity to prototypes representing the output classes (answers). This follows a similar idea to the Prototypical Networks [33].
Continuum learning
An important outcome of framing VQA in a meta learning setting is to develop models capable of improving as more data becomes available. This touches the fields of incremental [12, 31] and continuum learning [2, 23, 42]. Those works focus on the finetuning of a network with new training data, output classes and/or tasks. In comparison, our model does not modify itself over time and cannot experience negative domain shift or catastrophic forgetting, which are a central concern of continuum learning [21]. Our approach is rather to use such additional data onthefly, at test time, i.e. without an iterative retraining. An important motivation for our framework is its potential to apply to support data of a different nature than question/answer examples. We consider this to be an important direction for future work. This would allow to leverage general, non VQAspecific data, e.g. from knowledge bases or web searches.
3 VQA in a Meta Learning Setting
The traditional approach to VQA is in a supervised setting described as follows. A model is trained to map an input question and image to scores over candidate answers [39]. The model is trained to maximize the likelihood of correct answers over a training set of triplets , where represents the vector of ground truth scores of the predefined set of possible answers. At test time, the model is evaluated on another triplet from an evaluation or test set . The model predicts scores over the set of candidate answers, which can be compared to the ground truth for evaluation purposes.
We extend the formulation above to a meta learning setting by introducing an additional support set of similar triplets . These are provided to the model at test time. At a minimum, we define the support set to include the training examples themselves, i.e. , but more interestingly, the support set can include novel examples provided at test time. They constitute additional data to learn from, such that . The triplets in the support set can also include novel answers, never seen in the training set. In that case, the ground truth score vectors
of the other elements in the support are simply padded with zeros to match the larger size
of the extended set of answers.The following sections describe a deep neural network that can take advantage of the support set at test time. To leverage the information contained in the support set, the model must learn to utilize these examples onthefly at test time, without retraining of the whole model.
4 Proposed Model
The proposed model (Fig. 2) is a deep neural network that extends the stateofart VQA system of Teney et al. [34]. Their system implements the joint embedding approach common to most modern VQA models [39, 41, 18, 20], which is followed by a multilabel classifier over candidate answers. Conceptually, we separate the architecture into (1) the embedding part that encodes the input question and image, and (2) the classifier part that handles the reasoning and actual question answering^{1}^{1}1The separation of the network into an embedding and a classifier parts is conceptual. The division is arbitrarily placed after the fusion of the question and image embeddings. Computational requirements aside, the concept of dynamic parameters is in principle applicable to earlier layers as in [7].. The contributions of this paper address only the second part. Our contributions are orthogonal to developments on the embedding part, which could also benefit e.g. from advanced attention mechanisms or other computer vision techniques [3, 37, 39]. We follow the implementation of [34] for the embedding part. For concreteness, let us mention that the question embedding uses GloVe word vectors [28] and a Recurrent Gated Unit (GRU [10]
). The image embedding uses features from a CNN (Convolutional Neural Network) with bottomup attention
[3] and questionguided attention over those features. See [34] for details.For the remainder of this paper, we abstract the embedding to modules that produce respectively the question and image vectors and . They are combined with a Hadamard (elementwise) product into , which forms the input to the classifier on which we now focus on. The role of the classifier is to map to a vector of scores over the candidate answers. We propose a definition of the classifier that generalizes the implementation of traditional models such as [34]. The input to the classifier is first passed through a nonlinear transformation , then through a mapping to scores over the set of candidate answers . This produces a vector of predicted scores . In traditional models, the two functions correspond to a stack of nonlinear layers for , and a linear layer followed by a softmax or sigmoid for . We now show how to extend and to take advantage of the meta learning setting.
4.1 Nonlinear Transformation
The role of the nonlinear transformation is to map the embedding of the question/image to a representation suitable for the following (typically linear) classifier. This transformation can be implemented in a neural network with any type of nonlinear layers. Our contributions are agnostic to this implementation choice. We follow [34] and use a gated hyperbolic tangent layer [11], defined as
(1) 
where
is the logistic activation function,
are learned weights, are learned biases, and is the Hadamard (elementwise) product. For notation purposes, we define the parameters as the concatenation of the vectorized weights and biases, i.e. . This vector thus contains all of the weights and biases used by the nonlinear transformation. A traditional model would learn the weightsby backpropagation and gradient descent on the training set, and they would be held static during test time. We propose instead to adaptively adjust the weights at test time, depending on the input
and the available support set. Concretely, we use a combination of static parameters learned in the traditional manner, and dynamic ones determined at test time. They are combined as , with a vector of learned weights. The dynamic weights can therefore be seen as an adjustment made to the static ones depending on the input .A set of candidate dynamic weights are maintained in an associative memory . This memory is a large set (as large as the support set, see Section 4.2) of key/value pairs . The interpretation for is of dynamic weights suited to an input similar to . Therefore, at test time, we retrieve appropriate dynamic weights by soft key matching:
(2) 
where
is the cosine similarity function. We therefore retrieve a weighted sum, in which the similarity of
with the memory keys serves to weight the memory values . In practice and for computational reasons, the softmax function cuts off after the top largest values, with in the order of a thousand elements (see Section 5). We detail in Section 4.2 how the memory is filled by processing the support set. Note that the above formulation can be made equivalent to the original model in [34] by using only static weights (). This serves as a baseline in our experiments (see Section 5).4.1.1 Mapping to Candidate Answers
The function maps the output of the nonlinear transformation to a vector of scores over the set of candidate answers. It is traditionally implemented as a simple affine or linear transformation (i.e. a matrix multiplication). We generalize the definition of by interpreting it as a similarity measure between its input and prototypes representing the possible answers. In traditional models, each prototype corresponds to one row of the weight matrix. Our general formulation allows one or several prototypes per possible answer as . Intuitively, the prototypes represent the typical expected feature vector when is a correct answer. The score for is therefore obtained as the similarity between the provided and the corresponding prototypes of . When multiple prototypes are available, the similarities are averaged. Concretely, we define
(3) 
where is a similarity measure, is a sigmoid (logistic) activation function to map the similarities to , and is a learned bias term. The traditional models that use a matrix multiplication [18, 34, 35] correspond to that uses a dot product as the similarity function. In comparison, our definition generalizes to multiple prototypes per answer and to different similarity measures. Our experiments evaluate the dot product and the weighted Lp norm of vector differences:
(4)  
(5)  
(6) 
where is a vector of learned weights applied coordinatewise.
Our model uses two sets of prototypes, the static and the dynamic . The static ones are learned during training as traditional weights by backpropagation and gradient descent, and held fixed at test time. The dynamic ones are determined at test time by processing the provided support set (see Section 4.2). Thereafter, all prototypes are used indistinctively. Note that our formulation of can be made equivalent to the original model of [34] by using only static prototypes () and the dotproduct similarity measure . This will serve as a baseline in our experiments (Section 5).
Finally, the output of the network is attached to a crossentropy loss between the predicted and ground truth for training the model endtoend [34].
4.2 Processing of Support Set
Both functions and
defined above use dynamic parameters that are dependent on the support set. Our model processes the entire support set in a forward and backward pass through the network as described below. This step is to be carried out once at test time, prior to making predictions on any instance of the test set. At training time, it is repeated before every epoch to account for the evolving static parameters of the network as training progresses (see Algorithm
1).We pass all elements of the support set through the network in minibatches for both a forward and backward pass. The evaluation of and use only static weights and prototypes, i.e. and . To fill the memory , we collect, for every element of the support set, its feature vector and the gradient of the final loss relative to the static weights . This effectively captures the adjustments that would be made by a gradient descent algorithm to those weights for that particular example. The pair is added to the memory , which thus holds elements at the end of the process.
To determine the set of dynamic prototypes , we collect the feature vectors over all instances of the support set. We then compute their average over instances having the same correct answer. Concretely, the dynamic prototype for answer is obtained as .
During training, we must balance the need for data to train the static parameters of the network, and the need for an “example” support set, such that the network can learn to use novel data. If the network is provided with a fixed, constant support set, it will overfit to that input and be unable to make use of novel examples at test time. Our training procedure uses all available data as the training set , and we form a different support set at each training epoch as a random subset of . The procedure is summarized in Algorithm 1. Note that in practice, it is parallelized to process instances in minibatches rather than individually.
5 Experiments
We perform a series of experiments to evaluate (1) how effectively the proposed model and its different components can use the support set, (2) how useful novel support instances are for VQA, (3) whether the model learns different aspects of a dataset from classical VQA methods trained in the classical setting.
Datasets
The VQA v2 dataset [15] serves as the principal current benchmark for VQA. The heavy class imbalance among answers makes it very difficult to draw meaningful conclusions or perform a qualitative evaluation, however. We additionally propose a series of experiments on a subset referred to as VQANumbers. It includes all questions marked in VQA v2 as a “number” question, which are further cleaned up to remove answers appearing less than 1,000 times in the training set, and to remove questions that do not have an unambiguous answer (we keep only those with ground truth scores containing a single element equal to ). Questions from the original validation set of VQA v2 are used for evaluation, and the original training set (45,965 questions after clean up) is used for training, support, and validation. The precise data splits will be available publicly. Most importantly, the resulting set of candidate answers corresponds to the seven numbers from 0 to 6.
Metrics
The aggregate metric used for evaluation on VQA v2 is the accuracy defined as with ground truth scores and the answer of highest predicted score, . We also define the recall of an answer as . We look at the recall averaged (uniformly) over all possible answers to better reflect performance across a variety of answers, rather than on the most common ones.
Implementation
Our implementation is based on the code provided by the authors of [34]. Details nonspecific to our contributions can be found there. We initialize all parameters, in particular static weights and static prototypes as if they were those of a linear layer in a traditional architecture, following Glorot and Bengio [14]. During training, the support set is subsampled (Section 4.2) to yield a set of 1,000 elements. We use, per answer, one or two static prototypes, and zero or one dynamic prototype (as noted in the experiments). All experiments use an embedding dimension =128 and a minibatches of instances. Experiments with VQA v2 use a set of candidate answers capped to a minimum number of training occurrences of , giving 1,960 possible answers [34]. Past works have shown that small differences in implementation can have noticeable impact on performance. Therefore, to ensure fair comparisons, we repeated all evaluations of the baseline [34] with our code and preprocessing. Results are therefore not directly comparable with those reported in [34]. In particular, we do not use the Visual Genome dataset [22] for training.
5.1 VQANumbers
Ablative evaluation
We first evaluate the components of the proposed model in comparison to the stateoftheart of [34] which serves as a baseline, being equivalent to our model with 1 static prototype per answer, the dot product similarity, and no dynamic parameters. We train and evaluate on all 7 answers. To provide the baseline with a fair chance^{2}^{2}2The VQANumbers data is still heavily imbalanced, “1” and “2” making up almost of correct answers in equal parts., we train all models with standard supersampling [9, 16], i.e. selecting training examples with equal probability with respect to their correct answer. In these experiments, the support set is equal to the training set.


Average answer recall  


(1a) Chance  14.28 
(1b) Stateoftheart model [34]  29.72 
Equivalent to 1 static prototype per answer, dot prod. similarity, no dynamic param.  
(2b) 1 Static prot./ans., L1 similarity  29.97 
(2c) 1 Static prot./ans., L2 similarity  27.80 
(2d) 2 Static prot./ans., dot prod. similarity  30.28 
(2e) 2 Static prot./ans., L1 similarity  28.34 
(2f) 2 Static prot./ans., L2 similarity  31.48 
(3a) Dynamic Weights (+2f)  31.81 
(3b) Proposed: dynamic weights and prototypes (+2f)  32.32 

As reported in Table 1, the proposed dynamic weights improve over the baseline, and the dynamic prototypes bring an additional improvement. We compare different choices for the similarity function. Interestingly, swapping the dot product in the baseline for an L2 distance has a negative impact. When using two static prototypes however, the L2 distances proves superior to the L1 or to the dot product. This is consistent with [33] where a prototypes network also performed best with an L2 distance.
Additional Support Set and Novel answers
We now evaluate the ability of the model to exploit support data never seen until test time (see Fig. 3). We train the same models designed for 7 candidate answers, but only provide them with training data for a subset of them. The proposed model is additionally provided with a complete support set, covering all 7 answers. Each reported result is averaged over 10 runs. The set of answers excluded from training is randomized across runs but identical to all models for a given .
The proposed model proves superior than the baseline and all other ablations (Fig. 3, top). The dynamic prototypes are particularly beneficial. With very little training data, the use of dynamic weights is less effective and sometimes even detrimental. We hypothesize that the model may then suffer from overfitting due to the additional learned parameters. When evaluated on novel answers (not seen during training and only present in the testtime support set), the dynamic prototypes provide a remarkable ability to learn those from the support set alone (Fig. 3, bottom). Their efficacy is particularly strong when only a single novel answer has to be learned. Remarkably, a model trained on only two answers maintains some capacity to learn about all others (average recall of , versus the chance baseline of ). Note that we cannot claim the ability of the model to count to those novel numbers, but at the very least it is able to associate those answers with particular images/questions (possibly utilizing questionconditioned biases).
5.2 VQA v2
We performed experiments on the complete VQA v2 dataset. We report results of different ablations, trained with 50% or 100% of the official training set, evaluated on the validation set as in [34]. The proposed model uses the remaining of the official training set as additional support data at test time. The complexity and varying quality of this dataset do not lead to clearcut conclusions from the standard accuracy metric (see Table 2). The answer recall leads to more consistent observations that align with those made on VQANumbers. Both dynamic weights and dynamic parameters provide a consistent advantage (Fig. 4). Each technique is beneficial in isolation, but their combination performs generally best. Individually, the dynamic prototypes appear more impactful than the dynamic weights. Note that our experiments on VQA v2 aimed at quantifying the effect of the contributions in the meta learning setting, and we did not seek to maximize absolute performance in the traditional benchmark setting.


Question accuracy / Answer recall  
Trained on 50%  Trained on 100%  


Baseline [34]  57.6 / 14.0  59.8 / 15.8 
Proposed model  
With dynamic weights, no dynamic prototypes  57.6 / 14.1  60.0 / 16.3 
No dynamic weights, with dynamic prototypes  57.6 / 15.2  59.7 / 18.0 
Same, no static prototypes, only dyn. ones  57.2 / 3.6  58.6 / 4.29 
With dyn. weights and dyn. prototypes  57.5 / 15.5  59.9 / 18.0 

To obtain a better insight into the predictions of the model, we examine the individual recall of possible answers. We compare the values with those obtained by the baseline. The difference (Fig. 5) indicates which of the two models provides the best predictions for every answer. We observe a qualitatively different behaviour between the models. While the baseline is most effective with frequent answers, the proposed model fares better (mostly positive values) in the long tail of rare answers. This corroborates previous discussions on dataset biases [15, 18, 43] which classical models are prone to overfit to. The proposed model is inherently more robust to such behaviour.
6 Conclusions and Future Work
We have devised a new approach to VQA through framing it as a meta learning task. This approach enables us to provide the model with supervised data at test time, thereby allowing the model to adapt or improve as more data is made available. We believe this view could lead to the development of scalable VQA systems better suited to practical applications. We proposed a deep learning model that takes advantage of the meta learning scenario, and demonstrated a range of benefits: improved recall of rare answers, better sample efficiency, and a unique capability of to learn to produce novel answers,
i.e. those never seen during training, and learned only from support instances.The learningtolearn approach we propose here enables a far greater separation of the questions answering method from the information used in the process than has previously been possible. Our contention is that this separation is essential if visionandlanguage methods are to move beyond benchmarks to tackle real problems, because embedding all of the information a method needs to answer real questions in the model weights is impractical.
Even though the proposed model is able to use novel support data, the experiments showed room for improvement, since a model trained initially from the same amount of data still shows superior performance. Practical considerations should also be addressed to apply this model to a larger scale, in particular for handling the memory of dynamic weights that currently grows linearly with the support set. Clustering schemes could be envisioned to reduce its size [33] and hashing methods [4, 19] could improve the efficiency of the contentbased retrieval.
Generally, the handling of additional data at test time opens the door to VQA systems that interact with other sources of information. While the proposed model was demonstrated with a support set of questions/answers, the principles extend to any type of data obtained at test time e.g. from knowledge bases or web searches. This would drastically enhance the scalability of VQA systems.
References
 [1] A. Agrawal, A. Kembhavi, D. Batra, and D. Parikh. Cvqa: A compositional split of the visual question answering (vqa) v1. 0 dataset. arXiv preprint arXiv:1704.08243, 2017.
 [2] R. Aljundi, P. Chakravarty, and T. Tuytelaars. Expert gate: Lifelong learning with a network of experts. arXiv preprint arXiv:1611.06194, 2016.
 [3] P. Anderson, X. He, C. Buehler, D. Teney, M. Johnson, S. Gould, and L. Zhang. Bottomup and topdown attention for image captioning and vqa. arXiv preprint arXiv:1707.07998, 2017.
 [4] A. Andoni and P. Indyk. Nearoptimal hashing algorithms for approximate nearest neighbor in high dimensions. In Foundations of Computer Science, 2006. FOCS’06. 47th Annual IEEE Symposium on, pages 459–468. IEEE, 2006.
 [5] M. Andrychowicz, M. Denil, S. Gomez, M. W. Hoffman, D. Pfau, T. Schaul, and N. de Freitas. Learning to learn by gradient descent by gradient descent. In Advances in Neural Information Processing Systems, pages 3981–3989, 2016.
 [6] S. Antol, A. Agrawal, J. Lu, M. Mitchell, D. Batra, C. L. Zitnick, and D. Parikh. VQA: Visual Question Answering. In Proc. IEEE Int. Conf. Comp. Vis., 2015.
 [7] L. Bertinetto, J. F. Henriques, J. Valmadre, P. H. S. Torr, and A. Vedaldi. Learning feedforward oneshot learners. In NIPS, pages 523–531, 2016.
 [8] C. Buck, J. Bulian, M. Ciaramita, A. Gesmundo, N. Houlsby, W. Gajewski, and W. Wang. Ask the right questions: Active question reformulation with reinforcement learning. arXiv preprint arXiv:1705.07830, 2017.
 [9] M. Buda, A. Maki, and M. A. Mazurowski. A systematic study of the class imbalance problem in convolutional neural networks. arXiv preprint arXiv:1710.05381, 2017.
 [10] K. Cho, B. van Merrienboer, C. Gulcehre, F. Bougares, H. Schwenk, and Y. Bengio. Learning phrase representations using RNN encoderdecoder for statistical machine translation. In Proc. Conf. Empirical Methods in Natural Language Processing, 2014.
 [11] Y. N. Dauphin, A. Fan, M. Auli, and D. Grangier. Language modeling with gated convolutional networks. arXiv preprint arXiv:1612.08083, 2016.
 [12] C. Fernando, D. Banarse, C. Blundell, Y. Zwols, D. Ha, A. A. Rusu, A. Pritzel, and D. Wierstra. Pathnet: Evolution channels gradient descent in super neural networks. arXiv preprint arXiv:1701.08734, 2017.
 [13] C. Finn, P. Abbeel, and S. Levine. Modelagnostic metalearning for fast adaptation of deep networks. arXiv preprint arXiv:1703.03400, 2017.
 [14] X. Glorot and Y. Bengio. Understanding the difficulty of training deep feedforward neural networks. In Proc. Int. Conf. Artificial Intell. & Stat., pages 249–256, 2010.
 [15] Y. Goyal, T. Khot, D. SummersStay, D. Batra, and D. Parikh. Making the V in VQA matter: Elevating the role of image understanding in Visual Question Answering. arXiv preprint arXiv:1612.00837, 2016.
 [16] H. Guo, Y. Li, J. Shang, G. Mingyun, H. Yuanyue, and G. Bing. Learning from classimbalanced data: Review of methods and applications. Expert Syst. Appl., 73:220–239, 2017.
 [17] S. Hochreiter, A. S. Younger, and P. R. Conwell. Learning to learn using gradient descent. In International Conference on Artificial Neural Networks, pages 87–94. Springer, 2001.
 [18] A. Jabri, A. Joulin, and L. van der Maaten. Revisiting visual question answering baselines. 2016.
 [19] L. Kaiser, O. Nachum, A. Roy, and S. Bengio. Learning to remember rare events. CoRR, 2017.
 [20] V. Kazemi and A. Elqursh. Show, ask, attend, and answer: A strong baseline for visual question answering. arXiv preprint arXiv:1704.03162, 2017.
 [21] J. Kirkpatrick, R. Pascanu, N. C. Rabinowitz, J. Veness, G. Desjardins, A. A. Rusu, K. Milan, J. Quan, T. Ramalho, A. GrabskaBarwinska, D. Hassabis, C. Clopath, D. Kumaran, and R. Hadsell. Overcoming catastrophic forgetting in neural networks. arXiv preprint arXiv:1612.00796, 2016.
 [22] R. Krishna, Y. Zhu, O. Groth, J. Johnson, K. Hata, J. Kravitz, S. Chen, Y. Kalantidis, L.J. Li, D. A. Shamma, M. Bernstein, and L. FeiFei. Visual genome: Connecting language and vision using crowdsourced dense image annotations. arXiv preprint arXiv:1602.07332, 2016.
 [23] D. LopezPaz and M. Ranzato. Gradient episodic memory for continuum learning. arXiv preprint arXiv:1706.08840, 2017.
 [24] T. Munkhdalai and H. Yu. Meta networks. In International Conference on Machine Learning (ICML), pages 2554–2563, 2017.
 [25] K. Narasimhan, A. Yala, and R. Barzilay. Improving information extraction by acquiring external evidence with reinforcement learning. arXiv preprint arXiv:1603.07954, 2016.
 [26] R. Nogueira and K. Cho. Taskoriented query reformulation with reinforcement learning. arXiv preprint arXiv:1704.04572, 2017.
 [27] H. Noh, P. H. Seo, and B. Han. Image Question Answering using Convolutional Neural Network with Dynamic Parameter Prediction. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn., 2016.
 [28] J. Pennington, R. Socher, and C. Manning. Glove: Global Vectors for Word Representation. In Conference on Empirical Methods in Natural Language Processing, 2014.
 [29] S. K. Ramakrishnan, A. Pal, G. Sharma, and A. Mittal. An empirical evaluation of visual question answering for novel objects. arXiv preprint arXiv:1704.02516, 2017.
 [30] S. Ravi and H. Larochelle. Optimization as a model for fewshot learning. 2017.
 [31] S. Rebuffi, A. Kolesnikov, and C. H. Lampert. icarl: Incremental classifier and representation learning. arXiv preprint arXiv:1611.07725, 2016.
 [32] J. Schmidhuber. Learning to control fastweight memories: An alternative to dynamic recurrent networks. Learning, 4(1), 2008.
 [33] J. Snell, K. Swersky, and R. S. Zemel. Prototypical networks for fewshot learning. arXiv preprint arXiv:1703.05175, 2017.
 [34] D. Teney, P. Anderson, X. He, and A. van den Hengel. Tips and tricks for visual question answering: Learnings from the 2017 challenge. arXiv preprint arXiv:1708.02711, 2017.
 [35] D. Teney and A. van den Hengel. Zeroshot visual question answering. 2016.
 [36] E. Triantafillou, R. Zemel, and R. Urtasun. Fewshot learning through an information retrieval lens. arXiv preprint arXiv:1707.02610, 2017.
 [37] P. Wang, Q. Wu, C. Shen, and A. van den Hengel. The VQAMachine: Learning how to use existing vision algorithms to answer new questions. arXiv preprint arXiv:1612.05386, 2016.
 [38] P. Wang, Q. Wu, C. Shen, A. van den Hengel, and A. Dick. Explicit knowledgebased reasoning for visual question answering. arXiv preprint arXiv:1511.02570, 2015.
 [39] Q. Wu, D. Teney, P. Wang, C. Shen, A. Dick, and A. van den Hengel. Visual question answering: A survey of methods and datasets. Computer Vision and Image Understanding, 2017.
 [40] Q. Wu, P. Wang, C. Shen, A. Dick, and A. v. d. Hengel. Ask Me Anything: Freeform Visual Question Answering Based on Knowledge from External Sources. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn., 2016.
 [41] Z. Yang, X. He, J. Gao, L. Deng, and A. Smola. Stacked Attention Networks for Image Question Answering. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn., 2016.
 [42] J. Yoon, E. Yang, J. Lee, and S. ju Hwang. Lifelong learning with dynamically expandable networks. arXiv preprint arXiv:1708.01547, 2017.
 [43] P. Zhang, Y. Goyal, D. SummersStay, D. Batra, and D. Parikh. Yin and yang: Balancing and answering binary visual questions. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn., 2016.
 [44] Y. Zhu, O. Groth, M. Bernstein, and L. FeiFei. Visual7W: Grounded Question Answering in Images. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn., 2016.
Supplementary material
Appendix A Factorized NonLinear Transformation
We defined in Eq. 1 our nonlinear transformations using weights (static or dynamic) containing all the parameters of a gated tanh layer. Although this is sound in principle, it is computationally costly to handle a memory containing dynamic weights of such a large dimensionality (). To alleviate this, we follow [7] and factorize the parameters of the gated tanh layer, rewriting Eq. 1 as follows:
(7) 
with vectors and matrices . The matrices and are learned like traditional weights, and only and are those incorporated into the vector of weights (static or dynamic). This reduces the dimensionality of from to (accounting for , , , and ).
Appendix B Bias in the Output Mapping
In Eq. 3, the output mapping to answer scores uses a scalar
bias term. This is different to the vector bias in a classical model that use a linear (affine) layer to implement
. A vector contains a value for each output class (i.e. each candidate answer), whereas our formulation uses a single value shared among all of them. Our formulation helps avoid the model incorporating biases towards frequent training answers, as discussed in Section 1 and 2. This feature is essential to enable the capability of our model to produce novel answers (unseen during training) only demonstrated by instances in the support set. A vector of answerspecific biases would prevent this capability, as the bias for the novel answers can not be learned from training data.Completely removing the bias term is another option. At test time, it is without effect compared to a scalar bias, since it is only followed by a (monotonic) sigmoid. Removing the bias term however renders the training by gradient descent numerically unstable, because the chosen similarity function can map to saturating regions of the domain of the sigmoid.
Appendix C VQANumbers Dataset
We provide below statistics of the VQANumbers dataset.


Correct answer  Row  
0  1  2  3  4  5  6  sum 


Training/validation/support set  
2,529  8,193  7,030  2,485  1,520  579  602  22,938 
11.0%  35.7%  30.7%  10.8%  6.6%  2.6%  2.6%  100% 


Test set  
858  2,804  2,434  843  495  173  205  7,812 
11.0%  35.9%  31.2%  10.8%  6.3%  2.2%  2.6%  100% 

Appendix D VQANumbers Experiments
Our experiments on VQANumbers (Section 5.1) use supersampling during training to ensure that none of the compared models can be influenced by dataset biases (i.e. class imbalance). The supersampling is performed at the epoch level, not at the minibatch level. Concretely, the training instances of all answers (classes) except the most frequent one are repeated at random such that there are as many of them as instances with the most frequent one. The elements within a minibatch are selected at random. Each training epoch will thus go through every training instance at least once. We did not try to constrain the sampling within minibatches.
Note finally that the supersampling strategy is practical on VQANumbers thanks to the small number of classes and only mild imbalance. It would not be suitable to VQA v2 for the opposite reasons.
We provide in Fig. 6 additional results of the experiments of Section 5.1. We report the performance of the same models, now evaluated only on answers present in the training set. The number of those answers is varied from 1 to 7. The chance performance (gray dashes) diminishes as the number of possible answers gets larger. As expected, the baseline model performs at 100% recall in the trivial case of 1 possible answer. The proposed model however receives a support set containing examples of all other answers (i.e. all 7 of them). This explains the nonperfect result in the trivial case. As the number of possible answers is increased, the proposed models (with dynamic weights and dynamic prototypes) shows a growing advantage and finally surpass the baseline by a clear margin on the most interesting cases of 5, 6, and 7 answers.
Comments
There are no comments yet.