Some of the most active research areas in computer vision are tackling increasingly complex tasks that require high-level reasoning. Some examples of this trend include visual question answering (VQA), image captioning , referring expressions , visual dialog , and vision-and-language navigation 
. While deep learning helped make significant progress, these tasks expose the limitations of the pattern recognition methods that have proved successful on classical vision tasks such as object recognition. A key indicator of the shortcomings of deep learning methods is their tendency to respond to specific features or biases in the dataset, rather than generalising to an approach that is applicable more broadly[1, 12]. In response, we propose a benchmark to directly measure a method’s ability for high-level reasoning over real visual information, and in which we can control the level of generalisation required.
. predicting the correct answer in VQA, or producing a sentence matching the ground truth in image captioning. These tasks include a strong visual component, and they are naturally assumed to lie on the path to semantic scene understanding, the overarching goal of computer vision. Unfortunately, non-visual aspects of these tasks – language in particular – act as major confounding factors. For example, in image captioning, the automated evaluation of generated language is itself an unsolved problem. In VQA, many questions are phrased such that their answers can be guessed without looking at the image.
We propose to take a step back with a task that directly evaluates abstract reasoning over realistic visual stimuli. Our setting is inspired by Raven’s Progressive Matrices (RPMs) , which are used in educational settings to measure human non-verbal visual reasoning abilities. Each instance of the task is a matrix of images, where the last image is missing and is to be chosen from eight candidates. All rows of the completed matrix must represent a same relationship (logical relationships, counts and comparisons, etc.) over a visual property of their three images (Fig. 1). We use real photographs, such that the task requires strong visual capabilities, and we focus on visual, mostly non-semantic properties. This evaluation is thus designed to reflect the capabilities required by the complex tasks mentioned above, but in an abstract non-task-specific manner that might help guide general progress in the field.
Other recent efforts have proposed benchmarks for visual reasoning [6, 32] and our key difference is to focus on real images, which are of greater interest to the computer vision community than 2D shapes and line drawings. This is a critical difference, because abstract reasoning is otherwise much easier to achieve when applied to a closed set of easily identified symbols such as simple geometrical shapes. A major contribution of this paper is the construction of a suitable dataset with real images on large scale (over 300,000 instances).
Generalisation is a key issues that limits the robustness, and thus practicality of deep learning (see ([19, 17, 13, 39] among many others). Current benchmarks that require visual reasoning, with few exceptions [1, 4, 40], use training and test splits that follow an identical distribution, which encourages methods to exploit dataset-specific biases (e.g. class imbalance) and superficial correlations [23, 33]. This practice rewards methods that overfit to their training sets  to the detriment of generalization capabilities. With these concerns in mind, our benchmark includes several evaluation settings that demand controlled levels of generalization (Section 3.2).
We have adapted and evaluated a range of deep learning models on our benchmark. Simple feed-forward networks achieve better than random results given enough depth, but recurrent neural networks and relational networks perform noticeably better. In the evaluation settings requiring strong generalization,i.e. applying relationships to visual properties in combinations not seen during training, all tested models clearly struggle. In most cases, small improvements are observed by using additional supervision, both on the visual features (using a bottom-up attention network  rather than a ResNet CNN ), and on the type of relationship represented in the training examples. These results indicate the difficulty of the task while hinting at promising research directions.
Finally, the proposed benchmark is not to be addressed as an end-goal, but should serve as a diagnostic test of methods aiming at more complex tasks. In the spirit of the CLEVR dataset for VQA  and the bAbI dataset for reading comprehension , our benchmark focuses on the fundamental operations common to multiple high-levels tasks. Crafting a solution specific to this benchmark is however not necessarily a path to actual solutions to these tasks. This guided the selection of general-purpose architectures evaluated in this paper.
The contributions of this paper are summarized as follows.
We define a new task to evaluate a model’s ability for abstract reasoning over complex visual stimuli. The task is designed to require reasoning similar to complex tasks in computer vision, while allowing evaluation free of task-specific confounding factors such as natural language and dataset biases.
We describe a procedure to collect instances for this task at little cost, by mining images and annotations from the Visual Genome. We build a large-scale dataset of over 300,000 instances, over which we define multiple training and evaluation splits that require controlled amounts of generalization.
We evaluate a range of popular deep learning architectures on the benchmark. We identify elements that prove beneficial (e.g. relational reasoning and mid-level supervision), and we also show that all tested models struggle significantly when strong generalization is required.
The dataset is publicly available on demand to encourage the development of models with improved capabilities for abstract reasoning over visual data.
2 Related work
Evaluation of abstract visual reasoning
Evaluating reasoning has a long history in the field of AI, but is typically based on pre-defined or easily identifiable symbols. Recent works include the task set of Fleuret et al. , in which they focus on the spatial arrangement of abstract elements in synthetic images. Their setting is reminiscent of the Bongard problems presented in  and further popularized by Hofstadter . Stabinger et al.  tested whether state-of-the-art CNN architectures can compare visual properties of multiple abstract objects, e.g. to determine whether two shapes are of the same size. Although this involves high-level reasoning, it is over coarse characteristics of line-drawings.
V-PROM is inspired by Raven’s Progressive Matrices (RPMs) , a classic psychological test of a human’s ability to interpret synthetic images. RPMs have been used previously to evaluate the reasoning abilities of neural networks [6, 22, 42]. In , the authors propose a CNN model to solve problems involving geometric operations such as rotations and reflections. Barrett et al.  evaluated existing deep learning models on a large-scale dataset of RPMs, with a procedure similar to one previously proposed by Wang et al. . The benchmark of Barrett et al.  is the most similar to our work. It uses synthetic images of simple 2D shapes, whereas ours uses much more complex images, at the cost of a less precise control of the visual stimuli. Recognizing the complementarity of the two settings, we purposefully model our evaluation setup after  such that future methods can be evaluated and compared across the two settings. Since the synthetic images in  do not reflect the complexity of real-world data, progress on this benchmark may not readily translate to high-level vision tasks. Our work bridges the gap between these two extremes (Fig.3).
Evaluation of high-level tasks in computer vision
The interest in high-level tasks is growing, as exemplified by the advent of VQA , referring expressions , and visual navigation , to name a few. Unbiased evaluations are notoriously difficult, and there is a growing trend toward evaluation on out-of-distribution data, i.e. where the test set is drawn from a different distribution than the training set [1, 4, 36, 40]. In this spirit, our benchmark includes multiple training/test splits drawn from different distributions to evaluate generalization under controlled conditions. Moreover, our task focuses on abstract relationships applied to visual (i.e. mostly non-semantic) properties, with the aim of minimizing the possibility of solving the task by exploiting non-visual factors.
Models for abstract reasoning with neural networks
, and neural Turing machines. Recent works on meta learning [14, 41] address the same fundamental problem by focusing on generalization from few examples (i.e. few shot learning), and they have shown better generalization , including in VQA . Barrett et al.  applied relation networks (RNs) with success to their dataset of RPMs. We evaluate RNs on our benchmark with equally encouraging results, although there remains large room for improvement, in particular when strong generalization is required.
3 A new task to evaluate visual reasoning
images, where the missing final image must be identified from among 8 candidates. The goal is to select an image such that all 3 rows represent a same relationship over some visual property (attribute, object category, or object count) of their 3 respective images. The definition of our task was guided by the following principles. First, it must require, but be not limited to, strong visual recognition ability. Second, it should measure a common set of capabilities required in high-level computer vision tasks. Third, it must be practical to construct a large-scale benchmark for this task, enabling an automatic and unambiguous evaluation. Finally, the task cannot be solvable through task-specific heuristics or relying on superficial statistics of the training examples. This points at a task that is compositional in nature and inherently requires strong generalization.
Our task can be seen as an extension to real images of recent benchmarks for reasoning on synthetic data [6, 22]. These works sacrifice visual realism for precise control over the contents of images which are limited to simple geometrical shapes. It is unclear whether reasoning under these conditions can transfer to realistic vision tasks. Our design is also intended to limit the extent to which semantic cues might be used to as “shortcuts” to avoid solving the task using the appropriate relationships. For example, a test to recognize the relation above could rely on the higher likelihood of car above ground than ground above car, rather than its actual spatial meaning. Therefore, our task focuses on fundamental visual properties and relationships such as logical and counting operations over multiple images (co-occurrence in a same photograph being likely biased).
The task requires identifying a plausible explanation for the provided triplets of images, i.e. a relation that could have generated them. The incomplete triplet serves as a “visual question”, and the explanation must be applied generatively to identify the missing image. It is unavoidable that more than one of the answer candidates constitute plausible completions. Indeed, a sufficiently-contrived explanation can justify any possible choice. The model has to identify the explanation with the strongest justification, which in practice tends to be the simplest one in the sense of Occam’s razor. This is expected to be learned by the model from training examples.
3.1 Construction of the V-PROM dataset
We describe how to construct a large-scale dataset for our task semi-automatically. We it V-PROM for Visual PROgressive Matrices.
|Nb. visual elements||84||38||346||10|
|Nb. task instances||45,000||45,000||45,000||100,000333We generate more task instances with object counts than with attributes and categories because counts are the only ones involved in the relationship progression, in addition to the three others (and, or, union).|
Generating descriptions of task instances
Each instance is a matrix of images that we call a visual reasoning matrix (VRM). Each image in the VRM depicts a visual element , where denotes an element depicted in the image with , where , , , respectively denote sets of possible attributes, objects, and object counts. We denote with the type of visual element corresponds to. We also denote with the -th image of the -th row in a VRM. Each VRM represents one specific type of visual elements, and one specific type of relationship . We define them as follows.
And: . The last image of each row has the same visual element as the other two.
Or: or . The last image in each row has the same visual element as the first or the second.
Union: . All rows contain the same three visual elements, possibly in different orders.
Progression: ; and . The numbers of objects in a row follow an arithmetic progression.
We randomly sample a visual element and relationship to generate the definition of a VRM. Seven additional incorrect answer candidates are obtained by sampling seven different visual elements of the same type as . The following section describes how to obtain images that fulfills a definition of a VRM by mining annotations from the Visual Genome (VG) .
Mining images from the Visual Genome
To select suitable images, we impose five desired principles: richness, purity, image quality, visual relatedness, and independence. Richness requires the diversity of visual elements, and of the images representing each visual element. Purity constrains the complexity of the image, as we want images that depict the visual element of interest fairly clearly. Visual relatedness guides us toward properties that have a clear visual depiction. As a counterexample, the attribute open appears very differently when a door is open and a bottle is open. Such semantic attributes are not desirable for our task. Finally, independence excludes the objects that frequently co-occur with other objects (e.g. “sky”,“road”,“water”, etc.) and could lead to ambiguous VRMs.
We obtain images that fulfill the above principles using VG’s region-level annotations of categories, attributes, and natural language description ( Table 1). We first preselect categories and attributes with large numbers of instances to guarantee sufficient representations of each in our dataset. We manually exclude unsuitable labels such as semantic attributes, and objects likely to cause ambiguity. We crop the annotated regions to obtain pure images. We discard those smaller than 100 px in either dimension. The annotations of object counts are extracted from numbers 1–10 appearing in natural language descriptions (e.g. “five bowls of oatmeal”), manually excluding those unrelated to counts (e.g. “five o’clock” or “a 10 years old boy”).
3.2 Data splits to measure generalization
In order to evaluate a method’s capabilities for generalization, we define several training/evaluation splits that require different levels of generalization. Training and evaluating a method in each of these settings will provide an overall picture of its capabilities beyond the basic fitting of training examples. To define these different settings, we follow the nomenclature proposed by Barrett et al. .
Neutral – The training and test sets are both sampled from the whole set of relationships and visual elements. Training to testing ratio is .
Interpolation / extrapolation
– These two splits evaluate generalization for counting. In the interpolation split, odd counts (1,3,5,7,9) are used for training and even counts (2,4,6,8,10) are used for testing. In the extrapolation split, the first five counts (1–5) are used for training and the remaining (6–10) are used for testing.
Held-out attributes – The object attributes are divided into 7 super-attributes444The attributes within each super-attribute are mutually exclusive.: color, material, scene, plant condition, action, shape, texture. The human attributes are divided into 6 super-attributes: age, hair style, clothing style, gender, action, clothing color. The super-attributes shape, texture, action are held-out for testing only.
Held-out objects – A subset of object categories () are held-out for testing only.
Held-out pairs of relationships/attributes – A subset of relationship/super-attribute combinations are held-out for testing only. Three combinations are held-out for both object attributes and human attributes. The held-out super-attributes vary with each type of relationship.
Held-out pairs of relationships/objects – For each type of relationship, of objects are held-out. The held-out objects are different for each relationship.
We report a model’s performance with the accuracy, i.e. the fraction of test instances for which the predicted answer (among the eight candidates) is correct. Random guessing gives an accuracy of .
3.3 Task complexity and human evaluation
Solving an instance of our task requires to recognize the visual elements depicted in all images, and to identify the relation that applies to triplets of images. This basically amounts to inferring the abstract description (Section 3.1) of the instance. Our dataset contains 4 types of relations, applied over 478 types of visual elements (Table 1), giving in the order of 2,000 different combinations.
We performed a human study to assess the difficulty of our benchmark. We presented human subjects with a random selection of task instances, sampled evenly across the four types of relations. The testees can skip an instance if they find it too difficult or ambiguous. The accuracy was of with a skip rate of . This accuracy is not an upper bound for the task however. The two main reasons for non-perfect human performance are (1) counting errors with 5 objects, cluttered background, or scale variations and (2) a tendency to use prior knowledge and favor higher-level (semantic) concepts/attributes than those used to generate the dataset.
4 Models and experimental setup
We evaluated a range of models on our benchmark. These models are based on popular deep learning architectures that have proven successful on various task-specific benchmarks. The models are summarized in Fig. 4.
4.1 Input data
For each instance of our task, the input data consists of 8 context panels and 8 candidate answers. These 16 RGB images are passed through a pretrained CNN to extract visual features. Our experiments compare features from a ResNet101  and from a Bottom-Up Attention Network555The network of et al.  was pretrained with annotations from the Visual Genome. Our dataset only uses cropped images from VG, and we use layer activations rather than explicit class predictions, but the possible overlap in the label space used to pretrain  and to generate our benchmark must be kept in mind. , which is popular for image captioning and VQA 
. The feature maps from either of these CNNs are average-pooled, and the resulting vector is L2-normalized. The vector of each of the 16 images is concatenated with a one-hot representation of an index: the 8 context panels are assigned indices 1–8 and the candidate answers 9–16. The resulting vectors are referred to as.
The vectors serve as input to the models described below, which are trained with supervision to predict a score for each of the 8 candidate answers, i.e. . Each model is trained with a softmax cross-entropy loss over
, standard backpropagation and SGD, using AdaDelta
as the optimizer. Suitable hyperparameters for each model were coarsely selected by grid search (details in supplementary material). We held out 8,000 instances from the training set to serve as a validation set, to select the hyperparameters and to monitor for convergence and early-stopping. Unless noted, the non-linear transformations within the networks below refer to a linear layer followed by a ReLU.
Our simplest model is a multilayer perceptron (see Fig.4). The features of every image are passed through a non-linear transformation . The model is then applied so as to share the parameters used to score each candidate answer. The features of each candidate answer ( for =) are concatenated with the context panels (). The features are then passed through another non-linear transformation , and a final linear transformation to produce a scalar score for each candidate answer. That is, :
where the semicolumn represents the concatenation of vectors. A variant of this model replaces the concatenation with a sum-pooling over the nine panels. This reduces the number of parameters by sharing the weights within across the panels. This gives
We will refer to these two models as MLP-cat-k and MLP-sum-k, in which and are both implemented with linear layers, all followed by a ReLU.
We consider two variants of a recurrent neural network, implemented with a gated recurrent unit (GRU). The first naive version takes each of the feature vectors to over 16 time steps. The final hidden state of the GRU is then passed through a linear transformation to map it to a vector of 8 scores .
The second version shares the parameters of the model over the 8 candidate answers. The GRU takes, in parallel, 8 sequences, each consisting of the context panels with one of the 8 candidate answers. The final state of each GRU is then mapped to a single score for the corresponding candidate answer. That is, :
4.4 VQA-like architecture
We consider an architecture that mimics a state-of-the-art model in VQA  based on a “joint embedding” approach [45, 38]. In our case, the context panels serve as the input “image”, and the panels serve as the “question”. They are passed through non-linear transformations, then combined with an elementwise product into a joint embedding . The score for each answer is obtained as the dot product between and the embedding of each candidate answer (see Fig. 4). Formally, we have
where , and are non-linear transformations, and represents the Hadamard product.
4.5 Relation networks
We finally evaluate a relation network (RN). RNs were specifically proposed to model relationships between visual elements, such as in VQA when questions refer to multiple parts of the image . Our model is applied, again, such that its parameters are shared across answer candidates. The basic idea of an RN is to consider all pairwise combinations of input elements ( in our case), pass them through a non-linear transformation, sum-pool over these representations, then pass the pooled representation through another non-linear transformation. Formally, we have, :
where and are non-linear transformations, and a linear transformation.
4.6 Auxiliary objective
We experimented with an auxiliary objective that encourages the network to predict the type of the relationship involved in the given matrix. This objective is trained with a softmax cross-entropy and the ground truth type of relationship in the training example. This value is a index among the seven possible relations, i.e. and, or, progression, attribute, object, union, and counting (see Section 3). This prediction is made from a linear projection of the final activations of the network in Eq. 8, that is:
where is an additional learned linear transformation. At test time, this prediction is not used, and the auxiliary objective serves only to provide an inductive bias during the training of the network such that its internal representation captures the type of relationship (which should then help the model to generalize). Note that we also experimented with an auxiliary objective for predicting labels such as object class and visual attributes, but this did not prove beneficial.
We conducted numerous experiments to establish reference baselines and to shed light on the capabilities of popular architectures. As a sanity check, we trained our best model with randomly-shuffled context panels. This verified that the task could not be solved by exploiting superficial regularities of the data. All models trained in this way perform around the “chance” level of 12.5%.
|RN with shuffled inputs||12.5||12.5||12.5||12.5|
|Relational network (RN)||51.2||55.8||55.4||61.3|
5.1 Neutral training/test splits
We first examine all models on the neutral training/test splits (Fig. 5 and Table 2). In this setting, training and test data are drawn from the same distribution, and supervised models are expected to perform well, given sufficient capacity and training examples. We observe that a simple MLP can indeed fit the data relatively well if it has enough layers, but a network with only 2 non-linear layers performs quite badly. The two models based on a GRU have very different performance. The GRU-shared model performs best. It shares its parameters over the candidate answers (processed in parallel rather than across the recurrent steps). This result was not obviously predictable, since this model does not get to consider all candidate answers in relation with each other. The alternate model (GRU) receives every candidate answer in succession. It could therefore perform additional reasoning steps over the candidates, but this does not seem to be the case in practice. The VQA-like model obtains a performance comparable to a deep MLP, but it proved more difficult to train than an MLP. In some of our experiments, the optimization this model was slow or simply failed to converge. We found it best to use, as non-linear transformations, “gated tanh” layers as in . Overall, we obtained the best performance with a relation network (RN) model. While this is basically an MLP on top of pairwise combinations of features, these combinations prove much more informative than the individual features. We experimented with an RN without the one-hot representations of panel IDs concatenated with the input (“RN without panel IDs”), and this version performed very poorly. It is worth noting that RNs come at the cost of processing feature vectors rather than (with =9 in our case). The number of parameters is the same, since they are shared across the combinations, but the computation time increases.
We break down performance along two axes in Fig. 5
. The following two groups of question types are mutually exclusive: and/or/progression/union, and attribute/object/counting. The former reflects the type of relationship across the nine images of a test instance, while the latter corresponds to the type of visual properties to which the relationship applies. We observe that some types are much more easily solved than others. Instances involving object identity are easier than those involving attributes and counts, presumable because the image features are obtained with a CNN pretrained for object classification. The bottom-up image features performs remarkably well, most likely because the set of labels used for pretraining was richer than the ImageNet labels used to train the ResNet. The instances that require counting are particularly difficult; this corroborates the struggle of vision systems with counting, already reported in multiple existing works,e.g. in .
5.2 Splits requiring generalization
We now look at the performance with respect to splits that specifically require generalization (Fig. 6). As expected, accuracy drops significantly as the need for generalization increases. This confirms our hypothesis that naive end-to-end training cannot guarantee generalization beyond training examples, and that this is easily masked when the test and training data come from the same distribution (as in the neutral split). This drop is particularly visible with the simple MLP and GRU models. The RN model suffers a smaller drop in performance in some of the generalization settings. This indicates that learning over combinations of features provides a useful inductive bias for our task.
Image features from bottom-up attention
We tested all models with features from a ResNet, as well as features from the “bottom-up attention” model of Andersonet al. . These improve the performance of all tested models over ResNet features, in the neutral and all generalization splits. The bottom-up attention model is pretrained with a richer set of annotations than the ImageNet labels used to pretrain the ResNet. This likely provides features that better capture fine visual properties of the input images. Note that the visual features used by our models do not contain explicit predictions of such labels and visual properties, as they are vectors of continuous values. We experimented with alternative schemes (not reported in the plots), including an auxiliary loss within our models for predicting visual attributes, but these did not prove helpful.
Auxiliary prediction of relationship type
We experimented with success with an auxiliary loss on the prediction of the type of relationship in the given instance. This is provided during training as a label among seven. All models trained with this additional loss gained in accuracy in the neutral and most generalization settings. The relative importance of the main and auxiliary losses did not seem critical, and all reported experiments use an equal weight on both.
Overall, the performance of our best models remains well below that of human performance leaving substantial room for improvement. This dataset should be a valuable tool to evaluate future approaches to visual reasoning.
We have introduced a new benchmark to measure a method’s ability to carry out abstract reasoning over complex visual data. The task addresses a central issue in deep learning, being the degree to which methods learn to reason over their inputs. This issue is critical because reasoning can generalise to new classes of data, whereas memorising incidental relationships between signal and label does not. This issue lies at the core of many of the current challenges in deep learning, including zero-shot learning, domain adaptation, and generalisation, more broadly.
Our benchmark serves to evaluate capabilities similar to some of those required in high-level tasks in computer vision, without task-specific confounding factors such as natural language or dataset biases. Moreover, the benchmark includes multiple evaluation settings that demand controllable levels of generalization. Our experiments with popular deep learning models demonstrate that they struggle when strong generalization is required, in particular for applying known relationships to combinations of visual properties not seen during training. We identified a number of promising directions for future research, and we hope that this setting will encourage the development of models with improved capabilities for abstract reasoning over visual data.
-  A. Agrawal, D. Batra, D. Parikh, and A. Kembhavi. Don’t just assume; look and answer: Overcoming priors for visual question answering. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn., 2018.
-  P. Anderson, B. Fernando, M. Johnson, and S. Gould. Guided open vocabulary image captioning with constrained beam search. arXiv preprint arXiv:1612.00576, 2016.
-  P. Anderson, X. He, C. Buehler, D. Teney, M. Johnson, S. Gould, and L. Zhang. Bottom-up and top-down attention for image captioning and vqa. CVPR, 2018.
-  P. Anderson, Q. Wu, D. Teney, J. Bruce, M. Johnson, N. Sünderhauf, I. Reid, S. Gould, and A. van den Hengel. Vision-and-language navigation: Interpreting visually-grounded navigation instructions in real environments. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn., 2018.
-  S. Antol, A. Agrawal, J. Lu, M. Mitchell, D. Batra, C. L. Zitnick, and D. Parikh. VQA: Visual Question Answering. In Proc. IEEE Int. Conf. Comp. Vis., 2015.
-  D. Barrett, F. Hill, A. Santoro, A. Morcos, and T. Lillicrap. Measuring abstract reasoning in neural networks. In Proc. Int. Conf. Mach. Learn., 2018.
-  M. M. Bongard. Pattern recognition. Spartan Books, 1970.
-  T. Chen, M. Li, Y. Li, M. Lin, N. Wang, M. Wang, T. Xiao, B. Xu, C. Zhang, and Z. Zhang. Mxnet: A flexible and efficient machine learning library for heterogeneous distributed systems. In NIPS Workshop, 2015.
-  X. Chen, H. Fang, T.-Y. Lin, R. Vedantam, S. Gupta, P. Dollar, and C. L. Zitnick. Microsoft COCO captions: Data collection and evaluation server. arXiv preprint arXiv:1504.00325, 2015.
K. Cho, B. van Merrienboer, C. Gulcehre, F. Bougares, H. Schwenk, and
Learning phrase representations using RNN encoder-decoder for
statistical machine translation.
Proc. Conf. Empirical Methods in Natural Language Processing, 2014.
-  A. Das, S. Kottur, K. Gupta, A. Singh, D. Yadav, J. M. Moura, D. Parikh, and D. Batra. Visual Dialog. In CVPR, 2017.
-  J. Devlin, S. Gupta, R. B. Girshick, M. Mitchell, and C. L. Zitnick. Exploring nearest neighbor approaches for image captioning. arXiv preprint arXiv:1505.04467, 2015.
Y. Duan, M. Andrychowicz, B. Stadie, O. J. Ho, J. Schneider, I. Sutskever,
P. Abbeel, and W. Zaremba.
One-shot imitation learning.In Advances in neural information processing systems, pages 1087–1098, 2017.
-  C. Finn, P. Abbeel, and S. Levine. Model-agnostic meta-learning for fast adaptation of deep networks. arXiv preprint arXiv:1703.03400, 2017.
-  C. Finn, T. Yu, T. Zhang, P. Abbeel, and S. Levine. One-shot visual imitation learning via meta-learning. In Conference on Robot Learning (CoRL), pages 357–368, 2017.
-  F. Fleureta, T. Lic, C. Dubouta, E. K. Wamplerd, S. Yantisd, and D. Gemanc. Comparing machines and humans on a visual categorization test. In Proceedings of the National Academy of Sciences, 2011.
-  J. Fu, K. Luo, and S. Levine. Learning robust rewards with adversarial inverse reinforcement learning. arXiv preprint arXiv:1710.11248, 2017.
-  A. Graves, G. Wayne, and I. Danihelka. Neural turing machines. arXiv preprint arXiv:1410.5401, 2014.
-  E. Groshev, A. Tamar, M. Goldstein, S. Srivastava, and P. Abbeel. Learning generalized reactive policies using deep neural networks. In 2018 AAAI Spring Symposium Series, 2018.
-  K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn., 2016.
-  D. R. Hofstadter. Godel, Escher, Bach: An Eternal Golden Braid. Basic Books, Inc., 1979.
-  D. Hoshen and M. Werman. Iq of neural networks. arXiv preprint arXiv:1710.01692, 2017.
-  J. Jo and Y. Bengio. Measuring the tendency of cnns to learn surface statistical regularities. arXiv preprint arXiv:1711.11561, 2017.
-  J. Johnson, B. Hariharan, L. van der Maaten, L. Fei-Fei, C. L. Zitnick, and R. B. Girshick. CLEVR: A diagnostic dataset for compositional language and elementary visual reasoning. arXiv preprint arXiv:1612.06890, 2016.
-  K. Kafle and C. Kanan. An analysis of visual question answering algorithms. In Proc. IEEE Int. Conf. Comp. Vis., 2017.
-  R. Krishna, Y. Zhu, O. Groth, J. Johnson, K. Hata, J. Kravitz, S. Chen, Y. Kalantidis, L.-J. Li, D. A. Shamma, M. Bernstein, and L. Fei-Fei. Visual genome: Connecting language and vision using crowdsourced dense image annotations. arXiv preprint arXiv:1602.07332, 2016.
-  J. C. Raven and A. C. for Educational Research. Raven’s progressive matrices (1938) : sets A, B, C, D, E. Australian Council for Educational Research Melbourne, 1938.
-  O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and L. Fei-Fei. ImageNet Large Scale Visual Recognition Challenge. Int. J. Comput. Vision, 2015.
-  A. Santoro, D. Raposo, D. G. Barrett, M. Malinowski, R. Pascanu, P. Battaglia, and T. Lillicrap. A simple neural network module for relational reasoning. In Proc. Advances in Neural Inf. Process. Syst. 2017.
-  A. Santoro, D. Raposo, D. G. Barrett, M. Malinowski, R. Pascanu, P. Battaglia, and T. Lillicrap. A simple neural network module for relational reasoning. arXiv preprint arXiv:1706.01427, 2017.
-  S. Stabinger, A. Rodríguez-Sánchez, and J. Piater. 25 years of cnns: Can we compare to human abstraction capabilities? In A. E. Villa, P. Masulli, and A. J. Pons Rivero, editors, Artificial Neural Networks and Machine Learning, 2016.
-  A. Suhr, M. Lewis, J. Yeh, and Y. Artzi. A corpus of natural language for visual reasoning. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), volume 2, pages 217–223, 2017.
-  C. Szegedy, W. Zaremba, I. Sutskever, J. Bruna, D. Erhan, I. Goodfellow, and R. Fergus. Intriguing properties of neural networks. arXiv preprint arXiv:1312.6199, 2013.
-  D. Teney, P. Anderson, X. He, and A. van den Hengel. Tips and tricks for visual question answering: Learnings from the 2017 challenge. CVPR, 2018.
-  D. Teney, L. Liu, and A. van den Hengel. Graph-structured representations for visual question answering. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn., 2016.
-  D. Teney and A. van den Hengel. Zero-shot visual question answering. CoRR, abs/1611.05546, 2016.
-  D. Teney and A. van den Hengel. Visual question answering as a meta learning task. In Proc. Eur. Conf. Comp. Vis., 2018.
-  D. Teney, Q. Wu, and A. van den Hengel. Visual question answering: A tutorial. IEEE Signal Processing Magazine, 34:63–75, 2017.
-  J. Tobin, R. Fong, A. Ray, J. Schneider, W. Zaremba, and P. Abbeel. Domain randomization for transferring deep neural networks from simulation to the real world. In 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 23–30. IEEE, 2017.
-  K. Tran, X. He, L. Zhang, J. Sun, C. Carapcea, C. Thrasher, C. Buehler, and C. Sienkiewicz. Rich image captioning in the wild. arXiv preprint arXiv:1603.09016, 2016.
-  O. Vinyals, C. Blundell, T. Lillicrap, k. kavukcuoglu, and D. Wierstra. Matching networks for one shot learning. In Proc. Advances in Neural Inf. Process. Syst. 2016.
-  K. Wang and Z. Su. Automatic generation of raven’s progressive matrices. In Proc. Int. Joint Conf. Artificial Intell., 2015.
-  J. Weston, A. Bordes, S. Chopra, and T. Mikolov. Towards ai-complete question answering: A set of prerequisite toy tasks. arXiv preprint arXiv:1502.05698, 2015.
-  J. Weston, S. Chopra, and A. Bordes. Memory networks. arXiv preprint arXiv:1410.3916, 2014.
-  Q. Wu, D. Teney, P. Wang, C. Shen, A. Dick, and A. van den Hengel. Visual question answering: A survey of methods and datasets. Computer Vision and Image Understanding, 2017.
-  L. Yu, H. Tan, M. Bansal, and T. L. Berg. A joint speaker-listener-reinforcer model for referring expressions. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn., 2017.
-  M. D. Zeiler. ADADELTA: an adaptive learning rate method. arXiv preprint arXiv:1212.5701, 2012.
Appendix A Dataset details
Table 3 shows the number of training/test instances in the different data splits.
Appendix B Implementation details
The image features were obtained with the ResNet-101 CNN  implemented in MXNet  and pretrained on ImageNet , and with the Bottom-Up Attention network of Anderson et al. . The latter uses an R-CNN framework itself based on a ResNet-101. We resize each image of our dataset such that its shorter side is of 256 pixels, and preprocess it with color normalization. We then crop out the central patch from the resulting image and feed it to the network. Feature maps from the last convolutional layer are pre-extracted in this way for every image. These feature maps are pooled (averaged) over image locations (with the ResNet) or over region proposals (for the Bottom-Up Attention network). The resulting vector is of dimension 2048, and is normalized to unit length. The normalization of the image features is crucial to obtain reasonable performance. This has previously been reported for other tasks like VQA.
All models are trained with a batch size of 128, and a size of all hidden layers of 128. These values were selected by grid search and performed consistently well across models. All models use the one-hot labels of the input panels, except the model referred to as “RN without panel IDs”. All models are optimized using AdaDelta .
Models using an auxiliary loss use the same weight for the two losses. We experimented with different relative weights, and it did not affect the results significantly in either direction. All non-linear layers are implemented with affine weights followed by a ReLU, except in the VQA-like model, which proved easier to optimize with “gated tanh” layers, as in the VQA model of Teney et al. .
Let us also mention that we experimented with a VQA-like network that includes a top-down attention mechanism as in . This performed slightly worse than the simple model, and it is not included in our results.
Appendix C Additional results
We performed additional experiments to compare the sample efficiency of the different models. We trained our best models with only a fraction of the training set. We included 4 of our best models, using simple ResNet features and without the auxiliary loss. The results are presented in Fig. 7. The sample efficiency is fairly consistent across models. The performance grows almost linearly with the amounts of training data, which indicates the explicit reliance of these models on the training examples (i.e. their weak ability to generalize). These results can also serve as supplementary baselines to investigate low-data regimes
|Overall||Accuracy per question type|
|Lower bound: RN (last row) with shuffled inputs||12.5||12.5||12.5||12.5||12.5||12.5||12.5||12.5|
|MLP-sum-2, ResNet + aux. loss||39.1||39.1||55.5||32.1||18.0||53.7||17.9||31.9|
|MLP-sum-2, Bot.-Up + aux. loss||40.4||65.5||40.4||56.5||32.8||24.4||33.6||23.4|
|MLP-cat-2, ResNet + aux. loss||39.2||39.2||51.8||36.7||14.9||51.0||14.4||31.4|
|MLP-cat-2, Bot.-Up + aux. loss||41.7||76.7||41.7||54.6||37.1||26.1||34.8||28.4|
|MLP-sum-4, ResNet + aux. loss||44.6||44.6||60.1||36.0||22.0||62.3||20.4||40.0|
|MLP-sum-4, Bot.-Up + aux. loss||46.2||70.9||46.2||61.5||36.0||32.9||42.0||36.5|
|MLP-cat-4, ResNet + aux. loss||43.0||43.0||52.1||43.4||13.1||59.0||14.1||36.4|
|MLP-cat-4, Bot.-Up + aux. loss||52.5||82.0||52.5||62.5||50.9||38.9||45.2||41.2|
|MLP-sum-6, ResNet + aux. loss||41.4||41.4||55.9||32.9||26.6||54.3||22.3||36.7|
|MLP-sum-6, Bot.-Up + aux. loss||47.7||73.6||47.7||62.1||35.1||37.4||46.1||45.1|
|MLP-cat-6, ResNet + aux. loss||44.5||44.5||53.5||43.9||13.1||61.9||13.2||39.2|
|MLP-cat-6, Bot.-Up + aux. loss||55.7||83.8||55.7||65.6||53.6||43.0||48.9||44.7|
|GRU, ResNet + aux. loss||31.0||31.0||39.8||28.5||13.0||29.2||13.3||26.5|
|GRU, Bot.-Up + aux. loss||50.6||81.6||50.6||63.1||49.2||26.0||41.9||27.9|
|GRU-shared, ResNet + aux. loss||48.2||48.2||60.5||45.5||22.8||65.8||19.9||41.3|
|GRU-shared, Bot.-Up + aux. loss||52.7||82.5||52.7||67.3||49.9||34.2||42.0||40.3|
|VQA-like, ResNet + aux. loss||39.7||39.7||57.1||36.2||21.9||55.0||20.6||27.7|
|VQA-like, Bot.-Up + aux. loss||41.0||62.4||41.0||57.8||35.8||24.6||31.0||25.0|
|RN without panel IDs, ResNet||32.6||32.6||35.8||27.3||14.0||56.2||13.7||36.6|
|RN without panel IDs, ResNet + aux. loss||35.0||35.0||39.5||27.1||16.7||57.9||15.3||40.2|
|RN without panel IDs, Bot.-Up||35.8||46.4||35.8||40.6||28.0||15.1||40.9||14.0|
|RN without panel IDs, Bot.-Up + aux. loss||38.0||47.5||38.0||43.6||28.7||18.3||44.0||15.4|
|RN, ResNet + aux. loss||55.8||55.8||69.8||44.3||34.2||78.2||28.3||55.4|
|RN, Bot.-Up + aux. loss||61.3||88.2||61.3||76.8||48.4||41.9||60.3||45.8|
|Lower bound: RN (last row) with shuffled inputs||12.5||12.5||12.5||12.5||12.5||12.5||12.5|
|MLP-sum-2, ResNet + aux. loss||39.1||33.7||29.7||29.2||35.0||34.7||36.5|
|MLP-sum-2, Bot.-Up + aux. loss||40.4||31.8||29.0||29.5||33.4||34.7||38.1|
|MLP-cat-2, ResNet + aux. loss||39.2||30.8||29.8||27.1||34.7||35.0||35.4|
|MLP-cat-2, Bot.-Up + aux. loss||41.7||31.7||30.8||29.6||33.4||38.7||39.1|
|MLP-sum-4, ResNet + aux. loss||44.6||35.2||35.4||29.8||37.0||41.3||37.8|
|MLP-sum-4, Bot.-Up + aux. loss||46.2||36.1||33.0||31.5||36.0||38.7||41.3|
|MLP-cat-4, ResNet + aux. loss||43.0||38.1||37.1||31.9||39.7||40.5||39.4|
|MLP-cat-4, Bot.-Up + aux. loss||52.5||40.9||40.4||34.3||40.5||47.0||48.1|
|MLP-sum-6, ResNet + aux. loss||41.4||36.6||37.4||32.0||35.9||41.7||39.7|
|MLP-sum-6, Bot.-Up + aux. loss||47.7||37.8||35.3||33.6||36.8||41.1||42.6|
|MLP-cat-6, ResNet + aux. loss||44.5||38.4||40.2||33.6||38.9||43.6||40.5|
|MLP-cat-6, Bot.-Up + aux. loss||55.7||43.3||43.9||35.8||42.7||54.0||51.3|
|GRU, ResNet + aux. loss||31.0||36.5||33.7||25.9||12.6||24.4||23.5|
|GRU, Bot.-Up + aux. loss||50.6||38.4||37.5||32.2||33.4||40.1||48.0|
|GRU-shared, ResNet + aux. loss||48.2||38.1||37.3||35.2||40.0||42.2||44.8|
|GRU-shared, Bot.-Up + aux. loss||52.7||39.7||38.8||36.4||41.5||47.6||48.8|
|VQA-like, ResNet + aux. loss||39.7||30.2||29.7||31.5||35.6||36.4||37.8|
|VQA-like, Bot.-Up + aux. loss||41.0||30.6||30.9||31.0||35.1||36.2||38.1|
|RN without panel IDs, ResNet||32.6||27.7||28.1||30.4||32.0||30.9||30.6|
|RN without panel IDs, ResNet + aux. loss||35.0||29.7||27.3||31.2||32.6||32.9||31.8|
|RN without panel IDs, Bot.-Up||35.8||31.1||29.1||32.6||32.7||32.2||32.5|
|RN without panel IDs, Bot.-Up + aux. loss||38.0||31.7||31.6||31.7||34.2||34.9||33.2|
|RN, ResNet + aux. loss||55.8||42.4||42.3||40.9||47.3||52.8||51.0|
|RN, Bot.-Up + aux. loss||61.3||47.4||45.2||44.1||44.9||58.5||55.2|
Appendix D Dataset examples
We provide in Fig. 11 a random selection of instances from our dataset.