V-PROM: A Benchmark for Visual Reasoning Using Visual Progressive Matrices

07/29/2019 ∙ by Damien Teney, et al. ∙ 2

One of the primary challenges faced by deep learning is the degree to which current methods exploit superficial statistics and dataset bias, rather than learning to generalise over the specific representations they have experienced. This is a critical concern because generalisation enables robust reasoning over unseen data, whereas leveraging superficial statistics is fragile to even small changes in data distribution. To illuminate the issue and drive progress towards a solution, we propose a test that explicitly evaluates abstract reasoning over visual data. We introduce a large-scale benchmark of visual questions that involve operations fundamental to many high-level vision tasks, such as comparisons of counts and logical operations on complex visual properties. The benchmark directly measures a method's ability to infer high-level relationships and to generalise them over image-based concepts. It includes multiple training/test splits that require controlled levels of generalization. We evaluate a range of deep learning architectures, and find that existing models, including those popular for vision-and-language tasks, are unable to solve seemingly-simple instances. Models using relational networks fare better but leave substantial room for improvement.



There are no comments yet.


page 1

page 3

page 7

page 8

page 14

page 15

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Figure 1: We propose a new task to evaluate a model’s ability to perform abstract reasoning over complex visual stimuli. Each test instance is a matrix of images, within which each row contains 3 images that exemplify the same relationship (in this case they have the same shape). The task is to identify the correct candidate for the missing image from a set of candidates. The correct answer above is the third candidate that represents a heart-shaped object.

Some of the most active research areas in computer vision are tackling increasingly complex tasks that require high-level reasoning. Some examples of this trend include visual question answering (VQA) 

[5], image captioning [2], referring expressions [46], visual dialog [11], and vision-and-language navigation [4]

. While deep learning helped make significant progress, these tasks expose the limitations of the pattern recognition methods that have proved successful on classical vision tasks such as object recognition. A key indicator of the shortcomings of deep learning methods is their tendency to respond to specific features or biases in the dataset, rather than generalising to an approach that is applicable more broadly 

[1, 12]. In response, we propose a benchmark to directly measure a method’s ability for high-level reasoning over real visual information, and in which we can control the level of generalisation required.

Progress on the complex tasks mentioned above is typically evaluated on standardized benchmarks [4, 5, 9, 35]. Methods are evaluated with metrics on task-specific objectives, e.g

. predicting the correct answer in VQA, or producing a sentence matching the ground truth in image captioning. These tasks include a strong visual component, and they are naturally assumed to lie on the path to semantic scene understanding, the overarching goal of computer vision. Unfortunately, non-visual aspects of these tasks – language in particular – act as major confounding factors. For example, in image captioning, the automated evaluation of generated language is itself an unsolved problem. In VQA, many questions are phrased such that their answers can be guessed without looking at the image.

We propose to take a step back with a task that directly evaluates abstract reasoning over realistic visual stimuli. Our setting is inspired by Raven’s Progressive Matrices (RPMs) [27], which are used in educational settings to measure human non-verbal visual reasoning abilities. Each instance of the task is a matrix of images, where the last image is missing and is to be chosen from eight candidates. All rows of the completed matrix must represent a same relationship (logical relationships, counts and comparisons, etc.) over a visual property of their three images (Fig. 1). We use real photographs, such that the task requires strong visual capabilities, and we focus on visual, mostly non-semantic properties. This evaluation is thus designed to reflect the capabilities required by the complex tasks mentioned above, but in an abstract non-task-specific manner that might help guide general progress in the field.

Other recent efforts have proposed benchmarks for visual reasoning [6, 32] and our key difference is to focus on real images, which are of greater interest to the computer vision community than 2D shapes and line drawings. This is a critical difference, because abstract reasoning is otherwise much easier to achieve when applied to a closed set of easily identified symbols such as simple geometrical shapes. A major contribution of this paper is the construction of a suitable dataset with real images on large scale (over 300,000 instances).

Generalisation is a key issues that limits the robustness, and thus practicality of deep learning (see ([19, 17, 13, 39] among many others). Current benchmarks that require visual reasoning, with few exceptions [1, 4, 40], use training and test splits that follow an identical distribution, which encourages methods to exploit dataset-specific biases (e.g. class imbalance) and superficial correlations [23, 33]. This practice rewards methods that overfit to their training sets [1] to the detriment of generalization capabilities. With these concerns in mind, our benchmark includes several evaluation settings that demand controlled levels of generalization (Section  3.2).

We have adapted and evaluated a range of deep learning models on our benchmark. Simple feed-forward networks achieve better than random results given enough depth, but recurrent neural networks and relational networks perform noticeably better. In the evaluation settings requiring strong generalization,

i.e. applying relationships to visual properties in combinations not seen during training, all tested models clearly struggle. In most cases, small improvements are observed by using additional supervision, both on the visual features (using a bottom-up attention network [3] rather than a ResNet CNN [20]), and on the type of relationship represented in the training examples. These results indicate the difficulty of the task while hinting at promising research directions.

Finally, the proposed benchmark is not to be addressed as an end-goal, but should serve as a diagnostic test of methods aiming at more complex tasks. In the spirit of the CLEVR dataset for VQA [24] and the bAbI dataset for reading comprehension [43], our benchmark focuses on the fundamental operations common to multiple high-levels tasks. Crafting a solution specific to this benchmark is however not necessarily a path to actual solutions to these tasks. This guided the selection of general-purpose architectures evaluated in this paper.

The contributions of this paper are summarized as follows.

  1. [noitemsep]

  2. We define a new task to evaluate a model’s ability for abstract reasoning over complex visual stimuli. The task is designed to require reasoning similar to complex tasks in computer vision, while allowing evaluation free of task-specific confounding factors such as natural language and dataset biases.

  3. We describe a procedure to collect instances for this task at little cost, by mining images and annotations from the Visual Genome. We build a large-scale dataset of over 300,000 instances, over which we define multiple training and evaluation splits that require controlled amounts of generalization.

  4. We evaluate a range of popular deep learning architectures on the benchmark. We identify elements that prove beneficial (e.g. relational reasoning and mid-level supervision), and we also show that all tested models struggle significantly when strong generalization is required.

The dataset is publicly available on demand to encourage the development of models with improved capabilities for abstract reasoning over visual data.

Figure 2: Some challenging instances from our dataset. See the footnote222Denoting the candidate answers as 1–8, left-to-right, first then second row, the correct ones are 7, 2, 6.for the answer key.

2 Related work

Evaluation of abstract visual reasoning

Evaluating reasoning has a long history in the field of AI, but is typically based on pre-defined or easily identifiable symbols. Recent works include the task set of Fleuret et al[16], in which they focus on the spatial arrangement of abstract elements in synthetic images. Their setting is reminiscent of the Bongard problems presented in [7] and further popularized by Hofstadter [21]. Stabinger et al[31] tested whether state-of-the-art CNN architectures can compare visual properties of multiple abstract objects, e.g. to determine whether two shapes are of the same size. Although this involves high-level reasoning, it is over coarse characteristics of line-drawings.

V-PROM is inspired by Raven’s Progressive Matrices (RPMs) [27], a classic psychological test of a human’s ability to interpret synthetic images. RPMs have been used previously to evaluate the reasoning abilities of neural networks [6, 22, 42]. In [22], the authors propose a CNN model to solve problems involving geometric operations such as rotations and reflections. Barrett et al[6] evaluated existing deep learning models on a large-scale dataset of RPMs, with a procedure similar to one previously proposed by Wang et al[42]. The benchmark of Barrett et al[6] is the most similar to our work. It uses synthetic images of simple 2D shapes, whereas ours uses much more complex images, at the cost of a less precise control of the visual stimuli. Recognizing the complementarity of the two settings, we purposefully model our evaluation setup after [6] such that future methods can be evaluated and compared across the two settings. Since the synthetic images in [6] do not reflect the complexity of real-world data, progress on this benchmark may not readily translate to high-level vision tasks. Our work bridges the gap between these two extremes (Fig.3).

Evaluation of high-level tasks in computer vision

The interest in high-level tasks is growing, as exemplified by the advent of VQA [5], referring expressions [46], and visual navigation [4], to name a few. Unbiased evaluations are notoriously difficult, and there is a growing trend toward evaluation on out-of-distribution data, i.e. where the test set is drawn from a different distribution than the training set [1, 4, 36, 40]. In this spirit, our benchmark includes multiple training/test splits drawn from different distributions to evaluate generalization under controlled conditions. Moreover, our task focuses on abstract relationships applied to visual (i.e. mostly non-semantic) properties, with the aim of minimizing the possibility of solving the task by exploiting non-visual factors.

Models for abstract reasoning with neural networks

Various architectures have been proposed with the goal of moving beyond memorizing training examples, for example relation networks [29], memory-augmented networks [44]

, and neural Turing machines 

[18]. Recent works on meta learning [14, 41] address the same fundamental problem by focusing on generalization from few examples (i.e. few shot learning), and they have shown better generalization [15], including in VQA [37]. Barrett et al[6] applied relation networks (RNs) with success to their dataset of RPMs. We evaluate RNs on our benchmark with equally encouraging results, although there remains large room for improvement, in particular when strong generalization is required.

Figure 3: Alternative tasks and datasets requiring visual reasoning. V-PROM fills an important gap between controlled, synthetic datasets (on which current methods are increasingly successful), and complex real-world tasks (which remain largely unsolved).

3 A new task to evaluate visual reasoning

Our task is inspired by the classical Raven’s Progressive Matrices [27] used in human IQ tests (see Fig. 1) . Each instance is a matrix of

images, where the missing final image must be identified from among 8 candidates. The goal is to select an image such that all 3 rows represent a same relationship over some visual property (attribute, object category, or object count) of their 3 respective images. The definition of our task was guided by the following principles. First, it must require, but be not limited to, strong visual recognition ability. Second, it should measure a common set of capabilities required in high-level computer vision tasks. Third, it must be practical to construct a large-scale benchmark for this task, enabling an automatic and unambiguous evaluation. Finally, the task cannot be solvable through task-specific heuristics or relying on superficial statistics of the training examples. This points at a task that is compositional in nature and inherently requires strong generalization.

Our task can be seen as an extension to real images of recent benchmarks for reasoning on synthetic data [6, 22]. These works sacrifice visual realism for precise control over the contents of images which are limited to simple geometrical shapes. It is unclear whether reasoning under these conditions can transfer to realistic vision tasks. Our design is also intended to limit the extent to which semantic cues might be used to as “shortcuts” to avoid solving the task using the appropriate relationships. For example, a test to recognize the relation above could rely on the higher likelihood of car above ground than ground above car, rather than its actual spatial meaning. Therefore, our task focuses on fundamental visual properties and relationships such as logical and counting operations over multiple images (co-occurrence in a same photograph being likely biased).

The task requires identifying a plausible explanation for the provided triplets of images, i.e. a relation that could have generated them. The incomplete triplet serves as a “visual question”, and the explanation must be applied generatively to identify the missing image. It is unavoidable that more than one of the answer candidates constitute plausible completions. Indeed, a sufficiently-contrived explanation can justify any possible choice. The model has to identify the explanation with the strongest justification, which in practice tends to be the simplest one in the sense of Occam’s razor. This is expected to be learned by the model from training examples.

3.1 Construction of the V-PROM dataset

We describe how to construct a large-scale dataset for our task semi-automatically. We it V-PROM for Visual PROgressive Matrices.

Object Human Object Object
attributes attributes categories counts
Nb. visual elements 84 38 346 10
Nb. images 36,750 12,249 82,905 11,730
Nb. task instances 45,000 45,000 45,000 100,000333We generate more task instances with object counts than with attributes and categories because counts are the only ones involved in the relationship progression, in addition to the three others (and, or, union).
Table 1: Statistics of the V-PROM dataset.

Generating descriptions of task instances

Each instance is a matrix of images that we call a visual reasoning matrix (VRM). Each image in the VRM depicts a visual element , where denotes an element depicted in the image with , where , , , respectively denote sets of possible attributes, objects, and object counts. We denote with the type of visual element corresponds to. We also denote with the -th image of the -th row in a VRM. Each VRM represents one specific type of visual elements, and one specific type of relationship . We define them as follows.

  • And: . The last image of each row has the same visual element as the other two.

  • Or: or . The last image in each row has the same visual element as the first or the second.

  • Union: . All rows contain the same three visual elements, possibly in different orders.

  • Progression: ; and . The numbers of objects in a row follow an arithmetic progression.

We randomly sample a visual element and relationship to generate the definition of a VRM. Seven additional incorrect answer candidates are obtained by sampling seven different visual elements of the same type as . The following section describes how to obtain images that fulfills a definition of a VRM by mining annotations from the Visual Genome (VG) [26].

Mining images from the Visual Genome

To select suitable images, we impose five desired principles: richness, purity, image quality, visual relatedness, and independence. Richness requires the diversity of visual elements, and of the images representing each visual element. Purity constrains the complexity of the image, as we want images that depict the visual element of interest fairly clearly. Visual relatedness guides us toward properties that have a clear visual depiction. As a counterexample, the attribute open appears very differently when a door is open and a bottle is open. Such semantic attributes are not desirable for our task. Finally, independence excludes the objects that frequently co-occur with other objects (e.g. “sky”,“road”,“water”, etc.) and could lead to ambiguous VRMs.

We obtain images that fulfill the above principles using VG’s region-level annotations of categories, attributes, and natural language description ( Table 1). We first preselect categories and attributes with large numbers of instances to guarantee sufficient representations of each in our dataset. We manually exclude unsuitable labels such as semantic attributes, and objects likely to cause ambiguity. We crop the annotated regions to obtain pure images. We discard those smaller than 100 px in either dimension. The annotations of object counts are extracted from numbers 1–10 appearing in natural language descriptions (e.g. “five bowls of oatmeal”), manually excluding those unrelated to counts (e.g. “five o’clock” or “a 10 years old boy”).

3.2 Data splits to measure generalization

In order to evaluate a method’s capabilities for generalization, we define several training/evaluation splits that require different levels of generalization. Training and evaluating a method in each of these settings will provide an overall picture of its capabilities beyond the basic fitting of training examples. To define these different settings, we follow the nomenclature proposed by Barrett et al[6].

  1. [noitemsep]

  2. Neutral – The training and test sets are both sampled from the whole set of relationships and visual elements. Training to testing ratio is .

  3. Interpolation / extrapolation

    – These two splits evaluate generalization for counting. In the interpolation split, odd counts (1,3,5,7,9) are used for training and even counts (2,4,6,8,10) are used for testing. In the extrapolation split, the first five counts (1–5) are used for training and the remaining (6–10) are used for testing.

  4. Held-out attributes – The object attributes are divided into 7 super-attributes444The attributes within each super-attribute are mutually exclusive.: color, material, scene, plant condition, action, shape, texture. The human attributes are divided into 6 super-attributes: age, hair style, clothing style, gender, action, clothing color. The super-attributes shape, texture, action are held-out for testing only.

  5. Held-out objects – A subset of object categories () are held-out for testing only.

  6. Held-out pairs of relationships/attributes – A subset of relationship/super-attribute combinations are held-out for testing only. Three combinations are held-out for both object attributes and human attributes. The held-out super-attributes vary with each type of relationship.

  7. Held-out pairs of relationships/objects – For each type of relationship, of objects are held-out. The held-out objects are different for each relationship.

We report a model’s performance with the accuracy, i.e. the fraction of test instances for which the predicted answer (among the eight candidates) is correct. Random guessing gives an accuracy of .

3.3 Task complexity and human evaluation

Solving an instance of our task requires to recognize the visual elements depicted in all images, and to identify the relation that applies to triplets of images. This basically amounts to inferring the abstract description (Section 3.1) of the instance. Our dataset contains 4 types of relations, applied over 478 types of visual elements (Table  1), giving in the order of 2,000 different combinations.

We performed a human study to assess the difficulty of our benchmark. We presented human subjects with a random selection of task instances, sampled evenly across the four types of relations. The testees can skip an instance if they find it too difficult or ambiguous. The accuracy was of with a skip rate of . This accuracy is not an upper bound for the task however. The two main reasons for non-perfect human performance are (1) counting errors with 5 objects, cluttered background, or scale variations and (2) a tendency to use prior knowledge and favor higher-level (semantic) concepts/attributes than those used to generate the dataset.

4 Models and experimental setup

We evaluated a range of models on our benchmark. These models are based on popular deep learning architectures that have proven successful on various task-specific benchmarks. The models are summarized in Fig. 4.

Figure 4: Overview of the models evaluated in our experiments. These are based on popular deep learning architectures.

4.1 Input data

For each instance of our task, the input data consists of 8 context panels and 8 candidate answers. These 16 RGB images are passed through a pretrained CNN to extract visual features. Our experiments compare features from a ResNet101 [20] and from a Bottom-Up Attention Network555The network of et al. [3] was pretrained with annotations from the Visual Genome. Our dataset only uses cropped images from VG, and we use layer activations rather than explicit class predictions, but the possible overlap in the label space used to pretrain [3] and to generate our benchmark must be kept in mind. [3], which is popular for image captioning and VQA [34]

. The feature maps from either of these CNNs are average-pooled, and the resulting vector is L2-normalized. The vector of each of the 16 images is concatenated with a one-hot representation of an index: the 8 context panels are assigned indices 1–8 and the candidate answers 9–16. The resulting vectors are referred to as


The vectors serve as input to the models described below, which are trained with supervision to predict a score for each of the 8 candidate answers, i.e. . Each model is trained with a softmax cross-entropy loss over

, standard backpropagation and SGD, using AdaDelta 


as the optimizer. Suitable hyperparameters for each model were coarsely selected by grid search (details in supplementary material). We held out 8,000 instances from the training set to serve as a validation set, to select the hyperparameters and to monitor for convergence and early-stopping. Unless noted, the non-linear transformations within the networks below refer to a linear layer followed by a ReLU.

4.2 Mlp

Our simplest model is a multilayer perceptron (see Fig. 

4). The features of every image are passed through a non-linear transformation . The model is then applied so as to share the parameters used to score each candidate answer. The features of each candidate answer ( for =) are concatenated with the context panels (). The features are then passed through another non-linear transformation , and a final linear transformation to produce a scalar score for each candidate answer. That is, :


where the semicolumn represents the concatenation of vectors. A variant of this model replaces the concatenation with a sum-pooling over the nine panels. This reduces the number of parameters by sharing the weights within across the panels. This gives


We will refer to these two models as MLP-cat-k and MLP-sum-k, in which and are both implemented with linear layers, all followed by a ReLU.

4.3 Gru

We consider two variants of a recurrent neural network, implemented with a gated recurrent unit (GRU 

[10]). The first naive version takes each of the feature vectors to over 16 time steps. The final hidden state of the GRU is then passed through a linear transformation to map it to a vector of 8 scores .


The second version shares the parameters of the model over the 8 candidate answers. The GRU takes, in parallel, 8 sequences, each consisting of the context panels with one of the 8 candidate answers. The final state of each GRU is then mapped to a single score for the corresponding candidate answer. That is, :


4.4 VQA-like architecture

We consider an architecture that mimics a state-of-the-art model in VQA [34] based on a “joint embedding” approach [45, 38]. In our case, the context panels serve as the input “image”, and the panels serve as the “question”. They are passed through non-linear transformations, then combined with an elementwise product into a joint embedding . The score for each answer is obtained as the dot product between and the embedding of each candidate answer (see Fig. 4). Formally, we have


where , and are non-linear transformations, and represents the Hadamard product.

4.5 Relation networks

We finally evaluate a relation network (RN). RNs were specifically proposed to model relationships between visual elements, such as in VQA when questions refer to multiple parts of the image [30]. Our model is applied, again, such that its parameters are shared across answer candidates. The basic idea of an RN is to consider all pairwise combinations of input elements ( in our case), pass them through a non-linear transformation, sum-pool over these representations, then pass the pooled representation through another non-linear transformation. Formally, we have, :


where and are non-linear transformations, and a linear transformation.

4.6 Auxiliary objective

We experimented with an auxiliary objective that encourages the network to predict the type of the relationship involved in the given matrix. This objective is trained with a softmax cross-entropy and the ground truth type of relationship in the training example. This value is a index among the seven possible relations, i.e. and, or, progression, attribute, object, union, and counting (see Section 3). This prediction is made from a linear projection of the final activations of the network in Eq. 8, that is:


where is an additional learned linear transformation. At test time, this prediction is not used, and the auxiliary objective serves only to provide an inductive bias during the training of the network such that its internal representation captures the type of relationship (which should then help the model to generalize). Note that we also experimented with an auxiliary objective for predicting labels such as object class and visual attributes, but this did not prove beneficial.

Figure 5: Accuracy of all modes in the neutral setting, broken down by question type. The types and/or/progression/union reflect the type of relationship across the nine images, while attribute/object/counting correspond to the type of visual properties to which the relationship applies. Each group of bars corresponds to the methods MLP-cat-2, MLP-cat-4, MLP-cat-6, MLP-sum-2, MLP-sum-4, MLP-sum-6, GRU, GRU-shared, VQA-like, RN without panel IDs, and RN. See supplementary material for numbers.
Figure 6: Accuracy of all models trained/evaluated on splits requiring varying levels of generalization. The relative performance of the models is generally consistent, but all models perform significantly worse than in the neutral setting, indicating poor generalization of most models and overfitting to their training examples. Each group of bars corresponds to the same methods as in Fig. 5.

5 Experiments

We conducted numerous experiments to establish reference baselines and to shed light on the capabilities of popular architectures. As a sanity check, we trained our best model with randomly-shuffled context panels. This verified that the task could not be solved by exploiting superficial regularities of the data. All models trained in this way perform around the “chance” level of 12.5%.


ResNet ResNet B.-up B.-up
+aux.loss +aux.loss


Human evaluation 77.8
RN with shuffled inputs 12.5 12.5 12.5 12.5


MLP-sum-6 layers 40.7 44.5 50.4 55.7
GRU-shared 43.4 48.2 46.7 52.7
VQA-like 36.7 39.7 37.9 41.0
Relational network (RN) 51.2 55.8 55.4 61.3


Table 2: Summary of the best models in the neutral setting, on all question types (Fig. 5, first row). Additional results in supp.mat.

5.1 Neutral training/test splits

We first examine all models on the neutral training/test splits (Fig. 5 and Table 2). In this setting, training and test data are drawn from the same distribution, and supervised models are expected to perform well, given sufficient capacity and training examples. We observe that a simple MLP can indeed fit the data relatively well if it has enough layers, but a network with only 2 non-linear layers performs quite badly. The two models based on a GRU have very different performance. The GRU-shared model performs best. It shares its parameters over the candidate answers (processed in parallel rather than across the recurrent steps). This result was not obviously predictable, since this model does not get to consider all candidate answers in relation with each other. The alternate model (GRU) receives every candidate answer in succession. It could therefore perform additional reasoning steps over the candidates, but this does not seem to be the case in practice. The VQA-like model obtains a performance comparable to a deep MLP, but it proved more difficult to train than an MLP. In some of our experiments, the optimization this model was slow or simply failed to converge. We found it best to use, as non-linear transformations, “gated tanh” layers as in [34]. Overall, we obtained the best performance with a relation network (RN) model. While this is basically an MLP on top of pairwise combinations of features, these combinations prove much more informative than the individual features. We experimented with an RN without the one-hot representations of panel IDs concatenated with the input (“RN without panel IDs”), and this version performed very poorly. It is worth noting that RNs come at the cost of processing feature vectors rather than (with =9 in our case). The number of parameters is the same, since they are shared across the combinations, but the computation time increases.

We break down performance along two axes in Fig. 5

. The following two groups of question types are mutually exclusive: and/or/progression/union, and attribute/object/counting. The former reflects the type of relationship across the nine images of a test instance, while the latter corresponds to the type of visual properties to which the relationship applies. We observe that some types are much more easily solved than others. Instances involving object identity are easier than those involving attributes and counts, presumable because the image features are obtained with a CNN pretrained for object classification. The bottom-up image features performs remarkably well, most likely because the set of labels used for pretraining was richer than the ImageNet labels used to train the ResNet. The instances that require counting are particularly difficult; this corroborates the struggle of vision systems with counting, already reported in multiple existing works,

e.g. in [25].

5.2 Splits requiring generalization

We now look at the performance with respect to splits that specifically require generalization (Fig. 6). As expected, accuracy drops significantly as the need for generalization increases. This confirms our hypothesis that naive end-to-end training cannot guarantee generalization beyond training examples, and that this is easily masked when the test and training data come from the same distribution (as in the neutral split). This drop is particularly visible with the simple MLP and GRU models. The RN model suffers a smaller drop in performance in some of the generalization settings. This indicates that learning over combinations of features provides a useful inductive bias for our task.

Image features from bottom-up attention

We tested all models with features from a ResNet, as well as features from the “bottom-up attention” model of Anderson

et al[3]. These improve the performance of all tested models over ResNet features, in the neutral and all generalization splits. The bottom-up attention model is pretrained with a richer set of annotations than the ImageNet labels used to pretrain the ResNet. This likely provides features that better capture fine visual properties of the input images. Note that the visual features used by our models do not contain explicit predictions of such labels and visual properties, as they are vectors of continuous values. We experimented with alternative schemes (not reported in the plots), including an auxiliary loss within our models for predicting visual attributes, but these did not prove helpful.

Auxiliary prediction of relationship type

We experimented with success with an auxiliary loss on the prediction of the type of relationship in the given instance. This is provided during training as a label among seven. All models trained with this additional loss gained in accuracy in the neutral and most generalization settings. The relative importance of the main and auxiliary losses did not seem critical, and all reported experiments use an equal weight on both.

Overall, the performance of our best models remains well below that of human performance leaving substantial room for improvement. This dataset should be a valuable tool to evaluate future approaches to visual reasoning.

6 Conclusions

We have introduced a new benchmark to measure a method’s ability to carry out abstract reasoning over complex visual data. The task addresses a central issue in deep learning, being the degree to which methods learn to reason over their inputs. This issue is critical because reasoning can generalise to new classes of data, whereas memorising incidental relationships between signal and label does not. This issue lies at the core of many of the current challenges in deep learning, including zero-shot learning, domain adaptation, and generalisation, more broadly.

Our benchmark serves to evaluate capabilities similar to some of those required in high-level tasks in computer vision, without task-specific confounding factors such as natural language or dataset biases. Moreover, the benchmark includes multiple evaluation settings that demand controllable levels of generalization. Our experiments with popular deep learning models demonstrate that they struggle when strong generalization is required, in particular for applying known relationships to combinations of visual properties not seen during training. We identified a number of promising directions for future research, and we hope that this setting will encourage the development of models with improved capabilities for abstract reasoning over visual data.


Supplementary material

Appendix A Dataset details

Table 3 shows the number of training/test instances in the different data splits.

Training Test
Neutral 103,323 51,677
Interpolation 109,991 65,009
Extrapolation 109,991 65,009
Att.held 103,329 51,617
Att.rel.held 73,329 51,671
Obj.held 103,326 51,674
Obj.rel.held 88,326 51,674
Table 3: Number of training/test instances in each data split.

Appendix B Implementation details

The image features were obtained with the ResNet-101 CNN [20] implemented in MXNet [8] and pretrained on ImageNet [28], and with the Bottom-Up Attention network of Anderson et al[3]. The latter uses an R-CNN framework itself based on a ResNet-101. We resize each image of our dataset such that its shorter side is of 256 pixels, and preprocess it with color normalization. We then crop out the central patch from the resulting image and feed it to the network. Feature maps from the last convolutional layer are pre-extracted in this way for every image. These feature maps are pooled (averaged) over image locations (with the ResNet) or over region proposals (for the Bottom-Up Attention network). The resulting vector is of dimension 2048, and is normalized to unit length. The normalization of the image features is crucial to obtain reasonable performance. This has previously been reported for other tasks like VQA.

All models are trained with a batch size of 128, and a size of all hidden layers of 128. These values were selected by grid search and performed consistently well across models. All models use the one-hot labels of the input panels, except the model referred to as “RN without panel IDs”. All models are optimized using AdaDelta [47].

Models using an auxiliary loss use the same weight for the two losses. We experimented with different relative weights, and it did not affect the results significantly in either direction. All non-linear layers are implemented with affine weights followed by a ReLU, except in the VQA-like model, which proved easier to optimize with “gated tanh” layers, as in the VQA model of Teney et al[34].

Let us also mention that we experimented with a VQA-like network that includes a top-down attention mechanism as in [34]. This performed slightly worse than the simple model, and it is not included in our results.

Appendix C Additional results

We provide in Tables 4 and 5 all numbers corresponding to the bar plots presented in the paper.

We performed additional experiments to compare the sample efficiency of the different models. We trained our best models with only a fraction of the training set. We included 4 of our best models, using simple ResNet features and without the auxiliary loss. The results are presented in Fig. 7. The sample efficiency is fairly consistent across models. The performance grows almost linearly with the amounts of training data, which indicates the explicit reliance of these models on the training examples (i.e. their weak ability to generalize). These results can also serve as supplementary baselines to investigate low-data regimes

Figure 7: Accuracy of various models trained on reduced amounts of training data.


Overall Accuracy per question type
accuracy  and    or progression  union attribute  object  counting


Human evaluation 77.8
Lower bound: RN (last row) with shuffled inputs 12.5 12.5 12.5 12.5 12.5 12.5 12.5 12.5


MLP-sum-2, ResNet 35.1 35.1 50.2 31.4 15.1 49.7 15.1 25.7
MLP-sum-2, ResNet + aux. loss 39.1 39.1 55.5 32.1 18.0 53.7 17.9 31.9
MLP-sum-2, Bot.-Up 36.4 60.8 36.4 49.4 30.1 18.7 31.6 16.8
MLP-sum-2, Bot.-Up + aux. loss 40.4 65.5 40.4 56.5 32.8 24.4 33.6 23.4


MLP-cat-2, ResNet 34.7 34.7 45.0 33.0 12.8 51.2 12.9 28.3
MLP-cat-2, ResNet + aux. loss 39.2 39.2 51.8 36.7 14.9 51.0 14.4 31.4
MLP-cat-2, Bot.-Up 37.6 72.3 37.6 50.0 33.8 20.8 30.7 21.0
MLP-cat-2, Bot.-Up + aux. loss 41.7 76.7 41.7 54.6 37.1 26.1 34.8 28.4


MLP-sum-4, ResNet 37.6 37.6 50.2 30.6 15.6 56.2 14.7 34.2
MLP-sum-4, ResNet + aux. loss 44.6 44.6 60.1 36.0 22.0 62.3 20.4 40.0
MLP-sum-4, Bot.-Up 41.4 67.2 41.4 54.5 32.3 26.1 39.3 24.3
MLP-sum-4, Bot.-Up + aux. loss 46.2 70.9 46.2 61.5 36.0 32.9 42.0 36.5


MLP-cat-4, ResNet 39.1 39.1 46.8 38.8 12.3 61.2 11.8 34.3
MLP-cat-4, ResNet + aux. loss 43.0 43.0 52.1 43.4 13.1 59.0 14.1 36.4
MLP-cat-4, Bot.-Up 46.2 77.3 46.2 55.0 45.0 28.2 40.3 28.6
MLP-cat-4, Bot.-Up + aux. loss 52.5 82.0 52.5 62.5 50.9 38.9 45.2 41.2


MLP-sum-6, ResNet 41.2 41.2 55.2 33.4 16.9 61.6 16.6 37.5
MLP-sum-6, ResNet + aux. loss 41.4 41.4 55.9 32.9 26.6 54.3 22.3 36.7
MLP-sum-6, Bot.-Up 43.8 68.5 43.8 57.8 33.9 23.4 41.8 24.6
MLP-sum-6, Bot.-Up + aux. loss 47.7 73.6 47.7 62.1 35.1 37.4 46.1 45.1


MLP-cat-6, ResNet 40.7 40.7 48.2 39.5 13.7 63.6 13.2 37.0
MLP-cat-6, ResNet + aux. loss 44.5 44.5 53.5 43.9 13.1 61.9 13.2 39.2
MLP-cat-6, Bot.-Up 50.4 80.6 50.4 59.7 48.6 37.2 44.0 40.4
MLP-cat-6, Bot.-Up + aux. loss 55.7 83.8 55.7 65.6 53.6 43.0 48.9 44.7


GRU, ResNet 20.8 20.8 24.7 20.2 12.6 18.1 13.1 18.4
GRU, ResNet + aux. loss 31.0 31.0 39.8 28.5 13.0 29.2 13.3 26.5
GRU, Bot.-Up 43.8 77.4 43.8 56.8 40.0 20.8 37.0 20.7
GRU, Bot.-Up + aux. loss 50.6 81.6 50.6 63.1 49.2 26.0 41.9 27.9


GRU-shared, ResNet 43.4 43.4 55.4 40.8 14.8 65.8 14.5 36.8
GRU-shared, ResNet + aux. loss 48.2 48.2 60.5 45.5 22.8 65.8 19.9 41.3
GRU-shared, Bot.-Up 46.7 77.1 46.7 59.6 44.1 25.9 38.3 26.9
GRU-shared, Bot.-Up + aux. loss 52.7 82.5 52.7 67.3 49.9 34.2 42.0 40.3


VQA-like, ResNet 36.7 36.7 52.2 33.1 20.7 54.0 17.3 26.3
VQA-like, ResNet + aux. loss 39.7 39.7 57.1 36.2 21.9 55.0 20.6 27.7
VQA-like, Bot.-Up 37.9 59.9 37.9 54.0 33.4 20.2 28.1 20.4
VQA-like, Bot.-Up + aux. loss 41.0 62.4 41.0 57.8 35.8 24.6 31.0 25.0


RN without panel IDs, ResNet 32.6 32.6 35.8 27.3 14.0 56.2 13.7 36.6
RN without panel IDs, ResNet + aux. loss 35.0 35.0 39.5 27.1 16.7 57.9 15.3 40.2
RN without panel IDs, Bot.-Up 35.8 46.4 35.8 40.6 28.0 15.1 40.9 14.0
RN without panel IDs, Bot.-Up + aux. loss 38.0 47.5 38.0 43.6 28.7 18.3 44.0 15.4


RN, ResNet 51.2 51.2 64.3 43.0 20.7 78.8 22.1 49.3
RN, ResNet + aux. loss 55.8 55.8 69.8 44.3 34.2 78.2 28.3 55.4
RN, Bot.-Up 55.4 83.0 55.4 68.3 46.2 31.3 54.0 33.6
RN, Bot.-Up + aux. loss 61.3 88.2 61.3 76.8 48.4 41.9 60.3 45.8


Table 4: Evaluation of all models in the neutral setting.


Neutral Generalization settings
setting interpolation extrapolation att.held att.rel.held obj.rel.held obj.held


Human evaluation 77.8
Lower bound: RN (last row) with shuffled inputs 12.5 12.5 12.5 12.5 12.5 12.5 12.5


MLP-sum-2, ResNet 35.1 28.8 30.3 29.3 31.6 31.9 32.1
MLP-sum-2, ResNet + aux. loss 39.1 33.7 29.7 29.2 35.0 34.7 36.5
MLP-sum-2, Bot.-Up 36.4 30.3 28.0 29.5 31.0 31.3 35.4
MLP-sum-2, Bot.-Up + aux. loss 40.4 31.8 29.0 29.5 33.4 34.7 38.1


MLP-cat-2, ResNet 34.7 27.6 27.6 26.3 29.1 30.5 31.2
MLP-cat-2, ResNet + aux. loss 39.2 30.8 29.8 27.1 34.7 35.0 35.4
MLP-cat-2, Bot.-Up 37.6 30.3 29.2 29.1 30.3 36.1 35.8
MLP-cat-2, Bot.-Up + aux. loss 41.7 31.7 30.8 29.6 33.4 38.7 39.1


MLP-sum-4, ResNet 37.6 32.3 31.2 13.5 32.1 33.8 34.0
MLP-sum-4, ResNet + aux. loss 44.6 35.2 35.4 29.8 37.0 41.3 37.8
MLP-sum-4, Bot.-Up 41.4 32.7 29.9 33.1 34.5 34.2 37.5
MLP-sum-4, Bot.-Up + aux. loss 46.2 36.1 33.0 31.5 36.0 38.7 41.3


MLP-cat-4, ResNet 39.1 32.0 33.5 31.3 34.9 34.3 34.4
MLP-cat-4, ResNet + aux. loss 43.0 38.1 37.1 31.9 39.7 40.5 39.4
MLP-cat-4, Bot.-Up 46.2 38.1 36.4 33.1 37.0 43.4 42.4
MLP-cat-4, Bot.-Up + aux. loss 52.5 40.9 40.4 34.3 40.5 47.0 48.1


MLP-sum-6, ResNet 41.2 33.4 32.9 18.7 32.0 35.0 37.1
MLP-sum-6, ResNet + aux. loss 41.4 36.6 37.4 32.0 35.9 41.7 39.7
MLP-sum-6, Bot.-Up 43.8 32.5 30.0 32.5 33.9 36.3 41.0
MLP-sum-6, Bot.-Up + aux. loss 47.7 37.8 35.3 33.6 36.8 41.1 42.6


MLP-cat-6, ResNet 40.7 35.2 33.8 30.5 36.2 34.4 35.8
MLP-cat-6, ResNet + aux. loss 44.5 38.4 40.2 33.6 38.9 43.6 40.5
MLP-cat-6, Bot.-Up 50.4 40.4 39.0 36.4 39.3 46.9 45.8
MLP-cat-6, Bot.-Up + aux. loss 55.7 43.3 43.9 35.8 42.7 54.0 51.3


GRU, ResNet 20.8 30.0 23.9 25.4 12.6 20.8 12.7
GRU, ResNet + aux. loss 31.0 36.5 33.7 25.9 12.6 24.4 23.5
GRU, Bot.-Up 43.8 36.3 32.8 31.2 23.6 34.7 43.3
GRU, Bot.-Up + aux. loss 50.6 38.4 37.5 32.2 33.4 40.1 48.0


GRU-shared, ResNet 43.4 34.2 33.0 34.4 38.4 37.0 41.2
GRU-shared, ResNet + aux. loss 48.2 38.1 37.3 35.2 40.0 42.2 44.8
GRU-shared, Bot.-Up 46.7 36.8 36.4 35.4 37.4 41.1 44.5
GRU-shared, Bot.-Up + aux. loss 52.7 39.7 38.8 36.4 41.5 47.6 48.8


VQA-like, ResNet 36.7 28.1 27.7 29.4 34.0 33.2 35.0
VQA-like, ResNet + aux. loss 39.7 30.2 29.7 31.5 35.6 36.4 37.8
VQA-like, Bot.-Up 37.9 28.2 28.1 31.8 32.7 33.8 35.5
VQA-like, Bot.-Up + aux. loss 41.0 30.6 30.9 31.0 35.1 36.2 38.1


RN without panel IDs, ResNet 32.6 27.7 28.1 30.4 32.0 30.9 30.6
RN without panel IDs, ResNet + aux. loss 35.0 29.7 27.3 31.2 32.6 32.9 31.8
RN without panel IDs, Bot.-Up 35.8 31.1 29.1 32.6 32.7 32.2 32.5
RN without panel IDs, Bot.-Up + aux. loss 38.0 31.7 31.6 31.7 34.2 34.9 33.2


RN, ResNet 51.2 39.8 39.0 42.1 43.4 48.9 47.5
RN, ResNet + aux. loss 55.8 42.4 42.3 40.9 47.3 52.8 51.0
RN, Bot.-Up 55.4 43.4 39.7 43.6 42.2 51.9 50.6
RN, Bot.-Up + aux. loss 61.3 47.4 45.2 44.1 44.9 58.5 55.2


Table 5: Overall accuracy of all models trained/evaluated on splits requiring varying levels of generalization.

Appendix D Dataset examples

We provide in Fig. 11 a random selection of instances from our dataset.

Figure 11: Additional examples from our dataset. The four rows respecitvely depict the relations And, Or, Union, and Progression. The correct answers are 4,3,6,1,2,1,4,7,2,5,5,8, referring to the candidate answers as 1–8 left to right, first then second row.