Achieving artificial visual reasoning - the ability to answer image-related questions which require a multi-step, high-level process - is an important step towards artificial general intelligence. This multi-modal task requires learning a question-dependent, structured reasoning process over images from language. Standard deep learning approaches tend to exploit biases in the data rather than learn this underlying structure, while leading methods learn to visually reason successfully but are hand-crafted for reasoning. We show that a general-purpose, Conditional Batch Normalization approach achieves state-of-the-art results on the CLEVR Visual Reasoning benchmark with a 2.4 error rate. We outperform the next best end-to-end method (4.5 methods that use extra supervision (3.1 how it reasons, showing it has learned a question-dependent, multi-step process. Previous work has operated under the assumption that visual reasoning calls for a specialized architecture, but we show that a general architecture with proper conditioning can learn to visually reason effectively.READ FULL TEXT VIEW PDF
The ability to use language to reason about every-day visual input is a fundamental building block of human intelligence. Achieving this capacity to visually reason is thus a meaningful step towards artificial agents that truly understand the world. Advances in both image-based learning and language-based learning using deep neural networks have made huge strides in difficult tasks such as object recognition[1, 2] and machine translation [3, 4]. These advances have in turn fueled research on the intersection of visual and linguistic learning [5, 6, 7, 8, 9].
To this end,  recently proposed the CLEVR dataset to test multi-step reasoning from language about images, as traditional visual question-answering datasets such as [5, 7] ask simpler questions on images that can often be answered in a single glance. Examples from CLEVR are shown in Figure 1. Structured, multi-step reasoning is quite difficult for standard deep learning approaches [10, 11], including those successful on traditional visual question answering datasets. Previous work highlights that standard deep learning approaches tend to exploit biases in the data rather than reason [9, 12]. To overcome this, recent efforts have built new learning architectures that explicitly model reasoning or relational associations [10, 11, 13], some of which even outperform humans [10, 11].
with a Recurrent Neural Network (RNN) and a Convolutional Neural Network (CNN) to show that deep learning architectures built without strong priors can learn underlying structure behind visual reasoning, directly from language and images. We demonstrate this by achieving state-of-the-art visual reasoning on CLEVR and finding structured patterns while exploring the internals of our model.
Our model processes the multi-modal question-image input using a RNN and CNN combined via Conditional Batch Normalization (CBN). CBN has proven highly effective for image stylization [14, 16], speech recognition , and traditional visual question answering tasks . We start by explaining CBN in Section 2.1 and then describe our model in Section 2.2.
Batch normalization (BN) is a widely used technique to improve neural network training by normalizing activations throughout the network with respect to each mini-batch. BN has been shown to accelerate training and improve generalization by reducing covariate shift throughout the network . To explain BN, we define as a mini-batch of samples, where corresponds to input feature maps whose subscripts refers to the feature map at the spatial location . We also define and as per-channel, trainable scalars and as a constant damping factor for numerical stability. BN is defined at training time as follows:
where and are arbitrary functions such as neural networks. Thus, and can learn to control the distribution of CNN activations based on .
Combined with ReLU non-linearities, CBN empowers a conditioning model to manipulate feature maps of a target CNN by scaling them up or down, negating them, shutting them off, selectively thresholding them, and more. Each feature map is modulated independently, giving the conditioning model an exponential (in the number of feature maps) number of ways to affect the feature representation.
Rather than output directly, we output , where:
since initially zero-centered can zero out CNN feature map activations and thus gradients. In our implementation, we opt to output rather than , but for simplicity, in the rest of this paper, we will explain our method using .
Our model consists of a linguistic pipeline and a visual pipeline as depicted in Figure 2. The linguistic pipeline processes a question
using a Gated Recurrent Unit (GRU) with 4096 hidden units that takes in learned, 200-dimensional word embeddings. The final GRU hidden state is a question embedding . From this embedding, the model predicts the CBN parameters for the CBN layer of the residual block via linear projection with a trainable weight matrix
and bias vector:
The visual pipeline extracts image features using the conv4 layer of a ResNet-101 
pre-trained on ImageNet, as done in  for CLEVR. Image features are processed by a
convolution followed by several — 3 for our model — CBN residual blocks with 128 feature maps, and a final classifier. The classifier consists of a
convolution to 512 feature maps, global max-pooling, and a two-layer MLP with 1024 hidden units that outputs a distribution over final answers.
Each CBN residual block starts with a convolution followed by two convolutions with CBN as depicted in Figure 2. Drawing from [11, 21], we concatenate coordinate feature maps indicating relative spatial position (scaled from to ) to the image features, each residual block’s input, and the classifier’s input. We train our model end-to-end from scratch with Adam (learning rate ) , early stopping on the validation set, weight decay (), batch size 64, and BN and ReLU throughout the visual pipeline, using only image-question-answer triplets from the training set.
CLEVR is a generated dataset of 700K (image, question, answer, program) tuples. Images contain 3D-rendered objects of various shapes, materials, colors, and sizes. Questions are multi-step and compositional in nature, as shown in Figure 1. They range from counting questions (”How many green objects have the same size as the green metallic block?”) to comparison questions (”Are there fewer tiny yellow cylinders than yellow metal cubes?”) and can be 40+ words long. Answers are each one word from a set of possible answers. Programs are an additional supervisory signal consisting of step-by-step instructions, such as filter_shape[cube], relate[right], and count
, on how to answer the question. Program labels are difficult to generate or come by for real world datasets. Our model avoids using this extra supervision, learning to reason effectively directly from linguistic and visual input.
|Q-type baseline ||41.8||34.6||50.2||51.0||36.0||51.3|
|PG+EE (9K prog.)* ||88.6||79.7||89.7||79.1||92.6||96.0|
|PG+EE (700K prog.)* ||96.9||92.7||97.1||98.7||98.1||98.9|
Our results on CLEVR are shown in Table 1. Our model achieves a new overall state-of-the-art, outperforming humans and previous, leading models, which often use additional program supervision. Notably, CBN outperforms Stacked Attention networks (CNN+LSTM+SA in 1) by 21.0%. Stacked Attention networks are highly effective for visual question answering with simpler questions  and are the previously leading model for visual reasoning that does not build in reasoning, making them a relevant baseline for CBN. We note also that our model’s pattern of performance more closely resembles that of humans than other models do. Strong performance ( error) in exist and query_attribute categories is perhaps explained by our model’s close resemblance to standard CNNs, which traditionally excel at these classification-type tasks. Our model also demonstrates strong performance on more complex categories such as count and compare_attribute.
Comparing numbers of objects gives our model more difficulty, understandably so; this question type requires more high-level reasoning steps — querying attributes, counting, and comparing — than other question type. The best model from  beats our model here but is trained with extra supervision via 700K program labels. As shown in Table 1, the equivalent, more comparable model from  which uses 9K program labels significantly underperforms our method in this category.
To understand what our model learns, we use t-SNE  to visualize the CBN parameter vectors , of 2,000 random validation points, modulating first and last CBN layers in our model, as shown in Figure 4. The parameters of the first and last CBN layers are grouped by the low-level and high-level reasoning functions necessary to answer CLEVR questions, respectively. For example, the CBN parameters for equal_color and query_color are close for the first layer but apart for the last layer, and the same is true for equal_shape and query_shape, equal_size and query_size, and equal_material and query_material. Conversely, equal_shape, equal_size, and equal_material CBN parameters are grouped in the last layer but split in the first layer. Similar patterns emerge when visualizing residual block activations. Thus, we see that CBN learns a sort of function-based modularity, directly from language and image inputs and without an architectural prior on modularity. Simply with end-to-end training, our model learns to handle not only different types of questions differently, but also different types of question sub-parts differently, working from low-level to high-level processes as is the proper approach to answer CLEVR questions.
Additionally, we observe that many points that break the previously mentioned clustering patterns do so in meaningful ways. For example, Figure 4 shows that some count questions have last layer CBN parameters far from those of other count questions but close to those of exist questions. Closer examination reveals that these count questions have answers of either 0 or 1, making them similar to exist questions.
An analysis of our model’s errors reveals that 94% of its counting mistakes are off-by-one errors, indicating our model has learned underlying concepts behind counting, such as close relationships between close numbers.
As shown in Figure 3, our CBN model struggles more on questions that require more steps, as indicated by the length of the corresponding CLEVR programs; error rates for questions requiring 10 or fewer steps are around , while error rates for questions requiring 17 or more steps are around , more than three times higher.
Furthermore, the model sometimes makes curious reasoning mistakes a human would not. In Figure 5, we show an example where our model correctly counts two cyan objects and two yellow objects but simultaneously does not answer that there are the same number of cyan and yellow objects. In fact, it does not answer that the number of cyan blocks is more, less, or equal to the number of yellow blocks. These errors could be prevented by directly minimizing logical inconsistency, which is an interesting avenue for future work orthogonal to our approach.
These types of mistakes in a state-of-the-art visual reasoning model suggest that more work is needed to truly achieve human-like reasoning and logical consistency. We view CLEVR as a curriculum of tasks and believe that the key to the most meaningful and advanced reasoning lies in tackling these last few percentage points of error.
|How many yellow things are there?||2|
|How many cyan things are there?||2|
|Are there as many yellow things as cyan things?||No|
|Are there more yellow things than cyan things?||No|
|Are there fewer yellow things than cyan things?||No|
One leading approach for visual reasoning is the Program Generator + Execution Engine model from . This approach consists of a sequence-to-sequence “Program Generator” (PG), which takes in a question and outputs a sequence corresponding to a tree of composable Neural Modules, each of which is a two-layer residual block similar to ours. This tree of Neural Modules is assembled to form the Execution Engine (EE) that then predicts an answer from the image. The PG+EE model uses a strong prior by training with program labels and explicitly modeling the compositional nature of reasoning. Our approach learns to reason directly from textual input without using additional cues or a specialized architecture.
This modular approach is part of a recent line of work in Neural Module Networks [13, 25, 26]. Of these, End-to-End Module Networks (N2NMN)  also tackle visual reasoning but do not perform as well as other approaches. These methods also use strong priors by modeling the compositionality of reasoning, using program-level supervision, and building per-module, hand-crafted neural architectures for specific functions.
Relation Networks (RNs) from  are another leading approach for visual reasoning. RNs use an MLP to carry out pairwise comparisons over each location of extracted convolutional features over an image, including LSTM-extracted question features as input to this MLP. RNs then element-wise sum over the resulting comparison vectors to form another vector from which a final classifier predicts the answer. This approach is end-to-end differentiable and trainable from scratch to high performance, as we show in Table 1. Our approach lifts the explicitly relational aspect of this model, freeing our approach from the use of a comparison-based prior, as well as the scaling difficulties of pairwise comparisons over spatial locations.
CBN itself has its own line of work. The results of [14, 16] show that the closely related Conditional Instance Normalization is able to successfully modulate a convolutional style-transfer network to quickly and scalably render an image in a huge variety of different styles, simply by learning to output a different set of BN parameters based on target style. For visual question answering, answering general questions often of natural images, de Vries et al.  show that CBN performs highly on real-world VQA and GuessWhat?! datasets, demonstrating CBN’s effectiveness beyond the simpler CLEVR images. Their architecture conditions 50 BN layers of a pre-trained ResNet. We show that a few layers of CBN after a ResNet can also be highly effective, even for complex problems. We also show how CBN models can learn to carry out multi-step processes and reason in a structured way — from low-level to high-level.
Additionally, CBN is essentially a post-BN, feature-wise affine conditioning, with BN’s trainable scalars turned off. Thus, there are many interesting connections with other conditioning methods. A common approach, used for example in Conditional DCGANs , is to concatenate constant feature maps of conditioning information to the input of convolutional layers, which amounts to adding a post-convolutional, feature-wise conditional bias. Other approaches, such as LSTMs  and Hierarchical Mixtures of Experts , gate an input’s features as a function of that same input (rather than a separate, conditioning input), which amounts to a feature-wise, conditional scaling, restricted to between 0 and 1. CBN consists of both scaling and shifting, each unrestricted, giving it more capacity than many of these related approaches. We leave exploring these connections more in-depth for future work.
With a simple and general model based on CBN, we show it is possible to achieve state-of-the-art visual reasoning on CLEVR without explicitly incorporating reasoning priors. We show that our model learns an underlying structure required to answer CLEVR questions by finding clusters in the CBN parameters of our model; earlier parameters are grouped by low-level reasoning functions while later parameters are grouped by high-level reasoning functions. Simply by manipulating feature maps with CBN, a RNN can effectively use language to influence a CNN to carry out diverse and multi-step reasoning tasks over an image. It is unclear whether CBN is the most effective general way to use conditioning information for visual reasoning or other tasks, as well as what precisely about CBN is so effective. Other approaches [27, 28, 29, 30, 31, 32, 33] employ a similar, repetitive conditioning, so perhaps there is an underlying principle that explains the success of these approaches. Regardless, we believe that CBN is a general and powerful technique for multi-modal and conditional tasks, especially where more complex structure is involved.
We would like to thank the developers of PyTorch (http://pytorch.org/) for their elegant deep learning framework. Also, our implementation was based off the open-source code from 
. We thank Mohammad Pezeshki, Dzmitry Bahdanau, Yoshua Bengio, Nando de Freitas, Joelle Pineau, Olivier Pietquin, Jérémie Mary, Chin-Wei Huang, Layla Asri, and Max Smith for helpful feedback and discussions, as well as Justin Johnson for CLEVR test set evaluations. We thank NVIDIA for donating a DGX-1 computer used in this work. We also acknowledge FRQNT through the CHIST-ERA IGLU project and CPER Nord-Pas de Calais, Collège Doctoral Lille Nord de France and FEDER DATA Advanced data science and technologies 2015-2020 for funding our research.
D. Geman, S. Geman, N. Hallonquist, and L. Younes, “Visual turing test for computer vision systems,” vol. 112, no. 12. National Acad Sciences, 2015, pp. 3618–3623.
S. Hochreiter and J. Schmidhuber, “Long short-term memory,”Neural Comput., vol. 9, no. 8, pp. 1735–1780, Nov. 1997. [Online]. Available: http://dx.doi.org/10.1162/neco.1918.104.22.1685