In recent years, neural network based models have become the workhorse of natural language understanding and generation. They empower industrial systems in machine translation(Wu et al., 2016)
and text generation(Kannan et al., 2016), also showing state-of-the-art performance on numerous benchmarks including Recognizing Textual Entailment (RTE) (Gong et al., 2017), Visual Question Answering (VQA) (Jiang et al., 2018), and Reading Comprehension (Wang et al., 2018). Despite these successes, a growing body of literature suggests that these approaches do not generalize outside of the specific distributions on which they are trained, something that is necessary for a language understanding system to be widely deployed in the real world. Investigations on the three aforementioned tasks have shown that neural models easily latch onto statistical regularities which are omnipresent in existing datasets (Agrawal et al., 2016; Gururangan et al., 2018; Jia & Liang, 2017) and extremely hard to avoid in large scale data collection. Having learned such dataset-specific solutions, neural networks fail to make correct predictions for examples that are even slightly out of domain, yet are trivial for humans. These findings have been corroborated by a recent investigation on a synthetic instruction-following task (Lake & Baroni, 2017), in which seq2seq models (Sutskever et al., 2014; Bahdanau et al., 2015) have shown little systematicity (Fodor & Pylyshyn, 1988) in how they generalize, that is they do not learn general rules on how to compose words and fail spectacularly when for example asked to interpret “jump twice” after training on “jump”, “run twice” and “walk twice”.
An appealing direction to improve the generalization capabilities of neural models is to add modularity and structure to their design to make them structurally resemble the kind of rules they are supposed to learn (Andreas et al., 2016; Gaunt et al., 2016). For example, in the Neural Module Network paradigm (NMN, Andreas et al. (2016)), a neural network is assembled from several neural modules, where each module is meant to perform a particular subtask of the input processing, much like a computer program composed of functions. The NMN approach is intuitively appealing but its widespread adoption has been hindered by the large amount of domain knowledge that is required to decide (Andreas et al., 2016) or predict (Johnson et al., 2017; Hu et al., 2017) how the modules should be created (parametrization) and how they should be connected (layout) based on a natural language utterance. Besides, their performance has often been matched by more traditional neural models, such as FiLM (Perez et al., 2017), Relations Networks (Santoro et al., 2017), and MAC networks (Hudson & Manning, 2018). Lastly, generalization properties of NMNs, to the best of our knowledge, have not been rigorously studied prior to this work.
Here, we investigate the impact of explicit modularity and structure on systematic generalization of NMNs and contrast their generalization abilities to those of generic models. For this case study, we focus on the task of visual question answering (VQA), in particular its simplest binary form, when the answer is either “yes” or “no”. Such a binary VQA task can be seen as a fundamental task of language understanding, as it requires one to evaluate the truth value of the utterance with respect to the state of the world. Among many systematic generalization requirements that are desirable for a VQA model, we choose the following basic one: a good model should be able to reason about all possible object combinations despite being trained on a very small subset of them. We believe that this is a key prerequisite to using VQA models in the real world, because they should be robust at handling unlikely combinations of objects. We implement our generalization demands in the form of a new synthetic dataset, called Spatial Queries On Object Pairs (), in which a model has to perform basic spatial relational reasoning about pairs of randomly scattered letters and digits in the image (e.g. answering the question “Is there a letter A left of a letter B?”). The main challenge in is that models are evaluated on all possible object pairs, but trained on only a subset of them.
Our first finding is that NMNs do generalize better than other neural models when layout and parametrization are chosen appropriately. We then investigate which factors contribute to improved generalization performance and find that using a layout that matches the task (i.e. a tree layout, as opposed to a chain layout), is crucial for solving the hardest version of our dataset. Lastly, and perhaps most importantly, we experiment with existing methods for making NMNs more end-to-end by inducing the module layout (Johnson et al., 2017) or learning module parametrization through soft-attention over the question (Hu et al., 2017). Our experiments show that such end-to-end approaches often fail by not converging to tree layouts or by learning a blurred parameterization for modules, which results in poor generalization on the hardest version of our dataset. We believe that our findings challenge the intuition of researchers in the field and provide a foundation for improving systematic generalization of neural approaches to language understanding.
2 The Dataset For Testing Systematic Generalization
We perform all experiments of this study on the dataset. is a minimalistic VQA task that is designed to test the model’s ability to interpret unseen combinations of known relation and object words. Clearly, given known objects , and a known relation , a human can easily verify whether or not the objects and are in relation . Some instances of such queries are common in daily life (is there a cup on the table), some are extremely rare (is there a violin under the car), and some are unlikely but have similar, more likely counter-parts (is there grass on the frisbee vs is there a frisbee on the grass). Still, a person can easily answer these questions by understanding them as just the composition of the three separate concepts. Such compositional reasoning skills are clearly necessary for language understanding models, and is explicitly designed to test for them.
Concretely speaking, requires observing a 64 64 RGB image and answering a yes-no question about whether objects and are in a spatial relation . The questions are represented in a redundancy-free form; we did not aim to make the questions look like natural language. Each image contains 5 randomly chosen and randomly positioned objects. There are 36 objects: the latin letters A-Z and digits 0-9, and there are 4 relations: left_of, right_of, above, and below. This results in possible unique questions (we do not allow questions about identical objects). To make negative examples challenging, we ensure that both and of a question are always present in the associated image and that there are always distractor objects and such that and are both true for the image. These extra precautions guarantee that answering a question requires the model to locate all possible and then check if any pair of them are in the relation R. Two examples are shown in Figure 2.
Our goal is to discover which models can correctly answer questions about all possible object pairs in after having been trained on only a subset. For this purpose we build training sets containing unique questions by sampling different right-hand-side (RHS) objects 1, 2, …, k for each left-hand-side (LHS) object . We use this procedure instead of just uniformly sampling object pairs in order to ensure that each object appears in at least one training question, thereby keeping the all versions of the dataset solvable. We will refer to as the #rhs/lhs parameter of the dataset. Our test set is composed from the remaining questions. We generate training and test sets for rhs/lhs values of 1,2,4,8 and 18, as well as a control version of the dataset, #rhs/lhs=35, in which both the training and the test set contain all the questions (with different images). Note that lower #rhs/lhs versions are harder for generalization due to the presence of spurious correlations between lhs and rhs objects that which the models may adapt. In the extreme case of #rhs/lhs=1, a model may learn to predict the rhs object from the lhs. In order to exclude a possible compounding factor of overfitting on the training images, all our training sets contain 1 million examples, so for a dataset with #rhs/lhs = we generate approximately different images per question. Pseudocode for generating can be found in Appendix C.
A great variety of VQA models have been recently proposed in the literature, among which we can distinguish two trends. Some of the recently proposed models, such as FiLM (Perez et al., 2017) and Relation Networks (RelNet, Santoro et al. (2017)) are highly generic and do not require any task-specific knowledge to be applied on a new dataset. On the opposite end of the spectrum are modular and structured models, typically flavours of Neural Module Networks (Andreas et al., 2016), that require some knowledge about the task at hand to be instantiated. Here, we evaluate systematic generalization of several state-of-the-art models in both families. In all models, the image is first fed through a CNN based network, that we refer to as the stem
, to produce a feature-level 3D tensor. This is passed through a model-specific computation conditioned on the question , to produce a joint representation . Lastly, this representation is fed into a fully-connected classifier
network to produce logits for prediction. Therefore, the main difference between the models we consider is how the computationis performed.
3.1 Generic Models
We consider four generic models in this paper: CNN+LSTM, FiLM, Relation Network (RelNet), and Memory-Attention-Control (MAC) network. For CNN+LSTM, FiLM, and RelNet models, the question is first encoded into a fixed-size representation using a unidirectional LSTM network. CNN+LSTM flattens the 3D tensor
to a vector and concatenates it withto produce .
RelNet (Santoro et al., 2017) uses a network which is applied to all pairs of feature columns of concatenated with the question representation , all of which is then pooled to obtain :
where is the -th feature column of . FiLM networks (Perez et al., 2017) use convolutional FiLM blocks applied to . A FiLM block is a residual block (He et al., 2016) in which a feature-wise affine transformation (FiLM layer) is inserted after the 2 convolutional layer. The FiLM layer is conditioned on the question at hand via prediction of the scaling and shifting parameters and :
stands for batch normalization,stands for convolution and stands for element-wise multiplications. is the output of the -th FiLM block and . The output of the last FiLM block undergoes an extra 1
1 convolution and max-pooling to produce. MAC network of Hudson & Manning (2018) produces by repeatedly applying a Memory-Attention-Composition (MAC) cell that is conditioned on the question through an attention mechanism. The MAC model is quite complex and we refer the reader to the original paper for details.
3.2 Neural Module Networks
Neural Module Networks (NMN) (Andreas et al., 2016) are an elegant solution that constructs a question-specific network by composing together trainable neural modules, drawing inspiration from symbolic approaches to question answering (Malinowski & Fritz, 2014) . To answer a question with an NMN, one first constructs the computation graph by making the following decisions: (a) how many modules and of which types will be used, (b) how will the modules be connected to each other, and (c) how are these modules parametrized based on the question. We refer to the aspects (a) and (b) of the computation graph as the layout and the aspect (c) as the parametrization. In the original NMN and in many follow-up works, different module types are used to perform very different computations, e.g. the Find module from Hu et al. (2017) performs trainable convolutions on the input attention map, whereas the And module from the same paper computes an element-wise maximum for two input attention maps. In this work, we follow the trend of using more homogeneous modules started by Johnson et al. (2017), who use only two types of modules: unary and binary, both performing similar computations. We go one step further and retain a single binary module type, using a zero tensor for the second input when only one input is available. Additionally, we choose to use exactly three modules, which simplifies the layout decision to just determining how the modules are connected. Our preliminary experiments have shown that, even after these simplifications, NMNs are far ahead of other models in terms of generalization.
In the original NMN, the layout and parametrization were set in an ad-hoc manner for each question by analyzing a dependency parse. In the follow-up works (Johnson et al., 2017; Hu et al., 2017), these aspects of the computation are predicted by learnable mechanisms with the goal of reducing the amount of background knowledge required to apply the NMN approach to a new task. We experiment with the End-to-end NMN (N2NMN) (Hu et al., 2017) paradigm from this family, which predicts the layout with a seq2seq model (Sutskever et al., 2014) and computes the parametrization of the modules using a soft attention mechanism. Since all the questions in have the same structure, we do not employ a seq2seq model but instead have a trainable layout variable and trainable attention variables for each module.
Formally, our NMN is constructed by repeatedly applying a generic neural module , which takes as inputs the shared parameters , the question-specific parametrization and the left-hand side and right-hand side inputs and . such modules are connected and conditioned on a question as follows:
In the equations above, is the zero tensor input, are the image features outputted by the stem, and is the embedding table for the questions words, and we refer to and as the parametrization attention matrix and the layout tensor respectively.
We experiment with two choices for the NMN’s generic neural module: the module from Hu et al. (2017) and the module from Johnson et al. (2017) with very minor modifications — we use 64 dimensional CNNs in our blocks since our dataset consists of 64 64 images. The equations for the module are as follows:
and for module as follows:
In formulas above are convolution weights, and , , are biases. The main difference between and is that in all parameters depend on the questions words, where as in convolutional weights are the same for all questions, and only the element-wise multipliers vary based on the question. We note that the specific module we use in this work is slightly different from the one used in (Hu et al., 2017) in that it outputs a feature tensor, not just an attention map. This change was required in order to connect multiple modules in the same way as we connect multiple residual ones.
Based on the generic NMN model described above, we experiment with several specific architectures as shown in Figure 1. Each of the models uses modules, which are connected and parametrized differently. In NMN-Chain modules form a sequential chain. Modules 1, 2 and 3 are parametrized based on the first object word, second object word and the relation word respectively, which is achieved by setting the attention , , to the corresponding one-hot vectors. We also experiment with giving the image features as the right-hand side input to all 3 modules and call the resulting model NMN-Chain-Shortcut. NMN-Tree is similar to NMN-Chain in that the attention vectors are similarly hard-coded, but we change the connectivity between the modules to be tree-like. Stochastic N2NMN follows the N2NMN approach by Hu et al. (2017) for inducting layout. We treat the layout as a stochastic latent variable. is allowed to take two values: as in NMN-Tree, and
as in NMN-Chain. We calculate the output probabilities by marginalizing out the layout i.e. probability of answer being “yes” is computed as. Lastly, Attention N2NMN uses the N2NMN method for learning parametrization (Hu et al., 2017). It is structured just like NMN-Tree but has computed as , where is a trainable vector. We use Attention N2NMN only with the module because using it with the
module would involve a highly non-standard interpolation between convolutional weights.
In our experiments we aimed to: (a) understand which models are capable of exhibiting systematic generalization as required by , and (b) understand whether it is possible to induce, in an end-to-end way, the successful architectural decisions that lead to systematic generalization.
All models share the same stem architecture which consists of 6 layers of convolution (8 for Relation Networks), batch normalization and max pooling. The input to the stem is a 64 64 3 image, and the feature dimension used throughout the stem is 64. Further details can be found in Appendix A. The code for all experiments will be released in the near future.
4.1 Which Models Generalize Better?
We report the performance for all models on datasets of varying difficulty in Figure 3. Our first observation is that the modular and tree-structured NMN-Tree model exhibits strong systematic generalization. Both versions of this model, with and modules, robustly solve all versions of our dataset, including the most challenging #rhs/lhs=1 split.
The results of NMN-Tree should be contrasted with those of generic models. 2 out of 4 models (Conv+LSTM and RelNet) are not able to learn to answer all questions, no matter how easy the split was (for high #rhs/lhs Conv+LSTM overfitted and RelNet did not train). The results of other two models, MAC and FiLM, are similar. Both models are clearly able to solve the task, as suggested by their almost perfect error rate on the control #rhs/lhs=35 split, yet they struggle to generalize on splits with lower #rhs/lhs. In particular, we observe errors for MAC and a errors for FiLM on the hardest #rhs/lhs=1 split. For the splits of intermediate difficulty we saw the error rates of both models decreasing as we increased the #rhs/lhs ratio from 2 to 18. Interestingly, even with 18 #rhs/lhs some MAC and FiLM runs result in a test error rate of . Given the simplicity and minimalism of questions, we believe that these results should be considered a failure to pass the test for both MAC and FiLM. That said, we note a difference in how exactly FiLM and MAC fail on #rhs/lhs=1: in several runs (3 out of 15) MAC exhibits a strong generalization performance ( error rate), whereas in all runs of FiLM the error rate is about . We examine the successful MAC models and find they have converged to a successful setting of the control attention weights, that is the weights with which MAC units attend to questions words. In particular, MAC models that generalize strongly for each question would have a unit focusing strongly on and a unit focusing strongly on . (see Appendix B for more details). As MAC was the strongest competitor of NMN-Tree across generic models, we have performed an ablation study for this model, in which we varied the number of modules and hidden units, as well as experimented with weight decay. These modifications have not resulted in any significant reduction of the gap between MAC and NMN-Tree. Interestingly, we found that using the default high number of MAC units, namely 12, was helpful, possibly it made it more likely that some units are initialized to focus on and words (see Appendix B for details).
Comparing NMNs with different layouts and modules. We can clearly observe the superior generalization of trees, poor generalization of chains and mediocre generalization of chains with shortcuts. Means and standard deviations after at least 5 runs are reported.
4.2 What is Essential to Strong Generalization of NMN?
The superior generalization performance of NMN-Tree raises the following question: what is the key architectural difference between NMN-Tree and generic models that explains the performance gap between them? We consider two candidate explanations. First, the NMN-Tree model differs from the generic models in that it does not use a language encoder and is instead built from modules that are parametrized by question words directly. Second, NMN-Tree is structured in a particular way, with the idea that modules 1 and 2 may learn to locate objects and module 3 can learn to reason about object locations independently of their identities. To understand which of the two differences is responsible for the superior generalization, we compare the performance of the NMN-Tree, NMN-Chain and NMN-Chain-Shortcut models (see Figure 1). These 3 versions of NMN are similar is that none of them are using a language encoder, but they differ in how the modules are connected. The results in Figure 3 show that for both and module architectures, using a tree layout is absolutely crucial (and sufficient) for generalization, meaning that the generalization gap between NMN-Tree and generic models can not be explained merely by the language encoding step in the latter. In particular, NMN-Chain models perform barely above random chance, doing even worse than generic models on the #rhs/lhs=1 version of the dataset and dramatically failing even on the easiest #rhs/lhs=18 split. This is in stark contrast with NMN-Tree models that exhibits nearly perfect performance on the hardest #rhs/lhs=1 split. As a sanity check we trained NMN-Chain models on the vanilla #rhs/lhs=35 split. We found that NMN-Chain model has little difficulty to learn to answer questions when it sees all of them at training time, even though it shows very poor generalization in our other experiments. Interestingly, NMN-Chain-Shortcut performed much better than NMN-Chain and quite similarly to generic models. We find it remarkable that such a slight change in the model layout as adding the shortcut connections from image features to the chain modules results in a drastic change in generalization performance. In an attempt to understand why NMN-Chain generalizes so poorly we compared the test set responses of the 5 NMN-Chain models trained on #rhs/lhs=1 split. Notably, there was very little agreement between predictions of these 5 runs (Fleiss ), suggesting that NMN-Chain performs rather randomly outside of the training set.
4.3 Can the Right Kind of NMN Be Induced?
The strong generalization of the NMN-Tree model is impressive, but a significant amount of prior knowledge about the task was required to come up with the successful layout and parametrization used in this model. We therefore investigate whether the amount of such prior knowledge can be reduced by fixing one of these structural aspects and inducing another.
4.3.1 Layout Induction
In our layout induction experiments, we use the Stochastic N2NMN model which treats the layout as a stochastic latent variable with two values ( and , see Section 3.2 for details). We experiment with N2NMNs using both and modules and report results with different initial conditions, . We believe that the initial probability should not be considered small, as in more challenging datasets the space of layouts would be exponentially large, and sampling the right layout in 10% of all cases could be considered a very lucky initialization. We repeat all experiments on #rhs/lhs=1 and on #rhs/lhs=18 splits, the former to study generalization, and the latter to control whether the failures on #rhs/lhs=1 are caused specifically by the difficulty of this split. The results (see Table 1) show that the success of layout induction (i.e. converging to a close to ) depends in a complex way on all the factors that we considered in our experiments. The initialization has the most influence: models initialized with typically do not converge to a tree (exception being experiments with module on #rhs/lhs=18, in which 3 out of 5 runs converged to a solution with a high ). Likewise, models initialized with always stay in a regime with a high . In the intermediate setting of we observe differences in behaviors for and modules. In particular, N2NMN based on modules stays spurious with when #rhs/lhs=1, whereas N2NMN based on modules always converges to a tree.
|module||#rhs/lhs||Test error rate (%)||Test loss|
One counterintuitive result in Table 1 is that Stochastic N2NMNs with modules that were trained with and #rhs/lhs=1 make just errors on the generalization set despite being spurious mixtures between a tree and a chain. Our explanation for this phenomenon is as follows: when connected in a tree, modules of such spurious models generalize well, and when connected as a chain they generalize poorly. The output distribution of the whole model is thus a mixture of the mostly correct and mostly random . We verified our reasoning by explicitly evaluating test accuracies for and , and we found them to be around and respectively, confirming our hypothesis. As a result the predictions of the spurious models with have lower confidence than those of sharp tree models, as indicated by the high log loss of . We visualize the progress of structure induction for the module with in Figure 4 which shows how saturates to 1.0 for #rhs/lhs=18 and remains around 0.5 when #rhs/lhs=1.
4.3.2 Parametrization Induction
Next, we experiment with the Attention N2NMN model (see Section 3.2) in which the parametrization is learned for each module as an attention-weighted average of word embeddings. In these experiments, we fix the layout to be tree-like and sample the pre-softmax attention weights
from a uniform distribution. As in the layout induction investigations, we experiment with several splits, namely we try #rhs/lhs . The results (reported in Table 2) show that Attention N2NMN fails dramatically on #rhs/lhs=1 but quickly catches up as soon as #rhs/lhs is increased to 2. Notably, 9 out of 10 runs on #rhs/lhs=2 resulted in almost perfect performance, and 1 run completely failed to generalize (26% error rate), resulting in a high variance of the mean error rate. All 10 runs on the split with 18 rhs/lhs generalized flawlessly. We furthermore inspected the learned attention weights and found that for typical successful runs, module 3 focuses on the relation word, whereas modules 1 and 2 focus on different object words (see Figure 6) while still focusing on the relation word. To better understand the relationship between successful layout induction and generalization, we define an attention quality metric . Intuitively, is large when for each word there is a module that focuses mostly on this word. The renormalization by is necessary to factor out the amount of attention that modules 1 and 2 assign to the relation word. For the ground-truth parametrization that we use for NMN-Tree takes a value of 1, and if both modules 1 and 2 focus on X, completely ignoring Y, equals 0. The scatterplot of the test error rate versus (Figure 5) shows that for #rhs/lhs=1 high generalization is strongly associated with higher , meaning that it is indeed necessary to have different modules strongly focusing on different object words in order to generalize in this most challenging setting. Interestingly, for #rhs/lhs=2 we see a lot of cases where N2NMN generalizes well despite attention being rather spurious ().
In order to put Attention N2NMN results in context we compare them to those of MAC (see Table 2). Such a comparison can be of interest because both models perform attention over the question. For 1 rhs/lhs MAC seems to be better on average, but as we increase #rhs/lhs to 2 we note that Attention N2NMN succeeds in 9 out of 10 cases on the #rhs/lhs=2 split, much more often than 1 success out of 10 observed for MAC111If we judge a run successful when the error rate is lower than , these success rates are different with a p-value of 0.001 according to the Fisher exact test. Same holds for any other threshold
, these success rates are different with a p-value of 0.001 according to the Fisher exact test. Same holds for any other threshold.. This result suggests that Attention N2NMNs retains some of the strong generalization potential of NMNs with hard-coded parametrization.
|Model||#rhs/lhs||Test error rate (%)||Test loss (%)|
Parameterization induction results for 1,2,18 rhs/lhs datasets for Attention N2NMN. The model does not generalize well in the difficult 1 rhs/lhs setting. Results for MAC are presented for comparison. Means and standard deviations were estimated based on at least 10 runs.
5 Related Work
The notion of systematicity was originally introduced by (Fodor & Pylyshyn, 1988) as the property of human cognition whereby “the ability to entertain a given thought implies the ability to entertain thoughts with semantically related contents”. They illustrate this with an example that no English speaker can understand the phrase “John loves the girl” without being also able to understand the phrase “the girl loves John”. The question of whether or not connectionist models of cognition can account for the systematicity phenomenon has been a subject of a long debate in cognitive science (Fodor & Pylyshyn, 1988; Smolensky, 1987; Marcus, 1998, 2003; Calvo & Colunga, 2003). Most recently Lake & Baroni (2017) and Loula et al. (2018)
have shown that lack of systematicity in the generalization is still a concern for the modern seq2seq models. Our findings about the weak systematic generalization of generic VQA models corroborate the aforementioned seq2seq results. We also go beyond merely stating negative generalization results and showcase the high systematicity potential of adding explicit modularity and structure to modern deep learning models.
Besides the theoretical appeal of systematicity, our study is inspired by highly related prior evidence that when trained on downstream language understanding tasks, neural networks often generalize poorly and latch on to dataset-specific regularities. Agrawal et al. (2016) report how neural models exploit biases in a VQA dataset, e.g. responding “snow” to the question “what covers the ground” regardless of the image because “snow” is the most common answer to this question. Gururangan et al. (2018) report that many successes in natural language entailment are actually due to exploiting statistical biases as opposed to solving entailment, and that state-of-the-art systems are much less performant when tested on unbiased data. Jia & Liang (2017) demonstrate that seemingly state-of-the-art reading comprehension system can be misled by simply appending an unrelated sentence that resembles the question to the document.
Using synthetic VQA datasets to study grounded language understanding is a recent trend started by the CLEVR dataset (Johnson et al., 2016). CLEVR images are 3D-rendered and CLEVR questions are longer and more complex than ours, yet the color-shape generalization split that CLEVR includes arguably lacks a cleAnother source of inspiration for tar motivation. More closely related to our work is the ShapeWorld family of datasets by Kuhnle & Copestake (2017), that involves a number of VQA generalization tests. ShapeWorld only contains 10 different objects, making it insufficient for our study. Most closely related to our work is the recent study of generalization to long-tail questions about rare objects done by Bingham et al. (2017). They do not, however, consider as many models as we do and do not study the question of whether the best-performing models can be made end-to-end.
The key paradigm that we test in our experiments is Neural Module Networks (NMN). Andreas et al. (2016) introduced NMNs as a modular, structured VQA model where a fixed number of hand-crafted neural modules (such as Find, or Compare) are chosen and composed together in a layout determined by the dependency parse of question. Hu et al. (2017) and Johnson et al. (2017)
followed up by making NMNs end-to-end, removing the non-differentiable parser. The former chose to keep the handcrafted modules and uses reinforcement learning to learn the layout and modules end-to-end. The latter used a ground truth module layout learned separately, and changes the hand-crafted modules for a generic ResNet block structure(He et al., 2016) for every module. Both Hu et al. (2017) and Johnson et al. (2017) reported that several thousands of ground-truth layouts are required to pretrain the layout predictor in order for their approaches to work. In a recent work, Hu et al. (2018) attempt to soften the layout decisions, but training their models end-to-end from scratch performed substantially lower than best models on the CLEVR task.
6 Conclusion and Discussion
We have conducted a rigorous investigation of an important form of systematic generalization required for grounded language understanding: the ability to reason about all possible pairs of objects despite being trained on a small subset. Our results allow one to draw two important conclusions. For one, the intuitive appeal of modularity and structure in designing neural architectures for language understanding is now supported by our results, which show how a modular model consisting of general purpose residual blocks generalizes much better than a number of baselines, including architectures such as MAC, FiLM and RelNet that were designed specifically for visual reasoning. While this may seem unsurprising, to the best of our knowledge, the literature has lacked such a clear empirical evidence in favor of modular and structured networks before this work. Importantly, we have also shown how sensitive the high performance of the modular models is to the layout of modules, and how a tree-like structure generalizes much stronger than a typical chain of layers.
Our second key conclusion is that coming up with an end-to-end and/or soft version of modular models may be not sufficient for strong generalization. In the very setting where strong generalization is required, end-to-end methods often converge to a different, less compositional solution (e.g. a chain layout or blurred attention). This can be observed especially clearly in our NMN layout and parametrization induction experiments on the #rhs/lhs=1 version of , but notably, strong initialization sensitivity of layout induction remains an issue even on the #rhs/lhs=18 split. This conclusion is relevant in the view of recent work in the direction of making NMNs more end-to-end (Suarez et al., 2018; Hu et al., 2018; Hudson & Manning, 2018; Gupta & Lewis, 2018). Our findings suggest that merely replacing hard-coded components with learnable counterparts can be insufficient, and that research on regularizers or priors that steer the learning towards more systematic solutions can be required. That said, our parametrization induction results on the #rhs/lhs=2 split are encouraging, as they show that compared to generic models, a weaker nudge (in the form of a richer training signal or a prior) towards systematicity may suffice for end-to-end NMNs.
While our investigation has been performed on a synthetic dataset, we believe that it is the real-world language understanding where our findings may be most relevant. It is possible to construct a synthetic dataset that is bias-free and that can only be solved if the model has understood the entirety of the dataset’s language. It is, on the contrary, much harder to collect real-world datasets that do not permit highly dataset-specific solutions, as numerous dataset analysis papers of recent years have shown (see Section 5 for a review). We believe that approaches that can generalize strongly from imperfect and biased data will likely be required, and our experiments can be seen as a simulation of such a scenario. We hope, therefore, that our findings will inform researchers working on language understanding and provide them with a useful intuition about what facilitates strong generalization and what is likely to inhibit it.
We thank Maxime Chevalier-Boisvert and Yoshua Bengio for useful discussions. This research was enabled in part by support provided by Compute Canada (www.computecanada.ca), NSERC and Canada Research Chairs. We also thank Nvidia for donating NVIDIA DGX-1 used for this research.
Agrawal et al. (2016)
Aishwarya Agrawal, Dhruv Batra, and Devi Parikh.
Analyzing the Behavior of Visual Question Answering Models.
Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, January 2016.
- Andreas et al. (2016) Jacob Andreas, Marcus Rohrbach, Trevor Darrell, and Dan Klein. Neural Module Networks. In Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016. URL http://arxiv.org/abs/1511.02799.
- Bahdanau et al. (2015) Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural Machine Translation by Jointly Learning to Align and Translate. In Proceedings of the 2015 International Conference on Learning Representations, 2015.
- Bingham et al. (2017) Eli Bingham, Piero Molino, Paul Szerlip, Obermeyer Fritz, and Goodman Noah. Characterizing how Visual Question Answering scales with the world. In NIPS 2017 Visually-Grounded Interaction and Language Workshop, 2017.
- Calvo & Colunga (2003) Francisco Calvo and Eliana Colunga. The statistical brain: Reply to Marcus’ The algebraic mind. In Proceedings of the Annual Meeting of the Cognitive Science Society, volume 25, 2003.
- Fodor & Pylyshyn (1988) Jerry A. Fodor and Zenon W. Pylyshyn. Connectionism and cognitive architecture: A critical analysis. Cognition, 28(1):3–71, 1988.
- Gaunt et al. (2016) Alexander L. Gaunt, Marc Brockschmidt, Nate Kushman, and Daniel Tarlow. Differentiable Programs with Neural Libraries. In Proceedings of the 34th International Conference on Machine Learning, November 2016. URL http://arxiv.org/abs/1611.02109. arXiv: 1611.02109.
- Gong et al. (2017) Yichen Gong, Heng Luo, and Jian Zhang. Natural Language Inference over Interaction Space. In Proceedings of the 2018 International Conference on Learning Representations, 2017. URL http://arxiv.org/abs/1709.04348. arXiv: 1709.04348.
- Gupta & Lewis (2018) Nitish Gupta and Mike Lewis. Neural Compositional Denotational Semantics for Question Answering. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 2018. URL http://aclweb.org/anthology/D18-1239.
- Gururangan et al. (2018) Suchin Gururangan, Swabha Swayamdipta, Omer Levy, Roy Schwartz, Samuel R. Bowman, and Noah A. Smith. Annotation Artifacts in Natural Language Inference Data. In Proceedings of NAACL-HLT 2018, March 2018. URL http://arxiv.org/abs/1803.02324. arXiv: 1803.02324.
- He et al. (2016) Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In
- Hu et al. (2017) Ronghang Hu, Jacob Andreas, Marcus Rohrbach, Trevor Darrell, and Kate Saenko. Learning to Reason: End-to-End Module Networks for Visual Question Answering. In Proceedings of 2017 IEEE International Conference on Computer Vision, April 2017. URL http://arxiv.org/abs/1704.05526. arXiv: 1704.05526.
- Hu et al. (2018) Ronghang Hu, Jacob Andreas, Trevor Darrell, and Kate Saenko. Explainable Neural Computation via Stack Neural Module Networks. In Proceedings of 2018 European Conference on Computer Vision, July 2018. URL http://arxiv.org/abs/1807.08556. arXiv: 1807.08556.
- Hudson & Manning (2018) Drew A. Hudson and Christopher D. Manning. Compositional Attention Networks for Machine Reasoning. In Proceedings of the 2018 International Conference on Learning Representations, February 2018. URL https://openreview.net/forum?id=S1Euwz-Rb.
- Jia & Liang (2017) Robin Jia and Percy Liang. Adversarial Examples for Evaluating Reading Comprehension Systems. Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pp. 2021–2031, 2017. doi: 10.18653/v1/D17-1215. URL https://aclanthology.coli.uni-saarland.de/papers/D17-1215/d17-1215.
- Jiang et al. (2018) Yu Jiang, Vivek Natarajan, Xinlei Chen, Marcus Rohrbach, Dhruv Batra, and Devi Parikh. Pythia v0.1: The winning entry to the vqa challenge 2018. https://github.com/facebookresearch/pythia, 2018.
- Johnson et al. (2016) Justin Johnson, Bharath Hariharan, Laurens van der Maaten, Li Fei-Fei, C. Lawrence Zitnick, and Ross Girshick. CLEVR: A Diagnostic Dataset for Compositional Language and Elementary Visual Reasoning. In Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), December 2016. URL http://arxiv.org/abs/1612.06890. arXiv: 1612.06890.
- Johnson et al. (2017) Justin Johnson, Bharath Hariharan, Laurens van der Maaten, Judy Hoffman, Li Fei-Fei, C. Lawrence Zitnick, and Ross Girshick. Inferring and Executing Programs for Visual Reasoning. In Proceedings of 2017 IEEE International Conference on Computer Vision, 2017. URL http://arxiv.org/abs/1705.03633.
- Kannan et al. (2016) Anjuli Kannan, Karol Kurach, Sujith Ravi, Tobias Kaufmann, Andrew Tomkins, Balint Miklos, Greg Corrado, Laszlo Lukacs, Marina Ganea, Peter Young, and Vivek Ramavajjala. Smart Reply: Automated Response Suggestion for Email. In Proceedings of the 22Nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’16, pp. 955–964, New York, NY, USA, 2016. ACM. ISBN 978-1-4503-4232-2. doi: 10.1145/2939672.2939801. URL http://doi.acm.org/10.1145/2939672.2939801.
- Kingma & Ba (2015) Diederik P. Kingma and Jimmy Ba. Adam: A Method for Stochastic Optimization. In Proceedings of the 2015 International Conference on Learning Representations, 2015. URL http://arxiv.org/abs/1412.6980. arXiv: 1412.6980.
- Kuhnle & Copestake (2017) Alexander Kuhnle and Ann Copestake. ShapeWorld - A new test methodology for multimodal language understanding. arXiv:1704.04517 [cs], April 2017. URL http://arxiv.org/abs/1704.04517. arXiv: 1704.04517.
- Lake & Baroni (2017) Brenden M. Lake and Marco Baroni. Generalization without systematicity: On the compositional skills of sequence-to-sequence recurrent networks. arXiv:1711.00350 [cs], October 2017. URL http://arxiv.org/abs/1711.00350. arXiv: 1711.00350.
- Loula et al. (2018) Joao Loula, Marco Baroni, and Brenden M. Lake. Rearranging the Familiar: Testing Compositional Generalization in Recurrent Networks. In Proceedings of the 2018 BlackboxNLP EMNLP Workshop, July 2018. URL https://arxiv.org/abs/1807.07545.
- Malinowski & Fritz (2014) Mateusz Malinowski and Mario Fritz. A Multi-world Approach to Question Answering About Real-world Scenes Based on Uncertain Input. In Proceedings of the 27th International Conference on Neural Information Processing Systems, NIPS’14, pp. 1682–1690, Cambridge, MA, USA, 2014. MIT Press. URL http://dl.acm.org/citation.cfm?id=2968826.2969014.
- Marcus (1998) Gary F. Marcus. Rethinking Eliminative Connectionism. Cognitive Psychology, 37(3):243–282, December 1998. ISSN 0010-0285. doi: 10.1006/cogp.1998.0694. URL http://www.sciencedirect.com/science/article/pii/S0010028598906946.
- Marcus (2003) Gary F. Marcus. The algebraic mind: Integrating connectionism and cognitive science. MIT press, 2003.
- Perez et al. (2017) Ethan Perez, Florian Strub, Harm de Vries, Vincent Dumoulin, and Aaron Courville. FiLM: Visual Reasoning with a General Conditioning Layer. In In Proceedings of the 2017 AAAI Conference on Artificial Intelligence, 2017. URL http://arxiv.org/abs/1709.07871.
- Santoro et al. (2017) Adam Santoro, David Raposo, David G. T. Barrett, Mateusz Malinowski, Razvan Pascanu, Peter Battaglia, and Timothy Lillicrap. A simple neural network module for relational reasoning. In Advances in Neural Information Processing Systems 31, June 2017. URL http://arxiv.org/abs/1706.01427. arXiv: 1706.01427.
- Smolensky (1987) Paul Smolensky. The constituent structure of connectionist mental states: A reply to Fodor and Pylyshyn. Southern Journal of Philosophy, 26(Supplement):137–161, 1987.
- Suarez et al. (2018) Joseph Suarez, Justin Johnson, and Fei-Fei Li. DDRprog: A CLEVR Differentiable Dynamic Reasoning Programmer. arXiv:1803.11361 [cs], March 2018. URL http://arxiv.org/abs/1803.11361. arXiv: 1803.11361.
- Sutskever et al. (2014) Ilya Sutskever, Oriol Vinyals, and Quoc V Le. Sequence to Sequence Learning with Neural Networks. In Advances in Neural Information Processing Systems 27, pp. 3104–3112, 2014.
- Wang et al. (2018) Wei Wang, Ming Yan, and Chen Wu. Multi-Granularity Hierarchical Attention Fusion Networks for Reading Comprehension and Question Answering. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 1705–1714, Melbourne, Australia, 2018. Association for Computational Linguistics. URL http://aclweb.org/anthology/P18-1158.
- Wu et al. (2016) Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V Le, Mohammad Norouzi, Wolfgang Macherey, Maxim Krikun, Yuan Cao, Qin Gao, Klaus Macherey, and others. Google’s Neural Machine Translation System: Bridging the Gap between Human and Machine Translation. arXiv preprint arXiv:1609.08144, 2016.
Appendix A Experiment Details
We trained all models by minimizing the cross entropy loss on the training set, where is the correct answer, is the image, is the question. In all our experiments we used the Adam optimizer (Kingma & Ba, 2015)
with hyperparameters, , , . We continuously monitored validation set performance of all models during training, selected the best one and reported its performance on the test set. The number of training iterations for each model was selected in preliminary investigations based on our observations of how long it takes for different models to converge. This information, as well as other training details, can be found in Table 3.
|model||stem layers||subsampling factor||iterations||batch size|
|Stochastic NMN (Residual)||6||4||200000||64|
|Stochastic NMN (Find)||6||4||200000||64|
|Attention NMN (Find)||6||4||50000||64|
Appendix B Additional Results for MAC Model
We performed an ablation study in which we varied the number of MAC units, the model dimensionality and the level of weight decay for the MAC model. The results can be found in Table 4.
|model||#rhs/lhs||train error rate (%)||test error rate (%)|
|weight decay 0.00001||1|
|weight decay 0.0001||1|
|weight decay 0.001||1|
We also perform qualitative investigations to understand the high variance in MAC’s performance. In particular, we focus on control attention weights () for each run and aim to understand if runs that generalize have clear differences when compared to runs that failed. Interestingly, we observe that in successful runs each word has a unit that is strongly focused on it. To present our observations in quantitative terms, we plot attention quality , where are control scores vs accuracy in Figure 7 for each run (see Section 4.3.2 for an explanation of ). We can clearly see a strong positive correlation between and error rate.
Next, we experiment with a hard-coded variation of MAC. In this model, we use hard-coded control scores such that given a question , the first half of all modules focuses on while the second half focuses on . The relationship between MAC and hardcoded MAC is similar to that between NMN-Tree and end-to-end NMN with parameterization induction. However, this model has not performed as well as the successful runs of MAC. We hypothesize that this could be due to the interactions between the control scores and the visual attention part of the model.