Modularity Matters: Learning Invariant Relational Reasoning Tasks

06/18/2018 ∙ by Jason Jo, et al. ∙ aalto 0

We focus on two supervised visual reasoning tasks whose labels encode a semantic relational rule between two or more objects in an image: the MNIST Parity task and the colorized Pentomino task. The objects in the images undergo random translation, scaling, rotation and coloring transformations. Thus these tasks involve invariant relational reasoning. We report uneven performance of various deep CNN models on these two tasks. For the MNIST Parity task, we report that the VGG19 model soundly outperforms a family of ResNet models. Moreover, the family of ResNet models exhibits a general sensitivity to random initialization for the MNIST Parity task. For the colorized Pentomino task, now both the VGG19 and ResNet models exhibit sluggish optimization and very poor test generalization, hovering around 30 learn hierarchies of fully distributed features and thus encode the distributed representation prior. We are motivated by a hypothesis from cognitive neuroscience which posits that the human visual cortex is modularized, and this allows the visual cortex to learn higher order invariances. To this end, we consider a modularized variant of the ResNet model, referred to as a Residual Mixture Network (ResMixNet) which employs a mixture-of-experts architecture to interleave distributed representations with more specialized, modular representations. We show that very shallow ResMixNets are capable of learning each of the two tasks well, attaining less than 2 MNIST Parity and the colorized Pentomino tasks respectively. Most importantly, the ResMixNet models are extremely parameter efficient: generalizing better than various non-modular CNNs that have over 10x the number of parameters. These experimental results support the hypothesis that modularity is a robust prior for learning invariant relational reasoning.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The human visual system is able to learn discriminative representations for high level abstractions in the data that are also invariant to an incredibly large and varied collection of transformations GemanInvariance ; PoggioInvariance

. A central question in computer vision is how to learn such representations. The current de-facto standard visual learning models are deep convolutional neural networks (CNNs)

CNN ; Neocognitron . Deep CNNs have achieved an incredible amount of success in learning various visual tasks: object recognition AlexNet ; VGG19 ; Inception1 ; ResNet , segmentation RCNN ; FullyConvSegmentation and even visual question answering RelationalNetwork ; NeuralModuleAndreas ; HierarchicalAttention . While various CNN models are able to exhibit record breaking, sometimes superhuman test generalization performance, it should be noted that this test generalization is in the identically and independently distributed (i.i.d) setting. So-called adversarial noise has been shown to break various models on the aforementioned tasks AdvExamples ; SegmentationAdvExamples ; AdvSegmentation ; FoolingVQA , exposing the sensitivity of these models in what we refer to as the out-of-distribution (o.o.d) setting. Therefore the search for simultaneously discriminative and highly invariant representations continues.

While there are many qualitatively different deep CNNs in the literature, the majority of them can be interpreted as learning deep hierarchies of fully distributed features: for features at level of the hierarchy, these features get applied to the same input . Thus many deep CNN models encode the fully distributed representation prior. In this article, we explore the efficacy of the fully distributed representation prior for learning invariant relational rules. To measure this, we focus on two relational reasoning tasks: a newly crafted MNIST Parity task and a colorized variant of the Pentomino task KnowledgeMatters . These two tasks are supervised visual reasoning tasks whose labels encode a semantic (high-level) relational rule between two or more objects in an image. Most importantly, the objects in the image undergo a wide range of transformations: random translation, scaling, rotation and coloring. Therefore performance on these two tasks will measure a model’s ability to learn an invariant relational reasoning rule between explicit objects.

For these two tasks we found that conventional deep convolutional architectures did not perform well, which stimulated our quest for architectures incorporating a different kind of prior. We tested the VGG19 model VGG19

with batch-normalization

BatchNorm and ResNet models ResNet with depth varying from 26 all the way to 152. For the MNIST Parity task, we report that the VGG19 model soundly outperforms a family of ResNet models, training faster and generalizing better. In particular, the family of ResNet models exhibited a general sensitivity to choice of random seed for weight initialization. However, for the colorized Pentomino task, both the VGG19 and family of ResNet models perform poorly, exhibiting sluggish optimization and very poor generalization, with average test error hovering around 30% across the tested models.

To address the shortcoming of the tested deep CNNs on the colorized Pentomino task, we appeal to a hypothesis from cognitive neuroscience which posits that the human visual cortex is modularized CorticalAreaBody ; Facevsobject (as opposed to fully distributed), and that this modularity allows the visual cortex to learn higher order invariances PoggioModularityInvariance ; PoggioFaceModularity . This paper is a first exploration towards a different style of architecture which would better reflect this prior. To this end, we consider a modularized variant of the ResNet model, referred to as Residual Mixture Networks (ResMixNets) which employs a mixture-of-experts architecture MixtureofExperts ; GoogleMixturePaper to interleave distributed representations with more specialized, modular representations.

Our main empirical result is that we can deploy extremely parameter efficient ResMixNets that outperform both the VGG19-BN and the family of ResNets for the MNIST Parity and colorized Pentomino tasks. The best non-modularized model we tested for the MNIST Parity task was the VGG19-BN architecture, which achieves an average test error of 2.27% while having over 20 million parameters. Using a ResMixNet with only 274K params (over a 70x reduction in parameter count), we are able to achieve an average test of error 1.98%. For the colorized Pentomino task, we were able to deploy a ResMixNet with 193K parameters that achieved 0.88% average test error (almost a 30x reduction in test error). In light of the extreme parameter efficiency and stellar generalization performance of the ResMixNet on both the MNIST Parity and colorized Pentomino task, we conclude that these experimental results support the hypothesis that modularity is a robust prior for learning invariant relational reasoning.

2 Invariant Relational Learning Tasks

Here we introduce the MNIST Parity task (Section 2.1) and a colorized variant of the Pentomino task originally introduced in KnowledgeMatters (Section 2.2

). Both of these tasks are binary supervised tasks whose labels encode a semantic relational rule between two or more objects in the image. We view both of these tasks as requiring a machine learning model to learn higher order invariances because the objects in the images can undergo random translation, scaling, rotation and coloring transformations without changing the image label. In addition to the large number of invariances present in each of the datasets, we further challenge any proposed machine learning model by restricting the training set size. Thus we are operating in the high invariance, low sample regime.

2.1 MNIST Parity Dataset

The MNIST Parity dataset consists of 30K training, 5K validation and 5K test images. Each image is of size 6464 and is divided into a 22 grid of 3232 blocks. Each image has two 2828 MNIST digits placed in 2 randomly chosen blocks out of these four blocks. A digit is randomly colored (using one out of 10 randomly chosen colors), randomly scaled to size , randomly rotated by angle

and placed at a random location with in a block. The task is to predict whether both the digits in an image are of the same parity, both even or both odd (label 1) or not (label 0). Example images are shown in Fig. 

1. We note that the MNIST Parity training, validation and test images are generated and deformed from the original MNIST training, validation and test digits respectively.

Figure 1: (Left) Label 0 example: MNIST digit pair with different parity (one odd digit, one even digit) and (Right) Label 1 example: MNIST digit pair of the same parity (both odd digits). Digits are subject to random translations, scalings, rotations and coloring. Best viewed in color.

2.2 Colorized Pentomino Dataset

The Colorized Pentomino dataset consists of 20K train, 5K validation and 5K test images. Each image is of size 6464, which is divided into a grid of 88 blocks. Each image has 3 Pentomino sprites placed in 3 randomly chosen unique blocks. The Pentomino sprite type, scaling amount and rotation angles are the same as in KnowledgeMatters . Additionally, we color the Pentomino sprites randomly using one out of 10 colors. Due to the extra coloring transformation, the colorized Pentomino has 10x the number of invariances as the original Pentomino dataset. As in KnowledgeMatters , the task is to learn whether all the Pentomino sprites in an image belong to same class (label 0) or not (label 1). Example images are shown in Fig. 2.

Figure 2: (Left) Label 0 example (all the shapes are of the same sprite type) and (Right) Label 1 example (there exists a sprite of a different type than the other sprites). Sprites are subject to random translations, scalings, rotations and coloring. Best viewed in color.

Relational object reasoning tasks have two key defining characteristics: the object distribution and the relational rule. In this manner, we believe that the MNIST Parity task and the colorized Pentomino task are qualitatively different. The MNIST Parity task consists of curvilinear digit strokes while the colorized Pentomino task consists of rigid polygonal shapes. With respect to the relational rule, the MNIST Parity task is an AND operation on the parity of the digits while the colorized Pentomino task is a XOR like operation on the sprite types. The two datasets furthermore differ in following aspects: Colorized Pentomino has more sparsity in the images and the objects in the image have more freedom for translation as compared to MNIST Parity. Furthermore, all the objects in the Colorized Pentomino dataset are made of only straight edges, whereas the MNIST Parity dataset consists of different types of curves. Arguably, these curves assist more than the straight edges of Colorized Pentomino dataset in learning discriminative features for the desired task.

2.3 Psychological Tests of Relational Reasoning

While both the datasets are artificial, we note that in the field of psychology and human intelligence, similar visual tasks have been used to measure relational reasoning in human beings RelationalReasoningPsych ; CultureFreeIntelligenceTest ; QuoteTheRaven . Our tasks are also completely “figural” in that no outside information is needed to be able to solve these problems. Indeed these tasks are designed to measure what QuoteTheRaven refers to as eductive ability:

…the ability to make meaning out of confusion, the ability to generate high-level, usually nonverbal, schemata which make it easy to handle complexity.”

3 Modularity and Relational Reasoning

3.1 Why modularize?

In this article we are interested in invariant relational learning. In this setting, a machine learning model must be able to recognize that simply translating, rotating, scaling or changing the color of any of the objects in the image does not change the label of the image. Therefore, any proposed machine learning model will be tasked with learning simultaneously discriminative and invariant representations GemanInvariance ; PoggioInvariance . We are motivated by the following question: which architectural priors will facilitate learning of such representations?

The de-facto visual learning models used today are deep CNNs. Many of these deep CNNs may be classified as learning a deep hierarchy of fully distributed features: for features

at level of the hierarchy, these features get applied to the same input . Overall, distributed representations HintonDistReps have been an extremely powerful architectural prior for AI. However, when the number of invariances in the dataset is very large (and/or the dataset size is sufficiently small), one may encounter the interference problem BeckerHinton ; InterferenceBook ; MixtureofExperts ; ToModularize

for architectures that learn fully distributed representations. In the case of supervised learning from image labels, there is one global teaching signal, and

BeckerHinton conjectured that this would entangle all the neural network’s parameters, which would cause the features to interfere with one another and result in a slow down in learning. Similarly in InterferenceBook , it is hypothesized that a neural network with a “homogeneous connectivity” topology (e.g. encoding the fully distributed representation prior) will struggle to simultaneously learn many different patterns, as each pattern will interfere with each other. When a dataset has a large number of invariances, a machine learning model must learn to associate a large number of seemingly unrelated patterns with one another, which may exacerbate the interference problem. Take for example the MNIST Parity task: a machine learning model must learn associate the digit pairing with as they have the same label of 0, but the digit pairings have different geometric properties.

One natural way to combat the interference problem is to allow for specialized sub-modules in our architecture. Once we modularize, we reduce the amount of interference that can occur between features in our model. These specialized modules can now learn highly discriminative yet invariant representations while not interfering with each other PoggioModularityInvariance ; PoggioFaceModularity . To this end, there is much supporting evidence from the cognitive neuroscience research: the Fusiform Face Area (FFA) FusiFormFaceAreaOriginalPaper , face selective cells FaceSelectiveCells , and cells that are selective only to portions of the human body as opposed to the human face CorticalAreaBody . In the case of invariant relational learning, we hypothesize that modularity allows for the development of specialized neural circuitry that can learn to associate many seemingly unrelated patterns.

3.2 Residual Mixture Network

A well established neural architecture to combat the interference problem is the so-called mixture of experts (MoE) MixtureofExperts . There are two key components to the MoE architecture:

  1. Individual expert networks (which here map their input to their output).

  2. A Gater network that weights the output from each of the individual experts, in a way that is context-dependent.

In Fig. 2(a) we present our mixture of experts module. Each of our experts is a stack of residual modules, specifically the Basic Block from ResNet . Our Gater network is a stack of four convolutional layers, a global avg pooling layer and a dense layer with a softmax activation. Thus the output of the gater network is an

-length probability vector and we use this to output a weighted linear combination of the experts, e.g. the output

.

We present our Residual Mixture Network (ResMixNet) architecture in Fig. 2(b)

. We observe that the ResMixNet interleaves distributed and modular representations together. For example, the first layer in our network is a fully distributed convolutional layer with 16 3x3 filters with stride 2. We feel this is the appropriate prior for the MNIST Parity and colorized Pentomino tasks because each of those datasets share low level features, e.g. curvilinear digit strokes for MNIST Parity and straight edges for the colorized Pentomino. It would be wasteful to have each individual expert learn its own low level edge filters. Also note that each expert in our network receives the same input as any other expert, but then each expert learns its own specialized representation through its

stack of residual modules. These modularized representations are then weighted together using the gater network . We stack two expert modules and together and then have the third block of our network to be a stack of BasicBlock residual modules. Following the ResNet recipe, we have our second and third blocks reduce the spatial height and width dimensions by 2 by using a stride 2 convolution, hence the “/2” notation in Figs 2(a)2(b)

. Again following ResNet recipe, whenever the spatial dimension is reduced, we double the number of filters. Our final output layer is a 2 unit dense layer. We use a two class negative log likelihood loss function.

(a) : A mixture of experts, where each expert is a stack of residual modules and a gater network which weights all the experts and forms an additive mixture.
(b) ResMixNet() consisting of stacks of residual modules and experts. Best viewed in color.

4 Experimental Results

In all of our experiments we used a training batch size of 64 and L2 weight decay parameter of 1e-4. We train all of our models using SGD+Momentum with momentum parameter , we try 5 different learning rates and perform 5 trials with seed

. We train for 200 epochs and decay the learning rate by a factor of 10 at epochs 100 and 140. We report the learning rate with the best average test performance. Due to the sparsity in the datasets, all of our networks have an initial convolutional layer with stride 2. We used two-class softmax and negative log likelihood as our loss function. We refer to the supplementary materials for further details.

4.1 MNIST Parity

Model Parameter Count Test Error
ResNet26 370K
ResNet50 758K
ResNet152-Bottleneck 3.66M
VGG19-BN 20M
ResMixNet(2,2) 274K
Table 1: MNIST Parity Generalization Results
Figure 3: Average Validation Loss of the best performing models: ResNet152, VGG19-BN and ResMixNet(2,2).

In Table 1

we present the generalization performance of the various models for the MNIST Parity task. We first note the somewhat surprising result that the VGG19-BN network soundly outperforms the ResNet models. To the best of our knowledge, this is the first time such a performance gap has been exhibited between a residual network and non-residual network. We witnessed a sensitivity to randomized initialization for all the ResNet models evidenced by the large standard deviation of the average test error. While the VGG19-BN does exhibit stellar performance, we note that our ResMixNet(2,2) model actually attains

slightly better test performance while having over 70x fewer parameters. We see in Fig. 3 that the ResMixNet(2,2) model is able to obtain lower validation loss than the VGG19-BN model.

4.2 Colorized Pentomino

Model Parameter Count Test Error
ResNet26 370K
ResNet50 758K
ResNet152-Bottleneck 3.66M
VGG19-BN 20M
ResMixNet(4,1) 193K
Table 2: Pentomino 10 Color Generalization Results
Figure 4: (Left) Train Loss and (Right) Validation Loss performance on the Pentomino 10 color dataset. Best viewed in color.

In Table 2 we present the generalization performance of the various models for the Pentomino 10 color task. In contrast to the MNIST Parity results, we observe that now the VGG19-BN and the various ResNet models generalize poorly. From Fig. 4 we conclude that on this task the non-modularized networks have a tendency to slowly converge to poorly generalizing local minima.

On the other hand, from the Table 2 and Fig. 4 we observe stellar optimization and generalization performance of the ResMixNet(4,1) model. This model is again able to outperform networks that have many more parameters. We highlight a nearly 30x reduction in test error from the non-modularized CNNs to the ResMixNet(4,1) model.

4.3 Classical Object Recognition

Finally we present the performance of the ResMixtureNet for three object recognition tasks: CIFAR-10, CIFAR-100 and SVHN. We train as before except that now we decay the learning rate by a factor of 10 at epoch 100 and 150. For the SVHN we use a similar setup but with learning rate 0.01 and we train for only 40 epochs, reducing the learning rate by a factor of 10 after 20 and 30 epochs. For the CIFAR-10/100 datasets we augment with on-the-fly random horizontal flips and random cropping with pad size 4. We use random seed 0 for the initialization. The results of our experiments are shown in Table 

3.

We set our baseline of performance as the ResNet50 and we test ResMixNet(5,3) which has roughly the same number of parameters. We notice from the table that the performance on CIFAR-10 is quite close, merely a gap in test error and that for SVHN that the performance of the two models is even closer, a mere difference of . However we see that the gap is for the CIFAR-100. Note that for the CIFAR-100, the data by design has multiple class labels that are semantically similar, and thus many of the images may share features. In this case the ResMixNet may not be an optimal prior by itself. This suggests future work exploring combinations of this prior with the classical CNN prior based on a sequence of transformations.

Model Dataset Num. Params Test Accuracy
ResNet50 CIFAR-10 758K
ResMixNet(5,3) CIFAR-10 748K
ResNet50 CIFAR-100 764K
ResMixNet(5,3) CIFAR-100 754K
ResNet50 SVHN 758K
ResMixNet(5,3) SVHN 748K
Table 3: CIFAR-10, CIFAR-100 and SVHN Results

5 Related Work

With respect to relational reasoning, there has been much recent work on Visual Question and Answering (VQA) VQA ; CLEVR ; VisualTuringTest ; MultiWorld ; VizWiz . The MNIST Parity task and the Pentomino tasks have no question and answering component and are of a purely visual nature. The VQA tasks require learning multiple types of relations while each of the MNIST Parity and Pentomino tasks require learning a single invariant relational rule that applies to a set of objects and their many transformations.

With respect to the ResMixNet architecture, it is clearly built on top of the well established mixture of experts architecture MixtureofExperts ; BengioModularity . The most relevant work is GoogleMixturePaper . One major difference with our work is that GoogleMixturePaper applied modular MLPs to a task like Jittered MNIST, which does not have as many invariances as the MNIST Parity and the colorized Pentomino task, nor does it involve any relational reasoning. To this end, we believe that the higher order invariance setting is where the modularity prior can result in a significant performance gain which we showed in Section 4.

In general there has been some recent work on modular neural networks. In SparseMOE a huge modular neural network is used to achieve state of the art performance on language modeling and machine translation tasks. The ResNeXt model ResNext uses multi-branches (e.g. experts) and pools the experts together via summation, but they do not employ a gater-type network to weight the sum. The Inception architectures Inception1 ; Inception2 ; Inception4 also uses multi-branch modules and concatenates all them together, thus they similarly lack a gater network.

6 Conclusion

In this article we tested the performance of four well known types of CNN models (VGG19-BN, ResNet26, ResNet50 and ResNet152-Bottleneck) on two invariant relational reasoning tasks: the MNIST Parity task and a colorized variant of the Pentomino task. For these two tasks we observed that conventional deep CNN models did not perform well overall. We hypothesized that the root of the problem is the so-called interference problem. For invariant relational reasoning tasks, a machine learning model must learn to associate a large number of seemingly unrelated patterns in the data. The interference problem posits that for models that employ fully distributed feature hierarchies, these patterns can interfere with one another and result in poorly conditioned training and/or suboptimal generalization.

One natural remedy to the interference problem is to deploy machine learning models that learn modularized representations. To this end, we proposed a modularized CNN: the Residual Mixture Network (ResMixNet) which combines the existing mixture of experts architecture with the ResNet architecture. We showed that the ResMixNet is able to learn both the MNIST Parity and colorized Pentomino tasks, exhibiting less than 2% and 1% test error, respectively. Most importantly the ResMixNets are able to outperform networks that have well over 10x the number of parameters. We believe our empirical results support the hypothesis that modularity can be a robust prior for learning invariant relational reasoning rules.

Finally we tested the ResMixNets on three object recognition tasks: CIFAR-10, CIFAR-100 and SVHN. We used a ResNet50 model as our baseline of performance. We constructed a ResMixNet that has roughly the same parameter budget as the ResNet50. For the CIFAR-10 and SVHN classification tasks, the ResMixNet exhibited a less than 1% gap in test error from the ResNet50 baseline. However, for the CIFAR-100, the ResMixNet lagged behind the ResNet50 model by over 5%. We hypothesized that this is due to the fact that the CIFAR-100 dataset has many similar class labels and thus many of the images may share features. In this case, modularity would be a sub-optimal prior.

For future work we aim to understand the optimal balance between modular and distributed representations, all towards the end goal of a robust general visual architecture that can learn from a wide array of data distributions (artificial, natural, relational reasoning oriented, object recognition oriented, etc.).

Acknowledgments

Vikas Verma was supported by Academy of Finland project 13312683 / Raiko Tapani AT kulut. We would like to acknowledge the following organizations for their generous research funding and/or computational support (in alphabetical order): Calcul Québec, Canada Research Chairs, the CIFAR, Compute Canada, the IVADO and the NSERC.

Appendix A Optimization Experiments

In this section we expand upon our experimental setup and share some more details. In the main article we listed our SGD+Momentum hyperparameter setup. In preliminary experiments we explored other optimizers over a range of settings. Specifically, we tested the Adadelta

Adadelta optimizer over learning rates with and we also tried Adam Adam with learning rates with and we did not notice any non-trivial difference in performance.

In Table 4 we list the LR for SGD+Momentum with the best average test accuracy over 5 random trials for the Colorized Pentomino and MNIST Parity tasks.

Model Dataset LR
ResNet26 Colorized Pentomino 0.1
ResNet50 Colorized Pentomino 0.1
ResNet152 Colorized Pentomino 0.1
VGG19-BN Colorized Pentomino 0.1
ResMixNet(4,1) Colorized Pentomino 0.01
ResNet26 MNIST Parity 0.05
ResNet50 MNIST Parity 0.1
ResNet152 MNIST Parity 0.05
VGG19-BN MNIST Parity 0.01
ResMixNet(2,2) MNIST Parity 0.1
Table 4: Best LRs for SGD+Momentum

We also ran experiments for the ResNet26, ResNet50, ResNet152-Bottleneck and VGG19-BN with stride 1 in the convolutional layer as opposed to stride 2 and there was no non-trivial difference in average test error. None of the models were able to display qualitatively different performance with respect to closing the gap of the ResMixNet’s performance.

Appendix B Choosing the Residual Mixture Network’s hyperparameters

The ResMixNet has two hyperparameters: the number of experts , and since each expert is parameterized to be a deep stack of BasicBlock residual modules, the other hyperparameter is the depth . Our only guiding principle was to never have the parameter count of our ResMixNet exceed the parameter count of the shallowest non-modular network we considered. Thus we enforced the constraint that we never exceed the parameter count of the ResNet26, which has 370K. This really limits the possible figurations of the ResMixNet.

References