Learning Non-Parametric Invariances from Data with Permanent Random Connectomes

11/13/2019 ∙ by Dipan K. Pal, et al. ∙ Carnegie Mellon University 0

One of the fundamental problems in supervised classification and in machine learning in general, is the modelling of non-parametric invariances that exist in data. Most prior art has focused on enforcing priors in the form of invariances to parametric nuisance transformations that are expected to be present in data. Learning non-parametric invariances directly from data remains an important open problem. In this paper, we introduce a new architectural layer for convolutional networks which is capable of learning general invariances from data itself. This layer can learn invariance to non-parametric transformations and interestingly, motivates and incorporates permanent random connectomes, thereby being called Permanent Random Connectome Non-Parametric Transformation Networks (PRC-NPTN). PRC-NPTN networks are initialized with random connections (not just weights) which are a small subset of the connections in a fully connected convolution layer. Importantly, these connections in PRC-NPTNs once initialized remain permanent throughout training and testing. Permanent random connectomes make these architectures loosely more biologically plausible than many other mainstream network architectures which require highly ordered structures. We motivate randomly initialized connections as a simple method to learn invariance from data itself while invoking invariance towards multiple nuisance transformations simultaneously. We find that these randomly initialized permanent connections have positive effects on generalization, outperform much larger ConvNet baselines and the recently proposed Non-Parametric Transformation Network (NPTN) on benchmarks that enforce learning invariances from the data itself.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The Invariance Problem.

The study of machine learning over the years has resulted in the identification of a few core problems that many other problems are compositions of. Learning invariances to nuisance transformations in data is one such task. Indeed, the ideal classifier for supervised classification would be invariant to all within-class transformations while being selective or equivariant to between-class transformations. Further, it has been demonstrated that learning better invariances reduces sample complexity of a classifier

[Anselmi et al.2013]. Given the importance of the invariance problem, it is critical to explore a diverse range of solutions. Incorporating large amounts of data is common way to expose the model to nuisance transformations so that it can learn invariance. Other methods include minimizing auxiliary objectives to enforce invariance [Hadsell, Chopra, and LeCun2006] and incorporating priors or structure into the algorithm [Gens and Domingos2014, Jaderberg et al.2015, Cohen and Welling2016a]. All these methods are useful and have provided important advancements. However, moving towards real-world data of different modalities, it is a daunting task to theoretically model all nuisance transformations. Towards this goal, methods which learn non-parametric invariances from the data itself without any change in architecture will be critical.

Figure 1: Left: Architecture of the vanilla convolution layer. Left bottom: Transformation Networks were introduced as a general framework for modelling feed forward convolutional networks. NPTNs and PRC-NPTNs can model non-parametric invariances within the TN framework. Center:

Architecture of the PRC-NPTN layer. Each input channel is convolved with a number of filters (parameterized by G). Each of the resultant activation maps is connected to a one of the channel max pooling units randomly (initialized once, fixed during training and testing). Each channel pooling unit pools over a fixed random support of a size parameterized by CMP.

Right: Explicit invariances enforced within deep networks in prior art are mostly parametric in nature. The important problem of learning non-parametric invariances from data has not received a lot of attention.

Encoding Invariances through Deep Architectures. Before delving into methods which learn such invariances, it is important to study methods which incorporate known

invariances in data. Many times it is the case that a few most predominant nuisance transformations in data are well understood. Visual data is one such domain with translation being perhaps the most common nuisance transformation emerging. An early method to incorporate this prior into the algorithm was the Convolutional Neural Network (ConvNet)

[LeCun et al.1998] with the pooling operation following the convolution. The success of ConvNets indicates that addressing predominant invariances in data warrants being a major objective. Over the years, there have been efforts in investigating what other transformations would result in similar breakthroughs. Rotation was investigated at length with studies rotating the inputs [Dieleman, Willett, and Dambre2015] and the convolution filters [Teney and Hebert2016, Li et al.2017]. Similarly combinations of rotation, scale and translation invariances were explored [Mallat and Waldspurger2013] along with more general parametric invariances [Cohen and Welling2016a, Henriques and Vedaldi2017, Cohen and Welling2016b]. These efforts provided valuable insights into the nature of visual data leading to more powerful networks, albeit for specific or specialized tasks. For more general tasks, methods which focused on better optimization, minimizing better objectives and developing more effective architectures proved to be more successful. Nonetheless, it is important to note that though these methods were motivated differently, they ultimately provided hand-crafted invariances assumed to be useful for the task at hand.

Learning Invariances from Data using Deep Architectures. A different class of architectures that have been recently proposed explicitly attempt to learn

the transformation invariances directly from the data, with the only prior being the structure that allows them to do so. One of the earliest attempts using backpropagation was the SymNet

[Gens and Domingos2014]

, which utilized kernel based interpolation to learn general invariances. Although given the interesting nature of the study, the method was limited in scalability. Spatial Transformer Networks

[Jaderberg et al.2015] were also designed to learn activation normalization from data itself, however the transformation invariance learned was parametric in nature. A more recent effort was through the introduction of the Transformation Network paradigm [Pal and Savvides2019]. Non-Parametric Transformation Networks (NPTN) were introduced as an generalization of the convolution layer to model general symmetries from data [Pal and Savvides2019]. It was also introduced as an alternate direction of network development other than skip connections. The convolution operation followed by pooling was re-framed as pooling across outputs from the translated versions of a filter. Translation forming a unitary group generates invariance through group symmetry [Anselmi et al.2013]. The NPTN framework has the important advantage of learning general invariances without any change in architecture while being scalable. Given this is an important open problem, we introduce an extension of the Transformation Network (TN) paradigm with an enhanced ability to learn non-parametric invariances through permanent random connectivity.

(a) Homogeneous Structured Pooling
(b) Heterogeneous Permanent Random Support Pooling
(c) Random Support Pooling
Figure 2: (a) Homogeneous Structured Pooling pools across the entire range of transformations of the same kind

leading to feature vectors invariant only to that particular transformations. Here, two

distinct feature vectors are invariant to transformation and independently. (b) Heterogeneous Random Support Pooling pools across randomly selected ranges of multiple transformations simultaneously. The pooling supports are defined during initialization and remain fixed during training and testing. This results in a single feature vector that is invariant to multiple transformations simultaneously. Here, each colored box defines the support of the pooling and pools across features only inside the boxed region leading to one single feature. (c) Vectorized Random Support Pooling extends this idea to convolutional networks, where one realizes that the random support pooling on the feature grid (on the left) is equivalent to random support pooling of the vectorized grid. Each element of the vector (on the right) now represents a single channel in a convolutional network and hence random support pooling in PRC-NPTNs occur across channels. Note that the random support does not change during training and testing, and are initialized once during the creation of the network. Also, this is independent of additional spatial pooling. Further, this is different from MaxOut which pools across channels originating from all inputs (whereas here the pooling occurs over activation maps from a small subset of input channels).

Prior Art using Temporary Random Connections in Neural Architectures. There have been many seminal works that have indeed explored the role of temporary random connections in deep networks. Notable examples include DropOut [Srivastava et al.2014], DropConnect [Wan et al.2013] and Stochastic Pooling [Zeiler and Fergus2013]. However, one critical observation is that unlike the proposed approach, the connections in these networks randomly change at every forward pass, hence are temporary. Thus, the networks do not permanently maintain the same randomly initialized structure throughout the course of training and testing whereas our proposed PRC-NPTNs do. There has also been some interesting work which explored the use of random weight matrices for back propagation [Lillicrap et al.2016]. Here, the forward weight matrices were updated so as to fruitfully use the random weight matrices during back propagation. The motivation of the [Lillicrap et al.2016] study was to address the biological implausibility of the transport of precise gradients through the cortex due to the lack of exact connections and pathways [Grossberg1987, Stork1989, Mazzoni, Andersen, and Jordan1991, Xie and Seung2003]. More recently, random permanent connections were explored for large scale architectures [Xie et al.2019]. It is important however to note that the basic unit of computation, the convolutional layer, remained unchanged. Our study explores permanent random connectomes within the convolutional layer itself, and explores how it can learn non-parametric invariances to multiple transformations simultaneously in a simple manner.

Prior Art using Alternate Architectures. Several works have explored alternate deep layer architectures. A few of the main developments were the application of the skip connection [He et al.2016], depthwise separable convolutions [Chollet2017] and group convolutions [Xie et al.2017]. Randomly initialized channel shuffling is an operation that is central to the application of permanent random connectomes. However, deterministic non-randomized channel shuffling was explored while optimizing networks for computation efficiency [Zhang et al.2018]. Nonetheless, none of these methods explored permanent and random connectomes from the perspective of explicitly learning invariances from data itself.

Relaxed Biological Motivation for Randomly Initialized Connectomes. Although not central to our motivation, the observation that the cortex lacks precise local pathways for back-propagation provided the initial inspiration for this study. It further garnered pull from the observation that random unstructured local connections are indeed common in many parts of the cortex [Corey and Scholl2012, Schottdorf et al.2015]. Moreover, it has been shown that orientation selectivity can arise in the visual cortex even through local random connections [Hansel and van Vreeswijk2012]. Though we do not explore these biological connections in more detail, it is still an interesting observation. The common presence of random connections in the cortex at a local level leads us to ask: Is it possible that such locally random connectomes improve generalization in deep networks? We provide evidence for answering this question in the positive.

2 Representation Learning through Permanent Random Connectomes

Representation Learning through Pooling. Over the years, the idea of pooling across transformed features to generate invariance towards that particular transformation has been one of the central tools in algorithm design for invariance properties [Dieleman, Willett, and Dambre2015, Li et al.2017]. Similar ideas have also been explored in a more general setting. For instance, a pose-tolerant feature can be generated by pooling over dot-products of the input face with multiple template faces undergoing pose (and other) variation [Liao, Leibo, and Poggio2013, Pal, Juefei-Xu, and Savvides2016].

Invoking Invariance through Pooling. In previous years a number of theories have emerged on the mechanics of generating invariance through pooling. [Anselmi et al.2013, Anselmi et al.2017] develop a framework in which the transformations are modelled as a group comprised of unitary operators denoted by . These operators transform a given filter through the operation 111The action of the group element on is denoted by to promote clarity., following which the dot-product between these transformed filters and an novel input is measured through . It is shown by [Anselmi et al.2013]

that any moment such as the mean or max (infinite moment) of the distribution of these dot-products in the set

is an invariant. These invariants will exhibit robustness to the transformation in encoded by the transformed filters in practice, as confirmed by [Liao, Leibo, and Poggio2013, Pal, Juefei-Xu, and Savvides2016]. Though this framework did not make any assumptions on the distribution of the dot-products, it imposed the restricting assumption of group symmetry on the transformations. We now show that invariance can be invoked by avoiding the assumption that the transformations in need to form a group. Nonetheless, we assume that the distribution of the dot-product is uniform and thus we have the following result222

We provide a proof in the supplementary. The assumption of the distribution being uniform is meant to provide insight into the general behavior of the max pooling operation, rather than a statement that deep learning features are uniformly distributed.


Lemma 2.1.

(Invariance Property) Assume a novel test input and a filter both fixed vectors . Further, let

denote a random variable representing unitary operators with some distribution. Finally, let

, with i.e. a Uniform distribution between and . Then, we have

This result is interesting because it shows that the max operation of the dot-products has less variance due to

than the pre-pooled features. Though this is largely known empirical result, we provide a concrete proof for invoking robustness. Importantly, it bypasses the need for a group structure on the nuisance transformations . Practical studies such as [Liao, Leibo, and Poggio2013, Pal, Juefei-Xu, and Savvides2016] had ignored the effects of non-group structure in theory while demonstrating effective empirical results. More importantly, the variance of the max is less than the variance of the quantity , which implies that is more robust to even in test, though it has never observed . This useful property is due to the unitarity of .

Homogeneous Structured Support Pooling across Transformations of the same kind. Pooling has been a key operation in learning representations robust to nuisance transformations. However, in almost all cases is carried out over transformed features/responses undergoing the same kind of transformation. For instance, ConvNets [LeCun et al.1998] pool only across translations and [Pal, Juefei-Xu, and Savvides2016] pools only across pose at a time, and subjects at another. Similarly, rotation invariant networks [Dieleman, Willett, and Dambre2015] pool only across rotations. There are cases where pooling occurs over rotations and translations in the same network, but this is more of an artifact of of default translation pooling in ConvNets rather than a carefully designed feature of the network. Pooling across transformations of the same kind is what we term as homogeneous pooling. Fig. 2(a) illustrates this idea more concretely. Consider a grid of features that have been obtained through a dot product (for instance from a convolution activation map). Along the two axes of the grid, two different kinds of transformation are depicted corresponding to different . along the horizontal axis and along the vertical. where is a transformation parameterized by that acts on and similarly . Now, pooling homogeneously across one axis invokes invariance only to the corresponding (for a more in depth analysis see [Anselmi et al.2013]). Similarly, pooling along only will result in a feature vector (Feature 2) invariant only to . These representations (Feature 1 and 2) have complimentary invariances and can be used for complimentary tasks e.g.face recognition (invariant to pose) versus pose estimation (invariant to subject).

Heterogeneous Random Support Pooling across Transformations of multiple kinds. The previous but ubiquitous approach has one major limitation. For a reasonable number of transformations in the data, maintaining features invariant to each kind of transformation might be feasible, but not when the number of transformations is large as is common in real data. One therefore would need features that are invariant to multiple transformations simultaneously. One simple yet effective approach is to drop the homogeneity constraint and pool across transformations of all kinds. This kind of pooling could be called heterogeneous pooling. Under some strong assumptions, Lemma 5.1 shows that such units can be more invariant even when there is no group symmetry within the filters/templates in Fig.2(a). However, pooling across the entire range of all transformations simultaneously results in a trivial feature losing all selectivity (therefore not useful for any downstream task).

Limiting the Range of Pooling through a Random Support. A solution to trivial feature problem described above, is to limit the range or support of pooling as illustrated in Fig.2(b). One simple way of selecting such a support for pooling is at random. This selection would happen only once during initialization of the network (or any other model), and will remain fixed through training and testing. In order to increase the selectivity of such features, multiple such pooling units are needed with such a randomly initialized support [Anselmi et al.2013, Pal et al.2017]. These multiple pooling units together form the feature that is invariant to multiple transformations simultaneously, which improves generalization as we find in our experiments. This is called heterogeneous pooling and Fig. 2(b) illustrates this more concretely.

3 Permanent Random Connectome NPTNs

Permanent Random Connectomes from Randomly Initialized Pooling Supports across Channels. We incorporate the idea of randomly initialized but permanent pooling supports into deep networks, more specifically convolutional networks. Fig. 2(b) implements randomly initialized pooling supports over the features. Note that these features are of the form for some transformed with an input . For a convolutional network, the filters connected to a single input channel are assumed to be some transformations of each other. Note that since these transformations can non-parametric and complex, this assumption is feasible. One can simply initialize a permanent random support for pooling over the entire set of activation maps. Fig. 2(c) illustrates this operation. Recall that this random support is defined only once during initialization of the network and does not change during training or testing, unlike DropOut [Srivastava et al.2014], DropConnect [Wan et al.2013] or Stochastic Pooling [Zeiler and Fergus2013]. Importantly, this randomly initialized support results in a sparse randomly initialized but permanent connectome. Since multiple pooling units are max pooling over limited ranges of input channels, this invokes invariance to multiple transformations simultaneously. Note that this is also different from MaxOut [Goodfellow et al.2013], where the pooling is over all

input channels. Further, MaxOut was motivated as a more general activation function with no explicit connection to invariance modelling.

The PRC-NPTN layer. Fig. 1(b) shows the the architecture of a single PRC-NPTN layer 333We provide pseudo-code in the supplementary. The PRC-NPTN layer consists of a set of filters of size where is the number of input channels and is the number of filters connected to each input channel. More specifically, each of the input channels connects to

filters. Then, a number of channel max pooling units randomly select a fixed number of activation maps to pool over. This is parameterized by Channel Max Pool (CMP). Note that this random support selection for pooling is the reason a PRC-NPTN layer contains a permanent random connectome. These pooling supports once initialized do not change through training or testing. Once max pooling over CMP activation maps completes, the resultant tensor is average pooled across channels with a average pool size such that the desired number of outputs is obtained. After the CMP units, the output is finally fed through a two layered network with the same number of channels with

kernels, which we call a pooling network. This small pooling network helps in selecting non-linear combinations of the invariant nodes generated through the CMP operation, thereby enriching feature combinations downstream.

Invariances in a PRC-NPTN layer. Recent work introducing NPTNs [Pal and Savvides2019] had highlighted the Transformation Network (TN) framework in which invariance is generated during the forward pass by pooling over dot-products with transformed filter outputs. A vanilla convolution layer with a single input and output channel (therefore a single convolution filter) followed by a spatial pooling layer can be seen as a single TN node enforcing translation invariance with the number of filter outputs being pooled over to be . It has been shown that spatial pooling over the convolution output of a single filter is an approximation to channel pooling across the outputs of translated filters [Pal and Savvides2019]. The output of such an operation with an input patch can be expressed as


where is the set of filters whose outputs are being pooled over. Thus, defines the set of transformations and thus the invariance that the TN node enforces. In a vanilla convolution layer, this is the translation group (enforced by the convolution operation followed by spatial pooling). An NPTN removes any constraints on allowing it to approximately model arbitrarily complex transformations. A vanilla convolution layer would have one filter whose convolution is pooled over spatially (for translation invariance). In contrast, an NPTN node has independent filters whose convolution outputs are pooled across channel wise leading to general invariance.

A PRC-NPTN layer inherits the property from NPTNs to learn arbitrary transformations and thereby arbitrary invariances using . As Fig. 1(b) shows, individual channel max pooling (CMP) nodes act as NPTN nodes sharing a common filter bank as opposed to independent and disjoint filter banks for vanilla NPTNs. This allows for greater activation sharing, where transformations learned from data through one subset of filters can be used for invoking similar invariances in a parallel computation path. This sharing and reuse of activation maps allows for higher parameter and sample efficiency. As we find in our experiments, randomization plays a critical role here, allowing for a simple and quick approximation to obtaining high performing invariances. A high activation map can activate multiple CMP nodes, winning over multiple sub-sets of low activations. Gradients flow back to these winning activations updating the filters to further model the features observed during that particular batch. Note that, CMP nodes in the same layer can pool over disjoint subsets to invoke a variety of invariances, leading to a more versatile network and also better modelling of a particular kind of invariance as we find in our experiments. Further, the primary source of invoking invariances in NPTN was understood to be the symmetry of the unitary group action space [Pal and Savvides2019]. General invariances were assumed to be only approximately forming a group. Lemma 5.1 shows that group symmetry is not necessary to reduce variance of the quantity due to the action of the set elements on some test input patch . Though, the result makes a strong assumption regarding the distribution of , it to the best of our knowledge the first result of its kind to show increased invariance without a group symmetric action.

Rotation *** *** *** ***
ConvNet (36) - - - -
ConvNet (36) FC - - - -
ConvNet (512) - - - -
NPTN (12,3) - - - -
PRCN (36,1)
PRCN (18,2)
PRCN (12,3)
PRCN (9,4)
Translations 0 pixels *** 4 pixels *** 8 pixels *** 12 pixels ***
ConvNet (36) - - - -
ConvNet (36) FC - - -
ConvNet (512) - - - -
NPTN (12,3) - - - -
PRC-NPTN (36,1)
PRC-NPTN (18,2)
PRC-NPTN (12,3)
PRC-NPTN (9,4)
Table 1: Individual Transformation Results:

Test error statistics with mean and standard deviation on MNIST with progressively extreme transformations with a)

random rotations and b) random pixel shifts. indicates ablation runs without any randomization i.e. without any random connectomes (applicable only to PRC-NPTNs). For PRC-NPTN and NPTN the brackets indicate the number of channels in the layer 1 and . ConvNet FC denotes the addition of a 2-layered pooling pooling network after every layer. Note that for this experiment, CMP=. Permanent Random Connectomes help with achieving better generalization despite increased nuisance transformations.

Current Limitations in Implementation.

Given the motivation of PRC-NPTNs, it would be interesting to observe their behavior on large scale datasets. Unfortunately however, current limitations in computational resources and constraints within deep learning frameworks such as PyTorch limit us to from experimenting with very large networks (deepest network we test contains 12 layers). Our current implementation suffers from heavy GPU memory use and slower run time despite optimizing code at the PyTorch abstraction level. Further, optimization are needed at lower abstraction levels which are currently outside the scope of this study. Nonetheless, we believe that more efficient CUDA kernels would be possible in the future through a more engineering focused effort. This study however, serves as the bedrock upon which such improvements can be benchmarked. The networks we do benchmark against such as very wide ConvNets, DenseNets and NPTNs

[Pal and Savvides2019] all of which provide strong baselines.

4 Empirical Evaluation and Discussion

General Experimental Settings.

For all experiments, we run all models for 300 epochs trained using SGD. The initial learning rate was kept at 0.1 and decreased by 10 at 50% and 75% epoch completion. Momentum was kept at 0.9 with a weight decay of

. Batch size was kept at 64 for both MNIST and CIFAR10. For the MNIST experiments, gradients were clipped to norm 1. For CIFAR10, a random crop of the original size was used after zero padding the images with 4 pixels along with AutoAugment. Each block for all baselines for ConvNet and PRC-NPTN had either a convolution layer or PRC-NPTN layer followed by batch normalizaiton, PReLU and spatial max pooling. The convolutional kernel size for all models was kept at

for all MNIST experiments and for all CIFAR10 models. Spatial max pooling of size was performed after every layer, BN and PReLU for MNIST models.

Rot/Trans 0 2 4 6 8 10 12
ConvNet (36)
ConvNet (36) FC
ConvNet (512)
NPTN (12,3)
PRC-NPTN (36,1)
PRC-NPTN (18,2)
PRC-NPTN (12,3)
PRC-NPTN (9,4)
Table 2: Simultaneous Transformation Results: Test error statistics with mean and standard deviation on MNIST with progressively extreme transformations with random rotations and random pixel shifts simultaneously. For PRC-NPTN and NPTN the brackets indicate the number of channels in the layer 1 and . Note that for this experiment, CMP=.
Method CIFAR10 (w/o Random) CIFAR 10++ (w/o Random)
DenseNet-Conv - -
Table 3: Efficacy on CIFAR10: Test error statistics on CIFAR10 with mean and standard deviation. ++ indicates AutoAugment testing. Each DenseNet and its corresponding PRC-NPTN variant has the same number of parameters. for PRC-NPTN and growth rate was kept at 12 for DenseNet-Conv. (w/o Random) indicates no randomization in the connectomes constructed (as an ablation study).

Efficacy in Learning Arbitrary and Unknown Transformations Invariances from Data. We evaluate on one of the most important tasks of any perception system, i.e. being invariant to nuisance transformations learned from the data itself. Most other architectures based on vanilla ConvNets learn these invariances through the implicit neural network functional map rather than explicitly through the architecture as PRC-NPTNs. Moreover, most previous approaches needed hand crafted architectures to handle different transformations. We benchmark our networks based on tasks where nuisance transformations such as large amounts of in-plane rotation and translation are steadily increased, with no change in architecture whatsoever. For this purpose, we utilize MNIST where it is straightforward to add such transformations without any artifacts.

We benchmark on such a task as described in [Pal and Savvides2019] and for fair comparisons, we follow the exact same protocol. We train and test on MNIST augmented with progressively increasing transformations i.e. 1) extreme random translations (up to 12 pixels in a 28 by 28 image), 2) extreme random rotations (up to rotations) and finally 3) both transformations simultaneously. Both train and test data were augmented leading to an increase in overall complexity of the problem. No architecture was altered in anyway between the two transformations i.e. they were not designed to specifically handle either. The same architecture for all networks is expected to learn invariances directly from data unlike prior art where such invariances are hand crafted in [Teney and Hebert2016, Li et al.2017, Sifre and Mallat2013, Xu et al.2014, Cohen and Welling2016a, Henriques and Vedaldi2017].

For this experiment, we utilize a two layered network with the intermediate layer 1 having up to 36 channels and layer 2 having exactly 16 channels for all networks (similar to the architectures in [Pal and Savvides2019]) except a wider ConvNet baseline with 512 channels. All ConvNet, NPTN and PRC-NPTN models have the similar number of parameters (except the ConvNet with 512 channels). For PRC-NPTN, the number of channels in layer 1 was decreased from 36, through to 9 while was increased in order to maintain similar number of parameters. All PRC-NPTN networks have a two layered pooling network with same number of channels as that layer. For a fair benchmark, Convnet FC has 2 two-layered pooling networks with 36 channels each. Average test errors are reported over 5 runs for all networks.

Discussion. We present all test errors for this experiment in Table. 4 and Table. 5444We display only the (12, 3) configuration for NPTN as it performed the best. More benchmarks with NPTNs are provided in the supplementary.. From both tables, it is clear that as more nuisance transformations act on the data, PRC-NPTN networks outperform other baselines with the same number of parameters. In fact, even with significantly more parameters, ConvNet-512 performs worse than PRCN-NPTN on this task for all settings. Since the testing data has nuisance transformations similar to the training data, the only way for a model to perform well is to learn invariance to these transformations. It is also interesting to observe that permanent random connectomes do indeed help with generalization. Indeed, without randomization the performance of PRCN-NPTNs drop substantially. The performance improvement of PRC-NPTN also increases with nuisance transformations, showcasing the benefits arising from modelling such invariances. This is particularly apparent from Table. 5, where the two simultaneous nuisance transformations pose a significant challenge. Yet, as the transformations increase, the performance improvements increase as well.

Efficacy on CIFAR10 Image Classification. MNIST was a good candidate for the previous experiment where the addition of nuisance transformations such as translation and rotation did not introduce any artifacts. However, in order to validate permanent random connectomes on more realistic data, we utilize the CIFAR10 dataset and AutoAugmentation [Cubuk et al.2018] as the nuisance transformation. Note that, from the perspective of previous works in network invariance, it is unclear how to hand craft architectures to handle invariances due to the variety of transformations that AutoAugment invokes. Here is where the general invariance learning capability of PRC-NPTNs would help, without the need of expertise in such hand-crafting.

We replace vanilla convolution layers with kernel size 3 in DenseNets with PRC-NPTNs without the 2-layered pooling networks. There was another modification for this experiment. For each input channel of a layer, a total of filters were learnt. However only a few of them were pooled over (channel max pool or CMP). We pool with CMP = 1, 2, 3 or 4 channels randomly keeping fixed always. Note that in contrast with the MNIST experiment, pooling was always done over number of channels (CMP=

). This provides a different setting under which PRC-NPTN can be utilized. All models in this experiment were trained with AutoAugment and were tested on both a) the original testing images and also on b) the test set transformed by AutoAugment. Similarly to the previous experiment, a model would have learn invariance towards these auto-augment transformations in order to perform well. All DenseNet models have 12 layers with the PRC-NPTN variant having the same number of parameters. We train 5 models for each setting and report the mean and standard deviation of the errors. Training 5 runs for each of the hyperparameter combination to account for the randomization is yet another reason which tended to result in unreasonably large experiment times. Importantly, the goal of this experiment is not to push the state-of-the-art, but rather to investigate the behavior of DensePRC-NPTNs within the limits of computational resources available for this study while executing 5 runs for each network.

Discussion. Table. 3 presents the results of this experiment. We find PRC-NPTN provides clear benefits even with architectures employing heavy use of skip connections such as DenseNets with the same number of parameters. Performance seems to increase as channel max pooling increased. Further, randomization seems to be important to the overall architecture even when given the complex nature of real image transformations. PRC-NPTN helps DenseNets account for nuisance transformations better even for those as extreme as auto-augment with its 16 transformation types ShearX/Y, TranslateX/Y, Rotate, AutoContrast, Invert, Equalize, Solarize, Posterize, Contrast, Color, Brightness, Sharpness, Cutout, Sample Pairing to various degrees.

Concluding Remarks. It is interesting to find that random connectomes have an interesting motivation from the perspective of learning heterogeneous invariances from data without any change in architectures. They seem to provide a promising alternate dimension in future network design in contrast to the ubiquitous use of highly structured and ordered connectomes.

5 Appendix

Proof of Lemma

Lemma 5.1.

(Invariance Property) Assume a novel test input and a filter both fixed vectors . Further, let denote a random variable representing unitary operators with some distribution. Finally, let , with i.e. a Uniform distribution between and . Then, we have


Let be the random variable representing the randomness in for fixed and random . We assume that .

Considering a sample set , then . Now,


Let the density of be denoted by , then



Since the variance of is i.e. , and is a decreasing function in , along with the fact that for is , we have

For general , it follows shortly after considering and that . Finally, due to unitary , . ∎

1:  class PRCN-NPTN:

def init(self, inch, outch, G, CMP, kernelSize, padding, stride):

3:      self.G = G
4:      self.maxpoolSize = CMP
5:    self.avgpoolSize = int((inch*self.G)/(self.maxpoolSize*outch))
6:      self.expansion = self.G*inch
7:      self.conv1 = nn.Conv2d(inch, self.G*inch, kernelSize=kernelSize, groups=inch, padding=padding, bias=False)
8:      self.transpool1 = nn.MaxPool3d((self.maxpoolSize, 1, 1))
9:      self.transpool2 = nn.AvgPool3d((self.avgpoolSize, 1, 1))

      self.index = torch.LongTensor(self.expansion).cuda()

11:      self.randomlist = list(range(self.expansion))
12:      random.shuffle(self.randomlist)
13:      for ii in range(self.expansion):
14:           self.index[ii] = self.randomlist[ii]
16:  def forward(self, x):
17:      out = self.conv1(x) #inch G*inch
18:      out = out[:,self.index,:,:] # randomization
19:      out = self.transpool1(out) # G*inch inch*G/maxpool
20:      out = self.transpool2(out) # inch*G/(maxpool*meanpool) outch
21:      return out
Figure 3: PRC-NPTN pseudo-code.
Rotation *** *** *** ***
ConvNet (36) - - - -
ConvNet (36) FC - - - -
ConvNet (512) - - - -
NPTN (36,1) - - - -
NPTN (18,2) - - - -
NPTN (12,3) - - - -
NPTN (9,4) - - - -
PRCN (36,1)
PRCN (18,2)
PRCN (12,3)
PRCN (9,4)
Translations 0 pixels *** 4 pixels *** 8 pixels *** 12 pixels ***
ConvNet (36) - - - -
ConvNet (36) FC - - -
ConvNet (512) - - - -
NPTN (36,1) - - - -
NPTN (18,2) - - - -
NPTN (12,3) - - - -
NPTN (9,4) - - - -
PRC-NPTN (36,1)
PRC-NPTN (18,2)
PRC-NPTN (12,3)
PRC-NPTN (9,4)
Table 4: Individual Transformation Results: Test errors on MNIST with progressively extreme transformations with a) random rotations and b) random pixel shifts. indicates ablation runs without any randomization i.e. without any random connectomes (applicable only to PRC-NPTNs). For PRC-NPTN and NPTN the brackets indicate the number of channels in the layer 1 and . ConvNet FC denotes the addition of a 2-layered pooling pooling network after every layer. Note that for this experiment, CMP=. Permanent Random Connectomes help with achieving better generalization despite increased nuisance transformations.
Rot/Trans 0 2 4 6 8 10 12
ConvNet (36)
ConvNet (36) FC
ConvNet (512)
NPTN (36,1)
NPTN (18,2)
NPTN (12,3)
NPTN (9,4)
PRC-NPTN (36,1)
PRC-NPTN (18,2)
PRC-NPTN (12,3)
PRC-NPTN (9,4)
Table 5: Simultaneous Transformation Results: Test errors on MNIST with progressively extreme transformations with random rotations and random pixel shifts simultaneously. For PRC-NPTN and NPTN the brackets indicate the number of channels in the layer 1 and . Note that for this experiment, CMP=.


  • [Anselmi et al.2013] Anselmi, F.; Leibo, J. Z.; Rosasco, L.; Mutch, J.; Tacchetti, A.; and Poggio, T. 2013. Unsupervised learning of invariant representations in hierarchical architectures. arXiv preprint arXiv:1311.4158.
  • [Anselmi et al.2017] Anselmi, F.; Evangelopoulos, G.; Rosasco, L.; and Poggio, T. 2017. Symmetry regularization. Technical report, Center for Brains, Minds and Machines (CBMM).
  • [Chollet2017] Chollet, F. 2017. Xception: Deep learning with depthwise separable convolutions. In

    Proceedings of the IEEE conference on computer vision and pattern recognition

    , 1251–1258.
  • [Cohen and Welling2016a] Cohen, T., and Welling, M. 2016a. Group equivariant convolutional networks. In International Conference on Machine Learning, 2990–2999.
  • [Cohen and Welling2016b] Cohen, T. S., and Welling, M. 2016b. Steerable cnns. arXiv preprint arXiv:1612.08498.
  • [Corey and Scholl2012] Corey, J., and Scholl, B. 2012. Cortical selectivity through random connectivity. Journal of Neuroscience 32(30):10103–10104.
  • [Cubuk et al.2018] Cubuk, E. D.; Zoph, B.; Mane, D.; Vasudevan, V.; and Le, Q. V. 2018. Autoaugment: Learning augmentation policies from data. arXiv preprint arXiv:1805.09501.
  • [Dieleman, Willett, and Dambre2015] Dieleman, S.; Willett, K. W.; and Dambre, J. 2015. Rotation-invariant convolutional neural networks for galaxy morphology prediction. Monthly notices of the royal astronomical society 450(2):1441–1459.
  • [Gens and Domingos2014] Gens, R., and Domingos, P. M. 2014. Deep symmetry networks. In Advances in neural information processing systems, 2537–2545.
  • [Goodfellow et al.2013] Goodfellow, I.; Warde-Farley, D.; Mirza, M.; Courville, A.; and Bengio, Y. 2013. Maxout networks. In International Conference on Machine Learning, 1319–1327.
  • [Grossberg1987] Grossberg, S. 1987. Competitive learning: From interactive activation to adaptive resonance. Cognitive science 11(1):23–63.
  • [Hadsell, Chopra, and LeCun2006] Hadsell, R.; Chopra, S.; and LeCun, Y. 2006. Dimensionality reduction by learning an invariant mapping. In Computer vision and pattern recognition, 2006 IEEE computer society conference on, volume 2, 1735–1742. IEEE.
  • [Hansel and van Vreeswijk2012] Hansel, D., and van Vreeswijk, C. 2012. The mechanism of orientation selectivity in primary visual cortex without a functional map. Journal of Neuroscience 32(12):4049–4064.
  • [He et al.2016] He, K.; Zhang, X.; Ren, S.; and Sun, J. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, 770–778.
  • [Henriques and Vedaldi2017] Henriques, J. F., and Vedaldi, A. 2017. Warped convolutions: Efficient invariance to spatial transformations. In International Conference on Machine Learning.
  • [Jaderberg et al.2015] Jaderberg, M.; Simonyan, K.; Zisserman, A.; et al. 2015. Spatial transformer networks. In Advances in Neural Information Processing Systems, 2017–2025.
  • [LeCun et al.1998] LeCun, Y.; Bottou, L.; Bengio, Y.; and Haffner, P. 1998. Gradient-based learning applied to document recognition. Proceedings of the IEEE 86(11):2278–2324.
  • [Li et al.2017] Li, J.; Yang, Z.; Liu, H.; and Cai, D. 2017. Deep rotation equivariant network. arXiv preprint arXiv:1705.08623.
  • [Liao, Leibo, and Poggio2013] Liao, Q.; Leibo, J. Z.; and Poggio, T. 2013. Learning invariant representations and applications to face verification. In Advances in Neural Information Processing Systems, 3057–3065.
  • [Lillicrap et al.2016] Lillicrap, T. P.; Cownden, D.; Tweed, D. B.; and Akerman, C. J. 2016. Random synaptic feedback weights support error backpropagation for deep learning. Nature communications 7:13276.
  • [Mallat and Waldspurger2013] Mallat, S., and Waldspurger, I. 2013. Deep learning by scattering. arXiv preprint arXiv:1306.5532.
  • [Mazzoni, Andersen, and Jordan1991] Mazzoni, P.; Andersen, R. A.; and Jordan, M. I. 1991. A more biologically plausible learning rule for neural networks. Proceedings of the National Academy of Sciences 88(10):4433–4437.
  • [Pal and Savvides2019] Pal, D. K., and Savvides, M. 2019. Non-parametric transformation networks for learning general invariances from data. AAAI.
  • [Pal et al.2017] Pal, D.; Kannan, A.; Arakalgud, G.; and Savvides, M. 2017. Max-margin invariant features from transformed unlabelled data. In Advances in Neural Information Processing Systems 30, 1438–1446.
  • [Pal, Juefei-Xu, and Savvides2016] Pal, D. K.; Juefei-Xu, F.; and Savvides, M. 2016.

    Discriminative invariant kernel features: a bells-and-whistles-free approach to unsupervised face recognition and pose estimation.

    In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 5590–5599.
  • [Schottdorf et al.2015] Schottdorf, M.; Keil, W.; Coppola, D.; White, L. E.; and Wolf, F. 2015. Random wiring, ganglion cell mosaics, and the functional architecture of the visual cortex. PLoS computational biology 11(11):e1004602.
  • [Sifre and Mallat2013] Sifre, L., and Mallat, S. 2013. Rotation, scaling and deformation invariant scattering for texture discrimination. In Proceedings of the IEEE conference on computer vision and pattern recognition, 1233–1240.
  • [Srivastava et al.2014] Srivastava, N.; Hinton, G.; Krizhevsky, A.; Sutskever, I.; and Salakhutdinov, R. 2014. Dropout: A simple way to prevent neural networks from overfitting. The Journal of Machine Learning Research 15(1):1929–1958.
  • [Stork1989] Stork, D. G. 1989. Is backpropagation biologically plausible. In International Joint Conference on Neural Networks, volume 2, 241–246.
  • [Teney and Hebert2016] Teney, D., and Hebert, M. 2016. Learning to extract motion from videos in convolutional neural networks. In Asian Conference on Computer Vision, 412–428. Springer.
  • [Wan et al.2013] Wan, L.; Zeiler, M.; Zhang, S.; Le Cun, Y.; and Fergus, R. 2013. Regularization of neural networks using dropconnect. In International Conference on Machine Learning, 1058–1066.
  • [Xie and Seung2003] Xie, X., and Seung, H. S. 2003. Equivalence of backpropagation and contrastive hebbian learning in a layered network. Neural computation 15(2):441–454.
  • [Xie et al.2017] Xie, S.; Girshick, R.; Dollár, P.; Tu, Z.; and He, K. 2017. Aggregated residual transformations for deep neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, 1492–1500.
  • [Xie et al.2019] Xie, S.; Kirillov, A.; Girshick, R.; and He, K. 2019. Exploring randomly wired neural networks for image recognition. arXiv preprint arXiv:1904.01569.
  • [Xu et al.2014] Xu, Y.; Xiao, T.; Zhang, J.; Yang, K.; and Zhang, Z. 2014. Scale-invariant convolutional neural networks. arXiv preprint arXiv:1411.6369.
  • [Zeiler and Fergus2013] Zeiler, M. D., and Fergus, R. 2013. Stochastic pooling for regularization of deep convolutional neural networks. arXiv preprint arXiv:1301.3557.
  • [Zhang et al.2018] Zhang, X.; Zhou, X.; Lin, M.; and Sun, J. 2018. Shufflenet: An extremely efficient convolutional neural network for mobile devices. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 6848–6856.