DeepAI

# How ConvNets model Non-linear Transformations

In this paper, we theoretically address three fundamental problems involving deep convolutional networks regarding invariance, depth and hierarchy. We introduce the paradigm of Transformation Networks (TN) which are a direct generalization of Convolutional Networks (ConvNets). Theoretically, we show that TNs (and thereby ConvNets) are can be invariant to non-linear transformations of the input despite pooling over mere local translations. Our analysis provides clear insights into the increase in invariance with depth in these networks. Deeper networks are able to model much richer classes of transformations. We also find that a hierarchical architecture allows the network to generate invariance much more efficiently than a non-hierarchical network. Our results provide useful insight into these three fundamental problems in deep learning using ConvNets.

• 10 publications
• 49 publications
01/16/2013

### Learning Stable Group Invariant Representations with Convolutional Networks

Transformation groups, such as translations or rotations, effectively ex...
08/05/2015

### Deep Convolutional Networks are Hierarchical Kernel Machines

In i-theory a typical layer of a hierarchical architecture consists of H...
10/31/2019

### Deep Learning for 2D and 3D Rotatable Data: An Overview of Methods

One of the reasons for the success of convolutional networks is their eq...
02/09/2021

### More Is More – Narrowing the Generalization Gap by Adding Classification Heads

Overfit is a fundamental problem in machine learning in general, and in ...
10/31/2017

### Parametrizing filters of a CNN with a GAN

It is commonly agreed that the use of relevant invariances as a good sta...
03/09/2022

### Resource-Efficient Invariant Networks: Exponential Gains by Unrolled Optimization

Achieving invariance to nuisance transformations is a fundamental challe...
07/03/2017

### Appearance invariance in convolutional networks with neighborhood similarity

We present a neighborhood similarity layer (NSL) which induces appearanc...

## 1 Introduction

It is a well known fact that deep Convolutional Networks (or ConvNets) LeCun et al. (1998) generate invariance to local translations due to convolutions followed by a form of pooling. In practice, however, studies such as Krizhevsky et al. (2012) have applied these models very successfully to domains such as vision, which typically involve data undergoing highly non-linear transformations. It is therefore clear, that these models can model invariance towards these global non-linear transformations despite solely employing pooling over local translations. Further, simonyan2014very observed that a deeper ConvNet usually performs better (and thus is more invariant) on large scale tasks. This raises some fundamental questions.

Problem 1: How does a ConvNet generate invariance to global non-linear transformations through pooling over mere local translations?

Problem 2: How does invariance increase with depth in ConvNets?

Problem 3: How does a hierarchical architecture help?

These have been long standing problems in vision since the inception of these networks. Intuitions and empirical observations abound, the problems still are not completely addressed from a theoretical standpoint.

Main results: In this paper, we take a significant step towards answering these questions.

Addressing Problem 1: We show that these non-linear invariances arise from the architecture of the network itself rather than the exact features learnt. More specifically, the entire pipeline of convolution followed by pooling and then a non-linearity itself contributes towards learning such powerful invariances. Although optimizing the features is important to capture the most amount of “information” and provide descriptive features, invariance strictly speaking, is not generated due to the features themselves. Instead, it is a by-product of the architecture. Our main result shows that a layered ConvNet (and also a generalization of such architectures introduced as Transformation Networks or TNs), generates invariance to transformations of the input of the form 111 where is a unitary transformation and

is a point-wise applied non-linearity satisfying certain conditions of unitarity and stability. A very good approximation of such a non-linearity is the hard-ReLU which is prevalent in practice

Nair & Hinton (2010), thereby providing a theoretical justification of the same. The form of transformation highly non-linear. Even though unitary transforms include commonly known and “elementary” transforms such as translation and in-plane rotation, their composition with non-linearities make the overall transformation very rich and powerful.

Addressing Problem 2: Further, it immediately shows why depth is an important parameter in ConvNet architecture design. Increasing in our model allows us to be invariant to a more expressive transformation form. Loosely speaking, each layer of the ConvNet can be said to generate invariance to one pair of and . The precise form of depends on the exact hierarchy employed by the architecture and is discussed in more detail in a later section. The architecture of a ConvNet itself is a form of incorporating a prior on the kind of nuisance transformations expected to be observed in the data. This is complimentary to the regularization implications of weight sharing.

Addressing Problem 3: We also show that the hierarchical nature of a ConvNet also helps in significantly improving efficiency in generating invariance. A layered ConvNet reduces the number of required observations of transformed inputs for training from to , a reduction of the order .

Intuitive Proof Sketch: We first prove that each node at the first layer of a ConvNet (also Transformation Networks) generates invariances towards or factors out local translations (and more general unitary transforms for TNs). Then we put two conditions (unitarity and stability) on the point-wise non-linearity used in these networks such that transformations that were not factored out in the first layer are propagated to the second layer. We find that a the implicit mapping of a fractional degree polynomial kernel exactly satisfies unitarity and very closely approximates stability for a well chosen range of degrees. This function is also a very close approximation of the hard-ReLU non-linearity. The non-linearity helps preserve the group structure of the transformed inputs in the feature space. We finally show that every second layer node then is able to generate invariance to the left over transformations (not captured in the first layer) even if they had acted on the input after a non-linearity. This way the second layer node is overall invariant to a non-linear transformation of the input. As we go up passing through more layers, they add in abilities to be invariant to more non-linearities and complexities.

Prior Art: Deep learning despite its great success in learning useful representations, has yet to have a very concrete theoretical foundation. Nonetheless, there have been many attempts at a deeper understanding of its mechanics. For instance, Kawaguchi (2016)

proved important results for deep neural networks. Whereas

Cohen et al. (2015); Haeffele & Vidal (2015)

approached deep learning from the perspective of general tensor decompositions. All of these studies however, have focused on the supervised version of deep learning. Under supervision, theoretical results can be broadly described to be concerned with the optimality of a solution or properties of the optimization landscape. Given the success of supervised models, such an approach is definitely beneficial in advancing overall understanding. It however, considers architectures more general in nature since supervised results for specialized architectures are more difficult to obtain.

Unsupervised deep learning however, promises to play an important role in the future not to mention kindling interests from a neoroscientific perspective. The analysis of our models is therefore aimed at the unsupervised setting and focuses more on the invariance properties of such networks. This reveals new insights into properties of the architecture itself and provides an explanation as to why increasing depth is useful on many fronts. Even though there have been theoretical efforts Delalleau & Bengio (2011); Martens & Medabalimi (2014) to provide results related to the “depth” of a network, the models studied do not immediately resemble the most successful architecture class in practice, ConvNets and its variants. We present results on a generalization of ConvNets called Transformation Networks (TN) which are directly applicable to ConvNets. In fact, TNs are very closely related to ConvNets and become identical under a very simple constraint.

There have been a few important efforts towards providing results from a unsupervised standpoint Anselmi et al. (2013); Mallat (2012). Mallat (2012) shows that local translation invariance leads to contractions in space. However, it is not clear whether those contractions are due to non-linear invariances. Anselmi et al. (2013) approach the problem in a fashion more similar to ours with the use of unitary groups to “transfer” invariance. They show that for a hierarchical feed-forward network with unitary group structure, the features at top layers would be exactly invariant to groups of transformations acting over a larger receptive field. Our main result, on the other hand is more precise. We show that the top layer features is in fact invariant to non-linear transformations despite only pooling over linear transforms. Further, these non-linear transformations need not form a group overall. They are only required to form a group locally at every layer. The architecture we consider is very closely related to practical architectures used for ConvNets, whereas Anselmi et al. (2013) model the architecture utilizing simple and complex cell constructions from a more biologically motivated approach. Further, they hypothesize that the non-linearity serves as a way measuring bins of the CDF of an invariant distribution. On the other hand, we consider the non-linearity to be an integral part of the process to preserve unitary group structure in the feature space. This also leads to it being a part of the class or range of transformations to be invariant towards. In turn this observation leads to the critical result that the overall architecture is invariant to non-linear transformations despite pooling over linear transforms.

Finally, Bruna et al. (2013); Paul & Venkatasubramanian (2014) also applied group theory to a certain extent to the problem of representation learning. These works provide useful insights into stabilization with groups. Here, stabilization is meant along the lines of resulting in a contraction or non-expansion of the space. Nonetheless, they do not explore exact invariance to explicitly non-linear transforms as our study.

## 2 Transformation Networks

We introduce the paradigm of Transformation Networks (TN), a more general way of looking at feed forward architectures such as ConvNets and present results on these and then apply them directly to ConvNets. We first briefly review the notion of unitary groups and group invariant functions.

Premise and Notations:

We denote images and general vectors by

. Given such a , we define a support set which defines a subset of pixels or dimensions over , i.e. defines the subset of pixels contained in the set of indices arranged in a column. Given an image , we consider it divided it into small non-overlapping regions covering the entire image. Each support set is denoted by i.e. the support set at layer as shown in Fig. 1. is a union of certain as defined by a hierarchy (say in a ConvNet). For instance in Fig. 1, (shaded light blue) is the union of the supports in the image plane. This union of support is similar to the hierarchical structure observed in ConvNets and is defined by the specific architecture.

Unitary-Group: A group is a set of elements along with the properties of closure, associativity, invertibility and identity 222We will mostly deal with continuous groups however, our results also hold for discrete groups.. A unitary group is any group whose elements are unitary in nature, i.e. the dot-product is preserved under the unitary transformation. More precisely, . denotes the action of the group element (or transformation) on . The action of a group can also be constrained by . For instance, is a unitary transform acting only on the support set . We express the action of a transform on a restricted support by .

The Unitary Non-linear Image Transformation Model: Unitary groups are very useful in modelling linear transformations in domains such as images. Indeed, translation and in-plane rotation can be modelled as unitary and expressed as . However, coupled with a non-linearity and restricted support on the image , unitary transforms can model a far richer class of images. Let be the set of all transformations of generated . Now, for a given non-linearity , consider a non-linear transformation as . Here apply the individual transforms over the specified support. Notice that are jointly unitary. This is because each is a unitary transformation over the support , and , i.e. the supports are non-overlapping. Lastly, is a union of and is a unitary transformation over a larger support. This expression of a non-linear transformation of is more powerful than the simply linear primarily due to the non-linearity , thereby allowing the modelling of much richer variation in data.

Transformation Networks (TN): Transformation Networks (TN) are essentially feed forward networks that operate primarily on the principle of generating invariance towards a group or set of transformations through pooling modelled as group integration. The architecture of of these networks are hierarchical in nature and they explicitly invoke invariances only locally and can potentially have multiple layers. In doing so, they implicitly can model global invariances. Consider a TN with layers. Each layer has a number of TN nodes each with a receptive field size of , i.e. each cell or node in the layer can only look at patches of size of the output from the previous layer. Every node at layer can take in a number of input channels from the previous layer, and output a number of channels to the next layer. Further, each node has a set of filters or templates of size . Here is any unitary group specific to the output of the layer. We call as a template set (henceforth to be assumed under some specified ). The template set simply a set of templates transformed under the action of . Thus, there are such transformation blocks in layer . Every node contains a pooling operation which performs group integration over the template set (essentially mean pooling). Further, there is a point-wise non-linearity applied to the pooled feature.

TN Node: A TN node (the node at layer ) provides a single dimensional feature given a patch of size . The node output, for a given non-linearity and input , is given by

 Υli(x) =η(∫Gli⟨x,g(ti)⟩dg) (1) ≊η(1|Gli|∑Gli⟨x,g(ti)⟩) (2)

Here, recall that is a unitary group and is the template for that particular node. Note that Equation 2 models an average pooled ConvNet exactly for being the hard-ReLU function and

being the translation group. However, the results for the TN node also hold for max pooling. Equation

2 is the version in which the group is a discrete finite group. All results also hold for the discrete case. Fig. 2 illustrates a single channeled TN node observing two support sets.

Learnable Components in a Transformation Network: The only learnable parameters in a Transformation Network (after the architecture is finalized) are the sets of filters in each . However, each set has two components to be learnt. 1) The first is the template , the template for the node at layer (analogous to a feature). 2) The second is the group with which the template transforms. Note that once a single template is specified along with the corresponding group , all transformed templates in the template set are specified. Thus, contrary to convolutional architectures which only learn the filters, Transformation Networks are required to learn both the transformations and the filters. Though main focus of this paper are the invariance properties of these networks, we briefly investigate how one could learn a Transformation Network.

Unsupervised Learning of a Transformation Network: In the unsupervised setting, TNs can be trained in a greedy layer-by-layer fashion. The training data is passed through layer 1 of the TN to learn the templates and the corresponding transformation groups at the same time. One simple way is to sample the transforming input sequence. Doing so specifies both the templates and the corresponding groups simultaneously. Unsupervised feature learning techniques such as ICA can also be applied. Once layer 1 is trained, layer 1 features can be extracted from the training data before being passed to layer 2 for training the second layer. This process can be repeated until all layers are trained.

Supervised Learning of a Transformation Network: Under the supervised setting, one can assume that gradients are available. It is harder to train under this setting since the gradients need to update each template set or transformation block while keeping its group structure intact. One way of addressing this issue is to assume a particular group structure throughout the TN. This is the exact assumption that ConvNets make. ConvNets model all transformation groups in the network as the translation group which is parametric. The parametric nature allows one to compute the transformed template set on the fly. Thereby the only learnable parameters are the initial templates or filters . This brings us to the realization that a TN modelling general groups might model invariances better than a ConvNet, an observation we explore more in the following section. Nonetheless, our main result shows that ConvNets (and TNs in general) can in fact model non-linear invariances.

## 3 Invariances in a Transformation Network

### 3.1 Linear Unitary Group Invariance in single layer Transformation Networks

We will show that a single layered TN, more specifically a single TN node, can be invariant to any unitary group in the following sense.

###### Definition 3.1 (G-Invariant Function).

For any group , we define a function to be -invariant if .

An invariant to any group can be generated through the following (previously) known property utilizing group integration. This is a basic property of groups and arises due to the invariance of the Haar measure 333Proof in the supplementary..

###### Lemma 3.1.

(Invariance Property) Given a vector , and any group , for any fixed and a normalized Haar measure , the following is true

One layer TN is invariant to unitary transformation groups in the input space: Consider a TN with just a single layer of TN nodes. Each of these nodes looks at a patch of the same size. Each output feature of the network is given by Eq. 2, although to study the properties of such a construction, we will utilize Eq. 1. Utilizing Lemma 3.1 along with the definition of a TN node, we have the following.

###### Lemma 3.2.

(TN node linear Invariance) Under a unitary group , under the action of which the filters or templates of a TN node are transformed, the node output is invariant to the action of on the input , i.e. .

The proof is provided in the supplementary. This result shows that the TN node is invariant to local linear transformations (locality depending on the size of the receptive field). There are two main properties of the unitary group which allow for such invariance of the input. First, the group structure itself allows for invariant to be computed through group integration. Secondly, the unitary property of each element allows for the transformation to be “transferred” from the template to the input i.e. . Thus, integrating over is equivalent whether we compute this over input or the template. Transformation Networks compute this integration over the pre-transformed templates, thereby computing an invariant feature of even though it has never observed any other transformation of . The unitarity of the transformations allows us to be invariant to the transformed versions of the input even though we might have never observed them in training.

In the following section, we show that under certain conditions that are very closely approximated in practice, exact invariance can be achieved to non-linear transformations of the input as well. This is a fundamental problem in generalized (supervised and unsupervised) deep learning. Specifically, how does a deep feed-forward network generate invariance to the highly non-linear transformations in data? Much of the attention for the answer to this question has gone to learnable features. We find that the inherent structure of the network itself (such as in ConvNets) is ideal to invoke invariance. In our group theoretic framework, these ”features” or filter weights would be the point from which the transformed filters are generated i.e. in .

### 3.2 Non-linear Activation in Transformation Networks

In the case of a TN with 2 or more layers, the non-linear activation function (under certain conditions) can help in generating invariance to

non-linear transformations in the input space. In order to show this, we first show that under the unitary condition, such a non-linear activation can preserve the unitary group structure in the range space of the function i.e. the unitary transformation in the input domain of the non-linear activation function is also a corresponding albeit different unitary transformation in the range of the function. This unitary group structure is observed by TN nodes downstream (higher up the layers), which then through group integration to be able to generate invariance to the same utilizing Lemma 3.2.

Conditions on the non-linear activation function: We now state the conditions on the non-linear activation function .

1. Condition 1: (Unitarity) We define a function to be a unitary function if, for a unitary group , it satisfies .

2. Condition 2: (Stability) We define a function to be stable if .

Many functions prevalent in machine learning are unitary in the sense of Condition

1. One example is the class of polynomial kernels . Since the kernel employs an actual dot-product, it is clear that the function is unitary. The activation function of interest in this case would be the non-linear implicit map that the kernel defines from the input space to the Reproducing Kernel Hilbert Space (RKHS) (i.e. ). For an example of a function that is stable in the sense of Condition 2

, we consider Rectified Linear Units or the hard ReLU activation function (

) which is prevalent in deep learning Nair & Hinton (2010). Note that both conditions of unitarity (Condition 1) and stability (Condition 2) need to be satisfied by the activation function . We find such a class of non-linear functions in the implicit kernel map of the polynomial kernel with strictly less than but close to 1 i.e. . Although not prevalent, such kernels are valid Rossius et al. (1998). These functions exactly unitary and are approximately stable ( being arbitrarily close to 1 but not equal) for the range of values typical in activation functions. For the 1-D case with , . Restricting the function to produce only real values, it rejects all negative values in its domain. This behavior is a very close approximation of the hard rectified linear unit i.e. as illustrated in Fig. 3.

Group structure is preserved in the range of : One of our central results is that group invariance can be invoked through group integration in the non-linear feature space as well. This is the crux of the invariance generation capabilities of ConvNets and Transformation Networks in general.

We define an operator for any where is unitary. is thus a mapping within (the range of ). Under unitary , we then have the following result.

###### Theorem 3.3.

(Covariance in the range of ) If is a unitary function in the sense of Definition 1, then is unitary, and the set is a unitary-group in . This implies with being unitary.

Theorem 3.3 shows that the unitary group transformation in the input space of a TN node can be expressed as a unitary group transformation in the range or feature space of the node 444 Proof in the supplementary.. Since the group structure is preserved in the non-linear space, unitary group integration allows for transformation invariance of the input.

### 3.3 Non-linear Group Invariance in multi-layer Transformation Networks

Analysis of a two-layered TN: Consider a simple 2 layered TN with four TN nodes in layer 1 each looking at non-overlapping patches of the raw input image. Let one TN node at layer 2 be receiving the spatially concatenated input from layer 1 as shown in Fig. 1. Let all nodes be single channel nodes with the templates and their corresponding unitary groups as 555 Weight sharing would imply the same templates and groups. Our results are not affected by that constraint.. The transformed templates were learnt according to the unsupervised learning protocol or according to a supervised learning protocol which preserves the group structure of each template set. Since layer 1 TN nodes are already invariant to unitary groups , the only transformation that layer 2 nodes might observe are the ones that are not captured by these groups. is a transformation group that is unitary over the support which has a receptive field that is a union of . is not necessarily unitary over the individual supports (receptive field defined by ) of layer 1 nodes.

The output of the layer 1 nodes for a single channel with templates are of the form , where we replaced the group integral over the dot-product with to emphasize an invariant feature. The output for the layer 2 node for a single channel with template is as shown below.

 Υ2(x)=η(∫G2⟨x,η(g(t2))⟩dg)

Note that since the templates are learnt in an unsupervised fashion by passing in input through the previous layers, the transformed template is of the form . This non-linearity appears due to the non-linear activation function of the previous layer. Even in the case of learning by back-propagation, weights pass through multiple non-linearities resulting in similar forms for the templates. For a two layered TN network, the second layer TN node expression is of the following form.

 Υ2(o1)=η⎛⎜ ⎜ ⎜ ⎜⎝∫G2⟨⎡⎢ ⎢ ⎢ ⎢⎣[]cη(I(xΛ11))η(I(xΛ12))η(I(xΛ13))η(I(xΛ14))⎤⎥ ⎥ ⎥ ⎥⎦,η(g(t2))⟩dg⎞⎟ ⎟ ⎟ ⎟⎠

where . Replacing the concatenated invariant feature vector and applying the point-wise non-linearity over the entire vector, the layer 2 feature output becomes,

 Υ2(η(x′Λ2)) =η(∫G2⟨η(x′Λ2),η(g(t2))⟩dg) (3)

Here the receptive field of is the union of the receptive fields of the four . Since layer 1 features are invariant to the respective groups, variation in occurs only due to transformations that do not fall into the groups modelled by layer 1 nodes. However, in order for group integration to be applied at layer 2, the transformation needs to be propagated or be covariant in the layer 1 feature space. We express this formally through a property which was previously shown to be true for hierarchical architectures employing group integrals Anselmi et al. (2013).

###### Property 3.4.

(Covariance in the TN node feature space) Given a unitary over that is not modelled by the unitary groups in layer 1 i.e. , s.t.

 ⎡⎢ ⎢ ⎢ ⎢ ⎢⎣I(gΛ2|Λ11(xΛ11))I(gΛ2|Λ12(xΛ12))I(gΛ2|Λ13(xΛ13))I(gΛ2|Λ14(xΛ14))⎤⎥ ⎥ ⎥ ⎥ ⎥⎦=g′Λ2(x′Λ2)

where is the transformation restricted to the support and

This property allows for a unitary transformation acting on the support in the input space to have a corresponding action or effect in the feature space. For instance, if one considers an in-plane rotation over a image, then a

pooling (which is essentially feature extraction with the identity template) of pixels still preserves the rotation to a large degree. Feature extraction with general templates will also preserve the transformation due to the linearity of the dot-product.

Applying Property 3.4 to the layer 2 features, we have for any transformed having support over ,

 Υ2(η∘g′Λ2(x′Λ2)) (4) =η(∫G2⟨η∘g′Λ2(x′Λ2),η∘g(t2)⟩dg) (5) =η(∫Gη2⟨g′ηΛ2∘η(x′Λ2),gη∘η(t2)⟩dgη) (6) =η(∫Gη2⟨η(x′Λ2),(g′ηΛ2)−1∘gη∘η(t2)⟩dgη) (7) =η(∫Gη2⟨η(x′Λ2),gη∘η(t2)⟩dgη) (8) =Υ2(η(x′Λ2)) (9)

Equation 6 utilizes Theorem 3.3 and Equation 8 utilizes the fact that since the templates are considered to be modelled using the same transformation model as the raw input images due to training (thereby layer 2 always observes the same group structure ). This implies that forms a bijective mapping to some and thus the transformation results in a rotation of the group elements or a reordering of the group666Here the identity element maps to and maps to the identity. The group essentially is invariant and hence the group integral does not change. We therefore arrive at the layer 2 linear invariance expression, which is

 Υ2(η(gΛ2x′Λ2))=Υ2(η(x′Λ2)) (10)

Here is simply the activation response or output of the layer 1 TN nodes. Intuitively, layer 2 TN node is invariant to linear transformations over the larger support in the input space. The invariance expression of however, is coupled with the non-linearity from the previous layer (i.e. layer 1). This coupling is what allows the node to model more general non-linear invariances as we will see shortly. In order to highlight the invariance specifically, we rewrite the invariance expression for a general and unitary group element and replacing where with 2 denotes layer 2 node. Therefore, we have

 Γ2(g(x))=Γ2(x)

Further, it is interesting to note that if itself is a non-linear transformation of some , i.e. , we then arrive at our main result.

###### Theorem 3.5.

(Two-layer TN node Non-linear Invariance) Under a unitary group acting on the support , the output of the second layer node covering the support , is invariant to the action or transformations of on any input , i.e.

 Γ2(x′Λ2)=Γ2(η∘g′Λ2(x′Λ2))  ∀g′Λ2∈GΛ2,∀x′Λ2

for satisfying the conditions of unitarity and stability.

###### Proof.

We have,

 Γ2(η∘g′Λ2(x′Λ2)) =Υ2(η∘η∘g′Λ2(x′Λ2)) (11) =Υ2(η∘g′Λ2(x′Λ2)) (12) =Υ2(η(x′Λ2))=Γ2(x′Λ2) (13)

The second equality utilizes the stability property of whereas the third equality arises from the invariance of as demonstrated in Equation 10. ∎

This shows that a TN node at layer 2 is invariant to a non-linear transformation of any over the support . Combining the invariance generated due to the first layer as well, the node overall is invariant to the non-linear transformation . More specifically, the four layer 1 TN nodes are invariant to the first 4 group elements in the sequence and the second node is invariant to the last element combined with the non-linearity. This result can be extended to multiple layers directly.

Rich non-linear invariance in the case of a multi-layered TN: Our result can be naturally extended to multi-layered TNs. Consider a TN with layers with non-overlapping receptive fields at layer . The node at layer observes a receptive field . Assuming the last layer ’s receptive field covers the entire image, the layer node is invariant to the following class of transformation . One can rewrite the form as where we collapse all unitary transforms in a layer into one variable. This class of transformations contains non-linearities and is extremely rich. The structure of a TN itself along with unitary group modelling and a special class of non-linearities allow for generating invariance to such a large class of transformations of the input.

Hierarchy helps in efficient invariance generation: Consider the class of transformations that a layered ConvNet is invariant to. Using a naive single layered approach to be invariant to , one would need to generate all transforms modelled by and integrate over them. If for a finite group with the cardinality , then the size of all possible is of the form . If all individual groups have the same cardinality , then the number of transformations is of the order . However, with a hierarchical architecture that generates invariance to the individual groups at every layer, the machine only needs to integrate over transforms at layer . The total number of transforms needed to be integrated over becomes . Under the assumption that all groups are the same size, the total number becomes . This is a significant reduction from to , by an order of . Even though deeper networks require more data to train well, they can generate invariance to more complicatedly transformed data more efficiently. Further, lower layers having a smaller receptive field helps since cardinality of the transformation groups acting on smaller sized input is lower than those for a larger sized input. This helps the network in factorizing transformations with smaller less complicated transformations before deadling with larger more complicated non-linear ones.

### 3.4 The need for multiple templates or channels

Up until now in our analysis, we have assumed that the TN nodes have a single channel or a single template. The feature at layer 2 and above was multi-dimensional merely due to the distinct support sets over the image. Our results however extend naturally to multiple channels with multiple templates since we make no assumption regarding the relation between the templates. Anselmi et al. (2013)

suggest the need for multiple templates as a way of better measuring the invariant probability distribution (over pixels) of a group of transformed images. Indeed, the quantity

is a 1-D projection along of the distribution of the set

. More the number of templates or channels, better the estimate of the probability distribution due to the Cramer-Wold’s theorem. This result also holds true for our framework since the dot-products in a TN are 1-D projections of the transformed data onto a TN node template. The reason that this probability distribution is important is because

Anselmi et al. (2013) show that it itself is an invariant to the action of the group

. Therefore, moments of the distribution are also invariant including the first (leading to mean pooling) and infinite moment (leading to max pooling). Group integrals can be seen as measuring the first moment. Thus our results can be integrated with theirs almost seamlessly.

## 4 Towards Convolutional Architectures

Our framework for Transformation Networks models the transforming templates in each TN node as unitary groups. In order to apply supervised learning or back-propagation to these architectures, one must address the crucial issue of maintaining group structure in the template sets while optimizing the templates themselves. If back-propagation is applied naively to all templates in a template set, assuming they start with a group structure intact, they will converge to the same template throughout the set and the group structure will be lost. One way of addressing this issue is to assume all groups in the TN to be identical and parametric. A parametric transformation that can be efficiently applied would allow us to explicitly generate the template set on the fly for pooling or group integration. In doing so, back-propagation needs to only update one of the templates in each template set. This is because the group structure is explicitly maintained by applying the parametric transformations to that template to generate rest of the template set. ConvNets adopt this exact approach with the transformation of choice being translation since translations can be efficiently implemented during run-time as convolutions.

Our results apply directly to ConvNets since they are simply TNs instantiated with the unitary groups being discrete translations. We therefore find that the architecture of a ConvNet itself allows it to be able to model non-linear transformations. The weight sharing property of ConvNets (leading to convolutions), originally meant for regularization or merely local translation invariance, therefore has a very powerful by-product of generating invariance to much more complicated non-linear transforms overall. Goodfellow et al. (2009) has studied the problem of visualizing and measuring these invariances generated by a ConvNet and provided empirical justifications for increasing depth. Recall that given a ConvNet with layers, it is invariant to a class of transformations with at least non-linearities. By increasing the depth of a ConvNet, we are essentially adding in a layer of non-lnearity in the class of transformations that the ConvNet can be invariant towards. This provides a theoretical justification for the well known fact that depth can improve performance of a ConvNet.

## 5 Conclusion

We have shown that TNs (and thereby ConvNets) are can be invariant to non-linear transformations of the input despite pooling over mere local unitary transformations. We also showed that deeper networks are able to model much richer classes of transformations. Further, we find that a hierarchical architecture allows the network to generate invariance much more efficiently than a non-hierarchical network.

## 6 Proofs of theoretical results

All group theoretic resuts hold true for finite groups as well.

### 6.1 Proof of Lemma 3.1

###### Proof.

We have,

 g′(∫Gg(x) dg)=∫Gg′∘g(x) dg=∫Gg′′(x) dg′′=∫Gg(x) dg

Since the normalized Haar measure is invariant, i.e. . Intuitively, simply rearranges the group integral owing to elementary group properties. ∎

### 6.2 Proof of Lemma 3.2

###### Proof.

We have,

 Υ(g′(x)) =η(∫G⟨g′(x),g(t)⟩dg) (14) =η(∫G⟨x,g′−1(g(t))⟩dg) (15) =η(∫G⟨x,g′′(t)⟩dg′′) (16) =Υ(x) (17)

Eq. 15 uses the fact that is unitary. Eq. 16 showcases a change of variable. Since , therefore . Further since the Haar measure is unitary.

### 6.3 Proof of Theorem 3.3

###### Proof.

We have , since the function is unitary. We define as the action or transformation of on . This is one of the requirements of a unitary operator, however needs to be linear. Linearity of can be derived from the linearity of the inner product and its preservation under in . For an arbitrary vector and a scalar , we have

 ||αgη(p)−gη(αp)||2 (18) =⟨αgηp−gη(αp),αgηp−gη(αp)⟩ (19) =||αgη(p)||2+||gη(αp)||2−2⟨αgη(p),gη(αp)⟩ (20) =|α|||p||2+||αp||2−2α2⟨p,p⟩=0 (21)

Similarly for vectors , we have

We now prove that the set is a group. We start with proving the closure property. We have for any fixed

 gη(g′η(η(x)))=gη(η(g′(x)))=η(g(g′(x)))=η(g′′(x))=g′′η(η(x))

Since therefore by definition. Also, and thus closure is established. Associativity, identity and inverse properties can be proved similarly. The set is therefore a unitary-group in . ∎