1 Introduction
It is a well known fact that deep Convolutional Networks (or ConvNets) LeCun et al. (1998) generate invariance to local translations due to convolutions followed by a form of pooling. In practice, however, studies such as Krizhevsky et al. (2012) have applied these models very successfully to domains such as vision, which typically involve data undergoing highly nonlinear transformations. It is therefore clear, that these models can model invariance towards these global nonlinear transformations despite solely employing pooling over local translations. Further, simonyan2014very observed that a deeper ConvNet usually performs better (and thus is more invariant) on large scale tasks. This raises some fundamental questions.
Problem 1: How does a ConvNet generate invariance to global nonlinear transformations through pooling over mere local translations?
Problem 2: How does invariance increase with depth in ConvNets?
Problem 3: How does a hierarchical architecture help?
These have been long standing problems in vision since the inception of these networks. Intuitions and empirical observations abound, the problems still are not completely addressed from a theoretical standpoint.
Main results: In this paper, we take a significant step towards answering these questions.
Addressing Problem 1: We show that these nonlinear invariances arise from the architecture of the network itself rather than the exact features learnt. More specifically, the entire pipeline of convolution followed by pooling and then a nonlinearity itself contributes towards learning such powerful invariances. Although optimizing the features is important to capture the most amount of “information” and provide descriptive features, invariance strictly speaking, is not generated due to the features themselves. Instead, it is a byproduct of the architecture. Our main result shows that a layered ConvNet (and also a generalization of such architectures introduced as Transformation Networks or TNs), generates invariance to transformations of the input of the form ^{1}^{1}1 where is a unitary transformation and
is a pointwise applied nonlinearity satisfying certain conditions of unitarity and stability. A very good approximation of such a nonlinearity is the hardReLU which is prevalent in practice
Nair & Hinton (2010), thereby providing a theoretical justification of the same. The form of transformation highly nonlinear. Even though unitary transforms include commonly known and “elementary” transforms such as translation and inplane rotation, their composition with nonlinearities make the overall transformation very rich and powerful.Addressing Problem 2: Further, it immediately shows why depth is an important parameter in ConvNet architecture design. Increasing in our model allows us to be invariant to a more expressive transformation form. Loosely speaking, each layer of the ConvNet can be said to generate invariance to one pair of and . The precise form of depends on the exact hierarchy employed by the architecture and is discussed in more detail in a later section. The architecture of a ConvNet itself is a form of incorporating a prior on the kind of nuisance transformations expected to be observed in the data. This is complimentary to the regularization implications of weight sharing.
Addressing Problem 3: We also show that the hierarchical nature of a ConvNet also helps in significantly improving efficiency in generating invariance. A layered ConvNet reduces the number of required observations of transformed inputs for training from to , a reduction of the order .
Intuitive Proof Sketch: We first prove that each node at the first layer of a ConvNet (also Transformation Networks) generates invariances towards or factors out local translations (and more general unitary transforms for TNs). Then we put two conditions (unitarity and stability) on the pointwise nonlinearity used in these networks such that transformations that were not factored out in the first layer are propagated to the second layer. We find that a the implicit mapping of a fractional degree polynomial kernel exactly satisfies unitarity and very closely approximates stability for a well chosen range of degrees. This function is also a very close approximation of the hardReLU nonlinearity. The nonlinearity helps preserve the group structure of the transformed inputs in the feature space. We finally show that every second layer node then is able to generate invariance to the left over transformations (not captured in the first layer) even if they had acted on the input after a nonlinearity. This way the second layer node is overall invariant to a nonlinear transformation of the input. As we go up passing through more layers, they add in abilities to be invariant to more nonlinearities and complexities.
Prior Art: Deep learning despite its great success in learning useful representations, has yet to have a very concrete theoretical foundation. Nonetheless, there have been many attempts at a deeper understanding of its mechanics. For instance, Kawaguchi (2016)
proved important results for deep neural networks. Whereas
Cohen et al. (2015); Haeffele & Vidal (2015)approached deep learning from the perspective of general tensor decompositions. All of these studies however, have focused on the supervised version of deep learning. Under supervision, theoretical results can be broadly described to be concerned with the optimality of a solution or properties of the optimization landscape. Given the success of supervised models, such an approach is definitely beneficial in advancing overall understanding. It however, considers architectures more general in nature since supervised results for specialized architectures are more difficult to obtain.
Unsupervised deep learning however, promises to play an important role in the future not to mention kindling interests from a neoroscientific perspective. The analysis of our models is therefore aimed at the unsupervised setting and focuses more on the invariance properties of such networks. This reveals new insights into properties of the architecture itself and provides an explanation as to why increasing depth is useful on many fronts. Even though there have been theoretical efforts Delalleau & Bengio (2011); Martens & Medabalimi (2014) to provide results related to the “depth” of a network, the models studied do not immediately resemble the most successful architecture class in practice, ConvNets and its variants. We present results on a generalization of ConvNets called Transformation Networks (TN) which are directly applicable to ConvNets. In fact, TNs are very closely related to ConvNets and become identical under a very simple constraint.
There have been a few important efforts towards providing results from a unsupervised standpoint Anselmi et al. (2013); Mallat (2012). Mallat (2012) shows that local translation invariance leads to contractions in space. However, it is not clear whether those contractions are due to nonlinear invariances. Anselmi et al. (2013) approach the problem in a fashion more similar to ours with the use of unitary groups to “transfer” invariance. They show that for a hierarchical feedforward network with unitary group structure, the features at top layers would be exactly invariant to groups of transformations acting over a larger receptive field. Our main result, on the other hand is more precise. We show that the top layer features is in fact invariant to nonlinear transformations despite only pooling over linear transforms. Further, these nonlinear transformations need not form a group overall. They are only required to form a group locally at every layer. The architecture we consider is very closely related to practical architectures used for ConvNets, whereas Anselmi et al. (2013) model the architecture utilizing simple and complex cell constructions from a more biologically motivated approach. Further, they hypothesize that the nonlinearity serves as a way measuring bins of the CDF of an invariant distribution. On the other hand, we consider the nonlinearity to be an integral part of the process to preserve unitary group structure in the feature space. This also leads to it being a part of the class or range of transformations to be invariant towards. In turn this observation leads to the critical result that the overall architecture is invariant to nonlinear transformations despite pooling over linear transforms.
Finally, Bruna et al. (2013); Paul & Venkatasubramanian (2014) also applied group theory to a certain extent to the problem of representation learning. These works provide useful insights into stabilization with groups. Here, stabilization is meant along the lines of resulting in a contraction or nonexpansion of the space. Nonetheless, they do not explore exact invariance to explicitly nonlinear transforms as our study.
2 Transformation Networks
We introduce the paradigm of Transformation Networks (TN), a more general way of looking at feed forward architectures such as ConvNets and present results on these and then apply them directly to ConvNets. We first briefly review the notion of unitary groups and group invariant functions.
Premise and Notations:
We denote images and general vectors by
. Given such a , we define a support set which defines a subset of pixels or dimensions over , i.e. defines the subset of pixels contained in the set of indices arranged in a column. Given an image , we consider it divided it into small nonoverlapping regions covering the entire image. Each support set is denoted by i.e. the support set at layer as shown in Fig. 1. is a union of certain as defined by a hierarchy (say in a ConvNet). For instance in Fig. 1, (shaded light blue) is the union of the supports in the image plane. This union of support is similar to the hierarchical structure observed in ConvNets and is defined by the specific architecture.UnitaryGroup: A group is a set of elements along with the properties of closure, associativity, invertibility and identity ^{2}^{2}2We will mostly deal with continuous groups however, our results also hold for discrete groups.. A unitary group is any group whose elements are unitary in nature, i.e. the dotproduct is preserved under the unitary transformation. More precisely, . denotes the action of the group element (or transformation) on . The action of a group can also be constrained by . For instance, is a unitary transform acting only on the support set . We express the action of a transform on a restricted support by .
The Unitary Nonlinear Image Transformation Model: Unitary groups are very useful in modelling linear transformations in domains such as images. Indeed, translation and inplane rotation can be modelled as unitary and expressed as . However, coupled with a nonlinearity and restricted support on the image , unitary transforms can model a far richer class of images. Let be the set of all transformations of generated . Now, for a given nonlinearity , consider a nonlinear transformation as . Here apply the individual transforms over the specified support. Notice that are jointly unitary. This is because each is a unitary transformation over the support , and , i.e. the supports are nonoverlapping. Lastly, is a union of and is a unitary transformation over a larger support. This expression of a nonlinear transformation of is more powerful than the simply linear primarily due to the nonlinearity , thereby allowing the modelling of much richer variation in data.
Transformation Networks (TN): Transformation Networks (TN) are essentially feed forward networks that operate primarily on the principle of generating invariance towards a group or set of transformations through pooling modelled as group integration. The architecture of of these networks are hierarchical in nature and they explicitly invoke invariances only locally and can potentially have multiple layers. In doing so, they implicitly can model global invariances. Consider a TN with layers. Each layer has a number of TN nodes each with a receptive field size of , i.e. each cell or node in the layer can only look at patches of size of the output from the previous layer. Every node at layer can take in a number of input channels from the previous layer, and output a number of channels to the next layer. Further, each node has a set of filters or templates of size . Here is any unitary group specific to the output of the layer. We call as a template set (henceforth to be assumed under some specified ). The template set simply a set of templates transformed under the action of . Thus, there are such transformation blocks in layer . Every node contains a pooling operation which performs group integration over the template set (essentially mean pooling). Further, there is a pointwise nonlinearity applied to the pooled feature.
TN Node: A TN node (the node at layer ) provides a single dimensional feature given a patch of size . The node output, for a given nonlinearity and input , is given by
(1)  
(2) 
Here, recall that is a unitary group and is the template for that particular node. Note that Equation 2 models an average pooled ConvNet exactly for being the hardReLU function and
being the translation group. However, the results for the TN node also hold for max pooling. Equation
2 is the version in which the group is a discrete finite group. All results also hold for the discrete case. Fig. 2 illustrates a single channeled TN node observing two support sets.Learnable Components in a Transformation Network: The only learnable parameters in a Transformation Network (after the architecture is finalized) are the sets of filters in each . However, each set has two components to be learnt. 1) The first is the template , the template for the node at layer (analogous to a feature). 2) The second is the group with which the template transforms. Note that once a single template is specified along with the corresponding group , all transformed templates in the template set are specified. Thus, contrary to convolutional architectures which only learn the filters, Transformation Networks are required to learn both the transformations and the filters. Though main focus of this paper are the invariance properties of these networks, we briefly investigate how one could learn a Transformation Network.
Unsupervised Learning of a Transformation Network: In the unsupervised setting, TNs can be trained in a greedy layerbylayer fashion. The training data is passed through layer 1 of the TN to learn the templates and the corresponding transformation groups at the same time. One simple way is to sample the transforming input sequence. Doing so specifies both the templates and the corresponding groups simultaneously. Unsupervised feature learning techniques such as ICA can also be applied. Once layer 1 is trained, layer 1 features can be extracted from the training data before being passed to layer 2 for training the second layer. This process can be repeated until all layers are trained.
Supervised Learning of a Transformation Network: Under the supervised setting, one can assume that gradients are available. It is harder to train under this setting since the gradients need to update each template set or transformation block while keeping its group structure intact. One way of addressing this issue is to assume a particular group structure throughout the TN. This is the exact assumption that ConvNets make. ConvNets model all transformation groups in the network as the translation group which is parametric. The parametric nature allows one to compute the transformed template set on the fly. Thereby the only learnable parameters are the initial templates or filters . This brings us to the realization that a TN modelling general groups might model invariances better than a ConvNet, an observation we explore more in the following section. Nonetheless, our main result shows that ConvNets (and TNs in general) can in fact model nonlinear invariances.
3 Invariances in a Transformation Network
3.1 Linear Unitary Group Invariance in single layer Transformation Networks
We will show that a single layered TN, more specifically a single TN node, can be invariant to any unitary group in the following sense.
Definition 3.1 (Invariant Function).
For any group , we define a function to be invariant if .
An invariant to any group can be generated through the following (previously) known property utilizing group integration. This is a basic property of groups and arises due to the invariance of the Haar measure ^{3}^{3}3Proof in the supplementary..
Lemma 3.1.
(Invariance Property) Given a vector , and any group , for any fixed and a normalized Haar measure , the following is true
One layer TN is invariant to unitary transformation groups in the input space: Consider a TN with just a single layer of TN nodes. Each of these nodes looks at a patch of the same size. Each output feature of the network is given by Eq. 2, although to study the properties of such a construction, we will utilize Eq. 1. Utilizing Lemma 3.1 along with the definition of a TN node, we have the following.
Lemma 3.2.
(TN node linear Invariance) Under a unitary group , under the action of which the filters or templates of a TN node are transformed, the node output is invariant to the action of on the input , i.e. .
The proof is provided in the supplementary. This result shows that the TN node is invariant to local linear transformations (locality depending on the size of the receptive field). There are two main properties of the unitary group which allow for such invariance of the input. First, the group structure itself allows for invariant to be computed through group integration. Secondly, the unitary property of each element allows for the transformation to be “transferred” from the template to the input i.e. . Thus, integrating over is equivalent whether we compute this over input or the template. Transformation Networks compute this integration over the pretransformed templates, thereby computing an invariant feature of even though it has never observed any other transformation of . The unitarity of the transformations allows us to be invariant to the transformed versions of the input even though we might have never observed them in training.
In the following section, we show that under certain conditions that are very closely approximated in practice, exact invariance can be achieved to nonlinear transformations of the input as well. This is a fundamental problem in generalized (supervised and unsupervised) deep learning. Specifically, how does a deep feedforward network generate invariance to the highly nonlinear transformations in data? Much of the attention for the answer to this question has gone to learnable features. We find that the inherent structure of the network itself (such as in ConvNets) is ideal to invoke invariance. In our group theoretic framework, these ”features” or filter weights would be the point from which the transformed filters are generated i.e. in .
3.2 Nonlinear Activation in Transformation Networks
In the case of a TN with 2 or more layers, the nonlinear activation function (under certain conditions) can help in generating invariance to
nonlinear transformations in the input space. In order to show this, we first show that under the unitary condition, such a nonlinear activation can preserve the unitary group structure in the range space of the function i.e. the unitary transformation in the input domain of the nonlinear activation function is also a corresponding albeit different unitary transformation in the range of the function. This unitary group structure is observed by TN nodes downstream (higher up the layers), which then through group integration to be able to generate invariance to the same utilizing Lemma 3.2.Conditions on the nonlinear activation function: We now state the conditions on the nonlinear activation function .

Condition 1: (Unitarity) We define a function to be a unitary function if, for a unitary group , it satisfies .

Condition 2: (Stability) We define a function to be stable if .
Many functions prevalent in machine learning are unitary in the sense of Condition
1. One example is the class of polynomial kernels . Since the kernel employs an actual dotproduct, it is clear that the function is unitary. The activation function of interest in this case would be the nonlinear implicit map that the kernel defines from the input space to the Reproducing Kernel Hilbert Space (RKHS) (i.e. ). For an example of a function that is stable in the sense of Condition 2, we consider Rectified Linear Units or the hard ReLU activation function (
) which is prevalent in deep learning Nair & Hinton (2010). Note that both conditions of unitarity (Condition 1) and stability (Condition 2) need to be satisfied by the activation function . We find such a class of nonlinear functions in the implicit kernel map of the polynomial kernel with strictly less than but close to 1 i.e. . Although not prevalent, such kernels are valid Rossius et al. (1998). These functions exactly unitary and are approximately stable ( being arbitrarily close to 1 but not equal) for the range of values typical in activation functions. For the 1D case with , . Restricting the function to produce only real values, it rejects all negative values in its domain. This behavior is a very close approximation of the hard rectified linear unit i.e. as illustrated in Fig. 3.Group structure is preserved in the range of : One of our central results is that group invariance can be invoked through group integration in the nonlinear feature space as well. This is the crux of the invariance generation capabilities of ConvNets and Transformation Networks in general.
We define an operator for any where is unitary. is thus a mapping within (the range of ). Under unitary , we then have the following result.
Theorem 3.3.
(Covariance in the range of ) If is a unitary function in the sense of Definition 1, then is unitary, and the set is a unitarygroup in . This implies with being unitary.
Theorem 3.3 shows that the unitary group transformation in the input space of a TN node can be expressed as a unitary group transformation in the range or feature space of the node ^{4}^{4}4 Proof in the supplementary.. Since the group structure is preserved in the nonlinear space, unitary group integration allows for transformation invariance of the input.
3.3 Nonlinear Group Invariance in multilayer Transformation Networks
Analysis of a twolayered TN: Consider a simple 2 layered TN with four TN nodes in layer 1 each looking at nonoverlapping patches of the raw input image. Let one TN node at layer 2 be receiving the spatially concatenated input from layer 1 as shown in Fig. 1. Let all nodes be single channel nodes with the templates and their corresponding unitary groups as ^{5}^{5}5 Weight sharing would imply the same templates and groups. Our results are not affected by that constraint.. The transformed templates were learnt according to the unsupervised learning protocol or according to a supervised learning protocol which preserves the group structure of each template set. Since layer 1 TN nodes are already invariant to unitary groups , the only transformation that layer 2 nodes might observe are the ones that are not captured by these groups. is a transformation group that is unitary over the support which has a receptive field that is a union of . is not necessarily unitary over the individual supports (receptive field defined by ) of layer 1 nodes.
The output of the layer 1 nodes for a single channel with templates are of the form , where we replaced the group integral over the dotproduct with to emphasize an invariant feature. The output for the layer 2 node for a single channel with template is as shown below.
Note that since the templates are learnt in an unsupervised fashion by passing in input through the previous layers, the transformed template is of the form . This nonlinearity appears due to the nonlinear activation function of the previous layer. Even in the case of learning by backpropagation, weights pass through multiple nonlinearities resulting in similar forms for the templates. For a two layered TN network, the second layer TN node expression is of the following form.
where . Replacing the concatenated invariant feature vector and applying the pointwise nonlinearity over the entire vector, the layer 2 feature output becomes,
(3) 
Here the receptive field of is the union of the receptive fields of the four . Since layer 1 features are invariant to the respective groups, variation in occurs only due to transformations that do not fall into the groups modelled by layer 1 nodes. However, in order for group integration to be applied at layer 2, the transformation needs to be propagated or be covariant in the layer 1 feature space. We express this formally through a property which was previously shown to be true for hierarchical architectures employing group integrals Anselmi et al. (2013).
Property 3.4.
(Covariance in the TN node feature space) Given a unitary over that is not modelled by the unitary groups in layer 1 i.e. , s.t.
where is the transformation restricted to the support and
This property allows for a unitary transformation acting on the support in the input space to have a corresponding action or effect in the feature space. For instance, if one considers an inplane rotation over a image, then a
pooling (which is essentially feature extraction with the identity template) of pixels still preserves the rotation to a large degree. Feature extraction with general templates will also preserve the transformation due to the linearity of the dotproduct.
Applying Property 3.4 to the layer 2 features, we have for any transformed having support over ,
(4)  
(5)  
(6)  
(7)  
(8)  
(9) 
Equation 6 utilizes Theorem 3.3 and Equation 8 utilizes the fact that since the templates are considered to be modelled using the same transformation model as the raw input images due to training (thereby layer 2 always observes the same group structure ). This implies that forms a bijective mapping to some and thus the transformation results in a rotation of the group elements or a reordering of the group^{6}^{6}6Here the identity element maps to and maps to the identity. The group essentially is invariant and hence the group integral does not change. We therefore arrive at the layer 2 linear invariance expression, which is
(10) 
Here is simply the activation response or output of the layer 1 TN nodes. Intuitively, layer 2 TN node is invariant to linear transformations over the larger support in the input space. The invariance expression of however, is coupled with the nonlinearity from the previous layer (i.e. layer 1). This coupling is what allows the node to model more general nonlinear invariances as we will see shortly. In order to highlight the invariance specifically, we rewrite the invariance expression for a general and unitary group element and replacing where with 2 denotes layer 2 node. Therefore, we have
Further, it is interesting to note that if itself is a nonlinear transformation of some , i.e. , we then arrive at our main result.
Theorem 3.5.
(Twolayer TN node Nonlinear Invariance) Under a unitary group acting on the support , the output of the second layer node covering the support , is invariant to the action or transformations of on any input , i.e.
for satisfying the conditions of unitarity and stability.
Proof.
We have,
(11)  
(12)  
(13) 
The second equality utilizes the stability property of whereas the third equality arises from the invariance of as demonstrated in Equation 10. ∎
This shows that a TN node at layer 2 is invariant to a nonlinear transformation of any over the support . Combining the invariance generated due to the first layer as well, the node overall is invariant to the nonlinear transformation . More specifically, the four layer 1 TN nodes are invariant to the first 4 group elements in the sequence and the second node is invariant to the last element combined with the nonlinearity. This result can be extended to multiple layers directly.
Rich nonlinear invariance in the case of a multilayered TN: Our result can be naturally extended to multilayered TNs. Consider a TN with layers with nonoverlapping receptive fields at layer . The node at layer observes a receptive field . Assuming the last layer ’s receptive field covers the entire image, the layer node is invariant to the following class of transformation . One can rewrite the form as where we collapse all unitary transforms in a layer into one variable. This class of transformations contains nonlinearities and is extremely rich. The structure of a TN itself along with unitary group modelling and a special class of nonlinearities allow for generating invariance to such a large class of transformations of the input.
Hierarchy helps in efficient invariance generation: Consider the class of transformations that a layered ConvNet is invariant to. Using a naive single layered approach to be invariant to , one would need to generate all transforms modelled by and integrate over them. If for a finite group with the cardinality , then the size of all possible is of the form . If all individual groups have the same cardinality , then the number of transformations is of the order . However, with a hierarchical architecture that generates invariance to the individual groups at every layer, the machine only needs to integrate over transforms at layer . The total number of transforms needed to be integrated over becomes . Under the assumption that all groups are the same size, the total number becomes . This is a significant reduction from to , by an order of . Even though deeper networks require more data to train well, they can generate invariance to more complicatedly transformed data more efficiently. Further, lower layers having a smaller receptive field helps since cardinality of the transformation groups acting on smaller sized input is lower than those for a larger sized input. This helps the network in factorizing transformations with smaller less complicated transformations before deadling with larger more complicated nonlinear ones.
3.4 The need for multiple templates or channels
Up until now in our analysis, we have assumed that the TN nodes have a single channel or a single template. The feature at layer 2 and above was multidimensional merely due to the distinct support sets over the image. Our results however extend naturally to multiple channels with multiple templates since we make no assumption regarding the relation between the templates. Anselmi et al. (2013)
suggest the need for multiple templates as a way of better measuring the invariant probability distribution (over pixels) of a group of transformed images. Indeed, the quantity
is a 1D projection along of the distribution of the set. More the number of templates or channels, better the estimate of the probability distribution due to the CramerWold’s theorem. This result also holds true for our framework since the dotproducts in a TN are 1D projections of the transformed data onto a TN node template. The reason that this probability distribution is important is because
Anselmi et al. (2013) show that it itself is an invariant to the action of the group. Therefore, moments of the distribution are also invariant including the first (leading to mean pooling) and infinite moment (leading to max pooling). Group integrals can be seen as measuring the first moment. Thus our results can be integrated with theirs almost seamlessly.
4 Towards Convolutional Architectures
Our framework for Transformation Networks models the transforming templates in each TN node as unitary groups. In order to apply supervised learning or backpropagation to these architectures, one must address the crucial issue of maintaining group structure in the template sets while optimizing the templates themselves. If backpropagation is applied naively to all templates in a template set, assuming they start with a group structure intact, they will converge to the same template throughout the set and the group structure will be lost. One way of addressing this issue is to assume all groups in the TN to be identical and parametric. A parametric transformation that can be efficiently applied would allow us to explicitly generate the template set on the fly for pooling or group integration. In doing so, backpropagation needs to only update one of the templates in each template set. This is because the group structure is explicitly maintained by applying the parametric transformations to that template to generate rest of the template set. ConvNets adopt this exact approach with the transformation of choice being translation since translations can be efficiently implemented during runtime as convolutions.
Our results apply directly to ConvNets since they are simply TNs instantiated with the unitary groups being discrete translations. We therefore find that the architecture of a ConvNet itself allows it to be able to model nonlinear transformations. The weight sharing property of ConvNets (leading to convolutions), originally meant for regularization or merely local translation invariance, therefore has a very powerful byproduct of generating invariance to much more complicated nonlinear transforms overall. Goodfellow et al. (2009) has studied the problem of visualizing and measuring these invariances generated by a ConvNet and provided empirical justifications for increasing depth. Recall that given a ConvNet with layers, it is invariant to a class of transformations with at least nonlinearities. By increasing the depth of a ConvNet, we are essentially adding in a layer of nonlnearity in the class of transformations that the ConvNet can be invariant towards. This provides a theoretical justification for the well known fact that depth can improve performance of a ConvNet.
5 Conclusion
We have shown that TNs (and thereby ConvNets) are can be invariant to nonlinear transformations of the input despite pooling over mere local unitary transformations. We also showed that deeper networks are able to model much richer classes of transformations. Further, we find that a hierarchical architecture allows the network to generate invariance much more efficiently than a nonhierarchical network.
References
 Anselmi et al. (2013) Fabio Anselmi, Joel Z Leibo, Lorenzo Rosasco, Jim Mutch, Andrea Tacchetti, and Tomaso Poggio. Unsupervised learning of invariant representations in hierarchical architectures. arXiv preprint arXiv:1311.4158, 2013.
 Bruna et al. (2013) Joan Bruna, Arthur Szlam, and Yann LeCun. Learning stable group invariant representations with convolutional networks. arXiv preprint arXiv:1301.3537, 2013.
 Cohen et al. (2015) Nadav Cohen, Or Sharir, and Amnon Shashua. On the expressive power of deep learning: A tensor analysis. CoRR, abs/1509.05009, 2015. URL http://arxiv.org/abs/1509.05009.
 Delalleau & Bengio (2011) Olivier Delalleau and Yoshua Bengio. Shallow vs. deep sumproduct networks. In Advances in Neural Information Processing Systems, pp. 666–674, 2011.
 Goodfellow et al. (2009) Ian Goodfellow, Honglak Lee, Quoc V Le, Andrew Saxe, and Andrew Y Ng. Measuring invariances in deep networks. In Advances in neural information processing systems, pp. 646–654, 2009.
 Haeffele & Vidal (2015) Benjamin D. Haeffele and René Vidal. Global optimality in tensor factorization, deep learning, and beyond. CoRR, abs/1506.07540, 2015. URL http://arxiv.org/abs/1506.07540.
 Kawaguchi (2016) Kenji Kawaguchi. Deep learning without poor local minima. In Advances In Neural Information Processing Systems, pp. 586–594, 2016.
 Krizhevsky et al. (2012) Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pp. 1097–1105, 2012.
 LeCun et al. (1998) Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. Gradientbased learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
 Mallat (2012) Stéphane Mallat. Group invariant scattering. Communications on Pure and Applied Mathematics, 65(10):1331–1398, 2012.
 Martens & Medabalimi (2014) James Martens and Venkatesh Medabalimi. On the expressive efficiency of sum product networks. arXiv preprint arXiv:1411.7717, 2014.

Nair & Hinton (2010)
Vinod Nair and Geoffrey E Hinton.
Rectified linear units improve restricted boltzmann machines.
In Proceedings of the 27th international conference on machine learning (ICML10), pp. 807–814, 2010.  Paul & Venkatasubramanian (2014) Arnab Paul and Suresh Venkatasubramanian. Why does deep learning work?a perspective from group theory. arXiv preprint arXiv:1412.6621, 2014.
 Rossius et al. (1998) Rolf Rossius, Gérard Zenker, Andreas Ittner, and Werner Dilger. A short note about the application of polynomial kernels with fractional degree in support vector learning. In European Conference on Machine Learning, pp. 143–148. Springer, 1998.
6 Proofs of theoretical results
All group theoretic resuts hold true for finite groups as well.
6.1 Proof of Lemma 3.1
Proof.
We have,
Since the normalized Haar measure is invariant, i.e. . Intuitively, simply rearranges the group integral owing to elementary group properties. ∎
6.2 Proof of Lemma 3.2
6.3 Proof of Theorem 3.3
Proof.
We have , since the function is unitary. We define as the action or transformation of on . This is one of the requirements of a unitary operator, however needs to be linear. Linearity of can be derived from the linearity of the inner product and its preservation under in . For an arbitrary vector and a scalar , we have
(18)  
(19)  
(20)  
(21) 
Similarly for vectors , we have
We now prove that the set is a group. We start with proving the closure property. We have for any fixed
Since therefore by definition. Also, and thus closure is established. Associativity, identity and inverse properties can be proved similarly. The set is therefore a unitarygroup in . ∎