1 Introduction
Deep neural nets have revolutionized computer vision. Convolutional neural networks (CNNs) are particularly popular in computer vision, especially for tasks such as classification, object detection, and segmentation. CNNs are motivated by certain invariances that are claimed to be inherent in their architectures. For instance, one of these claimed invariances is invariance to inplane translations in the image, i.e., the output of a CNN is left unchanged as the object is translated, a desirable property for an object classifier. That property is claimed to be built into the architecture, so that such invariance is exhibited regardless of the parameters of the CNN that are learned. The choice of convolution and pooling in CNN layers are said to result in such translation invariance. For more general properties of CNNs related to invariances, see
[20, 2, 1] and [11] for experimental analysis.However, recent work [3] has shown that existing CNNs (e.g., especially ResNets [8], VGG [19], Inception Net [21]) are highly unstable to basic transformations including small translations in the image, scalings and natural perturbations across frames in videos (see also, [15]
) that one would expect CNNs to be invariant (or stable to). In fact this is not an odd occurrence, as in adversarial perturbations, that are specifically constructed to “attack” the network, but rather it is a common occurrence. For instance, it is reported in
[3]that even a 1 pixel shift in an input test image can cause a change in the resulting classification of the CNN with probability 30% in existing modern CNNs. This is despite typical data augmentation done in training, i.e., augmenting the training set with shifted versions of the training images.
It is claimed by [3] that this lack of translation invariance is caused by the subsampling or the pooling operation in CNNs between layers, which is said to ignore the classical ShannonNyquist sampling theorem. This is also noted as early as [18]. The claim is that one would be limited to subsample at a rate compatible with the Nyquist rate, i.e., twice the highest frequency in the feature map before the subsampling. Thus, to avoid aliasing effects, one would need to blur the feature map to eliminate high frequencies to support the desired subsampling rate. That is, however, not performed in existing CNNs. However, as claimed in [3], the approach does not achieve full translation invariance. It is claimed that the reason is due to the presence of a nonlinearity, which may introduce aliasing even in the presence of blur.
The only solution, as suggested by [3], seems to be to avoid subsampling. However, that leads to two problems: 1) to retain the receptive field sizes of conventional CNNs with subsampling, the kernel supports in layers would have to grow with respect to the depth of the layer, but this would require learning an exorbitant number of parameters to represent the growing kernel support, which poses a problem for training efficiency (see [23] for a possible solution), and 2) the memory footprint due to nonsubsampled feature maps is a problem in backpropagation during the training process, which may preclude the use of many GPUs that do not have sufficient memory, and the lack of subsampling also leads to greater training and inference times.
In this paper, we tackle both of these aforementioned problems. To do this, we represent convolutional kernels with an orthogonal GaussHermite basis whose basis coefficients are learned in convolution layers, rather than representing a kernel directly in terms of its pixel values^{1}^{1}1This basis has been used before [9] in CNNs, but the essential properties of translation invariance and insensitivity are not mentioned or explored in that work, and the focus is rather on the reduced parameters.. Without a subsampling layer, the representation leads to a fully translation invariant representation that keeps constant the number of parameters in kernels across layers, while being able to capture as big receptive fields as modern CNNs, with even fewer parameters per layer. To address the memory limitations, we show how the layers, due to the smooth GaussHermite approximation, can be subsampled, in a way that retains a weaker notion of translation invariance that we term translation insensitivity. This leads to stability of classifications with respect to translations, in contrast to existing subsampled CNNs, which do not exhibit translation insensitivity.
Contributions: Our specific contributions are as follows. 1. We introduce a CNN architecture, which we call GaussNets, that is both translation invariant and has equivalent (or bigger) receptive fields with fewer parameters per kernel than existing modern CNNs. 2. We introduce a CNN architecture, called Subsampled GaussNets, that exhibit the aforementioned properties with respect to receptive field and parameter usage, and is computationally efficient comparable to modern CNNs, by performing subsampling. This architecture retains a weaker form of translation invariance, which we call translation insensitivity that gives robustness to classifications. 3. We provide analytic proofs that show that these introduced architectures exhibit the aforementioned properties. These analytic tools serve as a framework for analyzing any other architecture. 4. We experimentally demonstrate the insensitivity to translation.
1.1 Related Work
The lack of translation invariance in modern CNNs (such as ResNet, VGG, InceptionNet) is due to the subsampling or pooling operations. One obvious approach to deal with this lack of invariance is by data augmentation  augmenting the training set with shifted test images. However, [3] shows that that only gives invariance to shifts on images statistically very similar to the training set, and the lack of invariance remains on the test set. The only work, to the best of our knowledge, that attempts to address the lack of translation invariance, due to subsampling is [24]. It is proposed to add an antialiasing layer by applying a fixed smoothing filter before the subsampling. Though the approach is not translation invariant, it is shown to empirically improve robustness to translation on the test set. We show analytically that if the right smoothing kernel is chosen, such antialiasing would give what we call translation insensitivity, something that was not known. While antialiasing does provide a solution, the approach we introduce, in addition lends itself to other invariances such as scaling and deformation that we will be the subject of our future work.
In [13, 4, 17] (see also [12] for an early approach), the Scattering Transform is introduced as a representation of an image invariant to basic transformations, such as translation, scale, rotations, and small deformations for classification. The transform is computed by convolving the image with a wavelet filter bank, a pointwise nonlinearity (complex modulus), followed by a lowpass filtering. These operations are then stacked to create a hierarchical representation. It is proven that such a representation provides not strict invariance, but a weaker notion, that we refer to as insensitivity
. We note that all such proofs are done assuming continuous data, and the subsampling operations are not analyzed. The transform has been used in a number of classification problems such as texture discrimination with success, but since the features are handcrafted, it seems difficult to apply them to complex classification tasks such as ImageNet, where handcrafting feature combinations would be difficult to compete with learned CNN approaches. Some more recent approaches along the lines of
[13, 4, 17] are [6, 5] (based on steerable filters [7, 14]) that obtain equivariance, i.e., a group action such as rotation on the input results in the same rotation in the feature map, while having learned parameters like CNNs.In [9], a hybrid approach between Scattering transforms and CNNs is taken, motivated by the desire to perform well on both small datasets with limited training data and large datasets. In that work, it is observed that the filters learned in CNNs from large datasets often are spatially coherent, and thus, rather than learning that spatially coherent structure directly from data, it is enforced in the filters to limit the training data requirements. To that end, a fixed structured basis, similar to the filter bank in Scattering transforms, is chosen to represent the kernel, and the coefficients are learned like CNNs. The particular basis chosen is derivatives of Gaussians [10], which is coincidentally the same basis that we use in this paper. However, our motivation is quite different than [9]; indeed, we seek to obtain translation invariance in CNNs, which we show is maintained by our architecture without subsampling. Furthermore, we show that translation insensitivity is maintained despite subsampling. Such properties were not explored in [9]. Another approach using Gaussian filters, though not a basis, include [16], where the filter shape and size are learned. [22] uses a mixture of Gaussians in different spatial locations in defining kernels, which is parameter efficient.
2 GaussNet CNN Architecture
We introduce our GaussNet architecture. The key idea is that rather than representing a convolution kernel directly in terms of its pixelized values (corresponding to coefficients of a basis of shifted delta functions), we represent the kernel in terms of an (orthogonal) basis of smooth functions given by derivatives of the Gaussian. In fact, any function can be approximated as
(1) 
where
is the 2d Gaussian function with standard deviation
, represents thederivative operator (the tensor of all partials of up to
derivatives), is a tensor of coefficients that are , and represents the sum of elementwise products between the two arguments. In practice, we will use an approximation of up to order 2 derivatives, so that the kernels we represent are given by(2) 
where represent the coefficients of the Gaussian derivatives up to order , and
represents a vector of partials in the
and directions. The representation in equation 2 will be used rather than pixelized representations typically used (e.g., pixelized kernels used in VGG and ResNet). The weights will be learned. By using this basis, the kernels are enforced to be smooth, while being flexible enough to have the discrimination power to separate object classes. As we will show in the next section, the smoothness of this choice of basis, leads to the translation insensitivity that we desire.A basic layer of a GaussNet is be given by
(3) 
where is a weight matrix and is an dimensional input image (feature), is the output feature map dimension, and is the convolution. Thus, each input channel is convolved with derivatives of Gaussians, and linear combinations of these are formed with the weight matrix .
is the rectified linear unit, i.e.,
. represents the subsampling operator. In one form of our architecture that is fully translation invariant, is not included. The subscript in is used to indicate the free parameters that are to be learned. In experiments, we will demonstrate our approach on an architecture motivated from ResNet. In this architecture, one sums the input feature map to a layer with the result above after the rectification, i.e.,(4) 
In the analysis in the next section, for simplicity, we will analyze the basic layer in equation 3, but the results we prove will also hold for a ResNetlike layer.
Multiple layers will be cascaded to form a deep CNN, which we call the GaussNet. As in a ResNetlike structure, our final feature will consist of an average pooling layer, i.e.,
(5) 
where is the number of pixels in the final feature maps, and is the number of layers.
Note that the effective receptive field size of the GaussNet is controlled by the parameter . In the GaussNet without subsampling, in order to maintain the overall receptive field size of the corresponding traditional CNN, the parameter would have to grow according to the subsampling rate of the traditional CNN, i.e., where
is the layer number. We efficiently evaluate such large receptive field convolutions with the Fast Fourier Transform (FFT) (see Section
4). In the subsampled GaussNet, will remain fixed over layers, and because of the subsampling, the overall receptive field would be similar to the corresponding traditional CNN.Note that in comparison to existing CNNs, the GaussNet has fewer parameters, i.e., a common choice (e.g., ResNet and VGG) is to use in convolution operations  this results in 9 parameters that should be learned, whereas we use 6 coefficients per convolution filter, while still being able to obtain a similar test accuracy as the traditional CNN.
3 Translation Insensitivity of GaussNets
3.1 Terminology
In this section, we will denote an image or feature map as , where . In our proofs, we will assume that the data is defined on the infinite discrete set
for simplicity of notation, as we may just zeropad the finite data with an infinite number of zeros. The set of all such images is denoted
. For simplicity in the notation, we will consider just one feature map in the input and output of each layer. We will denote an operation from one feature map to produce another as . This will typically correspond to an output of a layer of the network. We will denote an operation from an image or feature map to a vector as . This will typically correspond to the last layer of the network that produces a vector representation of the image.We define notation for some operations that we will be used extensively in the rest of the paper. We denote to be the translation operator, i.e.,
(6) 
which shifts the image by . Note that the translation operator is defined only on infinite domains, as must always be in the domain of the image. However, for finite data, this can be extended to an infinite domain by zero padding the finite data. Next, we define the subsampling operator , which subsamples data by a factor of , as follows
(7) 
In the following sections, we will show that the behavior of GaussNets are wellbehaved with respect to the translation operator. We will now make precise this behavior.
Definition 1 (Translation Covariant Operator).
An operator is translation covariant if for all translations and all inputs , we have that
(8) 
where is a monotone bijective function. When , we simply say that is translation covariant.
This definition says that a shift in the input map of a translation covariant operator results in a predictable shift in the output map.
We now introduce the notion of translation invariance; ideally, a property inherent in a CNN.
Definition 2 (Translation Invariance).
A function is translation invariant if for all translations and all inputs , we have that
(9) 
This says that the feature does not change as the image is translated. In practice, we will have to settle for a weaker property, which we call translation insensitivity:
Definition 3 (Translation Insensitivity).
A function is translation insensitive if there exists a positive constant such that
(10) 
for all .
This is what is known as Lipschitz continuity: the feature representation does not change much, i.e., at most at a linear rate of the shift size.
3.2 Invariance of GaussNets
We define a layer of a GaussNet (without subsampling) as follows:
(11) 
where is the rectified linear unit, is the Gaussian, , is a weight matrix, and is the convolution operator.
We show now a network consisting of stacking layers defined in equation 11 is translation covariant. In fact, the property is true for any CNN that performs no subsampling (or downsampling), and thus also a GaussNet:
Lemma 1 (GaussNets are Translation Covariant).
A deep GaussNet of the form
(12) 
where is in the form equation 11, which does no subsampling is translation covariant.
Proof.
The composition of translation covariant operators is also translation covariant. The convolution is translation covariant, as is the rectified linear unit as it is a pointwise function of the input, therefore is translation covariant, and thus so is the composition . ∎
The translation covariant property of a composition of several GaussNet layers allows us to now define a network from this composition that is translation invariant by using average pooling. Note that this also holds for ordinary CNN layers that perform no subsampling / downsampling.
Theorem 2 (Average Pooling of a GaussNet is Translation Invariant).
The deep GaussNet followed by an average pooling layer at the end, i.e.,
(13) 
is translation invariant.
Proof.
Using the translation covariant property of the composition of Gaussian layers, we have that
(14)  
(15)  
(16)  
(17) 
where and a change of variables was performed. ∎
3.3 Insensitivity of Subsampled GaussNets
While the GaussNets in the previous section exhibited translation invariance, and are able to capture as large receptive fields as conventional CNNs with downsampling (with appropriate choice of in each layer), with fewer parameters per layer, the lack of downsampling poses a problem for backpropagation as large memory requirements, preclude some GPUs. Thus, we would like to subsample in practice. However, strict translation invariance is lost; however, we show that translation insensitivity is retained.
We define a layer of the subsampled GaussNet as:
(18) 
where, as defined before, is the subsampling operator. We first show that a single layer GaussNet with subsampling followed by an average pooling is translation insensitive. We will then use this result to generalize to the multilayer case.
Theorem 3.
Average pooling of a layer of the subsampled GaussNet in equation 18 is translation insensitive, i.e.,
(19) 
where is a constant and is the number of pixels in the feature map. The constant is given by
(20) 
where is the Lipschitz constant of the Gaussian and its derivatives.
Proof.
We compute
(21) 
and
where we have performed a change of variables. Thus,
(22) 
Now,
(23)  
(24)  
(25)  
(26)  
(27) 
where and the last inequality is due to the Lipschitz continuity of the rectified linear unit, i.e., . Therefore,
(28) 
Since the Gaussian (and its derivatives) are Lipschitz continuous, with Lipschitz constant , we have that
∎
We now proceed to show that average pooling of a multilayer GaussNet is also translation insensitive. To do this, we first show that a layer of a GaussNet with subsampling is translation covariant with :
Lemma 4 (SubSampled GaussNet Layer is Translation Covariant).
The GaussNet with subsampling layer defined in equation 18 is translation covariant (a shift of in the input corresponds to a shift of in the output map).
Remark 5 (Fractional Shifts).
Note that may not be an integer. Since the image/feature map is defined discretely, we specify how this operation is defined. We will see that the answer will arise naturally in the proof: the fractional shifted feature map will be defined by a formula that needs only the Gaussian on the nonsubsampled domain, where will correspond to an integer shift.
Proof.
From the proof of the previous theorem, we have that
(29)  
(30)  
(31)  
(32) 
Note in the previous equation, the spatial argument is discrete and may not be an integer. However, the expression makes perfect sense as it is applied as an argument to the Gaussian, which applies in the nonsubsampled domain where the fractional shift corresponds to an integral value. Therefore, by the translation covariance of rectified linear unit, we have that
(33) 
∎
We now move to proving that average pooling for a deep GaussNet is translation insensitive. First, we need a lemma that shows that if the input feature maps are close, then the outputs through a layer of a GaussNet are close.
Lemma 6.
If and are feature maps such that for all and is a GaussNet layer with subsampling, then
(34) 
where , and is the spatial size of the output feature map.
Proof.
We estimate the difference:
(35)  
(36)  
(37)  
(38)  
(39)  
(40) 
where , and we applied the Lipschitz continuity of in the first step, and the fact that . ∎
Using the previous lemma, we can now show that the output of a deep GaussNet is translation insensitive.
Theorem 7.
A deep subsampled GaussNet that is followed by a global average pooling layer, i.e.,
is translation insensitive.
Proof.
We may now apply Lemma 6 successively to the network
(41) 
to arrive at
(42) 
where is the spatial size in pixels of the feature map at the layer.
We may set and , and apply equation 42. By the translation covariance of a layer of a GaussNet, we have that . Therefore (using computations in Theorem 3),
(43) 
Now combining equation 42 and equation 43 with in equation 42, we have that
(44) 
Therefore average the previous result over , we have that
(45) 
∎
Remark 8 (Lack of Translation Invariance of SubSampled GaussNets).
We show now that subsampled GaussNets are not in general translation invariant, thus justifying the need for the weaker translation insensitivity. Recall that
(46)  
(47) 
For translation invariance to hold, the above two sums should be equal. One would like to attempt the change of variable , which would result in . Note that the range of summation (of and ) would only be equal when is a multiple of so that the Gaussian in the two sums above would have the same arguments, and thus the summations would not be equal (for an arbitrary ) unless is a multiple of . Hence, GaussNets with subsampling in general will not be translation invariant.
Remark 9 (Choice of ).
We note that in the proofs, the translation insensitivity arises from the fact that is small, which was bounded by a Lipschitz estimate. In practice, should be large compared to to be less sensitive (where is shift of maximum desired insensitivity)  in this case the difference between the Gaussian and shifted Gaussian is small. Note that the subsampling rate contracts the Gaussian, and thus the factor of in choosing the . So the higher the desired insensitivity to large shifts and the subsampling rate, the more smoothing one has to perform.
Remark 10 (Contractive Operator).
We have thus far shown that sensitivity of the GaussNet layer to shift in the input is proportional to the “size” of the weights, the input, and the Gaussian derivatives. Thus, weight regularization and batch normalization are also both beneficial in terms of adding robustness to CNN. This sensitivity can be driven to
as more GaussNet layers are added if each GaussNet layer is a contraction operator (i.e., the Lipschitz constant is less than 1) on the shift perturbations. For the standard 1d Gaussian density function, its order derivatives are bounded by for . In theory, to ensure that each Gauss layer is contractive, one only need to apply normlization techniques on to ensure and learning with hard constraints to ensure .3.4 Existing CNNs are Not Translation Insensitive
We show that translation insensitivity is not a property of existing CNNs with subsampling. We analyze a layer of a general subsampled CNN. Let represent the learned kernel in a layer of a CNN. Then a layer (for instance, VGG with an average pool) is given by
(48) 
We show that this architecture is not translation insensitive:
(49)  
(50)  
(51) 
where . For any this sum cannot be controlled by without any additional smoothness properties on the class of , which we do not have on general images, nor feature maps. For instance, in the case that and , we have that the difference above is , and since the ranges of the functions and are in general distinct, the difference of the sums cannot be controlled.
Note that , the kernel, which is learned, would not in general have smoothness guarantees. Therefore, the difference cannot be Lipschitz bounded. In practice, CNNs are typically implemented with small supports (e.g., kernels) and thus shifted differences of the kernel could be large, breaking translation sensitivity. For translation insensitivity, the kernel would need to smoothly die down to zero at the boundary of its support.
From the argument, the lack of translation insensitivity is due to the lack of smoothness on the kernel , rather than just subsampling; of course larger subsampling rates would require smoother learned kernels.
3.5 Some AntiAliased CNNs are Insensitive
One may ask whether blurring the output before subsampling in layers of the CNN to remove high frequency components and avoid aliasing effects would lead to a translation insensitive CNN. As we show now, provided the kernel is chosen appropriately to be smooth and to die down to zero, such as a Gaussian, the answer is in the affirmative. A layer with antialiasing is given by
(52) 
where is a typical pixelized kernel (e.g., as in ResNet). Similar to the computation in the previous subsection, the antialiased CNN is not translation invariant. It is, however, translation insensitive. Consider a singlelayer antialiased CNN with average pooling:
(53)  
(54)  
(55)  
(56)  
(57) 
where we have performed a change of variables in going from equation 54 to equation 55
, and used the Lipschitz continuity of the Gaussian and ReLu in going from equation
55 to equation 56.Thus, as we see, one gets translation insensitivity, similar to our approach in the previous subsection with a similar Lipschitz constant. As we can see, the only requirement on the antialiasing filter is that it be Lipschitz continuous. In practice, for finite data this would imply the need to die down to zero at the border of its support.
4 Experiments
We compare the robustness performance (to small translations) of our new architecture against ResNet.
Datasets: We performed experiments on CIFAR10 dataset for image classification. We also created a second dataset derived from CIFAR10 in which we downsampled the original CIFAR10 image to and then zeropad the image by one pixel on each side to create a image. We call this second dataset CIFAR10ZP.
Shifted Test Sets: We evaluate each network’s insensitivity to shift perturbations by testing them on shifted test sets. The shifted test sets are constructed by shifting every image in test sets of CIFAR10 and CIFAR10ZP at most 1 pixel in the and/or directions (8 different shifts). The missing values at borders after the shift is filled by copying the closest pixel in the image to the missing pixel on the border^{2}^{2}2On CIFAR10ZP, these shifts do not remove any content of the original image..
Architectures: Given a ResNet architecture ResNetN, we replace its convolution layers that have kernel size of greater than with GaussNet layers. We keep the batch normalization, and the residual block structure. For every layer in ResNetN in which there is subsampling, we add a corresponding subsampling layer ^{3}^{3}3implemented using an average pooling with stride of into the new architecture GaussNetN. For each GaussNetN, we also create a version (no sub.) that is without any subsampling. We also have ResNetN with Antialiasing. This architecture is created by applying a Gaussian filter to every subsampling layer within ResNetN. Finally for GaussNet50, we evaluated a Large version that is created by replacing also the Conv1x1 layers within ResNet50 with GaussNet convolutions layers. The architectures and their estimated sizes are listed in Table 1.
Arch/Size  # of Params.  Fwd/Back Pass (MB)  Model Size (MB) 

ResNet18  11,173,962  11.25  53.89 
GaussNet18 no sub.  7,515,466  120.5  149.18 
GaussNet18  7,515,466  15.72  44.4 
ResNet50  23,520,842  17.41  107.14 
GaussNet50  19,751,690  22.64  98. 
GaussNet50 Large  66,567,370  22.64  276.59 
Implementation Details: To implement a GaussNet layer, we choose the supports of our Gaussians that are the sizes of feature maps. These have to be chosen with wide support and should die down near the border of feature map so as to avoid edge effects that would contribute to sensitivity. We use FFTs to compute the convolution with large supports efficiently. The FFT of the Gaussian needs to be only computed once. To compute the derivatives of a Gaussian, we simply used difference operators (SobelFeldman in our case) directly on the Gaussian filtered outputs. For experiments on small image data, the output convolution, due to its large support, is inaccurate because of the lack of data near the borders. To mitigate the effects of this, rather than performing average pooling, we first multiply the result just before the pooling by a Gaussian centered at the center of the feature map and then compute the average.
Robustness Measures: We measure both the probability of a change in classification in the shifted test sets, , where is the test set, is its size, and is the classifier output. We also measure the probability that at least one of the eight shifts of a test image results in a different classification, i.e., .
Training Details: Using ADAM, we train each architecture including ResNetN on both CIFAR10 and CIFAR10ZP. We perform no training data augmentation in order to test the inherent invariances built in the architectures. Without using data augmentations, the test accuracy of each architecture plateaus after a quick initial convergence to . We did not push the trainings to beyond epochs to get a better test accuracy because our robustness measures are applicable at any level of test accuracy. The important thing is to select models that are comparable in terms of their test accuracies. For the robustness benchmark, we selected trained models of around or near test accuracy.
Benchmark Results: The results are shown in Tables 2 5. The GaussNets significantly outperformed the ResNets and their antialiased versions in all possible combination of datasets and robustness measures. There is a degradation of robustness across the board going from CIFAR10 to CIFAR10ZP ranging from only small degradations for the GaussNets to very large degradations for ResNets with antialiasing.
In training time, GaussNetN architectures take longer than their ResNet counterparts mostly because of the FFT Gaussian filtering. However their training times are still comparable and are within the same order of magnitude. For example, on a single GPU laptop, GaussNet50 took about s to complete 50 epochs of training compared to s for ResNet50. GaussNet18 with no subsampling and GaussNet50 Large in contrast take 12 orders of magnitude longer to train and only gain a slight advantage in robustness over their unmodified GaussNetN counterparts.
Arch/Robustness  Testerror  

ResNet18  
ResNet18 + AntiAlias  
GaussNet18  
GaussNet18 no sub. 
Arch/Robustness  Testerror  

ResNet18  
ResNet18 + AntiAlias  
GaussNet18  
GaussNet18 no sub. 
Arch/Robustness  Testerror  

ResNet50  
ResNet50 + AntiAlias  
GaussNet50  
GaussNet50 Large 
Arch/Robustness  Testerror  

ResNet50  
ResNet50 + AntiAlias  
GaussNet50  
GaussNet50 Large 
5 Conclusion
We addressed the lack of translation invariance in modern CNNs by introducing a new CNN architecture, GaussNet that is translation invariant and a subsampled version that is translation insensitive. We showed analytically why existing CNNs are not only not translation invariant but also not translation insensitive. We proved analytically that our new architecture is translation insensitive, owing to the enforced smoothness of the kernels. Empirically, we showed that GaussNets could be trained to achieve similar test accuracy as modern CNNs, while being much less sensitive to shifts. This came at a reasonable increase in training and inference times. We showed that the subsampled GaussNet was as insensitive as the nonsubsampled one, allowing considerable gains in speed and less memory usage. Some aspects not explored in this paper were insensitivities to other transformations, e.g., scalings and deformations. GaussNets are naturally adaptable to address these as well. We plan to explore this in future work.
Appendix A Additional Experimental Analysis
a.1 Plots of Translation Sensitivity Over Epochs
In this section, we show additional evaluations of robustness/insensitivity of the trained models versus the epoch of training. Recall from the main paper that to evaluate the robustness/insensitivity: we first shift the test data of CIFAR10 and CIFAR10ZP by the possible 1pixel translations in the and directions which results in a pair of shifted test sets; and then we evaluate the robustness of the trained models on the respective shifted test sets via two measures of the probability of change in classification. We measure , i.e. the probability of a change in classification in the shifted test sets. We also measure , i.e. the probability that at least one of the eight shifts of a test image results in a different classification. Note that lower indicates greater insensitivity.
Figures 14 show the plots of sensitivity. Note that the GaussNet18 (and GaussNet50) architectures are less sensitive than both their corresponding ResNet and ResNetantialiased architectures, uniformly over all epochs.
a.2 Plots of Test Errors Over Epochs
In this section, we analyze the test error (on the original test set) over epochs to show that our insensitivity does not come at a price of appreciable testerror loss. We include the test error performance versus epoch of training for all the combinations of models and test sets from CIFAR10 and CIFAR10ZP. This is shown in Figures 56. Notice in Figure 5, the architectures based on ResNet18 all perform around the 20% error mark, and the ResNet18antialias seems to give the best test error, though it seems to fluctuate the most over epochs. ResNet18 is next best in terms of test error, though not much different than the GaussNet18 architectures. Figure 6 shows the plot of the testerror for the architectures based on ResNet50. Notice that all the architectures perform similarly and the differences between the architectures are even less pronounced than the ResNet18based architectures.
a.3 Analysis of Sensitivity as Varies
In all of the experiments and the previous plots, we chose . In practice, this is a hyperparameter to optimize. Here, we analyze sensitivity as varies. We chose . The plots of sensitivity for GaussNet50 are shown in Figure 7, which shows the sensitivity for 5 sample epochs. This shows that the sensitivity does vary, but not by much. Note that for each such , the GaussNet50 is still more translation insensitive than ResNet50. For too small and too large, the sensitivity is large, and thus there seems to be an optimal value in the middle.
a.4 Analysis of Inference Times
In this section, we show the inference times for each of the architectures evaluated in this study. These times have been recorded on a single NVIDIA GeForce 2080RTX MaxQ GPU. Tables 6 and 7 shows the absolute inference times measured in seconds for the entire CIFAR10 test set as well as the normalized times. While it may look surprising that GaussNet50 is faster than GaussNet18 in inference (and training as well), note that in GaussNet50, the Conv1x1 layers of ResNet50 are not replaced since they are already shift covariant. ResNet18 in contrast does not have 1x1 convolutions, and thus all its convolution layers are replaced with the GaussHermite approximation. Despite having a deeper architecture, the actual number of GaussNet layers in the GaussNet50 is the same as in GaussNet18. In addition, on average the GaussNet layers within GaussNet50 are deeper; these deep layers have smaller feature maps due to the earlier subsamplings, and thus are faster to process. Finally, there is one extra subsampling in GaussNet50 in comparison to GaussNet18 thus further reducing the average size of input feature maps to the GaussNet layers.
Thus, the translation insensitivity comes at a moderate price in inference time (GaussNet50 is slower by a factor of 7.34x) than its corresponding ResNet50 architecture, and roughly 3x in training time. However, we have not explored optimizing the GaussNet layer computations (e.g., the Gaussian filtering and derivatives) in terms of model parallelization and memory usage, which could lead to gains in speed. Furthermore, we simply did a direct replacement of convolutions in ResNet50, and we have not explored the most optimized GaussNet architecture, so there are potentially speedups with reduced number of layers.
Arch/Stats  Absolute (s)  Normalized 

ResNet18  2.64  1.00x 
ResNet18 + AntiAlias  3.70  1.40x 
GaussNet18  60.16  22.77x 
GaussNet18 no sub.  414.45  156.89x 
Arch/Stats  Absolute (s)  Normalized 

ResNet50  3.47  1.00x 
ResNet50 + AntiAlias  4.11  1.19x 
GaussNet50  25.47  7.34x 
GaussNet50 Large  160.79  46.33x 
References

[1]
Alessandro Achille and Stefano Soatto.
Emergence of invariance and disentanglement in deep representations.
The Journal of Machine Learning Research
, 19(1):1947–1980, 2018.  [2] Fabio Anselmi, Joel Z Leibo, Lorenzo Rosasco, Jim Mutch, Andrea Tacchetti, and Tomaso Poggio. Unsupervised learning of invariant representations. Theoretical Computer Science, 633:112–121, 2016.
 [3] Aharon Azulay and Yair Weiss. Why do deep convolutional networks generalize so poorly to small image transformations? arXiv preprint arXiv:1805.12177, 2018.
 [4] Joan Bruna and Stéphane Mallat. Invariant scattering convolution networks. IEEE transactions on pattern analysis and machine intelligence, 35(8):1872–1886, 2013.
 [5] Xiuyuan Cheng, Qiang Qiu, Robert Calderbank, and Guillermo Sapiro. Rotdcf: Decomposition of convolutional filters for rotationequivariant deep networks. arXiv preprint arXiv:1805.06846, 2018.
 [6] Taco S Cohen and Max Welling. Steerable cnns. arXiv preprint arXiv:1612.08498, 2016.
 [7] William T. Freeman and Edward H Adelson. The design and use of steerable filters. IEEE Transactions on Pattern Analysis & Machine Intelligence, (9):891–906, 1991.

[8]
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun.
Deep residual learning for image recognition.
In
Proceedings of the IEEE conference on computer vision and pattern recognition
, pages 770–778, 2016.  [9] JornHenrik Jacobsen, Jan van Gemert, Zhongyu Lou, and Arnold WM Smeulders. Structured receptive fields in cnns. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2610–2619, 2016.
 [10] Jan J Koenderink and Andrea J van Doorn. Representation of local geometry in the visual system. Biological cybernetics, 55(6):367–375, 1987.
 [11] Karel Lenc and Andrea Vedaldi. Understanding image representations by measuring their equivariance and equivalence. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 991–999, 2015.
 [12] Jitendra Malik and Pietro Perona. Preattentive texture discrimination with early vision mechanisms. JOSA A, 7(5):923–932, 1990.
 [13] Stéphane Mallat. Group invariant scattering. Communications on Pure and Applied Mathematics, 65(10):1331–1398, 2012.
 [14] Pietro Perona. Steerablescalable kernels for edge detection and junction analysis. In European Conference on Computer Vision, pages 3–18. Springer, 1992.
 [15] Vaishaal Shankar, Achal Dave, Rebecca Roelofs, Deva Ramanan, Benjamin Recht, and Ludwig Schmidt. A systematic framework for natural perturbations from videos. arXiv preprint arXiv:1906.02168, 2019.
 [16] Evan Shelhamer, Dequan Wang, and Trevor Darrell. Blurring the line between structure and learning to optimize and adapt receptive fields. arXiv preprint arXiv:1904.11487, 2019.
 [17] Laurent Sifre and Stéphane Mallat. Rotation, scaling and deformation invariant scattering for texture discrimination. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1233–1240, 2013.
 [18] Eero P Simoncelli, William T Freeman, Edward H Adelson, and David J Heeger. Shiftable multiscale transforms. IEEE transactions on Information Theory, 38(2):587–607, 1992.
 [19] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for largescale image recognition. arXiv preprint arXiv:1409.1556, 2014.
 [20] Stefano Soatto and Alessandro Chiuso. Visual representations: Defining properties and deep approximations. arXiv preprint arXiv:1411.7676, 2014.

[21]
Christian Szegedy, Sergey Ioffe, Vincent Vanhoucke, and Alexander A Alemi.
Inceptionv4, inceptionresnet and the impact of residual connections on learning.
InThirtyFirst AAAI Conference on Artificial Intelligence
, 2017.  [22] Domen Tabernik, Matej Kristan, and Aleš Leonardis. Spatiallyadaptive filter units for compact and efficient deep neural networks. arXiv preprint arXiv:1902.07474, 2019.
 [23] Fisher Yu and Vladlen Koltun. Multiscale context aggregation by dilated convolutions. arXiv preprint arXiv:1511.07122, 2015.
 [24] Richard Zhang. Making convolutional networks shiftinvariant again. arXiv preprint arXiv:1904.11486, 2019.
Comments
There are no comments yet.