Convolutional Neural Networks have demonstrated impressive performance in the field of computer vision and natural language processing and are thus the cutting edge method for image classification. In several layers of trainable convolutions and subsampling interspersed with sigmoid nonlinearity, features of the input are extracted and fed into a trainable classifier. To classify images or solve pattern recognition tasks reliably, a CNN should combine invariance and discriminability to the input variable. While high level features of the input should be learned, the network’s prediction should be robust to irrelevant input transformations. Thus, learning selective invariant features is a difficult task[goodfellow2009]. Despite the inflationary use of CNNs, the mathematical theory of how features are extracted and invariances to certain transformations are learned is not well understood yet [wiatowski2017].
In this paper, we propose a new method to extract the space of possible transformations such that for any image , the transformed image and are classified similarly by the trained network. To achieve this, we propose the architecture Invariant Transform Net, which allows us to introduce affine transformations specified by differentiable parameters and thus, access the space of possible image modifications to which the network is invariant.
We further evaluate the behavior of different CNN architectures by passing a set of affine and nonaffine transformations of increasing magnitude to the trained networks and analyze which kinds of invariances are present. This enables us to define thresholds of different transformations, which – when exceeded – lead to a change in the classification result.
Ii Related Work
One of the earliest treatments of invariances in deep neural networks was concerned with the question of how to determine the quality of learned representations in an unsupervised fashion [goodfellow2009]. The authors argue that good representations should not only be able to achieve a high performance in a supervised setting (discriminability
), but also generally be invariant to certain transformations in the inputs. They probe this invariance by defining an activity threshold for every hidden neuron based on its responses to random inputs and then applying transformations to inputs for which this neuron is considered active based on the threshold. If the neuron stays active under the input transformations, it is called invariant to them. They use translation and rotation (in 2D and 3D) from natural videos as transformations and test their method on stacked autoencoders and deep belief networks. They observe that invariance to those transformations increases with the depth of the model architecture.
Another work studies invariance in learned representations as a special case of equivariance in general features [lenc2015]. The authors examine equivariance in a representation by trying to learn a mapping from input space transformations to transformations in the feature space. In parts of the input space where this mapping approaches the identity function, the features can be considered invariant to the input transformations. They use rotations, rescaling and flips of images and find that the equivariance of latent features in deep convolutional networks to those transformations decreases with depth, while the invariance interestingly reaches its maximum in the middle layers of the networks. They also find that the representations in the first layers of different networks are largely equivalent to each other which is not the case for the deeper layers.
Iii Models and Methods
Iii-a Affine and Nonaffine Transformations
Affine transformations map an input from a space into a space using an affine map . The affine transformation is of the form , where
is a linear transformation onand
a vector in. Affine transformations include rotations, translations, as well as scaling. In affine transformations, parallel lines remain parallel [szeliski2010]. Nonaffine transformations comprise Gaussian noise, Gaussian blur, whiteness shifts, contrast and other nonlinearities. In this paper, the behavior of a neural network to both, affine and nonaffine transformations is analyzed.
Iii-B Convolutional Neural Network Architectures
Convolutional Neural Networks are deep, feed-forward artificial neural networks consisting of convolutional hidden, pooling, fully connected and normalization layers. Inspired by biological receptive fields of the visual cortex of animals, CNNs outperform traditional, hand-crafted feature approaches in tasks such as image and video recognition, recommender systems and natural language processing [lecun1998, krizhevsky2012]. Due to its simplicity and well studied behavior we chose the pretrained network from [guerzhoy2016] with an AlexNet architecture [krizhevsky2012]. Furthermore, we used a pretrained network following the ResNet architecture (ResNet V1 101) published in the TensorFlow-Slim image classification model library. The ResNet architecture consists of 152 layers and is winner of the ImageNet Large Scale Visual Recognition Challenge (ILSVRC) 2015 [he2016].
To analyze the behavior of the CNNs AlexNet and ResNet, we used the dataset published in ILSVRC, a well-known benchmark dataset in object category classification consisting of millions of images for the detection of hundreds of object categories [russakovsky2015].
Iii-C Strategy 1: Sensibility To Transformations
Using the pretrained AlexNet, we apply affine and nonaffine transformations of increasing magnitude on the input images to get . Let . We are considering the following transformations:
Translation. We translate the image either horizontally or vertically with and
Rotation. To rotate the image, we apply , where rotates the coordinates the amount of degrees around the center of the image.
Scale. We scale either in or direction with and which scales the coordinates with the factor and keeps the image in the center.
Zoom. Here we zoom in the center of the image. This can be written as a composition of the scale in and direction.
Brightness. We apply an additive bias to each color channel: .
Contrast. To adjust contrast, we just multiply all color values: : .
Grayscale. We linear transition between the color image and the grayscale image , where denotes the grayscale image of in each channel: .
Gaussian Blur. We 2D-convolve the image with an Gaussian kernel
with a standard deviation of:
Gaussian Noise. We add a noise value to each pixel: where and
denotes the normal distribution with meanand a standard deviation of .
This enabled us to define thresholds and identify critical points at which the network is not able to output correct prediction results any more. We use two measures to quantify this:
First we evaluated the average accuracy over the predictions of a set of test images with various classes. This can be used to analyze the sensibility of the network to the applied transformations.
Furthermore, we inspect different sets of test images of the same class. Here we analyze the softmax output of the network depending on the applied transformations.
Iii-D Strategy 2: Invariant Transformer Net
Instead of simply trying out to which transformations the network is invariant to, it would be even more interesting whether the network itself is capable of learning a family of transformations where for which would create different transformation functions depending on the inputs, to which the network is invariant to.
The network architecture we used is described in figure 1. The overall idea is to train the weights of the layers and while keeping the weights of the CNN fixed. Each of those layers depends on a control parameter ( and ). The parameters and are sampled during training uniformly at random from which allows us to learn a transformation function .
We decided to model color and spatial transformations. Both are differentiable and simple enough to avoid overfitting to single features of the images. The layers and
consist of two fully connected layers with rectified linear activation functions. This allows us to learn almost arbitrary functions which only depend on the parameters. We decided to split the parameters between the spatial and color transformation such that both can be controlled independently.
While the color transformer is a simple matrix multiplication (in homogeneous coordinates to also allow brightness shifts) to the color values of the input image, the spatial transformer was taken from [jaderberg2015] and enabled us to do differentiable (affine) transformations on the coordinates of the input images.
As both spatial and color transformations are affine, they can be described by two matrices and . We can further extend these to be quadratic, by setting the last row to . We denote the extended matrices by and
We describe a stochastic loss for these matrices, by measuring how far away they project a set of random unit vectors :
The more often and the further the matrix projects one of the elements in away from its original position, the more this loss decreases. This models the behavior we aim to achieve: to learn transformations of high magnitude such that the resulting transformed images are still classified correctly by the subsequent CNN. In principle, other matrix norms can be used, but we achieved good results with this loss formulation.
Moreover, we also want to enforce that different and values lead to different generated functions by and . We enforce this by incorporating the parameters
to the loss function:
where describes the -th value of .
The loss must further integrate the fact that learned transformations do not impact the prediction capabilities of the original network. It proved to be rather difficult to train both at once, because of the extremely different value ranges these losses have. Thus, we decided to train the network using batch wise accuracy to select whether we should increase the transformation impact or reduce the original loss. So, for each batch running trough the network we select the batch wide loss to be:
where (accuracy) is selected based on the original performance of the used dataset and
is an additional hyperparameter selected by hand to increase the influence of color transformations. This was needed because spatial transformations were learned much faster than color transformations.
While all tests were performed using the AlexNet architecture, the Invariant Transformer Network
Invariant Transformer Networkcan be used with other CNN architectures as well. The Invariant Transformer Network is implemented in Tensorflow (version 1.5.0) [abadi2016].
Iv-a Strategy 1: Sensibility To Transformations
The first approach analyzed the behavior of the networks AlexNet as well as ResNet for various affine and nonaffine transformations of differing magnitude.
For each plot we are using all images of a specific class of the ImageNet [krizhevsky2012] test set and record the mean accuracy as well as the mean softmax output over all the images in one class while varying the parameters of the transformations. In each plot we show three predicted classes with the highest softmax output over all predicted classes while varying the parameter. If there are too many classes, we condense them to one line ("others") which shows the maximum softmax over all predicted classes for the given parameter.
Figure 2 shows exemplary results of Gaussian noise, rotation, zoom and translation on the performance of AlexNet, figure 3 on ResNet. More effects of different transformations can be found in the appendix. Both network architecture show similar behavior with respect to the given transformation.
The results for both networks, AlexNet and ResNet suggest that the increasing addition of Gaussian noise leads to a switch of the class prediction as indicated by the softmax output. Since the networks’ confidence is dependent on the class, the critical level of Gaussian noise leading to a switch in the prediction is dependent on the class as well (fig. 1(a)). Rotation results in a strong reduction of the softmax. For certain point symmetric items like an orange, the softmax output is constant and independent on the transformation or for axisymmetric items, the softmax value returns to high values at a rotation of 180 (fig. 1(b)).
Iv-B Strategy 2: Invariant Transformer Net
In the second approach we tried to find transformations by training the Invariant Transformer Net with images of the validation set of ImageNet of different classes [krizhevsky2012]. The training shows a constant mean of the loss of the CNN but a reducing loss of the transformers of the Invariant Transformer Net as we expect, since we want to find transformations of high magnitude (fig. 3(a)).
In figure 3(b) we show that the classification of the transformed images does not change and thus, the network is expected to be invariant to the learned transformations.
Some of the actual transformations can be seen in figure 5. Note that changes of the parameter result in a changing spatial transformation (with increasing magnitude for increasing ) and changes in result in a change of color transformation.
The results of the large scale screening, as described in the first strategy, are consistent with the behavior one expects from common CNN architectures. This approach represents a general method to systematically access the invariances learned by CNNs and to extract thresholds at which the magnitude of different transformations lead to a misclassification of the input images. We showed that the learned invariances of ResNet correlate to the invariance learned by AlexNet and that both networks are highly sensible to stronger affine and nonaffine transformations.
The second strategy reveals interesting insights: While the network is able to learn small transformations of the input, it never chooses a transformation with high information loss, and thus, never strongly zooms into the image or rotates it more than a few degrees. Contrary, it only zooms out of the image which seems to only compress the image without changing too much. Additionally, Convolutional Neural Networks are highly sensible to color changes [engilberge2017]. This might be a reason why only color changes of rather low magnitude are learned.
Differently from the approach Karel Lenc et al. proposed in 2015, we focused on the questions to which magnitude of transformations CNNs are invariant [lenc2015].
For future work, it would be interesting to see different transformations learned with the Invariant Transformer Net approach described above. For example, one could also learn parametrized convolutions on input images.
This paper introduced the idea of learning the space of different affine transformation families in which the modified images are still correctly classified. The architecture of the Invariant Transformer Net can be used with different CNN architectures and allows to control the transformation via differentiable parameters, which are passed as inputs to the network. Furthermore, the large scale screening of affine and nonaffine transformations showed the invariances architectures like AlexNet and ResNet learned. If the magnitude of different transformations exceeds a class- and transformation dependent threshold, the prediction result is instable and incorrect.
Appendix A Sensibility To Transformations
Results showing effects of affine and nonaffine transformations on classification results achieved by the AlexNet and ResNet architectures.