. CNNs show state-of-art accuracies for most state-of-art benchmark datasets, such as ImageNet. CNN has a set of parameters (convolutional kernels, biases, and weights of the last fully connected layers) that are adjusted during the training process. Number of such parameters is typically very large (order of millions or tens of millions). Models with so many parameters do not overfit the data much because of the following reasons:
Data augmentation. Training set is augmented during training in different ways: affine transformations, random subimage selections, random color distortions for each pixel .
Efficient regularization techniques. Dropout is one of the most powerful regularization techniques, that corresponds to approximate ensembling over exponential number of subnetworks .
Inner structure of the CNN. Weight sharing is used to enforce approximate invariance of the network output to translations of the input image .
At this work, we focus on CNNs for classification. We propose to make the network output invariant to horizontal image flips via introduction of horizontally symmetric convolutional kernels. Thus we are modifying inner structure of the CNN to enforce additional invariance to improve generalization to the new data.
2 Symmetric image kernels
Let’s consider typical CNN architecture that consists of several convolutional layers, followed by elementwise nonlinear function (in most cases it’s RELU nonlinearity) alternating with pooling layers (it could be average or max pooling layers) followed by one or several fully connected layers with softmax activation function and trained with categorical cross-entropy loss.
Consider the first convolutional layer of the net. This layer is translation equivariant, so output of the layer is changed in the same way as the input for translations. But it’s not equivariant to the horizontal image flip in case of arbitrary convolution kernel.
We will focus on kernels of size that are the most widely used . General convolution kernel:
We propose to use horizontally symmetric kernels of the form:
We show that in this case convolution layer becomes equivariant to horizontal image flips, and the whole network, under certain structure, becomes invariant to horizontal flips.
It is enough to show equivariance in one-dimensional case (for each row of the image). Consider arbitrary vector:
and one-dimensional symmetric kernel:
Convolution with flipped vector :
Thus convolution with symmetric kernel of the flipped image is equal to the flip of convolution with initial image. Thus symmetric kernel makes convolution equivariant. Clearly, this result generalizes for 3D convolutions used in CNNs.
Consider now other types of operations performed in CNN. Elementwise application of non-linear function, max and average pooling layers are also clearly flip equivariant. Thus superposition of 3D convolutions, non-linear functions and poolings is also flip equivariant.
The only transformation used in CNNs that does not have this property is the flatten layer that maps tensor to vector before fully connected layers. That’s why we consider only cases when the last layer is global pooling (max or average). This condition is not restrictive, as the newest architectures (as DenseNet) use global pooling before Fully Connected layers.
Since global pooling (pools tensor to vector of the same depth) is invariant to horizontal flips, the whole network output becomes invariant to horizontal flips. Thus if symmetric kernels are used then posterior probabilitiesproduced by the CNN are exactly the same for the flipped image :
3 Levels of Symmetry
We experimented with several levels of symmetry of convolutional kernels. They are summarized in the table 1. Note, that the third column contains induced equivariances for convolutional layers that in turn correspond to induced invariances of the network output (it happens in case global pooling is used before fully connected layer)
|Symmetry Level||Kernel form||Induced network invariances|
|0||No induced invariances|
Different symmetry levels are aplicable to different datasets. For example, for the MNIST dataset levels 2 and higher are not applicable, since one can obtain digit 9 from the digit 6 with consecutive horizontal and vertical flip, so the network trained with such kernels will not distinguish between 6 and 9. But for datasets that contain photos of real world images high symmetry levels are applicable. On the other hand experiments show that training of a network with high symmetry level is a complicated problem, so in practice levels higher that 2 should not be used.
4 Backpropagation equations
At this section we describe the modification of the backpropagation procedure that is used to find gradients of the error function with respect to the network weights. For simplicity, we show forward and backward pass of the network only for 1-dimensional convolution for symmetry levels 0 and 1, as extension to 2D convolution and other symmetry levels is straightforward.
Let us denote elements of the convolutional layer in such a way: input vector: , output vector: , general convolutional kernel: , symmetric convolutional kernel: . We denote by and derivatives of the error function with respect to vectors and , and by , , derivatives of error function with respect to convolutional kernel elements. Equations for forward and backward passes then become:
|Level 0, Forward|
|Level 1, Forward|
|Level 0, Backward|
|Level 1, Backward|
Note, that distributive law makes forward and backward pass for level 1 slightly faster than for level 0. The same holds for higher symmetry levels.
To test the given approach, we use CIFAR-10 dataset, which consists of photos of size (3 color channels) distributed among 10 classes which include animals, cars, ships and other categories. Training and test sample sizes are 50000 and 10000 correspondingly. As a basic network we chose a variant of DenseNet  - one of the most efficient recent architectures. Exact configuration of the net we use is given in the table.
|Dense block 1||Number of layers: 1; Convolutional depth: 30|
|Input: ; Output:|
|Pooling 1||Average pooling|
|Dense block 2||Number of layers: 1; Convolutional depth: 30|
|Pooling 2||Average pooling|
|Dense block 3||Number of layers: 1; Convolutional depth: 30|
|Pooling 3||Average pooling|
|Dense block 4||Number of layers: 1; Convolutional depth: 30|
|Pooling 4||Full Average pooling|
|Fully Connected||Input length: 123|
|+ Softmax||Output length: 10|
Note, that we are using RELU nonlinearity for each layer of dense block.
We use this network architecture with each symmetry level for convolutional kernels. Since symmetry levels induce parameter sharing, total number of parameters for next levels is decreased.
We train all the networks with stochastic optimization method ADAM with initial learning rate 0.02, multiplying it by 0.97 after every 5 epochs. We use minibatch size of 1280 in all cases.
Final results for different symmetry levels are given in the table.
|Level||Model coefficients||Train error||Train accuracy||Test error||Test accuracy|
To see if usage of symmeric kernels improves regularization, we recorded train and test error function values and accuracies after every 5-th epoch during training. Scatterplots based on these tables are shown on Figures 2 and 2.
At this work we presented symmetric kernels for convolutional neural networks. Use of such kernels guarantees the network will be invariant under certain transformations, such as horizontal flips for the lowest symmetry level, and approximate rotational symmetry for the highest symmetry level.
We tested this approach by training convolutional neural net with the same DenseNet architecture on CIFAR-10 dataset under different symmetry levels. Since most of the parameters in such network are convolutional kernels (all except biases and matrix for the last fully connected layer) so total number of coefficients adjusted during training varies a lot: from 21960 (highest symmetry level) to 95520 (no symmetry).
Experiments suggest that CNN training is more complicated for higher symmetry levels (as expected) and that only level 1 symmetry shows improvement in generalization. This can be seen on Figure 2 where net without symmetries has higher test error values than net with horizontally symmetric kernels for low train error levels (0.2 - 0.4). The same effect is observed on Figure 2 where the network with horizontally symmetric kernels stabilizes at the highest test accuracy level. This shows networks with horizontally symmetric kernels tend to overfit less.
Why networks with higher symmetry levels (2,3 and 4) do not show improvement in generalization despite providing additional output invariances? From our point of view the reason is as follows. From a common point of view trained convolutional neural network extracts low level image features such as edges and corners at first convolutional layers and combines them into more complex shapes in subsequent layers. With the introduction of DenseNets this interpretation became not so clear since deeper layers have direct dependency on input, but convolutional kernel subtensors acting on input still extract these low level features. The problem with convolutional kernels of high symmetry levels is that they cannot extract image edges or corners of certain orientation (in fact units of convolutional layer respond to edges of different orientations in the same way). Thus such units cannot find joint orientation of edges within the image, besides the general network output is invariant under these transformations. From our point of view this is the reason networks with high symmetry levels do not show improvement in generalization.
Thus we suggest to use convolutional neural networks with horizontally symmetric kernels (symmetry level 1) in practice, since they show lower test error function values and higher test set accuracies as the same network with general convolutional kernels. At the same time such networks have lesser total number of parameters (approximately 2/3) and their output is guaranteed to be invariant under horizontal image flips.
Goodfellow et al 
Goodfellow I, Bengio Y, Courville A (2016) Deep Learning. MIT Press,http://www.deeplearningbook.org
- Krizhevsky et al  Krizhevsky A, Sutskever I, Hinton GE (2012) ImageNet classification with deep convolutional neural networks. In: Proc. Adv. Conf. Neur. Inform. Proc. Syst (NIPS 2012), Lake Tahoe, NE
- Srivastava et al  Srivastava N, Hinton G, Krizhevsky A, Sutskever I, Salakhutdinov R (2014) Dropout: A simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15:1929–1958
Huang et al 
Huang G, Liu Z, van der Maaten L, Weinberger KQ (2017) Densely connected convolutional networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
- He et al  He K, Zhang X, Ren S, Sun J (2015) Deep residual learning for image recognition. CoRR abs/1512.03385, URL http://arxiv.org/abs/1512.03385, 1512.03385
- Kingma and Ba  Kingma DP, Ba J (2014) Adam: A method for stochastic optimization. In: Int. Conf. Learn. Representations, Banff, Canada