In recent years, convolutional neural networks (CNNs) have achieved state-of-the-art performance on many computer vision tasks, including image recognition[26, 23, 10], semantic segmentation 14, 5, 6], action recognition [8, 24], and video captioning [27, 22]. The success of CNNs comes from their ability to learn the two-dimensional structures of images for which objects and patterns may appear at different locations. To detect and learn patterns despite their locations, the weights of local filters are shared when applied to different positions in the image.
Since distortions or shifts of the input can cause the positions of salient features to vary, weight sharing is very important for CNNs to detect invariant elementary features regardless of location changes of these features . In addition, pooling also reduces the sensitivity of the output to small local shifts and distortions by reducing the resolutions of input feature maps. However, another important property of weight sharing or pooling is that the location of detected features in the output feature maps is identical to that of the corresponding local patch in the input feature maps. As a result, the location change of the input visual patterns in lower layers will propagate to higher convolutional layers. Due to the typically small local spatial support for pooling (e.g., ) and convolution (e.g., kernel size), large global location changes of patterns in input (e.g., global rotation or translation of objects) will even propagate to the feature maps of the final convolutional layer (as shown in Fig. 1). Consequently, the following fully connected layers have to learn the location invariance to produce consistent predictions or representations, which restricts the use of the parameter budget for achieving more powerful outputs.
In this paper, we introduce a Patch Reordering (PR) module that can be embedded into a standard CNN architecture to improve the rotation and translation invariance capabilities. Output feature maps of the convolutional layer are first divided into multiple tiers of non-overlapped local patches at different spatial pyramid levels. We reorder these local patches at each level based on their energy (e.g., L1 or L2 norm of activations of the patch). To retain the spatial consistency of local patterns, we only reorder the patches of a given level locally (i.e., within each single patch of its upper level). In convolutional layers, a location change of the patterns in input feature maps will result in a corresponding location change of the output feature maps, while the local patterns (activations) in the output are equivalent. As a result, ranking these local patterns in a specific order leads to a consistent representation despite the locations of local patterns in input, that is, rotation or translation invariance. The proposed architecture can be inserted into any convolutional layers and allows for end-to-end training of the models for which they are applied. In addition, we do not need any extra training supervision or modification to the training process or any preprocessing of input images.
showed that a linear transform of a good visual representation was equivalent to a combination of the elementary irreducible representations using the theory of group representations. Lenc and Vedaldiestimated the linear relationships between representations of the original and transformed images. Gens and Domingos  proposed a generalization of CNNs that formed feature maps over arbitrary symmetry groups based on the theory of symmetry groups in , resulting in feature maps that were more invariant to symmetry groups. Bruna and Mallat  proposed a wavelet scattering network to compute a translation invariant image representation. Local linear transformations were adopted in the feature learning algorithms in  for the purpose of transformation-invariant feature learning.
Numerous recent works have focused on introducing spatial invariance in deep learning architectures explicitly. For unsupervised feature learning, Sohn and Lee
presented a transform-invariant restricted Boltzmann machine that compactly represented data by its weights and their transformations, which achieved invariance of the feature representation via probabilistic max pooling. Each hidden unit was augmented with a latent transformation assignment variable that described the selection of the transformed view of the weights associated with the unit in. In both works, the transformed filters were only applied at the center of the largest receptive field size. In tied convolutional neural networks , invariance was learned explicitly by square-root pooling hidden units computed by partially un-tied weights. Here, additional learned parameters were needed when un-tying weights.
, feature maps in CNNs were scaled or rotated to multiple levels, and the same kernel was convolved across the input at each scale. Then, the responses of the convolution at each scale were normalized and pooled at each spatial location to obtain a locally scale-invariant representation. In this model, only limited scales were considered, and extra modules were needed in the feature extraction process. To address different transformation types in input images, Jaderberget al.  proposed inserting a spatial transformer module between CNN layers, which explicitly transformed an input image into a proper appearance and fed the transformed input into the CNN model.
In conclusion, all aforementioned related works improve the transform invariance of deep learning models by adding extra feature extraction modules, more learnable parameters, or extra transformations on input images, which makes the trained CNN model problem-dependent and not generalizable to other datasets. In contrast, in this paper, we propose a very simple reordering on feature maps during the training of CNN models. No extra feature extraction modules or more learnable parameters are needed. Therefore, it is very easy to apply the trained model to other vision tasks.
Patch Reordering in Convolutional Neural Networks
Weight sharing in CNNs allows feature detectors to detect features regardless of their spatial locations in the image; however, the corresponding location of output patterns varies when subject to location changes of the local patterns in the input. Learning invariant representations causes parameter redundancy problems in current CNN models. In this section, we will reveal this phenomenon and propose the formulation of our Patch Reordering module.
Parameter Redundancy in Convolutional Neural Networks
Let denote the output feature maps of a convolutional layer with elements ( for feature maps with height and width ). Each is a
-dimensional input feature vector corresponding to location (,). If it is followed by a fully connected layer, can be computed by
is a non-linear activation function andare the weights for location (,). If there is some location change (such as a rotation or translation) of the input features, the resulting new input becomes
. Since there are no value changes (except cropping or padding), forin any position (,), we can always find its correspondence in the transformed input, i.e. . If the network learns to be invariant under this type of location change, the output (or representation) should remain the same. Specifically,
Then, in the monotonous section of , we have
Since , the aforementioned equation can be simplified as:
Because varies as the input image changes, we have
. That is to say, encoding rotation or translation invariance into CNNs leads to highly correlated parameters in higher layers. Therefore, the capacity of the model decreases. To validate this redundancy in CNN models, we compare the log histogram of cosine similarities between weights in fc6 and fc7 in AlexNet. Fig. 2 shows that parameter redundancy of the model is significantly reduced because of a more consistent feature map after patch reordering.
If one object is located at different positions in two images, the same visual features of the object will locate at different positions in their corresponding convolution feature maps. The feature maps generated by deep convolutional layers are analogous to the feature maps in traditional methods [2, 3]. In those methods, image patches or SIFT vectors are densely extracted and then encoded. These encoded features compose the feature maps and are pooled into a histogram of bins. Reordering of the pooled histogram achieves translation and rotation invariance. Likewise, since the deep convolutional feature maps are the encoded representations of images, reordering can be applied in a similar way.
Since convolutional kernels function as feature detectors, each activation in the output feature maps corresponds to a match of a specific visual pattern. Therefore, when the feature detectors slide through the whole input feature maps, the locations with matched patterns generate very high responses and vice versa. Consequently, the “energy” distribution (L1 norm or L2 norm) of the local patches in the output feature maps presents some heterogeneity. Furthermore, patches with different energies correspond to different parts of the input object. Naturally, if we rank the patches by their energies in a descending or ascending order, regardless of how we change the location of visual patterns by rotation or translation in the input, the output order will be quite consistent. Finally, and rotation- and translation-invariant representation is generated.
The details of the patch reordering module are illustrated in Fig. 3. The feature maps are divided into non-overlapped patches at level-. Here, is a predefined parameter (e.g., or ). Then, we rank the patches by energy (L1 or L2 norm) within each patch of level :
The patches are located from the upper left to the lower right in descending order of energy. The offset of each pixel in the patch can be obtained from the gap between the target patch location and the source patch location. Finally, the output feature map can be computed by
During the back-propagation process, we simply pass the error from the output pixel to its corresponding input pixel:
In this section, we evaluate our proposed CNN with patch reordering module on several supervised learning tasks, and compare our model with state-of-the-art methods, including traditional CNNs, SI-CNN, and ST-CNN . First, we conduct experiments on the distorted versions of the MNIST handwriting dataset as in [11, 13]. The experimental results show that patch reordering is capable of achieving comparable or better classification performance. Second, to test the effectiveness of patch reordering on CNNs for large-scale real-world image recognition tasks, we compare our model with AlexNet 
on ImageNet-2012 dataset. The results demonstrate that patch reordering improves the learning capacity of the model and encodes translation and rotation invariance into the architecture even when trained on raw images only. Finally, to evaluate the generalization ability of the proposed model on other vision tasks with real-world transformations of images, we apply our model to solve the image retrieval task on UK-Bench dataset. The improvement in the retrieval performance reveals that the proposed model has a good generalization ability and is better at solving real-world transformation variations.
We implement our method using the open-source Caffe framework. For patch energy, we have tested both L1 and L2 norm and found that they did not show much difference. Our code and model will be available online. For SI-CNN and ST-CNN, we directly report their results from the original papers on MNIST. For ImageNet-2012, since these two methods did not report their results on this dataset, we forked from the github for re-implementation.
In this section, we use the MNIST handwriting dataset to evaluate all deep models. In particular, different neural networks are trained to classify MNIST data that have been transformed via rotation (R) and translation (T). The rotated dataset was generated from rotating digits with a random angle sampled from a uniform distribution. The translated dataset was generated by randomly locating the digit in a canvas.
, all networks use ReLU activation function and softmax classifiers. All CNN networks have a
convolutional layer (stride, no padding), a max-pooling layer with stride , a subsequent convolutional layer (stride , no padding), and another max-pooling layer with stride before the final classification layer. All CNN networks have filters per layer. For SI-CNN, convolutional layers are replaced by rotation-invariant layers using six angles from to . For ST-CNN, the spatial transformer module is placed at the beginning of the network. In our patch reordering CNN, the patch reorder module is applied to the second convolutional layer. The feature maps are divided into blocks at level . Here, we set . All networks are trained with SGD for iterations, with a batch size, base learning rate, and no weight decay or dropout. The learning rate was reduced by a factor of every iterations. Weights were initialized randomly, and all networks shared the same random seed.
The experimental results are summarized in Table 1. It shows that our model achieves better performance under translation and comparable performance under rotation. Because our model does not need any extra learnable parameters, feature extraction modules, or transformations on training images, the comparable performance still reflects the superiority of the patch reordering CNN. For ST-CNN, the best results reported in  is obtained by training with a more narrow class of transformations selected manually (affine transformations). In our method, we did not optimize with respected to transformation classes. Therefore the comparison is unfair for our PR-CNN. We should compare with the most general ST-CNN defined for a class of projection transformations: 0.8(R) and 0.8(T).
The ImageNet-2012 dataset consists of images from classes and is split into three subsets: training (M), validation (K), and testing (K images with held-out class labels). The classification performance is evaluated using the top-1 and top-5 accuracy. The former is a multi-class classification accuracy. The latter is the main evaluation criterion used in ILSVRC and is defined as the proportion of images whose ground-truth category is not in the top-5 predicted categories. We use this dataset to test the performance of our model on a large-scale image recognition task.
CNN models are trained on raw images and tested on both raw and transformed images. For all transform types, specific transformations are applied to the original images. Then, the transformed images are rescaled to have a smallest image side of pixels. Finally, the center crop is used for test. The rotated (R) dataset is generated by randomly rotating original images from to with a uniform distribution. The translated dataset (T) is generated by randomly shifting an image by a proportion of .
All the models follow the architecture of AlexNet. For SI-CNN, the first, second and fifth convolutional layers are replaced by rotation-invariant layers using six angles from to11], the size of the spatial transformer network is about half the size of AlexNet. For our PR-CNN, the feature maps are divided into blocks at level , and the patch reorder module is applied to the fifth convolutional layer.
To train SI-CNN and PR-CNN, we use a base learning rate of and decay it by every iterations. Both networks are trained for iterations. We use a momentum of , a weight decay of , and a weight clip of . The convolutional kernel weights and bias are initialized by and , respectively. The weights and bias of fully connected layers are initialized by and . The bias learning rate is set to be the learning rate for the weights. For ST-CNN, since it does not converge under the aforementioned setting, we fine-tune the network with the classification network initialized by the pre-trained AlexNet. The spatial transformer module consists of convolutional layers, pooling layers, and fully connected layers. The first convolutional layer filters the input with kernels of size with a stride of pixels, then is connected by a pooling layer with stride . The second convolutional layer has kernels of size with a stride of pixels, followed by a max pooling layer with stride . The output of the pooling layer is fed into two fully connected layers with neurons. Finally, the third fully connected layer maps the output into affine parameters. Then the 6-dimensional output is fed into the spatial transformer layer to get the transformed input image. During the fine-tuning, the learning rate of the spatial transformer is set to be that of the classification network. We use a base learning rate of and decay it by every iterations, the training process converges after approximately iterations.
The results are presented in Table 2. It shows that data augmentation, feature map augmentation, transform pre-processing and patch reordering are all effective ways to improve the rotation or translation invariance of CNNs. Our PR-CNN not only achieves more consistent representation faced with location changes in input but also relieves the models from encoding invariance. It improves the classification accuracy of the model even for the original test images.
We also evaluate our PR-CNN model on the popular image retrieval benchmark dataset UK-Bench . This dataset includes groups of images, each containing relevant samples concerning a certain object or scene from different viewpoints. Each of the in total images is used as one query to perform image retrieval, targeting at finding each image’s
counterparts. We choose UK-Bench since the viewpoint variation in the dataset is very common. Although many of the variation types are beyond the three types of geometry transformations that we attempt to address with, we demonstrate the effectiveness of PR-CNN for solving many severe rotation, translation and scale variance cases in image retrieval task.
We directly apply the models trained on ImageNet-2012 for evaluation. The outputs of the fc6 and fc7 layers are used as the feature for each image. Then, we compute the root value of each dimension and perform L2 normalization. To perform image retrieval on UK-Bench, the Euclidean distances of the query image with respect to all database images are computed and sorted. Images with the smallest distances are returned as top ranked images. NS-Score (average top four accuracy) is used to evaluate the performance, and a score of indicates that all the relevant images are successfully retrieved in the top-four results.
As shown in Table 3, data augmentation, feature map augmentation or spatial transformer network does not present considerable capacity of transform invariance when applied to an unrelated new task. Maybe these models need to be well fine-tuned when transferred to a new dataset and the Spatial Transformer block is content and task dependent. Patch reordering is better for transferring by encoding invariance only into architecture, which is irrelevant to the content of input. It demonstrates that our PR-CNN model can be seamlessly transferred to other image recognition based applications (e.g. image retrieval) without any re-training/fine-tuning. Meanwhile, for other models, fc7 presents better invariance than fc6. However, for our PR-CNN, fc6 is better. We can find some clues from Fig. 2, that is, fc6 presents less parameter redundancy than fc7 in PR-CNN.
We evaluate the transform invariance achieved by our model using the invariance measure proposed in . In this approach, a neuron is considered to be firing when its response is above a certain threshold . Each is chosen to satisfy the condition that is greater than , where is the number of inputs. Then, the local firing rate is computed as the proportion of transformed inputs to which a neuron fires. To ensure that a neuron is selective and with a high local firing rate (invariance to the set of the transformed inputs), the invariance score of a neuron is computed based on the ratio of its invariance to selectivity, i.e., . We report the average of the top highest scoring neurons (), as in . Please refer to  for more details.
Here, we build the transformed dataset by applying rotation ( with a step size of ) and translation ( with a step size of ) on the validation images of ImageNet-2012. Fig. 4 shows the invariance score of CNN and PR-CNN measured at the end of each layer. We can see that by applying patch reordering to feature maps during training, the invariance of the subsequent layers is significantly improved.
Effect of Patch Reordering on Image Representations
To investigate the effect of applying patch reordering on the representations of transformed images, we show the output feature maps of Conv5 in Alexnet in Fig. 6. We can see that with patch reordering, the feature map is much more consistent than original CNN when faced with global rotations and translations.
Effect of Patch Reordering on Different Layers
To investigate the effect of applying patch reordering to different convolutional layers and the effect of pyramid levels, we train different PR-CNN models with patch reordering applied to convolutional layers with level or . When , we divide the feature maps into blocks. For , the feature maps are first divided into blocks, and each block is further divided into sub-blocks. The experimental results are presented in Fig. 5. We can see that the performance drops significantly when we perform patch reordering in low layers. Meanwhile, multi-level reordering does not result in a significant difference to single- level reordering in regard to higher convolutional layers. Low-level features, such as edges and corners, are detected in low layers, and they must be combined in a local spatial range to conduct further recognition. Because patch reordering breaks this local spatial correlation and treats each block as an independent feature, the generated representation becomes less meaningful. This explanation can also clarify the phenomenon that the multi-level division of feature maps significantly improves model performance in lower layers because a hierarchical reordering will preserve more local spatial relationships than will a single one.
In this paper, we introduce a very simple and effective way to improve the rotation and translation invariance of CNN models. By reordering the feature maps of CNN layers, the model is relieved from encoding location invariance into its parameters. Meanwhile, CNN models are able to generate more consistent representations when faced with location changes of local patterns in input. Our architecture does not need any extra parameters or pre-processing on input images. Experiments show that our model outperforms CNN models in both image recognition and image retrieval tasks.
Acknowledgments This work is supported by NSFC under the contracts No.61572451 and No.61390514, the 973 project under the contract No.2015CB351803, the Youth Innovation Promotion Association CAS CX2100060016, Fok Ying Tung Education Foundation WF2100060004, the Fundamental Research Funds for the Central Universities WK2100060011, Australian Research Council Projects: FT-130101457, DP-140102164, and LE140100061.
-  (2013) Invariant scattering convolution networks. IEEE Trans. Pattern Anal. Mach. Intell. 35, pp. 1872–1886. Cited by: Related Work.
-  (2011) The devil is in the details: an evaluation of recent feature encoding methods. In Proceedings of the British Machine Vision Conference, pp. 76.1–76.12. Cited by: Patch Reordering.
-  (2011) The importance of encoding versus training with sparse coding and vector quantization.. In ICML, pp. 921–928. Cited by: Patch Reordering.
-  (2015) Transformation properties of learned visual representations. ICLR. Cited by: Related Work.
-  (2015) Long-term recurrent convolutional networks for visual recognition and description. CVPR. Cited by: Introduction.
-  (2015) From captions to visual concepts and back. CVPR. Cited by: Introduction.
-  (2014) Deep symmetry networks. NIPS. Cited by: Related Work.
-  (2015) Contextual action recognition with r*cnn. ICCV. Cited by: Introduction.
-  (2009) Measuring invariances in deep networks. NIPS. Cited by: Measuring Invariance.
-  (2016) Deep residual learning for image recognition. CVPR. Cited by: Introduction.
-  (2015) Spatial transformer networks. NIPS. Cited by: Related Work, MNIST, MNIST, ImageNet-2012, Experiments.
-  (2014) Caffe: convolutional architecture for fast feature embedding. arXiv preprint arXiv:1408.5093. Cited by: Experiments.
-  (2014) Locally scale-invariant convolutional neural networks. NIPS. Cited by: Related Work, Measuring Invariance, Experiments.
-  (2015) Deep visual-semantic alignments for generating image description. CVPR. Cited by: Introduction.
Transformation equivariant boltzmann machines.
Artificial Neural Networks and Machine Learning, pp. 1–9. Cited by: Related Work.
-  (2012) ImageNet classification with deep convolutional neural networks. NIPS, pp. 1097–1105. Cited by: Parameter Redundancy in Convolutional Neural Networks, Experiments.
-  (2010) Tiled convolutional neural networks. NIPS. Cited by: Related Work.
-  (1998) Gradient-based learning applied to document recognition. In Proceedings of the IEEE, pp. 2278–2324. Cited by: Introduction.
-  (2015) Understanding image representations by measuring their equivariance and equivalence. CVPR. Cited by: Related Work.
-  (2015) Fully convolutional networks for semantic segmentation. CVPR. Cited by: Introduction.
-  (2006) Scalable recognition with a vocabulary tree. In CVPR, pp. 2161–2168. Cited by: UK-Bench, Experiments.
-  (2016) Joint modeling embedding and translation to bridge video and language. CVPR. Cited by: Introduction.
-  (2014) Very deep convolutional networks for large-scale image recognition. CoRR abs/1409.1556. Cited by: Introduction.
-  (2014) Two-stream convolutional networks for action recognition in videos. NIPS. Cited by: Introduction.
-  (2012) Learning invariant representations with local transformations. In ICML, Cited by: Related Work, Related Work.
-  (2015) Going deeper with convolution. CVPR. Cited by: Introduction.
-  (2015) Describing videos by exploiting temporal structure. ICCV. Cited by: Introduction.