The trend of the architectures of deep convolutional neural networks (CNNs) is to become wider xie2016aggregated ; DBLP:conf/bmvc/ZagoruykoK16 and deeper simonyan2015very ; szegedy2015going ; he2016deep ; he2016identity . However, millions of parameters make CNNs prone to overfitting when training data is not sufficient. In practice, plenty of regularization approaches have been adopted to improve the generalization ability of CNNs, such as weight decay krogh1991simple , dropout hinton2012improving ; srivastava2014dropoutioffe2015batch , etc. This paper provides an alternative regularization option during CNN training with application in image classification.
Overfitting is a long-standing issue in the machine learning community. The nature of overfitting is that the model adapts to the noise rather than capturing the underlying key factors of variations existing in the datazhang2016understanding
. For image classification, when lacking sufficient training data, the learned model may be misled by the irrelevant local information which can be regarded as noise. Moreover, in most classification tasks, it is the overall structure rather than the detailed local pixels that has a large influence on the performance of a CNN model. The model should be able to identify the input image correctly if pixels within local structures change in a way that does not destroy the overall view of an image. From another perspective, if we consider that human will probably not be confused about the the image content under moderate extent of local blur, it is expected that a data model such as CNN should behave similarly.
In this paper, we propose a new regularization approach: PatchShuffle, which is a beneficial supplement to existing regularization techniques krogh1991simple ; krizhevsky2012imagenet ; srivastava2014dropout ; ioffe2015batch . In the training stage, an image or feature map within a mini-batch is randomly chosen to undergo either of the two actions: 1) keep unchanged, or 2) be transformed in such a way that pixels within each patch are shuffled. On the one hand, when applied on the images, the shuffled images have nearly the same global structures with the original ones but possess rich local variations, which are expected to benefit the training of CNNs. On the other hand, when PatchShuffle is applied on the feature maps of the convolutional layers, it can be viewed as implementing model ensemble. In fact, locally shuffling the pixels within a patch is equivalent to shuffling the convolutional kernels given unshuffled patches. Thus at each iteration, the model is trained from different kernel instantiations. PatchShuffle can also be considered to enable weight sharing within each patch. By shuffling, the pixel instantiation at a specific position of an image can be viewed as being sampled from its neighboring pixels within a patch with equal probability. Therefore, across different iterations, patch pixels with different original locations share the same weight.
One might argue that PatchShuffle is a type of data augmentation technique since new images are generated. However, being applied on a very small percent of images/feature maps in a mini-batch, we speculate that PatchShuffle is more of a regularization method than data augmentation 111In lecun2015deep , data augmentation is considered as belonging to regularization. We differentiate the two concepts in this paper: data augmentation enlarges the training set to a large extent, while regularization makes more elaborate data changes without noticeable enlarging the data volume.. We also differentiate PatchShuffle from dropout. The latter samples activations from the hidden units, while PatchShuffle samples from all the possible permutations of pixels within patches and no hidden units are discarded.
In summary, the PatchShuffle regularization has the following merits.
An efficient method that costs negligible extra time and memories. It can be easily adopted in a variety of CNN models without changing the learning strategy.
A complementary technique to existing regularization approaches. On four representative classification datasets, PatchShuffle further improves the classification accuracy when combined with multiple regularization techniques.
Improving the robustness of CNNs to data that is noisy or losses partial information. For example, when adding salt-and-pepper noise to the MNIST dataset, our approach outperforms the baseline by more than 20 percent.
2 Related Work
We briefly review several aspects that are closely related to this paper, i.e., data augmentation, regularization, and transformation equivariant and invariant networks.
Data augmentation. The direct strategy against overfitting is to train CNNs on more data. Data augmentation addresses this problem by creating new data from existing data to augment the training set. Data augmentation is widely adopted in the training of deep neural networks Goodfellow-et-al-2016 ; krizhevsky2012imagenet ; he2016deep ; gan2015learning ; lin2013network . An effective way to perform data augmentation is to do various transformations, such as flipping, translation, cropping, etc. From the perspective of generating more images for training, PatchShuffle shares some properties with data augmentation methods.
Regularization. Regularization is an effective way to reduce the impact of overfitting. Various types of regularization methods have been proposed hinton2012improving ; ioffe2015batch ; krogh1991simple ; DBLP:conf/nips/SinghHF16 ; srivastava2014dropout ; xie2016disturblabel
. PatchShuffle relates to two kinds of regularizations. 1) Model ensemble. It adopts model averaging in which several separately trained models vote on the output given a test sample. The voting procedure is robust to prediction errors made by individual classifiers. Many methods implicitly implement model ensemble, such as dropouthinton2012improving ; srivastava2014dropout , stochastic depth DBLP:conf/eccv/HuangSLSW16 and swapout DBLP:conf/nips/SinghHF16 . Architectures are averaged by dropout through randomly discarding a group of hidden units, each of which has different widths of layers. Stochastic depth averages architectures with various depths through randomly skipping layers. Swapout samples from abundant set of architectures with dropout and stochastic depth as its special case. 2) weight sharing. It forces a set of weights to be equal nowlan1992simplifying and has been used in the architecture of deep convolutional neural networks lecun1998gradient . Networks regularized by weight sharing always have transformation invariant properties. For example, through a weight sharing framework, Ravanbakhsh et al. ravanbakhsh2016deep propose the permutation equivariant layer that gains robustness to permutations of the input. PatchShuffle is more of a regularization method because the generated images/feature maps share the global structures with the original ones and PatchShuffle is applied on a very small amount of images/feature maps.
Transformation equivariant and invariant networks. PatchShuffle regularization is also related to the family of transformation equivariant and invariant networks. Deep symmetry networks gens2014deep generalize vanilla CNN architecture to model arbitrary symmetry groups. In dieleman2016exploiting , a series of rotation equivariant operations are proposed, such as cyclic slicing, pooling and rolling. Ravanbakhsh et al. ravanbakhsh2016deep propose a kind of permutation equivariant layer that is robust to the permutations of the inputs through designed weight sharing. All of the aforementioned neural networks mainly focus on the transformations of the whole images and aim to enable the neural networks to be robust to several specific parametric transformation types. They are problem-driven and not easily generalized to other datasets. Moreover, few investigate and exploit various kinds of transformations in the regularizing of deep neural networks in a general sense.
In a recent work, Shen et al. DBLP:conf/aaai/ShenTST17 propose using patch reordering to achieve the rotation and translation invariance. They divide the feature maps into non-overlapping local patches and reorder the patches according to the or norm of the activations of the patches. Their work is similar to PatchShuffle in that they also break the original arrangement of an image or feature maps during training, but critical differences should be clarified. 1) Shen et al. DBLP:conf/aaai/ShenTST17 reorder the patches, while we shuffle the pixels within each local patch, which does not destroy the global structure. 2) Shen et al. DBLP:conf/aaai/ShenTST17 perform ranking
according to specific heuristic rule, while PatchShuffle does theshuffle operation randomly. Randomness is proved to be useful to regularize the training of CNNs by explicitly performing model averaging krizhevsky2012imagenet ; hinton2012improving ; srivastava2014dropout ; DBLP:conf/eccv/HuangSLSW16 ; DBLP:conf/nips/SinghHF16 . 3) Shen et al. DBLP:conf/aaai/ShenTST17 concentrate on the rotation and translation invariance of models, whereas we adopt PatchShufle as a regularizer.
3 PatchShuffle Regularization
3.1 PatchShuffle Transformation
Formulations. Let us consider a matrix with the size of elements. A random switch controls whether
needs to be transformed (PatchShuffled). Supposing the random variable
subjects to a Bernoulli distribution, i.e., with probability and with probability , the resulted matrix can be represented as
where denotes the PatchShuffle transformation. When is partitioned into a block matrix with non-overlapping patches of elements, i.e.,
the PatchShuffle transformation acts on each patch and can be formulated as follows,
where denotes a patch located at the -th row and -th column of the block matrix . and are permutation matrices. Pre-multiplying the patch with permutes the rows of , whereas post-multiplying the patch with results in the permutation of the columns of .
In practice, we first split into non-overlapping patches with sizes of elements. Within each patch, the elements are randomly shuffled, as shown in Fig. 1. So each patch will undergo one of the different permutations. For matrix , the number of possible permutations is . For each patch, after finite times of shuffle, it will recover the original order. Note that although we describe the PatchShuffle transformation assuming and are square, in practice, and its patches don’t need to be square. Our method can be trivially extended to the case of non-square matrix.
PatchShuffle on images. For CNNs, the input and output of a convolutional layer are feature maps (or images) which can be viewed as matrices. Thus the PatchShuffle transformation is readily applicable. Following Eq. (1), Eq. (2) and Eq. (3), PatchShuffle can be easily applied on images. Image samples are shown in Fig. 2. Intuitively, we observe that for an image, PatchShuffle transformation within an extent does not disable the recognition of corresponding object, which will benefit the training of a deep neural network.
PatchShuffle on feature maps. We also perform PatchShuffle transformation on the feature maps of the all the convolutional layers. Here, we treat each feature map as an image, and PatchShuffle is performed on the feature maps independently. That is, each feature map is randomly chosen to undergo the PatchShuffle transformation, regardless of the original image or other feature maps. For the feature maps of lower or middle layers, the spatial structures of the image are preserved to a large extent, so we expect that applying PatchShuffle to these layers can regularize training. For the higher convolutional layers, recall that PatchShuffle enables weight sharing among neighboring pixels; this property is beneficial for the higher level feature maps where neighboring pixels have largely overlapping receptive fields projected onto the original image. We will verify this in the experiment part.
PatchShuffle also faces the typical bias-variance dilemma. On the side of reducing the gap between training and test performance, PatchShuffle creates new images and feature maps, which increases the variety of the training data. However, on the side of bias, the data distributions of the new images and feature maps created are probably different from those of real-world data, which may induce more bias into the CNN model. Therefore, in the application of PatchShuffle, only a small percentage of the images/feature maps undergo PatchShuffle transformation (small) to achieve a bias-variance trade-off.
3.2 CNN Training and Inference
Objective function. We take PatchShuffling images for example. In this case, the training objective function can be formalized as,
where and denote the objective functions training with and without PatchShuffle, respectively. and represent the original and PatchShuffled images, respectively. The label of the training sample is denoted as , and encode the weights of the neural network.
In Eq. (4), when the random switch is set to 1, we have , which implies that network is chosen to be trained with PatchShuffle. When , we have , denoting that the network is trained without PatchShuffle. Taking the expectation over which follows a Bernoulli distribution, Eq. (4) becomes
where denotes the shuffle probability, and works as a regularizer.
Training procedure. During the training process, PatchShuffle is applied to the training images and feature maps of all the convolutional layers, each with an independently sampled and independent shuffling operations. Note that the sizes of non-overlap patches and the shuffle probability are also not necessarily set as the same across layers. We use the same procedure when applying PatchShuffle to different layers. Without loss of generality, in Algorithm 1, we summarize the training procedure using PatchShuffle which is applied on the the feature maps of one convolutional layer.
Inference. At the test stage, the network performs a forward process without any transformation applied to the images or feature maps.
In this section, we report results on four image classification datasets including CIFAR-10, SVHN, STL-10 and MNIST. CIFAR-10 krizhevsky2009learning contains 50,000+10,000 (training+test) 32 32 color images of 10 object classes. SVHN netzer2011reading consists 73,257+26,032 (training+test) color images for street view house numbers. STL-10 coates2011analysis contains 5,000+8,000 (training+test) color images for 10 categories. MNIST lecun1998gradient consists of 60,000+10,000 (training+test) 28 28 greyscale images of hand-written digits. We first show that our algorithm is robust to the change of hyper-parameters within a wide range. Then we demonstrates the improved generalization ability achieved by PatchShuffle on the benchmarks. Finally, we illustrate that CNNs trained by PatchShuffle are more robust to the noises such as salt-and-pepper, occlusions, etc. All experiments are implemented using Caffe jia2014caffe .
4.1 Experiment Settings
In all of our experiments, we compare the CNN models trained with and without PatchShuffle. The training method without PatchShuffle is denoted as standard back-propagation (BP). For the same deep architecture, all the models are trained from the same weight initializations. Note that some popular regularization techniques (i.e., weight decay, batch normalization and dropout) and various data augmentations (i.e.,
flipping, padding and cropping) are employed in the experiments. The hyper-parameters for training with PatchShuffle and standard BP are all the same except that the patch sizeand shuffle probability are chosen through validation for PatchShuffle. Our experiments build on various CNN architectures, which are summarized as follows:
CNNs for CIFAR-10. Three CNN models are adopted in the experiments of CIFAR-10: Network in Network lin2013network (NIN), pre-activation ResNet-110 he2016identity (ResNet-110-PreAct), and the modification of original ResNet-110 he2016deep (ResNet-110-Modified). The architecture of ResNet-110-Modified is the same as the original ResNet-110 designed for CIFAR he2016deep
except that it discards the ReLU unit after each summation of the shortcut and residual function. This small modification improves the performance of original ResNet to be comparable to that of pre-activation ResNethe2016identity . The training procedure of ResNet-110-PreAct and ResNet-110-Modified is the same with he2016deep ; he2016identity , and that of NIN is the same with lin2013network .
CNNs for SVHN. The CNNs adopted for SVHN are Plain-SVHN and ResNet-110-Modified. The architecture of Plain-SVHN is the same as provided in the Caffe examples222https://github.com/BVLC/caffe/blob/master/examples/cifar10/cifar10_full_train_test.prototxt which is originally designed by krizhevskycuda for the training of CIFAR-10. Because the architecture is general and the image resolutions are the same between SVHN and CIFAR-10, it can be employed for SVHN.
CNNs for STL-10.
The CNN architecture adopted for STL-10 is denoted by ResNet-STL-10. It is similar to ResNet-110-Modified, but due to a higher resolution of the images, several modifications are made. 1) The kernel size for the first convolutional layer becomes 7 with the stride of 2. 2) Four residual stages are employed with each of them containing two residual units. The spatial sizes of feature maps are successively halved after each residual stage. The channels of the feature maps for four residual stages are 32, 64, 128 and 256, respectively. 3) Dropout is applied on the final fully-connected layer with dropout ratio of 0.5.
CNNs for MNIST. The CNN architecture adopted on MNIST is provided in Caffe examples333https://github.com/BVLC/caffe/blob/master/examples/mnist/lenet_train_test.prototxt.
4.2 The Impact of Hyper-parameters
When applying PatchShuffle to CNN training, we have two hyper-parameters to evaluate, i.e., the patch size and the shuffle probability . To demonstrate the impact of these two hyper-parameters on the performance of the model, we conduct experiments on CIFAR-10 based on RestNet-110-Modified under different hyper-parameter settings with PatchShuffle applied on the images. The results are compared with standard BP and shown in percentage in Table 1. Note that all the models are trained with the simple data augmentation as in he2016deep ; he2016identity : 4 pixels are padded on each side, and a 32 32 crop is randomly sampled from the padded image or its horizontal flip. Results are presented in Table 1 and Fig. 3. We arrive at two findings.
First, PatchShuffle consistently outperforms standard BP under a wide range of hyper-parameters. On CIFAR-10, our best result reduces the classification error by 0.67% compared with standard BP.
Second, PatchShuffle is robust to parameter changes to some extent. When the shuffle probability and patch size increase, recognition error first decreases, touches the bottom, and then increases. In fact, within an extent, the increase of both parameters improve the variety of training sample without introducing too much bias. But under larger values, the benefit brought by diversity is gradually overtaken by the classifier bias, so error rate increases. In the following experiments, we use , patch size = when not specified.
4.3 Classification Performance
CIFAR-10. For the training of NIN and ResNet-110-Modified, we apply PatchShuffle on the images only, while for the training of ResNet-110-PreAct, PatchShuffle is applied on the images and feature maps between two successive convolutional layers in each residual unit. The test errors are shown in percentage in Table 2. It can be seen that the models trained by PatchShuffle are consistently superior to those trained by standard BP with using the three CNN architectures.
We further evaluate the impact of the size of training set on recognition accuracy in CIFAR-10. We use the ResNet-110-Modified model. Results are shown in Table 2. In all of the experiments, we use the same hyper-parameter setting (i.e., with patch size , and shuffle probability 0.05 ). Although the hyper-parameter setting may not be optimal under small training sets, Table 2 still indicates that PatchShuffle improves the recognition accuracy, especial when the training set is small (see the results when training set size is 9,000).
SVHN. PatchShuffle is applied on the images. The results are shown in Table 3, indicating that PatchShuffle consistently outperforms standard BP.
STL-10. Here we illustrate the impact of applying PatchShuffle on the feature maps of CNNs.
The classification results on STL-10 are summarized in Table 4. The five-bit binary code denotes on what stages PatchShuffle is applied. The first bit denotes the input layer, and the other four bits correspond to four residual stages. Applying PatchShuffle on a stage of ResNet means applying it on the feature maps between two adjacent convolutional layers of each residual unit in this stage. We set to 0.30 in this experiment.
Table 4 reveals that the generalization ability of the CNN models trained with PatchShuffle are significantly higher than using standard BP. More significant improvement over the baseline can be observed when using PatchShuffle on more convolutional layers. In addition, increasing the sizes of the patches also brings notable improvement in terms of the generalization performance. Note that all the models are trained with dropout applied on the output layer, which suggests that PatchShuffle can reduce overfitting beyond dropout.
4.4 Robustness to the Noise
Finally, we show the robustness of PatchShuffle against noise and occlusions. In experiment, we add different levels of salt-and-pepper noise and occlusions on the MNIST dataset. Salt-and-pepper noise is added to the image by changing the pixel to white or black with probability . For the occlusion, each pixel is randomly chosen to be imposed by a black block of certain size centered on it with probability . The size of the block adopted in our experiment is . Results are presented in Table 5.
A clear observation is that under increasing level of pollution, the performances of both standard BP and PatchShuffle drop quickly. Nevertheless, under each pollution level, our method yields consistently lower error rate than standard BP. For salt-and-pepper noise, the performance gap is largest (23.74%) under a noise level of 50%. For occlusion, our method exceeds standard BP by 3.46% under occlusion extent of 25%. These results indicate that Patchshuffle improves the robustness of CNNs against common image pollutions like noise and occlusion.
This paper introduces PatchShuffle, a new regularization method for generalizable CNN training. This method is efficient to compute, complementary to existing regularizers, and improves CNN’s robustness to noise and occlusions. We will explore Patchshuffle on more complex tasks in future, e.g., object detection and language modeling.
Adam Coates, Andrew Ng, and Honglak Lee.
An analysis of single-layer networks in unsupervised feature
Proceedings of the fourteenth international conference on artificial intelligence and statistics, pages 215–223, 2011.
-  Sander Dieleman, Jeffrey De Fauw, and Koray Kavukcuoglu. Exploiting cyclic symmetry in convolutional neural networks. 2016.
-  Zhe Gan, Ricardo Henao, David Carlson, and Lawrence Carin. Learning deep sigmoid belief networks with data augmentation. In Artificial Intelligence and Statistics, pages 268–276, 2015.
-  Robert Gens and Pedro M Domingos. Deep symmetry networks. In NIPS, 2014.
-  Ian Goodfellow, Yoshua Bengio, and Aaron Courville. Deep Learning. MIT Press, 2016. http://www.deeplearningbook.org.
-  Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In CVPR, 2016.
-  Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Identity mappings in deep residual networks. In ECCV, 2016.
-  Geoffrey E Hinton, Nitish Srivastava, Alex Krizhevsky, Ilya Sutskever, and Ruslan R Salakhutdinov. Improving neural networks by preventing co-adaptation of feature detectors. 2012.
-  Gao Huang, Yu Sun, Zhuang Liu, Daniel Sedra, and Kilian Q. Weinberger. Deep networks with stochastic depth. In Computer Vision - ECCV 2016 - 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part IV, pages 646–661, 2016.
-  Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In ICML, 2015.
-  Yangqing Jia, Evan Shelhamer, Jeff Donahue, Sergey Karayev, Jonathan Long, Ross Girshick, Sergio Guadarrama, and Trevor Darrell. Caffe: Convolutional architecture for fast feature embedding. In ACM MM, 2014.
-  A Krizhevsky. cuda-convnet, 2012. https://code.google.com/p/cuda-convnet/.
-  Alex Krizhevsky and Geoffrey Hinton. Learning multiple layers of features from tiny images. Citeseer, 2009.
-  Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. In NIPS, 2012.
-  Anders Krogh and John A Hertz. A simple weight decay can improve generalization. In NIPS, 1991.
-  Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. Deep learning. volume 521, pages 436–444. Nature Research, 2015.
-  Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to document recognition. volume 86, pages 2278–2324. IEEE, 1998.
-  Min Lin, Qiang Chen, and Shuicheng Yan. Network in network. In ICLR, 2014.
-  Yuval Netzer, Tao Wang, Adam Coates, Alessandro Bissacco, Bo Wu, and Andrew Y Ng. Reading digits in natural images with unsupervised feature learning. In NIPS workshop on deep learning and unsupervised feature learning, volume 2011, page 5, 2011.
-  Steven J Nowlan and Geoffrey E Hinton. Simplifying neural networks by soft weight-sharing. volume 4, pages 473–493. MIT Press, 1992.
-  Siamak Ravanbakhsh, Jeff Schneider, and Barnabas Poczos. Deep learning with sets and point clouds. 2016.
-  Xu Shen, Xinmei Tian, Shaoyan Sun, and Dacheng Tao. Patch reordering: A novelway to achieve rotation and translation invariance in convolutional neural networks. In Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence, February 4-9, 2017, San Francisco, California, USA., pages 2534–2540, 2017.
-  Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. In ICLR, 2015.
-  Saurabh Singh, Derek Hoiem, and David A. Forsyth. Swapout: Learning an ensemble of deep architectures. In Advances in Neural Information Processing Systems 29: Annual Conference on Neural Information Processing Systems 2016, December 5-10, 2016, Barcelona, Spain, pages 28–36, 2016.
-  Nitish Srivastava, Geoffrey E Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. Dropout: a simple way to prevent neural networks from overfitting. volume 15, pages 1929–1958, 2014.
Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir
Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich.
Going deeper with convolutions.
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1–9, 2015.
-  Lingxi Xie, Jingdong Wang, Zhen Wei, Meng Wang, and Qi Tian. Disturblabel: Regularizing cnn on the loss layer. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4753–4762, 2016.
-  Saining Xie, Ross Girshick, Piotr Dollár, Zhuowen Tu, and Kaiming He. Aggregated residual transformations for deep neural networks. 2016.
-  Sergey Zagoruyko and Nikos Komodakis. Wide residual networks. In Proceedings of the British Machine Vision Conference 2016, BMVC 2016, York, UK, September 19-22, 2016, 2016.
-  Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals. Understanding deep learning requires rethinking generalization. In ICLR, 2017.