Chainer Implementation of Parallel Grid Pooling for Data Augmentation
Convolutional neural network (CNN) architectures utilize downsampling layers, which restrict the subsequent layers to learn spatially invariant features while reducing computational costs. However, such a downsampling operation makes it impossible to use the full spectrum of input features. Motivated by this observation, we propose a novel layer called parallel grid pooling (PGP) which is applicable to various CNN models. PGP performs downsampling without discarding any intermediate feature. It works as data augmentation and is complementary to commonly used data augmentation techniques. Furthermore, we demonstrate that a dilated convolution can naturally be represented using PGP operations, which suggests that the dilated convolution can also be regarded as a type of data augmentation technique. Experimental results based on popular image classification benchmarks demonstrate the effectiveness of the proposed method. Code is available at: https://github.com/akitotakekiREAD FULL TEXT VIEW PDF
Feature pooling layers (e.g., max pooling) in convolutional neural netwo...
In this paper, we explore and compare multiple solutions to the problem ...
A core challenge in background subtraction (BGS) is handling videos with...
Data augmentation is a commonly used technique for increasing both the s...
Mixed Sample Data Augmentation (MSDA) has received increasing attention ...
Convolutional neural network (CNN) is widely used in computer vision
DeepPrior is a simple approach based on Deep Learning that predicts the ...
Chainer Implementation of Parallel Grid Pooling for Data Augmentation
Deep learning using convolutional neural networks (CNNs) has achieved excellent performance for a wide range of tasks, such as image recognition [1, 2, 3], object detection [4, 5, 6], and semantic segmentation [7, 8]
. Most CNN architectures utilize spatial operation layers for downsampling. Spatial downsampling restricts subsequent layers in a CNN to learn spatially invariant features and reduces computational costs. Modern architectures often use multi-stride convolutional layers, such as aor convolution with a stride of 2 [1, 3, 9, 10, 11, 12] or average pooling with a stride of 2 . The other methods for downsampling have been proposed [13, 14, 15, 16].
One drawback of these downsampling is that it makes impossible to use the full spectrum of input features, and most previous works overlook the importance of this drawback. To clarify this issue, we provide an example of such a downsampling operation in Fig. 1. Let be a spatial operation (e.g., convolution or pooling) with a kernel size of and stride of . In the first step, is performed on the input and yields an intermediate output. In the second step, the intermediate output is split into an grid pattern and downsampled by selecting the coordinate (top-left corner) from each grid, resulting in an output feature that is times smaller than the input feature. In this paper, the second step downsampling operation is referred to as grid pooling (GP) for the coordinate . From this two-step example, one can see that the typical downsampling operation utilizes only of the intermediate output and discards the remaining .
Motivated by the observation, we propose a novel layer for CNNs called the parallel grid pooling (PGP) layer (Fig. 2). PGP is utilized to perform downsampling without discarding any intermediate feature and can be regarded as a data augmentation technique. Specifically, PGP splits the intermediate output into an grid pattern and performs a grid pooling for each of the coordinates in the grid. This means that PGP transforms an input feature into feature maps, downsampled by . The layers following PGP compute feature maps shifted by several pixels in parallel, which can be interpreted as data augmentation in the feature space.
Furthermore, we demonstrate that dilated convolution [7, 17, 18, 19] can naturally be decomposed into a convolution and PGP. In general, a dilated convolution is considered as an efficient operation for spatial (and temporal) information, and achieves the state-of-the-art performance for various tasks, such as semantic segmentation [7, 20, 21], speech recognition 19, 22]. In this sense, we provide a novel interpretation of the dilated convolution operation as a data augmentation technique in the feature space.
The proposed PGP layer has several remarkable properties. PGP can be implemented easily by inserting it after a multi-stride convolutional and/or pooling layer without altering the original learning strategy. PGP does not have any trainable parameters and can even be utilized for test-time augmentation (i.e., even if a CNN was not trained using PGP layers, the performance can still be improved by inserting PGP layers into the pretrained CNN at testing time).
, and ImageNet. Experimental results demonstrate that PGP can improve the performance of recent state-of-the-art network architectures and is complementary to widely used data augmentation techniques (e.g., random flipping, random cropping, and random erasing).
The major contributions of this paper are as follows:
We propose PGP, which performs downsampling without discarding any intermediate feature information and can be regarded as a data augmentation technique. PGP is easily applicable to various CNN models without altering their learning strategies and the number of parameters.
We demonstrate that PGP is implicitly used in a dilated convolution operation, which suggests that dilated convolution can also be regarded as a type of data augmentation technique.
PGP is most closely related to data augmentation. In the context of training a CNN, data augmentation is a standard regularization technique that is used to avoid overfitting and artificially enlarge the size of a training dataset. For example, applying crops, flips, and various affine transforms in a fixed order or at random is a widely used technique in many computer vision tasks. Data augmentation techniques can be divided into two categories: data augmentation in the image space and data augmentation in the feature space.
Data augmentation in the image space: AlexNet 
, which achieves state-of-the-art performance on the ILSVRC2012 dataset, applies random flipping, random cropping, color perturbation, and adding noise based on principal component analysis. TANDA learns a generative sequence model that produces data augmentation strategies using adversarial techniques on unlabeled data. The method in 
proposes to train a class-conditional model of diffeomorphisms by interpolating between nearest-neighbor data. Recently, training techniques utilizing randomly masked images of rectangular regions, such as random erasing and cutout , have been proposed. Using these augmentation techniques, recent methods have achieved state-of-the-art results for image classification [31, 32]. Mix-up  and BC-learning  generate between-class images by mixing two images belonging to different classes with a random ratio.
Data augmentation in the feature space: Some techniques generate augmented training sets by interpolating  or extrapolating  features from the same class. DAGAN  utilizes conditional generative adversarial networks to conduct feature space data augmentation for one-shot learning. A-Fast-RCNN  generates hard examples with occlusions and deformations using an adversarial learning strategy in feature space. Smart augmentation  trains a neural network to combine training samples in order to generate augmented data. As another type of feature space data augmentation, AGA  and Fatten  perform augmentation in the feature space by using properties of training images (e.g., depth and pose).
While our PGP performs downsampling without discarding any intermediate feature, it splits the input feature map into branches, which will be discussed in detail in Sec. 3.1. The operations following PGP (e.g., convolution layers or fully connected layers) compute slightly shifted feature maps, which can be interpreted as a novel data augmentation technique in the feature space. In addition, dilated convolution can be regarded as data augmentation technique in the feature space, which will be discussed in detail in Sec. 3.2.
An overview of PGP is illustrated in Fig. 2. Let be an input feature map, where , , , and are the number of batches, number of channels, height and width, respectively. is a spatial operation (e.g. a convolution or a pooling) with a kernel size of and stride of . The height and width of the output feature map are and , respectively.
can be divided into two steps (Fig. 1). In the first step, is performed, which is an operation that makes full use of the input feature map information with a stride of one, producing an intermediate feature map . Then, is split into blocks of size . Let the coordinate in each grid be . In the second step, is downsampled by selecting the coordinate (top-left corner) in each grid. Let this downsampling operation of selecting the coordinate in each grid be called grid pooling (GP). To summarize, given a kernel size of and stride of , the multi-stride spatial operation layer is represented as
Our proposed PGP method is defined as:
The conventional multi-stride operation uses which consists of only of the intermediate output and discards the remaining of the feature map. In contrast, PGP retains all possible choices of grid poolings, meaning PGP makes full use of the feature map without discarding the portion of the intermediate output in the conventional method.
A network architecture with a PGP layer can be viewed as one with internal branches by performing grid pooling in parallel. Note that the weights (i.e., parameters) of the following network are shared between all branches, hence there is no additional parameter throughout the network. PGP performs downsampling while maintaining the spatial structure of the intermediate features, thereby producing output features shifted by several pixels. As described above, because the operations in each layer following PGP share the same weights across branches, the subsequent layers in each branch are trained over a set of slightly shifted feature maps simultaneously. This works as data augmentation; with PGP, the layers are trained with times larger number of mini-batches compared to the case without PGP.
We here discuss the relationship between PGP and dilated convolution [7, 20]. Dilated convolution is a convolutional operation applied to an input with a defined gap. Dilated convolution is considered to be an operation that efficiently utilizes spatial and temporal information [7, 18, 19, 20, 42]. For brevity, let us consider one-dimensional signals. Dilated convolution is defined as:
where is an input feature, is an output feature, is a convolution filter of kernel size , and is a dilation rate. A dilated convolution is equivalent to a regular convolution when . Let . Then, the output is
This suggests that the output , where , can be obtained from the input . To summarize, an -dilated convolution is equivalent to an operation where the input is split into branches and convolutions are performed with the same kernel , then spatially rearranged into a single output.
A dilated convolution in a two-dimensional space is illustrated in Fig. 3 (in case ). The -dilated convolution downsamples an input feature with an interval of pixels, then produces branches as outputs. Therefore, the dilated convolution is equivalent to a following three-step operation: , a convolution sharing the same weight, and inverse of PGP () which rearranges the cells to their original positions. For example, let the size of the input features be , where , , , and are the number of batches, number of channels, height, and width, respectively. Using , the input feature is divided into , subsequently a convolutional filter (with shared weight parameters) is applied to each branch in parallel (Fig. 3). Then, the intermediate feature maps are rearranged by , resulting the same size as the input by . In short, PGP is embedded in a dilated convolution, which suggests that the success of dilated convolution can be attributed to data augmentation in the feature space.
|(a) Base CNN (Base-CNN)|
|(b) CNN with dilated convolution (DConv-CNN)|
|(c) DConv-CNN represented by PGP|
|(d) CNN with PGP (PGP-CNN)|
The structure of the network model with PGP is better for learning than that with dilated convolutions. For example, consider the cases where two downsampling layers in a base CNN (Base-CNN) are replaced by either dilated convolution (Fig. 4(b)) or PGP (Fig. 4(d)). According to , the conventional multi-stride downsampling layer and all subsequent convolution layers can be replaced with dilated convolution layers with two times dilation rate (Fig. 4(b)). As mentioned in Section 3.2, a dilated convolution is equivalent to (). When -dilated convolutions are stacked, they are expressed by (). This means that each branch in a CNN with dilated convolution (DConv-CNN) is independent from the others.
Let us consider the case of Fig 4(b) that dilated convolutions with two different dilation rates (e.g., and ) are stacked. Using the fact that a sequence of 4-dilated convolutions is ( + Convolution + ), this architecture can readily be transformed to the one given in Fig. 4(c) which is the equivalent form with PGP. We can clearly see that each branch split by the former dilation layer is split again by the latter, and all the branches are eventually rearranged by layers prior the global average pooling (GAP) layer. In contrast, CNN with PGP (PGP-CNN) is implemented as follows. The convolution or pooling layer that decreases resolution is set to 1 stride of that layer, after which PGP is inserted (Fig. 4(d)). In DConv-CNN, all the branches are aggregated through layers just before the GAP layer, which can be regarded as a feature ensemble, while PGP-CNN does not use any layer throughout the network so the intermediate features are not fused until the end of the network.
Unlike feature ensemble (i.e., the case of DConv-CNN), this structure enforces all the branches to learn correct classification results. As mentioned above, DConv-CNN averages an ensemble of intermediate feature. The likelihood of each class is then predicted based on the average. Even if some branches contain useful features (e.g. for image classification) while the other branches do not acquire, the CNN attempts to determine the correct class based on the average, which prevents the useless branches from further learning. In contrast, thanks to the independency of branches PGP-CNN can perform additional learning with superior regularization.
Compared to DConv-CNN, PGP-CNN has a significant advantage in terms of weight transfer. Specifically, the weights learned on PGP-CNN work well on Base-CNN structure when transferred. We demonstrate later in the experiments that this can improve performance of the base architecture. However, DConv-CNN does not have the same structure as Base-CNN because of . The replacement of downsampling by dilated convolutions makes it difficult to transfer weights from the pre-trained base model. The base CNN with multi-stride convolutions (the top of Fig. 5 (a)) is represented by single-stride convolutions and grid pooling (GP) layers (the bottom of Fig. 5 (a)). DConv-CNN (the top of Fig. 5 (b)) is equivalent to the base CNN with and (the bottom of Fig. 5 (b)). Focusing on the difference in the size of the input features of a convolution, the size is in Base-CNN, whereas in DConv-CNN. This difference in resolution makes it impossible to directly transfer weights from the pre-trained model. Unlike DConv-CNN, PGP-CNN maintains the same structure as Base-CNN (Fig. 5 (c)), which results in better transferability and better performance.
This weight transferability of PGP has two advantages. First, applying PGP to Base-CNN with learned weights in test phase can improve recognition performance without retraining. For example, if one does not have sufficient computing resources to train PGP-CNN, using PGP only in the test phase can still make full use of the features discarded by downsampling in the training phase to achieve higher accuracy. Second, the weights learned by PGP-CNN can work well in Base-CNN. As an example, on an embedded system specialized to a particular CNN, one can not change calculations inside the CNN. However, the weights learned by on PGP-CNN can result in superior performance, even when used in Base-CNN at test time.
We evaluated PGP on benchmark datasets: CIFAR-10, CIFAR-100 , and SVHN . Both CIFAR-10 and CIFAR-100 consist of color images, with 50,000 images for training and 10,000 images for testing. The CIFAR-10 dataset has 10 classes, each containing 6,000 images. There are 5,000 training images and 1,000 testing images for each class. The CIFAR-100 dataset has 100 classes, each containing 600 images. There are 500 training images and 100 testing images for each class. We adopted a standard data augmentation scheme (mirroring/shifting) that is widely used for these two datasets [1, 2, 11]
. For preprocessing, we normalized the data using the channel means and standard deviations. We used all 50,000 training images for training and calculated the final testing error after training.
We trained a pre-activation ResNet (PreResNet) , All-CNN  WideResNet , ResNeXt , PyramidNet , and DenseNet  from scratch. We use a 164-layer network for PreResNet. We used WRN-28-10, ResNeXt-29 (8x64d), PyramidNet-164 (Bottleneck, alpha=48) and DenseNet-BC-100 (k=12) in the same manner as the original papers.
When comparing the base CNN (Base), the CNN with dilated convolutions (DConv), and the CNN with PGP (PGP
), we utilized the same hyperparameter values (e.g. learning rate strategy) to carefully monitor the performance improvements. We adopted the cosine shape learning rate schedule, which smoothly anneals the learning rate [46, 47]. The learning rate was initially set to and gradually reduced to . Following 
, for CIFAR and SVHN, we trained the networks using mini-batches of size 64 for 300 epochs and 40 epochs, respectively. All networks were optimized using stochastic gradient descent (SGD) with a momentum of 0.9 and weight decay of. We adopted the weight initialization method introduced in . We used the Chainer framework  and ChainerCV  for the implementation of all experiments.
The results of applying DConv and PGP on CIFAR-10, CIFAR-100, and SVHN with different networks are listed in Table 1. In all cases, PGP outperformed Base and DConv.
We compare our method to random flipping (RF), random cropping (RC), and random erasing (RE) in Table 2. PGP achieves better performance than any other data augmentation methods when applied alone. When applied in combination with other data augmentation methods, PGP works in a complementary manner. Combining RF, RC, RE, and PGP yields an error rate of 19.16%, which is 11.74% improvement over the baseline without any augmentation methods.
|Network||RF||RC||RE||w/o PGP||w/ PGP|
We compare our method to the classical ensemble method in Table 3. On CIFAR-10 and SVHN, a single CNN with the PGP model achieves better performance than an ensemble of three basic CNN models. Furthermore, PGP works in a complementary manner when applied to the ensemble.
The results of applying PGP as a test-time data augmentation method on CIFAR-10, CIFAR-100, and SVHN on different networks are listed in Table 4. PGP achieves better performance than the base model, except for on the ALL-CNN and ResNeXt models. The reason of the incompatibility is left for future work.
The results of applying PGP as a training-time data augmentation technique on the CIFAR-10, CIFAR-100, and SVHN datasets with different networks are listed in Table 5. The models trained with PGP, which have the same structure to the base model at the test phase, outperformed the base model and the model with dilated convolutions. The models trained with dilated convolutions performed worse compared to the baseline models.
We further evaluated PGP using the ImageNet 2012 classification dataset . The ImageNet dataset is comprised of 1.28 million training images and 50,000 validation images from 1,000 classes. With the data augmentation scheme for training presented in [9, 11, 51], input images of size were randomly cropped from a resized image using scale and aspect ratio augmentation. We reimplemented a derivative of ResNet  for training. According to the dilated network strategy [7, 20], we used dilated convolutions or PGP after the conv4 stage. All networks were optimized using SGD with a momentum of 0.9 and weight decay of . The models were trained over 90 epochs using mini-batches of size 256 on eight GPUs. We adopted the weight initialization method introduced in . We adopted a cosine-shaped learning rate schedule, where the learning rate was initially set to and gradually reduced to .
The results when applying dilated convolution and PGP on ImageNet with ResNet-50 and ResNet-101 are listed in Table 6. As can be seen in the results, the network with PGP outperformed the baseline performance. PGP works in a complementary manner when applied in combination with 10-crop testing and can be used as a training-time or testing-time data augmentation technique.
|Network||#Params||Train||Test||1 crop||10 crops|
We further evaluated PGP for multi-label classification on the NUS-WIDE  and MS-COCO  datasets. NUS-WIDE contains 269,648 images with a total of 5,018 tags collected from Flickr. These images are manually annotated with 81 concepts, including objects and scenes. Following the method in , we used 161,789 images for training and 107,859 images for testing. MS-COCO contains 82,783 images for training and 40,504 images for testing, which are labeled as 80 common objects. The input images were resized to , randomly cropped, and then resized again to .
We trained a derivative of ResNet , with pre-training using the ImageNet dataset. We used the same pre-trained weights throughout all experiments for fair comparison. All networks were optimized using SGD with a momentum of 0.9 and weight decay of . The models were trained for 16 and 30 epochs using mini-batches of 96 on four GPUs for NUS-WIDE and MS-COCO, respectively. We adopted the weight initialization method introduced in . For fine tuning, the learning rate was initially set to
and divided by 10 after the 8th and 12th epoch for NUS-WIDE and after the 15th and 22nd epoch for MS-COCO. We employed mean average precision (mAP), macro/micro F-measure, precision, and recall for performance evaluation. If the confidences for each class were greater than 0.5, the label were predicted as positive.
The results when applying dilated convolution and PGP on ResNet-50 and ResNet-101 are listed in Table 7. The networks with PGP achieved higher mAP compared to the others. In contrast, the networks with dilated convolutions achieved lower scores than the baseline, which suggests that ImageNet pre-trained weights were not successfully transferred to ResNet with dilated convolutions due to the difference in resolution as we mentioned in Sec. 3.4. This problem can be avoided by maintaining the dilation rate in the layers that perform downsampling, like the method used in a dilated residual networks (DRN) . The operation allows the CNN with dilated convolutions to perform convolution with the same resolution as the base CNN. The DRN achieved better mAP scores than the base CNN and nearly the same score as the CNN with PGP. The CNN with PGP was, however, superior to DRN in that the CNN with PGP has the weight transferability.
In this paper, we proposed PGP, which performs downsampling without discarding any intermediate feature and can be regarded as a data augmentation technique. PGP is easily applicable to various CNN models without altering their learning strategies. Additionally, we demonstrated that PGP is implicitly used in dilated convolution, which suggests that dilated convolution is also a type of data augmentation technique. Experiments on CIFAR-10/100, and SVHN with six network models demonstrated that PGP outperformed the base CNN and the CNN with dilated convolutions. PGP also obtained better results than the base CNN and the CNN with dilated convolutions on the ImageNet dataset for image classification and the NUS-WIDE and MS-COCO datasets for multi-label image classification.
Improved variational autoencoders for text modeling using dilated convolutions.In: ICML. (2017)
In: NIPS Workshop on Machine Learning Systems”. Volume 2011. (2011)