Parallel Grid Pooling for Data Augmentation

03/30/2018 ∙ by Akito Takeki, et al. ∙ IEEE The University of Tokyo 0

Convolutional neural network (CNN) architectures utilize downsampling layers, which restrict the subsequent layers to learn spatially invariant features while reducing computational costs. However, such a downsampling operation makes it impossible to use the full spectrum of input features. Motivated by this observation, we propose a novel layer called parallel grid pooling (PGP) which is applicable to various CNN models. PGP performs downsampling without discarding any intermediate feature. It works as data augmentation and is complementary to commonly used data augmentation techniques. Furthermore, we demonstrate that a dilated convolution can naturally be represented using PGP operations, which suggests that the dilated convolution can also be regarded as a type of data augmentation technique. Experimental results based on popular image classification benchmarks demonstrate the effectiveness of the proposed method. Code is available at: https://github.com/akitotakeki

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 4

page 6

page 7

Code Repositories

pgp-chainer

Chainer Implementation of Parallel Grid Pooling for Data Augmentation


view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Deep learning using convolutional neural networks (CNNs) has achieved excellent performance for a wide range of tasks, such as image recognition [1, 2, 3], object detection [4, 5, 6], and semantic segmentation [7, 8]

. Most CNN architectures utilize spatial operation layers for downsampling. Spatial downsampling restricts subsequent layers in a CNN to learn spatially invariant features and reduces computational costs. Modern architectures often use multi-stride convolutional layers, such as a

or convolution with a stride of 2 [1, 3, 9, 10, 11, 12] or average pooling with a stride of 2 [2]. The other methods for downsampling have been proposed [13, 14, 15, 16].

(a) (b)
Figure 1: A two step view of a spatial operation (e.g. convolution or pooling) with a kernel size of and stride of . (a) is equivalent to (b).

One drawback of these downsampling is that it makes impossible to use the full spectrum of input features, and most previous works overlook the importance of this drawback. To clarify this issue, we provide an example of such a downsampling operation in Fig. 1. Let be a spatial operation (e.g., convolution or pooling) with a kernel size of and stride of . In the first step, is performed on the input and yields an intermediate output. In the second step, the intermediate output is split into an grid pattern and downsampled by selecting the coordinate (top-left corner) from each grid, resulting in an output feature that is times smaller than the input feature. In this paper, the second step downsampling operation is referred to as grid pooling (GP) for the coordinate . From this two-step example, one can see that the typical downsampling operation utilizes only of the intermediate output and discards the remaining .

Motivated by the observation, we propose a novel layer for CNNs called the parallel grid pooling (PGP) layer (Fig. 2). PGP is utilized to perform downsampling without discarding any intermediate feature and can be regarded as a data augmentation technique. Specifically, PGP splits the intermediate output into an grid pattern and performs a grid pooling for each of the coordinates   in the grid. This means that PGP transforms an input feature into feature maps, downsampled by . The layers following PGP compute feature maps shifted by several pixels in parallel, which can be interpreted as data augmentation in the feature space.

Furthermore, we demonstrate that dilated convolution [7, 17, 18, 19] can naturally be decomposed into a convolution and PGP. In general, a dilated convolution is considered as an efficient operation for spatial (and temporal) information, and achieves the state-of-the-art performance for various tasks, such as semantic segmentation [7, 20, 21], speech recognition [18]

, and natural language processing 

[19, 22]. In this sense, we provide a novel interpretation of the dilated convolution operation as a data augmentation technique in the feature space.

The proposed PGP layer has several remarkable properties. PGP can be implemented easily by inserting it after a multi-stride convolutional and/or pooling layer without altering the original learning strategy. PGP does not have any trainable parameters and can even be utilized for test-time augmentation (i.e., even if a CNN was not trained using PGP layers, the performance can still be improved by inserting PGP layers into the pretrained CNN at testing time).

We evaluate our method on three standard image classification benchmarks: CIFAR-10, CIFAR-100 [23], SVHN [24]

, and ImageNet 

[25]. Experimental results demonstrate that PGP can improve the performance of recent state-of-the-art network architectures and is complementary to widely used data augmentation techniques (e.g., random flipping, random cropping, and random erasing).

The major contributions of this paper are as follows:

  • We propose PGP, which performs downsampling without discarding any intermediate feature information and can be regarded as a data augmentation technique. PGP is easily applicable to various CNN models without altering their learning strategies and the number of parameters.

  • We demonstrate that PGP is implicitly used in a dilated convolution operation, which suggests that dilated convolution can also be regarded as a type of data augmentation technique.

2 Related Works

PGP is most closely related to data augmentation. In the context of training a CNN, data augmentation is a standard regularization technique that is used to avoid overfitting and artificially enlarge the size of a training dataset. For example, applying crops, flips, and various affine transforms in a fixed order or at random is a widely used technique in many computer vision tasks. Data augmentation techniques can be divided into two categories: data augmentation in the image space and data augmentation in the feature space.

Data augmentation in the image space: AlexNet [26]

, which achieves state-of-the-art performance on the ILSVRC2012 dataset, applies random flipping, random cropping, color perturbation, and adding noise based on principal component analysis. TANDA 

[27] learns a generative sequence model that produces data augmentation strategies using adversarial techniques on unlabeled data. The method in [28]

proposes to train a class-conditional model of diffeomorphisms by interpolating between nearest-neighbor data. Recently, training techniques utilizing randomly masked images of rectangular regions, such as random erasing 

[29] and cutout [30], have been proposed. Using these augmentation techniques, recent methods have achieved state-of-the-art results for image classification [31, 32]. Mix-up [33] and BC-learning [34] generate between-class images by mixing two images belonging to different classes with a random ratio.

Data augmentation in the feature space: Some techniques generate augmented training sets by interpolating [35] or extrapolating [36] features from the same class. DAGAN [37] utilizes conditional generative adversarial networks to conduct feature space data augmentation for one-shot learning. A-Fast-RCNN [38] generates hard examples with occlusions and deformations using an adversarial learning strategy in feature space. Smart augmentation [39] trains a neural network to combine training samples in order to generate augmented data. As another type of feature space data augmentation, AGA [40] and Fatten [41] perform augmentation in the feature space by using properties of training images (e.g., depth and pose).

While our PGP performs downsampling without discarding any intermediate feature, it splits the input feature map into branches, which will be discussed in detail in Sec. 3.1. The operations following PGP (e.g., convolution layers or fully connected layers) compute slightly shifted feature maps, which can be interpreted as a novel data augmentation technique in the feature space. In addition, dilated convolution can be regarded as data augmentation technique in the feature space, which will be discussed in detail in Sec. 3.2.

3 Proposed Method

3.1 Parallel Grid Pooling (PGP)

Figure 2: An overview of Parallel Grid Pooling (PGP)

An overview of PGP is illustrated in Fig. 2. Let be an input feature map, where , , , and are the number of batches, number of channels, height and width, respectively. is a spatial operation (e.g. a convolution or a pooling) with a kernel size of and stride of . The height and width of the output feature map are and , respectively.

can be divided into two steps (Fig. 1). In the first step, is performed, which is an operation that makes full use of the input feature map information with a stride of one, producing an intermediate feature map . Then, is split into blocks of size . Let the coordinate in each grid be  . In the second step, is downsampled by selecting the coordinate (top-left corner) in each grid. Let this downsampling operation of selecting the coordinate in each grid be called grid pooling (GP). To summarize, given a kernel size of and stride of , the multi-stride spatial operation layer is represented as

(1)

Our proposed PGP method is defined as:

(2)

The conventional multi-stride operation uses which consists of only of the intermediate output and discards the remaining of the feature map. In contrast, PGP retains all possible choices of grid poolings, meaning PGP makes full use of the feature map without discarding the portion of the intermediate output in the conventional method.

A network architecture with a PGP layer can be viewed as one with internal branches by performing grid pooling in parallel. Note that the weights (i.e., parameters) of the following network are shared between all branches, hence there is no additional parameter throughout the network. PGP performs downsampling while maintaining the spatial structure of the intermediate features, thereby producing output features shifted by several pixels. As described above, because the operations in each layer following PGP share the same weights across branches, the subsequent layers in each branch are trained over a set of slightly shifted feature maps simultaneously. This works as data augmentation; with PGP, the layers are trained with times larger number of mini-batches compared to the case without PGP.

3.2 PGP vs. Dilated Convolution.

We here discuss the relationship between PGP and dilated convolution [7, 20]. Dilated convolution is a convolutional operation applied to an input with a defined gap. Dilated convolution is considered to be an operation that efficiently utilizes spatial and temporal information [7, 18, 19, 20, 42]. For brevity, let us consider one-dimensional signals. Dilated convolution is defined as:

(3)

where is an input feature, is an output feature, is a convolution filter of kernel size , and is a dilation rate. A dilated convolution is equivalent to a regular convolution when . Let . Then, the output is

(4)
(5)

This suggests that the output , where , can be obtained from the input . To summarize, an -dilated convolution is equivalent to an operation where the input is split into branches and convolutions are performed with the same kernel , then spatially rearranged into a single output.

Figure 3: PGP is implicitly used by dilated convolution.

A dilated convolution in a two-dimensional space is illustrated in Fig. 3 (in case ). The -dilated convolution downsamples an input feature with an interval of pixels, then produces branches as outputs. Therefore, the dilated convolution is equivalent to a following three-step operation: , a convolution sharing the same weight, and inverse of PGP () which rearranges the cells to their original positions. For example, let the size of the input features be , where , , , and are the number of batches, number of channels, height, and width, respectively. Using , the input feature is divided into , subsequently a convolutional filter (with shared weight parameters) is applied to each branch in parallel (Fig. 3). Then, the intermediate feature maps are rearranged by , resulting the same size as the input by . In short, PGP is embedded in a dilated convolution, which suggests that the success of dilated convolution can be attributed to data augmentation in the feature space.

3.3 Architectural Differences between dilated convolution and PGP

(a) Base CNN (Base-CNN)
(b) CNN with dilated convolution (DConv-CNN)
(c) DConv-CNN represented by PGP
(d) CNN with PGP (PGP-CNN)
Figure 4: Comparison between DConv-CNN and PGP-CNN. As mentioned in Section 3.1, Base-CNN is expressed as the architecture using GP (a). (b) is equivalent to (c).

The structure of the network model with PGP is better for learning than that with dilated convolutions. For example, consider the cases where two downsampling layers in a base CNN (Base-CNN) are replaced by either dilated convolution (Fig. 4(b)) or PGP (Fig. 4(d)). According to [7], the conventional multi-stride downsampling layer and all subsequent convolution layers can be replaced with dilated convolution layers with two times dilation rate (Fig. 4(b)). As mentioned in Section 3.2, a dilated convolution is equivalent to (). When -dilated convolutions are stacked, they are expressed by (). This means that each branch in a CNN with dilated convolution (DConv-CNN) is independent from the others.

Let us consider the case of Fig 4(b) that dilated convolutions with two different dilation rates (e.g., and ) are stacked. Using the fact that a sequence of 4-dilated convolutions is ( + Convolution + ), this architecture can readily be transformed to the one given in Fig. 4(c) which is the equivalent form with PGP. We can clearly see that each branch split by the former dilation layer is split again by the latter, and all the branches are eventually rearranged by layers prior the global average pooling (GAP) layer. In contrast, CNN with PGP (PGP-CNN) is implemented as follows. The convolution or pooling layer that decreases resolution is set to 1 stride of that layer, after which PGP is inserted (Fig. 4(d)). In DConv-CNN, all the branches are aggregated through layers just before the GAP layer, which can be regarded as a feature ensemble, while PGP-CNN does not use any layer throughout the network so the intermediate features are not fused until the end of the network.

Unlike feature ensemble (i.e., the case of DConv-CNN), this structure enforces all the branches to learn correct classification results. As mentioned above, DConv-CNN averages an ensemble of intermediate feature. The likelihood of each class is then predicted based on the average. Even if some branches contain useful features (e.g. for image classification) while the other branches do not acquire, the CNN attempts to determine the correct class based on the average, which prevents the useless branches from further learning. In contrast, thanks to the independency of branches PGP-CNN can perform additional learning with superior regularization.

3.4 Weight transfer using PGP

Compared to DConv-CNN, PGP-CNN has a significant advantage in terms of weight transfer. Specifically, the weights learned on PGP-CNN work well on Base-CNN structure when transferred. We demonstrate later in the experiments that this can improve performance of the base architecture. However, DConv-CNN does not have the same structure as Base-CNN because of . The replacement of downsampling by dilated convolutions makes it difficult to transfer weights from the pre-trained base model. The base CNN with multi-stride convolutions (the top of Fig. 5 (a)) is represented by single-stride convolutions and grid pooling (GP) layers (the bottom of Fig. 5 (a)). DConv-CNN (the top of Fig. 5 (b)) is equivalent to the base CNN with and (the bottom of Fig. 5 (b)). Focusing on the difference in the size of the input features of a convolution, the size is in Base-CNN, whereas in DConv-CNN. This difference in resolution makes it impossible to directly transfer weights from the pre-trained model. Unlike DConv-CNN, PGP-CNN maintains the same structure as Base-CNN (Fig. 5 (c)), which results in better transferability and better performance.

This weight transferability of PGP has two advantages. First, applying PGP to Base-CNN with learned weights in test phase can improve recognition performance without retraining. For example, if one does not have sufficient computing resources to train PGP-CNN, using PGP only in the test phase can still make full use of the features discarded by downsampling in the training phase to achieve higher accuracy. Second, the weights learned by PGP-CNN can work well in Base-CNN. As an example, on an embedded system specialized to a particular CNN, one can not change calculations inside the CNN. However, the weights learned by on PGP-CNN can result in superior performance, even when used in Base-CNN at test time.

(a) Base-CNN (b) DConv-CNN (c) PGP-CNN
Figure 5: The structure performing downsampling inside Base-CNN, DConv-CNN, and PGP-CNN. This downsampling structure ( convolution with stride of 2) is used in a derivative of ResNet [9], WideResNet [10], and ResNeXt [11].

4 Experimental Results

4.1 Image classification

4.1.1 Datasets

We evaluated PGP on benchmark datasets: CIFAR-10, CIFAR-100 [23], and SVHN [24]. Both CIFAR-10 and CIFAR-100 consist of color images, with 50,000 images for training and 10,000 images for testing. The CIFAR-10 dataset has 10 classes, each containing 6,000 images. There are 5,000 training images and 1,000 testing images for each class. The CIFAR-100 dataset has 100 classes, each containing 600 images. There are 500 training images and 100 testing images for each class. We adopted a standard data augmentation scheme (mirroring/shifting) that is widely used for these two datasets [1, 2, 11]

. For preprocessing, we normalized the data using the channel means and standard deviations. We used all 50,000 training images for training and calculated the final testing error after training.

The Street View House Numbers (SVHN) dataset consists of color digit image. There are 73,257 images in the training set, 26,032 images in the test set, and 531,131 additional images for training. Following the method in [10, 2], we used all the training data without any data augmentation.

4.1.2 Implementation details

We trained a pre-activation ResNet (PreResNet) [12], All-CNN [43] WideResNet [10], ResNeXt [11], PyramidNet [44], and DenseNet [2] from scratch. We use a 164-layer network for PreResNet. We used WRN-28-10, ResNeXt-29 (8x64d), PyramidNet-164 (Bottleneck, alpha=48) and DenseNet-BC-100 (k=12) in the same manner as the original papers.

When comparing the base CNN (Base), the CNN with dilated convolutions (DConv), and the CNN with PGP (PGP

), we utilized the same hyperparameter values (e.g. learning rate strategy) to carefully monitor the performance improvements. We adopted the cosine shape learning rate schedule 

[45], which smoothly anneals the learning rate [46, 47]. The learning rate was initially set to and gradually reduced to . Following [2]

, for CIFAR and SVHN, we trained the networks using mini-batches of size 64 for 300 epochs and 40 epochs, respectively. All networks were optimized using stochastic gradient descent (SGD) with a momentum of 0.9 and weight decay of

. We adopted the weight initialization method introduced in [48]. We used the Chainer framework [49] and ChainerCV [50] for the implementation of all experiments.

4.1.3 Results

The results of applying DConv and PGP on CIFAR-10, CIFAR-100, and SVHN with different networks are listed in Table 1. In all cases, PGP outperformed Base and DConv.

Dataset Network #Params Base DConv PGP
CIFAR-10 PreResNet-164 1.7M 4.710.01 4.150.12 3.770.09
All-CNN 1.4M 8.420.23 8.680.43 7.170.32
WideResNet-28-10 36.5M 3.440.06 3.880.21 3.130.15
ResNeXt-29 (864d) 34.4M 3.860.16 3.870.13 3.220.34
PyramidNet-164 (=28) 1.7M 3.910.15 3.720.07 3.380.09
DenseNet-BC-100 (k=12) 0.8M 4.600.07 4.350.22 4.110.15
CIFAR-100 PreResNet-164 1.7M 22.680.27 20.710.27 20.180.45
All-CNN 1.4M 32.430.31 33.100.39 29.790.33
WideResNet-28-10 36.5M 18.700.39 19.690.50 18.200.17
ResNeXt-29 (864d) 34.4M 19.631.12 17.900.57 17.180.29
PyramidNet-164 (=28) 1.7M 19.650.24 19.100.32 18.120.16
DenseNet-BC-100 (k=12) 0.8M 22.620.29 22.200.03 21.690.37
SVHN PreResNet-164 1.7M 1.961.74 1.740.07 1.540.02
All-CNN 1.4M 1.940.06 1.770.07 1.750.02
WideResNet-28-10 36.5M 1.810.03 1.530.02 1.380.03
ResNeXt-29 (864d) 34.4M 1.810.02 1.660.03 1.520.01
PyramidNet-164 (=28) 1.7M 2.020.05 1.860.06 1.790.05
DenseNet-BC-100 (k=12) 0.8M 1.970.12 1.770.06 1.670.07
Table 1: Test errors (%) with various architectures on CIFAR-10, CIFAR-100, and SVHN. DConv: dilated convolution, PGP: parallel grid pooling.

4.1.4 Comparison to other data augmentation methods

We compare our method to random flipping (RF), random cropping (RC), and random erasing (RE) in Table 2. PGP achieves better performance than any other data augmentation methods when applied alone. When applied in combination with other data augmentation methods, PGP works in a complementary manner. Combining RF, RC, RE, and PGP yields an error rate of 19.16%, which is 11.74% improvement over the baseline without any augmentation methods.

Network RF RC RE w/o PGP w/ PGP
PreResNet-164 30.900.32 22.680.27
26.400.18 20.700.34
23.940.07 22.340.23
27.520.22 21.320.21
22.680.27 20.180.45
24.390.55 20.010.14
22.590.09 20.710.02
21.480.21 19.160.20
Table 2: Test errors (%) with different data augmentation methods on CIFAR-100 based on PreResNet-164. RF: random flipping, RC: random cropping, RE: random erasing, PGP: parallel grid pooling.

4.1.5 Comparison to classical ensemble method

We compare our method to the classical ensemble method in Table 3. On CIFAR-10 and SVHN, a single CNN with the PGP model achieves better performance than an ensemble of three basic CNN models. Furthermore, PGP works in a complementary manner when applied to the ensemble.

Dataset Network
Base DConv PGP Base DConv PGP
CIFAR-10 PreResNet-164 4.71 4.15 3.77 3.98 3.37 3.14
CIFAR-100 22.68 20.71 20.18 19.16 17.35 17.15
SVHN 1.96 1.74 1.54 1.65 1.51 1.39
Table 3: Ensembled test errors (%) on CIFAR-10/100 and SVHN based on PreResNet-164. “” means an ensembling of CNN models. DConv: dilated convolution, PGP: parallel grid pooling.

4.1.6 Weight transfer

The results of applying PGP as a test-time data augmentation method on CIFAR-10, CIFAR-100, and SVHN on different networks are listed in Table 4. PGP achieves better performance than the base model, except for on the ALL-CNN and ResNeXt models. The reason of the incompatibility is left for future work.

Dataset Network #Params Base PGP
CIFAR-10 PreResNet-164 1.7M 4.710.01 4.560.14
All-CNN 1.4M 8.420.23 9.030.30
WideResNet-28-10 36.5M 3.440.06 3.390.02
ResNeXt-29 (864d) 34.4M 3.860.16 4.010.21
PyramidNet-164 (=28) 1.7M 3.910.15 3.820.05
DenseNet-BC-100 (k=12) 0.8M 4.600.07 4.530.12
CIFAR-100 PreResNet-164 1.7M 22.680.27 22.190.26
All-CNN 1.4M 32.430.31 33.240.30
WideResNet-28-10 36.5M 18.700.39 18.600.39
ResNeXt-29 (864d) 34.4M 19.631.12 20.020.98
PyramidNet-164 (=28) 1.7M 19.650.24 19.340.28
DenseNet-BC-100 (k=12) 0.8M 22.620.29 22.330.26
SVHN PreResNet-164 1.7M 1.960.07 1.830.06
All-CNN 1.4M 1.940.06 3.860.05
WideResNet-28-10 36.5M 1.810.03 1.770.03
ResNeXt-29 (864d) 34.4M 1.810.02 2.470.51
PyramidNet-164 (=28) 1.7M 2.020.05 2.080.08
DenseNet-BC-100 (k=12) 0.8M 1.970.12 1.890.08
Table 4: Test errors (%) when applying PGP as a test-time data augmentation with various architectures on CIFAR-10, CIFAR-100, and SVHN. PGP: parallel grid pooling.

The results of applying PGP as a training-time data augmentation technique on the CIFAR-10, CIFAR-100, and SVHN datasets with different networks are listed in Table 5. The models trained with PGP, which have the same structure to the base model at the test phase, outperformed the base model and the model with dilated convolutions. The models trained with dilated convolutions performed worse compared to the baseline models.

Dataset Network #Params Base DConv PGP
CIFAR-10 PreResNet-164 1.7M 4.710.01 7.300.20 4.080.09
All-CNN 1.4M 8.420.23 38.770.85 7.300.31
WideResNet-28-10 36.5M 3.440.06 7.900.90 3.300.13
ResNeXt-29 (864d) 34.4M 3.860.16 16.910.45 3.360.27
PyramidNet-164 (=28) 1.7M 3.910.15 6.820.46 3.550.08
DenseNet-BC-100 (k=12) 0.8M 4.600.07 7.030.70 4.360.10
CIFAR-100 PreResNet-164 1.7M 22.680.27 24.900.47 21.010.54
All-CNN 1.4M 32.430.31 64.272.21 30.260.44
WideResNet-28-10 36.5M 18.700.39 26.900.80 18.560.21
ResNeXt-29 (864d) 34.4M 19.631.12 44.5713.78 17.670.31
PyramidNet-164 (=28) 1.7M 19.650.24 25.280.30 18.580.13
DenseNet-BC-100 (k=12) 0.8M 22.620.29 27.560.78 22.460.16
SVHN PreResNet-164 1.7M 1.961.74 3.210.32 1.670.04
All-CNN 1.4M 1.940.06 4.640.60 1.800.04
WideResNet-28-10 36.5M 1.810.03 3.110.17 1.450.03
ResNeXt-29 (864d) 34.4M 1.810.02 5.640.61 1.580.03
PyramidNet-164 (=28) 1.7M 2.020.05 3.230.10 1.900.09
DenseNet-BC-100 (k=12) 0.8M 1.970.12 2.740.10 1.780.07
Table 5: Test errors (%) when applying PGP as a training-time data augmentation technique with various architectures on CIFAR-10, CIFAR-100, and SVHN. DConv: dilated convolution, PGP: parallel grid pooling.

4.2 ImageNet classification

We further evaluated PGP using the ImageNet 2012 classification dataset [25]. The ImageNet dataset is comprised of 1.28 million training images and 50,000 validation images from 1,000 classes. With the data augmentation scheme for training presented in [9, 11, 51], input images of size were randomly cropped from a resized image using scale and aspect ratio augmentation. We reimplemented a derivative of ResNet [9] for training. According to the dilated network strategy [7, 20], we used dilated convolutions or PGP after the conv4 stage. All networks were optimized using SGD with a momentum of 0.9 and weight decay of . The models were trained over 90 epochs using mini-batches of size 256 on eight GPUs. We adopted the weight initialization method introduced in [48]. We adopted a cosine-shaped learning rate schedule, where the learning rate was initially set to and gradually reduced to .

The results when applying dilated convolution and PGP on ImageNet with ResNet-50 and ResNet-101 are listed in Table 6. As can be seen in the results, the network with PGP outperformed the baseline performance. PGP works in a complementary manner when applied in combination with 10-crop testing and can be used as a training-time or testing-time data augmentation technique.

Network #Params Train Test 1 crop 10 crops
top-1 top-5 top-1 top-5
ResNet-50 25.6M Base Base 23.69 7.00 21.87 6.07
DConv DConv 22.47 6.27 21.17 5.58
PGP PGP 22.40 6.30 20.83 5.56
Base PGP 23.32 6.85 21.62 5.93
DConv Base 31.44 11.40 26.79 8.19
PGP Base 23.01 6.66 21.22 5.74
ResNet-101 44.5M Base Base 22.49 6.38 20.85 5.50
DConv DConv 21.26 5.61 20.03 5.02
PGP PGP 21.34 5.65 19.81 5.00
Base PGP 22.13 6.21 20.46 5.36
DConv Base 25.63 8.01 22.40 6.05
PGP Base 21.80 5.95 20.18 5.15
Table 6: Classification error (%) for the ImageNet 2012 validation set. The error was evaluated with and without 10-crop testing for -pixel inputs. DConv: dilated convolution, PGP: parallel grid pooling.

4.3 Multi-label classification

We further evaluated PGP for multi-label classification on the NUS-WIDE [52] and MS-COCO [53] datasets. NUS-WIDE contains 269,648 images with a total of 5,018 tags collected from Flickr. These images are manually annotated with 81 concepts, including objects and scenes. Following the method in [54], we used 161,789 images for training and 107,859 images for testing. MS-COCO contains 82,783 images for training and 40,504 images for testing, which are labeled as 80 common objects. The input images were resized to , randomly cropped, and then resized again to .

We trained a derivative of ResNet [9], with pre-training using the ImageNet dataset. We used the same pre-trained weights throughout all experiments for fair comparison. All networks were optimized using SGD with a momentum of 0.9 and weight decay of . The models were trained for 16 and 30 epochs using mini-batches of 96 on four GPUs for NUS-WIDE and MS-COCO, respectively. We adopted the weight initialization method introduced in [48]. For fine tuning, the learning rate was initially set to

and divided by 10 after the 8th and 12th epoch for NUS-WIDE and after the 15th and 22nd epoch for MS-COCO. We employed mean average precision (mAP), macro/micro F-measure, precision, and recall for performance evaluation. If the confidences for each class were greater than 0.5, the label were predicted as positive.

The results when applying dilated convolution and PGP on ResNet-50 and ResNet-101 are listed in Table 7. The networks with PGP achieved higher mAP compared to the others. In contrast, the networks with dilated convolutions achieved lower scores than the baseline, which suggests that ImageNet pre-trained weights were not successfully transferred to ResNet with dilated convolutions due to the difference in resolution as we mentioned in Sec. 3.4. This problem can be avoided by maintaining the dilation rate in the layers that perform downsampling, like the method used in a dilated residual networks (DRN) [17]. The operation allows the CNN with dilated convolutions to perform convolution with the same resolution as the base CNN. The DRN achieved better mAP scores than the base CNN and nearly the same score as the CNN with PGP. The CNN with PGP was, however, superior to DRN in that the CNN with PGP has the weight transferability.

Dataset Network #Params Arch. mAP Macro Micro
F P R F P R
NUS-WIDE ResNet-50 25.6M Base 55.7 52.1 63.3 47.0 70.2 75.3 65.8
DConv 55.3 51.7 63.3 46.6 70.0 75.7 65.2
DRN 55.9 52.2 43.8 47.0 70.1 75.8 65.3
PGP 56.1 51.9 64.4 46.2 70.0 76.4 64.6
ResNet-101 44.5M Base 56.5 53.0 63.6 48.3 70.4 75.4 66.1
DConv 56.3 53.4 62.7 49.2 70.5 74.9 66.5
DRN 56.8 53.1 63.1 48.9 70.5 75.1 66.5
PGP 57.2 53.5 64.6 48.7 70.6 75.7 66.2
MS-COCO ResNet-50 25.6M Base 73.4 67.5 78.2 60.9 72.8 82.2 65.4
DConv 72.7 66.6 77.9 59.7 72.2 81.9 64.5
DRN 74.2 68.3 78.8 61.8 73.4 82.5 66.1
PGP 74.2 67.9 80.0 60.5 73.1 83.7 64.9
ResNet-101 44.5M Base 74.6 68.9 77.6 62.8 73.8 82.2 66.9
DConv 74.0 68.3 78.0 62.0 73.2 81.6 66.4
DRN 75.3 69.7 78.2 63.8 74.3 81.8 68.1
PGP 75.5 69.4 79.8 62.9 74.2 83.1 67.0
Table 7: Quantitative results on NUS-WIDE and MS-COCO. DConv: dilated convolution, DRN: dilated residual networks, PGP: parallel grid pooling.

5 Conclusion

In this paper, we proposed PGP, which performs downsampling without discarding any intermediate feature and can be regarded as a data augmentation technique. PGP is easily applicable to various CNN models without altering their learning strategies. Additionally, we demonstrated that PGP is implicitly used in dilated convolution, which suggests that dilated convolution is also a type of data augmentation technique. Experiments on CIFAR-10/100, and SVHN with six network models demonstrated that PGP outperformed the base CNN and the CNN with dilated convolutions. PGP also obtained better results than the base CNN and the CNN with dilated convolutions on the ImageNet dataset for image classification and the NUS-WIDE and MS-COCO datasets for multi-label image classification.

References

  • [1] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: CVPR. (2016)
  • [2] Huang, G., Liu, Z., van der Maaten, L., Weinberger, K.Q.: Densely connected convolutional networks. In: CVPR. (2017)
  • [3] Hu, J., Shen, L., Sun, G.: Squeeze-and-excitation networks. arXiv preprint arXiv:1709.01507 (2017)
  • [4] Liu, W., Anguelov, D., Erhan, D., Szegedy, C., Reed, S., Fu, C.Y., Berg, A.C.: Ssd: Single shot multibox detector. In: ECCV. (2016)
  • [5] Redmon, J., Farhadi, A.: YOLO9000: Better, faster, stronger. In: CVPR. (2017)
  • [6] Lin, T.Y., Goyal, P., Girshick, R., He, K., Dollar, P.: Focal loss for dense object detection. In: CVPR. (2017)
  • [7] Chen, L.C., Papandreou, G., Kokkinos, I., Murphy, K., Yuille, A.L.: DeepLab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected CRFs. TPAMI (2017)
  • [8] Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid scene parsing network. In: CVPR. (2017)
  • [9] Gross, S., Wilber, M.: Training and investigating residual nets. Facebook AI Research, CA.[Online]. http://torch. ch/blog/2016/02/04/resnets. html (2016)
  • [10] Zagoruyko, S., Komodakis, N.: Wide residual networks. In: BMVC. (2016)
  • [11] Xie, S., Girshick, R., Dollár, P., Tu, Z., He, K.: Aggregated residual transformations for deep neural networks. In: CVPR. (2017)
  • [12] He, K., Zhang, X., Ren, S., Sun, J.: Identity mappings in deep residual networks. In: ECCV. (2016)
  • [13] Zeiler, M.D., Fergus, R.: Stochastic pooling for regularization of deep convolutional neural networks. In: ICLR. (2013)
  • [14] Graham, B.: Fractional max-pooling. arXiv preprint arXiv:1412.6071 (2014)
  • [15] Lee, C.Y., Gallagher, P.W., Tu, Z.: Generalizing pooling functions in convolutional neural networks: Mixed, gated, and tree. In: AISTATS. (2016)
  • [16] Zhai, S., Wu, H., Kumar, A., Cheng, Y., Lu, Y., Zhang, Z., Feris, R.: S3Pool: Pooling with stochastic spatial sampling. In: CVPR. (2017)
  • [17] Yu, F., Koltun, V., Funkhouser, T.: Dilated residual networks. In: CVPR. (2017)
  • [18] Oord, A.v.d., Dieleman, S., Zen, H., Simonyan, K., Vinyals, O., Graves, A., Kalchbrenner, N., Senior, A., Kavukcuoglu, K.: WaveNet: A generative model for raw audio. arXiv preprint arXiv:1609.03499 (2016)
  • [19] Kalchbrenner, N., Espeholt, L., Simonyan, K., Oord, A.v.d., Graves, A., Kavukcuoglu, K.: Neural machine translation in linear time. arXiv preprint arXiv:1610.10099 (2016)
  • [20] Yu, F., Koltun, V.: Multi-scale context aggregation by dilated convolutions. In: ICLR. (2016)
  • [21] Chen, L.C., Zhu, Y., Papandreou, G., Schroff, F., Adam, H.: Encoder-decoder with atrous separable convolution for semantic image segmentation. arXiv preprint arXiv:1802.02611 (2018)
  • [22] Yang, Z., Hu, Z., Salakhutdinov, R., Berg-Kirkpatrick, T.:

    Improved variational autoencoders for text modeling using dilated convolutions.

    In: ICML. (2017)
  • [23] Hinton, G.E.: Learning multiple layers of representation. Trends in cognitive sciences 11(10) (2007) 428–434
  • [24] Netzer, Y., Wang, T., Coates, A., Bissacco, A., Wu, B., Ng, A.Y.: Reading digits in natural images with unsupervised feature learning.

    In: NIPS Workshop on Machine Learning Systems”. Volume 2011. (2011)

  • [25] Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bernstein, M., et al.: ImageNet large scale visual recognition challenge. IJCV 115(3) (2015) 211–252
  • [26] Krizhevsky, A., Sutskever, I., Hinton, G.E.: ImageNet classification with deep convolutional neural networks. In: NIPS. (2012)
  • [27] Ratner, A.J., Ehrenberg, H., Hussain, Z., Dunnmon, J., Ré, C.: Learning to compose domain-specific transformations for data augmentation. In: NIPS. (2017)
  • [28] Hauberg, S., Freifeld, O., Larsen, A.B.L., Fisher, J., Hansen, L.: Dreaming more data: Class-dependent distributions over diffeomorphisms for learned data augmentation. In: AISTATS. (2016)
  • [29] Zhong, Z., Zheng, L., Kang, G., Li, S., Yang, Y.: Random erasing data augmentation. arXiv preprint arXiv:1708.04896 (2017)
  • [30] DeVries, T., Taylor, G.W.: Improved regularization of convolutional neural networks with cutout. arXiv preprint arXiv:1708.04552 (2017)
  • [31] Yamada, Y., Iwamura, M., Kise, K.: Shakedrop regularization. arXiv preprint arXiv:1802.02375 (2018)
  • [32] Real, E., Aggarwal, A., Huang, Y., Le, Q.V.: Regularized evolution for image classifier architecture search. arXiv preprint arXiv:1802.01548 (2018)
  • [33] Zhang, H., Cisse, M., Dauphin, Y.N., Lopez-Paz, D.: mixup: Beyond empirical risk minimization. In: ICLR. (2018)
  • [34] Yuji, T., Yoshitaka, U., Tatsuya, H.: Between-class learning for image classification. In: CVPR. (2018)
  • [35] Chawla, N.V., Bowyer, K.W., Hall, L.O., Kegelmeyer, W.P.: SMOTE: synthetic minority over-sampling technique. JAIR 16 (2002) 321–357
  • [36] DeVries, T., Taylor, G.W.: Dataset augmentation in feature space. ICLR workshop (2017)
  • [37] Fawzi, A., Samulowitz, H., Turaga, D., Frossard, P.: Adaptive data augmentation for image classification. In: ICIP. (2016)
  • [38] Wang, X., Shrivastava, A., Gupta, A.: A-fast-rcnn: Hard positive generation via adversary for object detection. In: CVPR. (2017)
  • [39] Lemley, J., Bazrafkan, S., Corcoran, P.: Smart augmentation-learning an optimal data augmentation strategy. IEEE Access (2017)
  • [40] Dixit, M., Kwitt, R., Niethammer, M., Vasconcelos, N.: AGA: Attribute-guided augmentation. In: CVPR. (2017)
  • [41] Liu, B., Dixit, M., Kwitt, R., Vasconcelos, N.: Feature space transfer for data augmentation. In: CVPR. (2018)
  • [42] Yu, F., Koltun, V., Funkhouser, T.: Dilated residual networks. In: CVPR. (2017)
  • [43] Springenberg, J.T., Dosovitskiy, A., Brox, T., Riedmiller, M.: Striving for simplicity: The all convolutional net. In: ICLR Workshop. (2015)
  • [44] Han, D., Kim, J., Kim, J.: Deep pyramidal residual networks. In: CVPR. (2017)
  • [45] Loshchilov, I., Hutter, F.: SGDR: stochastic gradient descent with restarts. In: ICLR. (2017)
  • [46] Huang, G., Li, Y., Pleiss, G., Liu, Z., Hopcroft, J.E., Weinberger, K.Q.: Snapshot ensembles: Train 1, get m for free. In: ICLR. (2017)
  • [47] Zoph, B., Vasudevan, V., Shlens, J., Le, Q.V.: Learning transferable architectures for scalable image recognition. In: CVPR. (2018)
  • [48] He, K., Zhang, X., Ren, S., Sun, J.: Delving deep into rectifiers: Surpassing human-level performance on ImageNet classification. In: ICCV. (2015)
  • [49] Tokui, S., Oono, K., Hido, S., Clayton, J.: Chainer: a next-generation open source framework for deep learning. In: NIPS Workshop. (2015)
  • [50] Niitani, Y., Ogawa, T., Saito, S., Saito, M.: ChainerCV: a library for deep learning in computer vision. In: ACMMM. (2017)
  • [51] Chen, Y., Li, J., Xiao, H., Jin, X., Yan, S., Feng, J.: Dual path networks. In: NIPS. (2017)
  • [52] Chua, T.S., Tang, J., Hong, R., Li, H., Luo, Z., Zheng, Y.: NUS-WIDE: a real-world web image database from national university of singapore. In: ACM-CIVR. (2009)
  • [53] Lin, T.Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., Zitnick, C.L.: Microsoft COCO: Common objects in context. In: ECCV. (2014)
  • [54] Zhu, F., Li, H., Ouyang, W., Yu, N., Wang, X.: Learning spatial regularization with image-level supervisions for multi-label image classification. In: CVPR. (2017)