Max pooling in convolutional neural networks (CNNs) is the operation used to select the maximum value in each kernel, as shown in Fig. 1
(a). It plays several important roles in CNN-based image recognition. One is the dimensionality reduction of convolutional features; by using a max pooling operation with an appropriate stride length, we can reduce the size of the convolutional feature map and expect efficient computation as well as information aggregation. Another role is deformation compensation. Even if the convolutional features undergo local (i.e., small) spatial translations due to deformations in the input images, the reduced feature maps are invariant to the translations. Consequently, the CNN becomes robust to deformations in the input images.
This paper is motivated by the fact that the deformation compensation ability of the max pooling operation is excessive for actual deformations. Most of the actual deformations are topology-preserving, i.e., spatially continuous within each object region; if a part of an object shifts to a certain direction, its neighboring part also shifts to a similar direction. However, since the value selection by max pooling is performed for each kernel independently, it not only compensates for intra-class topology-preserving deformations, but may also “over-compensate” for essential differences between similar classes.
Fig. 1(a) illustrates the excessive flexibility of max pooling, where the kernel size and the stride are equally set at 3 for a simpler illustration. (In the later experiments, was often smaller than .) The arrow on the convolutional feature map shows directions of the maximum value (depicted as the darkest point) from the center of the kernel. The value selection is performed independently at each kernel and therefore the arrows point random directions in the map. If we consider that each arrow represents a local displacement, the arrows work as spatial warping to compensate for the deformations in the map. We can thus understand that these random directions do not fit to continuous deformations. In other words, the flexibility of the max pooling operation is excessive for the actual deformations.
The lower part of Fig. 1(a) shows the result of the max pooling operation. Due to the greedy selection of the maximum value at each kernel, the result of pooling consists mostly of large feature values (i.e., darker colors). However, the original convolutional feature map is not always large; it exhibits a trend that the upper-left side has smaller values and the lower-right side has larger values. The result of max pooling no longer exhibits this trend. This means that max pooling easily overlooks small-valued but important parts and thus might ignore essential differences between similar classes.
Figs. 2(a) and (b) show digit and texture images and their results when the max pooling operation is applied to their first convolutional feature map, respectively. The digit images have one or two holes (‘8,’ ‘9,’ and ‘A’) or near-hole concave parts (‘m’). These hole parts have smaller values and thus nearly disappear by the max pooling operation despite their importance for their discrimination. For example, the max pooling result of ‘8’ might be confused with that of ‘5’ (In fact, this ‘8’ is misrecognized as ‘5’ by a CNN with the max pooling operation). Texture images are composed of coarse and fine structures. Their fine structures are lost by max pooling, whereas the coarse ones are still preserved.
In this paper, we propose a regularized pooling operation, where the flexibility of the pooling operation is regularized to fit the characteristics of actual deformations. Fig. 1(b) illustrates the proposed regularized pooling operation. The key idea is to smooth the value selection directions in the pooling operation. By taking the average of max value directions in the neighboring kernels in a window (the dotted squares in (a)), a non-maximum value can be selected and then over-compensation is suppressed. Note that, unlike average pooling, this regularization does not affect the feature values themselves; it only affects the selection of the value from the kernel.
Fig. 2(c) shows the result of the regularized pooling operation (with the window). The holes of digit images and fine structures of texture images are well preserved even after the pooling operations, compared with the max pooling operation (b). We can thus expect that our regularized pooling can avoid over-compensation and thus keep the separability among classes. It should be noted that this property will lead to a stable training process with a faster convergence because it will be possible to avoid local minima due to the over-compensation.
The main contributions of this paper are summarized as follows:
We propose a regularized pooling operation whose capability in terms of deformation compensation fits the characteristics of actual deformations. To the best of authors’ knowledge, this is the first proposal of the regularized pooling operation.
Since the regularized pooling operation can avoid over-compensation and thus preserve essential inter-class differences, it has positive effects on both the training and testing steps. We experimentally show these effects; our regularized pooling operation accelerates the training step (i.e., provides quick convergence) and improves the recognition accuracy, especially by avoiding confusion between similar classes, such as ‘7’ and ‘9’ and ‘a’ and ‘e.’ In a qualitative study, we also observed that the proposed method can preserve important inter-class differences.
We investigate when the proposed method is superior to max pooling using different datasets such as handwritten character image datasets and a texture image dataset. The experiment with texture images also shows the structure preservation capability of the regularized pooling operation.
2 Related Work
In recent years, many researchers have focused on pooling operations to improve the performance of deep learning-based architectures[7, 5, 12]. Pooling operations can reduce the dimension of the input features and render them to invariant to small shifts and deformations . However, the spatial information lost in the traditional pooling layers causes problems that limit the learning capability of deep neural networks [4, 16].
2.1 Traditional pooling operations
To handle the problems in the traditional pooling operations, many methods have been proposed to extend or improve them in different ways [8, 25, 26, 14, 24]. To solve the problems that the MP2-pooling ( max pooling) reduces the size of the hidden layers quickly and the disjointed nature of the regions of pooling can limit generalization, Graham  proposed fractional max pooling (FMP) to reduce the size of the image by a factor of with . Zhai et al.  proposed S3Pool, which extends standard max pooling by decomposing pooling into two steps: max pooling with stride one and a non-deterministic spatial downsampling step by randomly sampling rows and columns from a feature map. They observed that this general stochasticity acts as a strong regularizer, and can also be seen as performing implicit data augmentation by introducing distortions to the feature maps. To regularize CNN-based architectures, Yu et al.  proposed mixed pooling that was inspired by the random dropout  and DropConnect  methods. Similarly, Wei et al.  proposed an intermediate form between max and average pooling called polynomial pooling (P-pooling) to provide an optimally balanced and self-adjusted pooling strategy for semantic segmentation. To compensate for spatial information lost in the max pooling layer, Zheng et al.  extracted displacement directions from the max pooling layers and combined them with the original max pooling features to capture structural deformations in text recognition tasks.
2.2 Recent pooling operations
Considering the limitations of traditional pooling methods, many pooling operations and layers have recently been proposed to address problems in traditional pooling methods pertaining to specific applications such as image detection and classification [9, 21, 13, 6], handwriting and text recognition [8, 28, 20], semantic segmentation [2, 10, 24]
, and other challenging computer vision tasks[1, 21, 27, 19]. He et al.  introduced a spatial pyramid pooling (SPP) layer to remove the fixed-size constraint on the network, thereby making the network robust to object deformation. Kobayashi 
proposed a trainable local pooling function guided by global features beyond local ones. The parameterized pooling form is derived from a probabilistic perspective to flexibly represent various types of pooling, and the parameters are estimated by using statistics of the input feature map. More recently, Gaoet al.  proposed Local Importance-based Pooling (LIP) that can automatically enhance discriminative features during the downsampling procedure by learning adaptive importance weights based on the inputs. LIP solved the problem that the traditional downsampling layers can prevent discriminative details from being well preserved, which is crucial for the recognition and detection tasks.
Compared with prevalent pooling operations, the proposed regularized pooling considers spatial information and regulates the directions of pooling to be homogenized around the neighboring kernels. The advantage of the proposed method is that it compensates for deformations when the neighboring parts shift to random directions. In this way, the proposed method becomes more effective than conventional pooling methods at accelerating convergence.
3 Regularized Pooling
Fig. 3 shows an overview of regularized pooling. Regularized pooling takes a convolutional feature map as its input and outputs a new feature map. Although the outline of the calculation is similar to that of max pooling, the main difference is that the direction to the maximum value in a kernel, called the displacement direction, is extracted and then revised by the displacement directions at the neighboring kernels.
Specifically, the displacement direction is first extracted from the input feature map by the max pooling operation. Assume that we can conduct the max pooling operations times vertically and times horizontally by sliding an kernel with the stride 111To be specific, given a convolutional feature map of size as input, and if we add a proper size of padding to the input.
if we add a proper size of padding to the input.. For the -th pooling kernel (), the displacement direction is defined as the direction from the center of the kernel to the maximum value. The possible value for the element of , , depends on the parity of
. For an odd, , whereas for an even . The displacement directions are then regularized by considering the adjacent displacement directions. The regularization is based on spatial smoothing of the displacement directions. The regularized displacement direction is calculated as follows:
where the odd integer is the size of the smoothing window and . Finally, the output feature map is generated by using the regularized displacement directions. The pixel value in the -th kernel indicated by the regularized displacement direction is extracted as the -th value of the reduced feature map.
can be a non-integer vector due to the smoothing in Eq. (1) while it should be an integer vector for the acquisition of a reduced feature map. Therefore, we quantize if it is a non-integer. For an odd , the element of is rounded to the nearest integer 222If the fraction part is exactly 0.5, it is rounded away from zero.. For an even , the element is rounded away from zero, so as not to be zero.
4 Experiment on Character Images
We first assessed the effectiveness of the regularized pooling operation by comparing it with traditional pooling operations. In particular, we verified that regularized pooling improves the convergence speed of learning through a comparison of performance profiles. Second, we qualitatively show that regularized pooling reduces the dimensionality of the input feature map while preserving detailed structures via example-based evaluation. Finally, we evaluate the effects of the kernel size, smoothing window size, and stride, which are important hyperparameters of regularized pooling.
We evaluated our regularized pooling on two standard benchmark datasets of handwritten character images, MNIST  and EMNIST . Character images often undergo various and severe deformations; however, those deformations are still continuous and topology-preserving so as not to spoil inter-class differences. Therefore, character images are the most suitable for understanding the characteristics of the proposed regularized pooling operation.
MNIST is comprised of handwritten digit images and split to training samples and test samples. EMNIST is comprised of uppercase and lowercase English alphabet letters with 37 classes (after several identifications between indistinguishable classes, such as ‘o’ and ‘O’) and for training and for test.
4.2 Experimental setup
with some convolutional blocks and fully connected layers removed to fit the network to the size of the input image. Two convolutional layers with a ReLU activation function were cascaded as a block, and a pooling layer was connected after the convolutional block. After repeating this convolutional and pooling connection three times, a fully-connected (FC) layer with a softmax activation was connected as the last layer. Dropout with a ratio ofwas used for the last FC layer. Regularized pooling was applied to the first pooling layer. For comparison, we used max pooling and average pooling.
In all experiments, we calculated the average of five trials by changing the initial weights of the network when computing classification accuracy. To clarify the effect of pooling, all images were resized to . Zero-padding was not used in any pooling operation. We used the SGD optimizer for weight updating. The learning rate was for MNIST and
for EMNIST. The number of learning epochs and the batch size were set toand
, respectively. We employed cross entropy as a loss function.
|pool1||regularized pool, , or|
|max pool, , or|
|FC||FC + softmax|
4.3 Performance comparison with traditional pooling methods
Fig. 4 shows the comparison of performance profiles among regularized pooling, max pooling, and average pooling on the test datasets of MNIST and EMNIST. In this figure, the pooling kernel size , smoothing window size , and stride were set to , , and , respectively. Note that every line shows the average of five trials by changing the initial weights of the network.
These results confirmed that the learning convergence of regularized pooling is faster than those of max pooling and average pooling. Compared to max pooling, our regularized pooling could suppress the excessive deformation compensations and thus could avoid local minima due to them, especially the early training stages, when the feature values tend to have random-like values and the deformation compensation ability of max pooling is abused. Examples that support the above hypotheses are provided in the next subsection.
It is also very important that regularized pooling is better than average pooling. Regularized pooling still keeps important (large) feature values compared to average pooling. This is because feature values themselves are smoothed by average pooling, whereas they are not smoothed by our regularized pooling—regularized pooling just smooths the selection direction.
4.4 Qualitative evaluation
We qualitatively evaluated the differences between regularized pooling and traditional pooling methods by visualizing the feature maps after the application of the pooling operations. The visualization examples are shown in Fig. 1. In max pooling, the shapes of the characters collapsed due to over-compensation. For example, the holes of ’8’ and ’Q’ are filled with white pixels. In average pooling, the outlines of the characters are blurred although their shapes are preserved better than by max pooling. This is because average pooling considered the surrounding information by smoothing the feature values directly. Conversely, regularized pooling preserved both the shapes and the outlines of the characters better than max pooling and average pooling because it considers surrounding information by regularizing the deformation features, without directly smoothing the input feature maps.
We verified how the qualitative differences among the pooling methods in the above visualization affected recognition errors. Fig. 6 shows the number of misrecognitions between certain class pairs along with the learning epochs. In Figs. 6(a) and 6(b), ‘7’ and ‘9,’ and ‘a’ and ‘e,’ are given as the pairs whose structural differences are subtle, i.e., confusing pairs. In addition, Figs. 6(c) and 6(d) show the misrecognitions between the pairs of ‘2’ and ‘7,’ and ‘C’ and ‘O,’ where there are clear structural differences in the handwritten images, i.e., easy pairs. For the confusing pairs, regularized pooling reduced misrecognitions compared with max pooling and average pooling, whereas there was no remarkable difference among the three pooling methods for the easy pairs. These results show that regularized pooling preserves the detailed structure of the input feature map by suppressing over-compensations and thus effectively distinguishes between class pairs with subtle structural differences.
4.5 Effect of hyperparameters
We evaluated the effect of the hyperparameters, i.e., the pooling kernel size , smoothing window size , and stride . Fig. 9 shows the performance profiles when and were varied to and and and . The results by max pooling are also shown for comparison. These results suggest that the effect of on the results is more significant than . Moreover, the difference between regularized pooling and max pooling was clearer when was larger. This is because the larger the value of was, the stronger the effect of over-compensation due to max pooling was, whereas regularized pooling suppressed it.
The effect of the stride is shown in Fig. 9. This figure summarizes the performance profiles of regularized pooling and max pooling on the MNIST dataset while was varied to and . The result shows that regularized pooling showed faster convergence than max pooling at all values, while a smaller stride yielded better performance.
5 Experiment on Texture Images
In this experiment, we aimed to clarify the characteristics of regularized pooling by analyzing the results of classification for texture images with various structures. In particular, we reveal the kind of images for which regularized pooling is effective.
We used the Kylberg texture dataset  that contains classes with unique samples each ( samples for training, samples for testing). Each sample is a grayscale image of size pixels. We resized all images to in the experiment. For weight updating, we used the Adam optimizer with parameters of , , and . The batch size was set to . The network architecture and other experimental conditions were the same as in the experiments described in Section 4.
Fig. 9 shows the confusion matrix on the test set obtained by using max pooling and regularized pooling at 10 and 40 epochs. According to Fig. 9(a), certain classes such as class 6, 19, 20, and 21 are almost completely correctly recognized in the early stage of learning. Example images from the improved classes by regularized pooling are shown in Fig. 10(a). The common feature of these images was that they had a periodic structure. Regularized pooling could retain this periodic structure to some extent and thus show superiority. In Fig. 9(b), however, several classes, such as class 10, 23, 26, and 27, were not correctly recognized by regularized pooling, even at the 40 epoch. Example images from these classes are shown in Fig. 10(b), and it can be seen that they are near-random patterns without any specific periodicity, i.e., no clear structure. These results demonstrated that regularized pooling is effective for patterns with a periodic structure. This is because regularized pooling performs spatially continuous operations between adjacent kernels, and therefore preserves frequency information to some extent in the feature map after pooling.
We proposed regularized pooling, which enables a local pooling operation suitable for actual deformations. In the traditional max pooling operation, the value selection direction is determined as the maximum value position at each kernel independently. By considering it as a deformation compensation process, this independent strategy will cause over-compensation. In contrast, our regularized pooling operation smooths the value selection directions over the neighboring kernels to suppress over-compensation and thus stabilizes the training process. Through experiments on image recognition, we demonstrated that regularized pooling improves separability of similar classes and the convergence of learning compared with the conventional pooling methods.
In future work, we will further consider another strategy for smoothing the value selection directions, although we have shown that even simple average-based smoothing is already effective. For example, using an adaptive window size controlled by some spatial and/or channel-wise attention mechanisms will be a possible choice.
-  (2019) Global sum pooling: a generalization trick for object counting with small datasets of large images. In Proceedings of CVPR Deep Vision Workshop, Cited by: §2.2.
-  (2017) Loss max-pooling for semantic image segmentation. In Proceedings of CVPR, pp. 7082–7091. Cited by: §2.2.
-  (2017) EMNIST: extending mnist to handwritten letters. In Proceedings of IJCNN, pp. 2921–2926. Cited by: §4.1.
-  (2011) Geometric -norm feature pooling for image classification. In Proceedings of CVPR, pp. 2609–2704. Cited by: §2.
-  (2016) Compact bilinear pooling. In Proceedings of CVPR, pp. 317–326. Cited by: §2.
-  (2019) LIP: local importance-based pooling. In Proceedings of ICCV, pp. 3355–3364. Cited by: §2.2.
-  (2014) Multi-scale orderless pooling of deep convolutional activation features. In Proceedings of ECCV, pp. 392–407. Cited by: §2.
-  (2014) Fractional max-pooling. arXiv preprint arXiv:1412.6071. Cited by: §2.1, §2.2.
-  (2015) Spatial pyramid pooling in deep convolutional networks for visual recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence 37 (9), pp. 1904–1916. Cited by: §2.2.
-  (2017) Std2p: rgbd semantic segmentation using spatio-temporal data-driven pooling. In Proceedings of CVPR, pp. 4837–4846. Cited by: §2.2.
-  (2012) Improving neural networks by preventing co-adaptation of feature detectors. arXiv preprint arXiv:1207.0580. Cited by: §2.1.
REMAP: multi-layer entropy-guided pooling of dense cnn features for image retrieval. IEEE Transactions on Image Processing 28 (10), pp. 5201–5213. Cited by: §2.
-  (2019) Global feature guided local pooling. In Proceedings of ICCV, pp. 3365–3374. Cited by: §2.2.
-  (2018) Ordinal pooling networks: for preserving information over shrinking feature maps. arXiv preprint arXiv:1804.02702. Cited by: §2.1.
-  (2011) The kylberg texture dataset v. 1.0. External report (Blue series) Technical Report 35, Centre for Image Analysis, Swedish University of Agricultural Sciences and Uppsala University, Uppsala, Sweden. External Links: Cited by: §5.
-  (2016) TI-pooling: transformation-invariant pooling for feature learning in convolutional neural networks. In Proceedings of CVPR, pp. 289–297. Cited by: §2.
-  (2015) Deep learning. nature 521 (7553), pp. 436–444. Cited by: §2.
-  (1998) Gradient-based learning applied to document recognition. Proceedings of the IEEE 86 (11), pp. 2278–2324. Cited by: §4.1.
-  (2019) A simple pooling-based design for real-time salient object detection. In Proceedings of CVPR, pp. 3917–3926. Cited by: §2.2.
-  (2019) A pooling based scene text proposal technique for scene text reading in the wild. Pattern Recognition 87, pp. 118–129. Cited by: §2.2.
-  (2018) Detail-preserving pooling in deep networks. In Proceedings of CVPR, pp. 9108–9116. Cited by: §2.2.
-  (2015) Very deep convolutional networks for large-scale image recognition. In Proceedings of ICLR, Cited by: §4.2.
-  (2013) Regularization of neural networks using dropconnect. In Proceedings of ICML, pp. 1058–1066. Cited by: §2.1.
-  (2019) Building detail-sensitive semantic segmentation networks with polynomial pooling. In Proceedings of CVPR, pp. 7115–7123. Cited by: §2.1, §2.2.
-  (2014) Mixed pooling for convolutional neural networks. In Proceedings of RSKD, pp. 364–375. External Links: Cited by: §2.1.
-  (2017) S3pool: pooling with stochastic spatial sampling. In Proceedings of CVPR, pp. 4970–4978. Cited by: §2.1.
-  (2019) Local temporal bilinear pooling for fine-grained action parsing. In Proceedings of CVPR, pp. 12005–12015. Cited by: §2.2.
-  (2019) Mining the displacement of max-pooling for text recognition. Pattern Recognition 93, pp. 558–569. External Links: Cited by: §2.1, §2.2.