1 Introduction
Deep learning has achieved remarkable performance in many applications, especially in image related tasks [1, 2, 3]. In image recognition, convolutional neural networks (CNNs) are heavily used [4], based on a few components like convolutional layers and pooling (or subsampling) layers. Pooling layers not only reduce the size of the featuremap but also extract features that are more robust against position or movement for object recognition [4].
Despite of its importance, to our best knowledge, max pooling and arithmetic average pooling are often selected for the pooling layers without much consideration. Max pooling selects the highest value in the pooling window, and arithmetic average pooling takes the arithmetic average in the window area. However, the two pooling methods are not optimal [5]. The arithmetic average pooling degrades the performance in CNNs by losing crucial information in strong activation values. Also, max pooling has a problem by ignoring all information except the largest value.
In addition to max and arithmetic average, there are many other average methods including geometric and harmonic average. Then, how can we find the best average method? The approach of human selection is not in line with the philosophy of deep learning. Moreover, it is not practically possible to find a proper average method for the layers whenever the network architecture is reshaped. This might be the fundamental reason why diverse pooling is not used in practice. In order to avoid such limitations, it is desirable to find an optimal average method for pooling layers automatically from training data.
On the other hand, as a general data integration framework, integration was proposed [6]. integration integrates positive values, and the characteristics of integration is determined by the parameter . It finds out the optimal integration of the input values in the sense of minimizing the divergence. Many average models such as the mixture (or product) of experts model [7, 8] can be considered as special cases of integration [6]. In addition, a training algorithm was proposed to find the best value from training data for a given tasks [9].
In this paper, we propose a new pooling algorithm, alphapooling which applies integration to the pooling layers. Alphapooling finds the optimal
values for the pooling layers automatically from training data by backpropagation, since
values for the layers are parameters like other parameters (i.e., weights or bias of the network). So, when we need subsampling, we do not have to predefine a specific pooling type. With alphapooling, the model finds the optimal pooling from training data for the task.In experiments, alphapooling improves significantly the accuracy of image recognition. In other words, max pooling is not the best pooling method. After training models, we found that the layers have different values, which means optimal average types for layers should be different.
The rest of the paper is organized as follows. In section 2, we briefly review integration and the pooling methods in CNNs. In section 3, we propose alphapooling. In section 4, experiment results confirm our method, followed by conclusion in section 5.
2 Background
In this section, we briefly review integration [6] and pooling methods in CNNs.
2.1 Integration
Given two positive measures of random variable
, and , means is defined by(1) 
where is a differential monotone function given by
(2) 
mean includes many means as a special case. That is, arithmetic mean, geometric mean, harmonic mean, minimum, and maximum are specific cases of
mean with or , respectively, as shown in Fig. 1. As decreases, the mean approaches to the larger of or , and when , mean behaves as the max operation.Given positive values , mean can be generalized to integration, which is defined by
(3) 
where we assume that the values have the same weights.
2.2 Max Pooling and Average Pooling
CNNs are composed of convolutional layers, nonlinear function and pooling layers. Convolutional layers extract patterns from local regions of the input images [4]. Filters in convolution layers have high values when each portion of input image matches to the feature. Then, nonlinear function
is applied to the values, which is ReLU for the most cases. The output of the nonlinear function moves to the pooling layer. Pooling provides positional invariance to the feature, which becomes less sensitive to the precise locations of structures within the image than the original feature maps. This is a crucial transformation for classification tasks.
Despite pooling is an important component for CNNs, max pooling (or sometimes arithmetic average pooling) is selected for the most cases without much consideration. Max pooling chooses only the highest value in the pooling window. Arithmetic average pooing chooses the arithmetic average of all values in the window area. However, it is not guaranteed that the two pooling methods are perfect for all the times.
In general, the arithmetic average pooling degrades the performance in CNNs when ReLU is used. Averaging ReLU activation values reduces the high values which might be crucial information, because many zero elements are included in the average. The use of makes the problems worse by averaging out strong positive and negative activation values with mutual cancellation of each other. Although max pooling does not suffer from this problem, it has another problem. Max pooling gives the weight 1 only to the largest value, but gives weights 0 to the other values. That is, it ignores all information except the largest value. Moreover, max pooling easily overfits to the training dataset [5]. Thus, there have been many attempts to find better pooling methods [5, 11, 12, 13, 14].
3 Alpha Pooling
To find an optimal average method for pooling layers automatically from training data, we propose a generalized pooling method, alphapooling which applies integration to the pooling layers. We treat the alphapooling’s value as a parameter like other parameters (i.e., weights or bias of the network model), so that can be trained by backpropagation, although we can train in different ways from other network parameters.
For using
integration to the pooling layers, we should meet one constraint: all input values to alphapooling must be positive. However, this constraint is not a big problem because CNNs use ReLU as an activation function for the most cases. There are no negative values in the output of ReLU. Now, we just need to be careful to avoid zeros, because it is impossible to calculate
integration when zero is included. Therefore, we slightly revise the ReLU function by adding to the output of ReLU, which leads to a new activation function, as follows.(4) 
where is a small number, which is set to in our experiments.
After applying , we integrate the activation values with current , assuming all the values are positive. Fig. 2 shows an example of how alphapooling works. With different values, the output of the pooling layer is different. Note that when , the integration works as the max operation.
Now, with alphapooling, the model can find the optimal pooling from training data for the task. All pooling layers can share single value, or each layer can have different pooling type with a different value by training.
4 Experiments
We present experiment results with different models on two datasets: MNIST and CIFAR10. Experiment results confirm that alphapooling outperforms max pooling. It implies that max pooling is not the optimal pooling type. Also, layer have different optimal pooling policies which means there is no single optimal pooling type.
4.1 Model
As shown in Fig. 3, we take two CNN models for experiments. First, we have set a simple CNN model to minimize impact of other techniques and to confirm the impact of alphapooling on image recognition. This model consists of two convolutional layers and two pooling layers. Second, we take the VGG model to check whether alphapooling works well in complex models. The VGG model has 5 max pooling layers.
4.2 Data
To evaluate different pooling methods, we train CNN models described above on the MNIST and CIFAR10 datasets. MNIST includes hand written digit images of 10 classes (09 digits). MNIST splits into two sets: training(60K images) and testing(10K images). The CIFAR10 dataset includes images of 10 classes, which has 50K training image and 10K testing images.
4.3 Results
Since models with alphapooling find optimal pooling type, we can observe meaningful discoveries based on the values. First, max pooling may not be the optimal pooling method. As presented in Table. 1, alphapooling outperforms max pooling for image recognition tasks. In other words, alphapooling with a specific value is more optimal for the given tasks, compared to the previous max pooling method. If maxpooling is optimal for our experiment models, the value should converge to . However, in Fig. 4, all values are converging to some values between 10 and 0, and not going down to make alphapooling to be max pooling.
Accuracy (%)  

Datasets  Models  Maxpooling  alphapooling 
MNIST  CNNs  99.37  99.49 
CIFAR10  CNNs  72.52  74.07 
CIFAR10  VGG  92.38  93.71 
In addition, Fig. 4 shows that the values for different layers converge to different values. This implies that each layer has different optimal pooling, because its role is different. Also, it implies that there is no single optimal pooling type.
5 Conclusion
In this paper, we questioned about the pooling methods, which find different optimal representative value within window. Then, we proposed alphapooling to include the previous pooling methods as special cases. The parameter of alphapooling is trainable from training data, and the converged value determines the best pooling type automatically.
Experiment results confirm that alphapooling improves performance, implying max pooling is not the optimal. Also, pooling layers have different values, so that there is no single optimal pooling type for all cases. As future works, we can analyze the meaning of different values in detail.
6 Acknowledgement
This work was supported by Institute for Information & communications Technology Promotion(IITP) grant funded by the Korea government(MSIT)(No.2018000749,Development of virtual network management technology based on artificial intelligence) and Basic Science Research Program through the National Research Foundation of Korea (NRF) funded by the Ministry of Education (2017R1D1A1B03033341).
References
 [1] Karen Simonyan and Andrew Zisserman, “Very deep convolutional networks for largescale image recognition,” CoRR, vol. abs/1409.1556, 2014.

[2]
Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton,
“Imagenet classification with deep convolutional neural networks,”
in Advances in Neural Information Processing Systems (NIPS), 2012, pp. 1106–1114. 
[3]
Christian Szegedy, Sergey Ioffe, Vincent Vanhoucke, and Alexander A. Alemi,
“Inceptionv4, inceptionresnet and the impact of residual connections on learning,”
in Association for the Advancement of Artificial Intelligence (AAAI), 2017, pp. 4278–4284.  [4] Ian Goodfellow, Yoshua Bengio, and Aaron Courville, Deep Learning, MIT Press, 2016, http://www.deeplearningbook.org.
 [5] Matthew D. Zeiler and Rob Fergus, “Stochastic pooling for regularization of deep convolutional neural networks,” CoRR, vol. abs/1301.3557, 2013.
 [6] S. Amari, “Integration of stochastic models by minimizing divergence,” Neural Computation, vol. 19, pp. 2780–2796, 2007.
 [7] R. A. Jacobs, M. I. Jordan, S. J. Nowlan, and G. E. Hinton, “Adaptive mixtures of local experts,” Neural Computation, vol. 3, pp. 79–81, 1991.

[8]
G. E. Hinton,
“Training products of experts by minimizing contrastive divergence,”
Neural Computation, vol. 14, pp. 1771–1800, 2002.  [9] H. Choi, S. Choi, and Y. Choe, “Parameter learning for alphaintegration,” Neural Computation, vol. 25, no. 6, pp. 1585–1604, 2013.
 [10] H. Choi, A. Katake, S. Choi, and Y. Choe, “Alphaintegration of multiple evidence,” in International Conference on Acoustics, Speech, and Signal Processing (ICASSP), Dallas, TX, 2010, pp. 2210–2213.
 [11] Antoine Miech, Ivan Laptev, and Josef Sivic, “Learnable pooling with context gating for video classification,” CoRR, vol. abs/1706.06905, 2017.

[12]
Yang Gao, Oscar Beijbom, Ning Zhang, and Trevor Darrell,
“Compact bilinear pooling,”
in
Conference on Computer Vision and Pattern Recognition (CVPR)
, 2016, pp. 317–326.  [13] Matthijs Douze, Jérôme Revaud, Cordelia Schmid, and Herve Jegou, “Stable hyperpooling and query expansion for event detection,” in International Conference on Computer Vision (ICCV), 2013, pp. 1825–1832.

[14]
Çaglar Gülçehre, KyungHyun Cho, Razvan Pascanu, and Yoshua
Bengio,
“Learnednorm pooling for deep feedforward and recurrent neural networks,”
inEuropean Conference on Machine Learning (ECML)
, 2014, pp. 530–546.
Comments
There are no comments yet.