Deep learning has achieved remarkable performance in many applications, especially in image related tasks [1, 2, 3]. In image recognition, convolutional neural networks (CNNs) are heavily used , based on a few components like convolutional layers and pooling (or sub-sampling) layers. Pooling layers not only reduce the size of the feature-map but also extract features that are more robust against position or movement for object recognition .
Despite of its importance, to our best knowledge, max pooling and arithmetic average pooling are often selected for the pooling layers without much consideration. Max pooling selects the highest value in the pooling window, and arithmetic average pooling takes the arithmetic average in the window area. However, the two pooling methods are not optimal . The arithmetic average pooling degrades the performance in CNNs by losing crucial information in strong activation values. Also, max pooling has a problem by ignoring all information except the largest value.
In addition to max and arithmetic average, there are many other average methods including geometric and harmonic average. Then, how can we find the best average method? The approach of human selection is not in line with the philosophy of deep learning. Moreover, it is not practically possible to find a proper average method for the layers whenever the network architecture is reshaped. This might be the fundamental reason why diverse pooling is not used in practice. In order to avoid such limitations, it is desirable to find an optimal average method for pooling layers automatically from training data.
On the other hand, as a general data integration framework, -integration was proposed . -integration integrates positive values, and the characteristics of integration is determined by the parameter . It finds out the optimal integration of the input values in the sense of minimizing the -divergence. Many average models such as the mixture (or product) of experts model [7, 8] can be considered as special cases of -integration . In addition, a training algorithm was proposed to find the best value from training data for a given tasks .
In this paper, we propose a new pooling algorithm, alpha-pooling which applies -integration to the pooling layers. Alpha-pooling finds the optimal
values for the pooling layers automatically from training data by backpropagation, sincevalues for the layers are parameters like other parameters (i.e., weights or bias of the network). So, when we need sub-sampling, we do not have to predefine a specific pooling type. With alpha-pooling, the model finds the optimal pooling from training data for the task.
In experiments, alpha-pooling improves significantly the accuracy of image recognition. In other words, max pooling is not the best pooling method. After training models, we found that the layers have different values, which means optimal average types for layers should be different.
The rest of the paper is organized as follows. In section 2, we briefly review -integration and the pooling methods in CNNs. In section 3, we propose alpha-pooling. In section 4, experiment results confirm our method, followed by conclusion in section 5.
In this section, we briefly review -integration  and pooling methods in CNNs.
Given two positive measures of random variable, and , -means is defined by
where is a differential monotone function given by
-mean with or , respectively, as shown in Fig. 1. As decreases, the -mean approaches to the larger of or , and when , -mean behaves as the max operation.
Given positive values , -mean can be generalized to -integration, which is defined by
where we assume that the values have the same weights.
2.2 Max Pooling and Average Pooling
CNNs are composed of convolutional layers, nonlinear function and pooling layers. Convolutional layers extract patterns from local regions of the input images . Filters in convolution layers have high values when each portion of input image matches to the feature. Then, nonlinear function
is applied to the values, which is ReLU for the most cases. The output of the nonlinear function moves to the pooling layer. Pooling provides positional invariance to the feature, which becomes less sensitive to the precise locations of structures within the image than the original feature maps. This is a crucial transformation for classification tasks.
Despite pooling is an important component for CNNs, max pooling (or sometimes arithmetic average pooling) is selected for the most cases without much consideration. Max pooling chooses only the highest value in the pooling window. Arithmetic average pooing chooses the arithmetic average of all values in the window area. However, it is not guaranteed that the two pooling methods are perfect for all the times.
In general, the arithmetic average pooling degrades the performance in CNNs when ReLU is used. Averaging ReLU activation values reduces the high values which might be crucial information, because many zero elements are included in the average. The use of makes the problems worse by averaging out strong positive and negative activation values with mutual cancellation of each other. Although max pooling does not suffer from this problem, it has another problem. Max pooling gives the weight 1 only to the largest value, but gives weights 0 to the other values. That is, it ignores all information except the largest value. Moreover, max pooling easily overfits to the training dataset . Thus, there have been many attempts to find better pooling methods [5, 11, 12, 13, 14].
3 Alpha Pooling
To find an optimal average method for pooling layers automatically from training data, we propose a generalized pooling method, alpha-pooling which applies -integration to the pooling layers. We treat the alpha-pooling’s value as a parameter like other parameters (i.e., weights or bias of the network model), so that can be trained by backpropagation, although we can train in different ways from other network parameters.
-integration to the pooling layers, we should meet one constraint: all input values to alpha-pooling must be positive. However, this constraint is not a big problem because CNNs use ReLU as an activation function for the most cases. There are no negative values in the output of ReLU. Now, we just need to be careful to avoid zeros, because it is impossible to calculate-integration when zero is included. Therefore, we slightly revise the ReLU function by adding to the output of ReLU, which leads to a new activation function, as follows.
where is a small number, which is set to in our experiments.
After applying , we -integrate the activation values with current , assuming all the values are positive. Fig. 2 shows an example of how alpha-pooling works. With different values, the output of the pooling layer is different. Note that when , the integration works as the max operation.
Now, with alpha-pooling, the model can find the optimal pooling from training data for the task. All pooling layers can share single value, or each layer can have different pooling type with a different value by training.
We present experiment results with different models on two datasets: MNIST and CIFAR10. Experiment results confirm that alpha-pooling outperforms max pooling. It implies that max pooling is not the optimal pooling type. Also, layer have different optimal pooling policies which means there is no single optimal pooling type.
As shown in Fig. 3, we take two CNN models for experiments. First, we have set a simple CNN model to minimize impact of other techniques and to confirm the impact of alpha-pooling on image recognition. This model consists of two convolutional layers and two pooling layers. Second, we take the VGG model to check whether alpha-pooling works well in complex models. The VGG model has 5 max pooling layers.
To evaluate different pooling methods, we train CNN models described above on the MNIST and CIFAR10 datasets. MNIST includes hand written digit images of 10 classes (0-9 digits). MNIST splits into two sets: training(60K images) and testing(10K images). The CIFAR10 dataset includes images of 10 classes, which has 50K training image and 10K testing images.
Since models with alpha-pooling find optimal pooling type, we can observe meaningful discoveries based on the values. First, max pooling may not be the optimal pooling method. As presented in Table. 1, alpha-pooling outperforms max pooling for image recognition tasks. In other words, alpha-pooling with a specific value is more optimal for the given tasks, compared to the previous max pooling method. If max-pooling is optimal for our experiment models, the value should converge to . However, in Fig. 4, all values are converging to some values between -10 and 0, and not going down to make alpha-pooling to be max pooling.
In addition, Fig. 4 shows that the values for different layers converge to different values. This implies that each layer has different optimal pooling, because its role is different. Also, it implies that there is no single optimal pooling type.
In this paper, we questioned about the pooling methods, which find different optimal representative value within window. Then, we proposed alpha-pooling to include the previous pooling methods as special cases. The parameter of alpha-pooling is trainable from training data, and the converged value determines the best pooling type automatically.
Experiment results confirm that alpha-pooling improves performance, implying max pooling is not the optimal. Also, pooling layers have different values, so that there is no single optimal pooling type for all cases. As future works, we can analyze the meaning of different values in detail.
This work was supported by Institute for Information & communications Technology Promotion(IITP) grant funded by the Korea government(MSIT)(No.2018-0-00749,Development of virtual network management technology based on artificial intelligence) and Basic Science Research Program through the National Research Foundation of Korea (NRF) funded by the Ministry of Education (2017R1D1A1B03033341).
-  Karen Simonyan and Andrew Zisserman, “Very deep convolutional networks for large-scale image recognition,” CoRR, vol. abs/1409.1556, 2014.
Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton,
“Imagenet classification with deep convolutional neural networks,”in Advances in Neural Information Processing Systems (NIPS), 2012, pp. 1106–1114.
Christian Szegedy, Sergey Ioffe, Vincent Vanhoucke, and Alexander A. Alemi,
“Inception-v4, inception-resnet and the impact of residual connections on learning,”in Association for the Advancement of Artificial Intelligence (AAAI), 2017, pp. 4278–4284.
-  Ian Goodfellow, Yoshua Bengio, and Aaron Courville, Deep Learning, MIT Press, 2016, http://www.deeplearningbook.org.
-  Matthew D. Zeiler and Rob Fergus, “Stochastic pooling for regularization of deep convolutional neural networks,” CoRR, vol. abs/1301.3557, 2013.
-  S. Amari, “Integration of stochastic models by minimizing -divergence,” Neural Computation, vol. 19, pp. 2780–2796, 2007.
-  R. A. Jacobs, M. I. Jordan, S. J. Nowlan, and G. E. Hinton, “Adaptive mixtures of local experts,” Neural Computation, vol. 3, pp. 79–81, 1991.
G. E. Hinton,
“Training products of experts by minimizing contrastive divergence,”Neural Computation, vol. 14, pp. 1771–1800, 2002.
-  H. Choi, S. Choi, and Y. Choe, “Parameter learning for alpha-integration,” Neural Computation, vol. 25, no. 6, pp. 1585–1604, 2013.
-  H. Choi, A. Katake, S. Choi, and Y. Choe, “Alpha-integration of multiple evidence,” in International Conference on Acoustics, Speech, and Signal Processing (ICASSP), Dallas, TX, 2010, pp. 2210–2213.
-  Antoine Miech, Ivan Laptev, and Josef Sivic, “Learnable pooling with context gating for video classification,” CoRR, vol. abs/1706.06905, 2017.
-  Yang Gao, Oscar Beijbom, Ning Zhang, and Trevor Darrell, “Compact bilinear pooling,” in , 2016, pp. 317–326.
-  Matthijs Douze, Jérôme Revaud, Cordelia Schmid, and Herve Jegou, “Stable hyper-pooling and query expansion for event detection,” in International Conference on Computer Vision (ICCV), 2013, pp. 1825–1832.
Çaglar Gülçehre, KyungHyun Cho, Razvan Pascanu, and Yoshua
“Learned-norm pooling for deep feedforward and recurrent neural networks,”in
European Conference on Machine Learning (ECML), 2014, pp. 530–546.