1 Introduction
Convolutional neural networks (CNNs) [12] that are specifically suited for various visual tasks, such as image classification [11], object detection [17], segmentation [6], and modeling video evolution [3], are one of the main drivers of deep learning. A typical deep CNN architecture consists of three types of layers: 1) convolutional: for extracting various features or activations from an input image or feature maps, 2) pooling: a downsampling technique for aggregating elements within a pooling region so that the size of the feature maps along the spatial dimensions becomes smaller, and 3) fully connected: to carry out the classification from the extracted features at the end of the network. Many types of CNNs have been reported in the literature, for instance, networkinnetwork (NIN) [14], residual networks (ResNets) [7], inception networks [20], squeezeandexcitation networks (SENets) [8], densely connected convolutional networks (DenseNets) [9].
Replicating convolution kernels across the spatial dimensions in CNNs enables weight sharing across space. This helps achieve equivariance, i.e. a translation of an object in an input image results in an equivalent translation in the activations of the output feature map. The pooling operation, on the other hand, tends to achieve translational invariance, i.e. a translation of an object in an input image does not influence the output of the network. This pooling operation is most commonly performed either by averagepooling (shortened to avgpooling in the following), where all the activations in a pooling region are averaged together, or by maxpooling, where only the element with the maximum activation is retained. A theoretical analysis on these two pooling operations reveals that none of the techniques is optimal [2]. Yet, it has sometimes been argued that maxpooling achieves better performances over avgpooling because avgpooling treats all the elements equivalently irrespective of their activations, which results in an undervaluation of the elements with higher activations, while the elements with smaller activations are overestimated [18, 1].
This work presents an alternative pooling scheme, named ordinal pooling, that generalizes the classic avg and maxpooling operations and resolves the issue of unfair valuation of the elements in a pooling region, while still preserving the information from other activations. In this scheme, all the elements in a pooling region are first ordered based upon their activations and then combined together via a weighted sum, where the weights are assigned depending upon the orders of the elements and are learned with a standard gradientbased optimization during the training phase. Moreover, a key difference between ordinal pooling and a classic pooling layer is that while a typical pooling acts upon each feature map in the same way, ordinal pooling learns a different set of weights for each feature map and therefore allows much more flexibility in the pooling layer.
2 Related Works
The idea of a rankbased weighted aggregation was first introduced by Kolesnikov et al. [10]
in the context of image segmentation, who proposed a global weighted rankpooling (GWRP) in order to estimate a score associated with a segmentation class. However, GWRP is used only as a global pooling procedure as it acts upon all the elements in a feature map to generate the score of a particular segmentation class. Also, contrary to ordinal pooling, the weights that are assigned based on the order of the elements are determined from a hyperparameter and therefore do not change during the training.
In addition to GWRP [10], other variants of rankbased pooling have been introduced by Shi et al. [19]
, who proposed three pooling schemes based upon the rank of the elements: 1) average, 2) weighted, and 3) stochastic. Unlike ordinal pooling, all of these schemes require an additional hyperparameter, which is thus not learned in a differentiable way. Indeed, in the first scheme, the hyperparameter is fixed to determine the threshold for choosing the activations to be averaged. In the second scheme, it is fixed to generate the weights to be applied to the activations, which remain the same across all the feature maps, while in the third scheme, a set of probabilities is generated based upon this hyperparameter and is used to select an element in a pooling region.
Other works focus upon generalizing the pooling operation. Gulcehre et al. [5] regard pooling as a norm, where the values of and for the parameter correspond to avg and maxpooling, whereas itself is learned during the training. Pinheiro et al. [16] use a smooth convex approximation of maxpooling, called LogSumExp, where a hyperparameter controls the smoothness of the approximation, so that pixels with similar scores have a similar weight in the training process. Lee et al. [13] propose mixing together avgpooling and maxpooling by a trainable parameter and also introduce the idea of tree pooling to learn different pooling filters and combine these filters responsively.
Since a
maxpooling with a stride of 2 in each spatial dimension discards
of a feature map upon its application, it is an aggressive operation, which after a series of applications can result in a significant loss in information. To apply pooling in a gentler manner, a fractional maxpooling [4] has been proposed, where the dimensions of the feature map can be reduced by a noninteger factor. In the spirit of allowing information from other activations within a pooling region to also pass to the next layer, a stochastic version of pooling has been proposed by Zeiler et al. [21], where an element in a pooling region is selected based upon its probability within the multinomial distribution constructed from all the activations inside the pooling region. Another stochastic variant of pooling, S3Pool [22], is a twostep pooling technique, where in the first step, a pooling with a stride of is applied, while in the second step, a stochastic downsampling is performed. A combination of these operations makes S3Pool to work as a strong regularization technique.3 Method
A pooling operator can be seen as a realvalued function defined on the finite non empty subsets of real numbers . In particular, the avgpooling and maxpooling operators, noted and , are respectively defined by
(1) 
In CNNs, a pooling layer is used to decrease the spatial resolution of the feature maps obtained after the application of a nonlinear activation on responses to trainable convolutional filters. A pooling layer thus transforms an input tensor
of spatial resolution with feature maps (or channels) to an output tensor with and . This is commonly done via a pooling operation, which consists of slicing into pooling regions (), and applying a same pooling operator to each channel () in each . In a conventional CNN, the same pooling operator is used for all the feature maps and remains fixed as the network trains.In this work, we introduce the ordinal pooling layer, whose pooling operator involves trainable weights that are specific to each feature map. In the case of a ordinal pooling layer, a trainable weight kernel is used to pool the regions () located on the feature map of the input tensor . The ordinal pooling operator associated with , defined on , is given by
(2) 
where is a function from to that reorders the values of its input tensor based upon a given ranking process. In this work, we consider that reorders the activations of a tensor based upon the decreasing order of their values, such that for and :
(3) 
This implies that, for example, (resp. ) always multiplies the largest (resp. smallest) value of , for all and . An illustration of ordinal pooling is represented in Figure 1. In practice, we constrain each kernel to contain only positive weights that sum to . This is imposed to adhere to the common principle that a pooling operation is designed to aggregate the values comprised in a tensor and should thus output a value located in its convex hull. In particular, this guarantees that the output value is comprised between the minimum and the maximum values of the input tensor. An algorithm of the main workflow for the forward pass and the update of the weights is detailed in supplementary material to show how ordinal pooling can be implemented for the usual case. Let us note that, since ordinal pooling employs a different set of weights for each feature map, the total number of parameters introduced by this operation is , which is negligible compared with convolutional and fully connected layers.
Ordinal pooling generalizes the commonly used avg and max pooling operators. Indeed, avgpooling is a particular case of ordinal pooling for which for all , . Likewise, maxpooling corresponds to the case where and for . Also, compared with the other trainable pooling operations in the literature, ordinal pooling is the only technique that can lead to a minpooling behavior.
4 Experiments
4.1 Proofofconcept
Setup.
We perform the following proofofconcept experiment on MNIST. Let us consider a baseline network N comprising average pooling layers and its ordinal counterpart ON, wherein the average pooling layers are replaced by ordinal pooling layers. Since N and ON have the same structure, we initialize them exactly in the same way with the same weights for the nonpooling layers. The sole difference between N and ON is the additional weights required by ordinal pooling layers. These weights are initialized with “average pooling” initialization, i.e. for an ordinal pooling kernel of size
, each weight is initialized as . This implies that, before starting the training of a baseline network N and its ON counterpart, the two networks are exactly in the same state, they produce the same output if they are fed with the same input. Moreover, we fix all the random seeds, so that the two networks will experience exactly the same batches of images, in the same order, the data augmentation is the same, at any time, over the course of their training. To guarantee the reproducibility of the experiments and avoid suffering from GPUbased nondeterminism, these experiments are carried out on CPU. This setting allows us to compare the results of N and ON pairwise, for each run of the experiment, which provides a fairer and more significant insight on the intrinsic superiority of one network over the other.Networks compared.
Three “baseline networks” are used in the experiment, described as follows with standard compact notations:

“Baseline”: Conv , pooling, Conv , global pooling, FC(), softmax.

“Baseline2”: Conv ZP, pooling, Conv ZP, pooling, Conv ZP, global pooling, FC(), softmax.
These networks have their “ordinal” counterpart, e.g. “Ordinal baseline2”, for which the layers “(global) pooling” are replaced by “(global) ordinal pooling”. As mentioned above, “(global) average pooling” are used as pooling layers in the classic pooling setting while “average pooling initialization” is used to instantiate the weights of the ordinal pooling kernels. More details about the training of these networks are provided in supplementary material.
Results.
Each network is run times, where the runs differ by their initial random seeds. Figure 2 shows the average learning curves for the network “Baseline” and its “Ordinal baseline” counterpart. It can be seen that ordinal pooling allows to reach better performances in terms of accuracy and loss, while it also speeds up the training process.
The pairwise comparison of the performances of the networks versus their ordinal counterpart is represented in Figure 2. As can be seen, the “Baseline” and “Baseline2” networks employing ordinal pooling always achieve smaller training loss at the end of the training. The same is also true for the “LeNet5” network of the time. Regarding test error rates, the ordinal pooling networks outperform the classic ones , , and of the time for the “Baseline”, “Baseline2”, and “LeNet5” cases, respectively. Even though ordinal pooling seems less beneficial to LeNet5, the results have to be put in perspective with respect to the extra cost in parameters that the ordinal pooling layers require. In fact, as can be inferred from Table 1
, LeNet5 has by far the best ratio between the relative improvement in performances (both in terms of average and variance in test error rate) and the number of additional parameters contained in the ordinal pooling layers. Table
1 suggests that overall, the performances of the networks are boosted with ordinal pooling by a comfortable margin through a reduced average test error rate, and that ordinal pooling provides a more consistent convergence between different runs, through a reduced variance in test error rate. These benefits come at a moderate cost in terms of number of parameters.Relative variation in  Baseline  Baseline2  LeNet5 

average/variance of training loss  
average/variance of test loss  
average/variance of test error rate  
number of parameters 
Distribution of ordinal pooling kernels.
The use of ordinal pooling instead of a classic pooling operation allows to study the distributions of the weight kernels in the ordinal pooling layers, as learned by the networks, and helps discover how the trained networks chose to perform the pooling operations. For that purpose, we compare the learned kernels with some template kernels that characterize various categories of behaviors for the kernels, including avg and maxpooling like behaviors. These template kernels are chosen based on the behavior that they induce as explained below.
In the case of a ordinal pooling layer, a weight kernel leads to maxpooling if it converges to and to avgpooling if each . In the first case, the network “promotes” only the largest value of each pooling region, while in the second case, all the values are equally “promoted”. Nevertheless, a network may prefer to promote, for example, the lowest value of the regions (thus minpooling behavior) by using the kernel , or its first two largest values with . The template kernels are determined based upon this idea of enumerating all the ways that some values can be promoted by the network. In fact, it has the possibility to promote any of the four ordered values of the regions by making converge to one of the following four template kernels:
In the same spirit, it may prefer to promote equally two, three, or the four values of the ordered regions, making converge to
We note the set of template kernels having nonzero values. After the training of a network, for each kernel of a ordinal pooling layer, we identify its closest template kernel, in term of Euclidean distance. We examine the distribution of the learned kernels by grouping them by “closest template kernels” to find out how the network chooses to perform the pooling operations. These distributions for ordinal pooling layers of “Baseline2”, grouped by and aggregated for the runs, are displayed in Figure 3 (top left and top right).
It can be seen that, even though all the kernels were initialized as after the training less than half of them remain closer to than to any other template kernel. Another observation is that the network seems to prefer promoting contiguous “extreme” (largest or smallest) values in the sorted regions, i.e. when (resp. ) values are promoted, the associated kernels are preferably closer to or (resp. or ). The “extreme” aspect is reinforced in the group of the top right plot, in which almost only and are present. Hence, some kernels actually display a “minpooling” behavior.
Also, the network prefers to promote the largest value of the regions, which manifests by the fact that the argmax of a learned kernel is often , thus indicating that a behavior between average pooling and maxpooling is often desired. This observation is illustrated in Figure 3 (bottom left and bottom right), where the kernels are first distributed according to their argmax, then subdivided following their closest template kernel. Similar trends are also observed for “Ordinal baseline” and “Ordinal LeNet5” (provided in supplementary material). A complementary analysis of the kernels related to the global ordinal pooling layer is presented in supplementary material.
4.2 Influence of ordinal weights initialization
In this section, we examine the performances of the networks under various weight initializations in the ordinal pooling layers. As before, we report average test error rates over runs of each experiment to compare the results in Table 2.
For the ordinal pooling kernels, the initializations investigated are average, max, min. For the average (resp. max, min) case, each kernel is instantiated as (resp. ,). For the global ordinal pooling layers, average pooling () is used.
Table 2 shows that the networks with ordinal pooling consistently outperform the classic ones. Also, the performances are less dependent on the initialization of the ordinal pooling kernels than they are in the classic setting. Ordinal pooling thus alleviates the problem of choosing the appropriate type of pooling layer to incorporate in the networks.
Maxpooling initialization performs better in this experiment. However, even with minpooling initialization, the ordinal networks are still able to reach performances close to standard initializations, while it is not as clear for the classic networks. An explanation may reside in the use of ReLU activations before the pooling layers. In the classic setting, minpooling forces the networks to select the lowest value of the pooled regions, hence it can be assumed that many zero values are propagated in the network, responsible for decreasing the amount of useful information and thus leading to lower performances. In the ordinal pooling setting, the network has enough flexibility to circumvent this fixed minpooling behavior, and this only requires small variations in the ordinal weights. Indeed, we observed that the closest template kernel after the training was still most often .
We also examined a “uniform” initialization, where for each kernel, all the weights are randomly initialized with uniform distribution between
and and are then normalized so as to sum to . The results do not differ much from those obtained with “average” initialization and are discussed in supplementary material.Pooling  glob.  Bas.  Bas.2  LN5  Activation Bas.2  
None  ReLU  tanh  
classic  average  average  1.22  1.26  0.79  49.50  1.26  1.62 
ordinal  average  average  0.89  1.00  0.75  1.13  1.00  0.99 
classic  max  average  0.98  1.01  0.74  3.69  1.01  1.58 
ordinal  max  average  0.82  0.89  0.71  1.01  0.89  0.90 
classic  min  average  1.48  1.34  0.94  3.82  1.34  1.55 
ordinal  min  average  0.97  1.02  0.84  1.05  1.02  0.90 
4.3 Influence of activation functions
The case of the ordinal minpooling initialization raises the question of the influence of the activation function used before the pooling operation. We thus compare the average test error rates of various initializations and three types of activation functions: “None” (no activation), “ReLU”, “tanh”. The results for the “Baseline2” structure are reported in Table 2.
The results obtained for a given initialization with ordinal pooling are less sensitive to the choice of the activation function than those obtained in the networks with classic pooling schemes. Conversely, for a given activation, the results with ordinal pooling are less sensitive to the choice of initialization compared to the networks employing classic poolings.
One of the most striking results may be related to the performances obtained without any activation. Indeed, while it is wellknown that CNNs need nonlinear activations to achieve competitive performances, the networks with ordinal pooling layers still manage to obtain good performances without activation. In this case, some are even better than others obtained in the classic setting with activations. The sorting procedure in the ordinal pooling layer is itself a nonlinearity, which explains these results. This is especially true for the avgpooling case, where it is known that an avgpooling layer without prior activation is useless. For better performances, it still appears that using an activation is beneficial even with ordinal pooling layers, but the choice of the function may not be as crucial as in the networks with classic pooling. Similar trends were observed with “Baseline” and “LeNet5” structures.
4.4 Results on other datasets and best use cases
Ordinal pooling can be used in any CNN architecture involving pooling layers. Its benefits vary from one use case to another, as indicated by the following additional results, reported as average test error rates on five trials. On CIFAR10, with a CNN made of five Conv(128)ReLUPooling blocks and a FC layer, ordinal pooling outperforms avgpooling ( vs ). With DenseNetBC10012 [9], the results are mostly equivalent with ordinal and avgpooling ( vs with ReLU, vs with tanh) except without activation ( vs ), similarly to CIFAR100 ( vs with ReLU, vs with tanh, vs without activation).
Even though exhaustive performancerelated experiments still need to be carried out as future work, the present results are in line with those reported previously and confirm that ordinal pooling mainly helps on relatively simple architectures, as typically considered e.g. for embedded applications. To further support this statement, we performed experiments with quantized networks, as described in [15]. It appears that the more the model is quantized, the more ordinal pooling helps: with quantized versions of ResNet14 (resp. ResNet20), on CIFAR10, ordinal pooling performs up to (resp. ) better than maxpooling. It also reduces the gap between binary ResNet14 and 20, which is of , against with maxpooling, which certainly opens interesting prospects for ordinal pooling, as it helps simpler models achieve performances comparable with more complex models.
For the record, our experiments on MNIST and CIFAR10 with [13] lead to results comparable with those presented above with ordinal pooling ( difference). A comprehensive comparison with the pooling methods present in the literature could be the subject of a survey article (along with defining benchmark tests to assess the performances of a pooling method), and is thus beyond the scope of this work.
5 Conclusion
A novel trainable pooling scheme, Ordinal Pooling, is introduced in this work, which operates in two steps. In the first step, all the elements of a pooling region are reordered in decreasing sequence. Then, a trainable weight kernel is convolved with the rearranged pooling region to compute the output of the ordinal pooling operation. The usual avg and maxpooling operations can be recovered as particular cases of ordinal pooling.
In our experiments, replacing classic avg and maxpooling operations with ordinal pooling produces large relative improvements in classification performances at a moderate cost in additional parameters and also leads to a faster convergence. Ordinal pooling allows to perform the pooling operation differently in distinct feature maps. The analysis of the learned kernels reveals that the networks take advantage of this extra flexibility by using various types of pooling for different feature maps within the same pooling layer. A general trend is that a hybrid behavior between avg and maxpooling is often desired, even though the lowest elements of the pooling regions are not always discarded. Moreover, the performances of the networks are less inclined to fluctuate when different initializations of the ordinal pooling kernels are used than when different classic pooling operations are imposed. Besides, even when no nonlinear activation function is applied after the convolutional layers, the intrinsic nonlinearity introduced by ordinal pooling alone generally suffices to produce performances which are better than either avg and maxpooling used along with activation functions. Finally, our experiments suggest that ordinal pooling might be of particular interest for lightweight or quantized architectures, as typically considered in e.g. embedded resourceconstrained systems.
As future work, as already mentioned, it will be interesting to perform more experiments with more datasets and various architectures to determine the configurations which best benefit from the ordinal pooling operation. From a technical point of view, the value of the elements of the pooling regions is chosen as the criterion for ordering the region. However, other criteria could also be envisioned to further extend the ordinal pooling scheme. Eventually, conducting experiments with our baseline models but with other types of pooling methods proposed in the literature is certainly our next step in order to rank ordinal pooling among available pooling operations.
Acknowledgements
This research is supported by the DeepSport project of the Walloon region, Belgium, C. De Vleeschouwer is funded by the F.R.S.FNRS.
References

[1]
Y. Boureau, N. Le Roux, F. Bach, J. Ponce, and Y. Lecun.
Ask the locals: Multiway local pooling for image recognition.
In
IEEE Int. Conf. Comput. Vision (ICCV)
, pages 2651–2658, Barcelona, Spain, Nov. 2011.  [2] Y. Boureau, J. Ponce, and Y. LeCun. A theoretical analysis of feature pooling in visual recognition. In Int. Conf. Mach. Learn. (ICML), pages 111–118, Haifa, Israel, June 2010.
 [3] B. Fernando, E. Gavves, M. Jose Oramas, A. Ghodrati, and T. Tuytelaars. Modeling video evolution for action recognition. In IEEE Int. Conf. Comput. Vision and Pattern Recogn. (CVPR), pages 5378–5387, Boston, MA, USA, June 2015.
 [4] B. Graham. Fractional maxpooling. CoRR, abs/1412.6071, Dec. 2014.

[5]
C. Gulcehre, K. Cho, R. Pascanu, and Y. Bengio.
Learnednorm pooling for deep feedforward and recurrent neural networks.
In Machine Learning and Knowledge Discovery in Databases, volume 8724 of Lecture Notes Comp. Sci., pages 530–546. Springer, 2014.  [6] K. He, G. Gkioxari, P. Dollar, and R. Girshick. Mask RCNN. In IEEE Int. Conf. Comput. Vision (ICCV), pages 2980–2988, Venice, Italy, Oct. 2017.
 [7] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In IEEE Int. Conf. Comput. Vision and Pattern Recogn. (CVPR), pages 770–778, Las Vegas, NV, USA, June 2016.
 [8] J. Hu, L. Shen, and G. Sun. Squeezeandexcitation networks. CoRR, abs/1709.01507, 2017.
 [9] G. Huang, Z. Liu, L. van der Maaten, and K. Weinberger. Densely connected convolutional networks. In IEEE Int. Conf. Comput. Vision and Pattern Recogn. (CVPR), pages 2261–2269, Honolulu, HI, USA, July 2017.
 [10] A. Kolesnikov and C. Lampert. Seed, expand and constrain: Three principles for weaklysupervised image segmentation. In Eur. Conf. Comput. Vision (ECCV), volume 9908 of Lecture Notes Comp. Sci., pages 695–711. Springer, 2016.
 [11] A. Krizhevsky, I. Sutskever, and G. Hinton. ImageNet classification with deep convolutional neural networks. In Adv. in Neural Inform. Process. Syst. (NeurIPS), volume 25, pages 1097–1105, 2012.
 [12] Y. Lecun, L. Bottou, Y. Bengio, and P. Haffner. Gradientbased learning applied to document recognition. Proc. of IEEE, 86(11):2278–2324, Nov. 1998.
 [13] C.Y. Lee, P. Gallagher, and Z. Tu. Generalizing pooling functions in CNNs: Mixed, gated, and tree. IEEE Trans. Pattern Anal. Mach. Intell., 40(4):863–875, Apr. 2018.
 [14] M. Lin, Q. Chen, and S. Yan. Network in network. CoRR, abs/1312.4400, Dec. 2013.
 [15] B. Moons, K. Goetschalckx, N. Van Berckelaer, and M. Verhelst. Minimum energy quantized neural networks. In Asilomar Conference on Signals, Systems, and Computers, pages 1921–1925, Pacific Grove, CA, USA, 2017.
 [16] P. O. Pinheiro and R. Collobert. From imagelevel to pixellevel labeling with convolutional networks. In IEEE Int. Conf. Comput. Vision and Pattern Recogn. (CVPR), jun 2015.
 [17] S. Ren, K. He, R. Girshick, and J. Sun. Faster RCNN: Towards realtime object detection with region proposal networks. IEEE Trans. Pattern Anal. Mach. Intell., 39(6):1137–1149, June 2017.
 [18] D. Scherer, A. Müller, and S. Behnke. Evaluation of pooling operations in convolutional architectures for object recognition. In Int. Conf. Artificial Neural Networks (ICANN), volume 6354 of Lecture Notes Comp. Sci., pages 92–101. Springer, 2010.
 [19] Z. Shi, Y. Ye, and Y. Wu. Rankbased pooling for deep convolutional neural networks. Neural Networks, 83:21–31, Nov. 2016.

[20]
C. Szegedy, S. Ioffe, V. Vanhoucke, and A. Alemi.
Inceptionv4, InceptionResNet and the impact of residual connections on learning.
In AAAI Conf. Artificial Intell., pages 4278–4284, San Francisco, CA, USA, Feb. 2017.  [21] M. Zeiler and R. Fergus. Stochastic pooling for regularization of deep convolutional neural networks. In Int. Conf. on Learn. Rep. (ICLR), Scottsdale, Arizona, May 2013.
 [22] S. Zhai, H. Wu, A. Kumar, Y. Cheng, Y. Lu, Z. Zhang, and R. Feris. S3Pool: Pooling with stochastic spatial sampling. In IEEE Int. Conf. Comput. Vision and Pattern Recogn. (CVPR), pages 4003–4011, Honolulu, HI, USA, July 2017.
Comments
There are no comments yet.