1 Introduction
Convolutional Neural Networks (CNNs) have revolutionized the field of computer vision (Krizhevsky et al., 2012a; He et al., 2016b, 2017). Their success (compared to fully connected networks) is often attributed to their weight sharing in form of a convolution, which reduces the number of learnable parameters (Krizhevsky et al., 2012b). In addition, the “shift invariance” property of convolution has been believed to be crucial for improved generalization in vision tasks (Fukushima and Miyake, 1982) (although some modifications may be required (Azulay and Weiss, 2018; Zhang, 2019)). Shift invariance, while crucial for handling translation in images, is a very limited form of realworld geometric transformations. For instance, convolutional representations are not invariant or equivariant to other basic transforms such as image rotation and scaling (Azulay and Weiss, 2018).
There have been recent attempts to incorporate additional forms of invariances, such as rotation, reflection, and scaling. (Sifre and Mallat, 2013; Bruna and Mallat, 2013; Esteves et al., 2018a; Kanazawa et al., 2014; Worrall et al., 2017). However, these methods engineer the invariance into networks; requiring apriori identification of invariance types of interest. In this work, our goal is to achieve invariance using a datadriven approach to handle geometric transforms that may not fall into the categories mentioned above.
One way to achieve invariance or equivariance for transforms beyond translation is nonuniform subsampling of the image (to be fed as input to a convolutional layer). For example, logpolar sampling of an image results in a new image, where rotation and scaling in the original image become equivalent to a translation in the resulted image (Esteves et al., 2018b). This suggests that by adapting the pooling operator, one can build representations that are better suited for variety of geometric transforms.
Nonuniform sampling is also the chosen scheme by nature; foveal vision implements a spatiallyvarying sampling similar to logpolar transform (Larson and Loschky, 2009). Central and peripheral regions are sampled at different frequencies and both contribute to efficient and effective human vision. In addition, it is known that nonuniform sampling of image can facilitate image registration when geometric transforms are beyond translation (Mobahi et al., 2012). Our results with learned pooling operator confirms the advantage of spatially varying pooling. For example, Figure 0(a) shows the response map of a learned pooling operator on SVHN dataset. The operator places more weights on the center pixel to take advantage of the fact that SVHN digits are mostly in the center. Contrast this with the operator learned for a CIFAR100 model which places weights all across the spatial field (see Supplementary). The form of the learned pooling operator also affects the pooled feature maps. In Figure 0(b), the pooled feature maps are more clustered around the mean feature map of each class, compared to the feature maps produced by a regular CNN. This results in better separability of classes and better generalization as seen in Table 1.
In addition to adapting to the geometric transforms present in the data, and hence improving generalization, our learned pooling operator helps with robustness of the model. It has been observed that small geometric transforms in the image can result in prediction errors in existing deep models, which can be traced to the pooling operator (Azulay and Weiss, 2018; Zhang, 2019)
; max pooling results in aliasing effects in the representation. While average pooling can prevent the aliasing issues (because it acts as a lowpass filter), the blurring causes loss of information and hence inferior classification performance to max pooling. However, by adapting the pooling operator to the data, in way that it can provide more class separability, relevant information are automatically picked up. In fact, our experiments show that the learned pooling can outperform the naive uniform downsampling scheme that is used in most stateoftheart models (strided pooling) and yet is robust to geometric perturbations on robustness benchmark datasets (CIFAR10C and CIFAR10P
(Hendrycks and Dietterich, 2019)).2 Related Work
Convolutional Neural Networks (CNNs) rely on pooling or subsampling to reduce the size of the hidden representation. This is known to have important implications towards the kinds of invariances and generalization abilities of a network
(Cohen and Shashua, 2016). Earlier architectures have relied on averagepooling (LeCun et al., 1998) and maxpooling (Krizhevsky et al., 2012a), whereas modern ones learn parameters of pooling through strided convolutions (He et al., 2016a).The history of pooling in computer vision however goes past CNNs popularity. For instance, (Boureau et al., 2011) combines SIFT with pooling separately over learned clusters of features. (Malinowski and Fritz, 2013) learns pooling parameters of all spatial lowerlevel features through a fully connected network, whereas (Gong et al., 2014) learns pooling separately at each scale using VLAD (Jégou et al., 2010). (Girshick et al., 2015) defines distance transform pooling as part of deformable part models using a quadratic function of distance from the center and use a latent SVM (Felzenszwalb et al., 2009) to learn it on top of a pretrained CNN. (Li et al., 2015) uses pooling at different scales in spatial dimensions and also performs pooling on the color channels while aggregating using the max operator.
Pooling has also been used to aggregate varying input sizes into a fixed size representation. (Passalis and Tefas, 2017)
uses a fixed number of RBF neurons on top of a regular CNN to output a fixed size representation irrespective of the input image size.
(Zhou et al., 2017) uses a specially designed pooling function for a multiinstance learning setting to output tags for a video from tags predicted for each frame. (Miech et al., 2017) uses NetVLAD defined in (Arandjelovic et al., 2016)and approximations of Bag of Words and Fischer vector encoding to aggregate features across time for learning video classification.
Recently, there have been some attempts to to learn parameters of local pooling operations of CNNs in an endtoend fashion though gradient descent. (Sun et al., 2017) propose learning a local pooling operator, one per each channel and train a deepneural network for classification. (Saeedan et al., 2018) try to preserve small details in the input while pooling and introduce two new parameters per input feature map to control which details are preserved. (Lee et al., 2016) experiment with learned and fixed combinations of average/max pooling and also suggest organizing the outputs of multiple local filters in the form of a binary tree to learn the parameters of mixing them. Although these approaches learn pooling parameters from data, the pooling operator is limited to spatially uniform; the same sampling scheme is used to pool each output pixel. As we discuss in Section 1, spatially varying pooling is necessary to learn efficient and robust representation for transformations other than translation.
3 Method
3.1 Notation
We first formalize the definition of linear pooling. Let . Given a spatial domain and set of intensity values , a feature map of depth is a map . We can represent a feature map in matrix form:
(1)  
(2)  
(3) 
where denotes the domain size and converts a matrix into a column vector by concatenating the columns of the matrix.
We define linear pooling as the operator which maps into another feature map where . That is, the operator may shrink the input spatially but maintains the number of channels. Formally, the output of the linear operator is an element in , where and . Note that the operator is applied to each channel (column) of the input matrix to generate the corresponding channel (column) in the output matrix.
(4) 
3.2 Formulation
Obviously, average pooling is seen as a special case for the linear operator when the entries of the operator are set in a specific way. However, one may wonder if, within the space of all linear operators, there could be better choices than average pooling. Trivially, the answer is data dependent and hence we let the data itself discover the operator that suits the task in hand. In the classification setting, a good operator should help with classification. One possible way to quantify helping with classification is to improve separability of the classes, as explained below.
To simplify the formulation, we focus on finding each row of separately. Let be the ’th row of , arranged as a column vector. Similarly, let be the ’th row of , also arranged as a column vector. Then the pooling identity in (4) can be equivalently expressed using vectors as:
(5) 
To reduce mathematical clutter, we drop the index from and . The reader should remember that the following result needs to be applied for each choice of separately. Hence with abuse of notation we proceed as:
(6) 
To define separability, we require a training set. Consider a set of feature maps , whose elements are associated with a label , with being the number of classes in the dataset. We define the following total and per class average quantities:
(7) 
Inspired by Linear Discriminant Analysis (LDA), we quantify separability of the classification as the ratio of betweenclass scatter and withinclass scatter .
(8) 
To achieve a good representation for classification, we aim to improve separability of the data points by maximizing the ratio:
(9) 
Plugging definitions from (7) and (6) into the above objective function yields:
(10) 
where are defined based on ’s in a similar way done for in (7). For brevity, define:
(11)  
(12) 
This way our goal is to maximize separability:
(13) 
3.3 ClosedForm Solution
The solution to (13) is illposed; if some is a solution, then so is for any . To avoid such freedom of scale, we anchor to a fixed value, e.g. and then solve:
(14) 
In addition, we wish to keep localized so that the operator respects the topology of the space . This is important when the pooled feature maps are to be processed by convolution operators (in the future layers). We can encourage localization by introducing a penalty term of the form , where is a diagonal matrix with positive components. How the elements of are chosen is described in Section 3.4.
Applying localization penalty and anchoring of results in the following optimization:
(15) 
where is the penalty coefficient. It can be shown (see the supplementary appendix for proof) that the solution must satisfy the following
generalized eigenvalue problem
for some^{1}^{1}1It turns out is proportional to and hence still serves as some penalty coefficient. See the supplementary appendix for details. .(16) 
One way to solve the generalized eigenvalue problem is by matrix inversion. If the matrix on the r.h.s. is invertible, then we have:
(17) 
which implies that the optimal must be an (in fact the leading) eigevector of the following matrix:
(18) 
Since the matrix is diagonal with positive entries, and the matrix is positive semidefinite, the matrix has a regularization effect when computing the inverse of . Thus we refer to as a regularization matrix.
3.4 Regularization Matrix
We now explain how the components of are chosen. For clarity, we temporarily (throughout this subsection) switch from the brief notation to the full notation . We also need to switch from to accordingly. Note that each component of corresponds to a coordinate . To show this relationship we use the notation . Similarly, for the space , each index is associated with a coordinate , and the relationship is shown via . We penalize the ’th component of (recall each is a vector of size , thus ) by its coordinate distance from that of . Since in general , a scale correction needs to be done. This way, the amount of penalty for the ’th component of , which is encoded in the diagonal element , is set to , where is the scale factor . Here refers to the ’th component of the matrix .
In words, this penalty scheme means that if a point in the source feature maps contributes to a point in the destination feature map, where the latter is far from the source point, then that contribution is penalized.
3.5 Algorithm
The resulted procedure is shown in Algorithm 1. Note that the matrices and are the same for any . From the beginning of the algorithm up to line 16 is to compute these matrices. However, the matrix and the vector both depend on and thus Line 20 to the end loops over to compute each and its resulted .
3.6 Implementation Details
Choice of Norm. The last line of algorithm returns . We will explore and norms in the experiments.
Normalization of Feature Maps. The pooling operator is shared across all channels. However, the intensity values in each channel could potentially have a different center and scale, making it hard for the same pooling to provide similar effect on all channels. To fix this, we normalize feature maps before forming the matrices and applying Algorithm 1. More precisely, for a given feature map (, with being size of the training set), the normalized feature map is defined as , where and . After the pooling operator is applied, we transform the feature map back to its original scale and center by multiplying by and adding . This helps the output of the pooling operation be consistent with rest of the network.
Use in Deep Networks Consider a trained deep network using some typical pooling operator. We can convert the pooling operator at any given layer to a learned one, by treating the hidden representation at that layer as the input feature maps to Algorithm 1 (after applying the channel normalization described above). We will then adapt the network weights for the learned pooling by retraining the network. This process can be repeated for multiple layers. In our experiments, however, we observe that sometimes learned pooling even at one layer can already give a boost in test accuracy.
Number of Eigenvectors. For simplicity of presentation, Algorithm 1
uses the top eigenvector. In principle, however, top few eigenvectors could be used instead. In fact, modern architectures often double the number of output channels while downsampling via strided convolutions. To imitate that, we chose to select the top two eigenvectors from (
18), which results in two feature maps per input feature map. This keeps the size of the hidden representation after pooling consistent between our method and the common practice.Computing the Generalized Eigenvalue. For simpler exposition, in Section 3.3 and also Algorithm 1 we have used matrix inversion to solve the generalized eigenvalue problem. However, there are more efficient approaches for solving generalized eigenvalue problem without matrix inversion. In addition, we only need the top1 or top2 eigenvectors, which allows further efficiency in computation. There are numerical recipes that can leverage these two properties, such as scipy.sparse.linalg.eigs that we used for our Python implementation
Learning Pooling by SGD. One may wonder why not using gradient descent to optimize a total loss (sum of the usual crossentropy and separability criterion), instead of Algorithm 1 and thus simultaneously learn network weights and pooling operator. The answer is that it is either impractical or leads to inferior performance. To learn the regularized pooling operator it is necessary to store the matrix for each location in the output feature map. This would incur a memory cost of and would be extremely space inefficient. The performance with an unregularized pooling map is reported in Table 4 in the appendix. In performs worse than our approach in almost all cases and in some, even worse than the baseline.
4 Experiments
We study the performance of our pooling operator on the SVHN (Netzer et al., 2011) and CIFAR10/CIFAR100 (Krizhevsky and Hinton, 2009) datasets. For the SVHN dataset we also experiment with a reduced 5% subset to measure the performance of our algorithm in presence of limited labelled data. We use 2 models for our experiments, a CNN model which is a 4layer CovnNet and a 18 layer ResNet (He et al., 2016a) model. Both these models have 3 pooling layers in which they down sample via strided convolutions. ^{2}^{2}2Additional details about the models, datasets and training can be found in the supplementary material, along with the source code. The implementation will be opensourced with the camera ready version.
4.1 Effect on generalization
Table 1 shows the effect of our pooling operator on generalization. We are able to improve on the ResNet model in all settings and with the CNN model on both versions on the SVHN dataset. The largest gain is observed with the CNN model on the reduced SVHN dataset of over . Even when the CNN model fails to improve on the CIFAR datasets, the performance is on par with the CNN.
Model  Dataset  Baseline Error  Pooling Layer  Norm  Error with  

replaced pooling  
CNN  Reduced SVHN  3  1  
SVHN  2  25  2  
CIFAR10  3  
CIFAR100  3  5  2  
ResNet  Reduced SVHN  2  1  
SVHN  2  
CIFAR10  3, 1  
CIFAR100  2 
Effect of replacing the pooling operator on generalization. We report the mean test error and standard deviation after averaging over 5 trials. When multiple pooling layers are replaced, it is indicated by separating the hyperparameters by a comma. Experiments which result in improvements are highlighted in bold.
4.2 Robustness to corruptions and perturbations
(Hendrycks and Dietterich, 2019) have developed a dataset of realworld corruptions to test model robustness. For these set of experiments, the model is trained on the original CIFAR10 training set and evaluated on the modified test sets provided. We use the given CIFAR10C and CIFAR10P test sets and evaluate our approach by measuring the suggested quantities. For all of these measurements, we use the original ResNet architecture as the baseline.
In Table 2 we measure Corruption Error on the CIFAR10C dataset as suggested by (Hendrycks and Dietterich, 2019) . In the bottom most row we report the average corruption for each corruption type. We note that among others, the model with the replaced pooling operator is more robust in presence of geometric transformations for 4 out of 5 cases. We define geometric transformations as those which can move/displace pixels.
In Table 3
, we measure how our algorithm responds to gradually applied perturbations with the CIFAR10P dataset. Each cell reports the Flip Probability, which indicates the probability of the predicted label changing in presence of a perturbation. The bottom row reports the Flip Rate which is the ratio of the flip probability of our model over the flip probability of the original ResNet model. We note that our model does better than the original network for 10 out of 14 perturbations and for 6 out 8 geometric perturbations.
Corruption  
Model  Severity  Geometric  NonGeometric  
Defocus.  Frost.  Motion.  Zoom.  Elastic  Gauss.  Shot.  Impulse.  Snow  Frost  Fog  Bright.  Contr.  Pixel.  Jpeg.  
ResNet  
ResNet  
with  
pooling  
replaced  
CE 
Corruption  
Model  Severity  Geometric  NonGeometric  
Scale  Rot.  Tilt  Tran.  Shear  Motion.  Zoom.  Ga. B  Bright.  Spatter  Snow  Shot.  Speckle  Ga. N  
ResNet  
ResNet  
with  
pooling replaced  
Flip Rate 
5 Conclusion
We propose a more general pooling operator
than currently being used in literature.
We also present an algorithm
to learn the pooling operator in closed form given the distribution of
its inputs.
Compared to pooling operations that are shared throughout
spatial dimensions, ours allows more flexibility by being spatially varying.
We replace the standard pooling operations in a CNN and a ResNet model
and see benefits in generalizations on the CIFAR10/CIFAR100 and SVHN datasets.
The operator is demonstrably more robust to unseen geometric
transformations, which we show by evaluating on the CIFAR10C and
CIFAR10P test sets.
References

Arandjelovic et al. (2016)
Arandjelovic, R., Gronat, P., Torii, A., Pajdla, T., and Sivic, J. (2016).
Netvlad: Cnn architecture for weakly supervised place recognition.
In
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
, pages 5297–5307.  Azulay and Weiss (2018) Azulay, A. and Weiss, Y. (2018). Why do deep convolutional networks generalize so poorly to small image transformations? arXiv preprint arXiv:1805.12177.
 Boureau et al. (2011) Boureau, Y.L., Le Roux, N., Bach, F., Ponce, J., and LeCun, Y. (2011). Ask the locals: multiway local pooling for image recognition. In ICCV’11The 13th International Conference on Computer Vision.
 Bruna and Mallat (2013) Bruna, J. and Mallat, S. (2013). Invariant scattering convolution networks. IEEE transactions on pattern analysis and machine intelligence, 35(8):1872–1886.
 Cohen and Shashua (2016) Cohen, N. and Shashua, A. (2016). Inductive bias of deep convolutional networks through pooling geometry. arXiv preprint arXiv:1605.06743.

Esteves et al. (2018a)
Esteves, C., AllenBlanchette, C., Zhou, X., and Daniilidis, K. (2018a).
Polar transformer networks.
In International Conference on Learning Representations.  Esteves et al. (2018b) Esteves, C., AllenBlanchette, C., Zhou, X., and Daniilidis, K. (2018b). Polar transformer networks. In International Conference on Learning Representations.
 Felzenszwalb et al. (2009) Felzenszwalb, P. F., Girshick, R. B., McAllester, D., and Ramanan, D. (2009). Object detection with discriminatively trained partbased models. IEEE transactions on pattern analysis and machine intelligence, 32(9):1627–1645.
 Fukushima and Miyake (1982) Fukushima, K. and Miyake, S. (1982). Neocognitron: A selforganizing neural network model for a mechanism of visual pattern recognition. In Competition and cooperation in neural nets, pages 267–285. Springer.
 Girshick et al. (2015) Girshick, R., Iandola, F., Darrell, T., and Malik, J. (2015). Deformable part models are convolutional neural networks. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pages 437–446.
 Gong et al. (2014) Gong, Y., Wang, L., Guo, R., and Lazebnik, S. (2014). Multiscale orderless pooling of deep convolutional activation features. In European conference on computer vision, pages 392–407. Springer.
 He et al. (2017) He, K., Gkioxari, G., Dollár, P., and Girshick, R. (2017). Mask rcnn. In Proceedings of the IEEE international conference on computer vision, pages 2961–2969.
 He et al. (2016a) He, K., Zhang, X., Ren, S., and Sun, J. (2016a). Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778.
 He et al. (2016b) He, K., Zhang, X., Ren, S., and Sun, J. (2016b). Identity mappings in deep residual networks. In European conference on computer vision, pages 630–645. Springer.
 Hendrycks and Dietterich (2019) Hendrycks, D. and Dietterich, T. (2019). Benchmarking neural network robustness to common corruptions and perturbations. arXiv preprint arXiv:1903.12261.
 Ioffe and Szegedy (2015) Ioffe, S. and Szegedy, C. (2015). Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167.
 Jégou et al. (2010) Jégou, H., Douze, M., Schmid, C., and Pérez, P. (2010). Aggregating local descriptors into a compact image representation. In CVPR 201023rd IEEE Conference on Computer Vision & Pattern Recognition, pages 3304–3311. IEEE Computer Society.
 Kanazawa et al. (2014) Kanazawa, A., Sharma, A., and Jacobs, D. W. (2014). Locally scaleinvariant convolutional neural networks. CoRR, abs/1412.5104.
 Kingma and Ba (2014) Kingma, D. P. and Ba, J. (2014). Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980.
 Krizhevsky and Hinton (2009) Krizhevsky, A. and Hinton, G. (2009). Learning multiple layers of features from tiny images. Technical report, Citeseer.
 Krizhevsky et al. (2012a) Krizhevsky, A., Sutskever, I., and Hinton, G. E. (2012a). Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pages 1097–1105.
 Krizhevsky et al. (2012b) Krizhevsky, A., Sutskever, I., and Hinton, G. E. (2012b). Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pages 1097–1105.
 Larson and Loschky (2009) Larson, A. M. and Loschky, L. C. (2009). The contributions of central versus peripheral vision to scene gist recognition. Journal of Vision, 9(10):6–6.
 LeCun et al. (1998) LeCun, Y., Bottou, L., Bengio, Y., Haffner, P., et al. (1998). Gradientbased learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324.
 Lee et al. (2016) Lee, C.Y., Gallagher, P. W., and Tu, Z. (2016). Generalizing pooling functions in convolutional neural networks: Mixed, gated, and tree. In Artificial Intelligence and Statistics, pages 464–472.
 Lee et al. (2014) Lee, C.Y., Xie, S., Gallagher, P., Zhang, Z., and Tu, Z. (2014). Deeplysupervised nets. arXiv preprint arXiv:1409.5185.
 Li et al. (2015) Li, C., Reiter, A., and Hager, G. D. (2015). Beyond spatial pooling: finegrained representation learning in multiple domains. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4913–4922.
 Liu (2018) Liu, K. (2018). pytorchcifar. https://github.com/kuangliu/pytorchcifar.
 Malinowski and Fritz (2013) Malinowski, M. and Fritz, M. (2013). Learnable pooling regions for image classification. arXiv preprint arXiv:1301.3516.
 Miech et al. (2017) Miech, A., Laptev, I., and Sivic, J. (2017). Learnable pooling with context gating for video classification. arXiv preprint arXiv:1706.06905.
 Mobahi et al. (2012) Mobahi, H., Zitnick, C. L., and Ma, Y. (2012). Seeing through the blur. In 2012 IEEE Conference on Computer Vision and Pattern Recognition, pages 1736–1743. IEEE.
 Netzer et al. (2011) Netzer, Y., Wang, T., Coates, A., Bissacco, A., Wu, B., and Ng, A. Y. (2011). Reading digits in natural images with unsupervised feature learning.
 Passalis and Tefas (2017) Passalis, N. and Tefas, A. (2017). Learning bagoffeatures pooling for deep convolutional neural networks. In Proceedings of the IEEE International Conference on Computer Vision, pages 5755–5763.
 Saeedan et al. (2018) Saeedan, F., Weber, N., Goesele, M., and Roth, S. (2018). Detailpreserving pooling in deep networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 9108–9116.
 Sifre and Mallat (2013) Sifre, L. and Mallat, S. (2013). Rotation, scaling and deformation invariant scattering for texture discrimination. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1233–1240.
 Sun et al. (2017) Sun, M., Song, Z., Jiang, X., Pan, J., and Pang, Y. (2017). Learning pooling for convolutional neural network. Neurocomputing, 224:96–104.
 Worrall et al. (2017) Worrall, D. E., Garbin, S. J., Turmukhambetov, D., and Brostow, G. J. (2017). Harmonic networks: Deep translation and rotation equivariance. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 5028–5037.
 Zhang (2019) Zhang, R. (2019). Making convolutional networks shiftinvariant again. arXiv preprint arXiv:1904.11486.
 Zhou et al. (2017) Zhou, Y., Sun, X., Liu, D., Zha, Z., and Zeng, W. (2017). Adaptive pooling in multiinstance learning for web video annotation. In Proceedings of the IEEE International Conference on Computer Vision, pages 318–327.
6 Supplementary Appendix
6.1 Derivation of Closed Form
The goal is to solve the following optimization:
(19) 
Using Lagrange multiplier , the optimization has the following Lagrangian:
(20) 
The derivative of w.r.t. is:
(21) 
It is not difficult to verify that . Hence, defining , we learn that (because .
(22) 
6.2 Dataset details
In our descriptions, an epoch indicates the number of steps necessary to perform one full pass over the training data. Whenever reduced datasets are used, the number of steps in each epoch is scaled accordingly. To choose hyperparameters we use cross validation performed by evaluating performance on 5 distinct random heldout subsets of the training data.

CIFAR10/CIFAR100 For both the CIFAR datasets, we normalize the images by subtracting the mean and dividing by the standard deviation over the entire training set. During training, images are augmented using the technique described in (Lee et al., 2014)
which consists of padding images by 4 pixels and randomly cropping a
piece along with adding horizontal flips. All models on the CIFAR datasets are trained for epochs with the learning for the ResNet model decayed at epochs and . For crossvalidation we use 10 % subsets containing 5000 samples. 
SVHN For the SVHN dataset each image is normalized separately with no additional dataaugmentation applied. All models on this dataset are trained for epochs with the learning rate for the ResNet model decayed at epochs and . Cross validation is done by holding out 5 % of the training data.

Reduced SVHN We use this dataset to measure the performance of our algorithm in presence of less labelled samples. This is a reduced version of the SVHN dataset in which we only train with 5 % of the training data. For cross validation, 20 % of the reduced dataset is held out.
6.3 Model Details
We use our pooling operator within two models. These models perform spatial pooling by using stride 2 convolutions, and we experiment with replacing the 3 different layers in which they reduce spatial dimensions, with the exception of the final average pooling layer:

CNN This is a 4layer ConvNet with convolution kernels of size . The first convolution uses
channels and is followed by a ReLU nonlinearity. The 3 convolution layers which follow are strided convolutions with a stride of 2 (to reduce spatial dimensions) and double the number of channels from the previous layer, each of them followed by batchnorm
(Ioffe and Szegedy, 2015)and ReLU. Towards the end, the feature map is aggregated via global average pooling and fed into a linear layer which outputs logits. The CNN model is trained using the Adam optimizer
(Kingma and Ba, 2014) with a fixed learning rate of . 
ResNet The second architecture we use is a 18 layer ResNet described (He et al., 2016b) with its hyper parameters chosen form the implementation by (Liu, 2018). The network is trained with SGD and momentum coefficient of and a starting learning rate of , decayed a factor of after fixed number of epochs for each dataset.
6.4 Training Procedure
As the first step of our algorithm we train our models till convergence using the default pooling operation in each model. This is followed by using Algorithm 1 (in main paper) to compute our pooling operator along with the normalization parameters and
. While estimating matrices
and , for both models, we use at most samples per class. We then replace each pooling layer, one at a time, with our own pooling operator with various values of and choice of norm, and retrain the network from a random initialization. We choose the setting that leads to the best average crossvalidation error. Using this setting, we train on the full training dataset and report numbers on the test set.It is possible to use this procedure multiple times to replace more than one pooling layer. In our experiments, we tried replacing multiple pooling layers while using the CIFAR10 and CIFAR100 datasets. Only on CIFAR10, replacing with 3 and the 1 pooling layer respectively in a ResNet led to a nontrivial reduction in crossvalidation error. For all other models and datasets we report performance after replacing only a single pooling layer inside the model.
6.5 Additional visualizations
6.5.1 ResNet on reduced SVHN
6.5.2 ResNet on CIFAR100
6.6 Comparing operator learned through SGD
Model  Dataset  Baseline Error  Pooling Layer  Error with 

SGD  
CNN  Reduced SVHN  3  
SVHN  3  
CIFAR10  3  
CIFAR100  3  
ResNet  Reduced SVHN  3  
SVHN  3  
CIFAR10  1  
CIFAR100  1 