in 1990, with three main components: convolution, pooling, and activation. Pooling is an important component of CNNs. Even before the resuscitation of CNNs, pooling was utilized to extract features to gain dimension-reduced feature vectors and acquire invariance to small transformations of the input in the inpiration of complex cells in animal visual cortex.
Pooling is of crucial for reducing computation cost, improving some amount of translation invariance and increasing the receptive field of neural networks. In shallow/mid-sized networks, max or average pooling are most widely used such as in AlexNet , VGG , and GoogleNet . Deeper networks always prefer strided convolution for architecture-design simplicity. One markable examplar is the ResNet
. However, these common poolings sufer from its drawbacks. For example, max pooling implies an amazing by-product of discarding at least 75% of data–the maximum value picked out in each local region only reflects very rough information. Average pooling stretches to the opposite end, resulting in a gradual, constant attenuation of the contribution of individual grid in local region, and ignoring the importance of local structure. These two poolings both sufer from sharp dimensionality reduction and lead to implausible looking results (see the first and second row in Fig.1). Strided convolution may cause aliasing since it simply picks one node in a fixed position in each local region, regarding the significance of its activation.
There have been a few attempts to mitigate the harmful effects of max and average pooling, such as a linear combination and extension of them , and nonlinear pooling layers [14, 15]. In most common implementations, max or average related pooling layers directly downscale the spatial dimension of feature maps by a scaling factor. pooling provides better generalization than max pooling, with corresponding to average pooling and reducing to max pooling. Yu et. al. proposed mixed pooling which combines max pooling and average pooling, and switches between these two pooling methods randomly. Instead of picking the maximum values within each pooling region, stochastic pooling and S3Pool stochastically pick a node in each pooling region, and the former favors strong activations. In some networks, strided convolutions are also used for pooling. Notably, these pooling methods are all of integer stride larger than 1. To abate the loss of information caused by the dramatic dimension reduction, fractional max-pooling  randomly generates pooling region with stride 1 or 2 to achieve pooling stride of less than 2. Recently, Saeedan et. al. propose the detail-preserving pooling aiming at preserving low-level details and filling the gap between max pooling and average pooling. We refer these mentioned pooling methods as spatial pooling because they perform pooling in spatial domain.
Recently Rippel et.al. proposed the Fourier spectral pooling. It downsamples the feature maps in frequency domain using low-pass filtering. Specifically, it selects pooling region in Fourier frequency domain by extracting low frequency subset(i.e. truncation). This approach can alleviates those issues that exist in spatial pooling strategies as mentioned above, and it shows good information preserving ability. However, it introduces the processing of imaginary, which should be carefully treated in real CNNs. Besides, the frequency truncation may destroy the conjugate symmetry of the Fourier frequency representation of the real input. Rippel et. al. suggest RemoveRedundancy and RecoverMap algrithms to make sure the downsampled spatial approximation be real(see supplementray of ). But they are demonstrated to be time consuming. Following the work in ,  proposes the discrete cosine transform(DCT) pooling layer. Although the DCT pooling layer uses real numbers for frequency representation, it doubles the number of operations compared with FFT pooling layer.
Inspired by the work of 
, we present the Hartley transform-based spectral pooling in this paper. Hartley transform only use real numbers for real input and it has fast discrete algorithm with operations almost the same as Fourier transform. So the presented Hartley spectral pooling layer avoids the use of complex arithmetic and it could be plugged in modern CNNs effortlessly. Moreover, it reduces the calculated amount by dropping out two auxilary steps of the algorithms in. Meanwhile, we provide a useful observation that preserving more information could contribute to the convergence of training modern CNNs.
Ii-a Hartley transform
The Hartley transform is an integral transform closely related to the Fourier transform [23, 24]. It has some advantages over the Fourier transform in the analysis of real signals as it avoids the use of complex arithmetic. These advantages attracts researchers to conduct plenty of researches on its application and fast implementation during the 1990s [25, 26, 27, 28]. In two dimensions, we denote and as the Hartley transform and Fourier transform of respectively, as defined in .
and the inverse transform by
It can be easily derived the relationship between these two transforms, giving111Refer to  for details.
In the case that function is real, its Fourier transform is Hermetian, i.e.
so the Fourier transform processes some redundancy on the real - plane, which results in conjugate-symmetry constriction aiming at reducing training parameters in the frequency domain neural networks [1, 29].
The Hermetian property above shows that the Hartley transform of a real function can be written as
where and denote the real and imaginary parts respectively. Note that giving the definition above, the Hartley transform is a real linear operator. It is symmetric as well as an involution and a unitary operator
To Hartley transform, imaginary part and conjuate symmetry no more need to be concerned for real inputs such as images.
Here we discuss how to propagate the gradient through the Hartley transform, which will be used in CNNs. Define and to be the input and output of a discrete Hartlay transform (DHT) respectively, and
a real-valued loss function applied to. Since the DHT is a linear operator, its gradient is simply the transform matrix itself. During back-propagation, this gradient by the unitarity of DHT, corresponds to the application of Hartley transform:
Ii-B Hartley-based spectral pooling
Spectral pooling preserves considerably more information and structure for the same number of parameters (see the third row of Fig. 1) because the frequency domain provides a sparse basis for inputs with spatial structure. The spectrum power of typical input is heavily concentrate in lower frequencies while higher frequencies mainly tend to encode noise . This non-uniformity of spectrum power enables the removal of high frequencies do minimal damage of input information.
To avoid the time consuming RemoveRedundancy and RecoverMap steps in , we suggest the Hartley transform-based spectral pooling. This spectral pooling is straightforward to understand and much easier to implement. Assume we have an input , and some desired output map dimensionality . First, we compute the DHT of the input into the frequency domain as , and shift the DC component of the input to the center of the domain. Then we crop the frequency representation by maintaining only the central submatrix of frequencies, denoted as . Finally, we take the DHT again as to map frequency approximation back into spatial domain, obtaining the downsampled spatial approximation. The back-propagation of this spectral pooling is similar to its forward-propagation since Hartley transform is differentiable.
Those steps in both forward and backward propagation of this spectral pooling are listed in Algorithm 1 and 2, respectively. These algorithms simplify the spectral pooling by Fourier transform, profited from that the Hartley transform of a real function is real rather than complex. Figure 1 demonstrates the effect of this spectral pooling for various dimensionality reduction factors.
We compare the efficiency of the proposed pooling layer with that of the Fourier pooling layer in  here. Both spectral pooling methods are implemented in plain python with package numpy. As shown in Fig.2, the running time of Fourier spectral pooling increases rapidly when image size becomes larger. We claim that this is caused by the time-consuming steps RemoveRedundancy and RecoverMap
when backpropagating in Fourier spectral pooling algorithm(refer to).
We verify the effectiveness of the Hartley spectral pooling through image classification task on MNIST and CIFAR-10 datasets. The trained networks include a toy CNN model (Table I), ResNet-16 and ResNet-20 
. The toy network uses max pooling while ResNets employs strided convolutions for downscaling. In these experiments, spectral pooling shows favorable results. Our implementation is based on PyTorch.
Iii-a Datasets and configureations
The MNIST database is a large database of handwritten digits that is commonly used for training various convolutional neural networks. This dataset contains 60000 training examples and 10000 testing examples. All these examples are gray images in size of . In our experiment, we do not perform any preprocessing or augmentation on this dataset. Adam optimization algorithm  is used in all experiments of classification on MNIST, with hyper-parameter and
configured as suggested and a mini-batch size of 100. The initial learning rate is set to 0.001 and is divided by 10 every 5 epochs. Regularization is aborted in these experiments.
The CIFAR-10 dataset consists of 60000 colored natural images in 10 classes, with 6000 images per class holding 5000 for training and 1000 for testing. Each image is in size of . For data augmentation we follow the practice in , doing horizontal flips and randomly sampling
crops from image padded by 4pixels on each side. The normalization is performed in data preprocessing by using the channel means and standard deviations. In experiments on this dataset, we use stochastic gradient method with Nesterov momentum and cross-entropy loss. The initial learning rate is started with 0.1, and is multiplied by 0.1 at 80 and 120 epochs. Weight decay is configured toand momentum to 0.9 without dampening. Mini-batch size is set to 128.
Iii-B Classification results on MNIST
We first use a network in which max pooling is plugged (see Table I
). Each convolution layer is followed by a batch normalization and a ReLU nonlinearity. We test Hartley-based spectral pooling by replacing max pooling in this architecture. The train proceduce lasts 10 epoches and is repeated 10 times. The classification error on testing set in each epoch is shown in Fig.3.
Comparing to max pooling, spectral pooling shows strong results, yielding more than 15% reduction on classification error observed in our experiment. As all things equal except the pooling layers, we claim that this improvement is achieved by the better information-preserved ability of Hartley-based spectral pooling.
|layer name||output size||Max Pooling model||Spectral Pooling model|
|pool1||1414||Max, stride=2||Spectral, 1414|
|pool2||77||Max, stride=2||Spectral, 77|
Next, we use ResNet-20 , a deeper network. We don’t use more deeper residual net such as ResNet-110 because much more parameters in this architecture may give rise to overfitting. ResNet is composed of numerous residual building blocks. It is a much modern convolutional neural network which does not explicitly use pooling layers but instead embeds a stride-2 convolution layer inside some of building blocks for the effectuation of downsampling. For our experiments we replace the stride-2 convolutional layer by spectral pooling and remove the skip connection [9, 34] in those downscaling blocks. Besides, we set the output size of serial spectral pooling layers linearly decreased(reducing 8 in each axis after a spectral pooling layer). The manually tuned network architecture is depicted in Table II222 Building blocks are shown in brackets, with the numbers of block stacked. The SP stands for the spectral pooling layer, and the footnote indicates the output size.. We leave the global average pooling untouched. Each experiment is repeated 5 times and the best result is reported in Table III.
Spectral pooling ResNet21 performs a little better than its ResNet20 counterpart, surprisingly, even though it holds parameters almost in half that of ResNet20. Further, we illustrate one among the five training procedures in Fig. 4. It is observed that the SP-ResNet21 converges faster than ResNet20 (left panel) and performs better in classification (right panel). This indicates the spectral pooling ease the optimization by providing faster convergence at the early stage.
|block name||output size||21-layer|
|11||avg pool, 10-d fc|
Iii-C Classification results on CIFAR-10
For the network training on CIFAR-10, the architecture is similar to SP-ResNet21 that trains on MNIST. We set the sizes of output of spectral pooling layers to be , , sequentially. In experiments we use ResNet16 as a counterpart, since it contains almost the same amount of parameters as SP-ResNet21(see Table IV). The result of spectral pooling plugged network, SP-ResNet21, outperforms that of ResNet16 by nearly 0.25%.
We present a full real-valued spectral pooling method in this paper. It reduces the computational complexity compared to that of previous spectral pooling work. Based on this approach, we provide some results on several commonly used benchmark dataset (including MNIST, CIFAR-10) by training modified residual nets. We also investigate the contribution of this spectral pooling method to the convergence of training neural networks. We demonstrate spectral pooling yields higher classification accuracy than its counterparts max pooling. And in residual net, it improves the convergence rate in training convolutional neural networks by expanding the space of spatial dimensionality of downsamplings. Also, it is easy to implement and fast to compute during training time.
-  O. Rippel, J. Snoek, and R. P. Adams, “Spectral representations for convolutional neural networks,” in Advances in Neural Information Processing Systems (NIPS), 2015, pp. 2449–2457.
-  Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based learning applied to document recognition,” Proceedings of the IEEE, vol. 86, no. 11, pp. 2278–2324, 1998.
-  J. Schmidhuber, “Deep learning in neural networks: An overview,” Neural Networks, vol. 61, pp. 85–117, 2015.
-  Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,” Nature, vol. 521, no. 7553, p. 436, 2015.
-  Y. LeCun, B. E. Boser, J. S. Denker, D. Henderson, R. E. Howard, W. E. Hubbard, and L. D. Jackel, “Handwritten digit recognition with a back-propagation network,” in Advances in Neural Information Processing Systems (NIPS), 1990, pp. 396–404.
-  D. H. Hubel and T. N. Wiesel, “Receptive fields, binocular interaction and functional architecture in the cat’s visual cortex,” The Journal of Physiology, vol. 160, no. 1, pp. 106–154, 1962.
A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” inAdvances in Neural Information Processing Systems (NIPS), 2012, pp. 1097–1105.
-  K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” arXiv preprint arXiv:1409.1556, 2014.
-  C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, A. Rabinovich et al., “Going deeper with convolutions.” CVPR, 2015.
-  J. T. Springenberg, A. Dosovitskiy, T. Brox, and M. Riedmiller, “Striving for simplicity: The all convolutional net,” arXiv preprint arXiv:1412.6806, 2014.
K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image
Proceedings of the IEEE conference on Computer Vision and Pattern Recognition (CVPR), 2016, pp. 770–778.
-  J. G. Proakis and D. G. Manolakis, Digital signal processing. Pearson Education, 2013.
-  C.-Y. Lee, P. W. Gallagher, and Z. Tu, “Generalizing pooling functions in convolutional neural networks: Mixed, gated, and tree,” in Artificial Intelligence and Statistics, 2016, pp. 464–472.
C. Gulcehre, K. Cho, R. Pascanu, and Y. Bengio, “Learned-norm pooling for deep feedforward and recurrent neural networks,” inJoint European Conference on Machine Learning and Knowledge Discovery in Databases. Springer, 2014, pp. 530–546.
-  J. Bruna, A. Szlam, and Y. LeCun, “Signal recovery from pooling representations,” arXiv preprint arXiv:1311.4025, 2013.
-  D. Yu, H. Wang, P. Chen, and Z. Wei, “Mixed pooling for convolutional neural networks,” in International Conference on Rough Sets and Knowledge Technology. Springer, 2014, pp. 364–375.
-  M. D. Zeiler and R. Fergus, “Stochastic pooling for regularization of deep convolutional neural networks,” arXiv preprint arXiv:1301.3557, 2013.
-  S. Zhai, H. Wu, A. Kumar, Y. Cheng, Y. Lu, Z. Zhang, and R. Feris, “S3pool: Pooling with stochastic spatial sampling,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017, pp. 4970–4978.
-  B. Graham, “Fractional max-pooling,” arXiv preprint arXiv:1412.6071, 2014.
-  F. Saeedan, N. Weber, M. Goesele, and S. Roth, “Detail-preserving pooling in deep networks,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 9108–9116.
-  Y.-L. Boureau, J. Ponce, and Y. LeCun, “A theoretical analysis of feature pooling in visual recognition,” in Proceedings of the 27th International Conference on Machine Learning (ICML-10), 2010, pp. 111–118.
-  J. S. Smith and B. M. Wilamowski, “Discrete cosine transform spectral pooling layers for convolutional neural networks,” in International Conference on Artificial Intelligence and Soft Computing. Springer, 2018, pp. 235–246.
-  R. V. Hartley, “A more symmetrical fourier analysis applied to transmission problems,” Proceedings of the IRE, vol. 30, no. 3, pp. 144–150, 1942.
-  R. N. Bracewell, The Hartley transform. Oxford University Press, Inc., 1986.
-  ——, “Discrete hartley transform,” JOSA, vol. 73, no. 12, pp. 1832–1835, 1983.
-  ——, “The fast hartley transform,” Proceedings of the IEEE, vol. 72, no. 8, pp. 1010–1018, 1984.
-  R. Millane, “Analytic properties of the hartley transform and their implications,” Proceedings of the IEEE, vol. 82, no. 3, pp. 413–428, 1994.
J. Agbinya, “Fast interpolation algorithm using fast hartley transform,”Proceedings of the IEEE, vol. 75, no. 4, pp. 523–524, 1987.
-  H. Pratt, B. Williams, F. Coenen, and Y. Zheng, “Fcnn: Fourier convolutional neural networks,” in Joint European Conference on Machine Learning and Knowledge Discovery in Databases. Springer, 2017, pp. 786–798.
-  A. Torralba and A. Oliva, “Statistics of natural image categories,” Network: Computation in Neural Systems, vol. 14, no. 3, pp. 391–412, 2003.
-  A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison, L. Antiga, and A. Lerer, “Automatic differentiation in pytorch,” 2017.
-  E. Kussul and T. Baidyk, “Improved method of handwritten digit recognition tested on mnist database,” Image and Vision Computing, vol. 22, no. 12, pp. 971–981, 2004.
-  D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980, 2014.
-  R. K. Srivastava, K. Greff, and J. Schmidhuber, “Highway networks,” arXiv preprint arXiv:1505.00387, 2015.