I Introduction
Convolutional neural networks(CNNs) [2, 3, 4]
have been dominant machine learning approach for computer vision, and they also get increasing applications in many other fields. The modern framework of CNNs was established by LeCun et. al.
[5]in 1990, with three main components: convolution, pooling, and activation. Pooling is an important component of CNNs. Even before the resuscitation of CNNs, pooling was utilized to extract features to gain dimensionreduced feature vectors and acquire invariance to small transformations of the input in the inpiration of complex cells in animal visual cortex
[6].Pooling is of crucial for reducing computation cost, improving some amount of translation invariance and increasing the receptive field of neural networks. In shallow/midsized networks, max or average pooling are most widely used such as in AlexNet [7], VGG [8], and GoogleNet [9]. Deeper networks always prefer strided convolution for architecturedesign simplicity[10]. One markable examplar is the ResNet[11]
. However, these common poolings sufer from its drawbacks. For example, max pooling implies an amazing byproduct of discarding at least 75% of data–the maximum value picked out in each local region only reflects very rough information. Average pooling stretches to the opposite end, resulting in a gradual, constant attenuation of the contribution of individual grid in local region, and ignoring the importance of local structure. These two poolings both sufer from sharp dimensionality reduction and lead to implausible looking results (see the first and second row in Fig.
1). Strided convolution may cause aliasing since it simply picks one node in a fixed position in each local region, regarding the significance of its activation[12].There have been a few attempts to mitigate the harmful effects of max and average pooling, such as a linear combination and extension of them [13], and nonlinear pooling layers [14, 15]. In most common implementations, max or average related pooling layers directly downscale the spatial dimension of feature maps by a scaling factor. pooling[15] provides better generalization than max pooling, with corresponding to average pooling and reducing to max pooling. Yu et. al.[16] proposed mixed pooling which combines max pooling and average pooling, and switches between these two pooling methods randomly. Instead of picking the maximum values within each pooling region, stochastic pooling[17] and S3Pool[18] stochastically pick a node in each pooling region, and the former favors strong activations. In some networks, strided convolutions are also used for pooling. Notably, these pooling methods are all of integer stride larger than 1. To abate the loss of information caused by the dramatic dimension reduction, fractional maxpooling [19] randomly generates pooling region with stride 1 or 2 to achieve pooling stride of less than 2. Recently, Saeedan et. al.[20] propose the detailpreserving pooling aiming at preserving lowlevel details and filling the gap between max pooling and average pooling[21]. We refer these mentioned pooling methods as spatial pooling because they perform pooling in spatial domain.
Recently Rippel et.al.[1] proposed the Fourier spectral pooling. It downsamples the feature maps in frequency domain using lowpass filtering. Specifically, it selects pooling region in Fourier frequency domain by extracting low frequency subset(i.e. truncation). This approach can alleviates those issues that exist in spatial pooling strategies as mentioned above, and it shows good information preserving ability. However, it introduces the processing of imaginary, which should be carefully treated in real CNNs. Besides, the frequency truncation may destroy the conjugate symmetry of the Fourier frequency representation of the real input. Rippel et. al. suggest RemoveRedundancy and RecoverMap algrithms to make sure the downsampled spatial approximation be real(see supplementray of [1]). But they are demonstrated to be time consuming. Following the work in [1], [22] proposes the discrete cosine transform(DCT) pooling layer. Although the DCT pooling layer uses real numbers for frequency representation, it doubles the number of operations compared with FFT pooling layer.
Inspired by the work of [1]
, we present the Hartley transformbased spectral pooling in this paper. Hartley transform only use real numbers for real input and it has fast discrete algorithm with operations almost the same as Fourier transform. So the presented Hartley spectral pooling layer avoids the use of complex arithmetic and it could be plugged in modern CNNs effortlessly. Moreover, it reduces the calculated amount by dropping out two auxilary steps of the algorithms in
[1]. Meanwhile, we provide a useful observation that preserving more information could contribute to the convergence of training modern CNNs.Ii Method
Iia Hartley transform
The Hartley transform is an integral transform closely related to the Fourier transform [23, 24]. It has some advantages over the Fourier transform in the analysis of real signals as it avoids the use of complex arithmetic. These advantages attracts researchers to conduct plenty of researches on its application and fast implementation during the 1990s [25, 26, 27, 28]. In two dimensions, we denote and as the Hartley transform and Fourier transform of respectively, as defined in [27].
(1) 
and the inverse transform by
(2) 
It can be easily derived the relationship between these two transforms, giving^{1}^{1}1Refer to [27] for details.
(3) 
In the case that function is real, its Fourier transform is Hermetian, i.e.
(4) 
so the Fourier transform processes some redundancy on the real  plane, which results in conjugatesymmetry constriction aiming at reducing training parameters in the frequency domain neural networks [1, 29].
The Hermetian property above shows that the Hartley transform of a real function can be written as
(5) 
where and denote the real and imaginary parts respectively. Note that giving the definition above, the Hartley transform is a real linear operator. It is symmetric as well as an involution and a unitary operator
(6) 
To Hartley transform, imaginary part and conjuate symmetry no more need to be concerned for real inputs such as images.
Differentiation.
Here we discuss how to propagate the gradient through the Hartley transform, which will be used in CNNs. Define and to be the input and output of a discrete Hartlay transform (DHT) respectively, and
a realvalued loss function applied to
. Since the DHT is a linear operator, its gradient is simply the transform matrix itself. During backpropagation, this gradient by the unitarity of DHT, corresponds to the application of Hartley transform:(7) 
IiB Hartleybased spectral pooling
Spectral pooling preserves considerably more information and structure for the same number of parameters [1](see the third row of Fig. 1) because the frequency domain provides a sparse basis for inputs with spatial structure. The spectrum power of typical input is heavily concentrate in lower frequencies while higher frequencies mainly tend to encode noise [30]. This nonuniformity of spectrum power enables the removal of high frequencies do minimal damage of input information.
To avoid the time consuming RemoveRedundancy and RecoverMap steps in [1], we suggest the Hartley transformbased spectral pooling. This spectral pooling is straightforward to understand and much easier to implement. Assume we have an input , and some desired output map dimensionality . First, we compute the DHT of the input into the frequency domain as , and shift the DC component of the input to the center of the domain. Then we crop the frequency representation by maintaining only the central submatrix of frequencies, denoted as . Finally, we take the DHT again as to map frequency approximation back into spatial domain, obtaining the downsampled spatial approximation. The backpropagation of this spectral pooling is similar to its forwardpropagation since Hartley transform is differentiable.
Those steps in both forward and backward propagation of this spectral pooling are listed in Algorithm 1 and 2, respectively. These algorithms simplify the spectral pooling by Fourier transform, profited from that the Hartley transform of a real function is real rather than complex. Figure 1 demonstrates the effect of this spectral pooling for various dimensionality reduction factors.
IiC Comparison.
We compare the efficiency of the proposed pooling layer with that of the Fourier pooling layer in [1] here. Both spectral pooling methods are implemented in plain python with package numpy. As shown in Fig.2, the running time of Fourier spectral pooling increases rapidly when image size becomes larger. We claim that this is caused by the timeconsuming steps RemoveRedundancy and RecoverMap
when backpropagating in Fourier spectral pooling algorithm(refer to
[1]).Iii Experiments
We verify the effectiveness of the Hartley spectral pooling through image classification task on MNIST and CIFAR10 datasets. The trained networks include a toy CNN model (Table I), ResNet16 and ResNet20 [11]
. The toy network uses max pooling while ResNets employs strided convolutions for downscaling. In these experiments, spectral pooling shows favorable results. Our implementation is based on PyTorch
[31].Iiia Datasets and configureations
Mnist
The MNIST database
[32] is a large database of handwritten digits that is commonly used for training various convolutional neural networks. This dataset contains 60000 training examples and 10000 testing examples. All these examples are gray images in size of . In our experiment, we do not perform any preprocessing or augmentation on this dataset. Adam optimization algorithm [33] is used in all experiments of classification on MNIST, with hyperparameter andconfigured as suggested and a minibatch size of 100. The initial learning rate is set to 0.001 and is divided by 10 every 5 epochs. Regularization is aborted in these experiments.
Cifar10
The CIFAR10 dataset consists of 60000 colored natural images in 10 classes, with 6000 images per class holding 5000 for training and 1000 for testing. Each image is in size of . For data augmentation we follow the practice in [11], doing horizontal flips and randomly sampling
crops from image padded by 4pixels on each side. The normalization is performed in data preprocessing by using the channel means and standard deviations. In experiments on this dataset, we use stochastic gradient method with Nesterov momentum and crossentropy loss. The initial learning rate is started with 0.1, and is multiplied by 0.1 at 80 and 120 epochs. Weight decay is configured to
and momentum to 0.9 without dampening. Minibatch size is set to 128.IiiB Classification results on MNIST
Shallow network
We first use a network in which max pooling is plugged (see Table I
). Each convolution layer is followed by a batch normalization and a ReLU nonlinearity. We test Hartleybased spectral pooling by replacing max pooling in this architecture. The train proceduce lasts 10 epoches and is repeated 10 times. The classification error on testing set in each epoch is shown in Fig.
3.Comparing to max pooling, spectral pooling shows strong results, yielding more than 15% reduction on classification error observed in our experiment. As all things equal except the pooling layers, we claim that this improvement is achieved by the better informationpreserved ability of Hartleybased spectral pooling.
layer name  output size  Max Pooling model  Spectral Pooling model 

conv1  2828  55, 16  
pool1  1414  Max, stride=2  Spectral, 1414 
conv2  1414  55, 32  
pool2  77  Max, stride=2  Spectral, 77 
fc  11  10d fc 
[width=1.65in]compare_Acc_epochs_10.png  [width=1.65in]compare_AccBar_epochs_10.png 
ResNet
Next, we use ResNet20 [11], a deeper network. We don’t use more deeper residual net such as ResNet110 because much more parameters in this architecture may give rise to overfitting. ResNet is composed of numerous residual building blocks. It is a much modern convolutional neural network which does not explicitly use pooling layers but instead embeds a stride2 convolution layer inside some of building blocks for the effectuation of downsampling. For our experiments we replace the stride2 convolutional layer by spectral pooling and remove the skip connection [9, 34] in those downscaling blocks. Besides, we set the output size of serial spectral pooling layers linearly decreased(reducing 8 in each axis after a spectral pooling layer). The manually tuned network architecture is depicted in Table II^{2}^{2}2 Building blocks are shown in brackets, with the numbers of block stacked. The SP stands for the spectral pooling layer, and the footnote indicates the output size.. We leave the global average pooling untouched. Each experiment is repeated 5 times and the best result is reported in Table III.
Spectral pooling ResNet21 performs a little better than its ResNet20 counterpart, surprisingly, even though it holds parameters almost in half that of ResNet20. Further, we illustrate one among the five training procedures in Fig. 4. It is observed that the SPResNet21 converges faster than ResNet20 (left panel) and performs better in classification (right panel). This indicates the spectral pooling ease the optimization by providing faster convergence at the early stage.
block name  output size  21layer 
conv1  2828  33, 16 
conv2  2828  
downsample1  2020  
conv3  2020  
downsample2  1212  
conv4  88  
downsample3  44  
conv5  44  
11  avg pool, 10d fc 
[width=1.65in]compare_Train_Loss.png  [width=1.65in]compare_Train_Acc.png 
method  # params  error(%) 

ResNet20  0.27M  0.36 
SPResNet21  0.15M  0.32 
IiiC Classification results on CIFAR10
For the network training on CIFAR10, the architecture is similar to SPResNet21 that trains on MNIST. We set the sizes of output of spectral pooling layers to be , , sequentially. In experiments we use ResNet16 as a counterpart, since it contains almost the same amount of parameters as SPResNet21(see Table IV). The result of spectral pooling plugged network, SPResNet21, outperforms that of ResNet16 by nearly 0.25%.
method  # params  error(%) 

ResNet16  0.18M  8.87 
SPResNet21  0.15M  8.63 
Iv Conclusion
We present a full realvalued spectral pooling method in this paper. It reduces the computational complexity compared to that of previous spectral pooling work. Based on this approach, we provide some results on several commonly used benchmark dataset (including MNIST, CIFAR10) by training modified residual nets. We also investigate the contribution of this spectral pooling method to the convergence of training neural networks. We demonstrate spectral pooling yields higher classification accuracy than its counterparts max pooling. And in residual net, it improves the convergence rate in training convolutional neural networks by expanding the space of spatial dimensionality of downsamplings. Also, it is easy to implement and fast to compute during training time.
References
 [1] O. Rippel, J. Snoek, and R. P. Adams, “Spectral representations for convolutional neural networks,” in Advances in Neural Information Processing Systems (NIPS), 2015, pp. 2449–2457.
 [2] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, “Gradientbased learning applied to document recognition,” Proceedings of the IEEE, vol. 86, no. 11, pp. 2278–2324, 1998.
 [3] J. Schmidhuber, “Deep learning in neural networks: An overview,” Neural Networks, vol. 61, pp. 85–117, 2015.
 [4] Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,” Nature, vol. 521, no. 7553, p. 436, 2015.
 [5] Y. LeCun, B. E. Boser, J. S. Denker, D. Henderson, R. E. Howard, W. E. Hubbard, and L. D. Jackel, “Handwritten digit recognition with a backpropagation network,” in Advances in Neural Information Processing Systems (NIPS), 1990, pp. 396–404.
 [6] D. H. Hubel and T. N. Wiesel, “Receptive fields, binocular interaction and functional architecture in the cat’s visual cortex,” The Journal of Physiology, vol. 160, no. 1, pp. 106–154, 1962.

[7]
A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” in
Advances in Neural Information Processing Systems (NIPS), 2012, pp. 1097–1105.  [8] K. Simonyan and A. Zisserman, “Very deep convolutional networks for largescale image recognition,” arXiv preprint arXiv:1409.1556, 2014.
 [9] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, A. Rabinovich et al., “Going deeper with convolutions.” CVPR, 2015.
 [10] J. T. Springenberg, A. Dosovitskiy, T. Brox, and M. Riedmiller, “Striving for simplicity: The all convolutional net,” arXiv preprint arXiv:1412.6806, 2014.

[11]
K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image
recognition,” in
Proceedings of the IEEE conference on Computer Vision and Pattern Recognition (CVPR)
, 2016, pp. 770–778.  [12] J. G. Proakis and D. G. Manolakis, Digital signal processing. Pearson Education, 2013.
 [13] C.Y. Lee, P. W. Gallagher, and Z. Tu, “Generalizing pooling functions in convolutional neural networks: Mixed, gated, and tree,” in Artificial Intelligence and Statistics, 2016, pp. 464–472.

[14]
C. Gulcehre, K. Cho, R. Pascanu, and Y. Bengio, “Learnednorm pooling for deep feedforward and recurrent neural networks,” in
Joint European Conference on Machine Learning and Knowledge Discovery in Databases. Springer, 2014, pp. 530–546.  [15] J. Bruna, A. Szlam, and Y. LeCun, “Signal recovery from pooling representations,” arXiv preprint arXiv:1311.4025, 2013.
 [16] D. Yu, H. Wang, P. Chen, and Z. Wei, “Mixed pooling for convolutional neural networks,” in International Conference on Rough Sets and Knowledge Technology. Springer, 2014, pp. 364–375.
 [17] M. D. Zeiler and R. Fergus, “Stochastic pooling for regularization of deep convolutional neural networks,” arXiv preprint arXiv:1301.3557, 2013.
 [18] S. Zhai, H. Wu, A. Kumar, Y. Cheng, Y. Lu, Z. Zhang, and R. Feris, “S3pool: Pooling with stochastic spatial sampling,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017, pp. 4970–4978.
 [19] B. Graham, “Fractional maxpooling,” arXiv preprint arXiv:1412.6071, 2014.
 [20] F. Saeedan, N. Weber, M. Goesele, and S. Roth, “Detailpreserving pooling in deep networks,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 9108–9116.
 [21] Y.L. Boureau, J. Ponce, and Y. LeCun, “A theoretical analysis of feature pooling in visual recognition,” in Proceedings of the 27th International Conference on Machine Learning (ICML10), 2010, pp. 111–118.
 [22] J. S. Smith and B. M. Wilamowski, “Discrete cosine transform spectral pooling layers for convolutional neural networks,” in International Conference on Artificial Intelligence and Soft Computing. Springer, 2018, pp. 235–246.
 [23] R. V. Hartley, “A more symmetrical fourier analysis applied to transmission problems,” Proceedings of the IRE, vol. 30, no. 3, pp. 144–150, 1942.
 [24] R. N. Bracewell, The Hartley transform. Oxford University Press, Inc., 1986.
 [25] ——, “Discrete hartley transform,” JOSA, vol. 73, no. 12, pp. 1832–1835, 1983.
 [26] ——, “The fast hartley transform,” Proceedings of the IEEE, vol. 72, no. 8, pp. 1010–1018, 1984.
 [27] R. Millane, “Analytic properties of the hartley transform and their implications,” Proceedings of the IEEE, vol. 82, no. 3, pp. 413–428, 1994.

[28]
J. Agbinya, “Fast interpolation algorithm using fast hartley transform,”
Proceedings of the IEEE, vol. 75, no. 4, pp. 523–524, 1987.  [29] H. Pratt, B. Williams, F. Coenen, and Y. Zheng, “Fcnn: Fourier convolutional neural networks,” in Joint European Conference on Machine Learning and Knowledge Discovery in Databases. Springer, 2017, pp. 786–798.
 [30] A. Torralba and A. Oliva, “Statistics of natural image categories,” Network: Computation in Neural Systems, vol. 14, no. 3, pp. 391–412, 2003.
 [31] A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison, L. Antiga, and A. Lerer, “Automatic differentiation in pytorch,” 2017.
 [32] E. Kussul and T. Baidyk, “Improved method of handwritten digit recognition tested on mnist database,” Image and Vision Computing, vol. 22, no. 12, pp. 971–981, 2004.
 [33] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980, 2014.
 [34] R. K. Srivastava, K. Greff, and J. Schmidhuber, “Highway networks,” arXiv preprint arXiv:1505.00387, 2015.