1 Introduction
Convolution neural networks (CNNs)[29] have achieved remarkable performance in many tasks of computing vision, e.g., object detection[31, 40, 41], semantic segmentation[43]
[1], by capturing and representing multilevel features from a huge volume of data. However, existing experiments[2, 10] demonstrate that CNNs are often with great fragility[20]. Only injecting minute perturbation, e.g., random noise, contrast change, or blurring, can lead to significant degradation of model performance, i.e., CNN models usually lacks the ability of generalization transfer.A variety of explanation of the vulnerability have been proposed, e.g., the limit of the datasets scale, the distribution of real data is inconsistent with training data, and computational constraints[5], which resulting in a variety of coping strategies, such as data augmentation[49], adversarial training[13, 7, 35] and parameter regularization[47, 12]. In deed, these strategies are propelling the models of CNNs to encoding invariant features as well as neglecting the variable information in the learning phase. In essence, convolutions are a kind of signal processing operations that amplify certain frequencies of the input and attenuate others. This leads us to ask that whether can prompt CNNs to ”remember” invariant features by explicitly representing certain frequency components of its convolution layers? And, how to find certain frequency components for different layers? In this paper, we show that the answer to the questions leads to some surprising new perspectives on: model robustness and generalization.
The low frequency components in training set are easier to be learned than high frequency components, because the number of lowfrequency signals is large but their variation is little. For a finite training set, there exists a valid frequency range, and information beyond the lower bound usually is the bias of dataset, while information beyond the upper bound often is the noise. This phenomenon probably leads to a CNN with common settings always first quickly capturing low frequency components in their dominant, but easily overfitting when suffering from application scenario transformation. Therefore, we might be able to promote the generalization and convergence performance of CNN models by putting frequency range constraints on convolution layers on learning phase.
The architecture of CNNs is designed to abstract information layer by layer from low to high[39]. It is generally assumed that, low layers are in charge of extracting low frequency information, such as dots, lines and texture, while high ones are responsible for high frequency information, such as shapes and sketches. Intuitively, we can drive a CNN model to pinpoint the valid frequency range of training set by imposing the low frequency constrains on previous convolution layers while high frequency constrains on the rears. However, due to existing some other factors, e.g., shortcut connection[19], learning methods, and sample distribution, the valid frequency range probably entangled in different layers rather than continuous.
In this paper, we propose a novel frequency domain regularization on convolution layers, which improves the generalization and convergence performance of CNN models by automatically untangling the spectrum of convolution layers, and navigating the model to the valid frequency range of training set. In a nutshell, our main contributions can be summarized as follows.

An extreme small but valid spectral range for different layers was pinpointed.

A general training approach with frequency domain regularization on convolution layers, for improving the generalization and convergence performance of CNN models. Compared with data augmentation technique and other implicit regularization techniques, our training technique improves the transferability of model.

Comprehensive evaluation to investigate the effectiveness of proposed approach, and demonstrate how it can raise the generalization of CNN models.
2 Related work
Promoting the generalization of models is very important for deep learning. Generally, there are three branches of techniques to achieve the target,
i.e., data augmentation, regularization and spectrum analysis.Data augmentation: The idea of data augmentation[21, 9] is common to reduce overfitting on models, which increases the amount of training data using information only in training data. Simple techniques, such as cropping, rotating, flipping, etc., are prevailing in CNN model training, and usually can improve performance on validation a few. However, such simple techniques can not provide any practical defense against adversarial examples[8], which leads to an emerging direction of data augmentation, i.e., adversarial training[28, 3]. Indeed, adversarial training performs unsupervised generation of new samples using GANs[17], which can provide amount of hard examples for training. However, recent studies[49] demonstrate that adversarial training usually improve robustness to corruptions that are concentrated in the high frequency domain while reducing robustness to corruptions that are concentrated in the low frequency domain.
Regularization: [50] comprehensively evaluates the performance of explicit and implicit regularization techniques, i.e., dropout[44], weight decay[27, 32]
[23], early stopping. And gives the comments that although regularizers can provide marginal improvement, they seem not to be the fundamental reason for generalization, but the architecture. All of the regularization techniques mentioned above have little effect on preventing the model from quickly fitting random labeled data. Sharpness and norms are other perspective for generalization[36]. There is a tight connection between spectral norm and Lipschitz Continuity, which can be used to flatten minima and bound the generalization error[4, 35, 13]. Jacobian penalty[22] and orthogonality of weights[38] can also be used for improving generalization. But none of the regularization techniques focus on the transferability of model on unseen domain, nor can they explicit pinpoint the valid range of feature to help the model shield against background and noise.Spectrum analysis: Indeed, convolution is a common method to extract specific spectrum in signal processing. Inspired by this, there is substantial recent interest in studying the spectral properties of CNNs, with applications to model compression[6], speeding up model inference[33, 46], memory reduction[18], theoretical understanding of CNN capacity[39, 45], and eventually, better training methodologies[49, 15, 42, 30]. Especially, the works that leverage spectrum properties of CNNs to design better training methodologies are most relevant to this paper. For example, recent study[24, 39] find that a CNN model usually is biased towards lower Fourier frequencies while natural images tend to have the bulk of their Fourier spectrum concentrated on the low to midrange frequencies. From this discovery, some works try to drop the high frequency components from the inputs to improve the generalization of the model, e.g., spectral dropout[25] and Bandlimited training[11]. In practice, highfrequency components perhaps are nonrobust but highly predictable[49]. Therefore, although high frequency components contain noise, we do not simply drop them in our work. More indepth discussion is needed for valid spectral range.
3 Method
In this section, we introduce our regularization method to constrain the frequency spectra of the convolution. The overview of our method is illustrated in Figure 1.
3.1 FFTbased convolution
FFTbased convolution
The property of frequency analysis ensures that convolution in the spatial domain is equal to elementwise multiplication in the spectral domain. The main intuition of frequency analysis is that an image represented in spatial domain is significant redundancy, while represented in spectral domain can improve filter to feature the specific lengthscales and orientations [42, 14]. Convergence speedup and lower computational cost are additional benefit.
is called the spectrum of the convolution.
3.2 Mask design
Mask design is the key component of our method. Mask helps us pinpoint the valid frequency range entangled between different layers, and using back propagation to update it is the main difference from similar work.
Binarized mask
Our regularization try to mask the frequency of background and noise, only maintaining the frequency that is useful for the classification. is the mask that limits the spectrum .
Gradient Computation and Accumulation:
The gradients of mask are accumulated in realvalued variables, as Algorithm 1.
4 Experiments
We demonstrate the effectiveness of our regularization method with various datasets and architecture and compare it with several stateofart methods. We explore in detail to illustrate the property of our approach.
4.1 Experimental settings
Datasets We conduct our experiments with Cifar10[26], which contains 10 classes, 50000 images for training and 10000 images for testing.
Baseline training
All models use SGD, with momentum set to 0.9. For Cifar dataset, the learning rate is set to 0.01. For Imagenet dataset, the learning rate is set to 0.1. If weight decay and dropout are used, weight decay is set to
and the keepprop is set to 0.9. When training from sketch, Mask is initiated with random numbers from a normal distribution with mean 0.8 and variance 0.2. It means we don’t drop any frequency at beginning, and with the model learning, the accumulated gradient of mask will pinpoint the valid frequency range. While for finetuning, Mask is initiated with random numbers from a normal distribution with mean 0.6 and variance 0.1 to accelerate the learning of mask.
4.2 Experimental results and analysis
Clear  Impulse_noise  Fog  Contrast  
Natural  93.5  50.436  85.14  70.858 
Gauss  
Adversarial  
Our method  94.06  57.344  86.752  73.432 
Conv Layer  Mask Percentage 

layer1conv1.1  0.6938 
layer1conv1.2  0.5810 
layer1conv2.1  0.5213 
layer1conv2.2  0.4889 
layer2conv1.1  0.3354 
layer2conv1.2  0.3696 
layer2conv2.1  0.3124 
layer2conv2.2  0.2644 
layer3conv1.1  0.2082 
layer3conv1.2  0.2621 
layer3conv2.1  0.2453 
layer3conv2.2  0.1477 
layer4conv1.1  0.0810 
layer4conv1.2  0.0455 
layer4conv2.1  0.0337 
layer4conv2.2  0.0175 
It was observed that Gaussian data augmentation and adversarial training improve robustness to all noise and many of the blurring corruptions, while degrading robustness to fog and contrast[49]. Our method has better results against fog, contrast, and impulse noise, which shows that our method alleviates the low frequency brittle caused by adversarial training.
5 Discussion
5.1 Binarized mask or continuous mask
When using Binarized mask in spatial domain, it will have side effect like generating boundary. However, in spectral domain, this kind of side effect is not obvious. It can be seen as a novel way of denoising. Benefitting from the property of spectral domain that represents the feature in an invariant and sparse way, our method can suppress the wrong activation in the spectral domain through binarized mask.
5.2 Our method with random drop
Our method pinpoints the valid frequency range for the training set. What if we random drop some frequency after using our method, will the model learn redundancy feature, do not rely on heavily on those frequency and perform better. In the last line of Table 1, we show that it is not a better choice. This verifies that our method pinpoints the valid frequency range on another perspective.
5.3 Interclass mask
If we train the spectral domain mask for each class, it may have better performance and implicit transform among the same category in different datasets. However, in the test time, we need to determine the category of the images before using interclass mask. So, we may need to change the architecture of the model, which is left to our future work.
6 Conclusion
We proposed a novel regularization method in the train time to explicit remove the unimportant frequency. 1) We pinpoint the valid frequency range entangled in different layers. 2) We demonstrate the model trained with our regularization is more robust on unseen data.
Comparing with Bandlimited training[11] and spectral dropout[25]
, they do some restriction on spectral domain. Our method differs from them in two aspects: 1) Compared with energybased compression technique, our method do not drop high frequency component indiscriminately. Our goal is not to minimize the approximation error between masked input and filter with unmasked ones, but to find out the most important frequency for classification and force the model to shield against background and noise. 2) we do not use hyperparameter keeppercentage to determine the threshold for masking. Our method uses back propagation to figure out mask, so our method can be used endtoend.
Comparing with selfsupervised learning strategy
[34, 37, 48], our method does not have complex architecture. We try to leverage the transferability of frequency to solve the problem of transferability of model and domain adaptation.References

[1]
(2018)
Convolutional image captioning.
In
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
, pp. 5561–5570. Cited by: §1.  [2] (2018) Why do deep convolutional networks generalize so poorly to small image transformations?. arXiv preprint arXiv:1805.12177. Cited by: §1.
 [3] (2017) CVAEgan: finegrained image generation through asymmetric training. In Proceedings of the IEEE International Conference on Computer Vision, pp. 2745–2754. Cited by: §2.
 [4] (2017) Spectrallynormalized margin bounds for neural networks. In NIPS, Cited by: §2.
 [5] (2018) Adversarial examples from computational constraints. In ICML, Cited by: §1.
 [6] (2016) Compressing convolutional neural networks in the frequency domain. In KDD ’16, Cited by: §2.
 [7] (2017) Parseval networks: improving robustness to adversarial examples. In ICML, Cited by: §1.
 [8] (2017) Intriguing properties of adversarial examples. arXiv preprint arXiv:1711.02846. Cited by: §2.
 [9] (2018) AutoAugment: learning augmentation policies from data. ArXiv abs/1805.09501. Cited by: §2.
 [10] (2017) A study and comparison of human and deep learning recognition performance under visual distortions. In 2017 26th international conference on computer communication and networks (ICCCN), pp. 1–7. Cited by: §1.

[11]
(2019)
Bandlimited training and inference for convolutional neural networks.
In
International Conference on Machine Learning
, pp. 1745–1754. Cited by: §2, §6.  [12] (2018) A spectral approach to generalization and optimization in neural networks. Cited by: §1.
 [13] (2018) Generalizable adversarial training via spectral normalization. ArXiv abs/1811.07457. Cited by: §1, §2.
 [14] (1987) Relations between the statistics of natural images and the response properties of cortical cells.. Journal of the Optical Society of America. A, Optics and image science. Cited by: §3.1.
 [15] (2018) Wavelet convolutional neural networks. ArXiv abs/1805.08620. Cited by: §2.
 [16] (2018) DropBlock: a regularization method for convolutional networks. ArXiv abs/1810.12890. Cited by: Table 1.
 [17] (2014) Generative adversarial nets. In Advances in neural information processing systems, pp. 2672–2680. Cited by: §2.
 [18] (2019) SpecNet: spectral domain convolutional neural network. ArXiv abs/1905.10915. Cited by: §2.
 [19] (2015) Deep residual learning for image recognition. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 770–778. Cited by: §1.
 [20] (2019) Why deeplearning ais are so easy to fool. Nature 574 (7777), pp. 163. Cited by: §1.
 [21] (2020) AugMix: a simple data processing method to improve robustness and uncertainty. Proceedings of the International Conference on Learning Representations (ICLR). Cited by: §2.
 [22] (2019) Robust learning with jacobian regularization. ArXiv abs/1908.02729. Cited by: §2.
 [23] (2015) Batch normalization: accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167. Cited by: §2.
 [24] (2017) Measuring the tendency of cnns to learn surface statistical regularities. ArXiv abs/1711.11561. Cited by: §2.
 [25] (2017) Regularization of deep neural networks with spectral dropout. Neural networks : the official journal of the International Neural Network Society 110, pp. 82–90. Cited by: §2, §6.
 [26] (2009) Learning multiple layers of features from tiny images. Cited by: §4.1.
 [27] (1992) A simple weight decay can improve generalization. In Advances in neural information processing systems, pp. 950–957. Cited by: §2.
 [28] (2015) Autoencoding beyond pixels using a learned similarity metric. arXiv preprint arXiv:1512.09300. Cited by: §2.
 [29] (1990) Handwritten digit recognition with a backpropagation network. In Advances in neural information processing systems, pp. 396–404. Cited by: §1.
 [30] (2018) Multilevel waveletcnn for image restoration. 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), pp. 886–88609. Cited by: §2.
 [31] (2016) Ssd: single shot multibox detector. In European conference on computer vision, pp. 21–37. Cited by: §1.
 [32] (2017) Fixing weight decay regularization in adam. arXiv preprint arXiv:1711.05101. Cited by: §2.
 [33] (2013) Fast training of convolutional networks through ffts. CoRR abs/1312.5851. Cited by: §2.
 [34] (2019) Selfsupervised learning of pretextinvariant representations. ArXiv abs/1912.01991. Cited by: §6.
 [35] (2018) Spectral normalization for generative adversarial networks. ArXiv abs/1802.05957. Cited by: §1, §2.
 [36] (2017) Exploring generalization in deep learning. In NIPS, Cited by: §2.
 [37] (2020) Unsupervised intradomain adaptation for semantic segmentation through selfsupervision. ArXiv abs/2004.07703. Cited by: §6.
 [38] (2018) RePr: improved training of convolutional filters. 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 10658–10667. Cited by: §2.
 [39] (2018) On the spectral bias of neural networks. In ICML, Cited by: §1, §2.
 [40] (2018) Yolov3: an incremental improvement. arXiv preprint arXiv:1804.02767. Cited by: §1.
 [41] (2015) Faster rcnn: towards realtime object detection with region proposal networks. In Advances in neural information processing systems, pp. 91–99. Cited by: §1.
 [42] (2015) Spectral representations for convolutional neural networks. ArXiv abs/1506.03767. Cited by: §2, §3.1.
 [43] (2015) Unet: convolutional networks for biomedical image segmentation. In International Conference on Medical image computing and computerassisted intervention, pp. 234–241. Cited by: §1.
 [44] (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15 (1), pp. 1929–1958. Cited by: §2.
 [45] (2018) On the structural sensitivity of deep convolutional networks to the directions of fourier basis functions. 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 51–60. Cited by: §2.
 [46] (2014) Fast convolutional nets with fbfft: a gpu performance evaluation. CoRR abs/1412.7580. Cited by: §2.
 [47] (2016) Learning structured sparsity in deep neural networks. In NIPS, Cited by: §1.
 [48] (2020) Exploring categorical regularization for domain adaptive object detection. ArXiv abs/2003.09152. Cited by: §6.
 [49] (2019) A fourier perspective on model robustness in computer vision. In NeurIPS, Cited by: §1, §2, §2, §4.2.
 [50] (2016) Understanding deep learning requires rethinking generalization. arXiv preprint arXiv:1611.03530. Cited by: §2.

[51]
(2016)
Learning deep features for discriminative localization
. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2921–2929. Cited by: Figure 3.
Comments
There are no comments yet.