A variety of explanation of the vulnerability have been proposed, e.g., the limit of the data-sets scale, the distribution of real data is inconsistent with training data, and computational constraints, which resulting in a variety of coping strategies, such as data augmentation, adversarial training[13, 7, 35] and parameter regularization[47, 12]. In deed, these strategies are propelling the models of CNNs to encoding invariant features as well as neglecting the variable information in the learning phase. In essence, convolutions are a kind of signal processing operations that amplify certain frequencies of the input and attenuate others. This leads us to ask that whether can prompt CNNs to ”remember” invariant features by explicitly representing certain frequency components of its convolution layers? And, how to find certain frequency components for different layers? In this paper, we show that the answer to the questions leads to some surprising new perspectives on: model robustness and generalization.
The low frequency components in training set are easier to be learned than high frequency components, because the number of low-frequency signals is large but their variation is little. For a finite training set, there exists a valid frequency range, and information beyond the lower bound usually is the bias of data-set, while information beyond the upper bound often is the noise. This phenomenon probably leads to a CNN with common settings always first quickly capturing low frequency components in their dominant, but easily over-fitting when suffering from application scenario transformation. Therefore, we might be able to promote the generalization and convergence performance of CNN models by putting frequency range constraints on convolution layers on learning phase.
The architecture of CNNs is designed to abstract information layer by layer from low to high. It is generally assumed that, low layers are in charge of extracting low frequency information, such as dots, lines and texture, while high ones are responsible for high frequency information, such as shapes and sketches. Intuitively, we can drive a CNN model to pinpoint the valid frequency range of training set by imposing the low frequency constrains on previous convolution layers while high frequency constrains on the rears. However, due to existing some other factors, e.g., shortcut connection, learning methods, and sample distribution, the valid frequency range probably entangled in different layers rather than continuous.
In this paper, we propose a novel frequency domain regularization on convolution layers, which improves the generalization and convergence performance of CNN models by automatically untangling the spectrum of convolution layers, and navigating the model to the valid frequency range of training set. In a nutshell, our main contributions can be summarized as follows.
An extreme small but valid spectral range for different layers was pinpointed.
A general training approach with frequency domain regularization on convolution layers, for improving the generalization and convergence performance of CNN models. Compared with data augmentation technique and other implicit regularization techniques, our training technique improves the transferability of model.
Comprehensive evaluation to investigate the effectiveness of proposed approach, and demonstrate how it can raise the generalization of CNN models.
2 Related work
Promoting the generalization of models is very important for deep learning. Generally, there are three branches of techniques to achieve the target,i.e., data augmentation, regularization and spectrum analysis.
Data augmentation: The idea of data augmentation[21, 9] is common to reduce overfitting on models, which increases the amount of training data using information only in training data. Simple techniques, such as cropping, rotating, flipping, etc., are prevailing in CNN model training, and usually can improve performance on validation a few. However, such simple techniques can not provide any practical defense against adversarial examples, which leads to an emerging direction of data augmentation, i.e., adversarial training[28, 3]. Indeed, adversarial training performs unsupervised generation of new samples using GANs, which can provide amount of hard examples for training. However, recent studies demonstrate that adversarial training usually improve robustness to corruptions that are concentrated in the high frequency domain while reducing robustness to corruptions that are concentrated in the low frequency domain.
Spectrum analysis: Indeed, convolution is a common method to extract specific spectrum in signal processing. Inspired by this, there is substantial recent interest in studying the spectral properties of CNNs, with applications to model compression, speeding up model inference[33, 46], memory reduction, theoretical understanding of CNN capacity[39, 45], and eventually, better training methodologies[49, 15, 42, 30]. Especially, the works that leverage spectrum properties of CNNs to design better training methodologies are most relevant to this paper. For example, recent study[24, 39] find that a CNN model usually is biased towards lower Fourier frequencies while natural images tend to have the bulk of their Fourier spectrum concentrated on the low to mid-range frequencies. From this discovery, some works try to drop the high frequency components from the inputs to improve the generalization of the model, e.g., spectral dropout and Band-limited training. In practice, high-frequency components perhaps are non-robust but highly predictable. Therefore, although high frequency components contain noise, we do not simply drop them in our work. More in-depth discussion is needed for valid spectral range.
In this section, we introduce our regularization method to constrain the frequency spectra of the convolution. The overview of our method is illustrated in Figure 1.
3.1 FFT-based convolution
Given a tensor, Fourier transform is used to transform t to the spectral domain.
The property of frequency analysis ensures that convolution in the spatial domain is equal to element-wise multiplication in the spectral domain. The main intuition of frequency analysis is that an image represented in spatial domain is significant redundancy, while represented in spectral domain can improve filter to feature the specific length-scales and orientations [42, 14]. Convergence speedup and lower computational cost are additional benefit.
is called the spectrum of the convolution.
3.2 Mask design
Mask design is the key component of our method. Mask helps us pinpoint the valid frequency range entangled between different layers, and using back propagation to update it is the main difference from similar work.
Our regularization try to mask the frequency of background and noise, only maintaining the frequency that is useful for the classification. is the mask that limits the spectrum .
Gradient Computation and Accumulation:
The gradients of mask are accumulated in real-valued variables, as Algorithm 1.
We demonstrate the effectiveness of our regularization method with various datasets and architecture and compare it with several state-of-art methods. We explore in detail to illustrate the property of our approach.
4.1 Experimental settings
Datasets We conduct our experiments with Cifar10, which contains 10 classes, 50000 images for training and 10000 images for testing.
All models use SGD, with momentum set to 0.9. For Cifar dataset, the learning rate is set to 0.01. For Imagenet dataset, the learning rate is set to 0.1. If weight decay and dropout are used, weight decay is set to
and the keep-prop is set to 0.9. When training from sketch, Mask is initiated with random numbers from a normal distribution with mean 0.8 and variance 0.2. It means we don’t drop any frequency at beginning, and with the model learning, the accumulated gradient of mask will pinpoint the valid frequency range. While for finetuning, Mask is initiated with random numbers from a normal distribution with mean 0.6 and variance 0.1 to accelerate the learning of mask.
4.2 Experimental results and analysis
|Conv Layer||Mask Percentage|
It was observed that Gaussian data augmentation and adversarial training improve robustness to all noise and many of the blurring corruptions, while degrading robustness to fog and contrast. Our method has better results against fog, contrast, and impulse noise, which shows that our method alleviates the low frequency brittle caused by adversarial training.
5.1 Binarized mask or continuous mask
When using Binarized mask in spatial domain, it will have side effect like generating boundary. However, in spectral domain, this kind of side effect is not obvious. It can be seen as a novel way of denoising. Benefitting from the property of spectral domain that represents the feature in an invariant and sparse way, our method can suppress the wrong activation in the spectral domain through binarized mask.
5.2 Our method with random drop
Our method pinpoints the valid frequency range for the training set. What if we random drop some frequency after using our method, will the model learn redundancy feature, do not rely on heavily on those frequency and perform better. In the last line of Table 1, we show that it is not a better choice. This verifies that our method pinpoints the valid frequency range on another perspective.
5.3 Inter-class mask
If we train the spectral domain mask for each class, it may have better performance and implicit transform among the same category in different datasets. However, in the test time, we need to determine the category of the images before using inter-class mask. So, we may need to change the architecture of the model, which is left to our future work.
We proposed a novel regularization method in the train time to explicit remove the unimportant frequency. 1) We pinpoint the valid frequency range entangled in different layers. 2) We demonstrate the model trained with our regularization is more robust on unseen data.
, they do some restriction on spectral domain. Our method differs from them in two aspects: 1) Compared with energy-based compression technique, our method do not drop high frequency component indiscriminately. Our goal is not to minimize the approximation error between masked input and filter with unmasked ones, but to find out the most important frequency for classification and force the model to shield against background and noise. 2) we do not use hyperparameter keep-percentage to determine the threshold for masking. Our method uses back propagation to figure out mask, so our method can be used end-to-end.
Convolutional image captioning.
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5561–5570. Cited by: §1.
-  (2018) Why do deep convolutional networks generalize so poorly to small image transformations?. arXiv preprint arXiv:1805.12177. Cited by: §1.
-  (2017) CVAE-gan: fine-grained image generation through asymmetric training. In Proceedings of the IEEE International Conference on Computer Vision, pp. 2745–2754. Cited by: §2.
-  (2017) Spectrally-normalized margin bounds for neural networks. In NIPS, Cited by: §2.
-  (2018) Adversarial examples from computational constraints. In ICML, Cited by: §1.
-  (2016) Compressing convolutional neural networks in the frequency domain. In KDD ’16, Cited by: §2.
-  (2017) Parseval networks: improving robustness to adversarial examples. In ICML, Cited by: §1.
-  (2017) Intriguing properties of adversarial examples. arXiv preprint arXiv:1711.02846. Cited by: §2.
-  (2018) AutoAugment: learning augmentation policies from data. ArXiv abs/1805.09501. Cited by: §2.
-  (2017) A study and comparison of human and deep learning recognition performance under visual distortions. In 2017 26th international conference on computer communication and networks (ICCCN), pp. 1–7. Cited by: §1.
Band-limited training and inference for convolutional neural networks.
International Conference on Machine Learning, pp. 1745–1754. Cited by: §2, §6.
-  (2018) A spectral approach to generalization and optimization in neural networks. Cited by: §1.
-  (2018) Generalizable adversarial training via spectral normalization. ArXiv abs/1811.07457. Cited by: §1, §2.
-  (1987) Relations between the statistics of natural images and the response properties of cortical cells.. Journal of the Optical Society of America. A, Optics and image science. Cited by: §3.1.
-  (2018) Wavelet convolutional neural networks. ArXiv abs/1805.08620. Cited by: §2.
-  (2018) DropBlock: a regularization method for convolutional networks. ArXiv abs/1810.12890. Cited by: Table 1.
-  (2014) Generative adversarial nets. In Advances in neural information processing systems, pp. 2672–2680. Cited by: §2.
-  (2019) SpecNet: spectral domain convolutional neural network. ArXiv abs/1905.10915. Cited by: §2.
-  (2015) Deep residual learning for image recognition. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 770–778. Cited by: §1.
-  (2019) Why deep-learning ais are so easy to fool. Nature 574 (7777), pp. 163. Cited by: §1.
-  (2020) AugMix: a simple data processing method to improve robustness and uncertainty. Proceedings of the International Conference on Learning Representations (ICLR). Cited by: §2.
-  (2019) Robust learning with jacobian regularization. ArXiv abs/1908.02729. Cited by: §2.
-  (2015) Batch normalization: accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167. Cited by: §2.
-  (2017) Measuring the tendency of cnns to learn surface statistical regularities. ArXiv abs/1711.11561. Cited by: §2.
-  (2017) Regularization of deep neural networks with spectral dropout. Neural networks : the official journal of the International Neural Network Society 110, pp. 82–90. Cited by: §2, §6.
-  (2009) Learning multiple layers of features from tiny images. Cited by: §4.1.
-  (1992) A simple weight decay can improve generalization. In Advances in neural information processing systems, pp. 950–957. Cited by: §2.
-  (2015) Autoencoding beyond pixels using a learned similarity metric. arXiv preprint arXiv:1512.09300. Cited by: §2.
-  (1990) Handwritten digit recognition with a back-propagation network. In Advances in neural information processing systems, pp. 396–404. Cited by: §1.
-  (2018) Multi-level wavelet-cnn for image restoration. 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), pp. 886–88609. Cited by: §2.
-  (2016) Ssd: single shot multibox detector. In European conference on computer vision, pp. 21–37. Cited by: §1.
-  (2017) Fixing weight decay regularization in adam. arXiv preprint arXiv:1711.05101. Cited by: §2.
-  (2013) Fast training of convolutional networks through ffts. CoRR abs/1312.5851. Cited by: §2.
-  (2019) Self-supervised learning of pretext-invariant representations. ArXiv abs/1912.01991. Cited by: §6.
-  (2018) Spectral normalization for generative adversarial networks. ArXiv abs/1802.05957. Cited by: §1, §2.
-  (2017) Exploring generalization in deep learning. In NIPS, Cited by: §2.
-  (2020) Unsupervised intra-domain adaptation for semantic segmentation through self-supervision. ArXiv abs/2004.07703. Cited by: §6.
-  (2018) RePr: improved training of convolutional filters. 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 10658–10667. Cited by: §2.
-  (2018) On the spectral bias of neural networks. In ICML, Cited by: §1, §2.
-  (2018) Yolov3: an incremental improvement. arXiv preprint arXiv:1804.02767. Cited by: §1.
-  (2015) Faster r-cnn: towards real-time object detection with region proposal networks. In Advances in neural information processing systems, pp. 91–99. Cited by: §1.
-  (2015) Spectral representations for convolutional neural networks. ArXiv abs/1506.03767. Cited by: §2, §3.1.
-  (2015) U-net: convolutional networks for biomedical image segmentation. In International Conference on Medical image computing and computer-assisted intervention, pp. 234–241. Cited by: §1.
-  (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15 (1), pp. 1929–1958. Cited by: §2.
-  (2018) On the structural sensitivity of deep convolutional networks to the directions of fourier basis functions. 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 51–60. Cited by: §2.
-  (2014) Fast convolutional nets with fbfft: a gpu performance evaluation. CoRR abs/1412.7580. Cited by: §2.
-  (2016) Learning structured sparsity in deep neural networks. In NIPS, Cited by: §1.
-  (2020) Exploring categorical regularization for domain adaptive object detection. ArXiv abs/2003.09152. Cited by: §6.
-  (2019) A fourier perspective on model robustness in computer vision. In NeurIPS, Cited by: §1, §2, §2, §4.2.
-  (2016) Understanding deep learning requires rethinking generalization. arXiv preprint arXiv:1611.03530. Cited by: §2.
Learning deep features for discriminative localization. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2921–2929. Cited by: Figure 3.