Robust Learning with Frequency Domain Regularization

07/07/2020 ∙ by Weiyu Guo, et al. ∙ 0

Convolution neural networks have achieved remarkable performance in many tasks of computing vision. However, CNN tends to bias to low frequency components. They prioritize capturing low frequency patterns which lead them fail when suffering from application scenario transformation. While adversarial example implies the model is very sensitive to high frequency perturbations. In this paper, we introduce a new regularization method by constraining the frequency spectra of the filter of the model. Different from band-limit training, our method considers the valid frequency range probably entangles in different layers rather than continuous and trains the valid frequency range end-to-end by backpropagation. We demonstrate the effectiveness of our regularization by (1) defensing to adversarial perturbations; (2) reducing the generalization gap in different architecture; (3) improving the generalization ability in transfer learning scenario without fine-tune.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Convolution neural networks (CNNs)[29] have achieved remarkable performance in many tasks of computing vision, e.g., object detection[31, 40, 41], semantic segmentation[43]

, image captioning

[1], by capturing and representing multi-level features from a huge volume of data. However, existing experiments[2, 10] demonstrate that CNNs are often with great fragility[20]. Only injecting minute perturbation, e.g., random noise, contrast change, or blurring, can lead to significant degradation of model performance, i.e., CNN models usually lacks the ability of generalization transfer.

A variety of explanation of the vulnerability have been proposed, e.g., the limit of the data-sets scale, the distribution of real data is inconsistent with training data, and computational constraints[5], which resulting in a variety of coping strategies, such as data augmentation[49], adversarial training[13, 7, 35] and parameter regularization[47, 12]. In deed, these strategies are propelling the models of CNNs to encoding invariant features as well as neglecting the variable information in the learning phase. In essence, convolutions are a kind of signal processing operations that amplify certain frequencies of the input and attenuate others. This leads us to ask that whether can prompt CNNs to ”remember” invariant features by explicitly representing certain frequency components of its convolution layers? And, how to find certain frequency components for different layers? In this paper, we show that the answer to the questions leads to some surprising new perspectives on: model robustness and generalization.

The low frequency components in training set are easier to be learned than high frequency components, because the number of low-frequency signals is large but their variation is little. For a finite training set, there exists a valid frequency range, and information beyond the lower bound usually is the bias of data-set, while information beyond the upper bound often is the noise. This phenomenon probably leads to a CNN with common settings always first quickly capturing low frequency components in their dominant, but easily over-fitting when suffering from application scenario transformation. Therefore, we might be able to promote the generalization and convergence performance of CNN models by putting frequency range constraints on convolution layers on learning phase.

The architecture of CNNs is designed to abstract information layer by layer from low to high[39]. It is generally assumed that, low layers are in charge of extracting low frequency information, such as dots, lines and texture, while high ones are responsible for high frequency information, such as shapes and sketches. Intuitively, we can drive a CNN model to pinpoint the valid frequency range of training set by imposing the low frequency constrains on previous convolution layers while high frequency constrains on the rears. However, due to existing some other factors, e.g., shortcut connection[19], learning methods, and sample distribution, the valid frequency range probably entangled in different layers rather than continuous.

In this paper, we propose a novel frequency domain regularization on convolution layers, which improves the generalization and convergence performance of CNN models by automatically untangling the spectrum of convolution layers, and navigating the model to the valid frequency range of training set. In a nutshell, our main contributions can be summarized as follows.

  • An extreme small but valid spectral range for different layers was pinpointed.

  • A general training approach with frequency domain regularization on convolution layers, for improving the generalization and convergence performance of CNN models. Compared with data augmentation technique and other implicit regularization techniques, our training technique improves the transferability of model.

  • Comprehensive evaluation to investigate the effectiveness of proposed approach, and demonstrate how it can raise the generalization of CNN models.

2 Related work

Promoting the generalization of models is very important for deep learning. Generally, there are three branches of techniques to achieve the target,

i.e., data augmentation, regularization and spectrum analysis.

Data augmentation: The idea of data augmentation[21, 9] is common to reduce overfitting on models, which increases the amount of training data using information only in training data. Simple techniques, such as cropping, rotating, flipping, etc., are prevailing in CNN model training, and usually can improve performance on validation a few. However, such simple techniques can not provide any practical defense against adversarial examples[8], which leads to an emerging direction of data augmentation, i.e., adversarial training[28, 3]. Indeed, adversarial training performs unsupervised generation of new samples using GANs[17], which can provide amount of hard examples for training. However, recent studies[49] demonstrate that adversarial training usually improve robustness to corruptions that are concentrated in the high frequency domain while reducing robustness to corruptions that are concentrated in the low frequency domain.

Regularization: [50] comprehensively evaluates the performance of explicit and implicit regularization techniques, i.e., dropout[44], weight decay[27, 32]

, batch normalization

[23], early stopping. And gives the comments that although regularizers can provide marginal improvement, they seem not to be the fundamental reason for generalization, but the architecture. All of the regularization techniques mentioned above have little effect on preventing the model from quickly fitting random labeled data. Sharpness and norms are other perspective for generalization[36]. There is a tight connection between spectral norm and Lipschitz Continuity, which can be used to flatten minima and bound the generalization error[4, 35, 13]. Jacobian penalty[22] and orthogonality of weights[38] can also be used for improving generalization. But none of the regularization techniques focus on the transferability of model on unseen domain, nor can they explicit pinpoint the valid range of feature to help the model shield against background and noise.

Spectrum analysis: Indeed, convolution is a common method to extract specific spectrum in signal processing. Inspired by this, there is substantial recent interest in studying the spectral properties of CNNs, with applications to model compression[6], speeding up model inference[33, 46], memory reduction[18], theoretical understanding of CNN capacity[39, 45], and eventually, better training methodologies[49, 15, 42, 30]. Especially, the works that leverage spectrum properties of CNNs to design better training methodologies are most relevant to this paper. For example, recent study[24, 39] find that a CNN model usually is biased towards lower Fourier frequencies while natural images tend to have the bulk of their Fourier spectrum concentrated on the low to mid-range frequencies. From this discovery, some works try to drop the high frequency components from the inputs to improve the generalization of the model, e.g., spectral dropout[25] and Band-limited training[11]. In practice, high-frequency components perhaps are non-robust but highly predictable[49]. Therefore, although high frequency components contain noise, we do not simply drop them in our work. More in-depth discussion is needed for valid spectral range.

3 Method

In this section, we introduce our regularization method to constrain the frequency spectra of the convolution. The overview of our method is illustrated in Figure 1.

Figure 1: Overview of our method

3.1 FFT-based convolution

Given a tensor

, Fourier transform is used to transform t to the spectral domain.

FFT-based convolution

The property of frequency analysis ensures that convolution in the spatial domain is equal to element-wise multiplication in the spectral domain. The main intuition of frequency analysis is that an image represented in spatial domain is significant redundancy, while represented in spectral domain can improve filter to feature the specific length-scales and orientations [42, 14]. Convergence speedup and lower computational cost are additional benefit.

is called the spectrum of the convolution.

3.2 Mask design

Mask design is the key component of our method. Mask helps us pinpoint the valid frequency range entangled between different layers, and using back propagation to update it is the main difference from similar work.

Binarized mask

Our regularization try to mask the frequency of background and noise, only maintaining the frequency that is useful for the classification. is the mask that limits the spectrum .

Gradient Computation and Accumulation:

The gradients of mask are accumulated in real-valued variables, as Algorithm 1.

0:  a minibatch of inputs and targets .
0:  updated , Weights .
  
  
  for  to  do
      Binarize
     
  end for
  
  
  
  for  to  do
     
     
     compute and
  end for
  
  for  to  do
      Update
     
  end for
Algorithm 1 Forward and back propagation. is the cost function for minibatch, is the number of layers. Quanindicates element-wise multiplication.The function Binarize() specifics how to binarize the masks, and Clip(), how to clip the masks.

4 Experiments

We demonstrate the effectiveness of our regularization method with various datasets and architecture and compare it with several state-of-art methods. We explore in detail to illustrate the property of our approach.

4.1 Experimental settings

Datasets We conduct our experiments with Cifar10[26], which contains 10 classes, 50000 images for training and 10000 images for testing.

Baseline training

All models use SGD, with momentum set to 0.9. For Cifar dataset, the learning rate is set to 0.01. For Imagenet dataset, the learning rate is set to 0.1. If weight decay and dropout are used, weight decay is set to

and the keep-prop is set to 0.9. When training from sketch, Mask is initiated with random numbers from a normal distribution with mean 0.8 and variance 0.2. It means we don’t drop any frequency at beginning, and with the model learning, the accumulated gradient of mask will pinpoint the valid frequency range. While for finetuning, Mask is initiated with random numbers from a normal distribution with mean 0.6 and variance 0.1 to accelerate the learning of mask.

4.2 Experimental results and analysis

Figure 2: Sketch and finetune

. Lenet baseline 58.48 Lenet + normalization 60.41 Lenet + normalization + random crop + data augmentation 75.06 Lenet + normalization + random crop + data augmentation + weight decay 76.23 Lenet + our method + weight decay 66.7 Lenet + our method + normalization + weight decay 68.3 Lenet + our method + random crop 74.0 Lenet + our method + data augmentation 69.8 Lenet + our method + random drop(0.2) 62.1

Table 1: Summary of test accuracy on Cifar dataset for LeNet architecture. For dropout, DropPath, and SpatialDropout, we trained models with the best keep_prob values reported by [16]
Clear Impulse_noise Fog Contrast
Natural 93.5 50.436 85.14 70.858
Gauss
Adversarial
Our method 94.06 57.344 86.752 73.432
Table 2: Comparison between naturally trained model, Gaussian data augmentation, adversarial training, and our method on clean images and Cifar10-C for resnet-20 architecture.
Conv Layer Mask Percentage
layer1conv1.1 0.6938
layer1conv1.2 0.5810
layer1conv2.1 0.5213
layer1conv2.2 0.4889
layer2conv1.1 0.3354
layer2conv1.2 0.3696
layer2conv2.1 0.3124
layer2conv2.2 0.2644
layer3conv1.1 0.2082
layer3conv1.2 0.2621
layer3conv2.1 0.2453
layer3conv2.2 0.1477
layer4conv1.1 0.0810
layer4conv1.2 0.0455
layer4conv2.1 0.0337
layer4conv2.2 0.0175
Table 3: The percentage of frequency each convolution layer masks for resnet-20 architecture.

It was observed that Gaussian data augmentation and adversarial training improve robustness to all noise and many of the blurring corruptions, while degrading robustness to fog and contrast[49]. Our method has better results against fog, contrast, and impulse noise, which shows that our method alleviates the low frequency brittle caused by adversarial training.

Figure 3: Class activation mapping (CAM)[51] for resnet-18 model.

With such amount of activations suppressed in Table 3 and the CAM illustration in Figure 3, our method utilizes the valid frequency range to capture the most important frequency for classification.

5 Discussion

5.1 Binarized mask or continuous mask

When using Binarized mask in spatial domain, it will have side effect like generating boundary. However, in spectral domain, this kind of side effect is not obvious. It can be seen as a novel way of denoising. Benefitting from the property of spectral domain that represents the feature in an invariant and sparse way, our method can suppress the wrong activation in the spectral domain through binarized mask.

5.2 Our method with random drop

Our method pinpoints the valid frequency range for the training set. What if we random drop some frequency after using our method, will the model learn redundancy feature, do not rely on heavily on those frequency and perform better. In the last line of Table 1, we show that it is not a better choice. This verifies that our method pinpoints the valid frequency range on another perspective.

5.3 Inter-class mask

If we train the spectral domain mask for each class, it may have better performance and implicit transform among the same category in different datasets. However, in the test time, we need to determine the category of the images before using inter-class mask. So, we may need to change the architecture of the model, which is left to our future work.

6 Conclusion

We proposed a novel regularization method in the train time to explicit remove the unimportant frequency. 1) We pinpoint the valid frequency range entangled in different layers. 2) We demonstrate the model trained with our regularization is more robust on unseen data.

Comparing with Band-limited training[11] and spectral dropout[25]

, they do some restriction on spectral domain. Our method differs from them in two aspects: 1) Compared with energy-based compression technique, our method do not drop high frequency component indiscriminately. Our goal is not to minimize the approximation error between masked input and filter with unmasked ones, but to find out the most important frequency for classification and force the model to shield against background and noise. 2) we do not use hyperparameter keep-percentage to determine the threshold for masking. Our method uses back propagation to figure out mask, so our method can be used end-to-end.

Comparing with self-supervised learning strategy

[34, 37, 48], our method does not have complex architecture. We try to leverage the transferability of frequency to solve the problem of transferability of model and domain adaptation.

References

  • [1] J. Aneja, A. Deshpande, and A. G. Schwing (2018) Convolutional image captioning. In

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    ,
    pp. 5561–5570. Cited by: §1.
  • [2] A. Azulay and Y. Weiss (2018) Why do deep convolutional networks generalize so poorly to small image transformations?. arXiv preprint arXiv:1805.12177. Cited by: §1.
  • [3] J. Bao, D. Chen, F. Wen, H. Li, and G. Hua (2017) CVAE-gan: fine-grained image generation through asymmetric training. In Proceedings of the IEEE International Conference on Computer Vision, pp. 2745–2754. Cited by: §2.
  • [4] P. L. Bartlett, D. J. Foster, and M. Telgarsky (2017) Spectrally-normalized margin bounds for neural networks. In NIPS, Cited by: §2.
  • [5] S. Bubeck, E. Price, and I. P. Razenshteyn (2018) Adversarial examples from computational constraints. In ICML, Cited by: §1.
  • [6] W. Chen, J. T. Wilson, S. Tyree, K. Q. Weinberger, and Y. Chen (2016) Compressing convolutional neural networks in the frequency domain. In KDD ’16, Cited by: §2.
  • [7] M. Cissé, P. Bojanowski, E. Grave, Y. Dauphin, and N. Usunier (2017) Parseval networks: improving robustness to adversarial examples. In ICML, Cited by: §1.
  • [8] E. D. Cubuk, B. Zoph, S. S. Schoenholz, and Q. V. Le (2017) Intriguing properties of adversarial examples. arXiv preprint arXiv:1711.02846. Cited by: §2.
  • [9] E. D. Cubuk, B. Zoph, D. Mané, V. Vasudevan, and Q. V. Le (2018) AutoAugment: learning augmentation policies from data. ArXiv abs/1805.09501. Cited by: §2.
  • [10] S. Dodge and L. Karam (2017) A study and comparison of human and deep learning recognition performance under visual distortions. In 2017 26th international conference on computer communication and networks (ICCCN), pp. 1–7. Cited by: §1.
  • [11] A. Dziedzic, J. Paparrizos, S. Krishnan, A. Elmore, and M. Franklin (2019) Band-limited training and inference for convolutional neural networks. In

    International Conference on Machine Learning

    ,
    pp. 1745–1754. Cited by: §2, §6.
  • [12] F. Farnia, J. M. Zhang, and D. Tse (2018) A spectral approach to generalization and optimization in neural networks. Cited by: §1.
  • [13] F. Farnia, J. M. Zhang, and D. Tse (2018) Generalizable adversarial training via spectral normalization. ArXiv abs/1811.07457. Cited by: §1, §2.
  • [14] D. J. Field (1987) Relations between the statistics of natural images and the response properties of cortical cells.. Journal of the Optical Society of America. A, Optics and image science. Cited by: §3.1.
  • [15] S. Fujieda, K. Takayama, and T. Hachisuka (2018) Wavelet convolutional neural networks. ArXiv abs/1805.08620. Cited by: §2.
  • [16] G. Ghiasi, T. Lin, and Q. V. Le (2018) DropBlock: a regularization method for convolutional networks. ArXiv abs/1810.12890. Cited by: Table 1.
  • [17] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio (2014) Generative adversarial nets. In Advances in neural information processing systems, pp. 2672–2680. Cited by: §2.
  • [18] B. Guan, J. Zhang, W. A. Sethares, R. Kijowski, and F. Liu (2019) SpecNet: spectral domain convolutional neural network. ArXiv abs/1905.10915. Cited by: §2.
  • [19] K. He, X. Zhang, S. Ren, and J. Sun (2015) Deep residual learning for image recognition. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 770–778. Cited by: §1.
  • [20] D. Heaven (2019) Why deep-learning ais are so easy to fool. Nature 574 (7777), pp. 163. Cited by: §1.
  • [21] D. Hendrycks, N. Mu, E. D. Cubuk, B. Zoph, J. Gilmer, and B. Lakshminarayanan (2020) AugMix: a simple data processing method to improve robustness and uncertainty. Proceedings of the International Conference on Learning Representations (ICLR). Cited by: §2.
  • [22] J. Hoffman, D. A. Roberts, and S. Yaida (2019) Robust learning with jacobian regularization. ArXiv abs/1908.02729. Cited by: §2.
  • [23] S. Ioffe and C. Szegedy (2015) Batch normalization: accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167. Cited by: §2.
  • [24] J. Jo and Y. Bengio (2017) Measuring the tendency of cnns to learn surface statistical regularities. ArXiv abs/1711.11561. Cited by: §2.
  • [25] S. H. Khan, M. Hayat, and F. M. Porikli (2017) Regularization of deep neural networks with spectral dropout. Neural networks : the official journal of the International Neural Network Society 110, pp. 82–90. Cited by: §2, §6.
  • [26] A. Krizhevsky (2009) Learning multiple layers of features from tiny images. Cited by: §4.1.
  • [27] A. Krogh and J. A. Hertz (1992) A simple weight decay can improve generalization. In Advances in neural information processing systems, pp. 950–957. Cited by: §2.
  • [28] A. B. L. Larsen, S. K. Sønderby, H. Larochelle, and O. Winther (2015) Autoencoding beyond pixels using a learned similarity metric. arXiv preprint arXiv:1512.09300. Cited by: §2.
  • [29] Y. LeCun, B. E. Boser, J. S. Denker, D. Henderson, R. E. Howard, W. E. Hubbard, and L. D. Jackel (1990) Handwritten digit recognition with a back-propagation network. In Advances in neural information processing systems, pp. 396–404. Cited by: §1.
  • [30] P. Liu, H. Zhang, K. Zhang, L. Lin, and W. Zuo (2018) Multi-level wavelet-cnn for image restoration. 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), pp. 886–88609. Cited by: §2.
  • [31] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C. Fu, and A. C. Berg (2016) Ssd: single shot multibox detector. In European conference on computer vision, pp. 21–37. Cited by: §1.
  • [32] I. Loshchilov and F. Hutter (2017) Fixing weight decay regularization in adam. arXiv preprint arXiv:1711.05101. Cited by: §2.
  • [33] M. Mathieu, M. Henaff, and Y. LeCun (2013) Fast training of convolutional networks through ffts. CoRR abs/1312.5851. Cited by: §2.
  • [34] I. Misra and L. van der Maaten (2019) Self-supervised learning of pretext-invariant representations. ArXiv abs/1912.01991. Cited by: §6.
  • [35] T. Miyato, T. Kataoka, M. Koyama, and Y. Yoshida (2018) Spectral normalization for generative adversarial networks. ArXiv abs/1802.05957. Cited by: §1, §2.
  • [36] B. Neyshabur, S. Bhojanapalli, D. McAllester, and N. Srebro (2017) Exploring generalization in deep learning. In NIPS, Cited by: §2.
  • [37] F. Pan, I. Shin, F. Rameau, S. Lee, and I. S. Kweon (2020) Unsupervised intra-domain adaptation for semantic segmentation through self-supervision. ArXiv abs/2004.07703. Cited by: §6.
  • [38] A. Prakash, J. A. Storer, D. A. F. Florêncio, and C. Zhang (2018) RePr: improved training of convolutional filters. 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 10658–10667. Cited by: §2.
  • [39] N. Rahaman, A. Baratin, D. Arpit, F. Dräxler, M. Lin, F. A. Hamprecht, Y. Bengio, and A. C. Courville (2018) On the spectral bias of neural networks. In ICML, Cited by: §1, §2.
  • [40] J. Redmon and A. Farhadi (2018) Yolov3: an incremental improvement. arXiv preprint arXiv:1804.02767. Cited by: §1.
  • [41] S. Ren, K. He, R. Girshick, and J. Sun (2015) Faster r-cnn: towards real-time object detection with region proposal networks. In Advances in neural information processing systems, pp. 91–99. Cited by: §1.
  • [42] O. Rippel, J. Snoek, and R. P. Adams (2015) Spectral representations for convolutional neural networks. ArXiv abs/1506.03767. Cited by: §2, §3.1.
  • [43] O. Ronneberger, P. Fischer, and T. Brox (2015) U-net: convolutional networks for biomedical image segmentation. In International Conference on Medical image computing and computer-assisted intervention, pp. 234–241. Cited by: §1.
  • [44] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15 (1), pp. 1929–1958. Cited by: §2.
  • [45] Y. Tsuzuku and I. Sato (2018) On the structural sensitivity of deep convolutional networks to the directions of fourier basis functions. 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 51–60. Cited by: §2.
  • [46] N. Vasilache, J. Johnson, M. Mathieu, S. Chintala, S. Piantino, and Y. LeCun (2014) Fast convolutional nets with fbfft: a gpu performance evaluation. CoRR abs/1412.7580. Cited by: §2.
  • [47] W. Wen, C. Wu, Y. Wang, Y. Chen, and H. Li (2016) Learning structured sparsity in deep neural networks. In NIPS, Cited by: §1.
  • [48] C. Xu, X. Zhao, X. Jin, and X. Wei (2020) Exploring categorical regularization for domain adaptive object detection. ArXiv abs/2003.09152. Cited by: §6.
  • [49] D. Yin, R. G. Lopes, J. Shlens, E. D. Cubuk, and J. Gilmer (2019) A fourier perspective on model robustness in computer vision. In NeurIPS, Cited by: §1, §2, §2, §4.2.
  • [50] C. Zhang, S. Bengio, M. Hardt, B. Recht, and O. Vinyals (2016) Understanding deep learning requires rethinking generalization. arXiv preprint arXiv:1611.03530. Cited by: §2.
  • [51] B. Zhou, A. Khosla, À. Lapedriza, A. Oliva, and A. Torralba (2016)

    Learning deep features for discriminative localization

    .
    2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2921–2929. Cited by: Figure 3.