Log In Sign Up

Towards Robust 2D Convolution for Reliable Visual Recognition

by   Lida Li, et al.
Xidian University

2D convolution (Conv2d), which is responsible for extracting features from the input image, is one of the key modules of a convolutional neural network (CNN). However, Conv2d is vulnerable to image corruptions and adversarial samples. It is an important yet rarely investigated problem that whether we can design a more robust alternative of Conv2d for more reliable feature extraction. In this paper, inspired by the recently developed learnable sparse transform that learns to convert the CNN features into a compact and sparse latent space, we design a novel building block, denoted by RConv-MK, to strengthen the robustness of extracted convolutional features. Our method leverages a set of learnable kernels of different sizes to extract features at different frequencies and employs a normalized soft thresholding operator to adaptively remove noises and trivial features at different corruption levels. Extensive experiments on clean images, corrupted images as well as adversarial samples validate the effectiveness of the proposed robust module for reliable visual recognition. The source codes are enclosed in the submission.


Increasing-Margin Adversarial (IMA) Training to Improve Adversarial Robustness of Neural Networks

Convolutional neural network (CNN) has surpassed traditional methods for...

Computer-Aided Colorectal Tumor Classification in NBI Endoscopy Using CNN Features

In this paper we report results for recognizing colorectal NBI endoscopi...

Human Eyes Inspired Recurrent Neural Networks are More Robust Against Adversarial Noises

Compared to human vision, computer vision based on convolutional neural ...

Group-wise Inhibition based Feature Regularization for Robust Classification

The vanilla convolutional neural network (CNN) is vulnerable to images w...

Designing and Training of A Dual CNN for Image Denoising

Deep convolutional neural networks (CNNs) for image denoising have recen...

Dynamic Fusion Network for RGBT Tracking

For both visible and infrared images have their own advantages and disad...

Model Doctor: A Simple Gradient Aggregation Strategy for Diagnosing and Treating CNN Classifiers

Recently, Convolutional Neural Network (CNN) has achieved excellent perf...

1 Introduction

Deep convolutional neural networks (CNNs) have shown their powerfulness in a wide range of computer vision tasks, especially in image recognition

[12, 52]. Despite the great success, it has been found that a well performed CNN model can be out of work when handling images with various types of corruptions in real world [22]. In addition, a CNN model can be easily fooled by deliberately designed adversarial samples with subtle and unperceivable perturbations to human eyes [37]. Therefore, it is a very crucial issue to improve the robustness of CNN models against image corruption and adversarial attacks.

To improve the robustness of CNN models for images with corruptions, most existing methods choose to improve the quality of input data. Based on the priors of image denoising, some works [16, 23, 48]

attempt to transform the input data from spatial (pixel) domain into certain frequency domain for noise removal before they are fed into the networks. Though CNNs adapted to a specific type of corruption can have better robustness, they may be fragile to out-of-box corruptions. Besides, their implementation requires manual setting for each task, which is less practical. For example, one needs to manually adjust the noise level to find a good balance between noise removal and preservation of image detail.

To improve the robustness of CNN models against adversarial attacks, many methods [34, 8, 50, 49] have been developed to generate adversarial samples for training robust CNN models. Almost all of them view the CNN model as a “black-box” and focus on the adversarial sample generation process, while little work has been done on improving the CNN architectures to improve robustness.
As discussed above, though methods have been developed for visual recognition with corrupted images and adversarial attack, to the best of our knowledge, none of them considered an important question: can we improve the robustness of 2D convolution (Conv2d), which is the key component of a CNN, so that more reliable features can be extracted from images with corruption or adversarial samples? In this paper, we make the first attempt along this line and develop a robust alternative of the Conv2d layer. Our work is mainly inspired by LST-Net [29]. During training, LST-Net learns a set of channel and spatial transforms at each layer to convert the given CNN features from the spatial domain into a compact and sparse latent frequency space. A soft thresholding (ST) operator is applied to remove noises and trivial features in the learned domain. Though LST-Net is effective and efficient for visual recognition, it can be improved from two aspects. First, it is noticed from both the design principles and visualization results that the output features of the channel transforms are organized in a fairly fixed order by frequency. Such a property can be directly exploited to design more effective spatial transforms following their channel transforms. Second, the threshold of ST in LST-Net is fixed for features of all frequencies, which is less accurate and flexible to process complex input with various corruptions.
In this paper, we present a robust alternative of the Conv2d layer as a building block for more reliable visual recognition. To make better use of the output of channel transform, multiple kernels of different sizes are adopted in the following spatial transform. Large kernels are used to effectively handle low frequency signals in order to avoid misclassification due to limited receptive field. Meanwhile, small kernels are used to better handle high frequency signals while reducing much the overhead as high frequency signals are usually sparse. By sequentially partitioning the input into suitable groups, the proposed module has nearly the same overhead as LST-Net with negligible extra parameters. Besides, as image corruption levels often vary from one sample to another, we propose a normalized soft thresholding (NST) operator to effectively control the unknown corruption level of each sample. As a result, the proposed module, denoted by RConv-MK (“R” for robust and “MK” for multiple kernels), is more robust than the conventional Conv2d as well as LST, and the robustness of the entire CNN model is accordingly enhanced. Our extensive experiments on clean images, corrupted images as well as adversarial samples validate the effectiveness of RConv-MK under some popular CNN architectures. Our contributions can be summarized as follows:

  • Multiple kernels of different sizes are utilized to deal with different frequency signals along the channels dimension, reducing the negative impacts of noises without losing efficiency.

  • A normalized soft thresholding operator is proposed to adaptively suppress the effect of different corruptions at different levels enabling the entire CNN model to be more robust.

2 Related Work

2.1 Image recognition on corrupted images

In practical applications of visual recognition, images can be easily corrupted due to many reasons, such as improper light condition, defects of imaging devices, bad weather, defocus blur, etc. While a CNN model may perform well on clean images, it can be out of work when handling corrupted photos. The existing image restoration methods [51, 17, 15] are basically developed to improve the image quality according to the criteria such as PSNR or SSIM [43] instead of the recognition accuracy. Therefore, they are not suitable to be directly used for image recognition with various types of corruptions.

It is intuitive to suppress noises in noise corrupted images for reliable visual recognition. A few algorithms [16, 23, 48] have been developed to convert the input data from spatial domain into certain frequency domain before feeding them into CNNs because noises are easy to identify and suppress in the transformed domain. Franzen [16]

converted gray-scale images into DCT domain and fed the responses into a Multi-Layer Perceptron model with 2 hidden layers for classification. Hossain

et al. [23] inserted a DCT module before a pre-trained VGG-16 model to fine-tune it on dataset with various types of common corruptions. However, the robustness of those models may be limited to the specific frequency domain, and the models suffer from the generalization problem to unseen corruptions. Meanwhile, these methods require manual adjustment of noise level during domain transform, where one needs to trade-off between noise removal and image cue preservation.

In this paper, we follow the nature of features in different frequencies to learn frequency-adaptive kernels of different sizes for better domain transformation.

2.2 Adversarial attack and defense

Generally speaking, adversarial attack refers to deliberately designing inputs ( adversarial examples) to fool a trained network model and force it to produce wrong outputs. The purpose of adversarial attacks may vary under different scenarios [5]. In this paper, we focus on image recognition, where the goal of adversarial attacks is to cause misclassification.

Adversarial attack algorithms can be roughly classified into two categories based on whether gradient is adopted. Optimization-based attacks are by far the most popular methods. Given an input image and its associated ground truth label, these methods generate adversarial samples by computing the gradients according to the CNN architecture and the pre-defined loss function, such as cross-entropy loss, C&W

[7], etc. Meanwhile,  [9],  [37, 7] and  [34, 6] distortion metrics are commonly used to measure the budget of adversarial examples. In comparison, gradient-free attack methods are developed for the cases where the network architecture is unavailable. Some representative works can be found in [10, 40, 26].

To defend against adversarial attacks, adversarial training [2] is one of the most popular and natural choices. It augments the training data by generating adversarial examples with certain attack methods. Though a re-trained model can deal with unseen data, it may still fail when facing adversarial samples generated by other attacking methods. This is because one model can hardly cover the entire input space by training with a certain number of searching steps. To narrow the gap, Deng et al. [14] modelled the potential adversarial examples from the perspective of distribution. Other defence algorithms focus on designing better objective functions. Chen et al. [8]

proposed a novel loss function to neutralize the probability of wrong predictions and maximize the probability of right predictions.

Different from the existing anti-attack methods, we focus on robust feature extraction of CNNs by proposing a new module. Our method is complementary to the existing adversarial attack defenders, and it can be readily used to improve the anti-attack performance of existing methods.

2.3 Normalization layers

Normalization layers are critical components of CNNs, aiming at reducing the internal covariate shift [27]

between input distribution and output distribution. Batch Normalization (BN)


is the first work to mitigate this issue. It normalizes the whole batch by computing sample statistics (mean and standard deviation) during mini-batch based training. Layer Normalization (LN)

[3] is designed to normalize all the activations of a single layer of a batch along the channel dimension, where it collects statistics from every unit within the layer. Besides, Instance Normalization (IN) [41] performs a BN-like computation to each sample, where a sample refers to a unit of the space spanned by the batch and channel axes. In addition, Group Normalization (GN) [47] improves BN by partitioning channels into groups and computing mean values and standard deviations for each group.

In this paper, we propose a normalized soft-thresholding (NST) operator to deal with the covariate shift caused by different corruptions.

3 Proposed Method

3.1 Problem formulation

Denote by an input image in the spatial domain and its ground truth label. A CNN model of layers can be regarded as a sequence of functions as follows:


where is the function of the -th layer of the CNN model and its associated parameters are denoted as , . We further denote the input and the output of by and , respectively. Note that the output of the last layer, , is the prediction of the ground truth label , denoted by . Some loss function , e.g., the cross-entropy loss, could be defined to measure the distance between and and to update the CNN parameters.

In practice, the input image can be corrupted, and the corrupted sample can be written as


where refers to the corruption, e.g., noise or the perturbation deliberately generated by a specific adversarial algorithm, aiming to enforce a wrong classification on corrupted sample.

In this paper, we do not assume any specific distribution on . It is expected that a robust CNN model could consistently make correct predictions for either the clean input image or its corrupted counterparts .

3.2 Brief review of LST

This work is inspired by LST-Net [29], and we briefly review it here. An LST contains three primitive transforms, including channel transform , spatial transform , and resize transform . Both and work closely to reduce redundancies by converting the input into a frequency domain along the channel and spatial dimensions in order. Beginning with the DCT for training, they can be implemented by a learnable pointwise convolutional layer (PWConv) [31] and a depthwise separable convolutional layer (DWConv) [24], respectively. Besides, their outputs are organized along the channel dimension for removal of noise and trivial features by soft-thresholding (ST). is always arranged right after to save parameters and computational cost [30]. Thus, the output features of (i.e., the input features of ) are expected to correspond to the weight of along the channel dimension, while it is assumed that high frequency signals always come after low frequency ones. is placed before or after the composition transform of and to obtain the desired output size by using a PWConv.

We argue that the LST is restricted by its fixed kernel size from effective feature extraction for different frequency components. Besides, ST is less powerful to deal with corruptions at different levels. In this paper, we study both the problems and mitigate them with the proposed RConv-MK.

3.3 The module structure

Figure 1 presents the structure of LST [29] and the proposed RConv-MK. Both of them consist of three primitive transforms: , and , where follows , and is placed either before (as illustrated in Figure 1) or after (see our supplementary material) the other two transforms. The differences between LST and our RConv-MK are in two aspects. First, we design a spatial transform with multiple kernels to exploit the specific characteristics of different frequencies along the channel dimension. Second, we replace ST with NST to deal with corruptions at different levels.

Figure 1: Structure comparison of (a) LST and (b) our RConv-MK. PWConv/DWConv in regular and bold font suggests random and DCT initialization of the associated weights, respectively.

3.4 Spatial transform with multiple kernels

Given the kernel size and the expansion rate for DWConv, of LST [29] repeatedly applies a convolutional kernel to each channel of , where the subscript suggests association to the spatial transform. The learned weights of channel transform in LST maintain two important properties (which are also possessed by our RConv-MK). First, the transformed features of (i.e., the input features of ) are structured, where the low frequency features are placed at one end while the high frequency features are located at the other end in the channel dimension. Second, low frequency features are dense while high frequency features are sparse. Without loss of generality, in the remaining of this paper, we assume that features of (or ) are arranged from low to high frequencies along the channel dimension.

The in LST, unfortunately, loses the properties of . The fixed kernel size (usually 33 in most modern CNNs) may be too small to identify the genuine low frequency signals. Thus, some high frequency signals will be misclassified as low frequency ones due to the limited window size. Meanwhile, the fixed kernel size may not be suitable to efficiently process sparse high frequency signals (see our supplementary material for more discussions).

Figure 2: Comparison of the spatial transform in (a) LST and (b) our RConv-MK, where “LF” and “HF” suggest the expected location of low and high frequency signals along the channel axis, respectively. In each case, the input channels are arranged by frequency and expanded for times. The of LST only adopts a single DWConv kernel, while our RConv-MK incorporates DWConv kernels of different sizes. The low-frequency components are computed with large kernels and high-frequency components with small kernels.

Figure 2 compares the design of in LST and our RConv-MK. Channels of the input and the output of are highlighted in prism colorset, where red means low frequency signals while blue means high frequency signals. By using kernels of suitable sizes for signals of different frequencies, we can improve to produce better features for low frequency signals and accelerate the computation for high frequency ones. We leverage a set of -kernels of different sizes and sort them by their kernel size in a descending order. The associated weights can be written as , satisfying , and . Accordingly, we partition into groups along the channel dimension to have , where , . Obviously, . of RConv-MK applies each to its related group of input channels , , so that low frequency signals are assigned with large kernels while high frequency ones are assigned with small kernels. In this way, we can make good use of signals of different frequencies in the feature domain. Finally, ends up with concatenating features of all -groups along the channel dimension, producing the transformed features of in our RConv-MK.

3.5 Normalized soft thresholding

In LST, soft-thresholding is used to remove noise and trivial features. However, the threshold is determined manually based on the noise level in the corresponding feature domain. Mismatch of and features may cause performance drop. A large value may cut down useful cues, while noises still survive if a small is set. Actually, a fixed threshold threatens the robustness of a convolutional layer since the noise level of a corrupted sample may vary dramatically from one sample to another. It is highly desired to develop an adaptive thresholding scheme to remove noise and trivial features for robust convolution. We develop a normalized soft thresholding (NST) method to this end. Mathematically, NST first normalizes each sample () in an sized mini-batch as


where and compute the mean and the standard deviation, respectively. In this way, corrupted samples at different levels are scaled to nearly the same level in an adaptive manner so that the normalized corrupted samples are expected to approach the distribution of their corresponding clean samples. In this sense, we are able to further mitigate the internal covariate shift of the mini-batch by applying BN [27] to them, obtaining . The above procedures of NST can be easily implemented with a sequence of non-parametric LN plus a standard BN, which can be found in many existing toolkits [1, 11, 35].

Finally, the corruptions in the normalized feature domain can be suppressed by a standard ST operation


where is the threshold and is the NST output of . With the introduction of normalization in NST, we can easily set in all experiments.

Method Auxiliary Noise Receptive Low frequency High frequency
branch removal field kernel kernel
Conv2d N.A. Uniform N.A. N.A.
Conv2d-MK N.A. Varied N.A. N.A.
Conv2d+SE [25] N.A. Uniform N.A. N.A.
Conv2d+CBAM [45] N.A. Uniform N.A. N.A.
LST [29] ST Uniform N.A. N.A.
RConv-UK NST Uniform N.A. N.A.
RConv-RMK NST Varied Small Large
RConv-DMK N.A. Varied Large Small
RConv-LMK LN Varied Large Small
RConv-SMK ST Varied Large Small
RConv-MK NST Varied Large Small
Table 1: Methods for comparison in this paper.

3.6 Implementation details and complexity analysis

By using the proposed spatial transform with multiple kernels and the NST operator, we are able to construct a robust convolutional layer, namely RConv-MK, for more reliable visual recognition. We expect that RConv-MK has almost the same overhead as LST at the cost of a negligible number of extra parameters for the same setting of and . We set and fix , and in this paper. Besides, we set to half of so that the kernel of size

will be computed with the majority of the input features. As of DWConv, when we have the same input shape and stride, the overhead is in proportion to the kernel size. Therefore, we have

, . With the above settings, we encourage increase of input channels in high frequency (those convolved with kernels of smaller size) and decrease of input channels in low frequency (those convolved with kernels of larger size), which well matches recent findings that high frequency features are critical to the model generalization ability [42] while low frequency features are rather vulnerable to adversarial attacks [19]. We conduct a grid search to specify the proportion for some popular kernel sizes. In this paper, we set when , and when .

Compared to LST, the extra parameters of RConv-MK can be determined by . In modern CNN architectures, as , the total number of parameters of RConv-MK are dominated by and , which are the same as their counterparts of LST and proportional to . Weights of discussed in this paper only occupy a tiny fraction. Take the ResNet architecture [21] as an example. Given , there are only extra parameters in a RConv-MK. In contrast, and range in . Approximately, the number of extra parameters only occupies 0.01%1% of the total number of parameters in a RConv-MK.

4 Experiments

We perform extensive experiments to evaluate the robustness of the proposed RConv-MK to common types of corruptions and adversarial attacks. Ablation studies are also conducted to set the number of multiple kernels and the channel split in RConv-MK. All experiments are conducted on a 10-way NVIDIA RTX server. We use PyTorch 

[35] for implementation. Due to page limit, results of the proposed RConv-MK under more CNN architectures and ablation study can be found in the supplementary material.

4.1 Experiment setup and datasets

Methods for comparison. As an alternative to Conv2d in a CNN, we compare the proposed RConv-MK with Conv2d and its variants as well as LST [29]. Besides, we further test some variants of RConv-MK to better understand the roles of its different components. Table 1 lists the competing methods and their attributes, including use of auxiliary branch, noise removal method, receptive field and low and high frequency kernel size. Similar to RConv-MK, Conv2d-MK splits the input along channel dimension into -groups, performs Conv2d with different kernel sizes for each group, and then concatenates the result of each group. We also combine two popular attention modules, i.e., SE [25] and CBAM [45], with Conv2d, denoted as Conv2d+SE and Conv2d+CBAM. As for the variants of RConv-MK, RConv-UK adopts a uniform kernel in , RConv-RMK reverses the order of kernels, RConv-LMK replaces NST by LN, RConv-SMK substitutes NST with ST, and RConv-DMK removes all NST operators.

Tasks and datasets

. We compare the competing methods on three tasks: visual recognition on corrupted images, white-box adversarial attacks, as well as recognition on clean images. The ImageNet-C


dataset is employed to evaluate the robustness of each method to common corruptions. CIFAR-10/100


are used for the evaluation under white-box adversarial attacks. The ImageNet 


dataset is employed for evaluating classification performance on clean images. In addition, we also employ the MS-COCO 

[33] dataset to evaluate the proposed method for object detection and instance segmentation. On each dataset, we closely follow the standard experimental settings for fair comparison. Details can be found in our supplementary material.

4.2 Evaluation on images with corruptions

To study the generalization ability of models trained with clean images to various corruptions, ImageNet-C [22] is constructed by applying 19 types of distinct corruptions to the validation set of ImageNet [12]. The mean corruption error (mCE) is used as the criteria (the lower the better) for performance evaluation. We build up CNNs of different competing methods under ResNet-50. The best snapshot of each method on ImageNet [12] is used for comparison.

Method Top-1/5 E. R. () mCE
Conv2d 23.85/7.13 77.01
Conv2d+SE [25] 23.14/6.70 74.47
Conv2d+CBAM [45] 22.98/6.68 72.56
Conv2d-MK 24.96/7.51 77.17
LST [29] 22.78/6.66 70.54
RConv-LMK 22.76/7.05 70.34
RConv-SMK 22.98/6.64 70.80
RConv-DMK 23.31/6.88 70.93
RConv-RMK 23.10/6.80 70.81
RConv-UK 22.59/6.58 69.79
RConv-MK 22.22/6.32 67.91
Table 2: Comparison of robustness to common corruptions under ResNet-50 architecture on ImageNet-C.

Table 2 shows the best top-1/5 error rates on clean ImageNet and the corresponding mCE values on ImageNet-C. RConv-MK obtains lower mCE than all its competitors. It significantly reduces the mCE of the baseline Conv2d by 9.10. Besides, in terms of corruption suppression methods, RConv-MK RConv-LMK or RConv-SMK RConv-DMK. This shows both LN and ST improve the model robustness to common corruptions, while the proposed NST can produce more robust results. Besides, RConv-MK LST RConv-SMK. This suggests our NST is more robust to unseen corruptions than ST for multiple kernels as NST normalizes the features into the same range for noise and redundancy removal. In contrast, ST considers the amplitude values only. Though the adoption of multiple kernels helps feature extraction by frequency, RConv-SMK may also increase the amplitude change in some supporting frequencies of a corrupted image, which weakens its robustness. Furthermore, when it comes to the arrangement of multiple kernels, RConv-MK (normal order) RConv-UK (uniform kernel) RConv-RMK (reversed order). This suggests signals of different frequencies are sensible to the kernel size. Mismatches make it even worse than using a uniform kernel. In addition, it is critical to group and concatenate channels for multiple kernels in the frequency domain. We see RConv-MK RConv-UK in frequency domain while Conv2d-MK Conv2d because signals in the spatial domain are not well structured.

4.3 Evaluation on adversarial attacks

Our RConv-MK is developed from the perspective of network architecture so that comparison against existing adversarial training algorithms lies out of our main focus. Actually, our method is complementary to these methods in practice. Below, we compare the robustness of RConv-MK and its competing network building blocks to adversarial attacks on CIFAR-10/100. We build models under WRN34-10 (results under ResNet-18 are presented in our supplementary material). We conduct adversarial training of each model on each dataset under the

PGD attack for 100 epochs with common hyper-parameter settings. Specifically, the perturbation size is

, step size is , and number of steps is 10. Learning rate starts at 0.1 and is reduced by a factor of 10 after 75, 90 and 100 epochs, respectively. We fix the batch size as 128 and weight decay as 0.0002. We test each trained model under untargeted white-box attacks with five representative anti-attack algorithms, including FGSM [18], PGD [34], FFGSM [44], ODI [38] and AWP [46]. We use the official implementation of both ODI and AWP, and we exploit advertorch [13] of the rest.

Attacks Dataset Conv2d Conv2d+ Conv2d+ Conv2d- LST RConv- RConv- RConv- RConv- RConv- RConv-
FFGSM [44] C10 60.78 60.75 60.49 60.87 62.80 64.06 64.18 62.32 63.60 64.05 64.55
C100 32.15 32.04 32.35 31.12 34.08 32.48 33.91 33.84 33.66 34.19 34.55
FGSM [18] C10 57.31 56.70 56.68 57.63 59.02 60.16 60.18 57.71 58.56 59.86 60.67
C100 29.22 28.98 29.62 28.29 30.25 30.12 30.82 29.90 30.04 31.07 31.50
PGD [34] C10 47.05 46.51 46.89 47.04 50.46 50.92 50.95 49.60 50.24 51.38 52.64
C100 24.00 23.34 24.05 22.94 24.42 24.90 25.42 25.06 25.39 25.51 26.63
ODI [38] C10 45.94 45.36 45.62 45.64 48.21 49.19 50.53 48.08 49.23 50.29 51.05
C100 22.85 22.20 22.84 21.56 24.37 23.81 24.94 24.10 24.93 25.03 25.39
AWP [46] C10 56.17 56.11 55.90 56.21 56.22 57.76 57.81 56.89 57.32 57.59 58.22
C100 28.80 28.62 28.86 27.61 29.03 29.03 29.10 29.09 29.07 29.14 29.46
Table 3: Results (robust accuracy, ) by different methods under untargeted white-box attacks on CIFAR-10/100.

Table 3 presents the accuracy obtained by different methods. One can have the following findings. First, the proposed RConv-MK outperforms all its competitors for adversarial attacks. Second, the attention modules, including Conv2d+SE and Conv2d+CBAM, have almost the same performance as the baseline under various untargeted white-box attacks. According to their definition, both attention modules pay more attention to local patterns while they suppress trivial features for image recognition. Although such kind of mechanism is helpful to the recognition of clean images, it may hurt the backbone model under adversarial attacks because the corrupted local patterns have a bigger chance to impose uncorrected excitation on the target features. Third, we see that RConv-MK RConv-LMK RConv-SMK RConv-DMK on CIFAR-10 and RConv-MK RConv-SMK RConv-DMK RConv-LMK on CIFAR-100. ST improves the robustness to adversarial attacks as it actually plays a role of gradient mask under adversarial attacks. LN shows competitive performance on CIFAR-10 but it performs poorly on CIFAR-100. This may result from the over-fitting problem of LN in adversarial training. With the increase of categories, the decision boundaries are expected to be less smooth in the feature space shaped by LN. Therefore, the model becomes vulnerable to unseen adversarial samples during test. Fourth, the arrangement of multiple kernels also matters in adversarial attacks. We can see that RConv-MK (normal order) RConv-UK (uniform kernel) RConv-RMK (reversed order). Fifth, Conv2d-MK always obtains worse results than the baseline due to its poor structure in spatial domain for channel split and concatenation. In contrast, our RConv-MK improves RConv-UK under all attacks as the channel operations are conducted in a well-structured space.

Depth Method Param/FLOPs Top-1/Top-5
18 Conv2d 11.69M/1.81G 30.24/10.92
LST [29] 8.03M/1.48G 26.55/8.59
RConv-MK 8.03M/1.48G 26.26/8.48
34 Conv2d 21.79M/3.66G 26.70/8.58
LST [29] 13.82M/2.56G 23.92/7.24
RConv-MK 13.82M/2.56G 23.54/6.99
50 Conv2d 25.56M/4.09G 23.85/7.13
LST [29] 23.33M/4.05G 22.78/6.66
RConv-MK 23.33M/4.05G 22.22/6.32
101 Conv2d 44.55M/7.80G 22.63/6.44
LST [29] 42.36M/7.75G 21.63/5.94
RConv-MK 42.36M/7.75G 21.41/5.93
152 Conv2d 60.19M/11.51G 21.69/5.94
LST [29] 58.02M/11.46G 20.02/5.26
RConv-MK 58.02M/11.46G 19.77/5.15
Table 4: Results (error rates, ) of RConv-MK under ResNet architecture on ImageNet.

4.4 Evaluation on clean images

We further study the performance of RConv-MK on clean images. We evaluate it on tasks of image recognition, object detection and instance segmentation.

Image recognition. The ImageNet [12] dataset is used to evaluate the performance of RConv-MK on image recognition with clean images. We construct the models under the ResNet [21] architecture and train/test them with the standard settings. Table 4 shows the results. One can see that RConv-MK reduces the top-1/5 error rates of Conv2d by 1.22%3.98% with less cost, and those of LST by 0.2%0.5% at almost the same cost. This validates that RConv-MK can also improve the generalization performance of a CNN on clean images.

We also compare RConv-MK with the DCTNet [48] with 64 input channels under the same ResNet-50 architecture on ImageNet. RConv-MK can reduce the top-1/5 error rates of DCTNet from 22.84/6.53 to 22.22/6.32. Besides, RConv-MK runs at 27.78 FPS (including data loading and pre-processing with single CPU thread plus computation on GPU), faster than DCTNet by 3.39 FPS. Though DCTNet can reduce the latency of data transmission to some extent by performing DCT sequentially on CPU, we note that the cost is still expensive (even with the support of advanced CPU instructions, e.g., AVX512).

Detector Backbone Method
Faster R-CNN [36] Conv2d 37.4 58.1 40.4
LST [29] 40.8 62.2 44.3
RConv-MK 41.3 62.6 45.0
RetinaNet [32] Conv2d 36.5 55.4 39.1
LST [29] 38.7 58.5 41.7
RConv-MK 39.4 60.0 42.0
FCOS [39] Conv2d 36.6 55.7 38.8
LST [29] 38.8 58.7 41.5
RConv-MK 39.6 60.0 42.2
Mask R-CNN [20] Conv2d 38.2 58.8 41.4
LST [29] 41.3 62.5 45.0
RConv-MK 41.8 63.3 45.9
Cascade Mask R-CNN [4] Conv2d 41.2 59.4 45.0
LST [29] 43.9 62.6 47.9
RConv-MK 44.4 63.0 48.4
Table 5: Object detection results () of RConv-MK on MS-COCO validation set.
Detector Backbone Method
Mask R-CNN [20] Conv2d 34.7 55.7 37.2
LST [29] 37.1 59.3 39.4
RConv-MK 37.6 59.9 40.2
Cascade Mask R-CNN [4] Conv2d 35.9 56.6 38.4
LST [29] 38.1 59.7 40.9
RConv-MK 38.6 60.2 41.5
Table 6: Instance segmentation results () of RConv-MK on MS-COCO validation set.

Object detection and instance segmentation. We test the performance of RConv-MK for object detection and instance segmentation on MS-COCO [33] by using representative object detectors such as Faster R-CNN [36] and Mask R-CNN [20], etc. Table 5 and Table 6 demonstrate the object detection and instance segmentation results on MS-COCO validation set, respectively. RConv-MK achieves better mAP than Conv2d and LST on both tasks. Among all object detectors, RConv-MK improves the mAP of Conv2d by 2.9%3.9%, while it boosts the mAP of LST by 0.5%1.7%. On instance segmentation, RConv-MK outperforms Conv2d by 2.7%2.9% and LST by 0.5% with Mask R-CNN and Cascade Mask R-CNN.

Figure 3 (left) presents some visualization comparisons of object detection with FCOS. Compared with Conv2d and LST, though our RConv-MK leverages intermediate features the least in dimension for fusion, it obtains better detection results on challenging objects, for example, with severe occlusion (see the teddy bears in the left image), various sizes (see the persons in the middle image), as well as different levels of out-of-plane rotation (see the clocks in the right image). Such results demonstrate the robustness of RConv-MK in feature learning. In Figure 3 (right), we visualize instance segmentation results using Mask R-CNN. With the improved spatial transform, the proposed RConv-MK in the last row generates more precise segmentation results than Conv2d and LST in the middle two rows. For example, in the middle column, RConv-MK successfully segments the back of the left bear in the shadow, while Conv2d misses this part and LST misclassifies it as another object.

Figure 3: Comparisons of object detection results based on FCOS [39] and instance segmentation results based on Mask R-CNN [20] (from top to bottom: Conv2d, LST [29] and our RConv-MK).

5 Conclusion

In this paper, we proposed a robust alternative of Conv2d layer, namely RConv-MK, as a reliable feature extractor for visual recognition with corrupted images and adversarial samples. RConv-MK was designed with a set of kernels of different sizes so that they could be flexibly applied to the input features of different frequencies to exploit their specific characteristics. A normalized soft thresholding (NST) operator was then introduced to adaptively suppress the effect of different corruptions at different levels by using a uniform threshold. RConv-MK can be easily and efficiently implemented by the existing toolkits. Extensive experiments on corrupted images, adversarial samples as well as clean images validated the effectiveness of RConv-MK under popular CNN architectures.


  • [1] M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin, S. Ghemawat, G. Irving, M. Isard, M. Kudlur, J. Levenberg, R. Monga, S. Moore, D. G. Murray, B. Steiner, P. Tucker, V. Vasudevan, P. Warden, M. Wicke, Y. Yu, and X. Zheng (2016) TensorFlow: a system for large-scale machine learning. In Symp. Oper. Syst. Des. Implement., Cited by: §3.5.
  • [2] A. Athalye, N. Carlini, and D. Wagner (2018) Obfuscated gradients give a false sense of security: circumventing defenses to adversarial examples. In Int. Conf. Mach. Learn., Cited by: §2.2.
  • [3] J. L. Ba, J. R. Kiros, and G. E. Hinton (2016) Layer normalization. arXiv preprint arXiv:1607.06450. Cited by: §2.3.
  • [4] Z. Cai and N. Vasconcelos (2019) Cascade R-CNN: high quality object detection and instance segmentation. IEEE Trans. Pattern Anal. Mach. Intell.. External Links: Document Cited by: Table 5, Table 6.
  • [5] N. Carlini, A. Athalye, N. Papernot, W. Brendel, J. Rauber, D. Tsipras, I. Goodfellow, A. Madry, and A. Kurakin (2019) On evaluating adversarial robustness. arXiv preprint arXiv:1902.06705. Cited by: §2.2.
  • [6] N. Carlini, G. Katz, C. Barrett, and D. L. Dill (2017) Provably minimally-distorted adversarial examples. arXiv preprint arXiv:1709.10207. Cited by: §2.2.
  • [7] N. Carlini and D. Wagner (2017) Towards evaluating the robustness of neural networks. In IEEE Symp. Secur. Priv., Cited by: §2.2.
  • [8] H. Chen, J. Liang, S. Chang, J. Pan, Y. Chen, W. Wei, and D. Juan (2019) Improving adversarial robustness via guided complement entropy. In Int. Conf. Comput. Vis., Cited by: §1, §2.2.
  • [9] P. Chen, Y. Sharma, H. Zhang, J. Yi, and C. Hsieh (2018) EAD: Elastic-Net attacks to deep neural networks via adversarial examples. In AAAI, Cited by: §2.2.
  • [10] P. Chen, H. Zhang, Y. Sharma, J. Yi, and C. Hsieh (2017) ZOO: zeroth order optimization based black-box attacks to deep neural networks without training substitute models. arXiv preprint arXiv:1708.03999. Cited by: §2.2.
  • [11] T. Chen, M. Li, Y. Li, M. Lin, N. Wang, M. Wang, T. Xiao, B. Xu, C. Zhang, and Z. Zhang (2016) MXNet: a flexible and efficient machine learning library for heterogeneous distributed systems. In Adv. Neural Inform. Process. Syst. Worksh., Cited by: §3.5.
  • [12] J. Deng, W. Dong, R. Socher, L. Li, K. Li, and L. Fei-Fei (2009) ImageNet: a large-scale hierarchical image database. In IEEE Conf. Comput. Vis. Pattern Recog., Cited by: §1, §4.1, §4.2, §4.4.
  • [13] G. W. Ding, L. Wang, and X. Jin (2019) AdverTorch v0.1: an adversarial robustness toolbox based on pytorch. arXiv preprint arXiv:1902.07623. Cited by: §4.3.
  • [14] Y. Dong, Z. Deng, T. Pang, H. Su, and J. Zhu (2020)

    Adversarial distributional training for robust deep learning

    In Adv. Neural Inform. Process. Syst., Cited by: §2.2.
  • [15] Y. Fan, J. Yu, D. Liu, and T. S. Huang (2020) Scale-wise convolution for image restoration. In AAAI, Cited by: §2.1.
  • [16] F. Franzen (2018) Image classification in the frequency domain with neural networks and absolute value DCT. In Int. Conf. Image Signal Process., Cited by: §1, §2.1.
  • [17] R. Furuta, N. Inoue, and T. Yamasaki (2019)

    Fully convolutional network with multi-step reinforcement learning for image processing

    In AAAI, Cited by: §2.1.
  • [18] I. Goodfellow, J. Shlens, and C. Szegedy (2015) Explaining and harnessing adversarial examples. In Int. Conf. Learn. Represent., Cited by: §4.3, Table 3.
  • [19] C. Guo, J. S. Frank, and K. Q. Weinberger (2020) Low frequency adversarial perturbation. In Uncertain. Artif. Intell., Cited by: §3.6.
  • [20] K. He, G. Gkioxari, P. Dollár, and R. Grishick (2017) Mask R-CNN. In Int. Conf. Comput. Vis., Cited by: Figure 3, §4.4, Table 5, Table 6.
  • [21] K. He, X. Zhang, S. Ren, and J. Sun (2016) Identity mappings in deep residual networks. In Eur. Conf. Comput. Vis., Cited by: §3.6, §4.4.
  • [22] D. Hendrycks and T. Dietterich (2019) Benchmarking neural network robustness to common corruptions and perturbations. Int. Conf. Learn. Represent.. Cited by: §1, §4.1, §4.2.
  • [23] T. Hossain, S. W. Teng, D. Zhang, S. Lim, and G. Lu (2019) Distortion robust image classification using deep convolutional neural network with discrete cosine transform. In IEEE Int. Conf. Image Process., Cited by: §1, §2.1.
  • [24] A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T. Weyand, M. Andreetto, and H. Adam (2017) MobileNets: efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861. Cited by: §3.2.
  • [25] J. Hu, L. Shen, and G. Sun (2018) Squeeze-and-excitation networks. In IEEE Conf. Comput. Vis. Pattern Recog., Cited by: Table 1, §4.1, Table 2, Table 3.
  • [26] A. Ilyas, L. Engstrom, A. Athalye, and J. Lin (2018) Black-box adversarial attacks with limited queries and information. In Int. Conf. Mach. Learn., Cited by: §2.2.
  • [27] S. Ioffe and C. Szegedy (2015) Batch normalization: accelerating deep network training by reducing internal covariate shift. In Int. Conf. Mach. Learn., Cited by: §2.3, §3.5.
  • [28] A. Krizhevsky and G. E. Hinton (2009) Learning multiple layers of features from tiny images. Technical report Technical Report TR-2009, University of Toronto. Cited by: §4.1.
  • [29] L. Li, K. Wang, S. Li, X. Feng, and L. Zhang (2020) LST-Net: learning a convolutional neural network with a learnable sparse transform. In Eur. Conf. Comput. Vis., Cited by: §1, §3.2, §3.3, §3.4, Table 1, Figure 3, §4.1, Table 2, Table 3, Table 4, Table 5, Table 6.
  • [30] L. Li, K. Wang, S. Li, X. Feng, and L. Zhang (2020) Remarks on Tc and Ts. Note: Cited by: §3.2.
  • [31] M. Lin, Q. Chen, and S. Yan (2014) Network in network. In Int. Conf. Learn. Represent., Cited by: §3.2.
  • [32] T. Lin, P. Goyal, R. Girshick, K. He, and P. Dollár (2017) Focal loss for dense object detection. In Int. Conf. Comput. Vis., Cited by: Table 5.
  • [33] T. Lin, M. Maire, S. Belongie, L. Bourdev, R. Girshick, J. Hays, P. Perona, D. Ramanan, C. L. Zitnick, and P. Dollár (2014) Microsoft COCO: common objects in context. In Eur. Conf. Comput. Vis., Cited by: §4.1, §4.4.
  • [34] A. Madry, A. Makelov, L. Schmidt, D. Tsipras, and A. Vladu (2018) Towards deep learning models resistant to adversarial attacks. In Int. Conf. Learn. Represent., Cited by: §1, §2.2, §4.3, Table 3.
  • [35] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, A. Desmaison, A. Kopf, E. Yang, Z. DeVito, M. Raison, A. Tejani, S. Chilamkurthy, B. Steiner, L. Fang, J. Bai, and S. Chintala (2019) PyTorch: an imperative style, high-performance deep learning library. In Adv. Neural Inform. Process. Syst., H. Wallach, H. Larochelle, A. Beygelzimer, F. dAlché-Buc, E. Fox, and R. Garnett (Eds.), pp. 8024–8035. Cited by: §3.5, §4.
  • [36] S. Ren, K. He, R. Girshick, and J. Sun (2015) Faster R-CNN: towards real-time object detection with region proposal networks. In Adv. Neural Inform. Process. Syst., Cited by: §4.4, Table 5.
  • [37] C. Szegedy, W. Zaremba, I. Sutskever, J. Bruna, D. Erhan, I. Goodfellow, and R. Fergus (2014) Intriguing properties of neural networks. In Int. Conf. Learn. Represent., Cited by: §1, §2.2.
  • [38] Y. Tashiro, Y. Song, and S. Ermon (2020) Output diversified initialization for adversarial attacks. arXiv preprint arXiv:2003.06878. Cited by: §4.3, Table 3.
  • [39] Z. Tian, C. Shen, H. Chen, and T. He (2019) FCOS: fully convolutional one-stage object detection. In Int. Conf. Comput. Vis., Cited by: Figure 3, Table 5.
  • [40] J. Uesato, B. Odonoghue, P. Kohli, and A. V. Den Oord (2018) Adversarial risk and the dangers of evaluating against weak attacks. In Int. Conf. Mach. Learn., Cited by: §2.2.
  • [41] D. Ulyanov, A. Vedaldi, and V. Lempitsky (2016) Instance normalization: the missing ingredient for fast stylization. arXiv preprint arXiv:1607.08022. Cited by: §2.3.
  • [42] H. Wang, X. Wu, Z. Huang, and E. P. Xing (2020) High-frequency component helps explain the generalization of convolutional neural networks. In IEEE Conf. Comput. Vis. Pattern Recog., Cited by: §3.6.
  • [43] Z. Wang, A. C. Bovik, H. R. Sheikh, and E. P. Simoncelli (2004) Image quality assessment: from error visibility to structural similarity. IEEE Trans. Image Process. 13 (4), pp. 600–612. Cited by: §2.1.
  • [44] E. Wong, L. Rice, and J. Z. Kolter (2020) Fast is better than free: revisiting adversarial training. In Int. Conf. Learn. Represent., Cited by: §4.3, Table 3.
  • [45] S. Woo, J. Park, J. Lee, and I. So Kweon (2018) CBAM: convolutional block attention module. In Eur. Conf. Comput. Vis., Cited by: Table 1, §4.1, Table 2, Table 3.
  • [46] D. Wu, S. Xia, and Y. Wang (2020) Adversarial weight perturbation helps robust generalization. In Adv. Neural Inform. Process. Syst., Cited by: §4.3, Table 3.
  • [47] Y. Wu and K. He (2018) Group normalization. In Eur. Conf. Comput. Vis., Cited by: §2.3.
  • [48] K. Xu, M. Qin, F. Sun, Y. Wang, Y. Chen, and F. Ren (2020) Learning in the frequency domain.. In IEEE Conf. Comput. Vis. Pattern Recog., Cited by: §1, §2.1, §4.4.
  • [49] H. Zhang and J. Wang (2019) Defense against adversarial attacks using feature scattering-based adversarial training. In Adv. Neural Inform. Process. Syst., Cited by: §1.
  • [50] H. Zhang, Y. Yu, J. Jiao, E. P. Xing, L. E. Ghaoui, and M. I. Jordan (2019) Theoretically principled trade-off between robustness and accuracy. In Int. Conf. Mach. Learn., Cited by: §1.
  • [51] K. Zhang, W. Zuo, Y. Chen, D. Meng, and L. Zhang (2017) Beyond a Gaussian denoiser: residual learning of deep CNN for image denoising. IEEE Trans. Image Process. 26 (7), pp. 3142–3155. Cited by: §2.1.
  • [52] B. Zhou, A. Lapedriza, A. Khosla, A. Oliva, and A. Torralba (2018)

    Places: a 10 million image database for scene recognition

    IEEE Trans. Pattern Anal. Mach. Intell. 40 (6), pp. 1452–1464. Cited by: §1.