Recent research reveals that deliberately crafted adversarial perturbations succeed in leading Deep Neural Networks (DNNs) to make wrong predictions, not only when attackers are aware of the architecture of DNNs, i.e., white-box setting, but also when they only have access to the input-output pairs of DNNs, i.e., black-box setting. This discovery exposes the potential danger in existing machine learning applications and encourages defenses against adversarial attacks. These defenses are divided into two families, namely reactive defenses and proactive defenses. Reactive defensesXu et al. (2017); Guo et al. (2017); Sun et al. (2019); Xie et al. (2017) aim to gain robustness by introducing an extra element to recognize or remove the adversarial context. Proactive defenses Goodfellow et al. (2014); Madry et al. (2017); Lyu et al. (2015); Chen et al. (2019) attempt to build networks inherently robust to adversarial attacks.
Transformations Xu et al. (2017); Guo et al. (2017), as a typical reactive approach, remove adversarial effects via applying simple filters. It is cheap but performs poorly against strong attacks, e.g., PGD Madry et al. (2017), C&W Carlini and Wagner (2017) and DeepFool Moosavi-Dezfooli et al. (2016). To augment the performance, randomness Raff et al. (2019); Prakash et al. (2018) and representation Moosavi-Dezfooli et al. (2018); Buckman et al. (2018); Liu et al. (2019) are introduced into transformation. Transformation gains robustness but loses accuracy, since the original images are altered when it discards adversarial context. The networks learn from original data and cannot recognize the distorted information.
Adversarial Training Goodfellow et al. (2014); Madry et al. (2017), as a proactive defense, augments the training process with adversarial images such that the network learns the relative knowledge. To produce adversarial images, adversarial training employees a special attack. Since different attacks have different preferences, the model is vulnerable to unseen attacks. The ensemble is a solution to amend this drawback. It consists of several sub-models which learn from similar but different training sets, for instance, applying Gaussian noise to inputs and bootstrap Strauss et al. (2017) to augment robustness. Incorporating randomness stabilizes the performance of ensemble models. Random Self-Ensemble (RSE) Liu et al. (2018) adds random noise layers to prevent strong gradient-based attacks.
Diversity is essential to ensembles. To increase the diversity of adversarial examples during the training, ensemble adversarial training Tramèr et al. (2017) adds adversarial examples transferred from other pre-trained models. Ensemble-of-specialists Abbasi and Gagné (2017) multiplies adversarial examples targeted over different incorrect labels. This defense is confirmed not robust enough He et al. (2017). Other than data augmentation, Adaptive Diversity Promoting (ADP) Pang et al. (2019) and Diversity Training Kariyappa and Qureshi (2019) design a regularizer to encourage diversity. However, these methods either fail in defending against strong attacks or are too expensive because it needs too many sub-models to achieve a decent diversity. So the question arises:
What is the advantageous diversity to improve the ensemble defenses against adversarial attacks?
This work investigates the answer to this question. Inspired by the transformation defenses, we train sub-models with different front filters, such as dimension reduction, color quantization, and frequency filter. The model trained on a particular front filter is sensitive to a specific type of distortion. These front filters distort adversarial contexts. At the same time, training with transformed data allows models to learn and maintain accuracy on them. We analyze the Pearson correlation coefficient among the models and the performance of the models and their ensemble. We infer that the sub-models with weakly correlated sensitivity constitute a more robust ensemble, and propose a simple and powerful defense framework for ensemble models based on the inference. Finally, the experimental results demonstrate that the proposed method improves the robust of the network against adversarial examples.
In this section, we first state some basic notions of DNNs and the definition of DNN robustness in a local region. Then we recall the norm-based robustness region, as well as the Lipschitz constant of DNNs. After that, we give a brief introduction to a few existing attacks used in our experiments.
2.1 Deep Neural Network and Local Robustness
Our work concentrates on the image classification task. A DNN, which can be characterized as a function
, usually gives prediction by maximizing the output vector, i.e.,, where represents an image. To optimize the network, we minimize the cost function , in which is the ground truth.
Intuitively, the local robustness of a DNN ensures the consistency of its behavior of a given input under certain perturbations, and a strict robustness condition ensures that there is no adversarial example around an input . Formally, the local robustness of a DNN can be defined as below.
Definition 1 (DNN robustness).
Given a DNN and an input region , we say that is (locally) robust in if for any , we have .
In a typical way, the region here is usually defined by the neighborhood of an input, where -norm balls are commonly used. As the case of the , the neighborhood of an input bounded by the -norm can be described as an ball: The (closed) ball with the center and the radius is defined as .
Lipschitz constant of DNNs
The Lipschitz constant of a function is a measure to indicates the maximum ratio between variations in the output space and variations in the input space. In Ruan et al. (2018), a DNN is proved to be Lipschitz continuous. Namely, there exists , s.t. for any ,
and here is called a Lipschitz constant of . Generally, DNNs with a smaller Lipschitz constant are likely to be more robust.
2.2 Adversarial Attacks
Adversarial attacking methods attempt to find an imperceptible perturbation leading to misclassification, also regarded as a testing method for network robustness. We present several fundamental untargeted attacks widely used in existing literature. Hereafter, we denote a potential adversarial example as , the adversarial perturbation as , and the gradient calculated from the cost function to input as .
Fast Gradient Sign Method (FGSM)
Fast Gradient Sign Method (FGSM) by Goodfellow and his colleagues in 2014 (Goodfellow et al., 2014) simply uses the one-step gradient to generate the perturbation:
It is the perturbation that minimizes the first-order objective function for the constraint .
Projected Gradient Descent (PGD)
PGD initializes and then iterates by progressing in the opposite direction of the gradient with stepsize . The accumulated distortion are projected onto an -norm ball (Madry et al., 2017):
where , and
The -norm ball used for projection is , centered in with radius . Again, the attack does not end when hits the boundary of the ball for the first time. It continues and seeks to minimize the objective function while remaining on the sphere.
Basic Iterative Method (BIM)
BIM (Kurakin et al., 2016), as an iterative version of FGSM, it employees the sign of gradients of network iteratively with stepsize to update adversarial perturbations. The principle of BIM is similar to a -norm version of PGD. All the pixels in the adversarial example are clipped into range , i.e., adversarial perturbations are resized within the surface of an -norm ball with radius . The main difference is that PGD utilizes a random initialization and uses the gradients directly.
Backward Pass Differentiable Approximation (BPDA)
BPDA (Athalye et al., 2018) allows attackers to generate adversarial perturbations targeted at the network with defenses as a whole. BPDA approximates derivatives by computing the forward pass normally and computing the backward pass using a differentiable approximation of the defense function. For instance, if it is impossible to calculate gradients through the transformation, BPDA generates adversarial examples by including the transformation during the forward pass and replaces the transformation with an identity function during the backward pass under the assumption that the transformation output is close to the original input.
In this section, we state the structure of our ensemble framework. A filter is an image transformation that extracts some important features of the original image. We embed a filter in each sub-model as the core component, which provides the diversity of sub-models. Then we analyze the relationship between the correlations of the filters and the local robustness, which induces a principle of choosing the optimal filter combination. This improves the ensemble defense against adversarial attacks.
3.1 Filter-based Ensemble
, an input image is pre-processed by three different filters, and the obtained results are then respectively fed into three DNNs, which classify them and output the classification label individually. In the end, the results of each sub-models are combined by a voting mechanism. Formally, a certain sub-model applies a front filter, denoted by, on the original inputs, and the DNN model follows the filter by . Then an ensemble model with sub-models can be expressed as
where the function outputs the mode of the results of the sub-models, i.e., the classification label which appears the most times.
We call a input stable for an ensemble model if the output labels of all the sub-models in are consistent. It is easy to obtain the following proposition.
The ensemble model cannot be attacked at a stable input by the perturbation if any two sub-models are not attacked simultaneously at their respective inputs by the perturbations .
To defend against an adversarial attack, we propose to build a more robust ensemble model. Unlike network-based defenses like adversarial training, we do not focus on training skills for the sub-models to improve robustness. On the contrary, our principle is to enhance the diversity of sub-models by extracting partial features of inputs using differentiated front filters. Since it is hard for an adversarial attack to effectively affect all sub-models at the same time, the ensemble model achieves better robustness from the diversity of front filters.
So, the key to establish this ensemble model is how to choose a proper filter combination that provide both accuracy and robustness. In the following, we explain how we gain a more robust ensemble model through the relation among the filters.
3.2 Low Correlation Implies Strong Robustness
In this subsection, we give a theoretical description on the intuition that a low correlation of the sensitivity of two filters implies a more robust ensemble model under the assumption that the filters are of high quality, and this will guide us to choose the optimal filter combination from the candidates.
For an input and a perturbation , we define a function as
to measure the sensitivity of a filter , i.e., the -norm of the perturbation affecting the input of the DNN in the sub-model. Considering
as a random variable, we invoke thePearson correlation coefficient to evaluate the correlation of the sensitivity of two filters, which is expressed as
We assume that the filters in the ensemble model are of high quality, which means that the interpretation of the difference of two images by each filter sincerely reflects their semantics difference in statistics, i.e., the random variables and are identically distributed. Under this assumption, the equation (5) indicates that is monotonically increasing w.r.t. .
For a certain DNN and an input classified into label , we define the score difference by . Then the robust radius at an input
can be estimated according to the following lemma.
Lemma 1 (Yang et al. (2021)).
Consider a DNN defined by , whose Lipschitz constant is . Then for an input , the DNN is robust in with .
Then, we can infer that two DNNs and cannot be attacked simultaneously at and by the perturbations and respectively, if
In our framework, and are the input processed by two filters, and and are the sensitivity calculated by (4). It is clear that the right part in the inequality (6) is determined by the structure and parameters of the DNNs, while the left part is determined by the front filters.
The expectation is a statistical description of the item in (6): A small expectation imples that the value of tends to be small statistically. Consequently, by combining (5), (6) and Proposition 1, we infer that low correlation of the sensitivity among filters implies strong robustness of ensemble models. This leads to our principle as ‘minimum correlation coefficients’ for choosing filter combinations, i.e., to optimize the robustness of our ensemble model, we choose the filters among which the correlation is the weakest.
Note that the cosine similarity of
Note that the cosine similarity ofbetween two filters is also a measure of their correlation. However, we only consider the sensitivity from the perspective of magnitudes of perturbation vectors generated by filters, because the gradient of the entire sub-model depends on both the filter and the DNN, so we do not choose to analyze their directions without considering the following DNNs.
3.3 Filter Candidates
The original image is prop-processed by a filter before it is sent to the network. Therefore, some information is discarded and thus the overall entropy of the image is reduced. It is also regarded as a manual feature extraction procedure that extracts the most important features that benefit the task. It is generally harder to attack the filtered image because there is less information that the attacking methods can utilize. The filters we use are categorized into the following four classes.
The easiest way to reduce the entropy of an image is to reduce its dimensionality. Color pictures have three dimensions, i.e., length, width, and color channels. It is simple to reduce the first two dimensions by downsizing an image. Grayscale transformation can compress the color channels into one grayscale channel. Generally, downsizing and grayscale transformation preserve the overview of the original image with certain loss of details. Bilinear interpolation is used in downsizing filters. We use the ITU-R BT.601111BT.601 : Studio encoding parameters of digital television for standard 4:3 and wide screen 16:9 aspect ratios: https://www.itu.int/rec/R-REC-BT.601-7-201103-I/en luma transformation for the grayscale filter.
Another way to reduce the complexity of an image is to reduce its number of colors. The size of a CIFAR-10 image is. It may have colors at most. But it can still be recognizable using much fewer colors. The fast octree algorithm Gervautz and Purgathofer (1988) is adopted to reduce the colors of the image. A full-size octree with a depth of seven can be used to partition the RGB color space. The octree subdivides the colorspace into eight octants recursively. Each leaf node of the octree represents an individual color. The fast octree algorithm builds the tree according to the color of a given image and merges the leaf nodes when the number of colors overflows.
In digital image processing, frequency filters are commonly applied to extract useful features from pictures. The high-frequency features are usually the noise and the details of the original image, and the low-frequency features are often its overview. The high-pass filters suppress the low-frequency features, and the low-pass filters do the opposite. Our low-pass and high-pass filters are based on the discrete Fourier Transform. Via shifting the low-frequency part to the center of the spectrum and multiplying it by a Gaussian mask (high-pass mask) element-wise, we obtain the low-pass (high-pass) filtered image.
Inputs for a DNN can be any real value, while the 8-bit RGB color model takes integer values in the range of . To keep the practical meaning of an input, real numbers are approximated by its closest integer. This is essential since the DNN for image classification should have actual image data instead of arbitrary inputs. When we use iterative methods to attack the DNN, discretization can help generate practical adversarial examples. In our ensemble model, every sub-model trained with the original data is equipped with a discretization filter.
4 Experimental Evaluation
In this section, we demonstrate the experimental results to support our inference and our method. We first give a brief introduction of the experimental settings, and then show our study of the correlation between filters in Section 3.2. We measure the robustness of our models with respect to transferability, via calculating the accuracy of our models on adversarial examples produced by attacking the original network. In the end, two front filters are chosen by ‘minimum correlation coefficients’ to constitute the ensemble defense. We compare its robustness with adversarial training. We generate adversarial examples under various attacks implemented in FoolBox Rauber et al. (2017, 2020). The FoolBox version is 3.31, and the license is MIT license. All experiments are conducted on a Windows 10 laptop with Intel i7-9750H, GTX 2060, and 16G RAM.
Dataset and Network
We train our models on the CIFAR-10 Krizhevsky et al. (2009) dataset under the MIT license. The CIFAR-10 dataset contains RGB images in total with ten exclusive classes. We train our ResNet18 He et al. (2016) models on the training images and test them on the
testing images. We use the stochastic gradient descent optimizer for training with, , and as the learning rate successively.
4.1 The Optimal Filter Combination for Ensemble
We analyze the Pearson correlation coefficients for each pair of the candidate filters and pick the least correlated filters for the ensemble with the minimum correlation coefficients. We also evaluate the robustness of sub-models on adversarial examples produced on the original network.
Statistical Correlation Analysis for Filters
We apply noise of size to images randomly picked from the test set and evaluate Pearson correlation coefficient according to (5). As reported by Figure 2, the correlation coefficient between the high-pass filtered data and the original inputs is the largest, i.e., . The grayscale filtered data gets
, the second to the original inputs. It indicates that adversarial examples produced on the original network easily transfer to the sub-models with these filters. The downsizing filter strongly correlates With the low-pass filter, implying that they are probably deceived by the same adversarial examples. Thecolor reduction filter shows little correlation, i.e., , to the low-pass filter, and they both have a relatively low correlation to the original data, i.e., and . According to minimum correlation coefficients, the robust ensemble model includes the original network and the two sub-models trained with the low-pass filter and the color reduction filter.
Transfer-based Attack Analysis
We generate adversarial examples against the original network by FGSM and PGD attacks with different values of attacking radius and test them on the sub-models trained with filtered data. In Figure 3, the sub-models with the low-pass, color quantization, and downsizing filters perform better than the original network against both FGSM and PGD attacks. The accuracy of these sub-models remains above when the attacking radius is under the FGSM attack. The PGD attack is more powerful against the original network and drops its accuracy to nearly . The sub-model with a downsizing filter has the lowest accuracy among the three sub-models, which is under the PGD attack. However, the sub-models with the grayscale and the high-pass filters are vulnerable to the transfer-based attack. These sub-models have lower accuracy than the original network under the FGSM attack with . It is consistent with our analysis in Section 3.2, which suggests that these two filters should not be part of the ensemble.
|(a) FGSM||(b) PGD|
4.2 Comparison with Different Ensemble Methods
In this section, we compare the adversarial accuracy of our ensemble model with different ensemble models. The details of each ensemble model are as follows:
Minimum correlated Ensemble According to statistical correlation analysis for filters, we choose the color reduction filter and the low-pass filter for the ensemble, which have the lowest correlation. The ensemble includes the original network to maintain state-of-the-art accuracy on clean data.
Maximum correlated Ensemble The worst-case situation suggested by statistical correlation analysis is to constitute the ensemble with the original network and the two sub-models with the high-pass and grayscale filters, whose correlation is the highest as shown in Figure 2.
We compare the minimum correlated model to the Gaussian noise ensemble model and the maximum correlated ensemble model. We choose the BPDA attack based on the BIM to attack ensemble models. We use the sum of the gradient of sub-models to attack the ensemble model as a whole. The number of iterations is , and the step size is . Hereafter, the vote-based ensemble follows the voting mechanism described in Section 3, and the score-based ensemble outputs the class with the maximum average score.
According to Figure 4(a), when the disturbance is , , , and , the score-based accuracy of the minimum correlated ensemble model is , , , and higher than the Gaussian noise model, and is , , , and higher than the maximum correlated model, respectively. The minimum correlated ensemble model also has higher vote-based adversarial accuracy than the Gaussian ensemble model and the maximum correlated ensemble model. It agrees with the previous analysis that the ensemble model with less correlated sub-models obtains better adversarial robustness.
Figure 4(b) demonstrates the accuracy of sub-models when the ensemble model is attacked as a whole. When we attack the Gaussian noise ensemble model, the accuracy differences among its sub-models are close, which is at most. Also, the accuracy of its three sub-models decreases in a similar pattern. However, the sub-models of the minimum correlated ensemble model perform differently. The accuracy of the sub-model with the low-pass filter stabilizes when the radius is larger than . The accuracy of the sub-model with the 16 color filter and the original network decreases with a relatively larger accuracy difference of at least. It justifies that our minimum correlated ensemble model improves robustness against adversarial attacks by introducing advantageous diversity.
4.3 Comparison with Adversarial Training
Adversarial training is one of the most effective methods to improve the robustness of a DNN. We use the method proposed in Shafahi et al. (2019) and compare the robustness of our minimum correlated ensemble model with the adversarial training. The adversarial training procedure takes iterations with the maximal perturbation size .
Comparison with a single Adversarial Training Model
We first compare our ensemble model with one single adversarial training model. The grey line in Figure 5(a) shows the adversarial robustness of a single adversarially trained model. Our ensemble model has better adversarial accuracy under all perturbations. The score-based accuracy of our method is higher than the single adversarial trained model at . It is worth highlighting that every sub-model in our ensemble model has no defense mechanisms acting on the network. In other words, our ensemble-based defense can build robust models competing with adversarial training without manipulating the network.
Comparison with Adversarial Training Ensembles
We compare our method with the ensemble of three independent adversarially trained models. The orange lines in Figure 5(a) show the performance of the ensemble model using adversarial training. Remarkably, our score-based ensemble model has better accuracy than its counterpart with adversarial training when the perturbation size is large, i.e., . Meanwhile, our vote-based ensemble is very close to the one with adversarial training. Our ensemble model has a comparable defense to the ensemble of adversarially trained sub-models.
The orange lines in the right part of Figure 5(b) depict the accuracy of adversarially trained sub-models when the ensemble model is attacked as a whole. Comparing with the sub-models trained with Gaussian noise in Figure 4, adversarial training does not significantly improve the diversity between sub-models. The accuracy difference between sub-models is still relatively small, and that means the attacking methods can affect different sub-models simultaneously.
Since the adversarial training works on the network level and our method works on the data level, it is natural to combine these two methods. We demonstrate our ensemble model with adversarial training in Figure 5(a) using the cyan lines. Our ensemble model with adversarial training reaches a better robustness performance in both score-based and vote-based settings. The vote-based ensemble model achieves accuracy at . Conclusively, we build an ensemble model with high adversarial robustness using both our filter-based defense and adversarial training.
In this work, we investigate the advantageous diversity of the ensemble model against adversarial attacks. By studying the robustness of ensemble DNNs and the Pearson correlation coefficient among models trained with filters, we propose the "minimum correlation coefficients" principle for choosing filters, which is instrumental in building the ensemble defense.
Beyond existing ensemble defenses, we consider the diversity of ensemble models with a new perspective. We obtain the diversity from the filtered training data and confirm it experimentally. We observe that our ensemble model without adversarial information is more robust against adversarial attacks than adversarial training models.
Our discovery not only contributes to proposing a decent robust ensemble model but also supplies data diversity. As our future work, it is interesting to study further how much robustness we could gain from data diversity and model ensemble. We are also considering extending our framework to larger datasets like ImageNet and training our sub-models using different network structures.
- Robustness to adversarial examples through an ensemble of specialists. arXiv preprint arXiv:1702.06856. Cited by: §1.
- Obfuscated gradients give a false sense of security: circumventing defenses to adversarial examples. arXiv preprint arXiv:1802.00420. Cited by: §2.2.
- Thermometer encoding: one hot way to resist adversarial examples. In Proceedings of International Conference on Learning Representations (ICLR), Cited by: §1.
- Towards evaluating the robustness of neural networks. In Proceedings of the IEEE Symposium on Security and Privacy (SP), Cited by: §1.
Improving adversarial robustness via guided complement entropy.
Proceedings of the IEEE International Conference on Computer Vision (ICCV), pp. 4881–4889. Cited by: §1.
- A simple method for color quantization: octree quantization. In New Trends in Computer Graphics, pp. 219–231. Cited by: §3.3.
- Explaining and harnessing adversarial examples. arXiv preprint arXiv:1412.6572. Cited by: §1, §1, §2.2.
- Countering adversarial images using input transformations. arXiv preprint arXiv:1711.00117. Cited by: §1, §1.
Deep residual learning for image recognition.
Proceedings of the IEEE conference on Computer Vision and Pattern Recognition (CVPR), pp. 770–778. Cited by: §4.
- Adversarial example defense: ensembles of weak defenses are not strong. In Proceedings of 11th USENIX workshop on offensive technologies (WOOT 17), Cited by: §1.
- Improving adversarial robustness of ensembles with diversity training. arXiv preprint arXiv:1901.09981. Cited by: §1.
- Learning multiple layers of features from tiny images. Technical report Citeseer. Cited by: §4.
- Adversarial examples in the physical world. arXiv:1607.02533. Cited by: §2.2.
- Towards robust neural networks via random self-ensemble. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 369–385. Cited by: §1.
- Feature distillation: dnn-oriented jpeg compression against adversarial examples. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition (CVPR), pp. 860–868. Cited by: §1.
- A unified gradient regularization family for adversarial examples. In Proceedings of IEEE International Conference on Data Mining (ICDM), pp. 301–309. Cited by: §1.
Towards deep learning models resistant to adversarial attacks. arXiv preprint arXiv:1706.06083. Cited by: §1, §1, §1, §2.2.
- Deepfool: a simple and accurate method to fool deep neural networks. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §1.
- Divide, denoise, and defend against adversarial attacks. arXiv preprint arXiv:1802.06806. Cited by: §1.
- Improving adversarial robustness via promoting ensemble diversity. In Proceedings of International Conference on Machine Learning (ICML), pp. 4970–4979. Cited by: §1.
- Deflecting adversarial attacks with pixel deflection. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition (CVPR), pp. 8571–8580. Cited by: §1.
- Barrage of random transforms for adversarially robust defense. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition (CVPR), pp. 6528–6537. Cited by: §1.
- Foolbox: a python toolbox to benchmark the robustness of machine learning models. In Proceedings of International Conference on Machine Learning (ICML), Reliable Machine Learning in the Wild Workshop, External Links: Cited by: §4.
Journal of Open Source Software5 (53), pp. 2607. External Links: Cited by: §4.
- Reachability analysis of deep neural networks with provable guarantees. In Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence (IJCAI), pp. 2651–2659. Cited by: §2.1.
- Adversarial training for free!. arXiv preprint arXiv:1904.12843. Cited by: §4.3.
- Ensemble methods as a defense to adversarial perturbations against deep neural networks. arXiv preprint arXiv:1709.03423. Cited by: §1, 3rd item.
- Adversarial defense by stratified convolutional sparse coding. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §1.
- Ensemble adversarial training: attacks and defenses. arXiv preprint arXiv:1705.07204. Cited by: §1.
- Mitigating adversarial effects through randomization. arXiv preprint arXiv:1711.01991. Cited by: §1.
- Feature squeezing: detecting adversarial examples in deep neural networks. arXiv preprint arXiv:1704.01155. Cited by: §1, §1.
- Enhancing robustness verification for deep neural networks via symbolic propagation. To appear in Formal Aspects of Computing. Cited by: Lemma 1.