Despite a large amount of work on adversarial robustness, many fundamental problems remain open. One of the challenges is to end the long-standing arms race between adversarial defenders and attackers: defenders design empirically robust algorithms which are later exploited by new attacks designed to undermine those defenses (Athalye et al., 2018). This motivate the study of certified robustness—algorithms that are provably robust to the worst-case attacks—among which random smoothing has received significant attention in recent years. Algorithmically, random smoothing takes a base classifier as an input, and outputs a smooth classifier by repeatedly adding i.i.d. noises to the input examples and outputting the most-likely class. Random smoothing has many appealing properties that one could exploit: it is agnostic to network architecture, is scalable to deep networks, and perhaps most importantly, achieves state-of-the-art certified
robustness for deep learning based classifiers(Cohen et al., 2019; Li et al., 2019; Lecuyer et al., 2019).
Open problems in random smoothing. Given the rotation invariance of Gaussian distribution, most positive results for random smoothing have focused on the robustness achieved by smoothing with the Gaussian distribution (see Theorem 1). However, the existence of a noise distribution for general robustness has been posed as an open question by Cohen et al. (2019):
We suspect that smoothing with other noise distributions may lead to similarly natural robustness guarantees for other perturbation sets such as general norm balls.
Several special cases of the conjecture have been proven for . Li et al. (2019) show that robustness can be achieved with the Laplacian distribution, and Lee et al. (2019) show that robustness can be achieved with a discrete distribution. Much remains unknown concerning the case when . On the other hand, the most standard threat model for adversarial examples is robustness, among which 8-pixel and 16-pixel attacks have received significant attention in the computer vision community (i.e., the adversary can change every pixel by 8 or 16 intensity values, respectively). In this paper, we derive lower bounds on the magnitude of noise required for certifying
robustness that highlights a phase transition at. In particular, for , the noise that must be added to each feature of the input examples grows with the dimension in expectation, while it can be constant for .
Preliminaries. Given a base classifier and smoothing distribution , the randomly smoothed classifier is defined as follows: for each class , define the score of class at point to be . Then the smoothed classifier outputs the class with the highest score: .
The key property of smoothed classifiers is that the scores change slowly as a function of the input point (the rate of change depends on ). It follows that if there is a gap between the highest and second highest class scores at a point , the smoothed classifier must be constant in a neighborhood of . We denote the score gap by , where and .
Definition 1 (- and -robustness).
For any set and , we say that the smoothed classifier is -robust if for all with , we have that for all . For a given norm , we also say that is -robust with respect to if it is -robust with .
When the base classifier and the smoothing distribution are clear from context, we will simply write , , and . We often refer to a sample from the distribution as noise, and use noise magnitude to refer to norm of a noise sample. Finally, we use to denote the distribution of , where .
1.1 Our results
Our main results derive lower bounds on the magnitude of noise sampled from any distribution that leads to -robustness with respect to for all possible base classifiers . A major strength of randomized smoothing is that it provides certifiable robustness guarantees without making any assumption on the base classifier . For example, the results of Cohen et al. (2019)
imply that using a Gaussian smoothing distribution with standard deviationguarantees that is -robust with respect to for every possible base classifier . We show that there is a phase transition at , and that ensuring -robustness for all base classifiers with respect to norms with requires that the noise magnitude grows non-trivially with the dimension of the input space. In particular, for image classification tasks where the data is high dimensional and each feature is bounded in the range , this implies that for sufficiently large dimensions, the necessary noise will dominate the signal in each example.
The following result, proved in Appendix A, shows that any distribution that provides -robustness for every possible base classifier must be approximately translation-invariant to all translations . More formally, for every , we must have that the total variation distance between and , denoted by , is bounded by . The rest of our results will be consequences of this approximate translation-invariance property. lemlemRobustToTV Let be a distribution on with density function such that for every (randomized) classifier , the smoothed classifier is -robust. Then for all , we have .
Lower bound on noise magnitude.
Our first result is a lower bound on the expected -magnitude of a sample for any distribution that is approximately invariant to -translations of size .
thmthmLowerBound Fix any and let be a distribution on such that there exists a radius and total variation bound satisfying that for all with we have . Then
As a consequence of Section 1.1, it follows that any distribution that ensures -robustness with respect to for any base classifier must also satisfy the same lower bound.
Phase transition at .
The lower bound given by Section 1.1 implies a phase transition in the nature of distributions that are able to ensure -robustness with respect to that occurs at . For , the necessary expected magnitude of a sample from grows only like , which is consistent with adding a constant level of noise to every feature in the input example (e.g., as would happen when using a Gaussian distribution with standard deviation ). On the other hand, for , the expected magnitude of a sample from grows strictly faster than , which, intuitively, requires that the noise added to each component of the input example must scale with the input dimension , rather than remaining constant as in the regime. More formally, we prove the following:
thmthmComponentProperties Fix any and let be a distribution on such that for all with we have . Let be a sample from . Then at least 99% of the components of satisfy . Moreover, if is a product measure of i.i.d. noise (i.e., ), then the tail of satisfies for some , where is an absolute constant. In other words, is a heavy-tailed distribution.111A distribution is heavy-tailed, if its tail is not an exponential function of for all (i.e., not an sub-exponential or sub-Gaussian distribution) (Vershynin, 2018).
The phase-transition at is more clearly evident from Section 1.1
. In particular, the variance of most components of the noise must grow with. Section 1.1 shows that any distribution that provides -robustness with respect to for must have very high variance in most of its component distributions when the dimension is large. In particular, for the variance grows linearly with the dimension. Similarly, if we use a product distribution to achieve -robustness with respect to with , then each component of the noise distribution must be heavy tailed and is likely to generate very large perturbations.
1.2 Technical overview
Total-variation bound of noise magnitude. Our results demonstrate a strong connection between the required noise magnitude in random smoothing and the total variation distance between and its shifted distribution in the worst-case direction . The total variation distance has a very natural explanation on the hardness of testing v.s. : any classifier cannot distinguish from
with a good probability related to. We have the following theorem.
Warm-up: one-dimensional case. We begin our analysis of Theorem 1.1 with the one-dimension case, by studying the projection of noise on a direction . A simple use of Markov’s inequality implies . To see this, let be a sample from and let so that is a sample from . Define and . Define so that the intervals and are disjoint. From Markov’s inequality, we have . Similarly, and, since and are disjoint, this implies . Therefore, . The claim follows from rearranging this inequality and the fact .
The remainder of the one-dimension case is to show . To this end, we exploit a nice property of total variation distance in : every -interval satisfies . We note that for any , rearranging Markov’s inequality gives . We can cover the set using intervals of width and, by this property, each of those intervals has probability mass at most . It follows that , implying . Finally, we optimize to obtain the bound , as desired.
Extension to -dimensional case. A bridge to connect one-dimensional case with -dimensional case is the Pythagorean theorem: if there exists a set of orthogonal directions ’s such that and (the furthest distance to in the ball ), the Pythagorean theorem implies the result for the -dimensional case straightforwardly. The existence of a set of orthogonal directions that satisfy these requirements is easy to find for the case, because the ball is isotropic and any set of orthogonal bases of satisfies the conditions. However, the problem is challenging for the case, since the ball is not isotropic in general. In Corollary 7, we show that there exist at least ’s which satisfy the requirements. Using the Pythagorean theorem in the subspace spanned by such ’s gives Theorem 1.1. This leads to only a constant-factor looseness of our bound: for certain distributions such as the Gaussian, our bound in Theorem 1.1 is tight up to a constant factor.
Peeling argument and tail probability. We now summarize our main techniques to prove Theorem 1.1. By , Theorem 1.1 implies , which shows that at least one component of is large. However, this guarantee only highlights the largest pixel of . Rather than working with the -norm of , we apply a similar argument to show that the variance of at least one component of must be large. Next, we consider the -dimensional distribution obtained by removing the highest-variance feature. Applying an identical argument, the highest variance remaining feature must also be large. Each time we repeat this procedure, the strength of the variance lower bound decreases since the dimensionality of the distribution is decreasing. However, we can apply this peeling strategy for any constant fraction of the components of to obtain lower bounds. The tail-probability guarantee in Theorem 1.1
follows a standard moment analysis in(Vershynin, 2018).
Summary of our techniques. Our proofs—in particular, the use of the Pythagorean theorem—show that defending against adversarial attacks in the ball of radius is almost as hard as defending against attacks in the ball of radius . Therefore, the certification procedure—firstly using Gaussian smoothing to certify robustness and then dividing the certified radius by as in (Salman et al., 2019)—is almost an optimal random smoothing approach for certifying robustness. The principle might hold generally for other threat models beyond robustness, and sheds light on the design of new random smoothing and proofs of hardness in the other threat models broadly.
2 Related Works
robustness. Probably one of the most well-understood results for random smoothing is the robustness. With Gaussian random noises, Lecuyer et al. (2019) and Li et al. (2019) provided the first guarantee of random smoothing and was later improved by Cohen et al. (2019) with the following theorem.
Theorem 1 (Theorem 1 of Cohen et al. (2019)).
Let by any deterministic or random classifier, and let . Let . Suppose and satisfy:
Then for all , where
and is the cumulative distribution function of standard Gaussian distribution.
is the cumulative distribution function of standard Gaussian distribution.
Note that Theorem 1 holds for arbitrary classifier. Thus a hardness result of random smoothing—the one in an opposite direction of Theorem 1—requires finding a hard instance of classifier such that either a similar conclusion of Theorem 1 does not hold, or the resulting smoothed classifier is trivial as the noise is too large. Our results of Theorems 1.1 and 1.1 are in the latter setting. Beyond the top- predictions in Theorem 1, Jia et al. (2020) studied the certified robustness for top- predictions via randomized smoothing under Gaussian noise and derive a tight robustness bound in norm. In this paper, however, we study the standard setting of top- predictions.
robustness. Beyond the robustness, random smoothing also achieves the state-of-the-art certified robustness for . Lee et al. (2019) provided adversarial robustness guarantees and associated random-smoothing algorithms for the discrete case where the adversary is bounded. Li et al. (2019) suggested replacing Gaussian with Laplacian noise for the robustness. Dvijotham et al. (2020) introduced a general framework for proving robustness properties of smoothed classifiers in the black-box setting using -divergence. However, much remains unknown concerning the effectiveness of random smoothing for robustness with . Salman et al. (2019) proposed an algorithm for certifying robustness, by firstly certifying robustness via the algorithm of Cohen et al. (2019) and then dividing the certified radius by . However, the certified radius by this procedure is as small as , in contrast to the constant certified radius as discussed in this paper.
Training algorithms. While random smoothing certifies inference-time robustness for any given classifier , the performance might vary a lot for different base classifiers. This motivates researchers to design new training algorithms of that particularly adapts to random smoothing. Zhai et al. (2020) trained a robust smoothed classifier via maximizing the certified radius. In contrast to using naturally trained classifier in (Cohen et al., 2019), Salman et al. (2019) combined adversarial training of Madry et al. (2017) with random smoothing in the training procedure of . In our experiment, we introduce a new baseline which combines TRADES (Zhang et al., 2019) with random smoothing to train a robust smoothed classifier.
3 Analysis of Main Results
3.1 Analysis of Theorem 1.1
In this section we prove Section 1.1. Our proof has two main steps: first, we study the one-dimensional version of the problem and prove two complementary lower bounds on the magnitude of a sample drawn from a distribution over with the property that for all with we have . Next, we show how to apply this argument to orthogonal 1-dimensional subspaces in to lower bound the expected magnitude of a sample drawn from a distribution over with the property that for all with we have .
One dimensional results.
Our first result lower bounds the magnitude of a sample from any distribution in terms of the total variation distance between and for any .
Let be any distribution on , be a sample from , , and let . Then we have
We prove creftypecap 2 using two complementary lower bounds. The first lower bound is tighter for large , while the second lower bound is tighter when is close to zero. Taking the maximum of the two bounds proves creftypecap 2.
Let be any distribution on , be a sample from , , and let . Then we have
Let so that is a sample from and define so that the sets and are disjoint. From Markov’s inequality, we have that . Further, since if and only if , we have . Next, since and are disjoint, it follows that . Finally, we have . Rearranging this inequality proves the claim. ∎
Next, we prove a tighter bound when is close to zero. The key insight is that no interval of width can have probability mass larger than . This implies that the mass of cannot concentrate too close to the origin, leading to lower bounds on the expected magnitude of a sample from .
Let be any distribution on , be a sample from , , and let . Then we have
The key step in the proof is to show that every interval of length has probability mass at most under the distribution . Once we have established this fact, then the proof is as follows: For any , rearranging Markov’s inequality gives . We can cover the set using intervals of width and each of those intervals has probability mass at most . It follows that , implying . Since , we have . Finally, we optimize to get the strongest bound. The strongest bound is obtained at , which gives .
It remains to prove the claim that all intervals of length have probability mass at most . Let be any such interval. The proof has two steps: first, we partition using a collection of translated copies of the interval , and show that the difference in probability mass between any pair of intervals in the partition is at most . Then, given that there must be intervals with probability mass arbitrarily close to zero, this implies that the probability mass of any interval (and in particular, the probability mass of ) is upper bounded by .
For each integer , let be a copy of the interval translated by . By construction the set of intervals for forms a partition of . For any indices , we can express the difference in probability mass between and as a telescoping sum: . We will show that for any , the telescoping sum is contained in . Let be the indices of the positive terms in the sum. Then, since the telescoping sum is upper bounded by the sum of its positive terms and the intervals are disjoint, we have For all we have if and only if , which implies . Combined with the definition of the total variation distance, it follows that and therefore . A similar argument applied to the negative terms of the telescoping sum guarantees that , proving that .
Finally, for any , there must exist an interval such that (since otherwise the total probability mass of all the intervals would be infinite). Since no pair of intervals in the partition can have probability masses differing by more than , this implies that for any . Taking the limit as shows that , completing the proof. ∎
Finally, creftypecap 2 follows from LABEL:lem:_onedimensional_lower_bound_small_delta, LABEL:lem:_one_dimensional_lower_bound_bigdelta, and the fact that for any , we have .
Extension to the -dimensional case.
For the remainder of this section we turn to the analysis of distributions defined over . First, we apply creftypecap 2 lower bound the magnitude of noise drawn from when projected onto any one-dimensional subspace.
Let be any distribution on , be a sample from , , and let . Then we have
Let be a sample from , be a sample from , and define and . Then the total variation distance between and is bounded by , and corresponds to a translation of by a distance . Therefore, applying creftypecap 2 with , we have that . Rearranging this inequality completes the proof. ∎
Intuitively, creftypecap 5 shows that for any vector such that is small, the expected magnitude of a sample when projected onto can’t be much smaller than the length of . The key idea for proving Section 1.1 is to construct a large number of orthogonal vectors with small norms but large norms. Then will have to be “spread out” in all of these directions, resulting in a large expected norm. We begin by showing that whenever is a power of two, we can find an orthogonal basis for in .
For any there exist orthogonal vectors .
The proof is by induction on . For , we have and the vector satisfies the requirements. Now suppose the claim holds for and let be orthogonal in for . For each , define and . We will show that these vectors are orthogonal. For any indices and , we can compute the inner products between pairs of vectors among , , , and : , , and . Therefore, for any , since , we are guaranteed that , , and . It follows that the vectors are orthogonal. ∎
From this, it follows that for any dimension , we can always find a collection of vectors that are short in the norm, but long in the norm. Intuitively, these vectors are the vertices of a hypercube in a -dimensional subspace. Figure 1 depicts the construction.
For any and dimension , there exist orthogonal vectors such that and for all . This holds even when .
With this, we are ready to prove Section 1.1.
Let be a sample from . By scaling the vectors from creftypecap 7 by , we obtain vectors with and . By assumption we must have , since , and creftypecap 5 implies that for each . We use this fact to bound .
Let be the matrix whose row is given by so that is the orthogonal projection matrix onto the subspace spanned by the vectors . Then we have where the first inequality follows because orthogonal projections are non-expansive, the second inequality follows from the equivalence of and norms, and the last inequality follows from creftypecap 5. Using the fact that , we have that . Finally, since and for , we have , as required. ∎
3.2 Analysis of Theorem 1.1
In this section we prove the variance and heavy-tailed properties from Section 1.1 separately.
Combining Section 1.1 with a peeling argument, we are able to lower bound the marginal variance in most of the coordinates of .
Fix any and let be a distribution on such that there exists a radius and total variation bound so that for all with we have . Let be a sample from and be the permutation of such that . Then for any , we have .
For each index , let be the projection given by and be the distribution of . First we argue that for each and any with , we must have . To see this, let be the vector such that and . Then Next, since , we must have .
Now fix an index and let be a sample from . Applying Section 1.1 to , we have that . By Jensen’s inequality, we have that Since there must exist at least one index such that , it follows that at least one coordinate must satisfy . Finally, since the coordinates of are the coordinates of with smallest variance, it follows that , as required. ∎
creftypecap 8 implies that any distribution over such that for all with we have for must have high marginal variance in most of its coordinates. In particular, for any constant , the top -fraction of coordinates must have marginal variance at least . For , this bound grows with the dimension . Our next lemma shows that when is a product measure of i.i.d. one-dimension distribution in the standard coordinate, the distribution must be heavy-tailed.
Let . Let be random variables in sampled i.i.d. from distribution . Then “there exists a such that when , we have ” implies “ for some with an absolute constant ”, that is, in sufficiently high dimensions, is a heavy-tailed distribution.
Denote by the complementary cumulative distribution function of . We only need to show that for all implies . We note that
as desired. ∎
In this section, we evaluate the certified robustness and verify the tightness of our lower bounds by numerical experiments. Experiments run with two NVIDIA GeForce RTX 2080 Ti GPUs. We release our code and trained models at https://github.com/hongyanz/TRADES-smoothing.
4.1 Certified robustness
Despite the hardness results of random smoothing on certifying robustness with large perturbation radius, we evaluate the certified robust accuracy of random smoothing on the CIFAR-10 dataset when the perturbation radius is as small as 2/255, given that the data dimension is not too high relative to the -pixel attack. The goal of this experiment is to show that random smoothing based methods might be unable to achieve very promising robust accuracy (e.g., ) even when the perturbation radius is as small as 2 pixels.
Experimental setups. Our experiments exactly follow the setups of (Salman et al., 2019). Specifically, we train the models on the CIFAR-10 training set and test it on the CIFAR-10 test sest. We apply the ResNet-110 architecture (He et al., 2016) for the CIFAR-10 classification task. The output size of the last layer is 10. Our training procedure is a modification of (Salman et al., 2019): Salman et al. (2019) used adversarial training to train a soft-random-smoothing classifier by injecting Gaussian noise. In our training procedure, we replace the adversarial training with TRADES (Zhang et al., 2019), a state-of-the-art defense model which won the first place in the NeurIPS 2018 Adversarial Vision Challenge (Brendel et al., 2020). In particular, we minimize
where is the injected Gaussian noise, is the cross-entropy loss or KL divergence, andperturbation radius , perturbation step size 0.007, number of perturbation iterations 10, regularization parameter
, initial learning rate 0.1, standard deviation of injected Gaussian noise 0.12, batch size 256, and run 55 epochs on the training dataset. We decay the learning rate by a factor of 0.1 at epoch 50. We use random smoothing ofCohen et al. (2019) to certify robustness of the base classifier. We obtain the certified radius by scaling the robust radius by a factor of
. For the fairness, we do not compare with the models of using extra unlabeled data, ImageNet pretraining, or ensembling tricks.
Experimental results. We compare TRADES + random smoothing with various baseline methods of certified robustness with radius . We summarize our results in Table 1. All results are reported according to the numbers in their original papers.222We report the performance of (Salman et al., 2019) according to the results: https://github.com/Hadisalman/smoothing-adversarial/blob/master/data/certify/best_models/cifar10/ours/cifar10/DDN_4steps_multiNoiseSamples/4-multitrain/eps_255/cifar10/resnet110/noise_0.12/test/sigma_0.12, which is the best result in the folder “best models” by Salman et al. (2019). When a method was not tested under the 2/255 threat model in its original paper, we will not compare with it as well in our experiment. It shows that TRADES with random smoothing has comparable performance with the state-of-the-art algorithm of Salman et al. (2019) on certifying 2/255 robustness and enjoys slightly higher robust accuracy than other methods. However, for all approaches, there are still significant gaps between the robust accuracy and the desired accuracy that is acceptable in the real security-related tasks (e.g., ), even when the certified radius is chosen as small as 2 pixels.
|Method||Robust Acc.||Natural Acc.|
|Salman et al. (2019)||60.8%||82.1%|
|Zhang et al. (2020)||54.0%||72.0%|
|Wong et al. (2018)||53.9%||68.3%|
|Mirman et al. (2018)||52.2%||62.0%|
|Gowal et al. (2018)||50.0%||70.2%|
|Xiao et al. (2019)||45.9%||61.1%|
4.2 Effectiveness of lower bounds
For random smoothing, Theorem 1.1 suggests that the certified robust radius be (at least) proportional to , where is the standard deviation of injected noise. In this section, we verify this dependency by numerical experiments on the CIFAR-10 dataset and Gaussian noise.
Experimental setups. We apply the ResNet-110 architecture (He et al., 2016) for classification.333The input size of the architecture is adaptive by applying the adaptive pooling layer. The output size of the last layer is 10. We vary the size of the input images with , , by calling the resize function. We keep the quantity as an absolute constant by setting the standard deviation as , , and , and the perturbation radius as , , and in the TRADES training procedure for the three input sizes, respectively. Our goal is to show that the accuracy curves of the three input sizes behave similarly. In our training procedure, we set perturbation step size 0.007, number of perturbation iterations 10, regularization parameter , learning rate 0.1, batch size 256, and run 55 epochs on the training dataset. We use random smoothing (Cohen et al., 2019) with varying ’s to certify the robustness. The certified radius is obtained by scaling the robust radius by a factor of .
We summarize our results in Figure 2. We observe that the three curves of varying input sizes behave similarly. This empirically supports our theoretical finding in Theorem 1.1 that the certified robust radius should be proportional to the quantity . In Figure 2, the certified accuracy is monotonously decreasing until reaching some point where it plummets to zero. The phenomenon has also been observed by Cohen et al. (2019) and was explained by a hard upper limit to the radius we can certify, which is achieved when all samples are classified by as the same class.
In this paper, we show a hardness result of random smoothing on certifying adversarial robustness against attacks in the ball of radius when . We focus on a lower bound on the required noise magnitude: under certain regularity conditions, any noise distribution in that works for the robustness of should satisfy