Modern machine learning models such as deep neural networks have achieved a great success in a wide range of tasks, but are shown to be brittle againstadversarial attacks. For instance, in image classification small perturbations imperceptible to human eyes may largely deteriorate the performance [szegedy2013intriguing]
. Various heuristic approaches are proposed to eitherattack the classifier or defend adversarial attacks by making the classifier robust. However, defenses that are empirically observed to be robust to specific types of attacks are later found vulnerable to stronger or adaptive attacks [carlini2017adversarial, athalye2018obfuscated, uesato2018adversarial]. Therefore, achieving provable/certifiable robustness starts to draw attention, in which the goal is to guarantee, deterministically or probabilistically, that no attacks within a certain region will alter the prediction of a classifier.
Recently, randomized smoothing is shown to be able to provide instance-specific robustness guarantees [Lcuyer2018CertifiedRT, Li2018CertifiedAR, cohen2019certified]. Specifically, given a base classifier, the prediction of the smoothed classifier, defined as the most probable prediction over random isotropic Gaussian perturbations, will not change within an ball whose radius may vary among different inputs. This guarantee does not require assumptions on the base classifier, and is shown to be one of few methods to provide non-trivial robustness guarantee for large scale classification task like ImageNet.
Despite recent advances on the theoretical properties of randomized smoothed classifier, how to train a good base classifier that can achieve both good accuracy and robustness when smoothed under this framework has not been fully investigated. The training procedures employed in most previous works did not fully take into account the ultimate goal of achieving high accuracy and robustness when the trained classifier is smoothed. On the other hand, since smoothed classifiers based on neural networks cannot be evaluated exactly (we will discuss the technical details later), in order to provide robustness guarantee under this framework, a certification algorithm is required to give a lower bound of the certified radius for each instance that will hold with high probability. Nevertheless, how to certify the robustness of smoothed classifiers is under-explored as well.
In this paper, we fill the aforementioned gaps and study how to train and provide robustness certification for randomized smoothed classifier. For training, we derive a regularized risk and discuss how to implement it for training a good base classifier. Specifically, we propose ADRE, an ADaptive Radius Enhancing regularizer, which penalizes examples misclassified by the smoothed classifier while encourages the certified radius of correctly classified examples. This regularizer can be implemented efficiently and applied in parallel with other adversarial defense methods. In particular, we discuss how ADRE regularization can be extended to adversarial training scheme that has been widely employed to improve adversarial robustness [kurakin2016adversarial, madry2017towards, salman2019provably]. At the same time, we introduce T-CERTIFY, a new certification algorithm to provide a tighter lower bound of the certified radius that holds with high probability. This algorithm builds upon and extends previous certification approaches and can further improve the robustness guarantee. We assess the effectiveness of ADRE and T-CERTIFY on CIFAR-10 and ImageNet datasets, and demonstrate that both approaches can improve the robustness of randomized smoothed classifier.
Related Work and Preliminary
Certified adversarial defenses
Certified defenses aim to provide robustness guarantee for classifiers. Specifically, for a certain type of attack, we say a classifier is provable/certifiable robust within some region that may depend on the input, if the outputs of the classifier is constant over this region. For the well studied
norm bounded attacks, a variety of methods based on techniques such as mixed integer linear programing[lomuscio2017approach, fischetti2017deep], satisfiability modulo theories [katz2017reluplex, ehlers2017formal, huang2017safety], bounding local or global Lipschitz constant of the neural network [hein2017formal, cisse2017parseval, tsuzuku2018lipschitz, anil2018sorting], convex relaxation [wong2017provable, raghunathan2018semidefinite] and many others have been proposed. However, these methods are generally unable to certify large networks, and thus cannot provide meaningful guarantees for tasks like ImageNet classification, mainly due to the intrinsic computational burden or loose relaxation. Compared to these methods, a salient advantage of randomized smoothed classifier is that it circumvents additional assumptions on the base classifier, and thus can fully leverage large expressive neural network to generate a powerful smoothed classifier.
Notations and Randomized Smoothed Classifier
Let denote the distribution of where . A soft classification function parameterized by , , maps the input to the probability score for each class , and the corresponding (hard) classifier outputs the class label with the highest score. We use to denote the probability score with respect to class . For neural network classifiers, the probability scores are typically generated by the softmax function.
Given a (base) classifier , the smoothed classifier based on
under isotropic Gaussian perturbation with varianceis defined as
where is the smoothed probability score and Throughout the paper we simplify the notation by omitting the parameter and/or , and use to denote the base and smoothed classifier, respectively. A nice property of is that, for any given , will yield the same prediction for all , where the certified radius depends on the top probability score and the “runner up” score [Lcuyer2018CertifiedRT, Li2018CertifiedAR, cohen2019certified]. Without further assumptions on , the tight radius is
Training the Base Classifier
To train the base classifier, the most common approach was applying canonical empirical risk minimization with a single draw of Gaussian noise added on the training samples as a data augmentation procedure [Lcuyer2018CertifiedRT, cohen2019certified]
. Stability training that penalizes the difference between the logits from original and Gaussian augmented example was also proposed[Li2018CertifiedAR]. Very recently, adversarial training was applied to significantly improve the certified robustness of randomized smoothed classifier [salman2019provably], where adding multiple Gaussian perturbation for a single training example was also employed. In this paper, we formalize the idea of single and multiple Gaussian augmentation as approximately minimizing a perturbed risk, based on which we derive the proposed ADRE regularized risk. We further adapt adversarial training to our regularized procedure and demonstrate through experiments that ADRE regularizer is also effective in this case.
The robustness radius for a given example under the framework of randomized smoothing requires identifying and evaluating and
. Unfortunately, for neural network based smoothed classifier, exact evaluation is intractable. In practice, we can only give a lower bound of the certified radius by estimating a lower and upper bound forand , denoted by and
, respectively. Simultaneous confidence interval for multinomial distribution[sison1995simultaneous] was applied in [Li2018CertifiedAR]. However, from statistical perspective, without prior knowledge about the true top and “runner-up” class, constructing confidence intervals for class probabilities is not sufficient to provide rigorous robustness certification. Another approach named CERTIFY firstly estimates , and then chooses , which can be loose in some cases [cohen2019certified]. In particular, the proposed ADRE regularizer encourages robustness by penalizing the “runner-up” probability for correctly classified examples, and thus this approach may not fully express the improved robustness. In contrast, the proposed T-CERTIFY estimate and separately, and is shown to provide tighter lower bound for the true certified radius.
While the radius in (2) holds for arbitrary base classifier, under the framework of randomized smoothing we wish to train a base classifier that can consistently make correct predictions under isotropic Gaussian perturbation to achieve high accuracy and large certified radius. Consequently, standard empirical risk minimization may not yield a desired base classifier, since the original and perturbed samples can be very different in high dimension, especially when is large. Instead, consider the following perturbed risk
where is the perturbation distribution and
is some loss function. Althoughand can be arbitrary, in this paper we focus on independent of and cross entropy loss . We write for simplicity without confusion. Intuitively, minimizing (3) yields a classifier that has low risk, and thus high accuracy under Gaussian perturbation.