# On Adversarial Risk and Training

## Authors

• 9 publications
• 8 publications
• 11 publications
• 60 publications
• ### Asymptotic Behavior of Adversarial Training in Binary Classification

It is widely known that several machine learning models are susceptible ...
10/26/2020 ∙ by Hossein Taheri, et al. ∙ 0

• ### On Generalization of Graph Autoencoders with Adversarial Training

Adversarial training is an approach for increasing model's resilience ag...
07/06/2021 ∙ by Tianjin Huang, et al. ∙ 7

• ### Analysis and Improvement of Adversarial Training in DQN Agents With Adversarially-Guided Exploration (AGE)

This paper investigates the effectiveness of adversarial training in enh...
06/03/2019 ∙ by Vahid Behzadan, et al. ∙ 0

• ### The Geometry of Adversarial Training in Binary Classification

We establish an equivalence between a family of adversarial training pro...
11/26/2021 ∙ by Leon Bungert, et al. ∙ 0

• ### Improving filling level classification with adversarial training

We investigate the problem of classifying - from a single image - the le...
02/08/2021 ∙ by Apostolos Modas, et al. ∙ 8

• ### MixUp as Directional Adversarial Training

In this work, we explain the working mechanism of MixUp in terms of adve...
06/17/2019 ∙ by Guillaume P. Archambault, et al. ∙ 0

• ### Adversarial Removal of Demographic Attributes from Text Data

08/20/2018 ∙ by Yanai Elazar, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Recent works have shown that the output of deep neural networks is vulnerable to even a small amount of perturbation to the input

(Goodfellow et al., 2014; Szegedy et al., 2013)

. These perturbations, usually referred to as “adversarial” perturbations, are imperceivable by humans and can deceive even state-of-the-art models to make incorrect predictions. Consequently, a line of work in deep learning has focused on defending against such attacks/perturbations

(Goodfellow et al., 2014; Carlini and Wagner, 2016; Ilyas et al., 2017; Madry et al., 2017). This has resulted in several techniques for learning models that are robust to adversarial attacks. However, many of these techniques were later shown to be ineffective (Athalye and Sutskever, 2017; Carlini and Wagner, 2017; Athalye et al., 2018).

We present a brief review of existing literature on adversarial robustness, that is necessarily incomplete. Existing works define an adversarial perturbation at a point , for a classifier as any perturbation with a small norm, measured w.r.t some distance metric, which changes the output of the classifier; that is . Most of the existing techniques for learning robust models minimize the following worst case loss over all possible perturbations

 E(x,y)∼P[maxδ:\normδ≤ϵℓ(f(x+δ),y)]. (1)

Goodfellow et al. (2014); Carlini and Wagner (2017); Madry et al. (2017)

use heuristics to approximately minimize the above objective. In each iteration of the optimization, these techniques first use heuristics to approximately solve the inner maximization problem and then compute a descent direction using the resulting maximizers.

Tsuzuku et al. (2018) provide a training algorithm which tries to find large margin classifiers with small Lipschitz constants, thus ensuring robustness to adversarial perturbations. A recent line of work has focused on optimizing an upper bound of the above objective. Raghunathan et al. (2018); Kolter and Wong (2017) provide SDP and LP based upper bound relaxations of the objective, which can be solved efficiently for small networks. These techniques have the added advantage that they can be used to formally verify the robustness of any given model. Sinha et al. (2017) propose to optimize the following distributional robustness objective, which is a stronger form of robustness than the one used in Equation (1)

 minfsupQ:W(P,Q)≤ϵE(x,y)∼Q[ℓ(f(x),y)], (2)

where

is the Wasserstein distance between probability distributions

.

Another line of work on adversarial robustness has focused on studying adversarial risk from a theoretical perspective. Recently, Schmidt et al. (2018); Bubeck et al. (2018) study the generalization properties of adversarial risk and compare it with the generalization properties of standard risk (). Fawzi et al. (2018, 2018); Franceschi et al. (2018) study the properties of adversarial perturbations and adversarial risk. These works characterize the robustness at a point in terms of how much perturbation a classifier can tolerate at a point, without changing its prediction

 r(x)=minδ∈\calS\normδ \st sign(f(x))≠sign(f(x+δ)), (3)

where is some subspace. Fawzi et al. (2018) theoretically study the expected adversarial radius () of any classifier and suggest that there is a trade-off between adversarial robustness and the standard accuracy. Specifically, their results suggest that if the prediction accuracy is high then could be small.

However, these results are contrary to the standard notion that on tasks such as image classification, humans are robust classifiers with low error rate. A careful inspection of the definition of adversarial perturbation and adversarial radius used in Equations (1),(3) brings into light the inexactness of these definitions. For example, consider the definition of adversarial risk in Equation (1). A major issue with this definition is that it assumes the label remains the same in a neighborhood of , and penalizes any classifier which doesn’t output in the entire neighborhood of

. However, the response variable need not remain the same in the neighborhood of

. If a perturbation is such that “true label” at is not the same as the “true label” at then the classifier shouldn’t be penalized for not predicting at . Moreover, such a perturbation shouldn’t be considered as adversarial, since it changes the true label at . Figure 1 illustrates this phenomenon on MNIST and CIFAR-10. As we show later in the paper, this inexact definition of adversarial perturbation has resulted in recent works claiming that there exists a trade-off between adversarial and standard risks.

## 2 Preliminaries

In this section, we set up the notation and review necessary background on risk minimization. To simplify the presentation in the paper, we only consider the binary classification problem. However, it is straightforward to extend the results and analysis in this paper to multi-class classification.

Let denote the covariate, label pair which follows a probability distribution . Let be i.i.d samples drawn from . Let denote a score based classifier, which assigns to class , if . We define the population and empirical risks of classifier as

 R0−1(f)=E(x,y)∼P[ℓ0−1(f(x),y)], Rn,0−1(f)=1nn∑i=1ℓ0−1(f(xi),yi),

where is defined as , and if and otherwise. Given

, the objective of empirical risk minimization (ERM) is to estimate a classifier with low population risk

. Since optimization of

loss is computationally intractable, it is often replaced with a convex surrogate loss function

, where . Logistic loss is a popularly used surrogate loss and is defined as . We let denote the population and empirical risk functions obtained by replacing with in .

A score based classifier is called Bayes optimal classifier if a.e. on the support of distribution . We call as Bayes decision rule. Note that the Bayes decision rule need not be unique. We assume that the set of points where has measure .

In this paper, we focus on the following robustness setting, which is also the focus of most of the past works on adversarial robustness: given a pre-trained model, there is an adversary which corrupts the inputs to the model such that the corrupted inputs lead to certain “unwanted” behavior in the model. Our goal is to design models that are robust to such adversaries. In what follows, we make the notions of an adversary, unwanted behavior more concrete and formally define adversarial perturbation and adversarial risk.

Let be an adversary which modifies any given data point to . Let be the perturbation chosen by the adversary at . We assume that the perturbations are norm bounded, which is a standard restriction imposed on the capability of the adversary.

Our definition of adversarial perturbation is based on a reference or a base classifier. For example, in vision tasks, this base classifier is the human vision system. A perturbation is adversarial to a classifier if it changes the prediction of the classifier, whereas the base/reference classifier assigns it to the same class as the unperturbed point. [Adversarial Perturbation] Let be a score based classifier and be a base classifier. Then the perturbation chosen by an adversary at is said to be adversarial for , w.r.t base classifier , if and

 sign(f(x))=g(x),g(x)=g(x+δx),

and

 sign(f(x+δx))≠g(x).

Equivalently, a perturbation is said to be adversarial for , w.r.t base classifier , if , and

 ℓ0−1(f(x+δx),g(x))−ℓ0−1(f(x),g(x))=1.

Note that, unlike the existing notion of adversarial risk, the above definition doesn’t consider a perturbation as adversarial if it changes the label of the base classifier. Moreover, if disagrees with at , then the perturbation is not considered adversarial. This is reasonable because if disagrees with , it should be treated as a standard classification error rather than adversarial error. Using the above definition of adversarial perturbation, we next define adversarial risk. [Adversarial Risk] Let be a score based classifier and be a base classifier. The adversarial risk of w.r.t base classifier and adversary is defined as the fraction of points which can be adversarially perturbed by

It is typically assumed that the adversary is an “optimal” adversary; that is, at any give point , tries to find a perturbation that is adversarial for

 δx∈\argmax∥δ∥≤ϵg(x)=g(x+δ)ℓ0−1(f(x+δ),g(x))−ℓ0−1(f(x),g(x)).

The adversarial risk of a classifier w.r.t an optimal adversary can then be written as

 E⎡⎢ ⎢⎣max∥δ∥≤ϵg(x)=g(x+δ)ℓ0−1(f(x+δ),g(x))−ℓ0−1(f(x),g(x))⎤⎥ ⎥⎦.

In the sequel, we assume that the adversary is optimal and work with the above definition of adversarial risk.

Let denote the adversarial risk obtained by replacing with a convex surrogate loss and let denote its empirical version. In the sequel we refer to as standard and adversarial risks and , as the corresponding empirical risks. The goal of adversarial training is to learn a classifier that has low adversarial and standard risks. One natural technique to estimate such a robust classifier is to minimize a linear combination of both the risks

where is an appropriately chosen function class and is a hyper-parameter. The tuning parameter trades off standard risk with the excess risk incurred from adversarial perturbations, and allows us to tune the conservativeness of our classifier.

## 4 Bayes Optimal Classifier as Base Classifier

In this section we study the properties of minimizers of objective (4), under the assumption that the base classifier is a Bayes optimal classifier. This is a reasonable assumption because if we are interested in robustness with respect to a base classifier, it is likely we are getting labels from the base classifier itself. For instance, in many classification tasks the labels are generated by humans (i.e., human is a Bayes optimal classifier for the classification task) and robustness is also measured w.r.t a human. The following Theorem shows that under this condition, the minimizers of (4) are Bayes optimal. Suppose the hypothesis class is the set of all measurable functions. Let the base classifier be a Bayes optimal classifier.

1. ( loss). If is the loss, then any minimizer of

is a Bayes optimal classifier.

2. (Logistic loss). Suppose is the logistic loss and suppose the probability distribution is such that a.e., for some positive constant . Then any minimizer of Equation (4) is a Bayes optimal classifier.

The first part of the above Theorem shows that minimizing the joint objective with loss, for any choice of , results in a Bayes optimal classifier. This shows that there exist classifiers that are both robust and achieve high standard accuracy and there is no trade-off between adversarial and standard risks. More importantly, the Theorem shows that if there exists a unique Bayes decision rule (i.e., a.e. for any two Bayes optimal classifiers ), then standard training suffices to learn robust classifiers and there is no need for adversarial training.

The second part of the Theorem, which is perhaps the more interesting result, shows that using a convex surrogate for the loss to minimize the joint objective also results in Bayes optimal classifiers. This result assures us that optimizing a convex surrogate does not hinder our search for a robust classifier that has low adversarial and standard risks. Finally, we note that the requirement on conditional class probability is a mild condition as can be any small positive constant close to 0.

### 4.1 Approximate Bayes Optimal Classifier as Base Classifier

We now briefly discuss the scenario where the base classifier is not Bayes optimal. In this setting, the minimizers of the objective (4) need not be Bayes optimal. The first term in the objective will bias the optimization towards a Bayes optimal classifier. Whereas, the second term in the joint objective will bias the optimization towards the base classifier. Since the base classifier is not a Bayes optimal classifier, this results in a trade-off between the two terms, which is controlled by the tuning parameter . If is small, then the minimizers of the joint objective will be close to a Bayes optimal classifier. If is large, the minimizers will be close to the base classifier.

## 5 Old definition of Adversarial Risk

One natural question that Section 4 gives rise to is whether the results in Theorem 4 also hold for the definition of adversarial risk used by the existing works. To answer this question, we now study the properties of minimizers of the adversarial training objective in Equation (1). We start by making a slight modification to the definition of adversarial risk and analyzing the minimizers of the resulting adversarial training objective. Let be the adversarial risk obtained by removing the constraint in

We call this the adversarial “smooth” risk, because by removing the constraint, we are implicitly assuming that the base classifier is smooth in the neighborhood of each point. Let denote the adversarial risk obtained by replacing in with a convex surrogate loss .

The following Theorem studies the minimizers of the adversarial training objective obtained using the adversarial smooth risk. Specifically, it shows that if there exists a Bayes decision rule which satisfies a “margin condition”, then minimizing the adversarial training objective using results in Bayes optimal classifiers. Suppose the hypothesis class is the set of all measurable functions. Moreover, suppose the base classifier is a Bayes decision rule which satisfies the following margin condition:

 (5)
1. ( loss). If is the loss, then any minimizer of is a Bayes optimal classifier.

2. (Logistic loss). Suppose is the logistic loss. Moreover, suppose the probability distribution is such that a.e., for some positive constant . Then any minimizer of

is a Bayes optimal classifier

The margin condition in Equation (5) requires the Bayes decision rule to not change its prediction in the neighborhood of any given point. We note that this condition is necessary for the results of the above Theorem to hold. In Section 5.2 we show that without the margin condition, the minimizers of (6) need not be Bayes optimal. Theorem 5 also highlights the importance of the constraint “” in the definition of adversarial risk, for Bayes optimality of the minimizers.

### 5.1 Replacing Base Classifier with Stochastic Label y

A natural step is to replace in the definition of adversarial smooth risk with stochastic label and study the properties of minimizers of the resulting objective. Our results show that the resulting adversarial training objective behaves similarly as Equation (6). Consider the setting of Theorem 5. Let be the adversarial risk obtained by replacing with in

1. ( loss). If is the loss, then any minimizer of is a Bayes optimal classifier.

2. (Logistic loss). Suppose is the logistic loss. Moreover, suppose the probability distribution is such that a.e., for some positive constant . Then any minimizer of

is a Bayes optimal classifier

Note that, objective (1) is equivalent to objective (7) for . The Theorem thus shows that under the margin condition there is no trade-off between the popularly used definition of adversarial risk and standard risk.

### 5.2 Importance of Margin

If no Bayes decision rule satisfies the margin condition, then the results of Theorems 5,5.1 do not hold and minimizers of the corresponding joint objectives need not be Bayes optimal. [Necessity of margin] Consider the setting of Theorem 5. Suppose no Bayes decision rule satisfies the margin condition in Equation (5). Then such that the minimizers of the joint objectives and are not Bayes optimal. The above Theorem shows that without the margin condition, performing adversarial training using existing definition of adversarial risk can possibly result in a loss of standard accuracy. Next, we consider a concrete example and empirically validate our findings from Theorems 5.15.2.

#### Synthetic Dataset.

Consider the following data generation process in a 2D space. Let denote the axis aligned square of side length , centered at . The marginal distribution of

follows a uniform distribution on

. The conditional distribution of given is given by

 y|x∈S([2,0]T,2)={1,w.p. 0.7−1,w.p. 0.3,
 y|x∈S([−2,0]T,2)={1,w.p. 0.3−1,w.p. 0.7.

Note that the data satisfies the margin condition in Equation (5) w.r.t norm, for and the following Bayes decision rule

 η(x)={1if x(1)≥0−1if x(1)<0.

From Theorem 5.1 we know that for norm perturbations with , minimizing Equation 7 results in Bayes optimal classifiers. To verify this, we generated training samples from this distribution and minimized objective (7) over the set of linear classifiers. Since the model is linear, we have a closed form expression for the adversarial risk. Moreover, objective (7) can be efficiently solved using gradient descent. Figure 2 shows the behavior of standard risk of the resulting models as we vary . We can seen that for , the standard risk is equal to , which is the Bayes optimal risk. Whereas, for , the standard risk can be larger than .

#### Benchmark Datasets.

A number of recent works try to explain the drop in standard accuracy in adversarially trained models (Fawzi et al., 2018; Tsipras et al., 2018). These works suggest that there could be an inherent trade-off between standard and adversarial risks. In contrast, our results show that as long as there exists a Bayes optimal classifier with sufficient margin, minimizers of objectives (1), (7) have low standard and adversarial risks and there is no trade-off between the two risks. The important question then is, “Do the benchmark datasets such as MNIST (LeCun, 1998), CIFAR10 (Krizhevsky and Hinton, 2009) satisfy the margin condition?” Sharif et al. (2018) try to estimate the margin in MNIST, CIFAR10 datasets via user studies. Their results suggest that for perturbations larger than what is typically used in practice (), CIFAR10 doesn’t not satisfy the margin condition. Together with our results, this shows that for such large perturbations, adversarial training will result in models with low standard accuracy. However, it is still unclear if the benchmark datasets satisfy the margin condition for typically used in practice. We believe answering this question can help us understand if it possible to obtain a truly robust model, without compromising on standard accuracy.

### 5.3 Standard training with increasing model complexity

Before we conclude the section, we show how our results from Theorem 5.1 can be used to explain an interesting phenomenon observed by Madry et al. (2017): even with standard risk minimization, complex networks result in more robust classifiers than simple networks. Define the standard and adversarial training objectives as

 (standard) minf∈FR(f),

Let be a small function class, such as the set of functions which can be represented using a particular neural network architecture. As we increase the complexity of , we expect the minimizer of the standard risk to move closer to a Bayes optimal classifier. Assuming the margin condition is satisfied, from Theorem 5.1 we know that the minimizer of the adversarial training objective is also a Bayes optimal classifier. So as we increase the complexity of , we expect the minimizer of the standard risk to also have low joint adversarial + standard risk, and the minimizer of the joint adversarial + standard risk to have low standard risk. The latter could thus serve as an explanation for the other empirically observed phenomenon that performing adversarial training on increasingly complex networks results in classifiers with better standard risk. Figures 44 illustrate these two phenomena on MNIST and CIFAR10 datasets. To ensure the margin condition is at least approximately satisfied, we use small perturbations in these experiments. More details about the hyper-parameteres used in the experiments can be found in the Appendix.

We conclude the discussion by pointing out that in practice we optimize empirical risks instead of population risks, so that our explanations above are accurate only when empirical risks and the corresponding population risks have similar landscapes.

## 6 Importance of Adversarial Training

Recall, in Section 4 we studied the properties of adversarial training when the base classifier is Bayes Optimal. In particular, in Theorem 4 we showed that the minimizers of adversarial training objective (4) are Bayes optimal classifiers, which are also the minimizers of standard risk. This naturally leads us to the following question: Do we really need adversarial training? Will standard training suffice to learn robust classifiers? In this section, we investigate conditions under which standard risk minimization alone does not guarantee robust classifiers.

We first consider the setting where there is a single Bayes decision rule. Then, our theoretical results in Theorem 4 indeed show that there is no need for adversarial training, but provided we are able to find the optimum hypothesis over the set of all measurable functions. However, in practice, we are never able to do so due to the finite amount of data available to us. What if we choose a smaller hypothesis class (such as the set of linear separators)? In Section 6.1 we show that standard risk minimization over restricted hypothesis classes can result in classifiers with low standard risk but high adversarial risk. More generally, in Section 6.2, we show that adversarial and standard risks are not calibrated, and that small standard risks need not entail a small adversarial risk.

In Section 6.3, we consider the setting where there are multiple Bayes decision rules. For instance, when the data is separable or lies in a low-dimensional manifold, Bayes decision rule is not unique. In this setting, even if one has access to unlimited data (which allows us to optimize over the space of all measurable functions), we show that there is a need for adversarial training. Although all the Bayes decision rules have the same standard risk, they can differ on adversarial risk. In such cases, it is impossible to distinguish these classifiers using standard risk. As a result, one needs to perform adversarial training to learn a robust Bayes decision rule.

We study these questions theoretically using a mixture model where the data for each class is generated from a different mixture component. The distribution of conditioned on

follows a normal distribution:

, where

is the identity matrix and

. Note that in this setting is a Bayes optimal classifier. Moreover there is a unique Bayes decision rule.

### 6.1 Optimizing Standard Risk over Restricted Function Class

We study the effect of minimizing the standard risk over restricted function classes. Consider the restricted hypothesis class of all vectors which are non-zero in the top-

co-ordinates: . The following result shows that the exact minimizer of standard risk over this restricted hypothesis class need not be the minimizer of the adversarial risk over this class, even for perturbations as small as .

Consider the Gaussian mixture model with

, and let be the minimizer of the standard risk when restricted to . Then, even for a small enough perturbation of norm, we have that

where is measured .

### 6.2 Standard and Adversarial Risk are not Calibrated

We next explore if the two risks are calibrated, does approximately minimizing the standard risk always lead to small adversarial risk? Suppose the mean of the Gaussian components is -sparse; that is, has non-zero entries. Then the Bayes optimal classifier depends only on a few features and there are a lot of irrelevant features. The following result shows that there exist linear separators which achieve near-optimal classification accuracy, but have a high adversarial risk, even for a adversarial perturbation of size .

Let be -sparse with non-zeros in the first coordinates. Let be a linear separator such that , . Then, there exists a constant such that if and , the excess risk of is small; that is, , where is the risk of the Bayes optimal classifier. However, even for a small enough perturbation w.r.t norm, the adversarial risk satisfies

where the base classifier is equal to . Note that the constructed classifier has very small weights on irrelevant features. Hence the classification error is low but not minimal. But since there are a lot of such irrelevant features, there exist adversarial perturbations which don’t change the prediction of Bayes classifier, but change the prediction of .

### 6.3 Multiple Bayes Decision Rules

In this section, we consider the setting where there could be multiple Bayes optimal decision rules. We consider the question whether different Bayes optimal solutions have different adversarial risks, and whether standard risk minimization gives us robust Bayes optimal solutions.

Suppose our data comes from low dimensional Gaussians embedded in a high-dimensional space, suppose and the covariance matrix of the conditional distributions is diagonal with diagonal entry . Notice that in this model any classifier such that is a Bayes optimal classifier. Observe the subtle difference between this setting and sparse linear model. In particular, in the previous example, the data is inherently high-dimensional, but with only a few relevant discriminatory features; on the contrary, here the data lies on a low dimensional manifold of a high dimensional subspace.

In this setting, we study the adversarial risk of classifiers obtained through minimization of using iterative methods such as gradient descent.

Let be such that , for some constant . Let and be any convex calibrated surrogate loss . Then gradient descent on

with random initialization using a Gaussian distribution with covariance

converges to a point such that with high probability,

where is the adversarial risk measured . Note that Theorem 6.3 raises the vulnerability of standard risk minimization using gradient descent by showing that it can lead to Bayes optimal solutions which have high adversarial risk. Moreover, observe that increasing results in classifiers that are less robust; even a perturbation can create adversarial examples with respect to . All our results in this section show that standard risk minimization is inherently insufficient in providing robustness. This suggests the need for adversarial training.

## 7 Regularization properties of Adversarial Training

In this section, we study the regularization properties of the adversarial training objective in Equation (4). Specifically, we show that the adversarial risk , effectively acts as a regularizer which biases the solution towards certain classifiers. The following Theorem explicitly shows this regularization effect of adversarial risk. Let be the dual norm of , which is defined as: Suppose is the logistic loss and suppose the classifier is differentiable a.e. Then for any the adversarial training objective (4) can be upper bounded as

where . Although the above Theorem only provides an upper bound, it still provides insights into the regularization effects of adversarial risk. It shows that adversarial risk effectively acts as a regularization term biasing the optimization towards two kinds of classifiers: 1) classifiers that are smooth with small gradients and 2) classifiers that are pointwise close to the base classifier . We now compare the regularization effect of adversarial risk in objective (4) with the regularization effect of existing notion of adversarial risk. Suppose is the logistic loss and suppose the classifier is differentiable a.e. Then for any the adversarial training objective (7) can be upper bounded as

Moreover, for linear classifiers , the adversarial training objective (7) can be upper and lower bounded as

Comparing Theorems 77, we can see that the major difference between the two adversarial risks is that the existing definition doesn’t necessarily bias the optimization towards the base classifier , whereas the new definition certainly biases the optimization towards .

For linear classifiers, the above Theorem provides a tight upper bound and shows that adversarial training using objective (7) essentially acts as a regularizer which penalizes the dual norm of . In a related work, Xu et al. (2009) focus on linear classifiers with hinge loss, and show that under separability conditions on the data and certain additional constraint on perturbations, the robust objective is equivalent to the regularized objective.

## 8 Summary and Future Work

In this work, we identified the inaccuracies with the existing definition of adversarial risk and proposed a new definition of adversarial risk which fixes these inaccuracies. We analyzed the properties of minimizers of the resulting adversarial training objective and showed that Bayes optimal classifiers are its minimizers and that there is no trade-off between adversarial and standard risks. We also studied the existing definition of adversarial risk, its relation to the new definition, and identify conditions under which its minimizers are Bayes optimal. Our analysis highlights the importance of margin for Bayes optimality of its minimizers.

An important direction for future work would be to design algorithms for minimization of the new adversarial training objective. One can consider two different approaches in this direction: 1) assuming we have black box access to the base classifier, one could design efficient optimization techniques which make use of the black box. 2) assuming we have access to an approximate base classifier (e.g., some complex model which is pre-trained on a lot of labeled data from a related domain, or a “teacher” network), one could use this classifier as a surrogate for the base classifier, to optimize the adversarial training objective.

## 9 Acknowledgements

We acknowledge the support of NSF via IIS-1149803, IIS-1664720, DARPA via FA87501720152, and PNC via the PNC Center for Financial Services Innovation.

## Appendix A Proof of Theorem 4

### a.1 Intermediate Results

Before we present the proof of the Theorem, we present useful intermediate results which we require in our proof. The following Lemmas present some monotonicity properties of the logistic loss. Let

be a discrete random variable such that

 y={1,with probability≥12+γ−1,with probability≤12−γ,

for some . Let and let be a constant. Define as follows

 h(u)=Ey[log(1+e−y((1−u)z+uξ))].

Then is a strictly decreasing function over the domain .

###### Proof.

Let . The derviative of , w.r.t is given by

 h′(u)=p×(z−ξ)1+e(1−u)z+uξ+(1−p)×(ξ−z)e(1−u)z+uξ1+e(1−u)z+uξ.

We will now show that . Suppose (otherwise it is easy to see that ). Then

 (1+e(1−u)z+uξξ−z)×h′(u)=−p+(1−p)e(1−u)z+uξ=(1−p)(e(1−u)z+uξ−p1−p)≤(1−p)(e(1−u)z+uξ−eξ)=(1−p)eξ(e(1−u)(z−ξ)−1)<0.

Let be such that . Define functions as follows

 h1(z)=log(1+e−(1−u)z−uξ)−log(1+e−z).
 h2(z)=log(1+e+(1−u)z+uξ)−log(1+ez).

Then is an increasing function over the domain and is a decreasing function over .

###### Proof.

The derivative of w.r.t is given by

 h′1(z)=−1−u1+e(1−u)z+uξ+11+ez.

We will now show that .

 h′1(z)=−1−u1+e(1−u)z+uξ+11+ez≥−11+e(1−u)z+uξ+11+ez≥−11+ez+11+ez=0,

where the first inequality follows from the fact that and the second inequality follows from the fact that . This shows that is increasing over .

We use a similar argument to show that is a decreasing function. Consider the derivative of w.r.t

 h′2(z)=1−u1+e−(1−u)z−uξ−11+e−z.

We will now show that .

 h′2(z)=1−u1+e−(1−u)z−uξ−11+e−z≤11+e−(1−u)z−uξ−11+e−z≤11+e−z−11+e−z=0,

This shows that is decreasing over . ∎

### a.2 Main Argument

#### 0/1 loss.

We first prove the Theorem for loss; that is, we show that any minimizer of is a Bayes optimal classifier. We prove the result by contradiction. Let be a Bayes optimal classifier such that a.e. Suppose is a minimizer of the joint objective. Let disagree with over a set of non-zero measure. We show that the joint risk of is strictly larger than .

First, we show that the standard risk of is strictly larger than :

 R0−1(^f)−R0−1(f∗)=E(x,y)[ℓ0−1(^f(x),y)−ℓ0−1(f∗(x),y)]=Pr(x∈X)×E(x,y)[ℓ0−1(^f(x),y)−ℓ0−1(f∗(x),y)∣∣x∈X]=Pr(x∈X)×Ex[Ey[ℓ0−1(^f(x),y)−ℓ0−1(f∗(x),y)∣∣x]∣∣x∈X]=Pr(x∈X)×Ex[P(y≠% sign(^f(x))|x)−P(y≠sign(f∗(x))|x)∣∣x∈X]>0,

where the last inequality follows from the definition of Bayes optimal decision rule.

We now show that the adversarial risk of is larger than . Since the base classifier agrees with a.e. we have

Since of any classifier is always non-negative, this shows that . Combining this with the above result on classification risk we get

This shows that can’t be a minimizer of the joint objective and minimizer of joint objective should be a Bayes optimal classifier.

#### Logistic Loss.

We now consider the logistic loss and show that any minimizer of is a Bayes optimal classifier. We again prove the result by contradiction. Let . Suppose is a minimizer of the joint objective and is not Bayes optimal. Define set as

 X={x:^f(x)g(x)<ξ}.

Note that, since is not Bayes optimal, is a set with non-zero measure. Construct a new classifier as follows

 ¯f(x)=⎧⎨⎩^f(x),if x∉X^f(x)+τ(ξ−^f(x)g(x))g(x),otherwise,

where is a constant. We now show that has a strictly lower joint risk than . This will then contradict our assumption that is a minimizer of the joint objective.

Let be the adversarial risk at point , computed w.r.t base classifier

Define the conditional risk of at as

Note that is equal to the joint risk . We now show that .

#### Case 1.

Let . Then . So we have

where the last equality follows from the observation that and the logistic function is a monotonically decreasing function.

#### Case 2.

Let . Then . Now, consider the difference :

We show that both are non-positive. Using the monotonicity property of logistic loss in Lemma A.1, it is easy to verify that . We now bound . First, observe that based on our construction of and our definition of set , we have

Since the logistic loss is monotonically decreasing in , this shows that both the inner maxima in are achieved at ’s such that . Using this observation, can be rewritten as

 λ⎛⎜ ⎜⎝max∥δ∥≤ϵ,x+δ∈Xg(x)=g(x+δ)ℓ(¯f(x+δ),g(x))−ℓ(¯f(x),g(x))⎞⎟ ⎟⎠−λ⎛⎜ ⎜⎝max∥δ∥≤ϵ,x+δ∈Xg(x)=g(x+δ)ℓ(^f(x+δ),g(x))−ℓ(^f(x),g(x))⎞⎟ ⎟⎠.

The above expression can be rewritten as

 λ⎛⎜ ⎜⎝max∥δ∥≤ϵ,x+δ∈Xg(x)=g(x+δ)ℓ(¯f(x+δ),g(x))−max∥δ∥≤ϵ,x+δ∈Xg(x)=g(x+δ)ℓ(^f(x+δ),g(x))⎞⎟ ⎟⎠−λ(ℓ(¯f(x),g(x))−ℓ(^f(x),g(x))).

Note that in the above expression can equivalently be written as . This shows that both and in the above expression are monotonically decreasing in and as a result the maximum of both the inner objectives is achieved at a which minimizes . Let be the point at which the maxima is achieved. Then the above expression can be written as

 T2=λ(ℓ(¯f(x+δx),g(x))−ℓ(