Adversarial training augments the training set with perturbations to improve
the robust error (over worst-case perturbations), but it often leads to an
increase in the standard error (on unperturbed test inputs). Previous
explanations for this tradeoff rely on the assumption that no predictor in the
hypothesis class has low standard and robust error. In this work, we precisely
characterize the effect of augmentation on the standard error in linear
regression when the optimal linear predictor has zero standard and robust
error. In particular, we show that the standard error could increase even when
the augmented perturbations have noiseless observations from the optimal linear
predictor. We then prove that the recently proposed robust self-training (RST)
estimator improves robust error without sacrificing standard error for
noiseless linear regression. Empirically, for neural networks, we find that RST
with different adversarial training methods improves both standard and robust
error for random and adversarial rotations and adversarial ℓ_∞
perturbations in CIFAR-10.

This paper proposes an attack-independent (non-adversarial training)
tec...

Code Repositories

robust_tradeoff

Code for the ICML 2020 paper "Understanding and Mitigating the Tradeoff Between Robustness and Accuracy", Aditi Raghunathan, Sang Michael Xie, Fanny Yang, John Duchi, and Percy Liang. Paper available at https://arxiv.org/pdf/2002.10716.pdf.

Adversarial training methods (Goodfellow et al., 2015; Madry et al., 2017)
attempt to improve the robustness of neural networks against adversarial
examples (Szegedy et al., 2014)
by augmenting the training set (on-the-fly) with perturbed examples that
preserve the label but that fool the current model.
While such methods decrease the robust error, the error on worst-case
perturbed inputs, they have been observed to cause an undesirable increase in
the standard error, the error on unperturbed inputs (Madry et al., 2018; Zhang et al., 2019; Tsipras et al., 2019).

Previous works attempt to explain the tradeoff between standard error
and robust error in two settings: when no accurate classifier is

consistent with the perturbed
data (Tsipras et al., 2019; Zhang et al., 2019; Fawzi et al., 2018), and when the hypothesis class is not expressive
enough to contain the true classifier (Nakkiran, 2019).
In both cases, the tradeoff persists even with infinite data.
However, adversarial perturbations in practice are typically defined to be
imperceptible to humans (e.g. small ℓ∞ perturbations in
vision). Hence by definition, there exists a classifier (the human)
that is both robust and accurate with no tradeoff in
the infinite data limit. Furthermore, since deep neural networks are
expressive enough to fit not only adversarial but also randomly
labeled data perfectly (Zhang et al., 2017), the explanation
of a restricted hypothesis class does not perfectly capture empirical
observations either. Empirically on Cifar-10, we find that the gap between the standard error of adversarial training
and standard training decreases as we increase the labeled data size, thereby also suggesting the tradeoff could disappear with infinite data (See Figure 1).

Figure 2:

We consider function interpolation via cubic splines.

(Left) The underlying distribution Px denoted by sizes of the circles. The true function is a staircase.
(Middle) With a small number of standard training samples (purple circles), an augmented estimator that fits local perturbations (green crosses) has a large error. In constrast, the standard estimator that does not fit perturbations is a simple straight line and has small error.
(Right) Robust self-training (RST) regularizes the predictions of an augmented estimator towards the predictions of the standard estimator thereby obtaining both small error on test points and their perturbations.

In this work, we provide a different explanation for the tradeoff
between standard and robust error that takes generalization
from finite data into account. We first consider a linear model where the true linear function
has zero standard and robust error. Adversarial training augments the
original training set with extra data, consisting of samples
(xext,y) where the perturbations xext are consistent,
meaning that the conditional distribution stays constant
Py(⋅∣xext)=Py(⋅∣x).
We show that even in this simple setting, the augmented estimator, i.e. the minimum norm
interpolant of the augmented data (standard + extra data),
could have a larger standard error than that of the standard estimator,
which is the minimum norm interpolant of the standard data alone.
We found this surprising given that adding consistent perturbations
enforces the predictor to satisfy invariances that the true model
exhibits. One might think adding this information would only restrict the hypothesis class and thus enable better generalization, not worse.

We show that this tradeoff stems from overparameterization. If the restricted hypothesis class (by
enforcing invariances) is still overparameterized, the inductive bias of the estimation procedure (e.g., the norm being minimized) plays a key role in determining the generalization of a model.

Figure 2 shows an illustrative example of this phenomenon
with cubic smoothing splines.
The predictor obtained via standard training (dashed blue) is a line that captures the global
structure and obtains low error. Training on augmented data with
locally consistent perturbations of the training data (crosses)
restricts the hypothesis class by encouraging the predictor to fit the
local structure of the high density points. Within this set,
the cubic splines predictor (solid orange) minimizes the second
derivative on the augmented data, compromising the global structure
and performing badly on the tails (Figure 2(b)).
More generally, as we characterize in Section 3, the
tradeoff stems from the inductive bias of the minimum norm
interpolant, which minimizes a fixed norm independent of the data, while the standard error depends on the geometry of the
covariates.

Recent works (Carmon et al., 2019; Najafi et al., 2019; Uesato et al., 2019) introduced robust self-training (RST), a robust variant of self-training that overcomes the sample complexity barrier of learning a model with low robust error by leveraging extra unlabeled data. In this paper, our theoretical understanding of the tradeoff between standard and robust error in linear regression motivates RST as a method to improve robust error without sacrificing standard error. In Section 4.2, we prove that RST eliminates the tradeoff
for linear regression—RST does not increase standard error compared to the standard estimator
while simultaneously achieving the best possible robust error, matching the standard error (see
Figure 2(c) for the effect of RST on the spline problem). Intuitively, RST regularizes the predictions of the robust estimator towards that of the standard estimator on the unlabeled data thereby eliminating the tradeoff.

As previous works only focus on the empirical evaluation of the gains in robustness via RST, we systematically
evaluate the effect of RST on both the standard and robust error on Cifar-10 when using unlabeled data from Tiny Images as sourced in Carmon et al. (2019).
We expand upon empirical results in two ways. First, we study the effect of the labeled training set sizes and and find that the RST improves both robust and standard error over vanilla adversarial training across all sample sizes. RST offers maximum gains at smaller sample sizes where vanilla adversarial training increases the standard error the most. Second, we consider an additional family of perturbations over random and adversarial rotation/translations and find that RST offers gains in both robust and standard error.

2 Setup

We consider the problem of learning a mapping from an input x∈X⊆Rd to a target y∈Y.
For our theoretical analysis, we focus on regression where Y=R while our empirical studies consider general Y.
Let Pxy be the underlying distribution, Px the marginal on the inputs and Py(⋅∣x) the conditional distribution of the targets given inputs.
Given n training pairs (xi,yi)∼Pxy,
we use Xstd to denote the measurement matrix [x1,x2,…xn]⊤∈Rn×d and ystd

[y1,y2,…yn]⊤∈Rn.
Our goal is to learn a predictor fθ:X→Y that (i) has low standard error on inputs x and (ii) low robust error with respect to a set of perturbations T(x). Formally, the error metrics for a predictor fθ

Such transformations may consist of small rotations, horizontal flips, brightness or contrast changes (Krizhevsky et al., 2012; Yaeger et al., 1996), or small ℓp perturbations in vision (Szegedy et al., 2014; Goodfellow et al., 2015) or word synonym replacements in NLP (Jia & Liang, 2017; Alzantot et al., 2018).

Noiseless linear regression.

In section 3, we analyze noiseless linear regression on inputs x with targets y=x⊤θ⋆ with true parameter θ⋆∈Rk.^{1}^{1}1Our analysis extends naturally to arbitrary feature maps ϕ(x). For linear regression, ℓ is the squared loss which leads to the standard error (Equation 1) taking the form

Lstd(θ)=EPx[(x⊤θ−x⊤θ⋆)2]=(θ−θ⋆)⊤Σ(θ−θ⋆),

(4)

where Σ=EPx[xx⊤] is the population covariance.

Minimum norm estimators.

In this work, we focus on interpolating estimators in highly overparameterized models, motivated by modern machine learning models that achieve near zero training loss (on both standard and extra data). Interpolating estimators for linear regression have been studied in many recent works such as (Ma et al., 2018; Belkin et al., 2018; Hastie et al., 2019; Liang & Rakhlin, 2018; Bartlett et al., 2019).
We present our results for interpolating estimators with minimum Euclidean norm, but our analysis directly applies to more
general Mahalanobis norms via suitable reparameterization (see
Appendix A).

We consider robust training approaches that augment the
standard training data Xstd,ystd∈Rn×d×R with some extra training data Xext,yext∈Rm×d×R where the rows of Xext consist of vectors in the
set {xext:xext∈T(x),x∈Xstd}.^{2}^{2}2In practice, Xext is typically generated via iterative
optimization such as in adversarial training (Madry et al., 2018), or by random sampling as in data
augmentation (Krizhevsky et al., 2012; Yaeger et al., 1996).
We call the standard data together with the extra data as augmented data. We compare the following min-norm estimators: (i) the standard estimator^θstd
interpolating [Xstd,ystd] and (ii) the
augmented estimator^θaug interpolating X=[Xstd;Xext],Y=[ystd;yext]:

^θstd

=argminθ{∥θ∥2:Xstdθ=ystd}

^θaug

=argminθ{∥θ∥2:Xstdθ=ystd,Xextθ=yext}.

(5)

Notation.

For any vector z∈Rn, we use zi to denote the ith coordinate of z.

3 Analysis in the linear regression setting

In this section, we compare the standard errors of the standard
estimator and the augmented estimator in noiseless linear
regression. We begin with a simple toy example that describes the
intuition behind our results (Section 3.1) and
provide a more complete characterization in
Section 3.2. This section focuses only on the standard
error of both estimators; we revisit the robust error together with the standard error in
Section 4.

3.1 Simple illustrative problem

We consider a simple example in 3D where θ⋆∈R3
is the true parameter.
Let e1=[1,0,0];e2=[0,1,0];e3=[0,0,1] denote the standard basis vectors in R3. Suppose
we have one point in the standard training data Xstd=[0,0,1].
By definition (5), ^θstd satisfies Xstd^θstd=ystd and hence (^θstd)3=θ⋆3.
However, ^θstd is unconstrained on the subspace spanned by e1,e2 (the nullspace Null(Xstd)).
The min-norm objective chooses the solution with (^θstd)1=(^θstd)2=0.
Figure 3 visualizes the projection of various quantities on Null(Xstd). For simplicity of presentation, we omit the projection operator in the figure. The projection of ^θstd onto Null(Xstd) is the blue dot at the origin, and the parameter error θ⋆−^θstd is the projection of θ⋆ onto Null(Xstd).

Effect of augmentation on parameter error.

Suppose we augment with an extra data point Xext=[1,1,0]=e1+e2 which lies in Null(Xstd) (black dashed line in Figure 3).
The augmented estimator ^θaug still fits the standard data Xstd and thus (^θaug)3=θ⋆3=(^θstd)3. Due to fitting the extra data Xext, ^θaug (orange vector in Figure 3) must also satisfy an additional constraint Xext^θaug=Xextθ⋆.
The crucial observation is that additional constraints along one direction (e1+e2 in this case) could actually increase parameter error along other directions.
For example, let’s consider the direction e2 in Figure 3. Note that fitting Xext makes ^θaug have a large component along e2. Now if θ⋆2 is small (precisely, θ⋆2<θ⋆1/3), ^θaug has a larger parameter error along e2 than ^θstd, which was simply zero (Figure 3 (a)). Conversely, if the true component θ⋆2 is large enough (precisely, θ⋆2>θ⋆1/3), the parameter error of ^θaug along e2 is smaller than that of ^θstd.

Effect of parameter error on standard error.

The contribution of different components of the parameter error to the standard error is scaled by the population covariance Σ (see Equation 4). For simplicity, let Σ=diag([λ1,λ2,λ3]).
In our example, the parameter error along e3 is zero since both estimators interpolate the
standard training point Xstd=e1=3.
Then, the ratio between
λ1 and λ2 determines which component of the
parameter error contributes more to the standard error.

When is Lstd(^θaug)>Lstd(^θ%
std)?

Putting the two effects together, we see that when θ⋆2 is small as in Fig 3(a), ^θaug has larger parameter error than ^θstd in the direction e2. If λ2≫λ1, error in e2 is weighted much more heavily in the standard error and consequently ^θaug would have a larger standard error.
Precisely, we have

We present a formal characterization of this tradeoff in general in the next section.

Figure 3:
Illustration of the 3-D example described in Sec. 3.1.
(a)-(b) Effect of augmentation on parameter error for different θ⋆. We show the projections of the standard estimator ^θstd (blue circle), augmented estimator ^θaug (orange arrow), and true parameters θ⋆ (black arrow) on Null(Xstd), spanned by e1 and e2. For simplicity of presentation, we omit the projection operator in the figure labels. Depending on θ⋆, the parameter error of ^θaug along e2 could be larger or smaller than the parameter error of ^θstd along e2.
(c)–(d) Dependence of space of safe augmentations on Σ.
Visualization of the space of extra data points xext (orange), that do not cause an increase in the standard
error for the illustrated θ⋆ (black vector), as result
of Theorem 1.

3.2 General characterizations

In this section, we precisely characterize when the augmented estimator ^θaug that fits extra training data points Xext in addition to the standard points Xstd has higher standard error than the standard estimator ^θstd that only fits Xstd. In particular, this enables us to understand when there is a “tradeoff” where the augmented estimator ^θaug has lower robust error than ^θstd by virtue of fitting perturbations, but has higher standard error. In Section 3.1, we illustrated how the parameter error of ^θaug could be larger than ^θstd in some directions, and if these directions are weighted heavily in the population covariance Σ, the standard error of ^θaug would be larger.

Formally, let us define the parameter errors Δstddef=^θ%
std−θ⋆ and Δaugdef=^θ%
aug−θ⋆. Recall that the standard errors are

Lstd(^θstd)=Δ⊤stdΣΔstd,Lstd(^θaug)=Δ⊤augΣΔaug,

(6)

where Σ is the population covariance of the underlying inputs drawn from Px.

To characterize the effect of the inductive bias of minimum norm interpolation on the standard errors, we define the following projection operators: Π⊥std, the projection matrix onto Null(Xstd) and Π⊥aug, the projection matrix onto Null([Xext;Xstd]) (see formal definition in Appendix B). Since ^θaug and ^θstd are minimum norm interpolants, Π⊥std^θstd=0 and Π⊥aug^θaug=0. Further, in noiseless linear regression, ^θstd and ^θaug have no error in the span of Xstd and [Xstd;Xext] respectively. Hence,

Δstd=Π⊥stdθ⋆,Δaug=Π⊥augθ⋆.

(7)

Our main result relies on the key observation that for any vector u, Π⊥stdu can be decomposed into a sum of two orthogonal components v and w such that
Π⊥stdu=v+w with w=Π⊥augu and v=Π⊥stdΠaugu. This is because Null([Xstd;Xext])⊆Null(Xstd) and thus Π⊥stdΠ⊥aug=Π⊥aug.
Now setting u=θ⋆ and using the error expressions in Equation 6
and Equation 7 gives a precise characterization of the difference in the standard
errors of ^θstd and ^θaug.

Theorem 1.

The difference in the standard errors of the standard estimator ^θstd and augmented estimator ^θaug can be written as follows.

Lstd(^θstd)−Lstd(^θaug)

=v⊤Σv+2w⊤Σv,

(8)

where v=Π⊥stdΠaugθ⋆ and w=Π⊥augθ⋆.

The proof of Theorem 1 is in
Appendix B.3. The increase in standard error of the augmented estimator can be understood in terms of the vectors w and v defined in Theorem 1. The first term v⊤Σv is always
positive, and corresponds to the decrease in the standard error of the augmented
estimator ^θaug by virtue of fitting extra training points in some directions. However, the second term 2w⊤Σv can be negative and intuitively measures the cost of a possible increase in the parameter error along other directions (similar to the increase along e2 in the simple setting of Figure 3(a)). When the cost outweighs the benefit, the standard error of ^θaug is larger. Note that both the cost and benefit is determined by Σ which governs how the parameter error affects the standard error.

We can use the above expression (Theorem 1) for the difference in standard errors of ^θaug and ^θstd to characterize different “safe” conditions under which augmentation with extra data does not increase the standard error. See Appendix B.7 for a proof.

Corollary 1.

The following conditions are sufficient for Lstd(^θaug)≤Lstd(^θstd), i.e.
the standard error does not increase when fitting augmented data.

The population covariance Σ is identity.

The augmented data [Xstd;Xext] spans the entire space, or equivalently Π⊥aug=0.

The extra data xext∈Rd is a single point such that xext

We would like to draw special attention to the first condition. When
Σ=I, notice that the norm that governs the standard error
(Equation 6) matches the norm that is minimized by the
interpolants (Equation 5). Intuitively, the estimators
have the “right” inductive bias; under this condition, the augmented
estimator ^θaug does not have higher standard error. In other
words, the observed increase in the standard error of ^θaug can be
attributed to the “wrong” inductive bias. In Section 4,
we will use this understanding to propose a method of robust training
which does not increase standard error over standard training.

Safe extra points.

We use Theorem 1 to plot the safe extra points xext∈Rd that do not lead to an increase in standard error for any θ⋆ in the simple 3D setting described in Section 3.1 for two different Σ (Figure 3 (c), (d)).
The safe points lie in cones which contain the eigenvectors of Σ (as expected from Corollary 1). The width and alignment of the cones depends on the alignment between θ⋆ and the eigenvectors of Σ

We now tie our analysis back to the cubic splines interpolation problem from Figure 2. The inputs can be appropriately rotated and scaled such that the cubic spline interpolant is the minimum Euclidean norm interpolant (as in Equation 5). Under this transformation, the different eigenvectors of the nullspace of the training data Null(Xstd) represent the “local” high frequency components with small eigenvalues or “global” low frequency components with large eigenvalues (see Figure 4). An augmentation that encourages the fitting local components in Null(Xstd) could potentially increase the error along other global components (like the increase in error along e2 in Figure 3(a)). Such an increase, coupled with the fact that global components have larger eigenvalue in Σ, results in the standard error of ^θaug being larger than that of ^θstd. See Figure 8 and Appendix C.3.1 for more details. This is similar to the recent observation that adversarial training with ℓ∞ perturbations encourages neural networks to fit the high frequency components of the signal while compromising on the low-frequency components (Yin et al., 2019).

Model complexity.

Finally, we relate the magnitude of increase in standard error of the augmented estimator to the complexity of the true model.

Proposition 1.

For a given Xstd,Xext,Σ,

Lstd(^θaug)−Lstd(^θstd)>c⟹∥θ⋆∥22−∥^θstd∥22>γc

for some scalar γ>0 that depends on Xstd,Xext,Σ.

In other words, for a large increase in standard error upon augmentation, the true parameter θ⋆ needs to be sufficiently more complex (in the ℓ2 norm) than the standard estimator ^θstd. For example, the construction of the cubic splines interpolation problem relies on the
underlying function (staircase) being more complex with additional
local structure than the standard estimator—a linear function that
fits most points and can be learned with few
samples. Proposition 1 states that this
requirement holds more generally. The proof of
Proposition 1 appears in
Appendix B.5.
A similar intuition can be used to construct an example where augmentation can increase standard error for minimum ℓ1-norm interpolants when θ⋆ is dense (Appendix G).

4 Robust self-training

We now use insights from Section 3
to construct estimators with low robust error without increasing the
standard error. While Section 3 characterized the effect of
adding extra data Xext in general, in this section
we consider robust training which augments the dataset with extra data Xext that are consistent perturbations of the standard training data Xstd.

Since the standard estimator has small standard error, a natural strategy
to mitigate the tradeoff is to regularize the augmented estimator
to be closer to the standard estimator. The choice of distance between the estimators we regularize is very important.
Recall from Section 3.1 that the population covariance Σ determines how the parameter error affects the standard error. This suggests using a regularizer that incorporates information about Σ.

We first revisit the recently proposed robust self-training (RST) (Carmon et al., 2019; Najafi et al., 2019; Uesato et al., 2019) that incorporates additional unlabeled data via pseudo-labels from a standard estimator. Previous work only focused on the effectiveness of RST in improving the robust error. In Section 4.2, we prove that in linear regression, RST eliminates the tradeoff between standard and robust error (Theorem 2). The proof hinges on the connection between RST and the idea of regularizing towards the standard estimator discussed above. In particular, we show that the RST objective can be rewritten as minimizing a suitable Σ-induced distance to the standard estimator.

In Section 4.3, we expand upon
previous empirical RST results for Cifar-10 across various training set sizes and perturbations (rotations/translations in addition to ℓ∞).
We observe that across all settings, RST substantially improves the standard error while also
improving the robust error over the vanilla supervised robust training counterparts.

4.1 General formulation of RST

We first describe the general two-step robust self-training (RST) procedure (Carmon et al., 2019; Uesato et al., 2019) for a parameteric model fθ:

Perform standard training on labeled data {(xi,yi)}ni=1 to obtain
^θstd=argminθn∑i=1ℓ(fθ(xi),yi).

Perform robust training on both the labeled data and unlabeled inputs {~xi}mi=1 with pseudo-labels~yi=f^θstd(~xi) generated from the standard estimator ^θstd.

The second stage typically involves a combination of the standard loss ℓ and a robust loss ℓrob. The robust loss encourages invariance of the model over perturbations T(x), and is generally defined as

ℓrob(fθ(xi),yi)=maxxadv∈T(xi)ℓ(fθ(xadv),yi).

(9)

It is convenient to summarize the robust self-training estimator ^θrst as the minimizer of a weighted combination of four separate losses as follows.
We define the losses on the labeled dataset {(xi,yi)}ni=1 as

^Lstd-lab(θ)

=1nn∑i=1ℓ(fθ(xi),yi),

^Lrob-lab(θ)

=1nn∑i=1ℓrob(fθ(xi),yi).

The losses on the unlabeled samples {~xi}mi=1 which are psuedo-labeled by the standard estimator are

^Lstd-unlab(θ;^θstd)

=1mm∑i=1ℓ(fθ(~xi),f^θstd(~xi)),

^Lrob-unlab(θ;^θstd)

=1mm∑i=1ℓrob(fθ(~xi),f^θstd(~xi)).

Putting it all together, we have

^θrst\coloneqqargminθ(

α^Lstd-lab(θ)+β^Lrob-%
lab(θ)

(10)

+γ^Lstd-unlab(θ;^θstd)+λ^Lrob-unlab(θ;^θstd)),

for fixed scalars α,β,γ,λ≥0.

4.2 Robust self-training for linear regression

We now return to the noiseless linear regression as
described in Section 2 and specialize the general RST estimator described in Equation (10) to this setting. We prove that RST eliminates the decrease in standard error in this setting while achieving low robust error by showing that RST appropriately regularizes the augmented estimator towards the standard estimator.

Our theoretical results hold for RST procedures where the pseudo-labels can be generated from any interpolating estimator θint−std satisfying Xstdθint−std=ystd. This includes but is not restricted to the mininum-norm standard estimator ^θstd defined in (5). We use the squared loss as the loss function ℓ.
For consistent perturbations T(⋅), we analyze the following RST estimator for linear regression

^θrst=argminθ{

Lstd-unlab(θ;θint−std):L%
rob-unlab(θ)=0,

^Lstd-lab(θ)=0,^Lrob-lab(θ)=0}.

(11)

Figure 5 shows the four losses of RST in this special case of linear regression.

Obtaining this specialized estimator from the general RST estimator in Equation (10) involves the following steps. First, for convenience of analysis, we assume access to the population covariance Σ via infinite unlabeled data and thus replace the finite sample losses on the unlabeled data ^Lstd-unlab(θ),^Lrob-unlab(θ) by their population losses Lstd-unlab(θ),Lrob-unlab(θ). Second, the general RST objective minimizes some weighted combination of four losses. When specializing to the case of noiseless linear regression, since ^Lstd, lab(θ⋆)=0, rather than minimizing α^Lstd-lab(θ⋆), we set the coefficients on the losses such that the estimator satisfies a hard constraint ^Lstd-lab(θ⋆)=0. This constraint which enforces interpolation on the labeled dataset yi=x⊤iθ∀i=1,…n allows us to rewrite the robust loss (Equation 9) on the labeled examples equivalently as a self-consistency loss defined independent of labels.

^Lrob-lab(θ)

=1nn∑i=1maxxadv∈T(x)(x⊤iθ−x⊤advθ)2.

Since θ⋆ is invariant on perturbations T(x) by definition, we have ^Lrob-lab(θ⋆)=0 and thus we introduce a constraint ^Lrob-lab(θ)=0 in the estimator.

For the losses on the unlabeled data, since the pseudo-labels are not perfect, we minimize Lstd-unlab in the objective instead of enforcing a hard constraint on Lstd-unlab. However, similarly to the robust loss on labeled data, we can reformulate the robust loss on unlabeled samples Lrob-unlab as a self-consistency loss that does not use pseudo-labels. By definition, Lrob-unlab(θ⋆)=0 and thus we enforce Lrob-unlab(θ)=0 in the specialized estimator.

We now study the standard and robust error of the linear regression RST estimator defined above in Equation (4.2).

Theorem 2.

Assume the noiseless linear model y=x⊤θ⋆.
Let θint−std be an arbitrary interpolant of the standard data, i.e. Xstdθint−std=ystd.
Then

The crux of the proof is that the optimization objective of RST is an inductive bias that regularizes the estimator to be close to the standard estimator, weighing directions by their contribution to the standard error via Σ.
To see this, we rewrite

Lstd-unlab(θ;θint−std)

=EPx[(~x⊤θint−std−~x⊤θ)2]

=(θint−std−θ)⊤Σ(θint−std−θ).

By incorporating an appropriate Σ-induced regularizer while satisfying constraints on the robust losses, RST ensures that the standard error of the estimator never exceeds the standard error of ^θstd. The robust error of any estimator is lower bounded by its standard error, and this gap can be arbitrarily large for the standard estimator. However, the robust error of the RST estimator matches the lower bound of its standard error which in turn is bounded by the standard error of the standard estimator and hence is small. To provide some graphical intuition for the result, see Figure 2 that visualizes the RST estimator on the cubic splines interpolation problem that exemplifies the increase in standard error upon augmentation. RST captures the global
structure and obtains low standard error by matching ^θstd (straight line) on unlabeled inputs. Simultaneously, RST enforces invariance on local transformations on both labeled and unlabeled inputs, and obtains low robust error by capturing the local structure across the domain.

Implementation of linear RST.

The constraint on the standard loss on labeled data simply corresponds to interpolation on the standard labeled data. The constraints on the robust self-consistency losses involve a maximization over a set of transformations. In the case of linear regression, such constraints can be equivalently represented by a set of at most d linear constraints, where d is the dimension of the covariates. Further, with this finite set of constraints, we only require access to the covariance Σ in order to constrain the population robust loss. Appendix D gives a practical iterative algorithm that computes the RST estimator for linear regression reminiscent of adversarial training in the semi-supervised setting.

Worst-of-10 (Engstrom et al., 2019)^{4}^{4}4Used a smaller ResNet model

69.2%

91.3% }33mm[Existing baselines(smaller model)]

Random (Yang et al., 2019)^{5}^{5}5Used a smaller ResNet-32 model

58.3%

91.8%

Table 1:
Performance of robust self-training (RST) applied to different perturbations and adversarial training algorithms.
(Left)Cifar-10 standard and robust test accuracy against ℓ∞ perturbations of size ϵ=8/255. All methods use ϵ=8/255 while training and use the WRN-28-10 model. Robust accuracies are against a PG based attack with 20 steps.
(Right)Cifar-10 standard and robust test accuracy against a grid attack of rotations up to 30 degrees and translations up to ∼10% of the image size, following (Engstrom et al., 2019). All adversarial and random methods use the same parameters during training and use the WRN-40-2 model.
For both tables, shaded rows make use of 500K unlabeled images from 80M Tiny Images sourced in (Carmon et al., 2019). RST improves both the standard and robust accuracy over the vanilla counterparts for different algorithms (AT and TRADES) and different perturbations (ℓ∞ and rotation/translations).

Figure 6: Effect of data augmentation on test error as we vary the number of training samples.
(a)-(b) We plot the difference in errors of the augmented estimator and standard estimator. In both the spline staircase simulations and data augmentation with adversarial ℓ∞ perturbations via adversarial training (AT) on Cifar-10, the increase in test error decreases as the training sample size increases. In (b), robust self-training (RST+AT) not only mitigates the increase in test error from AT but even improves test error beyond that of the standard estimator.

4.3 Empirical evaluation of RST

Carmon et al. (2019) empirically evaluate RST with a focus on studying gains in the robust error. In this work, we focus on both the standard and robust error and expand upon results from previous work. Carmon et al. (2019) used TRADES (Zhang et al., 2019) as the robust loss in the general RST formulation (10); we additionally evaluate RST with Projected Gradient Adversarial Training (AT) (Madry et al., 2018) as the robust loss. Carmon et al. (2019) considered ℓ∞ and ℓ2 perturbations. We study rotations and translations in addition to ℓ∞ perturbations, and also study the effect of labeled training set size on standard and robust error. Table 1 presents the main results. More experiment details appear in Appendix D.3.

Both RST+AT and RST+TRADES have lower robust and standard error than their supervised counterparts AT and TRADES across all perturbation types. This mirrors the theoretical analysis of RST in linear regression (Theorem 2) where the RST estimator has small robust error while provably not sacrificing standard error, and never obtaining larger standard error than the standard estimator.

Effect of labeled sample size.

Recall that our work motivates studying the tradeoff between robust and standard error while taking generalization from finite data into account. We showed that the gap in the standard error of a standard estimator and that of a robust estimator is large for small training set sizes and decreases as the labeled dataset is larger (Figure 1). We now study the effect of RST as we vary the training set size in Figure 6. We find that RST+AT has lower standard error than standard training across all sample sizes for small ϵ, while simultaneously achieving lower robust error than AT (see Appendix E.2.1). In the small data regime where vanilla adversarial training hurts the standard error the most, we find that RST+AT gives about 3x more absolute improvement than in the large data regime. We note that this set of experiments are complementary to the experiments in (Schmidt et al., 2018) which study the effect of the training set size only on robust error.

Effect on transformations that do not hurt standard error.

We also test the effect of RST on perturbations where robust training slightly improves standard error rather than hurting it.
Since RST regularizes towards the standard estimator, one might suspect that the improvements from robust training disappear with RST.
In particular, we consider spatial transformations T(x) that consist of simultaneous rotations and translations.
We use two common forms of robust training for spatial perturbations, where we approximately maximize over T(x) with either adversarial (worst-of-10) or random augmentations (Yang et al., 2019; Engstrom et al., 2019). Table 1 (right) presents the results.
In the regime where vanilla robust training does not hurt standard error, RST in fact further improves the standard error by almost 1% and the robust error by 2-3% over the standard and robust estimators for both forms of robust training. Thus in settings where vanilla robust training improves standard error, RST seems to further amplify the gains while in settings where vanilla robust training hurts standard error, RST mitigates the harmful effect.

Comparison to other semi-supervised approaches.

The RST estimator minimizes both a robust loss and a standard loss on the unlabeled data with pseudo-labels (bottom row, Figure 5). Both of these losses are necessary to simultaneously improve both the standard and robust error over the vanilla supervised robust training. Standard self-training, which only uses standard loss on unlabeled data, has very high robust error (≈100%). Similarly, Robust Consistency Training, an extension of Virtual Adversarial Training (Miyato et al., 2018) that only minimizes a robust self-consistency loss on unlabeled data, marginally improves the robust error but actually hurts standard error. See Table 1.

Complementary methods for robustness and accuracy.

In Table 1, we also report the standard and robust errors of other methods that improve the tradeoff between standard and robust error. Interpolated Adversarial Training (IAT) (Lamb et al., 2019) considers a different training algorithm based on Mixup, and Neural Architecture Search (NAS) (Cubuk et al., 2017) uses RL to search for more robust architectures. RST, IAT and NAS are incomparable as they find different tradeoffs between standard and robust error. However, we believe that since RST provides a complementary statistical perspective on the tradeoff, it can be combined with methods like IAT or NAS to see further gains. We leave this to future work.

5 Conclusion

We studied the commonly observed increase in standard error upon adversarial training taking generalization from finite data into account. We showed that augmenting training data with perturbations, like in adversarial training can surprisingly increase the standard error even in a simple setting of noiseless linear regression where the true linear function has zero standard and robust error. Our analysis reveals that the interplay between the inductive bias of models and the underlying geometry of the inputs causes the standard error to increase even when the augmented data is perfectly labeled. This insight provides a method that provably eliminates the increase in standard error upon augmentation in linear regression by incorporating an appropriate regularizer based on the geometry of the inputs. While not immediately apparent, we show that this is a special case of the recently proposed robust self-training (RST) procedure that uses additional unlabeled data. Previous works view RST as a method to improve the robust error by effectively using more samples. Our work provides some theoretical justification for why RST improves both the standard and robust error thereby mitigating the tradeoff between accuracy and robustness in practice. How to best utilize unlabeled data, and whether sufficient unlabeled data would completely eliminate the tradeoff remain open questions.

References

Alzantot et al. (2018)
Alzantot, M., Sharma, Y., Elgohary, A., Ho, B., Srivastava, M., and Chang, K.
Generating natural language adversarial examples.
In

Bartlett et al. (2019)
Bartlett, P. L., Long, P. M., Lugosi, G., and Tsigler, A.
Benign overfitting in linear regression.
arXiv, 2019.

Belkin et al. (2018)
Belkin, M., Ma, S., and Mandal, S.

To understand deep learning we need to understand kernel learning.

In International Conference on Machine Learning (ICML), 2018.

Carmon et al. (2019)
Carmon, Y., Raghunathan, A., Schmidt, L., Liang, P., and Duchi, J. C.
Unlabeled data improves adversarial robustness.
In Advances in Neural Information Processing Systems
(NeurIPS), 2019.

Diamond & Boyd (2016)
Diamond, S. and Boyd, S.
CVXPY: A Python-embedded modeling language for convex
optimization.
Journal of Machine Learning Research (JMLR), 17(83):1–5, 2016.

Engstrom et al. (2019)
Engstrom, L., Tran, B., Tsipras, D., Schmidt, L., and Madry, A.
Exploring the landscape of spatial robustness.
In International Conference on Machine Learning (ICML), pp. 1802–1811, 2019.

Fawzi et al. (2018)
Fawzi, A., Fawzi, O., and Frossard, P.
Analysis of classifiers’ robustness to adversarial perturbations.
Machine Learning, 107(3):481–508, 2018.

Friedman et al. (2001)
Friedman, J., Hastie, T., and Tibshirani, R.
The elements of statistical learning, volume 1.
Springer series in statistics New York, NY, USA: Springer series in
statistics New York, NY, USA:, 2001.

Goodfellow et al. (2015)
Goodfellow, I. J., Shlens, J., and Szegedy, C.
Explaining and harnessing adversarial examples.
In International Conference on Learning Representations
(ICLR), 2015.

Jia & Liang (2017)
Jia, R. and Liang, P.
Adversarial examples for evaluating reading comprehension systems.
In Empirical Methods in Natural Language Processing (EMNLP),
2017.

Krizhevsky et al. (2012)
Krizhevsky, A., Sutskever, I., and Hinton, G. E.
Imagenet classification with deep convolutional neural networks.
In Advances in Neural Information Processing Systems
(NeurIPS), pp. 1097–1105, 2012.

In International Conference on Learning Representations
(ICLR), 2017.

Lamb et al. (2019)
Lamb, A., Verma, V., Kannala, J., and Bengio, Y.
Interpolated adversarial training: Achieving robust neural networks
without sacrificing too much accuracy.
arXiv, 2019.

Ma et al. (2018)
Ma, S., Bassily, R., and Belkin, M.
The power of interpolation: Understanding the effectiveness of SGD
in modern over-parametrized learning.
In International Conference on Machine Learning (ICML), 2018.

Madry et al. (2017)
Madry, A., Makelov, A., Schmidt, L., Tsipras, D., and Vladu, A.
Towards deep learning models resistant to adversarial attacks
(published at ICLR 2018).
arXiv, 2017.

Madry et al. (2018)
Madry, A., Makelov, A., Schmidt, L., Tsipras, D., and Vladu, A.
Towards deep learning models resistant to adversarial attacks.
In International Conference on Learning Representations
(ICLR), 2018.

Miyato et al. (2018)
Miyato, T., Maeda, S., Ishii, S., and Koyama, M.
Virtual adversarial training: a regularization method for supervised
and semi-supervised learning.
IEEE Transactions on Pattern Analysis and Machine
Intelligence, 2018.

Najafi et al. (2019)
Najafi, A., Maeda, S., Koyama, M., and Miyato, T.
Robustness to adversarial perturbations in learning from incomplete
data.
In Advances in Neural Information Processing Systems
(NeurIPS), 2019.

Sajjadi et al. (2016)
Sajjadi, M., Javanmardi, M., and Tasdizen, T.
Regularization with stochastic transformations and perturbations for
deep semi-supervised learning.
In Advances in Neural Information Processing Systems
(NeurIPS), pp. 1163–1171, 2016.

Schmidt et al. (2018)
Schmidt, L., Santurkar, S., Tsipras, D., Talwar, K., and Madry, A.
Adversarially robust generalization requires more data.
In Advances in Neural Information Processing Systems
(NeurIPS), pp. 5014–5026, 2018.

Szegedy et al. (2014)
Szegedy, C., Zaremba, W., Sutskever, I., Bruna, J., Erhan, D., Goodfellow, I.,
and Fergus, R.
Intriguing properties of neural networks.
In International Conference on Learning Representations
(ICLR), 2014.

Tsipras et al. (2019)
Tsipras, D., Santurkar, S., Engstrom, L., Turner, A., and Madry, A.

In International Conference on Learning Representations
(ICLR), 2019.

Uesato et al. (2019)
Uesato, J., Alayrac, J., Huang, P., Stanforth, R., Fawzi, A., and Kohli, P.
Are labels required for improving adversarial robustness?
In Advances in Neural Information Processing Systems
(NeurIPS), 2019.

Yaeger et al. (1996)
Yaeger, L., Lyon, R., and Webb, B.
Effective training of a neural network character classifier for word
recognition.
In Advances in Neural Information Processing Systems
(NeurIPS), pp. 807–813, 1996.

Yang et al. (2019)
Yang, F., Wang, Z., and Heinze-Deml, C.
Invariance-inducing regularization using worst-case transformations
suffices to boost accuracy and spatial robustness.
In Advances in Neural Information Processing Systems
(NeurIPS), 2019.

Zagoruyko & Komodakis (2016)
Zagoruyko, S. and Komodakis, N.
Wide residual networks.
In British Machine Vision Conference, 2016.

Zhang et al. (2017)
Zhang, C., Bengio, S., Hardt, M., Recht, B., and Vinyals, O.
Understanding deep learning requires rethinking generalization.
In International Conference on Learning Representations
(ICLR), 2017.

Zhang et al. (2019)
Zhang, H., Yu, Y., Jiao, J., Xing, E. P., Ghaoui, L. E., and Jordan, M. I.
Theoretically principled trade-off between robustness and accuracy.
In International Conference on Machine Learning (ICML), 2019.

Appendix A Transformations to handle arbitrary matrix norms

Consider a more general minimum norm estimator of the following form.
Given inputs X and corresponding targets y as training data, we study the interpolation estimator,

^θ

=argminθ{θ⊤Mθ:Xθ=y},

(12)

where M is a positive definite (PD) matrix that incorporates prior knowledge about the true model.
For simplicity, we present our results in terms of the ℓ2 norm (ridgeless regression) as defined in Equation 12. However, all our results hold for arbitrary M–norms via appropriate rotations. Given an arbitrary PD matrix M, the rotated covariates x←M−1/2x and rotated parameters θ←M1/2θ maintain y=Xθ and the M-norm of parameters simplifies to ∥θ∥2.

Appendix B Standard error of minimum norm interpolants

b.1 Projection operators

The projection operators Π⊥std and Π⊥aug are formally defined as follows.

Σstd=X⊤stdXstd,

Π⊥std=I−Σ+stdΣstd

(13)

Σaug=X⊤stdXstd+X⊤%
extXext,

Π⊥aug=I−Σ+augΣaug.

(14)

b.2 Invariant transformations may have arbitrary nullspace components

We show that the transformations which satisfy the invariance condition (~x−x)⊤θ⋆=0 where ~x∈T(x) is a transformation of x may have arbitrary nullspace components for general transfomation mappings T.
Let Πstd and Π⊥std be the column space and nullspace projections for the original data Xstd.
The invariance condition is equivalent to

(~x−x)⊤θ⋆

=(Πstd(~x−x)+Π⊥std(~x−x))⊤θ⋆=0

(15)

which implies that as long as Π⊥stdθ⋆≠0, then for any choice of nullspace component Π⊥std(~x)∈Null(X⊤stdXstd), there is a choice of Πstd~x which satisfies the condition.
Thus, we consider augmented points Xext with arbitrary components in the nullspace of Xstd.

by decomposition of Π⊥stdθ⋆=v+w where v=Π⊥stdΠaugθ⋆ and w=Π⊥stdΠ⊥augθ⋆.
Note that the error difference does scale with ∥θ⋆∥2, although the sign of the difference does not.

Corollary 1 presents three sufficient conditions under which the standard error of the augmented estimator Lstd(^θaug) is never larger than the standard error of the standard estimator Lstd(^θstd).

When the population covariance Σ=I, from Theorem 1, we see that

Lstd(^θstd)−Lstd(^θaug)=v⊤v+2w⊤v=v⊤v≥0,

(17)

since v=Π⊥stdΠaugθ⋆ and w=Π⊥augθ⋆ are orthogonal.

When Π⊥aug=0, the vector w in Theorem 1 is 0, and hence we get

Lstd(^θstd)−Lstd(^θaug)=v⊤v≥0.

(18)

We prove the eigenvector condition in Section B.7 which studies the effect of augmenting with a single extra point in general.

The proof of Proposition 1 is based on the following two lemmas that are also useful for characterization purposes in Corollary 2.

Lemma 1.

If a PSD matrix Σ has non-equal eigenvalues, one can find two unit vectors w,v for which the following holds

w⊤v=0andw⊤Σv≠0

(19)

Hence, there exists a combination of original and augmentation dataset Xstd,Xext such that condition (19) holds for two directions v∈Col(Π⊥stdΠaug) and w∈Col(Π⊥stdΠ⊥aug)=Col(Π⊥aug).

Note that neither w nor v can be eigenvectors of Σ in order for both conditions in equation (19) to hold. Given a population covariance, fixed original and augmentation data for which condition (19) holds, we can now explicitly construct θ⋆ for which augmentation increases standard error.

Lemma 2.

Assume Σ,Xstd,Xext are fixed. Then condition (19) holds for two directions v∈Col(Π⊥stdΠaug) and w∈Col(Π⊥stdΠ⊥aug) iff there exists a θ⋆ such that Lstd(^θaug)−Lstd(^θ%
std)≥c for some c>0.
Furthermore, the ℓ2 norm of θ⋆ needs to satisfy the following lower bounds with c1:=∥^θaug∥2−∥^θstd∥2

∥θ⋆∥2−∥^θaug∥2

≥β1c1+β2c2c1

∥θ⋆∥2−∥^θstd∥2

≥(β1+1)c1+β2c2c1

(20)

where βi are constants that depend on Xstd,Xext,Σ.

Proposition 1 follows directly from the second statement of Lemma 2
by minimizing the bound (20) with respect to c1 which is a free parameter to be
chosen during construction of θ⋆ (see proof of Lemma (2).
The minimum is attained for c1=2√(β1+1)(β2c2).
We hence conclude that θ⋆ needs to be sufficiently more complex than a good standard solution, i.e. ∥θ⋆∥22−∥^θstd∥22>γc where γ>0 is a constant that depends on the Xstd,Xext.

b.6 Proof of technical lemmas

In this section we prove the technical lemmas that are used to prove Theorem 1.

Any vector Π⊥stdθ∈Null(Σstd) can be decomposed into orthogonal components Π⊥stdθ=Π⊥stdΠ⊥augθ+Π⊥stdΠaugθ.
Using the minimum-norm property, we can then always decompose the (rotated) augmented estimator ^θaug∈Col(Π⊥aug)=Col(Π⊥stdΠ⊥aug) and true parameter θ⋆ by

^θaug

=^θstd+∑vi∈extζivi

θ⋆

=^θaug+∑wj∈restξjwj,

where we define “ext” as the set of basis vectors which span
Col(Π⊥stdΠaug) and respectively “rest”
for Null(Σaug). Requiring the standard error increase to be some
constant c>0 can be rewritten using identity (16) as
follows

Lstd(^θaug)−Lstd(^θstd)

=c

⟺(∑vi∈extζivi)⊤Σ(∑vi∈extζivi)+c

=−2(∑wj∈restξjwj)Σ(∑vi∈extζivi)

⟺(∑vi∈extζivi)⊤Σ(∑vi∈extζivi)+c

=−2∑wj∈rest,vi∈extξjζiw⊤jΣvi

(21)

The left hand side of equation (21) is always
positive, hence it is necessary for this equality to hold with any
c>0, that there exists at least one pair i,j such that w⊤jΣvi≠0 and one direction of the iff statement is
proved.

For the other direction, we show that if there exist v∈Col(Π⊥stdΠaug) and w∈Col(Π⊥stdΠ⊥aug) for which
condition (19) holds (wlog we assume that the w⊤Σv<0) we can construct a θ⋆
for which the inequality (8) in Theorem 1 holds as follows:

It is then necessary by our assumption that ξjζiw⊤jΣvi>0 for at least some i,j. We can then set ζi>0 such that ∥^θaug−^θstd∥2=∥ζ∥2=c1>0, i.e. that the augmented estimator is not equal to the standard estimator (else obviously there can be no difference in error and equality (21) cannot be satisfied for any desired error increase c>0).

The choice of ξ minimizing ∥θ⋆−^θaug∥2=∑jξ2j that also satisfies equation (21) is an appropriately scaled vector in the direction of x=W⊤ΣVζ where we define W:=[w1,…,w|rest|] and V:=