# Robust descent using smoothed multiplicative noise

To improve the off-sample generalization of classical procedures minimizing the empirical risk under potentially heavy-tailed data, new robust learning algorithms have been proposed in recent years, with generalized median-of-means strategies being particularly salient. These procedures enjoy performance guarantees in the form of sharp risk bounds under weak moment assumptions on the underlying loss, but typically suffer from a large computational overhead and substantial bias when the data happens to be sub-Gaussian, limiting their utility. In this work, we propose a novel robust gradient descent procedure which makes use of a smoothed multiplicative noise applied directly to observations before constructing a sum of soft-truncated gradient coordinates. We show that the procedure has competitive theoretical guarantees, with the major advantage of a simple implementation that does not require an iterative sub-routine for robustification. Empirical tests reinforce the theory, showing more efficient generalization over a much wider class of data distributions.

## Authors

• 14 publications
• ### Efficient learning with robust gradient descent

Minimizing the empirical risk is a popular training strategy, but for le...
06/01/2017 ∙ by Matthew J. Holland, et al. ∙ 0

• ### Better scalability under potentially heavy-tailed feedback

We study scalable alternatives to robust gradient descent (RGD) techniqu...
12/14/2020 ∙ by Matthew J. Holland, et al. ∙ 0

• ### Better scalability under potentially heavy-tailed gradients

We study a scalable alternative to robust gradient descent (RGD) techniq...
06/01/2020 ∙ by Matthew J. Holland, et al. ∙ 0

• ### Robust classification via MOM minimization

We present an extension of Vapnik's classical empirical risk minimizer (...
08/09/2018 ∙ by Guillaume Lecué, et al. ∙ 0

• ### How Robust is the Median-of-Means? Concentration Bounds in Presence of Outliers

In contrast to the empirical mean, the Median-of-Means (MoM) is an estim...
06/09/2020 ∙ by Pierre Laforgue, et al. ∙ 0

• ### Classification using margin pursuit

In this work, we study a new approach to optimizing the margin distribut...
10/11/2018 ∙ by Matthew J. Holland, et al. ∙ 0

• ### Improved scalability under heavy tails, without strong convexity

Real-world data is laden with outlying values. The challenge for machine...
06/02/2020 ∙ by Matthew J. Holland, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

The risk minimization model of learning is ubiquitous in machine learning, and it effectively captures the key facets of any effective learning algorithm: we must have reliable statistical inference procedures, and practical implementations of these procedures. Formulated using the expected loss, or risk

, induced by a loss , where

is the parameter (vector, function, set, etc.) to be learned, and expectation is taken with respect to

. In practice, all we are given is data , and based on this the algorithm outputs some candidate . If is small with high confidence over the random sample, it provides some evidence for good generalization, subject to the assumptions placed on the underlying distribution. The statistical side is important because the risk is always unknown, and the implementation is important since the only we ever have in practice is one we can actually compute given finite data, time, and memory.

The vast majority of popular algorithms used today can be viewed as different implementations of empirical risk minimization (ERM), which admits any minimizer of . From an algorithmic perspective, ERM is ambiguous; there are countless ways to implement the ERM procedure, and important work in recent years has highlighted the fact that a tremendous gap exists between the quality of good and bad ERM solutions [8]

, for tasks as simple as multi-class pattern recognition

[7], let alone tasks with unbounded losses. Furthermore, even tried-and-true implementations such as ERM by gradient descent (ERM-GD) only have appealing guarantees when the data is distributed sharply around the mean in a sub-Gaussian sense, as demonstrated in important work by Lin and Rosasco [15]. These facts are important because ERM is ubiquitous in modern learning algorithms, and heavy-tailed data by no means exceptional [9]. Furthermore, these works suggest that procedures which have been designed to deal with finite samples of heavy-tailed data may be much more efficient than traditional ERM-based approaches, and indeed the theoretical promise of robust learning algorithms is being studied rigorously [13, 17, 18, 19].

#### Review of related work

Here we review the technical literature most closely related to our work. The canonical benchmark to be compared against is ERM-GD, for which Lin and Rosasco [15] in pathbreaking work provide generalization guarantees under sub-Gaussian data. There are naturally two points of interest: (1) How do competing algorithms perform in settings when ERM is optimal? (2) What about robustness to settings in which ERM is sub-optimal? Many interesting robust learning algorithms have been studied in the past few years. One important procedure is from Brownlees et al. [1], based on fundamental results due to Catoni [3]

. The basic idea is to minimize an M-estimator of the risk, namely

 \wwhat =\argmin\wwˆ\loss(\ww) ˆ\loss(\ww) =\argminθ∈\RRn∑i=1ρ(\loss(\ww;\zzi)−θ).

While the statistical guarantees are near-optimal under weak assumptions on the data, and the proxy loss can be computed accurately by an iterative procedure, its definition is implicit, and leads to rather significant computational roadblocks. Even if and and convex, the proxy loss need not be, and the non-linear optimization required by this method can be both unstable and costly in high dimensions.

Another important body of work looks at generalization of the classical “median of means” technique to higher dimensions. From Minsker [20] and Hsu and Sabato [10], the core idea is to partition the data into disjoint subsets , obtain ERM solutions on each subset, and then robustly aggregate these solutions such that poor candidates are effectively ignored. For example, using the geometric median approach of aggregation, we have

 \wwhat =\argmin\wwk∑m=1∥\ww−\wwtilm∥ \wwtilm =\argmin\ww∑i∈\DDm\loss(\ww;\zzi),m=1,…,k.

These robust aggregation methods can be implemented [27], and have appealing formal properties. An application of this technique to construct a robust loss was very recently proposed by Lecué et al. [14]. The main limitation of all these approaches is practical: when sample size is small relative to the number of parameters to be determined, very few subsets can be created, and significant error due to bias occurs; conversely, when is large enough to make many candidates, cheaper and less sophisticated methods often suffice. Furthermore, in the case of Lecué et al. [14] where an expensive sub-routine must be run at every iteration, the computational overhead is substantial.

Also in the recent literature, interesting work has begun to appear looking at “robust gradient descent” algorithms, which is to say steepest descent procedures which utilize a robust estimate of the gradient vector of the risk [5, 6, 24]. The basic idea is as follows. Assuming partial derivatives exist, writing for the risk gradient, we could iteratively solve this task by the following update:

Naturally, this procedure is ideal, since the underlying distribution is never known in practice, meaning is always unknown. As such, we must approximate this objective function and optimize it with incomplete information. In taking a steepest descent approach, all that is required is an accurate approximation of . Instead of first approximating and then using that approximation to infer , computational resources are better spent approximating directly with some data-dependent constructed using the loss gradients , and plugging this in to the iterative update, as

 \wwhat(t+1)=\wwhat(t)−α(t)\rgest(\wwhat(t)). (2)

Once again here, the median-of-means idea pops up in the literature, with Prasad et al. [24] using a robust aggregation of empirical mean estimates of the gradient. That is, after partitioning the data into subsets as before, the estimate vector is constructed as

and substituted within the gradient update (2). While conceptually a very appealing new proposition, there are a number of issues to be overcome. Their theoretical guarantees have statistical error terms which depend only on rather than , where is the dimension of the gradient, but the same error terms also grow with the number of iterations, leading to very slow error rates when the number of iterations grows with , as is required for -good performance (see Remark 11). Furthermore, computing via a geometric median sub-routine introduces the exact same overhead and bias issues as the procedures of Hsu and Sabato [10] just discussed, only that this time these costs are incurred at each step of the gradient descent procedure, and thus these costs and errors accumulate, and can propagate over time. Iterative approximations at each update take time and are typically distribution-dependent, while fast approximations leave a major gap between the estimators studied in theory and those used in practice.

#### Our contributions

To address the limitations of both ERM-GD and the robust alternatives discussed above, we take an approach that allows us to obtain a robust gradient estimate directly, removing the need for iterative approximations, without losing the theoretical guarantees. In this paper, we provide both theoretical and empirical evidence that using the proposed procedure, paying a small price in terms of bias and computational overhead is worth it when done correctly, leading to a large payout in terms of distributional robustness. Key contributions are as follows:

• A practical learning algorithm which can closely mimic ERM-GD when ERM is optimal, but which performs far better under heavy-tailed data when ERM deteriorates.

• Finite-sample risk bounds that hold with high probability for the proposed procedure, under weak moment assumptions on the distribution of the loss gradient.

• We demonstrate the ease of use and flexibility of our procedure in a series of experiments, testing performance using both controlled simulations and real-world datasets, and compared with numerous standard competitors.

#### Content overview

Key concepts and basic technical ideas underlying the proposed algorithm are introduced in section 2. We fill in the details and provide performance guarantees via theoretical analysis in section 3, and reinforce our formal argument with a variety of empirical tests in section 4. Finally, concluding remarks and a look ahead are given in section 5.

## 2 Overview of proposed algorithm

Our proposed procedure can be derived in a few simple steps. Let us begin with a one-dimensional example, in which for random variable

we try to estimate based on sample . Our problem of interest is the setting in which the underlying distribution may be heavy-tailed, but it also may not be, and this information is not available to the learner a priori.

#### Scaling and truncation

The first step involves a very primitive technique for ensuring the bias is small under well-behaved data, all while constraining the impact of outlying points. We re-scale, apply a soft truncation , and then put the truncated arithmetic mean back in the original scale, namely

 snn∑i=1ψ(xis)≈\exxx.

Here

should have the symmetry of an odd function (

), be non-decreasing on , with a slope of as , and be bounded on . A simple example is the hyperbolic tangent function, , but we shall consider other examples shortly. If the scale is set such that is near zero for all but errant observations, the impact of the non-deviant terms to the arithmetic mean will be approximately equal, while the deviant points will have a disproportionately small impact.

#### Noise multiplication

The second step involves applying multiplicative noise, albeit the purpose is rather unique. Let be our independent random noise, generated from a common distribution with . We multiply each datum by , and then pass each modified datum through the truncation function as above, yielding

 ˜x(\mvϵ)=snn∑i=1ψ(xi+ϵixis).

Multiplicative noise has received much attention in recent years in the machine learning literature, in particular with “dropout” in deep neural networks via Bernoulli random variables

[26], and more recent investigations using Gaussian multiplicative noise [21]. In using multiplicative noise with mean , the basic idea is as follows. For typical points, an increase or decrease of a certain small fraction should not change the estimator output much. On the other hand, for wildly deviant points, a push further in the wrong direction is likely to be harmless due to , while a push in the right direction could earn an additional valid point for the estimator.

#### Noise smoothing

In the third and final step, we smooth out the multiplicative noise by taking the expectation of this estimator with respect to the noise distribution. This smoothed version of the estimator, still a random variable dependent on the original sample, is the final estimator of interest, defined

 ˆx\defeq\exx˜x(\mvϵ)=snn∑i=1∫ψ(xi+ϵixis)d\prior(ϵi). (3)

Computationally, in order to obtain to approximate , we will not actually have to generate the and multiply the by , but instead will have to evaluate the integral.

#### Computational matters

Before we move to the high-dimensional setting of interest, how can we actually compute this ? Numerical integration is not appealing as the overhead will be too much for a sub-routine to be repeated many times. Naturally, the computational approach will depend on the noise distribution , and the truncation function . Using recent results in the statistical literature from Catoni and Giulini [4], if we set the truncation function to be

 ψ(u)\defeq⎧⎪⎨⎪⎩u−u3/6,−√2≤u≤√22√2/3,u>√2−2√2/3,u<−√2 (4)

and set the noise distribution to be , then the integral of interest can be given in an explicit form that is simple to compute, requiring no numerical integration or approximation. Written in a general form with shift parameter and scale parameter , we can express the integral as

 \exx\priorψ(a+b√βϵ)=a(a−b22)−a36+\corr(a,b) (5)

where is a correction term that is complicated to write, but extremely simple to implement (see Appendix A.3 for exact form).

#### Proposed learning algorithm

Let us now return to the high-dimensional setting of interest. At any candidate , we can evaluate the and for all points . The heart of our proposal: apply the sub-routine specified in (3) to each coordinate of the loss gradients, which can be computed directly using (5), and plug the resulting “robust gradient estimate” into the usual first-order update (2). Pseudocode for the proposed procedure is provided in Algorithm 1. All operations on vectors in the pseudo-code are element-wise, e.g., , , , and so forth. For readability, we abbreviate .

As a simple example of the guarantees that are available for this procedure, assuming just finite variance of the gradients, and setting

for simplicity, we have that

 \risk(\wwhat(T))−\risk∗≤O(d(log(dδ−1)+dlog(\diametern))n)+O((1−α\SCfactor)T)

with probability at least over the random draw of the sample, where is the dimension of the space the gradient lives in, is the diameter of , and the constant depends only on . Theoretically, these results are competitive with existing state of the art methods cited in the previous section, but with the computational benefits of zero computational error, direct computability, and the fact that per-step computation time is independent of the underlying distribution.

## 3 Theoretical analysis

In this section, we carry out some formal analysis of the generalization performance of Algorithm 1. More concretely, we provide guarantees in the form of high-probability upper bounds on the excess risk achieved by the proposed procedure, given a finite sample of observations, and finite budget of iterations. We start with a general sketch, followed by a precise description of key conditions, representative results, and subsequent discussion. All detailed proofs are given in Appendix A.2.

#### Sketch of the argument

Our approach can be broken down into three straightforward steps:

1. Obtain pointwise error bounds for .

2. Extend Step 1 to obtain error bounds uniform in .

3. Control distance of from minimizer at each step.

For the first step, we can leverage new technical results from Catoni and Giulini [4] to obtain strong guarantees for a novel estimator of the risk gradient, evaluated at an arbitrary, albeit pre-fixed, parameter . With this result established, since Algorithm 1 updates iteratively, and the sequence of parameters is data dependent, bounds which hold for all possible contingencies are required. Since has an infinite number of elements, naive union bounds are useless. However, a rather typical covering number argument offers an appealing solution. As long as a finite number of balls of radius can cover , then we can discretize the space: every is -close to at least one ball, and by the previous step we have strong pointwise guarantees that we can apply to each of the ball centers. Finally, while in practice any learning meachine has no choice but to use the approximate update (2), when the risk function is convex, we can show that the deviation from the ideal procedure (1) can be tightly controlled. Indeed, under strong convexity, the distance between and a minimizer of can be controlled by a sharp statistical error term, and optimization error equivalent to running the ideal gradient descent procedure (1).

#### Notation

Here we organize the key notation used in the remainder of our theoretical analysis and associated proofs (some are re-statements of definitions above). The observable loss is , where is the model from which the learning machine can select parameters , and is the space housing the data sample, . The data distribution is denoted , and the noise distribution featured in our algorithm is . The risk to be minimized is . The risk and loss gradients are respectively and . Estimates of based on observations of are denoted . We frequently use to denote a generic probability measure, typically the product measure induced by the sample, which should be clear from the context. Unless specified otherwise, shall denote the usual norm on Euclidean space. For integer , write .

#### Assumptions with examples

No algorithm can achieve arbitrarily good performance across all possible distributions [25]. In order to obtain meaningful theoretical results, we must place conditions on the underlying distribution, as well as the model and objective functions used. We give concrete examples to illustrate that these assumptions are reasonable, and that they include scenarios that allow for both sub-Gaussian and heavy-tailed data.

1. is a closed, convex subset of , with diameter .

2. Loss function is -smooth on .

3. is -smooth, and continuously differentiable on .

4. There exists at which .

5. is -strongly convex on .

6. There exists such that , for all , .

Of these assumptions, assuredly 3 is simplest: any ball (here in the norm) with finite radius will suffice, though far more exotic examples are assuredly possible. The remaining assumptions require some checking, but hold under very weak assumptions on the underlying distribution, as the following examples show.

###### Example 1 (Concrete example of assumption 3).

Consider the linear regression model

, where is almost surely bounded (say ), but the noise can have any distribution we desire. Consider the squared loss , and observe that for any , we have

and thus

Thus we have smoothness with , satisfying 3.

###### Example 2 (Concrete example of assumptions 3 and 3).

Consider a similar setup as in Example 1, but instead of requiring to be bounded, weaken the assumption to . Since taking the derivative under the integral we have . Clearly, , satisfying 3. Furthermore, it follows that

We thus have

meaning smoothness of the risk holds with , satisfying 3.

###### Example 3 (Concrete example of assumption 3).

Again consider a setting similar to Examples 12, but with added assumptions that , that the noise and input are independent, and that the components of are independent of each other. Some straightforward algebra shows that

 \exx(\lgsubj(\ww;\zz))2 =4(\exxx2j⟨\ww−\wwstar,\xx⟩2+\exxη2\exxx2j) ≤4(∥\ww−\wwstar∥2\exxx2j∥\xx∥2+\exxη2\exxx2j).

It follows that as long as the noise has finite variance (), and all inputs have finite fourth moments , then using assumption 3, we get

 \exx(\lgsubj(\ww;\zz))2≤4(\diameter2\exxx2j∥\xx∥2+\exxη2\exxx2j)<∞.

This holds for all , satisfying 3.

###### Example 4 (Concrete example of assumption 3).

Consider the same setup as Example 3. Since for each , it follows that the risk induced by the squared loss under this model takes a convenient quadratic form,

 \risk(\ww)=\exx\loss(\ww;\zz)=(\ww−\wwstar)TA(\ww−\wwstar)+b2,

with and . For concreteness, say all the components of have variance , recalling that by assumption. Then the Hessian matrix of is , for all . For any , taking an exact Taylor expansion, we have that

for some appropriate on the line segment between and . Since the Hessian is positive definite with factor , the last term on the right-hand side can be no smaller than . This implies a lower bound,

The exact same inequality holds for any choice of and . This is precisely the definition of strong convexity of given in (13), with convexity parameter , satisfying 3.

In the analysis that follows, 33 are assumed to hold.

#### Analysis of Algorithm 1 with discussion

Here we consider the learning performance of the proposed procedure given by Algorithm 1. Almost every step of the procedure is given explicitly, save for the means of setting the moment bound (see section 4), and the step size setting of . In the subsequent analysis of section 3, we shall specify exact settings of , and show how these settings impact the final guarantees that can be made.

We begin with a general fact that shows the sub-routine used to estimate each element of the risk gradient has sharp guarantees under weak assumptions.

###### Lemma 5 (Pointwise accuracy).

Consider data , with distribution . Assume finite second moments, and a known upper bound . With probability no less than , the estimator defined in (3), scaled using , satisfies

 |ˆx−\exx\ddistx|≤√2vlog(δ−1)n+√vn.
###### Remark 6 (Scaling in Lemma 5).

Note that the scale setting in the above lemma depends on the sample size and the second moment of the underlying distribution. This can be derived from an exponential tail bound on the deviations of this estimator, namely we have that for any choice of ,

 \prr{|ˆx−\exx\ddistx|≤v2s+slog(δ−1)n+√vn}≥1−δ.

Choosing to minimize this upper bound yields the final results given in the lemma. Having a scale large relative to the second moments ensures the estimator behaves similarly regardless of the underlying scale of the data. Note that it also increases with sample size

: this is intuitive since in most cases, given a larger sample, we can allow the estimator to be more sensitive to outliers, earning a reduction in bias.

Of critical importance is that Lemma 5 only assumes finite variance, nothing more. Higher-order moments may be infinite or undefined, and the results still hold. This means the results hold for both Gaussian-like well-behaved data, and heavy-tailed data which are prone to errant observations. Next, we show that this estimator has a natural continuity property.

###### Lemma 7 (Estimator is Lipschitz).

Considering the estimator defined in (3) as a function of the data , it satisfies the following Lipschitz property:

 |ˆx(\xx)−ˆx(\xx′)|≤c\priorn∥\xx−\xx′∥1, for all \xx,\xx′∈\RRn

where the factor takes the form

 c\prior=1−2Φ(−√β)+√2βπexp(−β2)

where

, the cumulative distribution function of the standard Normal distribution.

At each step in an iterative procedure, we have some candidate , at which we can evaluate the loss and/or the gradient over some or all data points . In traditional ERM-GD, one simply uses the empirical mean of the loss gradients to approximate . In our proposed robust gradient descent procedure, instead of just doing summation, we feed the loss gradients as data into the robust procedure (3), highlighted in Lemma 5. Running this sub-routine for each dimension results in a novel estimator of the risk gradient , to be plugged into (2), constructing a novel steepest descent update. Since the candidate at any step will depend on the random draw of the data set , upper bounds on the estimation error must be uniform in in order to capture all contingencies. More explicitly, we require for some bound that

Using the following lemma, we can show that such a bound does exist, and its form can be readily characterized.

###### Lemma 8 (Uniform accuracy of gradient estimates).

Consider the risk gradient approximation , defined at (for ) by

 \rgestsubj(\ww)\defeqsjnn∑i=1∫ψ(\lgsubj(\ww;\zzi)(1+ϵi)sj)d\prior(ϵi), scaled as s2j=nvj2log(δ−1), (7)

with any valid bound satisfying , for all . Then, with probability no less than , for any choice of , we have that

where writing , the error is

 ˜ε\defeq\paraSmooth(1+c\prior√d)+√2dV(log(dδ−1)+dlog(3\diameter√n/2))+√V.

We now have that (6) is satisfied by our underlying routine, as just proved in Lemma 8. The last remaining task is to disentangle the underlying optimization problem (minimization of unknown ) from the statistical estimation problem (approximating with ), in order to control the distance between the output of Algorithm 1 after iterations, denoted , and the minimizer of .

###### Lemma 9 (Distance control).

Consider the general approximate GD update (1), and assume that (6) holds with bound . Then, with probability no less than , the following statements hold.

1. Setting , with , we have

 ∥\wwhat(T)−\wwstar∥≤(1−α)T/2∥\wwhat(0)−\wwstar∥+2ε\SCfactor.
2. Setting , we have

 ∥\wwhat(T)−\wwstar∥≤1√t+2∥\wwhat(0)−\wwstar∥+ε\SCfactor.

Our preparatory lemmas are now complete, and we can finally focus on the risk itself. We are considering bounds on the excess risk, namely the difference between the risk achieved by our procedure , and , namely the best possible performance using .

###### Theorem 10 (Excess risk bounds, fixed step size).

Write for the output of Algorithm 1 after iterations, assuming step size , and moment bounds for all . It follows that

 \risk(\wwhat(T))−\risk∗ ≤(1−α)T\paraSmooth∥\wwhat(0)−\wwstar∥2+4\paraSmooth˜ε\paraSC2n =O((1−α)T)+O(d(log(dδ−1)+dlog(\diametern))n)

with probability no less than over the random draw of the sample , where and are as defined in Lemma 8.

An immediate observation is that if we let scale with such that as , we have convergence in probability as for any , it holds that

 limn→∞\prr{\risk(\wwhat(T))−\risk∗>ε}=0,

and that indeed to get arbitrarily good performance at confidence , one requires on the order of observations. In terms of learning efficiency, the rate of convergence becomes more important than the simple fact of consistency. In the remark below, we compare our results with closely related works in the literature.

###### Remark 11 (Comparison with other RGD).

Among recent work in the literature, conceptually the closest work to ours are those proposing and analyzing novel “robust gradient descent” algorithms. As with ours, these procedures look to replace the traditional sample mean-based risk gradient estimate with a more robust approximation, within the context of a first-order optimization scheme. As we mentioned in section 1, two particularly closely related works are Chen et al. [6] and Prasad et al. [24], both using a generalized median-of-means strategy. On the computational side, our Algorithm 1 requires only a fixed number of basic operations, applied to each coordinate and each data point; we have no iterative sub-routines here. This provides an advantage over median-of-means procedures which require iterative approximations of the geometric median at each update in the main loop. Theoretically, while the setting of Chen et al. [6] is that of parallel computing with robustness to failures, their formal guarantees have essentially the same dependence on and as our bounds in Theorem 10, under comparable assumptions. On the other hand, Prasad et al. [24], using the same tactic as Chen et al. [6], provide new formal guarantees with better dependence on the dimension, but at the cost of a new factor in the statistical error term. More concretely, we consider their Theorem 8, in which they provide error bounds on a robust linear regression procedure. Consider assumptions as in our Example 3, where and . Writing for the sequential output of their procedure. Under such assumptions, they assert -probability bounds of the form

 ∥\wwtil(T)−\wwstar∥≤O(aT)+O(σ1−a√Tdlog(δ−1)(n/k))

for a constant , where is the number of partitions made, and . The reason for this form is as follows. Using their Lemma 2, based on error bounds from Minsker [20], one gets a pointwise bound on the error gradient estimate. In their Theorem 1 proof, they take a union over all algorithm iterations, and conclude with bounds on taking the form just stated. A naive approach re-using the same data over each step of the algorithm means the loss gradient observations are no longer independent, then making Lemma 2 invalid. To get around this, the authors split the original independent observations into disjoint subsets to be used for their proposed sub-routine (involving a further -partition). Union bounds over steps are now perfectly valid, but the data size at each step becomes , and even leads to a very slow rate of . Our bounds have an extra factor, but are free of in the statistical error, while also maintaining rates.

###### Remark 12 (Projected RGD).

One implicit assumption in our analysis above is that for all steps of Algorithm 1. To enforce this, running a projection sub-routine after the parameter update step is sufficient, and all theoretical guarantees hold as-is. To see this, note that the update becomes

 \wwhat(t+1)=π\WW(\wwhat(t)−α(t)\rgest(t)(\wwhat(t))) (8)

where . Under 3, this projection is well-defined [16, Sec. 3.12, Thm. 3.12]. With this fact in hand, it follows that for any choice of . Again via 3, since and , we have

 ∥∥π\WW(\wwhat(t)−α(t)\rgest(t)(\wwhat(t)))−\wwstar∥∥≤∥\wwhat(t+1)−\wwstar∥

implying that Lemma 9 applies as-is to Algorithm 1 modified using projection to as in (8), thereby extending all subsequent results based on it.

One would expect that with robust estimates of the risk gradient that over a wide variety of distributions, that the updates of Algorithm 1 should have small variance given enough observations. The following result shows that this is true, with the procedure stabilizing to the best level available under the given sample as the procedure closes in on a valid solution.

###### Theorem 13 (Control of update variance).

Run Algorithm 1 under the same assumptions as Theorem 10, except with step-size left arbitrary. Then, for any step , taking expectation with respect to the sample conditioned on , we have

## 4 Empirical analysis

In the numerical experiments that follow, our primary goal is to elucidate the relationship that exists between factors of the learning task (e.g., sample size, model dimension, initial value, underlying data distribution) and the performance of the robust gradient descent procedure proposed in Algorithm 1. We are interested in how these factors impact algorithm behavior in an absolute sense, as well as performance relative to well-known competitors.

We are considering three basic types of experiments. First, we develop a risk minimization task based on noisy function (and thus noisy gradient) observations. These controlled simulations let us carefully examine how different factors influence performance over time. Next, we consider a regression task under a wide variety of noise distributions, which lets us examine the true utility of competing algorithms under a scenario where the data may or may not be heavy-tailed. Finally, we use real-world benchmark data sets to evaluate performance on classification tasks.

### 4.1 Controlled tests

#### Experimental setup

We begin with a “noisy convex risk minimization” task, designed as follows. The risk function itself takes a quadratic form, as , where , , and are constants set in advance. The learning task is to find a minimizer of , without direct access to , rather only access to random function data , with mapping from parameter space to a numerical penalty. This data is generated independently from a common distribution, and are centered at the true risk, namely for all . More concretely, we generate , , with and independent. The true minimum is denoted , and . The inputs are set to have a

-dimensional Gaussian distribution with all components uncorrelated. This means that

is positive definite, and is strongly convex.

We make use of three metrics for evaluating performance here: average excess empirical risk (averaging of ), average excess risk (computed using true ), and variance of the risk. The latter two are computed by averaging over trials; each trial means a new independent random sample. In all tests, we conduct 250 trials.

Regarding methods tested, we run three representative procedures. First is the idealized gradient descent procedure (1), denoted oracle, which is possible here since is designed by us. Second, as a de facto standard for most machine learning algorithms, we use ERM-GD, written erm. Here the update direction is simply the sample mean of the loss gradient. Finally, we compare our Algorithm 1, written rgdmult, against these two procedures. Variance bounds are computed using the simplest possible procedure, namely the empirical mean of the second moments of , divided by two.

#### Impact of heavy-tailed noise

Our first inquiry is a basic proof of concept: are there natural problem settings under which using rgdmult over ERM-GD is advantageous? How does this procedure perform when ERM-GD is known to be effectively optimal? Under Gaussian noise, ERM-GD is effectively optimal [15, Appendix C]. As a baseline, we start with Gaussian noise (mean

), and then consider centered log-Normal noise (log-location , log-scale ) as a representative example of asymmetric, heavy-tailed data. Performance results are given in Figure 2.

We see that when ERM-GD is known to be strong, both algorithms perform almost the same. In contrast, when we have heavy-tailed data, we see that rgdmult is far superior in terms of both the solution found, and the stability over the random draw of the sample. While compared with the oracle procedure, there is clearly some overfitting, we see that rgdmult departs from the oracle procedure at a much slower rate than ERM-GD, a desirable property.

Moving forward, we look more systematically at how different experimental settings lead to different performance, all else kept constant.

#### Impact of initialization

Having fixed the underlying distribution and number of observations , here we look at the impact of the initial guess . We look at three initializations, taking the form , with , and values ranging over , . Here larger values of correspond to potentially worse initialization. Results are displayed in Figure 3.

Several trends are clear. First, in the case where ERM-GD is essentially optimal, we see that rgdmult matches it. Furthermore, when the data is heavy-tailed, the proposed procedure is seen to be much more robust to a sub-par initial guess. While a bad start can lead to serious performance issues long-run in ERM-GD, we see that rgdmult can effectively recover.

#### Impact of distribution

In our risk minimization task construction, we take advantage of the fact that very distinct loss distributions can still lead to precisely the same risk function. Here we examine performance as the underlying distribution changes; since all changes are of a purely statistical nature the oracle procedure is not affected, only ERM-GD and rgdmult. We consider six settings, three for Gaussian noise, and three for log-Normal noise. Location and scale parameters for the former are . Log-location and log-scale parameters for the latter are . Results are given in Figure 4.

In the case of Gaussian data, where we expect ERM-based methods to perform well, we see that rgdmult is able to match ERM-GD in all settings. Under log-Normal noise, the performance of ERM falls rather sharply, and we see a gap in performance that widens as the variance grows. Guarantees of good performance over a wide class of distributions for Algorithm 1, without any prior knowledge of the underlying distribution are the key take-aways of the results culminating in Theorem 10, and are reinforced clearly by these empirical test results, as well as those in subsequent sub-sections.

#### Impact of sample size

As a direct measure of learning efficiency, we investigate how algorithm performance metrics change with the sample size , with dimension fixed. Figure 5 shows the accuracy of erm and ERM-GD in tests just like those given above. Initial values are common for all methods, and ranges over .

As is natural, both procedures see monotonic performance improvement over increasing . More salient is the strength of rgdmult under heavy-tailed observations, particularly when data is limited, giving clear evidence of better learning efficiency, in the sense of realizing better generalization in less time, with less data.

#### Impact of dimension

The number of parameters the learning algorithm has to determine, here denoted as dimension , makes a major impact on the overall difficulty of the learning task, and what sample sizes should be considered “small.” Fixing , we let range over , and investigate how each algorithm performs with access to progressively less sufficient information. For our Algorithm 1, we set the variance bound to the empirical second moments of , multiplied by . Figure 6 gives results for these tests.

We see that with larger , since is fixed, both non-oracle routines become less efficient, and need more time to converge. As with previous tests, the key difference appears under heavy tails, where we see Algorithm 1 is superior ove ERM-GD for all settings. While the ERM procedure saturates rather quickly, our procedure keeps improving for more iterations.

#### Comparison with robust loss minimizer

In section 1, we cited the important work of Brownlees et al. [1], which chiefly considered theoretical analysis of a robust learning procedure that minimizes a robust objective, in contrast to our use of a robust update direction. Our proposed procedure enjoys essentially the same theoretical guarantees, and we have claimed that it is more practical. Here we attempt to verify this claim empirically. Denote the method of Brownlees et al. [1] by bjl. To implement their approach, which does not specify any particular algorithmic technique, we implement bjl using the non-linear conjugate gradient method of Polak and Ribière [23]. This can be found as part of the the optimize module of the SciPy scientific computation library, called fmin_cg, with default parameter settings. We believe that using this standard first-order solver makes for a fair comparison between bjl and our Algorithm 1, again denoted rgdmult, and again with variance bound set to the empirical second moments of , multiplied by . For our routine, we have fixed the number of iterations to be for all settings. We compute the time required for computation using the Python time module. Multiple independent trials of each learning task (analogous to those previous) are carried out, with the median time taken over trials (for each setting) used as the final time record. We consider settings of . These times along with performance results are given in Figure 7.

We can first observe that in low dimensions, and with data subject to Gaussian noise, the performance of both methods is simiular, a reassuring fact considering their conceptual closeness. Moving to higher dimensions, however, and especially under heavy-tailed noise, we see that our rgdmult achieves better performance in much less time. Note that bjl is optimizing a completely different objective function, explaining the deviation in excess empirical risk. There are certainly other ways of implementing bjl, but there is no way of circumventing the optimization of an explicitly-defined objective, which may not be convex. Our proposed rgdmult looks to offer a more practical alternative, which still enjoys the same statistical guarantees.

#### Regression application

For our next class of experiments, we look at a more general regression task, under a diverse collection of data distributions. We then compare Algorithm 1 with well-known procedures specialized to regression, both classical and recent. In each experimental condition, and for each trial, we generate observations of the form for training. Each condition is defined by the setting of and . Throughout, we have inputs which are generated from a -dimensional Gaussian distribution, with each coordinate independent of the others. As such, to set requires setting the distribution of the noise, . We consider several families of distributions, each with 15 distinct parameter settings, or “noise levels.” These settings are carried out such that the standard deviation of increases over the range , in a roughly linear fashion as we increase from level 1 (lowest) to 15 (highest).

A range of signal/noise ratios can be captured by controlling the norm of the vector determining the model. For each trial, we generate randomly as follows. Considering the sequence , sample uniformly, with . The underlying vector is then set as . The signal to noise ratio then varies over the range . Here we consider four noise families: log-logistic (denoted llog in figures), log-Normal (lnorm), Normal (norm), and symmetric triangular (tri_s). Many more are considered in Appendix B, and even with just these four, we have representative distributions with both bounded and unbounded sub-Gaussian noise, and heavy-tailed data both with and without finite higher-order moments.

Here we do not compute the risk exactly, but rather use off-sample prediction error as the key metric for evaluating performance. This is computed as excess root mean squared error (RMSE) computed on an independent testing set. Performance is averaged over independent trials. For each condition and trial, a test set of independent observations is generated identically to the -sized training set that precedes testing. All competing methods use common samples for training and testing, for each condition and trial. In the th trial, each algorithm outputs an estimate . Using RMSE to approximate the -risk, compute , outputting prediction error as the excess error , averaged over trials. In all experiments, we have , .

We consider several methods against which we compare the proposed Algorithm 1

. As classical choices, we have ordinary least squares (ERM under the squared error,

ols) and least absolute deviations (ERM under absolute error, lad). For more recent methods, as described in section 1, we consider robust regression routines as given by Minsker [20] (geomed) and Hsu and Sabato [10] (hs). In the former, we partition the data, obtaining the ols solution on each subset, and these candidates are aggregated using the geometric median in the norm [27]. The number of partitions is set to . In the latter, we used source code published online by the authors. To compare our Algorithm 1 with these routines, we initialize rgdmult to the analytical ols solution, with step size for all iterations, and . Variance bounds are set to the empirical second moments of , divided by 2. In total, the number of iterations is constrained by a fixed budget: we allow for gradient evaluations in total. Representative results are provided in Figure 8.

To begin, we fix , and look at performance over settings (first row of Figure 8). Regardless of distribution, rgdmult is seen to provide highly competitive performance; whereas other methods perform well on some distributions and very poorly on others, a high level of generalization ability is uniformly maintained by the proposed procedure. We see that ols is strong under sub-Gaussian data (norm and tri_s), while it deteriorates under the heavy-tailed data. The more robust methods tend to perform better than ols on heavy-tailed data, but clearly suffer from biased estimates under sub-Gaussian data. These results illustrate how rgdmult realizes the best of both worlds, paying a tolerable price in bias for large payouts in terms of robustness to outliers.

In the second row of Figure 8, we examine performance over noise levels. It is encouraging that even with pre-fixed step size and budgets (since is fixed over all noise levels), the strong performance of rgdmult holds over very diverse settings.

Finally, in the third row of Figure 8, the ratio of to is fixed, and we see if and how performance changes when is increased. For all distributions, the performance of rgdmult is essentially constant over when scales with , which is what we would hope considering the risk bounds of Theorem 10. Certain competitive methods show more sensitivity to the absolute number of free parameters, particularly in the case of heavy-tailed data with asymmetric distributions.

### 4.2 Application to real-world benchmarks

As our final class of numerical experiments, we shift our focus to classification tasks, and this time make use of real-world data sets, to be described in detail below.

All methods use a common model, here multi-class logistic regression. If the number of classes is

, and we have input features, then the dimension of the model will be . A basic property of this model is that the loss function is convex in the parameters, with gradients that exist, thus placing the model firmly within our realm of interest. Furthermore, for all of these tests we shall add a squared -norm regularization term to the loss, where varies depending on the dataset. Once again, each algorithm is given a fixed budget, this time of , where is the size of the training set available, which again depends on the dataset (details below).

Here we give results for two well-known data sets used for benchmarking: the forest cover type dataset from the UCI repository, and the protein homology dataset used in a previous KDD Cup. For each dataset, we execute 10 independent trials, with training/testing subsets randomly sampled without replacement as is described shortly. For all datasets, we normalize input features to the unit interval in a per-feature fashion. For the cover type dataset, we consider binary classification of the second type against all other types. With and , we have and , with a training subset of size . The protein homology dataset has highly unbalanced labels, with only 1296 positive labels our of over 145,000 examples. We balance out training and testing data, randomly selecting 296 positive examples and the same number of negative examples, yielding a test set of 592 points. As for the training set size, we use all positive examples not used for testing (1000 points each time), plus a random selection of 1000 negatively labeled examples, so . With and , the dimension is , and . In all settings, initialization is done uniformly over the interval .

We investigate the utility of a random mini-batch version of Algorithm 1 here. We try mini-batch sizes of 10 and 20. Variance bounds are set to times the empirical mean of the second moments of , with ranging over . Furthermore, for the high-dimensional datasets, we consider a mini-batch in terms of random selection of which parameters to robustly update. At each iteration, we randomly choose indices, running Algorithm 1

for the resulting sub-vector, and the sample mean for the remaining coordinates. We compare our proposed algorithm with stochastic gradient descent (SGD), and stochastic variance-reduced gradient descent (SVRG) proposed by

Johnson and Zhang [11]. For each method, pre-fixed step sizes ranging over are tested. SGD has mini-batches of size 1, just as the SVRG inner loop. The inner loop of SVRG has iterations, and all methods continue running until the fixed budget of gradient evaluations is spent.

We share representative results in Figure 9. For each dataset and each method, we chose the top two parameters settings, written *_1 and *_2 here. Here the “top two” refers to performance as measured by the median test error for the last five iterations. In general, the proposed procedure is clearly competitive with the best settings of these popular routines, and in the case of the smaller data set (protein homology, right-most plot), we see a significant improvement over competitors. Assuredly, our mini-batch implementation of Algorithm 1 is merely a nascent application, but strong performance under even a very naive setup is promising in terms of developing even stronger procedures for real-world data.

## 5 Conclusion

We introduced and analyzed a novel machine learning algorithm, with a solid theoretical grounding based on firm statistical principles, and with the added benefit of a simple implementation, very few parameters to set, and a fast, non-iterative robustification procedure that does not throw away any data, but which is also not overly sensitive to errant observations. Based on the strong theoretical guarantees and appealing empirical performance, it appears that our approach of paying the price of a small bias for the reward of more distributionally robust gradient estimates is sound as a methodology, realizing better performance using less computational resources (data, time).

In looking ahead, we are particularly interested in moving beyond per-coordinate robustification, and considering operations that operate on the loss gradient vectors themselves as atomic units. The per-coordinate technique is easy to implement and theoretical analysis is also more straightforward, but the risk bounds have an extra factor that should be removable given more sophisticated procedures. Indeed, the high-dimensional mean estimation discussed by Catoni and Giulini [4] has such a vector estimator, but unfortunately there is no way to actually compute the estimator they analyze. Bridging this gap is an important next step, from the perspective of both learning theory and machine learning practice.

## Appendix A Technical appendix

### a.1 Preliminaries

Consider two probability measures and on measurable space . We say that is absolutely continuous with respect to , written , whenever implies for all . The Radon-Nikodym theorem guarantees that there exists a function , -measurable, such that

 Q(A)=∫AgdP, for all A∈A.

Furthermore, this is unique in the sense that if another exists satisfying the above equality, we have almost everywhere . It is common to call this function the Radon-Nikodym derivative of with respect to , written

. The relative entropy, or Kullback-Leibler divergence, between two probability measures

and on measurable space is defined

 \KL(P;Q)\defeq⎧⎪⎨⎪⎩−∫log(dQdP)dP, if Q≪P+∞, else. (9)

They key property of the truncation function utilized by Catoni and Giulini [4], defined in (4), is that for all , we have

 −