# Better scalability under potentially heavy-tailed gradients

We study a scalable alternative to robust gradient descent (RGD) techniques that can be used when the gradients can be heavy-tailed, though this will be unknown to the learner. The core technique is simple: instead of trying to robustly aggregate gradients at each step, which is costly and leads to sub-optimal dimension dependence in risk bounds, we choose a candidate which does not diverge too far from the majority of cheap stochastic sub-processes run for a single pass over partitioned data. In addition to formal guarantees, we also provide empirical analysis of robustness to perturbations to experimental conditions, under both sub-Gaussian and heavy-tailed data. The result is a procedure that is simple to implement, trivial to parallelize, which keeps the formal strength of RGD methods but scales much better to large learning problems.

## Authors

• 14 publications
• ### Better scalability under potentially heavy-tailed feedback

We study scalable alternatives to robust gradient descent (RGD) techniqu...
12/14/2020 ∙ by Matthew J. Holland, et al. ∙ 0

• ### Improved scalability under heavy tails, without strong convexity

Real-world data is laden with outlying values. The challenge for machine...
06/02/2020 ∙ by Matthew J. Holland, et al. ∙ 0

• ### Efficient learning with robust gradient descent

Minimizing the empirical risk is a popular training strategy, but for le...
06/01/2017 ∙ by Matthew J. Holland, et al. ∙ 0

• ### Learning with CVaR-based feedback under potentially heavy tails

We study learning algorithms that seek to minimize the conditional value...
06/03/2020 ∙ by Matthew J. Holland, et al. ∙ 0

• ### Robust descent using smoothed multiplicative noise

To improve the off-sample generalization of classical procedures minimiz...
10/15/2018 ∙ by Matthew J. Holland, et al. ∙ 0

• ### On Proximal Policy Optimization's Heavy-tailed Gradients

Modern policy gradient algorithms, notably Proximal Policy Optimization ...
02/20/2021 ∙ by Saurabh Garg, et al. ∙ 0

• ### Fork and Join Queueing Networks with Heavy Tails: Scaling Dimension and Throughput Limit

Parallel and distributed computing systems are foundational to the succe...
05/14/2018 ∙ by Yun Zeng, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Obtaining “strong contracts” for the performance of machine learning algorithms is difficult.

111This notion was described lucidly in a keynote lecture by L. Bottou [4]. Classical tasks in computer science, such as sorting integers or simple matrix operations, come with lucid worst-case guarantees. With enough resources, the job can be done correctly and completely. In machine learning, things are less simple. Since we only have access to highly impoverished information regarding the phenomena or goal of interest, inevitably the learning task is uncertain, and any meaningful performance guarantee can only be stated with some degree of confidence, typically over the random draw of the data used for training. This uncertainty is reflected in the standard formulation of machine learning tasks as “risk minimization” problems [36, 16]. Here we consider risk minimization over some set of candidates , where the risk of is defined as the expected loss to be incurred by , namely

 \risk\ddist(w)\defeq\exx\ddist\loss(w;Z)=∫\ZZ\loss(w;z)\ddist(\difz),w∈\WW.

Here we have a loss function

, and random data takes values in a set . At most, any learning algorithm will have access to data points sampled from , denoted . Write

to denote the output of an arbitrary learning algorithm. The usual starting point for analyzing algorithm performance is the estimation error

, where , or more precisely, the distribution of this error. Since we never know much about the underlying data-generating process, typically all we can assume is that belongs to some class

of probability measures on

, and typical guarantees are given in the form of

 \prr{\risk\ddist(\whatn)−\risk∗\ddist>ε(n,δ,\ddist,\WW)}≤δ,∀\ddist∈\PP.

Flipping the inequalities around, this says that the algorithm generating enjoys -good performance with -high confidence over the draw of the sample, where the error level depends on the sample size , the desired confidence level , the underlying data distribution , and any constraints encoded in , not to mention the nature of loss . Ideally, we would like formal guarantees to align as closely as possible with performance observed in the real world by machine learning practitioners. With this in mind, the following properties are important to consider.

1. Transparency: can we actually compute the output that we study in theory?

2. Strength: what form do bounds on take? How rich is the class ?

3. Scalability: how do computational costs scale with the above-mentioned factors?

Balancing these three points is critical to developing guarantees for algorithms that will actually be used in practice. If strong assumptions are made on the data distribution (i.e., is a “small” class), then most of the data any practitioner runs into will fall out of scope. If the error bound grows too quickly with or shrinks too slowly with , then either the guarantees are vacuous, or the procedure is truly sub-optimal. If the procedure outputting cannot be implemented, then we run into a gap between what we code, and what we study formally.

#### Our problem setting

In this work, we consider the setup of potentially heavy-tailed data, specialized to the case of strongly convex loss functions.222We handle general convex loss functions in a companion work to follow shortly. More concretely, all the learner can know is that for some ,

 \PP⊆{\ddist:supw∈\WW\exx\ddist|\loss(w;Z)|m<∞}, (1)

where typically . Thus, it is unknown whether the losses (or partial derivatives, etc.) are congenial in a sub-Gaussian sense (where (1) holds for all

), or heavy-tailed in the sense that all higher-order moments could be infinite or undefined. The goal then comes down to obtaining the strongest possible guarantees for a tractable learning algorithm, given (

1). We next review the related technical literature, and give an overview of our contributions.

## 2 Context and contributions

With the three properties of transparency, strength, and scalability highlighted in the previous section in mind, for the next few paragraphs we look at the characteristics of several important families of learning algorithms.

#### ERM: can scale well, but lacks robustness

Classical learning theory is primarily centered around empirical risk minimization (ERM) [37, 1], and studies the statistical properties that hold for any minimizer of the empirical risk, namely

 \whatn∈\argminw∈\WW1nn∑i=1\loss(w;Zi). (2)

Clearly, this leaves all algorithmic aspects of the problem totally abstract, and opens up the possibility for substantial gaps between the performance of “good” and “bad” ERM solutions, as studied by Feldman [15]

. Furthermore, the empirical mean is sensitive to outliers, and formally speaking is sub-optimal in the sense that it cannot achieve sub-Gaussian error bounds under potentially heavy tails, while other practical procedures can; see

Catoni [8] and Devroye et al. [13] for comprehensive studies. Roughly speaking, the empirical mean cannot guarantee better error bounds than those which scale as . In the context of machine learning, these statistical limitations provide an important implication about the feedback available to any learner which tries to directly minimize the empirical risk, effectively lower-bounding the statistical error (in contrast to the optimization error) incurred by any such procedure.

#### Robust risk minimizers: strong in theory, but lacking transparency

To deal with the statistical weaknesses of ERM, it is natural to consider algorithms based on more “robust” feedback, i.e., minimizers of estimators of the risk which provide stronger guarantees than the empirical mean under potentially heavy tails. A seminal example of this is the work of Brownlees et al. [7], who consider learning algorithms of the form

 \whatn∈\argminw∈\WWˆ\risk(w), where n∑i=1ψ⎛⎝ˆ\risk(w)−\loss(w;Zi)s⎞⎠=0. (3)

That is, they consider minimizers of an M-estimator of the risk, using influence function of the type studied by Catoni [8]. Under weak moment bounds like (1), their minimizers enjoy rates with dependence on the confidence. This provides a significant improvement in terms of the strength of guarantees compared with ERM, but unfortunately the issue of transparency remains. Like ERM, the algorithmic side of the problem is left abstract here, and in general may even be a much more difficult computational task. Observe that the new objective cannot be written in closed form, and even if is convex, this need not preserve such convexity. Direct optimization is hard, but verifying improvement in the function value is easy, and some researchers have utilized a guess-and-check strategy to make the approach viable in practice [19]. However, these methods are inexact, and due to optimization error, strictly speaking the algorithm being run does not enjoy the full guarantees given by Brownlees et al. [7] for the ideal case.

#### Robust gradient descent: transparent, but scales poorly

To try and address the issue of transparency without sacrificing the strength of formal guarantees, several new families of algorithms have been designed in the past few years to tackle the potentially heavy-tailed setting using a tractable procedure. Such algorithms may naturally be called robust gradient descent (RGD), the naming being appropriate since their core updates all take the form

 \whatt+1=\whatt−αtˆGn(\whatt), (4)

and they are “robust” in the sense that the estimate

has deviations with near-optimal confidence intervals under potentially heavy-tailed data (i.e., both the loss and partial gradients are potentially heavy-tailed). These strategies typically use biased estimators of the mean, in sharp contrast with traditional first-order oracle assumptions for stochastic gradient descent. Since we will be interested in making a direct comparison with these procedures in this work, we give a more detailed introduction to representative RGD methods in the next three paragraphs.

The most common strategy is a “median of means” approach, studied first by Chen et al. [9, 10] under a distributed learning setup with outliers, and subsequently by Prasad et al. [30] in the context of potentially heavy-tailed data.333In the context of distributed machine learning under outliers, there have been many variations on how to do the “aggregation” of gradients in a robust fashion [3, 38, 14, 31]. These amount to different special cases of doing the RGD update (4), within a different problem setting. The basic strategy is simple: at each step, set the update direction to be the median of means estimator of the risk gradient. The sample is partitioned into subsets with elements each. From each subset , one computes an empirical mean of gradients, and then merge these independent estimates by taking their geometric median; we denote this (see Algorithm 2 for details). More explicitly, we have

 ˆGn(w) =\geomed[{ˆG(1)(w),…,ˆG(k)(w)};∥⋅∥], (5) ˆG(j)(w) =1|\IIj|∑i∈\IIj∇\loss(w;Zi),j=1,…,k.

We refer to (4) implemented using (5) as RGD-by-MoM.

Another approach, first introduced by Holland and Ikeda [18, 20], does not use a sample-splitting mechanism, but rather takes a dimension-wise robustification strategy, updating using

 ˆGn(w) =(ˆθ1(w),…,ˆθd(w)), (6) ˆθj(w) =\argminθ∈\RRn∑i=1ρ(∇j\loss(w;Zi)−θs),j=1,…,d.

Here is a scaling parameter, and is a convex, even function that is approximately quadratic near zero, but grows linearly in the limit of . Since M-estimation is the key sub-routine, we refer to (4) implemented by (6) as RGD-M. Note that both RGD-by-MoM and RGD-M enjoy error bounds with optimal dependence on and under potentially heavy-tailed data, with the significant merit that the computational procedures are transparent and easy to implement as-is. Unfortunately, instead of a simple one-dimensional robust mean estimate as in Brownlees et al. [7], all RGD methods rely on sub-routines that work in -dimensions. This makes the procedures much more expensive computationally for “big” learning tasks, and leads to an undesirable dependence on the ambient dimension in the statistical guarantees as well, hampering their overall scalability.

An alternative approach to doing robust gradient descent comes from Lecué et al. [23], who utilize a neat mixture of the core ideas of robust risk minimizers and the robust gradient descent procedures. Taking a -partition of the data as just described, the update direction is set as

 ˆGn(w) =1|\II⋆|∑i∈\II⋆∇\loss(w;Zi), (7) ˆ\loss⋆(w) \defeq\med{ˆ\loss1(w),…,ˆ\lossk(w)}, ˆ\lossj(w) \defeq1|\IIj|∑i∈\IIj\loss(w;Zi),j=1,…,k.

That is, a robust estimator of the risk (median-of-means) is used to determine which subset to use for computing an empirical estimate of the risk gradient. This approach is meant to approximately achieve the minimization of , the median-of-means risk estimator; we refer to it as MoM-by-GD. Lecué et al. [23] prove strong statistical guarantees for the true minimizer of , specialized to the binary classification task, without requiring bounded inputs; this is an appealing technical improvement over what can be guaranteed using the machinery of Brownlees et al. [7]. Under some technical conditions (their Sec. 4.2), they prove convergence of their algorithm, which scales well computationally since the only high-dimensional operation required is summation over a small subset. Unfortunately, since the rate of convergence is unclear, there may exist a substantial gap between the statistical error guaranteed for the median-of-means risk minimizer and the output of this procedure.

Summarizing the above points, first of all, ERM and robust risk minimizers leave the potential for a severe gap between what is guaranteed on paper and what is done in practice. On the other hand, both formal guarantees and computational requirements for RGD methods do not scale well to high-dimensional learning tasks. For comparison, we show concrete error bounds for RGD procedures under strongly convex losses in Table 1, accompanied by more detailed discussion in section 3.4. The key issues are clear: even when working with the Euclidean geometry, a quick glance at the proofs in the cited works on RGD shows that direct dependence on in the error bounds is unavoidable. Furthermore, the extra computational overhead, scaling at least linearly in , must be incurred at every step in the iterative procedure, which severely hurts scalability.

#### Our contributions

Considering the issues highlighted above, in particular with the scalability of modern RGD techniques in terms of loose guarantees and computational cost, here we investigate a different algorithmic approach of equal generality, with the goal of achieving as-good or better dependence on , , and , under the same assumptions, and in provably less time for larger problems. The core technique uses distance-based rules to select among independent weak candidates, which are implemented using inexpensive stochastic gradient-based updates. Our main contributions:

• We analyze a general-purpose learning procedure (Algorithm 1), and obtain sharp high-probability error bounds (Theorems 1 and 6) that improve on the poor dimension dependence of existing RGD routines under strongly convex risks when both the losses and gradients can be heavy-tailed (comparison in section 3.4).

• The procedure outlined in Algorithm 1 is simple to implement and amenable to distributed computation, providing superior computational scalability over existing serial RGD procedures, without sacrificing theoretical guarantees.

• Empirically, we study the efficiency and robustness of the proposed algorithm and its key competitors in a tightly controlled simulated setting (section 4), verifying a substantial improvement in the cost-performance tradeoff, robustness to heavy-tailed data, and performance that scales well to higher dimensions.

Taken together, our results suggest a promising class of algorithms for strongly convex risk minimization, which achieve an appealing balance between transparency, strength and scalability.

## 3 Theoretical analysis

### 3.1 Preliminaries

First we establish some basic notation, and organize technical details in one place for ease of reference.

#### Notation

For any positive integer , write . For any index , write , defined analogously for independent copy . To keep the notation simple, in the special case of , we write . We shall use as a generic symbol to denote computing probability; in most cases this will be the product measure induced by the sample or . For any function , denote by the sub-differential of evaluated at

. Variance of the loss is denoted by

for each . When we write , this refers to the indicator function which returns when event is true, and otherwise.

#### Technical conditions

The two key running assumptions that we make are related to independence and convexity. First, we assume that all the observed data are independent, i.e., the random variables

and taken over all are independent copies of . Second, for each , we assume the map is a real-valued convex function over , and that the parameter set is non-empty, convex, and compact. All results derived in the next sub-section will be for an arbitrary choice of , where satisfies (1) with . Finally, to make formal statements technically simpler, we assume that achieves its minimum on the interior of .

Several special properties of the underlying feedback provided to the learner will be of interest at different points in this paper; for convenience, we organize them all here.

• -Lipschitz loss: there exists such that for all and all , we have .

• -SC risk: There exists a such that the map is -strongly convex on in norm .

• -smooth risk: The map is differentiable over , and -smooth on in norm with .

• -smooth loss: The map is differentiable over , and -smooth on in norm with , for all .

Definitions of strong convexity and smoothness are given in the technical appendix. To keep the statement of technical results as succinct as possible, we shall refer directly to these conditions as required. For example, the assumption of a -strongly convex risk will be written , the assumption of losses with a -Lipschitz gradient will be written , and so forth. Observe that for clarity, we differentiate between properties which hold for the risk and those which hold for the loss by using an asterisk (e.g., note ), since depending on the setting we make use of both the weak and strong versions of this condition. We will never have need for strongly convex losses.

#### Choice of sub-process

We study a general-purpose learning algorithm using a divide-and-conquer strategy, summarized in Algorithm 1. Following a partition of the data into disjoint subsets, a sub-routine is used to generate candidates, and from these candidates, a final output is determined by . Three concrete examples of are given in Algorithms 24, and they will be discussed in further detail shortly. As for , for concreteness we are considering traditional (projected) stochastic gradient descent. The core update of arbitrary point given data is given by

 \SGD[w;Z,α,\WW]\defeq\proj\WW(w−αG(w;Z)). (8)

Here denotes a step-size parameter, denotes projection to with respect to the

norm, and the standard assumption is that the random vector

satisfies , for each

. That is, we assume access to an unbiased estimate of some sub-gradient of the true risk. For an arbitrary sequence

of length , let . Note that using (8), the right-hand side is defined recursively, and bottoms out at , using pre-fixed initial value . Note that we suppress the step sizes from this notation for readability. For any arbitrary sub-index , sequence is defined analogously; since the are iid, the sequence order does not matter.

### 3.2 Illustrative theorem for heavy-tailed losses

We start with a statement of a theorem that holds for potentially heavy-tailed losses, but with bounded gradients. This result is simple to state and effectively illustrates how Algorithm 1 can be used to obtain strong learning guarantees under potentially heavy-tailed data. After sketching out the proof of this theorem, in the following sub-section we will extend this to the case where the sub-gradients can also be heavy-tailed.

###### Theorem 1.

Let , , and hold in the norm. Run Algorithm 1 with , and step-size . Then, with probability no less than , we have

 \risk\ddist(\what\DC)−\risk∗\ddist≤(\parasm20\parasm21\parasc3)clog(δ−1)n

where the constant when , when , and when (see Lemma 2 for details).

Proving such a theorem is straightforward using the quadratic growth property of strongly convex functions in conjunction with -smoothness. When the risk is -strongly convex, we have the critical property that points which are -far away from the minimum must be -bad in terms of excess risk. As such, simple distance-based robust aggregation metrics can be used to efficiently boost the confidence. To start, we need a few basic facts which will be used to characterize a valid operation.444Procedures with this property are called “robust distance approximation” by Hsu and Sabato [21]. Given points , the basic requirement here is that we want the output of to be close to the majority of these points, in the appropriate norm. To make this concrete, define

 \diameter(u;γ,{u1,…,uk},∥⋅∥)\defeqinf{r≥0:|{j:∥uj−u∥≤r}|>k(12+γ)}. (9)

When the other parameters are obvious from the context, we shall write simply . In words, is the radius of the smallest ball centered at which contains a -majority of the points . Using this quantity, our requirement on is that for any and , we have

 ∥ˆu−u∥≤cγ\diameter(u;γ,{u1,…,uk}), where ˆu=[{u1,…,uk};∥⋅∥]. (10)

Here is a factor that is independent of the choice of or the points given, which depends only on the choice of . In the following lemma, we summarize how different sub-routines provide different guarantees (proofs for lemmas are given in the appendix).

###### Lemma 2.

The following implementations of satisfy (10):

• (Algorithm 2), with .

• (Algorithm 3), with .

• (Algorithm 4) for case, we have .

Considering the partitioning scheme of Algorithm 1, the ideal case is of course where, given some desired performance level , the sub-routine returns an -good candidate for all subsets. In practice, we will not always be so lucky, but the following lemma shows that with enough candidates, most of them will be -good with high confidence.

###### Lemma 3.

Let be any normed linear space. Let be iid random entities taking values in , and fix . For , write , . For any , it follows that

 \prr{k∑i=1ai(ε)>k(12+γ)}≥1−exp(−2k(γ+δε−12)2).

Applying Lemma 3 using the event , we see that when scales with , we can guarantee that there is a probability good event in which at least a -majority of the candidates are -good. On this good event, via strong convexity it follows that a -majority of the candidates are -close to , which means . Leveraging the requirement (10) on , one obtains the following general-purpose boosting procedure.

###### Lemma 4 (Boosting the confidence, under strong convexity).

Assume , and hold. Assume that we have a learning algorithm which for and achieves

 \prr{\risk\ddist(\what\textscold)−\risk∗\ddist>ε\ddist(n)δ0}≤δ0.

For desired confidence level , split into disjoint subsets, and let be the outputs of run on these subsets. Then setting

 \what\textscnew=[{\what(1)\textscold,…,\what(k)\textscold};∥⋅∥],

if is any of the sub-routines given in Lemma 2, then for any and , we have that

 \risk\ddist(\what\textscnew)−\risk∗\ddist≤4c2γ\parasm1\parascε\ddist((1−γ)2n8log(δ−1))

with probability no less than .

With these basic results in place, we can readily prove Theorem 1.

###### Proof of Theorem 1.

Using the assumptions provided in the hypothesis, we can obviously leverage Lemma 4. The key remaining point is to fill in the bound for last-iterate SGD as specified. Standard arguments yield a probability event on which

 \risk\ddist(\what(j))−\risk∗\ddist≤\parasm1(n/k)(\parasm0\parasc)2(1δ),

and this holds for each . See Theorem 9 in the appendix for a more detailed statement and complete proof. We then plug this into Lemma 4, where the correspondence with Algorithm 1 is and . It follows that on the high-probability good event, we have

 \risk\ddist(\what\DC)−\risk∗\ddist≤⎛⎝4c2γ\parasm20\parasm21\parasc3⎞⎠8log(δ−1)n(1−γ)2.

This holds for any routine satisfying (10). In the case of , by Lemma 2, we have for all . Note that is a free parameter that does not impact the algorithm being executed, and thus we can set . For case, we end up with a factor of the form . Direct computation shows that this is minimized at a value between and . The bounds hold for all , and thus taking the factor equals . Basic arithmetic in each case then immediately yields the desired bound. ∎

###### Remark 5 (Additional related literature).

The excess risk bounds given by Theorem 1 give us an example of the guarantees that are possible under potentially heavy-tailed data, for arguably the simplest divide-and-conquer strategy one could conceive of. Here we remark that the core idea of using robust aggregation methods to boost the confidence of independent candidates under potentially heavy-tailed data can be seen in various special cases throughout the literature. For example, influential work from Minsker [25, Sec. 4] applies the geometric median (here,

) to robustify both PCA and high-dimensional linear regression procedures, under potentially heavy-tailed observations.

Hsu and Sabato [21, Sec. 4.2] look at merging ERM solutions when the empirical risk is strongly convex, using a smallest-ball strategy (here, ). In contrast, we do not require the losses to be strongly convex, and our computational procedure is explicit, yielding bounds which incorporate error of both a statistical and computational nature, unlike ERM-type guarantees.

Broadening our viewpoint slightly, it is worth noting that the general approach seen in the preceding works actually dates back to at least Nemirovsky and Yudin [27, Sec. 6.6.4, p. 243–246], albeit in a slightly different algorithmic form. The smallest-ball strategy was adopted in interesting recent work by Davis et al. [12], who investigate a generic strategy to give stochastic algorithms high-probability error bounds, by solving an additional proximal sub-problem at each iteration, in which the new candidate is within a small-enough ball of the previous candidate. Also quite recently, new work on stochastic convex optimization under potentially heavy-tailed data has appeared from Juditsky et al. [22], who study a robust stochastic mirror descent procedure, which fixes an “anchor” direction, and only updates using the stochastic gradient oracle if that vector is close enough to the anchor. Under the setting of Theorem 6 to follow shortly, we remark that the error bounds for their procedure are similar to ours (e.g., their Section 6, Thm. 3), but rely critically on the quality of the anchor direction and the threshold level; when such quantities are unknown, the anchor is just set to zero, with the norm threshold being modulated by the size of the entire parameter space, which propagates into the error bounds.

### 3.3 Extension to allow heavy-tailed gradients

Note that the assumptions in Theorem 1 clearly allow for potentially heavy-tailed losses, but the Lipschitz condition is equivalent to requiring bounded partial derivatives, meaning that heavy-tailed gradients are ruled out, which is not meaningful for algorithms based entirely on first-order information. This is the only assumption made in Theorem 1 that does not appear in the existing RGD literature. On the other hand, existing RGD arguments use a -smoothness requirement on the loss (e.g., Holland and Ikeda [20, Sec. 3.2]), which is a stronger requirement than we have made in Theorem 1. Here we show that when we align our assumptions to that of the existing RGD theory, it only requires a minor adjustment to the sub-routine used in Algorithm 1 to obtain analogous results, now allowing for both the loss and gradient to be potentially heavy-tailed. This is summarized in the following result; the statement is slightly more complicated than the preceding illustrative theorem, but the proof follows using a perfectly analogous argument.

###### Theorem 6.

Let , and hold in the norm. Run Algorithm 1 with a sample size at least , where

 k=⌈8log(δ−1)⌉,M∗\defeq4\parasm1\parasc(max{\parasm1\parasc∥\what0−\wstar∥2\exx\ddist∥G(\wstar;Z)∥2,1}−1).

For the initial update set , and subsequent step sizes for , with , and set such that for all . Then, with probability no less than , we have

 \risk\ddist(\what\DC)−\risk∗\ddist≤\exx\ddist∥G(\wstar;Z)∥2(a\parasm1\parasc)22clog(δ−1)n−Mδ

where , and is exactly as in Theorem 1.

###### Proof of Theorem 6.

As with the preceding illustrative proof, the key is to fill in for the final iterate of standard SGD, using the prescribed step sizes. It is well-known that for averaged SGD, one does not need to require that the losses be Lipschitz. On the other hand, for last-iterate SGD, it was only quite recently that Nguyen et al. [29], in a nice argument building upon Bottou et al. [5], showed that the Lipschitz condition is not required if we have -smooth losses. For our purposes, this implies that for each of the candidates in Algorithm 1, we get

 \risk\ddist(\what(j))−\risk∗\ddist≤\exx\ddist∥G(\wstar;Z)∥2(n/k)−M∗+b(1δ)(2a2\parasm1\parasc)

with probability no less than . A detailed statement of this property is given in Theorem 10 in the appendix. The rest of the argument goes through exactly as in the proof of Theorem 1, noting that we shall end up with in the denominator, for arbitrary choice of . To cover all choices of and thus settings, we simply use the rough upper bound in the stated result. ∎

### 3.4 Comparison with RGD

Considering the three points of interest highlighted in section 1 (transparency, strength, and scalability), let us compare Algorithm 1 with the existing RGD algorithms introduced in section 2. In Table 1, we summarize some concrete metrics on the statistical and computational side. Let us unpack and discuss this here. First, the technical assumptions being made here are precisely that of our Theorem 6 for the first three rows of the table. The guarantees follow from Chen et al. [9, Thm. 4] for RGD-by-MoM, Holland and Ikeda [20, Thm. 5] for RGD-M, and Theorem 6 for DC-SGD. While the factors concealed by the notation (chiefly and ) certainly cannot be ignored, in terms of explicit dependence on , , and , we see that the dependence is as good or better in all respects, in particular the direct dependence on is removed. As for MoM-by-GD, the result holds for binary classification using a Lipschitz convex surrogate of the 0-1 loss, from Lecué et al. [23, Thm. 2]. Here denotes the true minimizer of given in (7), and the output of MoM-by-GD after steps. Since convergence rates for are not available, the overall guarantees are weaker than the above-cited RGD procedures.

Regarding computational costs, let us consider basic estimates for the temporal cost in terms of arithmetic operations required. Starting with Algorithm 1, for each subset , we need operations to obtain candidate , and these computations can be done independently on processors running in parallel over the entire learning process, until the final merge. With this in mind, the time cost to obtain all candidates will be , and then all that remains is one call to , yielding a total cost of . The table above reflects a setting of to match Theorem 6. For comparison, RGD-by-MoM (5) requires operations to compute one subset mean, and again assuming the computations for each , , are done across cores in parallel, then if iterations are done, the total cost is , since the -based merging must be done times. Regarding , the geometric median is a convex program, and can efficiently be solved to arbitrary accuracy; Cohen et al. [11] give an implementation such that the objective is -good (relative value), with time complexity of for points. This cost is incurred at each step, in contrast with , which in the case of , only incurs such a cost once. For RGD-M (6), note that solving for can be done readily using a fixed-point update, and in practice the number of iterations is , fixed independently of and , which means operations will be required for each of the steps. Assuming a standard empirical estimate of the per-coordinate variance is plugged in, this will require an additional arithmetic operations. Finally for MoM-by-GD, sorting can be done in steps, update directions require just operations, and these costs are incurred at all steps.

All else equal, under potentially heavy-tailed losses/gradients, there appears to be fairly strong formal evidence that better statistical guarantees may be possible at substantially improved computational cost, by choosing DC-SGD over the existing RGD procedures in the literature. Since DC-SGD only requires iterations in total, we see the obvious potential for costs to be improved by an order of magnitude, e.g., when for other routines. That said, there are other factors that remain to be considered, such as the variance over time and across independent samples, and the impact in performance for risk functions with different ratios, in both low- and high-dimensional problem settings. To elucidate how the formal guarantees derived above play out in practice, we conduct a detailed empirical analysis in section 4.

## 4 Empirical analysis

In this section, we use controlled simulations to investigate how the differences in formal performance guarantees discussed in the previous section work out in practice.

#### Experimental setup

We essentially follow the “noisy convex minimization” tests done by Holland and Ikeda [20] to compare the performance of robust gradient descent procedures with traditional ERM minimizers. For simplicity, we start with a risk function that takes a quadratic form , where , , and are constants that depend on the experimental conditions. Now in order to line the experimental setting up with the theory of section 3, the idea is to construct an easily manipulated loss distribution such that the expectation aligns precisely with the quadratic just given. To achieve this, one can naturally compute losses of the form , where is a pre-defined vector unknown to the learner, is a -dimensional random vector, is zero-mean random noise, and and are independent of each other. Note that

in this case corresponds to the joint distribution of

and , although all the learner sees is the loss value. It is readily confirmed under such a setting we have in the quadratic form given above with , , and (see appendix). For any non-trivial choice of the distribution of , the resulting matrix will be positive definite, implying that is strongly convex. Furthermore, since the gradients are clearly Lipschitz continuous whenever has a bounded support, the two key assumptions of Theorem 6 are satisfied.

Regarding the methods being compared, as classical baselines, empirical risk minimization using batch GD (denoted ERM-GD) and stochastic GD (denoted SGD) are used. We also implement the robust GD methods described in section 2. In particular, RGD-by-MoM is denoted here as RGD-MoM, MoM-by-GD is denoted as RGD-Lec, and RGD-M is denoted as RGD-M. Finally, Algorithm 1 (with ) is denoted by DC-SGD. Everything is implemented in Python (ver. 3.8), chiefly relying upon the numpy library (ver. 1.18).555An online repository to re-create the experiments here will be made public in the near future. Documentation for numpy: https://numpy.org/doc/1.18/index.html. The basic idea of these tests is to calibrate and fix the methods to the case of “nice” data characterized by additive Gaussian noise, and then to see how the performance of each method changes as different experimental parameters are modified. For all algorithms that use a -partition of the data, we have fixed throughout all tests. Partitioning is done using the split_array function in numpy, which means each subset gets at least points. Details regarding step-size settings will be described shortly. Finally, the initial value for all methods is determined randomly, using for each coordinate, with unless otherwise specified.

The key performance metric that we look at in the figures to follow is “excess risk,” computed as , where is the output of any learning algorithm being studied, and is the pre-fixed minimum described in the previous paragraph. Each experimental setting is characterized by the triplet , which we modify in many different ways to investigate different phenomena. For each setting, we run multiple independent trials, and compute performance statistics based on these trials. For example, when we give the average (denoted ave

) and standard deviation (denoted

sd) of excess risk, these statistics are computed over all trials. All box-plots are also computed based on multiple independent trials; the actual number of trials will be described in the subsequent exposition. For convenience, here we list the key contents of our empirical analysis:

• Error trajectories in low dimensions (fixed and , many iterations).

• Statistical error in high dimensions ( grows, fixed).

• Actual computation times ( grows, fixed/grows).

• Impact of initialization on error trajectories ( grows).

• Impact of noise level on error trajectories (signal/noise ratio gets worse).

We proceed to describe these experimental settings and related results one by one.

#### (E1) Error trajectories in low dimensions

We start with a simple setting, fixing and , and comparing how pre-fixed algorithms perform depending on whether is Normal or log-Normal. To be more precise, let . We consider the cases of and , respectively with and . These distinct settings of

are set to keep the width of the inter-quartile range of the noise in both cases approximately equal. The case of Normal noise is used as a baseline to calibrate fixed step sizes for all methods. The four batch methods all have

, the two sequential methods have , and these settings are fixed throughout the remaining experiments to evaluate the impact of different conditions on pre-fixed algorithms.666The reasoning for having the step size shrink with is because the smoothness parameter scales with and thus , all else equal. See constant in Theorem 6 for the motivation behind having shrink with in Algorithm 1. The same requirement is made for RGD methods [20, Sec. 3.2]. All learning algorithms are given access to exactly data points , based on which they can evaluate and at any and . We examine how all algorithms behave under the settings just described, given a large time budget. More specifically, cost is measured in terms of gradient evaluations, so that every time an algorithm computes for some and , we increment , until . Here we set . Note that this means SGD and DC-SGD will make many passes over the data; we shall look at the case of few passes over the data in short order. The number of independent trials here is . Results for this experimental setting are given in Figure 2 (Normal case) and Figure 2 (log-Normal case).

#### (E2) Statistical error in high dimensions

Next, we consider how a larger number of parameters impacts the statistical error of the methods being compared, in contrast with the computational error (e.g., errors in Table 1 free of , shrinking as grows). To do so, we consider a range of with fixed , and we control the initialization error for all methods such that for a constant that does not depend on , by initializing as for each . Furthermore, we let the batch methods run for many iterations with a gradient budget of , whereas we now restrict DC-SGD (Algorithm 1) to just . This is done to ensure that the computational error terms of batch methods are sufficiently small. The number of independent trials here is , and the noise distribution settings are as discussed previously. In Figure 3, we give box-plots of the final excess risk achieved by each method for different sizes. As a complementary result, we can also consider fixing and taking very large to evaluate the impact of non- factors in the statistical error bounds. Such a result is given in Figure 5, noting that we are still considering just single pass DC-SGD against the many-pass batch methods.

#### (E3) Actual computation times

It is natural to consider how much time is actually required to achieve the results given above, for both the many-pass batch methods and the single-pass DC-SGD (Algorithm 1), in particular when the sub-processes used to compute the in Algorithm 1 are run in parallel. Once again for all -dependent methods we have fixed just as in previous experiments, and consider two types of tests. First, and move together, with and . Second, is fixed, and dimension ranges over as before. Stopping conditions based on budget constraints are precisely as in the experiments described in the previous paragraph. We measure computation time of each experiment using the Python module time as follows.777Documentation: https://docs.python.org/3/library/time.html. For each method, we record the time immediately after is generated and passed to the learning algorithm, and the time immediately after the stopping condition costbudget is achieved. Computation time is then defined as simply . We run independent trials, and compute the median times for each method. For the parallel implementation of DC-SGD (Algorithm 1), we use the Python module multiprocessing to allocate each SGD sub-process to independent worker processes that can be run in parallel.888Documentation: https://docs.python.org/3.8/library/multiprocessing.html. Specifically, we generate an instance of the Pool class, iterating over worker functions that return . We note that computation of ending time for DC-SGD comes after the step, and thus all the overhead due to multiprocessing is included in the computation times recorded. We have made every effort to ensure all algorithms are implemented as efficiently as possible, and source code will be uploaded to a public repository in the near future. The median times for each method in both experimental settings are shown in Figure 5.

#### (E4) Impact of initialization on error trajectories

We now return to the setting of (E1), and investigate the impact that a larger initialization error has on the resulting trajectory of each method. Recall that our baseline setup has us initializing in a coordinate-wise fashion, namely . The default setting was , but here we consider , for both Normal and log-Normal noise cases. As with (E1), the excess risk values are averaged over independent trials. To ensure the plots are legible, we choose four representative methods to highlight the key trends. Results are shown in Figure 6 (we denote by sup in the legend).

#### (E5) Impact of noise level on error trajectories

Continuing with the same basic setup as (E4), we keep the default initialization error range, and instead here modify the signal to noise ratio. The nature of

and is kept constant, so the strength of the “signal” does not change. Recall that setting , we consider two cases of additive noise, namely where (Normal case) and (log-Normal case). In the Normal case, we take . In the log-Normal case, we take . Starting from small to large, we denote these levels as . As with our earlier experimental settings, we have selected the parameters for these three “noise levels” such that at each level, the inter-quartile range of is approximately equal for both the Normal and log-Normal cases. As before, we select four representative methods and show the impact of noise level on excess risk, averaged over independent trials. Results are given in Figure 6.

#### Discussion of results

As an overall take-away from the preceding empirical test results, it is clear that even with no fine-tuning of algorithm parameters, it is possible for Algorithm 1 to achieve performance comparable to robust gradient descent methods using far less computational resources. Clearly, even when the underlying sub-processes used by Algorithm 1 are very noisy, only a few passes over the data are necessary to match the best-performing RGD methods, both on average and in terms of between-trial variance (Figures 22). Furthermore, without any algorithm adjustments, this robustness holds over changes to initialization error and the signal/noise ratio (Figures 67). It is clear that due to sample splitting, our procedure can take a small hit in terms of statistical error as the sample size grows very large (Figure 5), but makes up for this in scalability. As the dimensionality of the task grows, under heavy-tailed data the proposed procedure establishes an even more stark advantage over competing methods (Figure 3), at only a small fraction of the computational cost (Figure 5). These initial results are encouraging, and additional tests looking at basic principles and data-driven strategies for optimizing the number of partitions is a natural point of interest.

## 5 Future directions

As discussed in section 2, this paper presents evidence, both formal and empirical, that a general-purpose learning algorithm following the archetype drawn out in Algorithm 1 should be able to improve significantly on the cost-performance tradeoff achieved by current state of the art robust gradient descent methods under potentially heavy-tailed losses and gradients. Clearly, this work represents only a first step in this direction. Extending the theory to other algorithms besides vanilla SGD is a straightforward exercise; less straightforward is when we start considering a stage-wise strategy, when partition size can change from stage to stage. Extending results to allow for multiple passes is also of natural interest; the work of Lin and Rosasco [24] does this for the squared error, but without heavy tails. We only covered the norm case here, but extensions to cover other geometries (via stochastic mirror descent for example) are also of interest. Empirically, much work remains to be done, including a more thorough study of the impact of partition sizes for different algorithms, simulations looking at the impact as , and real-world data applications using non-linear models where global convexity assumptions fall apart. The RGD methods cited in this work are chiefly designed for convex optimization problems; they sacrifice exploration in favor of exploitation, and it will be interesting to investigate how approaches like Algorithm 1 fare in non-convex settings due to their higher tendency to “explore.”

## Appendix A Technical supplement

### a.1 Helper facts

###### Lemma 7.

Let and be convex. Then, is -Lipschitz with respect to norm if and only if for all and , where the dual norm is defined by .

###### Proof.

See Shalev-Shwartz [34, Lem. 2.6] for a proof. ∎

#### Strong convexity

Let function be continuous on closed convex set . We say that is -strongly convex on with respect to norm if there exists such that for all and we have

 f(αu+(1−α)v)≤αf(u)+(1−α)f(v)−\parasc2α(1−α)∥u−v∥2. (11)

The definition given in (11) is intuitively appealing, as the third term on the right-hand side specifies how large the distance is from the graph of to the chord between and . All else equal, when taking over , a larger value means that the graph dips farther down below the chord. Furthermore, this definition is technically appealing in that it does not require to be differentiable. When is differentiable, then the more familiar characterizing property is that

 f(u)−f(v)≥⟨∇f(v),u−v⟩+\parasc2∥u−v∥2. (12)

for all . Among other useful properties, (11) implies that has a unique minimum , and that for any ,

 f(u)−f(u∗)≥\parasc2∥u−u