DeepAI

# Heavy-tailed Streaming Statistical Estimation

We consider the task of heavy-tailed statistical estimation given streaming p-dimensional samples. This could also be viewed as stochastic optimization under heavy-tailed distributions, with an additional O(p) space complexity constraint. We design a clipped stochastic gradient descent algorithm and provide an improved analysis, under a more nuanced condition on the noise of the stochastic gradients, which we show is critical when analyzing stochastic optimization problems arising from general statistical estimation problems. Our results guarantee convergence not just in expectation but with exponential concentration, and moreover does so using O(1) batch size. We provide consequences of our results for mean estimation and linear regression. Finally, we provide empirical corroboration of our results and algorithms via synthetic experiments for mean estimation and linear regression.

• 6 publications
• 8 publications
• 59 publications
• 71 publications
05/21/2020

### Stochastic Optimization with Heavy-Tailed Noise via Accelerated Gradient Clipping

In this paper, we propose a new accelerated stochastic first-order metho...
01/19/2021

### On Monte-Carlo methods in convex stochastic optimization

We develop a novel procedure for estimating the optimizer of general con...
06/10/2019

### Mean estimation and regression under heavy-tailed distributions--a survey

We survey some of the recent advances in mean estimation and regression ...
06/11/2020

### Multiplicative noise and heavy tails in stochastic optimization

Although stochastic optimization is central to modern machine learning, ...
06/26/2020

### Nearest Neighbour Based Estimates of Gradients: Sharp Nonasymptotic Bounds and Applications

Motivated by a wide variety of applications, ranging from stochastic opt...
01/04/2021

### Benign overfitting without concentration

We obtain a sufficient condition for benign overfitting of linear regres...
12/05/2022

### Rethinking the Structure of Stochastic Gradients: Empirical and Statistical Evidence

Stochastic gradients closely relate to both optimization and generalizat...

## 1 Introduction

Statistical estimators are typically random, since they depend on a random training set; their statistical guarantees are typically stated in terms of their expected loss between estimated and true parameters [31, 12, 51, 26]

. A bound on their expected loss however might not be sufficient in higher stakes settings, such as autonomous driving, and risk-laden health care, among others, since the deviation of the estimator from its expected behavior could be large. In such settings, we might instead prefer a bound on the loss that holds with high probability. Such high-probability bounds are however often stated only under strong assumptions (e.g. sub-Gaussianity or boundedness) on the tail of underlying distributions

[25, 24, 45, 18]; conditions which often do not hold in real-world settings. There has also been a burgeoning line of recent work that relaxes these assumptions and allows for heavy-tailed underlying distributions [3, 28, 35], but the resulting algorithms are often not only complex, but are also specifically batch learning algorithms that require the entire dataset, which limits their scalability. For instance, many popular polynomial time algorithms on heavy-tailed mean estimation [9, 42, 5, 43, 6, 32] and heavy-tailed linear regression [28, 46, 44, 38] need to store the entire dataset. On the other hand, most successful practical modern learning algorithms are iterative, light-weight and access data in a “streaming” fashion. As a consequence, we focus on designing and analyzing iterative statistical estimators which only use constant storage in each step. To summarize, motivated by practical considerations, we have three desiderata: (1) allowing for heavy-tailed underlying distributions (weak modeling assumptions), (2) high probability bounds on the loss between estimated and true parameters instead of just its expectation (strong statistical guarantees), and (3) estimators that access data in a streaming fashion while only using constant storage (scalable, simple algorithms).

A useful alternative viewpoint of the statistical estimation problem above is that of stochastic optimization: where we have access to the optimization objective function (which in the statistical estimation case is simply the population risk of the estimator) only via samples of the objective function or its higher order derivatives (typically just the gradient). Here again, most of the literature on stochastic optimization typically provides bounds in expectation [26, 51, 22], or places a strong assumptions on the tail of the stochastic function derivatives, such as bounded distributions [24, 45]

and sub-Gaussian distributions

[25, 34]. Figure 1 shows that even for the simple stochastic optimization task of mean estimation, the deviation of stochastic gradient descent (SGD) is much worse for heavy-tailed distributions than sub-Gaussian ones. Therefore, bounds on expected behavior, or strong assumptions on the tails of the stochastic noise distribution are no longer sufficient.

While there has been a line of work on heavy-tailed stochastic optimization, these require non-trivial storage complexity or batch sizes, making them unsuitable for streaming settings or large-scale problems [16, 37]. Specifically these existing works require at least batch size to obtain a -approximate solution under heavy-tailed noise [44, 7, 19, 39] (See Section B in the Appendix for further discussion). In other words, to achieve a typical convergence rate (on the squared error), where is the number of samples, they would need a batch-size of nearly an entire dataset.

Therefore, we investigate the following question:

Can we develop a stochastic optimization method that satisfy our three desiderata?

Our answer is that a simple algorithm suffices: stochastic gradient descent with clipping (clipped-SGD). In particular, we first prove a high probability bound for clipped-SGD under heavy-tailed noise, with a decaying step-size sequence for strongly convex objectives. By using a decaying step size, we improve the analysis of Gorbunov et al. [19] and develop the first robust stochastic optimization algorithm in a fully streaming setting - i.e. with batch size. We then consider corollaries for statistical estimation where the optimization objective is the population risk of the estimator, and derive the first streaming robust mean estimation and linear regression algorithm that satisfy all three desiderata above.

We summarize our contributions as follows:

• We prove the first high-probability bounds for clipped-SGD with step size for strongly convex and smooth objectives without sub-Gaussian assumption on stochastic gradients, and constant batch sizes. To the best of our knowledge, this is the first stochastic optimization algorithm that uses constant batch size in this setting. See Section 1.1 and Table 1 for more details and comparisons. A critical ingredient is a nuanced condition on the stochastic gradient noise.

• We show that our proposed stochastic optimization algorithm can be used for a broad class of statistical estimation problems. As corollaries of this framework, we present a new class of robust heavy-tailed estimators for streaming mean estimation and linear regression.

• Lastly, we conduct synthetic experiments to corroborate our theoretical and methodological developments. We show that clipped-SGD not only outperforms SGD and a number of baselines in average performance but also has a well-controlled tail performance.

### 1.1 Related Work

#### Batch Heavy-tailed mean estimation.

In the batch-setting, Hopkins [27] proposed a robust mean estimator that can be computed in polynomial time and which matches the error guarantees achieved by the empirical mean on Gaussian data. After this work, efficient algorithms with improved asymptotic runtimes were given in [10, 6, 5, 32, 8]. Related works [42, 38] provide more practical and computationally-efficient algorithms with slightly sub-optimal rates. We note that these estimators are targeted to the batch learning setting, and in particular need to store all data-points. Some exceptions are [38, 8], which have storage complexity to obtain a solution with a failure probability. However, these works rely on the median-of-means framework and are not easy to extend to the streaming setting. For more details on these recent algorithms, see the recent survey [35].

#### Heavy-tailed stochastic optimization.

A line of work in stochastic convex optimization have proposed bounds that achieve sub-Gaussian concentration around their mean (a common step towards providing sharp high-probability bounds), while only assuming that the variance of stochastic gradients is bounded (i.e. allowing for heavy-tailed stochastic gradients).

Davis et al. [7] proposed proxBoost that is based on robust distance estimation and proximal operators. Prasad et al. [44] utilized the geometric median-of-means to robustly estimate gradients in each mini-batch. Gorbunov et al. [19] and Nazin et al. [39] proposed clipped-SSTM and RSMD respectively based on truncation of stochastic gradients for stochastic mirror/gradient descent. Zhang et al. [51] analyzed the convergence of clipped-SGD in expectation but focus on a different noise regime where the distribution of stochastic gradients has bounded moments for some . However, all the above works [7, 44, 19, 39] have an unfavorable dependency on the batch size to get the typical convergence rate (on the squared error). We note that our bound is comparable to the above approaches while using a constant batch size. See Section B in the Appendix for more details.

#### Heavy-tailed regression.

For the setting where the regression noise is heavy-tailed with bounded variance, Huber’s estimator is known to achieve sub-Gaussian style rates [14, 47]. For the case where both the covariates and the noise are both heavy-tailed, several recent works have proposed computationally efficient estimators that achieve sub-Gaussian style error rates based on the median-of-means framework [28, 38] and thresholding techniques [46]. However, as we noted before, computing all of these estimators [28, 38, 46] require storing the entire dataset.

## 2 Background and Problem formulation

In this paper, we consider the following statistical estimation/stochastic optimization setting: we assume that there some class of functions parameterized by , where is a convex subset of

; some random vector

with distribution

; and a loss function

which takes and outputs the loss of at point . In this setting, we want to recover the true parameter defined as the minimizer of the population risk function :

 θ∗=\argminθ∈Θ\calR(θ)=\argminθ∈Θ\Expx∼P[¯¯¯¯¯¯¯¯¯¯¯¯¯¯\calL(θ,x)]. (1)

We assume that is differentiable and convex, and further impose two regularity conditions on the population risk: there exist and such that

 τℓ2\normθ1−θ222≤\calR(θ1)−\calR(θ2)−\inprod∇\calR(θ2)θ1−θ2≤τu2\normθ1−θ222 (2)

with and it holds for all . The parameters are called the strong-convexity and smoothness parameters of the function .

To solve the minimization problem defined in Eq. (1), we assume that we can access the stochastic gradient of the population risk, , at any point given a sample . We note that this is a unbiased gradient estimator, i.e.

 \Expx∼P[∇¯¯¯¯¯¯¯¯¯¯¯¯¯¯\calL(θ,x)]=∇\calR(θ).

Our goal in this work is to develop robust statistical methods under heavy-tailed distributions, where only very low order moments may be finite, e.g. student-t distribution or Pareto distribution. In this work, we assume the stochastic gradient distribution only has bounded second moment. Formally, for any , we assume that there exists and such that

 \Expx∼P[\norm∇¯¯¯¯¯¯¯¯¯¯¯¯¯¯\calL(θ,x)−∇\calR(θ)22]≤α(P,¯¯¯¯¯¯¯¯¯¯¯¯¯¯\calL)\normθ−θ∗22+β(P,¯¯¯¯¯¯¯¯¯¯¯¯¯¯\calL). (3)

In other words, the variance of the -norm of the gradient distribution depends on a uniform constant and a position-dependent variable, , which allows the variance of gradient noise to be large when is far from the true parameter . We note that this is a more general assumption compared to prior works which assumed that the variance is uniformly bounded by . It can be seen that our condition is more nuanced and can be weaker: even if our condition holds, we would allow for a uniform bound on variance to be large: . Whereas, a uniform bound on the variance could always be cast as and . We will show that this more nuanced assumption is essential to obtain tight bounds for linear regression problems.

We next provide some running examples to instantiate the above:

1. Mean estimation: Given observations where the distribution with mean and a bounded covariance matrix . The minimizer of the following square loss is the mean of distribution :

 ¯¯¯¯¯¯¯¯¯¯¯¯¯¯\calL(θ,x)=12\normx−θ22   and   μ=\argminθ∈\realp\Expx∼P[¯¯¯¯¯¯¯¯¯¯¯¯¯¯\calL(θ,x)]. (4)

In this case, , and satisfy the assumption in Eq.(3).

2. Linear regression: Given covariate-response pairs , where are sampled from and have a linear relationship, i.e. , where is the true parameter we want to estimate and is drawn from a zero-mean distribution. Suppose that under distribution the covariate have mean and non-singular covariance matrix . In this setting, we consider the squared loss:

 ¯¯¯¯¯¯¯¯¯¯¯¯¯¯\calL(θ,(x,y))=12(y−\inprodxθ)2,  %and  \calR(θ)=12(θ−θ∗)⊤Σ(θ−θ∗). (5)

The true parameter is the minimizer of . We also note that and satisfies the assumption in Eq.(2) with , and satisfy the assumption in Eq.(3). Note that if we had to uniformly bound the variance of the gradients as in previous stochastic optimization work, that bound would need to scale as: , where , which will yield much looser bounds.

## 3 Main results

In this section, we introduce our clipped stochastic gradient descent algorithm. We begin by formally defining clipped stochastic gradients. For a clipping parameter :

 clip(∇¯¯¯¯¯¯¯¯¯¯¯¯¯¯\calL(θ,x),λ)=min⎛⎝1,λ\norm∇¯¯¯¯¯¯¯¯¯¯¯¯¯¯\calL(θ,x)2⎞⎠∇¯¯¯¯¯¯¯¯¯¯¯¯¯¯\calL(θ,x), (6)

where , and is the stochastic gradient. The overall algorithm is summarized in Algorithm 1, where we use to denote the (Euclidean) projection onto the domain .

Next, we state our main convergence result for clipped-SGD in Theorem 1. (Streaming heavy-tailed stochastic optimization) Suppose that the population risk satisfies the regularity conditions in Eq. (2) and stochastic gradient noise satisfies the condition in Eq. (3). Let

 δ∈(0,2e−1)   and   γ\defeq144max⎧⎨⎩τuτℓ,96α(P,¯¯¯¯¯¯¯¯¯¯¯¯¯¯\calL)τ2ℓ⎫⎬⎭log(2/δ)+1.

Given samples , the Algorithm 1 initialized at with

 ηt\defeq1τℓ(t+γ)   and   λ\defeqC1 ⎷τ2ℓγ(γ−1)\normθ1−θ∗22log(2/δ)2+(N+γ)β(P,¯¯¯¯¯¯¯¯¯¯¯¯¯¯\calL)log(2/δ), (7)

where is a scaling constant can be chosen by users, returns such that with probability at least , we have

 \normθN+1−θ∗2≤100C1⎛⎜ ⎜⎝γ\normθ1−θ∗2N+γ+1τℓ ⎷β(P,¯¯¯¯¯¯¯¯¯¯¯¯¯¯\calL)log(2/δ)N+γ⎞⎟ ⎟⎠. (8)

We provide a proof sketch to explain the values we choose for these hyperparameters in Section

5 and give a complete proof in Section E in the Appendix.

#### Remarks:

a) This theorem says that with a properly chosen clipping level , clipped-SGD has an asymptotic convergence rate of and enjoys sub-Gaussian style concentration around the true minimizer (alternatively, its high probability bound scales logarithmically in the confidence parameter ). The first term in the error bound is related to the initialization error and the second term is governed by the stochastic gradient noise. These two terms have different convergence rates: at early iterations when the initialization error is large, the first term dominates but quickly decreases at the rate of . At later iterations the second term dominates and decreases at the usual statistical rate of convergence of .

b) Note that is a common choice for optimizing -strongly convex functions [24, 45]. The only difference is that we add a delay parameter to "slow down" the learning rate and to stablize the training process. The delay parameter depends on the position-dependent variance term and the condition number . From a theoretical viewpoint, the delay parameter ensures that the true gradient is within the clipping region with high-probability, i.e. for with high probability and this in turn allows us to control the variance and the bias incurred by clipped gradients. Moreover, it controls the position-dependent variance term , especially during the initial iterations when the error (and the variance of stochastic gradients) is large.

c) We choose the clipping level to be proportional to to balance the variance and bias of clipped gradients. Roughly speaking, the bias is inversely proportional to the clipping level (Lemma 5). As the error converges at the rate , and this in turn suggests that we should choose the clipping level to be .

We also provide an error bound and sample complexity where, as in prior work, we assume the variance of stochastic gradients are uniformly bounded by , i.e. and (as before, we assume that the population loss is strongly-convex and smooth). We have the following corollary:

Under the same assumptions and with the same hyper-parameters in Theorem 1 and letting , with the probability at least , we have the following error bound:

 \calR(θN+1)−\calR(θ∗)≤O(τ3uτ3ℓ⋅r0log(1/δ)2N2+τuτ2ℓ⋅σ2log(1/δ)N), (9)

where it the initialization error. In other words, to achieve with probability at least , we need samples. With our general results in place we now turn our attention to deriving some important consequences for mean estimation and linear regression.

## 4 Consequences for Heavy-tailed Parameter Estimation

In this section, we investigate the consequences of Theorem 1 for statistical estimation in the presence of heavy-tailed noise. We plug in the respective loss functions , the terms and capturing the underlying stochastic gradient distribution, in Theorem 1 to obtain high-probability bounds for the respective statistical estimators.

### 4.1 Heavy-tailed Mean Estimation

We assume that the distribution has bounded covariance matrix . Then clipped-SGD for mean estimation has the following guarantee. (Streaming Heavy-tailed Mean Estimation) Given samples from a distribution and confidence level , the Algorithm 1 instantiated with loss function , initial point , and

 γ=144log(2/δ)+1,   ηt=1t+γ  and  λ=C1√γ(γ−1)\normθ1−θ∗22log(2/δ)2+(N+γ)\traceΣlog(2/δ).

where is a scaling constant can be chosen by users, returns such that with probability at least , we have

 \normθN+1−θ∗2≤100(γ\normθ1−θ∗2N+γ+√\traceΣlog(2/δ)N+γ). (10)

#### Remarks:

The proposed mean estimator matches the error bound of the well-known geometric-median-of-means estimator [38], achieving . This guarantee is still sub-optimal compared to the optimal sub-Gaussian rate [35]. Existing polynomial time algorithms having optimal performance are for the batch setting and require either storing the entire dataset [6, 5, 36, 32, 11] or have storage complexity [8]. On the other hand, we argue that trading off some statistical accuracy for a large savings in memory and computation is favorable in practice.

### 4.2 Heavy-tailed Linear Regression

We consider the linear regression model described in Eq. (5). Assume that the covariates have bounded moments and a non-singular covariance matrix with bounded operator norm, and the noise has bounded

moments. We denote the minimum and maximum eigenvalue of

by and

. More formally, we say a random variable

has a bounded moment if there exists a constant such that for every unit vector , we have

 \Exp[\inprodx−\Exp[x]v4]≤C4(\Exp[\inprodx−\Exp[x]v])2. (11)

(Streaming Heavy-tailed Regression) Given samples and confidence level , the Algorithm 1 instantiated with loss function in Eq. (5), initial point , and

 γ=144max{τuτℓ,192(C4+1)pτ2uτ2ℓ}log(2/δ)+1,   ηt=1τℓ(t+γ)
 and  λ=C1 ⎷τ2ℓγ(γ−1)\normθ1−θ∗22log(2/δ)2+(N+γ)σ2pτulog(2/δ),

where is a scaling constant can be chosen by users and is the constant in Eq.(11), returns such that with probability at least , we have

 \normθN+1−θ∗2≤100C1(γ\normθ1−θ∗2N+γ+στℓ√pτulog(2/δ)N+γ). (12)

## 5 Sketch of proof

In this section, we provide an overview of the arguments that constitute the proof of Theorem 1 and explain our theoretical contribution. The full details of the proof can be found in Section E in Appendix. Our proof is improved upon previous analysis of clipped-SGD with constant learning rate and batch size [19] and high probability bounds for SGD with step size in the sub-Gaussian setting [25]. Our analysis consists of three steps: (i) Expansion of the update rule. (ii) Selection of the clipping level. (iii) Concentration of bounded martingale sequences.

#### Notations:

Recall that we use step size . We will write is the noise indicating the difference between the stochastic gradient and the true gradient at step . Let be the -algebra generated by the first steps of clipped-SGD. We note that clipping introduce bias so that is no longer zero, so we decompose the noise term into a bias term and a zero-mean variance term , i.e.

 ϵt=ϵbt+ϵvt,  where  ϵbt=\Expxt[ϵt|\calFt−1]  and  ϵvt=ϵt−\Expxt[ϵt|\calFt−1] (13)

#### (i) Expansion of the update rule:

We start with the following lemma that is pretty standard in the analysis of SGD for strongly convex functions. It can be obtained by unrolling the update rules and using properties of -strongly-convex and -smooth functions. Under the conditions in Theorem 1, for any , we have

 \normθi+1−θ∗22≤γ(γ−1)\normθ1−θ∗22(i+γ)(i+γ−1)the % initialization error+∑it=1(t+γ−1)\inprodϵbt+ϵvtθt−θ∗τℓ(i+γ)(i+γ−1)the first noise term +2∑it=1(\normϵvt22+\normϵbt22)τ2ℓ(i+γ)(i+γ−1)the second noise term .

#### (ii) Selection of the clipping level:

Now, to upper bound the noise terms, we need to choose the clipping level properly to balance the variance term and the bias term . Specifically, we use the inequalities of Gorbunov et al. [19], which provides us upper bounds for the magnitude and variance of these noise terms. (Lemma F.5, [19] ) For any , we have

 \normϵvt2≤2λ. (14)

Moreover, for all , assume that the variance of stochastic gradients is bounded by , i.e. and assume that the norm of the true gradient is less than , i.e. . Then we have

 \normϵbt2≤4σ2tλ   and   \Expxt[\normϵvt22|\calFt−1]≤18σ2t   for all t=1,2,...,N. (15)

This lemma gives us the dependencies between the variance, bias and clipping level: a larger clipping level leads to a smaller bias but the magnitude of the variance term is larger, while the variance of the variance term remains constant. These three inequalities are also essential for us to use concentration inequalities for martingales. However, we highlight the necessity for these inequalities to hold: the true gradient lies in the clipping region up to a constant, i.e. . This condition is necessary since without this, we could not have upper bounds of the bias and variance terms. Therefore, the clipping level should be chosen in a very accurate way. Below we informally describe how do we choose it.

We note that should converge with rate for strongly convex functions with step size [45, 25]. To make sure the first noise term upper bound by , one should expect each summand for , which implies . This motivates us to choose the clipping level to be proportional to by Eq.(15). Also, from the detailed proof in Section E, we will show that the delay parameter makes sure holds with high probabilities and the position dependent noise is controlled.

#### (iii) Concentration of bounded martingale sequences:

A significant technical step of our analysis is the use of the following Freedman’s inequality.

(Freedman’s inequality [15]) Let be a martingale difference sequence with a uniform bound on the steps . Let denote the sum of conditional variances, i.e. Then, for every ,

 Pr(s∑i=1di≥a  and  Vs≤v  for some%  s≤T)≤exp(−a22(v+ba)).

Freedman’s inequality says that if we know the boundness and variance of the martingale difference sequence, the summation of them has exponential concentration around its expected value for all subsequence .

Now we turn our attention to the variance term in the first noise term, i.e. . It is the summation of a martingale difference sequence since . Note that Lemma 5 has given us upper bounds for boundness/variance for . However, the main technical difficulty is that each summand involves the error of past sequences, i.e. .

Our solution is the use of Freedman’s inequality, which gives us a loose control of all past error terms with high probabilities, i.e. for . On the contrary, a recurrences technique used in the past works [21, 20, 19] uses an increasing clipping levels and calls the Bernstein inequality ( , which only provides an upper bound for the entire sequence, ) times in order to control for every . As a result, it incurs an extra factor on their bound since it imposes a too strong control over past error terms.

Finally, we describe why clipped-SGD allows a batch size at a high level. For previous works of stochastic optimization with strongly convex and smooth objective, they use a constant step size throughout their training process [21, 44, 7, 39]. However, to ensure their approach make a constant progress at each iteration, they should use an exponential growing batch size to reduce variance of gradients. Whereas in our approach, we explicitly control the variance by using a decayed learning rate and clipping the gradients. Therefore, we are able to provide a careful analysis of the resulted bounded martingale sequences.

## 6 Experiments

To corroborate our theoretical findings, we conduct experiments on mean estimation and linear regression to study the performance of clipped-SGD

. Another experiment on logistic regression is presented in the Section

D.3 in the Appendix.

#### Methods.

We compare clipped-SGD with vanilla SGD, which takes stochastic gradient to update without a clipping function. For linear regression, we also compare clipped-SGD

with Lasso regression

[40]

, Ridge regression and Huber regression

[47, 29]. All methods use the same step size at step .

To simulate heavy-tailed samples, we draw from a scaled standardized Pareto distribution with tail-parameter for which the moment only exists for . The smaller the , the more heavy-tailed the distribution. Due to space constraints, we defer other results with different setups to the Appendix.

#### Choice of Hyper-parameter.

Note that in Theorem 1, the clipping level depends on the initialization error, i.e. , which is not known in advance. Moreover, we found that the suggested value of has a too large of a constant factor and substantially decreases the convergence rate especially for small . However, standard hyper-parameter selection techniques such as cross-validation and hold-out validation require storing the entire validation set, which are not suitable for streaming settings.

Consequently, we use a sequential validation method [30], where we do "training" and "evaluation" on the last percent of the data. Formally, given candidate solutions trained from samples , let be the estimated parameter for candidate at iteration . Then we choose the candidate that minimizes the empirical mean of the risk of the last percents of samples, i.e.

 j∗=\argmin1≤j≤m1qNN∑t=(1−q)N+1¯\calL(ˆθtj,xt) (16)

That is, we calculate the mean of the risk induced by given the sample before using the sample to update the parameter at the last percents of iterations. In our experiment, we choose to tune the delay parameter , clipping level and regularization factors for Lasso, Ridge and Huber regression.

#### Metric.

For any estimator , we use the loss as our primary metric. To measure the tail performance of estimators (not just their expected loss), we also use , which is the bound on the loss that we wish to hold with probability . This could also be viewed as the percentile of the loss (e.g. if , then this would be the 95th percentile of the loss).

### 6.1 Synthetic Experiments: Mean estimation

#### Setup.

We obtain samples from a scaled standardized Pareto distribution with tail-parameter . We initialize the mean parameter estimate as , , and fix the step size to for clipped-SGD and Vanilla SGD. We note that in this setting, it can be seen that is the empirical mean over samples by some simple algebra. Each metric is reported over 50000 trials. For reference, we also compare our approach with coordiante-wise/geometric median-of-means(MoM), which use space.

#### Results.

Figure 2 shows the performance of all algorithms. In the first panel, we can see clipped-SGD has expected error that converges as as our theory indicates. Also, the 99.9 percent quantile loss of Vanilla SGD is over 10 times worse than the expected error as in the second panel while the tail performance of clipped-SGD is similar to its expected performance. Finally, we note that though coordinate-wise/ geometric median-of-means use space, they are still worse than our clipped-SGD algorithm.

In the third panel, we can see that Clipped SGD performs better across different . When the clipping level is too small, it takes too small a step size so that the final error is high. While if we use a very large clipping level, the performance of Clipped SGD is similar to Vanilla SGD.

### 6.2 Synthetic Experiments: Linear regression

#### Setup

We generate covariate from an scaled standardized Pareto distribution with tail-parameter . The true regression parameter is set to be and the initial parameter is set to . The response is generated by , where is sampled from scaled rescaled Pareto distribution with mean , variance and tail-parameter . We select . Each metric is reported over 50000 trials.

#### Results

We note that in this experiment, our hyperparameter selection technique yields . Figure 3(a), 3(b) show that clipped-SGD performs the best among all baselines in terms of average error and quantile errors across different probability levels . Also, has linear relation to as Corollary 4.2 indicates. In Figure 3(c), we plot quantile loss against different clipping levels. It shows a similar trend to the mean estimation.

Next, in Figure 3(d), we plot the averaged convergence curve for different delay parameters . This shows that it is necessary to use a large enough to prevent it from diverging. This phenomenon can also be observed for different baseline models. Figure 3(e), 3(f) shows that clipped-SGD performs the best across different noise level and different dimension .

## 7 Discussion

In this paper, we provide a streaming algorithm for statistical estimation under heavy-tailed distribution. In particular, we close the gap in theory of clipped stochastic gradient descent with heavy-tailed noise. We show that clipped-SGD can not only be used in parameter estimation tasks, such as mean estimation and linear regression, but also a more general stochastic optimization problem with heavy-tailed noise. There are several avenues for future work, including a better understanding of clipped-SGD under different distributions, such as having higher bounded moments or symmetric distributions, where the clipping technique incurs less bias. Finally, it would also be of interest to extend our results to different robustness setting such as Huber’s -contamination model [29]

, where there are a constant portion of arbitrary outliers in observed samples.

## References

• [1] M. Abadi, A. Chu, I. Goodfellow, H. B. McMahan, I. Mironov, K. Talwar, and L. Zhang (2016) Deep learning with differential privacy. In Proceedings of the 2016 ACM SIGSAC conference on computer and communications security, pp. 308–318. Cited by: Appendix B.
• [2] S. Bubeck (2014) Convex optimization: algorithms and complexity. arXiv preprint arXiv:1405.4980. Cited by: Appendix E.
• [3] O. Catoni (2012) Challenging the empirical mean and empirical variance: a deviation study. In Annales de l’IHP Probabilités et statistiques, Vol. 48, pp. 1148–1185. Cited by: §1.
• [4] X. Chen, S. Z. Wu, and M. Hong (2020)

Understanding gradient clipping in private sgd: a geometric perspective

.
Advances in Neural Information Processing Systems 33. Cited by: Appendix B.
• [5] Y. Cheng, I. Diakonikolas, R. Ge, and M. Soltanolkotabi (2020) High-dimensional robust mean estimation via gradient descent. In International Conference on Machine Learning, pp. 1768–1778. Cited by: §1.1, §1, §4.1.
• [6] Y. Cherapanamjeri, N. Flammarion, and P. L. Bartlett (2019) Fast mean estimation with sub-gaussian rates. In Conference on Learning Theory, pp. 786–806. Cited by: §1.1, §1, §4.1.
• [7] D. Davis, D. Drusvyatskiy, L. Xiao, and J. Zhang (2021) From low probability to high confidence in stochastic convex optimization. Journal of Machine Learning Research 22 (49), pp. 1–38. Cited by: Table 1, §1.1, §1, §5.
• [8] J. Depersin and G. Lecué (2019) Robust subgaussian estimation of a mean vector in nearly linear time. arXiv preprint arXiv:1906.03058. Cited by: §1.1, §4.1.
• [9] I. Diakonikolas, G. Kamath, D. M. Kane, J. Li, A. Moitra, and A. Stewart (2017) Being robust (in high dimensions) can be practical. In International Conference on Machine Learning, pp. 999–1008. Cited by: §1.
• [10] I. Diakonikolas, D. M. Kane, and A. Pensia (2020) Outlier robust mean estimation with subgaussian rates via stability. arXiv preprint arXiv:2007.15618. Cited by: §1.1.
• [11] Y. Dong, S. B. Hopkins, and J. Li (2019)

Quantum entropy scoring for fast robust mean estimation and improved outlier detection

.
arXiv preprint arXiv:1906.11366. Cited by: §4.1.
• [12] J. Duchi, E. Hazan, and Y. Singer (2011) Adaptive subgradient methods for online learning and stochastic optimization.. Journal of machine learning research 12 (7). Cited by: §1.
• [13] P. Dvurechensky and A. Gasnikov (2016) Stochastic intermediate gradient method for convex problems with stochastic inexact oracle. Journal of Optimization Theory and Applications 171 (1), pp. 121–145. Cited by: Table 1.
• [14] J. Fan, Q. Li, and Y. Wang (2017) Estimation of high dimensional mean regression in the absence of symmetry and light tail assumptions. Journal of the Royal Statistical Society. Series B, Statistical methodology 79 (1), pp. 247. Cited by: §1.1.
• [15] D. A. Freedman (1975) On tail probabilities for martingales. the Annals of Probability, pp. 100–118. Cited by: §5.
• [16] V. Fritsch, B. Da Mota, E. Loth, G. Varoquaux, T. Banaschewski, G. J. Barker, A. L. Bokde, R. Brühl, B. Butzek, P. Conrod, et al. (2015) Robust regression for large-scale neuroimaging studies. Neuroimage 111, pp. 431–441. Cited by: §1.
• [17] S. Garg, J. Zhanson, E. Parisotto, A. Prasad, J. Z. Kolter, S. Balakrishnan, Z. C. Lipton, R. Salakhutdinov, and P. Ravikumar (2021) On proximal policy optimization’s heavy-tailed gradients. arXiv preprint arXiv:2102.10264. Cited by: Appendix B.
• [18] S. Ghadimi and G. Lan (2013) Optimal stochastic approximation algorithms for strongly convex stochastic composite optimization, ii: shrinking procedures and optimal algorithms. SIAM Journal on Optimization 23 (4), pp. 2061–2089. Cited by: Table 1, §1.
• [19] E. Gorbunov, M. Danilova, and A. Gasnikov (2020) Stochastic optimization with heavy-tailed noise via accelerated gradient clipping. arXiv preprint arXiv:2005.10785. Cited by: Table 1, Appendix E, §1.1, §1, §1, §5, §5, §5.
• [20] E. Gorbunov, D. Dvinskikh, and A. Gasnikov (2019) Optimal decentralized distributed algorithms for stochastic convex optimization. arXiv preprint arXiv:1911.07363. Cited by: §5.
• [21] E. Gorbunov, P. Dvurechensky, and A. Gasnikov (2018) An accelerated method for derivative-free smooth stochastic convex optimization. arXiv preprint arXiv:1802.09022. Cited by: §5, §5.
• [22] R. M. Gower, N. Loizou, X. Qian, A. Sailanbayev, E. Shulgin, and P. Richtárik (2019) SGD: general analysis and improved rates. In International Conference on Machine Learning, pp. 5200–5209. Cited by: §1.
• [23] I. Gulrajani, F. Ahmed, M. Arjovsky, V. Dumoulin, and A. Courville (2017) Improved training of wasserstein gans. arXiv preprint arXiv:1704.00028. Cited by: Appendix B.
• [24] N. J. Harvey, C. Liaw, Y. Plan, and S. Randhawa (2019) Tight analyses for non-smooth stochastic gradient descent. In Conference on Learning Theory, pp. 1579–1613. Cited by: §1, §1, §3.
• [25] N. J. Harvey, C. Liaw, and S. Randhawa (2019) Simple and optimal high-probability bounds for strongly-convex stochastic gradient descent. arXiv preprint arXiv:1909.00843. Cited by: Appendix B, §1, §1, §5, §5.
• [26] E. Hazan and S. Kale (2014) Beyond the regret minimization barrier: optimal algorithms for stochastic strongly-convex optimization. The Journal of Machine Learning Research 15 (1), pp. 2489–2512. Cited by: §1, §1.
• [27] S. B. Hopkins (2020) Mean estimation with sub-gaussian rates in polynomial time. Annals of Statistics 48 (2), pp. 1193–1213. Cited by: §1.1.
• [28] D. Hsu and S. Sabato (2016) Loss minimization and parameter estimation with heavy tails. The Journal of Machine Learning Research 17 (1), pp. 543–582. Cited by: §1.1, §1.
• [29] P. J. Huber et al. (1973) Robust regression: asymptotics, conjectures and monte carlo. Annals of statistics 1 (5), pp. 799–821. Cited by: §6, §7.
• [30] R. J. Hyndman and G. Athanasopoulos (2018) Forecasting: principles and practice. OTexts. Cited by: §6.
• [31] D. P. Kingma and J. Ba (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §1.
• [32] Z. Lei, K. Luh, P. Venkat, and F. Zhang (2020) A fast spectral algorithm for mean estimation with sub-gaussian rates. In Conference on Learning Theory, pp. 2598–2612. Cited by: §1.1, §1, §4.1.
• [33] M. Li, M. Soltanolkotabi, and S. Oymak (2020)

Gradient descent with early stopping is provably robust to label noise for overparameterized neural networks

.
In

International Conference on Artificial Intelligence and Statistics

,
pp. 4313–4324. Cited by: Appendix B.
• [34] X. Li and F. Orabona (2019) On the convergence of stochastic gradient descent with adaptive stepsizes. In The 22nd International Conference on Artificial Intelligence and Statistics, pp. 983–992. Cited by: §1.
• [35] G. Lugosi and S. Mendelson (2019) Mean estimation and regression under heavy-tailed distributions: a survey. Foundations of Computational Mathematics 19 (5), pp. 1145–1190. Cited by: Table 2, §1.1, §1, §4.1.
• [36] G. Lugosi and S. Mendelson (2019) Robust multivariate mean estimation: the optimality of trimmed mean. arXiv preprint arXiv:1907.11391. Cited by: §4.1.
• [37] H. B. McMahan, G. Holt, D. Sculley, M. Young, D. Ebner, J. Grady, L. Nie, T. Phillips, E. Davydov, D. Golovin, et al. (2013) Ad click prediction: a view from the trenches. In Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 1222–1230. Cited by: §1.
• [38] S. Minsker et al. (2015) Geometric median and robust estimation in banach spaces. Bernoulli 21 (4), pp. 2308–2335. Cited by: §1.1, §1.1, §1, §4.1.
• [39] A. V. Nazin, A. S. Nemirovsky, A. B. Tsybakov, and A. B. Juditsky (2019) Algorithms of robust stochastic optimization based on mirror descent method. Automation and Remote Control 80 (9), pp. 1607–1627. Cited by: Table 1, §1.1, §1, §5.
• [40] N. H. Nguyen and T. D. Tran (2012) Robust lasso with missing and grossly corrupted observations. IEEE transactions on information theory 59 (4), pp. 2036–2058. Cited by: §6.
• [41] R. Pascanu, T. Mikolov, and Y. Bengio (2012)

.
CoRR, abs/1211.5063 2 (417), pp. 1. Cited by: Appendix B.
• [42] A. Prasad, S. Balakrishnan, and P. Ravikumar (2019) A unified approach to robust mean estimation. arXiv preprint arXiv:1907.00927. Cited by: §1.1, §1.
• [43] A. Prasad, S. Balakrishnan, and P. Ravikumar (2020) A robust univariate mean estimator is all you need. In International Conference on Artificial Intelligence and Statistics, pp. 4034–4044. Cited by: §1.
• [44] A. Prasad, A. S. Suggala, S. Balakrishnan, and P. Ravikumar (2018) Robust estimation via robust gradient estimation. arXiv preprint arXiv:1802.06485. Cited by: Table 1, §F.3, §1.1, §1, §1, §5.
• [45] A. Rakhlin, O. Shamir, and K. Sridharan (2011) Making gradient descent optimal for strongly convex stochastic optimization. arXiv preprint arXiv:1109.5647. Cited by: Appendix B, §1, §1, §3, §5.
• [46] A. S. Suggala, K. Bhatia, P. Ravikumar, and P. Jain (2019) Adaptive hard thresholding for near-optimal consistent robust regression. In Conference on Learning Theory, pp. 2892–2897. Cited by: §1.1, §1.
• [47] Q. Sun, W. Zhou, and J. Fan (2020) Adaptive huber regression. Journal of the American Statistical Association 115 (529), pp. 254–265. Cited by: §1.1, §6.
• [48] H. Victor et al. (1999) A general class of exponential inequalities for martingales and ratios. Annals of probability 27 (1), pp. 537–564. Cited by: Appendix E.
• [49] Y. You, I. Gitman, and B. Ginsburg (2017)

Scaling sgd batch size to 32k for imagenet training

.
arXiv preprint arXiv:1708.03888 6, pp. 12. Cited by: Appendix B.
• [50] J. Zhang, T. He, S. Sra, and A. Jadbabaie (2019) Why gradient clipping accelerates training: a theoretical justification for adaptivity. arXiv preprint arXiv:1905.11881. Cited by: Appendix B.
• [51] J. Zhang, S. P. Karimireddy, A. Veit, S. Kim, S. J. Reddi, S. Kumar, and S. Sra (2019)

Why are adaptive methods good for attention models?

.
arXiv preprint arXiv:1912.03194. Cited by: §1.1, §1, §1.

## Appendix A Organization

The Appendices contain additional technical content and are organized as follows. In Appendix B, we provide additional details for related work, which contain a detailed comparison of previous work on stochastic optimization. In Appendix C, we detail hyperparameters used for experiments in Section 6. In Appendix D, we present supplementary experimental results for different setups for heavy-tailed mean estimation and linear regression. Additionally, we show a synthetic experiment on logistic regression. Finally, in Appendix E and F, we give the proofs for Theorem 1 in Section 3 and corollaries in Section 4.

## Appendix B Related work : Additional details

#### Heavy-tailed stochastic optimization.

In this section, we present a detailed comparison of existing results. In Table 1, we compare existing high probability bounds of stochastic optimization for strongly convex and smooth objectives.

Since our work focus on large-scale setting (where we need to access data in a streaming fashion), we assume the number of samples is large so that the required is small. In such setting, is the dominating term and the error is driven by the stochastic noise term . If ignoring the difference in logarithmic factors and assuming is small, all methods for heavy-tailed noise in Table 1 achieve and are comparable to algorithms derived under the sub-Gaussian noise assumption.

However, we can see that all of the existing methods require batch size except ours. Their batch sizes are not constants because they use a constant step size throughout their training process. Although they can achieve linear convergence rates for initialization error , they should use an exponential growing batch size to reduce variance induced by gradient noise. Additionally, in large scale setting where the noise term is the dominating term, the linear convergence of initial error is not important. On the contrary, we choose a step size, which is widely used in stochastic optimization for strongly convex objective [45, 25]. Our proposed clipped-SGD can therefore enjoy convergence rate while using a constant batch size.

Finally, our analysis improves the dependency on the confidence level term : it does not have extra logarithmic terms and does not depend on . Although our bound has a worse dependency on the condition number , we argue that our bound has an extra term because our bound is derived under the square error, i.e. instead of the difference between objective values . As a result, we believe the dependency on of our bounds can be improved by modifying Lemma E.