 # Algorithms of Robust Stochastic Optimization Based on Mirror Descent Method

We propose an approach to construction of robust non-Euclidean iterative algorithms for convex composite stochastic optimization based on truncation of stochastic gradients. For such algorithms, we establish sub-Gaussian confidence bounds under weak assumptions about the tails of the noise distribution in convex and strongly convex settings. Robust estimates of the accuracy of general stochastic algorithms are also proposed.

## Authors

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

In this paper, we consider the problem of convex composite stochastic optimization:

 minx∈XF(x),F(x)=E{Φ(x,ω)}+ψ(x), (1)

where

is a compact convex subset of a finite-dimensional real vector space

with norm ,

is a random variable on a probability space

with distribution , function is convex and continuous, and function . Suppose that the expectation

 ϕ(x):=E{Φ(x,ω)}=∫ΩΦ(x,ω)dP(ω)

is finite for all , and is a convex and differentiable function of . Under these assumptions, the problem (1) has a solution with optimal value .

Assume that there is an oracle, which for any input returns a stochastic gradient that is a vector satisfying

 E{G(x,ω)}=∇ϕ(x)andE{∥G(x,ω)−∇ϕ(x)∥2∗}≤σ2,∀x∈X, (2)

where is conjugate norm to , and is a constant. The aim of this paper is to construct -reliable approximate solutions of the problem (1), i.e., solutions , based on queries of the oracle and satisfying the condition

 Prob{F(ˆxN)−F∗≤δN(α)}≥1−α,∀α∈(0,1), (3)

with as small as possible .

Note that stochastic optimization problems of the form (1) arise in the context of penalized risk minimization, where the confidence bounds (3) are directly converted into confidence bounds for the accuracy of the obtained estimators. In this paper, the bounds (3) are derived with of order

. Such bounds are often called sub-Gaussian confidence bounds. Standard results on sub-Gaussian confidence bounds for stochastic optimization algorithms assume boundedness of exponential or subexponential moments of the stochastic noise of the oracle

(cf. [1, 2, 3]). In the present paper, we propose robust stochastic algorithms that satisfy sub-Gaussian bounds of type (3) under a significantly less restrictive condition (2).

Recall that the notion of robustness of statistical decision procedures was introduced by J. Tukey  and P. Huber [5, 6, 7] in the ies, which led to the subsequent development of robust stochastic approximation algorithms. In particular, in the 1970ies–1980ies, algorithms that are robust for wide classes of noise distributions were proposed for problems of stochastic optimization and parametric identification. Their asymptotic properties when the sample size increases have been well studied, see, for example, [8, 9, 10, 11, 12, 13, 14, 15, 16] and references therein. An important contribution to the development of the robust approach was made by Ya.Z. Tsypkin. Thus, a significant place in the monographs [17, 18] is devoted to the study of iterative robust identification algorithms.

The interest in robust estimation resumed in the 2010ies due to the need to develop statistical procedures that are resistant to noise with heavy tails in high-dimensional problems. Some recent work [19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29] develops the method of median of means  for constructing estimates that satisfy sub-Gaussian confidence bounds for noise with heavy tails. Thus, in  the median of means approach was used to construct an -reliable version of stochastic approximation with averaging (“batch” algorithm) in a stochastic optimization setting similar to (1). Other original approaches were developed in [31, 32, 33, 34, 35], in particular, the geometric median techniques for robust estimation of signals and covariance matrices with sub-Gaussian guarantees [34, 35]. Also there was a renewal of interest in robust iterative algorithms. Thus, it was shown that robustness of stochastic approximation algorithms can be enhanced by using the geometric median of stochastic gradients [36, 37]. Another variant of the stochastic approximation procedure for calculating the geometric median was studied in [38, 39], where a specific property of the problem (boundedness of the stochastic gradients) allowed the authors to construct -reliable bounds under a very weak assumption about the tails of the noise distribution.

This paper discusses an approach to the construction of robust stochastic algorithms based on truncation of the stochastic gradients. It is shown that this method satisfies sub-Gaussian confidence bounds. In Sections 2 and 3, we define the main components of the optimization problem under consideration. In Section 4, we define the robust stochastic mirror descent algorithm and establish confidence bounds for it. Section 5 is devoted to robust accuracy estimates for general stochastic algorithms. Finally, Section 6 establishes robust confidence bounds for problems, in which has a quadratic growth. The Appendix contains the proofs of the results of the paper.

## 2 Notation and Definitions

Let be a finite-dimensional real vector space with norm and let be the conjugate space to . Denote by the value of linear function at point and by the conjugate to norm on , i.e.,

 ∥s∥∗=maxx{⟨s,x⟩:∥x∥≤1},s∈E∗.

On the unit ball

 B={x∈E:∥x∥≤1},

we consider a continuous convex function with the following property:

 ⟨θ′(x)−θ′(x′),x−x′⟩≥∥x−x′∥2,∀x,x′∈B, (4)

where is a continuous in version of the subgradient of and denotes the subdifferential of function at point , i.e., the set of all subgradients at this point. In other words, function is strongly convex on with coefficient 1 with respect to the norm . We will call the normalized proxy function. Examples of such functions are:

• for ;

• with for ;

•   with  for , where is the space of symmetric matrices equipped with the nuclear norm and

are eigenvalues of matrix

.

Here and in what follows, denotes the -norm in , . Without loss of generality, we will assume below that

 0=argminx∈Bθ(x).

We also introduce the notation

 Θ=maxx∈Bθ(x)−minx∈Bθ(x)≥\small12.

Now, let be a convex compact subset in and let and be such that . We equip with a proxy function

 ϑ(x)=R2θ(x−x0R).

Note that is strongly convex with coefficient 1 and

 maxx∈Xϑ(x)−minx∈Xϑ(x)≤R2Θ.

Let be the diameter of the set . Then .

We will also use the Bregman divergence

 Vx(z)=ϑ(z)−ϑ(x)−⟨ϑ′(x),z−x⟩,∀z,x∈X.

In the following, we denote by and positive numerical constants, not necessarily the same in different cases.

## 3 Assumptions

Consider a convex composite stochastic optimisation problem (1) on a convex compact set . Assume in the following that the function

 ϕ(x)=E{Φ(x,ω)}

is convex on , differentiable at each point of the set and its gradient satisfies the Lipschitz condition

 ∥∇ϕ(x′)−∇ϕ(x)∥∗≤L∥x−x′∥,∀x,x′∈X. (5)

Assume also that function is convex and continuous. In what follows, we assume that we have at our disposal a stochastic oracle, which for any input , returns a random vector , satisfying the conditions (2). In addition, it is assumed that for any and an exact solution of the minimization problem

 minz∈X{⟨a,z⟩+ψ(z)+βϑ(z)}

is available. This assumption is fulfilled for typical penalty functions , such as convex power functions of the -norm (if is a convex compact in ) or negative entropy , where (if is the standard simplex in ). Finally, it is assumed that a vector is available, where is a point in the set such that

 ∥g(¯x)−∇ϕ(¯x)∥∗≤υσ (6)

with a constant . This assumption is motivated as follows.

First, if we a priori know that the global minimum of function is attained at an interior point of the set (what is common in statistical applications of stochastic approximation), we have . Therefore, choosing , one can put and assumption (6) holds automatically with .

Second, in general, one can choose as any point of the set and as a geometric median of stochastic gradients , , over oracle queries. It follows from  that if is of order with some sufficiently small , then

 Prob{∥g(¯x)−∇ϕ(¯x)∥∗>υσ}≤ε. (7)

Thus, the confidence bounds obtained below will remain valid up to an -correction in the probability of deviations.

## 4 Accuracy bounds for Algorithm RSMD

In what follows, we consider that the assumptions of Section 3 are fulfilled. Introduce a composite proximal transform

 Proxβ,x(ξ) := argminz∈X{⟨ξ,z⟩+ψ(z)+βVx(z)}= = argminz∈X{⟨ξ−βϑ′(x),z⟩+ψ(z)+βϑ(z)},

where is a tuning parameter.

For , define the algorithm of Robust Stochastic Mirror Descent (RSMD) by the recursion

 xi=Proxβi−1,xi−1(yi),x0∈X, (9)
 yi={G(xi−1,ωi),if∥G(xi−1,ωi)−g(¯x)∥∗≤L∥¯x−xi−1∥+λ+υσ,g(¯x),otherwise. (10)

Here , , and are tuning parameters that will be defined below, and are independent identically distributed (i.i.d.) realizations of a random variable , corresponding to the oracle queries at each step of the algorithm.

The approximate solution of problem (1) after iterations is defined as the weighted average

 ˆxN=[N∑i=1β−1i−1]−1N∑i=1β−1i−1xi. (11)

If the global minimum of function is attained at an interior point of the set and , then definition (10) is simplified. In this case, replacing by the upper bound and putting and in (10), we define the truncated stochastic gradient by the formula

 yi={G(xi−1,ωi),if~{}∥G(xi−1,ωi)∥∗≤LD+λ,0,otherwise.

The next result describes some useful properties of mirror descent recursion (9). Define

 ξi=yi−∇ϕ(xi−1)

and

 ε(xN,z)=N∑i=1β−1i−1[⟨∇ϕ(xi−1),xi−z⟩+ψ(xi)−ψ(z)]+\small12Vxi−1(xi), (12)

where .

###### Proposition 1

Let for all , and let be defined in (11), where are iterations (9) for any values , not necessarily given by (10). Then for any we have

 [N∑i=1β−1i−1][F(ˆxN)−F(z)] ≤ N∑i=1β−1i−1[F(xi)−F(z)]≤ε(xN,z) (13) ≤ Vx0(z)+N∑i=1[⟨ξi,z−xi−1⟩βi−1+∥ξi∥2∗β2i−1] ≤ 2Vx0(z)+N∑i=1[⟨ξi,zi−1−xi−1⟩βi−1+32∥ξi∥2∗β2i−1], (14)

where is a random vector with values in depending only on .

Using Proposition 1 we obtain the following bounds on the expected error of the approximate solution of problem (1) based on the RSMD algorithm. In what follows, we denote by the expectation with respect to the distribution of .

###### Corollary 1

Set . Assume that and for all . Let be the approximate solution (11), where are the iterations of the RSMD algorithm defined by relations (9) and (10). Then

 E{F(ˆxN)}−F∗ ≤ [N∑i=1β−1i−1]−1[R2Θ+N∑i=1(2Dσβi−1√N+4σ2β2i−1)]. (15)

In particular, if for all , where

 ¯β=max{2L,σ√NR√Θ}, (16)

then the following inequalities hold:

 E{F(ˆxN)}−F∗≤¯βNE{supz∈Xε(xN,z)}≤Cmax{LR2ΘN,σR√Θ√N}. (17)

Moreover, in this case we have the following inequality with explicit constants:

 E{F(ˆxN)}−F∗≤max{2LR2ΘN+4Rσ(1+√Θ)√N,2Rσ(1+4√Θ)√N}.

This result shows that if the truncation threshold is large enough, then the expected error of the proposed algorithm is bounded similarly to the expected error of the standard mirror descent algorithm with averaging, i.e., the algorithm in which stochastic gradients are taken without truncation: .

The following theorem gives confidence bounds for the proposed algorithm.

###### Theorem 1

Let for all , and let ,

 λ=max{σ√Nτ,M}+υσ. (18)

Let be the approximate solution (11), where are the RSMD iterations defined by relations (9) and (10). Then there is a random event of probability at least such that for all the following inequalities hold:

 F(ˆxN)−F∗ ≤ ¯βNsupz∈Xε(xN,z)≤ ≤ CN(¯βR2Θ+Rmax{σ√τN,Mτ}+¯β−1max{Nσ2,M2τ}).

In paticular, chosing as in formula (16) we have, for all ,

 F(ˆxN)−F∗≤max{C1LR2[τ∨Θ]N,C2σR√τ∨ΘN}, (19)

where and are numerical constants.

The values of the numerical constants and in (19) can be obtained from the proof of the theorem, cf. the bound in (40).

Confidence bound (19) in Theorem 1 contains two terms corresponding to the deterministic error and to the stochastic error. Unlike the case of noise with a “light tail” (see, for example, ) and the bound in expectation (17), the deterministic error depends on . Note also that Theorem 1 gives a sub-Gaussian confidence bound (the order of the stochastic error is ). However, the truncation threshold depends on the confidence level . This can be inconvenient for the implementation of the algorithms. Some simple but coarser confidence bounds can be obtained by using a universal threshold independent of , which is . In particular, we have the following result.

###### Theorem 2

Let for all , and let . Set

 λ=max{σ√N,M}+υσ.

Let , where are the iterations of the RSMD algorithm defined by relations (9) and (10). Then there is a random event of probability at least such that for all the following inequalities hold:

 F(ˆxN)−F∗ ≤ ¯βNsupz∈Xε(xN,z)≤ ≤ CN(¯βR2Θ+τRmax{σ√N,M}+τ¯β−1max{Nσ2,M2}).

In particular, choosing as in formula (16) we have

 F(ˆxN)−F∗≤¯βNsupz∈Xε(xN,z)≤Cmax{LR2[τ∨Θ]N,τσR√ΘN} (20)

for all .

The values of the numerical constants in Theorem 2 can be obtained from the proof, cf. the bound in (40).

## 5 Robust Confidence Bounds for Stochastic Optimization Methods

Consider an arbitrary algorithm for solving the problem (1) based on queries of the stochastic oracle. Assume that we have a sequence , where are the search points of some stochastic algorithm and are the corresponding observations of the stochastic gradient. It is assumed that depends only on . The approximate solution of the problem (1) is defined in the form:

 ˆxN=N−1N∑i=1xi.

Our goal is to construct a confidence interval with sub-Gaussian accuracy for

. To do this, we use the following fact. Note that for any the value

 ϵN(t)=N−1supz∈X{N∑i=1[⟨∇ϕ(xi−1),xi−z⟩+ψ(xi)−ψ(z)+tVxi−1(xi)]} (21)

is an upper bound on the accuracy of the approximate solution :

 F(ˆxN)−F∗≤ϵN(t) (22)

(see Lemma 1 in Appendix). This fact is true for any sequence of points in , regardless of how they are obtained. However, since the function is not known, the estimate (22) cannot be used in practice. Replacing the gradients in (21) with their truncated estimates defined in (10) we get an implementable analogue of :

 ˆϵN(t)=N−1supz∈X{N∑i=1[⟨yi,xi−z⟩+ψ(xi)−ψ(z)+tVxi−1(xi)]}. (23)

Note that computing reduces to solving a problem of the form (4) with . Thus, it is computationally not more complex than, for example, one step of the RSMD algorithm. Replacing with introduces a random error. In order to get a reliable upper bound for , we need to compensate this error by slightly increasing . Specifically, we add to the value

 ¯ρN(τ) = 4R√5Θmax{Nσ2,M2τ}+16Rmax{σ√Nτ,Mτ}+ +minμ≥0{20μmax{Nσ2,M2τ}+μ−1N∑i=1Vxi−1(xi)},

where .

###### Proposition 2

Let be the trajectory of a stochastic algorithm for which depends only on . Let and let be truncated stochastic gradients defined in (10), where the threshold is chosen in the form (18). Then for any the value

 ΔN(τ,t)=ˆϵN(t)+¯ρN(τ)/N

is an upper bound for with probability , so that

 Prob{F(ˆxN)−F∗≤ΔN(τ,t)}≥1−2e−τ.

Since monotonically increases in it suffices to use this bound for when is known. Note that, although gives an upper bound for , Proposition 2 does not guarantee that is sufficiently close to . However, this property holds for the RSMD algorithm with a constant step, as follows from the next result.

###### Corollary 2

Under the conditions of Proposition 2, let the vectors be given by the RSMD recursion (9)–(10), where , . Then

 ¯ρN(τ) ≤ NϵN(¯β)+4R√5Θmax{Nσ2,M2τ}+ (24) +16Rmax{σ√Nτ,Mτ}+20¯β−1max{Nσ2,M2τ}.

Moreover, if then

 ¯ρN(τ)≤NϵN(¯β)+C3LR2[Θ∨τ]+C4σR√N[Θ∨τ],

and with probability at least the value satisfies the inequalities

 ϵN(¯β)≤ΔN(τ,¯β)≤3ϵN(¯β)+2C3LR2[Θ∨τ]N+2C4σR√[Θ∨τ]N, (25)

where and are numerical constants.

The values of the numerical constants and can be derived from the proof of this corollary.

## 6 Robust Confidence Bounds for Quadratic Growth Problems

In this section, it is assumed that is a function with quadratic growth on in the following sense (cf. ). Let be a continuous function on and let be the set of its minimizers on . Then is called a function with quadratic growth on if there is a constant such that for any there exists such that the following inequality holds:

 F(x)−F∗≥κ2∥x−¯x(x)∥2. (26)

Note that every strongly convex function on with the strong convexity coefficient is a function with quadratic growth on . However, the assumption of strong convexity, when used together with the Lipschitz condition with constant on the gradient of , has the disadvantage that, except for the case when is the Euclidean norm, the ratio depends on the dimension of the space . For example, in the important cases where is the -norm, the nuclear norm, the total variation norm, etc., one can easily check (cf. ) that there are no functions with Lipschitz continuous gradient such that the ratio is smaller than the dimension of the space. Replacing the strong convexity with the growth condition (26) eliminates this problem, see the examples in . On the other hand, assumption (26) is quite natural in the composite optimization problem since in many interesting examples the function is smooth and the non-smooth part of the objective function is strongly convex. In particular, if and the norm is the -norm, this allows us to consider such strongly convex components as the negative entropy (if is standard simplex in ), with and with the corresponding choice of (if is a convex compact in ) and others. In all these cases, condition (26) is fulfilled with a known constant , which allows for the use of the approach of [2, 42] to improve the confidence bounds of the stochastic mirror descent.

The RSMD algorithm for quadratically growing functions will be defined in stages. At each stage, for specially selected and it solves an auxiliary problem

 minx∈Xr(y)F(x)

using the RSMD. Here

 Xr(y)={x∈X:∥x−y∥≤r}.

We initialize the algorithm by choosing arbitrary and . We set , . Let and be the numerical constants in the bound (19) of Theorem 1. For a given parameter , and we define the values

 ¯¯¯¯¯Nk=max{4C1L[τ∨Θ]κ,16C2σ2[τ∨Θ]κ2r2k−1},Nk=⌋¯¯¯¯¯Nk⌊. (27)

Here denotes the smallest integer greater than or equal to . Set

 m(N):=max{k:k∑j=1Nj≤N}.

Now, let . At the -th stage of the algorithm, we solve the problem of minimization of on the ball , we find its approximate solution according to (9)–(11), where we replace by , by , by , by , and set

 λ=max{σ√Nτ,Lrk−1}+υσ,

and

 βi≡max{2L,σ√Nrk−1√Θ}.

It is assumed that, at each stage of the algorithm, an exact solution of the minimization problem

 minz∈Xrk−1(yk−1){⟨a,z⟩+ψ(z)+βϑ(z)}

is available for any and . At the output of the -th stage of the algorithm, we obtain .

###### Theorem 3

Assume that , i.e. at least one stage of the algorithm described above is completed. Then there is a random event of probability at least such that for the approximate solution after stages of the algorithm satisfies the inequality

 F(ym(N))−F∗≤Cmax{κr202−N/4,κr20exp(−C′κNL[τ∨Θ]),σ2[τ∨Θ]κN}. (28)

Theorem 3 shows that, for functions with quadratic growth, the deterministic error component can be significantly reduced – it becomes exponentially decreasing in . The stochastic error component is also significantly reduced. Note that the factor is of logarithmic order and has little effect on the probability of deviations. Indeed, it follows from (27) that . Neglecting this factor in the probability of deviations and considering the stochastic component of the error, we see that the confidence bound of Theorem 3 is approximately sub-exponential rather than sub-Gaussian.

## 7 Conclusion

We have considered algorithms of smooth stochastic optimization when the distribution of noise in observations has heavy tails. It is shown that by truncating the observed gradients with a suitable threshold one can construct confidence sets for the approximate solutions that are similar to those in the case of “light tails”. It should be noted that the order of the deterministic error in the obtained bounds is suboptimal — it is substantially greater than the optimal rates achieved by the accelerated algorithms [3, 40], namely, in the case of convex objective function and in the strongly convex case. On the other hand, the proposed approach cannot be used to obtain robust versions of the accelerated algorithms since applying it to such algorithms leads to accumulation of the bias caused by the truncation of the gradients. The problem of constructing accelerated robust stochastic algorithms with optimal guarantees remains open.

APPENDIX

###### Lemma 1

Assume that and satisfy the assumptions of Section 3, and let be some points of the set . Define

 εi+1(z):=⟨∇ϕ(xi),xi+1−z⟩+⟨ψ′(xi+1),xi+1−z⟩+LVxi(xi+1).

Then for any the following inequality holds:

 F(xi+1)−F(z)≤εi+1(z).

 F(ˆxN)−F(z)≤N−1N∑i=1[F(xi)−F(z)]≤N−1N−1∑i=0εi+1(z).

Proof  Using the property , the convexity of functions and and the Lipschitz condition on we get that, for any ,

 F(xi+1)−F(z) = [ϕ(xi+1)−ϕ(z)]+[ψ(xi+1)−ψ(z)]= = [ϕ(xi+1)−ϕ(xi)]+[ϕ(xi)−ϕ(z)]+[ψ(xi+1)−ψ(z)]≤ ≤ [⟨∇ϕ(xi),xi+1−xi⟩+LVxi(xi+1)]+⟨∇ϕ(xi),xi−z⟩+ψ(xi+1)−ψ(z)≤ ≤ ⟨∇ϕ(xi),xi+1−z⟩+⟨ψ′(xi+1),xi+1−z⟩+LVxi(xi+1)=εi+1(z).

Summing up over from 0 to and using the convexity of we obtain the second result of the lemma.

In what follows, we denote by the conditional expectation for fixed .

###### Lemma 2

Let the assumptions of Section 3 be fulfilled and let and satisfy the RSMD recursion, cf. (9) and (10). Then

 (a)∥ξi∥∗≤2(M+υσ)+λ,
 (b)∥Exi−1{ξi}∥∗≤(M+υσ)(σλ)2+σ2λ, (29)
 (c)(Exi−1{∥ξi∥2∗})1/2≤σ+(M+υσ)σλ.

Proof  Set . Note that by construction
. We have

 ξi =