 # Accelerated Stochastic Mirror Descent Algorithms For Composite Non-strongly Convex Optimization

We consider the problem of minimizing the sum of an average function of a large number of smooth convex components and a general, possibly non-differentiable, convex function. Although many methods have been proposed to solve this problem with the assumption that the sum is strongly convex, few methods support the non-strongly convex cases. Adding a small quadratic regularization is a common trick used to tackle non-strongly convex problems; however, it may worsen the quality of solutions or weaken the performance of the algorithms. Avoiding this trick, we propose a new accelerated stochastic mirror descent method for solving the problem without the strongly convex assumption. Our method extends the deterministic accelerated proximal gradient methods of Paul Tseng and can be applied even when proximal points are computed inexactly. Our direct algorithms can be proven to achieve the optimal convergence rate O(1/k^2) under a suitable choice of the errors in calculating the proximal points. We also propose a scheme for solving the problem when the component functions are non-smooth and finally apply the new algorithms to a class of composite convex concave optimization problems.

## Authors

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1. Introduction

Let

be a finite dimensional real vector space equipped with an inner product

. We consider the following composite convex optimization problem

 minx∈E{fP(x):=F(x)+P(x)}, (1)

where . Throughout this paper we focus on problems satisfying the following assumption.

###### Assumption 1.1.

The function is lower semi-continuous and convex. The domain of , , is closed. Each function is convex and -Lipschitz smooth, i.e., it is differentiable on an open set containing and its gradient is Lipschitz continuous with constant

 ∥∇fi(x)−∇fi(y)∥≤Li∥x−y∥,∀x,y∈dom(P).

Problems of this form often appear in machine learning and statistics. For examples, in the case of

logistic regression we have , where , , and ; in the case of Lasso we have and . More generally, any -regularized empirical risk minimization problem

 minx∈Rp1nn∑i=1fi(x)+λ∥x∥1

with smooth convex loss functions

belongs to the framework of (1). We also can use the function for modelling purpose. For example, when is the indicator function, i.e., if and otherwise, (1) becomes the popular constrained finite sum optimization problem

 minx∈X{F(x):=1nn∑i=1fi(x)}.

### Previous methods and motivation for our work:

One well-known method to solve (1) is the proximal gradient descent (PGD) method. Let the proximal mapping of a convex function be defined as

 proxP(x)=argminu{P(u)+12∥u−x∥22}.

At each iteration, PGD calculates a proximal point

 xk=proxγkP(xk−1−γk∇F(xk−1))=argminx{⟨∇F(xk−1),x⟩+12γk∥x−xk−1∥2+P(x)},

where is the step size at the -th iteration. Methods such as gradient descent, which computes , or projection gradient descent, which computes , are considered to be in the class of PGD algorithms. In particular, PGD becomes gradient descent when and becomes projection gradient descent when is the indicator function. If and are general convex functions and is -Lipschitz smooth then PGD has the convergence rate (see ). However, this convergence rate is not optimal. Nesterov, for the first time in , proposed an acceleration method for solving (1) with being an indicator function and achieved the optimal convergence rate . Later in a series of work [25, 27, 28]

, he introduced two other acceleration techniques, which make one or two proximal calls together with interpolation at each iteration to accelerate the convergence. Nesterov’s ideas have been further studied and applied for solving many practical optimization problems such as rank reduction in multivariate linear regression and sparse covariance selection (see e.g.,

[4, 5, 10] and reference therein). Auslender and Teboulle  used the acceleration technique in the context of Bregman divergence , which generalizes the squared Euclidean distance . Tseng  unified the analysis of all these acceleration techniques, proposed new variants and gave simple analysis for the proof of the optimal convergence rate.

When the number of component functions is very large, applying PGD can be unappealing since computing the full gradient in each iteration is very expensive. An effective alternative is the randomized version of PGD which is usually called the stochastic proximal gradient descend (SPGD) method,

 xk=argminx{⟨∇fik(xk−1),x⟩+12γk∥x−xk−1∥2+P(x)},

where is uniformly drawn from at each iteration. For a suitably chosen decreasing step size , SPGD was proven to have the sublinear convergence rate in the case of strong convexity (see e.g., ). Many authors have proposed methods to achieve better convergence rate when is strongly convex. For instances, stochastic average gradient (SAG) 

incorporates a memory of previous gradients to get a linear convergence rate. Stochastic variance reduced gradient (SVRG)

 and proximal SVRG 

use the variance reduction technique, which updates an estimate

of the optimal solution and calculates the full gradient periodically after every SPGD iterations. They achieve the same optimal convergence rate as SAG.

Although PGD has been extended to randomized variants for solving the large scale problem (1) and many of them achieved the optimal convergence rate for strongly convex function , extending the accelerated proximal gradient descent (APG) methods to achieve the optimal convergence rate for non-strongly convex is still an open question. Additionally, there have been very few algorithms that support the non-strongly convex case. One of such algorithms is SAGA, which only achieves the convergence rate (see ). Another algorithm is the randomized coordinate gradient descent method (RCGD), which recently has been successfully extended to accelerated versions to achieve the optimal convergence rate , (see e.g., ). However, accelerated RCGD is only applicable to block separable regularization , i.e., , where and are correspondingly the -th coordinate block of and . To the best of our knowledge, although APG has different versions, it is the only algorithm that obtains the optimal convergence rate for solving (1) under Assumption 1.1. Therefore, finding stochastic algorithms that also achieve the optimal convergence rate and perform even better than APG for the large scale problem (1) is one of our goals.

As another goal, our work aims to solve problem (1) even for cases when proximal point with respect to cannot be computed explicitly. We note that for several choices of , the proximal points at each iteration of the above-mentioned algorithms can be calculated efficiently. For example, when , the proximal points are explicitly calculated by a soft threshold operator (see ). However, in many situations such as nuclear norm regularization and total variation regularization, it would be very expensive to compute them exactly. For that reason, many efficient methods have been proposed to calculate proximal points inexactly (see e.g., [9, 13, 21]). In the context of proximal-point algorithms, basic algorithms that allow inexact computation of proximal points were first studied by Rockafellar . Since then, there has emerged a growing interest in both inexact proximal point algorithms and inexact accelerated proximal point algorithms (see e.g., [8, 9, 12, 16, 33, 34, 38, 41]). However, although there were many works showing impressive empirical performance of inexact (accelerated) PGD methods, there has been no analysis on randomized versions of proximal gradient descent methods. Our work gives such an analysis for an inexact algorithm for this problem.

In a cocurrent work 

, the author independently gives an analysis of an exact accelerated stochastic gradient descent, see

[2, Algorithm 2], and the same convergence rate is proved. Compared with [2, Algorithm 2], as we are extending the general deterministic acceleration framework of Tseng , our analysis for exact ASMD is put in a more general framework but employs simpler and neater proofs in the setting of Bregman distance. The analysis of  totally depends on specific choices of in Update (3) of our Algorithm 1. Specifically, the proofs there rely on variant 1 or variant 2 of our Example 3.1.

### Summary of contributions:

• Our main contribution is the incorporation the variance reduction technique and the general acceleration methods of Tseng to propose a framework of exact as well as inexact accelerated stochastic mirror descent (ASMD and iASMD, respectively) algorithms for the large-scale non-strongly convex optimization problem (1). At each stage of our inexact algorithms, proximal points are allowed to be calculated inexactly. Under a suitable choice of the errors in the calculation of proximal points, our algorithms can achieve the optimal convergence rate.

• When the component functions are non-smooth, we give a scheme for minimizing the corresponding non-smooth problem. The rate obtained using our smoothing scheme significantly improves the rate obtained using subgradient methods or stochastic subgradient methods.

• We later apply the scheme to solve a class of composite convex concave optimization, which represents many optimization problems such as data-driven robust optimization, two stage stochastic programming, or some stochastic interdiction models.

The paper is organized as follows. In Section 2, we give some preliminaries for the paper. The exact and inexact accelerated stochastic mirror descent methods and their convergence analysis are considered in Section 3. In Section 4.1, we give an introduction to a class of composite convex concave optimization, consider a smoothing approximation for max-type functions and propose a version of iASMD to solve the problem. Finally, computational results are given in Section 5.

## 2. Preliminaries

For a given continuous function , a convex set and a non-negative number , we write to denote such that . We use to denote the gradient of the function at . We now give some important definitions and lemmas that will be used in the paper.

###### Definition 2.1.

Let be a strictly convex function that is differentiable on an open set containing . Then the Bregman distance is defined as

 D(x,y)=h(x)−h(y)−⟨∇h(y),x−y⟩,∀y∈dom(P),x∈Ω.
###### Example 2.1.
1. Let . Then , which is the Euclidean distance.

2. Let and . Then , which is called the entropy distance.

The following lemmas are fundamental properties of Bregman distance.

###### Lemma 2.1.

If is strongly convex with constant , i.e.,

 h(y)≥h(x)+⟨∇h(x),y−x⟩+σ2∥x−y∥2,

then

 D(x,y)≥σ2∥x−y∥2.
###### Lemma 2.2.

Let be a proper convex function whose domain is an open set containing . For any , if , then for all we have

 ϕ(x)+D(x,z)≥ϕ(z∗)+D(z∗,z)+D(x,z∗).
###### Lemma 2.3.
 D(x,y)+D(y,z)=D(x,z)+⟨x−y,∇h(z)−∇h(y)⟩.

If we replace Euclidean distance in PGD by Bregman distance, we obtain the mirror descent method

We refer the readers to [7, 19, 35, 36] and references therein for proofs of these lemmas, other properties, and application of Bregman distance as well as mirror descent methods. Since we can scale if necessary, we assume is strongly convex with constant in this paper. The following lemma of -smooth functions is crucial for our analysis. Proofs are given in .

###### Lemma 2.4.

Let be a convex function that is -smooth. The following inequalities hold

•  fi(x)≤fi(y)+⟨∇fi(y),x−y⟩+Li2∥x−y∥2.
•  12Li∥∇fi(y)−∇fi(x)∥2+⟨∇fi(y),x−y⟩≤fi(x)−fi(y).

## 3. Accelerated Stochastic Mirror Descent Methods

In this section, we describe the framework of ASMD, give analysis on convergence rate of exact ASMD as well as inexact ASMD, and propose a scheme for solving (1) when some component functions are non-smooth.

### 3.1. Algorithm description

We start with some known initial vectors and . At each stage , we perform accelerated stochastic mirror descent steps under non-uniform sampling setting. At each stochastic step, we keep a part of of the previous stage, specifically, , such that the effect of acceleration techniques is maintained throughout the stages. For inexact ASMD, we let be the error in calculating the proximal points of the stochastic step at stage . When , we receive exact ASMD. The non-uniform sampling method would improve complexity of our algorithms when are different. Algorithm 1 fully describes our framework.

The set at stage should be chosen such that it contains a solution of (1), see Proposition 3.2. Choosing is the simplest variant of . To accelerate convergence, we can always choose a smaller . We refer the readers to [36, Section 3] for examples of choosing smaller and ignore the details here.

The update rule (3) of is very general. We give some examples that satisfy (3).

###### Example 3.1.
1. . For this choice, each stochastic step only finds 1 inexact/exact proximal point , or we can say it finds 1 inexact/exact projection.

2. . For this choice, each stochastic step needs to find 2 proximal points and .

3. If , are 2 variants of that satisfy (3) then their convex combination , where , is also a choice of .

It is easy to check that , where , satisfies the update rule (2). For the update rule (4), we can choose or , where

### 3.2. Convergence analysis

We first give an upper bound for the variance of .

###### Lemma 3.1.

Conditioned on , we have the following expectation inequality with respect to

 E∥∇F(yk,s)−vk∥2≤E1(nqik)2∥∥∇fik(yk,s)−∇fik(~xs−1)∥∥2.

The following proof is also given in [39, Corollary 3].

###### Proof.

We have

where the second equality is from the fact that . ∎

The following Lemma serves as the cornerstone to obtain Proposition 3.1, which provides a recursive inequality within stochastic steps at each stage .

###### Lemma 3.2.

For each stage , the following inequality holds true

 fP(xk,s)≤F(yk,s)+α32LQ∥∇F(yk,s)−vk∥2+α2,s(⟨vk,zk,s−zk−1,s⟩+θsD(zk,s,zk−1,s))+P(^xk,s).
###### Proof.

For simplicity, we ignore the subscript when necessary. Applying the first inequality of Lemma 2.4, we have

where the last inequality uses . Together with the update rule (3), Lemma 2.1 with , and noting that , we get

###### Lemma 3.3.

Let , i.e., is the exact proximal point of the stochastic step at stage . For all we have

 ⟨vk,zk,s⟩+θsD(zk,s,zk−1,s)≤⟨vk,x⟩+P(x)+θs(D(x,zk−1,s)−D(x,zk,s))−P(zk,s)+εs−θs(D(zk,s,¯zk,s)−⟨x−zk,s,∇h(¯zk,s)−∇h(zk,s)⟩).
###### Proof.

Let , then . Lemma 2.2 yields that for all we have

 1θs(⟨vk,x⟩+P(x))+D(x,zk−1,s)≥minx∈Xs{ϕ(x)+D(x,zk−1,s)}+D(x,¯zk,s).

Together with , we get

 (5)

On the other hand, by Lemma 2.3,

 D(x,¯zk,s)=D(x,zk,s)+D(zk,s,¯zk,s)−⟨x−zk,s,∇h(¯zk,s)−∇h(zk,s)⟩.

The result follows then. ∎

###### Proposition 3.1.

Denote

Considering a fixed stage , for any , if then we have

 EfP(xk,s)≤α1,sEfP(xk−1,s)+α2,sfP(x)+α3fP(~xs−1)+α22,s¯¯¯¯L(ED(x,zk−1,s)−ED(x,zk,s))+rk,s.
###### Proof.

For simplicity, we ignore the subscript if necessary. Applying Lemma 3.2, we have

 fP(xk)≤F(yk)+α32LQ∥∇F(yk)−vk∥2+α2(⟨vk,zk−zk−1⟩+θsD(zk,zk−1))+P(^xk). (6)

We now apply Lemma 3.3 to yield that for all we have

 ⟨vk,zk⟩+θsD(zk,zk−1)≤⟨vk,x⟩+P(x)−P(zk)+θs(D(x,zk−1)−D(x,zk))+εs−θs(D(zk,¯zk)−⟨x−zk,∇h(¯zk)−∇h(zk)⟩).

Together with (6) we deduce

 fP(xk)≤F(yk)+α32LQ∥∇F(yk)−vk∥2+α2(⟨vk,x−zk−1⟩+P(x)−P(zk))+P(^xk)+α2θs(D(x,zk−1)−D(x,zk)−D(zk,¯zk)+⟨x−zk,∇h(¯zk)−∇h(zk)⟩)+α2εs. (7)

Taking expectation with respected to conditioned on , and noting that and , it follows from (7) that

 (8)

On the other hand, applying Lemma 3.1, the second inequality of Lemma 2.4 and noting that , we have

 (9)

Therefore, (8) and (9) imply that

 EfP(xk)≤(1−α3)F(yk)+α3fP(~xs−1)+α2⟨∇F(yk),x−yk⟩+α2P(x)+α2⟨∇F(yk),yk−zk−1⟩−α3⟨∇F(yk),~xs−1−yk⟩+α1P(xk−1)+α2θs(D(x,zk−1)−ED(x,zk))+rk≤(1−α3)F(yk)+α3fP(~xs−1)+α2(F(x)−F(yk))+α2P(x)+α1⟨∇F(yk),xk−1−yk⟩+α1P(xk−1)+α2θs(D(x,zk−1)−ED(x,zk))+rk≤(1−α3−α2)F(yk)+α3fP(~xs−1)+α2fP(x)+α1(F(xk−1)−F(yk))+α1P(xk−1)+α2θs(D(x,zk−1)−ED(x,zk))+rk=α1fP(xk−1)+α2fP(x)+α3fP(~xs−1)+α2θs(D(x,zk−1)−ED(x,zk))+rk.

Here and were used in the second inequality. The third inequality uses . Finally, we take expectation with respected to to get the result. ∎

As appears in the recursive inequality, the acceleration effect works through the outer stage . The following Proposition is a consequence of Proposition 3.1. It leads to the optimal convergence rate of our scheme, which is stated in Theorem 3.1 and Theorem 3.2.

###### Proposition 3.2.

Denote and . Let be the optimal solution of (1). Let be the value of at . Suppose , then we have

 ~ds≤α22,s+1((1−α2,1)dm,0α22,1α3m+1mα22,1m−1∑k=1dk,0+¯¯¯¯Lmα3(ED(x∗,zm,0)−ED(x∗,zm,s)))+α22,s+1s∑i=1m∑k=1¯rk,imα3α22,i.
###### Proof.

Applying Proposition 3.1 with we have

 E(fP(xk,s)−fP(x∗))≤α1,sE(fP(xk−1,s)−fP(x∗))+α3(fP(~xs−1)−fP(x∗))+α22,s¯¯¯¯L(ED(x∗,zk−1,s)−ED(x∗,zk,s))+¯rk,s.

Denote , then

 dk,s≤α1,sdk−1,s+α3~ds−1+α22,s¯¯¯¯L(ED(x∗,zk−1,s)−ED(x∗,zk,s))+¯rk,s,

this implies

 1α22,sdk,s≤α1,sα22,sdk−1,s+α3α22,s~ds−1+¯¯¯¯L(ED(x∗,zk−1,s)−ED(x∗,zk,s))+¯rk,sα22,s

Summing up from to we get

 1α22,sdm,s+1−α1,sα22,sm−1∑k=1dk,s≤α1,sα22,sd0,s+α3α22,sm~ds−1+¯¯¯¯L(ED(x∗,z0,s)−ED(x∗,zm,s))+∑mk=1¯rk,sα22,s

Using the update rule (4), , and we get

 1α22,sdm,s+1−α1,sα22,sm−1∑k=1dk,s≤1−α2,sα22,sdm,s−1+α3α22,sm−1∑k=1dk,s−1+¯¯¯¯L(ED(x∗,zm,s−1)−ED(x∗,zm,s))+∑mk=1¯rk,sα22,s.

Combining with the update rule (2) we obtain

 1−α2,s+1α22,s+1dm,s+α3α22,s+1m−1∑k=1dk,s≤1−α2,sα22,sdm,s−1+α3α22,sm−1∑k=1dk,s−1+¯¯¯¯L(ED(x∗,zm,s−1)−ED(x∗,zm,s))+∑mk=1¯rk,sα22,s. (10)

Therefore,

 α3α22,s+1m~ds≤α3α22,s+1m∑k=1d