be a finite dimensional real vector space equipped with an inner product. We consider the following composite convex optimization problem
where . Throughout this paper we focus on problems satisfying the following assumption.
The function is lower semi-continuous and convex. The domain of , , is closed. Each function is convex and -Lipschitz smooth, i.e., it is differentiable on an open set containing and its gradient is Lipschitz continuous with constant
Problems of this form often appear in machine learning and statistics. For examples, in the case oflogistic regression we have , where , , and ; in the case of Lasso we have and . More generally, any -regularized empirical risk minimization problem
with smooth convex loss functionsbelongs to the framework of (1). We also can use the function for modelling purpose. For example, when is the indicator function, i.e., if and otherwise, (1) becomes the popular constrained finite sum optimization problem
Previous methods and motivation for our work:
One well-known method to solve (1) is the proximal gradient descent (PGD) method. Let the proximal mapping of a convex function be defined as
At each iteration, PGD calculates a proximal point
where is the step size at the -th iteration. Methods such as gradient descent, which computes , or projection gradient descent, which computes , are considered to be in the class of PGD algorithms. In particular, PGD becomes gradient descent when and becomes projection gradient descent when is the indicator function. If and are general convex functions and is -Lipschitz smooth then PGD has the convergence rate (see ). However, this convergence rate is not optimal. Nesterov, for the first time in , proposed an acceleration method for solving (1) with being an indicator function and achieved the optimal convergence rate . Later in a series of work [25, 27, 28]
, he introduced two other acceleration techniques, which make one or two proximal calls together with interpolation at each iteration to accelerate the convergence. Nesterov’s ideas have been further studied and applied for solving many practical optimization problems such as rank reduction in multivariate linear regression and sparse covariance selection (see e.g.,[4, 5, 10] and reference therein). Auslender and Teboulle  used the acceleration technique in the context of Bregman divergence , which generalizes the squared Euclidean distance . Tseng  unified the analysis of all these acceleration techniques, proposed new variants and gave simple analysis for the proof of the optimal convergence rate.
When the number of component functions is very large, applying PGD can be unappealing since computing the full gradient in each iteration is very expensive. An effective alternative is the randomized version of PGD which is usually called the stochastic proximal gradient descend (SPGD) method,
where is uniformly drawn from at each iteration. For a suitably chosen decreasing step size , SPGD was proven to have the sublinear convergence rate in the case of strong convexity (see e.g., ). Many authors have proposed methods to achieve better convergence rate when is strongly convex. For instances, stochastic average gradient (SAG) 
incorporates a memory of previous gradients to get a linear convergence rate. Stochastic variance reduced gradient (SVRG) and proximal SVRG 
use the variance reduction technique, which updates an estimateof the optimal solution and calculates the full gradient periodically after every SPGD iterations. They achieve the same optimal convergence rate as SAG.
Although PGD has been extended to randomized variants for solving the large scale problem (1) and many of them achieved the optimal convergence rate for strongly convex function , extending the accelerated proximal gradient descent (APG) methods to achieve the optimal convergence rate for non-strongly convex is still an open question. Additionally, there have been very few algorithms that support the non-strongly convex case. One of such algorithms is SAGA, which only achieves the convergence rate (see ). Another algorithm is the randomized coordinate gradient descent method (RCGD), which recently has been successfully extended to accelerated versions to achieve the optimal convergence rate , (see e.g., ). However, accelerated RCGD is only applicable to block separable regularization , i.e., , where and are correspondingly the -th coordinate block of and . To the best of our knowledge, although APG has different versions, it is the only algorithm that obtains the optimal convergence rate for solving (1) under Assumption 1.1. Therefore, finding stochastic algorithms that also achieve the optimal convergence rate and perform even better than APG for the large scale problem (1) is one of our goals.
As another goal, our work aims to solve problem (1) even for cases when proximal point with respect to cannot be computed explicitly. We note that for several choices of , the proximal points at each iteration of the above-mentioned algorithms can be calculated efficiently. For example, when , the proximal points are explicitly calculated by a soft threshold operator (see ). However, in many situations such as nuclear norm regularization and total variation regularization, it would be very expensive to compute them exactly. For that reason, many efficient methods have been proposed to calculate proximal points inexactly (see e.g., [9, 13, 21]). In the context of proximal-point algorithms, basic algorithms that allow inexact computation of proximal points were first studied by Rockafellar . Since then, there has emerged a growing interest in both inexact proximal point algorithms and inexact accelerated proximal point algorithms (see e.g., [8, 9, 12, 16, 33, 34, 38, 41]). However, although there were many works showing impressive empirical performance of inexact (accelerated) PGD methods, there has been no analysis on randomized versions of proximal gradient descent methods. Our work gives such an analysis for an inexact algorithm for this problem.
In a cocurrent work 
, the author independently gives an analysis of an exact accelerated stochastic gradient descent, see[2, Algorithm 2], and the same convergence rate is proved. Compared with [2, Algorithm 2], as we are extending the general deterministic acceleration framework of Tseng , our analysis for exact ASMD is put in a more general framework but employs simpler and neater proofs in the setting of Bregman distance. The analysis of  totally depends on specific choices of in Update (3) of our Algorithm 1. Specifically, the proofs there rely on variant 1 or variant 2 of our Example 3.1.
Summary of contributions:
Our main contribution is the incorporation the variance reduction technique and the general acceleration methods of Tseng to propose a framework of exact as well as inexact accelerated stochastic mirror descent (ASMD and iASMD, respectively) algorithms for the large-scale non-strongly convex optimization problem (1). At each stage of our inexact algorithms, proximal points are allowed to be calculated inexactly. Under a suitable choice of the errors in the calculation of proximal points, our algorithms can achieve the optimal convergence rate.
When the component functions are non-smooth, we give a scheme for minimizing the corresponding non-smooth problem. The rate obtained using our smoothing scheme significantly improves the rate obtained using subgradient methods or stochastic subgradient methods.
We later apply the scheme to solve a class of composite convex concave optimization, which represents many optimization problems such as data-driven robust optimization, two stage stochastic programming, or some stochastic interdiction models.
The paper is organized as follows. In Section 2, we give some preliminaries for the paper. The exact and inexact accelerated stochastic mirror descent methods and their convergence analysis are considered in Section 3. In Section 4.1, we give an introduction to a class of composite convex concave optimization, consider a smoothing approximation for max-type functions and propose a version of iASMD to solve the problem. Finally, computational results are given in Section 5.
For a given continuous function , a convex set and a non-negative number , we write to denote such that . We use to denote the gradient of the function at . We now give some important definitions and lemmas that will be used in the paper.
Let be a strictly convex function that is differentiable on an open set containing . Then the Bregman distance is defined as
Let . Then , which is the Euclidean distance.
Let and . Then , which is called the entropy distance.
The following lemmas are fundamental properties of Bregman distance.
If is strongly convex with constant , i.e.,
Let be a proper convex function whose domain is an open set containing . For any , if , then for all we have
If we replace Euclidean distance in PGD by Bregman distance, we obtain the mirror descent method
We refer the readers to [7, 19, 35, 36] and references therein for proofs of these lemmas, other properties, and application of Bregman distance as well as mirror descent methods. Since we can scale if necessary, we assume is strongly convex with constant in this paper. The following lemma of -smooth functions is crucial for our analysis. Proofs are given in .
Let be a convex function that is -smooth. The following inequalities hold
3. Accelerated Stochastic Mirror Descent Methods
In this section, we describe the framework of ASMD, give analysis on convergence rate of exact ASMD as well as inexact ASMD, and propose a scheme for solving (1) when some component functions are non-smooth.
3.1. Algorithm description
We start with some known initial vectors and . At each stage , we perform accelerated stochastic mirror descent steps under non-uniform sampling setting. At each stochastic step, we keep a part of of the previous stage, specifically, , such that the effect of acceleration techniques is maintained throughout the stages. For inexact ASMD, we let be the error in calculating the proximal points of the stochastic step at stage . When , we receive exact ASMD. The non-uniform sampling method would improve complexity of our algorithms when are different. Algorithm 1 fully describes our framework.
Choose probabilityon . Denote .
pick randomly according to
Choose such that
The set at stage should be chosen such that it contains a solution of (1), see Proposition 3.2. Choosing is the simplest variant of . To accelerate convergence, we can always choose a smaller . We refer the readers to [36, Section 3] for examples of choosing smaller and ignore the details here.
. For this choice, each stochastic step only finds 1 inexact/exact proximal point , or we can say it finds 1 inexact/exact projection.
. For this choice, each stochastic step needs to find 2 proximal points and .
If , are 2 variants of that satisfy (3) then their convex combination , where , is also a choice of .
3.2. Convergence analysis
We first give an upper bound for the variance of .
Conditioned on , we have the following expectation inequality with respect to
The following proof is also given in [39, Corollary 3].
where the second equality is from the fact that . ∎
The following Lemma serves as the cornerstone to obtain Proposition 3.1, which provides a recursive inequality within stochastic steps at each stage .
For each stage , the following inequality holds true
Let , i.e., is the exact proximal point of the stochastic step at stage . For all we have
Considering a fixed stage , for any , if then we have
For simplicity, we ignore the subscript if necessary. Applying Lemma 3.2, we have
We now apply Lemma 3.3 to yield that for all we have
Together with (6) we deduce
Taking expectation with respected to conditioned on , and noting that and , it follows from (7) that
Here and were used in the second inequality. The third inequality uses . Finally, we take expectation with respected to to get the result. ∎
As appears in the recursive inequality, the acceleration effect works through the outer stage . The following Proposition is a consequence of Proposition 3.1. It leads to the optimal convergence rate of our scheme, which is stated in Theorem 3.1 and Theorem 3.2.
Denote and . Let be the optimal solution of (1). Let be the value of at . Suppose , then we have