The basic problem of interest in this paper is the convex programming (CP) problem given by
Here, is a closed convex set, is a relatively simple convex function, , , are smooth convex functions with Lipschitz continuous gradient, i.e., such that
is a strongly convex function with modulus w.r.t. an arbitrary norm , i.e.,
and is a given constant. Hence, the objective function is strongly convex whenever . For notational convenience, we also denote and . It is easy to see that for some ,
Throughout this paper, we assume subproblems of the form
are easy to solve. CP given in the form of (1.1
) has recently found a wide range of applications in machine learning, statistics, and image processing, and hence becomes the subject of intensive studies during the past few years.
Stochastic (sub)gradient descent (SGD) (a.k.a. stochastic approximation (SA)) type methods have been proven useful to solve problems given in the form of (1.1). SGD was originally designed to solve stochastic optimization problems given by
is a random variable with support. Problem (1.1) can be viewed as a special case of (1.6) by setting
to be a discrete random variable supported onwith and , . Since each iteration of SGDs needs to compute the (sub)gradient of only one randomly selected 111 Observe that the subgradients of and are not required due to the assumption in (1.5)., their iteration cost is significantly smaller than that for deterministic first-order methods (FOM), which involves the computation of first-order information of and thus all the (sub)gradients of ’s. Moreover, when ’s are general nonsmooth convex functions, by properly specifying the probabilities , 222Suppose that are Lipschitz continuous with constants and let us denote , we should set in order to get the optimal complexity for SGDs., it can be shown (see NJLS09-1 ) that the iteration complexities for both SGD and FOM are in the same order of magnitude. Consequently, the total number of subgradients required by SGDs can be times smaller than those by FOMs.
Note however, that there is a significant gap on the complexity bounds between SGDs and deterministic FOMs if ’s are smooth convex functions. For the sake of simplicity, let us focus on the strongly convex case when and let be the optimal solution of (1.1). In order to find a solution s.t. , the total number of gradient evaluations for ’s performed by optimal FOMs can be bounded by
which was first achieved by the well-known Nesterov’s accelerated gradient method Nest83-1 ; Nest04 , see also relevant extensions in Nest13-1 ; BecTeb09-2 ; tseng08-1 . On the other hand, a direct application of optimal SGDs to the aforementioned stochastic optimization reformulation of (1.1) would yield an
denotes variance of the stochastic gradients. Clearly, the latter bound is significantly better than the one in (1.7) in terms of its dependence on , but much worse in terms of its dependence on accuracy and a few other problem parameters (e.g., and ).
It should be noted that the optimality of (1.8) for general stochastic programming (1.6) does not preclude the existence of more efficient algorithms for solving (1.1), because (1.1) is a special case of (1.6) with finite support . Last few years have seen very active and fruitful research in this field (e.g., SchRouBac13-1 ; JohnsonZhang13-1 ; DefBacLac14-1 ; ShaZhang15-1 ; Yuchen14 ). In particular, Schmidt, Roux and Bach SchRouBac13-1
presented a stochastic average gradient (SAG) method, which recursively computes an estimator ofby aggregating the gradient of a randomly selected with some other previously computed gradient information. They proved that the complexity of SAG is bounded by , see also Johnson and Zhang JohnsonZhang13-1 and Defazio et al. DefBacLac14-1 for similar complexity results for solving (1.1). In a related but different line of research, Shalev-Shwartz and Zhang ShaZhang15-1 studied a special class of CP problems given in the form of (1.1) with given by , where denotes an affine mapping. Under the assumption that , they presented an accelerated stochastic dual coordinate ascent (A-SDCA) method, obtained by properly restarting a stochastic coordinate ascent method in ShaZhang13-1 applied to the dual of (1.1). Shalev-Shwartz and Zhang show that the iteration complexity of this method can be bounded by However, each iteration of A-SDCA requires, instead of the computation of , the solution of a subproblem given in the form of
where denotes the conjugate function of . Moreover, these methods were also designed for solving a more special class of problems than (1.1). More recently, Lin, Lu, and Xiao LinLuXiao14-1 proposed to apply the accelerated coordinate descent methods by Nesterov Nest10-1 , and Fercoq and Richtárik’s fr13
to obtain similar results for solving these “regularized empirical loss functions” as inShaZhang15-1 . Zhang and Xiao Yuchen14 had also obtained similar results by using different stochastic primal-dual coordinate decomposition techniques.
In this paper, we focus on randomized incremental gradient methods that can access the first-order information of only one randomly selected smooth component at each iteration (see Bertsekas Bertsekas10-1 for an introduction to incremental gradient methods). It should be noted that while the algorithms in SchRouBac13-1 ; JohnsonZhang13-1 ; DefBacLac14-1 belong to incremental gradient methods, generally speaking, the dual coordinate algorithms in LinLuXiao14-1 ; ShaZhang15-1 ; Yuchen14 cannot be considered as incremental gradient methods because they require the solutions of a different subproblem rather than the computation of the gradient of . The previous attempts to improve the complexity of the existing incremental gradient methods, e.g., based on the extrapolation idea in Nesterov Nest83-1 , however, turned out to be tricky and unsuccessful, see Section 1.2 of Bertsekas Bertsekas10-1 and Section 5 of Agarwal and Bottou AgrBott14-1 for more discussions. Another important yet unresolved issue is that there does not exist a valid lower complexity bound for randomized incremental gradient methods in the literature. Hence, it remains unknown what would be the best possible performance that one can expect for these types of methods. Regarding this question, Agarwal and Bottou AgrBott14-1 recently suggested a lower complexity bound for solving problems given in the form of (1.1). However, as pointed out by them in a recent ISMP talk in 2015, the lower complexity bound in AgrBott14-1 is deterministic by construction, and hence cannot be used to justify the optimality or suboptimality for the randomized incremental gradient methods in SchRouBac13-1 ; JohnsonZhang13-1 ; DefBacLac14-1 or dual coordinate methods in LinLuXiao14-1 ; ShaZhang15-1 ; Yuchen14 .
Our contribution in this paper mainly lies on the following several aspects. Firstly, we present a new class of deterministic FOMs, referred to as the primal-dual gradient (PDG) methods, which can achieve the optimal black-box iteration complexity in (1.7) for solving (1.1). The novelty of these methods exists in: 1) a proper reformulation of (1.1) as a primal-dual saddle point problem and 2) the incorporation of a new non-differentiable prox-function (or Bregman distance) based on the conjugate functions of in the dual space. As a consequence, we are able to show that the PDG method covers a variant of the well-known Nesterov’s accelerated gradient method as a special case. In particular, the computation of the gradient at the extrapolation point of the accelerated gradient method is equivalent to a primal prediction step combined with a dual ascent step (employed with the aforementioned dual prox-function) in the PDG method. While it is often difficult to interpret Nesterov’s method, the development of the PDG method allows us to view this method as a natural iterative buyer-supplier game. Such a game-theoretic view of the accelerated gradient method seems to be new in the literature. In fact, the obtained complexity results for the PDG method are slightly stronger than the one in (1.7) and those in Nest83-1 ; Nest04 for Nesterov’s accelerated gradient method, because a stronger primal-dual termination criterion has been used in our analysis.
Secondly, we develop a randomized primal-dual gradient (RPDG) method, which is an incremental gradient method using only one randomly selected component at each iteration. A variant of PDG, this algorithm incorporates an additional dual prediction step before performing the primal descent step (with a properly defined primal prox-function). We prove that the number of iterations (and hence the number of gradients) required by RPDG is bounded by
both in expectation and with high probability. The complexity bounds of the RPDG method are established in terms of not only the distance from the iterate to the optimal solution, but also the primal optimality gap based on the ergodic mean of the iterates. In comparison with the accelerated stochastic dual coordinate ascent method in ShaZhang15-1 , RPDG deals with a wider class of problems and can be applied to the cases when ’s involve a more complicated composite structure (see examples in Bertsekas10-1 ) and/or a more general regularization term that is strongly convex with respect to an arbitrary norm (see open problems in Section 7 of ShaZhang15-1 ). Moreover, each iteration of RPDG only involves the computation , rather than the more complicated subproblem in (1.9), which sometimes may not have explicit solutions ShaZhang15-1
(e.g., the logistics regression problem). The RPDG method also admits an interesting game theoretic interpretation, implying that by properly incorporating randomization, the buyer and supplier can reach the equilibrium with possibly fewer price changes at the expense of more order transactions.
Thirdly, we show that the number of gradient evaluations required by any randomized incremental gradient methods to find an -solution of (1.1), i.e., a point s.t. , cannot be smaller than
whenever the dimension is sufficiently large. This bound is obtained by carefully constructing a special class of separable quadratic programming problems and tightly bounding the expected distance to the optimal solution for any arbitrary distribution used to choose at each iteration. Comparing (1.10) with (1.11), we conclude that the complexity of the RPDG method is optimal if is large enough. To the best of our knowledge, this is the first time that such a lower complexity bound has been presented for randomized incremental gradient methods in the literature. As a byproduct, we also derived a lower complexity bound for randomized block coordinate descent methods by utilizing the separable structure of the aforementioned worst-case instances. These methods have been intensively studied recently, but a valid lower complexity bound is still missing in the literature.
Finally, we generalize RPDG for problems which are not necessarily strongly convex (i.e., ) and/or involve structured nonsmooth terms . We show that for all these cases, the RPDG can save times gradient computations (up to certain logarithmic factors) in comparison with the corresponding optimal deterministic FOMs. In particular, we show that when both the primal and dual of (1.1) are not strongly convex, the total number of iterations performed by the RPDG method can be bounded by (up to some logarithmic factors), which is times better, in terms of the total number of dual subproblems to be solved, than Nesterov’s smoothing technique Nest05-1 , Nemirovski’s mirror-prox method Nem05-1 , or Chambolle and Pock’s primal-dual method ChamPoc11-1 . It seems that this complexity result has not been obtained before in the literature.
It is worth mentioning a few relevant works to our development. The most two related ones are conducted independently by Dang and Lan DangLan14-1 , and Zhang and Xiao Yuchen14 . Both of these papers deal with randomized variants of the primal-dual method presented by Chambolle and Pock ChamPoc11-1 (see also extensions in CheLanOu13-1 ) for solving saddle point problems. Zhang and Xiao’s development Yuchen14 was based on a variant of the primal-dual method for solving strongly convex saddle point problems ChamPoc11-1 . They were able to show that a block-wise randomized version of the algorithm can achieve similar complexity as the A-SDCA method in ShaZhang15-1 . Since Zhang and Xiao’s algorithm targets for solving a similar class of problems and requires the solutions of a similar subproblem to ShaZhang15-1 , it appears that the aforementioned possible advantages of RPDG over A-SDCA are also applicable to the stochastic primal-dual coordinate method in Yuchen14 . Moreover, the complexity bound of Zhang and Xiao’s algorithm is only established in terms of the Euclidean distances of the iterate , to the optimal solution. They did not deal with the convergence of the ergodic mean of iterates. On the other hand, Dang and Lan’s work was motivated by the observation in ChenHeYeYuan13-1 that a direct extension of the alternating direction method of multiplier (ADMM) does not converge for multi-block problems. Their work in DangLan14-1 then focuses on the non-strongly convex case and shows that a randomized primal-dual method, which is equivalent to a randomized pre-conditioned ADMM for linear constrained problems, does converge for multi-block problems. Without incorporating the aforementioned dual prediction step, the complexity obtained in DangLan14-1 is times worse than Chambolle and Pock’s method. Nevertheless, this is the first time that randomized algorithms for saddle point optimization with an complexity has been presented in the literature. More recently, close to the end of the preparation of this paper, we notice that Lin, Mairal, and Harchaoui LinMaiHar15-1 in a concurrent work presented a catalyst scheme that can be used to accelerate the SAG method in SchRouBac13-1 and thus possibly achieve the complexity bound in (1.10) (under the Euclidean setting). While their approach is an indirect one obtained by properly restarting SAG (or other “non-accelerated” first-order methods), the proposed randomized primal-dual gradient method is a direct approach with a “built-in” acceleration. Also none of these works DangLan14-1 ; Yuchen14 ; LinMaiHar15-1 discussed the lower complexity bound for randomized methods.
This paper is organized as follows. We first study the deterministic primal-dual method in Section 2. Section 3 is devoted to the design and analysis of the randomized primal-dual method for the strongly convex case, as well as the development of the lower complexity bound in (1.11). In Section 4, we generalize the RPDG method to different classes of CP problems that are not necessarily strongly convex. Important technical results and proofs of the main theorems in Sections 2 and 3 are provided in Section 5. Some brief concluding remarks are made in Section 6.
Notation and terminology. We use to denote an arbitrary norm in , which is not necessarily associated with the inner product . We also use to denote the conjugate norm of . For any convex function , is the set of subdifferential at . Given any , we say a convex function is nonsmooth if for any . We say that a convex function is smooth if it is Lipschitz continuously differentiable with Lipschitz constant , i.e., for any . For any , denotes the standard -norm in , i.e.,
For any real number , and denote the nearest integer to from above and below, respectively. and , respectively, denote the set of nonnegative and positive real numbers. denotes the set of natural numbers .
2 An optimal primal-dual gradient method
Our goal in this section is to present a novel primal-dual gradient (PDG) method for solving (1.1), which will also provide a basis for the development of the randomized primal-dual gradient methods in later sections. We establish the optimal convergence of this algorithm in terms of the primal-dual optimality gap under the assumption that the gradient of is computed at each iteration. We show that PDG generalizes one variant of the well-known Nesterov’s accelerated gradient method, and allows a natural game interpretation, and hence that the latter algorithm also admits a similar interpretation.
2.1 Preliminaries: primal and dual prox-functions
In this subsection, we discuss both primal and dual prox-functions (proximity control functions) in the primal and dual spaces, respectively.
Recall that the function in (1.1) is strongly convex with modulus with respect to . We can define a primal prox-function associated with as
where is an arbitrary subgradient of at . Clearly, by the strong convexity of , we have
Note that the prox-function described above generalizes the Bregman’s distance in the sense that is not necessarily differentiable (see Breg67 ; AuTe06-1 ; BBC03-1 ; Kiw97-1 and references therein). Throughout this paper, we assume that the prox-mapping associated with , , and , given by
is easily computable for any , , and . Clearly this is equivalent to the assumption that (1.5) is easy to solve. Whenever is non-differentiable, we need to specify a particular selection of the subgradient before performing the prox-mapping. We assume throughout this paper that such a selection of is defined recursively as follows. Denote . By the optimality condition of (2.3), we have
where denotes the normal cone of at . Once such a satisfying the above relation is identified, we will use it as a subgradient when defining in the next iteration.
Now let us consider the dual space , where the gradients of reside, and equip it with the conjugate norm . Let be the conjugate function of such that
It is clear that is strongly convex with modulus w.r.t. . Therefore, we can define its associated dual prox-functions and dual prox-mappings as
for any . Again, may not be uniquely defined since is not necessarily differentiable. Instead of choosing similarly to , we can explicitly specify such selections as will be discussed later in this paper.
The following simple result shows that the computation of the dual prox-mapping associated with is equivalent to the computation of .
Let and be given and be defined in (2.5). For any , let us denote . Then we have .
In view of the definition of in (2.5), we have
2.2 Primal-dual gradient method, Nesterov’s method, and a game interpretation
The primal-dual gradient method in Algorithm 1 can be viewed as a game iteratively performed by a primal player (buyer) and a dual player (supplier) for finding the optimal solution (order quantity and product price) of the saddle point problem in (2.7). In this game, both the buyer and supplier have access to their local cost and , respectively, as well as their interactive cost (or revenue) represented by a bilinear function . Our goal is to design an algorithm such that the buyer and supplier can achieve a equilibrium as soon as possible. In the proposed algorithm, the supplier first applies (2.8) to predict the demand based on historical information, i.e., and . She then determines in (2.9) the price in a way to maximize the predicted profit , regularized by the dual prox-function with a certain weight . Once after the supplier has made her decision, the buyer then determines his action according to (2.10) in order to minimize the cost , regularized by the primal prox-function with a certain weight .
In order to implement the above primal-dual gradient method, it is more convenient to rewrite step (2.9) in a form involving the computation of gradient rather than the dual prox-mapping . In order to do so, we shall specify explicitly the selection of the subgradient in (2.9). Denoting , we can easily see from that . Using this relation and letting in (see (2.5)), we then conclude from Lemma 1 that for any , (2.9) reduces to
With the above selection of the dual prox-function, we can specialize the primal-dual gradient method as follows.
Observe that one potential problem associated with this scheme is that the search points defined in (2.11) and (2.12), respectively, may fall outside . As a result, we need to assume to be differentiable over . However, it can be shown that by properly specifying and , we can guarantee and thus relax such restrictions on the differentiability of (see (2.31) and (2.32) below).
The above PDG method is related to the well-known Nesterov’s accelerated gradient (AG) method. Let us focus on a simple variant of the AG method that has been extensively studied in the literature (e.g., Nest04 ; tseng08-1 ; Lan10-3 ; GhaLan12-2a ; GhaLan13-1 ; GhaLan13-2 ). Given , this AG algorithm updates by
Therefore, (2.15) is equivalent to (2.11) and (2.12) with and . Moreover, (2.16) is identical to (2.14)(and (2.10)), and (2.17) basically defines the output of the AG algorithm as an ergodic mean of the iterates . We then conclude that the above variant of Nesterov’s AG method is a special case of Algorithm 2 (and Algorithm 1). It should be noted, however, that Algorithm 1 provides more flexibility in the specification of parameters, which will be used later in the development of the RPDG method. Moreover, the presentation of the PDG method helps us to reveal a natural game interpretation out of the intertwined and somehow mysterious updating of the three search sequences in the AG method.
Algorithm 1 is also closely related to Chambolle and Pock’s primal-dual method for solving saddle point problems ChamPoc11-1 , which explains the origin of its name. Two versions of primal-dual methods were discussed in ChamPoc11-1 . One is designed for solving general saddle point problems without assuming the strong convexity of and the other one is to deal with the case when is strongly convex by incorporating an additional extrapolation step. As pointed out in Remark 3 of ChamPoc11-1 , the rate of convergence for the latter primal-dual method is only suboptimal for solving (1.1) as it uses a weaker termination criterion. On the other hand, the PDG method does not involve any additional extrapolation steps and so it shares a similar scheme to the basic version of the primal-dual method in ChamPoc11-1 . Moreover, the original primal-dual methods in ChamPoc11-1 do not employ general prox-functions, which, as shown in Lemma 1, is crucial to relate the dual step (2.9) to the computation of the gradients. It should be noted that some recent extensions of the primal-dual method in CheLanOu13-1 ; DangLan14-1 ; ChamPoc14-1 indeed consider the incorporation of prox-functions, but restricted to problems without strong convexity. Hence, none of these earlier primal-dual methods can be viewed as a generalized accelerated gradient method.
2.3 Convergence properties of the primal-dual gradient method
Our goal in this subsection is to show that Algorithm 1 exhibits an optimal rate of convergence for solving problem (1.1). It is worth mentioning that our analysis significantly differs from the previous studies on optimal gradient methods and those on primal-dual methods for saddle point problems.
Given a pair of feasible solutions and of (2.7), we define the primal-dual gap function by
It can be easily seen that (resp., ) is an optimal solution of (2.7) if and only if for any (resp., for any ). Therefore, one can assess the solution quality of by the primal-dual optimality gap:
It should be noted that may not be well-defined, for example, when is unbounded and is not strictly convex. In these cases, we can define a slightly modified primal-dual gap
for an arbitrary optimal solution of (1.1). Since is strongly convex, is well-defined.
The following result establishes some relationship between the primal optimality gap and the above primal-dual optimality gaps.
It follows from the definitions of , and the gap function that
Relation (2.22) follows directly from the definitions of and .
Theorem 2.1 below describes the main convergence properties of the PDG method. More specifically, we provide in Theorem 2.1.a) a constant stepsize policy which works for the strongly convex case where , and a different parameter setting that works for the non-strongly convex case with in Theorem 2.1.b). Note that for the strongly convex case, we estimate the solution quality for the iterates , as well as that for their ergodic mean
for some , while only establishing the error bounds for for the non-strongly convex case. We put the proof of Theorem 2.1 in Section 5 since it shares many basic elements with the convergence analysis of the RPDG method.
Observe that when the algorithmic parameters are set to (2.24), by using an inductive argument, we can easily show that
In other words, can be written as a convex combination of and hence for any . Similarly, when the algorithmic parameters are set to (2.28), we can show by using induction that
which implies . Therefore, we only need to assume the differentiability of over rather than the whole .
In view of the results obtained in Theorem 2.1, the primal-dual gradient method is an optimal method for convex optimization. In fact, the rates of convergence in (2.26), (2.27), (2.29) and (2.30) associated with the ergodic mean have employed the primal-dual optimality gaps and , which are stronger than the primal optimality gap used in the previous studies for accelerated gradient methods. Moreover, whenever is bounded, the primal-dual optimality gap gives us a computable online accuracy certificates to check the quality of the solution (see lns11 ; GhaLan12-2a for some related discussions). Also observe that each iteration of the PDG method requires the computation of , and hence all the components . In the next section, we will develop a randomized PDG method that can possibly save the number of gradient evaluations for by utilizing the finite-sum structure of problem (1.1).
3 Randomized primal-dual gradient methods
In this section, we present a randomized primal-dual gradient (RPDG) method which needs to compute the gradient of only one randomly selected component function at each iteration. We show that RPDG can possibly achieve a better complexity than PDG in terms of the total number of gradient evaluations.
3.1 Multi-dual-player reformulation and the RPDG algorithm
We start by introducing a different saddle point reformulation of (1.1) than (2.7). Let be the conjugate functions of and , , denote the dual spaces where the gradients of reside. For the sake of notational convenience, let us denote , , and for any , . Clearly, we can reformulate problem (1.1) equivalently as a saddle point problem:
where is given by
is the identity matrix in. Given a pair of feasible solutions and of (3.1), we define the primal-dual gap function by
It is well-known that is an optimal solution of (3.1) if and only if for any .
Since , are strongly convex with modulus w.r.t. , we can define their associated dual prox-functions and dual prox-mappings as
for any . Accordingly, we define
Again, may not be uniquely defined since are not necessarily differentiable. However, we will discuss how to specify the particular selection of later in this subsection.
We are now ready to describe the randomized primal-dual method, which is obtained by properly modifying the primal-dual gradient method as follows. Firstly, in (3.8), we only compute a randomly selected dual prox-mapping rather than the dual prox-mapping as in Algorithm 1. Secondly, in addition to the primal prediction step (3.7), we add a new dual prediction step (3.9), and then use the predicted dual variable for the computation of the new search point in (3.10). It can be easily seen that the RPDG method reduces to the PDG method whenever this algorithm is directly applied to (2.7) (i.e., , , and ) .