1 Introduction
In this paper, we are interested in solving a class of convex optimization problems with both noncomposite and composite regularization terms:
(1) 
where is a convex compact set with diameter , the regularization terms and are both convex but possibly nonsmooth, and is composed with a possibly nondiagonal penalty matrix specifying the desired structured sparsity pattern in . We denote
as a convex and smooth loss function of a rule
for a sample data , and define the corresponding expectation as .The above formulation covers quite a few popular models arising from statistics and machine learning, such as Lasso [25] obtained by setting and and , and linear SVM [5] obtained by letting and and , where is a parameter. More importantly, we can accommodate problem (1) with more complicated structures by imposing the nontrivial regularization term , such as fused Lasso [26], fused logistic regression and graphguided regularized minimization [7].
The standard algorithm applied to solve problem (1) is proximal gradient descent [19]. However, there are two main difficulties: 1) Computing the exact proximal gradient is intractable since the closedform solution to the proximal mapping of , or even single is in usually unavailable; 2) the computational complexity of the full gradient rapidly increases as the size of samples grows, and is hence prohibitively expensive for modern dataintensive applications.
A common way to suppress the former one is to introduce a new auxiliary variable with and reformulate problem (1) as a linearly constrained convex problem with respect to two variables and as follows:
s.t.  (2) 
Then one can resort to Linearized Alternating Direction Method of Multipliers (LADMM) [4, 28]. Very recently, Lin et al. [12] have explored the efficiency of the extragradient descent [10, 11], and further showed the hybrid ExtraGradient ADM (EGADM) is very efficient on moderate size problems. However, these methods are computationally expensive due to the computation of the full gradient in each iteration.
To address the computational issue, several stochastic ADMM algorithms [18, 23, 2, 29, 30] have been proposed. The idea is to draw a minibatch of samples and then compute a noisy subgradient of on the minibatch in each iteration. However, for problem (1) with nonsmooth regularization (which is actually common in practice), these subgradient type alternating direction methods may be slow and unstable [6].
In this work, we propose a Stochastic PrimalDual Proximal ExtraGradient Descent (SPDPEG), which inherits the advantages of EGADM and stochastic methods. Basically, the proposed method computes two noisy gradients of at the th iteration by randomly drawing two data samples and , and then performs extragradient descent along the noisy gradients. We demonstrate that the proposed algorithm is very efficient and stable in solving problem (1) with possible nonsmooth terms at large scale.
Our contribution: We propose a novel Stochastic PrimalDual Proximal ExtraGradient Descent (SPDPEG). SPDPEG is efficient in solving largescale problems with composite and nonsmooth regularizations. We demonstrate its theoretical convergence for both convex and strongly convex objectives. For convex objectives, SPDPEG has the convergence rate of in expectation with the uniformly average iterates. This convergence rate is known to be the best possible for minimizing general convex objective using firstorder noisy oracle[1]. When the objective to be optimized is strongly convex, SPDPEG converges at the rates of and in expectation with the uniformly and nonuniformly average iterates, respectively. This matches the convergence rate of stochastic ADMM with a significantly stronger robustness in terms of the numerical performance, as confirmed by encouraging experiments on fused logistic regression and graphguided regularized minimization tasks.
Related work: The first line of related work are various stochastic alternating direction methods [18, 23, 2, 8, 31, 24, 29, 30] developed to solve problem (1). They fall into two camps: 1) to compute the noisy subgradient of on a minibatch of data samples and perform subgradient descent [18, 23, 2, 8, 29]; 2) to approximate problem (1
) using the finitesum loss and perform variancereduced gradient descent or dual coordinate ascent
[31, 24, 30].For the first group of algorithms, drawing a noisy subgradient may lead to the unstable numerical performance, especially on largescale problems. In the experimental section, we compare our algorithm against SGADM [8] and demonstrate the significant improvement.
For the second group of algorithms, it is not always feasible to use the finitesum loss since we know nothing about the underlying distribution of data. In specific, Zhong and Kwok [31] proposed a Stochastic Averaged Gradientbased ADM (SAGADM) whose iteration complexity is . However, SAGADM needs to store a few variables and incurs a very high memory cost. Suzuki [24] proposed a linearly convergent Stochastic Dual Coordinate Ascent ADM (SDCAADM). However, a stronger assumption on and such as strong convexity and smoothness is imposed. Zheng and Kwok [30] proposed a Stochastic VarianceReduce Gradientbased ADM (SVRGADM) for convex and nonconvex problems. However, SVRGADM only focuses on the finitesum problem. In contrast, our SPDPEG approach can be applied to solve problem (1) in very general form.
Very recently, a stochastic variant of hybrid gradient method, namely SPDHG [20], has been proposed to solve a class of compositely regularized minimization problems with very special regularization. In specific, and (See Assumption 3 in [20]). However, such assumption is very strong and does not hold for many compositely regularized minimization problems. This motivates us to consider problem (1) and develop SPDPEG approach.
The second line of related works is various extragradient methods. This idea is not new and originally proposed by Korpelevich for solving saddlepoint problems and variational inequalities [10, 11]. The convergence and iteration complexity of extragradient methods are established in [17] and [16] respectively. There are also some variants of extragradient methods. Solodov and Svaiter proposed a hybrid proximal extragradient method [22], whose iteration complexity is established by Monteiro and Svaiter in [13, 14, 15]. Bonettini and Ruggiero studied a generalized extragradient method for total variation based image restoration problem [3]. To the best of our knowledge, this is the first time that a stochastic primaldual variant of extragradient type methods is introduced to solve problem (1).
2 Problem SetUp and Methods
We make the following assumptions that are common in optimization literature and usually hold in practice throughout the paper:
Assumption 1.
The optimal set of problem (1) is nonempty.
Assumption 2.
is continuously differentiable with Lipschitz continuous gradient. That is, there exists a constant such that
Assumption 2 holds for many problems in machine learning. For example, the following least squares and logistic functions are two standard ones:
where is a single data sample.
Assumption 3.
The regularization functions and are both continuous but possibly nonsmooth; the associated proximal mapping for each individual regularization admits a closedform solution, i.e.,
(3) 
can be calculated in a closed form for .
Remark 4.
We remark that Assumption 3 is reasonable for a class of optimization problems regularized by norm or nuclear norm, such as fused Lasso, fused logistic regression, and graphguided regularized minimization problems. The proximal mapping of norm can be computed as follows:
We clarify that the proximal mapping of and that of are totally different and have different properties. For example, the proximal mapping of admits a closedform solution but the proximal mapping of does not admit in general when is nondiagonal. We only assume that the proximal mapping of admits a closedform solution in Assumption 3 but expect to address the case of whose proximal mapping does not admit a closedform solution in general.
Assumption 5.
The gradient of the objective function
is easy to estimate. Any stochastic gradient estimation
for at satisfiesand
where is a constant number.
Assumption 6.
is strongly convex at . In other words, there exists a constant such that
We remark that our algorithm works even without Assumption 6. However, the lower iteration complexity will be obtained with Assumption 6.
We introduce the Stochastic PrimalDual Proximal ExtraGradient (SPDPEG) method, and further discuss the choice of stepsize. We define the augmented Lagrangian function for problem (1) as
where is the dual variable associated with . is defined as
The SPDPEG algorithm is based on the primaldual update scheme where is primal variable and is a dual variable, and can be seen as an inexact augmented Lagrangian method. The details are presented in Algorithm 1.
We provide details on following four important issues: how to solve the primal and dual subproblems easily, how to apply the noisy gradient and perform extragradient descent, how to choose stepsize, and how to determine the weights for the nonuniformly average iterates.

Stochastic Gradient: According to Assumption 5, is known to be easy for gradient estimation with respect to , and the stochastic gradient estimation is defined as
To update , the SPDPEG algorithm takes a proximal extragradient step using a stochastic gradient estimation and different stepsizes, i.e.,
(5) (6) and
(7) (8) 
StepSize : The choice of stepsize depends on whether the objective function is strongly convex or not. The rate of convergence varies with respect to different stepsize rules. Moreover, a sequence of vanishing stepsizes is necessary since we do not adopt any technique of variance reduction in the proposed algorithm.

NonUniformly Average Iterates: [2] showed that nonuniform average iterates generated by stochastic algorithms converge with fewer iterations. Inspired by this work, through nonuniformly averaging the iterates of the SPDPEG algorithm and adopting a slightly modified stepsize, we manage to establish an accelerated convergence rate of in expectation.
3 Main Result
In this section, we present the main result in this paper. For general convex objectives, the uniformly average iterates generated by the SPDPEG algorithm converge in expectation with rate. While for strongly convex objectives, the uniformly and nonuniformly average iterates generated converge in expectation with and rates, respectively. The computational complexity are , and since the periteration complexity is the computational cost of the noisy gradient on and and the proximal mapping, where is the dimension of decision variable. The main theoretic results with respect to different settings are summarized as follows:

Assuming that is a general convex objective function, the stepsize is , and the weight of the iterates is , the proposed SPDPEG algorithm converges with the rate in expectation.

Assuming that is a strongly convex objective function, the stepsize is , and the weight of the iterates is , the proposed SPDPEG algorithm converges with the rate in expectation.

Assuming that is a strongly convex objective function, the stepsize is , the weight of the iterates is , and the dual variables are bounded by (this assumption is standard and also adopted in [2]), the proposed SPDPEG algorithm converges with the rate in expectation.
In the above, is defined as
where
denotes the largest eigenvalue of
, and when is a general convex objective function.We present the main theoretic result for uniformly average iterates under general convex objective functions in the following theorem.
Theorem 7.
Consider the SPDPEG algorithm with uniformly average iterates. For any optimal solution , it holds that
(9)  
(10) 
Note that this implies that the SPDPEG algorithm converges in expectation with the rate in terms of both the objective error and constraint violation.
We present the main theoretic result for uniformly average iterates under a strongly convex objective function in the following theorem.
Theorem 8.
Consider the SPDPEG algorithm with uniformly average iterates. For any optimal solution , it holds that
(11)  
(12) 
Note that this implies that the SPDPEG algorithm converges in expectation with the rate in terms of both the objective error and constraint violation.
We present the main theoretic result for nonuniformly average iterates under a strongly convex objective function in the following theorem.
Theorem 9.
Consider the SPDPEG algorithm with nonuniformly average iterates. For any optimal solution , it holds that
(13)  
(14) 
Note that this implies that the SPDPEG algorithm converges in expectation with the rate in terms of both the objective error and constraint violation.
4 Proof
Lemma 10.
The sequence generated by the SPDPEG algorithm satisfies the following inequality:
(15)  
where and are respectively denoted by
(16) 
Proof.
The firstorder optimality condition for updating is given by
(17) 
For and any , the firstorder optimality condition for updating and are given respectively by
(18)  
(19) 
Setting in (18) and in (19), and summing two resulting inequalities yields that
which implies that
(20) 
Therefore, we get
where the first inequality is obtained by letting in (18) and the second inequality follows from (20). Furthermore, we have
where and are defined in (16). By substituting (4) into (4), and then summing the resulting inequality and (19), we have
On the other hand, we have
where the last inequality holds since
4.1 Proof of Theorem 7
Lemma 11.
Suppose that are generated by the SPDPEG algorithm, and and are defined in the main paper. For any optimal solution , it holds that
Proof.
By the definition of , we have and . Plugging them into (15) yields that
Moreover, we have
Therefore, we conclude that
Comments
There are no comments yet.