# Stochastic Primal-Dual Proximal ExtraGradient Descent for Compositely Regularized Optimization

We consider a wide range of regularized stochastic minimization problems with two regularization terms, one of which is composed with a linear function. This optimization model abstracts a number of important applications in artificial intelligence and machine learning, such as fused Lasso, fused logistic regression, and a class of graph-guided regularized minimization. The computational challenges of this model are in two folds. On one hand, the closed-form solution of the proximal mapping associated with the composed regularization term or the expected objective function is not available. On the other hand, the calculation of the full gradient of the expectation in the objective is very expensive when the number of input data samples is considerably large. To address these issues, we propose a stochastic variant of extra-gradient type methods, namely Stochastic Primal-Dual Proximal ExtraGradient descent (SPDPEG), and analyze its convergence property for both convex and strongly convex objectives. For general convex objectives, the uniformly average iterates generated by SPDPEG converge in expectation with O(1/√(t)) rate. While for strongly convex objectives, the uniformly and non-uniformly average iterates generated by SPDPEG converge with O((t)/t) and O(1/t) rates, respectively. The order of the rate of the proposed algorithm is known to match the best convergence rate for first-order stochastic algorithms. Experiments on fused logistic regression and graph-guided regularized logistic regression problems show that the proposed algorithm performs very efficiently and consistently outperforms other competing algorithms.

## Authors

• 20 publications
• 6 publications
• 32 publications
• 128 publications
• 1 publication
• ### On the Iteration Complexity Analysis of Stochastic Primal-Dual Hybrid Gradient Approach with High Probability

In this paper, we propose a stochastic Primal-Dual Hybrid Gradient (PDHG...
01/22/2018 ∙ by Linbo Qiao, et al. ∙ 0

• ### Exploiting Strong Convexity from Data with Primal-Dual First-Order Algorithms

We consider empirical risk minimization of linear predictors with convex...
03/07/2017 ∙ by Jialei Wang, et al. ∙ 0

• ### Geometric descent method for convex composite minimization

In this paper, we extend the geometric descent method recently proposed ...
12/29/2016 ∙ by Shixiang Chen, et al. ∙ 0

• ### Super-Linear Convergence of Dual Augmented-Lagrangian Algorithm for Sparsity Regularized Estimation

We analyze the convergence behaviour of a recently proposed algorithm fo...
11/20/2009 ∙ by Ryota Tomioka, et al. ∙ 0

• ### A Parallel Best-Response Algorithm with Exact Line Search for Nonconvex Sparsity-Regularized Rank Minimization

In this paper, we propose a convergent parallel best-response algorithm ...
11/13/2017 ∙ by Yang Yang, et al. ∙ 0

• ### Linear convergence of SDCA in statistical estimation

In this paper, we consider stochastic dual coordinate (SDCA) without st...
01/26/2017 ∙ by Chao Qu, et al. ∙ 0

• ### An inexact subsampled proximal Newton-type method for large-scale machine learning

We propose a fast proximal Newton-type algorithm for minimizing regulari...
08/28/2017 ∙ by Xuanqing Liu, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

In this paper, we are interested in solving a class of convex optimization problems with both non-composite and composite regularization terms:

 minx∈X Eξ[l(x,ξ)]+r1(x)+r2(Fx), (1)

where is a convex compact set with diameter , the regularization terms and are both convex but possibly nonsmooth, and is composed with a possibly non-diagonal penalty matrix specifying the desired structured sparsity pattern in . We denote

as a convex and smooth loss function of a rule

for a sample data , and define the corresponding expectation as .

The above formulation covers quite a few popular models arising from statistics and machine learning, such as Lasso [25] obtained by setting and and , and linear SVM [5] obtained by letting and and , where is a parameter. More importantly, we can accommodate problem (1) with more complicated structures by imposing the non-trivial regularization term , such as fused Lasso [26], fused logistic regression and graph-guided regularized minimization [7].

The standard algorithm applied to solve problem (1) is proximal gradient descent [19]. However, there are two main difficulties: 1) Computing the exact proximal gradient is intractable since the closed-form solution to the proximal mapping of , or even single is in usually unavailable; 2) the computational complexity of the full gradient rapidly increases as the size of samples grows, and is hence prohibitively expensive for modern data-intensive applications.

A common way to suppress the former one is to introduce a new auxiliary variable with and reformulate problem (1) as a linearly constrained convex problem with respect to two variables and as follows:

 minx∈X Eξ[l(x,ξ)]+r1(x)+r2(z), s.t. Fx−z=0. (2)

Then one can resort to Linearized Alternating Direction Method of Multipliers (LADMM) [4, 28]. Very recently, Lin et al. [12] have explored the efficiency of the extra-gradient descent [10, 11], and further showed the hybrid Extra-Gradient ADM (EGADM) is very efficient on moderate size problems. However, these methods are computationally expensive due to the computation of the full gradient in each iteration.

To address the computational issue, several stochastic ADMM algorithms [18, 23, 2, 29, 30] have been proposed. The idea is to draw a mini-batch of samples and then compute a noisy sub-gradient of on the mini-batch in each iteration. However, for problem (1) with non-smooth regularization (which is actually common in practice), these sub-gradient type alternating direction methods may be slow and unstable [6].

In this work, we propose a Stochastic Primal-Dual Proximal Extra-Gradient Descent (SPDPEG), which inherits the advantages of EGADM and stochastic methods. Basically, the proposed method computes two noisy gradients of at the -th iteration by randomly drawing two data samples and , and then performs extra-gradient descent along the noisy gradients. We demonstrate that the proposed algorithm is very efficient and stable in solving problem (1) with possible non-smooth terms at large scale.

Our contribution: We propose a novel Stochastic Primal-Dual Proximal Extra-Gradient Descent (SPDPEG). SPDPEG is efficient in solving large-scale problems with composite and nonsmooth regularizations. We demonstrate its theoretical convergence for both convex and strongly convex objectives. For convex objectives, SPDPEG has the convergence rate of in expectation with the uniformly average iterates. This convergence rate is known to be the best possible for minimizing general convex objective using first-order noisy oracle[1]. When the objective to be optimized is strongly convex, SPDPEG converges at the rates of and in expectation with the uniformly and non-uniformly average iterates, respectively. This matches the convergence rate of stochastic ADMM with a significantly stronger robustness in terms of the numerical performance, as confirmed by encouraging experiments on fused logistic regression and graph-guided regularized minimization tasks.

Related work: The first line of related work are various stochastic alternating direction methods [18, 23, 2, 8, 31, 24, 29, 30] developed to solve problem (1). They fall into two camps: 1) to compute the noisy sub-gradient of on a mini-batch of data samples and perform sub-gradient descent [18, 23, 2, 8, 29]; 2) to approximate problem (1

) using the finite-sum loss and perform variance-reduced gradient descent or dual coordinate ascent

[31, 24, 30].

For the first group of algorithms, drawing a noisy sub-gradient may lead to the unstable numerical performance, especially on large-scale problems. In the experimental section, we compare our algorithm against SGADM [8] and demonstrate the significant improvement.

Very recently, a stochastic variant of hybrid gradient method, namely SPDHG [20], has been proposed to solve a class of compositely regularized minimization problems with very special regularization. In specific, and (See Assumption 3 in [20]). However, such assumption is very strong and does not hold for many compositely regularized minimization problems. This motivates us to consider problem (1) and develop SPDPEG approach.

The second line of related works is various extra-gradient methods. This idea is not new and originally proposed by Korpelevich for solving saddle-point problems and variational inequalities [10, 11]. The convergence and iteration complexity of extra-gradient methods are established in [17] and [16] respectively. There are also some variants of extra-gradient methods. Solodov and Svaiter proposed a hybrid proximal extra-gradient method [22], whose iteration complexity is established by Monteiro and Svaiter in [13, 14, 15]. Bonettini and Ruggiero studied a generalized extragradient method for total variation based image restoration problem [3]. To the best of our knowledge, this is the first time that a stochastic primal-dual variant of extra-gradient type methods is introduced to solve problem (1).

## 2 Problem Set-Up and Methods

We make the following assumptions that are common in optimization literature and usually hold in practice throughout the paper:

###### Assumption 1.

The optimal set of problem (1) is nonempty.

###### Assumption 2.

is continuously differentiable with Lipschitz continuous gradient. That is, there exists a constant such that

 ∥∇l(x1)−∇l(x2)∥≤L∥x1−x2∥,∀x1,x2∈X.

Assumption 2 holds for many problems in machine learning. For example, the following least squares and logistic functions are two standard ones:

 l(x,ξi)=12∥∥a⊤ix−bi∥∥2 or  l(x,ξi)=log(1+exp(−bi⋅a⊤ix)),

where is a single data sample.

###### Assumption 3.

The regularization functions and are both continuous but possibly non-smooth; the associated proximal mapping for each individual regularization admits a closed-form solution, i.e.,

 \rm proxri(x)=argminy ri(y)+12∥y−x∥22 (3)

can be calculated in a closed form for .

###### Remark 4.

We remark that Assumption 3 is reasonable for a class of optimization problems regularized by -norm or nuclear norm, such as fused Lasso, fused logistic regression, and graph-guided regularized minimization problems. The proximal mapping of -norm can be computed as follows:

 [\rm prox∥⋅∥1(x)]i = argminy ∥y∥1+12∥y−xi∥22 = {sign(xi)(|xi|−1)|xi|>1,0|xi|≤1.

We clarify that the proximal mapping of and that of are totally different and have different properties. For example, the proximal mapping of admits a closed-form solution but the proximal mapping of does not admit in general when is non-diagonal. We only assume that the proximal mapping of admits a closed-form solution in Assumption 3 but expect to address the case of whose proximal mapping does not admit a closed-form solution in general.

###### Assumption 5.

The gradient of the objective function

is easy to estimate. Any stochastic gradient estimation

for at satisfies

 Eξ[∇l(x,ξ)]=∇l(x),

and

 Eξ[∥∇l(x,ξ)−∇l(x)∥2]≤σ2,

where is a constant number.

###### Assumption 6.

is -strongly convex at . In other words, there exists a constant such that

We remark that our algorithm works even without Assumption 6. However, the lower iteration complexity will be obtained with Assumption 6.

We introduce the Stochastic Primal-Dual Proximal ExtraGradient (SPDPEG) method, and further discuss the choice of step-size. We define the augmented Lagrangian function for problem (1) as

 Lγ(z,x,λ)=r2(z)+r1(x)+ϕ(z,x,λ)+γ2∥Fx−z∥2,

where is the dual variable associated with . is defined as

 ϕ(z,x,λ)=l(x)−⟨λ,Fx−z⟩.

The SPDPEG algorithm is based on the primal-dual update scheme where is primal variable and is a dual variable, and can be seen as an inexact augmented Lagrangian method. The details are presented in Algorithm 1.

We provide details on following four important issues: how to solve the primal and dual sub-problems easily, how to apply the noisy gradient and perform extra-gradient descent, how to choose step-size, and how to determine the weights for the non-uniformly average iterates.

1. Update for : The first sub-problem in Algorithm 1 is to minimize the augmented Lagrangian function with respect to , i.e.,

 zk+1:=argminz Lγ(z,xk;λk), (4)

which is equivalent to computing the proximal mapping of and hence admits a closed-form solution from Assumption 3.

2. Stochastic Gradient: According to Assumption 5, is known to be easy for gradient estimation with respect to , and the stochastic gradient estimation is defined as

 G(z,x,λ;ξ)=∇l(x,ξ)−F⊤λ.

To update , the SPDPEG algorithm takes a proximal extra-gradient step using a stochastic gradient estimation and different step-sizes, i.e.,

 ¯xk+1 :=\rm proxck+1r1(xk−ck+1G(yk+1,xk,λk;ξk+11)), (5) ¯λk+1 :=λk−γ(Fxk−zk+1), (6)

and

 xk+1 :=\rm proxck+1r1(xk−ck+1G(yk+1,¯xk+1,¯λk+1;ξk+12)), (7) λk+1 :=λk−γ(F¯xk+1−zk+1). (8)
3. Step-Size : The choice of step-size depends on whether the objective function is strongly convex or not. The rate of convergence varies with respect to different step-size rules. Moreover, a sequence of vanishing step-sizes is necessary since we do not adopt any technique of variance reduction in the proposed algorithm.

4. Non-Uniformly Average Iterates: [2] showed that non-uniform average iterates generated by stochastic algorithms converge with fewer iterations. Inspired by this work, through non-uniformly averaging the iterates of the SPDPEG algorithm and adopting a slightly modified step-size, we manage to establish an accelerated convergence rate of in expectation.

## 3 Main Result

In this section, we present the main result in this paper. For general convex objectives, the uniformly average iterates generated by the SPDPEG algorithm converge in expectation with rate. While for strongly convex objectives, the uniformly and non-uniformly average iterates generated converge in expectation with and rates, respectively. The computational complexity are , and since the per-iteration complexity is the computational cost of the noisy gradient on and and the proximal mapping, where is the dimension of decision variable. The main theoretic results with respect to different settings are summarized as follows:

1. Assuming that is a general convex objective function, the step-size is , and the weight of the iterates is , the proposed SPDPEG algorithm converges with the rate in expectation.

2. Assuming that is a -strongly convex objective function, the step-size is , and the weight of the iterates is , the proposed SPDPEG algorithm converges with the rate in expectation.

3. Assuming that is a -strongly convex objective function, the step-size is , the weight of the iterates is , and the dual variables are bounded by (this assumption is standard and also adopted in [2]), the proposed SPDPEG algorithm converges with the rate in expectation.

In the above, is defined as

 ~L=max{8γσmax(F⊤F)+μ,√8L2+γσmax(F⊤F)+μ},

where

denotes the largest eigenvalue of

, and when is a general convex objective function.

We present the main theoretic result for uniformly average iterates under general convex objective functions in the following theorem.

###### Theorem 7.

Consider the SPDPEG algorithm with uniformly average iterates. For any optimal solution , it holds that

 ∣∣E[l(~xt)]+E[r1(~xt)]+E[r2(~zt)]−l(x∗)−r1(x∗)−r2(z∗)∣∣ =O(1/√t), (9) ∥∥FE[~xt]−E[~zt]∥∥ =O(1/√t). (10)

Note that this implies that the SPDPEG algorithm converges in expectation with the rate in terms of both the objective error and constraint violation.

We present the main theoretic result for uniformly average iterates under a strongly convex objective function in the following theorem.

###### Theorem 8.

Consider the SPDPEG algorithm with uniformly average iterates. For any optimal solution , it holds that

 ∣∣E[l(~xt)]+E[r1(~xt)]+E[r2(~zt)]−l(x∗)−r1(x∗)−r2(z∗)∣∣ =O(log(t)/t), (11) ∥∥FE[~xt]−E[~zt]∥∥ =O(log(t)/t). (12)

Note that this implies that the SPDPEG algorithm converges in expectation with the rate in terms of both the objective error and constraint violation.

We present the main theoretic result for non-uniformly average iterates under a strongly convex objective function in the following theorem.

###### Theorem 9.

Consider the SPDPEG algorithm with non-uniformly average iterates. For any optimal solution , it holds that

 ∣∣E[l(~xt)]+E[r1(~xt)]+E[r2(~zt)]−l(x∗)−r1(x∗)−r2(z∗)∣∣ =O(1/t), (13) ∥∥FE[~xt]−E[~zt]∥∥ =O(1/t). (14)

Note that this implies that the SPDPEG algorithm converges in expectation with the rate in terms of both the objective error and constraint violation.

## 4 Proof

We first prove the key technical lemma which is very important to the proof of Theorem 7-Theorem 9.

###### Lemma 10.

The sequence generated by the SPDPEG algorithm satisfies the following inequality:

 r1(x)+r2(z)−r1(¯xk+1)−r2(zk+1)+⎛⎜ ⎜⎝z−zk+1x−¯xk+1λ−¯λk+1⎞⎟ ⎟⎠⊤⎛⎜ ⎜ ⎜⎝¯λk+1G(zk+1,¯xk+1,¯λk+1;ξk+12)F¯xk+1−zk+1⎞⎟ ⎟ ⎟⎠ (15) ≥ 12ck+1∥∥x−xk+1∥∥2−12ck+1∥∥x−xk∥∥2−4ck+1∥∥δk+1∥∥2−4ck+1∥∥¯δk+1∥∥2 −12γ∥∥λ−λk∥∥2+12γ∥∥λ−λk+1∥∥2+[12γ−4ck+1σmax(F⊤F)]∥∥λk−¯λk+1∥∥2 +[12ck+1−γσmax(F⊤F)2−4ck+1L2]∥∥xk−¯xk+1∥∥2+12ck+1∥∥xk+1−¯xk+1∥∥2,

where and are respectively denoted by

 δk+1=∇l(xk,ξk+11)−∇l(xk)and¯δk+1=∇l(¯xk+1,ξk+12)−∇l(¯xk+1). (16)
###### Proof.

The first-order optimality condition for updating is given by

 (17)

For and any , the first-order optimality condition for updating and are given respectively by

 r1(x)−r1(¯xk+1)+⟨x−¯xk+1,¯xk+1−xkck+1+G(zk+1,xk,λk;ξk+11)⟩ ≥0, (18) r1(x)−r1(xk+1)+⟨x−xk+1,xk+1−xkck+1+G(zk+1,¯xk+1,¯λk+1;ξk+12)⟩ ≥0. (19)

Setting in (18) and in (19), and summing two resulting inequalities yields that

 1ck+1∥∥xk+1−¯xk+1∥∥2 ≤ ⟨xk+1−¯xk+1,G(zk+1,xk,λk;ξk+11)−G(zk+1,¯xk+1,¯λk+1;ξk+12)⟩ ≤ ∥∥xk+1−¯xk+1∥∥∥∥G(zk+1,xk,λk;ξk+11)−G(zk+1,¯xk+1,¯λk+1;ξk+12)∥∥,

which implies that

 ∥∥xk+1−¯xk+1∥∥≤ck+1∥∥G(zk+1,xk,λk;ξk+11)−G(zk+1,¯xk+1,¯λk+1;ξk+12)∥∥. (20)

Therefore, we get

 r1(xk+1)−r1(¯xk+1)+⟨xk+1−¯xk+1,G(zk+1,¯xk+1,¯λk+1;ξk+12)⟩ ≥ ⟨xk+1−¯xk+1,G(zk+1,¯xk+1,¯λk+1;ξk+12)−G(zk+1,xk,λk;ξk+11)⟩ −⟨xk+1−¯xk+1,¯xk+1−xkck+1⟩ ≥ −ck+1∥∥G(zk+1,¯xk+1,¯λk+1;ξk+12)−G(zk+1,xk,λk;ξk+11)∥∥2 −12ck+1∥∥xk+1−xk∥∥2+12ck+1∥∥xk+1−¯xk+1∥∥2+12ck+1∥∥¯xk+1−xk∥∥2.

where the first inequality is obtained by letting in (18) and the second inequality follows from (20). Furthermore, we have

 ∥∥G(zk+1,¯xk+1,¯λk+1;ξk+12)−G(zk+1,xk,λk;ξk+11)∥∥2 = ∥∥¯δk+1+∇l(¯xk+1)−F⊤¯λk+1−[δk+1+∇l(xk)−F⊤λk]∥∥2 ≤ 4∥∥δk+1∥∥2+4∥∥¯δk+1∥∥2+4L2∥∥xk−¯xk+1∥∥2+4σmax(F⊤F)∥∥λk−¯λk+1∥∥2,

where and are defined in (16). By substituting (4) into (4), and then summing the resulting inequality and (19), we have

 r1(x)−r1(¯xk+1)+⟨x−¯xk+1,G(zk+1,¯xk+1,¯λk+1;ξk+12)⟩ ≥ −4ck+1∥∥δk+1∥∥2−4ck+1∥∥¯δk+1∥∥2−4ck+1σmax(F⊤F)∥∥λk−¯λk+1∥∥2 −4ck+1L2∥∥xk−¯xk+1∥∥2−12ck+1∥∥xk+1−xk∥∥2+12ck+1∥∥xk+1−¯xk+1∥∥2 = −4ck+1∥∥δk+1∥∥2−4ck+1∥∥¯δk+1∥∥2−4ck+1σmax(F⊤F)∥∥λk−¯λk+1∥∥2 −12ck+1∥∥x−xk∥∥2+12ck+1∥∥x−xk+1∥∥2.

On the other hand, we have

 ⟨λ−¯λk+1,F¯xk+1−zk+1⟩ = 1γ⟨λ−λk+1+λk+1−¯λk+1,λk−λk+1⟩ = −12γ∥∥λ−λk∥∥2+12γ∥∥λ−λk+1∥∥2−12γ∥∥λk+1−¯λk+1∥∥2+12γ∥∥λk−¯λk+1∥∥2 ≥ −12γ∥∥λ−λk∥∥2+12γ∥∥λ−λk+1∥∥2+12γ∥∥λk−¯λk+1∥∥2−γσmax(F⊤F)2∥∥xk−¯xk+1∥∥2,

where the last inequality holds since

 λk+1−¯λk+1=γ(F¯xk+1−zk+1)−γ(Fxk−zk+1)=γ(F¯xk+1−Fxk).

Finally, combining (17), (4) and (4) yields (15). ∎

### 4.1 Proof of Theorem 7

###### Lemma 11.

Suppose that are generated by the SPDPEG algorithm, and and are defined in the main paper. For any optimal solution , it holds that

 l(x∗)+r1(x∗)+r2(z∗)−E[l(¯xk+1)]−E[r1(¯xk+1)]−E[r2(zk+1)] +E⎡⎢ ⎢⎣⎛⎜ ⎜⎝z∗−zk+1x∗−¯xk+1λ−¯λk+1⎞⎟ ⎟⎠⊤⎛⎜ ⎜⎝¯λk+1−F⊤¯λk+1F¯xk+1−zk+1⎞⎟ ⎟⎠⎤⎥ ⎥⎦ ≥ √k+1+~L2E∥∥x∗−xk+1∥∥2−√k+1+~L2E∥∥x∗−xk∥∥2−8σ2√k+1
###### Proof.

By the definition of , we have and . Plugging them into (15) yields that

 r1(x)+r2(z)−r1(¯xk+1)−r2(zk+1)+⎛⎜ ⎜⎝z−zk+1x−¯xk+1λ−¯λk+1⎞⎟ ⎟⎠⊤⎛⎜ ⎜ ⎜⎝¯λk+1G(zk+1,¯xk+1,¯λk+1;ξk+12)F¯xk+1−zk+1⎞⎟ ⎟ ⎟⎠ ≥ 12ck+1∥∥x−xk+1∥∥2−12ck+1∥∥x−xk∥∥2−4ck+1∥∥δk+1∥∥2−4ck+1∥∥¯δk+1∥∥2 −12γ∥∥λ−λk∥∥2+12γ∥∥λ−λk+1∥∥2.

Moreover, we have

 (x−¯xk+1)⊤G(zk+1,¯xk+1,¯λk+1;ξk+12) = (x−¯xk+1)⊤∇l(¯xk+1)+(x−¯xk+1)⊤¯δk+1+(x−¯xk+1)⊤[−F⊤¯λk+1] ≤ l(x)−l(¯xk+1)+(x−¯xk+1)⊤[−F⊤¯λk+1]+(x−¯xk+1)⊤¯δk+1.

Therefore, we conclude that

 l(x)+r1(x)+r2(z)−l(¯xk+1)−r1(¯xk+1)−r2(zk+1)+⎛⎜ ⎜⎝z−zk+1x−¯xk+1λ−¯λk+1⎞⎟ ⎟⎠⊤⎛⎜ ⎜⎝¯λk+1−F⊤¯λk+1F¯xk+1−zk+1⎞⎟ ⎟⎠ ≥ √k+1+~L2∥∥x−xk+1∥∥2−√k+1+~L2∥∥x−xk∥∥2−4√k+1+~L(∥∥δk+1∥∥2+∥∥¯δk+1∥