Stochastic Primal-Dual Proximal ExtraGradient Descent for Compositely Regularized Optimization

08/20/2017 ∙ by Tianyi Lin, et al. ∙ berkeley college 0

We consider a wide range of regularized stochastic minimization problems with two regularization terms, one of which is composed with a linear function. This optimization model abstracts a number of important applications in artificial intelligence and machine learning, such as fused Lasso, fused logistic regression, and a class of graph-guided regularized minimization. The computational challenges of this model are in two folds. On one hand, the closed-form solution of the proximal mapping associated with the composed regularization term or the expected objective function is not available. On the other hand, the calculation of the full gradient of the expectation in the objective is very expensive when the number of input data samples is considerably large. To address these issues, we propose a stochastic variant of extra-gradient type methods, namely Stochastic Primal-Dual Proximal ExtraGradient descent (SPDPEG), and analyze its convergence property for both convex and strongly convex objectives. For general convex objectives, the uniformly average iterates generated by SPDPEG converge in expectation with O(1/√(t)) rate. While for strongly convex objectives, the uniformly and non-uniformly average iterates generated by SPDPEG converge with O((t)/t) and O(1/t) rates, respectively. The order of the rate of the proposed algorithm is known to match the best convergence rate for first-order stochastic algorithms. Experiments on fused logistic regression and graph-guided regularized logistic regression problems show that the proposed algorithm performs very efficiently and consistently outperforms other competing algorithms.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

In this paper, we are interested in solving a class of convex optimization problems with both non-composite and composite regularization terms:

(1)

where is a convex compact set with diameter , the regularization terms and are both convex but possibly nonsmooth, and is composed with a possibly non-diagonal penalty matrix specifying the desired structured sparsity pattern in . We denote

as a convex and smooth loss function of a rule

for a sample data , and define the corresponding expectation as .

The above formulation covers quite a few popular models arising from statistics and machine learning, such as Lasso [25] obtained by setting and and , and linear SVM [5] obtained by letting and and , where is a parameter. More importantly, we can accommodate problem (1) with more complicated structures by imposing the non-trivial regularization term , such as fused Lasso [26], fused logistic regression and graph-guided regularized minimization [7].

The standard algorithm applied to solve problem (1) is proximal gradient descent [19]. However, there are two main difficulties: 1) Computing the exact proximal gradient is intractable since the closed-form solution to the proximal mapping of , or even single is in usually unavailable; 2) the computational complexity of the full gradient rapidly increases as the size of samples grows, and is hence prohibitively expensive for modern data-intensive applications.

A common way to suppress the former one is to introduce a new auxiliary variable with and reformulate problem (1) as a linearly constrained convex problem with respect to two variables and as follows:

s.t. (2)

Then one can resort to Linearized Alternating Direction Method of Multipliers (LADMM) [4, 28]. Very recently, Lin et al. [12] have explored the efficiency of the extra-gradient descent [10, 11], and further showed the hybrid Extra-Gradient ADM (EGADM) is very efficient on moderate size problems. However, these methods are computationally expensive due to the computation of the full gradient in each iteration.

To address the computational issue, several stochastic ADMM algorithms [18, 23, 2, 29, 30] have been proposed. The idea is to draw a mini-batch of samples and then compute a noisy sub-gradient of on the mini-batch in each iteration. However, for problem (1) with non-smooth regularization (which is actually common in practice), these sub-gradient type alternating direction methods may be slow and unstable [6].

In this work, we propose a Stochastic Primal-Dual Proximal Extra-Gradient Descent (SPDPEG), which inherits the advantages of EGADM and stochastic methods. Basically, the proposed method computes two noisy gradients of at the -th iteration by randomly drawing two data samples and , and then performs extra-gradient descent along the noisy gradients. We demonstrate that the proposed algorithm is very efficient and stable in solving problem (1) with possible non-smooth terms at large scale.

Our contribution: We propose a novel Stochastic Primal-Dual Proximal Extra-Gradient Descent (SPDPEG). SPDPEG is efficient in solving large-scale problems with composite and nonsmooth regularizations. We demonstrate its theoretical convergence for both convex and strongly convex objectives. For convex objectives, SPDPEG has the convergence rate of in expectation with the uniformly average iterates. This convergence rate is known to be the best possible for minimizing general convex objective using first-order noisy oracle[1]. When the objective to be optimized is strongly convex, SPDPEG converges at the rates of and in expectation with the uniformly and non-uniformly average iterates, respectively. This matches the convergence rate of stochastic ADMM with a significantly stronger robustness in terms of the numerical performance, as confirmed by encouraging experiments on fused logistic regression and graph-guided regularized minimization tasks.

Related work: The first line of related work are various stochastic alternating direction methods [18, 23, 2, 8, 31, 24, 29, 30] developed to solve problem (1). They fall into two camps: 1) to compute the noisy sub-gradient of on a mini-batch of data samples and perform sub-gradient descent [18, 23, 2, 8, 29]; 2) to approximate problem (1

) using the finite-sum loss and perform variance-reduced gradient descent or dual coordinate ascent

[31, 24, 30].

For the first group of algorithms, drawing a noisy sub-gradient may lead to the unstable numerical performance, especially on large-scale problems. In the experimental section, we compare our algorithm against SGADM [8] and demonstrate the significant improvement.

For the second group of algorithms, it is not always feasible to use the finite-sum loss since we know nothing about the underlying distribution of data. In specific, Zhong and Kwok [31] proposed a Stochastic Averaged Gradient-based ADM (SAG-ADM) whose iteration complexity is . However, SAG-ADM needs to store a few variables and incurs a very high memory cost. Suzuki [24] proposed a linearly convergent Stochastic Dual Coordinate Ascent ADM (SDCA-ADM). However, a stronger assumption on and such as strong convexity and smoothness is imposed. Zheng and Kwok [30] proposed a Stochastic Variance-Reduce Gradient-based ADM (SVRG-ADM) for convex and non-convex problems. However, SVRG-ADM only focuses on the finite-sum problem. In contrast, our SPDPEG approach can be applied to solve problem (1) in very general form.

Very recently, a stochastic variant of hybrid gradient method, namely SPDHG [20], has been proposed to solve a class of compositely regularized minimization problems with very special regularization. In specific, and (See Assumption 3 in [20]). However, such assumption is very strong and does not hold for many compositely regularized minimization problems. This motivates us to consider problem (1) and develop SPDPEG approach.

The second line of related works is various extra-gradient methods. This idea is not new and originally proposed by Korpelevich for solving saddle-point problems and variational inequalities [10, 11]. The convergence and iteration complexity of extra-gradient methods are established in [17] and [16] respectively. There are also some variants of extra-gradient methods. Solodov and Svaiter proposed a hybrid proximal extra-gradient method [22], whose iteration complexity is established by Monteiro and Svaiter in [13, 14, 15]. Bonettini and Ruggiero studied a generalized extragradient method for total variation based image restoration problem [3]. To the best of our knowledge, this is the first time that a stochastic primal-dual variant of extra-gradient type methods is introduced to solve problem (1).

2 Problem Set-Up and Methods

We make the following assumptions that are common in optimization literature and usually hold in practice throughout the paper:

Assumption 1.

The optimal set of problem (1) is nonempty.

Assumption 2.

is continuously differentiable with Lipschitz continuous gradient. That is, there exists a constant such that

Assumption 2 holds for many problems in machine learning. For example, the following least squares and logistic functions are two standard ones:

where is a single data sample.

Assumption 3.

The regularization functions and are both continuous but possibly non-smooth; the associated proximal mapping for each individual regularization admits a closed-form solution, i.e.,

(3)

can be calculated in a closed form for .

Remark 4.

We remark that Assumption 3 is reasonable for a class of optimization problems regularized by -norm or nuclear norm, such as fused Lasso, fused logistic regression, and graph-guided regularized minimization problems. The proximal mapping of -norm can be computed as follows:

We clarify that the proximal mapping of and that of are totally different and have different properties. For example, the proximal mapping of admits a closed-form solution but the proximal mapping of does not admit in general when is non-diagonal. We only assume that the proximal mapping of admits a closed-form solution in Assumption 3 but expect to address the case of whose proximal mapping does not admit a closed-form solution in general.

Assumption 5.

The gradient of the objective function

is easy to estimate. Any stochastic gradient estimation

for at satisfies

and

where is a constant number.

Assumption 6.

is -strongly convex at . In other words, there exists a constant such that

We remark that our algorithm works even without Assumption 6. However, the lower iteration complexity will be obtained with Assumption 6.

  Initialize: , , and .
  for  do
     choose two data samples and randomly;
     update according to Eq. (4);
     update according to Eq. (5);
     update according to Eq. (6);
     update according to Eq. (7);
     update according to Eq. (8);
  end for
  Output: , , and .
Algorithm 1 Stochastic Primal-Dual Proximal ExtraGradient (SPDPEG)

We introduce the Stochastic Primal-Dual Proximal ExtraGradient (SPDPEG) method, and further discuss the choice of step-size. We define the augmented Lagrangian function for problem (1) as

where is the dual variable associated with . is defined as

The SPDPEG algorithm is based on the primal-dual update scheme where is primal variable and is a dual variable, and can be seen as an inexact augmented Lagrangian method. The details are presented in Algorithm 1.

We provide details on following four important issues: how to solve the primal and dual sub-problems easily, how to apply the noisy gradient and perform extra-gradient descent, how to choose step-size, and how to determine the weights for the non-uniformly average iterates.

  1. Update for : The first sub-problem in Algorithm 1 is to minimize the augmented Lagrangian function with respect to , i.e.,

    (4)

    which is equivalent to computing the proximal mapping of and hence admits a closed-form solution from Assumption 3.

  2. Stochastic Gradient: According to Assumption 5, is known to be easy for gradient estimation with respect to , and the stochastic gradient estimation is defined as

    To update , the SPDPEG algorithm takes a proximal extra-gradient step using a stochastic gradient estimation and different step-sizes, i.e.,

    (5)
    (6)

    and

    (7)
    (8)
  3. Step-Size : The choice of step-size depends on whether the objective function is strongly convex or not. The rate of convergence varies with respect to different step-size rules. Moreover, a sequence of vanishing step-sizes is necessary since we do not adopt any technique of variance reduction in the proposed algorithm.

  4. Non-Uniformly Average Iterates: [2] showed that non-uniform average iterates generated by stochastic algorithms converge with fewer iterations. Inspired by this work, through non-uniformly averaging the iterates of the SPDPEG algorithm and adopting a slightly modified step-size, we manage to establish an accelerated convergence rate of in expectation.

3 Main Result

In this section, we present the main result in this paper. For general convex objectives, the uniformly average iterates generated by the SPDPEG algorithm converge in expectation with rate. While for strongly convex objectives, the uniformly and non-uniformly average iterates generated converge in expectation with and rates, respectively. The computational complexity are , and since the per-iteration complexity is the computational cost of the noisy gradient on and and the proximal mapping, where is the dimension of decision variable. The main theoretic results with respect to different settings are summarized as follows:

  1. Assuming that is a general convex objective function, the step-size is , and the weight of the iterates is , the proposed SPDPEG algorithm converges with the rate in expectation.

  2. Assuming that is a -strongly convex objective function, the step-size is , and the weight of the iterates is , the proposed SPDPEG algorithm converges with the rate in expectation.

  3. Assuming that is a -strongly convex objective function, the step-size is , the weight of the iterates is , and the dual variables are bounded by (this assumption is standard and also adopted in [2]), the proposed SPDPEG algorithm converges with the rate in expectation.

In the above, is defined as

where

denotes the largest eigenvalue of

, and when is a general convex objective function.

We present the main theoretic result for uniformly average iterates under general convex objective functions in the following theorem.

Theorem 7.

Consider the SPDPEG algorithm with uniformly average iterates. For any optimal solution , it holds that

(9)
(10)

Note that this implies that the SPDPEG algorithm converges in expectation with the rate in terms of both the objective error and constraint violation.

We present the main theoretic result for uniformly average iterates under a strongly convex objective function in the following theorem.

Theorem 8.

Consider the SPDPEG algorithm with uniformly average iterates. For any optimal solution , it holds that

(11)
(12)

Note that this implies that the SPDPEG algorithm converges in expectation with the rate in terms of both the objective error and constraint violation.

We present the main theoretic result for non-uniformly average iterates under a strongly convex objective function in the following theorem.

Theorem 9.

Consider the SPDPEG algorithm with non-uniformly average iterates. For any optimal solution , it holds that

(13)
(14)

Note that this implies that the SPDPEG algorithm converges in expectation with the rate in terms of both the objective error and constraint violation.

4 Proof

We first prove the key technical lemma which is very important to the proof of Theorem 7-Theorem 9.

Lemma 10.

The sequence generated by the SPDPEG algorithm satisfies the following inequality:

(15)

where and are respectively denoted by

(16)
Proof.

The first-order optimality condition for updating is given by

(17)

For and any , the first-order optimality condition for updating and are given respectively by

(18)
(19)

Setting in (18) and in (19), and summing two resulting inequalities yields that

which implies that

(20)

Therefore, we get

where the first inequality is obtained by letting in (18) and the second inequality follows from (20). Furthermore, we have

where and are defined in (16). By substituting (4) into (4), and then summing the resulting inequality and (19), we have

On the other hand, we have

where the last inequality holds since

Finally, combining (17), (4) and (4) yields (15). ∎

4.1 Proof of Theorem 7

Lemma 11.

Suppose that are generated by the SPDPEG algorithm, and and are defined in the main paper. For any optimal solution , it holds that

Proof.

By the definition of , we have and . Plugging them into (15) yields that

Moreover, we have

Therefore, we conclude that