Exploiting Strong Convexity from Data with Primal-Dual First-Order Algorithms

03/07/2017 ∙ by Jialei Wang, et al. ∙ 0

We consider empirical risk minimization of linear predictors with convex loss functions. Such problems can be reformulated as convex-concave saddle point problems, and thus are well suitable for primal-dual first-order algorithms. However, primal-dual algorithms often require explicit strongly convex regularization in order to obtain fast linear convergence, and the required dual proximal mapping may not admit closed-form or efficient solution. In this paper, we develop both batch and randomized primal-dual algorithms that can exploit strong convexity from data adaptively and are capable of achieving linear convergence even without regularization. We also present dual-free variants of the adaptive primal-dual algorithms that do not require computing the dual proximal mapping, which are especially suitable for logistic regression.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

We consider the problem of regularized empirical risk minimization (ERM) of linear predictors. Let

be the feature vectors of 

data samples, be a convex loss function associated with the linear prediction , for , and  be a convex regularization function for the predictor . ERM amounts to solving the following convex optimization problem:

(1)

Examples of the above formulation include many well-known classification and regression problems. For binary classification, each feature vector is associated with a label . In particular logistic regression is obtained by setting

. For linear regression problems, each feature vector

is associated with a dependent variable , and

. Then we get ridge regression with

, and elastic net with .

Let be the data matrix. Throughout this paper, we make the following assumptions:

Assumption 1.

The functions , and matrix  satisfy:

  • Each is -strongly convex and -smooth where and , and ;

  • is -strongly convex where ;

  • , where .

The strong convexity and smoothness mentioned above are with respect to the standard Euclidean norm, denoted as . (See, e.g., Nesterov (2004, Sections 2.1.1 and 2.1.3) for the exact definitions.) Let and assuming , then is a popular definition of condition number for analyzing complexities of different algorithms. The last condition above means that the primal objective function is strongly convex, even if .

There have been extensive research activities in recent years on developing efficiently algorithms for solving problem (1). A broad class of randomized algorithms that exploit the finite sum structure in the ERM problem have emerged as very competitive both in terms of theoretical complexity and practical performance. They can be put into three categories: primal, dual, and primal-dual.

Primal randomized algorithms work with the ERM problem (1) directly. They are modern versions of randomized incremental gradient methods (e.g., Bertsekas, 2012; Nedic & Bertsekas, 2001)

equipped with variance reduction techniques. Each iteration of such algorithms only process one data point

with complexity . They includes SAG (Roux et al., 2012), SAGA (Defazio et al., 2014), and SVRG (Johnson & Zhang, 2013; Xiao & Zhang, 2014), which all achieve the iteration complexity to find an -optimal solution. In fact, they are capable of exploiting the strong convexity from data, meaning that the condition number in the complexity can be replaced by the more favorable one . This improvement can be achieved without explicit knowledge of from data.

Dual algorithms solve Fenchel dual of (1) by maximizing

(2)

using randomized coordinate ascent algorithms. (Here and denotes the conjugate functions of and .) They include SDCA (Shalev-Shwartz & Zhang, 2013), Nesterov (2012) and Richtárik & Takáč (2014). They have the same complexity , but are hard to exploit strong convexity from data.

Primal-dual algorithms solve the convex-concave saddle point problem where

(3)

In particular, SPDC (Zhang & Xiao, 2015) achieves an accelerated linear convergence rate with iteration complexity , which is better than the aforementioned non-accelerated complexity when . Lan & Zhou (2015) developed dual-free variants of accelerated primal-dual algorithms, but without considering the linear predictor structure in ERM. Balamurugan & Bach (2016) extended SVRG and SAGA to solving saddle point problems.

Accelerated primal and dual randomized algorithms have also been developed. Nesterov (2012), Fercoq & Richtárik (2015) and Lin et al. (2015b) developed accelerated coordinate gradient algorithms, which can be applied to solve the dual problem (2). Allen-Zhu (2016) developed an accelerated variant of SVRG. Acceleration can also be obtained using the Catalyst framework (Lin et al., 2015a). They all achieve the same

complexity. A common feature of accelerated algorithms is that they require good estimate of the strong convexity parameter. This makes hard for them to exploit strong convexity from data because the minimum singular value 

of the data matrix  is very hard to estimate in general.

In this paper, we show that primal-dual algorithms are capable of exploiting strong convexity from data if the algorithm parameters (such as step sizes) are set appropriately. While these optimal setting depends on the knowledge of the convexity parameter  from the data, we develop adaptive variants of primal-dual algorithms that can tune the parameter automatically. Such adaptive schemes rely critically on the capability of evaluating the primal-dual optimality gaps by primal-dual algorithms.

A major disadvantage of primal-dual algorithms is that the required dual proximal mapping may not admit closed-form or efficient solution. We follow the approach of Lan & Zhou (2015) to derive dual-free variants of the primal-dual algorithms customized for ERM problems with the linear predictor structure, and show that they can also exploit strong convexity from data with correct choices of parameters or using an adaptation scheme.

2 Batch primal-dual algorithms

Before diving into randomized primal-dual algorithms, we first consider batch primal-dual algorithms, which exhibit similar properties as their randomized variants. To this end, we consider a “batch” version of the ERM problem (1),

(4)

where , and make the following assumption:

Assumption 2.

The functions , and matrix  satisfy:

  • is -strongly convex and -smooth where and , and ;

  • is -strongly convex where ;

  • , where .

For exact correspondence with problem (1), we have with . Under Assumption 1, the function  is -strongly convex and -smooth, and is -strongly convex and -smooth. However, such correspondences alone are not sufficient to exploit the structure of (1), i.e., substituting them into the batch algorithms of this section will not produce the efficient algorithms for solving problem (1) that we will present in Sections 3 and 4.2. So we do not make such correspondences explicit in this section. Rather, treat them as independent assumptions with the same notation.

0:  parameters , , , initial point
  for  do
      
      
      
  end for
Algorithm 1 Batch Primal-Dual (BPD) Algorithm

Using conjugate functions, we can derive the dual of (4) as

(5)

and the convex-concave saddle point formulation is

(6)

We consider the primal-dual first-order algorithm proposed by Chambolle & Pock (2011, 2016) for solving the saddle point problem (6), which is given as Algorithm 1. Here we call it the batch primal-dual (BPD) algorithm. Assuming that  is smooth and  is strongly convex, Chambolle & Pock (2011, 2016) showed that Algorithm 1 achieves accelerated linear convergence rate if . However, they did not consider the case where additional or the sole source of strong convexity comes from .

In the following theorem, we show how to set the parameters , and to exploit both sources of strong convexity to achieve fast linear convergence.

Theorem 1.

Suppose Assumption 2 holds and is the unique saddle point of  defined in (6). Let . If we set the parameters in Algorithm 1 as

(7)

and where

(8)

then we have

where

The proof of Theorem 1 is given in Appendices B and C. Here we give a detailed analysis of the convergence rate. Substituting and in (7) into the expressions for and in (8), and assuming , we have

Since the overall condition number of the problem is , it is clear that is an accelerated convergence rate. Next we examine in two special cases.

The case of but .

In this case, we have and , and thus

Therefore we have This indeed is an accelerated convergence rate, recovering the result of Chambolle & Pock (2011, 2016).

The case of but .

In this case, we have and , and

Notice that is the condition number of . Next we assume and examine how varies with .

  • If , meaning is badly conditioned, then

    Because the overall condition number is , this is an accelerated linear rate, and so is .

  • If , meaning is mildly conditioned, then

    This represents a half-accelerated rate, because the overall condition number is .

  • If , i.e., is a simple quadratic function, then

    This rate does not have acceleration, because the overall condition number is .

In summary, the extent of acceleration in the dominating factor (which determines ) depends on the relative size of and , i.e., the relative conditioning between the function  and the matrix . In general, we have full acceleration if . The theory predicts that the acceleration degrades as the function  gets better conditioned. However, in our numerical experiments, we often observe acceleration even if gets closer to 1.

As explained in Chambolle & Pock (2011), Algorithm 1 is equivalent to a preconditioned ADMM. Deng & Yin (2016) characterized conditions for ADMM to obtain linear convergence without assuming both parts of the objective function being strongly convex, but they did not derive convergence rate for this case.

2.1 Adaptive batch primal-dual algorithms

0:  problem constants , , , and , initial
    point , and adaptation period .
  Compute , , and as in (7) and (8) using
  for  do
      
      
      
      if  then
           
      end if
  end for
Algorithm 2 Adaptive Batch Primal-Dual (Ada-BPD)

In practice, it is often very hard to obtain good estimate of the problem-dependent constants, especially , in order to apply the algorithmic parameters specified in Theorem 1

. Here we explore heuristics that can enable adaptive tuning of such parameters, which often lead to much improved performance in practice.

A key observation is that the convergence rate of the BPD algorithm changes monotonically with the overall strong convexity parameter , regardless of the extent of acceleration. In other words, the larger is, the faster the convergence. Therefore, if we can monitor the progress of the convergence and compare it with the predicted convergence rate in Theorem 1, then we can adjust the algorithmic parameters to exploit the fastest possible convergence. More specifically, if the observed convergence is slower than the predicted convergence rate, then we should reduce the estimate of ; if the observed convergence is better than the predicted rate, then we can try to increase for even faster convergence.

0:  previous estimate , adaption period , primal and
    dual objective values
  if  then
      
  else
      
  end if
  Compute , , and as in (7) and (8) using
  new parameters
Algorithm 3 BPD-Adapt (simple heuristic)

We formalize the above reasoning in an Adaptive BPD (Ada-BPD) algorithm described in Algorithm 2. This algorithm maintains an estimate of the true constant , and adjust it every  iterations. We use and to represent the primal and dual objective values at and , respectively. We give two implementations of the tuning procedure BPD-Adapt:

  • Algorithm 3 is a simple heuristic for tuning the estimate , where the increasing and decreasing factor can be changed to other values larger than 1;

  • Algorithm 4 is a more robust heuristic. It does not rely on the specific convergence rate established in Theorem 1. Instead, it simply compares the current estimate of objective reduction rate with the previous estimate (). It also specifies a non-tuning range of changes in , specified by the interval .

One can also devise more sophisticated schemes; e.g., if we estimate that , then no more tuning is necessary.

The capability of accessing both the primal and dual objective values allows primal-dual algorithms to have good estimate of the convergence rate, which enables effective tuning heuristics. Automatic tuning of primal-dual algorithms have also been studied by, e.g., Malitsky & Pock (2016) and Goldstein et al. (2013), but with different goals.

Finally, we note that Theorem 1 only establishes convergence rate for the distance to the optimal point and the quantity , which is not quite the duality gap . Nevertheless, same convergence rate can also be established for the duality gap (see Zhang & Xiao, 2015, Section 2.2), which can be used to better justify the adaption procedure.

0:  previous rate estimate , , period ,
    constants and , and
  Compute new rate estimate
  if  then
      ,   
  else if  then
      ,  
  else
      
  end if
  ,  
  Compute using (8) or set
  new parameters
Algorithm 4 BPD-Adapt (robust heuristic)

3 Randomized primal-dual algorithm

In this section, we come back to the ERM problem (1), which have a finite sum structure that allows the development of randomized primal-dual algorithms. In particular, we extend the stochastic primal-dual coordinate (SPDC) algorithm (Zhang & Xiao, 2015) to exploit the strong convexity from data in order to achieve faster convergence rate.

First, we show that, by setting algorithmic parameters appropriately, the original SPDC algorithm may directly benefit from strong convexity from the loss function. We note that the SPDC algorithm is a special case of the Adaptive SPDC (Ada-SPDC) algorithm presented in Algorithm 5, by setting the adaption period (not performing any adaption). The following theorem is proved in Appendix E.

Theorem 2.

Suppose Assumption 1 holds. Let be the saddle point of the function  defined in (3), and . If we set in Algorithm 5 (no adaption) and let

(9)

and where

(10)

then we have

where The expectation is taken with respect to the history of random indices drawn at each iteration.

0:  parameters , , , initial point ,
    and adaptation period .
  Set
  for  do
      pick uniformly at random
      for  do
           if  then
               
           else
               
           end if
      end for
      
      
      
      if  then
           
      end if
  end for
Algorithm 5 Adaptive SPDC (Ada-SPDC)

Below we give a detailed discussion on the expected convergence rate established in Theorem 2.

The cases of but .

In this case we have and , and

Hence . These recover the parameters and convergence rate of the standard SPDC (Zhang & Xiao, 2015).

The cases of but .

In this case we have and , and

Since the objective is -smooth and -strongly convex, is an accelerated rate if (otherwise ). For , we consider different situations:

  • If , then we have which is an accelerated rate. So is .

  • If and , then which represents accelerated rate. The iteration complexity of SPDC is , which is better than that of SVRG in this case, which is .

  • If and , then we get This is a half-accelerated rate, because in this case SVRG would require iterations, while iteration complexity here is .

  • If and , meaning the ’s are well conditioned, then we get which is a non-accelerated rate. The corresponding iteration complexity is the same as SVRG.

3.1 Parameter adaptation for SPDC

The SPDC-Adapt procedure called in Algorithm 5 follows the same logics as the batch adaption schemes in Algorithms 3 and 4, and we omit the details here. One thing we emphasize here is that the adaptation period

is in terms of epochs, or number of passes over the data. In addition, we only compute the primal and dual objective values after each pass or every few passes, because computing them exactly usually need to take a full pass of the data.

Another important issue is that, unlike the batch case where the duality gap usually decreases monotonically, the duality gap for randomized algorithms can fluctuate wildly. So instead of using only the two end values and , we can use more points to estimate the convergence rate through a linear regression. Suppose the primal-dual values at the end of each past passes are

and we need to estimate (rate per pass) such that

We can turn it into a linear regression problem after taking logarithm and obtain the estimate through

The rest of the adaption procedure can follow the robust scheme in Algorithm 4. In practice, we can compute the primal-dual values more sporadically, say every few passes, and modify the regression accordingly.

4 Dual-free Primal-dual algorithms

Compared with primal algorithms, one major disadvantage of primal-dual algorithms is the requirement of computing the proximal mapping of the dual function or , which may not admit closed-formed solution or efficient computation. This is especially the case for logistic regression, one of the most popular loss functions used in classification.

Lan & Zhou (2015) developed “dual-free” variants of primal-dual algorithms that avoid computing the dual proximal mapping. Their main technique is to replace the Euclidean distance in the dual proximal mapping with a Bregman divergence defined over the dual loss function itself. We show how to apply this approach to solve the structured ERM problems considered in this paper. They can also exploit strong convexity from data if the algorithmic parameters are set appropriately or adapted automatically.

4.1 Dual-free BPD algorithm

0:  parameters , , , initial point
  Set and
  for  do
      ,  
      
      
  end for
Algorithm 6 Dual-Free BPD Algorithm

First, we consider the batch setting. We replace the dual proximal mapping (computing ) in Algorithm 1 with

(11)

where is the Bregman divergence of a strictly convex kernel function , defined as

Algorithm 1 is obtained in the Euclidean setting with and . While our convergence results would apply for arbitrary Bregman divergence, we only focus on the case of using itself as the kernel, because this allows us to compute in (11) very efficiently. The following lemma explains the details (Cf. Lan & Zhou, 2015, Lemma 1).

Lemma 1.

Let the kernel in the Bregman divergence . If we construct a sequence of vectors such that and for all ,

(12)

then the solution to problem (11) is

Proof.

Suppose (true for ), then

The solution to (11) can be written as

where in the last equality we used the property of conjugate function when  is strongly convex and smooth. Moreover,

which completes the proof. ∎

According to Lemma 1, we only need to provide initial points such that is easy to compute. We do not need to compute directly for any , because it is can be updated as in (12). Consequently, we can update in the BPD algorithm using the gradient , without the need of dual proximal mapping. The resulting dual-free algorithm is given in Algorithm 6.

Lan & Zhou (2015) considered a general setting which does not possess the linear predictor structure we focus on in this paper, and assumed that only the regularization  is strongly convex. Our following result shows that dual-free primal-dual algorithms can also exploit strong convexity from data with appropriate algorithmic parameters.

Theorem 3.

Suppose Assumption 2 holds and let be the unique saddle point of  defined in (6). If we set the parameters in Algorithm 6 as

(13)

and where

(14)

then we have

where

Theorem 3 is proved in Appendices B and D. Assuming , we have