1 Introduction
We consider the problem of regularized empirical risk minimization (ERM) of linear predictors. Let
be the feature vectors of
data samples, be a convex loss function associated with the linear prediction , for , and be a convex regularization function for the predictor . ERM amounts to solving the following convex optimization problem:(1) 
Examples of the above formulation include many wellknown classification and regression problems. For binary classification, each feature vector is associated with a label . In particular logistic regression is obtained by setting
. For linear regression problems, each feature vector
is associated with a dependent variable , and. Then we get ridge regression with
, and elastic net with .Let be the data matrix. Throughout this paper, we make the following assumptions:
Assumption 1.
The functions , and matrix satisfy:

Each is strongly convex and smooth where and , and ;

is strongly convex where ;

, where .
The strong convexity and smoothness mentioned above are with respect to the standard Euclidean norm, denoted as . (See, e.g., Nesterov (2004, Sections 2.1.1 and 2.1.3) for the exact definitions.) Let and assuming , then is a popular definition of condition number for analyzing complexities of different algorithms. The last condition above means that the primal objective function is strongly convex, even if .
There have been extensive research activities in recent years on developing efficiently algorithms for solving problem (1). A broad class of randomized algorithms that exploit the finite sum structure in the ERM problem have emerged as very competitive both in terms of theoretical complexity and practical performance. They can be put into three categories: primal, dual, and primaldual.
Primal randomized algorithms work with the ERM problem (1) directly. They are modern versions of randomized incremental gradient methods (e.g., Bertsekas, 2012; Nedic & Bertsekas, 2001)
equipped with variance reduction techniques. Each iteration of such algorithms only process one data point
with complexity . They includes SAG (Roux et al., 2012), SAGA (Defazio et al., 2014), and SVRG (Johnson & Zhang, 2013; Xiao & Zhang, 2014), which all achieve the iteration complexity to find an optimal solution. In fact, they are capable of exploiting the strong convexity from data, meaning that the condition number in the complexity can be replaced by the more favorable one . This improvement can be achieved without explicit knowledge of from data.Dual algorithms solve Fenchel dual of (1) by maximizing
(2) 
using randomized coordinate ascent algorithms. (Here and denotes the conjugate functions of and .) They include SDCA (ShalevShwartz & Zhang, 2013), Nesterov (2012) and Richtárik & Takáč (2014). They have the same complexity , but are hard to exploit strong convexity from data.
Primaldual algorithms solve the convexconcave saddle point problem where
(3) 
In particular, SPDC (Zhang & Xiao, 2015) achieves an accelerated linear convergence rate with iteration complexity , which is better than the aforementioned nonaccelerated complexity when . Lan & Zhou (2015) developed dualfree variants of accelerated primaldual algorithms, but without considering the linear predictor structure in ERM. Balamurugan & Bach (2016) extended SVRG and SAGA to solving saddle point problems.
Accelerated primal and dual randomized algorithms have also been developed. Nesterov (2012), Fercoq & Richtárik (2015) and Lin et al. (2015b) developed accelerated coordinate gradient algorithms, which can be applied to solve the dual problem (2). AllenZhu (2016) developed an accelerated variant of SVRG. Acceleration can also be obtained using the Catalyst framework (Lin et al., 2015a). They all achieve the same
complexity. A common feature of accelerated algorithms is that they require good estimate of the strong convexity parameter. This makes hard for them to exploit strong convexity from data because the minimum singular value
of the data matrix is very hard to estimate in general.In this paper, we show that primaldual algorithms are capable of exploiting strong convexity from data if the algorithm parameters (such as step sizes) are set appropriately. While these optimal setting depends on the knowledge of the convexity parameter from the data, we develop adaptive variants of primaldual algorithms that can tune the parameter automatically. Such adaptive schemes rely critically on the capability of evaluating the primaldual optimality gaps by primaldual algorithms.
A major disadvantage of primaldual algorithms is that the required dual proximal mapping may not admit closedform or efficient solution. We follow the approach of Lan & Zhou (2015) to derive dualfree variants of the primaldual algorithms customized for ERM problems with the linear predictor structure, and show that they can also exploit strong convexity from data with correct choices of parameters or using an adaptation scheme.
2 Batch primaldual algorithms
Before diving into randomized primaldual algorithms, we first consider batch primaldual algorithms, which exhibit similar properties as their randomized variants. To this end, we consider a “batch” version of the ERM problem (1),
(4) 
where , and make the following assumption:
Assumption 2.
The functions , and matrix satisfy:

is strongly convex and smooth where and , and ;

is strongly convex where ;

, where .
For exact correspondence with problem (1), we have with . Under Assumption 1, the function is strongly convex and smooth, and is strongly convex and smooth. However, such correspondences alone are not sufficient to exploit the structure of (1), i.e., substituting them into the batch algorithms of this section will not produce the efficient algorithms for solving problem (1) that we will present in Sections 3 and 4.2. So we do not make such correspondences explicit in this section. Rather, treat them as independent assumptions with the same notation.
Using conjugate functions, we can derive the dual of (4) as
(5) 
and the convexconcave saddle point formulation is
(6) 
We consider the primaldual firstorder algorithm proposed by Chambolle & Pock (2011, 2016) for solving the saddle point problem (6), which is given as Algorithm 1. Here we call it the batch primaldual (BPD) algorithm. Assuming that is smooth and is strongly convex, Chambolle & Pock (2011, 2016) showed that Algorithm 1 achieves accelerated linear convergence rate if . However, they did not consider the case where additional or the sole source of strong convexity comes from .
In the following theorem, we show how to set the parameters , and to exploit both sources of strong convexity to achieve fast linear convergence.
Theorem 1.
The proof of Theorem 1 is given in Appendices B and C. Here we give a detailed analysis of the convergence rate. Substituting and in (7) into the expressions for and in (8), and assuming , we have
Since the overall condition number of the problem is , it is clear that is an accelerated convergence rate. Next we examine in two special cases.
The case of but .
The case of but .
In this case, we have and , and
Notice that is the condition number of . Next we assume and examine how varies with .

If , meaning is badly conditioned, then
Because the overall condition number is , this is an accelerated linear rate, and so is .

If , meaning is mildly conditioned, then
This represents a halfaccelerated rate, because the overall condition number is .

If , i.e., is a simple quadratic function, then
This rate does not have acceleration, because the overall condition number is .
In summary, the extent of acceleration in the dominating factor (which determines ) depends on the relative size of and , i.e., the relative conditioning between the function and the matrix . In general, we have full acceleration if . The theory predicts that the acceleration degrades as the function gets better conditioned. However, in our numerical experiments, we often observe acceleration even if gets closer to 1.
As explained in Chambolle & Pock (2011), Algorithm 1 is equivalent to a preconditioned ADMM. Deng & Yin (2016) characterized conditions for ADMM to obtain linear convergence without assuming both parts of the objective function being strongly convex, but they did not derive convergence rate for this case.
2.1 Adaptive batch primaldual algorithms
In practice, it is often very hard to obtain good estimate of the problemdependent constants, especially , in order to apply the algorithmic parameters specified in Theorem 1
. Here we explore heuristics that can enable adaptive tuning of such parameters, which often lead to much improved performance in practice.
A key observation is that the convergence rate of the BPD algorithm changes monotonically with the overall strong convexity parameter , regardless of the extent of acceleration. In other words, the larger is, the faster the convergence. Therefore, if we can monitor the progress of the convergence and compare it with the predicted convergence rate in Theorem 1, then we can adjust the algorithmic parameters to exploit the fastest possible convergence. More specifically, if the observed convergence is slower than the predicted convergence rate, then we should reduce the estimate of ; if the observed convergence is better than the predicted rate, then we can try to increase for even faster convergence.
We formalize the above reasoning in an Adaptive BPD (AdaBPD) algorithm described in Algorithm 2. This algorithm maintains an estimate of the true constant , and adjust it every iterations. We use and to represent the primal and dual objective values at and , respectively. We give two implementations of the tuning procedure BPDAdapt:

Algorithm 3 is a simple heuristic for tuning the estimate , where the increasing and decreasing factor can be changed to other values larger than 1;

Algorithm 4 is a more robust heuristic. It does not rely on the specific convergence rate established in Theorem 1. Instead, it simply compares the current estimate of objective reduction rate with the previous estimate (). It also specifies a nontuning range of changes in , specified by the interval .
One can also devise more sophisticated schemes; e.g., if we estimate that , then no more tuning is necessary.
The capability of accessing both the primal and dual objective values allows primaldual algorithms to have good estimate of the convergence rate, which enables effective tuning heuristics. Automatic tuning of primaldual algorithms have also been studied by, e.g., Malitsky & Pock (2016) and Goldstein et al. (2013), but with different goals.
Finally, we note that Theorem 1 only establishes convergence rate for the distance to the optimal point and the quantity , which is not quite the duality gap . Nevertheless, same convergence rate can also be established for the duality gap (see Zhang & Xiao, 2015, Section 2.2), which can be used to better justify the adaption procedure.
3 Randomized primaldual algorithm
In this section, we come back to the ERM problem (1), which have a finite sum structure that allows the development of randomized primaldual algorithms. In particular, we extend the stochastic primaldual coordinate (SPDC) algorithm (Zhang & Xiao, 2015) to exploit the strong convexity from data in order to achieve faster convergence rate.
First, we show that, by setting algorithmic parameters appropriately, the original SPDC algorithm may directly benefit from strong convexity from the loss function. We note that the SPDC algorithm is a special case of the Adaptive SPDC (AdaSPDC) algorithm presented in Algorithm 5, by setting the adaption period (not performing any adaption). The following theorem is proved in Appendix E.
Theorem 2.
Below we give a detailed discussion on the expected convergence rate established in Theorem 2.
The cases of but .
In this case we have and , and
Hence . These recover the parameters and convergence rate of the standard SPDC (Zhang & Xiao, 2015).
The cases of but .
In this case we have and , and
Since the objective is smooth and strongly convex, is an accelerated rate if (otherwise ). For , we consider different situations:

If , then we have which is an accelerated rate. So is .

If and , then which represents accelerated rate. The iteration complexity of SPDC is , which is better than that of SVRG in this case, which is .

If and , then we get This is a halfaccelerated rate, because in this case SVRG would require iterations, while iteration complexity here is .

If and , meaning the ’s are well conditioned, then we get which is a nonaccelerated rate. The corresponding iteration complexity is the same as SVRG.
3.1 Parameter adaptation for SPDC
The SPDCAdapt procedure called in Algorithm 5 follows the same logics as the batch adaption schemes in Algorithms 3 and 4, and we omit the details here. One thing we emphasize here is that the adaptation period
is in terms of epochs, or number of passes over the data. In addition, we only compute the primal and dual objective values after each pass or every few passes, because computing them exactly usually need to take a full pass of the data.
Another important issue is that, unlike the batch case where the duality gap usually decreases monotonically, the duality gap for randomized algorithms can fluctuate wildly. So instead of using only the two end values and , we can use more points to estimate the convergence rate through a linear regression. Suppose the primaldual values at the end of each past passes are
and we need to estimate (rate per pass) such that
We can turn it into a linear regression problem after taking logarithm and obtain the estimate through
The rest of the adaption procedure can follow the robust scheme in Algorithm 4. In practice, we can compute the primaldual values more sporadically, say every few passes, and modify the regression accordingly.
4 Dualfree Primaldual algorithms
Compared with primal algorithms, one major disadvantage of primaldual algorithms is the requirement of computing the proximal mapping of the dual function or , which may not admit closedformed solution or efficient computation. This is especially the case for logistic regression, one of the most popular loss functions used in classification.
Lan & Zhou (2015) developed “dualfree” variants of primaldual algorithms that avoid computing the dual proximal mapping. Their main technique is to replace the Euclidean distance in the dual proximal mapping with a Bregman divergence defined over the dual loss function itself. We show how to apply this approach to solve the structured ERM problems considered in this paper. They can also exploit strong convexity from data if the algorithmic parameters are set appropriately or adapted automatically.
4.1 Dualfree BPD algorithm
First, we consider the batch setting. We replace the dual proximal mapping (computing ) in Algorithm 1 with
(11) 
where is the Bregman divergence of a strictly convex kernel function , defined as
Algorithm 1 is obtained in the Euclidean setting with and . While our convergence results would apply for arbitrary Bregman divergence, we only focus on the case of using itself as the kernel, because this allows us to compute in (11) very efficiently. The following lemma explains the details (Cf. Lan & Zhou, 2015, Lemma 1).
Lemma 1.
Let the kernel in the Bregman divergence . If we construct a sequence of vectors such that and for all ,
(12) 
then the solution to problem (11) is
Proof.
Suppose (true for ), then
The solution to (11) can be written as
where in the last equality we used the property of conjugate function when is strongly convex and smooth. Moreover,
which completes the proof. ∎
According to Lemma 1, we only need to provide initial points such that is easy to compute. We do not need to compute directly for any , because it is can be updated as in (12). Consequently, we can update in the BPD algorithm using the gradient , without the need of dual proximal mapping. The resulting dualfree algorithm is given in Algorithm 6.
Lan & Zhou (2015) considered a general setting which does not possess the linear predictor structure we focus on in this paper, and assumed that only the regularization is strongly convex. Our following result shows that dualfree primaldual algorithms can also exploit strong convexity from data with appropriate algorithmic parameters.
Comments
There are no comments yet.