We consider the problem of regularized empirical risk minimization (ERM)
of linear predictors.
Let be the feature vectors of
be the feature vectors ofdata samples, be a convex loss function associated with the linear prediction , for , and be a convex regularization function for the predictor . ERM amounts to solving the following convex optimization problem:
Examples of the above formulation
include many well-known classification and regression problems.
For binary classification,
each feature vector is associated with a label .
In particular logistic regression is obtained by setting
For linear regression problems, each feature vector .
Then we get ridge regression with
. For linear regression problems, each feature vectoris associated with a dependent variable , and
. Then we get ridge regression with, and elastic net with .
Let be the data matrix. Throughout this paper, we make the following assumptions:
The functions , and matrix satisfy:
Each is -strongly convex and -smooth where and , and ;
is -strongly convex where ;
, where .
The strong convexity and smoothness mentioned above are with respect to the standard Euclidean norm, denoted as . (See, e.g., Nesterov (2004, Sections 2.1.1 and 2.1.3) for the exact definitions.) Let and assuming , then is a popular definition of condition number for analyzing complexities of different algorithms. The last condition above means that the primal objective function is strongly convex, even if .
There have been extensive research activities in recent years on developing efficiently algorithms for solving problem (1). A broad class of randomized algorithms that exploit the finite sum structure in the ERM problem have emerged as very competitive both in terms of theoretical complexity and practical performance. They can be put into three categories: primal, dual, and primal-dual.
Primal randomized algorithms work with the ERM problem (1)
directly. They are modern versions of randomized incremental gradient
methods (e.g., Bertsekas, 2012; Nedic & Bertsekas, 2001) equipped with variance reduction
Each iteration of such algorithms only process one data point
equipped with variance reduction techniques. Each iteration of such algorithms only process one data pointwith complexity . They includes SAG (Roux et al., 2012), SAGA (Defazio et al., 2014), and SVRG (Johnson & Zhang, 2013; Xiao & Zhang, 2014), which all achieve the iteration complexity to find an -optimal solution. In fact, they are capable of exploiting the strong convexity from data, meaning that the condition number in the complexity can be replaced by the more favorable one . This improvement can be achieved without explicit knowledge of from data.
Dual algorithms solve Fenchel dual of (1) by maximizing
using randomized coordinate ascent algorithms. (Here and denotes the conjugate functions of and .) They include SDCA (Shalev-Shwartz & Zhang, 2013), Nesterov (2012) and Richtárik & Takáč (2014). They have the same complexity , but are hard to exploit strong convexity from data.
Primal-dual algorithms solve the convex-concave saddle point problem where
In particular, SPDC (Zhang & Xiao, 2015) achieves an accelerated linear convergence rate with iteration complexity , which is better than the aforementioned non-accelerated complexity when . Lan & Zhou (2015) developed dual-free variants of accelerated primal-dual algorithms, but without considering the linear predictor structure in ERM. Balamurugan & Bach (2016) extended SVRG and SAGA to solving saddle point problems.
Accelerated primal and dual randomized algorithms have also
Nesterov (2012), Fercoq & Richtárik (2015) and Lin et al. (2015b)
developed accelerated coordinate gradient algorithms,
which can be applied to solve the dual problem (2).
Allen-Zhu (2016) developed an accelerated variant of SVRG.
Acceleration can also be obtained using the Catalyst framework
(Lin et al., 2015a).
They all achieve the same
A common feature of accelerated algorithms
is that they require good estimate of the strong convexity parameter.
This makes hard for them to exploit strong convexity from data because
the minimum singular value
complexity. A common feature of accelerated algorithms is that they require good estimate of the strong convexity parameter. This makes hard for them to exploit strong convexity from data because the minimum singular valueof the data matrix is very hard to estimate in general.
In this paper, we show that primal-dual algorithms are capable of exploiting strong convexity from data if the algorithm parameters (such as step sizes) are set appropriately. While these optimal setting depends on the knowledge of the convexity parameter from the data, we develop adaptive variants of primal-dual algorithms that can tune the parameter automatically. Such adaptive schemes rely critically on the capability of evaluating the primal-dual optimality gaps by primal-dual algorithms.
A major disadvantage of primal-dual algorithms is that the required dual proximal mapping may not admit closed-form or efficient solution. We follow the approach of Lan & Zhou (2015) to derive dual-free variants of the primal-dual algorithms customized for ERM problems with the linear predictor structure, and show that they can also exploit strong convexity from data with correct choices of parameters or using an adaptation scheme.
2 Batch primal-dual algorithms
Before diving into randomized primal-dual algorithms, we first consider batch primal-dual algorithms, which exhibit similar properties as their randomized variants. To this end, we consider a “batch” version of the ERM problem (1),
where , and make the following assumption:
The functions , and matrix satisfy:
is -strongly convex and -smooth where and , and ;
is -strongly convex where ;
, where .
For exact correspondence with problem (1), we have with . Under Assumption 1, the function is -strongly convex and -smooth, and is -strongly convex and -smooth. However, such correspondences alone are not sufficient to exploit the structure of (1), i.e., substituting them into the batch algorithms of this section will not produce the efficient algorithms for solving problem (1) that we will present in Sections 3 and 4.2. So we do not make such correspondences explicit in this section. Rather, treat them as independent assumptions with the same notation.
Using conjugate functions, we can derive the dual of (4) as
and the convex-concave saddle point formulation is
We consider the primal-dual first-order algorithm proposed by Chambolle & Pock (2011, 2016) for solving the saddle point problem (6), which is given as Algorithm 1. Here we call it the batch primal-dual (BPD) algorithm. Assuming that is smooth and is strongly convex, Chambolle & Pock (2011, 2016) showed that Algorithm 1 achieves accelerated linear convergence rate if . However, they did not consider the case where additional or the sole source of strong convexity comes from .
In the following theorem, we show how to set the parameters , and to exploit both sources of strong convexity to achieve fast linear convergence.
Since the overall condition number of the problem is , it is clear that is an accelerated convergence rate. Next we examine in two special cases.
The case of but .
The case of but .
In this case, we have and , and
Notice that is the condition number of . Next we assume and examine how varies with .
If , meaning is badly conditioned, then
Because the overall condition number is , this is an accelerated linear rate, and so is .
If , meaning is mildly conditioned, then
This represents a half-accelerated rate, because the overall condition number is .
If , i.e., is a simple quadratic function, then
This rate does not have acceleration, because the overall condition number is .
In summary, the extent of acceleration in the dominating factor (which determines ) depends on the relative size of and , i.e., the relative conditioning between the function and the matrix . In general, we have full acceleration if . The theory predicts that the acceleration degrades as the function gets better conditioned. However, in our numerical experiments, we often observe acceleration even if gets closer to 1.
As explained in Chambolle & Pock (2011), Algorithm 1 is equivalent to a preconditioned ADMM. Deng & Yin (2016) characterized conditions for ADMM to obtain linear convergence without assuming both parts of the objective function being strongly convex, but they did not derive convergence rate for this case.
2.1 Adaptive batch primal-dual algorithms
In practice, it is often very hard to obtain good estimate of
the problem-dependent constants, especially
, in order to apply the algorithmic parameters
specified in Theorem 1 .
Here we explore heuristics that can enable adaptive tuning of such parameters,
which often lead to much improved performance in practice.
. Here we explore heuristics that can enable adaptive tuning of such parameters, which often lead to much improved performance in practice.
A key observation is that the convergence rate of the BPD algorithm changes monotonically with the overall strong convexity parameter , regardless of the extent of acceleration. In other words, the larger is, the faster the convergence. Therefore, if we can monitor the progress of the convergence and compare it with the predicted convergence rate in Theorem 1, then we can adjust the algorithmic parameters to exploit the fastest possible convergence. More specifically, if the observed convergence is slower than the predicted convergence rate, then we should reduce the estimate of ; if the observed convergence is better than the predicted rate, then we can try to increase for even faster convergence.
We formalize the above reasoning in an Adaptive BPD (Ada-BPD) algorithm described in Algorithm 2. This algorithm maintains an estimate of the true constant , and adjust it every iterations. We use and to represent the primal and dual objective values at and , respectively. We give two implementations of the tuning procedure BPD-Adapt:
Algorithm 3 is a simple heuristic for tuning the estimate , where the increasing and decreasing factor can be changed to other values larger than 1;
Algorithm 4 is a more robust heuristic. It does not rely on the specific convergence rate established in Theorem 1. Instead, it simply compares the current estimate of objective reduction rate with the previous estimate (). It also specifies a non-tuning range of changes in , specified by the interval .
One can also devise more sophisticated schemes; e.g., if we estimate that , then no more tuning is necessary.
The capability of accessing both the primal and dual objective values allows primal-dual algorithms to have good estimate of the convergence rate, which enables effective tuning heuristics. Automatic tuning of primal-dual algorithms have also been studied by, e.g., Malitsky & Pock (2016) and Goldstein et al. (2013), but with different goals.
Finally, we note that Theorem 1 only establishes convergence rate for the distance to the optimal point and the quantity , which is not quite the duality gap . Nevertheless, same convergence rate can also be established for the duality gap (see Zhang & Xiao, 2015, Section 2.2), which can be used to better justify the adaption procedure.
3 Randomized primal-dual algorithm
In this section, we come back to the ERM problem (1), which have a finite sum structure that allows the development of randomized primal-dual algorithms. In particular, we extend the stochastic primal-dual coordinate (SPDC) algorithm (Zhang & Xiao, 2015) to exploit the strong convexity from data in order to achieve faster convergence rate.
First, we show that, by setting algorithmic parameters appropriately, the original SPDC algorithm may directly benefit from strong convexity from the loss function. We note that the SPDC algorithm is a special case of the Adaptive SPDC (Ada-SPDC) algorithm presented in Algorithm 5, by setting the adaption period (not performing any adaption). The following theorem is proved in Appendix E.
Below we give a detailed discussion on the expected convergence rate established in Theorem 2.
The cases of but .
In this case we have and , and
Hence . These recover the parameters and convergence rate of the standard SPDC (Zhang & Xiao, 2015).
The cases of but .
In this case we have and , and
Since the objective is -smooth and -strongly convex, is an accelerated rate if (otherwise ). For , we consider different situations:
If , then we have which is an accelerated rate. So is .
If and , then which represents accelerated rate. The iteration complexity of SPDC is , which is better than that of SVRG in this case, which is .
If and , then we get This is a half-accelerated rate, because in this case SVRG would require iterations, while iteration complexity here is .
If and , meaning the ’s are well conditioned, then we get which is a non-accelerated rate. The corresponding iteration complexity is the same as SVRG.
3.1 Parameter adaptation for SPDC
The SPDC-Adapt procedure called in Algorithm 5
follows the same logics as the batch adaption schemes in
Algorithms 3 and 4,
and we omit the details here.
One thing we emphasize here is that the adaptation period is in terms of
epochs, or number of passes over the data.
In addition, we only compute the primal and dual objective values after
each pass or every few passes,
because computing them exactly usually need to take a full pass of the data.
is in terms of epochs, or number of passes over the data. In addition, we only compute the primal and dual objective values after each pass or every few passes, because computing them exactly usually need to take a full pass of the data.
Another important issue is that, unlike the batch case where the duality gap usually decreases monotonically, the duality gap for randomized algorithms can fluctuate wildly. So instead of using only the two end values and , we can use more points to estimate the convergence rate through a linear regression. Suppose the primal-dual values at the end of each past passes are
and we need to estimate (rate per pass) such that
We can turn it into a linear regression problem after taking logarithm and obtain the estimate through
The rest of the adaption procedure can follow the robust scheme in Algorithm 4. In practice, we can compute the primal-dual values more sporadically, say every few passes, and modify the regression accordingly.
4 Dual-free Primal-dual algorithms
Compared with primal algorithms, one major disadvantage of primal-dual algorithms is the requirement of computing the proximal mapping of the dual function or , which may not admit closed-formed solution or efficient computation. This is especially the case for logistic regression, one of the most popular loss functions used in classification.
Lan & Zhou (2015) developed “dual-free” variants of primal-dual algorithms that avoid computing the dual proximal mapping. Their main technique is to replace the Euclidean distance in the dual proximal mapping with a Bregman divergence defined over the dual loss function itself. We show how to apply this approach to solve the structured ERM problems considered in this paper. They can also exploit strong convexity from data if the algorithmic parameters are set appropriately or adapted automatically.
4.1 Dual-free BPD algorithm
First, we consider the batch setting. We replace the dual proximal mapping (computing ) in Algorithm 1 with
where is the Bregman divergence of a strictly convex kernel function , defined as
Algorithm 1 is obtained in the Euclidean setting with and . While our convergence results would apply for arbitrary Bregman divergence, we only focus on the case of using itself as the kernel, because this allows us to compute in (11) very efficiently. The following lemma explains the details (Cf. Lan & Zhou, 2015, Lemma 1).
Let the kernel in the Bregman divergence . If we construct a sequence of vectors such that and for all ,
then the solution to problem (11) is
Suppose (true for ), then
The solution to (11) can be written as
where in the last equality we used the property of conjugate function when is strongly convex and smooth. Moreover,
which completes the proof. ∎
According to Lemma 1, we only need to provide initial points such that is easy to compute. We do not need to compute directly for any , because it is can be updated as in (12). Consequently, we can update in the BPD algorithm using the gradient , without the need of dual proximal mapping. The resulting dual-free algorithm is given in Algorithm 6.
Lan & Zhou (2015) considered a general setting which does not possess the linear predictor structure we focus on in this paper, and assumed that only the regularization is strongly convex. Our following result shows that dual-free primal-dual algorithms can also exploit strong convexity from data with appropriate algorithmic parameters.