We consider regularized empirical risk minimization (ERM) problems of the following form:
where are data points with features,
is a convex loss function of the linear predictor, for , and
is a convex regularization function for the coefficient vectorin the linear predictor. We assume has a decomposable structure, namely,
where is only a function of , the -th coordinate of . We further assume that is -strongly convex for , i.e., for any , and the function is -smooth for , i.e., for any . For simplicity, we first consider a univariate . In Section 3, the proposed method will be generalized for the problems having a block-wise decomposable structure with multivariate .
The problem (1) has many applications in business analytics, statistics, machine learning and data mining, and has triggered many studies in the optimization community. Typically, for each data point , there is an associated response value , which can be continuous (in regression problems) or discrete (in classification problems). The examples of loss function associated to include: square loss where , and ; logistic loss where , and ; and smooth hinge loss where , and
The commonly used regularization terms include the -regularization with and -regularization with .
We often call (1) the primal problem and its conjugate dual problem is
In this paper, we propose a doubly stochastic primal-dual coordinate (DSPDC) method for solving problem (5) that randomly samples out of primal and out of dual coordinates to update in each iteration. We show that DSPDC method generates a sequence of primal-dual iterates that linearly converges to and the primal-dual objective gap along this sequence also linearly converges to zero. In addition, we generalize this approach to bilinear saddle-point problems with a block-wise decomposable structure, and provide upper and lower bounds on its iteration complexity for finding an -optimal solution. We also discuss the situations where the proposed methods are superior to existing coordinate methods.
Let represent the set . For , let be its -th coordinate for and be a sub-vector of that consists of the coordinates of indexed by a set . Given an matrix , we denote its -th row and -th column by and , respectively. For and , the matrices and represent sub-matrices of that consist of the rows indexed by and columns indexed by , respectively. We denote the entry of in -th row and -th column by and let be sub-matrix of where the rows indexed by intersect with the columns indexed by .
Let be the inner product in a Euclidean space, be the -norm of a vector and be the spectral norm of a matrix. For integers and , we define as a scale constant of the data as follows
which affects the iteration complexity of many first-order methods.
1.2 Contributions and related work
For solving problem (1) or (5), deterministic first-order methods such as [22, 23, 21, 41] usually have to compute a full gradient in each iteration, which requires going through all features of all instances and can be inefficient for big data. Therefore, stochastic optimization methods that sample one instance or one feature in each iteration become more popular. There are two major categories of stochastic optimization algorithms that are studied actively in recent years: stochastic gradient methods and stochastic coordinate methods. The DSPDC method we propose belongs to the second category.
Recently, there have been increasing interests in stochastic variance reduced gradient
stochastic variance reduced gradient(SVRG) methods [12, 39, 25, 14]. SVRG runs in multiple stages. At each stage, it computes a full gradient and then performs iterative updates with stochastic gradients constructed by sampled instances. Since the full gradient is computed only once in each stage, SVRG has a lower per-iteration cost than deterministic gradient methods and it needs iterations to find an -optimal solution for problem (1).
Stochastic incremental gradient method [31, 30, 6, 7, 19, 15] is also widely studied in recent literature. Different from SVRG, stochastic incremental gradient method computes a full gradient only once at the beginning, but maintains and updates the average of historical stochastic gradients using one sampled instance per iteration. Most stochastic incremental gradient methods have the same per-iteration cost as SVRG and also need iterations to find an -optimal solution except the recent RPDG method by Lan & Zhou  which needs only iterations. This iteration complexity achieved by Lan & Zhou  is proved to be optimal .
In contrast to stochastic gradient methods, stochastic coordinate method works by updating randomly sampled coordinates of decision variables [24, 29, 32, 10, 17, 4, 16, 8]. Shalev-Shwartz & Zhang [34, 33, 34] proposed a stochastic dual coordinate ascent (SDCA) method to solve the dual formulation (4). SDCA has an iteration complexity of and has been further improved to the accelerated SDCA (ASDCA) method  that achieves the optimal complexity up to a logarithmic term of . This optimal complexity is also obtained by the accelerated proximal coordinate gradient (APCG) method by Lin et al  when it is applied to the dual problem (4). Extending the deterministic algorithm by  for saddle-point problems, Zhang & Xiao  recently proposed a stochastic primal-dual coordinate (SPDC) method for (5), which alternates between maximizing over a randomly chosen dual variable and minimizing over all primal variables and also achieves the optimal iteration complexity.
Some recent works [43, 3, 13, 20, 5, 8] made attempts in combining stochastic gradient and stochastic coordinate. The authors in [43, 20, 5] proposed randomized block coordinate methods, which utilize stochastic partial gradient of the selected block based on randomly sampled instances and features in each iteration. However, these methods face a constant variance of stochastic partial gradient so that they need iterations. These techniques are further improved in [13, 43] with the stochastic variance reduced partial gradient but only obtain the sub-optimal iteration complexity.
Although the aforementioned stochastic coordinate methods have achieved great performances on ERM problem (5), they either only sample over primal coordinates or only sample over dual coordinates to update in each iteration. Therefore, it is natural to ask the following three questions.
What is the iteration complexity of a coordinate method for problem (5) that samples both primal and dual coordinates to update in each iteration?
When is this type of algorithm better than purely primal and purely dual coordinate methods?
What is the limit of power of this type of algorithm?
To contribute to the answers to these questions, we propose the DSPDC method that samples over both features and instances of dataset by randomly choosing the associated primal and dual coordinates to update in each iteration. This method generalizes the SPDC method which samples the dual coordinates only.
To answer the first question, we show that, if primal and dual coordinates are sampled and updated in each iteration, the number of iterations DSPDC needs to find an -optimal solution for (5) is . This iteration complexity is interesting since it matches the optimal iteration complexity of dual coordinate methods [33, 16, 42] when , and also matches the optimal iteration complexity of primal coordinate methods [17, 16] when . We further generalize DSPDC and its complexity to a bilinear saddle-point problem with block-wise decomposable structure.
To study the second question, we compare different coordinate algorithms based on an overall complexity defined as the number of iterations an algorithm needs to find an -optimal solution multiplied by the computational cost of each iteration. For example, the per-iteration cost of SPDC  for solving ERM is so its overall complexity . When , the per-iteration cost of DSPDC is due to an unavoidable full-dimensional inner product in the algorithm. If , which is true for most ERM problems, the overall complexity of DSPDC becomes , which is not lower than that of SPDC in general111Note that and ..
Nevertheless, we identify two cases when DSPDC has a lower overall complexity than SPDC and other existing coordinate methods. The first case is when has a factorized structure, meaning that with , and . In this case, choosing and using an efficient implementation, our DSPDC has an overall complexity of , better than the complexity of SPDC with the same efficient implementation. Motivations and examples of ERM with factorized data are provided in Section 2.2. The second case is when solving a block-wise decomposable bilinear saddle-point problem where the proximal mapping on each block is computationally expensive. For example, when each block of variables has to be a positive semi-definite matrix, the proximal mapping involves an eigenvalue decomposition whose complexity is in general. When and , DSPDC requires solving eigenvalue decomposition only for one block of variables so that its overall complexity is as shown in Section 3, which is lower than the overall complexity of SPDC when .
Last but not least, we contribute an answer to the third question by providing a theoretical lower bound on the number of iterations for a family of primal-dual block coordinate methods for strongly convex bilinear saddle-point problems. We show that, when , any coordinate algorithm that randomly updates one primal and one dual coordinates per iteration, i.e., , will need iterations in order to find an -optimal solution. This indicates that the component in the complexity of DSPDC method is optimal.
2 Doubly Stochastic Primal-Dual Coordinate Method
2.1 Algorithm and Convergence Properties
In this section, we propose a doubly stochastic primal-dual coordinate (DSPDC) method in Algorithm 1 for problem (5). In Algorithm 1, the primal and dual solutions are updated as (12) and (10) in the randomly selected and coordinates indexed by and , respectively222Here, we hide the dependency of and on to simplify the notation.. These updates utilize the first-order information provided by the vectors and where are updated using the momentum steps (13) and (11) which are commonly used to accelerate gradient (AG) methods [22, 23]. Algorithm 1 requires three control parameters , and and its convergence is ensured after a proper choice of these parameters as shown in Theorem 1. The proofs of all theorems are deferred to the Appendix (Section A).
Remark 1 For a given , the values of and can be solved from the last two equations of (14) in closed forms:
According to the convergence rate above, the best choice of is . Although the exact computation of by definition (8) may be costly, for instance, when or , it is tractable when and are close to 1 or close to and . In practice, we suggest choosing as an approximation of , which provides reasonably good empirical performance (see Section 5).
Besides the distance to the saddle-point , a useful quality measure for the solution is its primal-dual objective gap, , because it can be evaluated in each iteration and used as a stopping criterion in practice. The next theorem establishes the convergence rate of the primal-dual objective gap ensured by DSPDC.
iterations when . This iteration complexity is interesting since it matches the optimal iteration complexity of dual coordinate methods such as SPDC  and others [33, 16] when , and also matches the optimal iteration complexity of primal coordinate methods [17, 16] when .
To efficiently implement Algorithm 1, we just need to maintain and efficiently update either or , depending on whether or is larger. If , we should maintain during the algorithm, which is used in (12) and can be updated in time. We will then directly compute for in (10) in time. In fact, this is how SPDC is implemented in  where . On the other hand, if , it is more efficient to maintain and update it in time and compute for in (12) in time. Hence, the overall complexity for DSPDC to find an -optimal solution is when and when . Since the overall complexity of SPDC is when , DSPDC method is not more efficient for general data matrix. However, in the next section, we show that DSPDC has an efficient implementation for factorized data matrix which leads to a lower overall complexity than SPDC with the same implementation.
2.2 Efficient Implementation for Factorized Data Matrix
In this section, we assume that the data matrix in (5) has a factorized structure where and with . Such a matrix is often obtained as a low-rank or denoised approximation of raw data matrix. Recently, there emerges a surge of interests of using factorized data to alleviate the computational cost for big data. For example, Pham & Ghaoui  proposed to use a low-rank approximation for data matrix to solve multiple instances of lasso problems. For solving big data kernel learning problems, the Nyström methods, that approximates a kernel matrix by with , and , has become a popular method . Moreover, recent advances on fast randomized algorithms  for finding a low-rank approximation of a matrix render the proposed coordinate optimization algorithm more attractive for tackling factorized big data problems.
also appears often in the problem of sparse recovery from the randomized feature reduction or randomized instance reduction of (1). The sparse recovery problem from randomized feature reduction can be also formulated into (5) as
where is the original raw data matrix, is a random measurement matrix with , and the actual data matrix for (5) is with and . This approximation approach has been employed to reduce the computational cost of solving underconstrained least-squares problem . Similarly, the randomized instance reduction  can be applied by replacing in (18) with , where is a random measurement matrix with , and the data matrix with and .
To solve (5) with , we implement DSPDC by maintaining the vectors and and updating them in and time, respectively, in each iteration. Then, we can obtain in (10) in time by evaluating for each , where is the th row of . Similarly, we can obtain in (12) in time by taking for each , where is the th column of . This leads to an efficient implementation of DSPDC described as in Algorithm 2 whose per-iteration cost is , lower than the or cost when is not factorized. The similar efficient implementation can be also applied to other coordinate methods to obtain a lower computation cost in each iteration. To make a clear comparison between DSPDC and other methods when applied to factorized data, we summarize their numbers of iterations (for general ) and per-iteration costs (when ) in Table 1. Here, we assume and and omit all the big- notations for simplicity.
Input: , , and parameters
Initialize: , ,,
Uniformly and randomly choose and of sizes and , respectively.
(19) (20) (21) (22) (23) (24)
|Algorithm||Num. of Iter.||Per-Iter. Cost||Overall Compl.|
According to the last column of Table 1, the efficient implementation of DSPDC has an overall complexity always lower than SDCA, SPDC and other methods comparable to them, no matter or .
3 Extension with Block Coordinate Updates
3.1 Algorithm and Convergence Properties
With block-wise sampling and updates, DSPDC can be easily generalized and applied to the bilinear saddle-point problem (5) with a block-decomposable structure and a similar linear convergence rate can be obtained. Although this is an obvious extension, it is worth to show that, when the proximal mapping on each block is computationally expensive, our DSPDC can achieve a lower overall complexity than other coordinate methods.
We partition the space into subspaces as such that and partition the space into subspaces as such that . With a little abuse of notation, we represent the corresponding partitions of and as with for and with for , respectively.
We consider the following bilinear saddle-point problem
where and are functions of and , respectively. Moreover, we assume and are strongly convex with strong convexity parameters of and , respectively. Due to the partitions on and , we partition the matrix into blocks accordingly so that
where is the block of corresponding to and .
We want to point out that the problem (25) can be defined directly and solved as an independent problem rather than as the saddle-point formulation of an EMR problem like (1). In other words, the function does not have to be the conjugate of a function . We still use the same notation as in (5) so that the readers can easily compare (25) with (5).
where is sub-matrix of consisting of each block with and .
For this extension, the convergence results similar to Theorem 1 and Theorem 2 can be easily derived with almost the same proof. We skip the proofs but directly state the results. To find a pair of primal-dual solutions for (25) which either has an -distance to the optimal solution or has an -primal-dual objective gap, the number of iterations Algorithm 1 needs is