1 Introduction
We consider regularized empirical risk minimization (ERM) problems of the following form:
(1) 
where are data points with features,
is a convex loss function of the linear predictor
, for , andis a convex regularization function for the coefficient vector
in the linear predictor. We assume has a decomposable structure, namely,(2) 
where is only a function of , the th coordinate of . We further assume that is strongly convex for , i.e., for any , and the function is smooth for , i.e., for any . For simplicity, we first consider a univariate . In Section 3, the proposed method will be generalized for the problems having a blockwise decomposable structure with multivariate .
The problem (1) has many applications in business analytics, statistics, machine learning and data mining, and has triggered many studies in the optimization community. Typically, for each data point , there is an associated response value , which can be continuous (in regression problems) or discrete (in classification problems). The examples of loss function associated to include: square loss where , and ; logistic loss where , and ; and smooth hinge loss where , and
(3) 
The commonly used regularization terms include the regularization with and regularization with .
We often call (1) the primal problem and its conjugate dual problem is
(4) 
where and and are the convex conjugates of and , respectively. The associated saddlepoint problem with (1), (4) is
(5) 
Let and be the optimal solutions of (1) and (4), respectively. It is known that the pair is a saddle point of (5) in the sense that
(6)  
(7) 
In this paper, we propose a doubly stochastic primaldual coordinate (DSPDC) method for solving problem (5) that randomly samples out of primal and out of dual coordinates to update in each iteration. We show that DSPDC method generates a sequence of primaldual iterates that linearly converges to and the primaldual objective gap along this sequence also linearly converges to zero. In addition, we generalize this approach to bilinear saddlepoint problems with a blockwise decomposable structure, and provide upper and lower bounds on its iteration complexity for finding an optimal solution. We also discuss the situations where the proposed methods are superior to existing coordinate methods.
1.1 Notation
Let represent the set . For , let be its th coordinate for and be a subvector of that consists of the coordinates of indexed by a set . Given an matrix , we denote its th row and th column by and , respectively. For and , the matrices and represent submatrices of that consist of the rows indexed by and columns indexed by , respectively. We denote the entry of in th row and th column by and let be submatrix of where the rows indexed by intersect with the columns indexed by .
Let be the inner product in a Euclidean space, be the norm of a vector and be the spectral norm of a matrix. For integers and , we define as a scale constant of the data as follows
(8) 
The maximum norm of data points is therefore . The condition number of problems (1),(4), and (5) is usually defined as
(9) 
which affects the iteration complexity of many firstorder methods.
1.2 Contributions and related work
For solving problem (1) or (5), deterministic firstorder methods such as [22, 23, 21, 41] usually have to compute a full gradient in each iteration, which requires going through all features of all instances and can be inefficient for big data. Therefore, stochastic optimization methods that sample one instance or one feature in each iteration become more popular. There are two major categories of stochastic optimization algorithms that are studied actively in recent years: stochastic gradient methods and stochastic coordinate methods. The DSPDC method we propose belongs to the second category.
Recently, there have been increasing interests in
stochastic variance reduced gradient
(SVRG) methods [12, 39, 25, 14]. SVRG runs in multiple stages. At each stage, it computes a full gradient and then performs iterative updates with stochastic gradients constructed by sampled instances. Since the full gradient is computed only once in each stage, SVRG has a lower periteration cost than deterministic gradient methods and it needs iterations to find an optimal solution for problem (1).Stochastic incremental gradient method [31, 30, 6, 7, 19, 15] is also widely studied in recent literature. Different from SVRG, stochastic incremental gradient method computes a full gradient only once at the beginning, but maintains and updates the average of historical stochastic gradients using one sampled instance per iteration. Most stochastic incremental gradient methods have the same periteration cost as SVRG and also need iterations to find an optimal solution except the recent RPDG method by Lan & Zhou [15] which needs only iterations. This iteration complexity achieved by Lan & Zhou [15] is proved to be optimal [15].
In contrast to stochastic gradient methods, stochastic coordinate method works by updating randomly sampled coordinates of decision variables [24, 29, 32, 10, 17, 4, 16, 8]. ShalevShwartz & Zhang [34, 33, 34] proposed a stochastic dual coordinate ascent (SDCA) method to solve the dual formulation (4). SDCA has an iteration complexity of and has been further improved to the accelerated SDCA (ASDCA) method [33] that achieves the optimal complexity up to a logarithmic term of . This optimal complexity is also obtained by the accelerated proximal coordinate gradient (APCG) method by Lin et al [16] when it is applied to the dual problem (4). Extending the deterministic algorithm by [2] for saddlepoint problems, Zhang & Xiao [42] recently proposed a stochastic primaldual coordinate (SPDC) method for (5), which alternates between maximizing over a randomly chosen dual variable and minimizing over all primal variables and also achieves the optimal iteration complexity.
Note that, when applied to the primal problem (1), APCG samples a feature of data in each iterative update and find an optimal solution after iterations, which is also optimal according to [15].
Some recent works [43, 3, 13, 20, 5, 8] made attempts in combining stochastic gradient and stochastic coordinate. The authors in [43, 20, 5] proposed randomized block coordinate methods, which utilize stochastic partial gradient of the selected block based on randomly sampled instances and features in each iteration. However, these methods face a constant variance of stochastic partial gradient so that they need iterations. These techniques are further improved in [13, 43] with the stochastic variance reduced partial gradient but only obtain the suboptimal iteration complexity.
Although the aforementioned stochastic coordinate methods have achieved great performances on ERM problem (5), they either only sample over primal coordinates or only sample over dual coordinates to update in each iteration. Therefore, it is natural to ask the following three questions.

What is the iteration complexity of a coordinate method for problem (5) that samples both primal and dual coordinates to update in each iteration?

When is this type of algorithm better than purely primal and purely dual coordinate methods?

What is the limit of power of this type of algorithm?
To contribute to the answers to these questions, we propose the DSPDC method that samples over both features and instances of dataset by randomly choosing the associated primal and dual coordinates to update in each iteration. This method generalizes the SPDC method which samples the dual coordinates only.
To answer the first question, we show that, if primal and dual coordinates are sampled and updated in each iteration, the number of iterations DSPDC needs to find an optimal solution for (5) is . This iteration complexity is interesting since it matches the optimal iteration complexity of dual coordinate methods [33, 16, 42] when , and also matches the optimal iteration complexity of primal coordinate methods [17, 16] when . We further generalize DSPDC and its complexity to a bilinear saddlepoint problem with blockwise decomposable structure.
To study the second question, we compare different coordinate algorithms based on an overall complexity defined as the number of iterations an algorithm needs to find an optimal solution multiplied by the computational cost of each iteration. For example, the periteration cost of SPDC [42] for solving ERM is so its overall complexity . When , the periteration cost of DSPDC is due to an unavoidable fulldimensional inner product in the algorithm. If , which is true for most ERM problems, the overall complexity of DSPDC becomes , which is not lower than that of SPDC in general^{1}^{1}1Note that and ..
Nevertheless, we identify two cases when DSPDC has a lower overall complexity than SPDC and other existing coordinate methods. The first case is when has a factorized structure, meaning that with , and . In this case, choosing and using an efficient implementation, our DSPDC has an overall complexity of , better than the complexity of SPDC with the same efficient implementation. Motivations and examples of ERM with factorized data are provided in Section 2.2. The second case is when solving a blockwise decomposable bilinear saddlepoint problem where the proximal mapping on each block is computationally expensive. For example, when each block of variables has to be a positive semidefinite matrix, the proximal mapping involves an eigenvalue decomposition whose complexity is in general. When and , DSPDC requires solving eigenvalue decomposition only for one block of variables so that its overall complexity is as shown in Section 3, which is lower than the overall complexity of SPDC when .
Last but not least, we contribute an answer to the third question by providing a theoretical lower bound on the number of iterations for a family of primaldual block coordinate methods for strongly convex bilinear saddlepoint problems. We show that, when , any coordinate algorithm that randomly updates one primal and one dual coordinates per iteration, i.e., , will need iterations in order to find an optimal solution. This indicates that the component in the complexity of DSPDC method is optimal.
2 Doubly Stochastic PrimalDual Coordinate Method
2.1 Algorithm and Convergence Properties
(10)  
(11)  
(12)  
(13) 
In this section, we propose a doubly stochastic primaldual coordinate (DSPDC) method in Algorithm 1 for problem (5). In Algorithm 1, the primal and dual solutions are updated as (12) and (10) in the randomly selected and coordinates indexed by and , respectively^{2}^{2}2Here, we hide the dependency of and on to simplify the notation.. These updates utilize the firstorder information provided by the vectors and where are updated using the momentum steps (13) and (11) which are commonly used to accelerate gradient (AG) methods [22, 23]. Algorithm 1 requires three control parameters , and and its convergence is ensured after a proper choice of these parameters as shown in Theorem 1. The proofs of all theorems are deferred to the Appendix (Section A).
Theorem 1.
Remark 1 For a given , the values of and can be solved from the last two equations of (14) in closed forms:
(15)  
(16) 
According to the convergence rate above, the best choice of is . Although the exact computation of by definition (8) may be costly, for instance, when or , it is tractable when and are close to 1 or close to and . In practice, we suggest choosing as an approximation of , which provides reasonably good empirical performance (see Section 5).
Besides the distance to the saddlepoint , a useful quality measure for the solution is its primaldual objective gap, , because it can be evaluated in each iteration and used as a stopping criterion in practice. The next theorem establishes the convergence rate of the primaldual objective gap ensured by DSPDC.
Theorem 2.
According to Theorem 1 and 2, in order to obtain a pair of primal and dual solutions with an expected distance to , i.e., and , or with an expected objective gap, Algorithm 1 needs
iterations when . This iteration complexity is interesting since it matches the optimal iteration complexity of dual coordinate methods such as SPDC [42] and others [33, 16] when , and also matches the optimal iteration complexity of primal coordinate methods [17, 16] when .
To efficiently implement Algorithm 1, we just need to maintain and efficiently update either or , depending on whether or is larger. If , we should maintain during the algorithm, which is used in (12) and can be updated in time. We will then directly compute for in (10) in time. In fact, this is how SPDC is implemented in [42] where . On the other hand, if , it is more efficient to maintain and update it in time and compute for in (12) in time. Hence, the overall complexity for DSPDC to find an optimal solution is when and when . Since the overall complexity of SPDC is when , DSPDC method is not more efficient for general data matrix. However, in the next section, we show that DSPDC has an efficient implementation for factorized data matrix which leads to a lower overall complexity than SPDC with the same implementation.
2.2 Efficient Implementation for Factorized Data Matrix
In this section, we assume that the data matrix in (5) has a factorized structure where and with . Such a matrix is often obtained as a lowrank or denoised approximation of raw data matrix. Recently, there emerges a surge of interests of using factorized data to alleviate the computational cost for big data. For example, Pham & Ghaoui [27] proposed to use a lowrank approximation for data matrix to solve multiple instances of lasso problems. For solving big data kernel learning problems, the Nyström methods, that approximates a kernel matrix by with , and , has become a popular method [40]. Moreover, recent advances on fast randomized algorithms [11] for finding a lowrank approximation of a matrix render the proposed coordinate optimization algorithm more attractive for tackling factorized big data problems.
The factorized
also appears often in the problem of sparse recovery from the randomized feature reduction or randomized instance reduction of (
1). The sparse recovery problem from randomized feature reduction can be also formulated into (5) as(18) 
where is the original raw data matrix, is a random measurement matrix with , and the actual data matrix for (5) is with and . This approximation approach has been employed to reduce the computational cost of solving underconstrained leastsquares problem [18]. Similarly, the randomized instance reduction [9] can be applied by replacing in (18) with , where is a random measurement matrix with , and the data matrix with and .
To solve (5) with , we implement DSPDC by maintaining the vectors and and updating them in and time, respectively, in each iteration. Then, we can obtain in (10) in time by evaluating for each , where is the th row of . Similarly, we can obtain in (12) in time by taking for each , where is the th column of . This leads to an efficient implementation of DSPDC described as in Algorithm 2 whose periteration cost is , lower than the or cost when is not factorized. The similar efficient implementation can be also applied to other coordinate methods to obtain a lower computation cost in each iteration. To make a clear comparison between DSPDC and other methods when applied to factorized data, we summarize their numbers of iterations (for general ) and periteration costs (when ) in Table 1. Here, we assume and and omit all the big notations for simplicity.
Input: , , and parameters
Initialize: , ,,
Iterate:
For

Uniformly and randomly choose and of sizes and , respectively.
(19) (20) (21) (22) (23) (24)
Output: and
Algorithm  Num. of Iter.  PerIter. Cost  Overall Compl. 
DSPDC  
SPDC [42]  
ASDCA [33]  
APCG [16]  
RPDG [15]  
SDCA [34]  
SVRG [39]  
SAGA [6] 
According to the last column of Table 1, the efficient implementation of DSPDC has an overall complexity always lower than SDCA, SPDC and other methods comparable to them, no matter or .
3 Extension with Block Coordinate Updates
3.1 Algorithm and Convergence Properties
With blockwise sampling and updates, DSPDC can be easily generalized and applied to the bilinear saddlepoint problem (5) with a blockdecomposable structure and a similar linear convergence rate can be obtained. Although this is an obvious extension, it is worth to show that, when the proximal mapping on each block is computationally expensive, our DSPDC can achieve a lower overall complexity than other coordinate methods.
We partition the space into subspaces as such that and partition the space into subspaces as such that . With a little abuse of notation, we represent the corresponding partitions of and as with for and with for , respectively.
We consider the following bilinear saddlepoint problem
(25) 
where and are functions of and , respectively. Moreover, we assume and are strongly convex with strong convexity parameters of and , respectively. Due to the partitions on and , we partition the matrix into blocks accordingly so that
where is the block of corresponding to and .
We want to point out that the problem (25) can be defined directly and solved as an independent problem rather than as the saddlepoint formulation of an EMR problem like (1). In other words, the function does not have to be the conjugate of a function . We still use the same notation as in (5) so that the readers can easily compare (25) with (5).
It is easy to see that the problem (5) is a special case of (25) when for and , and . The scale constant defined in (8) can be similarly generalized as
(26) 
where is submatrix of consisting of each block with and .
Let and . Given these correspondings between (5) and (25), DSPDC can be easily extended for solving (25) by replacing (10) and (12) with
(27)  
(28) 
respectively, and and are updated in the same way as (11) and (13). We call this algorithm block DSPDC (BDSPDC) method.
For this extension, the convergence results similar to Theorem 1 and Theorem 2 can be easily derived with almost the same proof. We skip the proofs but directly state the results. To find a pair of primaldual solutions for (25) which either has an distance to the optimal solution or has an primaldual objective gap, the number of iterations Algorithm 1 needs is
Comments
There are no comments yet.