1 Introduction
We consider the following generic optimization problem. Let
be a sequence of vector convex functions from
to , and let be a strongly convex regularization function. Our goal is to solve where(1) 
For example, given a sequence of training examples , where and
, ridge regression is obtained by setting
and. Regularized logistic regression is obtained by setting
.The dual problem of (1) is defined as follows: For each , let be the convex conjugate of , namely, . Similarly, let be the convex conjugate of . The dual problem is:
(2) 
where for each , is the ’th column of the matrix .
The dual objective has a different dual vector associated with each primal function. Dual Coordinate Ascent (DCA) methods solve the dual problem iteratively, where at each iteration of DCA, the dual objective is optimized with respect to a single dual vector, while the rest of the dual vectors are kept in tact. Recently, ShalevShwartz and Zhang (2013) analyzed a stochastic version of dual coordinate ascent, abbreviated by SDCA, in which at each round we choose which dual vector to optimize uniformly at random. In particular, let be the optimum of (1). We say that a solution is accurate if . ShalevShwartz and Zhang (2013) have derived the following convergence guarantee for SDCA: If and each is smooth, then for every , if we run SDCA for at least
iterations, then the solution of the SDCA algorithm will be
accurate (in expectation). This convergence rate is significantly better than the more commonly studied stochastic gradient descent (SGD) methods that are related to SDCA
^{1}^{1}1An exception is the recent analysis given in Le Roux et al. (2012) for a variant of SGD..Another approach to solving (1) is deterministic gradient descent methods. In particular, Nesterov (2007) proposed an accelerated gradient descent (AGD) method for solving (1). Under the same conditions mentioned above, AGD finds an accurate solution after performing
iterations.
The advantage of SDCA over AGD is that each iteration involves only a single dual vector and in general costs . In contrast, each iteration of AGD requires operations. On the other hand, AGD has a better dependence on the condition number of the problem — the iteration bound of AGD scales with while the iteration bound of SDCA scales with .
In this paper we describe and analyze a new algorithm that interpolates between SDCA and AGD. At each iteration of the algorithm, we randomly pick a subset of
indices from and update the dual vectors corresponding to this subset. This subset is often called a minibatch. The use of minibatches is common with SGD optimization, and it is beneficial when the processing time of a minibatch of size is much smaller than times the processing time of one example (minibatch of size). For example, in the practical training of neural networks with SGD, one is always advised to use minibatches because it is more efficient to perform matrixmatrix multiplications over a minibatch than an equivalent amount of matrixvector multiplication operations (each over a single training example). This is especially noticeable when GPU is used: in some cases the processing time of a minibatch of size 100 may be the same as that of a minibatch of size 10. Another typical use of minibatch is for parallel computing, which was studied by various authors for stochastic gradient descent (e.g.,
Dekel et al. (2012)). This is also the application scenario we have in mind, and will be discussed in greater details in Section 3.Recently, Takác et al. (2013)
studied minibatch variants of SDCA in the context of the Support Vector Machine (SVM) problem. They have shown that the naive minibatching method, in which
dual variables are optimized in parallel, might actually increase the number of iterations required. They then describe several “safe” minibatching schemes, and based on the analysis of ShalevShwartz and Zhang (2013), have shown several speedup results. However, their results are for the nonsmooth case and hence they do not obtain linear convergence rate. In addition, the speedup they obtain requires some spectral properties of the training examples. We take a different approach and employ Nesterov’s acceleration method, which has previously been applied to minibatch SGD optimization. This paper shows how to achieve acceleration for SDCA in the minibatch setting. The pseudo code of our Accelerated MiniBatch SDCA, abbreviated by ASDCA, is presented below.0.8
Procedure Accelerated MiniBatch SDCA
Parameters scalars and ; minibatch size  
Initialize ,  
Iterate: for  
Randomly pick subset of size and update the dual variables in  
for  
for  
end 
In the next section we present our main result — an analysis of the number of iterations required by ASDCA. We focus on the case of Euclidean regularization, namely, . Analyzing more general strongly convex regularization functions is left for future work. In Section 3 we discuss parallel implementations of ASDCA and compare it to parallel implementations of AGD and SDCA. In particular, we explain in which regimes ASDCA can be better than both AGD and SDCA. In Section 4 we present some experimental results, demonstrating how ASDCA interpolates between AGD and SDCA. The proof of our main theorem is presented in Section 5. We conclude with a discussion of our work in light of related works in Section 6.
2 Main Results
Our main result is a bound on the number of iterations required by ASDCA to find an accurate solution. In our analysis, we only consider the squared Euclidean norm regularization,
where is the Euclidean norm and is a regularization parameter. The analysis for general strongly convex regularizers is left for future work. For the squared Euclidean norm we have
and
We further assume that each is smooth with respect to , namely,
For example, if , then it is smooth.
The smoothness of also implies that is strongly convex:
Theorem 1.
Assume that and for each , is smooth w.r.t. the Euclidean norm. Suppose that the ASDCA algorithm is run with parameters , where
(3) 
Define the dual suboptimality by , where is the optimal dual solution, and the primal suboptimality by . Then,
It follows that after performing
iterations, we have that .
Let us now discuss the bound, assuming is taken to be the righthand side of (3). The dominating factor of the bound on becomes
(4)  
(5) 
Table 1 summarizes several interesting cases, and compares the iteration bound of ASDCA to the iteration bound of the vanilla SDCA algorithm (as analyzed in ShalevShwartz and Zhang (2013)) and the Accelerated Gradient Descent (AGD) algorithm of Nesterov (2007). In the table, we ignore constants and logarithmic factors.
Algorithm  

SDCA  
ASDCA  
AGD 
As can be seen in the table, the ASDCA algorithm interpolates between SDCA and AGD. In particular, ASDCA has the same bound as SDCA when and the same bound as AGD when . Recall that the cost of each iteration of AGD scales with while the cost of each iteration of SDCA does not scale with . The cost of each iteration of ASDCA scales with . To compensate for the difference cost per iteration for different algorithms, we may also compare the complexity in terms of the number of examples processed in Table 2. This is also what we will study in our empirical experiments. It should be mentioned that this comparison is meaningful in a single processor environment, but not in a parallel computing environment when multiple examples can be processed simultaneiously in a minibatch. In the next section we discuss under what conditions the overall runtime of ASDCA is better than both AGD and SDCA.
Algorithm  

SDCA  
ASDCA  
AGD 
3 Parallel Implementation
In recent years, there has been a lot of interest in implementing optimization algorithms using a parallel computing architecture (see Section 6). We now discuss how to implement AGD, SDCA, and ASDCA when having a computing machine with parallel computing nodes.
In the calculations below, we use the following facts:

If each node holds a dimensional vector, we can compute the sum of these vectors in time by applying a “treestructure” summation (see for example the AllReduce architecture in Agarwal et al. (2011)).

A node can broadcast a message with bits to all other nodes in time . To see this, order nodes on the corners of the dimensional hypercube. Then, at each iteration, each node sends the message to its neighbors (namely, the nodes whose code word is at a hamming distance of from the node). The message between the furthest away nodes will pass after iterations. Overall, we perform iterations and each iteration requires transmitting bits.

All nodes can broadcast a message with bits to all other nodes in time . To see this, simply apply the broadcasting of the different nodes mentioned above in parallel. The number of iterations will still be the same, but now, at each iteration, each node should transmit bits to its neighbors. Therefore, it takes time.
For concreteness of the discussion, we consider problems in which takes the form of , where is a scalar and
. This is the case in supervised learning of linear predictors (e.g. logistic regression or ridge regression). We further assume that the average number of nonzero elements of
is . In very largescale problems, a single machine cannot hold all of the data in its memory. However, we assume that a single node can hold a fraction of of the data in its memory.Let us now discuss parallel implementations of the different algorithms starting with deterministic gradient algorithms (such as AGD). The bottleneck operation of deterministic gradient algorithms is the calculation of the gradient. In the notation mentioned above, this amounts to performing order of operations. If the data is distributed over computing nodes, where each node holds examples, we can calculate the gradient in time as follows. First, each node calculates the gradient over its own examples (which takes time ). Then, the resulting vectors in are summed up in time .
Next, let us consider the SDCA algorithm. On a single computing node, it was observed that SDCA is much more efficient than deterministic gradient descent methods, since each iteration of SDCA costs only while each iteration of AGD costs . When we have nodes, for the SDCA algorithm, dividing the examples into computing nodes does not yield any speedup. However, we can divide the features into the nodes (that is, each node will hold of the features for all of the examples). This enables the computation of in (expected) time of . Indeed, node will calculate , where is the set of features stored in node (namely, ). Then, each node broadcasts the resulting scalar to all the other nodes. Note that we will obtain a speedup over the naive implementation only if .
For the ASDCA algorithm, each iteration involves the computation of the gradient over examples. We can choose to implement it by dividing the examples to the nodes (as we did for AGD) or by dividing the features into the nodes (as we did for SDCA). In the first case, the cost of each iteration is while in the latter case, the cost of each iteration is . We will choose between these two implementations based on the relation between and .
The runtime and communication time of each iteration is summarized in the table below.
Algorithm  partition type  runtime  communication time 

SDCA  features  
ASDCA  features  
ASDCA  examples  
AGD  examples 
We again see that ASDCA nicely interpolates between SDCA and AGD. In practice, it is usually the case that there is a nonnegligible cost of opening communication channels between nodes. In that case, it will be better to apply the ASDCA with a value of that reflects an adequate tradeoff between the runtime of each node and the communication time. With the appropriate value of (which depends on constants like the cost of opening communication channels and sending packets of bits between nodes), ASDCA may outperform both SDCA and AGD.
4 Experimental Results
In this section we demonstrate how ASDCA interpolates between SDCA and AGD. All of our experiments are performed for the task of binary classification with a smooth variant of the hingeloss (see ShalevShwartz and Zhang (2013)). Specifically, let be a set of labeled examples, where for every , and . Define to be
We also set the regularization function to be where . This is the default value for the regularization parameter taken in several optimization packages.
Following ShalevShwartz and Zhang (2013)
, the experiments were performed on three large datasets with very different feature counts and sparsity. The astroph dataset classifies abstracts of papers from the physics ArXiv according to whether they belong in the astrophysics section; CCAT is a classification task taken from the Reuters RCV1 collection; and cov1 is class 1 of the covertype dataset of Blackard, Jock & Dean. The following table provides details of the dataset characteristics.
Dataset  Training Size  Testing Size  Features  Sparsity 

astroph  
CCAT  
cov1 
We ran ASDCA with values of from the set . We also ran the SDCA algorithm and the AGD algorithm. In Figure 1 we depict the primal suboptimality of the different algorithms as a function of the number of examples processed. Note that each iteration of SDCA processes a single example, each iteration of ASDCA processes examples, and each iteration of AGD processes examples. As can be seen from the graphs, ASDCA indeed interpolates between SDCA and AGD. It is clear from the graphs that SDCA is much better than AGD when we have a single computing node. ASDCA performance is quite similar to SDCA when is not very large. As discussed in Section 3, when we have parallel computing nodes and there is a nonnegligible cost of opening communication channels between nodes, running ASDCA with an appropriate value of (which depends on constants like the cost of opening communication channels) may yield the best performance.
astroph  CCAT  cov1 
. In all figures, the x axis is the number of processed examples. The three columns are for the different datasets. Top: primal suboptimality. Middle: average value of the smoothed hinge loss function over a test set. Bottom: average value of the 01 loss over a test set.
5 Proof
We use the following notation:
In addition, we use the notation to denote the expectation over the choice of the set at iteration , conditioned on the values of and .
Our first lemma calculates the expected value of .
Lemma 1.
At each round , we have
Proof.
By the definition of the update,
Taking expectation w.r.t. the choice of and noting that we obtain that
∎
Next, we upper bound the “variance” of
, in the sense of the expected squared norm of the difference between and its expectation.Lemma 2.
At each round , we have
Proof.
We introduce the simplified notation and . Note that is independent of the choice of (thus can be considered as a deterministic number). Then when , and . We thus have
Note that for any : and can be regarded as zeromean random vectors that are drawn uniformly at random from the same distribution without replacement. Therefore they are not positively correlated when . That is, we have
Therefore
∎
Recall that the theorem upper bounds the expected value of , which in turns upper bound the duality gap at round . The following lemma derives an upper bound on this quantity that depends on the value of this quantity at the previous iteration and three additional terms. We will later show that the sum of the additional terms is negative in expectation. The lemma uses standard algebraic manipulations as well as the assumptions on and .
Lemma 3.
For each round we have
where
Proof.
Since we have
Therefore, we need to show that the righthand side of the above is upper bounded by .
Step 1:
We first bound . Using the smoothness of we have
and using the convexity of we also have
Combining the above two inequalities and rearranging terms we obtain
(6)  
Next, using the convexity of we have
Combining this with (6) we obtain
which yields
(7) 
Step 2:
Next, we bound . Using the definition of the dual update we have
(8) 
For all , we may use the definition of the update of in the algorithm, the strongconvexity of , and the equality in FenchelYoung for gradients to obtain:
Combining this with (8) we get
(9)  
Step 3:
∎
Lemma 4.
At each round , let be as defined in Lemma 3. Then,
Proof.
Recall,
Comments
There are no comments yet.