 # Accelerated Mini-Batch Stochastic Dual Coordinate Ascent

Stochastic dual coordinate ascent (SDCA) is an effective technique for solving regularized loss minimization problems in machine learning. This paper considers an extension of SDCA under the mini-batch setting that is often used in practice. Our main contribution is to introduce an accelerated mini-batch version of SDCA and prove a fast convergence rate for this method. We discuss an implementation of our method over a parallel computing system, and compare the results to both the vanilla stochastic dual coordinate ascent and to the accelerated deterministic gradient descent method of nesterov2007gradient.

## Authors

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

We consider the following generic optimization problem. Let

be a sequence of vector convex functions from

to , and let be a strongly convex regularization function. Our goal is to solve where

 P(x)=[1nn∑i=1ϕi(x)+g(x)]. (1)

For example, given a sequence of training examples , where and

, ridge regression is obtained by setting

and

. Regularized logistic regression is obtained by setting

.

The dual problem of (1) is defined as follows: For each , let be the convex conjugate of , namely, . Similarly, let be the convex conjugate of . The dual problem is:

 maxα∈Rd×nD(α)   where   D(α)=[1nn∑i=1−ϕ∗i(−αi)−g∗(1nn∑i=1αi)] , (2)

where for each , is the ’th column of the matrix .

The dual objective has a different dual vector associated with each primal function. Dual Coordinate Ascent (DCA) methods solve the dual problem iteratively, where at each iteration of DCA, the dual objective is optimized with respect to a single dual vector, while the rest of the dual vectors are kept in tact. Recently, Shalev-Shwartz and Zhang (2013) analyzed a stochastic version of dual coordinate ascent, abbreviated by SDCA, in which at each round we choose which dual vector to optimize uniformly at random. In particular, let be the optimum of (1). We say that a solution is -accurate if . Shalev-Shwartz and Zhang (2013) have derived the following convergence guarantee for SDCA: If and each is -smooth, then for every , if we run SDCA for at least

 (n+1λγ)log((n+1λγ)⋅1ϵ)

iterations, then the solution of the SDCA algorithm will be

-accurate (in expectation). This convergence rate is significantly better than the more commonly studied stochastic gradient descent (SGD) methods that are related to SDCA

111An exception is the recent analysis given in Le Roux et al. (2012) for a variant of SGD..

Another approach to solving (1) is deterministic gradient descent methods. In particular, Nesterov (2007) proposed an accelerated gradient descent (AGD) method for solving (1). Under the same conditions mentioned above, AGD finds an -accurate solution after performing

 O(1√λγlog(1ϵ))

iterations.

The advantage of SDCA over AGD is that each iteration involves only a single dual vector and in general costs . In contrast, each iteration of AGD requires operations. On the other hand, AGD has a better dependence on the condition number of the problem — the iteration bound of AGD scales with while the iteration bound of SDCA scales with .

In this paper we describe and analyze a new algorithm that interpolates between SDCA and AGD. At each iteration of the algorithm, we randomly pick a subset of

indices from and update the dual vectors corresponding to this subset. This subset is often called a mini-batch. The use of mini-batches is common with SGD optimization, and it is beneficial when the processing time of a mini-batch of size is much smaller than times the processing time of one example (mini-batch of size

). For example, in the practical training of neural networks with SGD, one is always advised to use mini-batches because it is more efficient to perform matrix-matrix multiplications over a mini-batch than an equivalent amount of matrix-vector multiplication operations (each over a single training example). This is especially noticeable when GPU is used: in some cases the processing time of a mini-batch of size 100 may be the same as that of a mini-batch of size 10. Another typical use of mini-batch is for parallel computing, which was studied by various authors for stochastic gradient descent (e.g.,

Dekel et al. (2012)). This is also the application scenario we have in mind, and will be discussed in greater details in Section 3.

Recently, Takác et al. (2013)

studied mini-batch variants of SDCA in the context of the Support Vector Machine (SVM) problem. They have shown that the naive mini-batching method, in which

dual variables are optimized in parallel, might actually increase the number of iterations required. They then describe several “safe” mini-batching schemes, and based on the analysis of Shalev-Shwartz and Zhang (2013), have shown several speed-up results. However, their results are for the non-smooth case and hence they do not obtain linear convergence rate. In addition, the speed-up they obtain requires some spectral properties of the training examples. We take a different approach and employ Nesterov’s acceleration method, which has previously been applied to mini-batch SGD optimization. This paper shows how to achieve acceleration for SDCA in the mini-batch setting. The pseudo code of our Accelerated Mini-Batch SDCA, abbreviated by ASDCA, is presented below.

0.8

Procedure Accelerated Mini-Batch SDCA

 Parameters scalars λ,γ and θ∈[0,1]  ;  mini-batch size m Initialize α(0)1=⋯=α(0)n=¯α(t)=0, x(0)=0 Iterate: for t=1,2,… u(t−1)=(1−θ)x(t−1)+θ∇g∗(¯α(t−1)) Randomly pick subset I⊂{1,…,n} of size m and update the dual variables in I α(t)i=(1−θ)α(t−1)i−θ∇ϕi(u(t−1)) for i∈I α(t)j=α(t−1)j for j∉I ¯α(t)=¯α(t−1)+n−1∑i∈I(α(t)i−α(t−1)i) x(t)=(1−θ)x(t−1)+θ∇g∗(¯α(t)) end

In the next section we present our main result — an analysis of the number of iterations required by ASDCA. We focus on the case of Euclidean regularization, namely, . Analyzing more general strongly convex regularization functions is left for future work. In Section 3 we discuss parallel implementations of ASDCA and compare it to parallel implementations of AGD and SDCA. In particular, we explain in which regimes ASDCA can be better than both AGD and SDCA. In Section 4 we present some experimental results, demonstrating how ASDCA interpolates between AGD and SDCA. The proof of our main theorem is presented in Section 5. We conclude with a discussion of our work in light of related works in Section 6.

## 2 Main Results

Our main result is a bound on the number of iterations required by ASDCA to find an -accurate solution. In our analysis, we only consider the squared Euclidean norm regularization,

 g(x)=λ2∥x∥2,

where is the Euclidean norm and is a regularization parameter. The analysis for general -strongly convex regularizers is left for future work. For the squared Euclidean norm we have

 g∗(α)=12λ∥α∥2,

and

 ∇g∗(α)=1λα .

We further assume that each is -smooth with respect to , namely,

 ∀x,z,   ϕi(x)≤ϕi(z)+∇ϕi(z)⊤(x−z)+12γ∥x−z∥2.

For example, if , then it is -smooth.

The smoothness of also implies that is -strongly convex:

 ∀θ∈[0,1],  ϕ∗i((1−θ)α+θβ)≤(1−θ)ϕ∗i(α)+θϕ∗i(β)−θ(1−θ)γ2∥α−β∥2,
###### Theorem 1.

Assume that and for each , is -smooth w.r.t. the Euclidean norm. Suppose that the ASDCA algorithm is run with parameters , where

 θ≤14min⎧⎨⎩1 , √γλnm , γλn , (γλn)2/3m1/3⎫⎬⎭ . (3)

Define the dual sub-optimality by , where is the optimal dual solution, and the primal sub-optimality by . Then,

 mEΔP(x(t))+nEΔD(α(t))≤(1−θm/n)t[mΔP(x(0))+nΔD(α(0))].

It follows that after performing

 t≥n/mθlog(mΔP(x(0))+nΔD(α(0))mϵ)

iterations, we have that .

Let us now discuss the bound, assuming is taken to be the right-hand side of (3). The dominating factor of the bound on becomes

 nmθ =nm⋅max{1 , √mγλn , 1γλn , m1/3(γλn)2/3} (4) =max{nm , √n/mγλ , 1/mγλ , n1/3(γλm)2/3} . (5)

Table 1 summarizes several interesting cases, and compares the iteration bound of ASDCA to the iteration bound of the vanilla SDCA algorithm (as analyzed in Shalev-Shwartz and Zhang (2013)) and the Accelerated Gradient Descent (AGD) algorithm of Nesterov (2007). In the table, we ignore constants and logarithmic factors.

As can be seen in the table, the ASDCA algorithm interpolates between SDCA and AGD. In particular, ASDCA has the same bound as SDCA when and the same bound as AGD when . Recall that the cost of each iteration of AGD scales with while the cost of each iteration of SDCA does not scale with . The cost of each iteration of ASDCA scales with . To compensate for the difference cost per iteration for different algorithms, we may also compare the complexity in terms of the number of examples processed in Table 2. This is also what we will study in our empirical experiments. It should be mentioned that this comparison is meaningful in a single processor environment, but not in a parallel computing environment when multiple examples can be processed simultaneiously in a minibatch. In the next section we discuss under what conditions the overall runtime of ASDCA is better than both AGD and SDCA.

## 3 Parallel Implementation

In recent years, there has been a lot of interest in implementing optimization algorithms using a parallel computing architecture (see Section 6). We now discuss how to implement AGD, SDCA, and ASDCA when having a computing machine with parallel computing nodes.

In the calculations below, we use the following facts:

• If each node holds a -dimensional vector, we can compute the sum of these vectors in time by applying a “tree-structure” summation (see for example the All-Reduce architecture in Agarwal et al. (2011)).

• A node can broadcast a message with bits to all other nodes in time . To see this, order nodes on the corners of the -dimensional hypercube. Then, at each iteration, each node sends the message to its neighbors (namely, the nodes whose code word is at a hamming distance of from the node). The message between the furthest away nodes will pass after iterations. Overall, we perform iterations and each iteration requires transmitting bits.

• All nodes can broadcast a message with bits to all other nodes in time . To see this, simply apply the broadcasting of the different nodes mentioned above in parallel. The number of iterations will still be the same, but now, at each iteration, each node should transmit bits to its neighbors. Therefore, it takes time.

For concreteness of the discussion, we consider problems in which takes the form of , where is a scalar and

. This is the case in supervised learning of linear predictors (e.g. logistic regression or ridge regression). We further assume that the average number of non-zero elements of

is . In very large-scale problems, a single machine cannot hold all of the data in its memory. However, we assume that a single node can hold a fraction of of the data in its memory.

Let us now discuss parallel implementations of the different algorithms starting with deterministic gradient algorithms (such as AGD). The bottleneck operation of deterministic gradient algorithms is the calculation of the gradient. In the notation mentioned above, this amounts to performing order of operations. If the data is distributed over computing nodes, where each node holds examples, we can calculate the gradient in time as follows. First, each node calculates the gradient over its own examples (which takes time ). Then, the resulting vectors in are summed up in time .

Next, let us consider the SDCA algorithm. On a single computing node, it was observed that SDCA is much more efficient than deterministic gradient descent methods, since each iteration of SDCA costs only while each iteration of AGD costs . When we have nodes, for the SDCA algorithm, dividing the examples into computing nodes does not yield any speed-up. However, we can divide the features into the nodes (that is, each node will hold of the features for all of the examples). This enables the computation of in (expected) time of . Indeed, node will calculate , where is the set of features stored in node (namely, ). Then, each node broadcasts the resulting scalar to all the other nodes. Note that we will obtain a speed-up over the naive implementation only if .

For the ASDCA algorithm, each iteration involves the computation of the gradient over examples. We can choose to implement it by dividing the examples to the nodes (as we did for AGD) or by dividing the features into the nodes (as we did for SDCA). In the first case, the cost of each iteration is while in the latter case, the cost of each iteration is . We will choose between these two implementations based on the relation between and .

The runtime and communication time of each iteration is summarized in the table below.

Algorithm partition type runtime communication time
SDCA features
ASDCA features
ASDCA examples
AGD examples

We again see that ASDCA nicely interpolates between SDCA and AGD. In practice, it is usually the case that there is a non-negligible cost of opening communication channels between nodes. In that case, it will be better to apply the ASDCA with a value of that reflects an adequate tradeoff between the runtime of each node and the communication time. With the appropriate value of (which depends on constants like the cost of opening communication channels and sending packets of bits between nodes), ASDCA may outperform both SDCA and AGD.

## 4 Experimental Results

In this section we demonstrate how ASDCA interpolates between SDCA and AGD. All of our experiments are performed for the task of binary classification with a smooth variant of the hinge-loss (see Shalev-Shwartz and Zhang (2013)). Specifically, let be a set of labeled examples, where for every , and . Define to be

 ϕi(x)=⎧⎪⎨⎪⎩0yix⊤vi>11/2−yix⊤viyix⊤vi<012(1−yix⊤vi)2o.w.

We also set the regularization function to be where . This is the default value for the regularization parameter taken in several optimization packages.

Following Shalev-Shwartz and Zhang (2013)

, the experiments were performed on three large datasets with very different feature counts and sparsity. The astro-ph dataset classifies abstracts of papers from the physics ArXiv according to whether they belong in the astro-physics section; CCAT is a classification task taken from the Reuters RCV1 collection; and cov1 is class 1 of the covertype dataset of Blackard, Jock & Dean. The following table provides details of the dataset characteristics.

Dataset Training Size Testing Size Features Sparsity
astro-ph
CCAT
cov1

We ran ASDCA with values of from the set . We also ran the SDCA algorithm and the AGD algorithm. In Figure 1 we depict the primal sub-optimality of the different algorithms as a function of the number of examples processed. Note that each iteration of SDCA processes a single example, each iteration of ASDCA processes examples, and each iteration of AGD processes examples. As can be seen from the graphs, ASDCA indeed interpolates between SDCA and AGD. It is clear from the graphs that SDCA is much better than AGD when we have a single computing node. ASDCA performance is quite similar to SDCA when is not very large. As discussed in Section 3, when we have parallel computing nodes and there is a non-negligible cost of opening communication channels between nodes, running ASDCA with an appropriate value of (which depends on constants like the cost of opening communication channels) may yield the best performance. Figure 1: The figures presents the performance of AGD, SDCA, and ASDCA with different values of mini-batch size, m. In all figures, the x axis is the number of processed examples. The three columns are for the different datasets. Top: primal sub-optimality. Middle: average value of the smoothed hinge loss function over a test set. Bottom: average value of the 0-1 loss over a test set.

## 5 Proof

We use the following notation:

 f(x)=1nn∑i=1ϕi(x) , Δ¯α(t)=¯α(t)−¯α(t−1) .

In addition, we use the notation to denote the expectation over the choice of the set at iteration , conditioned on the values of and .

Our first lemma calculates the expected value of .

###### Lemma 1.

At each round , we have

 Et[Δ¯α(t)]=−θmn(¯α(t−1)+∇f(u(t−1))) .
###### Proof.

By the definition of the update,

 Δ¯α(t)=1n∑i∈I(α(t)i−α(t−1)i)=−θn∑i∈I(α(t−1)i+∇ϕi(u(t−1)))=−θnn∑i=11[i∈I](α(t−1)i+∇ϕi(u(t−1))) .

Taking expectation w.r.t. the choice of and noting that we obtain that

 Et[Δ¯α(t)]=−θmn2n∑i=1(α(t−1)i+∇ϕi(u(t−1)))=−θmn(¯α(t−1)+∇f(u(t−1))) .

Next, we upper bound the “variance” of

, in the sense of the expected squared norm of the difference between and its expectation.

###### Lemma 2.

At each round , we have

 Et∥Δ¯α(t)−EtΔ¯α(t)∥2≤mθ2n3n∑i=1∥α(t−1)i+∇ϕi(u(t−1))∥2.
###### Proof.

We introduce the simplified notation and . Note that is independent of the choice of (thus can be considered as a deterministic number). Then when , and . We thus have

 Et∥Δ¯α(t)−EtΔ¯α(t)∥2=Et∥n−1∑i∈I(βi−μ)∥2 = n−2Et∑i,j∈I(βi−μ)⊤(βj−μ) = n−2Et∑i∈I∥βi−μ∥2+n−2Et∑i≠j∈I(βi−μ)⊤(βj−μ).

Note that for any : and can be regarded as zero-mean random vectors that are drawn uniformly at random from the same distribution without replacement. Therefore they are not positively correlated when . That is, we have

 Et∑i≠j∈I(βi−μ)⊤(βj−μ)≤0.

Therefore

 Et∥Δ¯α(t)−EtΔ¯α(t)∥2≤ 1n2Et∑i∈I∥βi−μ∥2=mn3n∑i=1∥βi−μ∥2 ≤ mn3n∑i=1∥βi∥2=mθ2n3n∑i=1∥α(t−1)i+∇ϕi(u(t−1))∥2.

Recall that the theorem upper bounds the expected value of , which in turns upper bound the duality gap at round . The following lemma derives an upper bound on this quantity that depends on the value of this quantity at the previous iteration and three additional terms. We will later show that the sum of the additional terms is negative in expectation. The lemma uses standard algebraic manipulations as well as the assumptions on and .

###### Lemma 3.

For each round we have

 [mΔP(x(t))+nΔD(α(t))]−(1−θmn)[mΔP(x(t−1))+nΔD(α(t−1))]≤a(t)I+b(t)I+c(t)I ,

where

 a(t)I=m2γ∥x(t)−u(t−1)∥2−θ(1−θ)γ2∑i∈I∥α(t−1)i+∇ϕi(u(t−1))∥2  , b(t)I=θmf(u(t−1))+mθnn∑i=1ϕ∗i(−α(t−1)i)+m∇f(u(t−1))⊤(−θu(t−1)) +∑i∈I[−θϕ∗i(−α(t−1)i)+θ[∇ϕi(u(t−1))⊤u(t−1)−ϕi(u(t−1))]]  , c(t)I=m∇f(u(t−1))⊤(x(t)−(1−θ)x(t−1))+n+mθ2λ∥¯α(t)∥2−n−mθ2λ∥¯α(t−1)∥2  .
###### Proof.

Since we have

 [mΔP(x(t))+nΔD(α(t))]−(1−θmn)[mΔP(x(t−1))+nΔD(α(t−1))] =m[ΔP(x(t))−ΔP(x(t−1))]+n[ΔD(α(t))−ΔD(α(t−1))]+θm[ΔP(x(t−1))+ΔD(α(t−1))] =m[P(x(t))−P(x(t−1))]−n[D(α(t))−D(α(t−1))]+θm[P(x(t−1))−D(α(t−1))] =m[P(x(t))−(1−θ)P(x(t−1))]−n[D(α(t))−D(α(t−1))]−θmD(α(t−1)) .

Therefore, we need to show that the right-hand side of the above is upper bounded by .

### Step 1:

We first bound . Using the smoothness of we have

 f(x(t)) ≤f(u(t−1))+∇f(u(t−1))⊤(x(t)−u(t−1))+12γ∥x(t)−u(t−1)∥2 =(1−θ)f(u(t−1))+θf(u(t−1))+∇f(u(t−1))⊤(x(t)−u(t−1))+12γ∥x(t)−u(t−1)∥2

and using the convexity of we also have

 (1−θ)f(u(t−1))≤(1−θ)[f(x(t−1))−∇f(u(t−1))⊤(x(t−1)−u(t−1))] .

Combining the above two inequalities and rearranging terms we obtain

 f(x(t)) ≤(1−θ)f(x(t−1))+θf(u(t−1)) (6) +∇f(u(t−1))⊤(x(t)−θu(t−1)−(1−θ)x(t−1))+12γ∥x(t)−u(t−1)∥2 .

Next, using the convexity of we have

 g(x(t))=λ2∥x(t)∥2=λ2∥∥∥(1−θ)x(t−1)+θλ¯α(t)∥∥∥2≤λ(1−θ)2∥x(t−1)∥2+θ2λ∥¯α(t)∥2 .

Combining this with (6) we obtain

 P(x(t)) =f(x(t))+g(x(t)) ≤(1−θ)(f(x(t−1))+λ2∥x(t−1)∥2)+θf(u(t−1)) +∇f(u(t−1))⊤(x(t)−θu(t−1)−(1−θ)x(t−1))+12γ∥x(t)−u(t−1)∥2+θ2λ∥¯α(t)∥2 ,

which yields

 m[P(x(t))−(1−θ)P(x(t−1))] ≤mθf(u(t−1))+m∇f(u(t−1))⊤(x(t)−θu(t−1)−(1−θ)x(t−1)) +m2γ∥x(t)−u(t−1)∥2+θm2λ∥¯α(t)∥2 . (7)

### Step 2:

Next, we bound . Using the definition of the dual update we have

 −n[D(α(t))−D(α(t−1))]=∑i∈I[ϕ∗i(−α(t)i)−ϕ∗i(−α(t−1)i)]+n2λ[∥¯α(t)∥2−∥¯α(t−1)∥2] . (8)

For all , we may use the definition of the update of in the algorithm, the strong-convexity of , and the equality in Fenchel-Young for gradients to obtain:

 ϕ∗i(−α(t)i)≤(1−θ)ϕ∗i(−α(t−1)i)+θϕ∗i(∇ϕi(u(t−1)))−θ(1−θ)γ2∥α(t−1)i+∇ϕi(u(t−1))∥2 =(1−θ)ϕ∗i(−α(t−1)i)+θ[∇ϕi(u(t−1))⊤u(t−1)−ϕi(u(t−1))]−θ(1−θ)γ2∥α(t−1)i+∇ϕi(u(t−1))∥2 .

Combining this with (8) we get

 −n[D(α(t))−D(α(t−1))]≤ (9) ∑i∈I[−θϕ∗i(−α(t−1)i)+θ[∇ϕi(u(t−1))⊤u(t−1)−ϕi(u(t−1))]−θ(1−θ)γ2∥α(t−1)i+∇ϕi(u(t−1))∥2] +n2λ[∥¯α(t)∥2−∥¯α(t−1)∥2] .

### Step 3:

Summing (9), (7), and the equation

 −mθD(α(t−1))=mθn∑iϕ∗i(−α(t−1)i)+mθ2λ∥¯α(t−1)∥2

we obtain that

 m[P(x(t))−(1−θ)P(x(t−1))]−n[D(α(t))−D(α(t−1))]−θmD(α(t−1))≤ mθf(u(t−1))+m∇f(u(t−1))⊤(x(t)−θu(t−1)−(1−θ)x(t−1)) +m2γ∥x(t)−u(t−1)∥2+θm2λ∥¯α(t)∥2 +∑i∈I[−θϕ∗i(−α(t−1)i)+θ[∇ϕi(u(t−1))⊤u(t−1)−ϕi(u(t−1))]−θ(1−θ)γ2∥α(t−1)i+∇ϕi(u(t−1))∥2] +n2λ[∥¯α(t)∥2−∥¯α(t−1)∥2] +mθn∑iϕ∗i(−α(t−1)i)+mθ2λ∥¯α(t−1)∥2 =a(t)I+b(t)I+c(t)I .

###### Lemma 4.

At each round , let be as defined in Lemma 3. Then,

 Et[a(t)I+b(t)I+c(t)I]≤0 .
###### Proof.

Recall,

 a(t)I=m2γ∥x(t)−u(t−1)∥2−θ(1−θ)γ2∑i∈I∥α(t−1)i+∇ϕi(u(t−1))∥2  , b(t)I=θmf(u(t−1