 # mS2GD: Mini-Batch Semi-Stochastic Gradient Descent in the Proximal Setting

We propose a mini-batching scheme for improving the theoretical complexity and practical performance of semi-stochastic gradient descent applied to the problem of minimizing a strongly convex composite function represented as the sum of an average of a large number of smooth convex functions, and simple nonsmooth convex function. Our method first performs a deterministic step (computation of the gradient of the objective function at the starting point), followed by a large number of stochastic steps. The process is repeated a few times with the last iterate becoming the new starting point. The novelty of our method is in introduction of mini-batching into the computation of stochastic steps. In each step, instead of choosing a single function, we sample b functions, compute their gradients, and compute the direction based on this. We analyze the complexity of the method and show that the method benefits from two speedup effects. First, we prove that as long as b is below a certain threshold, we can reach predefined accuracy with less overall work than without mini-batching. Second, our mini-batching scheme admits a simple parallel implementation, and hence is suitable for further acceleration by parallelization.

Comments

There are no comments yet.

## Authors

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

The problem we are interested in is to minimize a sum of two convex functions,

 minx∈Rd{P(x):=f(x)+R(x)}, (1)

where is the average of a large number of smooth convex functions , i.e.,

 f(x)=1nn∑i=1fi(x), (2)

We further make the following assumptions:

###### Assumption 1.

The regularizer is convex and closed. The functions are differentiable and have Lipschitz continuous gradients with constant . That is,

 ∥∇fi(x)−∇fi(y)∥≤L∥x−y∥,

for all , where is L2 norm.

Hence, the gradient of is also Lipschitz continuous with the same constant .

###### Assumption 2.

is strongly convex with parameter , i.e., ,

 P(y)≥P(x)+ξT(y−x)+μ2∥y−x∥2,  ∀ξ∈∂P(x), (3)

where is the subdifferential of at .

Let and be the strong convexity parameters of and , respectively (we allow both of them to be equal to , so this is not an additional assumption). We assume that we have lower bounds available ( and ).

#### Related work

There has been intensive interest and activity in solving problems of (1) in the past years. An algorithm that made its way into many applied areas is FISTA . However, this method is impractical in large-scale setting (big ) as it needs to process all functions in each iteration. Two classes of methods address this issue – randomized coordinate descent methods [13, 15, 16, 11, 5, 19, 9, 10, 14, 4] and stochastic gradient methods [22, 12, 6, 20]

. This brief paper is closely related to the works on stochastic gradient methods with a technique of explicit variance reduction of stochastic approximation of the gradient. In particular, our method is a mini-batch variant of S2GD

; the proximal setting was motivated by SVRG [7, 21].

A typical stochastic gradient descent (SGD) method will randomly sample function and then update the variable using

— an estimate of

. Important limitation of SGD is that it is inherently sequential, and thus it is difficult to parallelize them. In order to enable parallelism, mini-batching—samples multiple examples per iteration—is often employed [17, 3, 2, 23, 20, 18].

#### Our Contributions.

In this work, we combine the variance reduction ideas for stochastic gradient methods with mini-batching. In particular, develop and analyze mS2GD (Algorithm 1) – a mini-batch proximal variant of S2GD . To the best of our knowledge, this is the first mini-batch stochastic gradient method with reduced variance for problem (1). We show that the method enjoys twofold benefit compared to previous methods. Apart from admitting a parallel implementation (and hence speedup in clocktime in an HPC environment), our results show that in order attain a specified accuracy , our mini-batching scheme can get by with less gradient evaluations. This is formalized in Theorem 2, which predicts more than linear speedup up to some — the size of the mini-batches. Another advantage, compared to , is that we do not need to average the points in each loop, but we instead simply continue from the last one (this is the approach employed in S2GD ).

## 2 Proximal Algorithms

A popular proximal gradient approach to solving (1) is to form a sequence via

 yk+1=argminx∈Rd[Uk(x)def=f(yk)+∇f(yk)T(x−yk)+12h∥x−yk∥2+R(x)].

Note that in an upper bound on if is a stepsize parameter satisfying . This procedure can be equivalently written using the proximal operator as follows:

 yk+1=proxhR(yk−h∇f(yk)),proxR(z)def=argminx∈Rd{12∥x−z∥2+R(x)}.

In large-scale setting it is more efficient to instead consider the stochastic proximal gradient approach, in which the proximal operator is applied to a stochastic gradient step:

 yk+1=proxhR(yk−hvk), (4)

where is a stochastic estimate of the gradient . Of particular relevance to our work are the the SVRG , S2GD  and Prox-SVRG  methods where the stochastic estimate of is of the form

 vk=∇f(x)+1nqik(∇fik(yk)−∇fik(x)), (5)

where is an “old” reference point for which the gradient was already computed in the past, and is a random index equal to

with probability

. Notice that

is an unbiased estimate of the gradient:

 Ei[vk]∇f(x)+n∑i=1qi1nqi(∇fi(yk)−∇fi(x))∇f(yk).

Methods such as SVRG , S2GD  and Prox-SVRG  update the points in an inner loop, and the reference point in an outer loop. This ensures that has low variance, which ultimately leads to extremely fast convergence.

## 3 Mini-batch S2GD

We now describe the mS2GD method (Algorithm 1).

The main step of our method (Step 8) is given by the update (4), however with the stochastic estimate of the gradient instead formed using a mini-batch of examples of size . We run the inner loop for iterations, where with probability given by

 qtdef=1γ(1−hμf1+hνR)m−t, with γdef=m∑t=1(1−hμf1+hνR)m−t. (6)

## 4 Complexity Result

In this section, we state our main complexity result and comment on how to optimally choose the parameters of the method.

###### Theorem 1.

Let Assumptions 1 and 2 be satisfied and let . In addition, assume that the stepsize satisfies and that is sufficiently large so that

 (7)

where . Then mS2GD has linear convergence in expectation:

 E(P(xk)−P(x∗))≤ρk(P(x0)−P(x∗)).
###### Remark 1.

If we consider the special case , (i.e., if the algorithm does not have any nontrivial good lower bounds on and ), we obtain

 ρ=1mhμ(1−4hLα(b))+4hLα(b)(m+1)m(1−4hLα(b)). (8)

In the special case when we get , and the rate given by (8) exactly recovers the rate achieved by Prox-SVRG  (in the case when the Lipschitz constats are all equal).

### 4.1 Mini-batch speedup

In order to be able to see the speed-up we can gain from the mini-batch strategy, and due to many parameters in the complexity result (Theorem 1) we need to fix some of the parameters. For simplicity, we will use and equal to , so we can analyse (8) instead of (7). Let us consider the case when we also fix (number of outer iterations). Once the parameter is fixed and in order to get some accuracy, we get the value of which will guarantee the result.

Let us now fix target decrease in single epoch . For any , define to be the optimal pair stepsize-size of the inner loop, such that . This pair is optimal in the sense that is the smallest possible — because we are interested in minimizing the computational effort, thus minimizing . If we set , we recover the optimal choice of parameters without mini-batching. If , then we can reach the same accuracy with less evaluations of gradient of a function . The following Theorem states the formula for and . Equation (9) shows that as long as the condition is satisfied, is decreasing at a rate faster than . Hence, we can attain the same accuracy with less work, compared to the case when .

###### Theorem 2.

Fix target , where is given by (8) and . Then, if we consider the mini-batch size to be fixed, the choice of stepsize and size of inner loop that minimizes the work done — the number of gradients evaluated — while having , is given by the following formulas:

 ~hb:=√(1+ρρμ)2+14μα(b)L−1+ρρμ.

If then and

 mb∗=4(√ρ2μα(b)L+4(1+ρ)2−2(1+ρ))=8α(b)L1+ρ+√14α(b)Lμρ2+(1+ρ)2μρ2. (9)

Otherwise and

## 5 Experiments

In this section we present a preliminary experiment, and an insight into the possible speedup by parallelism. Figure 2

shows experiments on L2-regularized logistic regression on the RCV1 dataset.

111Available at http://www.csie.ntu.edu.tw/cjlin/libsvmtools/datasets/. We compare S2GD (blue, squares) and mS2GD (green circles) with mini-batch size , without any parallelism. The figure demonstrates that one can achieve the same accuracy with less work. The green dashed line is the ideal (most likely practically unachievable) result with parallelism (we divide passes through data by ). For comparison, we also include SGD with constant stepsize (purple, stars), chosen in hindsight to optimize performance. Figure 2 shows the possible speedup in terms of work done, formalized in Theorem 2. Notice that up to a certain threshold, we do not need any more work to achieve the same accuracy (red straight line is ideal speedup; blue curvy line is what mS2GD achieves). Figure 1: Experiment on RCV1 dataset.