SVRG meets SAGA: k-SVRG --- A Tale of Limited Memory

05/02/2018 ∙ by Anant Raj, et al. ∙ EPFL Max Planck Society 0

In recent years, many variance reduced algorithms for empirical risk minimization have been introduced. In contrast to vanilla SGD, these methods converge linearly on strong convex problems. To obtain the variance reduction, current methods either require frequent passes over the full data to recompute gradients---without making any progress during this time (like in SVRG), or they require memory of the same size as the input problem (like SAGA). In this work, we propose k-SVRG, an algorithm that interpolates between those two extremes: it makes best use of the available memory and in turn does avoid full passes over the data without making progress. We prove linear convergence of k-SVRG on strongly convex problems and convergence to stationary points on non-convex problems. Numerical experiments show the effectiveness of our method.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

We study optimization algorithms for empirical risk minimization problems of the form

with (1)

where each is -smooth.

Problems with this structure are omnipresent in machine learning, especially in supervised learning applications 

(Bishop, 2016).

Stochastic gradient descent (SGD) (Robbins and Monro, 1951) is frequently used to solve optimization problems in machine learning. One drawback of SGD is that it does not converge at the optimal rate on many problem classes (cf. (Nemirovski et al., 2009; Lacoste-Julien et al., 2012)). Variance reduced methods have been introduced to overcome this challenge. Among the first of these methods were SAG (Roux et al., 2012), SVRG (Johnson and Zhang, 2013), SDCA (Shalev-Shwartz and Zhang, 2013) and SAGA (Defazio et al., 2014). The variance reduced methods can roughly be divided in two classes, namely i) methods that achieve variance reduction by computing (non-stochastic) gradients of  from time to time, as for example done SVRG, and ii) methods that maintain a table of previously computed stochastic gradients, such as done in SAGA.

Whilst these technologies allow the variance reduced methods to converge at a faster rate than vanilla SGD, they do not scale well to problems of very large scale. The reasons are simple: i) not only is computing a full batch gradient almost inadmissible when the number of samples is large, the optimization progress of SVRG completely stalls while this expensive computation takes place. This is avoided in SAGA, but ii) at the cost of additional memory. When the data is sparse and the stochastic gradients are not, the memory requirements can thus surpass the size of the dataset by orders of magnitude.

In this work we address these issues and propose a class of variance reduced methods that have i) shorter stalling phases of only order at the expense of only additional memory. Here is a parameter that can be set freely by the user. To get short stalling phases, it is advisable to set such as to fit the capacity of the fast memory of the system. We show that the new methods converge as fast as SVRG and SAGA on convex and non-convex problems, but are more practical for large . As a side-product of our analysis, we also crucially refine the previous theoretical analysis of SVRG, as we will outline in Section 1.2 below.

method complexity additional memory in situ comp. no full pass
Gradient Descent
SAGA
SVRG
SCSG
-SVRG
Table 1: Comparison of running times and (additional) storage requirement for different algorithms on strongly convex functions, where denotes the condition number. Most algorithms require in situ computations of many for the same without making progress. The longest such stalling phase is indicated, sometimes amounting to a full pass over the data (also indicated).

1.1 SVRG, SAGA and -Svrg

SVRG is an iterative algorithm, where in each each iteration only stochastic gradients, i.e. for a random index , are computed, much like in SGD. In order to attain variance reduction a full gradient is computed at a snapshot

point in every few epochs. There are three issues with SVRG: i) the computation of the full gradient requires a full pass over the dataset. No progress (towards the optimal solution) is made during this time (see illustration in Figure 

1). On large scale problems, where one pass over the data might take several hours, this can yield to wasteful use of resources; ii) the theory requires the algorithm to restart at every snapshot point, resulting in discontinuous behaviour (see Fig. 1) and iii) on strongly convex problems, the snapshot point can only be updated every iterations (cf. (Bubeck, 2014; Johnson and Zhang, 2013)), where denotes the condition number (see (9)). When the condition number is large, this means that the algorithm relies for a long time on “outdated” deterministic information. In practice—as suggested in the original paper by Johnson and Zhang (2013)—the update interval is often set to , without theoretical justification.

SAGA circumvents the stalling phases by treating every iterate as a partial snapshot point. That is, for each index

a full dimensional vector is kept in memory and updated with the current value

if index is picked in the current iteration. Hence, intuitively, in SAGA the gradient information at partial snapshot point does have more recent information about the gradient as compared to SVRG.

A big drawback of this method is the memory consumption: unless there are specific assumptions on the structure111Cf. the discussion in (Defazio et al., 2014, Sec. 4). of , this requires memory (sparsity of the data does not necessarily imply sparsity of the gradients). For large scale problems it is impossible to keep all data available in fast memory (i.e. cache or RAM) which means we can not run SAGA on large scale problems which do not have GLM structure. Although SAGA can sometimes converge faster than SVRG (but not always, cf. (Defazio et al., 2014)), the high memory requirements prohibit it’s use. One main advantage of this algorithm is that the convergence can be proven for every single iterate222More precisely, convergence is not directly shown on the iterates, but in terms of an auxilarly Lyapunov function.—thus justifying stopping the algorithm at any arbitrary time—whereas for SVRG convergence can only be proven for the snapshot points.

We propose -SVRG, a class of algorithms that addresses the limitations of both, SAGA and SVRG. Compared to SVRG the proposed schmes have a reduced memory footprint of only and therefore allow to optimally use the available (fast) memory. Compared to SVRG the schemes avoid long stalling phases on large scale applications (see Fig. 1). The methods do not require restarts and show smoother convergence than SVRG (see Fig. 1). As for SVRG, the convergence can only be guaranteed for snapshot points. However, unlike as in the original SVRG, the proposed -SVRG updates the snapshot point every single epoch ( iterations) and thus provides more fine grained performance guarantees than the original SVRG with iterations between snapshot points.

Figure 1: Convergence behavior of SAGA, SVRG and -SVRG. Left & Middle: SVRG recomputes the gradient at the snapshot point which yields to stalling for a full epoch both with respect to computation (left) and memory access (middle). SAGA requires only one stochastic gradient computation per iteration (left), but also one memory access (middle: roughly the identical performance as SVRG w.r.t. memory access). Right: -SVRG does not reset the iterates at a snapshot point and equally distributes the stalling phases.

1.2 Contributions

We present -SVRG, a limited memory variance reduced optimization algorithm that combines several good properties of SVRG as well as of SAGA. We propose two variants of -SVRG that require to store vectors and enjoy the theoretical convergence guarantees, and one (more practical) variant that requires only additional vectors in memory. Some key properties of our proposed approaches are:

  • [leftmargin=4mm]

  • Low memory requirements (like SVRG, unlike SAGA): We break the memory barrier of SAGA. The required additional memory can freely be chosen by the user (parameter ) and thus all available fast memory (but not more!) can be used by the algorithm.

  • Avoiding long stalling phases (like SAGA, unlike SVRG): This is in particular useful in large scale applications.

  • Refinement of the SVRG analysis. To the best of our knowledge we present the first analysis that allows arbitrary sizes of inner loops, not only as was supported by previous results.

  • Linear convergence on strongly-convex problems (like SVRG, SAGA), cf. Table 1.

  • Convergence on non-convex problems (like SVRG, SAGA).

Outline.

We informally introduce -SVRG in Section 2 and give the full details in Section 3. All theoretical results are presented in Section 4, the proofs can be found in Appendix C and D. We discuss the empirical performance in Section 5.

1.3 Related Work

Variance reduction alone is not sufficient to obtain the optimal convergence rate on problem (1). Accelerated schemes that combine the variance reduction with momentum as in Nesterov’s acceleration technique (Nesterov, 1983) achieve optimal convergence rate (Allen-Zhu, 2017; Lin et al., 2015). We do not discuss accelerated methods in this paper, however, we assume that it should be possible to accelerate the presented algorithm with the usual techniques.

There have also been significant efforts in developing stochastic variance reduced methods for non-convex problems (Allen-Zhu and Yuan, 2016; Reddi et al., 2016b, 2015; Allen-Zhu and Hazan, 2016; Shalev-Shwartz, 2016; Paquette et al., 2018). We will especially build on the technique proposed in (Reddi et al., 2016b) to derive the convergence analysis in the non-convex setting.

Recent work has also addressed the issue of making the stalling phase of SVRG shorter. In (Lei and Jordan, 2017; Lei et al., 2017) the authors propose SCSG, a method that makes only a batch gradient update instead of a full gradient update. However, this gives a slower rate of convergence (cf. Table 1). In another line of work, there was an effort to combine the SVRG and SAGA approach in an asynchronus optimization setting (Reddi et al., 2015) (HSAG) to run different updates in parallel. HSAG interpolates between SAGA and SVRG “per datapoint” which means snapshot points corresponding to indices in a (fixed) set are updated like in SAGA, whereas all other snapshot points are updated after each epoch. This is orthogonal to our approach: we treat all datapoints “equally”. All snapshot points are updated in the same, block-wise fashion. Also, convergence of HSAG is not guaranteed for every value of . In another line of work Hofmann et al. (2015) studied a version of SAGA with more than one update per iteration.

2 -SVRG: A Limited Memory Approach

In this section, we informally introduce our proposed limited memory algorithm -SVRG. For this, we will first present a unified framework that allows us to describe the algorithms SVRG and SAGA in concise notation. Let denote the iterates of the algorithm, where is the starting point. For each component , , of the objective function (1) we denote by the corresponding snapshot point. The updates of the algorithms take the form

(2)

where denotes the stepsize, and an index (typically selected uniformly at random from the set ). The updates of SVRG and SAGA can both be written in this general form, as we will review now.

[leftmargin=4mm]

SVRG

As mentioned before, SVRG maintains only one active snapshot point , i.e. for all . Instead of storing all components separately, it suffices to store one single snapshot point as well as in memory, as all components of the gradient can be recomputed when applying the update (2). This results in a slight increase in the computation cost, but in drastic reduction in the memory footprint.

SAGA

The update of SAGA takes exactly the form (2). In general for . Thus all parameters need to be kept in memory. In practice often is stored instead, as this avoids recomputation of .

-SVRG

As a natural interpolation between those two algorithms we propose the following: instead of maintaining just one single snapshot point or of them, just maintain a few. Precisely, the proposed algorithm maintains a set of snapshot points of cardinality , with the property for each . Therefore, it suffices to store only in the memory, and a mapping from each index to its corresponding element in . This needs memory. Opposed to SAGA, it is not adviced to store directly, as this would require memory.

-SVRG

We also propose a heuristic variant of

-SVRG that maintains at most snapshot points. This method comes without theoretical convergence rates, however, it shows quite good performance in practice.

We will give a formal definition of the algorithm in the next Section 3. Below we introduce some notation that will be needed later.

2.1 Notation

Our algorithm consists of updates of two types: updates of the iterates as in (2), performed in the inner loop and the updates of the snapshot points at the end of the inner loops (thus constituting the outer loop). We denote the iterates of the algorithm by , where denotes the counter of the inner loop (consisting of iterations), and the counter of the outer loop. For our algorithm (unlike in SAGA), the iterate at the end of an inner loop coincides with the first iterate of the next inner loop, . Whenever we only consider the iterates we will drop the index zero for convenience.

For clarity, we will also index the snapshot points by , that is we write for the snapshot point corresponding to the component in the outer loop. And consequently, . Thus the update (2) now reads

(3)

It will be convenient to define

(4)

Notation for Expectation.

denotes the full expectation with respect to the joint distribution of all chosen data points. Frequently, we will only consider the updates within one outer loop, and condition on the past iterates. Let

denote the set of chosen indices in the outer loop until the inner loop iteration. Then denotes the expectation with respect to the joint distribution of all indices in . The algorithm -SVRG-V2 samples additional indices, independent of and we denote the expectation over those samples by . Finally, we also denote as and as .

3 The Algorithm

In this section, we present -SVRG in detail. The pseudecode is given in Algorithm 1. -SVRG consist of inner and outer loops similar to SVRG, however the size of the inner loops is much smaller. Recall that denotes the counter of the inner loop (where ), and denotes the counter of the outer loop. Similar as in SVRG, a new snapshot point (denoted by ) is computed as an average of the iterates . However, in our case is a weighted average

(5)

where the normalization is defined in line 3. Note that for non-convex functions and the weighted average in (5) reduces to a uniform average.

In Algorithm 1, we describe two variants of -SVRG. These variants differ in the way how the snapshot points are updated at the end of each inner loop.

[leftmargin=4mm]

V1

In -SVRG-V1, we update the snapshot points as follows, before moving to the outerloop:

(6)

The set keeps track of the selected indices in the inner loop (line 10). Hence, we don’t need to store copies of the the snapshot point in memory, it suffices to store one copy and the set , as mentioned in Section 2 before.

It is not required that the set of indices that are used to update the are identical with the indices used to compute in the inner loop. Moreover, also the number points does not need to be the same. The following version of -SVRG makes this independence explicit.

[leftmargin=4mm]

V2

In -SVRG-V2(), we sample indices without replacement from at the end of the outer loop, which form the set , and then update the snapshot points as before in (6). The suggested choice of is , and whenever we drop the argument, we simply set .

1:  goal minimize
2:  init , , , , ,
3:  
4:  for
5:   init
6:   for
7:    pick uniformly at random
8:    
9:    
10:    
11:   end for
12:   
13:   
14:   if variant -SVRG-V2
15:    
16:   end if
17:   
18:   
19:  end for
20:  return
Algorithm 1 -SVRG-V1 / -SVRG-V2

Memory Requirement.

To estimate the memory requirement we need to know the number of different elements in the set

of snapshot points. The well-studied Coupon-Collector problem (cf. (Holst, 1986)) tells us that in expectation there are uniform samples needed to pick every index of the set  at least once. In Algorithm 1 precisely samples are picked in each iteration of the inner loop, which implies each single index in gets picked after outer loops. Thus there are in expectation only different different snapshot points at any time (

). These statements do also hold with high probability at the expense of additional poly-log factors in

. Thus, memory suffices to invoke Algorithm 1.

We can enforce a hard limit on the memory by slightly violating the random sampling assumption: instead of sampling without replacement in -SVRG-V2, we just process all indices according to a random permutation, and reshuffle after each epoch (the pseudocode is given in Algorithm 2 in Appendix A). Clearly, as we process the indices by the order given by random permutations, each index gets picked at least once every iterations, i.e. at least once after outer loops. Therefore, there are at most distinct snapshot points at any time.

[leftmargin=4mm]

-SVRG

-SVRG deviates from -SVRG-V2 on lines 1416. Instead of sampling distinct indices in each outer loop independently, we process the indices by blocks. Concretely, every outer loop we sample a random partition , for independently at random, and then process the indices of the sets the outer loop (to not clutter the nation we assumed here ). We give the pseudocode for -SVRG in Appendix A.

Remark 1 (Implementation).

One of the main advantages of -SVRG is that no full pass over the data is required at the end of an outer loop. The update of in line 12 can be computed on the fly with the help of an extra variable. To implement the update of the ’s on line 17 we use the compressed representation of the set as discussed above. The update of in line 18 requires gradient computations for -SVRG-V2, but only for -SVRG-V1, as

(7)

for computed values for .

4 Theoretical Analysis

In this section, we provide the theoretical analysis for the proposed algorithms from the previous section. We will first discuss the convergence in the convex case in Section 4.1 and then later will discuss the convergence in the non-convex setting in Section 4.2. For both cases we will assume that the functions , , are -smooth. Let us recall the definition: A function is -smooth if it is differentiable and

(8)

4.1 Strongly Convex Problems

In this subsection we additionally assume to be -strongly convex for , i.e. we assume it holds:

(9)

It will also become handy to denote , following the notation in (Hofmann et al., 2015).

Lyapunov Function.

Similar as in (Defazio et al., 2014) and (Hofmann et al., 2015), we show convergence of the algorithm by studying a suitable Lyapunov function. In fact, we are using the same family of functions as in (Hofmann et al., 2015) where is defined as follows:

(10)

with and a constant parameter that we will set later. We will evaluate this function at tuples , where are the iterates of the algorithm. In order to show convergence we therefore also need to define a sequence of parameters that are updated in sync with . Clearly, if for , then convergence of implies . We will now proceed to define a sequence with this property. It is important to note that these quantities do only show up in the analysis, but neither need to be be computed nor updated by the algorithm.

Similar as in (Hofmann et al., 2015), we will define quantities with the property , and thus their sum, is an upper bound on . Let us now proceed to precisely define . For this let be defined as

(11)

We initialize (conceptually) and for , and then update the bounds in the following manner:

(12)

Here denotes the set of indices that are used to compute in either -SVRG-V1 or -SVRG-V2, see Algorithm 1.

Convergence Results.

We now show the linear convergence of -SVRG-V1 (Theorem 2) and -SVRG-V2 (Theorem 1).

Theorem 1.

Let denote the iterates in the outer loop of -SVRG-V2(). If , parameter , and step size then

(13)
Proof Sketch.

By applying Lemmas 3 and 4, we directly get the following relation:

(14)

where and are constants that will be specified in the proof. From this expression it becomes clear that we get the statement of the theorem if we can ensure and . These calculations will be detailed in the proof in Appendix C. ∎

Theorem 2.

Let denote the iterates in the outer loop of -SVRG-V1. If , and step size then

(15)
Proof.

The proof of Theorem 2 is very similar to the one of Theorem 1. A detailed proof is provided in the Appendix C. ∎

Let us state a few observations:

Remark 2 (Convergence rate).

Both results show convergence at a linear rate. The convergence factor is the same that appears also in the convergence rates of SVRG and SAGA. For SAGA a decrease by this factor can be show in every iteration for the corresponding Lyapunov function. Thus, after steps, SAGA achieves a decrease of , i.e. of the same order333Note, the decrease is not exactly identical if different stepsizes are used. as -SVRG. On the other hand, the proof for SVRG shows decrease by a constant factor after iterations. The same improvement is attained by -SVRG after inner loops, i.e. total updates. Hence, our rates do not fundamentally differ from the rates of SVRG and SAGA (in case we even improve compared to the former method), but they provide an interpolation between both results.

Remark 3 (Relation to SVRG).

For and , our algorithms resemble SVRG with geometric averaging. However, our proof gives the flexibility to prove convergence of SVRG with inner loop size , instead of as in Johnson and Zhang (2013). The analysis of SVRG is further strengthened in many subtle details, for instance we don’t require as in vanilla SVRG, we have shorter stalling phases (for ) and the possibility to choose and differently opens more possibilities for tuning.

Remark 4 (Relation to SAGA).

In SAGA, exactly one snapshot point is updated per iteration. The same number of updates are performed (on average) per iteration for the setting . Hofmann et al. (2015) study a variant of SAGA that performs more updates per iteration (), but there was no proposal of choosing .

Remark 5 (Dependence of the convergence rate on and ).

For ease of presentation we have state here the convergence results in a simplified way, omitting dependence on entirely (see also Remark 2). However, some mild dependencies can be extracted from the proof. For instance, it is intuitively clear that choosing a larger in Theorem 1 should yield a better rate. This is indeed true. Moreover, also setting smaller will still give linear convergence, but at a lower rate. For our application we aim to choose as small as possible (reducing computation), without sacrificing too much in the convergence rate.

In the rest of this subsection, we will give some tools that are required to prove Theorems 1 and 2. The proof of both statements is given in Appendix C. Lemma 3 establishes a recurrence relation between subsequent iterates in the outer loop.

Lemma 3.

Let denote the iterates in the outer loop of Algorithm 1. Then it holds:

(16)

where and .

We further need to bound the expression that appears in the right hand side of equation (16). Recall that we have already introduced bounds for this purpose. We now follow closely the machinery that has been developed in (Hofmann et al., 2015) in order to show how these bounds decrease (in expectation) from one iteration to the next.

Lemma 4.

Let the sequence be defined as in Section 4.1 and updated according to equation (12) and let denote the sequence of snapshot points in Algorithm 1. Then it holds:

(for -SVRG-V1) (17)
(for -SVRG-V2) (18)

where .

4.2 Non-convex Problems

In this section, we discuss the convergence of the proposed algorithm for non-convex problems. In order to employ Algorithm 1 on non-convex problems we use the setting . We limit our analysis for only non-convex smooth functions.

Throughout the section, we assume that each is -smooth (8), and provide the convergence rate of algorithm -SVRG-V2 only. However, convergence of the algorithm -SVRG-V1 for the non-convex case can be shown in the similar way as for -SVRG-V2. The convergence also extends to the class of gradient dominated functions by standard techniques (cf. (Reddi et al., 2016b; Allen-Zhu and Yuan, 2016)). We follow the proof technique from (Reddi et al., 2016b) to provide the theoretical justification of our approach. However, the proof is not straight forward, due to the difficulty that is imposed by the block wise update of the snapshot points in -SVRG-V2.

Lyapunov Function.

For the analysis of our algorithms, we again choose a suitable Lyapunov function similar to the one chosen in (Reddi et al., 2016b). In the following, let denote the total number of outer loops performed. For define as:

(19)

where denotes a sequence of parameters that we will introduce shortly (note the superscript indices). By initializing we have . If we define the sequence such that it holds then . These two properties will be exploited in the proof below.

Similar to the previous section, we define quantities with . With this notation we can equivalently write . We now define the sequence and an auxiliary sequence that will be used in the proof:

(20)
(21)

with and a parameter that will be specified later. As mentioned, we will set and (20) provides the values of for . It will be convenient to denote the update in the outer loop and inner loop with , that is . Then we can define a matrix that consists of the columns for and a matrix that consists of columns for . Here denotes the Frobenius norm. By the notation just defined we have and by the tower property of conditional expectations . By similar reasoning

(22)

Convergence Results.

Now we provide the main theoretical result of this subsection. Theorem 5 shows sub-linear convergence for non-convex functions.

Theorem 5.

Let denote the iterates of -SVRG-V2. Let be defined as in (20) with and and such that for . Then:

(23)

where . In particular, for parameters , and and it holds:

(24)
Proof Sketch.

We need to rely on some technical results that will be presented in Lemmas 6, 7 and 8 below. Equation (23) can be readily be derived from Lemma 8 by first taking expectation and then using telescopic summation. Since , we get:

(25)

By setting for we have and as clearly . We find a lower bound on as a final step in our proof. Details about all the constants are given in detail in the Appendix D. ∎

Remark 6 (Upper bound on ).

It is important to note here that unlike in the convex setting, Theorem 5 does not allow to set the number of steps in the inner loop, i.e. , arbitrarily large. That essentially means that the number of snapshot points cannot be reduced below a certain threshold in -SVRG-V2 for non-convex problems. The limitation on occurs due to the fact that we cannot work with a Lyapunov function which only depends on the inner loop iteration as done in (Reddi et al., 2016a) and hence the expected variance keeps on adding itself to the next variance term which finally gives an extra dependence of the order . But we do believe that the limitation on can be improved further. Besides that limitation on , we get the same convergence rate for our method as that of non-convex SVRG and non-convex SAGA.

Now we discuss the lemmas which are helpful in proving Theorem 5. The proofs of these lemmas are deferred to Appendix D. Lemma 6 establishes the recurrence relation between the second term of the Lyapunov function, , with .

Lemma 6.

Consider the setting of Theorem 5. Then, conditioned on the iterates obtained before the outer loop, it holds for :

(26)

This result suggests that we now should relate the variance of the stochastic gradient update with the expected true gradient and the Lyapunov function. This is done in Lemma 7, with the help of the result from Lemma 11 which is provided in Appendix D.

Lemma 7.

Consider the setting of Theorem 5. Upon completion of the outer loop it holds:

(27)

Finally, we can proceed to present the most important lemma of this section from which the main Theorem 5 readily follows.

Lemma 8.

Consider the setting of Theorem 5, that is and are such that . Then:

(28)

5 Experiments

Figure 2: Residual loss on mnist for SVRG, -SVRG-V1 (left), -SVRG-V2 (middle) and -SVRG (right) for .
Figure 3: Residual loss on covtype (train) for SVRG, -SVRG-V1 (left), -SVRG-V2 (middle) and -SVRG (right) for .

To support the theoretical analysis, we present numerical results on

-regularized logistic regression problems, i.e. problems of the form

(29)

The regularization parameter is set to , as in (Nguyen et al., 2017). We use the datasets covtype(train,test) and MNIST(binary)444All datasets are available at http://manikvarma.org/code/LDKL/download.html. Some statistics of the datasets are summarized in Table 2. For all experiments we use and perform a warm start of the algorithms, that is we provide as input. Several cold start procedures (where are injected one by one) have been suggested (cf. (Defazio et al., 2014)) but discussing the effects of these heuristics is not the focus of this paper.

Dataset
covtype (test) 54 58 102 1311
covtype (train) 54 522 910 43 586
mnist 784 60 000 38 448
Table 2: Summary of datasets used for experiments. We use , where represents the data point. The factor of 4 is due to the use of the logistic loss.

We conduct experiments with SAGA, SVRG (we fix the size of the inner loop to ) and the proposed -SVRG for in all variants (-SVRG-V1, -SVRG-V2 and -SVRG). For simplicity we use the parameters throughout.

The running time of the algorithms is dominated by two important components: the time for computation and the time to access the data. The actual numbers depend on the hardware and problem instances.

[style=unboxed,leftmargin=0.3cm]

Gradient Computations (#GC).

Fig. 1 (left). We count the number of gradient evaluations of the form . In SAGA, each step of the inner loop only comprises one computation, whereas for SVRG, two gradients have to be computed in the inner loop. The figure nicely depicts the stalling of SVRG after one pass over the data (when a full gradient has to be computed in situ).

Effective Data Reads (#ER).

Fig. 1 (middle). We count the number of access to the data, that is when a -dimensional vector needs to be fetched from memory. In the SVRG variants this is one data point in each iteration of the inner loop, and data points when updating the gradients (see Remark 1). For SAGA in each iteration two values have to be fetched. For the -SVRG variants the stalling phases are more equally distributed (for large). Moreover, there is no big jump in function value as the current iterate does not have to be updated (a difference to SVRG).

5.1 Illustrative Experiment, Figure 1

For the results displayed in Figure 1 in Section 1.1 we set the learning rate to an artificially low value for all algorithms. This allows to emphasize the distinctive features of each method. Figure 4 in the appendix depicts additional -SVRG variants for the same setting.

5.2 Experiments on Large Datasets

Due to the large memory constrained of SAGA, we do not run SAGA on large scale problems. Even though for every method there is a theoretical safe stepsize , it is common practice to tune the stepsize according to the dataset (cf. (Defazio et al., 2014; Schmidt et al., 2017)). By extensive testing we determined the stepsizes that achieve the smallest training error after #ER for covtype (test) and after #ER for mnist.555We like to emphasize that the optimal stepsize crucially depend on the maximal budget. I.e. the optimal values might be different if the application demands higher or lower accuracy. The determined optimal learning rates are summarized in Table 3. For covtype (train) we figured is a reasonable setting for all algorithms.

Algorithm/Dataset covtype (test) mnist covtype (train)
SVRG
-SVRG-V1
-SVRG-V2
-SVRG
Table 3: Determined optimal stepsizes for the datasets covtype (test) and mnist and parameters .

In Figure 2 we compare all algorithms on mnist. We observe that -SVRG performs best on mnist, followed by the other -SVRG variants which perform very similar to SVRG. In Figure 3 we compare all algorithms on covertype (train) and the picture is similar: -SVRG works the best, followed by -SVRG-V1, then -SVRG-V2 and all variants of -SVRG outperform SVRG. We observe that the parameter seems to affect the performance only by a small factor on these datasets. However, it is not easy to predict the best possible without tuning it but larger values of do not seem to make performance worse; allowing to choose as large as supported on the system used. Additional results are displayed in Appendix E.

6 Conclusion

We propose -SVRG, a variance reduction technique suited for large scale optimization and show convergence on convex and non-convex problems at the same theoretical rates as SAGA and SVRG. Our algorithms have a very mild memory requirement compared to SAGA and the memory can be tuned according to the available resources. By tuning the parameter , one can pick the algorithm that fits best to the available system resources. I.e. one should pick a picking large for systems with fast memory, and smaller when data access is slow (in order that the additional memory still fits in RAM). This can provide a huge amount of flexibility inn distributed optimization as we can choose different on different machine. We could also imagine that automatic tuning of as the optimization progresses, i.e. automatically adapting to the system resources, might yield the best performance in practice. However, this feature needs to be investigated further.

For future work, we plan to extend our analysis of -SVRG using tools along the line of the recently proposed analysis of reshuffled SGD (HaoChen and Sra, 2018). From the computational point of view, it is also important to investigate if the gradients at the snapshot points could be replaced with inexact approximations of the gradients which are computationally cheaper to compute.

References

  • Allen-Zhu (2017) Zeyuan Allen-Zhu. Katyusha: The first direct acceleration of stochastic gradient methods. In

    Proceedings of the 49th Annual ACM SIGACT Symposium on Theory of Computing

    , pages 1200–1205, 2017.
  • Allen-Zhu and Hazan (2016) Zeyuan Allen-Zhu and Elad Hazan. Variance reduction for faster non-convex optimization. In International Conference on Machine Learning, pages 699–707, 2016.
  • Allen-Zhu and Yuan (2016) Zeyuan Allen-Zhu and Yang Yuan. Improved SVRG for non-strongly-convex or sum-of-non-convex objectives. In International Conference on Machine Learning, pages 1080–1089, 2016.
  • Bishop (2016) Christopher M Bishop. Pattern Recognition and Machine Learning. Springer-Verlag New York, 2016.
  • Bubeck (2014) Sébastien Bubeck. Convex Optimization: Algorithms and Complexity. May 2014.
  • Defazio et al. (2014) Aaron Defazio, Francis Bach, and Simon Lacoste-Julien. SAGA: A fast incremental gradient method with support for non-strongly convex composite objectives. In Advances in Neural Information Processing Systems, pages 1646–1654, 2014.
  • HaoChen and Sra (2018) Jeffery Z HaoChen and Suvrit Sra. Random shuffling beats SGD after finite epochs. arXiv preprint arXiv:1806.10077, 2018.
  • Hofmann et al. (2015) Thomas Hofmann, Aurelien Lucchi, Simon Lacoste-Julien, and Brian McWilliams. Variance reduced stochastic gradient descent with neighbors. In Advances in Neural Information Processing Systems, pages 2305–2313, 2015.
  • Holst (1986) Lars Holst. On birthday, collectors’, occupancy and other classical urn problems. International Statistical Review / Revue Internationale de Statistique, 54(1):15–27, 1986.
  • Johnson and Zhang (2013) Rie Johnson and Tong Zhang. Accelerating stochastic gradient descent using predictive variance reduction. In Advances in Neural Information Processing Systems, pages 315–323, 2013.
  • Lacoste-Julien et al. (2012) Simon Lacoste-Julien, Mark Schmidt, and Francis Bach. A simpler approach to obtaining an convergence rate for the projected stochastic subgradient method. arXiv preprint arXiv:1212.2002, 2012.
  • Lei and Jordan (2017) Lihua Lei and Michael Jordan. Less than a single pass: Stochastically controlled stochastic gradient. In Aarti Singh and Jerry Zhu, editors,

    Proceedings of the 20th International Conference on Artificial Intelligence and Statistics

    , volume 54 of Proceedings of Machine Learning Research, pages 148–156. PMLR, 2017.
  • Lei et al. (2017) Lihua Lei, Cheng Ju, Jianbo Chen, and Michael I Jordan. Non-convex finite-sum optimization via SCSG methods. In Advances in Neural Information Processing Systems, pages 2345–2355, 2017.
  • Lin et al. (2015) Hongzhou Lin, Julien Mairal, and Zaid Harchaoui. A universal catalyst for first-order optimization. In Advances in Neural Information Processing Systems 28, pages 3384–3392. 2015.
  • Nemirovski et al. (2009) Arkadi Nemirovski, Anatoli Juditsky, Guanghui Lan, and Alexander Shapiro. Robust stochastic approximation approach to stochastic programming. SIAM Journal on Optimization, 19(4):1574–1609, 2009.
  • Nesterov (1983) Yurii Nesterov. A method of solving a convex programming problem with convergence rate . Dokl. Akad. Nauk SSSR, 269:543–547, 1983.
  • Nguyen et al. (2017) Lam M. Nguyen, Jie Liu, Katya Scheinberg, and Martin Takáč. SARAH: A novel method for machine learning problems using stochastic recursive gradient. In Proceedings of the 34th International Conference on Machine Learning, volume 70 of Proceedings of Machine Learning Research, pages 2613–2621, 2017.
  • Paquette et al. (2018) Courtney Paquette, Hongzhou Lin, Dmitriy Drusvyatskiy, Julien Mairal, and Zaid Harchaoui. Catalyst for gradient-based nonconvex optimization. In Proceedings of the Twenty-First International Conference on Artificial Intelligence and Statistics, volume 84 of Proceedings of Machine Learning Research, pages 613–622. PMLR, 2018.
  • Reddi et al. (2015) Sashank J Reddi, Ahmed Hefny, Suvrit Sra, Barnabás Póczos, and Alexander J Smola. On variance reduction in stochastic gradient descent and its asynchronous variants. In Advances in Neural Information Processing Systems, pages 2647–2655, 2015.
  • Reddi et al. (2016a) Sashank J Reddi, Ahmed Hefny, Suvrit Sra, Barnabás Póczos, and Alex Smola. Stochastic variance reduction for nonconvex optimization. In International Conference on Machine Learning, pages 314–323, 2016a.
  • Reddi et al. (2016b) Sashank J Reddi, Suvrit Sra, Barnabás Póczos, and Alex Smola. Fast incremental method for smooth nonconvex optimization. In Decision and Control (CDC), 2016 IEEE 55th Conference on, pages 1971–1977. IEEE, 2016b.
  • Robbins and Monro (1951) Herbert Robbins and Sutton Monro. A Stochastic Approximation Method. The Annals of Mathematical Statistics, 22(3):400–407, September 1951.
  • Roux et al. (2012) Nicolas L. Roux, Mark Schmidt, and Francis R. Bach. A stochastic gradient method with an exponential convergence rate for finite training sets. In Advances in Neural Information Processing Systems 25, pages 2663–2671. 2012.
  • Schmidt et al. (2017) Mark Schmidt, Nicolas Le Roux, and Francis Bach. Minimizing finite sums with the stochastic average gradient. Mathematical Programming, 162(1-2):83–112, 2017.
  • Shalev-Shwartz (2016) Shai Shalev-Shwartz. SDCA without duality, regularization, and individual convexity. In International Conference on Machine Learning, pages 747–754, 2016.
  • Shalev-Shwartz and Zhang (2013) Shai Shalev-Shwartz and Tong Zhang. Stochastic Dual Coordinate Ascent Methods for Regularized Loss Minimization. JMLR, 14:567–599, February 2013.

Appendix A Pseudo-code for -Svrg

We provide the pseudo code -SVRG in Algorithm 2 below. For simplicity we assume here (mod) , i.e. .

1:  goal minimize
2:  init , , , , and
3:  
4:  
5:  for
6:   ind randperm(n)
7:   for
8:    init
9:    for
10:     pick uniformly at random
11:     
12:     
13:     
14:    end for
15:    
16:    
17: