Trading-off variance and complexity in stochastic gradient descent

03/22/2016 ∙ by Vatsal Shah, et al. ∙ The University of Texas at Austin 0

Stochastic gradient descent is the method of choice for large-scale machine learning problems, by virtue of its light complexity per iteration. However, it lags behind its non-stochastic counterparts with respect to the convergence rate, due to high variance introduced by the stochastic updates. The popular Stochastic Variance-Reduced Gradient (SVRG) method mitigates this shortcoming, introducing a new update rule which requires infrequent passes over the entire input dataset to compute the full-gradient. In this work, we propose CheapSVRG, a stochastic variance-reduction optimization scheme. Our algorithm is similar to SVRG but instead of the full gradient, it uses a surrogate which can be efficiently computed on a small subset of the input data. It achieves a linear convergence rate ---up to some error level, depending on the nature of the optimization problem---and features a trade-off between the computational complexity and the convergence rate. Empirical evaluation shows that CheapSVRG performs at least competitively compared to the state of the art.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Several machine learning and optimization problems involve the minimization of a smooth, convex and separable cost function :

(1)

where the -dimensional variable represents model parameters, and each of the functions

depends on a single data point. Linear regression is such an example: given points

in , one seeks that minimizes the sum of ,

. Training of neural networks 

[7, 13]

, multi-class logistic regression 

[13, 23], image classification [5], matrix factorization [24] and many more tasks in machine learning entail an optimization of similar form.

Batch gradient descent schemes can effectively solve small- or moderate-scale instances of (1). Often though, the volume of input data outgrows our computational capacity, posing major challenges. Classic batch optimization methods [19, 6] perform several passes over the entire input dataset to compute the full gradient, or even the Hessian111In this work, we will focus on first-order methods only. Extensions to higher-order schemes is left for future work., in each iteration, incurring a prohibitive cost for very large problems.

Stochastic optimization methods overcome this hurdle by computing only a surrogate of the full gradient , based on a small subset of the input data. For instance, the popular SGD [22] scheme in each iteration takes a small step in a direction determined by a single, randomly selected data point. This imperfect gradient step results in smaller progress per-iteration, though manyfold in the time it would take for a batch gradient descent method to compute a full gradient [4].

Nevertheless, the approximate ‘gradients’ of stochastic methods introduce variance in the course of the optimization. Notably, vanilla SGD methods can deviate from the optimum, even if the initialization point is the optimum [13]. To ensure convergence, the learning rate has to decay to zero, which results to sublinear convergence rates [22], a significant degradation from the linear rate achieved by batch gradient methods.

A recent line of work [23, 13, 14, 15] has made promising steps towards the middle ground of these two extremes. A full gradient computation is occasionally interleaved with the inexpensive steps of SGD, dividing the course of the optimization in epochs

. Within an epoch, descent directions are formed as a linear combination of an approximate gradient (as in vanilla

SGD

) and a full gradient vector computed at the beginning of the epoch. Though not always up-to-date, the full gradient information reduces the variance of gradient estimates and provably speeds up the convergence.

Yet, as the size of the problem grows, even an infrequent computation of the full gradient may severely impede the progress of these variance-reduction approaches. For instance, when training large neural networks [20, 29, 7]), the volume of the input data rules out the possibility of computing a full gradient within any reasonable time window. Moreover, in a distributed setting, accessing the entire dataset may incur significant tail latencies [33]. On the other hand, traditional stochastic methods exhibit low convergence rates and in practice frequently fail to come close to the optimal solution in reasonable amount of time.

Contributions.

The above motivate to design algorithms that try to compromise the two extremes circumventing the costly computation of the full gradient, while admitting favorable convergence rate guarantees. In this work, we reconsider the computational resource allocation problem in stochastic variance-reduction schemes: given a limited budget of atomic gradient computations, how can we utilize those resources in the course of the optimization to achieve faster convergence? Our contributions can be summarized as follows:

  • We propose CheapSVRG, a variant of the popular Svrg scheme [13]. Similarly to Svrg, our algorithm divides time into epochs, but at the beginning of each epoch computes only a surrogate of the full gradient using a subset of the input data. Then, it computes a sequence of estimates using a modified version of SGD steps. Overall, CheapSVRG can be seen as a family of stochastic optimization schemes encompassing Svrg and vanilla SGD. It exposes a set of tuning knobs that control trade-offs between the per-iteration computational complexity and the convergence rate.

  • Our theoretical analysis shows that CheapSVRG achieves linear convergence rate in expectation and up to a constant factor, that depends on the problem at hand. Our analysis is along the lines of similar results for both deterministic and stochastic schemes [25, 18].

  • We supplement our theoretical analysis with experiments on synthetic and real data. Empirical evaluation supports our claims for linear convergence and shows that CheapSVRG performs at least competitively with the state of the art.

2 Related work

There is extensive literature on classic SGD approaches. We refer the reader to [4, 3] and references therein for useful pointers. Here, we focus on works related to variance reduction using gradients, and consider only primal methods; see [27, 9, 17] for dual.

Roux et al. in [23] are among the first that considered variance reduction methods in stochastic optimization. Their proposed scheme, Sag, achieves linear convergence under smoothness and strong convexity assumptions and is computationally efficient: it performs only one atomic gradient calculation per iteration. However, it is not memory efficient222The authors show how to reduce memory requirements in the case where depends on a linear combination of . as it requires storing all intermediate atomic gradients to generate approximations of the full gradient and, ultimately, achieve variance reduction.

In [13], Johnson and Zhang improve upon [23] with their Stochastic Variance-Reduced Gradient (Svrg) method, which both achieves linear convergence rates and does not require the storage of the full history of atomic gradients. However, Svrg requires a full gradient computation per epoch. The S2gd method of [14] follows similar steps with Svrg, with the main difference lying in the number of iterations within each epoch, which is chosen according to a specific geometric law. Both [13] and [14] rely on the assumptions that is strongly convex and ’s are smooth.

Recently, Defazio et al. propose Saga [8], a fast incremental gradient method in the spirit of Sag and Svrg. Saga works for both strongly and plain convex objective functions, as well as in proximal settings. However, similarly to its predecessor [23], it does not admit low storage cost.

While writing this paper, we also became aware of the independent work of [10] and [12], where similar ideas are developed. In the first case, the authors consider a streaming version of Svrg algorithm and, the purpose of that research is different from the present work, as well as the theoretical analysis followed. The results presented in the latter work are of the same flavor with ours.

Finally, we note that proximal [15, 8, 1, 31] and distributed [21, 16, 32] variants have also been proposed for such stochastic settings. We leave these variations out of comparison and consider similar extensions to our approach as future work.

3 Our variance reduction scheme

We consider the minimization in (1). In the th iteration, vanilla SGD generates a new estimate

based on the previous estimate and the atomic gradient of a component , where index is selected uniformly at random from . The intuition behind SGD is that in expectation its update direction aligns with the gradient descent update. But, contrary to gradient descent, SGD is not guaranteed to move towards the optimum in each single iteration. To guarantee convergence, it employs a decaying sequence of step sizes , which in turn impacts the rate at which convergence occurs.

Svrg [13] alleviates the need for decreasing step size by dividing time into epochs and interleaving a computation of the full gradient between consecutive epochs. The full gradient information where is the estimate available at the beginning of the th epoch, is used to steer the subsequent steps and counterbalance the variance introduced by the randomness of the stochastic updates. Within the th epoch, Svrg computes a sequence of estimates where , and

is a linear combination of full and atomic gradient information. Based on this sequence, it computes the next estimate , which is passed down to the next epoch. Note that is an unbiased estimator of the gradient , i.e., .

As the number of components grows large, the computation of the full gradient , at the beginning of each epoch, becomes a computational bottleneck. A natural alternative is to compute a surrogate , using only a small subset of the input data.

Our scheme.

We propose CheapSVRG, a variance-reduction stochastic optimization scheme. Our algorithm can be seen as a unifying scheme of existing stochastic methods including Svrg and vanilla SGD. Its steps are outlined in Alg. 1.

1:  Input: .
2:  Output: .
3:  for   do
4:     Randomly select with cardinality .
5:     Set and .
6:     .
7:     .
8:     for   do
9:        Randomly select .
10:        .
11:        .
12:     end for
13:     .
14:  end for
Algorithm 1 CheapSVRG

CheapSVRG divides time into epochs. The th epoch begins at an estimate , inherited from the previous epoch. For the first epoch, that estimate is given as input, . The algorithm selects a set uniformly at random, with cardinality , for some parameter . Using only the components of indexed by , it computes

(2)

a surrogate of the full-gradient .

Within the th epoch, the algorithm generates a sequence of estimates , , through an equal number of SGD-like iterations, using a modified, ‘biased’ update rule. Similarly to Svrg, starting from , in the th iteration, it computes

where is a constant step-size and

The index is selected uniformly at random from , independently across iterations.333In the Appendix, we also consider the case where the inner loop uses a mini-batch instead of a single component . The cardinality is a user parameter. The estimates obtained from the iterations of the inner loop (lines -), are averaged to yield the estimate of the current epoch, and is used to initialize the next.

Note that during this SGD phase, the index set is fixed. Taking the expectation w.r.t. index , we have

Unless , the update direction is a biased estimator of . This is a key difference from the update direction used by Svrg in [13]. Of course, since is selected uniformly at random in each epoch, then across epochs where the expectation is with respect to the random choice of . Hence, on expectation, the update direction can be considered an unbiased surrogate of .

Our algorithm can be seen as a unifying framework, encompassing existing stochastic optimization methods. If the tunning parameter is set equal to , the algorithm reduces to vanilla SGD, while for , we recover Svrg. Intuitively, establishes a trade-off between the quality of the full-gradient surrogate generated at the beginning of each epoch and the associated computational cost.

4 Convergence analysis

In this section, we provide a theoretical analysis of our algorithm under standard assumptions, along the lines of [25, 18]. We begin by defining those assumptions and the notation used in the remainder of this section.

Notation.

We use to denote the set . For an index in , denotes the atomic gradient on the th component . We use

to denote the expectation with respect the random variable 

. With a slight abuse of notation, we use to denote the expectation with respect to .

Assumptions.

Our analysis is based on the following assumptions, which are common across several works in the stochastic optimization literature.

Assumption 1 (Lipschitz continuity of ).

Each  in (1) has -Lipschitz continuous gradients, i.e., there exists a constant such that for any ,

Assumption 2 (Strong convexity of ).

The function is -strongly convex for some constant , i.e., for any ,

Assumption 3 (Component-wise bounded gradient).

There exists such that , in the domain of , for all .

Observe that Asm. 3 is satisfied if the components are -Lipschitz functions. Alternatively, Asm. 3 is satisfied when is -Lipschitz function and . This is known as the strong growth condition [26].444This condition is rarely satisfied in many practical cases. However, similar assumptions have been used to show convergence of Gauss-Newton-based schemes [2], as well as deterministic incremental gradient methods [28, 30].

Assumption 4 (Bounded Updates).

For each of the estimates , we assume that the expected distance is upper bounded by a constant. Equivalently, there exists such that

We note that Asm. 4 is non-standard, but was required for our analysis. An analysis without this assumption is an interesting open problem.

4.1 Guarantees

We show that, under Asm. 1-4, the algorithm will converge –in expectation– with respect to the objective value, achieving a linear rate, up to a constant neighborhood of the optimal, depending on the configuration parameters and the problem at hand. Similar results have been reported for SGD [25], as well as deterministic incremental gradient methods [18].

Theorem 4.1 (Convergence).

Let be the optimal solution for minimization (1). Further, let , , and be user defined parameters such that

Under Asm. 1-4, CheapSVRG outputs such that

where .

We remark the following:

  • [leftmargin=0.5cm]

  • The condition ensures convergence up to a neighborhood around . In turn, we require that

    for appropriate .

  • The value of in Thm. 4.1 is similar to that of [13]: for sufficiently large , there is a -factor deterioration in the convergence rate, due to the parameter . We note, however, that our result differs from [13] in that Thm. 4.1 guarantees convergence up to a neighborhood around . To achieve the same convergence rate with [13], we require , which in turn implies that . To see this, consider a case where the condition number is constant and . Based on the above, we need . This further implies that, in order to bound the additive term in Thm. 4.1, is required for .

  • When is sufficiently small, Thm. 4.1 implies that

    i.e., that even leads to (linear) convergence; In Sec. 5, we empirically show cases where even for , our algorithm works well in practice.

The following theorem establishes the analytical complexity of CheapSVRG; the proof is provided in the Appendix.

Theorem 4.2 (Complexity).

For some accuracy parameter , if , then for suitable , , and

Alg. 1 outputs such that . Moreover, the total complexity is atomic gradient computations.

4.2 Proof of Theorem 4.1

We proceed with an analysis of Alg. 1, and in turn the proof of Thm. 4.1, starting from its core inner loop (Lines -) and show that in expectation, the steps of the inner loop make progress towards the optimum point. Then, we move outwards to the ‘wrapping’ loop that defines consecutive epochs.

Inner loop. Fix an epoch, say the th iteration of the outer loop. Starting from a point (which is effectively the estimate of the previous epoch), the inner loop performs steps, using the partial gradient information vector , for a fixed set .

Consider the th iteration of the inner loop. For now, let the estimates generated during previous iterations be known history. By the update in Line , we have

(3)

where the expectation is with respect to the choice of index . We develop an upper bound for the right hand side of (3). By the definition of in Line 10,

(4)

Similarly,

(5)

The first inequality follows from the fact that , for any , . The second inequality is due to eq. (8) in [13]. Continuing from (3) and taking into account (4) and (5), we have

(6)

Inequality (6) establishes an upper bound on the distance of the -th iteration estimate from the optimum, conditioned on the random choices of previous iterations. Taking expectation over , for any , we have

(7)

Note that , , , and, are constants w.r.t. . Summing over ,

(8)

For the second term on the right hand side, we have:

(9)

where the inequality follows from the convexity of . Continuing from (8), taking into account (9) and the fact that we obtain

(10)

By the convexity of ,

Also, by the strong convexity of (Asm. 2),

Continuing from (10), taking into account the above and recalling that , we obtain

(11)

The last sum in (11) can be further upper bounded:

The first inequality follows from Cauchy-Schwarz, the second from the fact that is independent from the random variables ( and are fixed), while the last one follows from Asm. 4. Incorporating the above upper bound in (11), we obtain

(12)

where The inequality in (12) effectively establishes a recursive bound on using only the estimate sequence produced by the epochs.

Outer Loop. Taking expectation over , assuming that is such that , we have

(13)

To further bound the right-hand side, note that:

where is due to and due to eq. (8) in [13]. By Asm. 3, . Using similar reasoning, under Asm. 3, Combining the above with (13), we get

(14)

Let . Also, let

and as defined in Thm. 4.1. Taking expectation with respect to , (14) becomes

Finally, unfolding the above recursion, we obtain where is defined in Thm. 4.1. This completes the proof of the theorem.

Figure 1: Convergence performance w.r.t. vs the number of effective data passes – i.e., the number of times data points were accessed – for (left), (middle), and (right). In all experiments, we generate noise such that . The plotted curves depict the median over Monte Carlo iterations: random independent instances of (15), executions/instance for each scheme.

5 Experiments

We empirically evaluate CheapSVRG on synthetic and real data and compare mainly with Svrg [13]. We show that in some cases it improves upon existing stochastic optimization methods, and discuss its properties, strengths and weaknesses.

5.1 Properties of CheapSVRG

We consider a synthetic linear regression problem: given a set of training samples , , , where and , we seek the solution to

(15)

We generate an instance of the problem as follows. First, we randomly select a point

from a spherical Gaussian distribution and rescale to unit

-norm; this point serves as our ‘ground truth’. Then, we randomly generate a sequence of ’s i.i.d. according to a Gaussian distribution. Let be the matrix formed by stacking the samples , . We compute , where is a noise term drawn from , with -norm rescaled to a desired value controlling the noise level.

We set and . Let where

denotes the maximum singular value of

. We run the classic SGD method with decreasing step size , the Svrg method of Johnson and Zhang [13] and, our CheapSVRG for parameter values , which covers a wide spectrum of possible configurations for .

Step size selection.

We study the effect of the step size on the performance of the algorithms; see Figure 1. The horizontal axis represents the number effective passes over the data: evaluating  component gradients, or computing a single full gradient is considered as one effective pass. The vertical axis depicts the progress of the objective in (15).

We plot the performance for three step sizes: , for and . Observe that Svrg becomes slower if the step size is either too big or too small, as also reported in [13, 31]. The middle value was the best555Determined via binary search. for Svrg in the range we considered. Note that each algorithm achieves its peak performance for a different value of the step size. In subsequent experiments, however, we will use the above value which was best for Svrg.

Overall, we observed that CheapSVRG is more ‘flexible’ in the choice of the step size. In Figure 1 (right), with a suboptimal choice of step size, Svrg oscillates and progresses slowly. On the contrary, CheapSVRG converges nice even for . It is also worth noting CheapSVRG with , i.e., effectively combining two datapoints in each stochastic update, achieves a substantial improvement compared to vanilla SGD.

Figure 2: Convergence performance w.r.t. vs. effective number of passes over the data. We set an upper bound on total atomic gradient calculations spent as and vary the percentage of these resources in the inner loop two-stage SGD schemes. Left: perc %. Middle : perc %. Right: perc %. In all experiments, we set . The plotted curves depict the median over Monte Carlo iterations: random independent instances of (15), executions/instance for each scheme.

Resilience to noise.

We study the behavior of the algorithms with respect to the noise magnitude. We consider the cases . In Figure 3, we focus on four distinct noise levels and plot the distance of the estimate from the ground truth vs the number of effective data passes. For SGD, we use the sequence of step sizes .

We also note the following surprising result: in the noiseless case, it appears that is sufficient for linear convergence in practice; see Figure 3. In contrast, CheapSVRG is less resilient to noise than Svrg – however, we can still get to a good solution with less computational complexity per iteration.

Figure 3: Distance from the optimum vs the number of effective data passes for the linear regression problem. We generate 10 independent random instances of (15). From left to right, we use noise noise

with standard deviation

(noiseless), , , and . Each scheme is executed times/instance. We plot the median over the Monte Carlo iterations.

Number of inner loop iterations.

Let denote a budget of atomic gradient computations. We study how the objective value decreases with respect to percentage perc of the budget allocated to the inner loop. We first run a classic gradient descent with step size which converges within iterations. Based on this, we choose our global budget to be . We consider the following values for perc: . E.g., when perc %, only atomic gradient calculations are spent in outer loop iterations. The results are depicted in Fig. 2.

We observe that convergence is slower as fewer computations are spent in outer iterations. Also, in contrast to Svrg, our algorithm appears to be sensitive to the choice of perc: for , our scheme diverges, while Svrg finds relatively good solution.

5.2 -regularized logistic regression

We consider the regularized logistic regression problem, i.e., the minimization

(16)

Here, , where indicates the binary label in a classification problem, represents the predictor, and is a regularization parameter.

We focus on the training loss in such a task.666By [13], we already know that such two-stage SGD schemes perform better than vanilla SGD. We use the real datasets listed in Table 1. We pre-process the data so that , as in [31]. This leads to an upper bound on Lipschitz constants for each such that . We set for all algorithms under consideration, according to [13, 31], perc % and, for all problem cases.

Fig. LABEL:fig:exp4 depicts the convergence results for the marti0, reged0 and sido0 datasets. CheapSVRG achieves comparable performance to Svrg, while requiring less computational ‘effort’ per epoch: though smaller values of , such that or , lead to slower convergence, CheapSVRG still performs steps towards the solution, while the complexity per epoch is significantly diminished.

Figure 4: Convergence performance of algorithms for the -regularized logistic regression objective. From left to right, we used the marti0, reged0, and sido0 dataset; the description of the datasets is given in Table 1. Plots depict vs the number of effective data passes. We use step size for all algorithms, as suggested by [13, 31]. The curves depict the median over Monte Carlo iterations.
Dataset
marti0
reged0
sido0
Table 1: Summary of datasets [11].

6 Conclusions

We proposed CheapSVRG, a new variance-reduction scheme for stochastic optimization, based on [13]. The main difference is that instead of computing a full gradient in each epoch, our scheme computes a surrogate utilizing only part of the data, thus, reducing the per-epoch complexity. CheapSVRG comes with convergence guarantees: under assumptions, it achieves a linear convergence rate up to some constant neighborhood of the optimal. We empirically evaluated our method and discussed its strengths and weaknesses.

There are several future directions. In the theory front, it would be interesting to maintain similar convergence guarantees under fewer assumptions, extend our results beyond the smooth convex optimization, e.g., to the proximal setting, or develop distributed variants. Finally, we seek to apply our CheapSVRG to large-scale problems, e.g., for training large neural networks. We hope that this will help us better understand the properties of CheapSVRG and the trade-offs associated with its various configuration parameters.

References

Appendix A Proof of Theorem 4.2

By assumptions of the theorem, we have:

As mentioned in the remarks of Section 4, the above conditions are sufficient to guarantee , for some . Further, for given accuracy parameter , we assume .

Let us define

as in the proof of Theorem 4.1. In order to satisfy , it is sufficient to find the required number of iterations such that . In particular: