On the Adaptivity of Stochastic Gradient-Based Optimization

04/09/2019 ∙ by Lihua Lei, et al. ∙ berkeley college 0

Stochastic-gradient-based optimization has been a core enabling methodology in applications to large-scale problems in machine learning and related areas. Despite the progress, the gap between theory and practice remains significant, with theoreticians pursuing mathematical optimality at a cost of obtaining specialized procedures in different regimes (e.g., modulus of strong convexity, magnitude of target accuracy, signal-to-noise ratio), and with practitioners not readily able to know which regime is appropriate to their problem, and seeking broadly applicable algorithms that are reasonably close to optimality. To bridge these perspectives it is necessary to study algorithms that are adaptive to different regimes. We present the stochastically controlled stochastic gradient (SCSG) method for composite convex finite-sum optimization problems and show that SCSG is adaptive to both strong convexity and target accuracy. The adaptivity is achieved by batch variance reduction with adaptive batch sizes and a novel technique, which we referred to as geometrization, which sets the length of each epoch as a geometric random variable. The algorithm achieves strictly better theoretical complexity than other existing adaptive algorithms, while the tuning parameters of the algorithm only depend on the smoothness parameter of the objective.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The application of gradient-based optimization methodology to statistical machine learning has been a major success story, in practice and in theory. Indeed, there is an increasingly detailed theory available for gradient-based algorithms that helps to explain their practical success. There remains, however, a significant gap between theory and practice, in that the designer of machine learning algorithms is required to make numerous choices that depend on parameters that are unlikely to be known in a real-world machine-learning setting. For example, existing theory asserts that different algorithms are preferred if a problem is strongly convex or merely convex, if the target accuracy is high or low, if the signal-to-noise is high or low and if data are independent or correlated. This poses a serious challenge to builders of machine-learning software, and to users of that software. Indeed, a distinctive aspect of machine-learning problems, especially large-scale problems, is that the user of an algorithm can be expected to know little or nothing about quantitative structural properties of the functions being optimized. It is hoped that the data and the data analysis will inform such properties, not the other way around.

To take a classical example, the stochastic gradient descent (SGD) algorithm takes diferent forms for strongly convex objectives and non-strongly convex objectives. In the former case, letting

denote the strong-convexity parameter, if the step size is set as then SGD exhibits a convergence rate of , where is the target accuracy (Nesterov, 2004). In the latter case setting the step size to yields a rate of  (Nemirovski et al., 2009). Using the former scheme for non-strongly convex objectives can significantly deteriorate the convergence (Nemirovski et al., 2009)

. It is sometimes suggested that one can insure strong convexity by simply adding a quadratic regularizer to the objective, using the coefficient of the regularizer as a conservative estimate of the strong-convexity parameter. But this produces a significantly faster rate only if

, a regime that is unrealistic in many machine-learning applications, where is relatively large. Setting to such a large value would have a major effect on the statistical properties of the optimizer.

Similar comments apply to presumptions of knowledge of Lipschitz parameters, mini-batch sizes, variance-reduction tuning parameters, etc. Current practice often involves heuristics in setting these tuning parameters, but the use of these heuristics can change the algorithm and the optimality guarantees may disappear.

Our goal, therefore, should be that our algorithms are adaptive, in the sense that they perform as well as an algorithm that is assumed to know the “correct” choice of tuning parameters, even if they do not know those parameters. In particular, in the convex setting, we wish to derive an algorithm that does not involve in its implementation but whose convergence rate would be better for larger while still reasonable for smaller , including the non-strongly convex case where .

Such adaptivity has been studied implicitly in the classic literature. Ruppert (1988) and Polyak (1990) and Polyak and Juditsky (1992) showed that the average iterate of SGD with stepsize for

satisfies a central limit theorem with information-theoretically optimal asymptotic variance. This implies adaptivity because the performance adapts to the underlying parameters of the problem, including the modulus of strong convexity, even though the algorithm does not require knowledege of them. The analysis by

Polyak and Juditsky (1992) is, however, asymptotic and relies on the smoothness of Hessian. Under similar assumptions on the Hessian, Moulines and Bach (2011)

provided a non-asymptotic analysis establishing adaptivity of SGD with Polyak-Ruppert averaging. Further contributions to this line of work include

Bach and Moulines (2013); Flammarion and Bach (2015); Dieuleveut et al. (2017)

, who prove the adaptivity of certain versions of SGD with refined rates for self-concordant objectives, including least-square regression and logistic regression.

This line of work relies on conditions on higher-order derivatives which are not required in the modern literature on stochastic gradient methods. In fact, under fairly standard assumptions for first-order methods, Moulines and Bach (2011) provided a non-asymptotic analysis for SGD with stepsize without averaging and showed that this algorithm exhibits adaptivity to strong convexity while having reasonable guarantee for non-strongly conex objectives. Specifically, if , their results show that the rate to achieve an -accurate solution for the expected function value is , where hides logarithmic factors. Further progress has been made by focusing on a setting that is particularly relevant to machine learning—that of finite-sum optimization. The objective function in this setting takes the following form:

(1)

where is the parameter space, is the number of data points, the functions are data-point-specific loss functions and is the regularization term. We assume that each is differentiably convex and is convex but can be non-differentiable. The introduction of the parameter into the optimization problem has two important implications. First, it implies that the number of operations to obtain a full gradient is , which is generally impractical in modern machine-learning applications, where the value of can be in the tens to hundreds of millions. This fact motivates us to make use of stochastic estimates of gradients. Such randomness introduces additional variance that interacts with the variability of the data, and tuning parameters are often introduced to control this variance.

Second, the finite-sum formulation highlights the need for adaptivity to the target accuracy , where that accuracy is related to the number of data points for statistical reasons. Unfortunately, different algorithms perform better in high-accuracy versus low-accuracy regimes, and the choice of regime is generally not clear to a user of machine-learning algorithms, given that target accuracy varies not only as a function of , but also as a function of other parameters, such as the signal-to-noise ratio, that the user is not likely to know. Ideally, therefore, optimization algorithms should be adaptive to target accuracy, performing well in either regime.

A recent line of research has shown that algorithms with lower complexity can be designed in the finite-sum setting with some adaptivity, generally via careful control of the variance. The stochastic average gradient (SAG) method opened this line of research, establishing a complexity of  (Roux et al., 2012). Importantly, this result shows that SAG is adaptive to strong convexity. To achieve such adaptivity, however, SAG requires two sequences of iterates, the average iterate and the last iterate. Defazio et al. (2014) proposed a single-sequence variant of SAG that is also adaptive to strong convexity, yet under stronger assumption that each is strongly convex. Both methods suffer, however, from a prohibitive storage cost of , where is the dimension of . Further developments in this vein include the stochastic variance reduced gradient (SVRG) method (Johnson and Zhang, 2013) and the stochastic dual coordinate ascent (SDCA) method (Johnson and Zhang, 2013); they achieve the same computational complexity as SAG while reducing the storage cost to . They are not, however, adaptive to strong convexity.

Lei and Jordan (2016) presented a randomized variant of SVRG that achieves the same convergence rate and adaptivity as SAG but with the same storage cost as SVRG. However, as is the case with SAG, the complexity of for the non-strongly convex case is much larger than the oracle lower bound of  (Woodworth and Srebro, 2016). Xu et al. (2017) proposed another variant of SVRG which adapts to a more general condition, called a “Hölderian error bound,” with strong convexity being a special case. In contrast to Lei and Jordan (2016), they required an initial conservative estimate of the strong convexity parameter.

In this article we present an algorithm, the stochastically-controlled stochastic gradient (SCSG) algorithm, that exhibits adaptivity to both strong convexity and to target accuracy. SCSG is a nested procedure that is similar to the SVRG algorithm. Crucially, it does not require the computation of a full gradient in the outer loop as performed by SVRG, but makes use of stochastic estimates of gradients in both the outer loop and the inner loop. Moreover, it makes essential use of a randomization technique (“geometrization”) that allows terms to telescope across the outer loop and the inner loops; such telescoping does not happen in SVRG, a fact which leads to the loss of adaptivity for SVRG.

The rest of the article is organized as follows: Section 2 introduces notation, assumptions and definitions. In Section 3 and Section 4, we focus on the relatively simple setting of unregularized problems and Euclidean geometry, introducing the key ideas of geometrization and adaptive batching. We extend these results to regularized problems and to non-Euclidean geometry in Section 5. The extension relaxes standard assumptions for analyzing mirror descent methods and may be of independent interest. All proofs for the general case are relegated into Appendix A and some miscellaneous results are presented in Appendix B.

2 Notation, Assumptions and Definitions

We write (resp. ) for (resp. ), and (or ) for throughout the paper. We adopt Landau’s notation (), and we occasionally use to hide logarithmic factors. We define computational cost by making use of the IFO framework of Agarwal and Bottou (2014); Reddi et al. (2016), where we assume that sampling an index and computing the pair incurs a unit of cost.

In this section and the following section we focus on unregularized problems and Euclidean geometry, turning to regularized problems and non-Euclidean geometry in Section 5. Specifically, we consider the case , and make the following assumptions that target the finite-sum optimization problem:

  1. is convex with -Lipschitz gradient

    for some ;

  2. is strongly convex at with

    for some .

Note that assumption A2 always holds with , corresponding to the non-strongly convex case. Note also that with the exception of Roux et al. (2012), this assumption is weaker than most of the the literature on smooth finite-sum optimization, where strong convexity of is required at every point.

Our analysis will make use of the following key quantity (Lei and Jordan, 2016):

(2)

where denotes the optimum of . If multiple optima exist we take one that minimizes . We use , an average squared norm at the optimum, in place of the uniform upper bound on the gradient that is often assumed in other work. The latter is not realistic for many practical problems in machine learning, including least squares, where the gradient is unbounded. We will write as when no confusion can arise.

We let denote the initial value (possibly random) and define the following measures of complexity:

(3)

Recall that a geometric random variable,

, is a discrete random variable with probability mass function

, and expectation:

(4)

Geometric random variables will play a key role in the design and analysis of our algorithm.

Finally, we introduce two fundamental definitions that serve to clarify desirable properties of optimization algorithms. We refer to the first property as -independence.

Definition 1

An algorithm is -independent if it guarantees convergence at all target accuracies .

-independence is a crucial property in practice because a target accuracy is usually not exactly known apriori. An -independent algorithm satisfies the “one-pass-for-all” property where the theoretical complexity analysis applies to the whole path of the iterates. In contrast, an -dependent algorithm only has a theoretical guarantee for a particular , whose value is often unknown in practice. To illustrate we consider SGD, where the iterate is updated by and where is a uniform index from . There are two popular schemes for theoretical analysis: (1) or (2) and the iterates are updated for steps where . Although both versions have theoretical complexity , only the former is -independent.

The second important property is referred to as almost universality.

Definition 2

An algorithm is almost universal if it only requires the knowledge of the smoothness parameters .

The term almost universality is motivated by the notion of universality introduced by Nesterov (2015) which does not require the knowledge of or other parameters such as the variance of the stochastic gradients. Returning to the previous example, both versions of SGD are universal. It is noteworthy that universal gradient methods are usually either -dependent (e.g., Nesterov, 2015) or require imposing other assumptions such as uniformly bounded (e.g., Nemirovski et al., 2009). The SCSG algorithm developed in this paper is both -independent and almost universal. This category also includes SGD for general convex functions (Nemirovski et al., 2009), SAG (Roux et al., 2012), SAGA (Defazio et al., 2014), SVRG++ (Allen-Zhu and Yuan, 2016), Katyusha for non-strongly convex functions (Allen-Zhu, 2017), and AMSVRG (Nitanda, 2015). In contrast, algorithms such as SGD for strongly convex functions (Nemirovski et al., 2009), SVRG (Johnson and Zhang, 2013), SDCA (Shalev-Shwartz and Zhang, 2012), APCG (Lin et al., 2014), Katyusha for strongly convex functions (Allen-Zhu, 2017) and adaptive SVRG (Xu et al., 2017) are -independent but not almost universal because they need full or partial knowledge of . Furthermore, algorithms such as Catalyst (Lin et al., 2015) and AdaptReg (Allen-Zhu and Hazan, 2016) even depend on unknown quantities such as or the variance of the . In comparing algorithms we believe that clarity on these distinctions is critical, in addition to comparison of convergence rates.

3 Stochastically Controlled Stochastic Gradient (SCSG)

In this section we present SCSG, a computationally efficient framework for variance reduction in stochastic gradient descent algorithms. SCSG builds on the SVRG algorithm of Johnson and Zhang (2013), incorporating several essential modifications that yield not only computational efficiency but also adaptivity. Recall that SVRG is a nested procedure that computes a full gradient in each outer loop and uses that gradient as a baseline to reduce the variance of the stochastic gradients that are computed in an inner loop. The need to compute a full gradient, at a cost of operations, unfortunately makes the SVRG procedure impractical for large-scale applications. SCSG seeks to remove this bottleneck by replacing the full gradient with an approximate, stochastic gradient, one that is based on a batch size that is significantly smaller than but larger than the size used for the stochastic gradients in the inner loop. By carefully weighing the contributions to the bias and variance of these sampling-based estimates, SCSG achieves a small iteration complexity while also keeping the per-iteration complexity feasibly small.

Further support for the SCSG framework comes from the comparison with SVRG in the setting of strongly convex objectives. In this setting, SVRG relies heavily on a presumption of knowledge of the strong convexity parameter . In particular, to achieve a complexity of , the number of stochastic gradients queried in the inner loop of SVRG needs to scale as . By contrast, the SCSG framework achieves the same complexity without knowledge of . This is achieved by setting the number of inner-loop stochastic gradients to be a geometric random variable. As we discuss below, the usage of a geometric random variable—a technique that we refer to as “geometrization”—is crucial in the design and analysis of SCSG. We believe that it is a key theoretical tool for achieving adaptivity to strong convexity.

The original version of SCSG was -dependent and not almost universal, because it required knowledge of the parameter  (Lei and Jordan, 2016). Moreover the algorithm had a sub-optimal rate in the high-accuracy regime. In further development of the SCSG framework, in the context of nonconvex optimization (Lei et al., 2017), we found that -independence and almost universality could be achieved by employing an increasing sequence of batch sizes.

In the remainder of this section, we bring these ideas together and present the general form of the SCSG algorithm, incorporating adaptive batching, geometrization and mini-batches in the inner loop. The resulting algorithm is adaptive, -independent and almost universal. Roughly speaking, the adaptive batching enables the adaptivity to target accuracy and the geometrization enables the adaptivity to strong convexity. The pseudocode for SCSG is shown in Algorithm 1

. As can be seen, the algorithm is superficially complex, but, as in the case of line-search and trust-region methods that augment simple gradient-based methods in deterministic optimization, the relative lack of dependence on hyperparameters makes the algorithm robust and relatively easy to deploy.

Note that in Algorithm 1, and throughout the paper, we use to denote the iterate in the th outer loop and to denote the iterate in the th step of the th inner loop.

Inputs: Number of stages , initial iterate , stepsizes , block sizes , inner loop sizes , mini-batch sizes .

Procedure

1:for  do
2:     Uniformly sample a batch with ;
3:     ;
4:     ;
5:     Generate ;
6:     for  do
7:         Uniformly sample a batch with ;
8:         ;
9:         ;
10:     end for
11:     ;
12:end for

Output: .

Algorithm 1 SCSG for unconstrained finite-sum optimization

To measure the computational complexity of SCSG, let denote the first time step at which is an -approximate solution:

(5)

The computational cost incurred for computing is

(6)

Noting that is random, we consider the average complexity obtained by taking the expectation of . Since , we have:

(7)

3.1 Two key ideas: adaptive batching and geometrization

The adaptivity of SCSG is achieved via two techniques: adaptive batching and geometrization. We provide intuitive motivation for these two ideas in this section.

The motivation for adaptive batching is straightforward. Heuristically, at the early stages of the optimization process, the iterate is far from the optimum and a small subset of data is sufficient to reduce the variance. On the other hand, at later stages, finer variance reduction is required to prevent the iterate from moving in the wrong direction. By allowing the batch sizes to increase, SCSG behaves like SGD for the purposes of low-accuracy computation while it behaves like SVRG for high-accuracy computation.

The motivation for geometrization is more subtle. To isolate its effect, let us consider a special case of SCSG in which the parameters are set as follows:

Note that the above setting is only used to illustrate the effect of geometrization and the setting that leads to adaptivity to both strong convexity and target accuracy is more involved and given in Section 4. In this simplified setting, SCSG reduces to SVRG if we replace line 5 by , with for some positive integer . (Although SVRG is usually implemented in practice by setting to be a fixed , a uniform random is crucial for the analysis of SVRG (Johnson and Zhang, 2013).) SVRG achieves a rate of rate only if

(8)

This requires ; hence, SVRG requires knowledge of to achieve the theoretical rate. We briefly sketch the step in the proof of the convergence of SVRG where this limitation arises, and we show how geometrization circumvents the need to know . To simplify our arguments we follow Johnson and Zhang (2013) and make the assumption that strong convexity holds everywhere for ; note that this is stronger than our assumption A2.

In Theorem 1 of Johnson and Zhang (2013), the following argument appears:

(9)

Strong convexity implies that

(10)

Note that this conclusion is independent of the choice of and hence holds for both SVRG and SCSG. To assess the overall effect of the th inner loop on the left-hand side, we let , thereby focusing on the last step of the inner loop, and we substitute for and for . We have:

(11)

For SVRG, , and thus (11) reduces to

(12)

Unfortunately, given that , the last two terms do not telescope, and one has to drop the final term, leading to the following conservative bound:

(13)

Without strong convexity (i.e., when ), can be arbitrarily larger than and hence (13) is not helpful. Thus Johnson and Zhang (2013) exploit strong convexity at this point, using . Then (13) implies that

(14)

This requires the coefficient on the left-hand side to be larger than that on the right-hand side, leading to the condition (8).

Summarizing, the reason that Johnson and Zhang (2013) rely on the knowledge of is that it permits the removal of the last term in (11). By contrast, if is a geometric random variable instead of a uniform random variable, the problem is completely circumvented, by making use of the following elementary lemma.

Lemma 3.1

Let for . Then for any sequence with ,

Remark 1

The requirement is essential. A useful sufficient condition if

because a geometric random variable has finite moments of any order.

Proof By definition,

where the last equality is followed by the condition that .  

Returning to (11) for SCSG with Lemma 3.1 in hand, where and , and assuming that , we obtain

(15)

The assumption that will be justified in our general theory and is taken for granted here to avoid distraction. This can be rearranged to yield a function that provides a better assessment of progress than the function in (13):

(16)

We accordingly view the left-hand side of (3.1) as a Lyapunov function and define:

We then have:

As a result,

and, by (6),

Suppose ,

Therefore the complexity of SCSG is

In summary, the better control provided by geometrization enables SCSG to achieve the fast rate of SVRG without knowledge of .

4 Convergence Analysis of SCSG for Unregularized Smooth Problems

4.1 One-epoch analysis

We start with the analysis for a single epoch. The key difficulty lies in controlling the bias of , conditional on drawn at the beginning of the th epoch. We have:

(17)

We deal with this extra bias by exploiting Lemma 3.1 and obtaining the following theorem which connects the iterates produced in consecutive epochs. The proof of the theorem is relegated to Section 4.5.

Theorem 4.1

Fix any . Assume that

(18)

Then under assumptions A1 and A2,

4.2 Multi-epoch analysis

We now turn to the multi-epoch analysis, focusing on using the one-epoch analysis to determine the setting of the hyperparameters. Interestingly, we require that the batch size scales as the square of the number of inner-loop iterations .

Theorem 4.2

Fix any constant , and . Let

(19)

Take and assume that

(20)

Then

where , ,

and be positive numbers such that

Proof By Theorem 4.1 with ,

(21)

Let

Then

where the last line uses the condition that

For any ,

When , we have , and thus

In summary,

(22)

Plugging this into (21), we conclude that

(23)

Finally we prove the following statement by induction.

(24)

where

It is obvious that (24) holds for . Suppose it holds for , then by (23),

where the last line uses the fact that for all . If , then and thus

If ,

Therefore, (24) is proved. The proof is then completed by noting that

and