Tight Dimension Independent Lower Bound on Optimal Expected Convergence Rate for Diminishing Step Sizes in SGD

10/10/2018 ∙ by Phuong Ha Nguyen, et al. ∙ ibm University of Connecticut 0

We study convergence of Stochastic Gradient Descent (SGD) for strongly convex and smooth objective functions F. We prove a lower bound on the expected convergence rate which holds for any sequence of diminishing stepsizes that are designed based on only global knowledge such as the fact that F is smooth and strongly convex, the component functions are smooth and convex, together with more additional information. Our lower bound meets the expected convergence rate of a recently proposed sequence of stepsizes at ICML 2018, which is based on such knowledge, within a factor 32. This shows that the stepsizes as proposed in the ICML paper are close to optimal. Furthermore, we conclude that in order to be able to construct stepsizes that beat our lower bound, more detailed information about F must be known. Our work significantly improves over the state-of-the-art lower bound which we show is another factor 643· d worse, where d is the dimension. We are the first to prove a lower bound that comes within a small constant -- independent from any other problem specific parameters -- from an optimal solution.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

We are interested in solving the following stochastic optimization problem

(1)

where

is a random variable obeying some distribution

. In the case of empirical risk minimization with a training set , is a random variable that is defined by a single random sample pulled uniformly from the training set. Then, by defining , empirical risk minimization reduces to

(2)

Problems of this type arise frequently in supervised learning applications

[7]. The classic first-order methods to solve problem (2) are gradient descent (GD) [17] and stochastic gradient descent (SGD) 111We notice that even though stochastic gradient is referred to as SG in literature, the term stochastic gradient descent (SGD) has been widely used in many important works of large-scale learning. [19] algorithms. GD is a standard deterministic gradient method, which updates iterates along the negative full gradient with learning as follows

We can choose and achieve a linear convergence rate for the strongly convex case [14]. The upper bound of the convergence rate of GD and SGD has been studied in  [2, 4, 14, 20]. However, GD requires evaluation of derivatives at each step, which is very expensive and therefore avoided in large-scale optimization. To reduce the computational cost for solving (2

), a class of variance reduction methods

[9, 6, 8, 15] has been proposed. The difference between GD and variance reduction methods is that GD needs to compute the full gradient at each step, while the variance reduction methods will compute the full gradient after a certain number of steps. In this way, variance reduction methods have less computational cost compared to GD. To avoid evaluating the full gradient at all, SGD generates an unbiased random variable such as

and then evaluates gradient for drawn from distribution . After this, is updated as follows

(3)

Algorithm 1 provides a detailed description. Obviously, the computational cost of SGD is times cheaper than that of GD. However, as has been shown in literature we need to choose and the convergence rate of SGD is slowed down to [3], which is a sublinear convergence rate.

  Initialize:
  Iterate:
  for  do
     Choose a step size (i.e., learning rate) .
     Generate a random variable .
     Compute a stochastic gradient
     Update the new iterate .
  end for
Algorithm 1 Stochastic Gradient Descent (SGD) Method

In this paper we focus on the general problem (1) where is strongly convex. Since is strongly convex, a unique optimal solution of (1) exists and throughout the paper we denote this optimal solution by . The starting point for analysis is the recurrence

(4)

where

and is upper bounded by ; the recurrence has been shown to hold if we assume (1) is finite, (2) is -strongly convex, (3) is -smooth and (4) convex [16, 10]; we detail these assumptions below:

Assumption 1 (-strongly convex).

The objective function is -strongly convex, i.e., there exists a constant such that ,

(5)

As shown in [14, 3], Assumption 1 implies

(6)
Assumption 2 (-smooth).

is -smooth for every realization of , i.e., there exists a constant such that, ,

(7)

Assumption 2 implies that is also -smooth.

Assumption 3.

is convex for every realization of , i.e., ,

We notice that the earlier established recurrence in [12] under the same set of assumptions

is similar, but worse than (4) as it only holds for where (4) holds for . Only for step sizes the above recurrence provides a better bound than (4), i.e.,

. In practical settings such as logistic regression

, , and (i.e.

is at most a relatively small constant number of epochs, where a single epoch represents

iterations resembling the complexity of a single GD computation). As we will show, for this parameter setting the optimally chosen step sizes are . This is the reason we focus in this paper on analysing recurrence (4): For ,

where .

Problem Statement: It is well-known that based on the above assumptions (without the so-called bounded gradient assumption) and knowledge of only and a sequence of stepsizes can be constructed such that is smaller than  [16]; more explicitly, . Knowing a tight lower bound on is important because of the following reasons: (1) It helps us understand into what extend a given sequence of stepsizes leads to an optimal expected convergence rate. (2) The lower bound tells us that a sequence of step sizes as a function of only and cannot beat an expected convergence rate of . More information is needed in a construction of if we want to achieve a better expected convergence rate where .

Related Work and Contribution: The authors of  [13] proposed the first formal study about lower bounding the expected convergence rate for SGD. The authors of  [1] and  [18] independently studied this lower bound using information theory and were able to improve it.

As in this paper, the derivation in  [1] is for SGD where the sequence of step sizes is a-priori fixed based on global information regarding assumed stochastic parameters concerning the objective function . Their proof uses the following three assumptions (in this paper we assume a different set of assumptions as listed above):

  1. The assumption of a strongly convex objective function, i.e., Assumption 1 (see Definition 3 in [1]).

  2. There exists a bounded convex set such that

    for all (see Definition 1 in [1]). Notice that this is not the same as the bounded gradient assumption where is unbounded.222The bounded gradient assumption, where is unbounded, is in conflict with assuming strong convexity as explained in [16].

  3. The objective function is a convex Lipschitz function, i.e., there exists a positive number such that

    We notice that this assumption actually implies the assumption on bounded gradients as stated above.

To prove the lower bound for strongly convex and Lipschitz objective functions, the authors constructed a class of objective functions and showed that the lower bound for this class is, in terms of the notation used in this paper,

(8)

We revisit their derivation in supplementary material A where we show how their lower bound transforms into (8). Notice that their lower bound depends on dimension .

In this paper we prove for strongly convex and smooth objective functions the lower bound

Our lower bound is independent from and, in fact, it meets the expected convergence rate for a specifically constructed sequence of step sizes (based on only the parameters for strong convexity and for smoothness) within a factor 32. This proves that this sequence of step sizes leads to an optimal expected convergence rate within the small factor of 32 and proves that our lower bound is tight within a factor of 32. Notice that we significantly improve over state of the art since (8) is a factor larger than our lower bound, and more important, our lower bound is independent of .

The specifically constructed sequence of step sizes mentioned above is from [16] and is given by and yields expected convergence rate . This explains the 32 factor difference.

In [19], the authors proved that in order to make SGD converge, stepsizes should satisfy conditions

In [12], the authors studied the expected convergence rates for another class of step sizes of where . However, the authors of both [19] and [12] do not discuss about the optimal stepsizes among all proposed stepsizes which is what is done in this paper.

Outline: The paper is organized as follows. Section 2 describes a class of strongly convex and smooth objective functions which is used to derive the lower bound. We verify our theory by experiments in Section 3. Suplementary material A comprehensively studies the work in [1]. Section 4 concludes the paper.

2 Lower Bound and Optimal Stepsize for SGD

In this paper, we consider the following extended problem of SGD: When constructing a sequence of stepsizes , we do not only have access to and ; in addition we also have access to , access to the full gradient in the -th iteration, and access to an oracle that knows , and in the -th iteration. Notice that this allows adaptively constructed into some extend and our lower bound will hold for this more general case.

Note that the construction of as analyzed in this paper does not depend on knowledge of the stochastic gradient . So, we do not consider step sizes that are adaptively computed based on .

We study the best lower bound on the expected convergence rate for any possible sequence of stepsizes that satisfy the requirements given above in the extended SGD setting.

In order to prove a lower bound we propose a specific class of strongly convex and smooth objective function and we show in the extended SGD setting how to compute the optimal step size as a function of , , and an oracle with access to . We will show that the optimal stepsize is based on . For completeness, as in Algorithm 1, the next is defined as .

We consider the following class of objective functions

: We consider a multivariate normal distribution of a

-dimensional random vector

, i.e., , where and is the (symmetric positive semi-definite) covariance matrix. The density function of is chosen as

We select component functions

where function is constructed a-priori according to the following random process:

  • With probability

    , we draw

    from the uniform distribution over interval

    .

  • With probability , we draw from the uniform distribution over interval .

The following theorem analyses the sequence of optimal step sizes for our class of objective functions and gives a lower bound on the corresponding expected convergence rates. The theorem states that we cannot find a better sequence of step sizes. In other words without any more additional information about the objective function (beyond for computing ), we can at best prove a general upper bound which is at least the lower bound as stated in the theorem. As explained in the introduction an upper bound which is only a factor 32 larger than the theorem’s lower bound exists.

As a disclaimer we notice that for some objective functions the expected convergence rate can be much better than what is stated in the theorem: This is due to the specific nature of the objective function itself. However, without knowledge about this nature, one can only prove a general upper bound on the expected convergence rate and any such upper bound must be at least the lower bound as proven in the next theorem. Therefore, as a conclusion of the theorem we infer that only if more/other information is used for adaptively computing , then it may be possible to derive stronger upper bounds that beat the lower bound of the theorem.

Theorem 1.

We assume that component functions are constructed according to the recipe described above. Then, the corresponding objective function is -strongly convex and the component functions are -smooth and convex with .

If we run Algorithm 1 and assume that an oracle accessing is given at the -th iteration (our extended SGD problem setting), then an exact expression for the optimal sequence of stepsizes (see (12)) based on can be given. For this sequence of stepsizes,

(9)

and for ,

(10)

where

Proof.

Clearly, is -smooth where the maximum value of is equal to . That is, all functions are -smooth (and we cannot claim a smaller smoothness parameter). We notice that

and

With respect to and distribution we define

Since only assigns a random variable to which is drawn from a distribution whose description is not a function of , random variables and are statistically independent. Therefore,

Notice:

  1. .

  2. Since , we have .

  3. .

Therefore,

and this shows is -strongly convex and has minimum

Since

we have

In our notation

By using similar arguments as used above we can split the expectation and obtain

We already calculated

and we know

This yields

In the SGD algorithm we compute

We choose according to the following computation: We draw from its distribution and apply full gradient descent in order to find which minimizes for . Since

the minimum is achieved by . Therefore,

Let be the -algebra generated by . We derive

which is equal to

(11)

Given , is not a random variable. Furthermore, we can use linearity of taking expectations and as above split expectations:

Again notice that and . So, is equal to

In terms of , by taking the full expectation (also over ) we get

This is very close to recurrence (4).

The optimal step size in this case is solved by taking the derivative with respect to . The derivative is equal to

(12)

which shows the minimum is achieved for

(13)

giving

(14)

We note that for any . We proceed by proving a lower bound on . Clearly,

(15)

Let us define . We can rewrite (15) as follows:

In order to make the inequality above correct, we require for any . Since , we only need . This means

This is equivalent to which is obviously true since .

This implies

Since

we have the following inequality:

Reordering, substituting , and replacing by yields, for , the lower bound

where

The upper bound of comes from the following fact. If we run Algorithm 1 with stepsize for in [16], then we have from [16] an expected convergence rate

where

Substituting

yields . Since is the most optimal stepsize and is not, . I.e., we have

Corollary 1.

Given the class of objective functions analyzed in Theorem 1, we run Algorithm 1 and assume an oracle with access to as well as the full gradient at the -th iteration. An exact expression for the optimal sequence of stepsizes based on and this extended oracle can be given. For this sequence of stepsizes, the same lower and upper bounds on the expected convergence rate as in Theorem 1 hold.

Proof.

The proof of this corollary is directly derived from the reason why we are allowed to transform (11) into (2), i.e., and must be independent to get (2) from (11). If the construction of does not depend on (or ), then only is required to construct the optimal stepsize . It implies that the information of is not useful and we can borrow the proof of Theorem 1 to arrive at the result of this corollary. ∎

Let us consider the set of all possible objective functions which are -strongly convex and -smooth. For an objective function , let be defined as the smallest expected convergence rate that can be achieved by a stepsize construction , where is computed as a function of and oracle with access to and at the -th iteration. That is,

where is explicitly shown as a function of the objective function and sequence of step sizes.

Among the objective functions , we consider objective function which has the worst expected convergence rate at the -th iteration. Let us denote the expected convergence rate that corresponds to the worst objective function as . Precisely,

The lower bound and upper bound of is stated in Corollary 2.

Corollary 2.

Given and oracle with access to , and at the -th iteration, the convergence rate of the worst strongly convex and smooth objective function with optimal stepsize based on and is . The expected convergence rate satisfies the same lower bounds on the expected convergence rate as in Theorem 1 where is substituted by . As an upper bound we have

where

for

Notice that scheme for constructing step sizes is independent of oracle , in other words its knowledge is not needed.

Proof.

Due to the definition of , it is always larger than for all . From Corollary 1 we infer that is larger than the lower bound as specified in Theorem 1. Since this holds for all , it also holds for the supremum over .

The upper bound follows from the result in [16], i.e., for any given and , we have

where

The importance of Corollary 2 is that, for the worst objective function, we can now compute the gap between the lower bound and the upper bound, i.e., they are separated by a factor of . This implies that no scheme for constructing a sequence of stepsizes that is based on , and oracle can achieve a better expected convergence rate than . The only way to achieve a better expected convergence rate is to use a scheme that has access to information beyond what is given by , and oracle . For example, if the construction of must depend on the information of the full gradient as well as the stochastic gradient , or we have to develop a new updating form to replace the updating form of SGD (i.e. ).

Corollary 2 shows that the lower bound and the upper bound of are (see (9)). Furthermore, it offers a general strategy for computing step sizes which only depends on and in order to realize the upper bound (which comes within a factor 32 of the lower bound). This means that we can finally conclude that there does not exist a significantly better construction for step sizes than for classical SGD (not our extended SGD problem).

3 Numerical Experiments

We verify our theory by considering simulations with different values of sample size (1000, 10000, and 100000) and vector size (10, 100, and 1000). First, we generate vectors of size with mean and positive definite covariance matrix . To be simple, we generate and diagonal matrix with uniform at random in for each element in and each element in the diagonal of . We experimented with 10 runs and reported the average results.

Figure 1: and its upper and lower bounds

We denote the labels “Upper Yt” (red line) “Lower Yt” (violet line) in Figure 1 as the upper and lower bounds of in (10) and (9), respectively; “Ytopt” (orange line) as defined in Theorem 1 with the given information from the oracle; “Yt” (green) as the squared norm of the difference between and , where generated from Algorithm 1 with learning rate in (13). We note that “Lower Yt” and “Ytopt” are very close to each other in Figure 1 and the difference between them is shown in Figure 2. Note that in Figure 1 is computed as average of 10 runs of (not exactly ).

Figure 2: The difference between “Lower Yt” and “Ytopt” (, )

Discussion: We have a vertical line at epoch because we expect to see the upper bound in (10) to take effect when . The “Upper Yt” (red line), “Lower Yt” (violet line) and “Ytopt” (orange line) do not oscillate because they can be correctly computed using formulas (10), (9) and (14), respectively, i.e., all the lines do not have variation. The green line “Yt” for stepsize in Figure 1 oscillates because in our analysis we do not consider the variance of . As shown in (4), we have

It is clear that a decrease in leads to a decrease of the variance of (i.e. ). This fact is reflected in all subfigures in Figure 1. We expect that increasing and (the number of dimensions in data and the number of data points) would increase the variance. Hence, we see that it requires larger to make the variance approach as shown in Figure 1. We can see that when is sufficiently large, then optimality of is clearly shown in Figure 1 when and , i.e., the green line is in between red line (upper bound) and violet line (lower bound). Moreover, these two bounds are pretty close to each other when is sufficient large.

4 Conclusion

In this paper, we study the convergence of SGD. We show that for any given stepsize constructed based on , , , and an oracle with access to and at the -th iteration, the best possible lower bound of the convergence is . Note that this extends classical SGD where only and are given for construction of . This result implies that the best possible lower bound of the convergence rate for any possible stepsize based on and is . This result confirms the optimality of the proposed stepsize for in [16]. Compared to the result in [1], our proposed class of objective functions is simple and does not require many assumptions for the sake of proof. Also our lower bound is orders of magnitude more tight as it is the first lower bound to be independent of dimension . In addition,  [1] does not study the lower bound of the extended problem of SGD.

References

  • [1] Alekh Agarwal, Peter L Bartlett, Pradeep Ravikumar, and Martin J Wainwright. Information-theoretic lower bounds on the oracle complexity of stochastic convex optimization. 2010.
  • [2] D.P. Bertsekas. Nonlinear Programming. Athena Scientific, 1999.
  • [3] Léon Bottou, Frank E Curtis, and Jorge Nocedal. Optimization methods for large-scale machine learning. arXiv:1606.04838, 2016.
  • [4] Stephen Boyd and Lieven Vandenberghe. Convex Optimization. Cambridge University Press, 2004.
  • [5] Thomas M. Cover and Joy A. Thomas. Elements of Information Theory. Wiley-Interscience, New York, NY, USA, 1991.
  • [6] Aaron Defazio, Francis Bach, and Simon Lacoste-Julien. SAGA: A fast incremental gradient method with support for non-strongly convex composite objectives. In NIPS, pages 1646–1654, 2014.
  • [7] Trevor Hastie, Robert Tibshirani, and Jerome Friedman. The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer Series in Statistics, 2nd edition, 2009.
  • [8] Rie Johnson and Tong Zhang. Accelerating stochastic gradient descent using predictive variance reduction. In NIPS, pages 315–323, 2013.
  • [9] Nicolas Le Roux, Mark Schmidt, and Francis Bach. A stochastic gradient method with an exponential convergence rate for finite training sets. In NIPS, pages 2663–2671, 2012.
  • [10] Rémi Leblond, Fabian Pederegosa, and Simon Lacoste-Julien. Improved asynchronous parallel optimization analysis for stochastic incremental methods. arXiv preprint arXiv:1801.03749, 2018.
  • [11] Lucien LeCam et al.

    Convergence of estimates under dimensionality restrictions.

    The Annals of Statistics, 1(1):38–53, 1973.
  • [12] Eric Moulines and Francis R Bach.

    Non-asymptotic analysis of stochastic approximation algorithms for machine learning.

    In Advances in Neural Information Processing Systems, pages 451–459, 2011.
  • [13] Arkadii Semenovich Nemirovsky and David Borisovich Yudin. Problem complexity and method efficiency in optimization. 1983.
  • [14] Yurii Nesterov. Introductory lectures on convex optimization : a basic course. Applied optimization. Kluwer Academic Publ., Boston, Dordrecht, London, 2004.
  • [15] Lam M. Nguyen, Jie Liu, Katya Scheinberg, and Martin Takáč. SARAH: A novel method for machine learning problems using stochastic recursive gradient. In ICML, 2017.
  • [16] Lam M. Nguyen, Phuong Ha Nguyen, Marten van Dijk, Peter Richtarik, Katya Scheinberg, and Martin Takac. SGD and hogwild! Convergence without the bounded gradients assumption. In ICML, 2018.
  • [17] Jorge Nocedal and Stephen J. Wright. Numerical Optimization. Springer, New York, 2nd edition, 2006.
  • [18] Maxim Raginsky and Alexander Rakhlin. Information-Based Complexity, Feedback and Dynamics in Convex Programming. IEEE Trans. Information Theory, 57(10):7036–7056, 2011.
  • [19] Herbert Robbins and Sutton Monro. A stochastic approximation method. The Annals of Mathematical Statistics, 22(3):400–407, 1951.
  • [20] Marten van Dijk, Lam Nguyen, Phuong Ha Nguyen, and Dzung Phan. Characterization of Convex Objective Functions and Optimal Expected Convergence Rates for SGD. arXiv preprint, 2018.
  • [21] Bin Yu. Assouad, Fano, and Le Cam. In Festschrift for Lucien Le Cam, pages 423–435. Springer, 1997.

Appendix A Related Work

In [1], the authors showed that the lower bound of is with bounded gradient assumption for objective function over a convex set . To show the lower bound, the authors use the following three assumptions for the objective function :

  1. The assumption of a strongly convex objective function, i.e., Assumption 1 (see Definition 3 in [1]).

  2. There exists a bounded convex set such that

    for all (see Definition 1 in [1]). Notice that this is not the same as the bounded gradient assumption where is unbounded.

  3. The objective function is a convex Lipschitz function, i.e., there exists a positive number such that