A Universally Optimal Multistage Accelerated Stochastic Gradient Method

01/23/2019 ∙ by Necdet Serhat Aybat, et al. ∙ MIT 0

We study the problem of minimizing a strongly convex and smooth function when we have noisy estimates of its gradient. We propose a novel multistage accelerated algorithm that is universally optimal in the sense that it achieves the optimal rate both in the deterministic and stochastic case and operates without knowledge of noise characteristics. The algorithm consists of stages that use a stochastic version of Nesterov's accelerated algorithm with a specific restart and parameters selected to achieve the fastest reduction in the bias-variance terms in the convergence rate bounds.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

First order optimization methods play a key role in solving large scale machine learning problems due to their low iteration complexity and scalability with large data sets. In several cases, these methods operate with noisy first order information either because the gradient is estimated from draws or subset of components of the underlying objective function

Bach and Moulines (2011); Cohen et al. (2018); Flammarion and Bach (2015); Ghadimi and Lan (2012, 2013); Jain et al. (2018); Vaswani et al. (2018); d’Aspremont (2008); Devolder et al. (2014) or noise is injected intentionally due to privacy or algorithmic considerations Bassily et al. (2014); Neelakantan et al. (2015); Raginsky et al. (2017); Gao et al. (2018a, b). A fundamental question is this setting is to design fast algorithms with optimal convergence rate, matching the lower bounds in terms of target accuracy and other important parameters both for the deterministic and stochastic case (i.e., with or without gradient errors).

In this paper, we design an optimal first order method to solve the problem

(1)

where, for scalars , is the set of continuously differentiable functions that are strongly convex with modulus and have Lipschitz-continuous gradients with constant , which imply that for every the function satisfies (see e.g. Nesterov (2004))

(2a)
(2b)

We denote the solution of problem (1) by which is achieved at the unique optimal point . Also, the ratio is called the condition number of .

We assume that the gradient information is available through a stochastic oracle, which at each iteration , given the current iterate , provides the noisy gradient where

is a sequence of independent random variables such that for all

(3a)
(3b)

This oracle model is commonly considered in the literature (see e.g. Ghadimi and Lan (2012, 2013); Bubeck et al. (2015)), and is more general than the additive noise model where the gradient is corrupted by additive stochastic noise, see e.g. Cohen et al. (2018); Bassily et al. (2014).

In this setting, the performance of many algorithms is characterized by the expected error of the iterates (in terms of the suboptimality in function values) which admits a bound as a sum of two terms: a bias term that shows the decay of initialization error and is independent of the noise parameter , and a variance term that depends on and is independent of the initial point . A lower bound on the bias term follows from the seminal work of Nemirovsky and Yudin (1983), which showed that without noise () and after iterations, the expected function suboptimality cannot be smaller than111This lower bound is shown with the additional assumption

(4)

With noise, Raginsky and Rakhlin (2011) provided the following (much larger) lower bound222The authors show this result for . Nonetheless, it can be generalized to any by scaling the problem parameters properly. on function suboptimality which also provides a lower bound on the variance term:

(5)

Several algorithms have been proposed in the recent literature attempting to achieve these lower bounds333Here we review their error bounds after iterations highlighting dependence on , , and initial point , suppressing and dependence.. Xiao (2010) obtain performance guarantees in expected suboptimality for an accelerated version of the dual averaging method. Dieuleveut et al. (2017) consider quadratic objective function and develop an algorithm with averaging to achieve the error bound . Hu et al. (2009) consider general strongly convex and smooth functions and achieve an error bound with similar dependence under the assumption of bounded noise. Ghadimi and Lan (2012) and Chen et al. (2012) extend this result to the noise model in (3) by introducing the accelerated stochastic approximation algorithm (AC-SA) and optimal regularized dual averaging algorithm (ORDA), respectively. Both AC-SA and ORDA have multistage versions presented in Ghadimi and Lan (2013) and Chen et al. (2012) where authors improve the bias term to the optimal by exploiting knowledge of and the optimality gap , i.e., an upper bound for , in the operation of the algorithm. Another closely related paper is Cohen et al. (2018) which proposed AGD+ and showed under additive noise model that it admits the error bound for any where the constants grow with , and in particular, they achieve the bound with .

In this paper, we introduce the class of Multistage Accelerated Stochastic Gradient (M-ASG) methods that are universally optimal, achieving the lower bound both in the noiseless deterministic case and the noisy stochastic case up to some constants independent of and . M-ASG proceeds in stages that use a stochastic version of Nesterov’s accelerated method Nesterov (2004) with a specific restart and parameterization. Given an arbitrary length and constant stepsize for the first stage and geometrically growing lengths and shrinking stepsizes for the following stages, we first provide a general convergence rate result for M-ASG (see Theorem 3.4). Given the computational budget , a specific choice for the length of the first stage is shown to achieve the optimal error bound without requiring knowledge of the noise bound and the initial optimality gap (See Corollary 3.8). To the best of our knowledge, this is the first algorithm that achieves such a lower bound under such informational assumptions. To the best of our knowledge, this is the first algorithm that achieves such a lower bound under such informational assumptions.

In Table 1, we provide a comparison of our algorithm with other algorithms in terms of required assumptions and optimality of their results in both bias and variance terms. In particular, we consider ACSA Ghadimi and Lan (2012), Multistage AC-SA Ghadimi and Lan (2013), ORDA and Multistage ORDA Chen et al. (2012), and the algorithm proposed in Cohen et al. (2018).

Algorithm Requires Opt. Opt.
or Bias Var.
AC-SA
Multi. AC-SA
ORDA
Multi. ORDA
Cohen et al.
M-ASG
(With parameters in
Corollary 3.7)
M-ASG ✓[]
(With parameters in
Corollary 3.8)
M-ASG ✓[]
(With parameters in
Corollary 3.9)
Table 1: Comparison of algorithms

Our paper builds on an analysis of Nesterov’s accelerated stochastic method with a specific momentum parameter presented in Section 2 which may be of independent interest. This analysis follows from a dynamical system representation and study of first order methods which has gained attention in the literature recently Lessard et al. (2016); Hu and Lessard (2017); Aybat et al. (2018). In Section 3, we present the M-ASG algorithm, and characterize its behavior under different assumptions as summarized in Table 1. In particular, we show that it achieves the optimal convergence rate with the given budget of iterations . In Section 4, we show how additional information such as and can be leveraged in our framework to improve practical performance. Finally, in Section 5, we provide numerical results on the comparison of our algorithm with some of the other most recent methods in the literature.

1.1 Preliminaries and Notation

Let and represent the identity and zero matrices. For matrix , and denote the trace and determinant of , respectively. Also, for scalars and , we use to show the submatrix formed by rows to and columns to . We use the superscript

to denote the transpose of a vector or a matrix depending on the context. Throughout this paper, all vectors are represented as column vectors. Let

denote the set of all symmetric and positive semi-definite matrices. For two matrices and , their Kronecker product is denoted by . For scalars , is the set of continuously differentiable functions that are strongly convex with modulus and have Lipschitz-continuous gradients with constant .

2 Modeling Accelerated Gradient Method as a Dynamical System

In this section we study Nesterov’s Accelerated Stochastic Gradient method (ASG) Nesterov (2004) with the stochastic first-order oracle in (3):

(6a)
(6b)

where is the stepsize and is the momentum parameter. It is worth noting that Nesterov’s analysis in Nesterov (2004) for (6) is given for and , i.e., for ; more importantly, it does not guarantee convergence when . This choice of momentum parameter has been also studied in other papers in the literature, e.g., Nitanda (2014); Wai et al. (2018); Shi et al. (2018). Here, in the following lemma, we provide a new motivation for it by showing that, for quadratic functions and in the noiseless setting, this momentum parameter achieves the fastest asymptotic convergence rate for a fixed stepsize . The proof of this lemma is provided in Appendix A.

Lemma 2.1.

Let be a strongly convex quadratic function such that where is a by

symmetric positive definite matrix with all its eigenvalues in the interval

. Consider the deterministic ASG iterations, i.e., , as shown in (6), with constant stepsize . Then, the fastest asymptotic convergence rate, i.e. the smallest that satisfies the inequality

for some non-negative sequence that goes to zero is and it is achieved by . As a consequence, for this choice of , we have

where .

Our analysis builds on the reformulation of a first-order optimization algorithm as a linear dynamical system. Following Lessard et al. (2016); Hu and Lessard (2017), we write ASG iterations as

(7a)
(7b)

where is the state vector and and are system matrices with appropriate dimensions defined as the Kronecker products , and with

(8)

We can also relate the state to the iterate in a linear fashion through the identity

We study the evolution of the ASG method through the following Lyapunov function which also arises in the study of deterministic accelerated gradient methods:

(9)

where is a symmetric positive semi-definite matrix. In particular, we first state the following lemma which can be derived by a simple adaptation of the proof of Proposition 4.6 in Aybat et al. (2018) to our setting where the noise assumption is less restrictive. Its proof can be found in Appendix B.

Lemma 2.2.

Let . Consider the ASG iterations given by (6). Assume there exist and , possibly depending on , such that

(10)

where

Let . Then, for every ,

(11)

We use this lemma and derive the following theorem which characterize the behavior of ASG method for when and (see the proof in Appendix C).

Theorem 2.3.

Let . Consider the ASG iterations given in (6) with and . Then,

(12)

for every , where with

(13)

This choice of , which has been studied before only for rate analysis in the case in Hu and Lessard (2017), is novel to the best of our knowledge. The proof exploits the special structure of , and the same structure is also exploited later in the proof of our main results in Section 3.

3 A Class of Multistage ASG Algorithms

In this section, we introduce a class of multistage ASG algorithms, represented in Algorithm 1 which we denote by M-ASG. The main idea is to run ASG with properly chosen parameters at each stage for stages. In addition, each new stage is dependent on the previous stage as the first two initial iterates of the new stage are set to the last iterate of the previous stage.

Input : Initial iterate , the sequence of stepsizes, the sequence of the number of iterations at each stage.
1 Set ;
2 for  do
3       Set ;
4       for  do
5             Set ;
6             Set ;
7             Set
8       end for
9      
10 end for
Algorithm 1 Multistage Accelerated Stochastic Gradient Algorithm (M-ASG)

In order to analyze Algorithm 1, in the following theorem, we first characterize the evolution of iterates in one specific stage through the Lyapunov function in (9). The details of the proof is provided in Appendix D.

Theorem 3.1.

Let . Consider running the ASG method given in (6) for iterations with and for some . Then,

(14)

where is as in Theorem 2.3.

We use this result to choose a stepsize, given iterations, such that we achieve an approximately optimal decay in the variance term which yields the following corollary for M-ASG algorithm with stage, and its proof can be found in Appendix E.

Corollary 3.2.

Let . Consider running M-ASG, i.e., Algorithm 1, for only one stage with iterations and stepsize for some scalar . Then,

(15)

provided that .

For subsequent analysis, we define the state vector as

for and where is the number of stages. We analyze the performance of each stage with respect to a stage-dependent Lyapunov function . The following lemma relates the performance bounds with respect to consecutive choice of Lyapunov functions, building on our specific restarting mechanism.

Lemma 3.3.

Let . Consider M-ASG, i.e., Algorithm 1. Then, for every ,

(16)
Proof.

The proof can be found in Appendix F. ∎

Now, we are ready to state and prove the main result of the paper (see proof in Appendix G):

Theorem 3.4.

Let . Consider running M-ASG , i.e., Algorithm 1, with the following parameters:

for any , and . The last iterate of each stage, i.e., , satisfies the following bound for all :

(17)

We next define as the number of iterations needed to run M-ASG for stages, i.e.,

(18)

Note for and with parameters given in Theorem 3.4,

(19)

Also, we define sequence such that is the iterate generated by M-ASG algorithm at the end of gradient steps for , i.e., , for , and for we set where and .

Remark 3.5.

In the absence of noise, i.e., , the result of Theorem 3.4 recovers the linear convergence rate of deterministic gradient methods as its special case. Indeed, running M-ASG only for one stage with iterations, i.e., and guarantees that for all .

The next theorem remarks the behavior of M-ASG after running for iterations with the parameters in the preceding theorem, and its proof is provided in Appendix H.

Theorem 3.6.

Let . Consider running Algorithm 1 for iterations and with parameters given in Theorem 3.4, , and . Then the error is bounded by

(20)

Using Theorem 3.6, as stated in the following corollary, we can obtain a convergence rate result similar to AC-SA Ghadimi and Lan (2012) and ORDA Chen et al. (2012) without assuming any knowledge of and (see Appendix I for the proof).

Corollary 3.7.

Under the premise of Theorem 3.6, choosing , the suboptimality error of M-ASG after admits the following upper bound

We continue this section by pointing out a few important special cases of our result. We first show in the next corollary how our algorithm is universally optimal and capable of achieving the lower bounds (4) and (5) simultaneously.

Corollary 3.8.

Under the premise of Theorem 3.6, consider a computational budget of iterations. By setting for some positive constant , we obtain a bound matching the the lower bounds in (4) and (5), i.e.,

Proof.

The proof is straightforward by using (20) and noting that . ∎

Note the bounds in Theorems 3.4 and 3.6 and in Corollaries 3.7 and 3.8 can be seen as sum of two separate bias and variance terms.

The lower bound can also be stated as the minimum number of iterations needed to find an solution, i.e, to find such that , for any given . In the following corollary, and with the additional assumption of knowing the bound on the initial optimality gap , we state this version of lower bound. The proof is provided in Appendix J.

Corollary 3.9.

Let . Then, for any , running M-ASG, displayed in Algorithm 1, with parameters given in Theorem 3.4, , and , one can compute solution within iterations in total, where

(21)

Recall that we presented a comparison of different versions of M-ASG with other state-of-the-art algorithms in Table 1. In particular, this table shows that Multistage AC-SA Ghadimi and Lan (2013) and Multistage ORDA Chen et al. (2012) also achieve the lower bounds provided that noise parameters are known – note we do not make this extra assumption for M-ASG. In the next remark, we compare M-ASG with these two algorithms from another perspective.

Remark 3.10.

In addition to the desirable property that M-ASG parameters are independent of , M-ASG iteration complexity bound in (21) has a better constant in front of the variance term, which is the dominant term of the bound, when compared to the bounds provided for Multistage AC-SA and Multistage ORDA. In fact, while our constant is less than 50, constants in Multistage AC-SA and Multistage ORDA are 384 and 1024, respectively.

Finally, we conclude this section by taking a closer look on how M-ASG is related to AC-SA and Multistage AC-SA algorithms proposed in Ghadimi and Lan (2012, 2013). In Appendix K, for the sake of completeness, we state the AC-SA algorithm, and next show that it can be cast as an ASG method in (6) with a specific varying stepsize rule. In fact, AC-SA iterations can be written for as follows:

(22a)
(22b)

As a consequence, Multistage AC-SA is a variant of M-ASG Algorithm that has a different length for each stage and employs a specific varying stepsize rule together with a different selection for the momentum parameter at each stage. That said, the pattern of increase in in Multistage AC-SA is very similar to M-ASG. In particular, for Multistage AC-SA, is given by

which increases almost by a factor of two in every stage for sufficiently large , similar to stage length of M-ASG in Theorem 3.4. Moreover, it can be verified that for specific parameter sequences that authors suggest, maximum of at each stage decreases by almost a factor of four (for large enough ), which is again a similar behavior to stepsize parameter of M-ASG.

4 M-Asg: An improved bias-variance trade-off

In section 3, we described a universal algorithm that do not require the knowledge of neither initial suboptimality gap nor the noise magnitude to operate. However, as we will argue in this section, our framework is flexible in the sense that additional information about the magnitude of or can be leveraged to improve practical performance.

We first note that several algorithms in the literature assume that an upper bound on is known or can be estimated, as summarized in Table 1. This assumption is reasonable in a variety of applications when there is a natural lower bound on

. For example, in supervised learning scenarios such as support vector machines, regression or logistic regression problems, the loss function

has non-negative values Vapnik (2013). Similarly, the noise level may be known or estimated; for instance in private risk minimization Bassily et al. (2014), the noise is added by the user to ensure privacy; therefore, it is a known quantity.

There is a natural well-known trade-off between constant and decaying stepsizes (decaying with the number of iterations ) in stochastic gradient algorithms. Since the noise is multiplied with the stepsize, a stepsize that is decaying with the number of iterations leads to a decay in the variance term; however, this will slow down the decay of the bias term, which is controlled essentially by the behavior of the underlying deterministic accelerated gradient algorithm (AG) that will give the best performance with the constant stepsize (note that when , the bias term gives the known performance bounds for the AG algorithm). The main idea behind the M-ASG algorithm (which allows it to achieve the lower bounds) is to exploit this trade-off and switch to decaying stepsize in the right time, i.e., when the bias term is sufficiently small so that the variance term dominates and should be handled with the decaying stepsize. This insight is visible from the results of Theorem 3.4 which gives insights on the choice of the stepsize at every stage to achieve the lower bounds. Theorem 3.4 shows that if M-ASG is run with a constant stepsize in the first stage, then the variance term admits the bound which does not decay with the number of iterations in the first stage. However, in later stages, when , the stepsize is decreased as the number of iterations grows and this results in a decay of the variance term. Overall, the choice of the length of the first stage , has a major impact in practice which we will highlight in our numerical experiments.

If an estimate of or is known, it is desirable to choose as small as possible such that it ensures the bias term becomes smaller than the variance term at the end of the first stage. More specifically, applying Theorem 3.1 for , one can choose to balance the variance and the bias terms. The term , as shown in the proof of Lemma 3.3, can be bounded by

Therefore, by having an estimate of an upper bound for , can be set to be the smallest number such that , i.e.,

(23)

This lemma allows one to fine-tune the switching point to start using the decaying stepsizes within our framework as a function of and . In scenarios, when the noise level is small or the initial gap is large, is chosen large enough to guarantee a fast decay in the bias term. We would like to emphasize that this modified M-ASG algorithm only requires the knowledge of and for selecting and the rest of the parameters can be chosen as in Theorem 3.4 which are independent of both and . Finally, the following theorem provides theoretical guarantees in our framework for this choice of . The proof is omitted as it is similar to the proofs of Theorems 3.4 and 3.6.

Theorem 4.1.

Let . Consider running Algorithm 1 for iterations and with parameters given in Theorem 3.4, , and set as (23). Then, the expected suboptimality in function values admits the bound

In the next section, we present numerical experiments that illustrates the performance of our proposed algorithms and compare them to existing methods from the literature.

5 Numerical Experiments

In this section, we demonstrate the numerical performance of Algorithm 1 with parameters specified by Corollary 3.7 (M-ASG) and Theorem 4.1 (M-ASG) and compare with other methods from the literature.

fig1

fig2

fig3
Figure 1: Comparison of different algorithms for a quadratic function with iterations with different level of noise.

fig1

fig2

fig3
Figure 2: Comparison of different algorithms for a quadratic function with iterations with different level of noise.

In our first experiment, we consider the strongly convex quadratic objective where

is the Laplacian of a cycle graph, is a random vector and is a regularization parameter. We assume the gradients

are corrupted by additive noise with a Gaussian distribution

where . We note that this example has been previously considered in the literature as a problem instance where Standard ASG (ASG iterations with standard choice of parameters and ) perform badly compared to Standard GD (Gradient Descent with standard choice of the stepsize ) Hardt (2014).

fig1

fig2

fig3
Figure 3: Comparison of GD, AGD, AGD+, and MASG for logistic regression with iterations with different level of noise.

fig1

fig2

fig3
Figure 4: Comparison of GD, AGD, AGD+, and MASG for logistic regression with iterations with different level of noise.

In Figures 1 and 2, we compare M-ASG and M-ASG with Standard GD, Standard AG, AGD+ Cohen et al. (2018), Multistage AC-SA Ghadimi and Lan (2013), and Flammarion-Bach algorithm proposed in Flammarion and Bach (2015). We consider dimension and initialize all the methods from . We run the algorithms Multistage AC-SA, Flammarion-Bach algorithm and M-ASG, having access to the same estimate of . Figures 1- 2 show the average performance of all the algorithms over 10 sample runs while the total number of iterations and respectively as the noise level is varied. The simulation results reveal that both M-ASG and M-ASG have typically a faster decay of the error in the beginning and outperforms the other algorithms in general when the number of iterations is small to moderate. In this case, the speed-up obtained by M-ASG and M-ASG are more prominent from the figures if the noise level is smaller. However, as the number of iterations grows, the performance of the algorithms become similar as the variance term dominates. In addition, we would like to highlight that when the noise is small, using as suggested in (23), M-ASG runs stage one longer than M-ASG; hence, enjoys the linear rate of decay for more iterations before the variance term becomes the dominant term.

For the second set of experiments, we consider a regularized logistic regression problem for binary classification. In particular, we generate a random matrix

and a random vector and compute which is the vector that contains the sign of the inner product with the rows of and the vector . Our goal is to recover by optimizing a regularized logistic objective when the gradient of the loss function is corrupted with additive Gaussian noise. We compare M-ASG and M-ASG with Standard GD, Standard AG, AGD+ Cohen et al. (2018), and Multistage AC-SA Ghadimi and Lan (2013). We note that the condition number of the problem for this problem.

Figures 34 illustrate the behavior of the algorithms for and iterations for the noise level as before. It can be seen that both M-ASG and M-ASG usually start faster, and do not perform worse than other algorithms in different scenarios; moreover, they outperform other algorithms when the iteration budget is limited or the noise level is small. Furthermore, note that in the setting where the noise is large, M-ASG behaves better than M-ASG, as it terminates the first stage earlier, which is helpful as the noise is large; hence, the variance becomes term dominant in the first stage just after a few iterations.

6 Conclusion

In this work, we consider strongly convex smooth optimization problems where we have access to noisy estimates of the gradients. We proposed a multistage method that adapts the choice of the parameters of the Nesterov’s accelerated gradient at each stage to achieve the optimal rate. Our method is universal in the sense that it does not require the knowledge of the noise characteristics to operate and can achieve the optimal rate both in the deterministic and stochastic settings. We provided numerical experiments to show our algorithm can be faster than the existing approaches in practice.

References

Appendix A Proof of Lemma 2.1

Let us denote the asymptotic convergence rate of the ASG method as a function of and by . It is well-known that has the following characterization (see e.g. Lessard et al. [2016], O’Donoghue and Candès [2015]):

(24)

where and is defined as:

(25)

with . Note that, since , we have for ; therefore, if and only if , which is equivalent to .

Using the fact that and is decreasing in , we obtain ; hence, for , we have both and . As a consequence, (24) implies that for , we have

(26)

Moreover, for , the two branches in (25) take the same value for and ; therefore, when is set to this critical value, we also get for . Note (26) is an increasing function of for any ; thus, given , the smallest rate possible is equal to , which is the rate given in the statement of Lemma and it is achieved by .

Now, we consider the case . From (24), if , then we also have . Thus, showing suffices us to claim that for any , the best possible rate is and this can be achieved by setting . Indeed, as we discussed above, for the case , we have ; thus,

Therefore, to show , we just need to prove

(27)

Taking the square of both sides of (27), it follows that (27) is equivalent to

and this holds when . Therefore, for any , we have for . which completes the proof.

Appendix B Proof of Lemma 2.2

We first state the following lemma which is an extension of Lemma 4.1 in Aybat et al. [2018] for ASG.

Lemma B.1.

Let where and consider the function . Then we have