Stochastic Newton and Cubic Newton Methods with Simple Local Linear-Quadratic Rates

by   Dmitry Kovalev, et al.

We present two new remarkably simple stochastic second-order methods for minimizing the average of a very large number of sufficiently smooth and strongly convex functions. The first is a stochastic variant of Newton's method (SN), and the second is a stochastic variant of cubically regularized Newton's method (SCN). We establish local linear-quadratic convergence results. Unlike existing stochastic variants of second order methods, which require the evaluation of a large number of gradients and/or Hessians in each iteration to guarantee convergence, our methods do not have this shortcoming. For instance, the simplest variants of our methods in each iteration need to compute the gradient and Hessian of a single randomly selected function only. In contrast to most existing stochastic Newton and quasi-Newton methods, our approach guarantees local convergence faster than with first-order oracle and adapts to the problem's curvature. Interestingly, our method is not unbiased, so our theory provides new intuition for designing new stochastic methods.



page 1

page 2

page 3

page 4


Stochastic Subspace Cubic Newton Method

In this paper, we propose a new randomized second-order optimization alg...

Distributed Second Order Methods with Fast Rates and Compressed Communication

We develop several new communication-efficient second-order methods for ...

Hessian Averaging in Stochastic Newton Methods Achieves Superlinear Convergence

We consider minimizing a smooth and strongly convex objective function u...

A simple Newton method for local nonsmooth optimization

Superlinear convergence has been an elusive goal for black-box nonsmooth...

Sketched Newton-Raphson

We propose a new globally convergent stochastic second order method. Our...

Fast and Furious Convergence: Stochastic Second Order Methods under Interpolation

We consider stochastic second order methods for minimizing strongly-conv...

Variance-Reduced Stochastic Quasi-Newton Methods for Decentralized Learning: Part I

In this work, we investigate stochastic quasi-Newton methods for minimiz...

Code Repositories

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The problem of empirical risk minimization (ERM) is ubiquitous in machine learning and data science. Advances of the last few years 

Roux et al. (2012); Defazio et al. (2014a); Johnson and Zhang (2013); Shalev-Shwartz and Zhang (2013) have shown that availability of finite-sum structure and ability to use extra memory allow for methods that solve a wide range of ERM problems Nguyen et al. (2017a, b); Ryu and Yin (2017); Allen-Zhu (2017); Mishchenko and Richtárik (2019); Kovalev et al. (2019). In particular, Majorization-Minimization Lange et al. (2000) (MM) approach, also known as successive upper-bound minimization Razaviyayn et al. (2016), gained a lot of attention in first-order Mairal (2013); Defazio et al. (2014b); Bietti and Mairal (2017); Mokhtari et al. (2018b); Qian et al. (2019), second-order Mokhtari et al. (2018a), proximal Lin et al. (2015); Ryu and Yin (2017) and other settings Karimi and Moulines (2018); Mishchenko et al. (2018).

The aforementioned finite-sum structure is given by the minimization formulation


where each function is assumed to have Lipschitz Hessian.

Since in practical applications

is often very large, this problem is typically solved using stochastic first order methods, such as Stochastic Gradient Descent (SGD)

Robbins and Monro (1951), which have cheap iterations independent of . However, constant-stepsize SGD provides fast convergence to a neighborhood of the solution only Nguyen et al. (2018); Gower et al. (2019)

, whose radius is proportional to the variance of the stochastic gradient at the optimum. So-called variance-reduced methods resolve this issue 

Roux et al. (2012); Johnson and Zhang (2013); Gorbunov et al. (2019); Hanzely and Richtárik (2019), but their iteration complexity relies significantly on the condition number of the problem, which is equal to the ratio of the smoothness and strong convexity parameters.

Second-order methods, in contrast, adapt to the curvature of the problem and thereby decrease their dependence on the condition number Nesterov (2004). Among the most well-known second-order methods are Newton’s method (see e.g. Nesterov (2004); Karimireddy et al. (2018); Gower et al. (2019)), a variant thereof with cubic regularization Nesterov and Polyak (2006) and BFGS Liu and Nocedal (1989); Avriel (2003). The vanilla deterministic Newton method uses simple gradient update preconditioned via the inverse Hessian evaluated at the same iterate, i.e.,

This can be seen as approximating function locally around as111For and an positive definite matrix , we let .


and subsequently minimizing this approximation in . However, this approximation is not necessarily an upper bound on , which makes it hard to establish global rates for the method. Indeed, unless extra assumptions Karimireddy et al. (2018); Gower et al. (2019) or regularizations Nesterov and Polyak (2006); Doikov and Richtárik (2018); Doikov and Nesterov (2019) are employed, the method can diverge. However, second order methods are famous for their fast (superlinear or quadratic) local convergence rates.

Unfortunately, the fast convergence rate of Newton and cubically regularized Newton methods does not directly translate to the stochastic case. For instance, the quasi-Newton methods in Luo et al. (2016); Moritz et al. (2016); Gower et al. (2016) have complexity proportional to with some instead of in Johnson and Zhang (2013), thus giving a worse theoretical guarantee. Other theoretical gaps include inconsistent assumptions such as bounded gradients together with strong convexity Berahas et al. (2016) .

Surprisingly, a lot of effort Kohler and Lucchi (2017); Bollapragada et al. (2018); Zhou et al. (2018); Zhang et al. (2018); Tripuraneni et al. (2018); Zhou and Gu (2019) has been put into developing methods with massive (albeit sometimes still independent of ) batch sizes, typically proportional to , where is the target accuracy. This essentially removes all randomness from the analysis by making the variance infinitesimal. In fact, these batch sizes are likely to exceed

, so one ends up using mere deterministic algorithms. Furthermore, to make all estimates unbiased, large batch sizes are combined with preconditioning sampled gradients using the Hessians of

some other functions. We naturally expect that these techniques would not work in practice when only a single sample is used since the Hessian of a randomly sampled function might be significantly different from the Hessians of other samples.

Unlike all these methods, ours will work even with batch size equal one.

Thus, it is a major contribution of our work that we give up on taking unbiased estimates and show convergence despite the biased nature of our methods. This is achieved by developing new Lyapunov functions that are specific to second-order methods.

We are aware of two methods Rodomanov and Kropotov (2016); Mokhtari et al. (2018a) that do not suffer from the issues above. Nevertheless, their update is cyclic rather than stochastic, which is known to be slower with the first-order oracle due to possibly bad choice of the iteration order Safran and Shamir (2019).

1.1 Basic definitions and notation

Convergence of our algorithms will be established under certain regularity assumptions on the functions in (1). In particular, we will assume that all functions are twice differentiable, have a Lipschitz Hessian, and are strongly convex. These concepts are defined next. Let and be the standard Euclidean inner product and norm in , respectively.

Definition 1 (Strong convexity).

A differentiable function is -strongly convex, where is the strong convexity parameter, if for all


For twice differentiable functions, strong convexity is equivalent to requiring the smallest eigenvalue of the Hessian to be uniformly bonded above 0, i.e.,

, where

is the identity matrix, or equivalently

for all . Given a twice differentiable strongly convex function , we let be the norm induced by the matrix .

Definition 2 (Lipschitz Hessian).

Function has -Lipschitz Hessian if for all

Recall (e.g., see Lemma 1 in Nesterov and Polyak (2006)) that a function with -Lipschitz Hessian admits for any the following estimates,




Note that in the context of second order methods it is natural to work with functions with a Lipschitz Hessian. Indeed, this assumption is standard even for deterministic Newton Nesterov (2004) and cubically regularized Newton methods Nesterov and Polyak (2006).

2 Stochastic Newton method

In the big data case, i.e., when is very large, it is not efficient to evaluate the gradient, let alone the Hessian, of at each iteration. Instead, there is a desire to develop an incremental method which in each step resorts to computing the gradient and Hessian of a single function only. The challenge here is to design a scheme capable to use this stochastic first and second order information to progressively throughout the iterations build more accurate approximations of the Newton step, so as to ensure a global convergence rate. While establishing any meaningful global rate is challenging enough, a fast linear rate would be desirable.

In this paper we propose to approximate the gradient and the Hessian using the latest information available, i.e.,


is the last vector for which the gradient

and Hessian was computed. We then use these approximations to take a Newton-type step. Note, however, that and serve as good approximations only when used together, so that we precondition gradients with the Hessians at the same iterates. Although the true gradient, , gives precise information about the linear approximation of at , preconditioning it with approximation would not make much sense. Our method is summarized in Algorithm 1.

Initialize: Choose starting iterates and minibatch size
for  do
     Choose a subset of size uniformly at random
end for
Algorithm 1 Stochastic Newton (SN)

2.1 Local convergence analysis

We now state a simple local linear convergence rate for Algorithm 1. By we denote the expectation conditioned on all information prior to iteration .

Theorem 1.

Assume that every is -strongly convex and has -Lipschitz Hessian and consider the following Lyapunov function:


Then for the random iterates of Algorithm 1 we have the recursion


Furthermore, if for , then

Clearly, when , we achieve the standard superlinear (quadratic) rate of Newton method. When , we obtain linear convergence rate that is independent of the conditioning, thus, the method provably adapts to the curvature.

3 Stochastic Newton method with cubic regularization

Motivated by the desire to develop a new principled variant of Newton’s method that could enjoy global convergence guarantees, Nesterov and Polyak Nesterov and Polyak (2006) proposed the following “cubically regularized” second-order approximation of :

In contrast to (2), the above is a global upper bound on provided that . This fact is then used to establish local rates.

3.1 New method

In this section we combine cubic regularization with the stochastic Newton technique developed in previous section. Let us define the following second-order Taylor approximation of :

Having introduced , we are now ready to present our Stochastic Cubic Newton method (Algorithm 2).

Initialize: Choose starting iterates and minibatch size
for  do
     Choose a subset of size uniformly at random
end for
Algorithm 2 Stochastic Cubic Newton (SCN)

3.2 Local convergence analysis

Our main theorem gives a bound on the expected improvement of a new Lyapunov function .

Theorem 2.

Let be -strongly convex and assume every has -Lipschitz Hessian and consider the following Lyapunov function:


Then for the random iterates of Algorithm 2 we have the recursion

Furthermore, if for , then

Again, with we recover local superlinear convergence of cubic Newton. In addition, this rate is locally independent of the problem conditioning, showing that we indeed benefit from second-order information.


  • [1] Z. Allen-Zhu (2017) Katyusha: the first direct acceleration of stochastic gradient methods. The Journal of Machine Learning Research 18 (1), pp. 8194–8244. Cited by: §1.
  • [2] M. Avriel (2003) Nonlinear programming: analysis and methods. Courier Corporation. Cited by: §1.
  • [3] A. S. Berahas, J. Nocedal, and M. Takáč (2016) A multi-batch L-BFGS method for machine learning. In Advances in Neural Information Processing Systems, pp. 1055–1063. Cited by: §1.
  • [4] A. Bietti and J. Mairal (2017) Stochastic optimization with variance reduction for infinite datasets with finite sum structure. In Advances in Neural Information Processing Systems, pp. 1623–1633. Cited by: §1.
  • [5] R. Bollapragada, R. H. Byrd, and J. Nocedal (2018) Exact and inexact subsampled Newton methods for optimization. IMA Journal of Numerical Analysis 39 (2), pp. 545–578. Cited by: §1.
  • [6] A. Defazio, F. Bach, and S. Lacoste-Julien (2014) SAGA: a fast incremental gradient method with support for non-strongly convex composite objectives. In Advances in Neural Information Processing Systems 27, Cited by: §1.
  • [7] A. Defazio, T. Caetano, and J. Domke (2014) Finito: a faster, permutable incremental gradient method for big data problems. In International Conference on Machine Learning, pp. 1125–1133. Cited by: §1.
  • [8] N. Doikov and Y. Nesterov (2019) Minimizing uniformly convex functions by cubic regularization of newton method. arXiv preprint arXiv:1905.02671. Cited by: §1.
  • [9] N. Doikov and P. Richtárik (2018) Randomized block cubic Newton method. In Proceedings of the 35th International Conference on Machine Learning, Cited by: §1.
  • [10] E. Gorbunov, F. Hanzely, and P. Richtárik (2019) A unified theory of SGD: variance reduction, sampling, quantization and coordinate descent. arXiv preprint arXiv1905.11261. Cited by: §1.
  • [11] R. Gower, D. Goldfarb, and P. Richtárik (2016) Stochastic block BFGS: squeezing more curvature out of data. In International Conference on Machine Learning, pp. 1869–1878. Cited by: §1.
  • [12] R. M. Gower, D. Kovalev, F. Lieder, and P. Richtárik (2019) RSN: randomized subspace Newton. In Advances in Neural Information Processing Systems, Cited by: §1.
  • [13] R. M. Gower, N. Loizou, X. Qian, A. Sailanbayev, E. Shulgin, and P. Richtárik (2019-09–15 Jun) SGD: general analysis and improved rates. In Proceedings of the 36th International Conference on Machine Learning, Proceedings of Machine Learning Research, Vol. 97, Long Beach, California, USA, pp. 5200–5209. External Links: Link Cited by: §1.
  • [14] F. Hanzely and P. Richtárik (2019) One method to rule them all: variance reduction for data, parameters and many new methods. arXiv preprint arXiv:1905.11266. Cited by: §1.
  • [15] R. Johnson and T. Zhang (2013) Accelerating stochastic gradient descent using predictive variance reduction. In Advances in Neural Information Processing Systems 26, pp. 315–323. Cited by: Appendix D, §1, §1, §1.
  • [16] B. Karimi and E. Moulines (2018) MISSO: minimization by incremental stochastic surrogate for large-scale nonconvex optimization. Cited by: §1.
  • [17] S. P. Karimireddy, S. U. Stich, and M. Jaggi (2018) Global linear convergence of Newton’s method without strong-convexity or Lipschitz gradients. arXiv preprint arXiv:1806.00413. Cited by: §1.
  • [18] J. M. Kohler and A. Lucchi (2017) Sub-sampled cubic regularization for non-convex optimization. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 1895–1904. Cited by: §1.
  • [19] D. Kovalev, S. Horváth, and P. Richtárik (2019) Don’t jump through hoops and remove those loops: SVRG and Katyusha are better without the outer loop. arXiv preprint arXiv:1901.08689. Cited by: §1.
  • [20] K. Lange, D. R. Hunter, and I. Yang (2000) Optimization transfer using surrogate objective functions. Journal of computational and graphical statistics 9 (1), pp. 1–20. Cited by: §1.
  • [21] H. Lin, J. Mairal, and Z. Harchaoui (2015) A universal catalyst for first-order optimization. In Advances in Neural Information Processing Systems, pp. 3384–3392. Cited by: §1.
  • [22] D. C. Liu and J. Nocedal (1989) On the limited memory BFGS method for large scale optimization. Mathematical programming 45 (1-3), pp. 503–528. Cited by: §1.
  • [23] L. Luo, Z. Chen, Z. Zhang, and W. Li (2016) A proximal stochastic quasi-Newton algorithm. arXiv preprint arXiv:1602.00223. Cited by: §1.
  • [24] J. Mairal (2013) Optimization with first-order surrogate functions. In International Conference on Machine Learning, pp. 783–791. Cited by: §1.
  • [25] K. Mishchenko, F. Iutzeler, J. Malick, and M. Amini (2018) A delay-tolerant proximal-gradient algorithm for distributed learning. In International Conference on Machine Learning, pp. 3584–3592. Cited by: §1.
  • [26] K. Mishchenko and P. Richtárik (2019) A stochastic decoupling method for minimizing the sum of smooth and non-smooth functions. arXiv preprint arXiv:1905.11535. Cited by: Appendix D, Appendix E, §1.
  • [27] A. Mokhtari, M. Eisen, and A. Ribeiro (2018) IQN: an incremental quasi-Newton method with local superlinear convergence rate. SIAM Journal on Optimization 28 (2), pp. 1670–1698. Cited by: §1, §1.
  • [28] A. Mokhtari, M. Gürbüzbalaban, and A. Ribeiro (2018) Surpassing gradient descent provably: a cyclic incremental method with linear convergence rate. SIAM Journal on Optimization 28 (2), pp. 1420–1447. Cited by: §1.
  • [29] P. Moritz, R. Nishihara, and M. Jordan (2016) A linearly-convergent stochastic L-BFGS algorithm. In Artificial Intelligence and Statistics, pp. 249–258. Cited by: §1.
  • [30] Y. Nesterov and B. T. Polyak (2006) Cubic regularization of Newton method and its global performance. Mathematical Programming 108 (1), pp. 177–205. Cited by: §1.1, §1.1, §1, §3.
  • [31] Y. Nesterov (2004) Introductory lectures on convex optimization: a basic course (applied optimization). Kluwer Academic Publishers. Cited by: §1.1, §1.
  • [32] L. M. Nguyen, J. Liu, K. Scheinberg, and M. Takáč (2017) SARAH: a novel method for machine learning problems using stochastic recursive gradient. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 2613–2621. Cited by: §1.
  • [33] L. M. Nguyen, J. Liu, K. Scheinberg, and M. Takáč (2017) Stochastic recursive gradient algorithm for nonconvex optimization. arXiv preprint arXiv:1705.07261. Cited by: §1.
  • [34] L. Nguyen, P. H. Nguyen, M. van Dijk, P. Richtárik, K. Scheinberg, and M. Takáč (2018-10–15 Jul) SGD and Hogwild! Convergence without the bounded gradients assumption. In Proceedings of the 35th International Conference on Machine Learning, Proceedings of Machine Learning Research, Vol. 80, Stockholmsmässan, Stockholm Sweden, pp. 3750–3758. Cited by: §1.
  • [35] X. Qian, A. Sailanbayev, K. Mishchenko, and P. Richtárik (2019) MISO is making a comeback with better proofs and rates. arXiv preprint arXiv:1906.01474. Cited by: Appendix E, §1.
  • [36] M. Razaviyayn, M. Sanjabi, and Z. Luo (2016) A stochastic successive minimization method for nonsmooth nonconvex optimization with applications to transceiver design in wireless communication networks. Mathematical Programming 157 (2), pp. 515–545. Cited by: §1.
  • [37] H. Robbins and S. Monro (1951) A stochastic approximation method. Annals of Mathematical Statistics 22, pp. 400–407. Cited by: §1.
  • [38] A. Rodomanov and D. Kropotov (2016) A superlinearly-convergent proximal Newton-type method for the optimization of finite sums. In International Conference on Machine Learning, pp. 2597–2605. Cited by: Appendix E, §1.
  • [39] N. L. Roux, M. Schmidt, and F. R. Bach (2012) A stochastic gradient method with an exponential convergence rate for finite training sets. In Advances in neural information processing systems, pp. 2663–2671. Cited by: §1, §1.
  • [40] E. K. Ryu and W. Yin (2017) Proximal-proximal-gradient method. arXiv preprint arXiv:1708.06908. Cited by: §1.
  • [41] I. Safran and O. Shamir (2019) How good is SGD with random shuffling?. arXiv preprint arXiv:1908.00045. Cited by: §1.
  • [42] S. Shalev-Shwartz and T. Zhang (2013) Stochastic dual coordinate ascent methods for regularized loss. Journal of Machine Learning Research 14 (1), pp. 567–599. Cited by: §1.
  • [43] N. Tripuraneni, M. Stern, C. Jin, J. Regier, and M. I. Jordan (2018) Stochastic cubic regularization for fast nonconvex optimization. In Advances in neural information processing systems, pp. 2899–2908. Cited by: §1.
  • [44] J. Zhang, L. Xiao, and S. Zhang (2018) Adaptive stochastic variance reduction for subsampled Newton method with cubic regularization. arXiv preprint arXiv:1811.11637. Cited by: §1.
  • [45] D. Zhou and Q. Gu (2019) Stochastic recursive variance-reduced cubic regularization methods. arXiv preprint arXiv:1901.11518. Cited by: §1.
  • [46] D. Zhou, P. Xu, and Q. Gu (2018) Stochastic variance-reduced cubic regularized Newton method. In International Conference on Machine Learning, pp. 5985–5994. Cited by: §1.

Appendix A Proofs for Algorithm 1

a.1 Three lemmas

Lemma 1.

Let be -strongly convex and have -Lipschitz Hessian for all . Then the iterates of Algorithm 1 satisfy


Letting , one step of the method can be written in the form

Subtracting from both sides, and using the fact that , this further leads to


Next, note that since is strongly convex, we have for all . It implies that , which in turns gives


The rest of the proof follows similar reasoning as in the standard convergence proof of Newton’s method, with only small modifications:

Lemma 2.

Assume that each is -strongly convex and has -Lipschitz Hessian. If for , then for all ,


We claim that


The upper bound on then follows immediately.

We will now prove the claim by induction in . The statement (13) holds for by our assumption. Assume it holds for and let us show that it holds for . If , is not updated and inequality (13) holds for by induction assumption. On the other hand, for every , we have

Thus, again, inequality (13) holds for .

Lemma 3.

The random iterates of Algorithms 1 and 2 satisfy the identity


For each , is equal to

with probability

and to with probability , hence the result. ∎

a.2 Proof of Theorem 1


Using Lemma 1 and Lemma 3, we obtain

where in the last step we have also used the assumption on for . ∎

Appendix B Proofs for Algorithm 2

b.1 Four lemmas

Lemma 4.

Let each have -Lipschitz Hessian and choose . Then,


We begin by establishing an upper bound for :

Evaluating both sides at readily gives


where the last equation follows from the definition of . We can further upper bound every using to obtain


Finally, by combining (16) and (17), we get

Lemma 5.

Let be -strongly convex, have -Lipschitz Hessian for every and choose . Then


The result readily follows from combining Lemma 4 and strong convexity of . Indeed, we have

Lemma 6.

Let be -strongly convex, have -Lipschitz Hessian for every . If and for , then


holds with probability 1.


Denote for convenience . The lemma’s claim follows as a corollary of a stronger statement, that for any and we have with probability 1

Let us show it by induction in , where for it is satisfied by our assumptions. If , and the induction assumption gives the step. Let and, then

so . ∎

Lemma 7.

The following equality holds for all


The result follows immediately from the definition (8) of and the fact that with probability and with probability .

b.2 Proof of Theorem 2


Denote . By combining the results from our lemmas, we can show

where in the last step we also used the assumption on for . ∎

Appendix C Efficient implementation for generalized linear models

Initialize: Choose starting iterates , compute , , for , ,
for  do
     Choose uniformly at random
end for
Algorithm 3 Stochastic Newton (SN) for generalized linear models with

In this section, we consider fitting a generalized linear model (GLM) with regularization, i.e. solving

where , are given and are some convex and smooth function. For any , we therefore have and for .

Let us define for convenience

Then we can shorten the update rule for to . Our key observation is that updates of , and are very cheap both in terms of memory and compute.

First, it is evident that to update it is sufficient to maintain in memory only . Similarly, for we do not need to store the Hessians explicitly and can use instead and . Then

The update above is low-rank and there exist efficient ways of computing the inverse of using the inverse of . For instance, the following proposition is at the core of efficient implementation of quasi-Newton methods and it turns out to be quite useful for Algorithm 1 too.

Proposition 1 (Sherman-Morrison formula).


be an invertible matrix and

be arbitrary two vectors such that , then we have

Consider for simplicity , then and by Proposition 1

Thus, the complexity of each update of Algorithm 3 is , which is much faster than solving linear system from the update of Newton method.

Appendix D Efficient implementations of Algorithm 2

The subproblem that we need to efficiently solve when running Algorithm 2 can be reduced to


where and