Stochastic-Average-Newton
None
view repo
We present two new remarkably simple stochastic second-order methods for minimizing the average of a very large number of sufficiently smooth and strongly convex functions. The first is a stochastic variant of Newton's method (SN), and the second is a stochastic variant of cubically regularized Newton's method (SCN). We establish local linear-quadratic convergence results. Unlike existing stochastic variants of second order methods, which require the evaluation of a large number of gradients and/or Hessians in each iteration to guarantee convergence, our methods do not have this shortcoming. For instance, the simplest variants of our methods in each iteration need to compute the gradient and Hessian of a single randomly selected function only. In contrast to most existing stochastic Newton and quasi-Newton methods, our approach guarantees local convergence faster than with first-order oracle and adapts to the problem's curvature. Interestingly, our method is not unbiased, so our theory provides new intuition for designing new stochastic methods.
READ FULL TEXT VIEW PDFNone
The problem of empirical risk minimization (ERM) is ubiquitous in machine learning and data science. Advances of the last few years
Roux et al. (2012); Defazio et al. (2014a); Johnson and Zhang (2013); Shalev-Shwartz and Zhang (2013) have shown that availability of finite-sum structure and ability to use extra memory allow for methods that solve a wide range of ERM problems Nguyen et al. (2017a, b); Ryu and Yin (2017); Allen-Zhu (2017); Mishchenko and Richtárik (2019); Kovalev et al. (2019). In particular, Majorization-Minimization Lange et al. (2000) (MM) approach, also known as successive upper-bound minimization Razaviyayn et al. (2016), gained a lot of attention in first-order Mairal (2013); Defazio et al. (2014b); Bietti and Mairal (2017); Mokhtari et al. (2018b); Qian et al. (2019), second-order Mokhtari et al. (2018a), proximal Lin et al. (2015); Ryu and Yin (2017) and other settings Karimi and Moulines (2018); Mishchenko et al. (2018).The aforementioned finite-sum structure is given by the minimization formulation
(1) |
where each function is assumed to have Lipschitz Hessian.
Since in practical applications
is often very large, this problem is typically solved using stochastic first order methods, such as Stochastic Gradient Descent (SGD)
Robbins and Monro (1951), which have cheap iterations independent of . However, constant-stepsize SGD provides fast convergence to a neighborhood of the solution only Nguyen et al. (2018); Gower et al. (2019), whose radius is proportional to the variance of the stochastic gradient at the optimum. So-called variance-reduced methods resolve this issue
Roux et al. (2012); Johnson and Zhang (2013); Gorbunov et al. (2019); Hanzely and Richtárik (2019), but their iteration complexity relies significantly on the condition number of the problem, which is equal to the ratio of the smoothness and strong convexity parameters.Second-order methods, in contrast, adapt to the curvature of the problem and thereby decrease their dependence on the condition number Nesterov (2004). Among the most well-known second-order methods are Newton’s method (see e.g. Nesterov (2004); Karimireddy et al. (2018); Gower et al. (2019)), a variant thereof with cubic regularization Nesterov and Polyak (2006) and BFGS Liu and Nocedal (1989); Avriel (2003). The vanilla deterministic Newton method uses simple gradient update preconditioned via the inverse Hessian evaluated at the same iterate, i.e.,
This can be seen as approximating function locally around as111For and an positive definite matrix , we let .
(2) |
and subsequently minimizing this approximation in . However, this approximation is not necessarily an upper bound on , which makes it hard to establish global rates for the method. Indeed, unless extra assumptions Karimireddy et al. (2018); Gower et al. (2019) or regularizations Nesterov and Polyak (2006); Doikov and Richtárik (2018); Doikov and Nesterov (2019) are employed, the method can diverge. However, second order methods are famous for their fast (superlinear or quadratic) local convergence rates.
Unfortunately, the fast convergence rate of Newton and cubically regularized Newton methods does not directly translate to the stochastic case. For instance, the quasi-Newton methods in Luo et al. (2016); Moritz et al. (2016); Gower et al. (2016) have complexity proportional to with some instead of in Johnson and Zhang (2013), thus giving a worse theoretical guarantee. Other theoretical gaps include inconsistent assumptions such as bounded gradients together with strong convexity Berahas et al. (2016) .
Surprisingly, a lot of effort Kohler and Lucchi (2017); Bollapragada et al. (2018); Zhou et al. (2018); Zhang et al. (2018); Tripuraneni et al. (2018); Zhou and Gu (2019) has been put into developing methods with massive (albeit sometimes still independent of ) batch sizes, typically proportional to , where is the target accuracy. This essentially removes all randomness from the analysis by making the variance infinitesimal. In fact, these batch sizes are likely to exceed
, so one ends up using mere deterministic algorithms. Furthermore, to make all estimates unbiased, large batch sizes are combined with preconditioning sampled gradients using the Hessians of
some other functions. We naturally expect that these techniques would not work in practice when only a single sample is used since the Hessian of a randomly sampled function might be significantly different from the Hessians of other samples.Unlike all these methods, ours will work even with batch size equal one.
Thus, it is a major contribution of our work that we give up on taking unbiased estimates and show convergence despite the biased nature of our methods. This is achieved by developing new Lyapunov functions that are specific to second-order methods.
We are aware of two methods Rodomanov and Kropotov (2016); Mokhtari et al. (2018a) that do not suffer from the issues above. Nevertheless, their update is cyclic rather than stochastic, which is known to be slower with the first-order oracle due to possibly bad choice of the iteration order Safran and Shamir (2019).
Convergence of our algorithms will be established under certain regularity assumptions on the functions in (1). In particular, we will assume that all functions are twice differentiable, have a Lipschitz Hessian, and are strongly convex. These concepts are defined next. Let and be the standard Euclidean inner product and norm in , respectively.
A differentiable function is -strongly convex, where is the strong convexity parameter, if for all
(3) |
For twice differentiable functions, strong convexity is equivalent to requiring the smallest eigenvalue of the Hessian to be uniformly bonded above 0, i.e.,
, whereis the identity matrix, or equivalently
for all . Given a twice differentiable strongly convex function , we let be the norm induced by the matrix .Function has -Lipschitz Hessian if for all
Recall (e.g., see Lemma 1 in Nesterov and Polyak (2006)) that a function with -Lipschitz Hessian admits for any the following estimates,
(4) |
and
(5) |
In the big data case, i.e., when is very large, it is not efficient to evaluate the gradient, let alone the Hessian, of at each iteration. Instead, there is a desire to develop an incremental method which in each step resorts to computing the gradient and Hessian of a single function only. The challenge here is to design a scheme capable to use this stochastic first and second order information to progressively throughout the iterations build more accurate approximations of the Newton step, so as to ensure a global convergence rate. While establishing any meaningful global rate is challenging enough, a fast linear rate would be desirable.
In this paper we propose to approximate the gradient and the Hessian using the latest information available, i.e.,
where
is the last vector for which the gradient
and Hessian was computed. We then use these approximations to take a Newton-type step. Note, however, that and serve as good approximations only when used together, so that we precondition gradients with the Hessians at the same iterates. Although the true gradient, , gives precise information about the linear approximation of at , preconditioning it with approximation would not make much sense. Our method is summarized in Algorithm 1.We now state a simple local linear convergence rate for Algorithm 1. By we denote the expectation conditioned on all information prior to iteration .
Assume that every is -strongly convex and has -Lipschitz Hessian and consider the following Lyapunov function:
(6) |
Then for the random iterates of Algorithm 1 we have the recursion
(7) |
Furthermore, if for , then
Clearly, when , we achieve the standard superlinear (quadratic) rate of Newton method. When , we obtain linear convergence rate that is independent of the conditioning, thus, the method provably adapts to the curvature.
Motivated by the desire to develop a new principled variant of Newton’s method that could enjoy global convergence guarantees, Nesterov and Polyak Nesterov and Polyak (2006) proposed the following “cubically regularized” second-order approximation of :
In contrast to (2), the above is a global upper bound on provided that . This fact is then used to establish local rates.
In this section we combine cubic regularization with the stochastic Newton technique developed in previous section. Let us define the following second-order Taylor approximation of :
Having introduced , we are now ready to present our Stochastic Cubic Newton method (Algorithm 2).
Our main theorem gives a bound on the expected improvement of a new Lyapunov function .
Let be -strongly convex and assume every has -Lipschitz Hessian and consider the following Lyapunov function:
(8) |
Then for the random iterates of Algorithm 2 we have the recursion
Furthermore, if for , then
Again, with we recover local superlinear convergence of cubic Newton. In addition, this rate is locally independent of the problem conditioning, showing that we indeed benefit from second-order information.
Let be -strongly convex and have -Lipschitz Hessian for all . Then the iterates of Algorithm 1 satisfy
(9) |
Letting , one step of the method can be written in the form
Subtracting from both sides, and using the fact that , this further leads to
(10) |
Next, note that since is strongly convex, we have for all . It implies that , which in turns gives
(11) |
The rest of the proof follows similar reasoning as in the standard convergence proof of Newton’s method, with only small modifications:
∎
Assume that each is -strongly convex and has -Lipschitz Hessian. If for , then for all ,
(12) |
We claim that
(13) |
The upper bound on then follows immediately.
We will now prove the claim by induction in . The statement (13) holds for by our assumption. Assume it holds for and let us show that it holds for . If , is not updated and inequality (13) holds for by induction assumption. On the other hand, for every , we have
Thus, again, inequality (13) holds for .
∎
Let each have -Lipschitz Hessian and choose . Then,
(15) |
We begin by establishing an upper bound for :
Evaluating both sides at readily gives
(16) | |||||
where the last equation follows from the definition of . We can further upper bound every using to obtain
(17) | |||||
Finally, by combining (16) and (17), we get
∎
Let be -strongly convex, have -Lipschitz Hessian for every and choose . Then
(18) |
Let be -strongly convex, have -Lipschitz Hessian for every . If and for , then
(19) |
holds with probability 1.
Denote for convenience . The lemma’s claim follows as a corollary of a stronger statement, that for any and we have with probability 1
Let us show it by induction in , where for it is satisfied by our assumptions. If , and the induction assumption gives the step. Let and, then
so . ∎
The following equality holds for all
(20) |
The result follows immediately from the definition (8) of and the fact that with probability and with probability .
∎
Denote . By combining the results from our lemmas, we can show
where in the last step we also used the assumption on for . ∎
In this section, we consider fitting a generalized linear model (GLM) with regularization, i.e. solving
where , are given and are some convex and smooth function. For any , we therefore have and for .
Let us define for convenience
Then we can shorten the update rule for to . Our key observation is that updates of , and are very cheap both in terms of memory and compute.
First, it is evident that to update it is sufficient to maintain in memory only . Similarly, for we do not need to store the Hessians explicitly and can use instead and . Then
The update above is low-rank and there exist efficient ways of computing the inverse of using the inverse of . For instance, the following proposition is at the core of efficient implementation of quasi-Newton methods and it turns out to be quite useful for Algorithm 1 too.
Let be an invertible matrix and
The subproblem that we need to efficiently solve when running Algorithm 2 can be reduced to
(21) |
where and