A Unifying Framework for Convergence Analysis of Approximate Newton Methods

02/27/2017 ∙ by Haishan Ye, et al. ∙ Peking University 0

Many machine learning models are reformulated as optimization problems. Thus, it is important to solve a large-scale optimization problem in big data applications. Recently, subsampled Newton methods have emerged to attract much attention for optimization due to their efficiency at each iteration, rectified a weakness in the ordinary Newton method of suffering a high cost in each iteration while commanding a high convergence rate. Other efficient stochastic second order methods are also proposed. However, the convergence properties of these methods are still not well understood. There are also several important gaps between the current convergence theory and the performance in real applications. In this paper, we aim to fill these gaps. We propose a unifying framework to analyze local convergence properties of second order methods. Based on this framework, our theoretical analysis matches the performance in real applications.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Mathematical optimization is an importance pillar of machine learning. We consider the following optimization problem

(1)

where the are smooth functions. Many machine learning models can be expressed as (1) where each is the loss with respect to (w.r.t.) the

-th training sample. There are many examples such as logistic regressions, smoothed support vector machines, neural networks, and graphical models.

Many optimization algorithms to solve the problem in (1) are based on the following iteration:

where is the step length. If

is the identity matrix and

, the resulting procedure is called Gradient Descent (GD) which achieves sublinear convergence for a general smooth convex objective function and linear convergence for a smooth-strongly convex objective function. When is large, the full gradient method is inefficient due to its iteration cost scaling linearly in

. Consequently, stochastic gradient descent (SGD) has been a typical alternative

[18, 12, 5]. In order to achieve cheaper cost in each iteration, such a method constructs an approximate gradient on a small mini-batch of data. However, the convergence rate can be significantly slower than that of the full gradient methods [15]. Thus, a great deal of efforts have been made to devise modification to achieve the convergence rate of the full gradient while keeping low iteration cost [10, 20, 21, 25].

If is a positive definite matrix of containing the curvature information, this formulation leads us to second-order methods. It is well known that second order methods enjoy superior convergence rate in both theory and practice in contrast to first-order methods which only make use of the gradient information. The standard Newton method, where , and , achieves a quadratic convergence rate for smooth-strongly convex objective functions. However, the Newton method takes cost per iteration, so it becomes extremely expensive when or is very large. As a result, one tries to construct an approximation of the Hessian in which way the update is computationally feasible, and while keeping sufficient second order information. One class of such methods are quasi-Newton methods, which are generalizations of the secant methods to find the root of the first derivative for multidimensional problems. The celebrated Broyden-Fletcher-Goldfarb-Shanno (BFGS) and its limited memory version (L-BFGS) are the most popular and widely used [16]. They take cost per iteration.

Recently, when , a class of called subsampled Newton methods have been proposed, which defines an approximate Hessian matrix with a small subset of samples. The most naive approach is to sample a subset of functions randomly [19, 3, 24] to construct a subsampled Hessian. Erdogdu and Montanari [8] proposed a regularized subsampled Newton method called NewSamp. When the Hessian can be written as where is an available matrix, Pilanci and Wainwright [17] used sketching techniques to approximate the Hessian and proposed sketch Newton method. Similarly, Xu et al. [24] proposed to sample rows of

with non-uniform probability distribution.

Agarwal et al. [1] brought up an algorithm called LiSSA to approximate the inversion of Hessian directly.

Although the convergence performance of stochastic second order methods has been analyzed, the convergence properties are still not well understood. There are several important gaps lying between convergence theory and real application.

The first gap is the necessity of Lipschitz continuity of Hessian. In previous work, to achieve a linear-quadratic convergence rate, stochastic second order methods all assume that is Lipschitz continuous. However, in real application without this assumption, they might also converge to optimal point. For example, Erdogdu and Montanari [8] used NewSamp to successfully train smoothed-SVM in which case the Hessian is not Lipschitz continuous.

The second gap is about the sketched size of sketch Newton methods. To obtain a linear convergence, the sketched size is in [17] and then be improved to in [24] using Gaussian sketching matrices, where is the condition number of the Hessian matrix in question. However, the sketch Newton empirically performs well even when the Hessian matrix is ill-conditioned. Sketched size being several tens of times, or even several times of can achieve a linear convergence rate in unconstrained optimization. But the theoretical result of Pilanci and Wainwright [17], Xu et al. [24] implies that sketched size may be beyond in ill-condition cases.

The third gap is about the sample size in regularized subsampled Newton methods. In both [8] and [19], their theoretical analysis shows that the sample size of regularized subsampled Newton methods should be set as the same as the conventional subsampled Newton method. In practice, however, adding a large regularizer can obviously reduce the sample size while keeping convergence. Thus, this contradicts the extant theoretical analysis [8, 19].

In this paper, we aim to fill these gaps between the current theory and empirical performance. More specifically, we first cast these second order methods into an algorithmic framework that we call approximate Newton. Then we propose a general result for analysis of local convergence properties of second order methods. Based on this framework, we then give detailed theoretical analysis which matches the empirical performance. We summarize our contribution as follows:

  • We propose a unifying framework (Theorem 3) to analyze local convergence properties of second order methods including stochastic and deterministic versions. The convergence performance of second order methods can be analyzed easily and systematically in this framework.

  • We prove that the Lipschitz continuity condition of Hessian is not necessary for achieving linear and superlinear convergence in variants of subsampled Newton. But it is needed to obtain quadratic convergence. This explains the phenomenon that NewSamp [8] can be used to train smoothed SVM in which the Lipschitz continuity condition of Hessian is not satisfied. It also reveals the reason why previous stochastic second order methods, such as subsampled Newton, sketch Newton, LiSSA, etc., all achieve a linear-quadratic convergence rate.

  • We prove that the sketched size is independent of the condition number of Hessian matrix which explains that sketched Newton performs well even when Hessian matrix is ill-conditioned.

  • We provide a theoretical guarantee that adding a regularizer is an effective way to reduce sample size in subsampled Newton methods while keeping converging. Our theoretical analysis also shows that adding a regularizer will lead to poor convergence behavior because the sample size decreases.

1.1 Organization

The remainder of the paper is organized as follows. In Section 2 we present notation and preliminaries. In Section 3 we present a unifying framework for local convergence analysis of second order methods. In Section 4 we analyze the local convergence properties of sketch Newton methods and prove that sketched size is independent of condition number of Hessian matrix. In section 5 we give the local convergence behaviors of several variants of subsampled Newton method. Especially, we reveal the relationship among sample size, regularizer and convergence rate. In Section 6, we derive the local convergence properties of inexact Newton method from our framework. In Section 7, we invalidate our theoretical results experimentally. Finally, we conclude our work in Section 8.

2 Notation and Preliminaries

Section 2.1 defines the notation used in this paper. Section 2.2 introduces matrices sketching techniques and their properties. Section 2.3 describes some important assumptions about object function.

2.1 Notation

Given a matrix of rank and a positive integer , its condensed SVD is given as , where and contain the left singular vectors of , and contain the right singular vectors of , and with

are the nonzero singular values of

. We will use to denote the largest singular value and to denote the smallest non-zero singular value. Thus, the condition number of is defined by . If is positive semidefinite, then and the square root of can be defined as . It also holds that , where is the

-th largest eigenvalue of

, , and .

Additionally, is the Frobenius norm of and is the spectral norm. Given a positive definite matrix , is called the -norm of . Give square matrices and with the same size, we denote if is positive semidefinite.

2.2 Randomized sketching matrices

We first give an -subspace embedding property which will be used to sketch Hessian matrices. Then we list some useful types of randomized sketching matrices including Gaussian projection [9, 11], leverage score sampling [6], count sketch [4, 14, 13].

is said to be an -subspace embedding matrix w.r.t. a fixed matrix where , if (i.e., ) for all .

From the definition of the -subspace embedding matrix, we can derive the following property directly. is an -subspace embedding matrix w.r.t. the matrix if and only if

Gaussian sketching matrix.

The most classical sketching matrix is the Gaussian sketching matrix

, whose extries are i.i.d. from the normal of mean 0 and variance

. Owing to the well-known concentration properties [23], Gaussian random matrices are very attractive. Besides, is enough to guarantee the -subspace embedding property for any fixed matrix . Moreover,

is the tightest bound among known types of sketching matrices. However, the Gaussian random matrix is usually dense, so it is costly to compute

.

Leverage score sketching matrix.

A leverage score sketching matrix w.r.t.  is defined by sampling probabilities , a sampling matrix and a diagonal rescaling matrix . Specifically, we construct as follows. For every , independently and with replacement, pick an index from the set with probability , and set and for as well as . The sampling probabilities are the leverage scores of defined as follows. Let be the column orthonormal basis of , and let denote the -th row of . Then for are the leverage scores of . To achieve an -subspace embedding property w.r.t. , is sufficient.

Sparse embedding matrix.

A sparse embedding matrix is such a matrix in each column of which there is only one nonzero entry uniformly sampled from [4]. Hence, it is very efficient to compute , especially when is sparse. To achieve an -subspace embedding property w.r.t. , is sufficient [13, 23].

Other sketching matrices such as Subsampled Randomized Hadamard Transformation [7, 9] as well as their properties can be found in the survey [23].

2.3 Assumptions and Notions

In this paper, we focus on the problem described in Eqn. (1). Moreover, we will make the following two assumptions.

Assumption 1

The objective function is -strongly convex, that is,

Assumption 2

is -Lipschitz continuous, that is,

Assumptions 1 and 2 imply that for any , we have

where is the identity matrix of appropriate size. With a little confusion, we define

In fact, is an upper bound of condition number of the Hessian matrix for any .

Besides, if is Lipschitz continuous, then we have

where is the Lipschitz constant of .

Throughout this paper, we use notions of linear convergence rate, superlinear convergence rate and quadratic convergence rate. In our paper, the convergence rates we will use are defined w.r.t. , where and is the optimal solution to Problem (1). A sequence of vectors is said to converge linearly to a limit point , if for some ,

Similarly, superlinear convergence and quadratic convergence are respectively defined as

We call it the linear-quadratic convergence rate if the following condition holds:

where .

A small directly implies a small difference between and . If , then we have because and . We also have by the -strong convexity. Hence, we have

whenever .

3 Approximate Newton Methods and Local Convergence Analysis

The existing variants of stochastic second order methods share some important attributes. First, these methods such as NewSamp [8], LiSSA [1], subsampled Newton with conjugate gradient [3], and subsampled Newton with non-uniformly sampling [24], all have the same convergence properties; that is, they have a linear-quadratic convergence rate.

Second, they also enjoy the same algorithm procedure summarized as follows. In each iteration, they first construct an approximate Hessian matrix such that

(2)

where . Then they solve the following optimization problem

(3)

approximately or exactly to obtain the direction vector . Finally, their update equation is given as . With this procedure, we regard these stochastic second order methods as approximate Newton methods.

In the following theorem, we propose a unifying framework which describes the convergence properties of the second order optimization procedure depicted above.

Let Assumption 1 and 2 hold. Suppose that exists and is continuous in a neighborhood of a minimizer . is a positive definite matrix that satisfies Eqn. (2) with . Let be an approximate solution of Problem (3) such that

(4)

where . Consider the iteration .

(a) There exists a sufficient small value , , and such that when , we have that

(5)

Besides, and will go to as goes to .

(b) Furthermore, if is Lipschitz continuous with parameter , and satisfies

(6)

where , then it holds that

(7)

From Theorem 3, we can find some important insights. First, Theorem 3 provides sufficient conditions to get different convergence rates including super-liner and quadratic convergence rates. If is a constant, then sequence converges linearly because and will go to as goes to infinity. If we set and such that and decrease to as increases, then sequence will converge super-linearly. Similarly, if , , and is Lipschitz continous, then sequence will converge quadratically.

Second, Theorem 3 makes it clear that the Lipschitz continuity of is not necessary for linear convergence and superlinear convergence of stochastic second order methods including Subsampled Newton method, Sketch Newton, NewSamp, etc. This reveals the reason why NewSamp can be used to train the smoothed SVM where the Lipschitz continuity of the Hessian matrix is not satisfied. The Lipschitz continuity condition is only needed to get a quadratic convergence or linear-quadratic convergence. This explains the phenomena that LiSSA[1], NewSamp [8], subsampled Newton with non-uniformly sampling [24], Sketched Newton [17] have linear-quadratic convergence rate because they all assume that the Hessian is Lipschitz continuous. In fact, it is well known that the Lipschitz continuity condition of is not necessary to achieve a linear or superlinear convergence rate for inexact Newton methods.

Third, the unifying framework of Theorem 3 contains not only stochastic second order methods, but also the deterministic versions. For example, letting and using conjugate gradient to get , we obtain the famous “Newton-CG” method. In fact, different choice of and different way to calculate lead us to different second order methods. In the following sections, we will use this framework to analyze the local convergence performance of these second order methods in detail.

4 Sketch Newton Method

In this section, we use Theorem 3 to analyze the local convergence properties of Sketch Newton (Algorithm 1). We mainly focus on the case that the Hessian matrix is of the form

(8)

where is an explicitly available matrix. Our result can be easily extended to the case that

where is a positive semi-definite matrix related to the Hessian of regularizer.

1:  Input: , , ;
2:  for  until termination do
3:     Construct an -subspace embedding matrix for and where is of the form , and calculate ;
4:     Calculate ;
5:     Update ;
6:  end for
Algorithm 1 Sketch Newton.

Let satisfy the conditions described in Theorem 3. Assume the Hessian matrix is given as Eqn. (8). Let , and be given. is an -subspace embedding matrix w.r.t.  with probability at least , and direction vector satisfies Eqn. (4). Then Algorithm 1 has the following convergence properties:

  1. [label = ()]

  2. There exists a sufficient small value , , and such that when , then each iteration satisfies Eqn. (5) with probability at least .

  3. If is also Lipschitz continuous and satisfies Eqn. (6), then each iteration satisfies Eqn. (7) with probability at least .

Theorem 4 directly provides a bound of the sketched size. Using the leverage score sketching matrix as an example, the sketched size is sufficient. We compare our theoretical bound of the sketched size with the ones of Pilanci and Wainwright [17] and Xu et al. [24] in Table 1. As we can see, our sketched size is much smaller than the other two, especially when the Hessian matrix is ill-conditioned.

Theorem 4 shows that the sketched size is independent on the condition number of the Hessian matrix just as shown in Table 1. This explains the phenomena that when the Hessian matrix is ill-conditioned, Sketch Newton performs well even when the sketched size is only several times of . For a large condition number, the theoretical bounds of both Xu et al. [24] and Pilanci and Wainwright [17] may be beyond the number of samples . Note that the theoretical results of [24] and [17] still hold in the constrained optimization problem. However, our result proves the effectiveness of the sketch Newton method for the unconstrained optimization problem in the ill-conditioned case.

Theorem 4 also contains the possibility of achieving an asymptotically super-linear rate by using an iteration-dependent sketching accuracy . In particular, we present the following corollary.

satisfies the the properties described in Theorem 3. Consider Algorithm 1 in which the iteration-dependent sketching accuracy is given as and . If the initial point is close enough to the optimal point , then sequence converges superlinearly.

Reference Sketched Size Condition number free?
Pilanci and Wainwright [17] No
Xu et al. [24] No
Our result(Theorem 4) Yes
Table 1: Comparison with previous work

5 The Subsampled Newton method and Variants

In this section, we apply Theorem 3 to analyze subsampled Newton methods. First, we make the assumption that each and have the following properties:

(9)
(10)

It immediately follows from that . Accordingly, if is ill-conditioned, then the value is large.

5.1 The Subsampled Newton method

The Subsampled Newton method is depicted in Algorithm 2, and we now give its local convergence properties in the following theorem. Let satisfy the properties described in Theorem 3. Assume Eqn. (9) and Eqn. (10) hold and let , and be given. and are set as in Algorithm 2, and the direction vector satisfies Eqn. (4). Then for , Algorithm 2 has the following convergence properties:

  1. [label = ()]

  2. There exists a sufficient small value , , and such that when , then each iteration satisfies Eqn. (5) with probability at least .

  3. If is also Lipschitz continuous with parameter and satisfies Eqn. (6), then each iteration satisfies Eqn. (7) with probability at least .

As we can see, Algorithm 2 almost has the same convergence properties as Algorithm 1 except several minor differences. The main difference is the construction manner of which should satisfy Eqn. (2). Algorithm 2 relies on the assumption that each is upper bounded (i.e., Eqn. (9) holds), while Algorithm 1 is built on the setting of the Hessian matrix as in Eqn. (8).

1:  Input: , , ;
2:  Set the sample size .
3:  for  until termination do
4:     Select a sample set , of size and ;
5:     Calculate ;
6:     Update ;
7:  end for
Algorithm 2 Subsampled Newton.

5.2 Regularized Subsampled Newton

In ill-conditioned cases (i.e., is large), the subsampled Newton method in Algorithm 2 should take a lot of samples because the sample size depends on quadratically. To overcome this problem, one resorts to a regularized subsampled Newton method. The key idea is to add to the original subsampled Hessian just as described in Algorithm 3. Erdogdu and Montanari [8] proposed NewSamp which is another regularized subsampled Newton method depicted in Algorithm 4. In the following analysis, we prove that adding a regularizer is an effective way to reduce the sample size while keeping converging in theory.

We first give the theoretical analysis of local convergence properties of Algorithm 3. Let satisfy the properties described in Theorem 3. Assume Eqns. (9) and (10) hold, and let , and be given. Assume is a constant such that , the subsampled size satisfies , and is constructed as in Algorithm 3. Define

(11)

which implies that . Besides, the direction vector satisfies Eqn. (4). Then Algorithm 3 has the following convergence properties:

  1. [label = ()]

  2. There exists a sufficient small value , , and such that when , each iteration satisfies Eqn. (5) with probability at least .

  3. If is also Lipschitz continuous with parameter and satisfies Eqn. (6), then each iteration satisfies Eqn. (7) with probability at least .

In Theorem 5.2 the parameter mainly decides convergence properties of Algorithm 3. It is determined by two terms just as shown in Eqn. (11). These two terms depict the relationship among the sample size, regularizer , and convergence rate.

The first term describes the relationship between the regularizer and sample size. Without loss of generality, we set which satisfies . Then the sample size decreases as increases. Hence Theorem 5.2 gives a theoretical guarantee that adding the regularizer is an effective approach for reducing the sample size when is large. Conversely, if we want to sample a small part of ’s, then we should choose a large . Otherwise, will go to which means , i.e., the sequence does not converge.

Though a large can reduce the sample size, it is at the expense of slower convergence rate just as the second term shows. As we can see, goes to as increases. Besides, also has to decrease. Otherwise, may be beyond which means that Algorithm 3 will not converge.

In fact, slower convergence rate via adding a regularizer is because the sample size becomes small, which implies less curvature information is obtained. However, a small sample size implies low computational cost in each iteration. Therefore, a proper regularizer which balances the cost of each iteration and convergence rate is the key in the regularized subsampled Newton algorithm.

1:  Input: , , regularizer parameter , sample size ;
2:  for  until termination do
3:     Select a sample set , of size and ;
4:     Calculate
5:     Update ;
6:  end for
Algorithm 3 Regularized Subsample Newton.
1:  Input: , , , sample size ;
2:  for  until termination  do
3:     Select a sample set , of size and get ;
4:     Compute rank truncated SVD deompostion of to get and . Construct
5:     Calculate
6:     Update ;
7:  end for
Algorithm 4 NewSamp.

Next, we give the theoretical analysis of local convergence properties of NewSamp (Algorithm 4).

Let satisfy the properties described in Theorem 3. Assume Eqn. (9) and Eqn. (10) hold and let and target rank be given. Let be a constant such that , where is the -th eigenvalue of . Set the subsampled size such that , and define

(12)

which implies . Assume the direction vector satisfies Eqn. (4). Then for , Algorithm 4 has the following convergence properties:

  1. [label = ()]

  2. There exists a sufficient small value , , and such that when , each iteration satisfies Eqn. (5) with probability at least .

  3. If is also Lipschitz continuous with parameter and satisfies Eqn. (6), then each iteration satisfies Eqn. (7) with probability at least .

Similar to Theorem 5.2, parameter in NewSamp is also determined by two terms. The first term reveals the the relationship between the target rank and sample size. Without loss of generality, we can set . Then the sample size is linear to . Hence, a small means that a small sample size is sufficient. Conversely, if we want to sample a small portion of ’s, then we should choose a small . Otherwise, will go to which means , i.e., the sequence does not converge. The second term shows that a small sample size will lead to a poor convergence rate. If we set and , then will be . Consequently, the convergence rate of NewSamp is almost the same as gradient descent. Similar to Algorithm 3, a small means a precise solution to Problem (3) and the initial point being close to the optimal point .

It is worth pointing out that Theorem 5.2 explains the empirical results that NewSamp is applicable in training SVM in which the Lipschitz continuity condition of is not satisfied [8].

We now conduct comparison between Theorem 5.2 and Theorem 5.2. We mainly focus on the parameter in these two theorems which mainly determines convergence properties of Algorithm 3 and Algorithm 4. Specifically, if we set in Eqn. (11), then which equals to the second term on the right-hand side in Eqn. (12). Hence, we can regard NewSamp as a special case of Algorithm 3. However, NewSamp provides an approach for automatical choice of .

Recall that NewSamp includes another parameter: the target rank . Thus, NewSamp and Algorithm 3 have the same number of free parameters. If is not properly chosen, NewSamp will still have poor performance. Therefore, Algorithm 3 is theoretically preferred because NewSamp needs extra cost to perform SVDs.

5.3 Subsampled Hessian and Gradient

In fact, we can also subsample gradient to accelerate the subsampled Newton method. The detailed procedure is presented in Algorithm 5 [3, 19].

Let satisfy the properties described in Theorem 3. We also assume Eqn. (9) and Eqn. (10) hold and let and be given. Let and be set such that Eqn. (2) holds and it holds that

The direction vector is computed as in Algorithm 5. Then for , we have the following convergence properties:

  1. [label = ()]

  2. There exists a sufficient small value , , and such that when , then for each iteration, it holds that

    with probability at least .

  3. If is also Lipschitz continuous with parameter and satisfies Eqn. (6), then for each iteration, it holds that

    with probability at least .

In common cases, subsampled gradient needs to subsample over of samples to guarantee convergence of the algorithm. Roosta-Khorasani and Mahoney [19] showed that it needs , where for . When is close to , is close to . Hence will go to as iteration goes. This is the reason why the Newton method and variants of the subsampled Newton method are very sensitive to the accuracy of subsampled gradient.

1:  Input: , , ;
2:  Set the sample size and .
3:  for  until termination do
4:     Select a sample set , of size and construct ;
5:     Select a sample set of size and calculate .
6:     Calculate ;
7:     Update ;
8:  end for
Algorithm 5 Subsampled Hessian and Subsampled Gradient.

6 Inexact Newton Methods

Let , that is . Then Theorem 3 depicts the convergence properties of inexact Newton methods.

Let satisfy the properties described in Theorem 3, and be a direction vector such that

where is a constant. Consider the iteration .

(a) There exists a sufficient small value , , and such that when , then it holds that

(b) If is also Lipschitz continuous with parameter , and satisfies Eqn. (6), then it holds that