1 Introduction
Mathematical optimization is an importance pillar of machine learning. We consider the following optimization problem
(1) 
where the are smooth functions. Many machine learning models can be expressed as (1) where each is the loss with respect to (w.r.t.) the
th training sample. There are many examples such as logistic regressions, smoothed support vector machines, neural networks, and graphical models.
Many optimization algorithms to solve the problem in (1) are based on the following iteration:
where is the step length. If
is the identity matrix and
, the resulting procedure is called Gradient Descent (GD) which achieves sublinear convergence for a general smooth convex objective function and linear convergence for a smoothstrongly convex objective function. When is large, the full gradient method is inefficient due to its iteration cost scaling linearly in. Consequently, stochastic gradient descent (SGD) has been a typical alternative
[18, 12, 5]. In order to achieve cheaper cost in each iteration, such a method constructs an approximate gradient on a small minibatch of data. However, the convergence rate can be significantly slower than that of the full gradient methods [15]. Thus, a great deal of efforts have been made to devise modification to achieve the convergence rate of the full gradient while keeping low iteration cost [10, 20, 21, 25].If is a positive definite matrix of containing the curvature information, this formulation leads us to secondorder methods. It is well known that second order methods enjoy superior convergence rate in both theory and practice in contrast to firstorder methods which only make use of the gradient information. The standard Newton method, where , and , achieves a quadratic convergence rate for smoothstrongly convex objective functions. However, the Newton method takes cost per iteration, so it becomes extremely expensive when or is very large. As a result, one tries to construct an approximation of the Hessian in which way the update is computationally feasible, and while keeping sufficient second order information. One class of such methods are quasiNewton methods, which are generalizations of the secant methods to find the root of the first derivative for multidimensional problems. The celebrated BroydenFletcherGoldfarbShanno (BFGS) and its limited memory version (LBFGS) are the most popular and widely used [16]. They take cost per iteration.
Recently, when , a class of called subsampled Newton methods have been proposed, which defines an approximate Hessian matrix with a small subset of samples. The most naive approach is to sample a subset of functions randomly [19, 3, 24] to construct a subsampled Hessian. Erdogdu and Montanari [8] proposed a regularized subsampled Newton method called NewSamp. When the Hessian can be written as where is an available matrix, Pilanci and Wainwright [17] used sketching techniques to approximate the Hessian and proposed sketch Newton method. Similarly, Xu et al. [24] proposed to sample rows of
with nonuniform probability distribution.
Agarwal et al. [1] brought up an algorithm called LiSSA to approximate the inversion of Hessian directly.Although the convergence performance of stochastic second order methods has been analyzed, the convergence properties are still not well understood. There are several important gaps lying between convergence theory and real application.
The first gap is the necessity of Lipschitz continuity of Hessian. In previous work, to achieve a linearquadratic convergence rate, stochastic second order methods all assume that is Lipschitz continuous. However, in real application without this assumption, they might also converge to optimal point. For example, Erdogdu and Montanari [8] used NewSamp to successfully train smoothedSVM in which case the Hessian is not Lipschitz continuous.
The second gap is about the sketched size of sketch Newton methods. To obtain a linear convergence, the sketched size is in [17] and then be improved to in [24] using Gaussian sketching matrices, where is the condition number of the Hessian matrix in question. However, the sketch Newton empirically performs well even when the Hessian matrix is illconditioned. Sketched size being several tens of times, or even several times of can achieve a linear convergence rate in unconstrained optimization. But the theoretical result of Pilanci and Wainwright [17], Xu et al. [24] implies that sketched size may be beyond in illcondition cases.
The third gap is about the sample size in regularized subsampled Newton methods. In both [8] and [19], their theoretical analysis shows that the sample size of regularized subsampled Newton methods should be set as the same as the conventional subsampled Newton method. In practice, however, adding a large regularizer can obviously reduce the sample size while keeping convergence. Thus, this contradicts the extant theoretical analysis [8, 19].
In this paper, we aim to fill these gaps between the current theory and empirical performance. More specifically, we first cast these second order methods into an algorithmic framework that we call approximate Newton. Then we propose a general result for analysis of local convergence properties of second order methods. Based on this framework, we then give detailed theoretical analysis which matches the empirical performance. We summarize our contribution as follows:

We propose a unifying framework (Theorem 3) to analyze local convergence properties of second order methods including stochastic and deterministic versions. The convergence performance of second order methods can be analyzed easily and systematically in this framework.

We prove that the Lipschitz continuity condition of Hessian is not necessary for achieving linear and superlinear convergence in variants of subsampled Newton. But it is needed to obtain quadratic convergence. This explains the phenomenon that NewSamp [8] can be used to train smoothed SVM in which the Lipschitz continuity condition of Hessian is not satisfied. It also reveals the reason why previous stochastic second order methods, such as subsampled Newton, sketch Newton, LiSSA, etc., all achieve a linearquadratic convergence rate.

We prove that the sketched size is independent of the condition number of Hessian matrix which explains that sketched Newton performs well even when Hessian matrix is illconditioned.

We provide a theoretical guarantee that adding a regularizer is an effective way to reduce sample size in subsampled Newton methods while keeping converging. Our theoretical analysis also shows that adding a regularizer will lead to poor convergence behavior because the sample size decreases.
1.1 Organization
The remainder of the paper is organized as follows. In Section 2 we present notation and preliminaries. In Section 3 we present a unifying framework for local convergence analysis of second order methods. In Section 4 we analyze the local convergence properties of sketch Newton methods and prove that sketched size is independent of condition number of Hessian matrix. In section 5 we give the local convergence behaviors of several variants of subsampled Newton method. Especially, we reveal the relationship among sample size, regularizer and convergence rate. In Section 6, we derive the local convergence properties of inexact Newton method from our framework. In Section 7, we invalidate our theoretical results experimentally. Finally, we conclude our work in Section 8.
2 Notation and Preliminaries
Section 2.1 defines the notation used in this paper. Section 2.2 introduces matrices sketching techniques and their properties. Section 2.3 describes some important assumptions about object function.
2.1 Notation
Given a matrix of rank and a positive integer , its condensed SVD is given as , where and contain the left singular vectors of , and contain the right singular vectors of , and with
are the nonzero singular values of
. We will use to denote the largest singular value and to denote the smallest nonzero singular value. Thus, the condition number of is defined by . If is positive semidefinite, then and the square root of can be defined as . It also holds that , where is theth largest eigenvalue of
, , and .Additionally, is the Frobenius norm of and is the spectral norm. Given a positive definite matrix , is called the norm of . Give square matrices and with the same size, we denote if is positive semidefinite.
2.2 Randomized sketching matrices
We first give an subspace embedding property which will be used to sketch Hessian matrices. Then we list some useful types of randomized sketching matrices including Gaussian projection [9, 11], leverage score sampling [6], count sketch [4, 14, 13].
is said to be an subspace embedding matrix w.r.t. a fixed matrix where , if (i.e., ) for all .
From the definition of the subspace embedding matrix, we can derive the following property directly. is an subspace embedding matrix w.r.t. the matrix if and only if
Gaussian sketching matrix.
The most classical sketching matrix is the Gaussian sketching matrix
, whose extries are i.i.d. from the normal of mean 0 and variance
. Owing to the wellknown concentration properties [23], Gaussian random matrices are very attractive. Besides, is enough to guarantee the subspace embedding property for any fixed matrix . Moreover,is the tightest bound among known types of sketching matrices. However, the Gaussian random matrix is usually dense, so it is costly to compute
.Leverage score sketching matrix.
A leverage score sketching matrix w.r.t. is defined by sampling probabilities , a sampling matrix and a diagonal rescaling matrix . Specifically, we construct as follows. For every , independently and with replacement, pick an index from the set with probability , and set and for as well as . The sampling probabilities are the leverage scores of defined as follows. Let be the column orthonormal basis of , and let denote the th row of . Then for are the leverage scores of . To achieve an subspace embedding property w.r.t. , is sufficient.
Sparse embedding matrix.
2.3 Assumptions and Notions
In this paper, we focus on the problem described in Eqn. (1). Moreover, we will make the following two assumptions.
Assumption 1
The objective function is strongly convex, that is,
Assumption 2
is Lipschitz continuous, that is,
Assumptions 1 and 2 imply that for any , we have
where is the identity matrix of appropriate size. With a little confusion, we define
In fact, is an upper bound of condition number of the Hessian matrix for any .
Besides, if is Lipschitz continuous, then we have
where is the Lipschitz constant of .
Throughout this paper, we use notions of linear convergence rate, superlinear convergence rate and quadratic convergence rate. In our paper, the convergence rates we will use are defined w.r.t. , where and is the optimal solution to Problem (1). A sequence of vectors is said to converge linearly to a limit point , if for some ,
Similarly, superlinear convergence and quadratic convergence are respectively defined as
We call it the linearquadratic convergence rate if the following condition holds:
where .
A small directly implies a small difference between and . If , then we have because and . We also have by the strong convexity. Hence, we have
whenever .
3 Approximate Newton Methods and Local Convergence Analysis
The existing variants of stochastic second order methods share some important attributes. First, these methods such as NewSamp [8], LiSSA [1], subsampled Newton with conjugate gradient [3], and subsampled Newton with nonuniformly sampling [24], all have the same convergence properties; that is, they have a linearquadratic convergence rate.
Second, they also enjoy the same algorithm procedure summarized as follows. In each iteration, they first construct an approximate Hessian matrix such that
(2) 
where . Then they solve the following optimization problem
(3) 
approximately or exactly to obtain the direction vector . Finally, their update equation is given as . With this procedure, we regard these stochastic second order methods as approximate Newton methods.
In the following theorem, we propose a unifying framework which describes the convergence properties of the second order optimization procedure depicted above.
Let Assumption 1 and 2 hold. Suppose that exists and is continuous in a neighborhood of a minimizer . is a positive definite matrix that satisfies Eqn. (2) with . Let be an approximate solution of Problem (3) such that
(4) 
where . Consider the iteration .
(a) There exists a sufficient small value , , and such that when , we have that
(5) 
Besides, and will go to as goes to .
(b) Furthermore, if is Lipschitz continuous with parameter , and satisfies
(6) 
where , then it holds that
(7) 
From Theorem 3, we can find some important insights. First, Theorem 3 provides sufficient conditions to get different convergence rates including superliner and quadratic convergence rates. If is a constant, then sequence converges linearly because and will go to as goes to infinity. If we set and such that and decrease to as increases, then sequence will converge superlinearly. Similarly, if , , and is Lipschitz continous, then sequence will converge quadratically.
Second, Theorem 3 makes it clear that the Lipschitz continuity of is not necessary for linear convergence and superlinear convergence of stochastic second order methods including Subsampled Newton method, Sketch Newton, NewSamp, etc. This reveals the reason why NewSamp can be used to train the smoothed SVM where the Lipschitz continuity of the Hessian matrix is not satisfied. The Lipschitz continuity condition is only needed to get a quadratic convergence or linearquadratic convergence. This explains the phenomena that LiSSA[1], NewSamp [8], subsampled Newton with nonuniformly sampling [24], Sketched Newton [17] have linearquadratic convergence rate because they all assume that the Hessian is Lipschitz continuous. In fact, it is well known that the Lipschitz continuity condition of is not necessary to achieve a linear or superlinear convergence rate for inexact Newton methods.
Third, the unifying framework of Theorem 3 contains not only stochastic second order methods, but also the deterministic versions. For example, letting and using conjugate gradient to get , we obtain the famous “NewtonCG” method. In fact, different choice of and different way to calculate lead us to different second order methods. In the following sections, we will use this framework to analyze the local convergence performance of these second order methods in detail.
4 Sketch Newton Method
In this section, we use Theorem 3 to analyze the local convergence properties of Sketch Newton (Algorithm 1). We mainly focus on the case that the Hessian matrix is of the form
(8) 
where is an explicitly available matrix. Our result can be easily extended to the case that
where is a positive semidefinite matrix related to the Hessian of regularizer.
Let satisfy the conditions described in Theorem 3. Assume the Hessian matrix is given as Eqn. (8). Let , and be given. is an subspace embedding matrix w.r.t. with probability at least , and direction vector satisfies Eqn. (4). Then Algorithm 1 has the following convergence properties:

[label = ()]

There exists a sufficient small value , , and such that when , then each iteration satisfies Eqn. (5) with probability at least .
Theorem 4 directly provides a bound of the sketched size. Using the leverage score sketching matrix as an example, the sketched size is sufficient. We compare our theoretical bound of the sketched size with the ones of Pilanci and Wainwright [17] and Xu et al. [24] in Table 1. As we can see, our sketched size is much smaller than the other two, especially when the Hessian matrix is illconditioned.
Theorem 4 shows that the sketched size is independent on the condition number of the Hessian matrix just as shown in Table 1. This explains the phenomena that when the Hessian matrix is illconditioned, Sketch Newton performs well even when the sketched size is only several times of . For a large condition number, the theoretical bounds of both Xu et al. [24] and Pilanci and Wainwright [17] may be beyond the number of samples . Note that the theoretical results of [24] and [17] still hold in the constrained optimization problem. However, our result proves the effectiveness of the sketch Newton method for the unconstrained optimization problem in the illconditioned case.
Theorem 4 also contains the possibility of achieving an asymptotically superlinear rate by using an iterationdependent sketching accuracy . In particular, we present the following corollary.
5 The Subsampled Newton method and Variants
In this section, we apply Theorem 3 to analyze subsampled Newton methods. First, we make the assumption that each and have the following properties:
(9)  
(10) 
It immediately follows from that . Accordingly, if is illconditioned, then the value is large.
5.1 The Subsampled Newton method
The Subsampled Newton method is depicted in Algorithm 2, and we now give its local convergence properties in the following theorem. Let satisfy the properties described in Theorem 3. Assume Eqn. (9) and Eqn. (10) hold and let , and be given. and are set as in Algorithm 2, and the direction vector satisfies Eqn. (4). Then for , Algorithm 2 has the following convergence properties:

[label = ()]

There exists a sufficient small value , , and such that when , then each iteration satisfies Eqn. (5) with probability at least .
As we can see, Algorithm 2 almost has the same convergence properties as Algorithm 1 except several minor differences. The main difference is the construction manner of which should satisfy Eqn. (2). Algorithm 2 relies on the assumption that each is upper bounded (i.e., Eqn. (9) holds), while Algorithm 1 is built on the setting of the Hessian matrix as in Eqn. (8).
5.2 Regularized Subsampled Newton
In illconditioned cases (i.e., is large), the subsampled Newton method in Algorithm 2 should take a lot of samples because the sample size depends on quadratically. To overcome this problem, one resorts to a regularized subsampled Newton method. The key idea is to add to the original subsampled Hessian just as described in Algorithm 3. Erdogdu and Montanari [8] proposed NewSamp which is another regularized subsampled Newton method depicted in Algorithm 4. In the following analysis, we prove that adding a regularizer is an effective way to reduce the sample size while keeping converging in theory.
We first give the theoretical analysis of local convergence properties of Algorithm 3. Let satisfy the properties described in Theorem 3. Assume Eqns. (9) and (10) hold, and let , and be given. Assume is a constant such that , the subsampled size satisfies , and is constructed as in Algorithm 3. Define
(11) 
which implies that . Besides, the direction vector satisfies Eqn. (4). Then Algorithm 3 has the following convergence properties:

[label = ()]

There exists a sufficient small value , , and such that when , each iteration satisfies Eqn. (5) with probability at least .
In Theorem 5.2 the parameter mainly decides convergence properties of Algorithm 3. It is determined by two terms just as shown in Eqn. (11). These two terms depict the relationship among the sample size, regularizer , and convergence rate.
The first term describes the relationship between the regularizer and sample size. Without loss of generality, we set which satisfies . Then the sample size decreases as increases. Hence Theorem 5.2 gives a theoretical guarantee that adding the regularizer is an effective approach for reducing the sample size when is large. Conversely, if we want to sample a small part of ’s, then we should choose a large . Otherwise, will go to which means , i.e., the sequence does not converge.
Though a large can reduce the sample size, it is at the expense of slower convergence rate just as the second term shows. As we can see, goes to as increases. Besides, also has to decrease. Otherwise, may be beyond which means that Algorithm 3 will not converge.
In fact, slower convergence rate via adding a regularizer is because the sample size becomes small, which implies less curvature information is obtained. However, a small sample size implies low computational cost in each iteration. Therefore, a proper regularizer which balances the cost of each iteration and convergence rate is the key in the regularized subsampled Newton algorithm.
Next, we give the theoretical analysis of local convergence properties of NewSamp (Algorithm 4).
Let satisfy the properties described in Theorem 3. Assume Eqn. (9) and Eqn. (10) hold and let and target rank be given. Let be a constant such that , where is the th eigenvalue of . Set the subsampled size such that , and define
(12) 
which implies . Assume the direction vector satisfies Eqn. (4). Then for , Algorithm 4 has the following convergence properties:

[label = ()]

There exists a sufficient small value , , and such that when , each iteration satisfies Eqn. (5) with probability at least .
Similar to Theorem 5.2, parameter in NewSamp is also determined by two terms. The first term reveals the the relationship between the target rank and sample size. Without loss of generality, we can set . Then the sample size is linear to . Hence, a small means that a small sample size is sufficient. Conversely, if we want to sample a small portion of ’s, then we should choose a small . Otherwise, will go to which means , i.e., the sequence does not converge. The second term shows that a small sample size will lead to a poor convergence rate. If we set and , then will be . Consequently, the convergence rate of NewSamp is almost the same as gradient descent. Similar to Algorithm 3, a small means a precise solution to Problem (3) and the initial point being close to the optimal point .
It is worth pointing out that Theorem 5.2 explains the empirical results that NewSamp is applicable in training SVM in which the Lipschitz continuity condition of is not satisfied [8].
We now conduct comparison between Theorem 5.2 and Theorem 5.2. We mainly focus on the parameter in these two theorems which mainly determines convergence properties of Algorithm 3 and Algorithm 4. Specifically, if we set in Eqn. (11), then which equals to the second term on the righthand side in Eqn. (12). Hence, we can regard NewSamp as a special case of Algorithm 3. However, NewSamp provides an approach for automatical choice of .
Recall that NewSamp includes another parameter: the target rank . Thus, NewSamp and Algorithm 3 have the same number of free parameters. If is not properly chosen, NewSamp will still have poor performance. Therefore, Algorithm 3 is theoretically preferred because NewSamp needs extra cost to perform SVDs.
5.3 Subsampled Hessian and Gradient
In fact, we can also subsample gradient to accelerate the subsampled Newton method. The detailed procedure is presented in Algorithm 5 [3, 19].
Let satisfy the properties described in Theorem 3. We also assume Eqn. (9) and Eqn. (10) hold and let and be given. Let and be set such that Eqn. (2) holds and it holds that
The direction vector is computed as in Algorithm 5. Then for , we have the following convergence properties:

[label = ()]

There exists a sufficient small value , , and such that when , then for each iteration, it holds that
with probability at least .

If is also Lipschitz continuous with parameter and satisfies Eqn. (6), then for each iteration, it holds that
with probability at least .
In common cases, subsampled gradient needs to subsample over of samples to guarantee convergence of the algorithm. RoostaKhorasani and Mahoney [19] showed that it needs , where for . When is close to , is close to . Hence will go to as iteration goes. This is the reason why the Newton method and variants of the subsampled Newton method are very sensitive to the accuracy of subsampled gradient.
6 Inexact Newton Methods
Let , that is . Then Theorem 3 depicts the convergence properties of inexact Newton methods.
Let satisfy the properties described in Theorem 3, and be a direction vector such that
where is a constant. Consider the iteration .
(a) There exists a sufficient small value , , and such that when , then it holds that
(b) If is also Lipschitz continuous with parameter , and satisfies Eqn. (6), then it holds that
Comments
There are no comments yet.