An Investigation of Newton-Sketch and Subsampled Newton Methods

05/17/2017 ∙ by Albert S. Berahas, et al. ∙ Northwestern University 0

The concepts of sketching and subsampling have recently received much attention by the optimization and statistics communities. In this paper, we study Newton-Sketch and Subsampled Newton (SSN) methods for the finite-sum optimization problem. We consider practical versions of the two methods in which the Newton equations are solved approximately using the conjugate gradient (CG) method or a stochastic gradient iteration. We establish new complexity results for the SSN-CG method that exploit the spectral properties of CG. Controlled numerical experiments compare the relative strengths of Newton-Sketch and SSN methods and show that for many finite-sum problems, they are far more efficient than SVRG, a popular first-order method.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

A variety of first-order methods [6, 11, 16, 21] have recently been developed for the finite-sum problem,

(1.1)

where each component function

is smooth and convex. These methods reduce the variance of the stochastic gradient approximations, which allows them to enjoy a linear rate of convergence on strongly convex problems. In particular, the SVRG method

[11]

computes a variance reduced direction by evaluating the full gradient at the beginning of a cycle and by updating this vector at every inner iteration using the gradient of just one component function

. These variance reducing methods perform well in practice, when properly tuned, and it has been argued that their worst-case performance cannot be improved by incorporating second-order information [2].

Given these observations and the wide popularity of first-order methods, one may question whether methods that incorporate stochastic second-order information have much to offer in practice for solving large-scale instances of problem (1.1). In this paper, we argue that there are many important settings where this is indeed the case, provided second-order information is employed in a judicious manner. We pay particular attention to the Newton-Sketch algorithm recently proposed in [18] and Subsampled Newton methods (SSN) [3, 4, 8, 19, 20, 25].

The Newton-Sketch method is a randomized approximation of Newton’s method. It solves an approximate version of the Newton equations using an unbiased estimator of the true Hessian (Hessian Sketch; see

[18, 22]), based on a decomposition of reduced dimension that takes into account all components of (1.1). The Newton-Sketch method has been tested only on a few synthetic problems and its efficiency on practical problems has not yet been explored, to the best of our knowledge. In this paper, we evaluate the algorithmic innovations and computational efficiency of Newton-Sketch compared to SSN methods. We identify several interesting trade-offs involving storage, data access, speed of computation and overall efficiency. Conceptually, SSN methods are a special case of Newton-Sketch but we argue here that they are more flexible, as well as more efficient over a wider range of applications.

Although Newton-Sketch and SSN methods are in principle capable of achieving superlinear convergence [3, 20, 18], we implement them so as to be linearly convergent because aiming for a faster rate is not cost effective. Thus, these two methods are direct competitors of the variance reduction methods mentioned above, which are also linearly convergent. Newton-Sketch and SSN methods are able to incorporate second-order information at low cost by exploiting the structure of the objective function (1.1); they aim to produce well-scaled search directions to facilitate the choice of the step length parameter, and diminish the effects of ill-conditioning.

Two crucial ingredients in methods that employ second-order information are the technique for defining (explicitly or implicitly) the linear systems comprising the Newton equations, and the algorithm employed for solving these systems. We discuss the computational demands of the first ingredient, and study the use of the conjugate gradient (CG) method and a stochastic gradient iteration (SGI) to solve the linear systems. We show, both theoretically and experimentally, the advantages of using the CG method, which is able to exploit favorable eigenvalue distributions in the Hessian approximations. We find that the SSN method is particularly appealing because one can coordinate the amount of second-order information collected at every iteration with the accuracy of the linear solve to fully exploit the structure of the finite-sum problem (

1.1).

Contributions

  1. We consider the finite-sum problem and show that methods that incorporate stochastic second-order information can be far more efficient on badly-scaled or ill-conditioned problems than a first-order method, such as SVRG. These observations are in stark contrast with popular belief and worst-case complexity analysis that suggest little benefit of employing second-order information in this setting.

  2. Although the Newton-Sketch algorithm is a more effective technique for approximating the spectrum of the true Hessian matrix, we find that Subsampled Newton methods are simpler, more versatile, equally efficient in terms of effective gradient evaluations, and generally faster as measured by wall-clock time, than the Newton-Sketch method. We discuss the algorithmic reasons for these advantages.

  3. We present a complexity analysis of a Subsampled Newton-CG method that exploits the full power of CG. We show that the complexity bounds can be much more favorable than those obtained using a worst-case analysis of the CG method, and that these benefits can be realized in practice. Moreover, we compare the the work complexity of the SSN-CG method with that of the Newton-Sketch method.

The paper is organized as follows. In Section 2, we describe Newton-Sketch and Subsampled Newton methods. We discuss practical implementations of the algorithms in detail in Section 3. In Section 4, we provide a convergence analysis for the Subsampled Newton method, and compare the work complexity with that of the Newton-Sketch method. We present a thorough numerical investigation of the methods in Section 5. Finally, in Section 6, we provide some final remarks.

2 Newton-Sketch and Subsampled Newton Methods

The methods studied in this paper emulate Newton’s method,

(2.1)

but instead of forming the Hessian matrix and solving the linear system exactly, which is prohibitively expensive in many applications, the methods define stochastic approximations to the Hessian that allow one to compute the step at a much lower cost. We assume throughout the paper that the gradient is exact, and therefore, the stochastic nature of the iteration is entirely due to the way the Hessian approximation is defined. The reason we do not consider methods that compute stochastic approximations to the gradient is because we wish to analyze and compare only -linearly convergent algorithms. This allows us to study competing second-order techniques in a uniform setting, and to make a controlled comparison with the SVRG method.

The iteration of second-order methods has the form

(2.2)

where is an approximation of the Hessian. The methods presented below differ in the way the Hessian approximations are defined and in the way the linear systems are solved.

Newton-Sketch

The Newton-Sketch method [18]

is a randomized approximation to Newton’s method that instead of explicitly computing the true Hessian matrix, utilizes a decomposition of the matrix and approximates the Hessian via random projections in lower dimensions. At every iteration, the algorithm defines a random matrix known as a sketch matrix

, with , which has the property that . Next, the algorithm computes a square-root decomposition matrix of the Hessian , which we denote by . The search direction is obtained by solving the linear system

(2.3)

The intuition underlying the Newton-Sketch method (2.2)–(2.3) is that corresponds to the minimizer of a random quadratic that, in expectation, is the standard quadratic approximation of the objective function (1.1) at . Assuming that a square-root matrix is readily available and that the projections can be computed efficiently, forming and solving (2.3) is significantly cheaper than solving (2.1) [18].

Subsampled Newton (SSN)

Subsampling is a special case of sketching, where the Hessian approximation is constructed by randomly sampling component functions of (1.1). Specifically, at the -th iteration, one defines

(2.4)

where the sample set is chosen either uniformly at random (with or without replacement) or in a non-uniform manner as described in [25]. The search direction is the exact or inexact solution of the system of equations

(2.5)

SSN methods have received considerable attention recently, and some of their theoretical properties are well understood [3, 8, 19, 25]. Although Hessian subsampling can be thought of as a special case of sketching, as mentioned above, we view it here as a separate method because it has important algorithmic features that differentiate it from the other sketching techniques described in [18].

3 Practical Implementations of the Algorithms

We now discuss important implementation aspects of the Newton-Sketch and SSN methods. The per-iteration cost of the two methods can be decomposed as follows:

Both methods require a full gradient computation at every iteration and the solution to a linear system of equations, (2.3) or (2.5), but as mentioned above, the methods differ significantly in the way the Hessian approximation is constructed.

3.1 Forming the Hessian approximation

Forming the Hessian approximation in the Newton-Sketch method is contingent on having a square-root decomposition of the true Hessian . In the case of generalized linear models (GLMs), , the square-root matrix can be expressed as , where is the data matrix, denotes the -th row of and is a diagonal matrix given by . This requires accessing the full data set in order to compute the square-root Hessian. Forming the linear system in (2.3) also has an associated algebraic cost, namely the projection of the square-root matrix onto the lower dimensional manifold. When Newton-Sketch is implemented using randomized projections based on the fast Johnson-Lindenstrauss (JL) transform, e.g., Hadamard matrices, each iteration has complexity , which is substantially lower than the complexity of Newton’s method (), when and .

On the other hand, computations involving the Hessian approximations in the SSN method come at a much lower cost. The SSN method could be implemented using a sketch matrix . However, a practical implementation would avoid the computation of the square-root matrix by simply selecting a random subset of the component functions and forming the Hessian approximation as in (2.4). As such, the SSN method can be implemented without the need of projections, and thus does not require any additional computation beyond the application of the Hessian-free Newton method.

Although the SSN method comes at a cheaper cost, this may not always be advantageous. For example, when individual component functions are highly dissimilar, taking a small sample at each iteration of the SSN method may not yield a useful approximation to the true Hessian. For such problems, the Newton-Sketch method that forms the matrix based on a combination of all component functions may be more appropriate. In the next section, we investigate these tradeoffs.

3.2 Solving the Linear Systems

To compare the relative strengths of the Newton-Sketch and SSN methods we must pay careful attention to a key component of these methods: the solution of the linear systems (2.3) and (2.5).

Direct Solves

For problems where the number of variables is small, it is feasible to form the matrix and solve the linear systems using a direct solver. The Newton-Sketch method was originally proposed for problems where the number of component functions is large compared to number of variables (). In this setting, the cost of solving the system (2.3) accurately is low compared to the cost of forming the linear system. SSN methods were designed for large problems (both and ), and do not require forming the linear system or solving it accurately.

The Conjugate Gradient Method

For large scale problems, it is effective to employ a Hessian-free approach and compute an inexact solution of the systems (2.3) or (2.5) using an iterative method that only requires Hessian-vector products instead of the Hessian matrix itself. The conjugate gradient method (CG) is the best known method in this category [9]. One can compute a search direction by running iterations of CG on the system

(3.1)

where is a positive definite matrix that denotes either the sketched Hessian (2.3) or the subsampled Hessian (2.5). Under a favorable distribution of the eigenvalues of the matrix , the CG method can generate a useful approximate solution in much fewer than steps; we exploit this result in the analysis presented in Section 4.

We should note that the square-root Hessian decomposition, associated with the Newton-Sketch method, is not a regular decomposition, such as the LU or QR decompositions, and hence the advantages of standard decomposition techniques for solving the system cannot be realized. In fact, using an iterative method such as CG is more efficient because of the associated convergence properties. Since CG requires only Hessian-vector products, we need not form the entire sketched Hessian matrix. We can simply obtain sketched Hessian-vector products by performing two matrix-vector products with respect to the sketched square-root Hessian. More specifically, one can compute

in two steps as follows

(3.2)

The subsampled Hessian used in the SSN method is significantly simpler to define compared to sketched Hessian, and the CG solver is quite suitable in a Hessian-free approach.

Stochastic Gradient Iteration

We can also compute an approximate solution of the linear system using a stochastic gradient iteration (SGI) [1, 3]. We do so in the context of the SSN method. The resulting SSN-SGI method operates in cycles of length . At every iteration of the cycle, one chooses an index at random, and computes a step of the gradient method for the quadratic model at

(3.3)

This yields the iteration

(3.4)

LiSSA, a method similar to SSN-SGI, is analyzed in [1], and [3] compares the complexity of LiSSA and SSN-CG. However, none of those papers report extensive numerical results.

4 Analysis

The theoretical properties of the methods described in the previous sections have been the subject of several studies; see e.g., [3, 7, 8, 14, 18, 19, 20, 22, 23, 24]. We focus here on a topic that has not received sufficient attention even though it has significant impact in practice: the CG method can be much faster than other linear iterative solvers for many classes of problems. We show that the strengths of the CG method can be quantified in the context of a SSN-CG method.

A complexity analysis of the SSN-CG method based on the worst-case performance of CG is given in [3]. We argue that such analysis underestimates the power of the SSN-CG method on problems with favorable eigenvalue distributions and yields complexity estimates that are not indicative of its practical performance. We now present a more accurate analysis of SSN-CG by paying close attention to the behavior of the CG iteration.

When applied to a symmetric positive definite linear system , the progress made by the CG method is not uniform, but depends on the gap between eigenvalues. Specifically, if we denote the eigenvalues of by , then the iterates , , generated by the CG method satisfy [13]

(4.1)

We exploit this property in the following local linear convergence result for the SSN-CG iteration, which is established under standard assumptions.The SSN-CG method is given by (2.2), where is defined as (2.4) and is the iterate generated after iterations of the CG method applied to the system (2.5). For simplicity, we assume that the size of the sample used to define the Hessian approximation is constant throughout the run of the SSN-CG method (but the sample varies at every iteration).

Assumptions

  1. (Bounded Eigenvalues of Hessians) Each function is twice continuously differentiable and each component Hessian is positive definite with eigenvalues lying in a positive interval. That is there exists positive constants , such that

    (4.2)

    We define , which is an upper bound on the condition number of the true Hessian.

  2. (Lipschitz Continuity of Hessian) The Hessian of the objective function is Lipschitz continuous, i.e., there is a constant such that

    (4.3)
  3. (Bounded Variance of Hessian components) There is a constant such that, for all component Hessians, we have

    (4.4)
  4. (Bounded Moments of Iterates)

    There is a constant such that for any iterate generated by the SSN-CG method (2.2) – (2.5),

    (4.5)

In practice, most machine learning problems are regularized and therefore assumption 

is satisfied. Assumption  is a standard assumption in the context of Newton-type methods. Concerning Assumption , variances are always bounded for the finite-sum problem. Assumption  is imposed on non-negative numbers and is less restrictive than assuming that the iterates are bounded.

We now state the main theorem of the paper, and follow the technique in [3, Theorem 3.2] to prove the result. That result is based, however, on the worst-case analysis of the CG method. Our result gives a more realistic bound that exploits the properties of the CG iteration. Let, denote the eigenvalues of subsampled Hessian .

Theorem 4.1.

Suppose that Assumptions A.1–A.4 hold. Let be the iterates generated by the Subsampled Newton-CG method with , where the step is defined by (2.5) with

(4.6)

and suppose that the number of CG steps performed at iteration is the smallest integer such that

(4.7)

Then, if , we have

(4.8)
Proof.

We write

(4.9)

Term 1 was analyzed in [3], which establishes the bound

(4.10)

Using the result in Lemma 2.3 from [3] and (4.6), we have

(4.11)
(4.12)

Now, we analyze Term 2, which is the residual after CG iterations. Assuming for simplicity that the initial CG iterate is , we obtain from (4.1)

where . To express this in terms of unweighted norms, note that if , then

Therefore, from Assumption A1 and due to the condition (4.7) on the number of CG iterations, we have

(4.13)

where the last equality follows from the definition .

Combining (4.11) and (4.13), we get

(4.14)

We now use induction to prove (4.8). For the base case we have,

Now suppose that (4.8) is true for -th iteration. Then,

Thus, if the initial iterate is near the solution , then the SSN-CG method converges to at a linear rate, with a convergence constant of . As we discuss below, when the spectrum of the Hessian approximation is favorable, condition (4.7) can be satisfied after a small number () of CG iterations. We should note that in a practical implementation of the SSN-CG method the number of CG iterations is not chosen by testing condition (4.7), as this would involve computing the entire spectrum of the subsampled Hessian. Instead, one uses a well-known practical residual test [17]

where , as a proxy, to choose the number of CG iterations. Due to equation (4.1), the residual test well reflects and is a good proxy to condition (4.7), i.e., under favorable eigenvalue distributions the residual test is satisfied after only a few CG iterations.

We now establish a work computational complexity result. By work complexity we refer to the total number of component gradient evaluations () and component Hessian-vector products (). In establishing this result we assume that the cost of evaluating a component Hessian-vector product is the same as the cost of evaluating a component gradient, which is the case for the problems investigated in this paper.

Corollary 4.2.

Suppose that Assumptions A.1–A.4 hold, and let and be defined as in Theorem 4.1. Then, the work complexity required to get an -accurate solution is

(4.15)

This result involves the quantity , which is the maximum number of CG iterations performed at any iteration of the SSN-CG method. This quantity depends on the eigenvalue distribution of the subsampled Hessians. Under certain eigenvalue distributions, such as clustered eigenvalues, can be much smaller than the bound predicted by a worst-case analysis of the CG method [3]. Such favorable eigenvalue distributions can be observed in many applications, including machine learning problems. Figure 1

depicts the spectrum for 4 data sets used in our logistic regression experiments; see Appendix

B for more results. We should note that in Figure 1, the eigenvalues were computed at the optimal point (); however, similar plots were produced at other iterates obtained by the methods.

Figure 1: Distribution of eigenvalues of , where is logistic loss with regularization (5.1), for 4 data sets: synthetic3, gisette, MNIST, cina. The subsample size is chosen as , , and of the size of the whole data set, respectively.

The least favorable case is given by the synthetic data set, which was constructed to have a high condition number and a wide spectrum of eigenvalues. The behavior of CG on this problem is similar to that predicted by its worst-case analysis. However, for the rest of the problems one can observe how a small number of CG iterations can give rise to a rapid improvement in accuracy. For example, for cina, the CG method improves the accuracy in the solution by 2 orders of magnitude during CG iterations 20-24. For gisette, CG exhibits a fast initial increase in accuracy, after which it is not cost effective to continue performing CG steps due to the spectral distribution of this problem.

Let us now consider the Newton-Sketch algorithm, evaluate the total work complexity to get an -accurate solution and compare with the work complexity of the SSN-CG method. In [18, Theorem 1]

, the authors provide a linear-quadratic convergence rate (stated in probability) for the exact Newton-Sketch method, similar in form to (

4.9)-(4), with Term 2 in (4.9) equal to zero due to the exact solves of the linear systems. In order to achieve local linear convergence, the user-supplied parameter should be . As a result, the sketch dimension is given by

where the second equality follows from the fact that the square of the Gaussian width is at most [18]. The per-iteration cost of Newton sketch with a randomized Hadamard transform is

Hence, the total work complexity required to get an -accurate solution is

(4.16)

Note that in order to compare this with (4.15), we need to bound the parameters and . Using the definitions given in the Assumptions, we have that is bounded by a multiple of , and that CG converges in at most iterations, thus . Using these worst-case bounds we observe that Newton-Sketch has a slightly higher cost compared to SSN-CG.

We should note in passing that certain implicit assumptions are made about both these algorithms. In the SSN method, it is assumed that the number of subsamples is less than the number of examples ; and in Newton-Sketch, the sketch dimension is assumed to be less than . This implies that for these methods one makes the implicit assumption that .

5 Numerical Study

In this section, we study the practical performance of the Newton-Sketch, Subsampled Newton (SSN) and SVRG methods. We consider two versions of the SSN method that differ in the choice of iterative linear solver; we refer to them as SSN-CG and SSN-SGI. Our main goals are to investigate whether the second-order methods have advantages over first-order methods, to identify the classes of problems for which this is the case, and to evaluate the relative strengths of the Newton-Sketch and SSN approaches. We should note that although SVRG updates the iterates at every inner iteration, its linear convergence rate applies to the outer iterates (i.e., to each complete cycle). A cycle in the SVRG method is defined as the iterations between full gradient evaluations, which makes it comparable to the Newton-Sketch and SSN methods.

We test the methods on binary classification problems where the training function is given by a logistic loss with regularization:

(5.1)

Here , , denote the training examples and . For this objective function, the cost of computing the gradient is the same as the cost of computing a Hessian-vector product—an assumption made in the analysis of Section 4. When reporting the numerical performance of the methods, we make use of this observation.

The data sets employed in our experiments are listed in Appendix A. They were selected to have different dimensions and condition numbers, and consist of both synthetic and established data sets.

5.1 Methods

We now give a more detailed description of the methods tested in our experiments. All methods have two hyper-parameters that need to be tuned to achieve good performance.

SVRG This method was implemented as described in [11] using Option I. The SVRG method has two hyper-parameters: the step length and the length of the cycle. The cost per cycle of the SVRG method is plus evaluations of the component gradients .

Newton-Sketch We implemented the Newton-Sketch method proposed in [18]. The method evaluates a full gradient at every iteration and computes a step by solving the linear system (2.3). In [18], the system is solved exactly, however, this limits the size of the problems tested. Therefore, in our experiments we use the matrix-free CG method to solve the system inexactly, alleviating the need for constructing the full Hessian approximation matrix in (2.3). Our implementation of the Newton-Sketch method includes two parameters: the number of rows in the sketch matrix , and the maximum number of CG iterations, , used to compute the step. We define the sketch matrix through the Hadamard basis that allows for a fast matrix-vector multiplication scheme to obtain . This does not require forming the sketch matrix explicitly and can be executed at a cost of flops. The maximum cost associated with every iteration is given by component gradients plus component matrix-vector products, where the factor 2 arises due to (3.2). We ignore the cost of forming the square-root matrix and multiplying by the sketch matrix, but note that these can be time consuming operations. Therefore, we report the most optimistic computational results for Newton-Sketch.

SSN-SGI The Subsampled Newton method that employs the SGI iteration for the step computation is described in Section 3. It computes a full gradient every inner iterations, then builds a quadratic model (3.3), and uses the stochastic gradient method to compute the search direction (3.4), at the cost of one Hessian-vector product per SGI step. The total cost of an outer iteration of the method is given by component gradients plus component Hessian-vector products of the form . This method has two hyper-parameters: the step length and the length of the inner SGI cycle.

SSN-CG The Subsampled Newton-CG method computes a search direction (2.5), where the linear system is solved by the CG method. The two hyper-parameters in this method are the size of the subsample, , and the maximum number of CG iterations used to compute the step, . The method computes a full gradient at every iteration. The maximum cost (per-iteration) of the SSN-CG method is component gradients plus component Hessian-vector products.

For the Newton-Sketch and the SSN methods, the step length was determined by an Armijo back-tracking line search, starting from a unit step length which was accepted most of the time. For methods that use CG to solve the linear system, we included an early termination option based on the residual of the system, e.g., we terminate the SSN-CG method if

(5.2)

where is a tuning parameter. In all cases, the total cost reported below is based on the actual number of CG iterations performed for each problem.

Experimental Setup

To determine the best implementation of each method, we attempt to find the optimal setting of parameters that leads to the smallest amount of work required to reach the best possible function value. To do this, we allocated different per-iteration budgets () which denotes the total amount of work between full gradient evaluations. For the second-order methods per-iteration budget denotes the amount of work done per step taken, while for the SVRG method it is the amount of work of a complete cycle (inner iterations) of length . In other words, the per-iteration budget is the total amount of work done between full gradient evaluations. We tested 9222 different per-iteration budget levels for each method and independently tuned the associated hyper-parameters. For each budget level, we considered all possible combinations of the hyper-parameters and chose the combination that gave best performance. For example, for : (i) SVRG, we set and tuned the step length parameter ; (ii) Newton-Sketch, we chose pairs of parameters such that ; SSN-SGI, we set , and tuned for the step length parameter ; and (iv) SSN-CG, we chose pairs of parameters such that .

5.2 Numerical Results

In Figure 2 we present a sample of results obtained in our experiments on machine learning data sets (the rest are given in Appendix A). We report training error () vs. iterations, and training error vs. effective gradient evaluations (defined as the sum of all function evaluations, gradient evaluations and Hessian-vector products), where is the function value at the optimal solution . Note, for SVRG one iteration denotes one outer iteration (a complete cycle). We do not report results in terms of CPU time as this would deem the Newton-Sketch method not competitive for large-scale problems, using the best implementation known to us.

Let us consider the plots in the second row of Figure 2. Our first observation is that the SSN-SGI method is not competitive with the SSN-CG method. This was not clear to us before our experimentation, because the ability of the SGI method to use a new Hessian sample at each iteration seems advantageous. We observe, however, that in the context of the second-order methods investigated in this paper, the CG method is clearly more efficient.

Our results show that SVRG is superior on problems such as rcv1 where the benefits of using curvature information do not outweigh the increase in cost. On the other hand, problems like gisette and australian show the benefits of employing second-order information. It is striking how much faster they are than SVRG, even though SVRG was carefully tuned. Problem australian-scale, which is well-conditioned, illustrates that there are cases when all methods perform on par.

The Newton-Sketch and SSN-CG methods perform on par on most of the problems in terms of effective gradient evaluations. However, if we had reported CPU time, the SSN-CG method would show a dramatic advantage over Newton-Sketch. An unexpected finding is the fact that, although the number of CG iterations required for the best performance of the two methods is almost always the same, it appears that the best sample sizes (rows of the sketch matrix in Newton-Sketch and the number of subsamples in SSN-CG) differ: Newton-Sketch requires a smaller number of rows (when implemented using Hadamard matrices). This can also be seen in the eigenvalue figures in Appendix B, where the sketched Hessian with fewer rows is able to adequately capture the eigenvalues of the true Hessian.

Figure 2: Comparison of SSN-SGI, SSN-CG, Newton-Sketch and SVRG on four machine learning data sets: gisette; rcv1; australian_scale; australian. Each column corresponds to a different problem. The first row reports training error () vs iteration, and the second row reports training error vs effective gradient evaluations. The hyper-parameters of each method were tuned to achieve the best training error in terms of effective gradient evaluations. The cost of forming the linear systems in the Newton-Sketch method is ignored. The problems presented in these figures have differ characteristics (, and ) and illustrate the capabilities of the methods in different regimes.

6 Final Remarks

The theoretical properties of Newton-Sketch and subsampled Newton methods have been the subject of much attention, but their practical implementation and performance have not been investigated in depth. In this paper, we focus on the finite-sum problem and report that both methods live up to our expectations of being more efficient than first-order methods (as represented by SVRG) on problems that are ill-conditioned or badly-scaled. These results stand in contrast with conventional wisdom that states that first-order methods are to be preferred for large-scale finite-sum optimization problems. Newton-Sketch and SSN are also much less sensitive to the choice of hyper-parameters than SVRG; see Appendix A.3. One should note, however, that SVRG, when carefully tuned, can outperform second-order methods on problems where the objective function does not exhibit high curvature.

Although a sketched matrix based on Hadamard matrices has more information than a subsampled Hessian, and approximates the spectrum of the true Hessian more accurately, the two methods perform on par in our tests, as measured by effective gradient evaluations. However, in terms of CPU time the Newton-Sketch method is generally much slower due to the high costs associated with the construction of the linear system (2.3). A Subsampled Newton method that employs the CG algorithm to solve the linear system (the SSN-CG method) is particularly appealing as one can coordinate the amount of of second-order information employed at every iteration with the accuracy of the linear solver to find an efficient implementation for a given problem.

The power of the CG algorithm is evident in our tests and is quantified in the complexity analysis presented in this paper. It shows that when eigenvalues of the Hessian have clusters, which is common in practice, the behavior of the SSN-CG method is much better than predicted by a worst case analysis of CG.

References

  • [1] Naman Agarwal, Brian Bullins, and Elad Hazan. Second-order stochastic optimization for machine learning in linear time. Journal of Machine Learning Research, 18(116):1–40, 2017.
  • [2] Yossi Arjevani and Ohad Shamir. Oracle complexity of second-order methods for finite-sum problems. In International Conference on Machine Learning 30, pages 205–213, 2017.
  • [3] Raghu Bollapragada, Richard H Byrd, and Jorge Nocedal. Exact and inexact subsampled newton methods for optimization. IMA Journal of Numerical Analysis, 2018.
  • [4] Richard H Byrd, Gillian M Chin, Will Neveitt, and Jorge Nocedal. On the use of stochastic Hessian information in optimization methods for machine learning. SIAM Journal on Optimization, 21(3):977–995, 2011.
  • [5] Chih-Chung Chang and Chih-Jen Lin.

    LIBSVM: A library for support vector machines.

    ACM Transactions on Intelligent Systems and Technology, 2:27:1–27:27, 2011. Software available at http://www.csie.ntu.edu.tw/ cjlin/libsvm.
  • [6] Aaron Defazio, Francis Bach, and Simon Lacoste-Julien. SAGA: A fast incremental gradient method with support for non-strongly convex composite objectives. In Advances in Neural Information Processing Systems 27, pages 1646–1654. 2014.
  • [7] Petros Drineas, Michael W Mahoney, S Muthukrishnan, and Tamás Sarlós. Faster least squares approximation. Numerische Mathematik, 117(2):219–249, 2011.
  • [8] Murat A Erdogdu and Andrea Montanari. Convergence rates of sub-sampled newton methods. In Advances in Neural Information Processing Systems 28, pages 3034–3042, 2015.
  • [9] Gene H. Golub and Charles F. Van Loan. Matrix Computations. Johns Hopkins University Press, Baltimore, second edition, 1989.
  • [10] Isabelle Guyon, Constantin F Aliferis, Gregory F Cooper, André Elisseeff, Jean-Philippe Pellet, Peter Spirtes, and Alexander R Statnikov. Design and analysis of the causation and prediction challenge. In WCCI Causation and Prediction Challenge, pages 1–33, 2008.
  • [11] Rie Johnson and Tong Zhang.

    Accelerating stochastic gradient descent using predictive variance reduction.

    In Advances in Neural Information Processing Systems 26, pages 315–323, 2013.
  • [12] Yann LeCun, LD Jackel, Léon Bottou, Corinna Cortes, John S Denker, Harris Drucker, Isabelle Guyon, UA Muller, E Sackinger, Patrice Simard, et al. Learning algorithms for classification: A comparison on handwritten digit recognition. Neural networks: the statistical mechanics perspective, 261:276, 1995.
  • [13] David G. Luenberger. Linear and Nonlinear Programming. Addison-Wesley, New Jersey, second edition, 1984.
  • [14] Haipeng Luo, Alekh Agarwal, Nicolò Cesa-Bianchi, and John Langford. Efficient second order online learning by sketching. In Advances in Neural Information Processing Systems 29, pages 902–910, 2016.
  • [15] Indraneel Mukherjee, Kevin Canini, Rafael Frongillo, and Yoram Singer. Parallel boosting with momentum. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pages 17–32. Springer Berlin Heidelberg, 2013.
  • [16] Lam M Nguyen, Jie Liu, Katya Scheinberg, and Martin Takáč. Sarah: A novel method for machine learning problems using stochastic recursive gradient. In International Conference on Machine Learning, pages 2613–2621, 2017.
  • [17] Jorge Nocedal and Stephen Wright. Numerical Optimization. Springer New York, 2 edition, 1999.
  • [18] Mert Pilanci and Martin J Wainwright. Newton sketch: A near linear-time optimization algorithm with linear-quadratic convergence. SIAM Journal on Optimization, 27(1):205–245, 2017.
  • [19] Farbod Roosta-Khorasani and Michael W Mahoney. Sub-sampled Newton methods I: Globally convergent algorithms. arXiv preprint arXiv:1601.04737, 2016.
  • [20] Farbod Roosta-Khorasani and Michael W Mahoney. Sub-sampled Newton methods II: Local convergence rates. arXiv preprint arXiv:1601.04738, 2016.
  • [21] Mark Schmidt, Nicolas Le Roux, and Francis Bach. Minimizing finite sums with the stochastic average gradient. Mathematical Programming, page 1–30, 2016.
  • [22] Jialei Wang, Jason D Lee, Mehrdad Mahdavi, Mladen Kolar, Nathan Srebro, et al.

    Sketching meets random projection in the dual: A provable recovery algorithm for big and high-dimensional data.

    Electronic Journal of Statistics, 11(2):4896–4944, 2017.
  • [23] Shusen Wang, Alex Gittens, and Michael W Mahoney.

    Sketched ridge regression: Optimization perspective, statistical perspective, and model averaging.

    Journal of Machine Learning Research, 18(218):1–50, 2018.
  • [24] David P Woodruff et al. Sketching as a tool for numerical linear algebra. Foundations and Trends® in Theoretical Computer Science, 10(1–2):1–157, 2014.
  • [25] Peng Xu, Jiyan Yang, Farbod Roosta-Khorasani, Christopher Ré, and Michael W Mahoney. Sub-sampled newton methods with non-uniform sampling. In Advances in Neural Information Processing Systems 29, pages 3000–3008, 2016.

Appendix A Complete Numerical Results

In this Section, we present further numerical results, on the data sets listed in Tables 2 and 2.

Dataset (train; test)
synthetic1 (9000; 1000) 100
synthetic2 (9000; 1000) 100
synthetic3 (90000; 10000) 100
synthetic4 (90000; 10000) 100
Table 2: Real Datasets.
Dataset (train; test) Source
australian (621; 69) 14 , [5]
reged (450; 50) 999 [10]
mushroom (5,500; 2,624) 112 [5]
ijcnn1 (35,000; 91,701) 22 [5]
cina (14,430; 1603) 132 [10]
ISOLET (7,017; 780) 617 [5]
gisette (6,000; 1,000) 5,000 [5]
cov (522,911; 58101) 54 [5]
MNIST (60,000; 10,000) 748 [12]
rcv1 (20,242; 677,399) 47,236 [5]
real-sim (65,078; 7,231) 20,958 [5]
Table 1: Synthetic Datasets [15].

The synthetic data sets were generated randomly as described in [15]. These data sets were created to have different number of samples () and different condition numbers (), as well as a wide spectrum of eigenvalues (see Figures 11-14). We also explored the performance of the methods on popular machine learning data sets chosen to have different number of samples (), different number of variables () and different condition numbers (). We used the testing data sets where provided. For data sets where a testing set was not provided, we randomly split the data sets () for training and testing, respectively.

We focus on logistic regression classification; the objective function is given by

(A.1)

where denote the training examples and is the regularization parameter.

For the experiments in Sections A.1, A.2 and A.3 (Figures 3-10), we ran four methods: SVRG, Newton-Sketch, SSN-SGI and SSN-CG. The implementation details of the methods are given in Section 5.1. In Section A.1 we show the performance of the methods on the synthetic data sets, in Section A.2 we show the performance of the methods on popular machine learning data sets and in Section A.3 we examine the sensitivity of the methods on the choice of the hyper-parameters. We report training error vs. iterations, and training error vs. effective gradient evaluations (defined as the sum of all function evaluations, gradient evaluations and Hessian-vector products). In Sections A.1 and A.2 we also report testing loss vs. effective gradient evaluations.

a.1 Performance of methods - Synthetic Data sets

Figure 3 illustrates the performance of the methods on four synthetic data sets. These problems were constructed to have high condition numbers , which is the setting where Newton methods show their strength compared to a first order methods such as SVRG, and to have wide spectrum of eigenvalues, which is the setting that is unfavorable for the CG method.

As is clear from Figure 3, the Newton-Sketch and SSN-CG methods outperform the SVRG and SSN-SGI methods. This is expected due to the ill-conditioning of the problems. The Newton-Sketch method performs slightly better than the SSN-CG method; this can be attributed, primarily, to the fact that the component function in these problems are highly dissimilar. The Hessian approximations constructed by the Newton-Sketch method, use information from all component functions, and thus, better approximate the true Hessian. It is interesting to observe that in the initial stage of these experiments, the SVRG method outperforms the second-order methods, however, the progress made by the first-order method quickly stagnates.

Figure 3: Comparison of SSN-SGI, SSN-CG, Newton-Sketch and SVRG on two synthetic data sets: synthetic1; synthetic2; synthetic3; synthetic4. Each column corresponds to a different problem. The first row reports training error () vs iteration, and the second row reports training error vs effective gradient evaluations, and the third row reports testing loss vs effective gradient evaluations. The hyper-parameters of each method were tuned to achieve the best training error in terms of effective gradient evaluations. The cost of forming the linear systems in the Newton-Sketch method is ignored.

a.2 Performance of methods - Machine Learning Data sets

Figures 46 illustrate the performance of the methods on 12 popular machine learning data sets. As mentioned in Section 5.2, the performance of the methods is highly dependent on the problem characteristics. On the one hand, these figures show that there exists an SVRG sweet-spot, a regime in which the SVRG method outperforms all stochastic second-order methods investigated in this paper; however, there are other problems for which it is efficient to incorporate some form of curvature information in order to avoid stagnation due to ill-conditioning. For such problems, the SVRG method makes slow progress due to the need for a very small step length, which is required to ensure that the method does not diverge.

Note: We did not run Newton-Sketch on the cov data set (Figure 6). This was due to the fact that the number of data points (

) in this data set is large, and so it was prohibitive (in terms of memory) to create the padded sketch matrices.

Figure 4: Comparison of SSN-SGI, SSN-CG, Newton-Sketch and SVRG on four machine learning data sets: australian-scale; australian; reged; mushroom. Each column corresponds to a different problem. The first row reports training error () vs iteration, and the second row reports training error vs effective gradient evaluations, and the third row reports testing loss vs effective gradient evaluations. The hyper-parameters of each method were tuned to achieve the best training error in terms of effective gradient evaluations. The cost of forming the linear systems in the Newton-Sketch method is ignored.
Figure 5: Comparison of SSN-SGI, SSN-CG, Newton-Sketch and SVRG on two machine learning data sets: ijcnn1; cina; ISOLET; gisette. Each column corresponds to a different problem. The first row reports training error () vs iteration, and the second row reports training error vs effective gradient evaluations, and the third row reports testing loss vs effective gradient evaluations. The hyper-parameters of each method were tuned to achieve the best training error in terms of effective gradient evaluations. The cost of forming the linear systems in the Newton-Sketch method is ignored.
Figure 6: Comparison of SSN-SGI, SSN-CG, Newton-Sketch and SVRG on two machine learning data sets: cov; MNIST; rcv1; real-sim. Each column corresponds to a different problem. The first row reports training error () vs iteration, and the second row reports training error vs effective gradient evaluations, and the third row reports testing loss vs effective gradient evaluations. The hyper-parameters of each method were tuned to achieve the best training error in terms of effective gradient evaluations. The cost of forming the linear systems in the Newton-Sketch method is ignored.

a.3 Sensitivity of methods

In Figures 710 we illustrate the sensitivity of the methods to the choice of hyper-parameters for 4 different data sets: synthetic1, australian-scale, mushroom and ijcnn. One needs to be careful when interpreting the sensitivity results. It appears that all methods have similar sensitivity to the chosen hyper-parameters, however, we should note that this is not the case. More specifically, if the step length is not chosen appropriately in the SVRG and SSN-SGI methods, these methods can diverge ( too large) or make very slow progress ( too small). We have excluded these runs from the figures presented in this section. Empirically, we observe that the Newton-Sketch and SSN-CG methods are more robust to the choice of the hyper-parameters, and easier to tune. For almost all choices of and , respectively, and the methods converge, with the choice of hyper-parameters only affecting the speed of convergence.

Figure 7: synthetic1: Sensitivity of SVRG, SSN-SGI, Newton-Sketch and SSN-CG to the choice of hyper-parameters. Each row depicts a different method. The first column shows the performance of a given method when varying the first hyper-parameter, while the second column shows the performance when varying the second hyper-parameter. Note, for SVRG and Newton-SGI we only illustrate the performance of the methods in hyper-parameter regimes in which the methods did not diverge.
Figure 8: australian-scale: Sensitivity of SVRG, SSN-SGI, Newton-Sketch and SSN-CG to the choice of hyper-parameters. Each row depicts a different method. The first column shows the performance of a given method when varying the first hyper-parameter, while the second column shows the performance when varying the second hyper-parameter. Note, for SVRG and Newton-SGI we only illustrate the performance of the methods in hyper-parameter regimes in which the methods did not diverge.
Figure 9: mushroom: Sensitivity of SVRG, SSN-SGI, Newton-Sketch and SSN-CG to the choice of hyper-parameters. Each row depicts a different method. The first column shows the performance of a given method when varying the first hyper-parameter, while the second column shows the performance when varying the second hyper-parameter. Note, for SVRG and Newton-SGI we only illustrate the performance of the methods in hyper-parameter regimes in which the methods did not diverge.
Figure 10: ijcnn1: Sensitivity of SVRG, SSN-SGI, Newton-Sketch and SSN-CG to the choice of hyper-parameters. Each row depicts a different method. The first column shows the performance of a given method when varying the first hyper-parameter, while the second column shows the performance when varying the second hyper-parameter. Note, for SVRG and Newton-SGI we only illustrate the performance of the methods in hyper-parameter regimes in which the methods did not diverge.

Appendix B Eigenvalues

In this Section, we present the eigenvalue spectrum for different data sets and different subsample sizes. Here denotes the number of subsamples used in the subsampled Hessian and denotes the number of rows of the Sketch matrix. In order to make a fair comparison, we chose for each figure.

To calculate the eigenvalues we used the following procedure. For each data set and subsample size we:

  1. Computed the true Hessian

    of (5.1) at , and computed the eigenvalues of the true Hessian matrix (red).

  2. Computed 10 different subsampled Hessians of (5.1) at the optimal point using different samples for ()

    computed the eigenvalues of each subsampled Hessian and took the average of the eigenvalues across the 10 replications (blue).

  3. Computed 10 different sketched Hessians of (5.1) at the optimal point using different sketch matrices for

    computed the eigenvalues of each sketched Hessian and took the average of the eigenvalues across the 10 replications (green).

All eigenvalues were computed in MATLAB using the function eig. Figures 1122 show the distribution of the eigenvalues for different data sets. Each figure represents one data set and depicts the eigenvalue distribution for three different subsample and sketch sizes. The red marks, blue squares and green circles represent the eigenvalues of the true Hessian, the average eigenvalues of the subsampled Hessians and the average eigenvalues of the sketched Hessians, respectively. Since we computed 10 different subsampled and sketched Hessians, we also show the upper and lower bounds of each eigenvalue with error bars.

As is clear from the figures below, contingent on a reasonable subsample size and sketch size, the eigenvalue spectra of the subsampled and sketched Hessians are able to capture the eigenvalue spectrum of the true Hessian. It appears however, that the eigenvalue spectra of the sketched Hessians are closer to the spectra of the true Hessian. By this we mean both that with fewer rows of the sketch matrix, the approximations of the average eigenvalues of the sketched Hessians are closer to the true eigenvalues of the Hessian, and that the error bars of the sketched Hessians are smaller than those of the subsampled Hessians. This is not surprising as the sketched Hessians use information from all components functions whereas the subsampled Hessians are constructed from a subset of the component functions. This is interesting in itself, however, the main reasons we present these results is to demonstrate that the eigenvalue distributions of machine learning problems appear to have favorable distributions for CG.

Figure 11: Distribution of eigenvalues. synthetic1: ; ;
Figure 12: Distribution of eigenvalues. synthetic2: ; ;
Figure 13: Distribution of eigenvalues. synthetic3: ; ;
Figure 14: Distribution of eigenvalues. synthetic4: ; ;
Figure 15: Distribution of eigenvalues. australian_scale, australian: ; ;
Figure 16: Distribution of eigenvalues. reged: ; ;
Figure 17: Distribution of eigenvalues. mushroom: ; ;
Figure 18: Distribution of eigenvalues. ijcnn: ; ;
Figure 19: Distribution of eigenvalues. cina: ; ;
Figure 20: Distribution of eigenvalues. ISOLET: ; ;
Figure 21: Distribution of eigenvalues. gisette: ; ;
Figure 22: Distribution of eigenvalues. MNIST: ; ;