1 Introduction
A variety of firstorder methods [6, 11, 16, 21] have recently been developed for the finitesum problem,
(1.1) 
where each component function
is smooth and convex. These methods reduce the variance of the stochastic gradient approximations, which allows them to enjoy a linear rate of convergence on strongly convex problems. In particular, the SVRG method
[11]computes a variance reduced direction by evaluating the full gradient at the beginning of a cycle and by updating this vector at every inner iteration using the gradient of just one component function
. These variance reducing methods perform well in practice, when properly tuned, and it has been argued that their worstcase performance cannot be improved by incorporating secondorder information [2].Given these observations and the wide popularity of firstorder methods, one may question whether methods that incorporate stochastic secondorder information have much to offer in practice for solving largescale instances of problem (1.1). In this paper, we argue that there are many important settings where this is indeed the case, provided secondorder information is employed in a judicious manner. We pay particular attention to the NewtonSketch algorithm recently proposed in [18] and Subsampled Newton methods (SSN) [3, 4, 8, 19, 20, 25].
The NewtonSketch method is a randomized approximation of Newton’s method. It solves an approximate version of the Newton equations using an unbiased estimator of the true Hessian (Hessian Sketch; see
[18, 22]), based on a decomposition of reduced dimension that takes into account all components of (1.1). The NewtonSketch method has been tested only on a few synthetic problems and its efficiency on practical problems has not yet been explored, to the best of our knowledge. In this paper, we evaluate the algorithmic innovations and computational efficiency of NewtonSketch compared to SSN methods. We identify several interesting tradeoffs involving storage, data access, speed of computation and overall efficiency. Conceptually, SSN methods are a special case of NewtonSketch but we argue here that they are more flexible, as well as more efficient over a wider range of applications.Although NewtonSketch and SSN methods are in principle capable of achieving superlinear convergence [3, 20, 18], we implement them so as to be linearly convergent because aiming for a faster rate is not cost effective. Thus, these two methods are direct competitors of the variance reduction methods mentioned above, which are also linearly convergent. NewtonSketch and SSN methods are able to incorporate secondorder information at low cost by exploiting the structure of the objective function (1.1); they aim to produce wellscaled search directions to facilitate the choice of the step length parameter, and diminish the effects of illconditioning.
Two crucial ingredients in methods that employ secondorder information are the technique for defining (explicitly or implicitly) the linear systems comprising the Newton equations, and the algorithm employed for solving these systems. We discuss the computational demands of the first ingredient, and study the use of the conjugate gradient (CG) method and a stochastic gradient iteration (SGI) to solve the linear systems. We show, both theoretically and experimentally, the advantages of using the CG method, which is able to exploit favorable eigenvalue distributions in the Hessian approximations. We find that the SSN method is particularly appealing because one can coordinate the amount of secondorder information collected at every iteration with the accuracy of the linear solve to fully exploit the structure of the finitesum problem (
1.1).Contributions

We consider the finitesum problem and show that methods that incorporate stochastic secondorder information can be far more efficient on badlyscaled or illconditioned problems than a firstorder method, such as SVRG. These observations are in stark contrast with popular belief and worstcase complexity analysis that suggest little benefit of employing secondorder information in this setting.

Although the NewtonSketch algorithm is a more effective technique for approximating the spectrum of the true Hessian matrix, we find that Subsampled Newton methods are simpler, more versatile, equally efficient in terms of effective gradient evaluations, and generally faster as measured by wallclock time, than the NewtonSketch method. We discuss the algorithmic reasons for these advantages.

We present a complexity analysis of a Subsampled NewtonCG method that exploits the full power of CG. We show that the complexity bounds can be much more favorable than those obtained using a worstcase analysis of the CG method, and that these benefits can be realized in practice. Moreover, we compare the the work complexity of the SSNCG method with that of the NewtonSketch method.
The paper is organized as follows. In Section 2, we describe NewtonSketch and Subsampled Newton methods. We discuss practical implementations of the algorithms in detail in Section 3. In Section 4, we provide a convergence analysis for the Subsampled Newton method, and compare the work complexity with that of the NewtonSketch method. We present a thorough numerical investigation of the methods in Section 5. Finally, in Section 6, we provide some final remarks.
2 NewtonSketch and Subsampled Newton Methods
The methods studied in this paper emulate Newton’s method,
(2.1) 
but instead of forming the Hessian matrix and solving the linear system exactly, which is prohibitively expensive in many applications, the methods define stochastic approximations to the Hessian that allow one to compute the step at a much lower cost. We assume throughout the paper that the gradient is exact, and therefore, the stochastic nature of the iteration is entirely due to the way the Hessian approximation is defined. The reason we do not consider methods that compute stochastic approximations to the gradient is because we wish to analyze and compare only linearly convergent algorithms. This allows us to study competing secondorder techniques in a uniform setting, and to make a controlled comparison with the SVRG method.
The iteration of secondorder methods has the form
(2.2) 
where is an approximation of the Hessian. The methods presented below differ in the way the Hessian approximations are defined and in the way the linear systems are solved.
NewtonSketch
The NewtonSketch method [18]
is a randomized approximation to Newton’s method that instead of explicitly computing the true Hessian matrix, utilizes a decomposition of the matrix and approximates the Hessian via random projections in lower dimensions. At every iteration, the algorithm defines a random matrix known as a sketch matrix
, with , which has the property that . Next, the algorithm computes a squareroot decomposition matrix of the Hessian , which we denote by . The search direction is obtained by solving the linear system(2.3) 
The intuition underlying the NewtonSketch method (2.2)–(2.3) is that corresponds to the minimizer of a random quadratic that, in expectation, is the standard quadratic approximation of the objective function (1.1) at . Assuming that a squareroot matrix is readily available and that the projections can be computed efficiently, forming and solving (2.3) is significantly cheaper than solving (2.1) [18].
Subsampled Newton (SSN)
Subsampling is a special case of sketching, where the Hessian approximation is constructed by randomly sampling component functions of (1.1). Specifically, at the th iteration, one defines
(2.4) 
where the sample set is chosen either uniformly at random (with or without replacement) or in a nonuniform manner as described in [25]. The search direction is the exact or inexact solution of the system of equations
(2.5) 
SSN methods have received considerable attention recently, and some of their theoretical properties are well understood [3, 8, 19, 25]. Although Hessian subsampling can be thought of as a special case of sketching, as mentioned above, we view it here as a separate method because it has important algorithmic features that differentiate it from the other sketching techniques described in [18].
3 Practical Implementations of the Algorithms
We now discuss important implementation aspects of the NewtonSketch and SSN methods. The periteration cost of the two methods can be decomposed as follows:
Both methods require a full gradient computation at every iteration and the solution to a linear system of equations, (2.3) or (2.5), but as mentioned above, the methods differ significantly in the way the Hessian approximation is constructed.
3.1 Forming the Hessian approximation
Forming the Hessian approximation in the NewtonSketch method is contingent on having a squareroot decomposition of the true Hessian . In the case of generalized linear models (GLMs), , the squareroot matrix can be expressed as , where is the data matrix, denotes the th row of and is a diagonal matrix given by . This requires accessing the full data set in order to compute the squareroot Hessian. Forming the linear system in (2.3) also has an associated algebraic cost, namely the projection of the squareroot matrix onto the lower dimensional manifold. When NewtonSketch is implemented using randomized projections based on the fast JohnsonLindenstrauss (JL) transform, e.g., Hadamard matrices, each iteration has complexity , which is substantially lower than the complexity of Newton’s method (), when and .
On the other hand, computations involving the Hessian approximations in the SSN method come at a much lower cost. The SSN method could be implemented using a sketch matrix . However, a practical implementation would avoid the computation of the squareroot matrix by simply selecting a random subset of the component functions and forming the Hessian approximation as in (2.4). As such, the SSN method can be implemented without the need of projections, and thus does not require any additional computation beyond the application of the Hessianfree Newton method.
Although the SSN method comes at a cheaper cost, this may not always be advantageous. For example, when individual component functions are highly dissimilar, taking a small sample at each iteration of the SSN method may not yield a useful approximation to the true Hessian. For such problems, the NewtonSketch method that forms the matrix based on a combination of all component functions may be more appropriate. In the next section, we investigate these tradeoffs.
3.2 Solving the Linear Systems
Direct Solves
For problems where the number of variables is small, it is feasible to form the matrix and solve the linear systems using a direct solver. The NewtonSketch method was originally proposed for problems where the number of component functions is large compared to number of variables (). In this setting, the cost of solving the system (2.3) accurately is low compared to the cost of forming the linear system. SSN methods were designed for large problems (both and ), and do not require forming the linear system or solving it accurately.
The Conjugate Gradient Method
For large scale problems, it is effective to employ a Hessianfree approach and compute an inexact solution of the systems (2.3) or (2.5) using an iterative method that only requires Hessianvector products instead of the Hessian matrix itself. The conjugate gradient method (CG) is the best known method in this category [9]. One can compute a search direction by running iterations of CG on the system
(3.1) 
where is a positive definite matrix that denotes either the sketched Hessian (2.3) or the subsampled Hessian (2.5). Under a favorable distribution of the eigenvalues of the matrix , the CG method can generate a useful approximate solution in much fewer than steps; we exploit this result in the analysis presented in Section 4.
We should note that the squareroot Hessian decomposition, associated with the NewtonSketch method, is not a regular decomposition, such as the LU or QR decompositions, and hence the advantages of standard decomposition techniques for solving the system cannot be realized. In fact, using an iterative method such as CG is more efficient because of the associated convergence properties. Since CG requires only Hessianvector products, we need not form the entire sketched Hessian matrix. We can simply obtain sketched Hessianvector products by performing two matrixvector products with respect to the sketched squareroot Hessian. More specifically, one can compute
in two steps as follows
(3.2) 
The subsampled Hessian used in the SSN method is significantly simpler to define compared to sketched Hessian, and the CG solver is quite suitable in a Hessianfree approach.
Stochastic Gradient Iteration
We can also compute an approximate solution of the linear system using a stochastic gradient iteration (SGI) [1, 3]. We do so in the context of the SSN method. The resulting SSNSGI method operates in cycles of length . At every iteration of the cycle, one chooses an index at random, and computes a step of the gradient method for the quadratic model at
(3.3) 
This yields the iteration
(3.4) 
LiSSA, a method similar to SSNSGI, is analyzed in [1], and [3] compares the complexity of LiSSA and SSNCG. However, none of those papers report extensive numerical results.
4 Analysis
The theoretical properties of the methods described in the previous sections have been the subject of several studies; see e.g., [3, 7, 8, 14, 18, 19, 20, 22, 23, 24]. We focus here on a topic that has not received sufficient attention even though it has significant impact in practice: the CG method can be much faster than other linear iterative solvers for many classes of problems. We show that the strengths of the CG method can be quantified in the context of a SSNCG method.
A complexity analysis of the SSNCG method based on the worstcase performance of CG is given in [3]. We argue that such analysis underestimates the power of the SSNCG method on problems with favorable eigenvalue distributions and yields complexity estimates that are not indicative of its practical performance. We now present a more accurate analysis of SSNCG by paying close attention to the behavior of the CG iteration.
When applied to a symmetric positive definite linear system , the progress made by the CG method is not uniform, but depends on the gap between eigenvalues. Specifically, if we denote the eigenvalues of by , then the iterates , , generated by the CG method satisfy [13]
(4.1) 
We exploit this property in the following local linear convergence result for the SSNCG iteration, which is established under standard assumptions.The SSNCG method is given by (2.2), where is defined as (2.4) and is the iterate generated after iterations of the CG method applied to the system (2.5). For simplicity, we assume that the size of the sample used to define the Hessian approximation is constant throughout the run of the SSNCG method (but the sample varies at every iteration).
Assumptions

(Bounded Eigenvalues of Hessians) Each function is twice continuously differentiable and each component Hessian is positive definite with eigenvalues lying in a positive interval. That is there exists positive constants , such that
(4.2) We define , which is an upper bound on the condition number of the true Hessian.

(Lipschitz Continuity of Hessian) The Hessian of the objective function is Lipschitz continuous, i.e., there is a constant such that
(4.3) 
(Bounded Variance of Hessian components) There is a constant such that, for all component Hessians, we have
(4.4)
In practice, most machine learning problems are regularized and therefore assumption
is satisfied. Assumption is a standard assumption in the context of Newtontype methods. Concerning Assumption , variances are always bounded for the finitesum problem. Assumption is imposed on nonnegative numbers and is less restrictive than assuming that the iterates are bounded.We now state the main theorem of the paper, and follow the technique in [3, Theorem 3.2] to prove the result. That result is based, however, on the worstcase analysis of the CG method. Our result gives a more realistic bound that exploits the properties of the CG iteration. Let, denote the eigenvalues of subsampled Hessian .
Theorem 4.1.
Suppose that Assumptions A.1–A.4 hold. Let be the iterates generated by the Subsampled NewtonCG method with , where the step is defined by (2.5) with
(4.6) 
and suppose that the number of CG steps performed at iteration is the smallest integer such that
(4.7) 
Then, if , we have
(4.8) 
Proof.
We write
(4.9) 
Term 1 was analyzed in [3], which establishes the bound
(4.10) 
Using the result in Lemma 2.3 from [3] and (4.6), we have
(4.11)  
(4.12) 
Now, we analyze Term 2, which is the residual after CG iterations. Assuming for simplicity that the initial CG iterate is , we obtain from (4.1)
where . To express this in terms of unweighted norms, note that if , then
Therefore, from Assumption A1 and due to the condition (4.7) on the number of CG iterations, we have
(4.13) 
where the last equality follows from the definition .
Thus, if the initial iterate is near the solution , then the SSNCG method converges to at a linear rate, with a convergence constant of . As we discuss below, when the spectrum of the Hessian approximation is favorable, condition (4.7) can be satisfied after a small number () of CG iterations. We should note that in a practical implementation of the SSNCG method the number of CG iterations is not chosen by testing condition (4.7), as this would involve computing the entire spectrum of the subsampled Hessian. Instead, one uses a wellknown practical residual test [17]
where , as a proxy, to choose the number of CG iterations. Due to equation (4.1), the residual test well reflects and is a good proxy to condition (4.7), i.e., under favorable eigenvalue distributions the residual test is satisfied after only a few CG iterations.
We now establish a work computational complexity result. By work complexity we refer to the total number of component gradient evaluations () and component Hessianvector products (). In establishing this result we assume that the cost of evaluating a component Hessianvector product is the same as the cost of evaluating a component gradient, which is the case for the problems investigated in this paper.
Corollary 4.2.
Suppose that Assumptions A.1–A.4 hold, and let and be defined as in Theorem 4.1. Then, the work complexity required to get an accurate solution is
(4.15) 
This result involves the quantity , which is the maximum number of CG iterations performed at any iteration of the SSNCG method. This quantity depends on the eigenvalue distribution of the subsampled Hessians. Under certain eigenvalue distributions, such as clustered eigenvalues, can be much smaller than the bound predicted by a worstcase analysis of the CG method [3]. Such favorable eigenvalue distributions can be observed in many applications, including machine learning problems. Figure 1
depicts the spectrum for 4 data sets used in our logistic regression experiments; see Appendix
B for more results. We should note that in Figure 1, the eigenvalues were computed at the optimal point (); however, similar plots were produced at other iterates obtained by the methods.The least favorable case is given by the synthetic data set, which was constructed to have a high condition number and a wide spectrum of eigenvalues. The behavior of CG on this problem is similar to that predicted by its worstcase analysis. However, for the rest of the problems one can observe how a small number of CG iterations can give rise to a rapid improvement in accuracy. For example, for cina, the CG method improves the accuracy in the solution by 2 orders of magnitude during CG iterations 2024. For gisette, CG exhibits a fast initial increase in accuracy, after which it is not cost effective to continue performing CG steps due to the spectral distribution of this problem.
Let us now consider the NewtonSketch algorithm, evaluate the total work complexity to get an accurate solution and compare with the work complexity of the SSNCG method. In [18, Theorem 1]
, the authors provide a linearquadratic convergence rate (stated in probability) for the exact NewtonSketch method, similar in form to (
4.9)(4), with Term 2 in (4.9) equal to zero due to the exact solves of the linear systems. In order to achieve local linear convergence, the usersupplied parameter should be . As a result, the sketch dimension is given bywhere the second equality follows from the fact that the square of the Gaussian width is at most [18]. The periteration cost of Newton sketch with a randomized Hadamard transform is
Hence, the total work complexity required to get an accurate solution is
(4.16) 
Note that in order to compare this with (4.15), we need to bound the parameters and . Using the definitions given in the Assumptions, we have that is bounded by a multiple of , and that CG converges in at most iterations, thus . Using these worstcase bounds we observe that NewtonSketch has a slightly higher cost compared to SSNCG.
We should note in passing that certain implicit assumptions are made about both these algorithms. In the SSN method, it is assumed that the number of subsamples is less than the number of examples ; and in NewtonSketch, the sketch dimension is assumed to be less than . This implies that for these methods one makes the implicit assumption that .
5 Numerical Study
In this section, we study the practical performance of the NewtonSketch, Subsampled Newton (SSN) and SVRG methods. We consider two versions of the SSN method that differ in the choice of iterative linear solver; we refer to them as SSNCG and SSNSGI. Our main goals are to investigate whether the secondorder methods have advantages over firstorder methods, to identify the classes of problems for which this is the case, and to evaluate the relative strengths of the NewtonSketch and SSN approaches. We should note that although SVRG updates the iterates at every inner iteration, its linear convergence rate applies to the outer iterates (i.e., to each complete cycle). A cycle in the SVRG method is defined as the iterations between full gradient evaluations, which makes it comparable to the NewtonSketch and SSN methods.
We test the methods on binary classification problems where the training function is given by a logistic loss with regularization:
(5.1) 
Here , , denote the training examples and . For this objective function, the cost of computing the gradient is the same as the cost of computing a Hessianvector product—an assumption made in the analysis of Section 4. When reporting the numerical performance of the methods, we make use of this observation.
The data sets employed in our experiments are listed in Appendix A. They were selected to have different dimensions and condition numbers, and consist of both synthetic and established data sets.
5.1 Methods
We now give a more detailed description of the methods tested in our experiments. All methods have two hyperparameters that need to be tuned to achieve good performance.
SVRG This method was implemented as described in [11] using Option I. The SVRG method has two hyperparameters: the step length and the length of the cycle. The cost per cycle of the SVRG method is plus evaluations of the component gradients .
NewtonSketch We implemented the NewtonSketch method proposed in [18]. The method evaluates a full gradient at every iteration and computes a step by solving the linear system (2.3). In [18], the system is solved exactly, however, this limits the size of the problems tested. Therefore, in our experiments we use the matrixfree CG method to solve the system inexactly, alleviating the need for constructing the full Hessian approximation matrix in (2.3). Our implementation of the NewtonSketch method includes two parameters: the number of rows in the sketch matrix , and the maximum number of CG iterations, , used to compute the step. We define the sketch matrix through the Hadamard basis that allows for a fast matrixvector multiplication scheme to obtain . This does not require forming the sketch matrix explicitly and can be executed at a cost of flops. The maximum cost associated with every iteration is given by component gradients plus component matrixvector products, where the factor 2 arises due to (3.2). We ignore the cost of forming the squareroot matrix and multiplying by the sketch matrix, but note that these can be time consuming operations. Therefore, we report the most optimistic computational results for NewtonSketch.
SSNSGI The Subsampled Newton method that employs the SGI iteration for the step computation is described in Section 3. It computes a full gradient every inner iterations, then builds a quadratic model (3.3), and uses the stochastic gradient method to compute the search direction (3.4), at the cost of one Hessianvector product per SGI step. The total cost of an outer iteration of the method is given by component gradients plus component Hessianvector products of the form . This method has two hyperparameters: the step length and the length of the inner SGI cycle.
SSNCG The Subsampled NewtonCG method computes a search direction (2.5), where the linear system is solved by the CG method. The two hyperparameters in this method are the size of the subsample, , and the maximum number of CG iterations used to compute the step, . The method computes a full gradient at every iteration. The maximum cost (periteration) of the SSNCG method is component gradients plus component Hessianvector products.
For the NewtonSketch and the SSN methods, the step length was determined by an Armijo backtracking line search, starting from a unit step length which was accepted most of the time. For methods that use CG to solve the linear system, we included an early termination option based on the residual of the system, e.g., we terminate the SSNCG method if
(5.2) 
where is a tuning parameter. In all cases, the total cost reported below is based on the actual number of CG iterations performed for each problem.
Experimental Setup
To determine the best implementation of each method, we attempt to find the optimal setting of parameters that leads to the smallest amount of work required to reach the best possible function value. To do this, we allocated different periteration budgets () which denotes the total amount of work between full gradient evaluations. For the secondorder methods periteration budget denotes the amount of work done per step taken, while for the SVRG method it is the amount of work of a complete cycle (inner iterations) of length . In other words, the periteration budget is the total amount of work done between full gradient evaluations. We tested 9^{2}^{2}2 different periteration budget levels for each method and independently tuned the associated hyperparameters. For each budget level, we considered all possible combinations of the hyperparameters and chose the combination that gave best performance. For example, for : (i) SVRG, we set and tuned the step length parameter ; (ii) NewtonSketch, we chose pairs of parameters such that ; SSNSGI, we set , and tuned for the step length parameter ; and (iv) SSNCG, we chose pairs of parameters such that .
5.2 Numerical Results
In Figure 2 we present a sample of results obtained in our experiments on machine learning data sets (the rest are given in Appendix A). We report training error () vs. iterations, and training error vs. effective gradient evaluations (defined as the sum of all function evaluations, gradient evaluations and Hessianvector products), where is the function value at the optimal solution . Note, for SVRG one iteration denotes one outer iteration (a complete cycle). We do not report results in terms of CPU time as this would deem the NewtonSketch method not competitive for largescale problems, using the best implementation known to us.
Let us consider the plots in the second row of Figure 2. Our first observation is that the SSNSGI method is not competitive with the SSNCG method. This was not clear to us before our experimentation, because the ability of the SGI method to use a new Hessian sample at each iteration seems advantageous. We observe, however, that in the context of the secondorder methods investigated in this paper, the CG method is clearly more efficient.
Our results show that SVRG is superior on problems such as rcv1 where the benefits of using curvature information do not outweigh the increase in cost. On the other hand, problems like gisette and australian show the benefits of employing secondorder information. It is striking how much faster they are than SVRG, even though SVRG was carefully tuned. Problem australianscale, which is wellconditioned, illustrates that there are cases when all methods perform on par.
The NewtonSketch and SSNCG methods perform on par on most of the problems in terms of effective gradient evaluations. However, if we had reported CPU time, the SSNCG method would show a dramatic advantage over NewtonSketch. An unexpected finding is the fact that, although the number of CG iterations required for the best performance of the two methods is almost always the same, it appears that the best sample sizes (rows of the sketch matrix in NewtonSketch and the number of subsamples in SSNCG) differ: NewtonSketch requires a smaller number of rows (when implemented using Hadamard matrices). This can also be seen in the eigenvalue figures in Appendix B, where the sketched Hessian with fewer rows is able to adequately capture the eigenvalues of the true Hessian.
6 Final Remarks
The theoretical properties of NewtonSketch and subsampled Newton methods have been the subject of much attention, but their practical implementation and performance have not been investigated in depth. In this paper, we focus on the finitesum problem and report that both methods live up to our expectations of being more efficient than firstorder methods (as represented by SVRG) on problems that are illconditioned or badlyscaled. These results stand in contrast with conventional wisdom that states that firstorder methods are to be preferred for largescale finitesum optimization problems. NewtonSketch and SSN are also much less sensitive to the choice of hyperparameters than SVRG; see Appendix A.3. One should note, however, that SVRG, when carefully tuned, can outperform secondorder methods on problems where the objective function does not exhibit high curvature.
Although a sketched matrix based on Hadamard matrices has more information than a subsampled Hessian, and approximates the spectrum of the true Hessian more accurately, the two methods perform on par in our tests, as measured by effective gradient evaluations. However, in terms of CPU time the NewtonSketch method is generally much slower due to the high costs associated with the construction of the linear system (2.3). A Subsampled Newton method that employs the CG algorithm to solve the linear system (the SSNCG method) is particularly appealing as one can coordinate the amount of of secondorder information employed at every iteration with the accuracy of the linear solver to find an efficient implementation for a given problem.
The power of the CG algorithm is evident in our tests and is quantified in the complexity analysis presented in this paper. It shows that when eigenvalues of the Hessian have clusters, which is common in practice, the behavior of the SSNCG method is much better than predicted by a worst case analysis of CG.
References
 [1] Naman Agarwal, Brian Bullins, and Elad Hazan. Secondorder stochastic optimization for machine learning in linear time. Journal of Machine Learning Research, 18(116):1–40, 2017.
 [2] Yossi Arjevani and Ohad Shamir. Oracle complexity of secondorder methods for finitesum problems. In International Conference on Machine Learning 30, pages 205–213, 2017.
 [3] Raghu Bollapragada, Richard H Byrd, and Jorge Nocedal. Exact and inexact subsampled newton methods for optimization. IMA Journal of Numerical Analysis, 2018.
 [4] Richard H Byrd, Gillian M Chin, Will Neveitt, and Jorge Nocedal. On the use of stochastic Hessian information in optimization methods for machine learning. SIAM Journal on Optimization, 21(3):977–995, 2011.

[5]
ChihChung Chang and ChihJen Lin.
LIBSVM: A library for support vector machines.
ACM Transactions on Intelligent Systems and Technology, 2:27:1–27:27, 2011. Software available at http://www.csie.ntu.edu.tw/ cjlin/libsvm.  [6] Aaron Defazio, Francis Bach, and Simon LacosteJulien. SAGA: A fast incremental gradient method with support for nonstrongly convex composite objectives. In Advances in Neural Information Processing Systems 27, pages 1646–1654. 2014.
 [7] Petros Drineas, Michael W Mahoney, S Muthukrishnan, and Tamás Sarlós. Faster least squares approximation. Numerische Mathematik, 117(2):219–249, 2011.
 [8] Murat A Erdogdu and Andrea Montanari. Convergence rates of subsampled newton methods. In Advances in Neural Information Processing Systems 28, pages 3034–3042, 2015.
 [9] Gene H. Golub and Charles F. Van Loan. Matrix Computations. Johns Hopkins University Press, Baltimore, second edition, 1989.
 [10] Isabelle Guyon, Constantin F Aliferis, Gregory F Cooper, André Elisseeff, JeanPhilippe Pellet, Peter Spirtes, and Alexander R Statnikov. Design and analysis of the causation and prediction challenge. In WCCI Causation and Prediction Challenge, pages 1–33, 2008.

[11]
Rie Johnson and Tong Zhang.
Accelerating stochastic gradient descent using predictive variance reduction.
In Advances in Neural Information Processing Systems 26, pages 315–323, 2013.  [12] Yann LeCun, LD Jackel, Léon Bottou, Corinna Cortes, John S Denker, Harris Drucker, Isabelle Guyon, UA Muller, E Sackinger, Patrice Simard, et al. Learning algorithms for classification: A comparison on handwritten digit recognition. Neural networks: the statistical mechanics perspective, 261:276, 1995.
 [13] David G. Luenberger. Linear and Nonlinear Programming. AddisonWesley, New Jersey, second edition, 1984.
 [14] Haipeng Luo, Alekh Agarwal, Nicolò CesaBianchi, and John Langford. Efficient second order online learning by sketching. In Advances in Neural Information Processing Systems 29, pages 902–910, 2016.
 [15] Indraneel Mukherjee, Kevin Canini, Rafael Frongillo, and Yoram Singer. Parallel boosting with momentum. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pages 17–32. Springer Berlin Heidelberg, 2013.
 [16] Lam M Nguyen, Jie Liu, Katya Scheinberg, and Martin Takáč. Sarah: A novel method for machine learning problems using stochastic recursive gradient. In International Conference on Machine Learning, pages 2613–2621, 2017.
 [17] Jorge Nocedal and Stephen Wright. Numerical Optimization. Springer New York, 2 edition, 1999.
 [18] Mert Pilanci and Martin J Wainwright. Newton sketch: A near lineartime optimization algorithm with linearquadratic convergence. SIAM Journal on Optimization, 27(1):205–245, 2017.
 [19] Farbod RoostaKhorasani and Michael W Mahoney. Subsampled Newton methods I: Globally convergent algorithms. arXiv preprint arXiv:1601.04737, 2016.
 [20] Farbod RoostaKhorasani and Michael W Mahoney. Subsampled Newton methods II: Local convergence rates. arXiv preprint arXiv:1601.04738, 2016.
 [21] Mark Schmidt, Nicolas Le Roux, and Francis Bach. Minimizing finite sums with the stochastic average gradient. Mathematical Programming, page 1–30, 2016.

[22]
Jialei Wang, Jason D Lee, Mehrdad Mahdavi, Mladen Kolar, Nathan Srebro, et al.
Sketching meets random projection in the dual: A provable recovery algorithm for big and highdimensional data.
Electronic Journal of Statistics, 11(2):4896–4944, 2017. 
[23]
Shusen Wang, Alex Gittens, and Michael W Mahoney.
Sketched ridge regression: Optimization perspective, statistical perspective, and model averaging.
Journal of Machine Learning Research, 18(218):1–50, 2018.  [24] David P Woodruff et al. Sketching as a tool for numerical linear algebra. Foundations and Trends® in Theoretical Computer Science, 10(1–2):1–157, 2014.
 [25] Peng Xu, Jiyan Yang, Farbod RoostaKhorasani, Christopher Ré, and Michael W Mahoney. Subsampled newton methods with nonuniform sampling. In Advances in Neural Information Processing Systems 29, pages 3000–3008, 2016.
Appendix A Complete Numerical Results
Dataset  (train; test)  

synthetic1  (9000; 1000)  100  
synthetic2  (9000; 1000)  100  
synthetic3  (90000; 10000)  100  
synthetic4  (90000; 10000)  100 
Dataset  (train; test)  Source  

australian  (621; 69)  14  ,  [5] 
reged  (450; 50)  999  [10]  
mushroom  (5,500; 2,624)  112  [5]  
ijcnn1  (35,000; 91,701)  22  [5]  
cina  (14,430; 1603)  132  [10]  
ISOLET  (7,017; 780)  617  [5]  
gisette  (6,000; 1,000)  5,000  [5]  
cov  (522,911; 58101)  54  [5]  
MNIST  (60,000; 10,000)  748  [12]  
rcv1  (20,242; 677,399)  47,236  [5]  
realsim  (65,078; 7,231)  20,958  [5] 
The synthetic data sets were generated randomly as described in [15]. These data sets were created to have different number of samples () and different condition numbers (), as well as a wide spectrum of eigenvalues (see Figures 1114). We also explored the performance of the methods on popular machine learning data sets chosen to have different number of samples (), different number of variables () and different condition numbers (). We used the testing data sets where provided. For data sets where a testing set was not provided, we randomly split the data sets () for training and testing, respectively.
We focus on logistic regression classification; the objective function is given by
(A.1) 
where denote the training examples and is the regularization parameter.
For the experiments in Sections A.1, A.2 and A.3 (Figures 310), we ran four methods: SVRG, NewtonSketch, SSNSGI and SSNCG. The implementation details of the methods are given in Section 5.1. In Section A.1 we show the performance of the methods on the synthetic data sets, in Section A.2 we show the performance of the methods on popular machine learning data sets and in Section A.3 we examine the sensitivity of the methods on the choice of the hyperparameters. We report training error vs. iterations, and training error vs. effective gradient evaluations (defined as the sum of all function evaluations, gradient evaluations and Hessianvector products). In Sections A.1 and A.2 we also report testing loss vs. effective gradient evaluations.
a.1 Performance of methods  Synthetic Data sets
Figure 3 illustrates the performance of the methods on four synthetic data sets. These problems were constructed to have high condition numbers , which is the setting where Newton methods show their strength compared to a first order methods such as SVRG, and to have wide spectrum of eigenvalues, which is the setting that is unfavorable for the CG method.
As is clear from Figure 3, the NewtonSketch and SSNCG methods outperform the SVRG and SSNSGI methods. This is expected due to the illconditioning of the problems. The NewtonSketch method performs slightly better than the SSNCG method; this can be attributed, primarily, to the fact that the component function in these problems are highly dissimilar. The Hessian approximations constructed by the NewtonSketch method, use information from all component functions, and thus, better approximate the true Hessian. It is interesting to observe that in the initial stage of these experiments, the SVRG method outperforms the secondorder methods, however, the progress made by the firstorder method quickly stagnates.
a.2 Performance of methods  Machine Learning Data sets
Figures 4–6 illustrate the performance of the methods on 12 popular machine learning data sets. As mentioned in Section 5.2, the performance of the methods is highly dependent on the problem characteristics. On the one hand, these figures show that there exists an SVRG sweetspot, a regime in which the SVRG method outperforms all stochastic secondorder methods investigated in this paper; however, there are other problems for which it is efficient to incorporate some form of curvature information in order to avoid stagnation due to illconditioning. For such problems, the SVRG method makes slow progress due to the need for a very small step length, which is required to ensure that the method does not diverge.
Note: We did not run NewtonSketch on the cov data set (Figure 6). This was due to the fact that the number of data points (
) in this data set is large, and so it was prohibitive (in terms of memory) to create the padded sketch matrices.
a.3 Sensitivity of methods
In Figures 7–10 we illustrate the sensitivity of the methods to the choice of hyperparameters for 4 different data sets: synthetic1, australianscale, mushroom and ijcnn. One needs to be careful when interpreting the sensitivity results. It appears that all methods have similar sensitivity to the chosen hyperparameters, however, we should note that this is not the case. More specifically, if the step length is not chosen appropriately in the SVRG and SSNSGI methods, these methods can diverge ( too large) or make very slow progress ( too small). We have excluded these runs from the figures presented in this section. Empirically, we observe that the NewtonSketch and SSNCG methods are more robust to the choice of the hyperparameters, and easier to tune. For almost all choices of and , respectively, and the methods converge, with the choice of hyperparameters only affecting the speed of convergence.
Appendix B Eigenvalues
In this Section, we present the eigenvalue spectrum for different data sets and different subsample sizes. Here denotes the number of subsamples used in the subsampled Hessian and denotes the number of rows of the Sketch matrix. In order to make a fair comparison, we chose for each figure.
To calculate the eigenvalues we used the following procedure. For each data set and subsample size we:

Computed the true Hessian
of (5.1) at , and computed the eigenvalues of the true Hessian matrix (red).

Computed 10 different subsampled Hessians of (5.1) at the optimal point using different samples for ()
computed the eigenvalues of each subsampled Hessian and took the average of the eigenvalues across the 10 replications (blue).

Computed 10 different sketched Hessians of (5.1) at the optimal point using different sketch matrices for
computed the eigenvalues of each sketched Hessian and took the average of the eigenvalues across the 10 replications (green).
All eigenvalues were computed in MATLAB using the function eig. Figures 11–22 show the distribution of the eigenvalues for different data sets. Each figure represents one data set and depicts the eigenvalue distribution for three different subsample and sketch sizes. The red marks, blue squares and green circles represent the eigenvalues of the true Hessian, the average eigenvalues of the subsampled Hessians and the average eigenvalues of the sketched Hessians, respectively. Since we computed 10 different subsampled and sketched Hessians, we also show the upper and lower bounds of each eigenvalue with error bars.
As is clear from the figures below, contingent on a reasonable subsample size and sketch size, the eigenvalue spectra of the subsampled and sketched Hessians are able to capture the eigenvalue spectrum of the true Hessian. It appears however, that the eigenvalue spectra of the sketched Hessians are closer to the spectra of the true Hessian. By this we mean both that with fewer rows of the sketch matrix, the approximations of the average eigenvalues of the sketched Hessians are closer to the true eigenvalues of the Hessian, and that the error bars of the sketched Hessians are smaller than those of the subsampled Hessians. This is not surprising as the sketched Hessians use information from all components functions whereas the subsampled Hessians are constructed from a subset of the component functions. This is interesting in itself, however, the main reasons we present these results is to demonstrate that the eigenvalue distributions of machine learning problems appear to have favorable distributions for CG.
Comments
There are no comments yet.