1. Introduction
Machine learning involves data intensive optimization problems which have large number of component functions corresponding to large amount of data available. Traditional/classical methods fail to perform well on such optimization problems. So the challenge is to develop scalable and efficient algorithms to deal with these largescale learning problems (Zhou
et al., 2017; Chauhan
et al., 2017, 2018b, 2018a, 2018c).
To solve the machine learning problems, gradient descent (GD) (Cauchy, 1847) is the classical method of choice but it trains slowly while dealing with largescale learning problems due to high periteration cost. Stochastic approximation (Robbins and
Monro, 1951) is quite effective in such situations but it converges slowly due to the noisy approximations of gradient. So variety of stochastic variance reductions techniques came to existence, e.g., (Le Roux
et al., 2012; Johnson and Zhang, 2013; Konečný and Richtárik, 2013; ShalevShwartz and
Zhang, 2013; Defazio
et al., 2014; Zhang and Xiao, 2015; AllenZhu, 2017; Chauhan
et al., 2017, 2018d; Fanhua et al., 2018) etc. But the major limitation of these methods is that they can converge up to exponential rate only.
Newton method is another classical method to solve optimization problems which can give up to quadratic convergence rate (Boyd and
Vandenberghe, 2004). But again (pure) Newton method is not feasible with largescale learning problems due to huge periteration computational complexity and need to store a huge Hessian matrix. So nowadays one of the most significant open question in optimization for machine learning is: “Can we develop stochastic second order methods with quadratic convergence, like Newton method but has low periteration complexity, like stochastic approximation methods?”. After success in the stochastic first order methods, the research is shifting its focus towards the stochastic second order methods to leverage the faster convergence of second order methods.
Inexact Newton (also called truncated Newton or Hessian free) methods and quasiNewton methods are one of the major research directions for developing stochastic second order methods (Bollapragada et al., 2016). Inexact Newton methods try to solve the Newton equation approximately without calculating and storing the Hessian matrix. On the other hand, quasiNewton methods try to approximate the Hessian inverse and avoid the need to store the Hessian matrix. Thus both the methods try to resolve the issues with Newton method. The stochastic variants of inexact Newton and quasiNewton, further reduce the complexity of these methods by using subsampled gradient and Hessian calculations. In this paper, we have proposed a novel stochastic trust region inexact Newton (STRON) method which introduces subsampling to gradient and Hessian calculations. It uses progressive sampling to enjoy the benefits of both the regimes, stochastic approximation and full batch processing. We further extend the method using existing variance reduction techniques to deal with noise produced by subsampling of gradient values, and by proposing PCG for solving the trust region subproblem.
1.1. Optimization Problem
We consider unconstrained convex optimization problem of expected risk, as given below:
(1) 
where is a smooth composition of linear prediction model
and loss function over randomly selected data point
from the unknown distribution , parameterized by the model parameter . Since it is not feasible to solve (1) as is unknown so the model is approximated by taking a set of data points from the unknown distribution and then solving the empirical risk minimization problem, as given below:(2) 
For simplicity, we take . Finite sum optimization problems of type (2
) exists across different fields, like signal processing, statistics, operation research, data science and machine learning, e.g., logistic regression and SVM in machine learning.
1.2. Solution Techniques
Simple iterative classical method to solve (2) is gradient descent (GD) (Cauchy, 1847), as given below:
(3) 
where is iteration number and is called learning rate or step size. The complexity of this iteration is
which is large for largescale learning problems. SGD (stochastic gradient descent)
(Robbins and Monro, 1951) is very effective to deal with such problems due to its low periteration complexity, as given below:(4) 
for some randomly selected data point . But it converges slowly because of noisy gradient values. To deal with the noise issue, a number of techniques have proposed and some of the important techniques (as discussed in (Csiba and Richt, 2016)) are: (a) decreasing learning rates (ShalevShwartz et al., 2007), (b) using minibatching (Yang et al., 2018), (c) importance sampling (Csiba and Richt, 2016), and (d) variance reduction (Johnson and Zhang, 2013)
. The variance reduction methods can further be classified into three categories: primal methods
(Johnson and Zhang, 2013; Schmidt et al., 2016; Chauhan et al., 2017, 2018d), dual methods (ShalevShwartz and Zhang, 2013) and primaldual methods (Zhang and Xiao, 2015). These techniques are effective to deal with the largescale learning problems because of low periteration complexity, like SGD and have fast linear convergence like GD. These techniques exploit the best of SGD and GD but for these stochastic variants of first order methods the convergence is limited to linear rate only, unlike the second order methods which can give up to quadratic rate.Another classical second order method to solve (2) is Newton method, as given below:
(5) 
The complexity of this iteration is and it needs to store and invert the Hessian matrix , which is computationally very expensive. That’s why first order methods and their stochastic variants have been studied very extensively, during the last decade, to solve largescale learning problems, and not second order methods. As stochastic first order methods have hit their limits, the main focus is shifting towards stochastic second order methods, and nowadays one of the most significant open question in optimization for machine learning is: “Can we develop stochastic second order methods with quadratic convergence, like Newton method but has low periteration complexity, like stochastic approximation methods?”.
There are two major research directions for second order methods: quasiNewton methods and inexact Newton methods, both of which try to resolve the issues associated with the Newton method. QuasiNewton methods try to approximate the Hessian matrix during each iteration, as given below:
(6) 
where is an approximate of , e.g., BFGS is one such method (Fletcher, 1980). On the other hand, inexact Newton methods try to solve the Newton equation approximately, e.g., NewtonCG (conjugate gradient) (Steihaug, 1983). Both the methods try to resolve the issues related with Newton method but still their complexities are large for largescale problems. So a number of stochastic variants of these methods have been proposed, e.g., (Byrd et al., 2011, 2016; Bollapragada et al., 2016, 2018; Bellavia et al., 2018) which introduce subsampling to gradient and Hessian calculations.
1.3. Contributions
The contributions of the paper are listed below as:

We have proposed a novel subsampled variant of trust region inexact Newton method, which we call as STRON. The method introduces subsampling into gradient and Hessian calculations, which is first stochastic variant for trust region inexact Newton method. The proposed method uses progressive batching scheme for gradient and Hessian calculations to enjoy the benefits of both regimes, stochastic approximation and full batching scheme.

STRON has been extended using existing variance reduction techniques to deal with the noisy approximations of the gradient calculations. The extended method uses SVRG for variance reduction with static batching for gradient calculations and progressive batching for the Hessian calculations.

We further extend the method and use PCG, instead of CG method, to solve the Newton equation inexactly. We use weighted average of identity matrix and diagonal matrix as the preconditioner.

Our theoretical results prove superlinear convergence for STRON, without strong convexity and Lipschitz continuity assumptions.

Finally, our empirical experiments prove the efficacy of STRON against existing techniques with bench marked datasets.
2. Literature Review
Stochastic approximation methods are very effective to deal with the largescale problems due to their low periteration cost, e.g., SGD (Robbins and
Monro, 1951), but they lead to slow convergence rates due to the noisy approximation. This issue is fixed using following major techniques: (i) using minibatching (Yang
et al., 2018), (ii) importance sampling (Csiba and Richt, 2016), (ii) variance reduction techniques (Johnson and Zhang, 2013), and (iv) decreasing step sizes (ShalevShwartz et al., 2007), (see, (Csiba and Richt, 2016) for details). Variance reduction techniques can be classified into three categories: primal methods (Johnson and Zhang, 2013; Schmidt
et al., 2016; Chauhan
et al., 2017, 2018d), dual methods (ShalevShwartz and
Zhang, 2013) and primaldual methods (Zhang and Xiao, 2015). SVRG (Johnson and Zhang, 2013) is one of the most widely used variance reduction technique, which has been extended to parallel & distributed settings and also used in second order methods (Zhang
et al., 2018). These techniques are effective to deal with the largescale learning problems because of low periteration complexity, like SGD and have fast linear convergence like GD. But the variance reduction techniques are limited to linear convergence, far away from quadratic convergence of second order methods.
Second order methods utilize the curvature information to guide the step direction towards the solution and exhibit faster convergence than the first order methods. But huge periteration cost due to the huge Hessian matrix and its inversion make the training of models slow for largescale problems. So certain techniques have been developed to deal with the issues related to Hessian matrix, e.g., quasiNewton methods and inexact Newton methods are two major directions to deal with the huge computational cost of Newton method. QuasiNewton methods approximate the Hessian matrix during each iteration, e.g., BFGS (Fletcher, 1980) and its limited memory variant, called LBFGS (Liu and Nocedal, 1989), are examples of the quasiNewton class which use gradient and parameter values from the previous iterations to approximate the Hessian inverse. LBFGS uses only recent information from previous iterations. On the other hand, inexact Newton methods try to solve the Newton equation approximately, e.g., NewtonCG (Steihaug, 1983).
Recently, several stochastic variants of BFGS and LBFGS have been proposed to deal with largescale problems.
Schraudolph
et al. (2007) proposed stochastic variants of BFGS and LBFGS for the online setting, known as oBFGS. Mokhtari and
Ribeiro (2014) extended oBFGS by adding regularization which enforces upper bound on the eigen values of the approximate Hessian, known as RES. SQN (stochastic quasiNewton) is another stochastic variant of LBFGS which collects curvature information at regular intervals, instead of at each iteration (Byrd
et al., 2016). VITE (Variancereduced Stochastic Newton) extended RES and proposed to use variance reduction for the subsampled gradient values (Lucchi
et al., 2015). Moritz
et al. (2016)
proposed Stochastic LBGGS (SLBFGS) using SVRG for variance reduction and using Hessianvector product to approximate the gradient differences for calculating the Hessian approximations.
Berahas et al. (2016) proposed multibatch scheme into stochastic LBFGS where batch sample changes with some overlaps with previous iteration. Bollapragada et al. (2018) proposed progressive batching, stochastic line search and stable Newton updates for LBFGS. Bollapragada et al. (2016) studies the conditions on the subsample sizes to get the different convergence rates.Stochastic inexact Newton methods are also explored extensively. Byrd et al. (2011) proposed stochastic variants of NewtonCG along with LBFGS method. Bollapragada et al. (2016) studies subsampled Newton methods and find conditions on subsample sizes and forcing term (constant used with the residual condition), for linear convergence of NewtonCG method. Bellavia et al. (2018) studies the effect of forcing term and line search to find linear and superlinear convergence of NewtonCG method. NewtonSGI (stochastic gradient iteration) is another way of solving the linear system approximately and is studied in Agarwal et al. (2017).
TRON (trust region Newton method) is one of the most efficient solver for solving largescale linear classification problems (Lin et al., 2008). This is trust region inexact Newton method which does not use any subsampling and is present in LIBLINEAR library (Fan et al., 2008). Hsia et al. (2017) extends TRON by improving the trust region radius value. Hsia et al. (2018), further extends TRON and uses preconditioned conjugate gradient (PCG) which uses weighted average of identity matrix and diagonal matrix as a preconditioner, to solve the trust region subproblem. Since subsampling is an effective way to deal with the largescale problems so in this paper, we have proposed stochastic variant of trust region inexact Newton method, which have not been studied so far, to the best of our knowledge.
3. Trust Region Inexact Newton Method
Inexact Newton methods, also called as Truncated Newton or Hessian free methods, solve the Newton equation (linear system) approximately. CG method is a commonly used technique to solve the trust region subproblem approximately. In this section, we discuss inexact Newton method and its trust region variation.
3.1. Inexact Newton Method
The quadratic model obtained using Taylor’s theorem is given below:
(7) 
Solving the above equation using GD method, we get,
(8) 
which is Newton equation and its solution gives Newton method, as given below:
(9) 
The computational complexity of this iteration is which is very expensive. This iteration involves the calculation and inversion of a large Hessian matrix which is not only very expensive to calculate but expensive to store also. CG method approximately solves the subproblem (8) without forming the Hessian matrix, which solves the issues related to large computational complexity and need to store the large Hessian matrix. Each iteration runs for a given number of CG iterations or until the residual condition is satisfied, as given below:
(10) 
where and is a small positive value, known as forcing term (Nocedal and Wright, 1999).
3.2. Trust Region Inexact Newton Method
Trust region is a region in which the approximate quadratic model of the given function gives correct approximation for that function. In trust region methods, we don’t need to calculate the step size (also called learning rate) directly but it indirectly adjusts the step size as per the trust region radius. Trust region method solves the following subproblem to get the step direction :
(11) 
where is a quadratic model of , as given in (7) and is the trust region radius. This subproblem can be solved similar to NewtonCG, except that now we need to take care of the extra constraint of . TRON (trust region Newton method) (Lin
et al., 2008) is one of the most famous and widely used such method, which is used in LIBLINEAR (Fan
et al., 2008) to solve l2regularized logistic regression and l2SVM problems. Hsia
et al. (2017) extends TRON by proposing better trust region radius. Hsia
et al. (2018) uses PCG method which uses average of identity matrix and diagonal matrix as preconditioner, to solve the trust region subproblem and shows that PCG could be effective to solve illconditioned problems.
Then, the ratio of actual and predicted reductions of the model is calculated, as given below:
(12) 
The parameters are updated for the th iteration as given below:
(13) 
where is a given constant. Then, the trust region radius is updated as per the ratio of actual reduction and predicted reduction, and a framework for updating as given in (Lin and Moré, 1999), is given below:
(14) 
where and . If then the Newton step is considered unsuccessful and the trust region radius is shrink. On the other hand if then the step is successful and the trust region radius is enlarged. We have implemented this framework as given in the LIBLINEAR library (Fan et al., 2008).
4. Stron
STRON introduces stochasticity into the trust region inexact Newton method and calculates subsampled function, gradient and Hessian values to solve the trust region subproblem, as given below:
(15) 
where and are subsampled Hessian and gradient values over the subsamples and , respectively, as defined below:
(16) 
where subsamples are increasing, i.e., , and is subsampled function value used for calculating . STRON solves (15) approximately for given number of CG iterations or until the following residual condition is satisfied:
(17) 
where .
STRON is presented by Algorithm 1. It randomly selects subsamples and for the th iteration (outer iterations). is used for calculating the gradient. Then it solves the trust region subproblem using CG solver (inner iterations) which uses subsampled Hessian in calculating Hessianvector products. CG stops when residual condition, same as (17), satisfies, it reaches maximum #CG iterations or it reaches the trust region boundary. The ratio of reduction in actual and predicted reduction is calculated similar to (12) but using subsampled function, and is used for updating the parameters as given in (13). Then trust region radius is updated as per as given in (14) and these steps are repeated for given number of iterations or until convergence.
STRON uses progressive subsampling, i.e., dynamic subsampling to calculate function, gradient and Hessian values, and solves the Newton system approximately. It is effective to deal with largescale problems since it uses subsampling and solves the subproblem approximately, without forming the Hessian matrix but using only Hessianvector products. So it handles the complexity issues related with the Newton method.
4.1. Complexity
The complexity of trust region inexact Newton (TRON) method depends on function, gradient and CG subproblem solver. This is dominated by CG subproblem solver and is given by
(18) 
where is number of nonzeros values in the dataset. For subsampled trust region inexact Newton (STRON) method, the complexity periteration is given by
(19) 
where is number of nonzeros values in the subsample . Since #CG iterations taken by TRON and STRON do not differ much so the periteration complexity of STRON is smaller than TRON in the initial iterations and later becomes equal to TRON due to progressive sampling, i.e., when .
5. Analysis
In this section, we drive the convergence rate for STRON, without taking assumptions of strong convexity and Lipschitz continuity.
Theorem 5.1 (Superlinear Convergence).
For the parameter sequence converging to , STRON, presented by Algorithm 1, converges superlinearly.
Proof.
For some , sufficiently large, and , CG subproblem solver uses the residual exit criterion and step is accepted, i.e., (Steihaug, 1983).
(20) 
second inequality follows from, assuming and using the residual definition, and last inequality follows by the residual exit condition.
Now,
(21) 
Substituting inequality (21) in (20), we get,
(22) 
inequality follows assuming .
For some and , taking , then .
Further, as , and assuming , we get,
(23) 
And for as , we get,
(24) 
which proves superlinear convergence. ∎
6. Experimental Results
In this section, we discuss experimental settings and results. The experiments have been conducted with the bench marked binary datasets as given in the Table 1, which are available for download from LibSVM website^{1}^{1}1https://www.csie.ntu.edu.tw/c̃jlin/libsvmtools/datasets/.
Dataset  #features  #datapoints 

rcv1.binary  47,236  20,242 
covtype.binary  54  581,012 
ijcnn1  22  49,990 
news20.binary  1,355,191  19,996 
realsim  20,958  72,309 
Adult  123  32,561 
mushroom  112  8124 
We use following methods in experimentations:
TRON (Hsia
et al., 2018): This is a trust region inexact Newton method without any subsampling. It uses preconditioned CG method to solve the trust region subproblem and it is used in the current version of LIBLINEAR library (Fan
et al., 2008).
STRON: This is the proposed stochastic trust region inexact Newton method with progressive sampling technique for gradient and Hessian calculations. It uses CG method to solve the trust region subproblem.
STRONPCG: This is an extension of STRON using PCG for solving the trust region subproblem, as discussed in the Subsection 7.1.
STRONSVRG: This is another extension of STRON using variance reduction for subsampled gradient calculations, as discussed in Subsection 7.2.
NewtonCG (Byrd
et al., 2011): This is stochastic inexact Newton method which uses CG method to solve the subproblem. It uses progressive subsampling similar to STRON.
SVRGSQN (Moritz
et al., 2016): This is stochastic LBFGS method with variance reduction for gradient calculations.
SVRGLBFGS (Kolte
et al., 2015): This is another stochastic LBFGS method with variance reduction. It differs from SVRGSQN method in approach by which Hessian information is sampled.
6.1. Experimental Setup
All the datasets have been divided into 80% and 20%, for training and testing datasets, respectively. We have used for all methods and #CG iterations has been set to a sufficiently large value of 25, as all the inexact Newton methods use 510 iterations and hardly go beyond 20 iterations. Progressive batching scheme uses initial batch size of 1% to 10% with suitable rate for increasing the batch size. QuasiNewton methods (SVRGSQN and SVRGLBFGS) use minibatch size of 10% with stochastic backtracking line search to find the step size and same size minibatches are taken for gradient and Hessian subsampling. Memory of is used in quasiNewton methods with as update frequency of Hessian inverse approximation for SVRGSQN method. All the methods are implemented in C++^{2}^{2}2code will be made available soon, as a largescale machine learning library for second order methods with MATLAB interface and experiments have been executed on MacBook Air (8 GB 1600 MHz DDR3, 1.6 GHz Intel Core i5 and 256GB SSD).
6.2. Comparative Study
The experiments have been performed with strongly convex and smooth l2regularized logistic regression problem as given below:
(25) 
The results have been plotted as suboptimality () versus training time (in seconds) and accuracy versus training time for high (=)accuracy solutions, as given in the Figs. 1 2 and 3. As it is clear from the results, STRON converges faster than all other methods and shows improvement against TRON on accuracy vs. time. Moreover, quasiNewton methods converges slower than inexact Newton methods as already established in the literature (Lin et al., 2008). As per the intuitions, STRON takes initial advantage over TRON due to subsampling and as it reaches the solution region the progressive batching scheme reaches the full batching scheme and converges with same rate as TRON. That’s why, in most of the figures, we can observe STRON and TRON converging in parallel lines. Moreover, we observe a horizontal line for accuracy vs. time plot with covtype dataset because all methods give 100% accuracy.
Generally, the models are trained for low (=)accuracy solutions. So we present the results for such a case using Table 2. The results are reported as a mean for 5 runs of experiments. As it is clear from the table, STRON converges faster than other solvers in most of the cases and have nearly the same accuracy as others. We observe large variations in the training time of mushroom dataset because it’s training time is small and it is difficult to measure it precisely.
Datasets  Methods  SVRG_LBFGS  STRON  SVRG_SQN  Newton_CG  TRON 

covtype  Time (s)  0.860.31  0.060.01  0.770.03  0.850.01  0.680.00 
Accuracy  1.00  1.00  1.00  1.00  1.00  
realsim  Time (s)  1.850.15  0.880.28  1.150.01  0.680.01  0.880.10 
Accuracy  0.9693  0.9696  0.9670  0.9690  0.9696  
rcv1  Time (s)  2.630.18  0.530.01  3.040.07  0.610.05  0.610.04 
Accuracy  0.9634  0.9632  0.9632  0.9634  0.9627  
news20  Time (s)  49.300.70  6.000.08  68.980.77  7.010.13  6.230.42 
Accuracy  0.9313  0.9320  0.9323  0.9320  0.9320  
mushroom  Time (s)  0.150.02  0.110.03  0.270.06  0.140.06  0.170.06 
Accuracy  0.9988  0.9988  0.9988  0.9988  0.9988  
ijcnn1  Time (s)  0.400.16  0.280.01  0.280.03  0.190.01  0.280.14 
Accuracy  0.9239  0.9233  0.9225  0.9239  0.9238  
Adult  Time (s)  0.420.16  0.180.12  0.440.19  0.280.17  0.220.02 
Accuracy  0.8469  0.8468  0.8460  0.8471  0.8483 
6.3. Results with SVM
We extend STRON to solve l2SVM problem which is a non smooth problem, as given below:
(26) 
The results are reported in the Fig. 4 with news20 and realsim datasets. As it is clear from the figure, STRON outperforms all other methods.
7. Extensions
In this section, we discuss extensions of the proposed method with PCG for solving the trust region subproblem, and with variance reduction technique.
7.1. PCG Subproblem Solver
Number of iterations required by CG method to solve the subproblem depend on the condition number of the Hessian matrix. So for illconditioned problems CG method can converge very slow. To avoid such situations, generally a nonsingular matrix , called preconditioner, is used as follow. For the linear system , we solve following system:
(27) 
Generally, is taken to ensure the symmetry and positive definiteness of . PCG can be useful for solving the illconditioned problems but it involves extra computational overhead. We follow (Hsia et al., 2018) to use PCG as a weighted average of identity matrix and diagonal matrix of Hessian, as given below:
(28) 
where is a Hessian matrix and . For , there is no preconditioning and for it is a diagonal preconditioner. In the experiments, we have taken for TRON and STRONPCG (Hsia
et al., 2018). To apply PCG to trust region subproblem, we can use Algorithm 2 without any modifications, after changing the trust region subproblem (11), as given below (Steihaug, 1983):
(29) 
STRON using PCG as a trust region subproblem solver is denoted by STRONPCG and the results are reported in Fig. 5. It compares TRON, STRON and STRONPCG on news20 and rcv1 datasets. As it is clear from the figure, both STRON and STRONPCG outperform TRON.
PCG trust region subproblem solver involves extra cost for calculating the preconditioner, and for TRON the overhead due to preconditioner is given by
(30) 
And for STRONPCG, preconditioner involves extra cost as given below:
(31) 
7.2. Stochastic Variance Reduced Trust Region Inexact Newton Method
To improve the quality of search direction, we use SVRG as variance reduction technique for gradient calculations, as given below:
(32) 
where is parameter value at the start of outer iteration. STRONSVRG uses variance reduction for gradient calculations and progressive batching for Hessian calculations, as given in the Algorithm 3.
The experimental results are presented in Fig. 6 with news20 and rcv1 datasets. As it is clear from the figures, STRONSVRG lags behind STRON and TRON, i.e., variance reduction is not sufficient to beat the progressive batching in gradient calculations of STRON.
8. Conclusion
We have proposed a novel subsampled trust region inexact Newton method, called as STRON. STRON uses progressive batching scheme to deal with noisy approximations of gradient and Hessian, and enjoys the benefits of both the regimes, stochastic approximation and full batching scheme. The proposed method has been extended to use preconditioned CG as trust region subproblem solver, to use variance reduction for gradient calculations and to solve SVM problem. Theoretical analysis proves superlinear convergence for the proposed method and empirical results prove the efficacy of STRON.
Acknowledgements.
First author is thankful to Ministry of Human Resource Development, Government of INDIA, to provide fellowship (University Grants Commission  Senior Research Fellowship) to pursue his PhD.References
 (1)
 Agarwal et al. (2017) Naman Agarwal, Brian Bullins, and Elad Hazan. 2017. SecondOrder Stochastic Optimization for Machine Learning in Linear Time. Journal of Machine Learning Research 18, 116 (2017), 1–40.
 AllenZhu (2017) Zeyuan AllenZhu. 2017. Katyusha: The First Direct Acceleration of Stochastic Gradient Methods. Journal of Machine Learning Research (to appear) (2017). Full version available at http://arxiv.org/abs/1603.05953.
 Bellavia et al. (2018) Stefania Bellavia, Nataša Krejic, and Nataša Krklec Jerinkic. 2018. Subsampled Inexact Newton methods for minimizing large sums of convex functions. Optimization Online (2018). http://www.optimizationonline.org/DB_HTML/2018/01/6432.html
 Berahas et al. (2016) Albert S Berahas, Jorge Nocedal, and Martin Takac. 2016. A MultiBatch LBFGS Method for Machine Learning. In Advances in Neural Information Processing Systems 29. 1055–1063.
 Bollapragada et al. (2016) R. Bollapragada, R. Byrd, and J. Nocedal. 2016. Exact and Inexact Subsampled Newton Methods for Optimization. arXiv (2016). https://arxiv.org/abs/1609.08502
 Bollapragada et al. (2018) Raghu Bollapragada, Jorge Nocedal, Dheevatsa Mudigere, HaoJun Shi, and Ping Tak Peter Tang. 2018. A Progressive Batching LBFGS Method for Machine Learning. In Proceedings of the 35th International Conference on Machine Learning (Proceedings of Machine Learning Research), Vol. 80. PMLR, 620–629.
 Boyd and Vandenberghe (2004) Stephen Boyd and Lieven Vandenberghe. 2004. Convex Optimization. Cambridge University Press, New York, NY, USA.
 Byrd et al. (2011) R. Byrd, G. Chin, W. Neveitt, and J. Nocedal. 2011. On the Use of Stochastic Hessian Information in Optimization Methods for Machine Learning. SIAM Journal on Optimization 21, 3 (2011), 977–995. https://doi.org/10.1137/10079923X
 Byrd et al. (2016) Richard H. Byrd, S. L. Hansen, Jorge Nocedal, and Yoram Singer. 2016. A Stochastic QuasiNewton Method for LargeScale Optimization. SIAM Journal on Optimization 26, 2 (2016), 1008–1031.
 Cauchy (1847) AugustinLouis Cauchy. 1847. Méthode générale pour la résolution des systèmes d’équations simultanées. Compte Rendu des S’eances de L’Acad’emie des Sciences XXV S’erie A, 25 (18 Oct. 1847), 536–538.
 Chauhan et al. (2017) Vinod Kumar Chauhan, Kalpana Dahiya, and Anuj Sharma. 2017. Minibatch Blockcoordinate based Stochastic Average Adjusted Gradient Methods to Solve Big Data Problems. In Proceedings of the Ninth Asian Conference on Machine Learning, Vol. 77. PMLR, 49–64. http://proceedings.mlr.press/v77/chauhan17a.html
 Chauhan et al. (2018a) Vinod Kumar Chauhan, Kalpana Dahiya, and Anuj Sharma. 2018a. Faster Algorithms for Largescale Machine Learning using Simple Sampling Techniques. arXiv (2018). https://arxiv.org/abs/1801.05931v2
 Chauhan et al. (2018b) Vinod Kumar Chauhan, Kalpana Dahiya, and Anuj Sharma. 2018b. Problem formulations and solvers in linear SVM: a review. Artificial Intelligence Review (2018). https://doi.org/10.1007/s1046201896146
 Chauhan et al. (2018c) Vinod Kumar Chauhan, Anuj Sharma, and Kalpana Dahiya. 2018c. Faster learning by reduction of data access time. Applied Intelligence 48, 12 (01 Dec 2018), 4715–4729. https://doi.org/10.1007/s104890181235x
 Chauhan et al. (2018d) Vinod Kumar Chauhan, Anuj Sharma, and Kalpana Dahiya. 2018d. SAAGs: Biased Stochastic Variance Reduction Methods for Largescale Learning. arXiv (jul 2018). arXiv:1807.08934 http://arxiv.org/abs/1807.08934
 Csiba and Richt (2016) Dominik Csiba and Peter Richt. 2016. Importance Sampling for Minibatches. (2016), 1–19. arXiv:arXiv:1602.02283v1
 Defazio et al. (2014) Aaron Defazio, Francis Bach, and Simon LacosteJulien. 2014. SAGA: A Fast Incremental Gradient Method with Support for Nonstrongly Convex Composite Objectives. In Proceedings of the 27th International Conference on Neural Information Processing Systems (NIPS’14). MIT Press, Cambridge, MA, USA, 1646–1654.
 Fan et al. (2008) R. Fan, K. Chang, C. Hsieh, X. Wang, and C. Lin. 2008. LIBLINEAR: A library for large linear classification. JMLR 9 (2008), 1871–1874.
 Fanhua et al. (2018) Shang Fanhua, Kaiwen Zhou, James Cheng, Ivor W. Tsang, Lijun Zhang, and Dacheng Tao. 2018. VRSGD: A Simple Stochastic Variance Reduction Method for Machine Learning. arXiv (2018). https://arxiv.org/abs/1802.09932
 Fletcher (1980) R Fletcher. 1980. Practical Methods of Optimization, Vol. 1, Unconstrained Optimization.
 Hsia et al. (2018) ChihYang Hsia, WeiLin Chiang, and ChihJen Lin. 2018. Preconditioned Conjugate Gradient Methods in Truncated Newton Frameworks for Largescale Linear Classification. In Proceedings of the Tenth Asian Conference on Machine Learning (Proceedings of Machine Learning Research). PMLR.
 Hsia et al. (2017) ChihYang Hsia, Ya Zhu, and ChihJen Lin. 2017. A Study on Trust Region Update Rules in Newton Methods for Largescale Linear Classification. In Proceedings of the Ninth Asian Conference on Machine Learning (Proceedings of Machine Learning Research), Vol. 77. PMLR, 33–48.
 Johnson and Zhang (2013) Rie Johnson and Tong Zhang. 2013. Accelerating Stochastic Gradient Descent using Predictive Variance Reduction. In Advances in Neural Information Processing Systems 26, C. J. C. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K. Q. Weinberger (Eds.). Curran Associates, Inc., 315–323.
 Kolte et al. (2015) Ritesh Kolte, Murat Erdogdu, and Ayfer Ozgur. 2015. Accelerating SVRG via secondorder information. In NIPS Workshop on Optimization for Machine Learning.
 Konečný and Richtárik (2013) Jakub Konečný and Peter Richtárik. 2013. SemiStochastic Gradient Descent Methods. 1 (2013), 19. arXiv:1312.1666 http://arxiv.org/abs/1312.1666
 Le Roux et al. (2012) Nicolas Le Roux, Mark Schmidt, and Francis Bach. 2012. A Stochastic Gradient Method with an Exponential Convergence Rate for StronglyConvex Optimization with Finite Training Sets. Technical Report. INRIA.
 Lin and Moré (1999) C. Lin and J. Moré. 1999. Newton’s Method for Large BoundConstrained Optimization Problems. SIAM Journal on Optimization 9, 4 (1999), 1100–1127. https://doi.org/10.1137/S1052623498345075
 Lin et al. (2008) ChihJen Lin, Ruby C. Weng, and S. Sathiya Keerthi. 2008. Trust Region Newton Method for Logistic Regression. JMLR 9 (June 2008), 627–650.
 Liu and Nocedal (1989) Dong C. Liu and Jorge Nocedal. 1989. On the limited memory BFGS method for large scale optimization. Mathematical Programming 45, 1 (1989), 503–528.
 Lucchi et al. (2015) Aurélien Lucchi, Brian McWilliams, and Thomas Hofmann. 2015. A Variance Reduced Stochastic Newton Method. arXiv (2015). http://arxiv.org/abs/1503.08316
 Mokhtari and Ribeiro (2014) A. Mokhtari and A. Ribeiro. 2014. RES: Regularized Stochastic BFGS Algorithm. IEEE Transactions on Signal Processing 62, 23 (2014), 6089–6104.
 Moritz et al. (2016) Philipp Moritz, Robert Nishihara, and Michael I. Jordan. 2016. A LinearlyConvergent Stochastic LBFGS Algorithm. In AISTATS.
 Nocedal and Wright (1999) Nocedal and S.J. Wright. 1999. Numerical Optimization. Springer, New York.
 Robbins and Monro (1951) Herber Robbins and Sutton Monro. 1951. A stochastic approximation method. vol22 (1951), pp. 400–407.
 Schmidt et al. (2016) Mark Schmidt, Nicolas Le Roux, and Francis Bach. 2016. Minimizing finite sums with the stochastic average gradient. Math. Program. (2016), 1–30.
 Schraudolph et al. (2007) Nicol N. Schraudolph, Jin Yu, and Simon Günter. 2007. A Stochastic QuasiNewton Method for Online Convex Optimization. In Proceedings of the Eleventh International Conference on Artificial Intelligence and Statistics (Proceedings of Machine Learning Research), Marina Meila and Xiaotong Shen (Eds.), Vol. 2. PMLR, 436–443.

ShalevShwartz et al. (2007)
Shai ShalevShwartz, Yoram
Singer, and Nathan Srebro.
2007.
Pegasos: Primal Estimated subGrAdient SOlver for SVM. In
Proceedings of the 24th International Conference on Machine Learning (ICML ’07). ACM, New York, NY, USA, 807–814.  ShalevShwartz and Zhang (2013) Shai ShalevShwartz and Tong Zhang. 2013. Stochastic Dual Coordinate Ascent Methods for Regularized Loss. J. Mach. Learn. Res. 14, 1 (2013), 567–599.
 Steihaug (1983) Trond Steihaug. 1983. The Conjugate Gradient Method and Trust Regions in Large Scale Optimization. SIAM J. Numer. Anal. 20, 3 (1983), 626–637.
 Yang et al. (2018) Zhuang Yang, Cheng Wang, Zhemin Zhang, and Jonathan Li. 2018. Random Barzilai–Borwein step size for minibatch algorithms. Eng. Appl. Artif. Intell. 72 (2018), 124 – 135.
 Zhang et al. (2018) GongDuo Zhang, ShenYi Zhao, Hao Gao, and WuJun Li. 2018. FeatureDistributed SVRG for HighDimensional Linear Classification. arXiv (2018). http://arxiv.org/abs/1802.03604
 Zhang and Xiao (2015) Yuchen Zhang and Lin Xiao. 2015. Stochastic Primaldual Coordinate Method for Regularized Empirical Risk Minimization. In Proceedings of the 32Nd International Conference on International Conference on Machine Learning  Volume 37 (ICML’15). 353–361.
 Zhou et al. (2017) Lina Zhou, Shimei Pan, Jianwu Wang, and Athanasios V. Vasilakos. 2017. Machine learning on big data: Opportunities and challenges. Neurocomputing 237 (2017), 350 – 361. https://doi.org/10.1016/j.neucom.2017.01.026
Comments
There are no comments yet.