Stochastic Trust Region Inexact Newton Method for Large-scale Machine Learning

12/26/2018
by   Vinod Kumar Chauhan, et al.
0

Nowadays stochastic approximation methods are one of the major research direction to deal with the large-scale machine learning problems. From stochastic first order methods, now the focus is shifting to stochastic second order methods due to their faster convergence. In this paper, we have proposed a novel Stochastic Trust RegiOn inexact Newton method, called as STRON, which uses conjugate gradient (CG) to solve trust region subproblem. The method uses progressive subsampling in the calculation of gradient and Hessian values to take the advantage of both stochastic approximation and full batch regimes. We have extended STRON using existing variance reduction techniques to deal with the noisy gradients, and using preconditioned conjugate gradient (PCG) as subproblem solver. We further extend STRON to solve SVM. Finally, the theoretical results prove superlinear convergence for STRON and the empirical results prove the efficacy of the proposed method against existing methods with bench marked datasets.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

06/27/2019

Combining Stochastic Adaptive Cubic Regularization with Negative Curvature for Nonconvex Optimization

We focus on minimizing nonconvex finite-sum functions that typically ari...
05/23/2018

Approximate Newton-based statistical inference using only stochastic gradients

We present a novel inference framework for convex empirical risk minimiz...
05/22/2019

Ellipsoidal Trust Region Methods and the Marginal Value of Hessian Information for Neural Network Training

We investigate the use of ellipsoidal trust region constraints for secon...
02/15/2018

A Progressive Batching L-BFGS Method for Machine Learning

The standard L-BFGS method relies on gradient approximations that are no...
06/23/2020

An efficient Averaged Stochastic Gauss-Newtwon algorithm for estimating parameters of non linear regressions models

Non linear regression models are a standard tool for modeling real pheno...
07/24/2018

SAAGs: Biased Stochastic Variance Reduction Methods

Stochastic optimization is one of the effective approach to deal with th...
06/23/2020

An efficient Averaged Stochastic Gauss-Newton algorithm for estimating parameters of non linear regressions models

Non linear regression models are a standard tool for modeling real pheno...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

Machine learning involves data intensive optimization problems which have large number of component functions corresponding to large amount of data available. Traditional/classical methods fail to perform well on such optimization problems. So the challenge is to develop scalable and efficient algorithms to deal with these large-scale learning problems (Zhou et al., 2017; Chauhan et al., 2017, 2018b, 2018a, 2018c).
To solve the machine learning problems, gradient descent (GD) (Cauchy, 1847) is the classical method of choice but it trains slowly while dealing with large-scale learning problems due to high per-iteration cost. Stochastic approximation (Robbins and Monro, 1951) is quite effective in such situations but it converges slowly due to the noisy approximations of gradient. So variety of stochastic variance reductions techniques came to existence, e.g., (Le Roux et al., 2012; Johnson and Zhang, 2013; Konečný and Richtárik, 2013; Shalev-Shwartz and Zhang, 2013; Defazio et al., 2014; Zhang and Xiao, 2015; Allen-Zhu, 2017; Chauhan et al., 2017, 2018d; Fanhua et al., 2018) etc. But the major limitation of these methods is that they can converge up to exponential rate only.
Newton method is another classical method to solve optimization problems which can give up to quadratic convergence rate (Boyd and Vandenberghe, 2004). But again (pure) Newton method is not feasible with large-scale learning problems due to huge per-iteration computational complexity and need to store a huge Hessian matrix. So nowadays one of the most significant open question in optimization for machine learning is: “Can we develop stochastic second order methods with quadratic convergence, like Newton method but has low per-iteration complexity, like stochastic approximation methods?”. After success in the stochastic first order methods, the research is shifting its focus towards the stochastic second order methods to leverage the faster convergence of second order methods.
Inexact Newton (also called truncated Newton or Hessian free) methods and quasi-Newton methods are one of the major research directions for developing stochastic second order methods (Bollapragada et al., 2016). Inexact Newton methods try to solve the Newton equation approximately without calculating and storing the Hessian matrix. On the other hand, quasi-Newton methods try to approximate the Hessian inverse and avoid the need to store the Hessian matrix. Thus both the methods try to resolve the issues with Newton method. The stochastic variants of inexact Newton and quasi-Newton, further reduce the complexity of these methods by using subsampled gradient and Hessian calculations. In this paper, we have proposed a novel stochastic trust region inexact Newton (STRON) method which introduces subsampling to gradient and Hessian calculations. It uses progressive sampling to enjoy the benefits of both the regimes, stochastic approximation and full batch processing. We further extend the method using existing variance reduction techniques to deal with noise produced by subsampling of gradient values, and by proposing PCG for solving the trust region subproblem.

1.1. Optimization Problem

We consider unconstrained convex optimization problem of expected risk, as given below:

(1)

where is a smooth composition of linear prediction model

and loss function over randomly selected data point

from the unknown distribution , parameterized by the model parameter . Since it is not feasible to solve (1) as is unknown so the model is approximated by taking a set of data points from the unknown distribution and then solving the empirical risk minimization problem, as given below:

(2)

For simplicity, we take . Finite sum optimization problems of type (2

) exists across different fields, like signal processing, statistics, operation research, data science and machine learning, e.g., logistic regression and SVM in machine learning.

1.2. Solution Techniques

Simple iterative classical method to solve (2) is gradient descent (GD) (Cauchy, 1847), as given below:

(3)

where is iteration number and is called learning rate or step size. The complexity of this iteration is

which is large for large-scale learning problems. SGD (stochastic gradient descent)

(Robbins and Monro, 1951) is very effective to deal with such problems due to its low per-iteration complexity, as given below:

(4)

for some randomly selected data point . But it converges slowly because of noisy gradient values. To deal with the noise issue, a number of techniques have proposed and some of the important techniques (as discussed in (Csiba and Richt, 2016)) are: (a) decreasing learning rates (Shalev-Shwartz et al., 2007), (b) using mini-batching (Yang et al., 2018), (c) importance sampling (Csiba and Richt, 2016), and (d) variance reduction (Johnson and Zhang, 2013)

. The variance reduction methods can further be classified into three categories: primal methods

(Johnson and Zhang, 2013; Schmidt et al., 2016; Chauhan et al., 2017, 2018d), dual methods (Shalev-Shwartz and Zhang, 2013) and primal-dual methods (Zhang and Xiao, 2015). These techniques are effective to deal with the large-scale learning problems because of low per-iteration complexity, like SGD and have fast linear convergence like GD. These techniques exploit the best of SGD and GD but for these stochastic variants of first order methods the convergence is limited to linear rate only, unlike the second order methods which can give up to quadratic rate.
Another classical second order method to solve (2) is Newton method, as given below:

(5)

The complexity of this iteration is and it needs to store and invert the Hessian matrix , which is computationally very expensive. That’s why first order methods and their stochastic variants have been studied very extensively, during the last decade, to solve large-scale learning problems, and not second order methods. As stochastic first order methods have hit their limits, the main focus is shifting towards stochastic second order methods, and nowadays one of the most significant open question in optimization for machine learning is: “Can we develop stochastic second order methods with quadratic convergence, like Newton method but has low per-iteration complexity, like stochastic approximation methods?”.
There are two major research directions for second order methods: quasi-Newton methods and inexact Newton methods, both of which try to resolve the issues associated with the Newton method. Quasi-Newton methods try to approximate the Hessian matrix during each iteration, as given below:

(6)

where is an approximate of , e.g., BFGS is one such method (Fletcher, 1980). On the other hand, inexact Newton methods try to solve the Newton equation approximately, e.g., Newton-CG (conjugate gradient) (Steihaug, 1983). Both the methods try to resolve the issues related with Newton method but still their complexities are large for large-scale problems. So a number of stochastic variants of these methods have been proposed, e.g., (Byrd et al., 2011, 2016; Bollapragada et al., 2016, 2018; Bellavia et al., 2018) which introduce subsampling to gradient and Hessian calculations.

1.3. Contributions

The contributions of the paper are listed below as:

  • We have proposed a novel subsampled variant of trust region inexact Newton method, which we call as STRON. The method introduces subsampling into gradient and Hessian calculations, which is first stochastic variant for trust region inexact Newton method. The proposed method uses progressive batching scheme for gradient and Hessian calculations to enjoy the benefits of both regimes, stochastic approximation and full batching scheme.

  • STRON has been extended using existing variance reduction techniques to deal with the noisy approximations of the gradient calculations. The extended method uses SVRG for variance reduction with static batching for gradient calculations and progressive batching for the Hessian calculations.

  • We further extend the method and use PCG, instead of CG method, to solve the Newton equation inexactly. We use weighted average of identity matrix and diagonal matrix as the preconditioner.

  • Our theoretical results prove superlinear convergence for STRON, without strong convexity and Lipschitz continuity assumptions.

  • Finally, our empirical experiments prove the efficacy of STRON against existing techniques with bench marked datasets.

2. Literature Review

Stochastic approximation methods are very effective to deal with the large-scale problems due to their low per-iteration cost, e.g., SGD (Robbins and Monro, 1951), but they lead to slow convergence rates due to the noisy approximation. This issue is fixed using following major techniques: (i) using mini-batching (Yang et al., 2018), (ii) importance sampling (Csiba and Richt, 2016), (ii) variance reduction techniques (Johnson and Zhang, 2013), and (iv) decreasing step sizes (Shalev-Shwartz et al., 2007), (see, (Csiba and Richt, 2016) for details). Variance reduction techniques can be classified into three categories: primal methods (Johnson and Zhang, 2013; Schmidt et al., 2016; Chauhan et al., 2017, 2018d), dual methods (Shalev-Shwartz and Zhang, 2013) and primal-dual methods (Zhang and Xiao, 2015). SVRG (Johnson and Zhang, 2013) is one of the most widely used variance reduction technique, which has been extended to parallel & distributed settings and also used in second order methods (Zhang et al., 2018). These techniques are effective to deal with the large-scale learning problems because of low per-iteration complexity, like SGD and have fast linear convergence like GD. But the variance reduction techniques are limited to linear convergence, far away from quadratic convergence of second order methods.
Second order methods utilize the curvature information to guide the step direction towards the solution and exhibit faster convergence than the first order methods. But huge per-iteration cost due to the huge Hessian matrix and its inversion make the training of models slow for large-scale problems. So certain techniques have been developed to deal with the issues related to Hessian matrix, e.g., quasi-Newton methods and inexact Newton methods are two major directions to deal with the huge computational cost of Newton method. Quasi-Newton methods approximate the Hessian matrix during each iteration, e.g., BFGS (Fletcher, 1980) and its limited memory variant, called L-BFGS (Liu and Nocedal, 1989), are examples of the quasi-Newton class which use gradient and parameter values from the previous iterations to approximate the Hessian inverse. L-BFGS uses only recent information from previous -iterations. On the other hand, inexact Newton methods try to solve the Newton equation approximately, e.g., Newton-CG (Steihaug, 1983).
Recently, several stochastic variants of BFGS and L-BFGS have been proposed to deal with large-scale problems. Schraudolph et al. (2007) proposed stochastic variants of BFGS and L-BFGS for the online setting, known as oBFGS. Mokhtari and Ribeiro (2014) extended oBFGS by adding regularization which enforces upper bound on the eigen values of the approximate Hessian, known as RES. SQN (stochastic quasi-Newton) is another stochastic variant of L-BFGS which collects curvature information at regular intervals, instead of at each iteration (Byrd et al., 2016). VITE (Variance-reduced Stochastic Newton) extended RES and proposed to use variance reduction for the subsampled gradient values (Lucchi et al., 2015). Moritz et al. (2016)

proposed Stochastic L-BGGS (SLBFGS) using SVRG for variance reduction and using Hessian-vector product to approximate the gradient differences for calculating the Hessian approximations.

Berahas et al. (2016) proposed multi-batch scheme into stochastic L-BFGS where batch sample changes with some overlaps with previous iteration. Bollapragada et al. (2018) proposed progressive batching, stochastic line search and stable Newton updates for L-BFGS. Bollapragada et al. (2016) studies the conditions on the subsample sizes to get the different convergence rates.
Stochastic inexact Newton methods are also explored extensively. Byrd et al. (2011) proposed stochastic variants of Newton-CG along with L-BFGS method. Bollapragada et al. (2016) studies subsampled Newton methods and find conditions on subsample sizes and forcing term (constant used with the residual condition), for linear convergence of Newton-CG method. Bellavia et al. (2018) studies the effect of forcing term and line search to find linear and super-linear convergence of Newton-CG method. Newton-SGI (stochastic gradient iteration) is another way of solving the linear system approximately and is studied in Agarwal et al. (2017).
TRON (trust region Newton method) is one of the most efficient solver for solving large-scale linear classification problems (Lin et al., 2008). This is trust region inexact Newton method which does not use any subsampling and is present in LIBLINEAR library (Fan et al., 2008). Hsia et al. (2017) extends TRON by improving the trust region radius value. Hsia et al. (2018), further extends TRON and uses preconditioned conjugate gradient (PCG) which uses weighted average of identity matrix and diagonal matrix as a preconditioner, to solve the trust region subproblem. Since subsampling is an effective way to deal with the large-scale problems so in this paper, we have proposed stochastic variant of trust region inexact Newton method, which have not been studied so far, to the best of our knowledge.

3. Trust Region Inexact Newton Method

Inexact Newton methods, also called as Truncated Newton or Hessian free methods, solve the Newton equation (linear system) approximately. CG method is a commonly used technique to solve the trust region subproblem approximately. In this section, we discuss inexact Newton method and its trust region variation.

3.1. Inexact Newton Method

The quadratic model obtained using Taylor’s theorem is given below:

(7)

Solving the above equation using GD method, we get,

(8)

which is Newton equation and its solution gives Newton method, as given below:

(9)

The computational complexity of this iteration is which is very expensive. This iteration involves the calculation and inversion of a large Hessian matrix which is not only very expensive to calculate but expensive to store also. CG method approximately solves the subproblem (8) without forming the Hessian matrix, which solves the issues related to large computational complexity and need to store the large Hessian matrix. Each iteration runs for a given number of CG iterations or until the residual condition is satisfied, as given below:

(10)

where and is a small positive value, known as forcing term (Nocedal and Wright, 1999).

3.2. Trust Region Inexact Newton Method

Trust region is a region in which the approximate quadratic model of the given function gives correct approximation for that function. In trust region methods, we don’t need to calculate the step size (also called learning rate) directly but it indirectly adjusts the step size as per the trust region radius. Trust region method solves the following subproblem to get the step direction :

(11)

where is a quadratic model of , as given in (7) and is the trust region radius. This subproblem can be solved similar to Newton-CG, except that now we need to take care of the extra constraint of . TRON (trust region Newton method) (Lin et al., 2008) is one of the most famous and widely used such method, which is used in LIBLINEAR (Fan et al., 2008) to solve l2-regularized logistic regression and l2-SVM problems. Hsia et al. (2017) extends TRON by proposing better trust region radius. Hsia et al. (2018) uses PCG method which uses average of identity matrix and diagonal matrix as preconditioner, to solve the trust region subproblem and shows that PCG could be effective to solve ill-conditioned problems.
Then, the ratio of actual and predicted reductions of the model is calculated, as given below:

(12)

The parameters are updated for the th iteration as given below:

(13)

where is a given constant. Then, the trust region radius is updated as per the ratio of actual reduction and predicted reduction, and a framework for updating as given in (Lin and Moré, 1999), is given below:

(14)

where and . If then the Newton step is considered unsuccessful and the trust region radius is shrink. On the other hand if then the step is successful and the trust region radius is enlarged. We have implemented this framework as given in the LIBLINEAR library (Fan et al., 2008).

4. Stron

STRON introduces stochasticity into the trust region inexact Newton method and calculates subsampled function, gradient and Hessian values to solve the trust region subproblem, as given below:

(15)

where and are subsampled Hessian and gradient values over the subsamples and , respectively, as defined below:

(16)

where subsamples are increasing, i.e., , and is subsampled function value used for calculating . STRON solves (15) approximately for given number of CG iterations or until the following residual condition is satisfied:

(17)

where .

  Input:
  Result:
  for  do
     Randomly select subsamples and
     Calculate subsampled gradient
     Solve the trust region subproblem using Algorithm 2, to get the step direction
     Calculate the ratio
     Update the parameters using (13)
     Update the trust region radius using (14)
  end for
ALGORITHM 1 STRON
  Inputs: ,
  Result:
  Initialize
  for  do
     if  then
        return
     end if
     Calculate subsampled Hessian-vector product
     
     
     if  then
        Calculate such that
        return
     end if
     ,
     
  end for
ALGORITHM 2 CG Subproblem Solver

STRON is presented by Algorithm 1. It randomly selects subsamples and for the th iteration (outer iterations). is used for calculating the gradient. Then it solves the trust region subproblem using CG solver (inner iterations) which uses subsampled Hessian in calculating Hessian-vector products. CG stops when residual condition, same as (17), satisfies, it reaches maximum #CG iterations or it reaches the trust region boundary. The ratio of reduction in actual and predicted reduction is calculated similar to (12) but using subsampled function, and is used for updating the parameters as given in (13). Then trust region radius is updated as per as given in (14) and these steps are repeated for given number of iterations or until convergence.
STRON uses progressive subsampling, i.e., dynamic subsampling to calculate function, gradient and Hessian values, and solves the Newton system approximately. It is effective to deal with large-scale problems since it uses subsampling and solves the subproblem approximately, without forming the Hessian matrix but using only Hessian-vector products. So it handles the complexity issues related with the Newton method.

4.1. Complexity

The complexity of trust region inexact Newton (TRON) method depends on function, gradient and CG subproblem solver. This is dominated by CG subproblem solver and is given by

(18)

where is number of non-zeros values in the dataset. For subsampled trust region inexact Newton (STRON) method, the complexity per-iteration is given by

(19)

where is number of non-zeros values in the subsample . Since #CG iterations taken by TRON and STRON do not differ much so the per-iteration complexity of STRON is smaller than TRON in the initial iterations and later becomes equal to TRON due to progressive sampling, i.e., when .

5. Analysis

In this section, we drive the convergence rate for STRON, without taking assumptions of strong convexity and Lipschitz continuity.

Theorem 5.1 (Superlinear Convergence).

For the parameter sequence converging to , STRON, presented by Algorithm 1, converges superlinearly.

Proof.

For some , sufficiently large, and , CG subproblem solver uses the residual exit criterion and step is accepted, i.e., (Steihaug, 1983).

(20)

second inequality follows from, assuming and using the residual definition, and last inequality follows by the residual exit condition.
Now,

(21)

Substituting inequality (21) in (20), we get,

(22)

inequality follows assuming . For some and , taking , then .
Further, as , and assuming , we get,

(23)

And for as , we get,

(24)

which proves superlinear convergence. ∎

6. Experimental Results

In this section, we discuss experimental settings and results. The experiments have been conducted with the bench marked binary datasets as given in the Table 1, which are available for download from LibSVM website111https://www.csie.ntu.edu.tw/c̃jlin/libsvmtools/datasets/.

Dataset #features #datapoints
rcv1.binary 47,236 20,242
covtype.binary 54 581,012
ijcnn1 22 49,990
news20.binary 1,355,191 19,996
real-sim 20,958 72,309
Adult 123 32,561
mushroom 112 8124
Table 1. Datasets used in experimentation

We use following methods in experimentations:
TRON (Hsia et al., 2018): This is a trust region inexact Newton method without any subsampling. It uses preconditioned CG method to solve the trust region subproblem and it is used in the current version of LIBLINEAR library (Fan et al., 2008).
STRON: This is the proposed stochastic trust region inexact Newton method with progressive sampling technique for gradient and Hessian calculations. It uses CG method to solve the trust region subproblem.
STRON-PCG: This is an extension of STRON using PCG for solving the trust region subproblem, as discussed in the Subsection 7.1.
STRON-SVRG: This is another extension of STRON using variance reduction for subsampled gradient calculations, as discussed in Subsection 7.2.
Newton-CG (Byrd et al., 2011): This is stochastic inexact Newton method which uses CG method to solve the subproblem. It uses progressive subsampling similar to STRON.
SVRG-SQN (Moritz et al., 2016): This is stochastic L-BFGS method with variance reduction for gradient calculations.
SVRG-LBFGS (Kolte et al., 2015): This is another stochastic L-BFGS method with variance reduction. It differs from SVRG-SQN method in approach by which Hessian information is sampled.

6.1. Experimental Setup

All the datasets have been divided into 80% and 20%, for training and testing datasets, respectively. We have used for all methods and #CG iterations has been set to a sufficiently large value of 25, as all the inexact Newton methods use 5-10 iterations and hardly go beyond 20 iterations. Progressive batching scheme uses initial batch size of 1% to 10% with suitable rate for increasing the batch size. Quasi-Newton methods (SVRG-SQN and SVRG-LBFGS) use mini-batch size of 10% with stochastic backtracking line search to find the step size and same size mini-batches are taken for gradient and Hessian subsampling. Memory of is used in quasi-Newton methods with as update frequency of Hessian inverse approximation for SVRG-SQN method. All the methods are implemented in C++222code will be made available soon, as a large-scale machine learning library for second order methods with MATLAB interface and experiments have been executed on MacBook Air (8 GB 1600 MHz DDR3, 1.6 GHz Intel Core i5 and 256GB SSD).

6.2. Comparative Study

The experiments have been performed with strongly convex and smooth l2-regularized logistic regression problem as given below:

(25)

The results have been plotted as suboptimality () versus training time (in seconds) and accuracy versus training time for high (=)-accuracy solutions, as given in the Figs. 1 2 and 3. As it is clear from the results, STRON converges faster than all other methods and shows improvement against TRON on accuracy vs. time. Moreover, quasi-Newton methods converges slower than inexact Newton methods as already established in the literature (Lin et al., 2008). As per the intuitions, STRON takes initial advantage over TRON due to subsampling and as it reaches the solution region the progressive batching scheme reaches the full batching scheme and converges with same rate as TRON. That’s why, in most of the figures, we can observe STRON and TRON converging in parallel lines. Moreover, we observe a horizontal line for accuracy vs. time plot with covtype dataset because all methods give 100% accuracy.

(a)
(b)
(c)
(d)
(e)
(f)
Figure 1. First column presents suboptimality versus training time (in seconds) and second column presents accuracy versus training time, on mushroom (first row), rcv1 (second row) and news20 (third row) datasets.
(a)
(b)
(c)
(d)
(e)
(f)
Figure 2. First column presents suboptimality versus training time (in seconds) and second column presents accuracy versus training time, on Adult (first row), covtype (second row) and ijcnn1 (third row).
(a)
(b)
Figure 3. First column presents suboptimality versus training time (in seconds) and second column presents accuracy versus training time, on real-sim dataset.

Generally, the models are trained for low (=)-accuracy solutions. So we present the results for such a case using Table 2. The results are reported as a mean for 5 runs of experiments. As it is clear from the table, STRON converges faster than other solvers in most of the cases and have nearly the same accuracy as others. We observe large variations in the training time of mushroom dataset because it’s training time is small and it is difficult to measure it precisely.

Datasets Methods SVRG_LBFGS STRON SVRG_SQN Newton_CG TRON
covtype Time (s) 0.860.31 0.060.01 0.770.03 0.850.01 0.680.00
Accuracy 1.00 1.00 1.00 1.00 1.00
real-sim Time (s) 1.850.15 0.880.28 1.150.01 0.680.01 0.880.10
Accuracy 0.9693 0.9696 0.9670 0.9690 0.9696
rcv1 Time (s) 2.630.18 0.530.01 3.040.07 0.610.05 0.610.04
Accuracy 0.9634 0.9632 0.9632 0.9634 0.9627
news20 Time (s) 49.300.70 6.000.08 68.980.77 7.010.13 6.230.42
Accuracy 0.9313 0.9320 0.9323 0.9320 0.9320
mushroom Time (s) 0.150.02 0.110.03 0.270.06 0.140.06 0.170.06
Accuracy 0.9988 0.9988 0.9988 0.9988 0.9988
ijcnn1 Time (s) 0.400.16 0.280.01 0.280.03 0.190.01 0.280.14
Accuracy 0.9239 0.9233 0.9225 0.9239 0.9238
Adult Time (s) 0.420.16 0.180.12 0.440.19 0.280.17 0.220.02
Accuracy 0.8469 0.8468 0.8460 0.8471 0.8483
Table 2. Comparison of Training Time for low accuracy (=0.01) solution

6.3. Results with SVM

We extend STRON to solve l2-SVM problem which is a non smooth problem, as given below:

(26)
(a)
(b)
(c)
(d)
Figure 4. Experiments with l2-SVM on news20 (first row) and real-sim (second row) datasets.

The results are reported in the Fig. 4 with news20 and real-sim datasets. As it is clear from the figure, STRON outperforms all other methods.

7. Extensions

In this section, we discuss extensions of the proposed method with PCG for solving the trust region subproblem, and with variance reduction technique.

7.1. PCG Subproblem Solver

Number of iterations required by CG method to solve the subproblem depend on the condition number of the Hessian matrix. So for ill-conditioned problems CG method can converge very slow. To avoid such situations, generally a non-singular matrix , called preconditioner, is used as follow. For the linear system , we solve following system:

(27)

Generally, is taken to ensure the symmetry and positive definiteness of . PCG can be useful for solving the ill-conditioned problems but it involves extra computational overhead. We follow (Hsia et al., 2018) to use PCG as a weighted average of identity matrix and diagonal matrix of Hessian, as given below:

(28)

where is a Hessian matrix and . For , there is no preconditioning and for it is a diagonal preconditioner. In the experiments, we have taken for TRON and STRON-PCG (Hsia et al., 2018). To apply PCG to trust region subproblem, we can use Algorithm 2 without any modifications, after changing the trust region subproblem (11), as given below (Steihaug, 1983):

(29)

STRON using PCG as a trust region subproblem solver is denoted by STRON-PCG and the results are reported in Fig. 5. It compares TRON, STRON and STRON-PCG on news20 and rcv1 datasets. As it is clear from the figure, both STRON and STRON-PCG outperform TRON.

(a)
(b)
(c)
(d)
Figure 5. Comparative study of STRON_PCG, STRON and TRON on news20 (first row) and rcv1 (second row) datasets.

PCG trust region subproblem solver involves extra cost for calculating the preconditioner, and for TRON the overhead due to preconditioner is given by

(30)

And for STRON-PCG, preconditioner involves extra cost as given below:

(31)

7.2. Stochastic Variance Reduced Trust Region Inexact Newton Method

To improve the quality of search direction, we use SVRG as variance reduction technique for gradient calculations, as given below:

(32)

where is parameter value at the start of outer iteration. STRON-SVRG uses variance reduction for gradient calculations and progressive batching for Hessian calculations, as given in the Algorithm 3.

  Inputs:
  Result:
  for  do
     Calculate and set
     for  do
        Randomly select subsamples and
        Calculate subsampled gradient
        Calculate variance reduced gradient using (32)
        Solve the trust region subproblem using Algorithm 2 with variance reduced gradient, instead of subsampled gradient, to get the step direction
        Calculate the ratio
        Update the parameters using (13)
        Update the trust region using (14)
     end for
  end for
ALGORITHM 3 STRON with Variance Reduction

The experimental results are presented in Fig. 6 with news20 and rcv1 datasets. As it is clear from the figures, STRON-SVRG lags behind STRON and TRON, i.e., variance reduction is not sufficient to beat the progressive batching in gradient calculations of STRON.

(a)
(b)
(c)
(d)
Figure 6. Comparative study of STRON-SVRG, STRON and TRON on news20 (first row) and rcv1 (second row) datasets.

8. Conclusion

We have proposed a novel subsampled trust region inexact Newton method, called as STRON. STRON uses progressive batching scheme to deal with noisy approximations of gradient and Hessian, and enjoys the benefits of both the regimes, stochastic approximation and full batching scheme. The proposed method has been extended to use preconditioned CG as trust region subproblem solver, to use variance reduction for gradient calculations and to solve SVM problem. Theoretical analysis proves superlinear convergence for the proposed method and empirical results prove the efficacy of STRON.

Acknowledgements.
First author is thankful to Ministry of Human Resource Development, Government of INDIA, to provide fellowship (University Grants Commission - Senior Research Fellowship) to pursue his PhD.

References