1 Introduction
Generalized Linear Models (GLMs) play a crucial role in numerous statistical and machine learning problems. GLMs formulate the natural parameter in exponential families as a linear model and provide a miscellaneous framework for statistical methodology and supervised learning tasks. Celebrated examples include linear, logistic, multinomial regressions and applications to graphical models
[NB72, MN89, KF09].In this paper, we focus on how to solve the maximum likelihood problem efficiently in the GLM setting when the number of observations
is much larger than the dimension of the coefficient vector
, i.e., . GLM optimization task is typically expressed as a minimization problem where the objective function is the negative loglikelihood that is denoted by where is the coefficient vector. Many optimization algorithms are available for such minimization problems [Bis95, BV04, Nes04]. However, only a few uses the special structure of GLMs. In this paper, we consider updates that are specifically designed for GLMs, which are of the from(1.1) 
where is the step size and is a scaling matrix which provides curvature information.
For the updates of the form Eq. (1.1), the performance of the algorithm is mainly determined by the scaling matrix . Classical Newton’s Method (NM) and Natural Gradient (NG) descent can be recovered by simply taking to be the inverse Hessian and the inverse Fisher’s information at the current iterate, respectively [Ama98, Nes04]. Second order methods may achieve quadratic convergence rate, yet they suffer from excessive cost of computing the scaling matrix at every iteration. On the other hand, if we take
to be the identity matrix, we recover the simple
Gradient Descent (GD) method which has a linear convergence rate. Although GD’s convergence rate is slow compared to that of second order methods such as NM, modest periteration cost makes it practical for largescale optimization.The tradeoff between the convergence rate and periteration cost has been extensively studied [Bis95, BV04, Nes04]. In regime, the main objective is to construct a scaling matrix that is computational feasible and provides sufficient curvature information. For this purpose, several QuasiNewton methods have been proposed [Bis95, Nes04]. Updates given by QuasiNewton methods satisfy an equation which is often called QuasiNewton relation. A wellknown member of this class of algorithms is the BroydenFletcherGoldfarbShanno (BFGS) algorithm [Bro70, Fle70, Gol70, Sha70].
In this paper, we propose an algorithm which utilizes the special structure of GLMs by relying on a Steintype lemma [Ste81]. It attains fast convergence rates with low periteration cost. We call our algorithm NewtonStein Method which we abbreviate as NewSt . Our contributions can be summarized as follows:

We recast the problem of constructing a scaling matrix as an estimation problem and apply a Steintype lemma along with subsampling to form a computationally feasible .

Our formulation allows further improvements through subsampling techniques and eigenvalue thresholding.

Newton method’s periteration cost is replaced by periteration cost and a onetime cost, where is the subsample size.

Assuming that the rows of the design matrix are i.i.d. and have bounded support (or subgaussian), and denoting the iterates of NewtonStein Method by , we prove a bound of the form
(1.2) where is the minimizer and , are the convergence coefficients. The above bound implies that the convergence starts with a quadratic phase and transitions into linear as the iterate gets closer to .

We demonstrate the performance of NewSt on four datasets by comparing it to commonly used algorithms.
The rest of the paper is organized as follows: Section 1.1 surveys the related work and Section 1.2 introduces the notations used throughout the paper. Section 2 briefly discusses the GLM framework and its relevant properties. In Section 3, we introduce NewtonStein method , develop its intuition, and discuss the computational aspects. Section 4 covers the theoretical results and in Section 4.4 we discuss how to choose the algorithm parameters. Section 5 provides the empirical results where we compare the proposed algorithm with several other methods on four datasets. Finally, in Section 6, we conclude with a brief discussion along with a few open questions.
1.1 Related work
There are numerous optimization techniques that can be used to find the maximum likelihood estimator in GLMs. For moderate values of and , classical second order methods such as NM are commonly used. In largescale problems, data dimensionality is the main factor while choosing the right optimization method. Largescale optimization has been extensively studied through online and batch methods. Online methods use a gradient (or subgradient) of a single, randomly selected observation to update the current iterate [RM51]. Their periteration cost is independent of , but the convergence rate might be extremely slow. There are several extensions of the classical stochastic descent algorithms (SGD), providing significant improvement and/or stability [Bot10, DHS11, SRB13].
On the other hand, batch algorithms enjoy faster convergence rates, though their periteration cost may be prohibitive. In particular, second order methods enjoy quadratic convergence, but constructing the Hessian matrix generally requires excessive amount of computation. Many algorithms aim at formimg an approximate, costefficient scaling matrix. In particular, this idea lies at the core of QuasiNewton methods [Bis95, Nes04].
Another approach to construct an approximate Hessian makes use of subsampling techniques [Mar10, BCNN11, VP12, EM15]. Many contemporary learning methods rely on subsampling as it is simple and it provides significant boost over the first order methods. Further improvements through conjugate gradient methods and Krylov subspaces are available. Subsampling can also be used to obtain an approximate solution, with certain large deviation guarantees [DLFU13].
There are many composite variants of the aforementioned methods, that mostly combine two or more techniques. Wellknown composite algorithms are the combinations of subsampling and QuasiNewton [SYG07, BHNS14], SGD and GD [FS12], NG and NM [LRF10], NG and lowrank approximation [LRMB08], subsampling and eigenvalue thresholding [EM15].
1.2 Notation
Let and denote by , the size of a set . The gradient and the Hessian of with respect to are denoted by and , respectively. The th derivative of a function is denoted by . For a vector and a symmetric matrix , and denote the and spectral norms of and , respectively. denotes the subgaussian norm, which will be defined later. denotes the dimensional sphere. denotes the Euclidean projections onto the set , and denotes the ball of radius
. For a random variable
and density , means that the distribution of follows the density . Multivariate Gaussian density with mean and covariance is denoted as . For random variables , anddenote probability metrics (to be explicitly defined later) measuring the distance between the distributions of
and . and denote the bracketing number and net.2 Generalized Linear Models
Distribution of a random variable belongs to an exponential family with natural parameter if its density is
where is the cumulant generating function and is the carrier density. Let be independent observations such that Denoting , the joint likelihood can be written as
(2.1) 
We consider the problem of learning the maximum likelihood estimator in the above exponential family framework, where the vector is modeled through the linear relation,
for some design matrix with rows , and a coefficient vector . This formulation is known as Generalized Linear Models (GLMs). The cumulant generating function
determines the class of GLMs, i.e., for the ordinary least squares (OLS)
and for the logistic regression (LR)
.Finding the maximum likelihood estimator in the above formulation is equivalent to minimizing the negative loglikelihood function ,
(2.2) 
where is the inner product between the vectors and . The relation to OLS and LR can be seen much easier by plugging in the corresponding in Eq. (2.2). The gradient and the Hessian of can be written as:
(2.3) 
For a sequence of scaling matrices , we consider iterations of the form
where is the step size. The above iteration is our main focus, but with a new approach on how to compute the sequence of matrices . We will formulate the problem of finding a scalable as an estimation problem and apply a Steintype lemma that provides us with a computationally efficient update.
3 NewtonStein Method
Classical NewtonRaphson update is generally used for training GLMs. However, its periteration cost makes it impractical for largescale optimization. The main bottleneck is the computation of the Hessian matrix that requires flops which is prohibitive when . Numerous methods have been proposed to achieve NM’s fast convergence rate while keeping the periteration cost manageable. To this end, a popular approach is to construct a scaling matrix , which approximates the true Hessian at every iteration .
The task of constructing an approximate Hessian can be viewed as an estimation problem. Assuming that the rows of are i.i.d. random vectors, the Hessian of GLMs with cumulant generating function has the following form
We observe that is just a sum of i.i.d. matrices. Hence, the true Hessian is nothing but a sample mean estimator to its expectation. Another natural estimator would be the subsampled Hessian method suggested by [Mar10, BCNN11]. Therefore, our goal is to propose an estimator that is computationally efficient and wellapproximates the true Hessian.
We use the following Steintype proposition to find a more efficient estimator to the expectation of the Hessian.
Lemma 3.1 (Steintype lemma).
Assume that and is a constant vector. Then for any function that is twice “weakly” differentiable, we have
(3.1) 
The proof of Lemma 3.1 is given in Appendix. The right hand side of Eq.(3.1) is a rank1 update to the first term. Hence, its inverse can be computed with cost. Quantities that change at each iteration are the ones that depend on , i.e.,
Note that and are scalar quantities and can be estimated by their corresponding sample means and (explicitly defined at Step 3 of Algorithm 1) respectively, with only computation.
. The right plot shows the phase transition in the convergence rate of NewtonStein method (NewSt ). Convergence starts with a quadratic rate and transitions into linear. Plots are obtained using
Covertype dataset.To complete the estimation task suggested by Eq. (3.1), we need an estimator for the covariance matrix . A natural estimator is the sample mean where, we only use a subsample so that the cost is reduced to from . Subsampling based sample mean estimator is denoted by , which is widely used in largescale problems [Ver10]. We highlight the fact that Lemma 3.1 replaces NM’s periteration cost with a onetime cost of . We further use subsampling to reduce this onetime cost to .
In general, important curvature information is contained in the largest few spectral features [EM15]. For a given threshold , we take the largest eigenvalues of the subsampled covariance estimator, setting rest of them to th eigenvalue. This operation helps denoising and would only take computation. Step 2 of Algorithm 1 performs this procedure.
Inverting the constructed Hessian estimator can make use of the lowrank structure several times. First, notice that the updates in Eq. (3.1) are based on rank1 matrix additions. Hence, we can simply use a matrix inversion formula to derive an explicit equation (See in Step 3 of Algorithm 1). This formulation would impose another inverse operation on the covariance estimator. Since the covariance estimator is also based on rank approximation, one can utilize the lowrank inversion formula again. We emphasize that this operation is performed once. Therefore, instead of NM’s periteration cost of due to inversion, NewtonStein method (NewSt ) requires periteration and a onetime cost of . Assuming that NewSt and NM converge in and iterations respectively, the overall complexity of NewSt is whereas that of NM is .
Even though Proposition 3.1 assumes that the covariates are multivariate Gaussian random vectors, in Section 4, the only assumption we make on the covariates is either bounded support or subgaussianity, both of which cover a wide class of random variables including Bernoulli, elliptical distributions, bounded variables etc. The left plot of Figure 1 shows that the estimation is accurate for many distributions. This is a consequence of the fact that the proposed estimator in Eq. (3.1) relies on the distribution of only through inner products of the form
, which in turn results in approximate normal distribution due to the central limit theorem. We will discuss this in details in Section
4.The convergence rate of NewSt has two phases. Convergence starts quadratically and transitions into linear rate when it gets close to the true minimizer. The phase transition behavior can be observed through the right plot in Figure 1. This is a consequence of the bound provided in Eq. 1.2, which is the main result of our theorems stated in Section 4.
4 Theoretical results
We start by introducing the terms that will appear in the theorems. Then we will provide two technical results on bounded and subgaussian covariates. The proofs of the theorems are technical and provided in Appendix.
4.1 Preliminaries
Hessian estimation described in the previous section relies on a Gaussian approximation. For theoretical purposes, we use the following probability metric to quantify the gap between the distribution of ’s and that of a normal vector.
Definition 1.
Given a family of functions , and random vectors , for and any , define
Many probability metrics can be expressed as above by choosing a suitable function class . Examples include Total Variation (TV), Kolmogorov and Wasserstein metrics [GS02, CGS10]. Based on the second and the fourth derivatives of the cumulant generating function, we define the following function classes:
where is the ball of radius . Exact calculation of such probability metrics are often difficult. The general approach is to upper bound the distance by a more intuitive metric. In our case, we observe that for , can be easily upper bounded by up to a scaling constant, when the covariates have bounded support.
We will further assume that the covariance matrix follows spiked model, i.e.,
which is commonly encountered in practice [BS06]. The first eigenvalues of the covariance matrix are large and the rest are small and equal to each other. Small eigenvalues of (denoted by ), can be thought of as the noise component.
4.2 Bounded covariates
We have the following perstep bound for the iterates generated by NewSt , when the covariates are supported on a ball.
Theorem 4.1.
Assume that the covariates are i.i.d. random vectors supported on a ball of radius with
where follows the spiked model. Further assume that the cumulant generating function has bounded 2nd5th derivatives and that is the radius of the projection . For given by the NewtonStein method for , define the event
(4.1) 
for some positive constant , and the optimal value . If and are sufficiently large, then there exist constants and depending on the radii , and the bounds on and such that conditioned on the event , with probability at least , we have
(4.2) 
where the coefficients and are deterministic constants defined as
(4.3) 
and is defined as
(4.4) 
for a multivariate Gaussian random variable with the same mean and covariance as ’s.
The bound in Eq. (4.2) holds with high probability, and the coefficients and are deterministic constants which will describe the convergence behavior of the NewtonStein method. Observe that the coefficient is sum of two terms: measures how accurate the Hessian estimation is, and the second term depends on the subsample size and the data dimensions.
Theorem 4.1 shows that the convergence of NewtonStein method can be upper bounded by a compositely converging sequence, that is, the squared term will dominate at first giving a quadratic rate, then the convergence will transition into a linear phase as the iterate gets close to the optimal value. The coefficients and govern the linear and quadratic terms, respectively. The effect of subsampling appears in the coefficient of linear term. In theory, there is a threshold for the subsampling size , namely , beyond which further subsampling has no effect. The transition point between the quadratic and the linear phases is determined by the subsampling size and the properties of the data. The phase transition behavior can be observed through the right plot in Figure 1.
Using the above theorem, we state the following corollary.
Corollary 4.2.
The bound stated in the above corollary is an analogue of composite convergence (given in Eq. (4.2)) in expectation. Note that our results make strong assumptions on the derivatives of the cumulant generating function . We emphasize that these assumptions are valid for linear and logistic regressions. An example that does not fit in our scheme is Poisson regression with . However, we observed empirically that the algorithm still provides significant improvement.
The following theorem states a sufficient condition for the convergence of composite sequence.
Theorem 4.3.
Let be a compositely converging sequence with convergence coefficients and as in Eq. (4.2) to the true minimizer . Let the starting point satisfy and define . Then the sequence of distances converges to 0. Further, the number of iterations to reach a tolerance of can be upper bounded by , where
(4.5) 
Above theorem gives an upper bound on the number of iterations until reaching a tolerance of . The first and second terms on the right hand side of Eq. (4.5) stem from the quadratic and linear phases, respectively.
4.3 Subgaussian covariates
In this section, we carry our analysis to the more general case, where the covariates are subgaussian vectors.
Theorem 4.4.
Assume that are i.i.d. subgaussian random vectors with subgaussian norm such that
where follows the spiked model. Further assume that the cumulant generating function is uniformly bounded and has bounded 2nd5th derivatives and that is the radius of the projection. For given by NewtonStein method and the event in Eq. (4.1), if we have and sufficiently large and
then there exist constants and depending on the eigenvalues of , the radius , , and the bounds on and such that conditioned on the event , with probability at least ,
where defined as in Eq 4.4, for a Gaussian random variable with the same mean and covariance as ’s.
The above theorem is more restrictive than Theorem 4.1. We require to be much larger than the dimension . Also note that a factor of appears in the coefficient of the quadratic term. We also notice that the threshold for subsample size reduces to .
We have the following analogue of Corrolary 4.2.
4.4 Algorithm parameters
NewtonStein method takes three input parameters and for those, we suggest nearoptimal choices based on our theoretical results.

Subsample size: NewSt uses a subset of indices to approximate the covariance matrix . Corollary 5.50 of [Ver10] proves that a sample size of is sufficient for subgaussian covariates and that of is sufficient for arbitrary distributions supported in some ball to estimate a covariance matrix by its sample mean estimator. In the regime we consider, , we suggest to use a sample size of .

Rank: Many methods have been suggested to improve the estimation of covariance matrix and almost all of them rely on the concept of shrinkage [CCS10, DGJ13]. Eigenvalue thresholding can be considered as a shrinkage operation which will retain only the important second order information. Choosing the rank threshold can be simply done on the sample mean estimator of . After obtaining the subsampled estimate of the mean, one can either plot the spectrum and choose manually or use an optimal technique from [DG13].

Step size: Step size choices of NewSt are quite similar to Newton’s method (i.e., See [BV04]). The main difference comes from the eigenvalue thresholding. If the data follows the spiked model, the optimal step size will be close to 1 if there is no subsampling. However, due to fluctuations resulting from subsampling, we suggest the following step size choice for NewSt :
(4.6) This formula yields a step size larger than 1. A detailed discussion can be found in Section E in Appendix.
5 Experiments
In this section, we validate the performance of NewSt through extensive numerical studies. We experimented on two commonly used GLM optimization problems, namely, Logistic Regression (LR) and Linear Regression (OLS). LR minimizes Eq. (2.2) for the logistic function , whereas OLS minimizes the same equation for . In the following, we briefly describe the algorithms that are used in the experiments:

Newton’s Method (NM) uses the inverse Hessian evaluated at the current iterate, and may achieve quadratic convergence. NM steps require computation which makes it impractical for largescale datasets.

BroydenFletcherGoldfarbShanno (BFGS) forms a curvature matrix by cultivating the information from the iterates and the gradients at each iteration. Under certain assumptions, the convergence rate is locally superlinear and the periteration cost is comparable to that of first order methods.

Limited Memory BFGS (LBFGS) is similar to BFGS, and uses only the recent few iterates to construct the curvature matrix, gaining significant performance in terms of memory usage.

Gradient Descent (GD) update is proportional to the negative of the full gradient evaluated at the current iterate. Under smoothness assumptions, GD achieves a linear convergence rate, with periteration cost.

Accelerated Gradient Descent (AGD) is proposed by Nesterov [Nes83], which improves over the gradient descent by using a momentum term. Performance of AGD strongly depends of the smoothness of the function.
For all the algorithms, we use a constant step size that provides the fastest convergence. Subsample size , rank and the constant stepsize for NewSt is selected by following the guidelines described in Section 4.4. The rank threshold (which is an input to the algorithm) is specified on the title each plot.
5.1 Simulations with synthetic subgaussian datasets
Synthetic datasets, S3 and S20 are generated through a multivariate Gaussian distribution where the covariance matrix follows spiked model, i.e., for S3 and
for S20. To generate the covariance matrix, we first generate a random orthogonal matrix, say
. Next, we generate a diagonal matrix that contains the eigenvalues, i.e., the first diagonal entries are chosen to be large, and rest of them are equal to 1. Then, we let . For dimensions of the datasets, see Table 1. We also emphasize that the data dimensions are chosen so that Newton’s method still does well.The simulation results are summarized in Figure 2. Further details regarding the experiments can be found in Table 2 in Appendix F. We observe that NewSt provides a significant improvement over the classical techniques.
Observe that the convergence rate of NewSt has a clear phase transition point in the top left plot in Figure 2. As argued earlier, this point depends on various factors including subsampling size and data dimensions , the rank threshold and structure of the covariance matrix. The prediction of the phase transition point is an interesting line of research. However, our convergence guarantees are conservative and cannot be used for this purpose.
5.2 Experiments with Real datasets
We experimented on two real datasets where the datasets are downloaded from UCI repository [Lic13]. Both datasets satisfy , but we highlight the difference between the proportions of dimensions . See Table 1 for details.
We observe that NewtonStein method performs better than classical methods on real datasets as well. More specifically, the methods that come closer to NewSt is Newton’s method for moderate and and BFGS when is large.
The optimal stepsize for NewSt will typically be larger than 1 which is mainly due to eigenvalue thresholding operation. This feature is desirable if one is able to obtain a large stepsize that provides convergence. In such cases, the convergence is likely to be faster, yet more unstable compared to the smaller step size choices. We observed that similar to other second order algorithms, NewSt is also susceptible to the step size selection. If the data is not wellconditioned, and the subsample size is not sufficiently large, algorithm might have poor performance. This is mainly because the subsampling operation is performed only once at the beginning. Therefore, it might be good in practice to subsample once in every few iterations.
6 Discussion
In this paper, we proposed an efficient algorithm for training GLMs. We called our algorithm NewtonStein Method (NewSt ) as it takes a Newton update at each iteration relying on a Steintype lemma. The algorithm requires a one time cost to estimate the covariance structure and periteration cost to form the update equations. We observe that the convergence of NewSt has a phase transition from quadratic rate to linear. This observation is justified theoretically along with several other guarantees for the subgaussian covariates such as perstep convergence bounds, conditions for convergence, etc. Parameter selection guidelines of NewSt are based on our theoretical results. Our experiments show that NewSt provides significant improvement over several optimization methods.
Relaxing some of the theoretical constraints is an interesting line of research. In particular, strong assumptions on the cumulant generating functions might be loosened. Another interesting direction is to determine when the phase transition point occurs, which would provide a better understanding of the effects of subsampling and rank threshold.
Acknowledgements
The author is grateful to Mohsen Bayati and Andrea Montanari for stimulating conversations on the topic of this work. The author would like to thank Bhaswar B. Bhattacharya and Qingyuan Zhao for carefully reading this article and providing valuable feedback.
References
 [Ama98] ShunIchi Amari, Natural gradient works efficiently in learning, Neural computation 10 (1998), no. 2, 251–276.
 [BCNN11] Richard H Byrd, Gillian M Chin, Will Neveitt, and Jorge Nocedal, On the use of stochastic hessian information in optimization methods for machine learning, SIAM Journal on Optimization 21 (2011), no. 3, 977–995.

[BD99]
Jock A Blackard and Denis J Dean,
Comparative accuracies of artificial neural networks and discriminant analysis in predicting forest cover types from cartographic variables
, Computers and electronics in agriculture (1999), 131–151.  [BHNS14] Richard H Byrd, SL Hansen, Jorge Nocedal, and Yoram Singer, A stochastic quasinewton method for largescale optimization, arXiv preprint arXiv:1401.7020 (2014).

[Bis95]
Christopher M. Bishop,
Neural networks for pattern recognition
, Oxford University Press, Inc., NY, USA, 1995. 
[Bot10]
Lèon Bottou,
Largescale machine learning with stochastic gradient descent
, COMPSTAT, 2010, pp. 177–186.  [Bro70] Charles G Broyden, The convergence of a class of doublerank minimization algorithms 2. the new algorithm, IMA Journal of Applied Mathematics 6 (1970), no. 3, 222–231.

[BS06]
Jinho Baik and Jack W Silverstein, Eigenvalues of large sample covariance
matrices of spiked population models
, Journal of Multivariate Analysis
97 (2006), no. 6, 1382–1408.  [BV04] Stephen Boyd and Lieven Vandenberghe, Convex optimization, Cambridge University Press, New York, NY, USA, 2004.

[CCS10]
JianFeng Cai, Emmanuel J Candès, and Zuowei Shen,
A singular value thresholding algorithm for matrix completion
, SIAM Journal on Optimization 20 (2010), no. 4, 1956–1982.  [CGS10] Louis HY Chen, Larry Goldstein, and QiMan Shao, Normal approximation by Stein’s method, Springer Science, 2010.
 [DE15] Lee H Dicker and Murat A Erdogdu, Flexible results for quadratic forms with applications to variance components estimation, arXiv preprint arXiv:1509.04388 (2015).
 [DG13] David L Donoho and Matan Gavish, The optimal hard threshold for singular values is 4/sqrt3, arXiv:1305.5870 (2013).
 [DGJ13] David L Donoho, Matan Gavish, and Iain M Johnstone, Optimal shrinkage of eigenvalues in the spiked covariance model, arXiv preprint arXiv:1311.0851 (2013).
 [DHS11] John Duchi, Elad Hazan, and Yoram Singer, Adaptive subgradient methods for online learning and stochastic optimization, J. Mach. Learn. Res. 12 (2011), 2121–2159.
 [DLFU13] Paramveer Dhillon, Yichao Lu, Dean P Foster, and Lyle Ungar, New subsampling algorithms for fast least squares regression, Advances in Neural Information Processing Systems 26, 2013, pp. 360–368.
 [EM15] Murat A Erdogdu and Andrea Montanari, Convergence rates of subsampled newton methods, arXiv preprint arXiv:1508.02810 (2015).
 [FHT10] Jerome Friedman, Trevor Hastie, and Rob Tibshirani, Regularization paths for generalized linear models via coordinate descent, Journal of statistical software 33 (2010), no. 1, 1.
 [Fle70] Roger Fletcher, A new approach to variable metric algorithms, The computer journal 13 (1970), no. 3, 317–322.
 [FS12] Michael P Friedlander and Mark Schmidt, Hybrid deterministicstochastic methods for data fitting, SIAM Journal on Scientific Computing 34 (2012), no. 3, A1380–A1405.
 [GKS11] Franz Graf, HansPeter Kriegel, Matthias Schubert, Sebastian Pölsterl, and Alexander Cavallaro, 2d image registration in ct images using radial image descriptors, MICCAI 2011, Springer, 2011, pp. 607–614.
 [Gol70] Donald Goldfarb, A family of variablemetric methods derived by variational means, Mathematics of computation 24 (1970), no. 109, 23–26.
 [GS02] Alison L Gibbs and Francis Edward Su, On choosing and bounding probability metrics, ISR 70 (2002), no. 3, 419–435.
 [KF09] Daphne Koller and Nir Friedman, Probabilistic graphical models: principles and techniques, MIT press, 2009.
 [Lic13] M. Lichman, UCI machine learning repository, 2013.
 [LRF10] Nicolas Le Roux and Andrew W Fitzgibbon, A fast natural newton method, ICML, 2010, pp. 623–630.
 [LRMB08] Nicolas Le Roux, PierreA Manzagol, and Yoshua Bengio, Topmoumoute online natural gradient algorithm, NIPS, 2008.
 [LWK08] ChihJ Lin, Ruby C Weng, and Sathiya Keerthi, Trust region newton method for logistic regression, JMLR (2008).
 [Mar10] James Martens, Deep learning via hessianfree optimization, ICML, 2010, pp. 735–742.
 [MN89] Peter McCullagh and John A Nelder, Generalized linear models, vol. 2, Chapman and Hall London, 1989.
 [NB72] John A Nelder and R. Jacob Baker, Generalized linear models, Wiley Online Library, 1972.
 [Nes83] Yurii Nesterov, A method for unconstrained convex minimization problem with the rate of convergence o (1/k2), Doklady AN SSSR, vol. 269, 1983, pp. 543–547.
 [Nes04] , Introductory lectures on convex optimization: A basic course, vol. 87, Springer, 2004.
 [RM51] Herbert Robbins and Sutton Monro, A stochastic approximation method, Annals of mathematical statistics (1951).
 [Sha70] David F Shanno, Conditioning of quasinewton methods for function minimization, Mathematics of computation 24 (1970), no. 111, 647–656.
 [SRB13] Mark Schmidt, Nicolas Le Roux, and Francis Bach, Minimizing finite sums with the stochastic average gradient, arXiv preprint arXiv:1309.2388 (2013).
 [Ste81] Charles M Stein, Estimation of the mean of a multivariate normal distribution, Annals of Statistics (1981), 1135–1151.
 [SYG07] Nicol Schraudolph, Jin Yu, and Simon Günter, A stochastic quasinewton method for online convex optimization.
 [VdV00] Aad W Van der Vaart, Asymptotic statistics, vol. 3, Cambridge university press, 2000.
 [Ver10] Roman Vershynin, Introduction to the nonasymptotic analysis of random matrices, arXiv:1011.3027 (2010).
 [VP12] Oriol Vinyals and Daniel Povey, Krylov Subspace Descent for Deep Learning, AISTATS, 2012.
Appendix A Proof of Steintype lemma
Proof of Lemma 3.1.
The proof will follow from integration by parts over multivariate variables. Let be the density of , i.e.,
and . We write
which is the desired result. ∎
Appendix B Preliminary concentration inequalities
In this section, we provide concentration results that will be useful proving the main theorem. We start with some simple definitions on a special class of random variables.
Definition 2 (Subgaussian).
For a given constant , a random variable is called subgaussian if it satisfies
Smallest such is the subgaussian norm of and it is denoted by . Similarly, a random vector is a subgaussian vector if there exists a constant such that
Definition 3 (Subexponential).
For a given constant , a random variable is called subexponential if it satisfies
Smallest such is the subexponential norm of and it is denoted by . Similarly, a random vector is a subexponential vector if there exists a constant such that
We state the following Lemmas from [Ver10] for the convenience of the reader (i.e., See Theorem 5.39 and the following remark for subgaussian distributions, and Theorem 5.44 for distributions with arbitrary support):
Lemma B.1 ([Ver10]).
Let be an index set and for be i.i.d. subgaussian random vectors with
There exists absolute constants depending only on the subgaussian norm such that with probability ,
Remark 1.
We are interested in the case where , hence the right hand side becomes . In most cases, we will simply let and obtain a bound of order on the right hand side. For this, we need which is a reasonable assumption in the regime we consider.
The following lemma will be helpful to show a similar concentration result for the matrix :
Lemma B.2.
Let the assumptions in Lemma B.1 hold. Further, assume that follows spiked model. If is sufficiently large, then there exists absolute constants depending only on the subgaussian norm such that with probability ,
Proof.
By the Weyl’s inequality for the eigenvalues, we have
Hence the result follows from the previous lemma for . ∎
Lemmas B.1 and B.2 are straightforward concentration results for the random matrices with i.i.d. subgaussian rows. We state their analogues for the the covariates sampled from arbitrary distributions with bounded support.
Lemma B.3 ([Ver10]).
Let be an index set and for be i.i.d. random vectors with
Then, for some absolute constant , with probability , we have
Remark 2.
We will choose which will provide us with a probability of . Therefore, if the sample size is sufficiently large, i.e.,
we can estimate the true covariance matrix quite well for arbitrary distributions with bounded support. In particular, with probability , we obtain
where .
Lemma B.4.
Let the assumptions in Lemma B.3 hold. Further, assume that follows spiked model. If is sufficiently large, for , with probability , we have
where is an absolute constant.
Proof of Lemma B.4 is the same as that of Lemma B.2. Before proceeding, we note that the bounds given in Lemmas B.2 and B.4 also applies to .
In the following, we will focus on the empirical processes and obtain more technical bounds for the approximate Hessian. To that extent, we provide some basic definitions that will be useful later in the proofs. For a more detailed discussion on the machinery used throughout next section, we refer interested reader to [VdV00].
Definition 4.
On a metric space , for , is called an net over if , such that .
In the following, we will use distance between two functions and , namely . Note that the same distance definition can be carried to random variables as they are simply real measurable functions.
Definition 5.
Given a function class , and any two functions and (not necessarily in ), the bracket is the set of all such that . A bracket satisfying and is called an bracket in . The bracketing number is the minimum number of different brackets needed to cover .
The preliminary tools presented in this section will be utilized to obtain the concentration results in Section C.
Appendix C Main lemmas
c.1 Concentration of covariates with bounded support
Lemma C.1.
Let , for , be i.i.d. random vectors supported on a ball of radius , with mean , and covariance matrix . Also let be a uniformly bounded function such that for some , we have and is Lipschitz continuous with constant . Then, there exist constants such that
where the constants depend only on the bound and radii and .
Proof of Lemma c.1.
We start by using the Lipschitz property of the function , i.e., ,
Now let be a net over . Then , such that right hand side of the above inequality is smaller than . Then, we can write
(C.1) 
By choosing
and taking supremum over the corresponding sets on both sides, we obtain the following inequality
Now, since we have and for a fixed and , the random variables are i.i.d., by the Hoeffding’s concentration inequality, we have
Comments
There are no comments yet.