Approximate Newton-based statistical inference using only stochastic gradients

05/23/2018 ∙ by Tianyang Li, et al. ∙ The University of Texas at Austin Rice University 0

We present a novel inference framework for convex empirical risk minimization, using approximate stochastic Newton steps. The proposed algorithm is based on the notion of finite differences and allows the approximation of a Hessian-vector product from first-order information. In theory, our method efficiently computes the statistical error covariance in M-estimation, both for unregularized convex learning problems and high-dimensional LASSO regression, without using exact second order information, or resampling the entire data set. In practice, we demonstrate the effectiveness of our framework on large-scale machine learning problems, that go even beyond convexity: as a highlight, our work can be used to detect certain adversarial attacks on neural networks.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Statistical inference is an important tool for assessing uncertainties, both for estimation and prediction purposes [FHT01, EH16]. E.g.

, in unregularized linear regression and high-dimensional LASSO settings

[vdGBRD14, JM15, TWH15]

, we are interested in computing coordinate-wise confidence intervals and p-values of a

-dimensional variable, in order to infer which coordinates are active or not [Was13]. Traditionally, the inverse Fisher information matrix [Edg08] contains the answer to such inference questions; however it requires storing and computing a matrix structure, often prohibitive for large-scale applications [TRVB06]. Alternatively, the Bootstrap method is a popular statistical inference algorithm, where we solve an optimization problem per dataset replicate, but can be expensive for large data sets [KTSJ14].

While optimization is mostly used for point estimates, recently it is also used as a means for statistical inference in large scale machine learning [LLKC18, CLTZ16, SZ18, FXY17]. This manuscript follows this path: we propose an inference framework that uses stochastic gradients to approximate second-order, Newton steps. This is enabled by the fact that we only need to compute Hessian-vector products; in math, this can be approximated using , where is the objective function, and , denote the gradient and Hessian of . Our method can be interpreted as a generalization of the SVRG approach in optimization [JZ13] (Appendix E); further, it is related to other stochastic Newton methods (e.g. [ABH17]) when . We defer the reader to Section 6 for more details. In this work, we apply our algorithm to unregularized -estimation, and we use a similar approach, with proximal approximate Newton steps, in high-dimensional linear regression.

Our contributions can be summarized as follows; a more detailed discussion is deferred to Section 6:

  • [leftmargin=0.5cm]

  • For the case of unregularized -estimation, our method efficiently computes the statistical error covariance, useful for confidence intervals and p-values. Compared to state of the art, our scheme guarantees consistency of computing the statistical error covariance, exploits better the available information (without wasting computational resources to compute quantities that are thereafter discarded), and converges to the optimum (without swaying around it).

  • For high-dimensional linear regression, we propose a different estimator (see (13

    )) than the current literature. It is the result of a different optimization problem that is strongly convex with high probability. This permits the use of linearly convergent proximal algorithms

    [XZ14, LSS14] towards the optimum; in contrast, state of the art only guarantees convergence to a neighborhood of the LASSO solution within statistical error. Our model also does not assume that absolute values of the true parameter’s non-zero entries are lower bounded.

  • For statistical inference in non-i.i.d. time series analysis, we sample contiguous blocks of indices (instead of uniformly sampling) to compute stochastic gradients. This is similar to the Newey-West estimator [NW86] for HAC (heteroskedasticity and autocorrelation consistent) covariance estimation, but does not waste computational resources to compute the entire matrix.

  • The effectiveness of our framework goes even beyond convexity. As a highlight, we show that our work can be used to detect certain adversarial attacks on neural networks.

2 Unregularized -estimation

In unregularized, low-dimensional -estimation problems, we estimate a parameter of interest:

using empirical risk minimization (ERM) on i.i.d. data points :

Statistical inference, such as computing one-dimensional confidence intervals, gives us information beyond the point estimate , when has an asymptotic limit distribution [Was13]. E.g., under regularity conditions, the -estimator satisfies asymptotic normality [vdV98, Theorem 5.21]. I.e.,

weakly converges to a normal distribution:

where and . We can perform statistical inference when we have a good estimate of . In this work, we use the plug-in covariance estimator for , where:

Observe that, in the naive case of directly computing and , we require both high computational- and space-complexity. Here, instead, we utilize approximate stochastic Newton motions from first order information to compute the quantity .

2.1 Statistical inference with approximate Newton steps using only stochastic gradients

Based on the above, we are interested in solving the following -dimensional optimization problem:

Notice that can be written as , which can be interpreted as the covariance of stochastic –inverse-Hessian conditioned– gradients at . Thus, the covariance of stochastic Newton steps can be used for statistical inference.

1:  Parameters: ; ; Initial state:  
2:  for  do  // approximate stochastic Newton descent
3:     
4:      uniformly sample indices with replacement from
5:     
6:     for  do  // solving (1) approximately using SGD
7:         and
8:         uniformly sample indices without replacement from
9:        
10:     end for
11:     Use for statistical inference, where
12:     
13:  end for
Algorithm 1 Unregularized M-estimation statistical inference

Algorithm 1 approximates each stochastic Newton step using only first order information. We start from which is sufficiently close to , which can be effectively achieved using SVRG [JZ13]; a description of the SVRG algorithm can be found in Appendix E. Lines 4, 5 compute a stochastic gradient whose covariance is used as part of statistical inference. Lines 6 to 12 use SGD to solve the Newton step,

(1)

which can be seen as a generalization of SVRG; this relationship is described in more detail in Appendix E. In particular, these lines correspond to solving (1) using SGD by uniformly sampling a random , and approximating:

(2)

Finally, the outer loop (lines 2 to 13

) can be viewed as solving inverse Hessian conditioned stochastic gradient descent, similar to stochastic natural gradient descent

[Ama98].

In terms of parameters, similar to [PJ92, Rup88], we use a decaying step size in Line 8 to control the error of approximating . We set to control the error of approximating Hessian vector product using a finite difference of gradients, so that it is smaller than the error of approximating using stochastic approximation. For similar reasons, we use a decaying step size in the outer loop to control the optimization error.

The following theorem characterizes the behavior of Algorithm 1.

Theorem 1.

For a twice continuously differentiable and convex function where each is also convex and twice continuously differentiable, assume satisfies

  • [leftmargin=0.5cm]

  • strong convexity: , ;

  • , each , which implies that has Lipschitz gradient: , ;

  • each is Lipschitz continuous: , .

In Algorithm 1, we assume that batch sizes —in the outer loop—and —in the inner loops—are . The outer loop step size is

(3)

In each outer loop, the inner loop step size is

(4)

The scaling constant for Hessian vector product approximation is

(5)

Then, for the outer iterate we have

(6)
and
(7)

In each outer loop, after steps of the inner loop, we have:

(8)

and at each step of the inner loop, we have:

(9)

After steps of the outer loop, we have a non-asymptotic bound on the “covariance”:

(10)

where and .

Some comments on the results in Theorem 1. The main outcome is that (10) provides a non-asymptotic bound and consistency guarantee for computing the estimator covariance using Algorithm 1. This is based on the bound for approximating the inverse-Hessian conditioned stochastic gradient in (8), and the optimization bound in (6). As a side note, the rates in Theorem 1 are very similar to classic results in stochastic approximation [PJ92, Rup88]

; however the nested structure of outer and inner loops is different from standard stochastic approximation algorithms. Heuristically, calibration methods for parameter tuning in subsampling methods (

[ET94], Ch.18; [PRW12], Ch. 9) can be used for hyper-parameter tuning in our algorithm.

In Algorithm 1, does not have asymptotic normality. I.e., does not weakly converge to ; we give an example using mean estimation in Section D.1. For a similar algorithm based on SVRG (Algorithm 6 in Appendix D), we show that we have asymptotic normality and improved bounds for the “covariance”; however, this requires a full gradient evaluation in each outer loop. In Appendix C, we present corollaries for the case where the iterations in the inner loop increase, as the counter in the outer loop increases (i.e., is an increasing series). This guarantees consistency (convergence of the covariance estimate to ), although it is less efficient than using a constant number of inner loop iterations. Our procedure also serves as a general and flexible framework for using different stochastic gradient optimization algorithms [TA17, HAV15, LH15, DLH16] in the inner and outer loop parts.

Finally, we present the following corollary that states that the average of consecutive iterates, in the outer loop, has asymptotic normality, similar to [PJ92, Rup88].

Corollary 1.

In Algorithm 1’s outer loop, the average of consecutive iterates satisfies

(11)
and
(12)

where weakly converges to , and when and ( .

Corollary 1 uses 2nd , 4thmoment bounds on individual iterates (eqs. (6), (7) in the above theorem), and the approximation of inverse Hessian conditioned stochastic gradient in (9).

3 High dimensional LASSO linear regression

In this section, we focus on the case of high-dimensional linear regression. Statistical inference in such settings, where

, is arguably a more difficult task: the bias introduced by the regularizer is of the same order with the estimator’s variance. Recent works

[ZZ14, vdGBRD14, JM15] propose statistical inference via de-biased LASSO estimators. Here, we present a new -norm regularized objective and propose an approximate stochastic proximal Newton algorithm, using only first order information.

We consider the linear model , for some sparse . For each sample, is i.i.d. noise. And each data point .

  • [leftmargin=0.5cm]

  • Assumptions on : is -sparse; , which implies that .

  • Assumptions on : is sparse, where each column (and row) has at most non-zero entries;111This is satisfied when is block diagonal or banded. Covariance estimation under this sparsity assumption has been extensively studied [BL08, BRT09, CZ12], and soft thresholding is an effective yet simple estimation method [RLZ09]. is well conditioned: all of

    ’s eigenvalues are

    ; is diagonally dominant ( for all ), and this will be used to bound the norm of [Var75]. A commonly used design covariance that satisfies all of our assumptions is .

We estimate using:

(13)

where is an estimate of by soft-thresholding each element of with [RLZ09]. Under our assumptions, is positive definite with high probability when (Lemma 4), and this guarantees that the optimization problem (13) is well defined. I.e., we replace the degenerate Hessian in regular LASSO regression with an estimate, which is positive definite with high probability under our assumptions.

We set the regularization parameter

which is similar to LASSO regression [BvdG11, NRWY12] and related estimators using thresholded covariance [YLR14, JD11].

Point estimate.

Theorem 2 provides guarantees for our proposed point estimate (13).

Theorem 2.

When , the solution in (13) satisfies

(14)
(15)

with probability at least .

Confidence intervals.

We next present a de-biased estimator (16), based on our proposed estimator. can be used to compute confidence intervals and p-values for each coordinate of , which can be used for false discovery rate control [JJ18]. The estimator satisfies:

(16)

Our de-biased estimator is similar to [ZZ14, vdGBRD14, JM14, JM15]. however, we have different terms, since we need to de-bias covariance estimation. Our estimator assumes , since then is positive definite with high probability (Lemma 4). The assumption that is diagonally dominant guarantees that the norm is bounded by with high probability when .

Theorem 3 shows that we can compute valid confidence intervals for each coordinate when This is satisfied when . And the covariance is similar to the sandwich estimator [Hub67, Whi80].

Theorem 3.

Under our assumptions, when , we have:

(17)

where the conditional distribution satisfies , and with probability at least .

Our estimate in (13) has similar error rates to the estimator in [YLR14]; however, no confidence interval guarantees are provided, and the estimator is based on inverting a large covariance matrix. Further, although it does not match minimax rates achieved by regular LASSO regression [RWY11], and the sample complexity in Theorem 3 is slightly higher than other methods [vdGBRD14, JM14, JM15], our criterion is strongly convex with high probability: this allows us to use linearly convergent proximal algorithms [XZ14, LSS14], whereas provable linearly convergent optimization bounds for LASSO only guarantees convergence to a neighborhood of the LASSO solution within statistical error [ANW10]. This is crucial for computing the de-biased estimator, as we need the optimization error to be much less than the statistical error.

In Appendix A, we present our algorithm for statistical inference in high dimensional linear regression using stochastic gradients. It estimates the statistical error covariance using the plug-in estimator:

which is related to the empirical sandwich estimator [Hub67, Whi80]. Algorithm 2 computes the statistical error covariance. Similar to Algorithm 1, Algorithm 2 has an outer loop part and an inner loop part, where the outer loops correspond to approximate proximal Newton steps, and the inner loops solve each proximal Newton step using proximal SVRG [XZ14]. To control the variance, we use SVRG and proximal SVRG to solve the Newton steps. This is because in the high dimensional setting, the variance is too large when we use SGD [MB11] and proximal SGD [AFM17] for solving Newton steps. However, since we have , instead of sampling by sample, we sample by feature. When we set , we can estimate the statistical error covariance with element-wise error less than with high probability, using numerical operations. And Algorithm 3 calculates the de-biased estimator (16) via SVRG. For more details, we defer the reader to Appendix A.

4 Time series analysis

In this section, we present a sampling scheme for statistical inference in time series analysis using -estimation, where we sample contiguous blocks of indices, instead of uniformly.

We consider a linear model , where , but

may not be i.i.d. as this is a time series. And we use ordinary least squares (OLS)

to estimate . Applications include multifactor financial models for explaining returns [BBMS13, RM73]. For non-i.i.d. time series data, OLS may not be the optimal estimator, as opposed to the maximum likelihood estimator [SS11], but OLS is simple yet often robust, compared to more sophisticated models that take into account time series dynamics. And it is widely used in econometrics for time series analysis [Ber91]. To perform statistical inference, we use the asymptotic normality

(18)

where and , with . The difference compared with the i.i.d. case (Section 2) is that now includes autocovariance terms. We use the plug-in estimate as before, and we estimate using the Newey-West covariance estimator [NW86] for HAC (heteroskedasticity and autocorrelation consistent) covariance estimation

(19)

where is sample autocovariance weight, such as Bartlett weight , and is the lag

parameter, which captures data dependence across time. Note that this is an essential building block in time series statistical inference procedures, such as Driscoll-Kraay standard errors

[DK98, KD99], moving block bootstrap [Kun89], and circular bootstrap [PR92, PR94].

In our framework, we solve OLS using our approximate Newton procedure with a slight modification to Algorithm 1. Instead of uniformly sampling indices as in line 4 of Algorithm 1, we uniformly select some , and set the outer mini-batch indexes to the random contiguous block , where we let the indexes circularly wrap around, as in line 4 of Algorithm 5, and this sampling scheme is similar to circular bootstrap. Here is the lag parameter, similar to the Newey-West estimator. And the stochastic gradient’s expectation is still the full gradient. The complete algorithm is in Algorithm 5, and its guarantees are given in Corollary 2. Our approximate Newton statistical inference procedure is equivalent to using weight in the Newey-West covariance estimator (19), with negligible terms for blocks that wrap around, and this is the same as circular bootstrap. Note that the connection between sampling scheme and Newey-West estimator was also observed in [Kun89]. Following [PR92], we can set the lag parameter such that , and run at least outer loops. In practice, other methods for tuning the lag parameter can be used, such as [NW94]. For more details, we refer the reader to Appendix B.

5 Experiments

5.1 Synthetic data

The coverage probability is defined as where is the estimated confidence interval for the coordinate. The average confidence interval length is defined as where is the estimated confidence interval for the coordinate. In our experiments, coverage probability and average confidence interval length are estimated through simulation. Result given as a indicates (coverage probability, confidence interval length).

Approximate Newton Bootstrap Inverse Fisher information Averaged SGD
Lin1 (0.906, 0.289) (0.933, 0.294) (0.918, 0.274) (0.458, 0.094)
Lin2 (0.915, 0.321) (0.942, 0.332) (0.921,0.308) (0.455 0.103)
Table 1: Linear regression (low dimensional): synthetic data confidence interval (coverage, length)
Approximate Newton Jackknife Inverse Fisher information Averaged SGD
Log1 (0.902, 0.840) (0.966 1.018) (0.938, 0.892) (0.075 0.044)
Log2 (0.925, 1.006) (0.979, 1.167) (0.948, 1.025) (0.065 0.045)
Table 2: Logistic regression (low dimensional): synthetic data confidence interval (coverage, length)
Low dimensional problems.

Table 1 and Table 2 show 95% confidence interval’s coverage and length of 200 simulations for linear and logistic regression. The exact configurations for linear/logistic regression examples are provided in Section H.1.1. Compared with Bootstrap and Jackknife [ET94], Algorithm 1 uses less numerical operations, while achieving similar results. Compared with the averaged SGD method [LLKC18, CLTZ16], our algorithm performs much better, while using the same amount of computation, and is much less sensitive to the choice hyper-parameters. And we observe that calibrated approximate Newton confidence intervals [ET94, PRW12] are better than bootstrap and inverse Fisher information (Table 3).

Figure 1:

Distribution of two-sided Z-test p-values under the null hypothesis (high dimensional)

High dimensional linear regression.

Figure 1 shows p-value distribution under the null hypothesis for our method and the de-biased LASSO estimator with known covariance, using 600 i.i.d. samples generated from a model with ,

, and we can see that it is close to a uniform distribution, similar results are observed for other high dimensional statistical inference procedures such as

[CFJL18]. And visualization of confidence intervals computed by our algorithm is shown in Figure 3.

Time series analysis.

In our linear regression simulation, we generate i.i.d. random explanatory variables, and the observation noise is a 0-mean moving average (MA) process independent of the explanatory variables. Results on average 95% confidence interval coverage and length are given in Section H.1.3, and they validate our theory.

5.2 Real data

Neural network adversarial attack detection.

Here we use ideas from statistical inference to detect certain adversarial attacks on neural networks. A key observation is that neural networks are effective at representing low dimensional manifolds such as natural images [BJ16, CM16], and this causes the risk function’s Hessian to be degenerate [SEG17]. From a statistical inference perspective, we interpret this as meaning that the confidence intervals in the null space of is infinity, where is the pseudo-inverse of the Hessian (see Section 2). When we make a prediction using a fixed data point as input (i.e., conditioned on ), using the delta method [vdV98], the confidence interval of the prediction can be derived from the asymptotic normality of

To detect adversarial attacks, we use the score

to measure how much lies in null space of , where is the projection matrix onto the range of . Conceptually, for the same image, the randomly perturbed image’s score should be larger than the original image’s score, and the adversarial image’s score should be larger than the randomly perturbed image’s score.

We train a binary classification neural network with 1 hidden layer and softplus activation function, to distinguish between “Shirt” and “T-shirt/top” in the Fashion MNIST data set

[XRV17]. Figure 2 shows distributions of scores of original images, adversarial images generated using the fast gradient sign method [GSS14], and randomly perturbed images. Adversarial and random perturbations have the same norm. The adversarial perturbations and example images are shown in Section H.2.1. Although the scores’ values are small, they are still significantly larger than 64-bit floating point precision (). We observe that scores of randomly perturbed images is an order of magnitude larger than scores of original images, and scores of adversarial images is an order of magnitude larger than scores of randomly perturbed images.

Figure 2: Distribution of scores for original, randomly perturbed, and adversarially perturbed images
High dimensional linear regression.

We apply our high dimensional inference procedure to the dataset in [RTW06] to detect mutations related to HIV drug resistance, where we randomly sub-sample the dataset so that the number of features is larger than the number of samples. When we control the family-wise error rate (FWER) at using the Bonferroni correction [Bon36], our procedure is able to detect verified mutations in an expert dataset [JBVC05] (Table 4), and the details are given in Section H.2.2. Another experiment with a genomic data set concerning riboflavin (vitamin B2) production rate [BKM14] is given in the appendix.

Time series analysis.

Using monthly equities returns data from [FP14], we use our approximate Newton statistical inference procedure to show that the correlation between US equities market returns and non-US global equities market returns is statistically significant, which validates the capital asset pricing model (CAPM) [Sha64, Lin65, FF04]. The details are given in Section H.2.3.

6 Related work

Unregularized M-estimation.

This work provides a general, flexible framework for simultaneous point estimation and statistical inference, and improves upon previous methods, based on averaged stochastic gradient descent [LLKC18, CLTZ16].

Compared to [CLTZ16] (and similar works [SZ18, FXY17] using SGD with decreasing step size), our method does not need to increase the lengths of “segments” (inner loops) to reduce correlations between different “replicates”. Even in that case, if we use replicates and increasing “segment” length (number of inner loops is ) with a total of stochastic gradient steps, [CLTZ16] guarantees , whereas our method guarantees . Further, [CLTZ16] is inconsistent, whereas our scheme guarantees consistency of computing the statistical error covariance.

[LLKC18] uses fixed step size SGD for statistical inference, and discards iterates between different “segments” to reduce correlation, whereas we do not discard any iterates in our computations. Although [LLKC18] states empirically constant step SGD performs well in statistical inference, it has been empirically shown [DDB17] that averaging consecutive iterates in constant step SGD does not guarantee convergence to the optimal – the average will be “wobbling” around the optimal, whereas decreasing step size stochastic approximation methods ([PJ92, Rup88] and our work) will converge to the optimal, and averaging consecutive iterates guarantees “fast” rates.

Finally, from an optimization perspective, our method is similar to stochastic Newton methods (e.g. [ABH17]); however, our method only uses first-order information to approximate a Hessian vector product (). Algorithm 1’s outer loops are similar to stochastic natural gradient descent [Ama98]. Also, we demonstrate an intuitive view of SVRG [JZ13] as a special case of approximate stochastic Newton steps using first order information (Appendix E).

High dimensional linear regression.

[CLTZ16]’s high dimensional inference algorithm is based on [ANW12], and only guarantees that optimization error is at the same scale as the statistical error. However, proper de-biasing of the LASSO estimator requires the optimization error to be much less than the statistical error, otherwise the optimization error introduces additional bias that de-biasing cannot handle. Our optimization objective is strongly convex with high probability: this permits the use of linearly convergent proximal algorithms [XZ14, LSS14] towards the optimum, which guarantees the optimization error to be much smaller than the statistical error.

Our method of de-biasing the LASSO in Section 3 is similar to [ZZ14, vdGBRD14, JM14, JM15]. Our method uses a new regularized objective (13) for high dimensional linear regression, and we have different de-biasing terms, because we also need to de-bias the covariance estimation. In Algorithm 2, our covariance estimate is similar to the classic sandwich estimator [Hub67, Whi80]. Previous methods require space which unsuitable for large scale problems, whereas our method only requires space.

Similar to our -norm regularized objective, [YLR14, JD11] shows similar point estimate statistical guarantees for related estimators; however there are no confidence interval results. Further, although [YLR14] is an elementary estimator in closed form, it still requires computing the inverse of the thresholded covariance, which is challenging in high dimensions, and may not computationally outperform optimization approaches.

Finally, for feature selection, we do not assume that absolute values of the true parameter’s non-zero entries are lower bounded (“beta-min” condition).

[FGLS18, Loh17, LW17, BvdG11, Wai09].

Time series analysis.

Our approach of sampling contiguous blocks of indices to compute stochastic gradients for statistical inference in time series analysis is similar to resampling procedures in moving block or circular bootstrap [Car86, Kun89, Büh02, DH97, ET94, Lah13, PR92, PR94, KL12], and conformal prediction [BHV14, SV08, VGS05]. Also, our procedure is similar to Driscoll-Kraay standard errors [DK98, KD99, Hoe07], but does not waste computational resources to explicitly store entire matrices, and is suited for large scale time series analysis.

References

  • [ABH17] Naman Agarwal, Brian Bullins, and Elad Hazan. Second-order stochastic optimization for machine learning in linear time. The Journal of Machine Learning Research, 18(1):4148–4187, 2017.
  • [AFM17] Yves F. Atchadé, Gersende Fort, and Eric Moulines. On perturbed proximal gradient algorithms. J. Mach. Learn. Res, 18(1):310–342, 2017.
  • [Ama98] Shun-Ichi Amari. Natural gradient works efficiently in learning. Neural computation, 10(2):251–276, 1998.
  • [ANW10] Alekh Agarwal, Sahand Negahban, and Martin Wainwright. Fast global convergence rates of gradient methods for high-dimensional statistical recovery. In Advances in Neural Information Processing Systems, pages 37–45, 2010.
  • [ANW12] Alekh Agarwal, Sahand Negahban, and Martin J. Wainwright. Stochastic optimization and sparse statistical recovery: Optimal algorithms for high dimensions. In Advances in Neural Information Processing Systems, pages 1538–1546, 2012.
  • [BBMS13] Jennifer Bender, Remy Briand, Dimitris Melas, and Raman Subramanian. Foundations of factor investing. 2013.
  • [Ber91] Ernst Berndt. The practice of econometrics: classic and contemporary. Addison Wesley Publishing Company, 1991.
  • [BHV14] Vineeth Balasubramanian, Shen-Shyang Ho, and Vladimir Vovk. Conformal prediction for reliable machine learning: theory, adaptations and applications. Newnes, 2014.
  • [BJ16] Ronen Basri and David Jacobs. Efficient representation of low-dimensional manifolds using deep networks. arXiv preprint arXiv:1602.04723, 2016.
  • [BKM14] Peter Bühlmann, Markus Kalisch, and Lukas Meier. High-dimensional statistics with a view toward applications in biology. Annual Review of Statistics and Its Application, 1:255–278, 2014.
  • [BL08] Peter Bickel and Elizaveta Levina. Covariance regularization by thresholding. The Annals of Statistics, pages 2577–2604, 2008.
  • [Bon36] C. Bonferroni. Teoria statistica delle classi e calcolo delle probabilita. Pubblicazioni del R Istituto Superiore di Scienze Economiche e Commericiali di Firenze, 8:3–62, 1936.
  • [BRT09] Peter Bickel, Ya’acov Ritov, and Alexandre Tsybakov. Simultaneous analysis of Lasso and Dantzig selector. The Annals of Statistics, pages 1705–1732, 2009.
  • [Bub15] Sébastien Bubeck. Convex optimization: Algorithms and complexity. Foundations and Trends in Machine Learning, 8(3-4):231–357, 2015.
  • [Büh02] Peter Bühlmann. Bootstraps for time series. Statistical science, pages 52–72, 2002.
  • [Büh13] Peter Bühlmann. Statistical significance in high-dimensional linear models. Bernoulli, 19(4):1212–1242, 2013.
  • [BvdG11] Peter Bühlmann and Sara van de Geer.

    Statistics for high-dimensional data: methods, theory and applications

    .
    Springer Science & Business Media, 2011.
  • [Car86] Edward Carlstein. The use of subseries values for estimating the variance of a general statistic from a stationary sequence. The Annals of Statistics, 14(3):1171–1179, 1986.
  • [CFJL18] Emmanuel Candes, Yingying Fan, Lucas Janson, and Jinchi Lv. Panning for gold:‘model-x’knockoffs for high dimensional controlled variable selection. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 80(3):551–577, 2018.
  • [CLTZ16] Xi Chen, Jason Lee, Xin Tong, and Yichen Zhang. Statistical inference for model parameters in stochastic gradient descent. arXiv preprint arXiv:1610.08637, 2016.
  • [CM16] Charles Chui and Hrushikesh Narhar Mhaskar. Deep nets for local manifold learning. arXiv preprint arXiv:1607.07110, 2016.
  • [CZ12] Tony Cai and Harrison Zhou. Optimal rates of convergence for sparse covariance matrix estimation. The Annals of Statistics, pages 2389–2420, 2012.
  • [DDB17] Aymeric Dieuleveut, Alain Durmus, and Francis Bach. Bridging the Gap between Constant Step Size Stochastic Gradient Descent and Markov Chains. arXiv preprint arXiv:1707.06386, 2017.
  • [DH97] Anthony Christopher Davison and David Victor Hinkley. Bootstrap methods and their application. Cambridge University Press, 1997.
  • [Dip08a] Jürgen Dippon. Asymptotic expansions of the Robbins-Monro process. Mathematical Methods of Statistics, 17(2):138–145, 2008.
  • [Dip08b] Jürgen Dippon. Edgeworth expansions for stochastic approximation theory. Mathematical Methods of Statistics, 17(1):44–65, 2008.
  • [DK98] John Driscoll and Aart Kraay. Consistent covariance matrix estimation with spatially dependent panel data. Review of Economics and Statistics, 80(4):549–560, 1998.
  • [DLH16] Hadi Daneshmand, Aurelien Lucchi, and Thomas Hofmann. Starting small-learning with adaptive sample sizes. In International conference on machine learning, pages 1463–1471, 2016.
  • [Edg08] Francis Ysidro Edgeworth. On the probable errors of frequency-constants. Journal of the Royal Statistical Society, 71(2):381–397, 1908.
  • [EH16] B. Efron and T. Hastie. Computer age statistical inference. Cambridge University Press, 2016.
  • [ET94] Bradley Efron and Robert J. Tibshirani. An introduction to the bootstrap. CRC press, 1994.
  • [FF04] Eugene F. Fama and Kenneth R. French. The capital asset pricing model: Theory and evidence. Journal of economic perspectives, 18(3):25–46, 2004.
  • [FGLS18] Jianqing Fan, Wenyan Gong, Chris Junchi Li, and Qiang Sun. Statistical sparse online regression: A diffusion approximation perspective. In

    International Conference on Artificial Intelligence and Statistics

    , pages 1017–1026, 2018.
  • [FHT01] J. Friedman, T. Hastie, and R. Tibshirani. The elements of statistical learning, volume 1. Springer series in statistics New York, 2001.
  • [FP14] Andrea Frazzini and Lasse Heje Pedersen. Betting against beta. Journal of Financial Economics, 111(1):1–25, 2014.
  • [FXY17] Yixin Fang, Jinfeng Xu, and Lei Yang. On Scalable Inference with Stochastic Gradient Descent. arXiv preprint arXiv:1707.00192, 2017.
  • [GP17] Sébastien Gadat and Fabien Panloup. Optimal non-asymptotic bound of the Ruppert-Polyak averaging without strong convexity. arXiv preprint arXiv:1709.03342, 2017.
  • [GSS14] Ian J. Goodfellow, Jonathon Shlens, and Christian Szegedy. Explaining and harnessing adversarial examples. arXiv preprint arXiv:1412.6572, 2014.
  • [HAV15] Reza Harikandeh, Mohamed Osama Ahmed, Alim Virani, Mark Schmidt, Jakub Konecny, and Scott Sallinen. StopWasting My Gradients: Practical SVRG. In Advances in Neural Information Processing Systems 28, pages 2251–2259, 2015.
  • [Hoe07] Daniel Hoechle. Robust standard errors for panel regressions with cross-sectional dependence. Stata Journal, 7(3):281–312, 2007.
  • [Hub67] Peter Huber. The behavior of maximum likelihood estimates under nonstandard conditions. In Proceedings of the fifth Berkeley symposium on Mathematical Statistics and Probability, pages 221–233, Berkeley, CA, 1967. University of California Press.
  • [JBVC05] Victoria A Johnson, Francoise Brun-Vezinet, Bonaventura Clotet, Brian Conway, Daniel R. Kuritzkes, Deenan Pillay, Jonathan Schapiro, Amalio Telenti, and Douglas Richman. Update of the drug resistance mutations in HIV-1: 2005. Topics in HIV medicine: a publication of the International AIDS Society, USA, 13(1):51–57, 2005.
  • [JD11] Jessie Jeng and John Daye. Sparse covariance thresholding for high-dimensional variable selection. Statistica Sinica, pages 625–657, 2011.
  • [JJ18] Adel Javanmard and Hamid Javadi. False Discovery Rate Control via Debiased Lasso. arXiv preprint arXiv:1803.04464, 2018.
  • [JM14] Adel Javanmard and Andrea Montanari. Confidence intervals and hypothesis testing for high-dimensional regression. Journal of Machine Learning Research, 15(1):2869–2909, 2014.
  • [JM15] Adel Javanmard and Andrea Montanari. De-biasing the Lasso: Optimal sample size for Gaussian designs. arXiv preprint arXiv:1508.02757, 2015.
  • [JZ13] Rie Johnson and Tong Zhang. Accelerating stochastic gradient descent using predictive variance reduction. In Advances in neural information processing systems, pages 315–323, 2013.
  • [KD99] Aart Kraay and John Driscoll. Spatial Correlations in Panel Data. The World Bank, 1999.
  • [KL12] Jens-Peter Kreiss and Soumendra Nath Lahiri. Bootstrap methods for time series. In Handbook of statistics, volume 30, pages 3–26. Elsevier, 2012.
  • [KTSJ14] Ariel Kleiner, Ameet Talwalkar, Purnamrita Sarkar, and Michael Jordan. A scalable bootstrap for massive data. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 76(4):795–816, 2014.
  • [Kun89] Hans Kunsch. The jackknife and the bootstrap for general stationary observations. The annals of Statistics, pages 1217–1241, 1989.
  • [Lah13] Soumendra Nath Lahiri. Resampling methods for dependent data. Springer Science & Business Media, 2013.
  • [LH15] Ilya Loshchilov and Frank Hutter. Online batch selection for faster training of neural networks. arXiv preprint arXiv:1511.06343, 2015.
  • [Lin65] John Lintner. The valuation of risk assets and the selection of risky investments in stock portfolios and capital budgets. The Review of Economics and Statistics, pages 13–37, 1965.
  • [LLKC18] Tianyang Li, Liu Liu, Anastasios Kyrillidis, and Constantine Caramanis. Statistical inference using SGD. In The Thirty-Second AAAI Conference on Artificial Intelligence (AAAI-18), 2018.
  • [Loh17] Po-Ling Loh. Statistical consistency and asymptotic normality for high-dimensional robust -estimators. The Annals of Statistics, 45(2):866–896, 2017.
  • [LSS14] Jason Lee, Yuekai Sun, and Michael Saunders. Proximal Newton-type methods for minimizing composite functions. SIAM Journal on Optimization, 24(3):1420–1443, 2014.
  • [LW17] Po-Ling Loh and Martin Wainwright. Support recovery without incoherence: A case for nonconvex regularization. The Annals of Statistics, 45(6):2455–2482, 2017.
  • [MB11] Eric Moulines and Francis R. Bach.

    Non-asymptotic analysis of stochastic approximation algorithms for machine learning.

    In Advances in Neural Information Processing Systems, pages 451–459, 2011.
  • [MMB09] Nicolai Meinshausen, Lukas Meier, and Peter Bühlmann. -values for high-dimensional regression. Journal of the American Statistical Association, 104(488):1671–1681, 2009.
  • [NRWY12] Sahand Negahban, Pradeep Ravikumar, Martin Wainwright, and Bin Yu. A Unified Framework for High-Dimensional Analysis of M-Estimators with Decomposable Regularizers. Statistical Science, pages 538–557, 2012.
  • [NW86] Whitney Newey and Kenneth West. A simple, positive semi-definite, heteroskedasticity and autocorrelation consistent covariance matrix. National Bureau of Economic Research Cambridge, Mass., USA, 1986.
  • [NW94] Whitney Newey and Kenneth West. Automatic lag selection in covariance matrix estimation. The Review of Economic Studies, 61(4):631–653, 1994.
  • [PJ92] Boris Polyak and Anatoli Juditsky. Acceleration of stochastic approximation by averaging. SIAM Journal on Control and Optimization, 30(4):838–855, 1992.
  • [PR92] Dimitris Politis and Joseph Romano. A circular block-resampling procedure for stationary data. Exploring the limits of bootstrap, pages 263–270, 1992.
  • [PR94] Dimitris Politis and Joseph Romano. The stationary bootstrap. Journal of the American Statistical association, 89(428):1303–1313, 1994.
  • [PRW12] D. N. Politis, J. P. Romano, and M. Wolf. Subsampling. Springer Series in Statistics. Springer New York, 2012.
  • [RLZ09] Adam Rothman, Elizaveta Levina, and Ji Zhu. Generalized thresholding of large covariance matrices. Journal of the American Statistical Association, 104(485):177–186, 2009.
  • [RM73] Barr Rosenberg and Walt McKibben. The prediction of systematic and specific risk in common stocks. Journal of Financial and Quantitative Analysis, 8(2):317–333, 1973.
  • [RTW06] Soo-Yon Rhee, Jonathan Taylor, Gauhar Wadhera, Asa Ben-Hur, Douglas L. Brutlag, and Robert W. Shafer. Genotypic predictors of human immunodeficiency virus type 1 drug resistance. Proceedings of the National Academy of Sciences, 103(46):17355–17360, 2006.
  • [Rup88] David Ruppert. Efficient estimations from a slowly convergent Robbins-Monro process. Technical report, Cornell University Operations Research and Industrial Engineering, 1988.
  • [RWY11] Garvesh Raskutti, Martin Wainwright, and Bin Yu. Minimax rates of estimation for high-dimensional linear regression over -balls. IEEE transactions on information theory, 57(10):6976–6994, 2011.
  • [SEG17] Levent Sagun, Utku Evci, V. Ugur Guney, Yann Dauphin, and Leon Bottou. Empirical Analysis of the Hessian of Over-Parametrized Neural Networks. arXiv preprint arXiv:1706.04454, 2017.
  • [Sha64] William F. Sharpe. Capital asset prices: A theory of market equilibrium under conditions of risk. The journal of finance, 19(3):425–442, 1964.
  • [SS11] Robert Shumway and David Stoffer. Time series regression and exploratory data analysis. In Time series analysis and its applications, pages 47–82. Springer, 2011.
  • [SV08] Glenn Shafer and Vladimir Vovk. A tutorial on conformal prediction. Journal of Machine Learning Research, 9(Mar):371–421, 2008.
  • [SZ18] Weijie Su and Yuancheng Zhu. Statistical Inference for Online Learning and Stochastic Approximation via Hierarchical Incremental Gradient Descent. arXiv preprint arXiv:1802.04876, 2018.
  • [TA17] Panos Toulis and Edoardo M. Airoldi. Asymptotic and finite-sample properties of estimators based on stochastic gradients. The Annals of Statistics, 45(4):1694–1727, 2017.
  • [Tro15] Joel Tropp. An introduction to matrix concentration inequalities. Foundations and Trends in Machine Learning, 8(1-2):1–230, 2015.
  • [TRVB06] F. Tuerlinckx, F. Rijmen, G. Verbeke, and P. Boeck.

    Statistical inference in generalized linear mixed models: A review.

    British Journal of Mathematical and Statistical Psychology, 59(2):225–255, 2006.
  • [TWH15] R. Tibshirani, M. Wainwright, and T. Hastie. Statistical learning with sparsity: the lasso and generalizations. Chapman and Hall/CRC, 2015.
  • [Var75] James Varah.

    A lower bound for the smallest singular value of a matrix.

    Linear Algebra and its Applications, 11(1):3–5, 1975.
  • [vdGBRD14] Sara van de Geer, Peter Bühlmann, Ya’acov Ritov, and Ruben Dezeure. On asymptotically optimal confidence regions and tests for high-dimensional models. The Annals of Statistics, 42(3):1166–1202, 2014.
  • [vdV98] Aad W. van der Vaart. Asymptotic statistics. Cambridge University Press, 1998.
  • [VGS05] Vladimir Vovk, Alexander Gammerman, and Glenn Shafer. Algorithmic learning in a random world: conformal prediction. Springer, 2005.
  • [Wai09] Martin Wainwright. Sharp thresholds for High-Dimensional and noisy sparsity recovery using -Constrained Quadratic Programming (Lasso). IEEE transactions on information theory, 55(5):2183–2202, 2009.
  • [Wai17] Martin J. Wainwright. High-Dimensional Statistics: A Non-Asymptotic Viewpoint. To appear, 2017.
  • [Was13] Larry Wasserman. All of statistics: a concise course in statistical inference. Springer Science & Business Media, 2013.
  • [Whi80] Halbert White. A heteroskedasticity-consistent covariance matrix estimator and a direct test for heteroskedasticity. Econometrica: Journal of the Econometric Society, pages 817–838, 1980.
  • [XRV17] Han Xiao, Kashif Rasul, and Roland Vollgraf. Fashion-MNIST: a Novel Image Dataset for Benchmarking Machine Learning Algorithms. arXiv preprint arXiv:1708.07747, 2017.
  • [XZ14] Lin Xiao and Tong Zhang. A proximal stochastic gradient method with progressive variance reduction. SIAM Journal on Optimization, 24(4):2057–2075, 2014.
  • [YLR14] Eunho Yang, Aurelie Lozano, and Pradeep Ravikumar. Elementary estimators for high-dimensional linear regression. In International Conference on Machine Learning, pages 388–396, 2014.
  • [ZZ14] Cun-Hui Zhang and Stephanie Zhang. Confidence intervals for low dimensional parameters in high dimensional linear models. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 76(1):217–242, 2014.

Appendix A High dimensional linear regression statistical inference using stochastic gradients (Section 3)

a.1 Statistical inference using approximate proximal Newton steps with stochastic gradients

Here, we present a statistical inference procedure for high dimensional linear regression via approximate proximal Newton steps using stochastic gradients. It uses the plug-in estimator:

which is related to the empirical sandwich estimator [Hub67, Whi80]. Lemma 1 shows this is a good estimate of the covariance when .

Algorithm 2 performs statistical inference in high dimensional linear regression (13), by computing the statistical error covariance in Theorem 3, based on the plug-in estimate in Lemma 1. We denote the soft thresholding of by as an element-wise procedure . For a vector , we write ’s th coordinate as . The optimization objective (13) is denoted as:

where Further,

where is the basis vector where the th coordinate is and others are , and is computed in a column-wise manner.

For point estimate optimization, the proximal Newton step [LSS14] at solves the optimization problem

to determine a descent direction. For statistical inference, we solve a Newton step:

to compute , whose covariance is the statistical error covariance.

To control variance, we solve Newton steps using SVRG and proximal SVRG [XZ14], because in the high dimensional setting, the variance using SGD [MB11] and proximal SGD [AFM17] for solving Newton steps is too large. However because , instead of sampling by sample, we sample by feature. We start from sufficiently close to (see Theorem 4 for details), which can be effectively achieved using proximal SVRG (Section A.3). Line 7 corresponds to SVRG’s outer loop part that computes the full gradient, and line 12 corresponds to SVRG’s inner loop update. Line 8 corresponds to proximal SVRG’s outer loop part that computes the full gradient, and line 13 corresponds to proximal SVRG’s inner loop update.

The covariance estimate bound, asymptotic normality result, and choice of hyper-parameters are described in Section A.4. When , we can estimate the covariance with element-wise error less than with high probability, using numerical operations. Calculation of the de-biased estimator (16) via SVRG is described in Section A.2.

1:  Parameters: ; ;  Initial state:  
2:  for  to  do
3:      uniformly sample indices with replacement from
4:     
5:     
6:     for  to  do  // solving Newton steps using SVRG
7:        
8: