# Statistical Inference for Algorithmic Leveraging

The age of big data has produced data sets that are computationally expensive to analyze. To deal with such large-scale data sets, the method of algorithmic leveraging proposes that we sample according to some special distribution, rescale the data, and then perform analysis on the smaller sample. Ma, Mahoney, and Yu (2015) provides a framework to determine the statistical properties of algorithmic leveraging in the context of estimating the regression coefficients in a linear model with a fixed number of predictors. In this paper, we discuss how to perform statistical inference on regression coefficients estimated using algorithmic leveraging. In particular, we show how to construct confidence intervals for each estimated coefficient and present an efficient algorithm for doing so when the error variance is known. Through simulations, we confirm that our procedure controls the type I errors of significance tests for the regression coefficients and show that it has good power for those tests.

## Authors

• 3 publications
06/23/2013

### A Statistical Perspective on Algorithmic Leveraging

One popular method for dealing with large-scale data sets is sampling. F...
10/19/2021

### Simulating the Power of Statistical Tests: A Collection of R Examples

This paper illustrates how to calculate the power of a statistical test ...
06/06/2021

### Statistical Inference for Cox Proportional Hazards Models with a Diverging Number of Covariates

For statistical inference on regression models with a diverging number o...
02/18/2020

### Post-selection inference on high-dimensional varying-coefficient quantile regression model

Quantile regression has been successfully used to study heterogeneous an...
02/03/2021

### Statistical Inference for Ordinal Predictors in Generalized Linear and Additive Models with Application to Bronchopulmonary Dysplasia

Discrete but ordered covariates are quite common in applied statistics, ...
09/04/2019

### Group Inference in High Dimensions with Applications to Hierarchical Testing

Group inference has been a long-standing question in statistics and the ...
04/11/2014

### Estimating nonlinear regression errors without doing regression

A method for estimating nonlinear regression errors and their distributi...
##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

A popular method to deal with modern massive data sets with large volume is to sample a representative data set that is much smaller than the original data set and then carry out analysis on the representative data set. The sampling of the representative data set is usually dependent on the analysis method employed. As long as the sampling process is not too computationally expensive, if the size of the sample is much smaller than the size of the original data set, we could have considerable savings on the total computation time.

Algorithmic leveraging refers to the special case where the probability an observation is sampled is positively correlated with the influence of each observation on the results of data analysis. The definition of influence varies based on the data analysis problem considered.

We are concerned with algorithmic leveraging for the problem of least squares approximation (linear regression) [5], but algorithmic leveraging has been applied to many other data analysis problems with large data sets. They include low-rank matrix approximation [10] and least absolute deviations regression [3].

Previous literature has primarily been concerned with estimation, providing high probability bounds on errors. However, in many practical regression settings, we also would like to perform uncertainty quantification. Therefore, in this paper we explore how to efficiently construct confidence intervals and tests of significance for the estimated coefficients. We utilize the framework of Ma et al. [9], which computes the expectation and variance of algorithmic leveraging estimates for linear regression.

Assume that the data come from the following model.

###### Model 1.

For

 yi=xTiβ+ϵi,ϵiiid∼N(0,σ2)

where are the predictors and are the responses.

The statistically efficient estimator of the regression coefficient vector

is the ordinary least squares (OLS) estimator

, which requires time to compute (see Section 2.1). When both and are large, that would be expensive. This is why a data reduction method like algorithmic leveraging is desirable.

In section 2, we introduce ordinary least squares and algorithmic leveraging for linear regression, and present some useful lemmas. Section 3 shows how to construct confidence intervals and tests of significance using those estimated coefficients in time. The confidence intervals are exact when is known and asymptotically exact when is not known. In section 4, using simulated data, we confirm that our proposed confidence intervals have at least the nominal coverage probability and that our tests of significance control the type 1 error rate and have low type 2 error rates. In contrast, the bootstrap, a popular approach to obtaining confidence intervals for complex estimators, is more computationally expensive and may lead to confidence intervals with too small coverage probabilities. Section 5 concludes and discusses possible future work.

## 2 Background and Preliminary Work

In this section, we provide some background on least squares estimation of linear regression models and describe algorithmic leveraging as applied to that problem. We end with a characterization of the distribution of the resulting regression coefficient estimates.

Let indicate the th entry of a vector , and indicate entry of a matrix .

### 2.1 Ordinary Least Squares and Statistical Leverage Scores

Suppose that we have data from Model (1). Let be the matrix with rows and be the -dimensional vector of ’s. The OLS problem is

 argminβ∈Rp∥Y−Xβ∥22

and its solution is

 ^βOLS =(XTX)−1XTY (1)

with

 Var(^βOLS) =σ2(XTX)−1.

By the Gauss-Markov Theorem [12],

is the minimum variance unbiased estimator of

. By using the singular value decomposition (SVD) of

, and, when is known, can be computed in time [7]. However, when both and are large, doing so is expensive.

Let the OLS fitted values be

 ^Y =X(XTX)−1XTY=HY (2)

where is called the hat matrix.

The statistical leverage score of the th observation is defined to be the th diagonal entry of , . The statistical leverage scores may be directly computed from the SVD of , and thus requires computation time. is considered to be the influence of the th observation on the regression results [2]. For linear regression, algorithmic leveraging samples observations so that those with higher statistical leverage scores are more likely to be sampled and solves a least squares problem on the sample.

### 2.2 Algorithmic Leveraging in Linear Regression

This section summarizes the traditional algorithmic leveraging procedure for linear regression. Given data from Model (1) and a distribution over the observations, we execute Algorithm 1.

Note that in theory could be any natural number, but in practice we would choose .

Step of Algorithm 1 has computational complexity , and step . Step , as in section 2.1, requires computation time. Therefore, as long as the computation time to obtain the distribution is , the total computation time of Algorithm 1 is .

The ideal distribution ; by doing so, we are more likely to sample observations that are influential. Because computing the statistical leverage scores requires time, in practice, approximations such as those of Clarkson et al. [3] and Drineas et al. [4] are used; they are computable in time and time, respectively.

However, for our theoretical results we do not assume that was computed using either of the above algorithms. We only assume that is not dependent on the responses and that does not have any entries equal to zero.

Let be the matrix where entry is if the th sample is the th observation in the original data set and otherwise. The sampled data can be written as . Let be the diagonal matrix with th diagonal element if the th sample is the th observation of the original data set.

Then, step of Algorithm 1 can be written as

 argminβ∈Rp∥DSTXY−DSTXXβ∥22 (3)

with solution

 ^βW=(XTWX)−1XTWYW=SXD2STX. (4)

For convenience, we call the result of Algorithm 1, , the algorithmic leveraging estimate.

### 2.3 Asymptotic Normality of the Algorithmic Leveraging Estimates

In this section, we study the distribution of . We first note that , as defined in , is a diagonal matrix. Therefore, we can write , where .

###### Lemma 1.

For given , as , converges almost surely to .

###### Proof.

See Section 6.1. ∎

Using the multivariate Lindeberg-Feller central limit theorem

[8] and results from Ma et al. [9], we obtain

###### Theorem 1.

Suppose that as , converges to a finite positive definite matrix and . Then as , is approximately distributed as

 N(β,σ2(XTX)−1+σ2r(XTX)−1XTdiag((1−hii)2πi)X(XTX)−1).
###### Proof.

See Section 6.2

This theorem can be used to construct confidence intervals for each element of that have the correct coverage probability as and approach infinity. However, doing so requires computing , which requires computation time, and thus we proceed in a different direction.

### 2.4 Bootstrap

Because the main bottleneck in applying Theorem 1 to construct confidence intervals is in computing the variance of , we could consider using the bootstrap as follows. For a total of times, the bootstrap samples with replacement observations from the data set and applies Algorithm 1 to the sample. Then, for

, the standard deviation

of the th coordinate of the algorithmic leveraging estimates is computed; the level bootstrap confidence interval for is constructed as

 (^βW)j±z1−α/2δj

where is the quantile of the standard normal distribution. The corresponding test of significance would be to reject the null hypothesis when .

However, since in practice we must use fairly large values of , usually at least on the order of , the computational complexity of the bootstrap procedure is at least . It is more expensive than our proposal presented in the next section, and we show that experimentally it may not lead to valid confidence intervals. Moreover, the asymptotic guarantees of the bootstrap would require both and to approach infinity, while our proposal’s guarantees only require to approach infinity.

## 3 Inference on ^βW

This section presents our proposed approach for uncertainty quantification for the algorithmic leveraging estimator , both when the error variance is known and when it is unknown.

Consider the distribution of conditional on , i.e. conditional on the sample. is normally distributed with mean and variance

 E(^βW∣W)=β (5) Var(^βW∣W)=σ2(XTWX)−1XTW2X(XTWX)−1 (6)

We write

 XTW2X =(STXX)TD2STXSXD2(STXX)

Entry of is if the th and th sample are the same observation, and otherwise. Thus, can be computed in time. Note that we already know and from computing . Since is , is diagonal of size , and is , can be computed in time. Because can be found from the computation of , can be computed in time. If we choose for some , then that computation time is .

### 3.1 σ2 is Known

By the argument in the previous paragraph, using the following theorem, we can get exact confidence intervals for each element of in time.

###### Theorem 2.

For , an exact level confidence interval for based on is

 (^βW)j±z1−α/2σ√((XTWX)−1XTW2X(XTWX)−1)jj.
###### Proof.

See Section 6.3

By the correspondence between confidence intervals and hypothesis testing, in time we can also test for the significance of each regression coefficient estimated by algorithmic leveraging using, for , the hypothesis . Specifically, a test with significance level rejects when

 |(^βW)j|≥z1−α/2σ√((XTWX)−1XTW2X(XTWX)−1)jj. (7)

### 3.2 σ2 Is Unknown

In practice, the error variance is rarely known. In this section, we discuss this more realistic situation.

We estimate analogously to OLS. Letting be the predicted values, define

 ^σ2=1N−p∥Y−^YW∥22=1N−p∥Y−HWY∥22=1N−p∥(I−HW)ϵ∥22, (8)

where and . can be computed in time given .

Classical statistical inference, which allows us to find exact confidence intervals for regression coefficients, computes the exact distribution of . However, in our case, that distribution depends on the singular values of , which in general cannot be computed in time. Instead, we utilize Lemma 1, which states that for large enough, with probability one , where is the

-dimensional identity matrix.

Intuitively, we may make the approximation that . Then, , the hat matrix from OLS, and our estimates and

approximate the regression coefficients and standard error estimate from OLS.

Therefore, inspired by the classic confidence interval for OLS regression coefficients, we propose the following approximate level confidence interval for based on :

 (^βW)j±tN−p,1−α/2^σ√((XTWX)−1XTW2X(XTWX)−1)jj (9)

where is the quantile of .

These confidence intervals have the correct coverage probability asymptotically as approaches infinity.

###### Theorem 3.

If has full rank, as , the confidence interval (9) has coverage probability .

###### Proof.

See Section 6.4

By the argument in the previous section and the fact that is computed in time, the above confidence interval can be computed in time. As before, in time we can also test for the significance of each regression coefficient estimated by algorithmic leveraging using, for , the hypothesis . Specifically, we propose a test with approximate significance level that rejects when

 |(^βW)j|≥t1−α/2,N−p^σ√((XTWX)−1XTW2X(XTWX)−1)jj. (10)

## 4 Experimental Results

In this section, we illustrate the behavior of our proposed confidence intervals (9) and significance tests (10) and compare them to the bootstrap. Our experiments were carried out in R, using the special package mvtnorm [6]. They follow the same setup as in Ma et al. [9].

We use data simulated from model (1). We consider the data sizes and . For each tuple , we generate independently from the multivariate

-distribution with three degrees of freedom and covariance matrix

, where . This will give a matrix with some large statistical leverage scores and some small ones [9]. The sampling distribution was chosen to be proportional to the approximate leverage scores computed using the algorithm in Drineas et al. [4].

For each , we randomly choose half the entries of to be zero, a quarter to be , and the rest to be . Then, for each tuple we repeat the following procedure times:

1. Generate

2. For ,

• Carry out Algorithm 1 with to obtain .

• Compute using (8) and as described in Section 3.1. Record the computation time used.

• Compute the bootstrap standard deviations as described in Section 2.4 with and record the computation time used.

• For ,

• For , compute a level confidence interval for using (9). Calculate the actual coverage probability, i.e. the fraction of ’s for which the interval includes the true value of ; the nominal coverage probability is .

• For , test at significance level using (10). Compute the proportion of type 1 and 2 errors.

• Repeat the above two steps for the bootstrap, constructing the confidence intervals as described in Section 2.4 and using the corresponding tests of significance.

Finally, for each tuple , for both our algorithm and bootstrap we compute the averages of the computation times, actual coverage probabilities, and proportions of type 1 and type 2 errors over the iterations.

For each tuple , we plot four graphs to evaluate and compare the quality of our proposed confidence intervals and bootstrap confidence intervals. The first plots the computation time of our algorithm and that of bootstrap versus . The second plots, for both our algorithm and the bootstrap, the actual coverage probability versus the nominal coverage probability of the confidence intervals at three small values of . The third and fourth plot, for our algorithm and the bootstrap, the average type 1 error rate and average type 2 error rate of the significance tests at versus .

The graphs are shown in Figures 14. In Section 6.5, for each pair of values of and , we also present plots of the receiver operating characteristic (ROC) curve for various values of

for both our algorithm and the bootstrap. We defer them to the appendix because they are usually used to evaluate classifiers and not methods for uncertainty quantification.

We see that our algorithm is indeed faster than the bootstrap, with the gap widening as increases. Indeed, the rate of increase in computation time is greater for the bootstrap.

For , the actual coverage probability of the confidence intervals for both our algorithm and bootstrap is approximately equal to the nominal coverage probability. The confidence intervals constructed by our algorithm are conservative, but the actual coverage probability approaches the nominal coverage probability as increases. The latter is expected, since for larger the approximation made in Section 3.2 should be more accurate. The conservativeness of the confidence intervals constructed by our algorithm seems to increase with . The bootstrap confidence intervals are generally less conservative than those from our algorithm, but may have less than the nominal coverage probability as shown in Figure 3(a).

The type 1 error of our algorithm is generally below that of bootstrap, especially for smaller . While type 1 error of our algorithm is generally smaller than the nominal , the type 1 error of the bootstrap may be much more, in particular for . This is consistent with the fact that the bootstrap coverage probability may be less than the nominal value. It also appears that for fixed and , as increases the type 1 errors of our algorithm and of bootstrap decreases.

Analogously, the type 2 error of our algorithm is usually above that of bootstrap. However, for both approaches, the type 2 error is decreasing with , below for , and close to at . Therefore, as long as we don’t use too small samples, we can obtain an estimate of in less time without sacrificing much power in our significance tests. As increases the type 2 errors of both approaches increase.

## 5 Conclusion and Future Work

Learning from and mining massive data sets post great challenges given our limited storage and computational resources. Many data reduction approaches have been devised to overcome such challenges. Algorithmic leveraging is such an approach. In this paper, for linear regression coefficients estimated using algorithmic leveraging, we described how to efficiently construct finite sample confidence intervals and significance tests. Simulations show that our proposed confidence intervals have the desired coverage probability and our proposed significance tests control the type 1 error rate and have low type 2 error rates. The simulations also show that bootstrap confidence intervals may have smaller than the desired coverage probability.

There are several avenues for future work investigating the statistical properties of algorithmic leveraging applied to data analyses beyond simple linear regression. For instance, we believe that determining how sampling affects feature selection in Lasso regression

[11] could have important practical implications. Finally, we may consider uncertainty quantification for estimates from other data reduction methods, such as sketching algorithms, which are another popular method to deal with massive data sets.

## Acknowledgements

The author was supported by US NSF under grants DGE-114747, DMS-1407397, and DMS-1521145.

## References

• [1] P. Billingsley. Probability and Measure. Wiley Series in Probability and Statistics. Wiley, 2012.
• [2] Samprit Chatterjee and Ali S. Hadi.

Influential observations, high leverage points, and outliers in linear regression: Rejoinder.

Statistical Science, 1(3):415–416, 1986.
• [3] Kenneth L. Clarkson, Petros Drineas, Malik Magdon-Ismail, Michael W. Mahoney, Xiangrui Meng, and David P. Woodruff. The fast Cauchy transform and faster robust linear regression. In Proceedings of the Twenty-fourth Annual ACM-SIAM Symposium on Discrete Algorithms, pages 466–477. Society for Industrial and Applied Mathematics, 2013.
• [4] Petros Drineas, Malik Magdon-Ismail, Michael W Mahoney, and David P. Woodruff. Fast approximation of matrix coherence and statistical leverage.

Journal of Machine Learning Research

, 13:3475–3506, 2012.
• [5] Petros Drineas, Michael W. Mahoney, and S. Muthukrishnan. Sampling algorithms for L2 regression and applications. In Proceedings of the Seventeenth Annual ACM-SIAM Symposium on Discrete Algorithms, pages 1127–1136. Society for Industrial and Applied Mathematics, 2006.
• [6] Alan Genz, Frank Bretz, Tetsuhisa Miwa, Xuefei Mi, Friedrich Leisch, Fabian Scheipl, and Torsten Hothorn. mvtnorm: Multivariate Normal and t Distributions, 2014. R package version 1.0-2.
• [7] Gene H. Golub and Charles F. Van Loan. Matrix Computations. Johns Hopkins University Press, Baltimore, 3rd edition, 1996.
• [8] William H. Greene. Econometric Analysis. Pearson, 6th edition, 2008.
• [9] Ping Ma, Michael W. Mahoney, and Bin Yu. A statistical perspective on algorithmic leveraging. Journal of Machine Learning Research, 16:861–911, 2015.
• [10] Michael W. Mahoney and Petros Drineas. CUR matrix decompositions for improved data analysis. Proceedings of National Academy of Sciences, 106:697–702, 2009.
• [11] Robert Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society. Series B (Methodological), 58(1):267–288, 1996.
• [12] Sanford Weisberg. Applied Linear Regression. Wiley Series in Probability and Statistics. Wiley, 2005.

## 6 Appendix

### 6.1 Proof of Lemma 1

We apply the strong law of large numbers

[1]. Denote the th row of a matrix by and the th column by . Then,

 wj =(SX)j⋅D2(STX)⋅j=r∑i=1(SX)2j,irπj=1rr∑i=1(SX)j,iπj w =1rr∑i=1((SX)1,iπ1,…,(SX)N,iπN)T

Thus, is the average of

independent and identically distributed random variables. Each has mean

, since is with probability . The result follows.

### 6.2 Proof of Theorem 1

We first need the following lemma from Ma et al. [9].

###### Lemma 2.
 E(w) =1N (11) Var(w) (12)

where is the -dimensional vector of ones and .

Since , it follows that converges in probability to as with fixed.

We rewrite . Let be the -dimensional vector of ’s.

 ^βW =(XTWX)−1XTWY =(XTWX)−1XTW(Xβ+ϵ) =β+(XTWX/N)−1XTWϵ/N

By Slutsky’s Theorem [1] and Lemma 1, as for fixed , converges in probability to . We have assumed that as , converges to a finite, positive definite matrix .

Therefore, as , converges in probability to . In order to show that is asymptotically normal, we just have to show that is asymptotically normal.

With fixed, as , converges in probability to .

 XTϵN =1NN∑i=1xiϵi

where has mean zero and variance . We have assumed that as , converges to . Assuming that for each ,

 limN→∞(N∑i=1xixTi)−1xixTi=0,

by the multivariate Lindeberg-Feller central limit theorem [8], is asymptotically normal.

Hence, as , is normally distributed. Then, is asymptotically Gaussian. Its mean and variance are computed from approximate formulas given in Ma et al. [9], using the fact that as and is fixed converges in probability to .

### 6.3 Proof of Theorem 2

From (5) and (6), we have that

 1−α =P(|(^βW)j−βj|≤z1−α/2σ√((XTWX)−1XTW2X(XTWX)−1)jj∣W)

Taking expectation of the above equation with respect to , we have

 1−α =P(|(^βW)j−βj|≤z1−α/2σ√((XTWX)−1XTW2X(XTWX)−1)jj)

from which the result follows.

### 6.4 Proof of Theorem 3

From Lemma 1, as , almost surely. Then, with probability , . Therefore, as ,

 (^βW−β,(I−HW)ϵ)D→(^βOLS−β,(I−H)ϵ)

Considering only the th predictor and scaling appropriately, we have

 ((^βW)j−βj√((XTWX)−1XTW2X(XTWX)−1)jj,(I−HW)ϵ√N−p)D→((^βOLS)j−βj√((XTX)−1)jj,(I−H)ϵ√N−p)

Applying the continuous mapping theorem, we have

 (^βW)j−βj√∥(I−HW)ϵ∥2/(N−p)√((XTWX)−1XTW2X(XTWX)−1)jj D→(^βOLS)j−βj√∥(I−H)ϵ∥2/(N−p)√((XTX)−1)jj ∼tN−p

where the last line follows from [12] assuming that is full rank.

Therefore, as , has coverage probability .

### 6.5 ROC Curves

Here are the ROC curves for the simulations in the main paper.

As expected, as or increase, the tests of significance improve in terms of the area under the ROC curve. However, there does not seem to be an appreciable difference between the areas under the ROC curve for our algorithm and the bootstrap. Our algorithm seems to have more power when the false positive rate is small and the bootstrap seems to have more power when the false positive rate is large. Since in practice we would like to limit the false positive rate in order to avoid any costly actions, we believe that our algorithm is superior.