# An Econometric View of Algorithmic Subsampling

Datasets that are terabytes in size are increasingly common, but computer bottlenecks often frustrate a complete analysis of the data. While more data are better than less, diminishing returns suggest that we may not need terabytes of data to estimate a parameter or test a hypothesis. But which rows of data should we analyze, and might an arbitrary subset of rows preserve the features of the original data? This paper reviews a line of work that is grounded in theoretical computer science and numerical linear algebra, and which finds that an algorithmically desirable sketch of the data must have a subspace embedding property. Building on this work, we study how prediction and inference is affected by data sketching within a linear regression setup. The sketching error is small compared to the sample size effect which is within the control of the researcher. As a sketch size that is algorithmically optimal may not be suitable for prediction and inference, we use statistical arguments to provide `inference conscious' guides to the sketch size. When appropriately implemented, an estimator that pools over different sketches can be nearly as efficient as the infeasible one using the full sample.

## Authors

• 1 publication
• 9 publications
• ### An Econometric Perspective of Algorithmic Sampling

Datasets that are terabytes in size are increasingly common, but compute...
07/03/2019 ∙ by Sokbae Lee, et al. ∙ 0

• ### Sketching for Two-Stage Least Squares Estimation

When there is so much data that they become a computation burden, it is ...
07/15/2020 ∙ by Sokbae Lee, et al. ∙ 0

• ### Sketched MinDist

We consider sketch vectors of geometric objects J through the function ...
07/04/2019 ∙ by Jeff M. Phillips, et al. ∙ 0

• ### Sketching Meets Random Projection in the Dual: A Provable Recovery Algorithm for Big and High-dimensional Data

Sketching techniques have become popular for scaling up machine learning...
10/10/2016 ∙ by Jialei Wang, et al. ∙ 0

• ### Sample Size Calculations in Simple Linear Regression: Trials and Tribulations

The problem tackled in this paper is the determination of sample size fo...
07/24/2019 ∙ by Tianyuan Guan, et al. ∙ 0

• ### STORM: Foundations of End-to-End Empirical Risk Minimization on the Edge

Empirical risk minimization is perhaps the most influential idea in stat...
06/25/2020 ∙ by Benjamin Coleman, et al. ∙ 6

• ### SALSA: Self-Adjusting Lean Streaming Analytics

Counters are the fundamental building block of many data sketching schem...
02/24/2021 ∙ by Ran Ben Basat, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

The availability of terabytes of data for economic analysis is increasingly common. But analyzing large datasets is time consuming and sometimes beyond the limits of our computers. The need to work around the data bottlenecks was no smaller decades ago when the data were in megabytes than it is today when data are in terabytes and petabytes. One way to alleviate the bottleneck is to work with a sketch of the data.111The term ‘synopsis’ and ‘coresets’ have also been used. See Comrode et al. (2011), and Agarwal and Varadarajan (2004). We generically refer to these as sketches. These are data sets of smaller dimensions and yet representative of the original data. We study how the design of linear sketches affects estimation and inference in the context of the linear regression model. Our formal statistical analysis complements those in the theoretical computer science and numerical linear algebra derived using different notions of accuracy and whose focus is computation efficiency.

There are several motivations for forming sketches of the data from the full sample. If the data are too expensive to store and/or too large to fit into computer memory, the data would be of limited practical use. It might be cost effective in some cases to get a sense from a smaller sample whether an expensive test based on the full sample is worth proceeding. Debugging is certainly faster with fewer observations. A smaller dataset can be adequate while a researcher is learning how to specify the regression model, as loading a gigabyte of data is much faster than a terabyte even we have enough computer memory to do so. With confidentiality reasons, one might only want to circulate a subset rather than the full set of data.

For a sketch of the data to be useful, the sketch must preserve the characteristics of the original data. Early work in the statistics literature used a sketching method known as ‘data squashing’. The idea is to approximate the likelihood function by merging data points with similar likelihood profiles, such as by taking their mean. There are two different ways to squash the data. One approach is to construct subsamples randomly. Deaton and Ng (1998) uses ‘binning methods’ and uniform sampling to speed up estimation of non-parametric average derivatives. While these methods work well for the application under investigation, its general properties are not well understood. An alternative is to take the data structure into account. Du Mouchel et al. (1999)

also forms multivariate bins of the data, but they match low order moments within the bin by non-linear optimization.

Owen (1990) reweighs a random sample of to fit the moments using empirical likelihood estimation. Madigan et al. (1999) uses likelihood-based clustering to select data points that match the target distribution. While theoretically appealing, modeling the likelihood profiles can itself be time consuming and not easily scalable.

Data sketching is of also interest to computer scientists because they are frequently required to provide summaries (such as frequency, mean, and maximum) of data that stream by continously.222The seminal paper on frequency moments is Alon et al. (1999). For a review of the literature, see Comrode et al. (2011) and Cormode (2011). Instead of an exact answer which would be costly to compute, pass-efficient

randomized algorithms are designed to run fast, requires little storage, and guarantee the correct answer with a certain probability. But this is precisely the underlying premise of data sketching in statistical analysis.

333Pass-efficient algorithms read in data at most a constant number of times. A computational method is referred to as a streaming model if only one pass is needed.

Though randomized algorithms are increasingly used for sketching in a wide range of applications, the concept remains largely unknown to economists except for a brief exposition in Ng (2017). This paper provides a gentle introduction to these algorithms in Sections 2 to 4. To our knowledge, this is the first review on sketching in the econometrics literature. We will use the term algorithmic subsampling to refer to randomized algorithms designed for the purpose of sketching, to distinguish them from bootstrap and subsampling methods developed for frequentist inference. In repeated sampling, we only observe one sample drawn from the population. Here, the complete data can be thought of as the population which we observe, but we can only use a subsample. Algorithmic subsampling does not make distributional assumptions, and balancing between fast computation and favorable worst case approximation error often leads to algorithms that are oblivious to the properties the data. In contrast, exploiting the probabilistic structure is often an important aspect of econometric modeling.

Perhaps the most important tension between the algorithmic and the statistical view is that while fast and efficient computation tend to favor sketches with few rows, efficient estimation and inference inevitably favor using as many rows as possible. Sampling schemes that are optimal from an algorithmic perspective may not be desirable from an econometric perspective, but it is entirely possible for schemes that are not algorithmically optimal to be statistically desirable. As there are many open questions about the usefulness of these sampling schemes for statistical modeling, there is an increased interest in these methods within the statistics community. Recent surveys on sketching with a regression focus include Ahfock et al. (2017) and Geppert, Ickstadt, Munteanu, Qudedenfeld and Sohler (2017), among others. Each paper offers distinct insights, and the present paper is no exception.

Our focus is on efficiency of the estimates for prediction and inference within the context of the linear regression model. Analytical and practical considerations confine our focus eventually back to uniform sampling, and to a smaller extent, an algorithm known as the countsketch. The results in Sections 5 and 6 are new. It will be shown that data sketching has two effects on estimation, one due to sample size, and one due to approximation error, with the former dominating the later in all quantities of empirical interest. Though the sample size effect has direct implications for the power of any statistical test, it is at the discretion of a researcher. We show that moment restrictions can be used to guide the sketch size, with fewer rows being needed when more moments exist. By targeting the power of a test at a prespecified alternative, the size of the sketch can also be tuned so as not to incur excessive power loss in hypothesis testing. We refer to this as the ‘inference conscious’ sketch size.

There is an inevitable trade-off between computation cost and statistical efficiency, but the statistical loss from using fewer rows of data can be alleviated by combining estimates from different sketches. By the principle of ‘divide and conquer’, running several estimators in parallel can be statistically efficient and still computationally inexpensive. Both uniform sampling and the countsketch are amenable to parallel processing which facilitates averaging of quantities computed from different sketches. We assess two ways of combining estimators: one that averages the parameter estimates, and one that averages test statistics. Regardless of how information from the different sketches are combined, pooling over subsamples always provides more efficient estimates and more powerful tests. It is in fact possible to bring the power of a test arbitrarily close to the one using the full sample, as will be illustrated in Section 6.

### 1.1 Motivating Examples

The sketching problem can be summarized as follows. Given an original matrix , we are interested in constructed as

 ˜A=ΠA

where , . In a linear regression setting, where is the dependent variable, and are the regressors. Computation of the least squares estimator takes time which becomes costly when the number of rows, is large. Non-parametric regressions fit into this setup if is a matrix of sieve basis. Interest therefore arises to use fewer rows of without sacrificing too much information.

To motivate why the choice of the sampling scheme (ie. ) matters, consider as an example a matrix

 A=(10−.25.25001.5−.50)T.

The rows have different information content as the row norm is . Consider now three matrices constructed as follows:

 Π1 =(1000001000), ˜A1=Π1A=(1001) Π2 =(1000000001), ˜A2=Π2A=(1000) Π3 =(0011001−111), ˜A3=Π3A=(00.50).

Of the three sketches, only preserves the rank of . The sketch defined by fails because it chooses row 5 which has no information. The third sketch is obtained by taking linear combination of rows that do not have independent information. The point is that unless is chosen appropriately, may not have the same rank as .

Of course, when is large, changing rank is much less likely and one may also wonder if this pen and pencil problem can ever arise in practice. Consider now estimation of a Mincer equation which has the logarithm of wage as the dependent variable, estimated using the IPMUS (2019) dataset which provides a preliminary but complete count data for the 1940 U.S. Census. This data is of interest because it was the first census with information on wages and salary income. For illustration, we use a sample of 24 million white men between the age of 16 and 64 as the ‘full sample’. The predictors that can be considered are years of education, denoted (edu), and potential experience, denoted (exp).

Two Mincer equations with different covariates are considered:

 log wage = β0+β1edu+β2exp+β3% exp2+error (1a) log wage = β0+β1edu+11∑j=0β2+j1{j≤exp<(j+5)}+error. (1b)

Model (1a) uses exp and as control variables. Model (1b) replaces potential experience with indicators of experience in five year intervals. Even though there are three predictors including the intercept, the number of covariates is four in the first model and thirteen in the second. In both cases, the parameter of interest is the coefficient for years of education (). The full sample estimate of is 0.12145 in specification (1a) and 0.12401 in specification (1b).

Figure 1 shows the histogram of exp. The values of exp range from 0 to 58. The problem in this example arises because there are few observations with over 50 years of experience. Hence there is no guarantee that an arbitrary subsample will include observations with . Without such observations, the subsampled covariate matrix may not have full rank. Specification (1b) is more vulnerable to this problem especially when is small.

We verify that rank failure is empirically plausible in a small experiment with sketches of size extracted using two sampling schemes. The first method is uniform sampling without replacement which is commonly used in economic applications. The second is the countsketch which will be further explained below. Figure 2 and Figure 3 show the histograms of subsample estimates for uniform sampling and the countsketch, respectively. The left panel is for specification (1a) and the right panel is for specification (1b). In our experiments, singular matrices never occurred with specification (1a); the OLS estimates can be computed using both sampling algorithms and both performed pretty well. However, uniform sampling without replacement produced singular matrices for specification (1b) 77% of the time. The estimates seem quite different from the full sample estimates, suggesting not only bias in the estimates, but also that the bias might not be random. In contrast, the countsketch failed only once out of 100 replications. The estimates are shown in the right panel of Figure 3 excluding the singular case.

This phenomenon can be replicated in a Monte Carlo experiment with normally distributed predictors. Instead of

, it is assumed that we only observe a value of one if its latent value is three standard deviation from the mean. Together with an intercept, there are four regressors. As in the Mincer equation, the regressor matrix has a reduced rank of 3 with probability of 0.58, 0.25, 0.076 when

rows are sampled uniformly but is always full rank only when . In contrast, the countsketch never encounters this problem even with . The simple example underscores the point that the choice of sampling scheme matters. As will be seen below, the issue remains in a more elaborate regression with several hundred covariates. This motivates the need to better understand how to form sketches for estimation and inference.

## 2 Matrix Sketching

This section presents the key concepts in algorithmic sampling. The material is based heavily on the monographs by Mahoney (2011) and Woodruff (2014), as well as the seminal work of Drineas, Mahoney and Muthukrishnan (2006), and subsequent refinements developed in Drineas et al. (2011), Nelson and Nguyen (2013a), Nelson and Nguyen (2014), Cohen, Nelson and Woodruff (2015), Wang, Gittens and Mahoney (2018), among many others.

We begin by setting up the notation. Consider an matrix positive definite . Let denote its -th column of and be its -th row. Then

 A=⎛⎜ ⎜⎝A(1)⋮A(n)⎞⎟ ⎟⎠=(A(1)…A(n))

and

. The singular value decomposition of

is where and

are the left and right eigenvectors of dimensions (

) and respectively. The matrix is is diagonal with entries containing the singular values of denoted , which are ordered such that is the largest. Since is positive definite, its

-th eigenvalue

equals , for . The best rank approximation of is given by

 Ak=UkUTkA≡PUkA

where is an

orthonormal matrix of left singular vectors corresponding to the

largest singular values of , and is the projection matrix.

The Frobenius norm (an average type criterion) is The spectral norm (a worse-case type criterion) is where is the Euclidean norm of a vector . The spectral norm is bounded above by the Frobenius norm since

Let and be real valued functions defined on some unbounded subset of real positive numbers . We say that if for some constant . This means that is at most a constant multiple of for sufficiently large values of . We say that if for all . This means that is at least for some constant . We say that if for all . This means that is at least and at most .

### 2.1 Approximate Matrix Multiplication

Suppose we are given two matrices, and and are interested in the matrix . The textbook approach is to compute each element of by summing over dot products:

 Cij=[ATB]ij=n∑k=1ATikBkj.

Equivalently, each element is the inner product of two vectors and . Computing the entire entails three loops through , , and . An algorithmically more efficient approach is to form from outer products:

 C=ATBd×p=n∑i=1AT(i)B(i)(d×1)×(1×p),

making a sum of matrices each of rank-1. Viewing as a sum of terms suggests to approximate it by summing terms only. But which amongst the possible terms to sum? Consider the following Approximate Matrix Multiplication algorithm (AMM). Let be the probability that row will be sampled.

Algorithm AMM:

The algorithm essentially produces

 ˜C = (ΠA)TΠB=1mm∑s=11pksAT(ks)B(ks) (2)

where denotes the index for the non-zero entry in the row of the matrix

 Π=1√m⎛⎜ ⎜ ⎜ ⎜⎝01√pk10…0……………00…1√pkm0⎞⎟ ⎟ ⎟ ⎟⎠.

The matrix only has only one non-zero element per row, and the -th entry with probability . In the case of uniform sampling with for all , reduces to a sampling matrix scaled by .

While defined by (2) is recognized in econometrics as the estimator of Horvitz and Thompson (1952) which uses inverse probability weighting to account for different proportions of observations in stratified sampling, is a sketch of produced by the Monte Carlo algorithm AMM in the theoretical science literature literature.444In Mitzenmacher and Upfal (2006), a Monte Carlo algorithm is a randomized algorithm that may fail or return an incorrect answer but whose time complexity is deterministic and does not depend on the particular sampling. This contrasts with a Las Vegas algorithm which always returns the correct answer but whose time complexity is random. See also Eriksson-Bique et al. (2011). The Monte-Carlo aspect is easily understood if we take and to be vectors. Then where the last step follows from mean-value theorem for . Approximating by gives as the Monte Carlo estimate of .

Two properties of produced by AMM are noteworthy. Under independent sampling,

 E[AT(ks)B(ks)mp(ks)]=m∑k=1pkAT(k)B(k)mpk=[ATB]ij.

Hence regardless of the sampling distribution,

is unbiased. The variance of

defined in terms of the Frobenius norm is

 E[∥˜C−C∥2F] = 1mn∑k=11pk∥AT(k)∥22∥B(k)∥22−1m∥C∥2F

which depends on the sampling distribution . Drineas, Kannan and Mahoney (2006, Theorem 1) shows that minimizing with respect to subject to the constraint gives555The first order condition satifies . Solving for and imposing the constraint gives the result stated. Eriksson-Bique et al. (2011) derives probabilities that minimize expected variance for given distribution of the matrix elements.

 pk=∥A(k)∥2∥B(k)∥2∑ns=1∥A(s)∥2∥B(s)∥2.

This optimal yields a variance of

 E[∥˜C−C∥2F] ≤1m[n∑k=1∥A(k)∥2∥B(k)∥2]2≤1m∥A∥2F∥B∥2F.

It follows from Markov’s inequality that for given error of size and failure probability ,

 P(∥˜C−C∥2F>ε2∥A∥2F∥B∥2F)

implying that to have an approximation error no larger than with probability , the number of rows used in the approximation must satisfy

The approximate matrix multiplication result is the building block of many of the theoretical results to follow. The result also holds under the spectral norm since it is upper bounded by the Frobenius norm. Since is a rank one matrix, . Many of the results to follow are in spectral norm because it is simpler to work with a product of two Euclidean vector norms. Furthermore, putting , we have

 P(∥(ΠA)T(ΠA)−ATA)∥2≥ϵ∥A∥22)<δ.

One may think of the goal of AMM as preserving the second moment properties of . The challenge in practice is to understand the conditions that validate the approximation. For example, even though uniform sampling is the simplest of sampling schemes, it cannot be used blindly. Intuitively, uniform sampling treats all data points equally, and when information in the rows are not uniformly dispersed, the influential rows will likely be omitted. From the above derivations, we see that when , which can be prohibitively large. The Mincer equation in the Introduction illustrates the pitfall with uniform sampling when is too small, but that the problem can by and large be alleviated when . Hence, care must be taken in using the algorithmic sampling schemes. We will provide some guides below.

### 2.2 Subspace Embedding

To study the properties of the least squares estimates using sketched data, we first need to make clear what features of need to be preserved in . Formally, the requirement is that

has a ‘subspace embedding’ property. An embedding is a linear transformation of the data that has the Johnson-Lindenstrauss (JL) property, and a subspace embedding is a matrix generalization of an embedding. Hence it is useful to start with the celebrated JL Lemma.

The JL Lemma, due to Johnson and Lindenstauss (1994), is usually written for linear maps that reduce the number of columns from an matrix to . Given that our interest is ultimately in reducing the number of rows from to while keeping fixed, we state the JL Lemma as follows:

###### Lemma 1 (JL Lemma)

Let and be a set of points in with . Let . There exists a linear map such that

 (1−ϵ)||ai−aj||22≤||Πai−Πaj||22≤(1+ϵ)||ai−aj||22.

In words, the Lemma states that every set of points in Euclidean space of dimension can be represented by a Euclidean space of dimension with all pairwise distance preserved up to a factor. Notice that is logarithmic in and does not depend on . A sketch of the proof is given in the online Appendix.

The JL Lemma establishes that vectors in can be embedded into dimensions. But there are situations when we need to preserve the information in the columns jointly. This leads to the notion of ‘subspace embedding’ which requires that the norm of vectors in the column space of be approximately preserved by with high probability.

###### Definition 1 (Subspace-Embedding)

Let be an matrix. An subspace embedding for the column space of is an matrix such that ,

 (1−ϵ)∥Ax∥22≤∥ΠAx∥22≤(1+ϵ)∥Ax∥22. (3)

Subspace embedding is an important concept and it is useful to understand it from different perspectives. Given that , preserving the column space of means preserving the information in . The result can analogously be written as

 ∥ΠAx∥22∈[(1−ϵ)∥Ax∥22,(1+ϵ)∥Ax∥22].

Since where and is orthonormal, a change of basis gives:

 ∥ΠUz∥22 ∈ [(1−ϵ)∥Uz∥22,(1+ϵ)∥Uz∥22] = [(1−ϵ)∥z∥22,(1+ϵ)∥z∥22] ⇔ ∥(ΠU)T(ΠU)−UTU∥2≤ϵ ⇔ zT((ΠU)T(ΠU)−Id)z≤ϵ.

The following Lemma defines subspace embedding in terms of singular value distortions.

###### Lemma 2

Let

be a unitary matrix and

be a subspace embedding for the column space of . Let is the -th singular value of . Then (3) is equivalent to

 σ2k(ΠU)∈[1−ϵ,1+ϵ]∀k∈[1,d].

To understand Lemma 2, consider the Rayleigh quotient form of :666For a Hermitian matrix , the Rayleigh quotient is for a nonzero vector . By Rayleigh-Ritz Theorem, with equalities when is the eigenvector corresponding to the smallest and largest eigenvalues of , respectively. See, e.g. Hogben (2007, Section 8.2).

 ωk((ΠU)T(ΠU))=vTk(ΠU)T(ΠU)vkvTkvk

for some vector . As ,

 ωk((ΠU)T(ΠU)) = vTkvk−vTk(Id−(ΠU)T(ΠU))vkvTkvk = 1−ωk(Id−(ΠU)T(ΠU)).

This implies that . It follows that

 |1−σ2k(ΠU)| = ∣∣∣σk(UTU−(ΠU)T(ΠU))∣∣∣ ≤ σmax(UTU−(ΠU)T(ΠTU)) = ∥UTU−(ΠU)T(ΠU)∥2≤ϵ ⇔σ2k(ΠU) ∈ [1−ϵ,1+ϵ]∀k∈[1,d].

Hence the condition is equivalent to generating small singular value distortions. Nelson and Nguyen (2013a)

relates this condition to similar results in random matrix theory.

777 Consider a

matrix of random variables with mean zero and unit variance with

. In random matrix theory, the largest and smallest eigenvalues of the sample covariance matrix have been shown to converge to , respectively. See, e.g., Yin et al. (1988) and Bai and Yin (1993).

But where to find these embedding matrices? We can look for data dependent or data oblivious ones. We say that is a data oblivious embedding if it can be designed without knowledge of the input matrix. The idea of oblivious subspace-embedding first appeared in Sarlos (2006) in which it is suggested that can be drawn from a distribution with the JL properties.

###### Definition 2

A random matrix drawn from a distribution forms a JL transform with parameters if there exists a function such that for any and , holds with probability at least for all -vector .

A JL transform is often written JLT( for short. Embedding matrices that are JLT guarantee good approximation to matrix products in terms of Frobenius norm. This means that for such s with a suitable choice of , it holds that for conformable matrices having rows:

 (4)

The Frobenius norm bound has many uses. If , then with high probability. The result also holds in the spectral norm, Sarlos (2006, Corollary 11).

## 3 Random Sampling, Random Projections, and the Countsketch

There are two classes of s with the JL property: random sampling which reduces the row dimension by randomly picking rows of , and random projections which form linear combinations from the rows of . A scheme known as a countsketch that is not a JL transform can also achieve subspace embedding efficiently. We will use a pen and pencil example with and to provide a better understanding of the three types of s. In this example, has 9 rows given by .

### 3.1 Random Sampling (RS)

Let be a diagonal rescaling matrix with in the -th diagonal and is the probability that row is chosen. Under random sampling,

 Π = DS,

where if row is selected in the -th draw and zero otherwise so that the -th row of the selection matrix is the th-row of an dimensional indentity matrix. Examples of sampling schemes are:

• Uniform sampling without replacement: , , for all . Each row is sampled at most once.

• Uniform sampling with replacement: , , for all . Each row can be sampled more than once.

• Bernoulli sampling uses an matrix , where , is initialized to be and the -th diagonal entry is updated by

 Sjj={1with probability mn0with probability 1−mn

Each row is sampled at most once, and is the expected number of sampled rows.

• Leverage score sampling: the sampling probabilities are taken from importance sampling distribution

 pi=ℓi∑ni=1ℓi=ℓid, (5)

where for with svd,

 ℓi=∥U(i)∥22=∥eTiU∥22,

is the leverage score for row , , and is a standard basis vector.

Notably, the rows of the sketch produced by random sampling are the rows of the original matrix . For example, If rows 9,5,1 are randomly chosen by uniform sampling, RS1 would give

 ˜A=D⎛⎜⎝000000001000001000100000000⎞⎟⎠A=√9√3⎛⎜⎝A9A5A1⎞⎟⎠.

Ipsen and Wentworth (2014, Section 3.3) shows that sampling schemes RS1-RS3 are similar in terms of the condition number and rank deficiency in the matrices that are being subsampled. Unlike these three sampling schemes, leverage score sampling is not data oblivious and warrants further explanation.

As noted above, uniform sampling may not efficient. More precisely, uniform sampling does not work well when the data have high coherence, where coherence refers to the maximum of the row leverage scores defined above. Early work suggests to use sampling weights that depend on the Euclidean norm, See, e.g., Drineas, Kannan and Mahoney (2006) and Drineas and Mahoney (2005). Subsequent work finds that a better approach is to sample according to the leverage scores which measure the correlation between the left singular vectors of with the standard basis, and thus indicates whether or not information is spread out. The idea of leverage-sampling, first used in Jolliffe (1972), is to sampling a row more frequently if it has more information.888There are other variations of leverage score sampling. McWilliams et al. (2014) considers subsampling in linear regression models when the observations of the covariances may be corrupted by an additive noise. The influence of observation is defined by where is the OLS residual and is the leverage score. Unlike leverage scores,

takes into account the relation between the predictor variables and the

. Of course, is simply the -th diagonal element of the hat matrix , known to contain information about influential observations. In practice, computation of the leverage scores requires an eigen decomposition which is itself expensive. Drineas et al. (2012) and Cohen, Lee, Musco, Musco, Peng and Sidford (2015) suggest fast approximation of leverage scores.

### 3.2 Random Projections (RP)

Some examples of random projections are:

• Sub-Gaussian random projections:999A mean-zero vector is 1-sub-Gaussian if for any and for all , .

• Gaussian random projection, where

• A random matrix with entries of , Sarlos (2006), Achiloptas (2003).

• Randomized Orhthogonal Systems: where is an is diagonal Rademacher matrix with entries of , is a sparse matrix, and is an orthonormal matrix.

• Sparse Random Projections (SRP)

 Π=DS

where is a diagonal matrix of and

 Sij=⎧⎪ ⎪⎨⎪ ⎪⎩−1with probability 12s0with probability 1−1s1with probability 12s

The rows of the sketch produced by random projections are linear combinations of the rows of the original matrix. For example, RP3 with could give

 ˜A=D⎛⎜⎝[r]00110−100010−10−1000101000010−1⎞⎟⎠A=√3√9⎛⎜⎝A3+A4−A6A1−A3−A5+A9A2+A7−A9⎞⎟⎠

Early work on random projections such as Dasgupta et al. (2010) uses s that are dense, an example being RP1. Subsequent work favors sparser s, an example being RP3. Achiloptas (2003) initially considers . Li et al. (2006) suggests to increase to . Given that uniform sampling is algorithmically inefficient when information is concentrated, the idea of randomized orthogonal systems is to first randomize the data by the matrix to destroy uniformity, so that sampling in a data oblivious manner using and rescaling by remains appropriate. The randomization step is sometimes referred to as ‘preconditioning’. Common choices of are the Hadamard matrix as in the SRHT of Ailon and Chazelle (2009)101010The Hadamard matrix is defined recursively by . A constraint is that must be in powers of two.

and the discrete Fourier transform as in FJLT of

Woolfe et al. (2008).

### 3.3 Countsketch

While sparse s reduce computation cost, Kane and Nelson (2014, Theorem 2.3) shows that each column of must have non-zero entries to create a subspace embedding. This would seem to suggest that cannot be too sparse. However, Clarkson and Woodruff (2013) argues that if the non-zero entries of are carefully chosen, need not be a JLT and a very sparse subspace embedding is actually possible. Their insight is that need not preserve the norms of an arbitrary subset of vectors in , but only those that sit in the -dimensional subspace of . The sparse embedding matrix considered in Clarkson and Woodruff (2013) is the countsketch.111111The definition is taken from Dahiya et al. (2018). Given input , a count-sketch matrix can also be characterized by a hash function such that . Then with equal probability .

A countsketch of sketching dimension is a random linear map where is an random diagonal matrix with entries chosen independently to be or with equal probability. Furthermore, is an binary matrix such that and 0 otherwise, and is a random map such that for each , for with probability . As an example, a countsketch might be

 ˜A=⎛⎜⎝[r]00101−1001−100−1000−100−10000100⎞⎟⎠A=⎛⎜⎝A3+A5−A6+A9A1−A4−A8A2+A7⎞⎟⎠

Like random projections, the rows of a countsketch are also a linear combinations of the rows of .

Though the countsketch is not a JLT, Nelson and Nguyen (2013b) and Meng and Mahoney (2013) show that the following Frobenius norm bound holds for the countsketch with appropriate choice of :

 P(∥(ΠU)T(ΠU)−Id∥2>3ϵ)≤δ (6)

which implies that countsketch provides a subspace embedding for the column space of in spite of not being a JLT, see Woodruff (2014, Theorem 2.6).

The main appeal of the countsketch is that the run time needed to compute can be reduced to , where nnz(A) denotes the number of non-zero entries of . The efficiency gain is due to extreme sparsity of a countsketch which only has one non-zero element per column. Still, the matrix can be costly to store when is large. Fortunately, it is possible to compute the sketch without constructing .

The streaming version of the countsketch is a variant of the frequent-items algorithm where we recall that having to compute summaries such as the most frequent item in the data that stream by was instrumental to the development of sketching algorithms. The streaming algorithm proceeds by initializing to an matrix of zeros. Each row of is updated as

 ˜Ah(i)=˜Ah(i)+g(i)A(i)

where sampled uniformly at random from and sampled from are independent. Computation can be done one row at a time.121212See Ghashami et al. (2016). Similar schemes have been proposed in Charika et al. (2002); Conmode and Muthukrishnan (2005). The online Appendix provides the streaming implementation of the example above.

### 3.4 Properties of the Πs

To assess the actual performance of the different s, we conduct a small Monte Carlo experiment with 1000 replications. For each replication , we simulate an matrix and construct the seven JL embeddings considered above. For each embedding, we count the number of times that is within of for all pairs of distinct . The success rate for the replication is the total count divided by . We also record where is a vector of singular values of . According to theory, the pairwise distortion of the vectors should be small if . We set and . Four values of are considered. We draw

from the (i) normal distribution, and (ii) the exponential distribution. In

matlab, these are generated as X=randn(n,d) and X=exprnd(d,[n d]). Results for the Pearson distribution using X=pearsrnd(0,1,1,5,n,d) are similar and not reported.

Table 1 reports the results averaged over 1000 simulations. With probability around 0.975, the pairwise distance between columns with 1000 rows is close to the pairwise distance between columns with 20000 rows. The average singular value distortion also levels off with about 1000 rows of data. Hence, information in rows can be well summarized by a smaller matrix with rows. The takeaway from Table 1 is that the performance of the different s are quite similar, making computation cost and analytical tractability two important factor in deciding which ones to use.

Choosing a is akin to choosing a kernel and many will work well, but there are analytical differences. Any can be written as where is a generic diagonal and is a generic matrix with zeros in each diagonal entry. Two features will be particularly useful.

 ΠTΠ =In+R11 (7a) ΠΠT =nmIm (7b)

Property (7a) imposes that is a diagonal matrix, allowing but restricting . Each can also be written as where is a generic diagonal and is a generic matrix with zeros in each diagonal entry. Property (7b) requires that

is proportional to an identity matrix, and hence that

.

For the s previously considered, we summarize their properties as follows:

(7a) (7b)
RS1 (Uniform,w/o) yes yes
RS2 (Uniform,w) yes no
RS3 (Bernoulli) yes no
RS4 (Leverage) yes no
RP1 (Gaussian) no no
RP2 (SRHT) yes yes
RP3 (SRP) no no
RP4 (Countsketch) no no

Property (7a) holds for all three random sampling methods but of all the random projection methods considered, the property only holds for SRHT. This is because SRHT effectively performs uniform sampling of the randomized data. For property (7b), it is easy to see that and when uniform sampling is done without replacement and . By implication, the condition also holds for SRHT if sampling is done without replacement since and are orthonormal matrices. But uniform sampling is computationally much cheaper than SRHT and has the distinct advantange over the SRHT that the rows of the sketch are those of the original matrix and hence interpretable. For this reason, we will subsequently focus on uniform sampling and use its special structure to obtain precise statistical results. Though neither condition holds for the countsketch, its extreme sparse structure permits useful statements to be made. This is useful because of the computational advantages associted with the countsketch.

## 4 Algorithmic Results for the Linear Regression Model

The linear regression model with regressors is . The least squares estimator minimizes with respect to and is defined by

 ^β=(XTX)−1XTy=VΣ−1UTy.

We are familiar with the statistical properties of under assumptions about and . But even without specifying a probabilistic structure, with is an over-determined system of equations and can be solved by algebraically. The svd solution gives where , the pseudoinverse is . The ’Choleski’ solution starts with the normal equations and factorizes . The algebraic properties of these solutions are well studied in the matrix computations literature when all data are used.

Given an embedding matrix and sketched data , minimizing with respect to gives the sketched estimator

 ˜β=((ΠX)TΠX)−1(ΠX)TΠy.

Let be the full sample sum of squared residuals. For an embedding matrix , let be the sum of squared residuals from using the sketched data. Assume that the following two conditions hold with probability for :

 |1−σ2k(ΠU)| ≤ 1√2∀k=1,…,K; (8a) ∥(ΠU)TΠ(y−X^β)∥22 ≤ ϵˆ\textscssr2/2. (8b)

Condition (8a) is is equivalent to as discussed above. Since