# Sharpened Error Bounds for Random Sampling Based ℓ_2 Regression

Given a data matrix X ∈ R^n× d and a response vector y ∈ R^n, suppose n>d, it costs O(n d^2) time and O(n d) space to solve the least squares regression (LSR) problem. When n and d are both large, exactly solving the LSR problem is very expensive. When n ≫ d, one feasible approach to speeding up LSR is to randomly embed y and all columns of X into a smaller subspace R^c; the induced LSR problem has the same number of columns but much fewer number of rows, and it can be solved in O(c d^2) time and O(c d) space. We discuss in this paper two random sampling based methods for solving LSR more efficiently. Previous work showed that the leverage scores based sampling based LSR achieves 1+ϵ accuracy when c ≥ O(d ϵ^-2 d). In this paper we sharpen this error bound, showing that c = O(d d + d ϵ^-1) is enough for achieving 1+ϵ accuracy. We also show that when c ≥ O(μ d ϵ^-2 d), the uniform sampling based LSR attains a 2+ϵ bound with positive probability.

## Authors

• 20 publications
12/01/2020

### Time-Space Lower Bounds for Simulating Proof Systems with Quantum and Randomized Verifiers

A line of work initiated by Fortnow in 1997 has proven model-independent...
05/19/2021

### L1 Regression with Lewis Weights Subsampling

We consider the problem of finding an approximate solution to ℓ_1 regres...
01/08/2020

05/04/2015

### An Explicit Sampling Dependent Spectral Error Bound for Column Subset Selection

In this paper, we consider the problem of column subset selection. We pr...
03/27/2018

### Distributed Adaptive Sampling for Kernel Matrix Approximation

Most kernel-based methods, such as kernel or Gaussian process regression...
04/06/2014

### Provable Deterministic Leverage Score Sampling

We explain theoretically a curious empirical phenomenon: "Approximating ...
07/21/2020

### Backfitting for large scale crossed random effects regressions

Regression models with crossed random effect error models can be very ex...
##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Given data instances , each of dimension , and responses , it is interesting to find a model such that . If , there will not in general exist a solution to the linear system, so we instead seek to find a model such that . This can be formulated as the least squares regression (LSR) problem:

 \boldmathβ\unboldmathlsr=argmin\boldmathβ\unboldmath∈Rd∥∥y−X\boldmathβ\unboldmath∥∥22. (1)

Suppose , in general it takes time and space to compute

using the Cholesky decomposition, the QR decomposition, or the singular value decomposition (SVD)

[9].

LSR is perhaps one of the most widely used method in data processing, however, solving LSR for big data is very time and space expensive. In the big-data problems where and are both large, the time complexity and space complexity make LSR prohibitive. So it is of great interest to find efficient solution to the LSR problem. Fortunately, when , one can use a small portion of the instances instead of using the full data to approximately compute , and the computation cost can thereby be significantly reduced. Random sampling based methods [5, 11, 12] and random projection based methods [2, 6] have been applied to make LSR more efficiently solved.

Formally speaking, let be a random sampling/projection matrix, we solve the following problem instead of (1):

 ~\boldmathβ\unboldmathS=argmin\boldmathβ\unboldmath∈Rd∥∥Sy−SX\boldmathβ\unboldmath∥∥22. (2)

This problem can be solved in only time and space. If the random sampling/projection matrix is constructed using some special techniques, then it is ensured theoretically that and that

 ∥∥y−X~\boldmathβ\unboldmathS∥∥22≤(1+ϵ)∥∥y−X% \boldmathβ\unboldmathlsr∥∥22 (3)

hold with high probability. There are two criteria to evaluate random sampling/projection based LSR.

• Running Time. That is, the total time complexity in constructing and computing .

• Dimension after Projection. Given an error parameter , we hope that there exists a polynomial function such that if , the inequality (3) holds with high probability for all and . Obviously is the smaller the better because the induced problem (2) can be solved in less time and space if is small.

### 1.1 Contributions

The leverage scores based sampling is an important random sampling technique widely studied and empirically evaluated in the literature [2, 4, 5, 8, 11, 13, 12, 15, 16]. When applied to accelerate LSR, error analysis of the leverage scores based sampling is available in the literature. It was shown in [4] showed that by using the leverage scores based sampling with replacement, when

 c≥O(d2ϵ−2),

the inequality (3) holds with high probability. Later on, [5] showed that by using the leverage scores based sampling without replacement,

 c≥O(dϵ−2logd)

is sufficient to make (3) hold with high probability. In Theorem 3.1 we show that (3) holds with high probability when

 c≥O(dlogd+dϵ−1),

using the same leverage scores based sampling without replacement. Our results are described in Theorem 3.1. Our proof techniques are based on the previous work [5, 6], and our proof is self-contained.

Though uniform sampling can have very bad worst-case performance when applied to the LSR problem [11], it is still the simplest and most efficient strategy for column sampling. Though the uniform sampling is uniformly worse than the leverage score based sampling from an algorithmic perspective, it is not true from a statistical perspective [11]. Furthermore, when the leverage scores of are very homogeneous, the uniform sampling has virtually the same performance as the leverage scores based sampling [11]. So uniform sampling is still worth of study. We provide in Theorem 3.2 an error bound for the uniform sampling based LSR. We show that when

 c>O(μdϵ−2logd),

the uniform sampling based LSR attains a bound with positive probability. Here denotes the matrix coherence of .

## 2 Preliminaries and Previous Work

For a matrix , we let be its -th row, be its -th column, be its Frobenius norm, and be its spectral norm. We let be an identity matrix and let

be an all-zero matrix with proper size.

We let the thin singular value decomposition of be

 X=UX\boldmathΣ\unboldmathXVTX=d∑i=1σi(X)uX,ivTX,i.

Here , , and are of sizes , , and , and the singular values are in non-increasing order. We let be an

column orthogonal matrix such that

. The condition number of is defined by .

Based on SVD, the (row) statistical leverage scores of is defined by

 li=∥∥u(i)X∥∥22, i=1,⋯,n,

where is the -th row of . It is obvious that . Exactly computing the leverages scores costs time, which is as expensive as exactly solving the LSR problem (1). Fortunately, if is a skinny matrix, the leverages scores can be highly efficiently computed within arbitrary accuracy using the techniques of [2, 3].

There are many ways to construct the random sampling/projection matrix , and below we describe some of them.

• Uniform Sampling. The sampling matrix is constructed by sampling rows of the identity matrix uniformly at random. This method is the simplest and fastest, but in the worst case its performance is very bad [11].

• Leverage Scores Based Sampling. The sampling matrix is computed by Algorithm 1; has rows in expectation. This method is proposed in [4, 5].

• Subsampled Randomized Hadamard Transform (SRHT). The random projection matrix is called SRHT [1, 6, 14] if

• is a subset of rows from the identity matrix, where the rows are chosen uniformly at random and without replacement;

• is a normalized Walsh–Hadamard matrix;

• is an random diagonal matrix with each diagonal entry independently chosen to be or with equal probability.

SRHT is a fast version of the Johnson-Lindenstrauss transform. The performance of SRHT based LSR is analyzed in [6].

• Sparse Embedding Matrices. The sparse embedding matrix enables random projection performed in time only linear in the number of nonzero entries of [2]. The random linear map is defined by

• is a random map so that for each , for with probability ;

• is a binary matrix with , and all remaining entries ;

• is the same to the matrix of SRHT.

Sparse embedding matrices based LSR is guaranteed theoretically in [2].

## 3 Main Results

We provide in Theorem 3.1 an improved error bound for the leverage scores sampling based LSR.

###### Theorem 3.1 (The Leverage Score Based Sampling)

Use the leverage score based sampling without replacement (Algorithm 1) to construct the sampling matrix where

 c≥O(dlnd+dϵ−1),

and solve the approximate LSR problem (2) to obtain . Then with probability at least the following inequalities hold:

 ∥∥y−X~\boldmathβ\unboldmathS∥∥22 ≤ (1+ϵ)∥∥y−X\boldmathβ% \unboldmathlsr∥∥22, ∥∥\boldmathβ\unboldmathlsr−~\boldmathβ\unboldmathS∥∥22 ≤ ϵσ2min(X)∥∥y−X\boldmathβ\unboldmathlsr∥∥22≤ϵκ2(X)(γ−2−1)∥∥% \boldmathβ\unboldmathlsr∥∥22,

where is defined by .

We show in Theorem 3.2 an error bound for the uniform sampling based LSR.

###### Theorem 3.2 (Uniform Sampling)

Use the uniform sampling without replacement to sample

 c≥1000μd(lnd+7)

rows of and compute the approximate LSR problem (2) to obtain . Then with probability the following inequalities hold:

 ∥y−X~\boldmathβ\unboldmathS∥22 ≤ 2.2∥y−X\boldmathβ\unboldmathlsr∥22, ∥∥\boldmathβ\unboldmathlsr−~\boldmathβ\unboldmathS∥∥22 ≤ 1.2σ2min(X)∥∥y−X\boldmathβ\unboldmathlsr∥∥22≤1.2κ2(X)(γ−2−1)∥∥\boldmathβ\unboldmathlsr∥∥22,

where is defined by .

Since computing costs only time, so one can repeat the procedure times and choose the solution that attains the minimal error . In this way, the error bounds hold with probability which can be arbitrarily high.

## 4 Proof

In Section 4.1 we list some of the previous work that will be used in our proof. In Section 4.2 we prove Theorem 3.1. We prove the theorem by using the techniques in the proof of Lemma 1 and 2 of [5] and Lemma 1 and 2 of [6]. For the sake of self-contain, we repeat some of the proof of [5] in our proof. In Section 4.3 we prove Theorem 3.2 using the techniques in [6, 7, 14].

### 4.1 Key Lemmas

Lemma 1 is a deterministic error bound for the sampling/projection based LSR, which will be used to prove both of Theorem 3.1 and Theorem 3.2

. The random matrix multiplication bounds in Lemma

2 will be used to prove Theorem 3.1. The matrix variable tail bounds in Lemma 3 will be used to prove Theorem 3.2.

###### Lemma 1 (Deterministic Error Bound, Lemma 1 and 2 of [6])

Suppose we are given an overconstrained least squares approximation problem with and . We let be defined in (1) and be defined in (2), and define such that . Then the following equality and inequalities hold deterministically:

 ∥∥y−X~\boldmathβ\unboldmathS∥∥22 = ∥∥y−X\boldmathβ\unboldmathlsr∥∥22+∥∥UXzS∥∥22, ∥∥\boldmathβ\unboldmathlsr−~\boldmathβ\unboldmathS∥∥22 ≤ ∥UXzS∥22σ2min(X), ∥∥zS∥∥2 ≤ ∥∥UTXSTSU⊥XU⊥XTy∥∥2σ2min(SUX).

By further assuming that , it follows that

###### Proof

The equality and the first two inequalities follow from Lemma 1 of [6]. The last inequality follows from Lemma 2 of [6].

###### Lemma 2 (Theorem 7 of [5])

Suppose , , and , and we let be the sampling matrix computed by Algorithm 1 taking and as input, then

 E∥∥XTY−XTSTSY∥∥F ≤ 1√c∥∥X∥∥F∥∥Y∥∥F, E∥∥XTX−XTSTSX∥∥F ≤ O(√logcc)∥∥X∥∥2∥∥X∥∥F.
###### Lemma 3 (Theorem 2.2 of [14])

Let be a finite set of positive semidefinite matrices with dimension , and suppose that

 maxW∈Wλmax(W)≤R.

Sample uniformly at random from without replacement. We define and . Then for any and , the following inequalities hold:

 P{λmin(c∑i=1Wi)≤θ1ξmin} ≤ P{λmax(c∑i=1Wi)≥θ2ξmax} ≤

### 4.2 Proof of Theorem 3.1

###### Proof

We first bound the term as follows. Applying a singular value inequality in [10], we have that for all

 ∣∣1−σ2i(SUX)∣∣ = ∣∣σi(UTXUX)−σi(UTXSTSUX)∣∣ ≤ σmax(UTXUX−UTXSTSTUX) = ∥∥UTXUX−UTXSTSTUX∥∥2.

Since the leverage scores of are also the leverage scores of , it follows from Lemma 2 that

 E∥∥UTXUX−UTXSTSTUX∥∥2≤O(√lncc)∥UX∥F∥UX∥2=O(√dlncc).

It then follows from Markov’s inequality that the inequality

 ∣∣1−σ2i(SUX)∣∣≤δ−11O(√dlncc)

holds with probability at least . When

 c≥O(dδ−21ϵ−21ln(dδ−21ϵ−21)), (4)

the inequality

 σ2min(SUX)≥1−ϵ1 (5)

holds with probability at least .

Now we bound the term . Since , we have that

 ∥∥UTXSTSU⊥XU⊥XTy∥∥2 = ∥∥(UTX)(U⊥XU⊥XTy)−(UTX)STS(U⊥XU⊥XTy)∥∥2.

Since the leverage scores of are also the leverage scores of , it follows from Lemma 2 that

 E∥∥(UTX)(U⊥XU⊥XTy)−(UTX)STS(U⊥XU⊥XTy)∥∥2 ≤1√c∥∥UX∥∥F∥∥U⊥XU⊥XTy∥∥2=√dc∥∥U⊥XU⊥XTy∥∥2.

It follows from the Markov’s inequality that the following inequality holds with probability at least :

 ∥∥UTXSTSU⊥XU⊥XTy∥∥2≤δ−12√d√c∥∥U⊥XU⊥XTy∥∥2. (6)

Thus when

 c≥dδ−22ϵ−22(1−ϵ1)−2, (7)

it follows from (5), (6), and the union bound that the inequality

 ∥∥UTXSTSU⊥XU⊥XTy∥∥2σ2min(SUX)≤ϵ2∥∥U⊥XU⊥XTy∥∥2 (8)

holds with probability at least . We let , , , and let be defined in Lemma 1. When

 c≥max{O(dlnd),400dϵ−1},

it follows from (4), (7), (8), and Lemma 1 that with probability at least the following inequality holds:

 ∥zS∥2≤√ϵ∥∥U⊥XU⊥XTy∥∥2.

Since , the theorem follows directly from Lemma 1.

### 4.3 Proof of Theorem 3.2

###### Proof

We first follow some of the techniques of [7] to bound the two terms

 σ2max(SUX) = σmax(UTXSTSUX), σ2min(SUX) = σmin(UTXSTSUX).

We let be the -th column of , and let be matrices sampled i.i.d. from uniformly at random without replacement. Obviously, . We accordingly define

 R=maxjλmax(Wj)=maxi∥∥ui∥∥22=dnμ,

where is the row matrix coherence of , and define

 ξmin = cλmin(EW1)=cnλmin(UTXUX)=cn, ξmax = cλmax(EW1)=cnλmax(UTXUX)=cn.

Then we apply Lemma 3 and obtained the following inequality:

 P[λmin(c∑i=1Wi)≤θ1cn] ≤ P[λmax(c∑i=1Wi)≥θ2cn] ≤

where , , and are arbitrary real numbers. We set

 c=max{μdln(d/δ1)θ1lnθ1−θ1+1,μdln(d/δ2)θ2lnθ2−θ2+1}, (9)

it then follows that with probability at least , both of the following two inequalities hold:

 σmax(SUX)≤√θ2cn and σ−2min(SUX)≤nθ1c. (10)

Now we seek to bound the term . Let be an index set of cardinality with each element chosen from uniformly at random without replacement, and let , then we have that

 E∥∥SU⊥XU⊥XTy∥∥22=E∥∥Sy⊥∥∥22=E∑j∈C(y⊥j)2=cn∥∥y⊥∥∥22.

Thus with probability at least the following inequality holds:

 ∥∥SU⊥XU⊥XTy∥∥22≤cnδ3∥∥U⊥XU⊥XTy∥∥22. (11)

Finally, it follows from inequalities (10, 11) and Lemma 1 that

 ∥zS∥2 ≤ ∥∥UTXSTSU⊥XU⊥XTy∥∥2σ2min(SUX)≤∥∥UTXST∥∥2∥∥SU⊥XU⊥XTy∥∥2σ2min(SUX) ≤ √θ2cnnθ1c√cnδ3∥∥U⊥XU⊥XTy∥∥2=1θ1√θ2δ3∥∥U⊥XU⊥XTy∥∥2.

Here the first two inequalities hold deterministically, and the third inequality holds with probability at least .

We set , , , and . Since when is close to , it follows from (9) that when , the inequality

 ∥∥zS∥∥22≤1+ϵ(1−ϵ)2(1−3δ)∥∥U⊥XU⊥XTy∥∥22

holds with probability at least .

Setting , , , and , we conclude that when , the inequality holds with probability at least . Then the theorem follows directly from Lemma 1.

## References

• [1] C. Boutsidis and A. Gittens. Improved matrix algorithms via the subsampled randomized hadamard transform. SIAM Journal on Matrix Analysis and Applications, 34(3):1301–1340, 2013.
• [2] K. L. Clarkson and D. P. Woodruff. Low rank approximation and regression in input sparsity time. In

Annual ACM Symposium on theory of computing (STOC)

. ACM, 2013.
• [3] P. Drineas, M. Magdon-Ismail, M. W. Mahoney, and D. P. Woodruff. Fast approximation of matrix coherence and statistical leverage.

Journal of Machine Learning Research

, 13:3441–3472, 2012.
• [4] P. Drineas, M. W. Mahoney, and S. Muthukrishnan. Sampling algorithms for regression and applications. In Proceedings of the seventeenth annual ACM-SIAM symposium on Discrete algorithm. ACM, 2006.
• [5] P. Drineas, M. W. Mahoney, and S. Muthukrishnan. Relative-error CUR matrix decompositions. SIAM Journal on Matrix Analysis and Applications, 30(2):844–881, Sept. 2008.
• [6] P. Drineas, M. W. Mahoney, S. Muthukrishnan, and T. Sarlós. Faster least squares approximation. Numerische Mathematik, 117(2):219–249, 2011.
• [7] A. Gittens. The spectral norm error of the naive Nyström extension. arXiv preprint arXiv:1110.5305, 2011.
• [8] A. Gittens and M. W. Mahoney. Revisiting the nyström method for improved large-scale machine learning. In International Conference on Machine Learning (ICML), 2013.
• [9] G. H. Golub and C. F. van Loan. Matrix computations. The Johns Hopkins, 1996.
• [10] R. A. Horn and C. R. Johnson. Topics in matrix analysis. 1991. Cambridge University Presss, Cambridge.
• [11] P. Ma, M. W. Mahoney, and B. Yu. A statistical perspective on algorithmic leveraging. In International Conference on Machine Learning (ICML), 2014.
• [12] M. W. Mahoney. Randomized algorithms for matrices and data. Foundations and Trends in Machine Learning, 3(2):123–224, 2011.
• [13] M. W. Mahoney and P. Drineas. CUR matrix decompositions for improved data analysis. Proceedings of the National Academy of Sciences, 106(3):697–702, 2009.
• [14] J. A. Tropp. Improved analysis of the subsampled randomized hadamard transform. Advances in Adaptive Data Analysis, 3(01–02):115–126, 2011.
• [15] S. Wang and Z. Zhang. Improving CUR matrix decomposition and the Nyström approximation via adaptive sampling. Journal of Machine Learning Research, 14:2729–2769, 2013.
• [16] S. Wang and Z. Zhang. Efficient algorithms and error analysis for the modified nyström method. In

International Conference on Artificial Intelligence and Statistics (AISTATS)

, 2014.