# An Explicit Sampling Dependent Spectral Error Bound for Column Subset Selection

In this paper, we consider the problem of column subset selection. We present a novel analysis of the spectral norm reconstruction for a simple randomized algorithm and establish a new bound that depends explicitly on the sampling probabilities. The sampling dependent error bound (i) allows us to better understand the tradeoff in the reconstruction error due to sampling probabilities, (ii) exhibits more insights than existing error bounds that exploit specific probability distributions, and (iii) implies better sampling distributions. In particular, we show that a sampling distribution with probabilities proportional to the square root of the statistical leverage scores is always better than uniform sampling and is better than leverage-based sampling when the statistical leverage scores are very nonuniform. And by solving a constrained optimization problem related to the error bound with an efficient bisection search we are able to achieve better performance than using either the leverage-based distribution or that proportional to the square root of the statistical leverage scores. Numerical simulations demonstrate the benefits of the new sampling distributions for low-rank matrix approximation and least square approximation compared to state-of-the art algorithms.

## Authors

• 62 publications
• 60 publications
• 34 publications
• 17 publications
• ### Completing Any Low-rank Matrix, Provably

Matrix completion, i.e., the exact and provable recovery of a low-rank m...
06/12/2013 ∙ by Yudong Chen, et al. ∙ 0

• ### Tighter Low-rank Approximation via Sampling the Leveraged Element

In this work, we propose a new randomized algorithm for computing a low-...
10/14/2014 ∙ by Srinadh Bhojanapalli, et al. ∙ 0

• ### Provable Deterministic Leverage Score Sampling

We explain theoretically a curious empirical phenomenon: "Approximating ...
04/06/2014 ∙ by Dimitris Papailiopoulos, et al. ∙ 0

• ### Relaxed Leverage Sampling for Low-rank Matrix Completion

We consider the problem of exact recovery of any m× n matrix of rank ϱ f...
03/22/2015 ∙ by Abhisek Kundu, et al. ∙ 0

• ### Relating Leverage Scores and Density using Regularized Christoffel Functions

Statistical leverage scores emerged as a fundamental tool for matrix ske...
05/21/2018 ∙ by Edouard Pauwels, et al. ∙ 0

• ### Sharpened Error Bounds for Random Sampling Based ℓ_2 Regression

Given a data matrix X ∈ R^n× d and a response vector y ∈ R^n, suppose n>...
03/30/2014 ∙ by Shusen Wang, et al. ∙ 0

• ### Fast determinantal point processes via distortion-free intermediate sampling

Given a fixed n× d matrix X, where n≫ d, we study the complexity of samp...
11/08/2018 ∙ by Michal Derezinski, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Give a data matrix , column subset selection (CSS) is an important technique for constructing a compressed representation and a low rank approximation of

by selecting a small number of columns. Compared with conventional singular value decomposition (SVD), CSS could yield more interpretable output while maintaining performance as close as SVD

(Mahoney, 2011). Recently, CSS has been applied successfully to problems of interest to geneticists such as genotype reconstruction, identifying substructure in heterogeneous populations, etc. (Paschou et al., 2007b, a; Drineas et al., 2010; Javed et al., 2011).

Let be the matrix formed by selected columns of . The key question to CSS is how to select the columns to minimize the reconstruction error:

 ∥A−PCA∥ξ,

where denotes the projection onto the column space of with being the pseudo inverse of and or denotes the spectral norm or the Frobenius norm. In this paper, we are particularly interested in the spectral norm reconstruction with respect to a target rank .

Our analysis is based on a randomized algorithm that selects columns from according to sampling probabilities . Building on advanced matrix concentration inequalities (e.g., matrix Chernoff bound and Bernstein inequality), we develop a novel analysis of the spectral norm reconstruction and establish a sampling dependent relative spectral error bound with a high probability as following:

 ∥A−PCA∥2≤(1+ϵ(s))∥A−Ak∥2,

where is the best rank- approximation of based on SVD and is a quantity dependent on the sampling probabilities besides the scalars . As revealed in our main theorem (Theorem 1), the quantity also depends on the statistical leverage scores (SLS) inherent to the data, based on which are several important randomized algorithms for CSS.

To the best of our knowledge, this is the first such kind of error bound for CSS. Compared with existing error bounds, the sampling dependent error bound brings us several benefits: (i) it allows us to better understand the tradeoff in the spectral error of reconstruction due to sampling probabilities, complementary to a recent result on the tradeoff from a statistical perspective (Ma et al., 2014)

for least square regression; (ii) it implies that a distribution with sampling probabilities proportional to the square root of the SLS is always better than the uniform sampling, and is potentially better than that proportional to the SLS when they are skewed; (iii) it motivates an optimization approach by solving a constrained optimization problem related to the error bound to attain better performance. In addition to the theoretical analysis, we also develop an efficient bisection search algorithm to solve the constrained optimization problem for finding better sampling probabilities.

By combining our analysis with recent developments for spectral norm reconstruction of CSS (Boutsidis et al., 2011), we also establish the same error bound for an exact rank- approximation, i.e.,

 ∥∥A−Π2C,k(A)∥∥2≤(1+ϵ(s))∥A−Ak∥2,

where is the best approximation to within the column space of that has rank at most .

The remainder of the paper is organized as follows. We review some closely related work in Section 2, and present the main result in Section 4 with some preliminaries in Section 3. We conduct some empirical studies in Section 5 and present the detailed analysis in Section 6. Finally, conclusion is made.

## 2 Related Work

In this section, we review some previous work on CSS, low-rank matrix approximation, and other closely related work on randomized algorithms for matrices. We focus our discussion on the spectral norm reconstruction.

Depending on whether the columns are selected deterministically or randomly, the algorithms for CSS can be categorized into deterministic algorithms and randomized algorithms. Deterministic algorithms select columns with some deterministic selection criteria. Representative algorithms in this category are rank revealing QR factorization and its variants from the filed of numerical linear algebra (Gu & Eisenstat, 1996; Pan, 2000; Pan & Tang, 1999). A recent work (Boutsidis et al., 2011) based on the dual set spectral sparsification also falls into this category which will be discussed shortly. Randomized algorithms usually define sampling probabilities and then select columns based on these sampling probabilities. Representative sampling probabilities include ones that depend the squared Euclidean norm of columns (better for Frobenius norm reconstruction) (Frieze et al., 2004), the squared volume of simplices defined by the selected subsets of columns (known as volume sampling) (Deshpande & Rademacher, 2010), and the SLS (known as leverage-based sampling or subspace sampling) (Drineas et al., 2008; Boutsidis et al., 2009).

Depending on whether is allowed, the error bounds for CSS are different. Below, we review several representative error bounds. If exactly columns are selected to form , the best bound was achieved by the rank revealing QR factorization (Gu & Eisenstat, 1996) with the error bound given by:

 ∥A−PCA∥2≤√1+k(n−k)∥A−Ak∥2. (1)

with a running time . The same error bound was also achieved by using volume sampling (Deshpande & Rademacher, 2010). The running time of volume sampling based algorithms can be made close to linear to the size of the target matrix. Boutsidis et al. (2009) proposed a two-stage algorithm for selecting exactly columns and provided error bounds for both the spectral norm and the Frobenius norm, where in the firs stage columns are sampled based on a distribution related to the SLS and more if for the spectral norm reconstruction and in the second stage columns are selected based on the rank revealing QR factorization. The spectral error bound in this work that holds with a constant probability 0.8 is following:

 ∥A− PCA∥2≤ (2) Θ(klog1/2k+n1/2k3/4log1/4(k))∥A−Ak∥2

The time complexity of their algorithm (for the spectral norm reconstruction) is given by since it requires SVD of the target matrix for computing the sampling probabilities.

If more than columns are allowed to be selected, i.e., , better error bounds can be achieved. In the most recent work by Boutsidis et al. (2011), nearly optimal error bounds were shown by selecting columns with a deterministic selection criterion based on the dual set spectral sparsification. In particular, a deterministic polynomial-time algorithm 111A slower deterministic algorithm with a time complexity was also presented with an error bound , where is the rank of . was proposed that achieves the following error bound:

 ∥A−PCA∥2 ≤(1+1+√n/ℓ1−√k/ℓ)∥A−Ak∥2 (3)

in time where is the time needed to compute the top

right singular vectors of

and is the time needed to compute the selection scores. This bound is close to the lower bound established in their work. It is worth mentioning that the selection scores in (Boutsidis et al., 2011) computed based on the dual set spectral sparsification is difficult to understand than the SLS.

Although our sampling dependent error bound is not directly comparable to these results, our analysis exhibits that the derived error bound could be much better than that in (2). When the SLS are nonuniform, our new sampling distributions could lead to a better result than (3). Most importantly, the sampling probabilities in our algorithm are only related to the SLS and that can be computed more efficiently (e.g., exactly in or approximately in  (Drineas et al., 2012)). In simulations, we observe that the new sampling distributions could yield even better spectral norm reconstruction than the deterministic selection criterion in (Boutsidis et al., 2011), especially when the SLS are nonuniform.

For low rank matrix approximation, several other randomized algorithms have been recently developed. For example, Halko et al. (2011) used a random Gaussian matrix

or a subsampled random Fourier transform to construct a matrix

and then project into the column space of , and they established numerous spectral error bounds. Among them is a comparable error bound to (3) using the subsampled random Fourier transform. Other randomized algorithm for low rank approximation include CUR decomposition (Drineas et al., 2008; Wang & Zhang, 2012, 2013) and the Nyström based approximation for PSD matrices (Drineas & Mahoney, 2005; Gittens & Mahoney, 2013).

Besides low rank matrix approximation and column selection, CSS has also been successfully applied to least square approximation, leading to faster and interpretable algorithms for over-constrained least square regression. In particular, if let denote a scaled sampling matrix corresponding to selecting rows from , the least square problem can be approximately solved by  (Drineas et al., 2008, 2006b, 2011). At ICML 2014, Ma et al. (2014)

studied CSS for least square approximation from a statistical perspective. They showed the expectation and variance of the solution to the approximated least square with uniform sampling and leverage-based sampling. They found that leveraging based estimator could suffer from a large variance when the SLS are very nonuniform while uniform sampling is less vulnerable to very small SLS. This tradeoff is complementary to our observation. However, our observation follows directly from the spectral norm error bound. Moreover, our analysis reveals that the sampling distribution with probabilities proportional to the square root of the SLS is always better than uniform sampling, suggesting that intermediate sampling probabilities between SLS and their square roots by solving a constrained optimization problem could yield better performance than the mixing strategy that linearly combines the SLS and uniform probabilities as suggested in

(Ma et al., 2014).

There are much more work on studying the Frobenius norm reconstruction of CSS (Drineas et al., 2006a; Guruswami & Sinop, 2012; Boutsidis et al., 2011; Drineas et al., 2008; Boutsidis et al., 2009). For more references, we refer the reader to the survey (Mahoney, 2011)

. It remains an interesting question to establish sampling dependent error bounds for other randomized matrix algorithms.

## 3 Preliminaries

Let be a matrix of size and has a rank of . Let be a target rank to approximate . We write the SVD decomposition of as

 A=U(Σ100Σ2)(V⊤1V⊤2)

where , , and . We use to denote the singular values of in the descending order, and and to denote the maximum and minimum eigen-values of a PSD matrix

. For any orthogonal matrix

, let denote an orthogonal matrix whose columns are an orthonormal basis spanning the subspace of that is orthogonal to the column space of .

Let be a set of scores such that  222For the sake of discussion, we are not restricting the sum of these scores to be one but to be , which does not affect our conclusions., one for each column of . We will drawn independent samples with replacement from the set using a multinomial distribution where the probability of choosing the th column is . Let be the indices of selected columns 333Note that some of the selected columns could be duplicate., and be the corresponding sampling matrix, i.e,

 Si,j={1, if i=ij0,otherwise,

and be a diagonal rescaling matrix with . Given , we construct the matrix as

 C=AS=(Ai1,…,Aiℓ). (4)

Our interest is to bound the spectral norm error between and for a given sampling matrix , i.e., , where projects onto the column space of . For the benefit of presentation, we define to denote the sampling-and-rescaling matrix, and

 Y=AΩ,Ω1=V⊤1Ω,Ω2=V⊤2Ω, (5)

where and . Since the column space of is the same to that of , therefore

 ∥A−PCA∥2=∥A−PYA∥2

and we will bound in our analysis. Let and . It is easy to verify that

 Ω1=(vi1,…,viℓ)D,Ω2=(ui1,…,uiℓ)D

Finally, we let denote the SLS of relative to the best rank- approximation to  (Mahoney, 2011), i.e., . It is not difficult to show that .

## 4 Main Result

Before presenting our main result, we first characterize scores in by two quantities as follows:

 c(s)=max1≤i≤ns∗isi,q(s)=max1≤i≤n√s∗isi (6)

Both quantities compare to the SLS . With and , we are ready to present our main theorem regarding the spectral error bound.

###### Theorem 1.

Let has rank and contain the selected columns according to sampling scores in . With a probability , we have

 ∥A−PCA∥2≤σk+1(1+ϵ(s))

where is

 ϵ(s)=3⎡⎢ ⎢ ⎢⎣ ⎷c(s)k(ρ+1−k)log[ρδ]ℓ+q(s)klog[ρδ]ℓ⎤⎥ ⎥ ⎥⎦

where is the th singular value of .

Remark: Clearly, the spectral error bound and the successful probability in Theorem 1 depend on the quantities and . In the subsection below, we study the two quantities to facilitate the understanding of the result in Theorem 1.

The result in Theorem 1 implies that the smaller the quantities and , the better the error bound. Therefore, we first study when and achieve their minimum values. The key results are presented in the following two lemmas with their proofs deferred to the supplement.

###### Lemma 1.

The set of scores in that minimize is given by , i.e., .

Remark: The sampling distribution with probabilities that are proportional to the square root of falls in between the uniform sampling and the leverage-based sampling.

###### Lemma 2.

such that . The set of scores in that minimize is given by , and the minimum value of is .

Next, we discuss three special samplings with (i) proportional to the square root of the SLS, i.e., (referred to as square-root leverage-based sampling or sqL-sampling for short), (ii) equal to the SLS, i.e., (referred to as leverage-based sampling or L-sampling for short), and (iii) equal to uniform scalars (referred to as uniform sampling or U-sampling for short). Firstly, if , achieves its minimum value and we have the two quantities written as

 qsqL=1kn∑i=1√s∗icsqL=maxis∗i∑i√s∗ik√s∗i=qsqLmaxi√s∗i (7)

In this case, when is flat (all SLS are equal), then and . The bound becomes that suppresses logarithmic terms. To analyze and for skewed SLS, we consider a power-law distributed SLS, i.e., there exists a small constant and power index , such that ranked in descending order satisfy

 s∗[i]≤a2i−p,i=1,…,n

Then it is not difficult to show that

 1kn∑i=1√si≤ak(1+2p−2)

which is independent of . Then the error bound in Theorem 1 becomes , which is better than that in (3).

Secondly, if , then achieves its minimum value and we have the two quantities written as

 qL=maxi1√s∗i,cL=1 (8)

In this case, when is flat, we have and and the same bound follows. However, when is skewed, i.e., there exist very small SLS, then could be very large. As a comparison, the for sqL-sampling is always smaller than that for L-sampling due the following inequality

 qsqL =1kn∑i=1√s∗i=1kn∑i=1s∗i√s∗i

Lastly, we consider the uniform sampling . Then the two quantities become

 qU=maxin√s∗ik,cU=maxins∗ik (9)

Similarly, if is flat, and . Moreover, it is interesting to compare the two quantities for the sqL-sampling in (7) and for the uniform sampling in (9).

 qsqL =1kn∑i=1√s∗i≤maxin√sik=qU csqL =maxi1k√s∗in∑i=1√s∗i≤maxins∗ik=cU

From the above discussions, we can see that when is a flat vector, there is no difference between the three sampling scores for . The difference comes from when tends to be skewed. In this case,

works almost for sure better than uniform distribution and could also be potentially better than

according to the sampling dependent error bound in Theorem 1. A similar tradeoff between the L-sampling and U-sampling but with a different taste was observed in (Ma et al., 2014), where they showed that for least square approximation by CSS leveraging-based least square estimator could have a large variance when there exist very small SLS. Nonetheless, our bound here exhibits more insights, especially on the sqL-sampling. More importantly, the sampling dependent bound renders the flexibility in choosing the sampling scores by adjusting them according to the distribution of the SLS. In next subsection, we present an optimization approach to find better sampling scores. In Figure 1, we give a quick view of different sampling strategies.

### 4.2 Optimizing the error bound

As indicated by the result in Theorem 1, in order to achieve a good performance, we need to make a balance between an , where affects not only the error bound but also the successful probability. To address this issue, we propose a constrained optimization approach. More specifically, to ensure that the failure probability is no more than , we impose the following constraint on

 ℓ8kc(s)≥log(kδ),i.e.,maxis∗isi≤ℓ8klog(kδ):=γ (10)

Then we cast the problem into minimizing under the constraint in (10), i.e.,

 mins∈Rn+max1≤i≤n√s∗isi s.t.s⊤1=k,s∗i≤γsi,i=1,…,n (11)

It is easy to verify that the optimization problem in (11) is convex. Next, we develop an efficient bisection search algorithm to solve the above problem with a linear convergence rate. To this end, we introduce a slack variable and rewrite the optimization problem in (11) as

 mins∈Rn+,t≥0 t,s.t.s⊤1=k (12) and s∗isi≤min(γ,t√s∗i),i=1,…,n

We now find the optimal solution by performing bisection search on . Let and be the upper and lower bounds for . We set and decide the feasibility of by simply computing the quantity

 f(t)=n∑i=1s∗imin(γ,t√s∗i)

Evidently, is a feasible solution if and is not if . Hence, we will update if and if . To run the bisection algorithm, we need to decide initial and . We can set . To compute , we make an explicit construction of by distributing the share of the largest element of to the rest of the list. More specifically, let be the index for the largest entry in . We set and for . Evidently, this solution satisfies the constraints for . With this construction, we can show that

 q(s)≤max(γ√∥s∗∥∞,n−1√∥s∗∥∞(1−γ−1))

Therefore, we set initial to the value in R.H.S of the above inequality. Given the optimal value of we compute the optimal value of by The corresponding sampling distribution clearly lies between L-sampling and sqL-sampling. In particular, when the resulting sampling distribution is L-sampling due to Lemma 2 and when the resulting sampling distribution approaches sqL-sampling.

Finally, we comment on the value of . In order to make the constraint in (10) feasible, we need to ensure . Therefore, we need .

### 4.3 Subsequent Applications

Next, we discuss two subsequent applications of CSS, one for low rank approximation and one for least square approximation.

Rank- approximation. If a rank- approximation is desired, we need to do some postprocessing since might has rank larger than . We can use the same algorithm as presented in (Boutsidis et al., 2011). In particular, given the constructed , we first orthonormalize the columns of to construct a matrix with orthonormal columns, then compute the best rank- approximation of denoted by , and finally construct the low-rank approximation as . It was shown that (Lemma 2.3 in (Boutsidis et al., 2011))

 ∥A−Q(Q⊤A)k∥2≤√2∥A−Π2C,k(A)∥2

where is the best approximation to within the column space of that has rank at most . The running time of above procedure is . Regarding its error bound, the above inequality together with the following theorem implies that its spectral error bound is only amplified by a factor of compared to that of .

###### Theorem 2.

Let has rank and contain the selected columns according to sampling scores in . With a probability , we have

 ∥A−Π2C,k(A)∥2≤σk+1(1+ϵ(s))

where is given in Theorem 1.

Least Square Approximation. CSS has been used in least square approximation for developing faster and interpretable algorithms. In these applications, an over-constrained least square problem is considered, i.e., given and with , to solve the following problem:

 (13)

The procedure for applying CSS to least square approximation is (i) to sample a set of rows from and form a sampling-and-rescaling matrix denoted by  444We abuse the same notation .; (ii) to solve the following reduced least square problem:

 ˆxopt=argminx∈Rn∥ΩAx−Ωb∥22 (14)

It is worth pointing out that in this case the SLS are computed based on the the left singular vectors of by , where is the -th row of . One might be interested to see whether we can apply our analysis to derive a sampling dependent error bound for the approximation error similar to previous bounds of the form . Unfortunately, naively combining our analysis with previous analysis is a worse case analysis, and consequentially yields a worse bound. The reason will become clear in our later discussions. However, the statistical analysis in (Ma et al., 2014) does indicate that by using sqL-sampling could have smaller variance than that using L-sampling.

## 5 Numerical Experiments

Before delving into the detailed analysis, we present some experimental results. We consider synthetic data with the data matrix generated from one of the three different classes of distributions introduced below, allowing the SLS vary from nearly uniform to very nonuniform.

• Nearly uniform SLS (GA). Columns of

are generated from a multivariate normal distribution

, where . This data is referred to as GA data.

• Moderately nonuniform SLS (). Columns of are generated from a multivariate -distribution with degree of freedom and covariance matrix as before. This data is referred to as data.

• Very nonuniform SLS (). Columns of are generated from a multivariate -distribution with degree of freedom and covariance matrix as before. This data is referred to as data.

These distributions have been used in (Ma et al., 2014) to generate synthetic data for empirical evaluations.

We first compare the spectral norm reconstruction error of the three different samplings, namely L-sampling, U-sampling and the sqL-sampling, and the deterministic dual set spectral sparsification algorithm. We generate synthetic data with and repeat the experiments 1000 times. We note that the rank of the generated data matrix is . The averaged results are shown in Figure 2. From these results we observe that (i) when the SLS are nearly uniform, the three sampling strategies perform similarly as expected; (ii) when the SLS become nonuniform, sqL-sampling performs always better than U-sampling and better than the L-sampling when the target rank is small (e.g., ) or the sample size is large; (iii) when the SLS are non-uniform, the spectral norm reconstruction error of sqL-sampling decreases faster than L-sampling w.r.t the sample size ; (iv) randomized algorithms generally perform better than the deterministic dual set sparsification algorithm.

Second, we compare the sampling scores found the constrained optimization with L-sampling and sqL-sampling. We vary the value of from (corresponding to L-sampling) to (corresponding to sqL-sampling). A result with sampling size is shown in Figure 3. It demonstrate that intermediate samplings found by the proposed constrained optimization can perform better than both L-sampling and sqL-sampling.

Finally, we apply CSS to over-constrained least square regression. To this end, we generate a synthetic data matrix with and similarly to (Ma et al., 2014). The output is generated by where and . We compare the variance and bias of the obtained estimators over runs for different sampling distributions. The results shown in Figure 4 demonstrate the sqL-sampling gives smaller variance and better bias of the estimators than L-sampling and U-sampling. We also compare the proposed optimization approach with the simple mixing strategy (Ma et al., 2014) that uses a convex combination of the L-sampling and the U-sampling. The results are shown in Figure 5, which again support our approach.

More results including relative error versus varying size of the target matrix, performance on a real data set and the Frobenius norm reconstruction error can be found in supplement.

## 6 Analysis

In this section, we present major analysis of Theorem 1 and Theorem 2 with detailed proofs included in supplement. The key to our analysis is the following Theorem.

###### Theorem 3.

Let be defined in  (5). Assume that has full row rank. We have

 ∥A−PYA∥2ξ≤∥Σ2∥2ξ+∥∥Σ2Ω2Ω†1∥∥2ξ

and

where could be and .

The first inequality was proved in (Halko et al., 2011) (Theorem 9.1) and the second inequality is credited to (Boutsidis et al., 2011) (Lemma 3.2) 555In fact, the first inequality is implied by the second inequality. . Previous work on the spectral norm analysis also start from a similar inequality as above. They bound the second term by using and then bound the two terms separately. However, we will first write using the fact has full row rank, and then bound and separately. To this end, we will apply the Matrix Chernoff bound as stated in Theorem 4 to bound and apply the matrix Bernstein inequality as stated in Theorem 5 to bound .

###### Theorem 4 (Matrix Chernoff (Tropp, 2012)).

Let be a finite set of PSD matrices with dimension , and suppose that . Sample independently from . Compute

 μmax=ℓλmax(E[X1]),μmin=ℓλmin(E[X1])

Then

 Pr{λmax(ℓ∑i=1Xi)≥(1+δ)μmax}≤k[eδ(1+δ)1+δ]μmaxB Pr{λmin(ℓ∑i=1Xi)≤(1−δ)μmin}≤k[e−δ(1−δ)1−δ]μminB
###### Theorem 5 (Noncommutative Bernstein Inequality (Recht, 2011)).

Let be independent zero-mean random matrices of dimension . Suppose and almost surely for all . Then, for any ,

 Pr⎡⎣∥∥ ∥∥L∑j=1Zj∥∥ ∥∥2>ϵ⎤⎦≤(d1+d2)exp⎡⎣−ϵ2/2∑Lj=1τ2j+Mϵ/3⎤⎦

Following immediately from Theorem 3, we have

 ∥A−PYA∥2 ≤σk+1√1+∥Ω2Ω⊤1(Ω1Ω⊤1)−1∥22 ≤σk+1√1+∥Ω2Ω⊤1∥22λ−2min(Ω1Ω⊤1) ≤σk+1(1+∥Ω2Ω⊤1∥2λ−1min(Ω1Ω⊤1)),

where the last inequality uses the fact . Below we bound from below and bound from above.

### 6.1 Bounding ∥(Ω1Ω⊤1)−1∥2

We will utilize Theorem 4 to bound . Define . It is easy to verify that

 Ω1Ω⊤1=ℓ∑j=11sijvijv⊤ij=ℓ∑j=1Xij

and , where we use and . Therefore we have . Then the theorem below will follow Theorem 4.

###### Theorem 6.

With a probability , we have

 λmin(Ω1Ω⊤1)≥(1−δ)ℓk

Therefore, with a probability we have .

### 6.2 Bounding ∥Ω2Ω⊤1∥2

We will utilize Theorem 5 to bound . Define . Then

 Ω2Ω⊤1=ℓ∑j=11sijuijv⊤ij=l∑j=1Zj

and . In order to use the matrix Bernstein inequality, we will bound and . Then we can prove the following theorem.

###### Theorem 7.

With a probability , we have

 ∥Ω2Ω⊤1∥2≤√2c(s)(ρ+1−k)ℓlog(ρk)k+2q(s)log(ρk)3.

We can complete the proof of Theorem 1 by combining the bounds for and and by setting in Theorem 6 and using union bounds.

## 7 Discussions and Open Problems

From the analysis, it is clear that the matrix Bernstein inequality is the key to derive the sampling dependent bound for . For bounding , similar analysis using matrix Chernoff bound has been exploited before for randomized matrix approximation (Gittens, 2011).

Since Theorem 3 also holds for the Frobenius norm, it might be interested to see whether we can derive a sampling dependent Frobenius norm error bound that depends on and , which, however, still remains as an open problem for us. Nonetheless, in experiments (included in the supplement) we observe similar phenomena about the performance of L-sampling, U-sampling and sqL-sampling.

Finally, we briefly comment on the analysis for least square approximation using CSS. Previous results (Drineas et al., 2008, 2006b, 2011) were built on the structural conditions that are characterized by two inequalities

 λmin(ΩUU⊤Ω)≥1/√2 ∥U⊤Ω⊤ΩU⊥U⊥⊤b∥22≤ϵ2∥U⊥U⊥⊤b∥22

The first condition can be guaranteed by Theorem 6 with a high probability. For the second condition, if we adopt a worse case analysis

 ∥U⊤Ω⊤ΩU⊥U⊥⊤b∥22≤∥U⊤Ω⊤ΩU⊥∥22∥U⊥⊤b∥22

and bound the first term in R.H.S of the above inequality using Theorem 7, we would end up with a worse bound than existing ones that bound the left term as a whole. Therefore the naive combination can’t yield a good sampling dependent error bound for the approximation error of least square regression.

## 8 Conclusions

In this paper, we have presented a sampling dependent spectral error bound for CSS. The error bound brings a new distribution with sampling probabilities proportional to the square root of the statistical leverage scores and exhibits more tradeoffs and insights than existing error bounds for CSS. We also develop a constrained optimization algorithm with an efficient bisection search to find better sampling probabilities for the spectral norm reconstruction. Numerical simulations demonstrate that the new sampling distributions lead to improved performance.

## References

• Boutsidis et al. (2009) Boutsidis, Christos, Mahoney, Michael W., and Drineas, Petros. An improved approximation algorithm for the column subset selection problem. In Proceedings of the Twentieth Annual ACM-SIAM Symposium on Discrete Algorithms, pp. 968–977, 2009.
• Boutsidis et al. (2011) Boutsidis, Christos, Drineas, Petros, and Magdon-Ismail, Malik. Near optimal column-based matrix reconstruction. In The Annual Symposium on Foundations of Computer Science, pp. 305–314, 2011.
• Deshpande & Rademacher (2010) Deshpande, Amit and Rademacher, Luis. Efficient volume sampling for row/column subset selection. CoRR, abs/1004.4057, 2010.
• Drineas & Mahoney (2005) Drineas, Petros and Mahoney, Michael W. On the nystrom method for approximating a gram matrix for improved kernel-based learning. Journal of Machine Learning Research, 6:2005, 2005.
• Drineas et al. (2006a) Drineas, Petros, Mahoney, Michael W., and Muthukrishnan, S. Subspace sampling and relative-error matrix approximation: Column-based methods. In APPROX-RANDOM, volume 4110, pp. 316–326, 2006a.
• Drineas et al. (2006b) Drineas, Petros, Mahoney, Michael W., and Muthukrishnan, S. Sampling algorithms for l2 regression and applications. In ACM-SIAM Symposium on Discrete Algorithms (SODA), pp. 1127–1136, 2006b.
• Drineas et al. (2008) Drineas, Petros, Mahoney, Michael W., and Muthukrishnan, S. Relative-error cur matrix decompositions. SIAM Journal Matrix Analysis Applications, 30:844–881, 2008.
• Drineas et al. (2010) Drineas, Petros, Lewis, Jamey, and Paschou, Peristera. Inferring geographic coordinates of origin for Europeans using small panels of ancestry informative markers. PLoS ONE, 5(8):e11892, 2010.
• Drineas et al. (2011) Drineas, Petros, Mahoney, Michael W., Muthukrishnan, S., and Sarlós, Tamàs. Faster least squares approximation. Numerische Mathematik, 117(2):219–249, February 2011.
• Drineas et al. (2012) Drineas, Petros, Magdon-Ismail, Malik, Mahoney, Michael W., and Woodruff, David P. Fast approximation of matrix coherence and statistical leverage. Journal of Machine Learning Research, 13:3475–3506, 2012.
• Frieze et al. (2004) Frieze, Alan, Kannan, Ravi, and Vempala, Santosh. Fast monte-carlo algorithms for finding low-rank approximations. Journal of ACM, 51(6):1025–1041, 2004.
• Gittens (2011) Gittens, Alex. The spectral norm errors of the naive nystrom extension. CoRR, abs/1110.5305, 2011.
• Gittens & Mahoney (2013) Gittens, Alex and Mahoney, Michael W. Revisiting the nystrom method for improved large-scale machine learning. In Proceedings of International Conference of Machine Learning, volume 28, pp. 567–575, 2013.
• Gu & Eisenstat (1996) Gu, Ming and Eisenstat, Stanley C. Efficient algorithms for computing a strong rank-revealing qr factorization. SIAM Journal on Scientific Computing, 17(4):848–869, 1996.
• Guruswami & Sinop (2012) Guruswami, Venkatesan and Sinop, Ali Kemal. Optimal column-based low-rank matrix reconstruction. In Proceedings of the Twenty-third Annual ACM-SIAM Symposium on Discrete Algorithms, pp. 1207–1214, 2012.
• Halko et al. (2011) Halko, N., Martinsson, P. G., and Tropp, J. A. Finding structure with randomness: Probabilistic algorithms for constructing approximate matrix decompositions. SIAM Review, 53:217–288, 2011.
• Javed et al. (2011) Javed, A., Drineas, P., Mahoney, M. W., and Paschou, P.

Efficient genomewide selection of PCA-correlated tSNPs for genotype imputation.

Annals of Human Genetics, 75(6):707–722, Nov 2011.
• Ma et al. (2014) Ma, Ping, Mahoney, Michael W., and Yu, Bin. A statistical perspective on algorithmic leveraging. In Proceedings of the 31th International Conference on Machine Learning (ICML), pp. 91–99, 2014.
• Mahoney (2011) Mahoney, Michael W. Randomized algorithms for matrices and data. Foundations and Trends in Machine Learning, 3(2):123–224, 2011.
• Pan (2000) Pan, C.-T. On the existence and computation of rank-revealing lu factorizations. Linear Algebra and its Applications, 316(1–3):199 – 222, 2000. Special Issue: Conference celebrating the 60th birthday of Robert J. Plemmons.
• Pan & Tang (1999) Pan, Ching-Tsuan and Tang, PingTakPeter. Bounds on singular values revealed by qr factorizations. BIT Numerical Mathematics, 39(4):740–756, 1999. ISSN 0006-3835.
• Paschou et al. (2007a) Paschou, Peristera, Mahoney, Michael W., Javed, A., Kidd, J. R., Pakstis, A. J., Gu, S., Kidd, K. K., and Drineas, Petros. Intra- and interpopulation genotype reconstruction from tagging SNPs. Genome Research, 17(1):96–107, Jan 2007a.
• Paschou et al. (2007b) Paschou, Peristera, Ziv, Elad, Burchard, Esteban G., Choudhry, Shweta, Rodriguez-Cintron, William, Mahoney, Michael W., and Drineas, Petros. PCA-correlated SNPs for structure identification in worldwide human populations. PLoS Genetics, 3(9):e160+, 2007b.
• Recht (2011) Recht, Benjamin. A simpler approach to matrix completion. Journal Machine Learning Research (JMLR), pp. 3413–3430, 2011.
• Tropp (2012) Tropp, Joel A. User-friendly tail bounds for sums of random matrices. Found. Comput. Math., 12(4):389–434, August 2012. ISSN 1615-3375.
• Wang & Zhang (2012) Wang, Shusen and Zhang, Zhihua. A scalable cur matrix decomposition algorithm: Lower time complexity and tighter bound. In Advances in Neural Information Processing Systems 25, pp. 656–664. 2012.
• Wang & Zhang (2013) Wang, Shusen and Zhang, Zhihua. Improving cur matrix decomposition and the nyström approximation via adaptive sampling. Journal of Machine Learning Research, 14(1):2729–2769, 2013.