# Tail bounds for volume sampled linear regression

The n × d design matrix in a linear regression problem is given, but the response for each point is hidden unless explicitly requested. The goal is to observe only a small number k ≪ n of the responses, and then produce a weight vector whose sum of square loss over all points is at most 1+ϵ times the minimum. A standard approach to this problem is to use i.i.d. leverage score sampling, but this approach is known to perform poorly when k is small (e.g., k = d); in such cases, it is dominated by volume sampling, a joint sampling method that explicitly promotes diversity. How these methods compare for larger k was not previously understood. We prove that volume sampling can have poor behavior for large k - indeed worse than leverage score sampling. We also show how to repair volume sampling using a new padding technique. We prove that padded volume sampling has at least as good a tail bound as leverage score sampling: sample size k=O(d d + d/ϵ) suffices to guarantee total loss at most 1+ϵ times the minimum with high probability. The main technical challenge is proving tail bounds for the sums of dependent random matrices that arise from volume sampling.

## Authors

• 6 publications
• 24 publications
• 64 publications
• ### Reverse iterative volume sampling for linear regression

We study the following basic machine learning task: Given a fixed set of...
06/06/2018 ∙ by Michal Derezinski, et al. ∙ 0

• ### Unbiased estimators for random design regression

In linear regression we wish to estimate the optimum linear least square...
07/08/2019 ∙ by Michał Dereziński, et al. ∙ 4

• ### Minimax experimental design: Bridging the gap between statistical and worst-case approaches to least squares regression

In experimental design, we are given a large collection of vectors, each...
02/04/2019 ∙ by Michal Derezinski, et al. ∙ 16

• ### Correcting the bias in least squares regression with volume-rescaled sampling

Consider linear regression where the examples are generated by an unknow...
10/04/2018 ∙ by Michal Derezinski, et al. ∙ 38

• ### Generalized Leverage Score Sampling for Neural Networks

Leverage score sampling is a powerful technique that originates from the...
09/21/2020 ∙ by Jason D. Lee, et al. ∙ 0

• ### L1 Regression with Lewis Weights Subsampling

We consider the problem of finding an approximate solution to ℓ_1 regres...
05/19/2021 ∙ by Aditya Parulekar, et al. ∙ 0

• ### Stable recovery and the coordinate small-ball behaviour of random vectors

Recovery procedures in various application in Data Science are based on ...
04/17/2019 ∙ by Shahar Mendelson, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Consider a linear regression problem where the input points in are provided, but the associated response for each point is withheld unless explicitly requested. The goal is to sample the responses for just a small subset of inputs, and then produce a weight vector whose total square loss on all points is at most times that of the optimum.111The total loss of the algorithm being at most times loss of the optimum can be rewritten as the regret being at most times the optimum. This scenario is relevant in many applications where data points are cheap to obtain but responses are expensive. Surprisingly, with the aid of having all input points available, such multiplicative loss bounds are achievable without any range dependence on the points or responses common in on-line learning (see, e.g., onlineregr, ).

A natural and intuitive approach to this problem is volume sampling, since it prefers “diverse” sets of points that will likely result in a weight vector with low total loss, regardless of what the corresponding responses turn out to be (unbiased-estimates, ). Volume sampling is closely related to optimal design criteria (optimal-design-book, ; dual-volume-sampling, ), which are appropriate under statistical models of the responses; here we study a worst-case setting where the algorithm must use randomization to guard itself against worst-case responses.

Volume sampling and related determinantal point processes are employed in many machine learning and statistical contexts, including linear regression

(dual-volume-sampling, ; unbiased-estimates, ; regularized-volume-sampling, ), clustering and matrix approximation (pca-volume-sampling, ; efficient-volume-sampling, ; avron-boutsidis13, ), summarization and information retrieval (dpp, ; k-dpp, ; dpp-shopping, ), and fairness (celis2016fair, ; celis2018fair, ). The availability of fast algorithms for volume sampling (dual-volume-sampling, ; unbiased-estimates, ) has made it an important technique in the algorithmic toolbox alongside i.i.d. leverage score sampling (drineas2006sampling, ) and spectral sparsification (batson2012twice, ; lee2015constructing, ).

It is therefore surprising that using volume sampling in the context of linear regression, as suggested in previous works (unbiased-estimates, ; dual-volume-sampling, ), may lead to suboptimal performance. We construct an example in which, even after sampling up to half of the responses, the loss of the weight vector from volume sampling is a fixed factor larger than the minimum loss. Indeed, this poor behavior arises because for any sample size , the marginal probabilities from volume sampling are a mixture of uniform probabilities and leverage score probabilities, and uniform sampling is well-known to be suboptimal when the leverage scores are highly non-uniform.

A possible recourse is to abandon volume sampling in favor of leverage score sampling (drineas2006sampling, ; woodruff2014sketching, ). However, all i.i.d. sampling methods, including leverage score sampling, suffer from a coupon collector problem that prevents their effective use at small sample sizes (regularized-volume-sampling, ). Moreover, the resulting weight vectors are biased (regarded as estimators for the least squares solution using all responses), which is a nuisance when averaging multiple solutions (e.g., as produced in distributed settings). In contrast, volume sampling offers multiplicative loss bounds even with sample sizes as small as and it is the only known non-trivial method that gives unbiased weight vectors (unbiased-estimates, ).

We develop a new solution, called leveraged volume sampling, that retains the aforementioned benefits of volume sampling while avoiding its flaws. Specifically, we propose a variant of volume sampling based on rescaling the input points to “correct” the resulting marginals. On the algorithmic side, this leads to a new determinantal rejection sampling procedure which offers significant computational advantages over existing volume sampling algorithms, while at the same time being strikingly simple to implement. We prove that this new sampling scheme retains the benefits of volume sampling (like unbiasedness) but avoids the bad behavior demonstrated in our lower bound example. Along the way, we prove a new generalization of the Cauchy-Binet formula, which is needed for the rejection sampling denominator. Finally, we develop a new method for proving matrix tail bounds for leveraged volume sampling. Our analysis shows that the unbiased least-squares estimator constructed this way achieves a approximation factor from a sample of size , addressing an open question posed by unbiased-estimates .

#### Experiments.

Figure 1 presents experimental evidence on a benchmark dataset (cpusmall from the libsvm collection libsvm ) that the potential bad behavior of volume sampling proven in our lower bound does occur in practice. Appendix E shows more datasets and a detailed discussion of the experiments. In summary, leveraged volume sampling avoids the bad behavior of standard volume sampling, and performs considerably better than leverage score sampling, especially for small sample sizes .

#### Related work.

Despite the ubiquity of volume sampling in many contexts already mentioned above, it has only recently been analyzed for linear regression. Focusing on small sample sizes, (unbiased-estimates, ) proved multiplicative bounds for the expected loss of size volume sampling. Because the estimators produced by volume sampling are unbiased, averaging a number of such estimators produced an estimator based on a sample of size with expected loss at most times the optimum. It was shown in regularized-volume-sampling

that if the responses are assumed to be linear functions of the input points plus white noise, then size

volume sampling suffices for obtaining the same expected bounds. These noise assumptions on the response vector are also central to the task of A-optimal design, where volume sampling is a key technique (optimal-design-book, ; symmetric-polynomials, ; tractable-experimental-design, ; proportional-volume-sampling, ). All of these previous results were concerned with bounds that hold in expectation; it is natural to ask if similar (or better) bounds can also be shown to hold with high probability, without noise assumptions. Concentration bounds for volume sampling and other strong Rayleigh measures were studied in pemantle2014concentration , but these results are not sufficient to obtain the tail bounds for volume sampling.

Other techniques applicable to our linear regression problem include leverage score sampling (drineas2006sampling, ) and spectral sparsification (batson2012twice, ; lee2015constructing, ). Leverage score sampling is an i.i.d. sampling procedure which achieves tail bounds matching the ones we obtain here for leveraged volume sampling, however it produces biased weight vectors and experimental results (see regularized-volume-sampling and Appendix E) show that it has weaker performance for small sample sizes. A different and more elaborate sampling technique based on spectral sparsification (batson2012twice, ; lee2015constructing, ) was recently shown to be effective for linear regression (chen2017condition, ), however this method also does not produce unbiased estimates, which is a primary concern of this paper and desirable in many settings. Unbiasedness seems to require delicate control of the sampling probabilities, which we achieve using determinantal rejection sampling.

#### Outline and contributions.

We set up our task of subsampling for linear regression in the next section and present our lower bound for standard volume sampling. A new variant of rescaled volume sampling is introduced in Section 3. We develop techniques for proving matrix expectation formulas for this variant which show that for any rescaling the weight vector produced for the subproblem is unbiased.

Next, we show that when rescaling with leverage scores, then a new algorithm based on rejection sampling is surprisingly efficient (Section 4): Other than the preprocessing step of computing leverage scores, the runtime does not depend on (a major improvement over existing volume sampling algorithms). Then, in Section 4.1 we prove multiplicative loss bounds for leveraged volume sampling by establishing two important properties which are hard to prove for joint sampling procedures. We conclude in Section 5 with an open problem and with a discussion of how rescaling with approximate leverage scores gives further time improvements for constructing an unbiased estimator.

## 2 Volume sampling for linear regression

In this section, we describe our linear regression setting, and review the guarantees that standard volume sampling offers in this context. Then, we present a surprising lower bound which shows that under worst-case data, this method can exhibit undesirable behavior.

### 2.1 Setting

Suppose the learner is given input vectors , which are arranged as the rows of an input matrix . Each input vector

has an associated response variable

from the response vector . The goal of the learner is to find a weight vector that minimizes the square loss:

 w∗\tiny{def}=argminw∈RdL(w),whereL(w)% \tiny{def}=n∑i=1(x⊤iw−yi)2=∥Xw−y∥2.

Given both matrix and vector , the least squares solution can be directly computed as , where is the pseudo-inverse. Throughout the paper we assume w.l.o.g. that has (full) rank .222Otherwise just reduce to a subset of independent columns. Also assume has no rows of all zeros (every weight vector has the same loss on such rows, so they can be removed).

In our setting, the learner is only given the input matrix , while response vector remains hidden. The learner is allowed to select a subset of row indices in for which the corresponding responses are revealed. The learner constructs an estimate of using matrix and the partial vector of observed responses. The learner is evaluated by the loss over all rows of (including the ones with unobserved responses), and the goal is to obtain a multiplicative loss bound, i.e., that for some ,

 L(ˆw)≤(1+ϵ)L(w∗).

### 2.2 Standard volume sampling

Given and a size , standard volume sampling jointly chooses a set of indices in with probability

 Pr(S)=det(X⊤SXS)(n−dk−d)det(X⊤X),

where is the submatrix of the rows from indexed by the set . The learner then obtains the responses , for , and uses the optimum solution for the subproblem as its weight vector. The sampling procedure can be performed using reverse iterative sampling (shown on the right), which, if carefully implemented, takes time (see unbiased-estimates ; regularized-volume-sampling ).

The key property (unique to volume sampling) is that the subsampled estimator is unbiased, i.e.

 E[w∗S]=w∗,wherew∗=argminwL(w).

As discussed in unbiased-estimates

, this property has important practical implications in distributed settings: Mixtures of unbiased estimators remain unbiased (and can conveniently be used to reduce variance). Also if the rows of

are in general position, then for volume sampling

 E[(X⊤SXS)−1]=n−d+1k−d+1(X⊤X)−1. (1)

This is important because in A-optimal design bounding is the main concern. Given these direct connections of volume sampling to linear regression, it is natural to ask whether this distribution achieves a loss bound of times the optimum for small sample sizes .

### 2.3 Lower bound for standard volume sampling

We show that standard volume sampling cannot guarantee multiplicative loss bounds on some instances, unless over half of the rows are chosen to be in the subsample.

###### Theorem 1

Let be an least squares problem, such that

Let be obtained from size volume sampling for . Then,

 limγ→0E[L(w∗S)]L(w∗)≥1+n−kn−d, (2)

and there is a such that for any ,

 Pr(L(w∗S)≥(1+12)L(w∗))>14. (3)

Proof  In Appendix A we show part (2), and that for the chosen we have (see (8)), where is the -th leverage score of . Here, we show (3). The marginal probability of the -th row under volume sampling (as given by unbiased-estimates-journal ) is

 Pr(i∈S)=θ li+(1−θ) 1=1−θ (1−li), where θ=n−kn−d. (4)

Next, we bound the probability that all of the first input vectors were selected by volume sampling:

 Pr([d]⊆S) (∗)≤d∏i=1Pr(i∈S)=d∏i=1(1−n−kn−d(1−li))≤exp(−n−kn−d∑di=1(1−li)L(w∗)),

where follows from negative associativity of volume sampling (see dual-volume-sampling ). If for some we have , then . So for such that and any :

 Pr(L(w∗S)≥(1+12)2/3L(w∗))

Note that this lower bound only makes use of the negative associativity of volume sampling and the form of the marginals. However the tail bounds we prove in Section 4.1 rely on more subtle properties of volume sampling. We begin by creating a variant of volume sampling with rescaled marginals.

## 3 Rescaled volume sampling

Given any size , our goal is to jointly sample row indices with replacement (instead of a subset of of size , we get a sequence ). The second difference to standard volume sampling is that we rescale the -th row (and response) by , where is any discrete distribution over the set of row indices , such that and for all . We now define -rescaled size volume sampling as a joint sampling distribution over , s.t.

 q-rescaled size k volume sampling:Pr(π)∼det(k∑i=11qπixπix⊤πi)k∏i=1qπi. (5)

Using the following rescaling matrix we rewrite the determinant as . As in standard volume sampling, the normalization factor in rescaled volume sampling can be given in a closed form through a novel extension of the Cauchy-Binet formula (proof in Appendix B.1).

###### Proposition 2

For any , and , such that , we have

 ∑π∈[n]kdet(X⊤QπX)k∏i=1qπi=k(k−1)...(k−d+1)det(X⊤X).

Given a matrix , vector and a sequence , we are interested in a least-squares problem , which selects instances indexed by , and rescales each of them by the corresponding . This leads to a natural subsampled least squares estimator

 w∗π=argminwk∑i=11qπi(x⊤πiw−yπi)2=(Q\sfrac12πX)+Q\sfrac12πy.

The key property of standard volume sampling is that the subsampled least-squares estimator is unbiased. Surprisingly this property is retained for any -rescaled volume sampling (proof in Section 3.1). As we shall see this will give us great leeway for choosing to optimize our algorithms.

###### Theorem 3

Given a full rank and a response vector , for any as above, if is sampled according to (5), then

 E[w∗π]=w∗,wherew∗=argminw∥Xw−y∥2.

The matrix formula (1), discussed in Section 2 for standard volume sampling, has a natural extension to any rescaled volume sampling, turning here into an inequality (proof in Appendix B.2).

###### Theorem 4

Given a full rank and any as above, if is sampled according to (5), then

 E[(X⊤QπX)−1]⪯1k−d+1(X⊤X)−1.

### 3.1 Proof of Theorem 3

We show that the least-squares estimator produced from any -rescaled volume sampling is unbiased, illustrating a proof technique which is also useful for showing Theorem 4, as well as Propositions 2 and 5. The key idea is to apply the pseudo-inverse expectation formula for standard volume sampling (see e.g., unbiased-estimates ) first on the subsampled estimator , and then again on the full estimator . In the first step, this formula states:

where and denotes a subsequence of indexed by the elements of set . Note that since is of size , we can decompose the determinant:

 det(X⊤QπSX)=det(XπS)2∏i∈S1qπi.

Whenever this determinant is non-zero, is the exact solution of a system of linear equations:

 1√qπix⊤πiw=1√qπiyπi,fori∈S.

Thus, the rescaling of each equation by cancels out, and we can simply write . Note that this is not the case for sets larger than whenever the optimum solution incurs positive loss. We now proceed with summing over all . Following Proposition 2, we define the normalization constant as , and obtain:

 ZE[w∗π] (1)=(kd)∑¯π∈[n]ddet(X¯π)2(X¯π)+y¯π∑~π∈[n]k−dk−d∏i=1q~πi

Note that in we separate into two parts (subset and its complement, ) and sum over them separately. The binomial coefficient counts the number of ways that can be “placed into” the sequence . In we observe that whenever has repetitions, determinant is zero, so we can switch to summing over sets. Finally, again uses the standard size volume sampling unbiasedness formula, now for the least-squares problem , and the fact that ’s sum to 1.

## 4 Leveraged volume sampling: a natural rescaling

Rescaled volume sampling can be viewed as selecting a sequence of rank-1 covariates from the covariance matrix . If are sampled i.i.d. from , i.e. , then matrix is an unbiased estimator of the covariance matrix because . In rescaled volume sampling (5), , and the latter volume ratio introduces a bias to that estimator. However, we show that this bias vanishes when is exactly proportional to the leverage scores (proof in Appendix B.3).

###### Proposition 5

For any and as before, if is sampled according to (5), then

 E[Qπ]=(k−d)I+diag(l1q1,…,lnqn),whereli\tiny{def}=x⊤i(X⊤X)−1xi.

In particular, if and only if for all .

This special rescaling, which we call leveraged volume sampling, has other remarkable properties. Most importantly, it leads to a simple and efficient algorithm we call determinantal rejection sampling: Repeatedly sample indices i.i.d. from , and accept the sample with probability proportional to its volume ratio. Having obtained a sample, we can further reduce its size via reverse iterative sampling. We show next that this procedure not only returns a -rescaled volume sample, but also exploiting the fact that is proportional to the leverage scores, it requires (surprisingly) only a constant number of iterations of rejection sampling with high probability.

###### Theorem 6

Given the leverage score distribution and the determinant for matrix , determinantal rejection sampling returns sequence distributed according to leveraged volume sampling, and w.p. at least finishes in time .

Proof  We use a composition property of rescaled volume sampling (proof in Appendix B.4):

###### Lemma 7

Consider the following sampling procedure, for :

 π s∼X (q-rescaled size s volume sampling), S k∼ ⎛⎜ ⎜ ⎜ ⎜⎝1√qπ1x⊤π1…1√qπsx⊤πs⎞⎟ ⎟ ⎟ ⎟⎠=(Q\sfrac12[1..n]X)π (standard size k volume sampling).

Then is distributed according to -rescaled size volume sampling from .

First, we show that the rejection sampling probability in line 5 of the algorithm is bounded by :

 =det(1sX⊤QπX(X⊤X)−1)(∗)≤(1dtr(1sX⊤QπX(X⊤X)−1))d =(1dstr(QπX(X⊤X)−1X⊤))d=(1dss∑i=1dlix⊤i(X⊤X)−1xi)d=1,

where

follows from the geometric-arithmetic mean inequality for the eigenvalues of the underlying matrix. This shows that sequence

is drawn according to -rescaled volume sampling of size . Now, Lemma 7 implies correctness of the algorithm. Next, we use Proposition 2 to compute the expected value of acceptance probability from line 5 under the i.i.d. sampling of line 4:

 ∑π∈[n]s(s∏i=1qπi)det(1sX⊤QπX)det(X⊤X) =s(s−1)…(s−d+1)sd≥(1−ds)d≥1−d2s≥34,

where we also used Bernoulli’s inequality and the fact that (see line 2). Since the expected value of the acceptance probability is at least , an easy application of Markov’s inequality shows that at each trial there is at least a 50% chance of it being above . So, the probability of at least trials occurring is less than . Note that the computational cost of one trial is no more than the cost of SVD decomposition of matrix (for computing the determinant), which is . The cost of reverse iterative sampling (line 7) is also with high probability (as shown by regularized-volume-sampling ). Thus, the overall runtime is , where w.p. at least .

### 4.1 Tail bounds for leveraged volume sampling

An analysis of leverage score sampling, essentially following (woodruff2014sketching, , Section 2) (which in turn draws from sarlos-sketching, ), highlights two basic sufficient conditions on the (random) subsampling matrix that lead to multiplicative tail bounds for .

It is convenient to shift to an orthogonalization of the linear regression task by replacing matrix with a matrix . It is easy to check that the columns of have unit length and are orthogonal, i.e., . Now, is the least-squares solution for the orthogonal problem and prediction vector for is the same as the prediction vector for the original problem . The same property holds for the subsampled estimators, i.e., , where . Volume sampling probabilities are also preserved under this transformation, so w.l.o.g. we can work with the orthogonal problem. Now can be rewritten as

 (6)

where follows via Pythagorean theorem from the fact that lies in the column span of and the residual vector is orthogonal to all columns of , and follows from . By the definition of , we can write as follows:

 ∥v∗π−v∗∥=∥(U⊤QπU)−1U⊤Qπ(y−Uv∗)∥≤∥(U⊤QπU)−1d×d∥∥(U⊤QπU)−1U⊤Qπrd×1∥, (7)

where

denotes the matrix 2-norm (i.e., the largest singular value) of

; when is a vector, then is its Euclidean norm. This breaks our task down to showing two key properties:

1. Matrix multiplication: Upper bounding the Euclidean norm ,

2. Subspace embedding: Upper bounding the matrix 2-norm .

We start with a theorem that implies strong guarantees for approximate matrix multiplication with leveraged volume sampling. Unlike with i.i.d. sampling, this result requires controlling the pairwise dependence between indices selected under rescaled volume sampling. Its proof is an interesting application of a classical Hadamard matrix product inequality from hadamard-product-inequality (Proof in Appendix C).

###### Theorem 8

Let be a matrix s.t. . If sequence is selected using leveraged volume sampling of size , then for any ,

Next, we turn to the subspace embedding property. The following result is remarkable because standard matrix tail bounds used to prove this property for leverage score sampling are not applicable to volume sampling. In fact, obtaining matrix Chernoff bounds for negatively associated joint distributions like volume sampling is an active area of research, as discussed in

harvey2014pipage . We address this challenge by defining a coupling procedure for volume sampling and uniform sampling without replacement, which leads to a curious reduction argument described in Appendix D.

###### Theorem 9

Let be a matrix s.t. . There is an absolute constant , s.t. if sequence is selected using leveraged volume sampling of size , then

Theorems 8 and 9 imply that the unbiased estimator produced from leveraged volume sampling achieves multiplicative tail bounds with sample size .

###### Corollary 10

Let be a full rank matrix. There is an absolute constant , s.t. if sequence is selected using leveraged volume sampling of size , then for estimator

 w∗π=argminw∥Q\sfrac12π(Xw−y)∥2,

we have with probability at least .

Proof  Let . Combining Theorem 8 with Markov’s inequality, we have that for large enough , w.h.p., where . Finally following (6) and (7) above, we have that w.h.p.

 L(w∗π) ≤L(w∗)+∥(U⊤QπU)−1∥2∥U⊤Qπr∥2≤L(w∗)+82k2ϵk282∥r∥2=(1+ϵ)L(w∗).

## 5 Conclusion

We developed a new variant of volume sampling which produces the first known unbiased subsampled least-squares estimator with strong multiplicative loss bounds. In the process, we proved a novel extension of the Cauchy-Binet formula, as well as other fundamental combinatorial equalities. Moreover, we proposed an efficient algorithm called determinantal rejection sampling, which is to our knowledge the first joint determinantal sampling procedure that (after an initial preprocessing step for computing leverage scores) produces its samples in time , independent of the data size . When is very large, the preprocessing time can be reduced to by rescaling with sufficiently accurate approximations of the leverage scores. Surprisingly the estimator stays unbiased and the loss bound still holds with only slightly revised constants. For the sake of clarity we presented the algorithm based on rescaling with exact leverage scores in the main body of the paper. However we outline the changes needed when using approximate leverage scores in Appendix F.

In this paper we focused on tail bounds. However we conjecture that expected bounds of the form also hold for a variant of volume sampling of size .

## References

• [1] Nir Ailon and Bernard Chazelle. The fast johnson–lindenstrauss transform and approximate nearest neighbors. SIAM Journal on computing, 39(1):302–322, 2009.
• [2] Zeyuan Allen-Zhu, Yuanzhi Li, Aarti Singh, and Yining Wang. Near-optimal design of experiments via regret minimization. In Doina Precup and Yee Whye Teh, editors, Proceedings of the 34th International Conference on Machine Learning, volume 70 of Proceedings of Machine Learning Research, pages 126–135, International Convention Centre, Sydney, Australia, 2017. PMLR.
• [3] T Ando, Roger A. Horn, and Charles R. Johnson. The singular values of a hadamard product: A basic inequality. 21:345–365, 12 1987.
• [4] Haim Avron and Christos Boutsidis. Faster subset selection for matrices and applications. SIAM Journal on Matrix Analysis and Applications, 34(4):1464–1499, 2013.
• [5] Joshua Batson, Daniel A Spielman, and Nikhil Srivastava. Twice-ramanujan sparsifiers. SIAM Journal on Computing, 41(6):1704–1721, 2012.
• [6] L Elisa Celis, Amit Deshpande, Tarun Kathuria, and Nisheeth K Vishnoi. How to be fair and diverse? arXiv preprint arXiv:1610.07183, 2016.
• [7] L Elisa Celis, Vijay Keswani, Damian Straszak, Amit Deshpande, Tarun Kathuria, and Nisheeth K Vishnoi. Fair and diverse dpp-based data summarization. arXiv preprint arXiv:1802.04023, 2018.
• [8] N. Cesa-Bianchi, P. M. Long, and M. K. Warmuth. Worst-case quadratic loss bounds for on-line prediction of linear functions by gradient descent.

IEEE Transactions on Neural Networks

, 7(3):604–619, 1996.
Earlier version in 6th COLT, 1993.
• [9] Chih-Chung Chang and Chih-Jen Lin.

LIBSVM: A library for support vector machines.

ACM Transactions on Intelligent Systems and Technology, 2:27:1–27:27, 2011. Software available at http://www.csie.ntu.edu.tw/~cjlin/libsvm.
• [10] Xue Chen and Eric Price. Condition number-free query and active learning of linear families. CoRR, abs/1711.10051, 2017.
• [11] Michał Dereziński and Manfred K Warmuth. Unbiased estimates for linear regression via volume sampling. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems 30, pages 3087–3096. Curran Associates, Inc., 2017.
• [12] Michał Dereziński and Manfred K. Warmuth. Unbiased estimates for linear regression via volume sampling. CoRR, abs/1705.06908, 2017.
• [13] Michał Dereziński and Manfred K. Warmuth.

Subsampling for ridge regression via regularized volume sampling.

In

Proceedings of the 21st International Conference on Artificial Intelligence and Statistics

, 2018.
• [14] Amit Deshpande and Luis Rademacher. Efficient volume sampling for row/column subset selection. In Proceedings of the 2010 IEEE 51st Annual Symposium on Foundations of Computer Science, FOCS ’10, pages 329–338, Washington, DC, USA, 2010. IEEE Computer Society.
• [15] Amit Deshpande, Luis Rademacher, Santosh Vempala, and Grant Wang. Matrix approximation and projective clustering via volume sampling. In Proceedings of the Seventeenth Annual ACM-SIAM Symposium on Discrete Algorithm, SODA ’06, pages 1117–1126, Philadelphia, PA, USA, 2006. Society for Industrial and Applied Mathematics.
• [16] Petros Drineas, Malik Magdon-Ismail, Michael W. Mahoney, and David P. Woodruff. Fast approximation of matrix coherence and statistical leverage. J. Mach. Learn. Res., 13(1):3475–3506, December 2012.
• [17] Petros Drineas, Michael W Mahoney, and S Muthukrishnan. Sampling algorithms for regression and applications. In Proceedings of the seventeenth annual ACM-SIAM symposium on Discrete algorithm, pages 1127–1136. Society for Industrial and Applied Mathematics, 2006.
• [18] Valerii V. Fedorov, William J. Studden, and E. M. Klimko, editors. Theory of optimal experiments. Probability and mathematical statistics. Academic Press, New York, 1972.
• [19] Mike Gartrell, Ulrich Paquet, and Noam Koenigstein. Bayesian low-rank determinantal point processes. In Proceedings of the 10th ACM Conference on Recommender Systems, RecSys ’16, pages 349–356, New York, NY, USA, 2016. ACM.
• [20] David Gross and Vincent Nesme. Note on sampling without replacing from a finite collection of matrices. arXiv preprint arXiv:1001.2738, 2010.
• [21] Nicholas JA Harvey and Neil Olver. Pipage rounding, pessimistic estimators and matrix concentration. In Proceedings of the twenty-fifth annual ACM-SIAM symposium on Discrete algorithms, pages 926–945. SIAM, 2014.
• [22] Wassily Hoeffding.

Probability inequalities for sums of bounded random variables.

Journal of the American statistical association, 58(301):13–30, 1963.
• [23] Alex Kulesza and Ben Taskar. k-DPPs: Fixed-Size Determinantal Point Processes. In Proceedings of the 28th International Conference on Machine Learning, pages 1193–1200. Omnipress, 2011.
• [24] Alex Kulesza and Ben Taskar. Determinantal Point Processes for Machine Learning. Now Publishers Inc., Hanover, MA, USA, 2012.
• [25] Yin Tat Lee and He Sun. Constructing linear-sized spectral sparsification in almost-linear time. In Foundations of Computer Science (FOCS), 2015 IEEE 56th Annual Symposium on, pages 250–269. IEEE, 2015.
• [26] Chengtao Li, Stefanie Jegelka, and Suvrit Sra. Polynomial time algorithms for dual volume sampling. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems 30, pages 5045–5054. Curran Associates, Inc., 2017.
• [27] Michael W. Mahoney. Randomized algorithms for matrices and data. Found. Trends Mach. Learn., 3(2):123–224, February 2011.
• [28] Zelda E. Mariet and Suvrit Sra. Elementary symmetric polynomials for optimal experimental design. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems 30, pages 2136–2145. Curran Associates, Inc., 2017.
• [29] Aleksandar Nikolov, Mohit Singh, and Uthaipon Tao Tantipongpipat. Proportional volume sampling and approximation algorithms for a-optimal design. CoRR, abs/1802.08318, 2018.
• [30] Robin Pemantle and Yuval Peres. Concentration of lipschitz functionals of determinantal and other strong rayleigh measures. Combinatorics, Probability and Computing, 23(1):140–160, 2014.
• [31] Tamas Sarlos. Improved approximation algorithms for large matrices via random projections. In Proceedings of the 47th Annual IEEE Symposium on Foundations of Computer Science, FOCS ’06, pages 143–152, Washington, DC, USA, 2006. IEEE Computer Society.
• [32] Joel A. Tropp. User-friendly tail bounds for sums of random matrices. Foundations of Computational Mathematics, 12(4):389–434, Aug 2012.
• [33] David P Woodruff. Sketching as a tool for numerical linear algebra. Foundations and Trends® in Theoretical Computer Science, 10(1–2):1–157, 2014.

## Appendix A Proof of part (2) from Theorem 1

First, let us calculate . Observe that

 (X⊤X)−1 =c(1+n−ddγ2)−1 I, andw∗ =cX⊤y=c1d.

The loss of any can be decomposed as , where is the total loss incurred on all input vectors or :

 Li(w∗)=(1−c)2+1c−1n−ddγ2c2=1−c,

Note that -th leverage score of is equal , so we obtain that

 L(w∗)=d(1−c)=d∑i=1(1−li). (8)

Next, we compute . Suppose that is produced by size standard volume sampling. Note that if for some we have , then and therefore . Moreover, denoting ,

 (X⊤SXS)−1 ⪰(X⊤X)−1=cI,andX⊤SyS=(b1,…,bd)⊤,

so if , then and

 Li(w∗S)≥n−ddγ2c2=(1c−1)c2=cLi(w∗).

Putting the cases of and together, we get

 Li(w∗S) ≥cLi(w∗)+(1−cLi(w∗))(1−bi) ≥cLi(w∗)+c2(1−bi).

Applying the marginal probability formula for volume sampling (see (4)), we note that

 E[1−bi] =1−Pr(i∈S)=n−kn−d(1−c)=n−kn−dLi(w∗).

Taking expectation over and summing the components over , we get

 E[L(w∗S)]≥L(w∗)(c+c2n−kn−d).

Note that as , we have , thus showing (2).

## Appendix B Properties of rescaled volume sampling

We give proofs of the properties of rescaled volume sampling which hold for any rescaling distribution . In this section, we will use as the normalization constant for rescaled volume sampling.

### b.1 Proof of Proposition 2

First, we apply the Cauchy-Binet formula to the determinant term specified by a fixed sequence :

Next, we compute the sum, using the above identity:

 ∑π∈[n]kdet(X⊤QπX)k∏i=1qπi =(kd)∑¯π∈[n]ddet(X¯π)2∑~π∈[n]k−dk−d∏i=1q~πi =(kd)∑¯π∈[n]ddet(X¯π)2 (n∑i=1qi)k−d =(kd)d!∑S∈([n]d)det(XS)2=k(k−1)...(k−d+1)det(X⊤X),

where the steps closely follow the corresponding derivation for Theorem 3, given in Section 3.1.

### b.2 Proof of Theorem 4

We will prove that for any vector ,

 E[v⊤(X⊤QπX)−1v]≤v⊤(X⊤X)−1vk−d+1,

which immediately implies the corresponding matrix inequality. First, we use Sylvester’s formula, which holds whenever a matrix is full rank:

 det(A+vv⊤)=det(A)(1+v⊤A−1v).

Note that whenever the matrix is not full rank, its determinant is (in which case we avoid computing the matrix inverse), so we have for any :

 ≤det(X⊤QπX+vv⊤)−det(X⊤QπX)

where follows from applying the Cauchy-Binet formula to both of the determinants, and cancelling out common terms. Next, we proceed in a standard fashion, summing over all :

 Z E[v⊤(X⊤QπX)−1v] =∑π∈[n]kv⊤(X⊤QπX)−1vdet(X⊤QπX)k∏i=1qπi =d!(kd)k−d+1(det(X⊤X+vv⊤)−det(X⊤X))=Zv⊤(X⊤X)−1vk−d+1.

### b.3 Proof of Proposition 5

First, we compute the marginal probability of a fixed element of sequence containing a particular index under -rescaled volume sampling:

 Z

where the first term can be computed by following the derivation in Appendix B.1, obtaining