Improved guarantees and a multiple-descent curve for the Column Subset Selection Problem and the Nyström method

02/21/2020
by   Michał Dereziński, et al.
0

The Column Subset Selection Problem (CSSP) and the Nyström method are among the leading tools for constructing small low-rank approximations of large datasets in machine learning and scientific computing. A fundamental question in this area is: how well can a data subset of size k compete with the best rank k approximation? We develop techniques which exploit spectral properties of the data matrix to obtain improved approximation guarantees which go beyond the standard worst-case analysis. Our approach leads to significantly better bounds for datasets with known rates of singular value decay, e.g., polynomial or exponential decay. Our analysis also reveals an intriguing phenomenon: the approximation factor as a function of k may exhibit multiple peaks and valleys, which we call a multiple-descent curve. A lower bound we establish shows that this behavior is not an artifact of our analysis, but rather it is an inherent property of the CSSP and Nyström tasks. Finally, using the example of a radial basis function (RBF) kernel, we show that both our improved bounds and the multiple-descent curve can be observed on real datasets simply by varying the RBF parameter.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

08/16/2019

Low-rank approximation in the Frobenius norm by column and row subset selection

A CUR approximation of a matrix A is a particular type of low-rank appro...
06/23/2017

On the numerical rank of radial basis function kernel matrices in high dimension

Low-rank approximations are popular techniques to reduce the high comput...
10/30/2019

Optimal Analysis of Subset-Selection Based L_p Low Rank Approximation

We study the low rank approximation problem of any given matrix A over R...
10/30/2019

Learning-Based Low-Rank Approximations

We introduce a "learning-based" algorithm for the low-rank decomposition...
03/06/2018

Low-Rank Matrix Approximations with Flip-Flop Spectrum-Revealing QR Factorization

We present Flip-Flop Spectrum-Revealing QR (Flip-Flop SRQR) factorizatio...
09/04/2010

On the Estimation of Coherence

Low-rank matrix approximations are often used to help scale standard mac...
11/29/2018

Smoothed Analysis in Unsupervised Learning via Decoupling

Smoothed analysis is a powerful paradigm in overcoming worst-case intrac...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Figure 1: An empirical study showing the expected approximation factor for -, with different subset sizes , compared to our theory. We use a data matrix whose spectrum exhibits two sharp drops, demonstrating the multiple-descent phenomenon. The lower bounds are based on Theorem 3, whereas, as our upper bound, we plot the minimum over all functions from Theorem 1. Note that when the spectrum of matrix exhibits sufficiently smooth decay (instead of the sharp drops), then the multiple-descent curve vanishes. In such cases, our upper bounds significantly improve on the worst-case analysis of Deshpande et al. (2006) for all subset sizes (see Theorem 2).

We consider the task of selecting a small but representative sample of column vectors from a large matrix. Known as the Column Subset Selection Problem (CSSP), this is a well-studied combinatorial optimization task with many applications in machine learning

(e.g., feature selection, see Guyon & Elisseeff,

2003; Boutsidis et al., 2008)
, scientific computing (e.g., Chan & Hansen, 1992; Drineas et al., 2008) and signal processing (e.g., Balzano et al., 2010). In a commonly studied variant of this task, we aim to minimize the squared error of projecting all columns of the matrix onto the subspace spanned by the chosen column subset.

Definition 1 (Cssp).

Given an matrix , pick a set of column indices, to minimize

where is the Frobenius norm, is the projection onto and denotes the th column of .

Another variant of the CSSP, of particular interest in machine learning, emerges in the kernel setting under the name Nyström method (Williams & Seeger, 2001; Drineas & Mahoney, 2005; Gittens & Mahoney, 2016). We also discuss this variant, showing how our analysis applies in this context. Both the CSSP and the Nyström method are ways of constructing accurate low-rank approximations by using submatrices of the target matrix. Therefore, it is natural to ask how close we can get to the best possible rank  approximation error:

Our goal is to find a subset of size for which the ratio between and is small. Furthermore, a brute force search requires iterating over all subsets, which is prohibitively expensive, so we would like to find our subset more efficiently.

In terms of worst-case analysis, Deshpande et al. (2006) gave a randomized method which returns a set of size such that:

(1)

While the original algorithm was slow, efficient implementations have been provided since then (e.g., see Deshpande & Rademacher, 2010). The method belongs to the family of cardinality constrained determinantal point processes (see Definition 3), and will be denoted as . The approximation factor is optimal in the worst-case, since for any and , an matrix  can be constructed for which for all subsets of size . Yet it is known that, in practice, CSSP algorithms perform better than worst-case, so the question we consider is: how can we go beyond the usual worst-case analysis to accurately reflect what is possible in the CSSP?

Contributions. We provide improved guarantees for the CSSP approximation factor, which go beyond the worst-case analysis and which lead to surprising conclusions.

  1. New upper bounds: We develop a family of upper bounds on the CSSP approximation factor (Theorem 1), which we call the Master Theorem as they can be used to derive a number of new guarantees. In particular, we show that when the data matrix exhibits a known spectral decay, then (1) can often be drastically improved (Theorem 2).

  2. New lower bound: Even though the worst-case upper bound in (1) can often be loose, there are cases when it cannot be improved. We give a new lower bound construction (Theorem 3) showing that there are matrices for which multiple different subset sizes exhibit worst-case behavior.

  3. Multiple-descent curve: Our upper and lower bounds reveal that for some matrices the CSSP approximation factor can exhibit peaks and valleys as a function of the subset size (see Figure 1). We show that this phenomenon is an inherent property of the CSSP (Corollary 1), which leads us to a new connection with the recently discovered double descent curve (Belkin et al., 2019a; Dereziński et al., 2019b).

1.1 Main results

Our upper bounds rely on the notion of effective dimensionality called stable rank (Alaoui & Mahoney, 2015). Here, we use an extended version of this concept, as defined by Bartlett et al. (2019).

Definition 2 (Stable rank).

Let

denote the eigenvalues of the matrix

. For , we define the stable rank of order as .

In the following result, we define a family of functions which bound the approximation factor in the range of between and . We call this the Master Theorem because we use it to derive a number of more specific upper bounds.

Theorem 1 (Master Theorem).

Given , let , and suppose that , where . If -, then

where .

Note that we separated out the dependence on from the function , because the term is an artifact of a concentration of measure analysis that is unlikely to be of practical significance. In fact, we believe that the dependence on can be eliminated from the statement entirely (see Conjecture 1).

We next examine the consequences of the Master Theorem, starting with a sharp transition that occurs as approaches the stable rank of .

Remark 1 (Sharp transition).

For any it is true that:

  1. For all , if , then there exists a subset of size such that .

  2. There is such that and for every subset of size we have .

Part 1 of the remark follows from the Master Theorem by setting , whereas part 2 follows from the lower bound of Guruswami & Sinop (2012). Observe how the worst-case approximation factor jumps from to , as approaches . An example of this sharp transition is shown in Figure 1, where the stable rank of is around .

While certain matrices directly exhibit the sharp transition from Remark 1, many do not. In particular, for matrices with a known rate of spectral decay, the Master Theorem can be used to provide improved guarantees on the CSSP approximation factor over all subset sizes.

To illustrate this, we give novel bounds for the two most commonly studied decay rates: polynomial and exponential.

Theorem 2 (Examples without sharp transition).

Let be the eigenvalues of . There is an absolute constant such that for any , with , if:
1. (polynomial spectral decay) , with , then - satisfies

2. (exponential spectral decay) , with , then - satisfies

Note that for polynomial decay, unlike in (1), the approximation factor is constant, i.e., it does not depend on . For exponential decay, our bound provides an improvement over (1) when . To illustrate how these types of bounds can be obtained from the Master Theorem, consider the function for some . The first term in the function, , decreases with , whereas the second term (the square root) increases, albeit at a slower rate. This creates a U-shaped curve which, if sufficiently wide, has a valley where the approximation factor can get arbitrarily close to 1. This will occur when is large, i.e., when the spectrum of has a relatively flat region after the th eigenvalue (Figure 1 for between 20 and 50). Note that a peak value of some function may coincide with a valley of some , so only taking a minimum over all functions reveals the true approximation landscape predicted by the Master Theorem. To prove Theorem 2, we show that the stable ranks are sufficiently large so that any lies in the valley of some function (see Section 4).

The peaks and valleys of the CSSP approximation factor suggested by Theorem 1 are in fact an inherent property of the problem, rather than an artifact of our analysis or the result of using a particular algorithm. We prove this by constructing a family of matrices for which the best possible approximation factor is large, i.e., close to the worst-case upper bound of Deshpande et al. (2006), not just for one size , but for a sequence of increasing sizes.

Theorem 3 (Lower bound).

For any and , there is a matrix such that for any subset of size , where ,

Combining the Master Theorem with the lower bound of Theorem 3 we can easily provide an example matrix for which the optimal solution to the CSSP problem exhibits multiple peaks and valleys. We refer to this phenomenon as the multiple-descent curve.

Corollary 1 (Multiple-descent curve).

For and , there is a sequence and such that for any :

The multiple-descent phenomenon that emerges from our analysis bears similarity to the double descent curve described by Belkin et al. (2019a). This curve illustrates the sharp transition between the generalization error of under- and over-parameterized machine learning models as we change the ratio between the number of parameters and the number of samples. We further discuss this connection in Section 2.

1.2 The Nyström method

We briefly discuss how our results translate to guarantees for the Nyström mehod, a variant of the CSSP in the kernel setting which has gained considerable interest in the machine learning literature (Drineas & Mahoney, 2005; Gittens & Mahoney, 2016). In this context, rather than being given the column vectors explicitly, we consider the matrix whose entry is the dot product between the th and th vector in the kernel space, . A Nyström approximation of based on subset is defined as , where is the submatrix of indexed by , whereas is the submatrix with columns indexed by . The Nyström method has numerous applications in machine learning, including for kernel machines (Williams & Seeger, 2001), Gaussian Process regression (Burt et al., 2019)

and Independent Component Analysis

(Bach & Jordan, 2003).

Remark 2.

If and is the trace norm, then

Moreover, the trace norm error of the best rank approximation of , is equal to the squared Frobenius norm error of the best rank approximation of , i.e.,

This connection was used by Belabbas & Wolfe (2009) to adapt the approximation factor bound of Deshpande et al. (2006) to the Nyström method. Similarly, all of our results for the CSSP, including the multiple-descent curve that we have observed, can be translated into analogous statements for the trace norm approximation error in the Nyström method. Of particular interest are the improved bounds for kernel matrices with known eigenvalue decay rates. Such matrices arise naturally in machine learning when using standard kernel functions such as the Gaussian Radial Basis Function (RBF) kernel (a.k.a. the squared exponential kernel) and the Matérn kernel (Burt et al., 2019).

RBF kernel: If and the data comes from , then, for large , , where , and (Santa et al., 1997), so Theorem 2 yields an approximation factor of , better than when . Note that the parameter defines the size of a neighborhood around which the data points are deemed similar by the RBF kernel. Therefore, smaller means that each data point has fewer similar neighbors.

Matérn kernel: If is the Matérn kernel with parameters and and the data is distributed according to a uniform measure in one dimension, then (Rasmussen & Williams, 2006), so Theorem 2 yields a Nyström approximation factor of for any subset size .

In Section 6, we also empirically demonstrate our improved guarantees and the multiple-descent curve for the Nyström method with the RBF kernel.

2 Related work

The Column Subset Selection Problem is one of the most classical tasks in matrix approximation (Boutsidis et al., 2008). The original version of the problem compares the projection error of a subset of size to the best rank approximation error. The techniques used for finding good subsets have included many randomized methods (Deshpande et al., 2006; Boutsidis et al., 2008; Belhadji et al., 2018), as well as deterministic methods (Gu & Eisenstat, 1996). Later on, most works have relaxed the problem formulation by allowing the number of selected columns to exceed the rank . These approaches include deterministic sparsification based algorithms (Boutsidis et al., 2011), greedy selection (e.g., Altschuler et al., 2016) and randomized methods (e.g., Drineas et al., 2008; Guruswami & Sinop, 2012; Paul et al., 2015). Note that we study the original version of the CSSP (i.e., without the relaxation), where the number of columns must be equal to the rank .

The Nyström method has been given significant attention independently of the CSSP. The guarantees most comparable to our setting are due to Belabbas & Wolfe (2009), who show the approximation factor for the trace norm error. Many recent works allow the subset size to exceed the target rank , which enables the use of i.i.d. sampling techniques such as leverage scores (Gittens & Mahoney, 2016) and ridge leverage scores (Alaoui & Mahoney, 2015; Musco & Musco, 2017). In addition to the trace norm error, these works consider other types of guarantees, e.g., based on spectral and Frobenius norms, which are not as readily comparable to the CSSP error bounds.

The double descent curve was introduced by Belkin et al. (2019a)

to explain the remarkable success of machine learning models which generalize well despite having more parameters than training data. This research has been primarily motivated by the success of deep neural networks, but double descent has also been observed in linear regression

(Belkin et al., 2019b; Bartlett et al., 2019; Dereziński et al., 2019b) and other learning models. Double descent occurs when we plot the generalization error as a function of the number of parameters used in the learning model. The definition of generalization error relies on assuming a probabilistic generative model of the data. Importantly, our setting is different in that it is a deterministic combinatorial optimization problem. In particular, Corollary 1 shows that our multiple-descent curve can occur as a purely deterministic property of the optimal CSSP solution.

Determinantal point processes have been shown to provide near-optimal guarantees not only for the CSSP but also other tasks in numerical linear algebra, such as least squares regression (e.g., Avron & Boutsidis, 2013; Dereziński & Warmuth, 2018; Dereziński et al., 2019). They are also used in recommender systems, stochastic optimization and other tasks in machine learning (for a review, see Kulesza & Taskar, 2012). Efficient algorithms for sampling from these distributions have been proposed both in the CSSP setting (i.e., given matrix ; see, e.g., Deshpande & Rademacher, 2010; Dereziński, 2019) and in the Nyström setting (i.e., given kernel ; see, e.g., Anari et al., 2016; Dereziński et al., 2019). The term “cardinality constrained DPP” (also known as a “k-DPP” or “volume sampling”) was introduced by Kulesza & Taskar (2011) to differentiate from standard DPPs which have random cardinality. Our proofs rely in part on converting DPP bounds to k-DPP bounds via a refinement of the concentration of measure argument used by Dereziński et al. (2019a).

3 Determinantal point processes

Since our main results rely on randomized subset selection via determinantal point processes (DPPs), we provide a brief overview of the relevant aspects of this class of distributions. First introduced by Macchi (1975)

, a determinantal point process is a probability distribution over subsets

, where we use to denote the set . The relative probability of a subset being drawn is governed by a positive semidefinite (p.s.d.) matrix , as stated in the definition below, where we use to denote the submatrix of with rows and columns indexed by .

Definition 3.

For an p.s.d. matrix , define as a distribution over all subsets so that

A restriction to subsets of size is denoted as -.

DPPs can be used to introduce diversity in the selected set or to model the preference for selecting dissimilar items, where the similarity is stated by the kernel matrix . DPPs are commonly used in many machine learning applications where these properties are desired, e.g., recommender systems (Warlop et al., 2019), model interpretation (Kim et al., 2016), text and video summarization (Gong et al., 2014), and others (Kulesza & Taskar, 2012).

Given a p.s.d. matrix with eigenvalues , the size of the set

is distributed as a Poisson binomial random variable, namely, the number of successes in

Bernoulli random trials where the probability of success in the th trial is given by . This leads to a simple expression for the expected subset size:

(2)

Note that if , where , then is proportional to , so rescaling the kernel by a scalar only affects the distribution of the subset sizes, giving us a way to set the expected size to a desired value (larger means smaller expected size). Nevertheless, it is still often preferrable to restrict the size of to a fixed , obtaining a - (Kulesza & Taskar, 2011).

Both DPPs and k-DPPs can be sampled efficiently, with some of the first algorithms provided by Hough et al. (2006), Deshpande & Rademacher (2010), Kulesza & Taskar (2011) and others. These approaches rely on an eigendecomposition of the kernel , at the cost of . When , as in the CSSP, and the dimensions satisfy , then this can be improved to . More recently, algorithms that avoid computing the eigendecomposition have been proposed (Anari et al., 2016; Dereziński et al., 2019; Dereziński, 2019), resulting in running times of when given matrix and for matrix , assuming small desired subset size. See Gautier et al. (2019) for an efficient Python implementation of DPP sampling.

The key property of DPPs that enables our analysis is a formula for the expected value of the random matrix that is the orthogonal projection onto the subspace spanned by vectors selected by

. In the special case when is a square full rank matrix, the following result can be derived as a corollary of Theorem 1 by Mutný et al. (2019), and a variant for DPPs over continuous domains can be found as Lemma 8 of Dereziński et al. (2019b). For completeness, we also provide a proof in Appendix A.

Lemma 1.

For any and , let be the projection onto the . If , then

Lemma 1 implies a simple closed form expression for the expected error in the CSSP. Here, we use a rescaling parameter for controlling the distribution of the subset sizes. Note that it is crucial that we are using a DPP with random subset size, because the corresponding expression for the expected error of the fixed size k-DPP is combinatorial, and therefore much harder to work with.

Lemma 2.

For any , if , then

Proof.

Using Lemma 1, the expected loss is given by:

where follows from the matrix identity . ∎

The challenge in using the above formula for the expected error is that it applies to a DPP with a randomized subset size rather than a fixed size k-DPP. Therefore, the random subset with some positive probability has cardinality much greater than . Our strategy in addressing this is to choose the rescaling parameter so that the subset size is bounded by with sufficiently high probability, via a concentration of measure argument, as discussed in the following section.

4 Upper bounds

In this section, we derive the upper bound given in Theorem 1 by using the expectation formula for the squared projection error of a DPP (Lemma 2). We then show how this result can be used to obtain improved guarantees for matrices with known eigenvalue decays, i.e., Theorem 2.

Recall that both the expected error formula and the expected subset size of depend on the rescaling parameter , and our analysis relies on a careful selection of this parameter. To illustrate this, consider setting it to , where are the eigenvalues of in decreasing order. Now, (2) implies that:

Together with Lemma 2, this recovers the upper bound of Deshpande et al. (2006) since , except that the subset size is randomized with expectation bounded by , instead of a fixed subset size equal . However, a more refined choice of the parameter allows us to significantly improve on the above error bound in certain regimes, as shown below.

Lemma 3.

For any , and , where , suppose that for and . Then:

where .

Note that, setting , the above lemma implies that we can achieve approximation factor with a DPP whose expected size is bounded by . We introduce so that we can convert the bound from DPP to the fixed size k-DPP via a concentration argument. Intuitively, our strategy is to show that the randomized subset size of a DPP is sufficiently concentrated around its expectation that with high probability it will be bounded by , and for this we need the expectation to be strictly below . A careful application of the Chernoff bound for a Poisson binomial random variable yields the following concentration bound.

Lemma 4.

Let be sampled as in Lemma 3 with . If , then .

Finally, any expected bound for random size DPPs can be converted to an expected bound for a fixed size k-DPP via the following result.

Lemma 5.

For any , and , if and -, then

The above inequality may seem intuitively obvious since adding more columns to a set to complete it to size always reduces the error. However, a priori, it could happen that going from subsets of size to subsets of size results in a redistribution of probabilities to the subsets with larger error. To show that this will not happen, our proof relies on classic but non-trivial combinatorial bounds called Newton’s inequalities. Putting together Lemmas 3, 4 and 5, we obtain our Master Theorem.

Proof of Theorem 1  Let be sampled as in Lemma 3, and let -. We have:

where follows from Lemma 5 and follows from Lemmas 3 and 4. Since , we have , which completes the proof.  

Figure 2: Illustration of the upper bound functions for different values of , with a matrix such that the th eigenvalue of is set to: (top) ; (bottom) for and for . For each function, we marked the window of applicable ’s with a horizontal line. For polynomial spectral decay (top), the stable rank (i.e., the width of the window starting at ) increases, while for the sharp spectrum drop (bottom) the stable rank shrinks as the window approaches the drop, causing a peak in the upper bound.

We now demonstrate how Theorem 1 can be used as the Master Theorem to derive new bounds on the CSSP approximation factor under additional assumptions on the singular value decay of matrix . Rather than a single upper bound, Theorem 1 provides a family of upper bounds , each with a range of applicable values . Since each forms a U-shaped curve, its smallest point falls near the middle of that range. In Figure 2 we visualize these bounds as a sliding window that sweeps accross the axis representing possible subset sizes. The width of the window varies: when it starts at then its width is the stable rank . The wider the window, the lower is the valley of the corresponding U-curve. Thus, when bounding the approximation factor for a given , we should choose the widest window such that falls near the bottom of its U-curve. Showing a guarantee that holds for all requires lower-bounding the stable ranks for each . This is straightforward for both polynomial and exponential decay. Specifically, using the notation from Theorem 2, in Appendix C we prove that:

As an example, Figure 2 (top) shows that the stable rank , i.e., the width of the window starting at , grows linearly with for eigenvalues decaying polynomially with . As a result, the bottom of each U-shaped curve remains at roughly the same level, making the CSSP approximation factor independent of , as in Theorem 2. In contrast, Figure 2 (bottom) provides the same plot for a different matrix with a sharp drop in the spectrum. The U-shaped curves cannot slide smoothly accross that drop because of the shrinking stable ranks, which results in a peak similar to the ones observed in Figure 1.

5 Lower bound

As discussed in the previous section, our upper bounds for the CSSP approximation factor exhibit a peak (a high point, with the bound decreasing on either side) around a subset size when there is a sharp drop in the spectrum of around the th singular value. It is natural to ask whether this peak is an artifact of our analysis, or a property of the k-DPP distribution, or whether even optimal CSSP subsets exhibit this phenomenon. In this section, we extend a lower bound construction of Deshpande et al. (2006) and use it to show that for certain matrices the approximation factor of the optimal CSSP subset, i.e., , can exhibit not just one but any number of peaks as a function of , showing that the multiple-descent curve from Figure 1 describes an inherent phenomenon in the CSSP.

The lower bound construction of Deshpande et al. (2006) relies on arranging the column vectors of a matrix into a centered symmetric -dimensional simplex. This way, the columns are spanning a dimensional subspace which contains the leading singular vectors of . They then proceed to shift the columns slightly in the direction orthogonal to that subspace so that the st singular value of becomes non-zero. This results in an instance of the CSSP with a sharp drop in the spectrum. Due to the symmetry in this construction, all subsets of size have an identical squared projection error. It is easy to show that this error satisfies , where is a parameter which depends on the condition number of matrix and it can be driven arbitrarily close to . Another variant of this construction was also provided by Guruswami & Sinop (2012). The key limitation of both of these constructions is that they only provide a lower bound for a single subset size in a given matrix, whereas our goal is to show that the CSSP can exhibit the multiple-descent curve, which requires lower bounds for multiple different values of holding with respect to the same matrix .

Our strategy for constructing the lower bound matrix is to concatenate together multiple sets of columns, each of which represents a simplex spanning some subspace of . The key challenge that we face in this approach is that, unlike in the construction of Deshpande et al. (2006), different subsets of the same size will have different projection errors. Nevertheless, we are able to lower bound these errors.

Lemma 6.

Fix and consider unit vectors in general position, where , , such that for each , and for any , if then is orthogonal to . Also, let unit vectors be orthogonal to each other and to all . There are positive scalars for such that matrix with columns over all and satisfies:

Proof of Theorem 3  We let and then for we set . We then construct the vectors that satisfy Lemma 6 by letting each set be the corners of a centered -dimensional regular simplex. We ensure that each simplex is orthogonal to every other simplex by placing them in orthogonal subspaces.  

We also use Lemma 6 in Appendix E to construct a matrix which exhibits the multiple-descent curve (Corollary 1).

6 Empirical evaluation

Figure 3: Top three plots show the Nyström approximation factor , where - (experiments using greedy selection instead of a k-DPP are in Appendix F), for a toy dataset ( is the condition number) and two Libsvm datasets (

is the RBF parameter). Error bars show three times the standard error of the mean over 1000 trials. Bottom three plots show the spectral decay for the top

eigenvalues of each kernel . Note that the peaks in the approximation factor align with the drops in the spectrum.

In this section, we provide an empirical evaluation designed to demonstrate how our improved guarantees for the CSSP and Nyström method, as well as the multiple-descent phenomenon, can be easily observed on real datasets. We use a standard experimental setup for data subset selection using the Nyström method (Gittens & Mahoney, 2016), where an kernel matrix for a dataset of size is defined so that the entry is computed using the Gaussian Radial Basis Function (RBF) kernel: , where is a free parameter. We are particularly interested in the effect of varying . Nyström subset selection is performed using - (Definition 3), and we plot the expected approximation factor (averaged over 1000 runs), where is the Nyström approximation of based on the subset (see Section 1.2), is the trace norm, and is the trace norm error of the best rank approximation. Additional experiments, using greedy selection instead of a k-DPP, are in Appendix F. As discussed in Section 1.2, this task is equivalent to the CSSP task defined on the matrix such that .

The aim of our empirical evaluation is to verify the following two claims motivated by our theory (and to illustrate that doing so is as easy as varying the RBF parameter ):

  1. When the spectral decay is sufficiently slow/smooth, the approximation factor for CSSP/Nyström is much better than suggested by previous worst-case bounds.

  2. A drop in spectrum around the th eigenvalue results in a peak in the approximation factor near subset size . Several drops result in the multiple-descent curve.

In Figure 3 (top), we plot the approximation factor against the subset size (in the range of 1 to 40) for an artificial toy dataset and for two benchmark regression datasets from the Libsvm repository (bodyfat and eunite2001, see Chang & Lin, 2011). The toy dataset is constructed by scaling the eigenvalues of a random Gaussian matrix so that the spectrum is flat with a single drop at the -st eigenvalue. For each dataset, in Figure 3 (bottom), we also show the top 40 eigenvalues of the kernel in decreasing order. For the toy dataset, to maintain full control over the spectrum we use the linear kernel , and we show results for three different values of the condition number of kernel . For the benchmark datasets, we show results on the RBF kernel with three different values of the parameter .

Examining the toy dataset (Figure 3, left), it is apparent that a larger drop in spectrum leads to a sharper peak in the approximation factor as a function of the subset size , whereas a flat spectrum results in the approximation factor being close to . A similar trend is observed for dataset bodyfat (Figure 3, center), where large parameter results in a peak that is aligned with a spectrum drop, while decreasing makes the spectrum flatter and the factor closer to . Finally, dataset eunite2001 (Figure 3, right) exhibits a full multiple-descent curve with up to three peaks for large values of , and the peaks are once again aligned with the spectrum drops. Decreasing gradually eliminates the peaks, resulting in a uniformly small approximation factor. Thus, both of our theoretical claims can easily be verified on this dataset simply by adjusting the RBF parameter.

While the right choice of the parameter ultimately depends on the downstream machine learning task, it has been observed that varying has a pronounced effect on the spectral properties of the kernel matrix, (see, e.g., Gittens & Mahoney, 2016; Lawlor et al., 2016; Wang et al., 2019). The main takeaway from our results here is that, depending on the structure of the problem, we may end up in the regime where the Nyström approximation factor exhibits a multiple-descent curve (e.g., due to a hierarchical nature of the data) or in the regime where it is relatively flat.

7 Conclusions and open problems

We derived new guarantees for the Column Subset Selection Problem (CSSP) and the Nyström method, going beyond worst-case analysis by exploiting the structural properties of a dataset, e.g., when the spectrum exhibits a known rate of decay. Our upper and lower bounds for the CSSP/Nyström approximation factor reveal an intruiguing phenomenon we call the multiple-descent curve: the approximation factor can exhibit a highly non-monotonic behavior as a function of , with multiple peaks and valleys. These observations suggest a connection to the double descent curve exhibited by the generalization error of many machine learning models (see Section 2). This new connection is remarkable, since, unlike generalization error, the CSSP approximation factor is a deterministic objective in a combinatorial optimization problem without any underlying statistical model.

Our analysis technique relies on converting an error bound from random-size DPPs to fixed-size k-DPPs, which results in an additional constant factor of in Theorem 1. We put forward a conjecture which would eliminate this factor from Theorem 1 and is of independent interest to the study of elementary symmetric polynomials, a classical topic in combinatorics (Hardy et al., 1952).

Conjecture 1.

The following function is convex with respect to for any positive sequence :

Deshpande et al. (2006) showed that if - and are the eigenvalues of , then . If is convex then Jensen’s inequality implies:

where is chosen so that . This would allow us to use the bound from Lemma 3 directly on a k-DPP without relying on the concentration argument of Lemma 4, thereby improving the bounds in Theorems 1 and 2.

Acknowledgements

MWM would like to acknowledge ARO, DARPA, NSF and ONR for providing partial support of this work. Also, the authors thank the NSF for funding via the NSF TRIPODS program.

References

  • Alaoui & Mahoney (2015) Alaoui, A. E. and Mahoney, M. W.

    Fast randomized kernel ridge regression with statistical guarantees.

    In Proceedings of the 28th International Conference on Neural Information Processing Systems, pp. 775–783, Montreal, Canada, December 2015.
  • Altschuler et al. (2016) Altschuler, J., Bhaskara, A., Fu, G., Mirrokni, V., Rostamizadeh, A., and Zadimoghaddam, M. Greedy column subset selection: New bounds and distributed algorithms. In Balcan, M. F. and Weinberger, K. Q. (eds.), Proceedings of The 33rd International Conference on Machine Learning, volume 48 of Proceedings of Machine Learning Research, pp. 2539–2548, New York, New York, USA, 20–22 Jun 2016. PMLR.
  • Anari et al. (2016) Anari, N., Gharan, S. O., and Rezaei, A.

    Monte carlo markov chain algorithms for sampling strongly rayleigh distributions and determinantal point processes.

    In Feldman, V., Rakhlin, A., and Shamir, O. (eds.), 29th Annual Conference on Learning Theory, volume 49 of Proceedings of Machine Learning Research, pp. 103–115, Columbia University, New York, New York, USA, 23–26 Jun 2016. PMLR.
  • Avron & Boutsidis (2013) Avron, H. and Boutsidis, C. Faster subset selection for matrices and applications. SIAM Journal on Matrix Analysis and Applications, 34(4):1464–1499, 2013.
  • Bach & Jordan (2003) Bach, F. R. and Jordan, M. I. Kernel independent component analysis. J. Mach. Learn. Res., 3:1–48, March 2003. ISSN 1532-4435.
  • Balzano et al. (2010) Balzano, L., Recht, B., and Nowak, R. High-dimensional matched subspace detection when data are missing. In 2010 IEEE International Symposium on Information Theory, pp. 1638–1642. IEEE, 2010.
  • Bartlett et al. (2019) Bartlett, P. L., Long, P. M., Lugosi, G., and Tsigler, A. Benign overfitting in linear regression. Technical Report Preprint: arXiv:1906.11300, 2019.
  • Belabbas & Wolfe (2009) Belabbas, M.-A. and Wolfe, P. J. Spectral methods in machine learning and new strategies for very large datasets. Proceedings of the National Academy of Sciences, 106(2):369–374, 2009. ISSN 0027-8424. doi: 10.1073/pnas.0810600105.
  • Belhadji et al. (2018) Belhadji, A., Bardenet, R., and Chainais, P. A determinantal point process for column subset selection. arXiv e-prints, art. arXiv:1812.09771, Dec 2018.
  • Belkin et al. (2019a) Belkin, M., Hsu, D., Ma, S., and Mandal, S.

    Reconciling modern machine-learning practice and the classical bias–variance trade-off.

    Proc. Natl. Acad. Sci. USA, 116:15849–15854, 2019a.
  • Belkin et al. (2019b) Belkin, M., Hsu, D., and Xu, J. Two models of double descent for weak features. arXiv preprint arXiv:1903.07571, 2019b.
  • Boutsidis et al. (2008) Boutsidis, C., Mahoney, M., and Drineas, P. An improved approximation algorithm for the column subset selection problem. Proceedings of the Annual ACM-SIAM Symposium on Discrete Algorithms, 12 2008.
  • Boutsidis et al. (2011) Boutsidis, C., Drineas, P., and Magdon-Ismail, M. Near optimal column-based matrix reconstruction. In 2011 IEEE 52nd Annual Symposium on Foundations of Computer Science, pp. 305–314, Oct 2011.
  • Burt et al. (2019) Burt, D., Rasmussen, C. E., and Van Der Wilk, M. Rates of convergence for sparse variational Gaussian process regression. In Chaudhuri, K. and Salakhutdinov, R. (eds.), Proceedings of the 36th International Conference on Machine Learning, volume 97 of Proceedings of Machine Learning Research, pp. 862–871, Long Beach, California, USA, 09–15 Jun 2019. PMLR.
  • Chan & Hansen (1992) Chan, T. F. and Hansen, P. C. Some applications of the rank revealing QR factorization. SIAM Journal on Scientific and Statistical Computing, 13(3):727–741, 1992.
  • Chang & Lin (2011) Chang, C.-C. and Lin, C.-J.

    LIBSVM: A library for support vector machines.

    ACM Transactions on Intelligent Systems and Technology, 2:27:1–27:27, 2011.
  • Chung & Lu (2006) Chung, F. and Lu, L. Complex Graphs and Networks (Cbms Regional Conference Series in Mathematics). American Mathematical Society, Boston, MA, USA, 2006. ISBN 0821836579.
  • Dereziński (2019) Dereziński, M. Fast determinantal point processes via distortion-free intermediate sampling. In Beygelzimer, A. and Hsu, D. (eds.), Proceedings of the Thirty-Second Conference on Learning Theory, volume 99 of Proceedings of Machine Learning Research, pp. 1029–1049, Phoenix, USA, 25–28 Jun 2019.
  • Dereziński & Warmuth (2018) Dereziński, M. and Warmuth, M. K. Reverse iterative volume sampling for linear regression. Journal of Machine Learning Research, 19(23):1–39, 2018.
  • Dereziński et al. (2019) Dereziński, M., Calandriello, D., and Valko, M. Exact sampling of determinantal point processes with sublinear time preprocessing. In Wallach, H., Larochelle, H., Beygelzimer, A., d Alché-Buc, F., Fox, E., and Garnett, R. (eds.), Advances in Neural Information Processing Systems 32, pp. 11542–11554. Curran Associates, Inc., 2019.
  • Dereziński et al. (2019) Dereziński, M., Clarkson, K. L., Mahoney, M. W., and Warmuth, M. K. Minimax experimental design: Bridging the gap between statistical and worst-case approaches to least squares regression. In Beygelzimer, A. and Hsu, D. (eds.), Proceedings of the Thirty-Second Conference on Learning Theory, volume 99 of Proceedings of Machine Learning Research, pp. 1050–1069, Phoenix, USA, 25–28 Jun 2019.
  • Dereziński et al. (2019a) Dereziński, M., Liang, F., and Mahoney, M. W. Bayesian experimental design using regularized determinantal point processes. arXiv e-prints, art. arXiv:1906.04133, Jun 2019a.
  • Dereziński et al. (2019b) Dereziński, M., Liang, F., and Mahoney, M. W. Exact expressions for double descent and implicit regularization via surrogate random design. arXiv e-prints, art. arXiv:1912.04533, Dec 2019b.
  • Deshpande & Rademacher (2010) Deshpande, A. and Rademacher, L. Efficient volume sampling for row/column subset selection. In Proceedings of the 2010 IEEE 51st Annual Symposium on Foundations of Computer Science, pp. 329–338, Las Vegas, USA, October 2010.
  • Deshpande et al. (2006) Deshpande, A., Rademacher, L., Vempala, S., and Wang, G. Matrix approximation and projective clustering via volume sampling. In Proceedings of the Seventeenth Annual ACM-SIAM Symposium on Discrete Algorithm, pp. 1117–1126, Miami, FL, USA, January 2006.
  • Drineas & Mahoney (2005) Drineas, P. and Mahoney, M. W. On the Nyström method for approximating a Gram matrix for improved kernel-based learning. Journal of Machine Learning Research, 6:2153–2175, 2005.
  • Drineas et al. (2008) Drineas, P., Mahoney, M. W., and Muthukrishnan, S. Relative-error CUR matrix decompositions. SIAM Journal on Matrix Analysis and Applications, 30(2):844–881, 2008.
  • Gautier et al. (2019) Gautier, G., Polito, G., Bardenet, R., and Valko, M. DPPy: DPP Sampling with Python.

    Journal of Machine Learning Research - Machine Learning Open Source Software (JMLR-MLOSS), in press

    , 2019.
    Code at http://github.com/guilgautier/DPPy/ Documentation at http://dppy.readthedocs.io/.
  • Gittens & Mahoney (2016) Gittens, A. and Mahoney, M. W. Revisiting the Nyström method for improved large-scale machine learning. J. Mach. Learn. Res., 17(1):3977–4041, January 2016. ISSN 1532-4435.
  • Gong et al. (2014) Gong, B., Chao, W.-L., Grauman, K., and Sha, F. Diverse sequential subset selection for supervised video summarization. In Ghahramani, Z., Welling, M., Cortes, C., Lawrence, N. D., and Weinberger, K. Q. (eds.), Advances in Neural Information Processing Systems 27, pp. 2069–2077. Curran Associates, Inc., 2014.
  • Gu & Eisenstat (1996) Gu, M. and Eisenstat, S. C. Efficient algorithms for computing a strong rank-revealing qr factorization. SIAM Journal on Scientific Computing, 17(4):848–869, 1996.
  • Guruswami & Sinop (2012) Guruswami, V. and Sinop, A. K. Optimal column-based low-rank matrix reconstruction. In Proceedings of the Twenty-third Annual ACM-SIAM Symposium on Discrete Algorithms, pp. 1207–1214, Kyoto, Japan, January 2012.
  • Guyon & Elisseeff (2003) Guyon, I. and Elisseeff, A. An introduction to variable and feature selection. J. Mach. Learn. Res., 3(null):1157–1182, March 2003. ISSN 1532-4435.
  • Hardy et al. (1952) Hardy, G., Littlewood, J., and Pólya, G. Inequalities. Cambridge University Press, 2nd edition, 1952.
  • Hough et al. (2006) Hough, J. B., Krishnapur, M., Peres, Y., Virág, B., et al. Determinantal processes and independence. Probability surveys, 3:206–229, 2006.
  • Kim et al. (2016) Kim, B., Khanna, R., and Koyejo, O. Examples are not enough, learn to criticize! criticism for interpretability. In Advances in Neural Information Processing Systems, 2016.
  • Kulesza & Taskar (2011) Kulesza, A. and Taskar, B. k-DPPs: Fixed-Size Determinantal Point Processes. In Proceedings of the 28th International Conference on Machine Learning, pp. 1193–1200, Bellevue, WA, USA, June 2011.
  • Kulesza & Taskar (2012) Kulesza, A. and Taskar, B. Determinantal Point Processes for Machine Learning. Now Publishers Inc., Hanover, MA, USA, 2012.
  • Lawlor et al. (2016) Lawlor, D., Budavári, T., and Mahoney, M. W. Mapping the similarities of spectra: Global and locally-biased approaches to SDSS galaxy data. Astrophysical Journal, 833(1), 12 2016.
  • Macchi (1975) Macchi, O. The coincidence approach to stochastic point processes. Advances in Applied Probability, 7(1):83–122, 1975. ISSN 00018678.
  • Musco & Musco (2017) Musco, C. and Musco, C. Recursive sampling for the nystrom method. In Guyon, I., Luxburg, U. V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., and Garnett, R. (eds.), Advances in Neural Information Processing Systems 30, pp. 3833–3845. Curran Associates, Inc., 2017.
  • Mutný et al. (2019) Mutný, M., Dereziński, M., and Krause, A. Convergence analysis of the randomized newton method with determinantal sampling. arXiv e-prints, art. arXiv:1910.11561, Oct 2019.
  • Paul et al. (2015) Paul, S., Magdon-Ismail, M., and Drineas, P. Column selection via adaptive sampling. In Proceedings of the 28th International Conference on Neural Information Processing Systems - Volume 1, NIPS’15, pp. 406–414, Cambridge, MA, USA, 2015. MIT Press.
  • Rasmussen & Williams (2006) Rasmussen, C. E. and Williams, C. K. I. Gaussian Processes for Machine Learning. MIT Press, 2006.
  • Santa et al. (1997) Santa, H. Z., Zhu, H., Williams, C. K. I., Rohwer, R., and Morciniec, M. Gaussian regression and optimal finite dimensional linear models. In Neural Networks and Machine Learning, pp. 167–184. Springer-Verlag, 1997.
  • Wang et al. (2019) Wang, R., Li, Y., Mahoney, M. W., and Darve, E. Block basis factorization for scalable kernel evaluation. SIAM Journal on Matrix Analysis and Applications, 40(4):1497–1526, 2019.
  • Warlop et al. (2019) Warlop, R., Mary, J., and Gartrell, M. Tensorized determinantal point processes for recommendation. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, KDD ’19, pp. 1605–1615, New York, NY, USA, 2019. ACM. ISBN 978-1-4503-6201-6.
  • Williams & Seeger (2001) Williams, C. K. I. and Seeger, M. Using the Nyström method to speed up kernel machines. In Leen, T. K., Dietterich, T. G., and Tresp, V. (eds.), Advances in Neural Information Processing Systems 13, pp. 682–688. MIT Press, 2001.

Appendix A Proof of Lemma 1

We will use the following standard determinantal summation identity (see Theorem 2.1 in Kulesza & Taskar, 2012) which corresponds to computing the normalization constant for a DPP.

Lemma 7.

For any matrix , we have

We now proceed with the proof of Lemma 1 (restated below for convenience).

Lemma’ 1.

For any and , let denote the projection onto the . If , then

Proof.

Fix as the column dimension of and let denote the submatrix of consisting of the columns indexed by . We have , where denotes the Moore-Penrose inverse and . Let be an arbitrary vector. When is invertible, then a standard determinantal identity states that:

When is not invertible then , because the rank of cannot be higher than the rank of . Thus,

where involves two applications of Lemma 7. Since the above calculation holds for arbitrary vector , the claim follows. ∎

Appendix B Proofs omitted from Section 4

Lemma’ 3.

For any , and , where , suppose that for and . Then:

where .

Proof.

Let be the eigenvalues of . Note that scaling the matrix by any constant and scaling by preserves the distribution of as well as the approximation ratio, so without loss of generality, assume that . Furthermore, using the shorthands and , we have and so . We now lower bound the optimum as follows:

We will next define an alternate sequence of eigenvalues which is in some sense “worst-case”, by shifting the spectral mass away from the tail. Let , and for set , where . Additionally, define:

(3)

and note that . Moreover, for , we let , while for we set . We proceed to bound the expected subset size by converting all the eigenvalues from to and to