Approximating Spectral Clustering via Sampling: a Review

Spectral clustering refers to a family of unsupervised learning algorithms that compute a spectral embedding of the original data based on the eigenvectors of a similarity graph. This non-linear transformation of the data is both the key of these algorithms' success and their Achilles heel: forming a graph and computing its dominant eigenvectors can indeed be computationally prohibitive when dealing with more that a few tens of thousands of points. In this paper, we review the principal research efforts aiming to reduce this computational cost. We focus on methods that come with a theoretical control on the clustering performance and incorporate some form of sampling in their operation. Such methods abound in the machine learning, numerical linear algebra, and graph signal processing literature and, amongst others, include Nyström-approximation, landmarks, coarsening, coresets, and compressive spectral clustering. We present the approximation guarantees available for each and discuss practical merits and limitations. Surprisingly, despite the breadth of the literature explored, we conclude that there is still a gap between theory and practice: the most scalable methods are only intuitively motivated or loosely controlled, whereas those that come with end-to-end guarantees rely on strong assumptions or enable a limited gain of computation time.

Authors

• 21 publications
• 24 publications
06/25/2020

Scalable Spectral Clustering with Nystrom Approximation: Practical and Theoretical Aspects

Spectral clustering techniques are valuable tools in signal processing a...
02/05/2016

Compressive Spectral Clustering

Spectral clustering has become a popular technique due to its high perfo...
06/10/2014

Graph Approximation and Clustering on a Budget

We consider the problem of learning from a similarity matrix (such as sp...
04/07/2017

Fast Spectral Clustering Using Autoencoders and Landmarks

In this paper, we introduce an algorithm for performing spectral cluster...
01/09/2013

Spectral Clustering Based on Local PCA

We propose a spectral clustering method based on local principal compone...
06/07/2020

09/08/2019

Iterative Spectral Method for Alternative Clustering

Given a dataset and an existing clustering as input, alternative cluster...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Clustering is a cornerstone of our learning process and, thus, of our understanding of the world. Indeed, we can all distinguish between a rose and a tulip precisely because we have learned what these flowers are. Plato would say that we learned the Idea –or Form [120]– of both the rose and the tulip, which then enables us to recognize all instances of such flowers. A machine learner would say that we learned two classes: their most discriminating features (shape, size, number of petals, smell, etc.) as well as their possible intra-class variability.

Mathematically speaking, the first step on the road to classifying objects (such as flowers) is to create an abstract representation of these objects: with each object

we associate a feature vector

, where the dimension of the vector corresponds to the number of features one chooses to select for the classification task. The space in this context is sometimes called the feature space. The choice of representation will obviously have a strong impact on the subsequent classification performance. Say that in the flower example we choose to represent each flower by only

features: the average color of each RGB channel (level of red, green and blue) of its petals. This choice is not fail-proof: even though the archetype of the rose is red and the archetype of the tulip is yellow, we know that some varieties of both flowers can have very similar colors and thus a classification solely based on the color will necessarily lead to confusion. In fact, there are many different ways of choosing features: from features based on the expert knowledge of a botanist, to features learned by a deep learning architecture from many instances of labeled images of roses and tulips, via features obtained by hybrid methods more-or-less based on human intelligence (such as the first few components of a Principal Component Analysis of expert-based features).

The second step on the road to classifying objects is to choose a machine learning algorithm that groups the set of points in classes ( may be known in advance or determined by the algorithm itself). Choosing an appropriate algorithm depends on the context:

• Availability of pre-labeled data. Classifying the points in classes may be seen as assigning a label (such as “rose” or ”tulip” in our example) to each of the points. If one has access to some pre-labeled data, we are in the case of supervised learning

: a more-or-less parametrized model is first learned from the pre-labeled data and then applied to the unlabeled points that need classification. If one does not have access to any pre-labeled data, we are in the case of

unsupervised learning where classes are typically inferred only via geometrical consideration of the distribution of points in the feature space. If one has only access to a few labeled data, we are in the in-between case of semi-supervised learning where the known labels are typically propagated in one form or another in the feature space.

• Inductive vs transductive learning. Another important characteristic of a classification algorithm is whether it can be used to classify only the set of points at hand (transductive), or if it can also be directly used to classify any never-seen data point (inductive).

This review focuses on the family of algorithms jointly referred to as spectral clustering. These algorithms are unsupervised and transductive: no label is known in advance and one may not naturally111Out-of-sample extensions of spectral clustering do exist (see for instance Section 5.3.6 of [144]), but they require additional work. extend the results obtained on to never-seen data points. Another particularity of spectral clustering algorithms is that the number of classes is known in advance.

Spectral clustering algorithms have received a large attention in the last two decades due to their good performance on a wide range of different datasets, as well as their ease of implementation. In a nutshell, they combine three steps:

1. Graph construction. A sparse similarity graph is built between the points.

2. Spectral embedding. The first eigenvectors of a graph representative matrix (such as the Laplacian) are computed.

3. Clustering. -means is performed on these spectral features, to obtain clusters.

For background information about spectral clustering, such as several justifications of its performance, out-of-sample extensions, as well as comparisons with local methods, the interested reader is referred to the recent book chapter [144].

One of the drawbacks of spectral clustering is its computational cost as , , and/or become large (see Section 2.3 for a discussion on the cost). Since the turn of the century, a large number of authors have striven to reduce the computational cost while keeping the high level of classification performance. The majority of such accelerating methods are based on sampling: they reduce the dimension of a sub-problem of spectral clustering, compute a low-cost solution in small dimension, and lift the result back to the original space.

The goal of this paper is to review existing sampling methods for spectral clustering, focusing especially on their approximation guarantees. Some of the fundamental questions we are interested in are: where is the sampling performed and what is sampled precisely? how should the reduced approximate solutions be lifted back to the original space? what is the computational gain? what is the control on performances—if it exists? Given the breadth of the literature on the subject, we do not try to be exhaustive, but rather to illustrate the key ways that sampling can be used to provide acceleration, paying special attention on recent developments on the subject.

Paper organization. We begin by recalling in Section 2 the prototypical spectral clustering algorithm. We also provide some intuitive and formal justification of why it works. The next three sections classify the different methods of the literature depending on where the sampling is performed with respect to the three steps of spectral clustering:

• Section 3 details methods that sample directly in the original feature space.

• Section 4 assumes that the similarity graph is given and details methods that sample nodes and/or edges to approximate the spectral embedding.

• Section 5 assumes that the spectral embedding is given and details methods to accelerate the -means step.

Finally, Section 6 gives perspective on the limitations of existing works and discusses key open problems.

Notation. Scalars, such as or , are written with low-case letters. Vectors, such as , or the all-one vector , are denoted by low-case bold letters. Matrices, such as or are denoted with bold capital letters. Ensembles are denoted by serif font capital letters, such as or . The “tilde” will denote approximations, such as in or . We use so-called Matlab notations to slice matrices: given a set of indices of size and an matrix , is reduced to the lines indexed by , is reduced to the columns indexed by , and is reduced to the lines and columns indexed by . The equation defines as the reduction of to its first columns. Also, is the transpose of matrix and its Moore-Penrose pseudo-inverse. The operator takes as an input a vector and returns an diagonal matrix featuring in its main diagonal, i.e., if and , otherwise. Finally, we will consider graphs in a large part of this paper. We will denote by the undirected weighted graph of nodes interconnected by edges: is the edge connecting nodes and , with weight . Matrix is the adjacency matrix of . As is considered undirected, is also symmetric.

2 Spectral clustering

The input of spectral clustering algorithms consists of (i) a set of points (also called featured vectors) representing objects in a feature space of dimension , and (ii) the number of classes in which to classify these objects. The output is a partition of the objects in disjoint clusters. The prototypical spectral clustering algorithm [121, 102], dates back in fact to fundamental ideas by Fiedler [46] and entails the following steps:

[ breakable, left=0pt, right=0pt, top=0pt, bottom=0pt, colback=gray!25, colframe=gray!25, width=1.01enlarge left by=-1mm, boxsep=5pt, arc=0pt,outer arc=0pt, ] Algorithm 1. The prototypical Spectral Clustering algorithm
Input:
A set of points in dimension and a number of desired clusters .

1. [itemsep=-0.2ex]

2. Graph construction (optional)

1. [topsep=-1cm,itemsep=-0.2ex]

2. Compute the kernel matrix : .

3. Compute , a sparsified version of .

4. Interpret as the adjacency matrix of a weighted undirected graph .

3. Spectral embedding

1. [topsep=-1cm,itemsep=-0.2ex]

2. Compute the eigenvectors associated with the

smallest eigenvalues of a graph representative matrix

of .

3. Set .

4. Embed the -th node to , with a normalizing function.

4. Clustering

1. [topsep=-1cm,itemsep=-0.2ex]

2. Use -means on in order to identify centroids .

3. Voronoi tesselation: construct one cluster per centroid and assign each object to the cluster of the centroid closest to .

Output: A partition of the points in clusters.

A few comments are in order:

• A common choice of kernel in step 1a is the radial basis function (RBF) kernel

for some user-defined . The sparsification of usually entails setting the diagonal to and keeping only the largest entries of each column (i.e., set all others to ). The obtained matrix is not symmetric in general and a final “symmetrization” step is necessary to obtain a matrix interpretable as the adjacency matrix of a weighted undirected graph222Each node of represents a point , an undirected edge exists between nodes and if and only if , and the weight of that connection is . . This graph is called the nearest neighbour (-NN) similarity graph (note that the used in this paragraph has nothing to do with the number of clusters). Other kernel functions and sparsification methods are possible (see Section 2 of [138] for instance).

• There are several possibilities for choosing the graph representative matrix in step 2a. We consider three main choices [138]. Let us denote by the diagonal degree matrix such that is the (weighted) degree of node . We define the combinatorial graph Laplacian matrix , the normalized graph Laplacian matrix , and the random walk Laplacian . Other popular choices include333In some of these examples, the largest eigenvalues (instead of the lowest in the Laplacian cases) of the representative matrix, and especially their corresponding eigenvectors, are of interest. This is only a matter of sign of the matrix and has no impact on the general discussion. the non-backtracking matrix [73], degree-corrected versions of the modularity matrix [2], the Bethe-Hessian matrix [114] or similar deformed Laplacians [34].

• The normalizing function depends on which representative matrix is chosen. In the case of the Laplacians, experimental evidence as well as some theoretical arguments [138] support using a unit norm normalization for the eigenvectors of (i.e. is the identity function), and no normalization for the eigenvectors of and (i.e. ).

• Step 1 of the algorithm is “optional” in the sense that in some cases the input is not a set of points but directly a graph. For instance, it could be a graph representing a social network between individuals, where each node is an individual and there is an edge between two nodes if they know each other. The weight on each edge can represent the strength of their relation (for instance close to 0 if they barely know each other, and close to 1 if they are best friends). The goal is then to classify individuals based on the structure of these social connections and is usually referred to as community detection in this context [47]. Given the input graph, and the number of communities to identify, one can run spectral algorithms starting directly at step 2. Readers only interested in such applications can skip Section 3, which is devoted to sampling techniques designed to accelerate step 1.

After the spectral embedding has been identified, spectral clustering uses -means in order to find the set of centroids that best represents the data. Formally, the -means cost function to minimize reads:

 f(C;X)=∑x∈Xminc∈C∥x−c∥2. (1)

We would ideally hope to identify the set of centroids minimizing . Solving exactly this problem is NP-hard [41]

, so one commonly resorts to approximation and heuristic solutions (see for instance

[128] for details on different such heuristics). The most famous is the so-called Lloyd-Max heuristic algorithm: [ breakable, left=0pt, right=0pt, top=0pt, bottom=0pt, colback=gray!25, colframe=gray!25, width=1.01enlarge left by=-1mm, boxsep=5pt, arc=0pt,outer arc=0pt, ] Algorithm 2. The Lloyd-Max algorithm [87]
Input:
A set of points and a number of desired clusters .

1. [itemsep=-0.2ex]

2. Start from an initial guess of centroids

3. Iterate until convergence:

1. [topsep=-1cm,itemsep=-0.2ex]

2. Assign each point to its closest centroid to obtain a partition of in components.

3. Update each centroid as the average position of all points in component .

Output: A set of centroids . When the clusters are sufficiently separated and is not too far from the optimal centroids, then the Lloyd-Max algorithm converges to the correct solution [75]. Otherwise, it typically ends up in a local minimum.

A remark on notation. Two quantities of fundamental importance in spectral clustering are the eigenvalues and especially the eigenvectors of the graph Laplacian matrix. We adopt the graph theoretic convention of sorting eigenvalues in non-decreasing order: Also, for reasons of brevity, we overload notation and use the same symbol for the spectrum of the three Laplacians , and . Thus, we advise the reader to rely on the context in order to discern which Laplacian gives rise to the eigenvalues and eigenvectors. Finally, the reader should keep in mind that the largest eigenvalue if always bounded by for and .

2.1 An illustration of spectral clustering

The first two steps of the algorithm can be understood as a non-linear transformation from the initial feature space to another feature space (that we call spectral feature space or spectral embedding): a transformation of features in to spectral features in . The first natural question that arises is why do we run -means on the spectral features that are subject to parameter tuning and costly to compute, rather than directly run -means on the original ? Figures 1 and 2 illustrate the answer.

In Figure 1, we show the result of -means directly on a set of artificial features known as the two-half moons dataset. In this example, the intuitive ground truth is that each half-moon corresponds to a class, that we want to recover. Running -means directly in this feature space will necessarily output a linear separation between the two obtained Voronoi cells and will thus necessarily fail, as no straight line can separate the two half-moons.

Spectral clustering, via the computation of the spectral features of a similarity graph, transforms these original features in spectral features that are typically linearly separable by -means: the two half-moons are succesfully recovered! We illustrate this in Figure 2. To understand why this works, there are several theoretical arguments of varying rigour. We propose a few in the following.

2.2 Justification of spectral clustering

A popular approach –and by no means the only one, see Section 2.2.3– to justify spectral clustering algorithms stems from its connection to graph partitioning. Suppose that the similarity graph has been obtained and we want to compute a partition444By definition, a partition of the nodes is such that and of the nodes in groups. Intuitively, a good clustering objective function should favor strongly connected nodes to end up in the same subset, and nodes that are far apart in the graph to end up in different subsets. This intuition can be formalized with graph cuts.

Considering two groups and , define to be the total weight of all links connecting to . Also, denote by the complement of in , such that is the total weight one needs to cut in order to disconnect from the rest of the graph. Given these definitions, the simplest graph cut objective function, denoted by cut, is:

 cut(P={V1,…,Vk})=12k∑ℓ=1w(Vℓ,¯Vℓ). (2)

The best partition according to the cut criterion is . For , solving this problem can be done exactly in  amortized time using the Stoer-Wagner algorithm [126] and approximated in nearly linear time [68]. Nevertheless, this criterion is not satisfactory as it often separates an individual node from the rest of the graph, with no attention to the balance of the sizes or volumes of the groups. In clustering, one usually wants to partition into groups that are “large enough”. There are two famous ways to balance the previous cost in the machine learning literature555The reader should note that in the graph theory literature, the measure of conductance is preferred over ncut. Conductance is . The two measures are equivalent when .: the ratio cut [143] and normalized cut [121] cost functions, respectively defined as:

 rcut(P)=12k∑ℓ=1w(Vℓ,¯Vℓ)|Vℓ|    and    ncut(P)=12k∑ℓ=1w(Vℓ,¯Vℓ)vol(Vℓ), (3)

where is the number of nodes in and is the so-called volume of . The difference between them is that ncut favors clusters of large volume, whereas rcut only considers cluster size—though for a -regular graph with unit weights the two measures match (up to multiplication by ). Unfortunately, it is hard to minimize these cost functions directly: minimizing these two balanced costs is NP-hard [139, 121] and one needs to search over the space of all possible partitions which is of exponential size.

A continuous relaxation. Spectral clustering may be interpreted as a continuous relaxation of the above minimization problems. Without loss of generality, in the following we concentrate on relaxing the rcut minimization problem (ncut is relaxed almost identically). Given a partition , let us define

 C=(z1√|V1||…|zk√|Vk|)∈Rn×k, (4)

where is the indicator vector of :

 zℓ(i) ={1if node i∈Vℓ,0otherwise. (5)

It will prove useful in the following to remark that, independently of how the partitions are chosen, we always have that

, the identity matrix in dimension

. With this in place, the problem of minimizing rcut can be rewritten as (see discussion in [138]):

 minC∈Rn×ktr(C⊤LC)s.t.C⊤C=Iand    C as in Eq.~{}(???) (6)

To understand why this equivalence holds, one should simply note that

 tr(C⊤LC)=k∑ℓ=11|Vℓ|z⊤ℓLzℓ =k∑ℓ=11|Vℓ|∑i>jW(i,j)(zℓ(i)−zℓ(j))2 =k∑ℓ=1w(Vℓ,¯Vℓ)|Vℓ|=2rcut(P).

Solving Eq. (6) is obviously still NP-hard as the only thing we have achieved is to rewrite the rcut minimization problem in matrix form. Yet, in this form, it is easier to realize that one may find an approximate solution by relaxing the discreteness constraint “ as in Eq. (4)”. In the absence of the hard-to-deal-with constraint, the relaxed problem is not only polynomially solvable but also possesses a closed-form solution! By the Courant–Fischer–Weyl (min-max) theorem, the solution is given by the first eigenvectors of :

 Uk=argminC∈Rn×ktr(C⊤LC)% subject toC⊤C=I.

This relaxation is not unique to the combinatorial Laplacian. In the same spirit, the minimum ncut optimization problem can be formulated in terms of the normalized Laplacian matrix , and the relaxed problem’s solution is given by the first eigenvectors of .

A difficulty still lies before us: how do we go from a real-valued to a partition of the nodes? The two next subsections aim to motivate the use of -means as a rounding heuristic. The exposition starts from the simple case when there are only two clusters () before considering the general case (arbitrary ).

2.2.1 The case of two clusters: thresholding suffices

For simplicity, we first consider the case of two clusters. If one constructs a partitioning with and for every level set , then it is a folklore result that

 rcut(P∗)≤mintrcut(Pt)≤2√rcut(P∗)(dmax% −λ22), (7)

with being the optimal partitioning, is the maximum degree of any node in , and the second smallest eigenvalue of . The upper bound is achieved by the tree-cross-path graph constructed by Guattery and Miller [57]. In an analogous manner, if is the optimal partitioning w.r.t. the ncut cost and every has been constructed by thresholding the second eigenvector of , then

 ncut(P∗)≤mintncut(Pt)≤2√ncut(P∗). (8)

Equation (8) can be derived as a consequence of the Cheeger inequality, a key result of spectral graph theory [32], which for the normalized Laplacian reads:

 λ22≤ncut(P∗)≤minVw(V,¯V)min{w(V),w(¯V)}≤mintncut(Pt)≤√2λ2.

As a consequence, we have

 ncut(P∗)≤mintncut(Pt)≤√2λ2≤√4ncut(P∗)=2√% ncut(P∗),

as desired. The derivation of the rcut bound given in equation (7) follows similarly.

2.2.2 More than two clusters: use k-means

As the number of clusters increases, the brute-force approach of testing every level set becomes quickly prohibitive. But why is -means the right way to obtain the clusters in the spectral embedding? Though a plethora of experimental evidence advocate the use of -means, a rigorous justification is still lacking. The interested reader may refer to [83] for an example of an analysis of spectral partitioning without -means.

More recently, Peng et al. [107] came up with a mathematical argument showing that, if is well clusterable and we use a -means algorithm (e.g.,  [76]) which guarantees that the identified solution abides to

 f(~C;X)≤(1+ϵ)f(C∗;X),

where is the optimal solution of the -means problem, then the partitioning produced by spectral clustering when using has ncut cost provably close to that of the optimal partitioning . In particular, it was shown that, as long as , then

 ncut(P∗)≤ncut(~P)≤ζncut(P∗)(1+ϵk3λk+1),

for some constants that are independent of and (see also [71]). Note that, using the higher-order Cheeger inequality [83] , the condition implies

 λk+1λk≥ck22=Ω(k2).

Though hopefully milder than this one666To construct an example possibly verifying such a strong gap assumption, consider cliques of size connected together via only edges, so as to form a loosely connected chain. Even though this is a straightforward clustering problem known to be easy for spectral clustering algorithms, the above theorem’s assumption implies which, independently of , can only be satisfied when is a small (recall that the eigenvalues of are necessarily between and )., such gap assumptions are very common in the analysis of spectral clustering. Simply put, the larger the gap is, the stronger the cluster structure and the easier it is to identify a good clustering. Besides quantifying the difficulty of the clustering problem, the gap also encodes the robustness of the spectral embedding to errors induced by approximation algorithms [36]. The eigenvectors of a perturbed Hermitian matrix exhibit an interesting property: instead of changing arbitrarily, the leakage of information is localized w.r.t. the eigenvalue axis [89]. More precisely, if is the -th eigenvector of after perturbation, then the inner products decrease proportionally with . As such, demanding that is large is often helpful in the analysis of spectral clustering algorithms in order to ensure that the majority of useful information (contained within ) is preserved (in ) despite approximation errors777Usually, one needs to ensure that remains bounded..

2.2.3 Choice of relaxation

The presented relaxation approach is not unique and other relaxation could be equally valid (see for instance [17, 24, 112]). This relaxation has nevertheless the double advantage of being theoretically simple and computationally easy to implement. Also, justification of spectral clustering algorithms does not only come from this graph cut perspective and in fact encompasses several approaches that we will not detail here: perturbation approaches or hitting time considerations [138], a polarization theorem [23], consistency derivations [135, 84], etc. Interestingly, recent studies (for instance [18]) on the Stochastic Block Models have shown that spectral clustering (on other matrices than the Laplacian, such as the non-backtracking matrix [73], the Bethe-Hessian matrix [114] or other similar deformed Laplacians [34]) perform well up to the detectability threshold of the block structure.

2.3 Computational complexity considerations

What is the computational complexity of spectral clustering as a function of the number of points , their dimension and the number of desired clusters ? Let us examine the three steps involved one by one.

The first step entails the construction of a sparse similarity graph from the input points, which is dominated by the kernel computation and costs . In the second step, given the graph consisting of nodes and edges888with of the order of if the sparsification step was well conducted, one needs to compute the spectral embedding (step 2 of Algorithm 1). Without exploiting the special structure of a graph Laplacian —other than its sparsity that is— there are two main options:

• Using power iterations, one may identify sequentially each non-trivial eigenvector in time , where is the -th eigenvalue gap and is the number of edges of the graph [136]. Computing the spectral embedding therefore takes with . Unfortunately, there exist graphs999The combinatorial Laplacian of a complete balanced binary tree on levels and nodes has  [56]. such that , bringing the overall worst-case complexity to .

• The Lanczos method can be used to approximate the first eigenvectors in roughly time. This procedure is often numerically unstable resulting to a loss of orthogonality in the computed Krylov subspace basis. The most common way to circumvent this problem is by implicit restart [26], whose computational complexity is not easily derived. The number of restarts, empirically, depend heavily on the eigenvalue distribution in the vicinity of : if is in an eigenvalue bulk, the algorithms takes longer than when is isolated. We decide to write the complexity of restarted Arnoldi as with modeling the number of restarts. Note that throughout this paper, will generically refer to a number of iterations in algorithm complexities. We refer the interested reader to [13] for an in-depth perspective on Lanczos methods.

The third step entails solving the -means problem, typically by using the Lloyd-Max algorithm to converge to a local minimum of . Since there is no guarantee that this procedure will find a good local minimum, it is usually rerun multiple times, starting in each case from randomly selected centroids . The computational complexity of this third step is , where is a bound on the number of iterations required until convergence multiplied by the number of retries (typically around 10).

2.4 A taxonomy of sampling methods for spectral clustering

For the remainder of the paper, we propose to classify sampling methods aiming at accelerating one or more of these three steps according to when they sample. If they sample before step 1, they are detailed in Section 3. Methods that assume that the similarity graph is given or well-approximated and sample between steps 1 and 2 will be found in Section 4. Finally, methods that assume that the spectral embedding has been exactly computed or well-approximated and sample before the -means step are explained in Section 5. This classification of methods, like all classification systems, bears a few flaws. For instance, Nyström methods can be applied to both the context of Sections 3 and 4 and are thus mentioned in both. Also, we decided to include the pseudo-code of only a few chosen algorithms that we think are illustrative of the literature. This choice is of course subjective and debatable. Notwithstanding these inherent flaws, we hope that this classification clarifies the landscape of existing methods.

3 Sampling in the original feature space

This section is devoted to methods that ambitiously aim to reduce the dimension of the spectral clustering problem even before the graph has been formed. Indeed, the naive way of building the similarity graph (step 1 of spectral clustering algorithms) costs and, as such, is one of the the key bottlenecks of spectral clustering. It should be remarked that the present discussion fits into the wider realm of kernel approximation, a proper review of which cannot fit in this paper: we will thus concentrate on methods that were in practice used for spectral clustering.

3.1 Nyström-based methods

The methods of this section aim to obtain an approximation of the exact spectral embedding via a sampling procedure in the original feature space.

The Nyström method is a well known algorithm for obtaining a rough low rank approximation of a positive semi-definite (PSD) matrix . Here is a high level description of the steps entailed: [ breakable, left=0pt, right=0pt, top=0pt, bottom=0pt, colback=gray!25, colframe=gray!25, width=1.01enlarge left by=-1mm, boxsep=5pt, arc=0pt,outer arc=0pt, ] Algorithm 3. Nyström’s method
Input: PSD matrix , number of samples , desired rank

1. [itemsep=-0.2ex]

2. Let be column indices chosen by some sampling procedure.

3. Denote by and the sub-matrices indexed by .

4. Let be the eigen-decomposition of with the diagonal of sorted in decreasing magnitude.

5. Compute the rank- approximation of as , where and .

Possible outputs:

• [topsep=-1cm,itemsep=-0.2ex]

• A low-rank approximation of

• A rank- approximation of

• The top eigenvectors of , stacked as columns in matrix , obtained by orthonormalizing the columns of

Various guarantees are known for the quality of depending on the type of sampling utilized and the preferred notion of error (spectral vs frobenius vs trace norm) [54, 77, 50, 148]. For instance:

Theorem 3.1 (Lemma 8 for q=1 in [54]).

Let and . Consider columns drawn i.i.d. uniformly at random (with or without replacement). Then:

 ∥A−~A∥2≤(1+n(1−ϵ)m)∥A−Ak∥2

holds with probability at least

, provided that ; where

 μ=nkmaxi=1,…,n∥Vk(i,:)∥22

is the coherence associated with the first eigenvectors of , and is the best rank- approximation of .

Guarantees independent of the coherence can be obtained for more advanced sampling methods. Perhaps the most well known method is that of leverage scores, where one draws samples independently by selecting (with replacement) the -th column with probability , called leverage scores.

Theorem 3.2 (Lemma 5 for q=1 in [54]).

Let and . Consider

columns drawn i.i.d. with replacement from such a probability distribution. Then:

 ∥A−~A∥2≤∥A−Ak∥2+ϵ2∥A−Ak∥∗

holds with probability at least provided that .

Computing leverage scores exactly is computationally prohibitive since it necessitates a partial SVD decomposition of , which we are trying to avoid in the first place. Nevertheless, it is possible to approximate all leverage scores with a multiplicative error guarantee in time roughly if has non-zero entries. (see Algorithms 1 to 3 in [54]). Many variants of the above exist [77, 78], but to the best of our knowledge, the fastest current Nyström algorithm utilizes ridge leverage scores with a complex recursive sampling scheme and runs in time nearly linear in  [100].

Nyström for spectral clustering. Though initially conceived for low-rank approximation, Nyström’s method can also be used to accelerate spectral clustering. The key observation is that , the tailing eigenvectors of the graph representative matrix , can be interpreted as the top eigenvectors of the PSD matrix . As such, the span of the top eigenvectors of obtained by running Algorithm 3 on is an approximation of the span of the exact spectral embedding. Different variants of this idea have been considered for the acceleration of spectral clustering [48, 141, 85, 19, 97, 86].

Following our taxonomy, we hereby focus on the case where we have at our disposal points in dimension , and the similarity graph has yet to be formed. The case where the graph is known is deferred to Section 4.

In this case, we cannot run Algorithm 3 on as the graph, and a fortiori its representative matrix , has not yet been formed. What we can have access to efficiently is and as these require only a partial computation of the kernel and cost only . Note that is a sparsification function that is applied on a subset of the kernel matrix.

The following pseudo-code exemplifies how Nyström-based techniques can be used to approximate the first eigenvectors associated with the normalized Laplacian matrix (i.e., here ): [ breakable, left=0pt, right=0pt, top=0pt, bottom=0pt, colback=gray!25, colframe=gray!25, width=1.01enlarge left by=-1mm, boxsep=5pt, arc=0pt,outer arc=0pt, ] Algorithm 3b. Nyström for spectral clustering [85]
Input:
The set of points , the number of desired clusters , a sampling set of size

1. [itemsep=-0.5ex]

2. Compute the sub-matrices and , where is a sparsification function.

3. Let be the degree matrix.

4. Compute the top eigenvalues and eigenvectors of .

5. Set .

6. Let be the degree matrix.

7. Compute obtained by orthogonalizing .

Output: , an approximation of the spectral embedding . This algorithm runs in time, which is small when depends mildly on the other parameters of interest. Nevertheless, the algorithm (and others like it) suffers from several issues:

• Alg. 3b attempts in fact to apply Nyström on , via the exact computation of two submatrices of . It makes two strong (and uncontrolled) approximations. First of all, the sparsification step (step 1 in Alg. 3b) is applied to the sub-matrices and , deviating from the correct sparsification procedure that takes into account the entire kernel matrix . Second, the degree matrix is never exactly computed as knowing it exactly would entail computing exactly , which is precisely what we are trying to avoid. Existing methods thus rely on heuristic approximations of the degree in order to bypass this difficulty (see steps 2 and 5 of Alg. 3b).

• Since we don’t have direct access to the kernel matrix, we cannot utilize advanced sampling methods such as leverage scores to draw the sampling set . This is particularly problematic if (due to sparsification), matrices and are sparse, as for sparse matrices uniform sampling is known to perform poorly [97]. Techniques that rely on distances between columns do not fair much better. Landmark-based approaches commonly perform better in simple problems but suffer when the clusters are non-convex [19]. We refer the reader to the work by Mohan et al. [97] for more information on landmark-based methods. The latter work also describes an involved sampling scheme that is aimed at general (i.e., non-convex) clusters.

For the reasons highlighted above, the low-rank approximation guarantees accompanying the classical Nyström method cannot be directly used here. A fortiori, it is an open question how much the quality of the spectral clustering solution is affected by using the centroids obtained by running -means on .

Column sampling. Akin in spirit to Nyström methods, an alternative approach to accelerating spectral clustering was inspired by column sampling low-rank approximation techniques [42, 37].

An instance of such algorithms was put forth under the name of cSPEC (column sampling spectral clustering) by Wang et al. [141]. Let

be the singular value decomposition of the

matrix . Then, matrices

 ~Σ=√nmΣCand~U=CVCΣ+C

are interpreted as an approximation of the actual eigenvalues and eigenvectors of and thus can be substituted by the first columns of . This algorithm runs in .

Authors in [30] propose a hybrid method, between column sampling and the representative-based methods discussed in Section 3.3, where they propose the following approximate factorization of the data matrix:

 (p1|…|pn)≃FZ∈Rd×n, (9)

where concatenates the feature vectors of sampled points and represents all unsampled points as approximate linear combinations of the representatives, computed via sparse coding techniques [82]101010Authors in [116] have a very similar proposition as [30], adding a projection phase at the beginning to reduce the dimension (see Section 3.4.2). Similar ideas may also be found in [137].. The SVD of , with the row-sum of , is then computed to obtain an approximation of . The complexity of their algorithm is also .

In these methods, the choice of the sample set is of course central and has been much debated. Popular options are uniformly at random or via better-taylored probability distributions, via a first -means (with ) pass on , or via other selective sampling methods. Also, as with most extensions of Nyström’s method to spectral clustering, column sampling methods for spectral clustering do not come with end-to-end approximation guarantees on .

In the world of low-rank matrix approximation the situation is somewhat more advanced. Recent work in column sampling utilizes adaptive sampling with leverage scores in time , or uniformly i.i.d. after preconditioning by a fast randomized Hadamard transform [145, 43]. Others have also used a correlated version called volume sampling to obtain column indices [37]. Nevertheless, this literature extends beyond the scope of this review and we invite the interested reader to consider the aforementioned references for a more in depth perspective.

3.2 Random Fourier features

Out of several sketching techniques one could a priori use to accelerate spectral clustering, we focus on random Fourier features (RFF) [110]: a method that samples in the Fourier space associated to the original feature space. Even though RFFs have originally been developed to approximate a kernel matrix in time linear in instead of the quadratic time necessary for its exact computation, they can in fact be used to obtain an approximation of the exact spectral embedding .

Let us denote by the RBF kernel, i.e.,

, whose Fourier transform is:

 (10)

The above takes real values as is symmetric. One may write:

 κ(p,q)=κ(p−q)=1Z∫Rd^κ(ω)expiω⊤(p−q)dω, (11)

where, in order to ensure that , the normalization constant is set to . According to Bochner’s theorem, and due to the fact that is positive-definite,

is a valid probability density function.

may thus be interpreted as the expected value of provided that is drawn from :

 κ(p,q)=Eω(expiω⊤(p−q)) (12)

Drawing from the distribution is equivalent to drawing independently each of its entries according to the normal law of mean

and variance

. Indeed: and , leading to

In practice, we draw independently such vectors to obtain the set of sampled frequencies For each data point , and given this set of samples , we define the associated random Fourier feature vector:

 ψi=1√m[cos(ω⊤1pi)|⋯|cos(ω⊤mpi)|sin(ω⊤1pi)|⋯|sin(ω⊤mpi)]⊤∈R2m, (13)

and call the RFF matrix. Other embeddings are possible in the RFF framework, but this one was shown to be the most appropriate to the Gaussian kernel [127]. As increases, concentrates around its expected value : . Proposition 1 of [127] states the tightness of this concentration: it shows that the approximation starts to be valid with high probability for . The Gaussian kernel matrix is thus well approximated as . With such a low-rank approximation of

, one can: estimate the degrees

111111an approximation of the degree of node is where . All degrees can thus be estimated in time ., degree-normalize to obtain a low-rank approximation of the normalized Laplacian and perform an SVD to directly obtain an approximation of the spectral embedding . The total cost to obtain this approximation is . These ideas were developed in Refs. [31, 146] for instance.

As in Nyström methods however, the concentration guarantees of RFFs for do not extend to the degree-normalized case; moreover, the sparsification step 1b of spectral clustering is ignored. Note that improving over RFFs in terms of efficiency and concentration properties is the subject of recent research (see for instance [81]).

3.3 The paradigm of representative points

The methods detailed here sample in the original feature space and directly obtain a control on the misclustering rate due to the sampling process. They are based on the following framework:

1. Sample so-called representatives.

2. Run spectral clustering on the representatives.

3. Lift the solution back to the entire dataset.

Let us illustrate this with the example of KASP: [ breakable, left=0pt, right=0pt, top=0pt, bottom=0pt, colback=gray!25, colframe=gray!25, width=1.01enlarge left by=-1mm, boxsep=5pt, arc=0pt,outer arc=0pt, ] Algorithm 4. KASP: -means-based approximate spectral clustering [147]
Input: A set of points in dimension , a number of desired clusters , and a number of representatives

1. [itemsep=-0.2ex]

2. Perform -means with on and obtain:

1. [topsep=-1cm,itemsep=-0.2ex]

2. the cluster centroids as the representative points.

3. a correspondence table to associate each to its nearest representative

3. Run spectral clustering on : obtain a -way cluster membership for each .

4. Lift the cluster membership to each by looking up the cluster membership of its representative in the correspondence table.

Output: clusters The complexity of KASP is bounded by121212It is in fact for step 1, and bounded by for step 2. As and , the total complexity is bounded by . . For a summary of the analysis given in [147], let us consider the cluster memberships given by exact spectral clustering on as well as the memberships given by exact spectral clustering on where the are any small perturbations on the initial points. Let us write (resp. ) the Laplacian matrix of the similarity graph on (resp. ). The analysis concentrates on the study of the misclustering rate :

 ρ=# of points with different membershipsn. (14)

The main result, building upon preliminary work in [63], stems from a perturbation approach and reads:

Theorem 3.3.

Under assumptions of Theorem 3 in [147], verifies: where is a value depending on the spectral gap. Also, under the assumptions of Theorem 6 in [147], one has, with high probability:

 ∥L−~L∥F≤O(σ(2)ϵ+σ(4)ϵ) (15)

with and

the second and fourth moments of the perturbation’s norms

.

Combining both bounds, one obtains an upper bound on the misclustering rate that depends on the second and fourth moments of the perturbation’s norms . The “collapse” of points onto the representative points, interpreted as a perturbation on the original points, should thus tend to minimize these two moments, leading the authors to propose distortion-minimizing algorithms, such as KASP. A very similar algorithm, eSPEC, is described in [141].

3.4 Other methods

3.4.1 Approximate nearest neighbour search algorithms

The objective here is to approximate the nearest neighbour graph efficiently. Even though these methods are not necessarily based on sampling, we include them in the discussion as they are frequently used in practice.

Given the feature vectors and a query point , the exact nearest neighbour search (exact NNS) associated to and is where dist stands for any distance. Different distances are possible depending on the choice of kernel . We will here consider the Euclidean norm as it enters the definition of the classical RBF kernel. Computing the exact NNS costs . The goal of the approximate NNS field of research is to provide faster algorithms that have the following control on the error.

Definition 3.4.

Point is an -approximate nearest neighbor of query , if

 ∀p∈Pdist(q,p∗)≤(1+ϵ)dist(q,p).

For , this reduces to exact NNS.

Extensions of this objective to the -nearest neighbour goal are considered in the NNS literature. A -nearest neighbour graph can then be constructed simply by running an approximate -NNS query for each object . Thus, approximate NSS algorithms are interesting candidates to approximate the adjacency matrix of the nearest-neighbour affinity graph, that we need in step 1 of spectral clustering. Many algorithms exist, their respective performances depending essentially on the dimension of the feature vectors. According to [9], randomized - forests as implemented in the library FLANN [98] are considered state-of-the-art for dimension of around , whereas methods based on Balanced Box Decomposition (BBD) [7, 4] are known to perform well for roughly smaller than

. In high dimensions, to avoid the curse of dimensionality, successful approaches are for instance based on hashing methods (such as Locality Sensitive Hashing (LSH)

[5], Product Quantization (PQ) [66]) or -

generalized random forests

[9]. Finally, proximity graph methods, that sequentially improve over a first coarse approximation of the -NN graph (or other graph structures such as navigable graphs) have received a large attention recently and are becoming state-of-the-art in regimes where quality of approximation primes (see for instance [94, 40, 51, 8]). Such tools come with various levels of guarantees and computation costs, the details of which are not in the scope of this review.

Experimentally, to obtain an approximate -NN graph with a typical recall rate131313The recall rate for a node is the number of correctly identified -NN divided by . The recall rate for a -NN graph is the average recall rate over all nodes. of , these algorithms are observed to achieve a complexity of with close to ( in [40] for instance).

3.4.2 Feature selection and feature projection

Some methods work on reducing the factor of the complexity

of the kernel computation via feature selection, i.e., the sampling of features deemed more useful for the underlying clustering task, or feature projection, i.e., the projection on usually random subspaces of dimension

. Feature selection methods are usually designed to improve the classification by removing features that are too noisy or useless for the classification. We thus do not detail further these methods as they are not approximation algorithms per se. The interested reader will find some entries in the literature via references [35, 60, 149, 25]. Projection methods use random projections of the original points on spaces of dimension in order to take advantage of the Johnson-Lindenstrauss lemma of norm conservation: the kernel computed from the projected features in dimension is thus an approximation of the true kernel with high probability. We refer to the works [116, 64] for more details.

4 Sampling given the similarity graph

We now suppose that the similarity graph is either given (e.g., in cases where the original data is a graph) or has been well approximated (by approximate k-NN search for instance) and concentrate on sampling-based methods that aim to reduce the cost of computing the first eigenvectors of .

These methods predominantly aim to approximate by a smaller matrix of size . The eigen-decomposition is done in which can be significantly cheaper when . In addition, each method comes with a fast way of lifting vectors from back to (this is usually a linear transformation). After lifting, the eigenvectors of are used as a proxy for those of .

Unlike the previous section where a strong approximation guarantee of the exact embedding by an efficiently computed was a distant and difficult goal to achieve in itself; we will see in this Section that the knowledge of the similarity graph not only enables to obtain such strong approximation guarantees, but also enables to control how the error on transfers as an error on the -means cost.

To be more precise, recall Eq. (1) defining the -means cost associated to the points and a centroid set . Now, suppose that we have identified a set of points that are meant to approximate the exact spectral embedding . Moreover, let (resp. ) be the optimal set of centroids minimizing the -means cost on (resp. ). We will see that several (not all) approximation methods of this Section achieve an end-to-end approximation guarantee of the form:

 ∣∣f(C∗;X)\sfrac12−f(~C∗;X)\sfrac12∣∣≤ϵ

for some small with -at least- constant probability. Such an end-to-end guarantee is indeed more desirable than a simple guarantee on the distance between and : it informs us on the approximation quality of the attained clustering.

4.1 Nyström-based methods

Once again, Nyström-based methods are applicable. Let us concentrate on the choice to illustrate the main ideas. As explained in Section 3.1, the tailing eigenvectors of , , can be interpreted as the top eigenvectors of the PSD matrix . As such, the span of the top- eigenvectors of , , obtained by running Algorithm 3 on should approximate the span of . Now, how does ones goes from Nyström theorems such as Theorem 3.2 to error bounds on the -means cost function?

The first step towards an end-to-end guarantee relies on the following result:

Lemma 4.1 (see the proof of Theorem 6 in [21]).

Denote by the optimal centroid set obtained by solving -means on the lines of . It holds that

 ∣∣f(C∗;X)\sfrac12−f(~C∗;X)\sfrac12∣∣≤2∥E∥F, (16)

where

This means that the error made by considering the optimal -means solution based on (instead of ) is controlled by the Frobenius norm of the projector difference . Furthermore, since141414Based on three arguments: (i) for any two matrices and of rank and it holds that , (ii) for any matrix or rank , , and (iii) both and are of rank . and , we can apply the Davis-Kahan