Scalable Spectral Clustering with Nystrom Approximation: Practical and Theoretical Aspects

06/25/2020
by   Farhad Pourkamali-Anaraki, et al.
UMass Lowell
0

Spectral clustering techniques are valuable tools in signal processing and machine learning for partitioning complex data sets. The effectiveness of spectral clustering stems from constructing a non-linear embedding based on creating a similarity graph and computing the spectral decomposition of the Laplacian matrix. However, spectral clustering methods fail to scale to large data sets because of high computational cost and memory usage. A popular approach for addressing these problems utilizes the Nystrom method, an efficient sampling-based algorithm for computing low-rank approximations to large positive semi-definite matrices. This paper demonstrates how the previously popular approach of Nystrom-based spectral clustering has severe limitations. Existing time-efficient methods ignore critical information by prematurely reducing the rank of the similarity matrix associated with sampled points. Also, current understanding is limited regarding how utilizing the Nystrom approximation will affect the quality of spectral embedding approximations. To address the limitations, this work presents a principled spectral clustering algorithm that makes full use of the information obtained from the Nystrom method. The proposed method exhibits linear scalability in the number of input data points, allowing us to partition large complex data sets. We provide theoretical results to reduce the current gap and present numerical experiments with real and synthetic data. Empirical results demonstrate the efficacy and efficiency of the proposed method compared to existing spectral clustering techniques based on the Nystrom method and other efficient methods. The overarching goal of this work is to provide an improved baseline for future research directions to accelerate spectral clustering.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

05/25/2018

Scalable Spectral Clustering Using Random Binning Features

Spectral clustering is one of the most effective clustering approaches t...
07/07/2016

Mini-Batch Spectral Clustering

The cost of computing the spectrum of Laplacian matrices hinders the app...
01/29/2019

Approximating Spectral Clustering via Sampling: a Review

Spectral clustering refers to a family of unsupervised learning algorith...
07/18/2018

Reconstructing Latent Orderings by Spectral Clustering

Spectral clustering uses a graph Laplacian spectral embedding to enhance...
09/09/2018

Clustering of graph vertex subset via Krylov subspace model reduction

Clustering via graph-Laplacian spectral imbedding is ubiquitous in data ...
07/21/2020

Spectral Clustering using Eigenspectrum Shape Based Nystrom Sampling

Spectral clustering has shown a superior performance in analyzing the cl...
06/07/2020

Unsupervised Learning for Subterranean Junction Recognition Based on 2D Point Cloud

This article proposes a novel unsupervised learning framework for detect...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Cluster analysis is a fundamental problem in signal processing and exploratory data analysis that divides a data set into several groups using the information found only in the data. Among several techniques [saxena2017review, rodriguez2019clustering], spectral clustering [ng2002spectral, von2007tutorial]

is one of the most prominent and successful methods to capture complex structures, such as non-spherical clusters. In these scenarios, spectral clustering outperforms popular Euclidean clustering techniques, such as K-means clustering

[pourkamali2017preconditioned, vijayaraghavan2017clustering]

. Hence, spectral clustering has found applications in various domains, including computer vision

[shi2000normalized, tacsdemir2015approximate, chen2017linear], biology [shi2017spectral], neuroscience [thirion2014fmri], recommender systems [li2019new], and blind source separation [bach2006learning].

Spectral clustering expresses data clustering as a graph partitioning problem by constructing an undirected similarity graph with each point in the data set being a node. A popular connectivity measure employs the radial basis kernel function of the form , where is the set of data points in to be partitioned into clusters, and is the bandwidth parameter. Thus, the first step of spectral clustering involves forming a positive semi-definite kernel matrix with the -th entry , which describes similarities among input data points. Therefore, a significant challenge in applying spectral clustering is the computation and storage of the entire kernel matrix, which requires time and space. The quadratic complexity in the number of input data points renders spectral clustering intractable for large data sets.

In this work, we focus on the popular method of normalized cut [shi2000normalized, he2016iterative] to partition the resulting similarity graph. This method forms the normalized Laplacian matrix as follows:

(1)

where is the diagonal degree matrix associated with ,

is the identity matrix, and

is the vector of all ones. The spectral embedding of the original data

is then obtained by solving the following trace minimization problem:

(2)

where

represents the matrix trace. The above optimization problem has a closed-form solution; its optimizer is obtained by eigenvectors corresponding to the

smallest eigenvalues of the Laplacian matrix

. Thus, spectral clustering learns a non-linear map that embeds the original data into the eigenspace of

for uncovering the intrinsic structure of the input data.

We can also view the spectral embedding process as solving a low-rank approximation problem [zhu2018global]. To this end, we substitute from (1) into the above minimization problem. Using the constraint allows rewriting the minimization problem in (2) as follows:

(3)

where we used the fact that is a constant, and introduced the modified kernel matrix:

(4)

The solution of this maximization problem is obtained by computing eigenvectors corresponding to the largest eigenvalues of , denoted by . That is, we should compute the rank- approximation of to map the original data from into . The computational complexity associated with finding the leading eigenvectors of the modified kernel matrix is without making assumptions on the structure of the kernel matrix, such as being sparse [halko2011finding]. When the number of data points is large, the exact solution of this step becomes computationally prohibitive and suffers from large memory overhead.

The third step of spectral clustering is to partition the rows of using K-means clustering [bachem2018scalable], where the goal is to find centroids and assign each embedded point to the closest centroid for partitioning the original data. Although solving this problem is NP-hard [drineas1999clustering], it is common to use iterative algorithms that lead to expected approximation guarantees [kmeansplus, ahmadian2019better]. As these algorithms should compute distances between data points and centroids in , the third step requires operations per iteration, and a few tens of iterations typically suffice to cluster the data. Since the third step takes linear time in the number of data points, the central challenge that arises in large-scale data settings is computing the leading eigenvectors of the modified kernel matrix , which is the main focus of this paper.

While other variants exist in the literature, including robust methods to noisy data [bojchevski2017robust, kim2020outer], this work focuses on the standard formulation of spectral clustering. Alg. 1 summarizes the three main steps of prototypical spectral clustering. Empirical evidence suggests that normalizing the rows of improves stability and accuracy [von2007tutorial]. As these row vectors are in , the normalization cost scales linearly in terms of the number of data points. Using cross-validation is a common technique for selecting the kernel parameter .

Input: data set , kernel parameter , number of clusters .

1:function SC()
2:     Form the kernel matrix ;
3:     Construct , where ;
4:     Compute the leading eigenvectors of to obtain the spectral embedding ;
5:     Normalize each row of to have unit length;
6:     Perform K-means clustering over the rows of the normalized matrix ; return Clustering results.
Algorithm 1 Prototypical Spectral Clustering (SC)

I-a Related Work on Accelerating Spectral Clustering

Various methods have been proposed to accelerate spectral clustering by computing an approximate spectral embedding of the original data. Recent work [tremblay2020approximating] presented an excellent review of the literature on this topic for interested readers. In this paper, we divide the related work into two main categories: (1) methods that circumvent the computation of the full kernel matrix, and (2) techniques that consider the similarity graph as one of the inputs to spectral clustering and, thus, ignore the cost associated with step 2 of Algorithm 1. The former is more realistic since constructing full kernel matrices is computationally prohibitive, even for medium-sized data. We further divide the first line of work into three sub-categories:

  1. explicit approximation of the kernel matrix;

  2. random Fourier features to approximate the radial basis kernel function;

  3. forming a sparse similarity graph.

The Nyström method is one of the most popular techniques for approximating positive semi-definite matrices. In a nutshell, the Nyström method [williams2001using] selects points from the original data set using a sampling strategy, such as uniform sampling or a more complicated non-uniform selection technique [kumar2012sampling, sun2015review, pourkamali2018randomized]. After choosing a subset of the data, the so-called landmarks which we denote them by , one should compute similarities between the original data and , as well as pairwise similarities among the elements of . Hence, the Nyström method constructs two matrices and such that and , which takes time. We then obtain an approximation of the kernel matrix in the form of:

(5)

where is the pseudo-inverse of .

When employing the Nyström method, the number of landmarks should exceed , i.e., the desired number of clusters, and increasing is a common practice to improve accuracy [wang2019scalable, pourkamali2019improved]. The Nyström approximation results in linear time complexity in the size of the original data for a fixed landmark set [bach2013sharp]

. However, a critical task is to efficiently integrate the Nyström method with spectral clustering because the ultimate goal is to estimate the

leading eigenvectors of the modified kernel matrix in linear time (instead of ). Therefore, several variants of Nyström-based spectral clustering have been proposed [fowlkes2004spectral, li2011time, mohan2017exploiting], where the underlying theme is to compute the spectral decomposition of the inner matrix to save on computational resources and lift the solution from back to . Although these methods reduce time complexity, a downside is the lack of a theoretical framework and understanding concerning how the Nyström method affects the quality of resulting spectral embedding.

The second sub-category seeks to directly approximate the radial basis kernel function in the form of , where is known as the random Fourier feature vector [rahimi2008random]

. The main idea behind this approach is to use the Fourier transform of shift-invariant kernel functions, including Gaussian kernels, for the efficient computation of feature vectors

[sutherland2015error, wu2016revisiting, he2018fast]. Recent work [wu2018scalable] utilized this strategy to implicitly approximate the kernel matrix for performing spectral clustering in linear time. The introduced method requires the dimension of feature vectors to be significantly greater than the ambient dimension . However, each feature vector should contain only a few non-zero entries to reduce the subsequent eigenvalue decomposition cost.

The third sub-category utilizes fast nearest neighbor search algorithms [muja2014scalable] to form similarity graphs with sparse kernel matrices, which may substantially reduce the computational and memory complexities associated with the spectral decomposition step [dong2011efficient, lucinska2012spectral]. In this case, the number of nearest neighbors is an additional tuning parameter that remarkably impacts the connectivity of the similarity graph and the following spectral embedding.

The second class of accelerated spectral clustering techniques assumes that the similarity graph is one of the inputs. Therefore, the main task is to compute the eigenvalue decomposition of the modified kernel matrix , which is assumed to be available at no cost. A possible solution focuses on utilizing tools from the randomized numerical linear algebra literature, such as randomized subspace iteration [saibaba2019randomized]. These methods typically employ random projections to identify a subspace that approximately captures the range of [boutsidis2015spectral]. Another approach seeks to form a sparse similarity matrix according to the effective resistances of all nodes [spielman2011graph], which can be approximated in nearly linear time. However, these techniques are practical only when kernel matrices are accessible.

I-B Main Contributions

In this paper, we design and study an efficient method for incorporating the Nyström approximation into prototypical spectral clustering. The main feature of the proposed approach is to fully utilize the collected information from the Nyström method while achieving linear scalability in terms of the data size. Hence, our approach is suitable for clustering complex data sets containing tens or hundreds of thousands of samples.

As we will discuss in detail, efficient Nyström-based spectral clustering methods take a two-step approach, which entails restricting the rank of the kernel matrix associated with the landmark set, followed by lifting the solution back to the original space. The disentanglement of the two similarity matrices and for computing the spectral decomposition gives rise to several issues. First, performing the rank reduction step too early adversely affects the spectral embedding process. Second, the produced eigenvectors are not necessarily orthogonal, which will require additional orthogonalization steps. Third, providing theoretical guarantees to understand the relationship between the Nyström approximation error and the quality of resulting spectral embedding becomes complicated. A serious concern is that a small perturbation of the kernel matrix may have an out-sized influence on the modified kernel matrix due to the perturbation of the degree matrix , which is used for normalizing the row and columns of the kernel matrix.

This work improves Nyström-based spectral clustering by utilizing both matrices and at the same time. Our proposed approach automatically exploits decay in the spectrum of the inner matrix to reduce time complexity and memory footprint, instead of enforcing its rank to be as prescribed by the prior work. We then implicitly form the modified kernel matrix and compute its leading eigenvectors.

A further advantage of the proposed approach is reducing the current gap in the literature between a provably good low-rank approximation of the kernel matrix to a provably accurate estimation of . We derive an upper bound for the perturbation of the modified kernel matrix due to the Nyström method by making use of the Taylor series expansion for matrix functions [deadman2016taylor]. Our analysis shows that a relatively small perturbation of the kernel matrix results in a practical upper bound for approximating the modified kernel matrix, or equivalently the normalized Laplacian matrix. We present numerical experiments to understand the main assumptions and bounds involved in our theoretical results.

Finally, we present an extensive empirical evaluation of the proposed approach, using both synthetic and real data. We compare our introduced method with other state-of-the-art approximate spectral clustering methods that circumvent the formation of the entire similarity graph, including techniques based on random Fourier features and sparse similarity graphs. We also corroborate the scalability of our proposed spectral clustering method concerning the size of input data and the number of landmarks.

I-C Paper Organization

The rest of the paper is outlined as follows. We first thoroughly review the related work on Nyström-based spectral clustering and exemplify several drawbacks in Section II. Then, Section III explains our proposed scalable approach for improving Nyström-based spectral clustering while providing new theoretical results to reduce the current gap in the literature. We also present numerical experiments to verify the assumptions made in our analysis. Section IV empirically demonstrates trade-offs between accuracy and efficiency of the proposed Nyström-based spectral clustering method on benchmark data sets. We present concluding remarks in Section V.

I-D Notation and Preliminaries

We denote column vectors with lower-case bold letters and matrices with upper-case bold letters. We take to be the -dimensional vector of all ones and represents the identity matrix. For a vector , let be the Euclidean norm and returns a diagonal matrix with the elements of on the main diagonal. Given a matrix , the -th element is denoted by and is the transpose of . The matrix

admits a factorization, known as the truncated singular value decomposition (SVD), in the form of

, where and are matrices with orthonormal columns referred to as the left singular vectors and right singular vectors, respectively. The parameter represents the rank of and the diagonal matrix contains the singular values of in descending order, i.e., . In this paper, the expression “ has rank ” means that the rank of does not exceed . Using this factorization, we can define several standard matrix norms, including the Frobenius norm and the spectral norm .

If is symmetric positive semi-definite, we have in the previous factorization, which is called the reduced eigenvalue decomposition (EVD) or spectral decomposition. The columns of are the eigenvectors of and contains the corresponding eigenvalues in descending order. Thus, we get , where . The Moore-Penrose pseudo-inverse of can be obtained from the EVD as , where . When is full rank, i.e., , we have . The trace of is equal to the sum of its eigenvalues, i.e., . The matrix is positive semi-definite of rank if and only if there exists a matrix of rank such that .

In this paper, given an integer that does not exceed the rank parameter , we denote the first columns of , i.e., the leading eigenvectors of , by . Similarly, represents a diagonal sub-matrix that contains the largest eigenvalues of . Based on the Eckart-Young-Mirsky theorem, is the best rank- approximation to because it minimizes the approximation error for any unitarily invariant norm over all matrices of rank . The error incurred in the spectral norm by the best rank- approximation can be identified as:

(6)

Ii Review of Spectral Clustering Using Nyström Approximation

Previous research on integrating the Nyström method with spectral clustering has proposed several techniques to compute an approximate spectral embedding in linear time concerning the data size . The key challenge is how to utilize the Nyström approximation of the kernel matrix, i.e., , to compute the top eigenvectors of the approximate modified kernel matrix , where . Obviously, the desired linear complexity in terms of time and space does not permit forming square matrices of size . Thus, the previous research focused on constructing a small matrix of size as a proxy for the kernel matrix . Although this strategy reduces the cost, a major concern is that critical information regarding the structure of the matrix may be ignored during this process, adversely impacting the accuracy of spectral clustering. In this section, we review two relevant prior techniques and highlight their limitations to motivate our proposed approach.

Exploiting the Nyström approximation for estimating the leading eigenvectors of the modified kernel matrix was first introduced in [fowlkes2004spectral]. The main idea behind this approach is to find the exact EVD of an matrix, where

refers to the number of selected landmarks. Then, a linear transformation from

to generates an approximate spectral embedding of the original data. Let us write the Nyström approximation using two sub-matrices and :

(7)

where representing similarities between distinct landmark points using the Gaussian kernel function has full rank [scholkopf2001learning], which allows calculating the inverse of (this discussion also holds when using pseudo-inverse). Similarly, the information regarding connectivity measures between and the remaining data points is encoded in . One can compute and decompose the approximate degree matrix into two diagonal sub-matrices as follows:

(8)

Thus, we can compute two normalized matrices and . The next step is to define and calculate its EVD:

(9)

The last step is to find the relationship between the spectral decomposition of and the modified kernel matrix :

(10)

Thus, the main contribution of [fowlkes2004spectral] was to generate a factorization of without explicitly computing all its entries:

(11)

Although this technique estimates the leading eigenvectors of the modified kernel matrix, a significant downside is the need to use , or equivalently , more than once to form the matrix and map its eigenspace from to via (10). Hence, this method incurs high computational complexity and memory overhead for clustering huge data sets.

In order to improve the scalability of integrating the Nyström method with spectral clustering, another approach was proposed in [li2011time]. The main idea is to treat the inner matrix as a proxy for the full kernel matrix. That is, one computes the best rank- approximation of the normalized matrix and then the solution will be lifted from back to while using the matrix only once. To be formal, the first step is to compute the following EVD:

(12)

where is the degree matrix associated with . This approach then immediately utilizes the best rank- approximation of to generate a rank- approximation of the modified kernel matrix as follows:

(13)

In this equation, we defined the following matrix, which consists of multiplication of the matrix by the leading eigenvalues and eigenvectors of the normalized matrix :

(14)

and is a diagonal degree matrix, which takes linear time concerning the data size .

Therefore, unlike the approach in [fowlkes2004spectral], this method requires a single pass over the matrix , which is a notable gain when the number of samples is large. However, this reduction of time complexity leads to two drawbacks. Information loss may occur because of performing the strict rank reduction step on the inner matrix without taking into consideration the structure of . Also, the produced eigenvectors in (13) are not guaranteed to be orthogonal. The authors in [li2011time] proposed an orthogonalization step to tackle this problem.

(a)
(b)
Fig. 1: Comparing accuracy and time complexity of our approach with the previous research (Approach 1 and 2 refer to [fowlkes2004spectral] and [li2011time], respectively).

In a nutshell, we discussed two related Nyström-based techniques that seek to accelerate the prototypical spectral clustering algorithm. The first approach builds on the exact spectral decomposition of the modified kernel matrix. In contrast, the second approach reduces the computational complexity while compromising accuracy by prematurely restricting the rank of the kernel matrix associated with the landmark set. To further illustrate the merits and limitations of these two approaches, we present a numerical experiment to report both accuracy and time complexity as the number of landmarks increases. In this experiment, we use a benchmark data set from LIBSVM [CC01a], named mushrooms, which consists of samples with attributes. In Fig. 1, we report the results of our accuracy and timing comparison as a function of the number of landmarks . Since the landmark selection process involves uniform sampling, we use independent trials for each value of , and we fix the kernel parameter . Moreover, the mushrooms data set contains two classes, i.e., , and we thus investigate the accuracy of estimating the two leading eigenvectors of the modified kernel matrix as follows [fowlkes2004spectral]:

(15)

When the two matrices comprising the leading eigenvectors are identical, this metric reaches its maximum and is equal to . Also, higher values indicate more accurate estimates of the leading eigenvectors for partitioning the embedded data in the last step of spectral clustering.

Fig. LABEL:fig:prior_acc shows that the first approach achieves higher accuracy levels than the second method introduced in [li2011time]. This observation is consistent with our previous discussion because the method proposed in [fowlkes2004spectral] does not enforce a strict rank reduction step. However, according to Fig. LABEL:fig:prio_time, the first approach suffers from high computational complexity, which is a scalability barrier for clustering massive data sets.

To be formal, let us compare the time complexity of these two approaches. We ignore the shared cost of forming and in the Nyström method, which is arithmetic operations. Moreover, we only report dominant costs involving the number of data points (e.g., we remove for the EVD of matrices). The time complexity of the first approach is to form and the estimated eigenvectors based on (9) and (10). However, the second approach takes operations based on (13). Although the cost of both approaches scales linearly as a function of , the time complexity of the first approach scales quadratically with the number of landmarks . To address this problem, we propose a new approach that provides tunable trade-offs between accuracy and efficiency. As we see in Fig. 1, our approach generates accurate estimates of the leading eigenvectors, and its time complexity is similar to that of the efficient approach in [li2011time]. Another shortcoming of the previous research is the lack of theoretical guarantees for the perturbation analysis of the modified kernel matrix and its eigenvectors because of leveraging the Nyström approximation.

Iii The Proposed Method

This paper presents a systematic treatment of utilizing the Nyström method for improving the accuracy and scalability of approximate spectral clustering. The proposed approach automatically exploits decay in the spectrum of the inner matrix to reduce the computational complexity and memory overhead of the prototypical spectral clustering algorithm, presented in Alg. 1. We will take advantage of the extracted information to construct an approximation of the modified kernel matrix in the form of , where and represents the underlying low-rank structure of . This strategy allows us to use standard SVD solvers for computing the leading left singular vectors of to obtain the spectral embedding of the original data in linear time concerning both the number of data points and the size of the landmark set. We also offer new theoretical results for the perturbation analysis of the (modified) kernel matrix due to exploiting the low-rank structure of and utilizing the Nyström method . Furthermore, we present two numerical experiments to verify the assumptions made in the analysis and investigate the obtained upper bounds. While the presented results are based on the spectral norm, our analysis can be easily generalized to other unitarily invariant matrix norms, including the Frobenius norm.

Iii-a The Proposed Scalable Spectral Clustering Algorithm

The first step of our proposed approach exploits decay in the spectrum of the inner matrix , which is obtained from the Nyström approximation , by computing its spectral decomposition as follows:

(16)

where the matrix contains the decaying spectrum of . A key aspect of our approach is that eigenvectors corresponding to small eigenvalues generate very little error for a low-rank approximation of . Therefore, we propose to retain eigenvectors whose corresponding eigenvalues are above a threshold :

(17)

The parameter should be chosen so that the obtained exceeds the number of clusters . Hence, the first step of our approach utilizes the decaying spectrum of the inner matrix for replacing with its best rank- approximation , where and are the leading eigenvectors and eigenvalues, respectively. Therefore, the parameter allows controlling the amount of spectral energy captured by the low-rank approximation (note that ):

(18)

Next, we generate an approximation of as follows:

(19)

The next step is to find the diagonal degree matrix without explicitly computing :

(20)

which shows that we can compute by two matrix-vector multiplications. We thus approximate the modified kernel matrix using the two matrices and in the following form:

(21)

The last step involves computing the leading left singular vectors of due to the following relationship between the SVD of and the EVD of :

(22)

The proposed spectral clustering algorithm is summarized in Alg. 2. The key advantage of our approach is that it requires a single pass over the matrix , which encodes similarities between all the input data points and the landmark set. Given and , the dominant computational cost for forming the matrix in (19) is arithmetic operations. Moreover, unlike the previous approach in [li2011time], we do not enforce a strict rank- approximation of the inner matrix . Instead, we exploit decay in the spectrum of , and the cost of forming scales linearly with the underlying structure of the matrix . Also, it takes linear time in to compute the leading left singular vectors of . Hence, the parameter provides tunable trade-offs between efficiency and accuracy of the produced spectral embedding for large data sets. While we assumed that the computational cost of steps that do not involve is not a scalability barrier, one can utilize fast solvers for accelerating the computation of the spectral decomposition of the inner matrix in the first step of our approach. For example, the previous work [li2014large] employed randomized low-rank matrix approximation algorithms to compute the spectral decomposition of in the Nyström method. Thus, we can apply the same strategy to further accelerate our proposed spectral clustering algorithm.

Input: data set , landmark set , kernel parameter , number of clusters , threshold parameter .

1:function SCNys()
2:     Form and , where and ;
3:     Compute the exact (or approximate) EVD of and find the value of as in (17);
4:     Construct ;
5:     Compute the degree matrix ;
6:     Find the leading left singular vectors of to obtain ;
7:     Normalize each row of to have unit length;
8:     Perform K-means clustering over the rows of the normalized matrix ; return Clustering results.
Algorithm 2 Proposed Spectral Clustering Algorithm

Iii-B Theoretical Results

The goal of this section is to present theoretical insights and develop a better understanding of the role of the two low-rank approximations that we leveraged in our Nyström-based spectral clustering algorithm. As a reference, the modified kernel matrix of the original data that we introduced in (4) is , and our method substitutes the kernel matrix with the Nyström approximation . While various theoretical guarantees exist on the quality of the Nyström approximation, e.g., upper bounds for with respect to unitarily invariant matrix norms [gittens2016revisiting], a serious concern is that utilizing this approximation may have an out-sized influence on the normalized kernel matrix due to the perturbation of the degree matrix. The lack of theoretical understanding regarding the approximation quality of is a critical disadvantage of Nyström-based spectral clustering.

Moreover, the first step of our approach exploits the rank- approximation of the inner matrix . While the incurred error is negligible for small values of , this approximation does not necessarily guarantee that the error term is small. To explain this point, note that for any invertible matrices and , we have . Thus, the small norm of cannot be directly used to conclude their inverses are close to each other or even bounded. In this section, we start with providing theoretical guarantees for understanding the influence of the rank- approximation of .

Theorem 1 (Rank- approximation of ).

Consider a set of distinct data points and a landmark set sampled from using a set of indices . Let denote the kernel matrix associated with using the Gaussian kernel function . Also, let be a subset of the columns of selected according to the set of indices . The Nyström method computes and to form . The error incurred by the best rank- approximation of the inner matrix can be characterized as follows for any :

(23)

where represents the left singular vectors of and denotes its first columns.

Proof.

Let us start with the truncated SVD of , which allows us to rewrite as follows:

(24)

and we get the following representation for :

(25)

Using (24) and (25), it is straightforward to show that:

(26)

where we used since the kernel matrix associated with is full-rank when employing the Gaussian kernel function on a set of distinct data points.

Next, we use the best rank- approximation of , i.e., , to simplify the Nyström approximation after replacing with as follows:

(27)

where . Note that the right singular vectors of can be decomposed as:

(28)

where represents the tailing right singular vectors of and we used the fact that the columns of are orthogonal. Hence, we see that has a block structure:

(29)

As a result, we have , which completes the proof. ∎

The presented result in (23) allows us to develop a better understanding of utilizing the rank- approximation of via the SVD of without any matrix inversion. Based on this result and using standard inequalities for the spectral norm, we conclude the error incurred by the low-rank approximation is always bounded for any value of :

(30)

which can be simplified as:

(31)

A significant advantage of this upper bound compared to prior results, such as the bound presented in [drineas2005nystrom], is that our theoretical analysis does not make any assumptions on the landmark selection process involved in the Nyström method.

Fig. 2: Investigating the influence of the threshold parameter on the normalized approximation error for fixed .

We provide numerical evidence measuring the normalized approximation error in (31) using the mushrooms data set that we considered in Section II. Fig. 2 reports the mean approximation error over independent trials with fixed and varying values of the threshold parameter . As discussed before, the normalized approximation error is less than even when we set the threshold parameter , yielding the rank- approximation of under the assumption that . In practice, one has the flexibility to reduce the value of to lower the resulting rank- approximation error, achieving a trade-off between accuracy and scalability. As we increase in this experiment, the mean values of the rank parameter are , , , and , respectively.

Our second theoretical analysis focuses on developing an upper bound for the spectral norm of the perturbation on the modified kernel matrix when utilizing the Nyström approximation. Therefore, our theoretical result reduces the current gap between a high-quality approximation of the kernel matrix to a provably reasonable estimate of . Since we have the following relationship between the modified kernel matrix and the graph Laplacian matrix , our result immediately translates to the perturbation analysis of the graph Laplacian under the Nyström approximation .

Theorem 2 (Low-rank approximation of ).

Consider the kernel matrix representing pairwise similarities and the modified kernel matrix , where is the diagonal degree matrix. Let and denote the perturbed kernel matrix and its diagonal degree matrix, respectively. If the amount of perturbation is small in the sense that , then the normalized difference between and is bounded in the spectral norm as follows:

(32)

where

(33)

and

(34)
Proof.

Using and , we express the perturbed modified kernel matrix in the following form:

(35)

where we used the following relationship:

(36)

and because both and are diagonal matrices. Also, for any invertible matrices and , the matrix product is invertible and we have .

Next, we use the Taylor series for matrix functions under the assumption that the perturbation amount on the degree matrix satisfies , which yields [yan2009fast]:

(37)

We then substitute (37) in the expression for the perturbed modified kernel matrix given in (35). It is straightforward to show that , where we have:

(38)

and we introduced . Similarly, we get:

(39)

and we find the third term:

(40)

Therefore, we compute and use standard inequalities for the spectral norm to find the following upper bound:

(41)

and we divide both sides by to complete the proof. ∎

This theorem provides an upper bound for the spectral norm of the perturbation on the modified kernel matrix . This result also applies to the perturbation analysis of the normalized graph Laplacian . The main underlying assumption is that the term should be less than , where both and are diagonal matrices. That is, the absolute error incurred by estimating the degree of each node when utilizing the perturbed kernel matrix of the similarity graph should be less than the actual degree of the corresponding node. To better understand the influence of utilizing an approximate kernel matrix on the degree matrix, we have the following connection between and :

(42)

Next, we find an upper bound for the spectral norm of the diagonal matrix as follows:

(43)

where represents the infinity norm for vectors and we used the following inequality for vector norms . Therefore, the error associated with estimating the degree matrix depends on the kernel matrix approximation error. Furthermore, according to (32), we find out the upper bound depends on the quality of the Nyström approximation, i.e., . As mentioned before, this error term is known to be bounded in the Nyström method for unitarily invariant norms and various landmark selection strategies such as uniform sampling without replacement [kumar2012sampling].

To further explain the assumption made in Theorem 2 and understand the upper bound for the normalized difference between and , we revisit the numerical simulation performed on the mushrooms data set. For varying values of

, we report the average and standard deviation values of

and over independent trials in Fig. LABEL:fig:eta and LABEL:fig:Merror, respectively. We also set the threshold parameter in our proposed approach to consider the error introduced by the low-rank approximation of the inner matrix . As the number of landmarks increases, we observe a decreasing trend for both quantities because of obtaining more accurate Nyström kernel matrix approximations. Note that even for small values of , e.g., , the underlying assumption of Theorem 2 is satisfied since . We also notice that the normalized approximation error for the modified kernel matrix gets close to for .

(a)
(b)
Fig. 3: Empirical investigation of the main assumption and the upper bound for the normalized approximation error in Theorem 2.

Iv Experimental Results

In this section, we conduct extensive experiments to assess the performance and time complexity of the proposed spectral clustering method on several synthetic and real data sets. We compare our proposed method with other competing techniques that circumvent the construction of full similarity graphs in large-scale settings. All the studied spectral clustering methods are implemented in Matlab. In our approach, Matlab built-in functions are used for computing standard matrix factorizations, including SVD and EVD. We also use Matlab’s internal K-means in the last step of spectral clustering to partition the produced spectral embedding into clusters. For the K-means algorithm in the last step of spectral clustering, we set the maximum number of iterations to 10 since increasing the number of iterations does not make any notable difference based on our evaluation.

This section reports two evaluation metrics to assess the performance of spectral clustering techniques using ground-truth labels. Let

be the ground-truth partition of the data, and represent the output of a spectral clustering algorithm. Thus, denotes the number of shared samples in and . Hence,