Scalable Kernel K-Means Clustering with Nystrom Approximation: Relative-Error Bounds

06/09/2017 ∙ by Shusen Wang, et al. ∙ Rensselaer Polytechnic Institute berkeley college Stevens Institute of Technology 0

Kernel k-means clustering can correctly identify and extract a far more varied collection of cluster structures than the linear k-means clustering algorithm. However, kernel k-means clustering is computationally expensive when the non-linear feature map is high-dimensional and there are many input points. Kernel approximation, e.g., the Nyström method, has been applied in previous works to approximately solve kernel learning problems when both of the above conditions are present. This work analyzes the application of this paradigm to kernel k-means clustering, and shows that applying the linear k-means clustering algorithm to k/ϵ (1 + o(1)) features constructed using a so-called rank-restricted Nyström approximation results in cluster assignments that satisfy a 1 + ϵ approximation ratio in terms of the kernel k-means cost function, relative to the guarantee provided by the same algorithm without the use of the Nyström method. As part of the analysis, this work establishes a novel 1 + ϵ relative-error trace norm guarantee for low-rank approximation using the rank-restricted Nyström approximation. Empirical evaluations on the 8.1 million instance MNIST8M dataset demonstrate the scalability and usefulness of kernel k-means clustering with Nyström approximation. This work argues that spectral clustering using Nyström approximation---a popular and computationally efficient, but theoretically unsound approach to non-linear clustering---should be replaced with the efficient and theoretically sound combination of kernel k-means clustering with Nyström approximation. The superior performance of the latter approach is empirically verified.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 20

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Cluster analysis divides a data set into several groups using information found only in the data points. Clustering can be used in an exploratory manner to discover meaningful groupings within a data set, or it can serve as the starting point for more advanced analyses. As such, applications of clustering abound in machine learning and data analysis, including, inter alia: genetic expression analysis (Sharan et al., 2002), market segmentation (Chaturvedi et al., 1997), social network analysis (Handcock et al., 2007), image segmentation (Haralick and Shapiro, 1985)

, anomaly detection 

(Chandola et al., 2009), collaborative filtering (Ungar and Foster, 1998), and fast approximate learning of non-linear models (Si et al., 2014).

Linear

-means clustering is a standard and well-regarded approach to cluster analysis that partitions input vectors

into clusters, in an unsupervised manner, by assigning each vector to the cluster with the nearest centroid. Formally, linear -means clustering seeks to partition the set into disjoint sets by solving

(1)

Lloyd’s algorithm (Lloyd, 1982) is a standard approach for finding local minimizers of (1), and is a staple in data mining and machine learning.

Despite its popularity, linear -means clustering is not a universal solution to all clustering problems. In particular, linear -means clustering strongly biases the recovered clusters towards isotropy and sphericity. Applied to the data in Figure 1(a), Lloyd’s algorithm is perfectly capable of partitioning the data into three clusters which fit these assumptions. However, the data in Figure 1(b) do not fit these assumptions: the clusters are ring-shaped and have coincident centers, so minimizing the linear -means objective does not recover these clusters.

(a)
(b)
(c)
Figure 1: Figures 1(a) and 1(b) show two sets of two-dimensional points, with different colors indicating different clusters. Figure 1(a) is separable using linear -means; Figures 1(b) is inseparable using linear -means. Figure 1(c) shows the first two dimensions of the feature vectors derived from a kernel.

To extend the scope of -means clustering to include anistotropic, non-spherical clusters such as those depicted in Figure 1(b), Schölkopf et al. (1998) proposed to perform linear -means clustering in a nonlinear feature space instead of the input space. After choosing a feature function to map the input vectors non-linearly into feature vectors, they propose minimizing the objective function

(2)

where denotes a -partition of . The “kernel trick” enables us to minimize this objective without explicitly computing the potentially high-dimensional features, as inner products in feature space can be computed implicitly by evaluating the kernel function

Thus the information required to solve the kernel -means problem (2), is present in the kernel matrix .

Let

be the full eigenvalue decomposition (EVD) of the kernel matrix and

be the rows of . It can be shown (see Appendix B.3) that the solution of (2) is identical to the solution of

(3)

To demonstrate the power of kernel -means clustering, consider the dataset in Figure 1(b). We use the Gaussian RBF kernel

with , and form the corresponding kernel matrix of the data in Figure 1(b). Figure 1(c) scatterplots the first two coordinates of the feature vectors . Clearly, the first coordinate of the feature vectors already separates the two classes well, so -means clustering using the non-linear features has a better chance of separating the two classes.

Although it is more generally applicable than linear -means clustering, kernel -means clustering is computationally expensive. As a baseline, we consider the cost of optimizing (3). The formation of the kernel matrix given the input vectors costs time. The objective in (3) can then be (approximately) minimized using Lloyd’s algorithm at a cost of time per iteration. This requires the -dimensional non-linear feature vectors obtained from the full EVD of ; computing these feature vectors takes time, because is, in general, full-rank. Thus, approximately solving the kernel -means clustering problem by optimizing (3) costs time, where is the number of iterations of Lloyd’s algorithm.

Kernel approximation techniques, including the Nyström method (Nyström, 1930; Williams and Seeger, 2001; Gittens and Mahoney, 2016) and random feature maps (Rahimi and Recht, 2007), have been applied to decrease the cost of solving kernelized machine learning problems: the idea is to replace with a low-rank approximation, which allows for more efficient computations. Chitta et al. (2011, 2012) proposed to apply kernel approximations to efficiently approximate kernel -means clustering. Although kernel approximations mitigate the computational challenges of kernel -means clustering, the aforementioned works do not provide guarantees on the clustering performance: how accurate must the low-rank approximation of be to ensure near optimality of the approximate clustering obtained via this method?

We propose a provable approximate solution to the kernel -means problem based on the Nyström approximation. Our method has three steps: first, extract () features using the Nyström method; second, reduce the features to dimensions () using the truncated SVD;111This is why our variant is called the rank-restricted Nyström approximation. The rank-restriction serves two purposes. First, although we do not know whether the rank-restriction is necessary for the bound, we are unable to establish the bound without it. Second, the rank-restriction makes the third step, linear -means clustering, much less costly. For the computational benefit, previous works (Boutsidis et al., 2009, 2010, 2015; Cohen et al., 2015; Feldman et al., 2013) have considered dimensionality reduction for linear -means clustering. third, apply any off-the-shelf linear -means clustering algorithm upon the -dimensional features to obtain the final clusters. The total time complexity of the first two steps is . The time complexity of the third step depends on the specific linear -means algorithm; for example, using Lloyd’s algorithm, the per-iteration complexity is , and the number of iterations may depend on .222Without the rank-restriction, the per-iteration cost would be , and the number of iterations may depend on .

Our method comes with a strong approximation ratio guarantee. Suppose we set and for any , where is the coherence parameter of the dominant -dimensional singular space of the kernel matrix . Also suppose the standard kernel -means and our approximate method use the same linear -means clustering algorithm, e.g., Lloyd’s algorithm or some other algorithm that comes with different provable approximation guarantees. As guaranteed by Theorem 1, when the quality of the clustering is measured by the cost function defined in (2

), with probability at least

, our algorithm returns a clustering that is at most times worse than the standard kernel -means clustering. Our theory makes explicit the trade-off between accuracy and computation: larger and lead to high accuracy and also high computational cost.

Spectral clustering (Shi and Malik, 2000; Ng et al., 2002) is a popular alternative to kernel -means clustering that can also partition non-linearly separable data such as those in Figure 1(b). Unfortunately, because it requires computing an affinity matrix and the top eigenvectors of the corresponding graph Laplacian, spectral clustering is inefficient for large . Fowlkes et al. (2004) applied the Nyström approximation to increase the scalability of spectral clustering. Since then, spectral clustering with Nyström approximation has been used in many works, e.g., (Arikan, 2006; Berg et al., 2004; Chen et al., 2011; Wang et al., 2016b; Weiss et al., 2009; Zhang and Kwok, 2010). Despite its popularity in practice, this approach does not come with guarantees on the approximation ratio for the obtained clustering. Our algorithm, which combines kernel -means with Nyström approximation, is an equally computationally efficient alternative that comes with strong bounds on the approximation ratio, and can be used wherever spectral clustering is applied.

1.1 Contributions

Using tools developed in (Boutsidis et al., 2015; Cohen et al., 2015; Feldman et al., 2013), we rigorously analyze the performance of approximate kernel -means clustering with the Nyström approximation, and show that a rank- Nyström approximation delivers a approximation ratio guarantee, relative to the guarantee provided by the same algorithm without the use of the Nyström method.

As part of the analysis of kernel -means with Nyström approximation, we establish the first relative-error bound for rank-restricted Nyström approximation,333Similar relative-error bounds were independently developed by contemporaneous work of Tropp et al. (2017), in service of the analysis of a novel streaming algorithm for fixed-rank approximation of positive semidefinite matrices. Preliminary versions of this work and theirs were simultaneously submitted to arXiv in June 2017. which has independent interest.

Kernel -means clustering and spectral clustering are competing solutions to the nonlinear clustering problem, neither of which scales well with . Fowlkes et al. (2004)

introduced the use of Nyström approximations to make spectral clustering scalable; this approach has become popular in machine learning. We identify fundamental mathematical problems with this heuristic. These concerns and an empirical comparison establish that our proposed combination of kernel

-means with rank-restricted Nyström approximation is a theoretically well-founded and empirically competitive alternative to spectral clustering with Nyström approximation.

Finally, we demonstrate the scalability of this approach by measuring the performance of an Apache Spark implementation of a distributed version of our approximate kernel -means clustering algorithm using the MNIST8M data set, which has million instances and classes.

1.2 Relation to Prior Work

The key to our analysis of the proposed approximate kernel -means clustering algorithm is a novel relative-error trace norm bound for a rank-restricted Nyström approximation. We restrict the rank of the Nyström approximation in a non-standard manner (see Remark 1). Our relative-error trace norm bound is not a simple consequence of the existing bounds for non-rank-restricted Nyström approximation such as the ones provided by Gittens and Mahoney (2016). The relative-error bound which we provide for the rank-restricted Nyström approximation is potentially useful in other applications involving the Nyström method.

The projection-cost preservation (PCP) property (Cohen et al., 2015; Feldman et al., 2013) is an important tool for analyzing approximate linear -means clustering. We apply our novel relative-error trace norm bound as well as existing tools in (Cohen et al., 2015) to prove that the rank-restricted Nyström approximation enjoys the PCP property. We do not rule out the possibility that the non-rank-restricted (rank-) Nyström approximation satisfies the PCP property and/or also enjoys a approximation ratio guarantee when applied to kernel -means clustering. However, the cost of the linear -means clustering step in the algorithm is proportional to the dimensionality of the feature vectors, so the rank-restricted Nyström approximation, which produces -dimensional feature vectors, where , is more computationally desirable.

Musco and Musco (2017) similarly establishes a approximation ratio for the kernel -means objective when a Nyström approximation is used in place of the full kernel matrix. Specifically, Musco and Musco (2017) shows that when columns of are sampled using ridge leverage score (RLS) sampling (Alaoui and Mahoney, 2015; Cohen et al., 2017; Musco and Musco, 2017) and are used to form a Nyström approximation, then applying linear -means clustering to the -dimensional Nyström features returns a clustering that has objective value at most times as large as the objective value of the best clustering. Our theory is independent of that in Musco and Musco (2017), and differs in that (1) Musco and Musco (2017) applies specifically to Nyström approximations formed using RLS sampling, whereas our guarantees apply to any sketching method that satisfies the “subspace embedding” and “matrix multiplication” properties (see Lemma A.2 for definitions of these two properties); (2) Musco and Musco (2017) establishes a approximation ratio for the non-rank-restricted RLS-Nyström approximation, whereas we establish a approximation ratio for the (more computationally efficient) rank-restricted Nyström approximation.

1.3 Paper Organization

In Section 2, we start with a definition of the notation used throughout this paper as well as a background on matrix sketching methods. Then, in Section 3, we present our main theoretical results: Section 3.1 presents an improved relative-error rank-restricted Nyström approximation; Section 3.2 presents the main theoretical results on kernel -means with Nyström approximation; and Section 3.3 studies kernel

-means with kernel principal component analysis. Section 

4 discusses and evaluates the theoretical and empirical merits of kernel -means clustering versus spectral clustering, when each is approximated using Nyström approximation. Section 5 empirically compares the Nyström method and the random feature maps for the kernel -means clustering on medium-scale data. Section 6 presents a large-scale distributed implementation in Apache Spark and its empirical evaluation on a data set with million points. Section 7 provides a brief conclusion. Proofs are provided in the Appendices.

2 Notation

This section defines the notation used throughout this paper. A set of commonly used parameters is summarized in Table 1.

   Notation    Definition
       number of samples
       number of features (attributes)
       number of clusters
       target rank of the Nyström approximation
       sketch size of the Nyström approximation
Table 1: Commonly used parameters. It holds that .

Matrices and vectors.

We take to be the identity matrix, to be a vector or matrix of all zeros of the appropriate size, and to be the -dimensional vector of all ones.

Sets.

The set is written as . We call a -partition of if and when . Let denote the cardinality of the set .

Singular value decomposition (SVD).

Let and

. A (compact) singular value decomposition (SVD) is defined by

where , , are an

column-orthogonal matrix, a

diagonal matrix with nonnegative entries, and a column-orthogonal matrix, respectively. If is symmetric positive semi-definite (SPSD), then , and this decomposition is also called the (reduced) eigenvalue decomposition (EVD). By convention, we take .

Truncated SVD.

The matrix is a rank- truncated SVD of , and is an optimal rank- approximation to when the approximation error is measured in a unitarily invariant norm.

The Moore-Penrose inverse

of is defined by .

Leverage score and coherence.

Let be defined in the above and be the -th row of . The row leverage scores of are for . The row coherence of is . The leverage scores for a matrix can be computed exactly in the time it takes to compute the matrix ; and the leverage scores can be approximated (in theory (Drineas et al., 2012) and in practice (Gittens and Mahoney, 2016)) in roughly the time it takes to apply a random projection matrix to the matrix .

Matrix norms.

We use three matrix norms in this paper:

Frobenius Norm:
Spectral Norm:
Trace Norm:

Any square matrix satisfies . If additionally is SPSD, then .

Matrix sketching

Here, we briefly review matrix sketching methods that are commonly used within randomized linear algebra (RLA) (Mahoney, 2011).

Given a matrix , we call (typically ) a sketch of and a sketching matrix. Within RLA, sketching has emerged as a powerful primitive, where one is primarily interested in using random projections and random sampling to construct randomzed sketches (Mahoney, 2011; Drineas and Mahoney, 2016). In particular, sketching is useful as it allows large matrices to be replaced with smaller matrices which are more amenable to efficient computation, but provably retain almost optimal accuracy in many computations (Mahoney, 2011; Woodruff, 2014). The columns of typically comprise a rescaled subset of the columns of , or random linear combinations of the columns of ; the former type of sketching is called column selection or random sampling, and the latter is referred to as random projection.

Column selection forms using a randomly sampled and rescaled subset of the columns of . Let be the sampling probabilities associated with the columns of (so that, in particular, ). The columns of the sketch are selected identically and independently as follows: each column of is randomly sampled from the columns of according to the sampling probabilities and rescaled by where is the index of the column of that was selected. In our matrix multiplication formulation for sketching, column selection corresponds to a sketching matrix that has exactly one non-zero entry in each column, whose position and magnitude correspond to the index of the column selected from . Uniform sampling is column sampling with , and leverage score sampling takes for , where is the -th leverage score of some matrix (typically , , or an randomized approximation thereto) (Drineas et al., 2012).

Gaussian projection is a type of random projection where the sketching matrix is taken to be ; here the entries of are i.i.d. random variables. Gaussian projection is inefficient relative to column sampling: the formation of a Gaussian sketch of a dense matrix requires time. The Subsampled Randomized Hadamard Transform (SRHT) is a more efficient alternative that enjoys similar properties to the Gaussian projection (Drineas et al., 2011; Lu et al., 2013; Tropp, 2011), and can be applied to a dense matrix in only time. The CountSketch is even more efficient: it can be applied to any matrix in time (Clarkson and Woodruff, 2013; Meng and Mahoney, 2013; Nelson and Nguyên, 2013), where denotes the number of nonzero entries in a matrix.

3 Our Main Results: Improved SPSD Matrix Approximation and Kernel -means Approximation

In this section, we present our main theoretical results. We start, in Section 3.1, by presenting Theorem 3.1, a novel result on SPSD matrix approximation with the rank-restricted Nyström method. This result is of independent interest, and so we present it in detail, but in this paper we will use it to establish our main result. Then, in Section 3.2, we present Theorem 1, which is our main result for approximate kernel -means with the Nyström approximation. In Section 3.3, we establish novel guarantees on kernel -means with dimensionality reduction.

3.1 The Nyström Method

The Nyström method (Nyström, 1930) is the most popular kernel approximation method in the machine learning community. Let be an SPSD matrix and be a sketching matrix. The Nyström method approximates with , where and . The Nyström method was introduced to the machine learning community by Williams and Seeger (2001); since then, numerous works have studied its theoretical properties, e.g., (Drineas and Mahoney, 2005; Gittens and Mahoney, 2016; Jin et al., 2013; Kumar et al., 2012; Wang and Zhang, 2013; Yang et al., 2012).

Empirical results in (Gittens and Mahoney, 2016; Wang et al., 2016b; Yang et al., 2012) demonstrated that the accuracy of the Nyström method significantly increases when the spectrum of

decays fast. This suggests that the Nyström approximation captures the dominant eigenspaces of

, and that error bounds comparing the accuracy of the Nyström approximation of to that of the best rank- approximation (for ) would provide a meaningful measure of the performance of the Nyström kernel approximation. Gittens and Mahoney (2016) established the first relative-error bounds showing that for sufficiently large , the trace norm error is comparable to

. Such results quantify the benefits of spectral decay to the performance of the Nyström method, and are sufficient to analyze the performance of Nyström approximations in applications such as kernel ridge regression 

(Alaoui and Mahoney, 2015; Bach, 2013)

and kernel support vector machines 

(Cortes et al., 2010).

However, Gittens and Mahoney (2016) did not analyze the performance of rank-restricted Nyström approximations; they compared the approximation accuracies of the rank- matrix and the rank- matrix (recall ). In our application to approximate kernel -means clustering, it is the rank- matrix that is of relevance. Given and , the truncated SVD can be found using time. Then the rank- Nyström approximation can be written as

(4)

Theorem 3.1 provides a relative-error trace norm approximation guarantee for the sketch (4) and is novel; a proof is provided in Appendix A.

[Relative-Error Rank-Restricted Nyström Approximation] Let be an SPSD matrix, be the target rank, and be an error parameter. Let be a sketching matrix corresponding to one of the sketching methods listed in Table 2. Let and . Then

holds with probability at least . In addition, there exists an column orthogonal matrix such that .

   sketching    sketch size ()    time complexity ()
   uniform sampling        
   leverage sampling        
   Gaussian projection        
   SRHT        
   CountSketch        
Table 2: Sufficient sketch sizes for several sketching methods used to construct Nyström approximations to the matrix . Let be the truncated SVD of , then denotes the row coherence of and the leverage score sampling is done using the row leverage scores of .
Remark 1 (Rank Restrictions)

The traditional rank-restricted Nyström approximation, , (Drineas and Mahoney, 2005; Fowlkes et al., 2004; Gittens and Mahoney, 2016; Li et al., 2015) is not known to satisfy a relative-error bound of the form guaranteed in Theorem 3.1. Pourkamali-Anaraki and Becker (2016) pointed out the drawbacks of the traditional rank-restricted Nyström approximation and proposed the use of the rank-restricted Nyström approximation in applications requiring kernel approximations, but provided only empirical evidence of its performance. This work provides guarantees on the approximation error of the rank-restricted Nyström approximation , and applies this approximation to the kernel -means clustering problem. The contemporaneous work Tropp et al. (2017) provides similar guarantees on the approximation error of , and uses this Nyström approximation as the basis of a streaming algorithm for fixed-rank approximation of positive-semidefinite matrices.

3.2 Main Result for Approximate Kernel -means

In this section we establish the approximation ratio guarantees for the objective function of kernel -means clustering. We first define -approximate -means algorithms (where ), then present our main result in Theorem 1.

Let be a matrix with rows . The objective function for linear -means clustering over the rows of is

The minimization of w.r.t. the -partition is NP-hard (Garey et al., 1982; Aloise et al., 2009; Dasgupta and Freund, 2009; Mahajan et al., 2009; Awasthi et al., 2015), but approximate solutions can be obtained in polynomial time. -approximate algorithms capture one useful notion of approximation.

Definition 1 (-Approximate Algorithms)

A linear -means clustering algorithm takes as input a matrix with rows and outputs . We call a -approximate algorithm if, for any such matrix ,

Here and are -partitions of .

Many -approximation algorithms have been proposed, but they are computationally expensive (Chen, 2009; Har-Peled and Mazumdar, 2004; Kumar et al., 2004; Matousek, 2000). There are also relatively efficient constant factor approximate algorithms, e.g., (Arthur and Vassilvitskii, 2007; Kanungo et al., 2002; Song and Rajasekaran, 2010).

Let be a feature map, be the matrix with rows , and be the associated kernel matrix. Analogously, we denote the objective function for kernel -means clustering by

where is a -partition of .

[Kernel -Means with Nyström Approximation] Choose a sketching matrix and sketch size consistent with Table 2. Let be the previously defined Nyström approximation of . Let be any matrix satisfying . Let the -partition be the output of a -approximate algorithm applied to the rows of . With probability at least ,

Remark 2

Kernel -means clustering is an NP-hard problem. Therefore, instead of comparing with , we compare with clusterings obtained using -approximate algorithms. Theorem 1 shows that, when uniform sampling to form the Nyström approximation, if and , then the returned clustering has an objective value that is at most a factor of larger than the objective value of the kernel -means clustering returned by the -approximate algorithm.

Remark 3

Assume we are in a practical setting where , the budget of column samples one can use to form a Nyström approximation, and , the number of desired cluster centers, are fixed. The pertinent question is how to choose to produce a high-quality approximate clustering. Theorem 1 shows that for uniform sampling, the error ratio is

To balance the two sources of error, must be larger than , but not too large a fraction of . To minimize the above error ratio, should be selected on the order of . Since the matrix coherence () is unknown, it can be heuristically treated as a constant.

We empirically study the effect of the values of and using a data set comprising million samples. Note that computing the kernel -means clustering objective function requires the formation of the entire kernel matrix , which is infeasible for a data set of this size; instead, we use normalized mutual information (NMI) (Strehl and Ghosh, 2002)—a standard measure of the performance of clustering algorithms— to measure the quality of the clustering obtained by approximating kernel -means clustering using Nyström approximations formed through uniform sampling. NMI scores range from zero to one, with a larger score indicating better performance. We report the results in Figure 2. The complete details of the experiments, including the experimental setting and time costs, are given in Section 6.

(a) Fix and vary .
(b) Fix and vary .
Figure 2: Performance of approximate kernel -means clustering on the MNIST8M data set, for which , , and . We use Nyström approximations formed using uniform sampling, and the built-in -means algorithm of Spark MLlib (a parallelized variant of the -means++ method) with the setting .

From Figure 2(a) we observe that larger values of

lead to better and more stable clusterings: the mean of the NMI increases and its standard deviation decreases. This is reasonable and in accordance with our theory. However, larger values of

incur more computations, so one should choose to trade off computation and accuracy.

Figure 2(b) shows that for fixed and , the clustering performance is not monotonic in , which matches Theorem 1 (see the discussion in Remark 3). Setting as small as results in poor performance. Setting over-large not only incurs more computations, but also negatively affects clustering performance; this may suggest the necessity of rank-restriction. Furthermore, in this example, , which corroborates the suggestion made in Remark 3 that setting around (where is unknown but can be treated as a constant larger than ) can be a good choice.

Remark 4

Musco and Musco (2017) established a approximation ratio for the kernel -means objective value when a non-rank-restricted Nyström approximation is formed using ridge leverage scores (RLS) sampling; their analysis is specific to RLS sampling and does not extend to other sketching methods. By way of comparison, our analysis covers several popular sampling schemes and applies to rank-restricted Nyström approximations, but does not extend to RLS sampling.

3.3 Approximate Kernel -Means with KPCA and Power Method

The use of dimensionality reduction to increase the computational efficiency of -means clustering has been widely studied, e.g. in (Boutsidis et al., 2010, 2015; Cohen et al., 2015; Feldman et al., 2013; Zha et al., 2002). Kernel principal component analysis (KPCA) is particularly well-suited to this application (Dhillon et al., 2004; Ding et al., 2005). Applying Lloyd’s algorithm on the rows of or has an per-iteration complexity; if features are extracted using KPCA and Lloyd’s algorithm is applied to the resulting -dimensional feature map, then the per-iteration cost reduces to Proposition 3.3 states that, to obtain a approximation ratio in terms of the kernel -means objective function, it suffices to use KPCA features. This proposition is a simple consequence of (Cohen et al., 2015).

[KPCA] Let be a matrix with rows, and be the corresponding kernel matrix. Let be the truncated SVD of and take . Let the -partition be the output of a -approximate algorithm applied to the rows of . Then

In practice, the truncated SVD (equivalently EVD) of is computed using the power method or Krylov subspace methods. These numerical methods do not compute the exact decomposition , so Proposition 3.3

is not directly applicable. It is useful to have a theory that captures the effect of realisticly inaccurate estimates like

on the clustering process. As one particular example, consider that general-purpose implementations of the truncated SVD attempt to mitigate the fact that the computed decompositions are inaccurate by returning very high-precision solutions, e.g. solutions that satisfy . Understanding the trade-off between the precision of the truncated SVD solution and the impact on the approximation ratio of the approximate kernel -means solution allows us to more precisely manage the computational complexity of our algorithms. Are such high-precision solutions necessary for kernel -means clustering?

Theorem 1 answers this question by establishing that highly accurate eigenspaces are not significantly more useful in approximate kernel -means clustering than eigenspace estimates with lower accuracy. A low-precision solution obtained by running the power iteration for a few rounds suffices for kernel -means clustering applications. We prove Theorem 1 in Appendix C.

1:  Input: kernel matrix , number of clusters , target dimension (), sketch size  (), number of iterations ()
2:  Draw a Gaussian projection matrix ;
3:  for all  do
4:     ;
5:  end for
6:  Orthogonalize to obtain ;
7:  Compute and ;
8:  Compute a satisfying ;
9:  return  .
Algorithm 1 Approximate Kernel -Means using the Power Method.

[The Power Method] Let be a matrix with rows, be the corresponding kernel matrix, and be the -th singular value of . Fix an error parameter . Run Algorithm 1 with to obtain . Let the -partition be the output of a -approximate algorithm applied to the rows of . If , then

holds with probability at least . If , then the above inequality holds with probability , where is a positive constant (Tao and Vu, 2010).

Note that the power method requires forming the entire kernel matrix , which may not fit in memory even in a distributed setting. Therefore, in practice, the power method may not be as efficient as the Nyström approximation with uniform sampling, which avoids forming .

Theorem 1, Proposition 3.3, and Theorem 1 are highly interesting from a theoretical perspective. These results demonstrate that features are sufficient to ensure a approximation ratio. Prior work (Dhillon et al., 2004; Ding et al., 2005) set and did not provide approximation ratio guarantees. Indeed, a lower bound in the linear -means clustering case due to (Cohen et al., 2015) shows that is necessary to obtain a approximation ratio.

4 Comparison to Spectral Clustering with Nyström Approximation

In this section, we provide a brief discussion and empirical comparison of our clustering algorithm, which uses the Nyström method to approximate kernel -means clustering, with the popular alternative algorithm that uses the Nyström method to approximate spectral clustering.

4.1 Background

Spectral clustering is a method with a long history (Cheeger, 1969; Donath and Hoffman, 1972, 1973; Fiedler, 1973; Guattery and Miller, 1995; Spielman and Teng, 1996). Within machine learning, spectral clustering is more widely used than kernel -means clustering (Ng et al., 2002; Shi and Malik, 2000), and the use of the Nyström method to speed up spectral clustering has been popular since Fowlkes et al. (2004). Both spectral clustering and kernel -means clustering can be approximated in time linear in by using the Nyström method with uniform sampling. Practitioners reading this paper may ask:

How does the approximate kernel -means clustering algorithm presented here, which uses Nyström approximation, compare to the popular heuristic of combining spectral clustering with Nyström approximation?

Based on our theoretical results and empirical observations, our answer to this reader is:

Although they have equivalent computational costs, kernel -means clustering with Nyström approximation is both more theoretically sound and more effective in practice than spectral clustering with Nyström approximation.

We first formally describe spectral clustering, and then substantiate our claim regarding the theoretical advantage of our approximate kernel -means method. Our discussion is limited to the normalized and symmetric graph Laplacians used in Fowlkes et al. (2004), but spectral clustering using asymmetric graph Laplacians encounters similar issues.

4.2 Spectral Clustering with Nyström Approximation

The input to the spectral clustering algorithm is an affinity matrix that measures the pairwise similarities between the points being clustered; typically is a kernel matrix or the adjacency matrix of a weighted graph constructed using the data points as vertices. Let be the diagonal degree matrix associated with , and be the associated normalized graph Laplacian matrix. Let denote the bottom eigenvectors of , or equivalently, the top eigenvectors of . Spectral clustering groups the data points by performing linear -means clustering on the normalized rows of . Fowlkes et al. (2004) popularized the application of the Nyström approximation to spectral clustering. This algorithm computes an approximate spectral clustering by: (1) forming a Nyström approximation to , denoted by ; (2) computing the degree matrix of ; (3) computing the top singular vectors of , which are equivalent to the bottom eigenvectors of ; (4) performing linear -means over the normalized rows of .

To the best of our knowledge, spectral clustering with Nyström approximation does not have a bounded approximation ratio relative to exact spectral clustering. In fact, it seems unlikely that the approximation ratio could be bounded, as there are fundamental problems with the application of the Nyström approximation to the affinity matrix.

  • The affinity matrix used in spectral clustering must be elementwise nonnegative. However, the Nyström approximation of such a matrix can have numerous negative entries, so is, in general, not proper input for the spectral clustering algorithm. In particular, the approximated degree matrix may have negative diagonal entries, so is not guaranteed to be a real matrix; such exceptions must be handled heuristically. The approximate asymmetric Laplacian does avoid the introduction of complex values; however, the negative entries in negate whole columns of , leading to less meaningful negative similarities/distances.

  • Even if is real, the matrix may not be SPSD, much less a Laplacian matrix. Thus the bottom eigenvectors of cannot be viewed as useful coordinates for linear -means clustering in the same way that the eigenvectors of can be.

  • Such approximation is also problematic in terms of matrix approximation accuracy. Even when approximates well, which can be theoretically guaranteed, the approximate Laplacian can be far from . This is because a small perturbation in can have an out-sized influence on the eigenvectors of .

  • One may propose to approximate , rather than , with a Nyström approximation ; this ensures that the approximate normalized graph Laplacian is SPSD. However, this approach requires forming the entirety of in order to compute the degree matrix , and thus has quadratic (with ) time and memory costs. Furthermore, although the resulting approximation, , is SPSD, it is not a graph Laplacian: its off-diagonal entries are not guaranteed to be non-positive, and its smallest eigenvalue may be nonzero.

In summary, spectral clustering using the Nyström approximation (Fowlkes et al., 2004), which has proven to be a useful heuristic, and which is composed of theoretically principled parts, is less principled when viewed in its entirety. Approximate kernel -means clustering using Nyström approximation is an equivalently efficient, but theoretically more principled alternative.

4.3 Empirical Comparison with Approximate Spectral Clustering using Nyström Approximation

     dataset   #instances ()   #features ()   #clusters ()
MNIST (LeCun et al., 1998) 60,000 780 10
Mushrooms (Frank and Asuncion, 2010) 8,124 112 2
PenDigits (Frank and Asuncion, 2010) 7,494 16 10
Table 3: Summary of the data sets used in our comparisons.

To complement our discussion of the relative merits of the two methods, we empirically compared the performance of our novel method of approximate kernel -means clustering using the Nyström method with the popular method of approximate spectral clustering using the Nyström method. We used three classification data sets, described in Table 3. The data sets used are available at http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/.

Figure 3: Comparisons between the runtimes and accuracies of approximate spectral clustering (SC) and approximate kernel -means clustering (KK), both using uniform column sampling Nyström approximation.

Let be the input vectors. We take both the affinity matrix for spectral clustering and the kernel matrix for kernel -means to be the RBF kernel matrix , where and is the kernel width parameter. We choose based on the average interpoint distance in the data sets as

(5)

where we take , , or .

The algorithms under comparison are all implemented in Python 3.5.2. Our implementation of approximate spectral clustering follows the code in (Fowlkes et al., 2004). To compute linear -means clusterings, we use the function sklearn.cluster.KMeans present in the scikit-learn package. Our algorithm for approximate kernel -means clustering is described in more detail in Section 5.1. We ran the computations on a MacBook Pro with a 2.5GHz Intel Core i7 CPU and 16GB of RAM.

We compare approximate spectral clustering (SC) with approximate kernel -means clustering (KK), with both using the rank-restricted Nyström method with uniform sampling.444Uniform sampling is appropriate for the value of used in Eqn. (5); see Gittens and Mahoney (2016) for a detailed discussion of the effect of varying . We used normalized mutual information (NMI) (Strehl and Ghosh, 2002) to evaluate clustering performance: the NMI falls between 0 (representing no mutual information between the true and approximate clusterings) and 1 (perfect correlation of the two clusterings), so larger NMI indicates better performance. The target dimension is taken to be ; and, for each method, the sketch size is varied from to . We record the time cost of the two methods, excluding the time spent on the -means clustering required in both algorithms.555For both SC and KK with Nyström approximation, the extracted feature matrices have dimension , so the -means clusterings required by both SC and KK have identical cost. We repeat this procedure times and report the averaged NMI and average elapsed time.

We note that, at small sketch sizes , exceptions often arise during approximate spectral clustering due to negative entries in the degree matrix. (This is an example, as discussed in Section 4.2, of when approximate spectral clustering heuristics do not perform well.) We discard the trials where such exceptions occur.

Our results are summarized in Figure 3. Figure 3 illustrates the NMI of SC and KK as a function of the sketch size and as a function of elapsed time for both algorithms. While there are quantitative differences between the results on the three data sets, the plots all show that KK is more accurate as a function of the sketch size or elapsed time than SC.

5 Single-Machine Medium-Scale Experiments

In this section, we empirically compare the Nyström method and random feature maps (Rahimi and Recht, 2007) for kernel -means clustering. We conduct experiments on the data listed in Table 3. For the Mushrooms and PenDigits data, we are able to evaluate the objective function value of kernel -means clustering.

1:  Input: data set , number of clusters , target dimension (), arbitrary integer , sketch size (), kernel function .
2:  // Step 1: The Nyström Method
3:  Form sketches and , where is the kernel matrix and is some sketching matrix, e.g., uniform sampling;
4:  Compute a matrix such that ;
5:  // Step 2: Dimensionality Reduction
6:  Compute the rank- truncated SVD ;
7:  Let ;
8:  // Step 3: Linear -Means Clustering
9:  Perform -means clustering over the rows of ;
10:  return  the clustering results.
Algorithm 2 Approximate Kernel -Means Clustering using Nyström Approximation.

5.1 Single-Machine Implementation of Approximate Kernel -Means

Our algorithm for approximate kernel -means clustering comprises three steps: Nyström approximation, dimensionality reduction, and linear -means clustering. Both the single-machine as well as the distributed variants of the algorithm are governed by three parameters: , the number of features used in the clustering; , a regularization parameter; and , the sketch size. These parameters satisfy .

  1. Nyström approximation. Let be the sketch size and be a sketching matrix. Let and . The standard Nyström approximation is ; small singular values in can lead to instability in the Moore-Penrose inverse, so a widely used heuristic is to choose and use instead of the standard Nyström approximation.666The Nyström approximation is correct in theory, but the Moore-Penrose inverse often causes numerical errors in practice. The Moore-Penrose inverse drops all the zero singular values, however, due to the finite numerical precision, it is difficult to determine whether a singular value, say , should be zero or not, and this makes the computation unstable: if such a small singular value is believed to be zero, it will be dropped; otherwise, the Moore-Penrose inverse will invert it to obtain a singular value of . Dropping some portion of the smallest singular values is a simple heuristic that avoids this instability. This is why we heuristically use instead of . Currently we do not have theory for this heuristic. Chiu and Demanet (2013) considers the theoretical implications of this regularization heuristic, but their results do not apply to our problem. We set (arbitrarily). Let be the truncated SVD of and return as the output of the Nyström method.

  2. Dimensionality reduction. Let contain the dominant right singular vectors of . Let . It can be verified that , which is our desired rank-restricted Nyström approximation.

  3. Linear -means clustering. With at hand, use an arbitrary off-the-shelf linear -means clustering algorithm to cluster the rows of .

Figure 4: Quality of approximate kernel -means clusterings using Nyström, random feature maps (Rahimi and Recht, 2007), and the two-step method (Chitta et al., 2011). (Dimensionality reduction is not applicable to the two-step method). The -axis reports the normalized mutual information (NMI).

See Algorithm 2 for the single-machine version of this approximate kernel -means clustering algorithm. Observe that we can use uniform sampling to form and , and thereby avoid computing most of .

Let be the feature matrix computed by random feature maps (RFM). To make the comparison fair, we perform dimensionality reduction for RFM (Rahimi and Recht, 2007) in the same way as described in Algorithm 2 to compute , and apply linear -means clustering on the rows of .

5.2 Comparing Nyström, Random Feature Maps, and Two-Step Method

We empirically compare the clustering performances of kernel approximations formed using Nyström, random feature map (RFM) (Rahimi and Recht, 2007), and the two-step method (Chitta et al., 2011) on the data sets detailed in Table 3.

We use the RBF kernel with width parameter given by (5); Figure 3 indicates that is a good choice for these data sets. We conduct dimensionality reduction for both Nyström and RFM to obtain -dimensional features, and consider three choices: , , and without dimensionality reduction (equivalently, ).

Figure 5: Quality of approximate kernel -means clusterings obtained using rank-restricted Nyström, RFMs (Rahimi and Recht, 2007), and the two-step method (Chitta et al., 2011). (Dimensionality reduction is not applicable to the two-step method). The -axis reports the kernel -means clustering objective function value.

The quality of the clusterings is quantified using both normalized mutual information (NMI) (Strehl and Ghosh, 2002) and the objective function value:

(6)

where are the columns of the kernel matrix , and the disjoint sets reflect the clustering.

We repeat the experiments times and report the results in Figures 4 and 5. The experiments show that as measured by both NMIs and objective values, the Nyström method outperforms RFM in most cases. Both the Nyström method and RFM are consistently superior to the two-step method of (Chitta et al., 2011), which requires a large sketch size. All the compared methods improve as the sketch size increases.

Judging from these medium-scale experiments, the target rank has little impact on the NMI and clustering objective value. This phenomenon is not general; in the large-scale experiments of the next section we see that setting properly allows one to obtain a better NMI than an over-small or over-large .

6 Large-Scale Experiments using Distributed Computing

In this section, we empirically study our approximate kernel -means clustering algorithm on large-scale data. We state a distributed version of the algorithm, implement it in Apache Spark777 This implementation is available at https://github.com/wangshusen/SparkKernelKMeans.git. , and evaluate its performance on NERSC’s Cori supercomputer. We investigate the effect of increased parallelism, sketch size , and target dimension .

Algorithm 3 is a distributed version of our method described in Section 5.1. Again, we use uniform sampling to form and to avoid computing most of . We mainly focus on the Nyström approximation step, as the other two steps are well supported by distributed computing systems such as Apache Spark.

1:  Input: data set distributed among machines, number of clusters , target dimension (), arbitrary integer , sketch size (), kernel function .
2:  // Step 1: The Nyström Method
3:  Sample vectors from to form and send to the driver;
4:  Driver computes and a matrix satisfying ;
5:  Broadcast and to all executors;
6:  for all  do
7:     Each executor locally computes and ;
8:  end for
9:  // Step 2: Dimensionality Reduction
10:  Let contain the dominant right singular vectors of the distributed row matrix (computed using a distributed truncated SVD algorithm);
11:  Broadcast to all executors;
12:  for all  do
13:     Each executor locally computes ;
14:  end for
15:  // Step 3: Linear -Means Clustering
16:  Perform -means clustering over the Nyström features (using a distributed linear -means clustering algorithm).
17:  return  the clustering results.
Algorithm 3 Distributed Approximate Kernel -Means Clustering using Nyström Approximation.
8 Nodes