1 Introduction
Cluster analysis divides a data set into several groups using information found only in the data points. Clustering can be used in an exploratory manner to discover meaningful groupings within a data set, or it can serve as the starting point for more advanced analyses. As such, applications of clustering abound in machine learning and data analysis, including, inter alia: genetic expression analysis (Sharan et al., 2002), market segmentation (Chaturvedi et al., 1997), social network analysis (Handcock et al., 2007), image segmentation (Haralick and Shapiro, 1985)
(Chandola et al., 2009), collaborative filtering (Ungar and Foster, 1998), and fast approximate learning of nonlinear models (Si et al., 2014).Linear
means clustering is a standard and wellregarded approach to cluster analysis that partitions input vectors
into clusters, in an unsupervised manner, by assigning each vector to the cluster with the nearest centroid. Formally, linear means clustering seeks to partition the set into disjoint sets by solving(1) 
Lloyd’s algorithm (Lloyd, 1982) is a standard approach for finding local minimizers of (1), and is a staple in data mining and machine learning.
Despite its popularity, linear means clustering is not a universal solution to all clustering problems. In particular, linear means clustering strongly biases the recovered clusters towards isotropy and sphericity. Applied to the data in Figure 1(a), Lloyd’s algorithm is perfectly capable of partitioning the data into three clusters which fit these assumptions. However, the data in Figure 1(b) do not fit these assumptions: the clusters are ringshaped and have coincident centers, so minimizing the linear means objective does not recover these clusters.
To extend the scope of means clustering to include anistotropic, nonspherical clusters such as those depicted in Figure 1(b), Schölkopf et al. (1998) proposed to perform linear means clustering in a nonlinear feature space instead of the input space. After choosing a feature function to map the input vectors nonlinearly into feature vectors, they propose minimizing the objective function
(2) 
where denotes a partition of . The “kernel trick” enables us to minimize this objective without explicitly computing the potentially highdimensional features, as inner products in feature space can be computed implicitly by evaluating the kernel function
Thus the information required to solve the kernel means problem (2), is present in the kernel matrix .
Let
be the full eigenvalue decomposition (EVD) of the kernel matrix and
be the rows of . It can be shown (see Appendix B.3) that the solution of (2) is identical to the solution of(3) 
To demonstrate the power of kernel means clustering, consider the dataset in Figure 1(b). We use the Gaussian RBF kernel
with , and form the corresponding kernel matrix of the data in Figure 1(b). Figure 1(c) scatterplots the first two coordinates of the feature vectors . Clearly, the first coordinate of the feature vectors already separates the two classes well, so means clustering using the nonlinear features has a better chance of separating the two classes.
Although it is more generally applicable than linear means clustering, kernel means clustering is computationally expensive. As a baseline, we consider the cost of optimizing (3). The formation of the kernel matrix given the input vectors costs time. The objective in (3) can then be (approximately) minimized using Lloyd’s algorithm at a cost of time per iteration. This requires the dimensional nonlinear feature vectors obtained from the full EVD of ; computing these feature vectors takes time, because is, in general, fullrank. Thus, approximately solving the kernel means clustering problem by optimizing (3) costs time, where is the number of iterations of Lloyd’s algorithm.
Kernel approximation techniques, including the Nyström method (Nyström, 1930; Williams and Seeger, 2001; Gittens and Mahoney, 2016) and random feature maps (Rahimi and Recht, 2007), have been applied to decrease the cost of solving kernelized machine learning problems: the idea is to replace with a lowrank approximation, which allows for more efficient computations. Chitta et al. (2011, 2012) proposed to apply kernel approximations to efficiently approximate kernel means clustering. Although kernel approximations mitigate the computational challenges of kernel means clustering, the aforementioned works do not provide guarantees on the clustering performance: how accurate must the lowrank approximation of be to ensure near optimality of the approximate clustering obtained via this method?
We propose a provable approximate solution to the kernel means problem based on the Nyström approximation. Our method has three steps: first, extract () features using the Nyström method; second, reduce the features to dimensions () using the truncated SVD;^{1}^{1}1This is why our variant is called the rankrestricted Nyström approximation. The rankrestriction serves two purposes. First, although we do not know whether the rankrestriction is necessary for the bound, we are unable to establish the bound without it. Second, the rankrestriction makes the third step, linear means clustering, much less costly. For the computational benefit, previous works (Boutsidis et al., 2009, 2010, 2015; Cohen et al., 2015; Feldman et al., 2013) have considered dimensionality reduction for linear means clustering. third, apply any offtheshelf linear means clustering algorithm upon the dimensional features to obtain the final clusters. The total time complexity of the first two steps is . The time complexity of the third step depends on the specific linear means algorithm; for example, using Lloyd’s algorithm, the periteration complexity is , and the number of iterations may depend on .^{2}^{2}2Without the rankrestriction, the periteration cost would be , and the number of iterations may depend on .
Our method comes with a strong approximation ratio guarantee. Suppose we set and for any , where is the coherence parameter of the dominant dimensional singular space of the kernel matrix . Also suppose the standard kernel means and our approximate method use the same linear means clustering algorithm, e.g., Lloyd’s algorithm or some other algorithm that comes with different provable approximation guarantees. As guaranteed by Theorem 1, when the quality of the clustering is measured by the cost function defined in (2
), with probability at least
, our algorithm returns a clustering that is at most times worse than the standard kernel means clustering. Our theory makes explicit the tradeoff between accuracy and computation: larger and lead to high accuracy and also high computational cost.Spectral clustering (Shi and Malik, 2000; Ng et al., 2002) is a popular alternative to kernel means clustering that can also partition nonlinearly separable data such as those in Figure 1(b). Unfortunately, because it requires computing an affinity matrix and the top eigenvectors of the corresponding graph Laplacian, spectral clustering is inefficient for large . Fowlkes et al. (2004) applied the Nyström approximation to increase the scalability of spectral clustering. Since then, spectral clustering with Nyström approximation has been used in many works, e.g., (Arikan, 2006; Berg et al., 2004; Chen et al., 2011; Wang et al., 2016b; Weiss et al., 2009; Zhang and Kwok, 2010). Despite its popularity in practice, this approach does not come with guarantees on the approximation ratio for the obtained clustering. Our algorithm, which combines kernel means with Nyström approximation, is an equally computationally efficient alternative that comes with strong bounds on the approximation ratio, and can be used wherever spectral clustering is applied.
1.1 Contributions
Using tools developed in (Boutsidis et al., 2015; Cohen et al., 2015; Feldman et al., 2013), we rigorously analyze the performance of approximate kernel means clustering with the Nyström approximation, and show that a rank Nyström approximation delivers a approximation ratio guarantee, relative to the guarantee provided by the same algorithm without the use of the Nyström method.
As part of the analysis of kernel means with Nyström approximation, we establish the first relativeerror bound for rankrestricted Nyström approximation,^{3}^{3}3Similar relativeerror bounds were independently developed by contemporaneous work of Tropp et al. (2017), in service of the analysis of a novel streaming algorithm for fixedrank approximation of positive semidefinite matrices. Preliminary versions of this work and theirs were simultaneously submitted to arXiv in June 2017. which has independent interest.
Kernel means clustering and spectral clustering are competing solutions to the nonlinear clustering problem, neither of which scales well with . Fowlkes et al. (2004)
introduced the use of Nyström approximations to make spectral clustering scalable; this approach has become popular in machine learning. We identify fundamental mathematical problems with this heuristic. These concerns and an empirical comparison establish that our proposed combination of kernel
means with rankrestricted Nyström approximation is a theoretically wellfounded and empirically competitive alternative to spectral clustering with Nyström approximation.Finally, we demonstrate the scalability of this approach by measuring the performance of an Apache Spark implementation of a distributed version of our approximate kernel means clustering algorithm using the MNIST8M data set, which has million instances and classes.
1.2 Relation to Prior Work
The key to our analysis of the proposed approximate kernel means clustering algorithm is a novel relativeerror trace norm bound for a rankrestricted Nyström approximation. We restrict the rank of the Nyström approximation in a nonstandard manner (see Remark 1). Our relativeerror trace norm bound is not a simple consequence of the existing bounds for nonrankrestricted Nyström approximation such as the ones provided by Gittens and Mahoney (2016). The relativeerror bound which we provide for the rankrestricted Nyström approximation is potentially useful in other applications involving the Nyström method.
The projectioncost preservation (PCP) property (Cohen et al., 2015; Feldman et al., 2013) is an important tool for analyzing approximate linear means clustering. We apply our novel relativeerror trace norm bound as well as existing tools in (Cohen et al., 2015) to prove that the rankrestricted Nyström approximation enjoys the PCP property. We do not rule out the possibility that the nonrankrestricted (rank) Nyström approximation satisfies the PCP property and/or also enjoys a approximation ratio guarantee when applied to kernel means clustering. However, the cost of the linear means clustering step in the algorithm is proportional to the dimensionality of the feature vectors, so the rankrestricted Nyström approximation, which produces dimensional feature vectors, where , is more computationally desirable.
Musco and Musco (2017) similarly establishes a approximation ratio for the kernel means objective when a Nyström approximation is used in place of the full kernel matrix. Specifically, Musco and Musco (2017) shows that when columns of are sampled using ridge leverage score (RLS) sampling (Alaoui and Mahoney, 2015; Cohen et al., 2017; Musco and Musco, 2017) and are used to form a Nyström approximation, then applying linear means clustering to the dimensional Nyström features returns a clustering that has objective value at most times as large as the objective value of the best clustering. Our theory is independent of that in Musco and Musco (2017), and differs in that (1) Musco and Musco (2017) applies specifically to Nyström approximations formed using RLS sampling, whereas our guarantees apply to any sketching method that satisfies the “subspace embedding” and “matrix multiplication” properties (see Lemma A.2 for definitions of these two properties); (2) Musco and Musco (2017) establishes a approximation ratio for the nonrankrestricted RLSNyström approximation, whereas we establish a approximation ratio for the (more computationally efficient) rankrestricted Nyström approximation.
1.3 Paper Organization
In Section 2, we start with a definition of the notation used throughout this paper as well as a background on matrix sketching methods. Then, in Section 3, we present our main theoretical results: Section 3.1 presents an improved relativeerror rankrestricted Nyström approximation; Section 3.2 presents the main theoretical results on kernel means with Nyström approximation; and Section 3.3 studies kernel
means with kernel principal component analysis. Section
4 discusses and evaluates the theoretical and empirical merits of kernel means clustering versus spectral clustering, when each is approximated using Nyström approximation. Section 5 empirically compares the Nyström method and the random feature maps for the kernel means clustering on mediumscale data. Section 6 presents a largescale distributed implementation in Apache Spark and its empirical evaluation on a data set with million points. Section 7 provides a brief conclusion. Proofs are provided in the Appendices.2 Notation
This section defines the notation used throughout this paper. A set of commonly used parameters is summarized in Table 1.
Notation  Definition 

number of samples  
number of features (attributes)  
number of clusters  
target rank of the Nyström approximation  
sketch size of the Nyström approximation 
Matrices and vectors.
We take to be the identity matrix, to be a vector or matrix of all zeros of the appropriate size, and to be the dimensional vector of all ones.
Sets.
The set is written as . We call a partition of if and when . Let denote the cardinality of the set .
Singular value decomposition (SVD).
Let and
. A (compact) singular value decomposition (SVD) is defined by
where , , are an
columnorthogonal matrix, a
diagonal matrix with nonnegative entries, and a columnorthogonal matrix, respectively. If is symmetric positive semidefinite (SPSD), then , and this decomposition is also called the (reduced) eigenvalue decomposition (EVD). By convention, we take .Truncated SVD.
The matrix is a rank truncated SVD of , and is an optimal rank approximation to when the approximation error is measured in a unitarily invariant norm.
The MoorePenrose inverse
of is defined by .
Leverage score and coherence.
Let be defined in the above and be the th row of . The row leverage scores of are for . The row coherence of is . The leverage scores for a matrix can be computed exactly in the time it takes to compute the matrix ; and the leverage scores can be approximated (in theory (Drineas et al., 2012) and in practice (Gittens and Mahoney, 2016)) in roughly the time it takes to apply a random projection matrix to the matrix .
Matrix norms.
We use three matrix norms in this paper:
Frobenius Norm:  
Spectral Norm:  
Trace Norm: 
Any square matrix satisfies . If additionally is SPSD, then .
Matrix sketching
Here, we briefly review matrix sketching methods that are commonly used within randomized linear algebra (RLA) (Mahoney, 2011).
Given a matrix , we call (typically ) a sketch of and a sketching matrix. Within RLA, sketching has emerged as a powerful primitive, where one is primarily interested in using random projections and random sampling to construct randomzed sketches (Mahoney, 2011; Drineas and Mahoney, 2016). In particular, sketching is useful as it allows large matrices to be replaced with smaller matrices which are more amenable to efficient computation, but provably retain almost optimal accuracy in many computations (Mahoney, 2011; Woodruff, 2014). The columns of typically comprise a rescaled subset of the columns of , or random linear combinations of the columns of ; the former type of sketching is called column selection or random sampling, and the latter is referred to as random projection.
Column selection forms using a randomly sampled and rescaled subset of the columns of . Let be the sampling probabilities associated with the columns of (so that, in particular, ). The columns of the sketch are selected identically and independently as follows: each column of is randomly sampled from the columns of according to the sampling probabilities and rescaled by where is the index of the column of that was selected. In our matrix multiplication formulation for sketching, column selection corresponds to a sketching matrix that has exactly one nonzero entry in each column, whose position and magnitude correspond to the index of the column selected from . Uniform sampling is column sampling with , and leverage score sampling takes for , where is the th leverage score of some matrix (typically , , or an randomized approximation thereto) (Drineas et al., 2012).
Gaussian projection is a type of random projection where the sketching matrix is taken to be ; here the entries of are i.i.d. random variables. Gaussian projection is inefficient relative to column sampling: the formation of a Gaussian sketch of a dense matrix requires time. The Subsampled Randomized Hadamard Transform (SRHT) is a more efficient alternative that enjoys similar properties to the Gaussian projection (Drineas et al., 2011; Lu et al., 2013; Tropp, 2011), and can be applied to a dense matrix in only time. The CountSketch is even more efficient: it can be applied to any matrix in time (Clarkson and Woodruff, 2013; Meng and Mahoney, 2013; Nelson and Nguyên, 2013), where denotes the number of nonzero entries in a matrix.
3 Our Main Results: Improved SPSD Matrix Approximation and Kernel means Approximation
In this section, we present our main theoretical results. We start, in Section 3.1, by presenting Theorem 3.1, a novel result on SPSD matrix approximation with the rankrestricted Nyström method. This result is of independent interest, and so we present it in detail, but in this paper we will use it to establish our main result. Then, in Section 3.2, we present Theorem 1, which is our main result for approximate kernel means with the Nyström approximation. In Section 3.3, we establish novel guarantees on kernel means with dimensionality reduction.
3.1 The Nyström Method
The Nyström method (Nyström, 1930) is the most popular kernel approximation method in the machine learning community. Let be an SPSD matrix and be a sketching matrix. The Nyström method approximates with , where and . The Nyström method was introduced to the machine learning community by Williams and Seeger (2001); since then, numerous works have studied its theoretical properties, e.g., (Drineas and Mahoney, 2005; Gittens and Mahoney, 2016; Jin et al., 2013; Kumar et al., 2012; Wang and Zhang, 2013; Yang et al., 2012).
Empirical results in (Gittens and Mahoney, 2016; Wang et al., 2016b; Yang et al., 2012) demonstrated that the accuracy of the Nyström method significantly increases when the spectrum of
decays fast. This suggests that the Nyström approximation captures the dominant eigenspaces of
, and that error bounds comparing the accuracy of the Nyström approximation of to that of the best rank approximation (for ) would provide a meaningful measure of the performance of the Nyström kernel approximation. Gittens and Mahoney (2016) established the first relativeerror bounds showing that for sufficiently large , the trace norm error is comparable to. Such results quantify the benefits of spectral decay to the performance of the Nyström method, and are sufficient to analyze the performance of Nyström approximations in applications such as kernel ridge regression
(Alaoui and Mahoney, 2015; Bach, 2013)and kernel support vector machines
(Cortes et al., 2010).However, Gittens and Mahoney (2016) did not analyze the performance of rankrestricted Nyström approximations; they compared the approximation accuracies of the rank matrix and the rank matrix (recall ). In our application to approximate kernel means clustering, it is the rank matrix that is of relevance. Given and , the truncated SVD can be found using time. Then the rank Nyström approximation can be written as
(4) 
Theorem 3.1 provides a relativeerror trace norm approximation guarantee for the sketch (4) and is novel; a proof is provided in Appendix A.
[RelativeError RankRestricted Nyström Approximation] Let be an SPSD matrix, be the target rank, and be an error parameter. Let be a sketching matrix corresponding to one of the sketching methods listed in Table 2. Let and . Then
holds with probability at least . In addition, there exists an column orthogonal matrix such that .
sketching  sketch size ()  time complexity () 

uniform sampling  
leverage sampling  
Gaussian projection  
SRHT  
CountSketch 
Remark 1 (Rank Restrictions)
The traditional rankrestricted Nyström approximation, , (Drineas and Mahoney, 2005; Fowlkes et al., 2004; Gittens and Mahoney, 2016; Li et al., 2015) is not known to satisfy a relativeerror bound of the form guaranteed in Theorem 3.1. PourkamaliAnaraki and Becker (2016) pointed out the drawbacks of the traditional rankrestricted Nyström approximation and proposed the use of the rankrestricted Nyström approximation in applications requiring kernel approximations, but provided only empirical evidence of its performance. This work provides guarantees on the approximation error of the rankrestricted Nyström approximation , and applies this approximation to the kernel means clustering problem. The contemporaneous work Tropp et al. (2017) provides similar guarantees on the approximation error of , and uses this Nyström approximation as the basis of a streaming algorithm for fixedrank approximation of positivesemidefinite matrices.
3.2 Main Result for Approximate Kernel means
In this section we establish the approximation ratio guarantees for the objective function of kernel means clustering. We first define approximate means algorithms (where ), then present our main result in Theorem 1.
Let be a matrix with rows . The objective function for linear means clustering over the rows of is
The minimization of w.r.t. the partition is NPhard (Garey et al., 1982; Aloise et al., 2009; Dasgupta and Freund, 2009; Mahajan et al., 2009; Awasthi et al., 2015), but approximate solutions can be obtained in polynomial time. approximate algorithms capture one useful notion of approximation.
Definition 1 (Approximate Algorithms)
A linear means clustering algorithm takes as input a matrix with rows and outputs . We call a approximate algorithm if, for any such matrix ,
Here and are partitions of .
Many approximation algorithms have been proposed, but they are computationally expensive (Chen, 2009; HarPeled and Mazumdar, 2004; Kumar et al., 2004; Matousek, 2000). There are also relatively efficient constant factor approximate algorithms, e.g., (Arthur and Vassilvitskii, 2007; Kanungo et al., 2002; Song and Rajasekaran, 2010).
Let be a feature map, be the matrix with rows , and be the associated kernel matrix. Analogously, we denote the objective function for kernel means clustering by
where is a partition of .
[Kernel Means with Nyström Approximation] Choose a sketching matrix and sketch size consistent with Table 2. Let be the previously defined Nyström approximation of . Let be any matrix satisfying . Let the partition be the output of a approximate algorithm applied to the rows of . With probability at least ,
Remark 2
Kernel means clustering is an NPhard problem. Therefore, instead of comparing with , we compare with clusterings obtained using approximate algorithms. Theorem 1 shows that, when uniform sampling to form the Nyström approximation, if and , then the returned clustering has an objective value that is at most a factor of larger than the objective value of the kernel means clustering returned by the approximate algorithm.
Remark 3
Assume we are in a practical setting where , the budget of column samples one can use to form a Nyström approximation, and , the number of desired cluster centers, are fixed. The pertinent question is how to choose to produce a highquality approximate clustering. Theorem 1 shows that for uniform sampling, the error ratio is
To balance the two sources of error, must be larger than , but not too large a fraction of . To minimize the above error ratio, should be selected on the order of . Since the matrix coherence () is unknown, it can be heuristically treated as a constant.
We empirically study the effect of the values of and using a data set comprising million samples. Note that computing the kernel means clustering objective function requires the formation of the entire kernel matrix , which is infeasible for a data set of this size; instead, we use normalized mutual information (NMI) (Strehl and Ghosh, 2002)—a standard measure of the performance of clustering algorithms— to measure the quality of the clustering obtained by approximating kernel means clustering using Nyström approximations formed through uniform sampling. NMI scores range from zero to one, with a larger score indicating better performance. We report the results in Figure 2. The complete details of the experiments, including the experimental setting and time costs, are given in Section 6.
From Figure 2(a) we observe that larger values of
lead to better and more stable clusterings: the mean of the NMI increases and its standard deviation decreases. This is reasonable and in accordance with our theory. However, larger values of
incur more computations, so one should choose to trade off computation and accuracy.Figure 2(b) shows that for fixed and , the clustering performance is not monotonic in , which matches Theorem 1 (see the discussion in Remark 3). Setting as small as results in poor performance. Setting overlarge not only incurs more computations, but also negatively affects clustering performance; this may suggest the necessity of rankrestriction. Furthermore, in this example, , which corroborates the suggestion made in Remark 3 that setting around (where is unknown but can be treated as a constant larger than ) can be a good choice.
Remark 4
Musco and Musco (2017) established a approximation ratio for the kernel means objective value when a nonrankrestricted Nyström approximation is formed using ridge leverage scores (RLS) sampling; their analysis is specific to RLS sampling and does not extend to other sketching methods. By way of comparison, our analysis covers several popular sampling schemes and applies to rankrestricted Nyström approximations, but does not extend to RLS sampling.
3.3 Approximate Kernel Means with KPCA and Power Method
The use of dimensionality reduction to increase the computational efficiency of means clustering has been widely studied, e.g. in (Boutsidis et al., 2010, 2015; Cohen et al., 2015; Feldman et al., 2013; Zha et al., 2002). Kernel principal component analysis (KPCA) is particularly wellsuited to this application (Dhillon et al., 2004; Ding et al., 2005). Applying Lloyd’s algorithm on the rows of or has an periteration complexity; if features are extracted using KPCA and Lloyd’s algorithm is applied to the resulting dimensional feature map, then the periteration cost reduces to Proposition 3.3 states that, to obtain a approximation ratio in terms of the kernel means objective function, it suffices to use KPCA features. This proposition is a simple consequence of (Cohen et al., 2015).
[KPCA] Let be a matrix with rows, and be the corresponding kernel matrix. Let be the truncated SVD of and take . Let the partition be the output of a approximate algorithm applied to the rows of . Then
In practice, the truncated SVD (equivalently EVD) of is computed using the power method or Krylov subspace methods. These numerical methods do not compute the exact decomposition , so Proposition 3.3
is not directly applicable. It is useful to have a theory that captures the effect of realisticly inaccurate estimates like
on the clustering process. As one particular example, consider that generalpurpose implementations of the truncated SVD attempt to mitigate the fact that the computed decompositions are inaccurate by returning very highprecision solutions, e.g. solutions that satisfy . Understanding the tradeoff between the precision of the truncated SVD solution and the impact on the approximation ratio of the approximate kernel means solution allows us to more precisely manage the computational complexity of our algorithms. Are such highprecision solutions necessary for kernel means clustering?Theorem 1 answers this question by establishing that highly accurate eigenspaces are not significantly more useful in approximate kernel means clustering than eigenspace estimates with lower accuracy. A lowprecision solution obtained by running the power iteration for a few rounds suffices for kernel means clustering applications. We prove Theorem 1 in Appendix C.
[The Power Method] Let be a matrix with rows, be the corresponding kernel matrix, and be the th singular value of . Fix an error parameter . Run Algorithm 1 with to obtain . Let the partition be the output of a approximate algorithm applied to the rows of . If , then
holds with probability at least . If , then the above inequality holds with probability , where is a positive constant (Tao and Vu, 2010).
Note that the power method requires forming the entire kernel matrix , which may not fit in memory even in a distributed setting. Therefore, in practice, the power method may not be as efficient as the Nyström approximation with uniform sampling, which avoids forming .
Theorem 1, Proposition 3.3, and Theorem 1 are highly interesting from a theoretical perspective. These results demonstrate that features are sufficient to ensure a approximation ratio. Prior work (Dhillon et al., 2004; Ding et al., 2005) set and did not provide approximation ratio guarantees. Indeed, a lower bound in the linear means clustering case due to (Cohen et al., 2015) shows that is necessary to obtain a approximation ratio.
4 Comparison to Spectral Clustering with Nyström Approximation
In this section, we provide a brief discussion and empirical comparison of our clustering algorithm, which uses the Nyström method to approximate kernel means clustering, with the popular alternative algorithm that uses the Nyström method to approximate spectral clustering.
4.1 Background
Spectral clustering is a method with a long history (Cheeger, 1969; Donath and Hoffman, 1972, 1973; Fiedler, 1973; Guattery and Miller, 1995; Spielman and Teng, 1996). Within machine learning, spectral clustering is more widely used than kernel means clustering (Ng et al., 2002; Shi and Malik, 2000), and the use of the Nyström method to speed up spectral clustering has been popular since Fowlkes et al. (2004). Both spectral clustering and kernel means clustering can be approximated in time linear in by using the Nyström method with uniform sampling. Practitioners reading this paper may ask:
How does the approximate kernel means clustering algorithm presented here, which uses Nyström approximation, compare to the popular heuristic of combining spectral clustering with Nyström approximation?
Based on our theoretical results and empirical observations, our answer to this reader is:
Although they have equivalent computational costs, kernel means clustering with Nyström approximation is both more theoretically sound and more effective in practice than spectral clustering with Nyström approximation.
We first formally describe spectral clustering, and then substantiate our claim regarding the theoretical advantage of our approximate kernel means method. Our discussion is limited to the normalized and symmetric graph Laplacians used in Fowlkes et al. (2004), but spectral clustering using asymmetric graph Laplacians encounters similar issues.
4.2 Spectral Clustering with Nyström Approximation
The input to the spectral clustering algorithm is an affinity matrix that measures the pairwise similarities between the points being clustered; typically is a kernel matrix or the adjacency matrix of a weighted graph constructed using the data points as vertices. Let be the diagonal degree matrix associated with , and be the associated normalized graph Laplacian matrix. Let denote the bottom eigenvectors of , or equivalently, the top eigenvectors of . Spectral clustering groups the data points by performing linear means clustering on the normalized rows of . Fowlkes et al. (2004) popularized the application of the Nyström approximation to spectral clustering. This algorithm computes an approximate spectral clustering by: (1) forming a Nyström approximation to , denoted by ; (2) computing the degree matrix of ; (3) computing the top singular vectors of , which are equivalent to the bottom eigenvectors of ; (4) performing linear means over the normalized rows of .
To the best of our knowledge, spectral clustering with Nyström approximation does not have a bounded approximation ratio relative to exact spectral clustering. In fact, it seems unlikely that the approximation ratio could be bounded, as there are fundamental problems with the application of the Nyström approximation to the affinity matrix.

The affinity matrix used in spectral clustering must be elementwise nonnegative. However, the Nyström approximation of such a matrix can have numerous negative entries, so is, in general, not proper input for the spectral clustering algorithm. In particular, the approximated degree matrix may have negative diagonal entries, so is not guaranteed to be a real matrix; such exceptions must be handled heuristically. The approximate asymmetric Laplacian does avoid the introduction of complex values; however, the negative entries in negate whole columns of , leading to less meaningful negative similarities/distances.

Even if is real, the matrix may not be SPSD, much less a Laplacian matrix. Thus the bottom eigenvectors of cannot be viewed as useful coordinates for linear means clustering in the same way that the eigenvectors of can be.

Such approximation is also problematic in terms of matrix approximation accuracy. Even when approximates well, which can be theoretically guaranteed, the approximate Laplacian can be far from . This is because a small perturbation in can have an outsized influence on the eigenvectors of .

One may propose to approximate , rather than , with a Nyström approximation ; this ensures that the approximate normalized graph Laplacian is SPSD. However, this approach requires forming the entirety of in order to compute the degree matrix , and thus has quadratic (with ) time and memory costs. Furthermore, although the resulting approximation, , is SPSD, it is not a graph Laplacian: its offdiagonal entries are not guaranteed to be nonpositive, and its smallest eigenvalue may be nonzero.
In summary, spectral clustering using the Nyström approximation (Fowlkes et al., 2004), which has proven to be a useful heuristic, and which is composed of theoretically principled parts, is less principled when viewed in its entirety. Approximate kernel means clustering using Nyström approximation is an equivalently efficient, but theoretically more principled alternative.
4.3 Empirical Comparison with Approximate Spectral Clustering using Nyström Approximation
dataset  #instances ()  #features ()  #clusters () 

MNIST (LeCun et al., 1998)  60,000  780  10 
Mushrooms (Frank and Asuncion, 2010)  8,124  112  2 
PenDigits (Frank and Asuncion, 2010)  7,494  16  10 
To complement our discussion of the relative merits of the two methods, we empirically compared the performance of our novel method of approximate kernel means clustering using the Nyström method with the popular method of approximate spectral clustering using the Nyström method. We used three classification data sets, described in Table 3. The data sets used are available at http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/.
Let be the input vectors. We take both the affinity matrix for spectral clustering and the kernel matrix for kernel means to be the RBF kernel matrix , where and is the kernel width parameter. We choose based on the average interpoint distance in the data sets as
(5) 
where we take , , or .
The algorithms under comparison are all implemented in Python 3.5.2. Our implementation of approximate spectral clustering follows the code in (Fowlkes et al., 2004). To compute linear means clusterings, we use the function sklearn.cluster.KMeans present in the scikitlearn package. Our algorithm for approximate kernel means clustering is described in more detail in Section 5.1. We ran the computations on a MacBook Pro with a 2.5GHz Intel Core i7 CPU and 16GB of RAM.
We compare approximate spectral clustering (SC) with approximate kernel means clustering (KK), with both using the rankrestricted Nyström method with uniform sampling.^{4}^{4}4Uniform sampling is appropriate for the value of used in Eqn. (5); see Gittens and Mahoney (2016) for a detailed discussion of the effect of varying . We used normalized mutual information (NMI) (Strehl and Ghosh, 2002) to evaluate clustering performance: the NMI falls between 0 (representing no mutual information between the true and approximate clusterings) and 1 (perfect correlation of the two clusterings), so larger NMI indicates better performance. The target dimension is taken to be ; and, for each method, the sketch size is varied from to . We record the time cost of the two methods, excluding the time spent on the means clustering required in both algorithms.^{5}^{5}5For both SC and KK with Nyström approximation, the extracted feature matrices have dimension , so the means clusterings required by both SC and KK have identical cost. We repeat this procedure times and report the averaged NMI and average elapsed time.
We note that, at small sketch sizes , exceptions often arise during approximate spectral clustering due to negative entries in the degree matrix. (This is an example, as discussed in Section 4.2, of when approximate spectral clustering heuristics do not perform well.) We discard the trials where such exceptions occur.
Our results are summarized in Figure 3. Figure 3 illustrates the NMI of SC and KK as a function of the sketch size and as a function of elapsed time for both algorithms. While there are quantitative differences between the results on the three data sets, the plots all show that KK is more accurate as a function of the sketch size or elapsed time than SC.
5 SingleMachine MediumScale Experiments
In this section, we empirically compare the Nyström method and random feature maps (Rahimi and Recht, 2007) for kernel means clustering. We conduct experiments on the data listed in Table 3. For the Mushrooms and PenDigits data, we are able to evaluate the objective function value of kernel means clustering.
5.1 SingleMachine Implementation of Approximate Kernel Means
Our algorithm for approximate kernel means clustering comprises three steps: Nyström approximation, dimensionality reduction, and linear means clustering. Both the singlemachine as well as the distributed variants of the algorithm are governed by three parameters: , the number of features used in the clustering; , a regularization parameter; and , the sketch size. These parameters satisfy .

Nyström approximation. Let be the sketch size and be a sketching matrix. Let and . The standard Nyström approximation is ; small singular values in can lead to instability in the MoorePenrose inverse, so a widely used heuristic is to choose and use instead of the standard Nyström approximation.^{6}^{6}6The Nyström approximation is correct in theory, but the MoorePenrose inverse often causes numerical errors in practice. The MoorePenrose inverse drops all the zero singular values, however, due to the finite numerical precision, it is difficult to determine whether a singular value, say , should be zero or not, and this makes the computation unstable: if such a small singular value is believed to be zero, it will be dropped; otherwise, the MoorePenrose inverse will invert it to obtain a singular value of . Dropping some portion of the smallest singular values is a simple heuristic that avoids this instability. This is why we heuristically use instead of . Currently we do not have theory for this heuristic. Chiu and Demanet (2013) considers the theoretical implications of this regularization heuristic, but their results do not apply to our problem. We set (arbitrarily). Let be the truncated SVD of and return as the output of the Nyström method.

Dimensionality reduction. Let contain the dominant right singular vectors of . Let . It can be verified that , which is our desired rankrestricted Nyström approximation.

Linear means clustering. With at hand, use an arbitrary offtheshelf linear means clustering algorithm to cluster the rows of .
See Algorithm 2 for the singlemachine version of this approximate kernel means clustering algorithm. Observe that we can use uniform sampling to form and , and thereby avoid computing most of .
5.2 Comparing Nyström, Random Feature Maps, and TwoStep Method
We empirically compare the clustering performances of kernel approximations formed using Nyström, random feature map (RFM) (Rahimi and Recht, 2007), and the twostep method (Chitta et al., 2011) on the data sets detailed in Table 3.
We use the RBF kernel with width parameter given by (5); Figure 3 indicates that is a good choice for these data sets. We conduct dimensionality reduction for both Nyström and RFM to obtain dimensional features, and consider three choices: , , and without dimensionality reduction (equivalently, ).
The quality of the clusterings is quantified using both normalized mutual information (NMI) (Strehl and Ghosh, 2002) and the objective function value:
(6) 
where are the columns of the kernel matrix , and the disjoint sets reflect the clustering.
We repeat the experiments times and report the results in Figures 4 and 5. The experiments show that as measured by both NMIs and objective values, the Nyström method outperforms RFM in most cases. Both the Nyström method and RFM are consistently superior to the twostep method of (Chitta et al., 2011), which requires a large sketch size. All the compared methods improve as the sketch size increases.
Judging from these mediumscale experiments, the target rank has little impact on the NMI and clustering objective value. This phenomenon is not general; in the largescale experiments of the next section we see that setting properly allows one to obtain a better NMI than an oversmall or overlarge .
6 LargeScale Experiments using Distributed Computing
In this section, we empirically study our approximate kernel means clustering algorithm on largescale data. We state a distributed version of the algorithm, implement it in Apache Spark^{7}^{7}7 This implementation is available at https://github.com/wangshusen/SparkKernelKMeans.git. , and evaluate its performance on NERSC’s Cori supercomputer. We investigate the effect of increased parallelism, sketch size , and target dimension .
Algorithm 3 is a distributed version of our method described in Section 5.1. Again, we use uniform sampling to form and to avoid computing most of . We mainly focus on the Nyström approximation step, as the other two steps are well supported by distributed computing systems such as Apache Spark.
8 Nodes 

Comments
There are no comments yet.