Modern applications require machine learning algorithms to be accurate as well as computationally efficient, since data-sets are increasing in size and dimensions. Understanding the interplay and trade-offs between statistical and computational requirements is then crucialRudi et al. (2015); Rudi and Rosasco (2017). In this paper, we consider this question in the context of clustering, considering a popular nonparametric approach, namely kernel k-means Schölkopf et al. (1998). K-means is arguably one of most common approaches to clustering and produces clusters with piece-wise linear boundaries. Its kernel version allows to consider nonlinear boundaries, greatly improving the flexibility of the approach. Its statistical properties have been studied Canas et al. (2012); Maurer and Pontil (2010); Biau et al. (2008) and from a computational point of view it requires manipulating an empirical kernel matrix. As for other kernel methods, this latter operation becomes unfeasible for large scale problems and deriving approximate computations is subject of recent works, see for example Tropp et al. (2017); Chitta et al. (2011); Rahimi and Recht (2007); Wang et al. (2017); Musco and Musco (2017) and reference therein.
In this paper we are interested into quantifying the statistical effect of computational approximations. Arguably one could expect the latter to induce some loss of accuracy. In fact, we prove that, perhaps surprisingly, there are favorable regimes where it is possible maintain optimal statistical accuracy while significantly reducing computational costs. While a similar phenomenon has been recently shown in supervised learningRudi et al. (2015); Rudi and Rosasco (2017); Calandriello et al. (2017a), we are not aware of similar results for other learning tasks.
Our approach is based on considering a Nyström approach to kernel k-means based on sampling a subset of training set points (landmarks) that can be used to approximate the kernel matrix Alaoui and Mahoney (2015); Tropp et al. (2017); Calandriello et al. (2017b, c); Wang et al. (2017); Musco and Musco (2017). While there is a vast literature on the properties of Nyström methods for kernel approximations Musco and Musco (2017); Alaoui and Mahoney (2015), experience from supervised learning show that better results can be derived focusing on the task of interest, see discussion in Bach (2013). The properties of Nyström approximations for k-means has been recently studied in Wang et al. (2017); Musco and Musco (2017). Here they focus only on the computational aspect of the problem, and provide fast methods that achieve an empirical cost only a multiplicative factor larger than the optimal one.
Our analysis is aimed at combining both statistical and computational results. Towards this end we derive a novel additive bound on the empirical cost that can be used to bound the true object of interest: the expected cost. This result can be combined with probabilistic results to show that optimal statistical accuracy can be obtained considering only Nyström landmark points, where is the number of training set of points. Moreover, we show similar bounds not only for the optimal solution, which is hard to compute in general, but also for approximate solutions that can be computed efficiently using -means++. From a computational point of view this leads to massive improvements reducing the memory complexity from to . Experimental results on large scale data-sets confirm and illustrate our findings.
The rest of the paper is organized as follows. We first overview kernel -means, and introduce our approximate kernel -means approach based on Nyström embeddings. We then prove our statistical and computational guarantees and empirically validate them. Finally, we present some limits of our analysis, and open questions.
Given an input space , a sampling distribution , and samples drawn i.i.d. from , we denote with the empirical distribution. Once the data has been sampled, we use the feature map to maps into a Reproducing Kernel Hilbert Space (RKHS) Scholkopf and Smola (2001), that we assume separable, such that for any we have . Intuitively, in the rest of the paper the reader can assume that with or even infinite. Using the kernel trick Aizerman et al. (1964) we also know that , where is the kernel function associated with and is a short-hand for the inner product in . With a slight abuse of notation we will also define the norm , and assume that for any . Using , we denote with the input dataset. We also represent the dataset as the map with as its -th column. We denote with the empirical kernel matrix with entries . Finally, given we denote as the orthogonal projection matrix on the span of the dataset.
Given our dataset, we are interested in partitioning it into disjoint clusters each characterized by its centroid . The Voronoi cell associated with a centroid is defined as the set , or in other words a point belongs to the -th cluster if is its closest centroid. Let be a collection of centroids from . We can now formalize the criterion we use to measure clustering quality.
The empirical and expected squared norm criterion are defined as
The empirical risk minimizer (ERM) is defined as .
The sub-script in indicates that it minimizes for the samples in . Biau et al. (2008) gives us a bound on the excess risk of the empirical risk minimizer.
Proposition 1 (Biau et al. (2008)).
The excess risk of the empirical risk minimizer satisfies
where is the optimal clustering risk.
From a theoretical perspective, this result is only times larger than a corresponding lower bound Graf and Luschgy (2000), and therefore shows that the ERM achieve an excess risk optimal in . From a computational perspective, Definition 1 cannot be directly used to compute , since the points in cannot be directly represented. Nonetheless, due to properties of the squared norm criterion, each must be the mean of all associated with that center, i.e., belongs to . Therefore, it can be explicitly represented as a sum of the points included in the -th cluster, i.e., all the points in the -th Voronoi cell . Let be the space of all possible disjoint partitions . We can use this fact, together with the kernel trick, to reformulate the objective .
Proposition 2 (Dhillon et al. (2004)).
We can rewrite the objective
While the combinatorial search over can now be explicitly computed and optimized using the kernel matrix , it still remains highly inefficient to do so. In particular, simply constructing and storing takes time and space and does not scale to large datasets.
A simple approach to reduce computational cost is to use approximate embeddings, which replace the map and points with a finite-dimensional approximation .
Nyström kernel -means
Given a dataset , we denote with a dictionary (i.e., subset) of points from , and with the map with these points as columns. These points acts as landmarks Williams and Seeger (2001), inducing a smaller space spanned by the dictionary. As we will see in the next section, should be chosen so that is close to the whole span of the dataset.
Let be the empirical kernel matrix between all points in , and denote with
the orthogonal projection on . Then we can define an approximate ERM over as
since any component outside of is just a constant in the minimization. Note that the centroids are still points in , and we cannot directly compute them. Instead, we can use the eigen-decomposition of to rewrite . Defining now we have a finite-rank embedding into . Substituting in Eq. 2
where are the embedded points. Replacing and searching over instead of searching over , we obtain (similarly to Proposition 2)
where we do not need to resort to kernel tricks, but can use the -dimensional embeddings to explicitly compute the centroid . Eq. 3 can now be solved in multiple ways. The most straightforward is to run a parametric -means algorithm to compute , and then invert the relationship to bring back the solution to , i.e., . This can be done in in space and time using steps of Lloyd’s algorithm Lloyd (1982) for clusters. More in detail, computing the embeddings is a one-off cost taking time. Once the -rank Nyström embeddings are computed they can be stored and manipulated in time and space, with an improvement over the time and space required to construct .
3.1 Uniform sampling for dictionary construction
Due to its derivation, the computational cost of Algorithm 1 depends on the size of the dictionary . Therefore, for computational reasons we would prefer to select as small a dictionary as possible. As a conflicting goal, we also wish to optimize well, which requires a and rich enough to approximate well. Let be the projection orthogonal to . Then when
We will now introduce the concept of a -preserving dictionary to control the quantity .
We define the subspace and dictionary as -preserving w.r.t. space if
Notice that the inverse on the right-hand side of the inequality is crucial to control the error . In particular, since , we have that in the worst case the error is bounded as . Conversely, since we know that in the best case the error can be reduced up to . Note that the directions associated with the larger eigenvalues are the ones that occur most frequently in the data. As a consequence, Definition 2 guarantees that the overall error across the whole dataset remains small. In particular, we can control the residual after the projection as .
To construct -preserving dictionaries we focus on a uniform random sampling approachBach (2013). Uniform sampling is historically the first Williams and Seeger (2001), and usually the simplest approach used to construct . Leveraging results from the literature Bach (2013); Calandriello et al. (2017c); Musco and Musco (2017) we can show that uniformly sampling landmarks generates a
-preserving dictionary with high probability.
For a given , construct by uniformly sampling landmarks from . Then w.p. at least the dictionary is -preserving.
Musco and Musco (2017) obtains a similar result, but instead of considering the operator they focus on the finite-dimensional eigenvectors of . Moreover, their bound is weaker and would not be sufficient to satisfy our definition of -accuracy. A result equivalent to Lemma 1 was obtained by Alaoui and Mahoney (2015), but they also only focus on the finite-dimensional eigenvectors of , and did not investigate the implications for .
Proof sketch of Lemma 1.
We can now rewrite as
In other words, using uniform sampling we can reduce the size of the search space by a factor (from to ) in exchange for a additive error, resulting in a computation/approximation trade-off that is linear in .
4 Theoretical analysis
Exploiting the error bound for -preserving dictionaries we are now ready for the main result of this paper: showing that we can improve the computational aspect of kernel -means using Nyström embedding, while maintaining optimal generalization guarantees.
Given a -preserving dictionary
From a statistical point of view, Theorem 1 shows that if is -preserving, the ERM in achieves the same excess risk as the exact ERM from up to an additional error. Therefore, choosing the solution achieves the generalization Biau et al. (2008). Our results can also be applied to traditional -means in Euclidean space, i.e. the special case of and a linear kernel. In this case Theorem 1 shows that regardless of the size of the input space , it is possible to embed all points in and preserve optimal statistical rates. In other words, given samples we can always construct features that are sufficient to well preserve the objective regardless of the original number of features . In the case when , this can lead to substantial computational improvements.
From a computational point of view, Lemma 1 shows that we can construct an -preserving dictionary simply by sampling points uniformly111 hides logarithmic dependencies on and ., which greatly reduces the embedding size from to , and the total required space from to .
Time-wise, the bottleneck becomes the construction of the embeddings , which takes time, while each iterations of Lloyd’s algorithm only requires time. In the full generality of our setting this is practically optimal, since computing a -preserving dictionary is in general as hard as matrix multiplication Musco and Woodruff (2017); Backurs et al. (2017), which requires time. In other words, unlike the case of space complexity, there is no free lunch for time complexity, that in the worst case must scale as similarly to the exact case. Nonetheless embedding the points is an embarrassingly parallel problem that can be easily distributed, while in practice it is usually the execution of the Lloyd’s algorithm that dominates the runtime.
Finally, when the dataset satisfies certain regularity conditions, the size of can be improved, which reduces both embedding and clustering runtime. Denote with the so-called effective dimension Alaoui and Mahoney (2015) of . Since , we have that , and therefore can be seen as a soft version of the rank. When it is possible to construct a -preserving dictionary with only landmarks in time using specialized algorithms Calandriello et al. (2017c) (see Section 6). In this case, the embedding step would require only , improving both time and space complexity.
Morever, to the best of our knowledge, this is the first example of an unsupervised non-parametric problem where it is always (i.e., without assumptions on ) possible to preserve the optimal risk rate while reducing the search from the whole space to a smaller subspace.
Proof sketch of Theorem 1.
We can separate the distance between in a component that depends on how close is to , bounded using Proposition 1, and a component that depends on the distance between and
Given a -preserving dictionary
To show this we can rewrite the objective as (see Dhillon et al. (2004))
where is a -rank projection matrix associated with the exact clustering . Then using Definition 2 we have and we obtain an additive error bound
Since , is a projection matrix, and we have
Conversely, if we focus on the matrix we have
Since both bounds hold simultaneously, we can simply take the minimum to conclude our proof. ∎
We now compare the theorem with previous work. Many approximate kernel -means methods have been proposed over the years, and can be roughly split in two groups.
Low-rank decomposition based methods try to directly simplify the optimization problem from Proposition 2, replacing the kernel matrix with an approximate that can be stored and manipulated more efficiently. Among these methods we can mention partial decompositions Bach and Jordan (2005), Nyström approximations based on uniform Williams and Seeger (2001), -means++ Oglic and Gärtner (2017), or ridge leverage score (RLS) samplingWang et al. (2017); Musco and Musco (2017); Calandriello et al. (2017c), and random-feature approximations Avron et al. (2017). None of these optimization based methods focus on the underlying excess risk problem, and their analysis cannot be easily integrated in existing results, as the approximate minimum found has no clear interpretation as a statistical ERM.
Other works take the same embedding approach that we do, and directly replace the exact with an approximate , such as Nyström embeddings Williams and Seeger (2001), Gaussian projections Biau et al. (2008), and again random-feature approximations Rahimi and Recht (2007). Note that these approaches also result in approximate that can be manipulated efficiently, but are simpler to analyze theoretically. Unfortunately, no existing embedding based methods can guarantee at the same time optimal excess risk rates and a reduction in the size of , and therefore a reduction in computational cost.
To the best of our knowledge, the only other result providing excess risk guarantee for approximate kernel -means is Biau et al. (2008), where the authors consider the excess risk of the ERM when the approximate is obtained using Gaussian projections. Biau et al. (2008) notes that the feature map can be expressed using an expansion of basis functions , with very large or infinite. Given a matrix where each entry is a standard Gaussian r.v., Biau et al. (2008) proposes the following -dimensional approximate feature map . Using Johnson-Lindenstrauss (JL) lemma Johnson and Lindenstrauss (1984), they show that if then a multiplicative error bound of the form holds. Reformulating their bound, we obtain that and .
Note that to obtain a bound comparable to Theorem 1, and if we treat as a constant, we need to take which results in . This is always worse than our result for uniform Nyström embedding. In particular, in the risk rate setting Gaussian projections would require resulting in random features, which would not bring any improvement over computing . Moreover when is infinite, as it is usually the case in the non-parametric setting, the JL projection is not explicitly computable in general and Biau et al. (2008) must assume the existence of a computational oracle capable of constructing . Finally note that, under the hood, traditional embedding methods such as those based on JL lemma, usually provide only bounds of the form , and an error (see the discussion of Definition 2). Therefore the error can be larger along multiple directions, and the overall error across the dictionary can be as large as rather than .
Recent work in RLS sampling has also focused on bounding the distance between empirical errors. Wang et al. (2017) and Musco and Musco (2017) provide multiplicative error bounds of the form for uniform and RLS sampling. Nonetheless, they only focus on empirical risk and do not investigate the interaction between approximation and generalization, i.e., statistics and computations. Moreover, as we already remarked for Biau et al. (2008), to achieve the excess risk rate using a multiplicative error bound we would require an unreasonably small , resulting in a large that brings no computational improvement over the exact solution.
Finally, note that when Rudi et al. (2015)
showed that a favourable trade-off was possible for kernel ridge regression (KRR), they strongly leveraged the fact that KRR is a-regularized problem. Therefore, all eigenvalues and eigenvectors in the covariance matrix smaller than the regularization do not influence significantly the solution. Here we show the same for kernel -means, a problem without regularization. This hints at a deeper geometric motivation which might be at the root of both problems, and potentially similar approaches could be leveraged in other domains.
4.1 Further results: beyond ERM
So far we provided guarantees for , that this the ERM in . Although is much smaller than , solving the optimization problem to find the ERM is still NP-Hard in general Aloise et al. (2009). Nonetheless, Lloyd’s algorithm Lloyd (1982), when coupled with a careful -means++ seeding, can return a good approximate solution .
Proposition 3 (Arthur and Vassilvitskii (2007)).
For any dataset
where is the randomness deriving from the -means++ initialization.
Note that, similarly to Wang et al. (2017); Musco and Musco (2017), this is a multiplicative error bound on the empirical risk, and as we discussed we cannot leverage Lemma 2 to bound the excess risk . Nonetheless we can still leverage Lemma 2 to bound only the expected risk , albeit with an extra error term appearing that scales with the optimal clustering risk (see Proposition 1).
Given a -preserving dictionary
From a statistical perspective, we can once again, set to obtain a rate for the first part of the bound. Conversely, the optimal clustering risk is a -dependent quantity that cannot in general be bounded in , and captures how well our model, i.e., the choice of and how well the criterion , matches reality.
From a computational perspective, we can now bound the computational cost of finding . In particular, each iteration of Lloyd’s algorithm will take only time. Moreover, when -means++ initialization is used, the expected number of iterations required for Lloyd’s algorithm to converge is only logarithmic Ailon et al. (2009). Therefore, ignoring the time required to embed the points, we can find a solution in time and space instead of the cost required by the exact method, with a strong improvement.
Finally, if the data distribution satisfies some regularity assumption the following result follows Canas et al. (2012).
If we denote by the support of the distribution and assume to be a -dimensional manifold, then , and given a -preserving dictionary the expected cost satisfies
We now evaluate experimentally the claims of Theorem 1, namely that sampling increases the excess risk by an extra factor, and that is sufficient to recover the optimal rate. We use the Nystroem and MiniBatchKmeans classes from the sklearn python library Pedregosa et al. (2011)to implement kernel -means with Nyström embedding (Algorithm 1) and we compute the solution .
For our experiments we follow the same approach as Wang et al. (2017), and test our algorithm on two variants of the MNIST digit dataset. In particular, mnist60k LeCun and Cortes (2010) is the original MNIST dataset containing pictures each with pixels. We divide each pixel by 255, bringing each feature in a interval. We split the dataset in two part, samples are used to compute the centroids, and we leave out unseen samples to compute , as a proxy for . To test the scalability of our approach we also consider the mnist8m dataset from the infinite MNIST project Loosli et al. (2007), constructed using non-trivial transformations and corruptions of the original mnist60k images. Here we compute using images, and compute on unseen images. As in Wang et al. (2017) we use Gaussian kernel with bandwidth .
mnist60k: these experiments are small enough to run in less than a minute on a single laptop with 4 cores and 8GB of RAM. The results are reported in Fig. 1. On the left we report in blue , where the shaded region is a confidence interval for the mean over 10 runs. As predicted, the expected cost decreases as the size of increases, and plateaus once we achieve , in line with the statistical error. Note that the normalized mutual information (NMI) between the true digit classes and the computed cluster assignments also plateaus around . While this is not predicted by the theory, it strengthens the intuition that beyond a certain capacity expanding is computationally wasteful.
mnist8m: to test the scalability of our approach, we run the same experiment on millions of points. Note that we carry out our mnist8m experiment on a single 36 core machine with 128GB of RAM, much less than the setup of Wang et al. (2017), where at minimum a cluster of 8 such nodes are used. The behaviour of and NMI are similar to mnist60k, with the increase in dataset size allowing for stronger concentration and smaller confidence intervals. Finalle, note that around uniformly sampled landmarks are sufficient to achieve , matching the NMI reported by Wang et al. (2017) for a larger , although smaller than the NMI they report for when using a slower, PCA based method to compute the embeddings, and RLS sampling to select the landmarks. Nonetheless, computing takes less than 6 minutes on a single machine, while their best solution required more than hr on a cluster of 32 machines.
6 Open questions and conclusions
Combining Lemma 1 and Lemma 2, we know that using uniform sampling we can linearly trade-off a decrease in sub-space size with a increase in excess risk. While this is sufficient to maintain the rate, it is easy to see that the same would not hold for a rate, since we would need to uniformly sample landmarks losing all computational improvements.
To achieve a better trade-off we must go beyond uniform sampling and use different probabilities for each sample, to capture their uniqueness and contribution to the approximation error.
Definition 3 (Alaoui and Mahoney (2015)).
The -ridge leverage score (RLS) of point is defined as
The sum of the RLSs
is the empirical effective dimension of the dataset.
Ridge leverage scores are closely connected to the residual after the projection discussed in Definition 2. In particular, using Lemma 2 we have that the residual can be bounded as . It is easy to see that, up to a factor , high-RLS points are also high-residual points. Therefore it is not surprising that sampling according to RLSs quickly selects any high-residual points and covers , generating a -preserving dictionary.
Calandriello (2017) For a given , construct by sampling landmarks from proportionally to their RLS. Then w.p. at least the dictionary is -preserving.
Note there exist datasets where the RLSs are uniform,and therefore in the worst case the two sampling approaches coincide. Nonetheless, when the data is more structured can be much smaller than the dictionary size required by uniform sampling.
Finally, note that computing RLSs exactly also requires constructing and time and space, but in recent years a number of fast approximate RLSs sampling methods Calandriello et al. (2017c) have emerged that can construct -preserving dictionaries of size in just time. Using this result, it is trivial to sharpen the computational aspects of Theorem 1 in special cases.
In particular, we can generate a -preserving dictionary with only elements instead of the required by uniform sampling. Using concentration arguments Rudi et al. (2015) we also know that w.h.p. the empirical effective dimension is at most three times