MiniBatchSpectralClustering
None
view repo
The cost of computing the spectrum of Laplacian matrices hinders the application of spectral clustering to large data sets. While approximations recover computational tractability, they can potentially affect clustering performance. This paper proposes a practical approach to learn spectral clustering based on adaptive stochastic gradient optimization. Crucially, the proposed approach recovers the exact spectrum of Laplacian matrices in the limit of the iterations, and the cost of each iteration is linear in the number of samples. Extensive experimental validation on data sets with up to half a million samples demonstrate its scalability and its ability to outperform state-of-the-art approximate methods to learn spectral clustering for a given computational budget.
READ FULL TEXT VIEW PDF
Spectral clustering techniques are valuable tools in signal processing a...
read it
The eigendeomposition of nearest-neighbor (NN) graph Laplacian matrices ...
read it
Clustering of data sets is a standard problem in many areas of science a...
read it
Clustering via graph-Laplacian spectral imbedding is ubiquitous in data
...
read it
Robust automation of analysis procedures capable of handling diverse dat...
read it
Spectral clustering uses a graph Laplacian spectral embedding to enhance...
read it
Constrained clustering has been well-studied for algorithms such as K-me...
read it
None
Over the past two decades, spectral clustering has established itself as one of the most prominent clustering methods (Shi and Malik, 2000; Ng et al., 2002). The effectiveness of spectral clustering in identifying complex structures in data is a direct consequence of its close connection with kernel machines (Dhillon et al., 2004). Because of this connection, however, it is also apparent that spectral clustering inherits the scalability issues of kernel machines. In spectral clustering, the computational challenge is to determine the spectral properties of the so called Laplacian matrix (von Luxburg, 2007). Denoting by the number of samples, storing the Laplacian matrix requires space while calculating the spectrum of the Laplacian requires computations.
Several approaches have been proposed to reduce the complexity of spectral clustering, such as employing power methods to identify the principal eigenvectors of the Laplacian
(Boutsidis et al., 2015). While this approach is exact in the limit of iterations and does not require storing the Laplacian, the complexity is dominated by the iterative multiplication of the Laplacian matrix by vectors, leading to
computations. In order to further reduce this complexity to , a number of approximations are proposed in the literature. A popular technique based on the Nyström approximation relies on a small set of inducing points to approximate the spectrum of the Laplacian matrix (Fowlkes et al., 2004). Other recent approximations which attempt to compress the dataset appear in Yan et al. (2009) and Li et al. (2016). These approximations recover tractability and make it possible to apply spectral clustering to large data sets. However, approximations can generally affect the quality of the clustering solution, as we illustrate in the experiments.This paper proposes a novel iterative way to solve spectral clustering in , while retaining exactness in the calculation of the spectrum of the Laplacian in the limit of the iterations. Denoting by the Laplacian matrix, the idea hinges on the possibility to cast the algebraic problem of identifying its principal eigenvectors as the following trace optimization problem
(1) |
We propose to solve the constrained optimization problem by means of stochastic gradient optimization. In view of the orthonormality constraint the elements of lie on the so called Stiefel manifold, and appealing to theoretical guarantees of convergence of stochastic gradient optimization on manifolds, we can prove that our proposal computes the exact spectrum of in the limit of the iterations (Bonnabel, 2013). In order to simplify the tuning of the optimization procedure, we adapt Adagrad (Duchi et al., 2011) for stochastic gradient optimization on the Stiefel manifold. The novelty of our proposal stems from the use of stochastic linear algebra techniques to compute stochastic gradients in . This leads to computations of stochastic gradient that require processing of a limited number of columns of the full Laplacian matrix, motivating us to name our proposal Mini-Batch Spectral Clustering.
The results on a variety of clustering problems with up to give credence to the value of our proposal. We can tackle large scale spectral clustering problems achieving the same level of accuracy of the approach that uses the exact spectrum of at a fraction of the computing time. We also compare against approximate spectral clustering methods and show that approximations lead to faster solutions that are suboptimal compared to what we can achieve with the proposed method, especially for large data sets^{1}^{1}1Code to reproduce all results in the paper is available upon request..
(i) We formulate the solution of spectral clustering as a constrained optimization problem that we solve using adaptive stochastic gradient optimization; (ii) We present a novel way of computing stochastic gradients linearly in the number of data that does not require storing the Laplacian matrix; (iii) We analyze the variance of the proposed estimator of the exact gradient to explain the impact of algorithm parameters. (iv) We demonstrate that our proposal allows us to tackle large-scale spectral clustering problems by reporting results on data sets of size up to
. Crucially, we can achieve clustering solutions of similar accuracy and orders of magnitude faster compared to the approach that computes the exact spectrum of , and higher accuracy compared to approximate methods at a comparable cost.Define to be a set of samples. The formulation of spectral clustering introduces an undirected graph based on , where the nodes of represent the input data in , and the edges are weighted according to a similarity measure between the inputs. The graph is expressed by an adjacency matrix , where each entry determines the weight associated with the edge connecting inputs and
. Typically, the elements of the adjacency matrix are defined through off-the-shelf kernel functions, e.g., the Radial Basis Function (RBF) kernel
Schölkopf and Smola (2001).Spectral clustering attempts to cluster the elements of by analyzing the spectral properties of the graph . In particular, the objective of spectral clustering is to partition the graph so as to minimize some graph cut criterion, e.g., the normalized cut (Shi and Malik, 2000). The graph cut problem is generally NP-hard, but its relaxation leads to the definition of the clustering problem as the solution of an algebraic problem (Chung, 1997). In particular, following the spectral clustering algorithm proposed by Ng et al. (2002), the graph Laplacian is defined as a normalized version of the adjacency matrix
(2) |
where is the diagonal matrix of the degrees of the nodes. Spectral clustering represents each data point using the corresponding component of the top eigenvectors of the Laplacian , and computes the solution to the clustering problem by applying -means in this representation. The difficulty in solving the graph cut problem then becomes calculating the spectrum ; this requires computations and space, making it prohibitive – if not unfeasible – for large data sets. The aim of this paper is to address this scaling issue of spectral clustering without sacrificing the accuracy of the solution, as explained next.
The first step to reduce the complexity in finding the top eigenvectors of , is to cast this algebraic operation as solving the constrained optimization problem in Eq. 1. This is an optimization problem involving parameters representing the components of the top eigenvectors of . The objective function rewards maximization of a score that, at convergence, is the sum of the
largest eigenvalues. The constraint
gives rise to the so called Stiefel manifold in ; this imposes orthonormality on the columns of which, at convergence, represent the eigenvectors associated with the largest eigenvalues.The constrained optimization problem in Eq. 1 can be solved by formulating standard optimization algorithms to deal with the Riemannian metric on the manifold (Smith, 2014), which typically rely on search operations along geodesics. Alternative schemes, such as the Cayley transform, have been proposed to tackle optimization problems on Riemannian manifolds (Wen and Yin, 2013). All these optimization schemes require calculating the gradient of the objective function
(3) |
This costs and does not require storing anymore, which is an improvement with respect to computing the full spectrum of . However, while casting spectral clustering as a constrained optimization problem improves scalability, when is large it may still take a prohibitive amount of time to be practical. In the next section we present our proposal to reduce the complexity to solve the constrained optimization problem to with the guarantee of obtaining the exact solution of the in the limit of the iterations.
The intuition behind our proposal is that it is possible to solve the constrained optimization problem in Eq. 1 relying exclusively on stochastic gradients Bonnabel (2013). By inspecting the expression of the exact gradient in Eq. 3
, we show how stochastic linear algebra techniques can be employed to unbiasedly estimate exact gradients (
cost) with stochastic gradients at cost.Stochastic gradient optimization on manifolds is based on the notion of Riemannian gradients, which are elements of the tangent space at a given that determine the direction of steepest increase of the objective function on the manifold. For the Stiefel manifold, the Riemannian gradient is (Bonnabel, 2013)
(4) |
In the case where an unbiased version of the gradient is available, namely , an intuitively sensible strategy is to perform stochastic gradient optimization on the manifold. Formally, this would amount in moving along geodesics for a given length based on the noisy version of the gradient. Defining as the equivalent of the usual step-sizes in stochastic gradient optimization, and as the exponential map at , the update equation would then be , where is the unbiased Riemannian gradient of the objective function. This approach can be shown to converge to the exact solution of the constrained optimization problem in Eq. 1 Bonnabel (2013). However, computing the exponential map to simulate the trajectory of the solver on the manifold requires solving potentially expensive differential equations.
An alternative that avoids computing the exponential map altogether is to replace this calculation with an approximation that is much easier to calculate. If the so called retraction satisfies the property that , where is an alement of the tangent space, then it is still possible to prove convergence to the exact solution Bonnabel (2013). A simple and computationally convenient retraction function that satisfies this property is
(5) |
where extracts the orthonormal factor of a decomposition. This simple retraction moves the optimization in the direction of the stochastic Riemannian gradient and applies an orhtonormalization step to ensure that the update is projected back onto the manifold. Under this choice of retraction, we can appeal to the theoretical results in Bonnabel (2013) that ensure convergence to the exact solution in the limit of the iterations similarly to standard stochastic gradient optimization. An illustration of the retraction scheme is provided in Fig.1.
The introduction of stochasticity in the calculation of follows on from ideas that have been proposed to calculate unbiased stochastic approximations to algebraic quantities, such as traces and log-determinants (Chen et al., 2011; Filippone and Engler, 2015). In particular, we define a vector such that and we rewrite the expensive matrix product as
(6) |
which suggests that we can replace the exact calculation of with the estimator
(7) |
The key to making computations linear in is to define the components of the random vectors as drawn from the set
with probabilities
respectively. It is straightforward to verify that , and can be chosen to enforce any proportion of zeros in the vectors. With this mechanism to inject stochasticity in the calculation of the gradients, we are effectively ignoring some columns of the matrix whenever there is a zero in the corresponding positions of the vectors. This makes it possible to update the parameters during the solution of the constrained optimization problem in Eq. 1, by only selecting a few columns of the full Laplacian. If the average number of non-zero elements is chosen to be independent of , the calculation of the stochastic gradient is , making the proposed iterative solver linear in the number of samples. The memory footprint of the algorithm is a distinctive feature of our proposal; if the degree matrix is precomputed, calculating the columns of requires to compute stochastic gradients requires evaluating and normalizing only elements of the adjacency matrix .Given that only a subset of the columns in are used at each iteration to calculate stochastic gradients, we term our proposal Mini-Batch Spectral Clustering (MBSC). Instead of defining a probability to select columns and to repeat this times, we can fix the number and indices of columns that are selected at each iteration (size of the mini-batch) to be and interpret . While this is intuitively sensible, fixing the indices of the mini-batches would violate the property that . One easy way around this issue is to constantly change the way data are split into mini-batches, e.g., by shuffling the data, and this would recover the property . Even though it is not the focus of the current paper, we envisage the possibility to develop a distributed version of the proposed MBSC algorithm based on our formulation. The proposed MBSC algorithm is sketched in Algorithms 1 and 2, where, for the sake of clarity, is assumed to be stored. For memory constrained systems where storing the whole Laplacian matrix is unfeasible, our proposal can easily be adapted to avoid storing it, and we report on the performance of this variant in the experiments.
Here we are interested in quantifying the impact of the choice of and in the calculation of stochastic gradients; to this end, we analyze the variance of the proposed estimator of the exact gradients. Without loss of generality, we focus on the variance of a given column of the stochastic gradient , namely the one associated with a given eigenvector, say . Recall that the exact gradient with respect to would be and assume that we use a single vector to unbiasedly estimate this as . The covariance of is
After some manipulations, that we leave to the supplementary material, we obtain
Given that , then when has positive elements. Also, since has positive diagonal entries, we can further bound
(8) |
The first term contains the square of the components of the exact gradient that will vanish at convergence. The second term depends on the choice of . We can rewrite the bound in terms of the mini-batch size , for which , and when vectors are used to calculate stochastic gradients as in Eq. 7.
(9) |
This reveals that we have two ways of reducing the variance of stochastic gradients; one is to increase and another is to increase . Imagine that we fix the mini-batch size ; is it better to increase and reduce , or the other way around? For the second term it does not matter. For the first instead, given that it depends only on , it is clear that we should favor increasing and reducing . This entails that we should consider averaging stochastic gradients over several subsets of a mini-batch instead of a few large ones. This result is interesting because in other popular mini-batch approaches increasing the mini-batch size or the number of repetitions is equivalent. In our proposed MBSC, because of the nonlinearity of the estimator with respect to the vectors , the bound on the variance shows an unintuitive asymmetry between and . A further consideration we can make is that this suggestion is most useful during the first phase of the optimization; towards convergence, the first term will be small and dominated by the second term that is inversely proportional to the mini-batch size .
Throughout the experiments, we make use of the Radial Basis Function (RBF) adjacency function:
where we set when . The RBF adjacency function assigns a large weight to the edge connection inputs that are close in the input space, whereas it assigns a small weight to edges connecting inputs that are far apart. The parameter determines the rate of decay of the adjacency function, which can be tuned, e.g., using local statistics on the distances between pairs of points Zelnik-Manor and Perona (2004).
We assess clustering performance using the normalized mutual information (NMI) score between the cluster labels and the ground truth class labels. To reliably measure computational cost of all methods involved in the comparison, we count the amount of floating-point addition and multiplication operations they require, given the affinity matrix, and we also report running time statistics. We implemented all the algorithms in Python using the
numpy and scikit-learn packages. All our experiments are conducted under Ubuntu Linux with -core CPU and memory.Table 1 summaries the statistics of the data sets, taken from LibSVM (Chang and Lin, 2011), that we consider in the experiments. To construct the Covtype-I set, we randomly sample samples from the first two classes of the original Covtype data set and merge the data samples of classes , and into one single class. The purpose is to avoid severe imbalances between classes. We make use of all data samples of the original Covtype data set to build the Covtype-II set.
We organize the experimental study in two parts. In both parts, we aim to show the performance of the proposed MBSC algorithm against other state-of-the-art algorithms for solving spectral clustering. In particular, we compare our proposal with the solution of the power iteration-based algorithm in Boutsidis et al. (2015) and the Nyström approximation-based spectral clustering in Fowlkes et al. (2004). In the first part, this comparison is carried out on moderately large data sets comprising , and samples. For these, we can also report the performance of the spectral clustering approach in Shi and Malik (2000) where the spectrum of the normalized Laplacian is computed exactly (denoted as “Exact” in the plots). In the second part, we repeat the same comparison on two larger sized data sets, comprising and samples, where we cannot compute the exact spectrum of .
Data set | of samples | of features | of classes | |
---|---|---|---|---|
Pendigits | 10992 | 16 | 10 | 223.61 |
Shuttle | 58000 | 9 | 7 | 0.45 |
MNIST | 60000 | 780 | 10 | 4.08 |
Covtype-I | 100000 | 54 | 5 | 1.15 |
Covtype-II | 581012 | 54 | 7 | 1.15 |
We conduct a comparative evaluation of the proposed MBSC with state-of-the-art spectral clustering algorithms on the Pendigits, Shuttle and MNIST data sets (Chang and Lin, 2011)
. In the proposed MBSC approach, we experiment with different choices of the mini-batch size to assess its impact on performance. To comprehensively analyze the performance of the proposed method, the iterative stochastic gradient descent on the manifold runs until we make one full pass through the whole data set. In practice, the stochastic gradient descent process can be stopped either if the difference between the spectral embedding
derived at two successive iteration steps is less than a threshold, or if a fixed number of iteration is achieved. For the Nyström approximation, we report a few choices on the number of samples selected to construct the approximate eigenvectors, namely , , , and .Figure 2
illustrates the NMI scores of the proposed MBSC algorithm, the power method, and the Nyström approximation versus the amount of floating-point operations. In the figure, for the proposed MSBC and the Nyström approximation we report the average plus and minus one standard deviation of the NMI score over
repetitions.Recalling that is the mini-batch size, is the number of top eigenvectors and the number of clusters, each iteration of MBSC requires floating point operations. The power method starts by generating an -by-
Gaussian random matrix
, which costs operations. It then computes a matrix product between the -by- affinity matrix and , and iteratively applies the same multiplication for a total cost offloating point operations. The final step of the power method performs Singular Value Decomposition (SVD) on an
-by- matrix. Since , we adopt the estimate of SVD complexity in (Golub and Van Loan, 1996), which costs operations. Finally, in order to calculate the number of floating point operations for the Nyström approximation we follow the pseudo code in (Fowlkes et al., 2004), for which the total count of floating point operations is .In Figure 2, we can observe how the variance of the clustering performance of MSBC diminishes throughout iterations. Larger mini-batches lead to faster variance reduction of the stochastic gradient, thus producing faster convergence to the solution. Compared with the power method, the proposed MBSC needs distinctively less computations, while achieving higher or similar clustering accuracy. Another interesting observation is that the proposed MBSC achieves stable clustering accuracy before it makes a full pass through the whole data set. The Nyström approximation is computationally fast on all three data sets. Nevertheless, its time and space complexity increase drastically if the number of inducing points increases. On the Pendigits and Shuttle data sets, with less running time, the proposed MBSC method requires smaller mini batches to conduct clustering and achieves better clustering performance than the Nyström approximation. On the MNIST data set, instead, the Nyström approximation produces strikingly good clustering results when the number of selected samples is larger than . However, the proposed MBSC method converges to the clustering accuracy of the exact spectral clustering even when the size of the mini-batch size is small, e.g., less than .
Covtype-I | Covtype-II | ||||||
Algorithm | NMI | time (s) | Algorithm | NMI | time (s) | ||
MBSC-E-400 | 200 | 0.40 | 998 | MBSC-E-1000 | 200 | 0.14 | 7500 |
MBSC-E-800 | 200 | 0.40 | 2100 | MBSC-E-2000 | 200 | 0.14 | 12610 |
Nystrom-400 | - | 0.38 | 78 | Nystrom-1000 | - | 0.09 | 2520 |
Nystrom-800 | - | 0.38 | 965 | Nystrom-2000 | - | 0.11 | 11420 |
Power method | 3 | 0.40 | 9300 | Power Method | Too expensive |
To demostrate that the proposed approach can tackle large-scale spectral clustering problems, we implemented the proposed MBSC algorithm without storing the Laplacian matrix, and applied it on two data sets comprising and samples. This variant of the code, that we denote as MBSC-E, computes the necessary columns of to construct stochastic gradients on-the-fly.
Table 2 reports the overall running time and NMI of MBSC-E on the two data sets, where MBSC-E produces stable clustering results after iterations. In addition, we compare against the Nyström approximation and the power method. In the comparison, the Nyström approximation selects a subset of the same size of the mini-batch in MBSC-E. On the Covtype-II data set, the power method fails to obtain clustering results within an acceptable time, and we omit it from Table 2.
While running time is heavily dependent on implementation and system architecture, we argue that this is probably in favor of the Nyström approximation, for which we are using well optimized scientific computing packages. In any case, the purpose of this experiment is to demonstrate that the proposed MBSC algorithm, being exact in the limit of iterations, can achieve higher performance than approximate methods on large scale spectral clustering problems. Crucially, we demonstrate that this is possible at a comparable computational cost with approximate methods.
On both Covtype-I and Covtype-II data sets, MBSC achieves consistently better clustering accuracy than the Nyström approximation. Because the computational cost of the Nyström approximation rapidly increases with the size of the approximating set, it requires longer than MBSC-E to achieve a comparable clustering accuracy. Furthermore, compared with the power method, MBSC-E shows superior computational efficiency for large-scale spectral clustering problems. Remarkably, MBSC-E requires less than and memory to run on the two data sets, respectively.
With the aim of improving scalability of spectral clustering, in this work we formulated normalized cut spectral clustering as an optimization problem with an orthonormality constraint that we could solve using stochastic gradient optimization. We proposed a novel adaptive stochastic gradient optimization framework on Stiefel manifolds to compute the spectrum of Laplacian matrices, with computation of stochastic gradients linear in the number of samples. We provided theoretical justifications and empirical analyses to demonstrate how our proposal can tackle large-scale spectral clustering problems in a practical way.
The proposed stochastic optimization is characterized by attractive robustness to parameter selection and scalability properties, leading to the same clustering accuracy of spectral clustering approaches that use the exact spectrum of the Laplacian at a fraction of the cost. The results also support the motivation behind our proposal that approximate methods can potentially affect clustering performance. In cases where approximate methods perform well, as we reported in one of the experiments, we can see our proposal as a practical way to obtain the gold-standard of the “exact” approach at a reasonable cost. Furthermore, the proposed method does not need to load the whole Laplacian matrix into memory, making it especially suitable for handling large-scale spectral clustering with a limited memory footprint.
There are a number of extensions that we are currently investigating, such as the possibility to combine our framework with other approximate methods, for example to be able to afford more inducing points when performing spectral clustering using the Nyström approximation. Another extension is to leverage approximations to reduce the variance of the stochastic gradients without introducing any bias to accelerate stochastic gradient optimization. A Spark/TensorFlow implementation of the proposed algorithm is under development.
The Authors would like to thank Yun Shen from Symantec Research Labs for his support in setting up the experiments and optimizing the implementation of the algorithms.
On Spectral Clustering: Analysis and an algorithm.
In Advances in Neural Information Processing Systems 14, Cambridge, MA, 2002.Kernel k-means: spectral clustering and normalized cuts.
In Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining, pages 551–556, New York, NY, USA, 2004.Proceedings of the 32nd International Conference on Machine Learning
, pages 40–48, 2015.Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence, February 12-17, 2016, Phoenix, Arizona, USA.
, pages 1809–1815, 2016.Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond
. MIT Press, Cambridge, MA, USA, 2001.Recall that the covariance of the stochastic gradient is:
By expanding the first term and realizing that the second term in the right hand side is just the outer product of the th column of with itself, we obtain that the variance of the components of the stochastic gradient is:
The focus of this supplement is to derive an expression for the expectation
needed to calculate the variance of the stochastic gradients. The element of the expectation is
(10) | |||||
(11) |
We need to consider all possible cases for the indices going from to .
- there are of these cases - all variables are independent and the expectation factorizes into the product of the expectations, which is zero.
two pairs are equal - there are of these cases. It is useful to realize the following expectation:
Within this set of cases we distinguish three cases, and there are for each of these:
- these are useful for the calculation of the diagonal of the expectation and they give
- these give
- these give
So for the off-diagonal elements of the expectation we have the sum of the last two cases above, giving .
two out of four are equal and the other two are different - there are of these cases and they give a zero.
three out of four are equal - there are of these combinations - these cases give a zero.
- there are of these combinations - when the indices are all the same, we have nonzero in the two cases of being or , giving
We are now ready to compute the expectation above. We distinguish two cases:
- the off-diagonal of the expectation which is:
- the diagonal of the expectation which is:
Because of these expressions, the expectation can be rewritten in matrix form as:
We realize that
because of the orthonormality constraint, so after substituting this expression and gathering terms we obtain
Plugging this into the expression of the variance of the components of the stochastic gradient we obtain
This can be simplifed into
and finally into
Comments
There are no comments yet.