Data Clustering and Graph Partitioning via Simulated Mixing

Spectral clustering approaches have led to well-accepted algorithms for finding accurate clusters in a given dataset. However, their application to large-scale datasets has been hindered by computational complexity of eigenvalue decompositions. Several algorithms have been proposed in the recent past to accelerate spectral clustering, however they compromise on the accuracy of the spectral clustering to achieve faster speed. In this paper, we propose a novel spectral clustering algorithm based on a mixing process on a graph. Unlike the existing spectral clustering algorithms, our algorithm does not require computing eigenvectors. Specifically, it finds the equivalent of a linear combination of eigenvectors of the normalized similarity matrix weighted with corresponding eigenvalues. This linear combination is then used to partition the dataset into meaningful clusters. Simulations on real datasets show that partitioning datasets based on such linear combinations of eigenvectors achieves better accuracy than standard spectral clustering methods as the number of clusters increase. Our algorithm can easily be implemented in a distributed setting.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 22

page 28

12/12/2018

Image Segmentation Based on Multiscale Fast Spectral Clustering

In recent years, spectral clustering has become one of the most popular ...
08/21/2020

KCoreMotif: An Efficient Graph Clustering Algorithm for Large Networks by Exploiting k-core Decomposition and Motifs

Clustering analysis has been widely used in trust evaluation on various ...
03/05/2020

Auto-Tuning Spectral Clustering for Speaker Diarization Using Normalized Maximum Eigengap

In this study, we propose a new spectral clustering framework that can a...
10/05/2013

Role of normalization in spectral clustering for stochastic blockmodels

Spectral clustering is a technique that clusters elements using the top ...
01/07/2013

Efficient Eigen-updating for Spectral Graph Clustering

Partitioning a graph into groups of vertices such that those within each...
03/16/2021

K-expectiles clustering

K-means clustering is one of the most widely-used partitioning algorithm...
07/21/2020

Spectral Clustering using Eigenspectrum Shape Based Nystrom Sampling

Spectral clustering has shown a superior performance in analyzing the cl...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Data clustering is a fundamental problem in pattern recognition, data mining, computer vision, machine learning, bioinformatics and several other related disciplines. It has a long history and researchers in various fields have proposed numerous solutions. Several spectral clustering algorithms have been proposed

[1, 2, 3, 4], which have enjoyed great success and have been widely used to cluster data. However, spectral clustering does not scale well to large-scale problems due to its considerable computational cost. In general, spectral clustering algorithms seek a low-dimensional embedding of the dataset by computing the eigenvectors of a Laplacian or similarity matrix. For a dataset with instances, eigenvector computation has time complexity of [5], which is significant for large-scale problems.

In the past few years, efforts have been focused towards addressing scalability of spectral clustering. A natural way to achieve scalability is to perform spectral clustering on a sample of the given dataset and then generalize the result to the rest of data. For example, Fowlkes et al. [6] find an approximate solution by first performing spectral clustering on a small random sample from the dataset and then using the Nystrom method; they extrapolate the solution to all the dataset. In [7], Sakai and Imiya also find an approximate spectral clustering by clustering a random sample of the dataset. They also reduce the dimension of the dataset using random projections. Another approach proposed by Yan et al. [8] works by first determining a smaller set of representative points using -means (each centroid is a representative point) and then performing spectral clustering on the representative points. Finally, the original dataset is clustered by assigning each point to the cluster of its representative. In [9], Chen et al. deal with large-scale data by parallelizing both computation and memory use on distributed computers.

These methods sacrifice the accuracy of spectral clustering to achieve fast implementation. In this paper, we perform spectral clustering without explicitly calculating eigenvectors. Rather we compute a linear combination of the right-eigenvectors weighted with corresponding eigenvalues. Moreover, unlike many traditional algorithms, our algorithm does not require a predefined number of clusters, , as input. This algorithm can automatically detect and adapt to any number of clusters, based on a preselected tolerance. We apply our algorithm to large size stochastic block models to illustrate the scalability and demonstrate that it can handle large datasets where traditional spectral algorithms result in memory errors. We compare the accuracy and speed of our algorithm to the normalized cut algorithm [1] on real datasets and show that our approach achieves similar accuracy but at a faster speed. We also show that our algorithm is faster and more accurate than both the Nystrom method for spectral clustering [6] and the fast approximate spectral clustering [8].

Notation.

Throughout this paper we use boldface to distinguish between vectors and scalars. For example,

is in boldface to identify a vector, while , a scalar, is not boldface. We use to denote the vector of ones. We denote matrices by capital letters, such as , and use to represent the entries of . Calligraphic font is used to denote sets with the single exception that is reserved to denote graphs. The norm denotes for vectors and for matrices it denotes spectral norm.

2 Problem Statement

Consider a set of data points in a -dimensional space. We will often use a short-hand notation to denote the vector . The goal is to find clusters in the dataset such that points in the same cluster are similar to each other, while points in differing clusters are dissimilar under some predefined notion of similarity. In particular, suppose pairwise similarity between points is given by some similarity function often abbreviated by , where usually it is assumed that the function is symmetric. Also, the similarity function is non-negative if and is equal to zero if . A similarity matrix is an symmetric matrix such that the entry is equal to the value of the similarity function between points and .

The data points together with the similarity function form a weighted undirected similarity graph ; is the set of nodes (vertices) of the graph, is the set of edges. We note that the problem of finding clusters in a dataset can be converted into a graph partitioning problem. We will make this more precise in the sequel. In particular, in our case each data point represents a vertex/node in the graph. Two vertices and are connected by an edge if their similarity is positive and the edge weight is given by . Different similarity measures lead to different similarity graphs. The objective of constructing a similarity graph is to model the local neighborhood relationships to capture the geometric structure of the dataset using the similarity function. Some commonly used similarity graphs for spectral clustering are noted below.

  1. Gaussian similarity graphs:

    Gaussian similarity graphs are based on the distance between points. Typically every pair of vertices is connected by an edge and the edge weight is determined by the Gaussian function (radial basis function)

    , where parameter controls how quickly the similarity fades away as the distance between the points increases. It is often helpful to remove the edges for edge weights below a certain threshold (say ) to sparsify the graph. Thus the similarity function for this type of graph is

  2. -nearest neighbor graphs: As the name suggests each vertex is connected to its nearest neighbors, where the nearness between and is measured by the distance . This similarity measure results in a graph which is not necessarily symmetric in its similarity function, since the nearness relationship is not symmetric. In particular, if is among the nearest neighbors of then it is not necessary for to be among the nearest neighbors of . We make the similarity measure symmetric by placing an edge between two vertices and if either is among the nearest neighbors of or is among the nearest neighbors of , that is

  3. -neighborhood graphs: In an -neighborhood graph, we connect two vertices and if the distance is less than giving us the following similarity function.

Thus finding clusters in a dataset is equivalent to finding partitions in the similarity graph such that the sum of edge weights between partitions is small and the partitions themselves are dense subgraphs. The degree of each vertex of the graph is given by . A degree matrix is a diagonal matrix with its diagonal elements given by for . We assume that the degree of each vertex is positive. This in turn allows us to define the normalized similarity matrix by .

3 Mixing processes

As a visualization of our clustering approach, consider a mixing process in which one imagines that every vertex in the graph moves towards (mixes with) other vertices in discrete time steps. At each time step, vertex moves towards (mixes with) vertex by a distance proportional to the similarity . Thus the larger the similarity , the larger the distance vertices and move towards each other i.e., the greater the mixing. Moreover, a point will move away from the points which have weak similarity with it. Thus, similar points will move towards each other making dense clusters and dissimilar points will move away from each other increasing the separability between clusters. Clusters in this transformed distribution of points then can easily be identified by the -means algorithm.

To describe the above idea more precisely, consider the following model, where each point moves according to the following equation, starting at its original position at time :

(1)

The parameter is the step size, which controls the speed of movement (or mixing rate) in each time interval. Observe that if the underlying graph has a bipartite component and , then in each time step all points on one side of this component would move to the other side and vice versa. Therefore, points in this component would not actually mix even after a large number of iterations (for details see [10]). For such graphs we must have bounded away from 1. We can use for graphs without a bipartite component. Assuming each point is a row vector, we express equation (1) in a matrix form:

(2)

The matrix is a matrix with row representing the position of point at time . is a identity matrix, and we define . Note that the matrix

is essentially the transition matrix of a lazy random walk with probability of staying in place given by

. Since also captures the similarity of the data points, one would expect that for large enough, the process in equation (2) would reveal the data clusters, since

will mix the data points according to their similarities. Using this intuition, one can expect that a heuristic algorithm based on equation (

2) can be constructed to determine the clusters, as given in the following algorithm.

1:Input: Set of data points , number of clusters
2:Represent the data points in the matrix with point being the th row.
3:Compute .
4:Loop over
5:     
6:until  Stopping criteria is met.
7:Find clusters from rows of using -means algorithm.
8:Output: Clustering obtained in the final iteration.
Algorithm 1 Point-Based Resource Diffusion (PRD)

Algorithm 1 has two limitations: a) it does not scale well with the dimension of the data points, because the number of computations in each iteration is , and b) it fails to identify clusters contained within other clusters. For example, in the case of two concentric circular clusters, points in both clusters will move towards the center and become one cluster, losing the geometric structure inherent in the data. Hence it becomes impossible to discern these clusters using the -means algorithm in Step 7. To overcome these limitations, we associate an agent to each point and carry out calculations in the agent space. Agents are generated by choosing points uniformly at random from a bounded interval 111Note, is a scaling parameter and does not change the resulting clustering. For the sake of simplicity we use a probability vector for the analysis in the subsequent section. . We rewrite equation (2) in the agent space as:

(3)

We refer to this iterative equation as the Mixing Process. In the following section we analyze this process using the properties of the random walk matrix .

4 Analysis of the Mixing Process

The matrix captures the similarity structure of the data, and the idea behind using the iterative process (3) is that, after some sufficient number of iterations, the entries of the vector will reveal clusters on a real line, which will be representative of the clusters in the data. The fact is that the process (2) and (3) both mix with the same speed, which is governed by the random walk . Thus, the hope is that through the process in (3) we determine the weakly coupled components in the matrix , which can lead us to the data clusters of the points .

The Mixing Process shares a resemblance to the power iteration. Unlike the power iteration, we here use the mixing process to discover strongly coupled components of , which translate to data clusters.

4.1 Properties of the matrix

We first show that the matrix is diagonalizable, which allows for a more straightforward analysis. By definition,

(4)

where is the normalized Laplacian of the graph . Let be a right eigenvector of with eigenvalue , then is a right eigenvector of with eigenvalue , that is

This gives us a useful relationship between the spectra of the random walk matrix and the normalized Laplacian . It is well known that the eigenvalues of a normalized Laplacian lie in the interval , see for example [10]. Thus, if are the eigenvalues of , then the corresponding eigenvalues of are It is worth noting that we are considering the right eigenvector of the random walk matrix . One should not confuse this with the left eigenvector.

Although the matrix is not symmetric, is a symmetric positive semi-definite matrix, thus its normalized eigenvectors form an orthonormal basis for and we can express in the following form:

(5)

Using (4) and (5), we obtain

(6)

We will exploit this relationship in our proofs later.

4.2 The ideal case

For the sake of analysis, it is worthwhile to consider the ideal case, in which all points form tight clusters that are well-separated. By well-separated, we mean that if points and lie in different clusters, then their similarity . Suppose that the data consists of clusters with points, respectively, such that and . For the ease of exposition, we also assume that the ’s are numbered in such a way that points are in cluster , the points are in cluster and so on.

The underlying graph in the ideal case consists of connected components , where each component consists of vertices in the corresponding cluster . We represent this ideal graph by , its normalized Laplacian by and its similarity matrix by . The -dimensional characteristic vector of the th component is defined as

The ideal similarity matrix and, consequently the ideal normalized Laplacian of a graph with connected components, are both block-diagonal with the block representing the component , i.e.,

Since is block-diagonal, its spectrum is the union of spectra of . The eigenvalue of has multiplicity with linearly independent normalized eigenvectors . Each of these eigenvectors is given by . In the following theorem, we prove that if we have an ideal graph then the iterate sequence generated by the Mixing Process (3) converges to a linear combination of characteristic vectors ’s of the components of the graph. The ’s are also eigenvectors of , where , corresponding to first eigenvalues .

Theorem 1.

Suppose that we have an ideal dataset which consists of clusters as defined previously and let be any vector such that each and , then

where and is the degree of the node.

Proof.

We begin by noting that the iterative process (3) is equivalent to . Using (6), we have

Separating the first terms in the sum and using the fact that eigenvector for , the above equation can be simplified as

We can simplify the first term in the norm as

where the term is equal to the sum of the degrees of vertices in component (commonly called volume of ) which can also be expressed by , to have

In the last equation we have used . Now using the properties of the norm, we can separate the terms, giving us

Note that we can always choose such that . Thus the preceding inequality can be written as

For any there exists some such that

(7)

Specifically, taking the log and simplifying, we have

4.3 The general case

In practice, the graph under consideration may not have connected components, but rather nearly connected components i.e., dense subgraphs sparsely connected by bridges (edges). We can obtain connected components from such a graph by removing only a small fraction of edges. This means that matrices and have non-zero off-diagonal blocks, but both matrices have dominant blocks on the diagonal. The general case is thus a perturbed version of the ideal case.

Let be the similarity matrix for a dataset , where is the similarity matrix corresponding to the true clusters (an ideal similarity matrix), which is block-diagonal and symmetric. We obtain by replacing the off-diagonal block elements of with zeros and adding the sum of the off-diagonal block weights in each row to the diagonal elements. This results in the matrices and having the same degree matrix . The matrix is then a symmetric matrix with row and column sums equal to zero with the diagonal entry given by . The off-diagonal entries of are the same as the entries in the off-diagonal blocks of . For example, for

the matrices and satisfying the above constraints are given by

Using the definition of , one can then show that the following result holds.

Lemma 1.

If is a similarity matrix for a dataset , then the normalized Laplacian of the corresponding graph is , where is the normalized Laplacian of the ideal graph corresponding to the true clusters.

Since eigenvalues and eigenvectors are continuous functions of entries of a matrix, the eigenvalues ’s of can be written as

where is the eigenvalue of the ideal normalized Laplacian , and depend continuously on the entries of . Similarly the eigenvectors ’s of can be expressed as

where is the eigenvector of the ideal normalized Laplacian and depend continuously on the entries of . Note that the pair is not necessarily an eigenvalue/eigenvector pair of . We assume that and consequently are small enough so that and are also small.

Theorem 2.

Suppose that we have a dataset which consists of clusters and let be any vector such that each and , then we have

where and is the degree of the node.

Proof.

Using (6) and separating the first terms, we get

Substituting and in the above equation and using the fact that for we have

(8)

where

Substituting the above expression in (8) and using the triangle inequality, we have

(9)

Since ,  , we can further simplify inequality (9) as

(10)

Note that we can always choose such that and the above expression then becomes

Observe that for . Thus, assuming that the perturbation is small, the first eigenvalues of the Laplacian are close to zero. If the eigengap is large enough, then for some , we will have both for a small , and for a small . This results in the effective vanishing of the term in the above expression after a sufficient number of iterations and ’s being bounded away from zero. According to Theorem 2, we will then have an approximate linear combination of the characteristic vectors of the graph i.e., will be small. Small perturbation assumption also leads to being relatively small. Note that this eigengap condition is equivalent to Assumption A1 in [3].

It is worth noting that as the number of clusters grows, it becomes increasingly difficult to distinguish the clusters from the vector using the classical -means algorithm because the perturbation will accompany the eigenvectors in . Thus, we devise a recursive bi-partitioning mechanism to find the clusters.

4.4 Clustering algorithm

Our analysis in the previous section suggests that points in the same cluster mix quickly whereas points in different clusters mix slowly. Simon and Ando’s [11] theory of nearly completely decomposable systems also demonstrates that states in the same subsystem achieve local equilibria long before the system as a whole attains a global equilibrium. Therefore, an efficient clustering algorithm should stop when a local equilibrium is achieved. We can then distinguish the clusters based on mixing of the points. The two clusters in this case correspond to aggregation of elements of . Thus a simple search for the largest gap in the sorted can reveal the clusters. This cluster separating gap is directly proportional to , since we initialize by choosing points uniformly at random from an interval . Furthermore, it is inversely proportional to the size of the dataset. Thus we define the gap between two consecutive elements of sorted as: