sampling-clustering
Sampling Clustering
view repo
We propose an efficient graph-based divisive cluster analysis approach called sampling clustering. It constructs a lite informative dendrogram by recursively dividing a graph into subgraphs. In each recursive call, a graph is sampled first with a set of vertices being removed to disconnect latent clusters, then condensed by adding edges to the remaining vertices to avoid graph fragmentation caused by vertex removals. We also present some sampling and condensing methods and discuss the effectiveness in this paper. Our implementations run in linear time and achieve outstanding performance on various types of datasets. Experimental results show that they outperform state-of-the-art clustering algorithms with significantly less computing resources requirements.
READ FULL TEXT VIEW PDFSampling Clustering
Cluster analysis is a basic common unsupervised learning approach and widely used in various fields such as data mining and computer vision. It categorizes a set of objects into groups (or clusters) based on the similarities between them to help understand and analyze data. Clusters are aggregations of objects that intra-cluster similarities are higher than inter-cluster ones.
With the long history of research on clustering, a large number of algorithms have been proposed. However, there still exist several common limitations among them.
Classical center-based algorithms like -Means [1]
use central vectors to partition the data space so that they may lack the ability to find arbitrary shaped clusters. Many of them are also very sensitive to center initialization. Distribution-based algorithms including Gaussian Mixture Models have similar shortcomings. In addition to that, for spatial data, as the dimension grows, the number of parameters to be determined also increases. The running time and memory required may become unacceptable.
Density-based clustering methods define clusters as contiguous regions with high density. DBSCAN [2] assumes that the clusters have homogeneous densities. A global threshold needs to be specified in advance to distinguish between high-density and low-density areas. It simply judges whether two regions are connected from connectivity, which can result in “single-link effect” [3] and lead to undesirable combinations. Another critical problem is that, for some applications, there may not be a global threshold at all to separate all clusters at once. It makes such tasks unsolvable to DBSCAN.
Algorithms above generate disjoint partitions of the dataset, and nesting is not allowed. However, in many cases, some data in reality are hierarchical. As shown in Fig. 1, it is hard, or even impossible, to find an absolutely reasonable and unambiguous division of the data. In such cases, a possible way is to generate multiple partitions on the same dataset with different granularities to construct a hierarchy. However, it requires multiple runs and is difficult to handle.
Hierarchical algorithms are traditionally divided into two categories, agglomerative methods and divisive ones. Such algorithms generate hierarchical structures called dendrograms, which are much more informative than non-nested partitions. When necessary, a dendrogram can also be converted into partitions with different numbers of clusters as needed without multiple runs.
Typically, agglomerative methods initially treat each object as a cluster, or start with a large number of tiny clusters, and then merge them in pairs. The dendrogram is usually very deep and large, which makes it difficult to analyze. In fact, in most cases, low level branches are not so meaningful, and often pruned to make the dendrogram easier to analyze, which is also a challenging task.
Other than that, due to clusters being merged in pairs, dendrograms are usually binary and thus can not represent the real structures accurately. For example, assuming a cluster consisting of more than two identical subclusters, the cluster can’t be built by merging all subclusters simultaneously and the structure of the dendrogram largely depends on the order of selections. Fig. 2 depicts one case.
For many agglomerative methods, to select a pair of clusters for the next merge, the similarities between every two clusters must be calculated. The complexity of an agglomerative method is often very high, usually , which can be unacceptable for a mass of data.
In general, a divisive method initially groups all objects into one cluster, and divides each cluster into two subclusters recursively.
Compared to a large number of agglomerative algorithms, divisive methods are fewer. One major reason is that separation is usually much more difficult than mergence. For an agglomerative method, given subclusters, in order to find the two most similar ones to merge, only a maximum of comparisons are required. Correspondingly, there are divisions that split a cluster of size into two subclusters.
Divisive methods have similar problems with agglomerative ones. Besides that, in addition to finding a proper division is difficult, the stopping condition (e.g., the commonly used minimum cluster size) is usually hard to be defined too. However, with the well-defined termination, a large number of unnecessary divisions can be avoided. Unlike agglomerative methods, this feature makes it possible to output smaller and more analyzable dendrograms directly without any pruning.
Many methods require the data to be spatial, which limits their applications. Graph-based methods use similarity graphs, and usually transform spatial data into nearest neighbor graphs (e.g., -NN, mutual -NN [4] and XNN [5]). Although the transformation may cause loss of information, processing on graphs still brings many significant benefits. 1) There are metrics in graph theory, like reachability, that can measure the similarities between objects better than traditional ones (e.g., the Euclidean distance). 2) Querying the most similar objects in a graph is often faster than others. 3) It has been shown in [6] that nearest neighbor graphs are proper and efficient representations for data lying on a low dimensional manifold embedded in a high dimensional space.
With the considerations and inspired by [7], we propose a graph-based divisive clustering approach called Sampling Clustering. It recursively divides a graph into subgraphs through removing vertices from the graph.
The approach has following advantages.
A non-binary dendrogram is generated, which is much more representative than non-nested partitions and binary dendrograms.
The division stops in time, and undesirable splits are less. The dendrogram is small and easy to analyze.
It works well and outperforms state-of-the-art algorithms on various types of datasets.
Only a few hyper-parameters are required and none of them plays a key role.
Computing resources required are significantly less, and the time complexity is usually linear. It is also easily parallelizable.
The approach is very simple, and can be extended to handle domain-specific data.
The paper is organized as follows. We present details of the approach in Section 2, and compare it with other methods theoretically in Section 3. Section 4 shows the experimental results. ^{1}^{1}1 The implementation and other helpful resources are available at http://res.ctarn.io/sampling-clustering. There is also a published reproducible version at https://doi.org/10.24433/CO.1783197.v1. We discuss the experiments further in Section 5. The conclusions are in Section 6.
The main idea of this approach is that, a big graph can be summarized as a smaller one that still retains its main structure but has fewer details. The smaller graph is easier to analyze, and can be further summarized again. By summarizing repeatedly, as shown in Fig. 3, the global structural information can be gradually revealed.
In this process, a graph can be divided into several disconnected subgraphs. Like traditional methods, connectivity can be used as the criterion to group objects into clusters. Each subgraph, that is, a connected component, is treated as a subcluster. And thus, a hierarchy can be obtained by splitting recursively.
For summarization, it can be further divided into two steps: selecting a set of vertices from the original graph as the representatives and then building a new graph by reconnecting these vertices.
Specifically, as shown in Fig. 4, given a similarity graph of the data, the approach constructs a dendrogram through dividing a graph into subgraphs recursively.
In each recursive call, the graph is sampled first to disconnect latent subclusters. The selected vertices are removed from the graph along with the edges.
After each sampling, the density of the graph may decrease and the connections between vertices become weaker. The remaining edges are usually not enough to connect the vertices stably to maintain the structure. Multiple samplings may even divide the graph into a lot of tiny parts unnecessarily. To avoid the fragmentation of a cluster, condensing is applied on the sampled graph. It keeps the connections between them stable enough, through finding new nearest neighbors of each remaining vertex and connecting them with their new neighbors respectively to fill the vacancies around them.
The recursive process terminates if no vertex or all vertices of a graph are removed by sampling.
After sampling, if the process is not terminated, each vertex to be deleted from the graph is associated with a remaining nearest vertex as its representative. Otherwise, if terminated, the call returns the set of all vertices of the graph along with vertices associated, perhaps indirectly, with them.
For association, since the graph is connected and there exists at least one remaining vertex, it is guaranteed that each vertex can always find a nearest one not to be removed. We divide the initial input into subgraphs without sampling first if it is not connected.
An overview is as Algorithm 1.
In the rest of this section, we describe these steps in more detail. For summarization, we present a set of sampling and condensing methods, and introduce some measures to evaluate them to help choose or design one method that suits a specific application. We also provide a fine-tuning method to further improve the quality of a dendrogram, and simple algorithms to convert a dendrogram into partitions. In the end we briefly discuss the complexity of the approach.
For the convenience of following description, assuming a directed graph , core concepts are defined as below.
is a neighbor of if and only if .
The neighborhood of is the set of all neighbors of , denoted as .
In addition, we denote as .
A dendrogram is a tree of which all leaves are exclusive subsets of .
As described in Section 1.3, transforming datasets into graphs brings many benefits. In addition, graph is a flexible data structure, and can be easily edited or built. Converting data in other forms into a graph is usually much easier than the reverse. For example, building a spatial dataset based on similarity graphs is usually more difficult than transforming a spatial dataset into a similarity graph (e.g., nearest neighbor graph). It means that, a graph-based method can accept more types of data and thus can be applied to more fields.
The method of constructing a similarity graph usually depends on the original data type and the distance, or similarity, metric used. For general spatial data, in addition to brute-force search, fast indexing structures such as KD Tree [8] or Ball Tree [9] can be used to speed up the search of nearest neighbors.
Although the following discussions are mainly based on unweighted directed nearest neighbor graphs, the approach can also be applied to various type of graphs such as undirected or weighted graphs. Some methods introduced below may need to be sightly adjusted.
Sampling vertices from a graph leads to several benefits. 1) By reducing the amount of vertices, both analysis and computing become easier. With a proper subset of the graph, the main structural information can be well maintained. 2) Multiple samplings can break the weak connections caused by noise between components. It avoids common problems, like “single-link effect”, of linkage-methods and makes the approach more tolerant to noise. 3) More importantly, the sequence of separations contains rich structural information and can be used to reconstruct the hierarchical structure of the data.
The sampling algorithm is described as Algorithm 2. It uses a measure to calculate a score of each vertex that indicates whether it should be deleted or not, and returns the set of vertices with the lowest score, that is, should be removed. We use sampling rate to control the number of vertices to be removed.
A basic sampling measure is assigning a random score to each vertex. Another type of methods tend to remove vertices at junctions of clusters instead of vertices in core areas, so that the latent clusters can be disconnected quickly and undesirable divisions can be avoided. Such methods can be boundary detection based, like [10]. It is shown in [10] that a vertex at junctions usually has a smaller indegree than others. In particular, for a -nn graph, the outdegree of each vertex is , and the indegree of a vertex at boundaries is usually less than , as shown in Fig. 5.
Another similar one, defined as Definition 5, identifies boundaries based on the number of mutual neighbors of each vertex.
To measure the effectiveness of a sampling method, first we formally define the vertices at junctions as Definition 6.
A vertex is at junctions if and only if that and belong to different clusters. A vertex not at junctions is called a positive vertex.
A sampled graph should be more separable. We use the proportion of positive vertices of the graph to measure its separability.
The proportion of positive vertices of a graph is called its vertex positivity, denoted as .
To avoid being affected by neighborhood size deceasing, we label the positivity of a vertex before sampling and only recalculate the proportion after that. Obviously, sampling randomly doesn’t change the proportion of positive vertices. Fig. 6 shows that both and increase the vertex positivity of a graph significantly.
Since only the last remaining vertices are grouped into clusters directly, in order to completely cluster all objects, the rest, which have been removed, should be bound to these classified ones. As shown in Fig. 7, after each sampling, the vertices to be removed are associated with remaining ones. For each deleted one, there is always a vertex (indirectly) bound with it that has never been removed, and they are grouped into the same cluster.
A simple method to query a nearest remaining vertex, described as Algorithm 3, is breadth-first search starting from each vertex to be removed, and, in general, it is also very fast.
However, the worst-case performance of this method is . For example, assuming a long chain of length , and only the vertex at one end is not marked to delete, the number of visits on all edges is .
In order to theoretically guarantee that the algorithm is linear in any case, we introduce another equivalent algorithm as an alternative here. Instead of using breadth-first search starting from each vertex to be removed, we start breadth-first search from all remaining vertices at the same time, and only search once. Since each edge is only visited at most once, it ensures that the search can be finished in a linear time. We call this method as multi-source breadth-first search, and the details are shown in Algorithm 4.
A sampled graph is usually sparser, and connections between vertices turn to be weaker. Multiple samplings may break the graph into pieces. We condense the graph by increasing the number of edges to avoid graph fragmentation caused by vertex removals. For a sampled nearest neighbor graph, we search new nearest vertices of each vertex and connect them with their new neighbors respectively.
For a vertex , we use breadth-first search starting from to find a set of candidates . To avoid searching too deep unnecessarily, we limit the depth to . It stops if and the searching depth is greater than . A measure is used to calculate a score of each candidate which indicates the distance, or dissimilarity, between it and .
The new neighborhood of a vertex consists of the candidates with high similarities. A new graph is built using edges selected by scores. The algorithm is described as Algorithm 5.
The simplest measure is based on the breadth-first search visiting sequence. We simply assign the index to a candidate as its score, that is, the first vertices visited are selected.
Another measure is shared neighbor, denoted as . It has been shown in [13]
that a high similarity between two neighborhoods also indicates that the two vertices are similar. We use Jaccard index, also known as Intersection over Union (IoU), to measure the similarity between two neighborhoods.
For spatial data, distance metrics (e.g., Euclidean distance ) can also be used in some applications.
Effective condensing methods should maintain the structure of the graph well, and it can be considered from two aspects. 1) Inter-cluster connections should be avoided. A vertex should belong to the same cluster with its neighbors to make clusters separable. 2) To avoid a cluster being divided into pieces, vertices in it need to be strongly connected.
For the first consideration, we use a similar measure with Definition 7 called the edge positivity of a graph.
An edge is positive if and only if and belong to the same cluster.
The proportion of positive edges of a graph is called its edge positivity, denoted as .
For the second one, the average connectivity of a graph [14], defined as follows, is used to measure the strength of a cluster’s internal connections. It can also be used to evaluate the stability of a sampling method.
The connectivity of a graph is measured with
where is the value of maximum flow from to .
Fig. 8 shows that graph-based measures are usually more effective than geometry-based.
Although is very effective, it is not a stable condensing method on a big graph. With the condensing being repeated multiple times on a graph, it tends to divide a component into lots of tiny parts. The size of such a part is usually about , and vertices in it are almost fully-connected, while inter-part connections are extremely weak. The intra-cluster connections are destroyed and the clusters are divided into pieces.
Fig. 9 shows that, the connectivity of graph decreases rapidly after about iterations if is employed. In fact, if a -nn graph is not sampled after each iteration, tiny parts are generated even faster, while doesn’t change the graph at all if .
At the end of each recursive call, the processed graph is partitioned into several subgroups, and it generates a branch on the dendrogram. Since the similarities between elements can be measured by reachability, the graph can be simply divided into connected components. For data difficult to partition, in order to split it more thoroughly, the graph is divided into strongly-connected components. Otherwise, or the graph is undirected, weakly-connected components methods can also be used.
Additionally, dividing a graph into components also makes sure that all subgraphs to be sampled in next iteration are connected, which guarantees that each vertex to be removed can be bound to a remaining one.
Pruning is used to reduce the number of leaves of a dendrogram and generate a partition from it. We provide two simple pruning algorithms.
The first algorithm strictly conforms to the original structure of the dendrogram. A node can be merged if and only if it is an end branch, defined as Definition 12.
An end branch of a dendrogram is a node of the dendrogram whose all children are leaves.
The size of a node is defined as the total number of objects in its descendants, or itself.
The smallest end branch is first merged into one leaf. The algorithm stops if the next merge causes the number of leaves to be less than the desired value , or the largest leaves have contained more than objects. It is described as Algorithm 6.
The second algorithm allows two leaves belonging to the same parent to be merged together directly. A leaf is moved down to find a leaf brother, as shown in Fig. 10, if it has a brother and the brother is not a leaf. The detailed description is shown in Algorithm 7. It generates a more balanced partition with the number of clusters being precisely controlled. However, it dose not conform strictly to the original structure of the dendrogram and doesn’t work well on unbalanced data.
We also introduce a simple method called smoothing to adjust dendrograms. It is based on the observation that, firstly, some association trees, which are generated after sampling, may cross together slightly, which results in the fact that a small number of vertices, especially at junctions, are grouped into different clusters from their neighbors; secondly, a few isolated vertices may become clusters unexpectedly, and such tiny leaves should also be removed.
We simply regroup each vertex into the most-common cluster in its neighborhood. This operation can be repeated multiple times, and our experiments show that it improves the qualities of dendrograms on most datasets and also converges very fast.
At the end, we briefly analyze the complexity of the approach. Given a -nn graph , first we discuss the complexity of each step in one recursive call.
Assuming a sampling measure with a complexity of , calculating the scores costs . Selection methods are used to find the threshold, and can be finished at the cost of . Therefore, the complexity of sampling is .
The complexity of sampling is .
As described in Section 2.3, the worst-case performance of multi-source breadth-first search method is .
The complexity of association is .
With the searching depth being limited to , for each vertex, the number of candidates does not exceed , and thus both calculating and sorting the scores cost only , where is the complexity of the measure.
The complexity of condensing is .
Methods to computing strongly or weakly connected components also run in linear time.
The complexity of partition is .
Therefore, the complexity of a recursive call is .
Since the total size of all subgraphs is , where is the sampling rate and , the complexity of the approach is .
The complexity of Sampling Clustering is .
The complexity of any sampling or condensing measure introduced above is , and thus the complexity of an implementation based on them is linear.
In the case where both the measures of sampling and condensing are , the complexity of Sampling Clustering is .
We compare Sampling Clustering with other cluster analysis methods theoretically in this section.
Almost all existing algorithms, including the classical methods mentioned above, can be classified into two categories. For the first type, the methods process on static original data distributions or graphs. For the second one, the methods are usually iterative, and adjust the distribution of all objects or the structure of a graph dynamically.
Global characteristics matter. Methods using static distributions often lack the ability to reveal global characteristics. A typical example is DBSCAN [2]. As mentioned in Section 1.1, since the method simply groups objects into a cluster based on local connectivity, DBSCAN often merges independent clusters by mistake. Other derivative algorithms of DBSCAN [2], like HDBSCAN [15], do not solve it very well either. Iterative algorithms based on static modeling, including -Means [1], -Means++ [16], -Medoid [17], -Medians [18] , Gaussian Mixture Models and BRICH [19], perform better in this regard. However, since the modeling abilities are also limited, most of them are only suitable for specific types of distribution, which are mainly convex structures. Additionally, many of them, especially for BRICH, are sensitive to parameters, and thus require a good understanding of the data.
HCS [20], Highly Connected Subgraphs, is a positive example. It recursively splits a graph into two subgraphs based on minimum cuts, which is a good global measure. Although it doesn’t work through either [21], HCS is more tolerant to noise. However, it is obvious that computing minimum cuts multiple times is a time consuming task.
Chameleon Clustering [22] is an effective agglomerative method. It groups data into a large number of tiny subclusters, and repeatedly merges two clusters that are relatively close and interconnected. Unfortunately, the metrics used to measure the similarity and interconnectivity between pairs of clusters only performs well in low-dimensional spaces.
GDL [23], Graph Degree Link, is another graph-based agglomerative algorithm. It uses indegree and outdegree to measure the similarities between clusters, and merges them in pairs. The parameters are usually difficult to be specified properly, and often require multiple runs to find usable settings. However, in practice, it is often hard to evaluate the quality of a clustering result without ground-truth labels, which results in the fact that there is no reliable method to measure the parameter settings. Moreover, just like most agglomerative methods, it is also very time consuming and not applicable on big datasets.
RCC [24], Robust Continuous Clustering, is an iterative method and expresses clustering as optimization of a continuous objective. The method associates each data point with a representative, and optimizes them to reveal the structure of data distribution. In the optimization process, the representatives gradually gather into several clusters. It is fast and also works well in high-dimensional spaces. Another notable feature is that the number of clusters need not to be specified in advance. However, it also causes the granularity of clustering to be uncontrollable. Even worse, since the optimized distribution can not be further easily interpreted, it hard to split or merge existing clusters manually.
There are also a set of cluster analysis methods based on dimensionality reduction. They either require a lot of computing resources (e.g., t-SNE [25]), or perform poorly.
As far as we know, the algorithm schema proposed in this paper, Sampling Clustering, is the first approach that adjusts the structure of a graph dynamically through removing vertices repeatedly. The approach converts graphs into smaller ones, but does not lose much of structure information. It makes the global characteristics to be well revealed easily with significantly less computing resources required.
We follow the experiments in [24], Robust Continuous Clustering, and use the datasets preprocessed and publicly provided by the authors.
For YaleB [26], we only use the frontal face images processed using gamma correction and DoG filter. For TYF [27], the video frames of the first 40 subjects sorted in chronological order are used. For Reuters-21578, the train and test sets of the Modified Apte are used, and categories with less than five instances are not considered. For RCV1 [28], the target clusters are defined as four root categories. We use a random subset of 10,000 instances. For text datasets, Reuters and RCV1, only the 2,000 most frequently occurring word stems are considered. There is no additional preprocessing for other datasets. Unlike [24], for methods other than Robust Continuous Clustering, the features are not normalized or reduced to a low dimension.
A brief summary is shown in Table I.
We compare Sampling Clustering (SC) with both partitional methods, including -Means++ (KM) [16], Mean Shift (MS) [32], Gaussian Mixture Models (GMM), Affinity Propagation (AP) [33] and Robust Continuous Clustering (RCC, RCC-DR), and hierarchical methods, including three classical agglomerative algorithms (AC-Average, AC-Complete, AC-Ward), Graph Degree Linkage (GDL-U, AGDL) [23] and Hierarchical DBSCAN (HD) [15].
For RCCs (RCC, RCC-DR) and GDLs (GDL-U, AGDL), we use implementations publicly provided by the authors, and for others, we use scikit-learn and scikit-learn-contrib.
Sampling Clustering: The Euclidean distance metric is used to build a -nn graph of each dataset. We use indegree sampling and the condensing method based on BFS visiting sequence . Although is not stable and may divide a graph into pieces after multiple runs, it still be usable and very effective, and thus we run condensing on the input using only once to improve the quality of the graph. The parameters are fixed as , . Additionally, for condensing, we run BFS on the graph with the sampled vertices being removed and limit the searching depth to . Graphs are divided into strongly connected components. Smoothing runs times before and after pruning. We use hard pruning method with being fixed as on unbalanced or noise-rich datasets (YTF, Reuters, RCV1, Shuttle and Mice Protein) and soft pruning method on others.
KM, GMM: Run each algorithm times.
MS: quantile .
AP: max iter , convergence iter , damping .
RCCs: max iter , inner iter . The weighted graphs are provided by the authors.
GDLs: , .
HD: min cluster size , where is the average size of ground-truth clusters.
The default setting is used if not mentioned. For algorithms that run multiple times, including KM, GMM, MS, GDLs and HD, we use the best results.
We use a computer with an Intel Core i7-6770HQ CPU () and GiB memory, and running Ubuntu Desktop 18.04. The results are shown in Table II. Some algorithms may require too much memory on a dataset and thus are not applicable, marked as MLE.
The computing resource costs are also compared. Considering the scalability of some algorithms, as shown in Table III, we only evaluate them on RCV1.
Dataset | KM | MS | GMM | AP | RCC | RCC-DR | AC-A | AC-C | AC-W | GDL-U | AGDL | HD | SC |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
MNIST |
0.496 | 0.226 | 0.281 | MLE | 0.869 | 0.746 | MLE | MLE | MLE | MLE | MLE | 0.189 | 0.895 |
COIL100 |
0.793 | 0.706 | MLE | 0.635 | 0.924 | 0.924 | 0.514 | 0.650 | 0.825 | 0.936 | 0.936 | 0.860 | 0.862 |
YTF |
0.775 | 0.714 | MLE | 0.578 | 0.788 | 0.783 | 0.429 | 0.621 | 0.803 | 0.576 | 0.563 | 0.758 | 0.791 |
YaleB |
0.593 | 0.272 | MLE | 0.526 | 0.958 | 0.958 | 0.106 | 0.387 | 0.726 | 0.955 | 0.955 | 0.618 | 0.900 |
Reuters |
0.381 | 0.000 | 0.384 | 0.198 | 0.379 | 0.398 | 0.462 | 0.289 | 0.357 | 0.431 | 0.429 | 0.282 | 0.376 |
RCV1 |
0.506 | 0.000 | 0.556 | 0.129 | 0.106 | 0.365 | 0.057 | 0.106 | 0.306 | 0.066 | 0.144 | 0.181 | 0.160 |
Pendigits |
0.665 | 0.680 | 0.712 | 0.427 | 0.730 | 0.800 | 0.566 | 0.557 | 0.707 | 0.422 | 0.422 | 0.699 | 0.857 |
Shuttle |
0.136 | 0.267 | 0.223 | MLE | 0.290 | 0.365 | 0.010 | 0.011 | 0.160 | MLE | MLE | 0.597 | 0.293 |
Mice Protein |
0.457 | 0.438 | 0.416 | 0.384 | 0.500 | 0.520 | 0.239 | 0.274 | 0.486 | 0.403 | 0.403 | 0.379 | 0.526 |
MLE: Memory Limit Exceeded. |
Costs on RCV1 | KM | MS | GMM | AP | RCC | RCC-DR | AC-A | AC-C | AC-W | GDL-U | AGDL | HD | SC |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Time (sec) |
25 | 114 | 187 | 757 | 6906 | 458 | 64 | 64 | 64 | 26 | 21 | 359 | 22 |
Memory (MiB) |
479 | 1279 | 1055 | 4017 | 1983 | 1356 | 935 | 935 | 935 | 1807 | 1804 | 912 | 686 |
Time = Elapsed time Percent of CPU. Memory = Maximum resident set size. |
|||||||||||||
The costs of computing graphs (RCCs, GDLs and SC), distance matrices (GDLs) and bandwidths (MS) are not included. | |||||||||||||
The RCCs and GDLs are implemented in Matlab and others in Python. Our implementation is mainly based on scipy, numpy and networkx. |
It shows that the implementation based on Sampling Clustering achieves the best results on three datasets and requires significantly less computing resources, while every other algorithms, including state-of-the-art algorithms like RCCs and GDLs, works best on at most one dataset.
GMM failed on high dimensional datasets. MNIST is the biggest dataset, and all agglomerative methods and affinity propagation are not applicable on it. GDLs and AP also failed on another big dataset, Shuttle.
We discuss the results further in Section 5.
We evaluate the measures on four datasets, including MNIST, Pendigits, YaleB and YTF. The results are shown in Table IV. The indicates the average size of ground-truth clusters of each dataset.
MNIST () | Pendigits () | ||||||
---|---|---|---|---|---|---|---|
0.630 | 0.895 | 0.817 | 0.698 | 0.857 | 0.845 | ||
0.631 | 0.527 | 0.578 | 0.853 | 0.847 | 0.851 | ||
0.260 | 0.421 | 0.000 | 0.817 | 0.845 | 0.845 | ||
YaleB () | YTF () | ||||||
0.888 | 0.900 | 0.892 | 0.790 | 0.791 | 0.791 | ||
0.916 | 0.905 | 0.891 | 0.791 | 0.790 | 0.790 | ||
0.809 | 0.886 | 0.889 | 0.791 | 0.791 | 0.791 |
It shows that usually leads to bad results, expected that is used. The main reason is that sampling randomly slows down the process of generating tiny components, while other methods tend to remove vertices at junctions and vertices in the core areas are divided into pieces very fast.
With the increase of data volume, the differences between them become more and more significant. On MNIST, that each cluster contains digits, , and are poorly effective, while all measures are almost equally effective on YaleB and YTF.
The robustness to sampling rate , condensing size , and -nn graph size is analyzed. We vary them in ranges and respectively. Other settings are the same as 4.1.4.
We also test the effectiveness of smoothing. Instead of running 16 times before and after pruning as the previous experiments, to make the effect on partitions more obvious, we only run it once before pruning to remove tiny clusters and 16 times after that. The results, Fig. 14, shows that it is effective on most datasets and converges very fast.
The visualizations of clustering results on MNIST is shown in Fig. 15 and 16. The number on a branch indicates the size of it, and the images are random samples in it. We also visualize the results in 4.1 using t-SNE [25], as shown in Appendix Fig. 17.
As can be seen, without any pruning, the dendrogram is quite small, compared with other binary outputs of classical hierarchical methods. The pruned dendrogram shows that it first separates s from the whole, and then s. The remaining digits are roughly divided into three groups, , and .
As shown in Table II, Sampling Clustering (SC) is compared with 12 methods, and achieves the best results on 3 out of 9 datasets. It is also the second fastest method on RCV1, and just a little slower, about , than the fastest one. The memory required is also less than all other algorithms except -Means. Moreover, it is guaranteed that the implementation runs in linear time.
Four of the datasets are images, including MNIST, COIL100, YTF and YaleB. SC achieves the best result on the biggest dataset, MNIST, and the second best result on YTF. On other two datasets, SC is only worse than RCCs and GDLs. Additionally, it also one of the only two applicable hierarchical methods on MNIST. Neither classical nor recent state-of-art agglomerative methods can handle such a large dataset. As mentioned earlier, GMM is not applicable on high-dimensional datasets.
Reuters and RCV1 are two text datasets. Unfortunately, almost all algorithms do not work well on them. It is mainly because both of them are complicated in structure, and the measure, Euclidean distance of item frequency, is not effective and thus can not accurately represent the similarity. For Shuttle and Mice Protein, there also exist similar problems. The results on such datasets are greatly influenced by random factors. It leads to that conservative methods like -Means can achieve not bad results instead. We believe that such results are not so reliable and should not be used as key evaluation criteria like others.
We present an efficient graph-based divisive clustering approach which is shown to be effective on various types of data. We also take an introductory discussion on strategies to divide faster and to avoid undesirable divisions. The sampling and condensing measures discussed are very basic, but they work very well already. We believe that the performance of the approach can be further improved with the help of more effective measures, especially for data of specific types.
As far as we know, it is the first schema trying to groups data into clusters through repeatedly summarizing, and also one of the few effective hierarchical methods running in linear time. Existing methods may be adapted based on the schema, and the performances can be greatly improved. We hope this work can inspire further research.
In addition to clustering algorithms, the effectiveness of cluster analysis is also influenced by many other factors.
Preprocessing is the basis. The experimental results show that most algorithms can not handle datasets like Reuters directly. One of the main reasons is that the similarity measure used is not effective, which leads to the poor quality of graphs.
For hierarchical methods, post-processing is also important. In many cases, a dendrogram may need to be pruned or converted into partitions. Although it is independent of different clustering approaches, considering that most existing pruning methods, like depth-based ones, are not so effective on it, we introduce two simple algorithms in this paper to make the clustering process complete. However, as future work, it is obvious that, since they are simply based on the sizes of leaves, they can be further optimized.
We want to thank Yangyan Li for lecturing Pattern Recognition, and we are also appreciate for the support from Computer Architecture Laboratory, Shandong University. Particularly, Ching Tarn would like to thank Xiaojun Cai, Mengying Zhao, Jianhua Yin, Si-Min He, and his classmates for their concern and help.
Proceedings of the fifth Berkeley symposium on mathematical statistics and probability
, vol. 1, no. 14. Oakland, CA, USA, 1967, pp. 281–297.M. Brito, E. Chavez, A. Quiroz, and J. Yukich, “Connectivity of the mutual k-nearest-neighbor graph in clustering and outlier detection,”
Statistics & Probability Letters, vol. 35, no. 1, pp. 33–42, 1997.L. Ertöz, M. Steinbach, and V. Kumar, “Finding clusters of different sizes, shapes, and densities in noisy, high dimensional data,” in
Proceedings of the 2003 SIAM international conference on data mining. SIAM, 2003, pp. 47–58.R. J. Campello, D. Moulavi, and J. Sander, “Density-based clustering based on hierarchical density estimates,” in
Pacific-Asia conference on knowledge discovery and data mining. Springer, 2013, pp. 160–172.D. Arthur and S. Vassilvitskii, “k-means++: The advantages of careful seeding,” in
Proceedings of the eighteenth annual ACM-SIAM symposium on Discrete algorithms. Society for Industrial and Applied Mathematics, 2007, pp. 1027–1035.Journal of machine learning research
, vol. 9, no. Nov, pp. 2579–2605, 2008.A. S. Georghiades, P. N. Belhumeur, and D. J. Kriegman, “From few to many: Illumination cone models for face recognition under variable lighting and pose,”
IEEE Transactions on Pattern Analysis & Machine Intelligence, no. 6, pp. 643–660, 2001.Dataset | KM | MS | GMM | AP | RCC | RCC-DR | AC-A | AC-C | AC-W | GDL-U | AGDL | HD | SC |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
MNIST |
0.500 | 0.430 | 0.311 | MLE | 0.893 | 0.830 | MLE | MLE | MLE | MLE | MLE | 0.356 | 0.895 |
COIL100 |
0.840 | 0.852 | MLE | 0.843 | 0.963 | 0.963 | 0.687 | 0.754 | 0.862 | 0.961 | 0.961 | 0.905 | 0.894 |
YTF |
0.789 | 0.831 | MLE | 0.783 | 0.892 | 0.890 | 0.583 | 0.681 | 0.808 | 0.699 | 0.692 | 0.846 | 0.881 |
YaleB |
0.658 | 0.742 | MLE | 0.799 | 0.978 | 0.975 | 0.231 | 0.479 | 0.774 | 0.967 | 0.967 | 0.696 | 0.914 |
Reuters |
0.535 | 0.000 | 0.538 | 0.503 | 0.556 | 0.553 | 0.533 | 0.392 | 0.487 | 0.514 | 0.515 | 0.413 | 0.538 |
RCV1 |
0.511 | 0.000 | 0.566 | 0.355 | 0.138 | 0.437 | 0.142 | 0.108 | 0.375 | 0.116 | 0.182 | 0.301 | 0.169 |
Pendigits |
0.682 | 0.738 | 0.715 | 0.648 | 0.845 | 0.851 | 0.659 | 0.584 | 0.728 | 0.593 | 0.593 | 0.738 | 0.860 |
Shuttle |
0.216 | 0.375 | 0.356 | MLE | 0.488 | 0.546 | 0.040 | 0.039 | 0.246 | MEL | MEL | 0.622 | 0.530 |
Mice Protein |
0.479 | 0.600 | 0.426 | 0.593 | 0.662 | 0.636 | 0.377 | 0.324 | 0.514 | 0.513 | 0.513 | 0.602 | 0.581 |
MLE: Memory Limit Exceeded. |
Comments
There are no comments yet.