Sampling Clustering

06/21/2018
by   Ching Tarn, et al.
0

We propose an efficient graph-based divisive cluster analysis approach called sampling clustering. It constructs a lite informative dendrogram by recursively dividing a graph into subgraphs. In each recursive call, a graph is sampled first with a set of vertices being removed to disconnect latent clusters, then condensed by adding edges to the remaining vertices to avoid graph fragmentation caused by vertex removals. We also present some sampling and condensing methods and discuss the effectiveness in this paper. Our implementations run in linear time and achieve outstanding performance on various types of datasets. Experimental results show that they outperform state-of-the-art clustering algorithms with significantly less computing resources requirements.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 8

02/23/2022

Clustering Edges in Directed Graphs

How do vertices exert influence in graph data? We develop a framework fo...
11/10/2017

Clustering with Local Restrictions

We study a family of graph clustering problems where each cluster has to...
02/04/2021

All Subgraphs of a Wheel are 5-Coupled-Choosable

A wheel graph consists of a cycle along with a center vertex connected t...
04/01/2020

Learning to Cluster Faces via Confidence and Connectivity Estimation

Face clustering is an essential tool for exploiting the unlabeled face d...
10/11/2018

FeatureLego: Volume Exploration Using Exhaustive Clustering of Super-Voxels

We present a volume exploration framework, FeatureLego, that uses a nove...
07/05/2021

Template-Based Graph Clustering

We propose a novel graph clustering method guided by additional informat...
08/30/2017

Fighting with the Sparsity of Synonymy Dictionaries

Graph-based synset induction methods, such as MaxMax and Watset, induce ...

Code Repositories

sampling-clustering

Sampling Clustering


view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Cluster analysis is a basic common unsupervised learning approach and widely used in various fields such as data mining and computer vision. It categorizes a set of objects into groups (or clusters) based on the similarities between them to help understand and analyze data. Clusters are aggregations of objects that intra-cluster similarities are higher than inter-cluster ones.

With the long history of research on clustering, a large number of algorithms have been proposed. However, there still exist several common limitations among them.

1.1 Classical Partitional Methods

Classical center-based algorithms like -Means [1]

use central vectors to partition the data space so that they may lack the ability to find arbitrary shaped clusters. Many of them are also very sensitive to center initialization. Distribution-based algorithms including Gaussian Mixture Models have similar shortcomings. In addition to that, for spatial data, as the dimension grows, the number of parameters to be determined also increases. The running time and memory required may become unacceptable.

Density-based clustering methods define clusters as contiguous regions with high density. DBSCAN [2] assumes that the clusters have homogeneous densities. A global threshold needs to be specified in advance to distinguish between high-density and low-density areas. It simply judges whether two regions are connected from connectivity, which can result in “single-link effect” [3] and lead to undesirable combinations. Another critical problem is that, for some applications, there may not be a global threshold at all to separate all clusters at once. It makes such tasks unsolvable to DBSCAN.

Algorithms above generate disjoint partitions of the dataset, and nesting is not allowed. However, in many cases, some data in reality are hierarchical. As shown in Fig. 1, it is hard, or even impossible, to find an absolutely reasonable and unambiguous division of the data. In such cases, a possible way is to generate multiple partitions on the same dataset with different granularities to construct a hierarchy. However, it requires multiple runs and is difficult to handle.

Fig. 1: (a): an example; (b): a partition that divides the data into two clusters; (c): another partition that divides the data into four clusters.
Obviously, since the data distribution is hierarchical, neither partition can fully represent the structure of the data. In order to obtain sufficient structural information, a partitional algorithm must be run many times. Moreover, it is also difficult to build a hierarchy from multiple partitions.

1.2 Agglomerative and Divisive Methods

Hierarchical algorithms are traditionally divided into two categories, agglomerative methods and divisive ones. Such algorithms generate hierarchical structures called dendrograms, which are much more informative than non-nested partitions. When necessary, a dendrogram can also be converted into partitions with different numbers of clusters as needed without multiple runs.

Typically, agglomerative methods initially treat each object as a cluster, or start with a large number of tiny clusters, and then merge them in pairs. The dendrogram is usually very deep and large, which makes it difficult to analyze. In fact, in most cases, low level branches are not so meaningful, and often pruned to make the dendrogram easier to analyze, which is also a challenging task.

Other than that, due to clusters being merged in pairs, dendrograms are usually binary and thus can not represent the real structures accurately. For example, assuming a cluster consisting of more than two identical subclusters, the cluster can’t be built by merging all subclusters simultaneously and the structure of the dendrogram largely depends on the order of selections. Fig. 2 depicts one case.

Fig. 2: (a): an example; (b): a typical binary dendrogram generated by an agglomerative method; (c): a more desirable dendrogram.
As can be seen, the dataset consists of three similar parts: A, B and C. Although the similarities between them are approximately equal, a binary dendrogram still first groups A, B into one bigger cluster, which means that A and B are much more similar than A and C or B and C. The second one is non-binary, and provides a more accurate representation.

For many agglomerative methods, to select a pair of clusters for the next merge, the similarities between every two clusters must be calculated. The complexity of an agglomerative method is often very high, usually , which can be unacceptable for a mass of data.

In general, a divisive method initially groups all objects into one cluster, and divides each cluster into two subclusters recursively.

Compared to a large number of agglomerative algorithms, divisive methods are fewer. One major reason is that separation is usually much more difficult than mergence. For an agglomerative method, given subclusters, in order to find the two most similar ones to merge, only a maximum of comparisons are required. Correspondingly, there are divisions that split a cluster of size into two subclusters.

Divisive methods have similar problems with agglomerative ones. Besides that, in addition to finding a proper division is difficult, the stopping condition (e.g., the commonly used minimum cluster size) is usually hard to be defined too. However, with the well-defined termination, a large number of unnecessary divisions can be avoided. Unlike agglomerative methods, this feature makes it possible to output smaller and more analyzable dendrograms directly without any pruning.

1.3 Graph-based Methods

Many methods require the data to be spatial, which limits their applications. Graph-based methods use similarity graphs, and usually transform spatial data into nearest neighbor graphs (e.g., -NN, mutual -NN [4] and XNN [5]). Although the transformation may cause loss of information, processing on graphs still brings many significant benefits. 1) There are metrics in graph theory, like reachability, that can measure the similarities between objects better than traditional ones (e.g., the Euclidean distance). 2) Querying the most similar objects in a graph is often faster than others. 3) It has been shown in [6] that nearest neighbor graphs are proper and efficient representations for data lying on a low dimensional manifold embedded in a high dimensional space.

1.4 The Advantages of Sampling Clustering

With the considerations and inspired by [7], we propose a graph-based divisive clustering approach called Sampling Clustering. It recursively divides a graph into subgraphs through removing vertices from the graph.

The approach has following advantages.

  • A non-binary dendrogram is generated, which is much more representative than non-nested partitions and binary dendrograms.

  • The division stops in time, and undesirable splits are less. The dendrogram is small and easy to analyze.

  • It works well and outperforms state-of-the-art algorithms on various types of datasets.

  • Only a few hyper-parameters are required and none of them plays a key role.

  • Computing resources required are significantly less, and the time complexity is usually linear. It is also easily parallelizable.

  • The approach is very simple, and can be extended to handle domain-specific data.

The paper is organized as follows. We present details of the approach in Section 2, and compare it with other methods theoretically in Section 3. Section 4 shows the experimental results. 111 The implementation and other helpful resources are available at http://res.ctarn.io/sampling-clustering. There is also a published reproducible version at https://doi.org/10.24433/CO.1783197.v1. We discuss the experiments further in Section 5. The conclusions are in Section 6.

2 The Approach

Fig. 3: (a): the original graph; (b): graph after first summarization; (c): graph after second summarization.
The figure shows that, after the first operation, some less important details are removed from the graph, and the main body becomes more simple and smaller. Then it is further simplified. The second summarization removes weak connections inside the graph, and clearly shows that the graph is made up of two parts.

The main idea of this approach is that, a big graph can be summarized as a smaller one that still retains its main structure but has fewer details. The smaller graph is easier to analyze, and can be further summarized again. By summarizing repeatedly, as shown in Fig. 3, the global structural information can be gradually revealed.

In this process, a graph can be divided into several disconnected subgraphs. Like traditional methods, connectivity can be used as the criterion to group objects into clusters. Each subgraph, that is, a connected component, is treated as a subcluster. And thus, a hierarchy can be obtained by splitting recursively.

For summarization, it can be further divided into two steps: selecting a set of vertices from the original graph as the representatives and then building a new graph by reconnecting these vertices.

Specifically, as shown in Fig. 4, given a similarity graph of the data, the approach constructs a dendrogram through dividing a graph into subgraphs recursively.

Fig. 4:

An ideal flow diagram. First the dataset is converted into a graph. Given a spatial dataset, the nearest neighbor graph can be used. The graph is sampled, condensed and partitioned into small graphs. The same operations are preformed on each subgraph recursively until the termination condition is met. For sampling, a set of vertices are removed from the graph, and the remaining ones are treated as representatives. Each removed vertex is bound to a remaining one, and belongs to the same cluster with it. After next sampling, these remaining ones may be removed and bound to other vertices too. When the termination condition is met, the last remaining vertices are group into clusters directly, and each removed vertex can always find a corresponding vertex that has been classified, and is grouped into the same cluster with the corresponding one. After association, the condensing is applied to reconnecting the remaining vertices tightly to avoid the fragmentation of a cluster. Partition is performed after condensing. It splits the graph into subgraphs based on connectivity.

In each recursive call, the graph is sampled first to disconnect latent subclusters. The selected vertices are removed from the graph along with the edges.

After each sampling, the density of the graph may decrease and the connections between vertices become weaker. The remaining edges are usually not enough to connect the vertices stably to maintain the structure. Multiple samplings may even divide the graph into a lot of tiny parts unnecessarily. To avoid the fragmentation of a cluster, condensing is applied on the sampled graph. It keeps the connections between them stable enough, through finding new nearest neighbors of each remaining vertex and connecting them with their new neighbors respectively to fill the vacancies around them.

The recursive process terminates if no vertex or all vertices of a graph are removed by sampling.

After sampling, if the process is not terminated, each vertex to be deleted from the graph is associated with a remaining nearest vertex as its representative. Otherwise, if terminated, the call returns the set of all vertices of the graph along with vertices associated, perhaps indirectly, with them.

For association, since the graph is connected and there exists at least one remaining vertex, it is guaranteed that each vertex can always find a nearest one not to be removed. We divide the initial input into subgraphs without sampling first if it is not connected.

An overview is as Algorithm 1.

procedure Cluster()
      the set of vertices to be removed
     if  or  then
         
         return a leaf of the dendrogram
     else
         for all  do
              
              associate to
         end for
          the new graph
          an empty dendrogram node
         for all subgraph of  do
              
               append to as a child
         end for
         return
     end if
end procedure
Algorithm 1 Sampling Clustering

In the rest of this section, we describe these steps in more detail. For summarization, we present a set of sampling and condensing methods, and introduce some measures to evaluate them to help choose or design one method that suits a specific application. We also provide a fine-tuning method to further improve the quality of a dendrogram, and simple algorithms to convert a dendrogram into partitions. In the end we briefly discuss the complexity of the approach.

For the convenience of following description, assuming a directed graph , core concepts are defined as below.

Definition 1 (neighbor).

is a neighbor of if and only if .

Definition 2 (neighborhood).

The neighborhood of is the set of all neighbors of , denoted as .

In addition, we denote as .

Definition 3 (dendrogram).

A dendrogram is a tree of which all leaves are exclusive subsets of .

2.1 Graph Construction

As described in Section 1.3, transforming datasets into graphs brings many benefits. In addition, graph is a flexible data structure, and can be easily edited or built. Converting data in other forms into a graph is usually much easier than the reverse. For example, building a spatial dataset based on similarity graphs is usually more difficult than transforming a spatial dataset into a similarity graph (e.g., nearest neighbor graph). It means that, a graph-based method can accept more types of data and thus can be applied to more fields.

The method of constructing a similarity graph usually depends on the original data type and the distance, or similarity, metric used. For general spatial data, in addition to brute-force search, fast indexing structures such as KD Tree [8] or Ball Tree [9] can be used to speed up the search of nearest neighbors.

Although the following discussions are mainly based on unweighted directed nearest neighbor graphs, the approach can also be applied to various type of graphs such as undirected or weighted graphs. Some methods introduced below may need to be sightly adjusted.

2.2 Sampling

Sampling vertices from a graph leads to several benefits. 1) By reducing the amount of vertices, both analysis and computing become easier. With a proper subset of the graph, the main structural information can be well maintained. 2) Multiple samplings can break the weak connections caused by noise between components. It avoids common problems, like “single-link effect”, of linkage-methods and makes the approach more tolerant to noise. 3) More importantly, the sequence of separations contains rich structural information and can be used to reconstruct the hierarchical structure of the data.

The sampling algorithm is described as Algorithm 2. It uses a measure to calculate a score of each vertex that indicates whether it should be deleted or not, and returns the set of vertices with the lowest score, that is, should be removed. We use sampling rate to control the number of vertices to be removed.

for all vertex  do
     
end for
the number of vertices to be removed
threshold
for all vertex  do
     if  then
         
     end if
end for
return
Algorithm 2 Sampling

A basic sampling measure is assigning a random score to each vertex. Another type of methods tend to remove vertices at junctions of clusters instead of vertices in core areas, so that the latent clusters can be disconnected quickly and undesirable divisions can be avoided. Such methods can be boundary detection based, like [10]. It is shown in [10] that a vertex at junctions usually has a smaller indegree than others. In particular, for a -nn graph, the outdegree of each vertex is , and the indegree of a vertex at boundaries is usually less than , as shown in Fig. 5.

Fig. 5: Indegree of each vertex on a 16-nn graph [11]. It shows that the indegree of a vertex on boundaries usually is smaller than others.
Definition 4 (indegree sampling).

Another similar one, defined as Definition 5, identifies boundaries based on the number of mutual neighbors of each vertex.

Definition 5 (mutual neighbor sampling).

To measure the effectiveness of a sampling method, first we formally define the vertices at junctions as Definition 6.

Definition 6 (positive vertex).

A vertex is at junctions if and only if that and belong to different clusters. A vertex not at junctions is called a positive vertex.

A sampled graph should be more separable. We use the proportion of positive vertices of the graph to measure its separability.

Definition 7 (vertex positivity).

The proportion of positive vertices of a graph is called its vertex positivity, denoted as .

To avoid being affected by neighborhood size deceasing, we label the positivity of a vertex before sampling and only recalculate the proportion after that. Obviously, sampling randomly doesn’t change the proportion of positive vertices. Fig. 6 shows that both and increase the vertex positivity of a graph significantly.

Fig. 6: Vertex positivity evaluated with a 16-nn graph built from MNIST [12]. It shows that both and increase the vertex positivity of the graph significantly.

2.3 Association

Since only the last remaining vertices are grouped into clusters directly, in order to completely cluster all objects, the rest, which have been removed, should be bound to these classified ones. As shown in Fig. 7, after each sampling, the vertices to be removed are associated with remaining ones. For each deleted one, there is always a vertex (indirectly) bound with it that has never been removed, and they are grouped into the same cluster.

Fig. 7: (a): an example; (b): associations between removed vertices and remaining ones after first iteration; (c): associations after second iteration.
The figure shows that, the first sampling removes all but 6 vertices, marked in gray. They are associated with a nearest remaining vertex respectively. After the second iteration, 4 of the 6 vertices are removed, and are associated with the last two that have not been removed yet. Assuming that the two vertices are divided into two clusters, other ones are also grouped into the two clusters respectively.

A simple method to query a nearest remaining vertex, described as Algorithm 3, is breadth-first search starting from each vertex to be removed, and, in general, it is also very fast.

for all vertex  do vertices to be removed
     
     
     while  has not been associated do
         
         for all edges  do
              if  is not to be removed then
                  associate to
                  break
              else if  has not been visited then
                  
              end if
         end for
     end while
end for
Algorithm 3 Simple BFS Association

However, the worst-case performance of this method is . For example, assuming a long chain of length , and only the vertex at one end is not marked to delete, the number of visits on all edges is .

In order to theoretically guarantee that the algorithm is linear in any case, we introduce another equivalent algorithm as an alternative here. Instead of using breadth-first search starting from each vertex to be removed, we start breadth-first search from all remaining vertices at the same time, and only search once. Since each edge is only visited at most once, it ensures that the search can be finished in a linear time. We call this method as multi-source breadth-first search, and the details are shown in Algorithm 4.

for all vertex  do vertices not to be removed
      label it as itself
     
end for
while not every vertex has been labeled do
     
     for all edges  do
         if  has not been labeled then
              
              
         end if
     end for
end while
for all vertex  do
     associate to
end for
Algorithm 4 Multi-source BFS Association

2.4 Condensing

A sampled graph is usually sparser, and connections between vertices turn to be weaker. Multiple samplings may break the graph into pieces. We condense the graph by increasing the number of edges to avoid graph fragmentation caused by vertex removals. For a sampled nearest neighbor graph, we search new nearest vertices of each vertex and connect them with their new neighbors respectively.

For a vertex , we use breadth-first search starting from to find a set of candidates . To avoid searching too deep unnecessarily, we limit the depth to . It stops if and the searching depth is greater than . A measure is used to calculate a score of each candidate which indicates the distance, or dissimilarity, between it and .

The new neighborhood of a vertex consists of the candidates with high similarities. A new graph is built using edges selected by scores. The algorithm is described as Algorithm 5.

an empty edges set
for all vertex  do
      candidates
     initialize a BFS iterator starting from
     while  or the depth of BFS  do
         
         if  then
              break there is no reachable vertex.
         else if  then
              
              
         end if
     end while
      the actual amount
      the vertices in with the lowest scores
     for all vertex  do
          new edges
     end for
end for
return
Algorithm 5 Condensing

The simplest measure is based on the breadth-first search visiting sequence. We simply assign the index to a candidate as its score, that is, the first vertices visited are selected.

Another measure is shared neighbor, denoted as . It has been shown in [13]

that a high similarity between two neighborhoods also indicates that the two vertices are similar. We use Jaccard index, also known as Intersection over Union (IoU), to measure the similarity between two neighborhoods.

Definition 8 (Jaccard index).

For spatial data, distance metrics (e.g., Euclidean distance ) can also be used in some applications.

Effective condensing methods should maintain the structure of the graph well, and it can be considered from two aspects. 1) Inter-cluster connections should be avoided. A vertex should belong to the same cluster with its neighbors to make clusters separable. 2) To avoid a cluster being divided into pieces, vertices in it need to be strongly connected.

For the first consideration, we use a similar measure with Definition 7 called the edge positivity of a graph.

Definition 9 (positive edge).

An edge is positive if and only if and belong to the same cluster.

Definition 10 (edge positivity).

The proportion of positive edges of a graph is called its edge positivity, denoted as .

For the second one, the average connectivity of a graph [14], defined as follows, is used to measure the strength of a cluster’s internal connections. It can also be used to evaluate the stability of a sampling method.

Definition 11 (graph connectivity).

The connectivity of a graph is measured with

where is the value of maximum flow from to .

Fig. 8 shows that graph-based measures are usually more effective than geometry-based.

Fig. 8: Edge positivity evaluated with a 16-nn graph built from MNIST. The condensing methods based on different measures are applied on the graph independently with . In each iteration, vertices are randomly removed from the graph. It shows that both and are significantly more effective than .

Although is very effective, it is not a stable condensing method on a big graph. With the condensing being repeated multiple times on a graph, it tends to divide a component into lots of tiny parts. The size of such a part is usually about , and vertices in it are almost fully-connected, while inter-part connections are extremely weak. The intra-cluster connections are destroyed and the clusters are divided into pieces.

Fig. 9 shows that, the connectivity of graph decreases rapidly after about iterations if is employed. In fact, if a -nn graph is not sampled after each iteration, tiny parts are generated even faster, while doesn’t change the graph at all if .

Fig. 9: Connectivity evaluated with a random graph. The graph contains vertices, and each is randomly connected to other ones. In each iteration, condensing is applied with , and vertices are randomly removed from the graph. It shows that the condensing method based on can maintain the connectivity well. For the based method, the connectivity decreases rapidly after about iterations.

2.5 Partition

At the end of each recursive call, the processed graph is partitioned into several subgroups, and it generates a branch on the dendrogram. Since the similarities between elements can be measured by reachability, the graph can be simply divided into connected components. For data difficult to partition, in order to split it more thoroughly, the graph is divided into strongly-connected components. Otherwise, or the graph is undirected, weakly-connected components methods can also be used.

Additionally, dividing a graph into components also makes sure that all subgraphs to be sampled in next iteration are connected, which guarantees that each vertex to be removed can be bound to a remaining one.

2.6 Pruning

Pruning is used to reduce the number of leaves of a dendrogram and generate a partition from it. We provide two simple pruning algorithms.

The first algorithm strictly conforms to the original structure of the dendrogram. A node can be merged if and only if it is an end branch, defined as Definition 12.

Definition 12 (end branch).

An end branch of a dendrogram is a node of the dendrogram whose all children are leaves.

The size of a node is defined as the total number of objects in its descendants, or itself.

The smallest end branch is first merged into one leaf. The algorithm stops if the next merge causes the number of leaves to be less than the desired value , or the largest leaves have contained more than objects. It is described as Algorithm 6.

while the sum of the sizes of largest leaves  do
     
     
     
     if  then
         break
     end if
      new empty leaf
     for all leaf of  do
          move objects to the new leaf
     end for
     replace with
end while
Algorithm 6 Hard Pruning

The second algorithm allows two leaves belonging to the same parent to be merged together directly. A leaf is moved down to find a leaf brother, as shown in Fig. 10, if it has a brother and the brother is not a leaf. The detailed description is shown in Algorithm 7. It generates a more balanced partition with the number of clusters being precisely controlled. However, it dose not conform strictly to the original structure of the dendrogram and doesn’t work well on unbalanced data.

Fig. 10: Assuming that A is the smallest leaf of the dendrogram, and the branch B is the smallest brother of A. Since B is a branch and can not be merged with A directly, A is moved into B so that it may be able to merged with a leaf child of B. The leaf A may be moved down multiple times, until its smallest brother is a leaf.
while  do
     
     
     if  has more than one child then
         
         if  is a leaf then
              merge and
         else is a branch
              move into
         end if
     end if
     if  has only one child then
         replace with its child
     end if
end while
return
Algorithm 7 Soft Pruning

2.7 Fine-tuning

We also introduce a simple method called smoothing to adjust dendrograms. It is based on the observation that, firstly, some association trees, which are generated after sampling, may cross together slightly, which results in the fact that a small number of vertices, especially at junctions, are grouped into different clusters from their neighbors; secondly, a few isolated vertices may become clusters unexpectedly, and such tiny leaves should also be removed.

We simply regroup each vertex into the most-common cluster in its neighborhood. This operation can be repeated multiple times, and our experiments show that it improves the qualities of dendrograms on most datasets and also converges very fast.

2.8 Complexity Analysis

At the end, we briefly analyze the complexity of the approach. Given a -nn graph , first we discuss the complexity of each step in one recursive call.

2.8.1 Sampling

Assuming a sampling measure with a complexity of , calculating the scores costs . Selection methods are used to find the threshold, and can be finished at the cost of . Therefore, the complexity of sampling is .

Lemma 1.

The complexity of sampling is .

2.8.2 Association

As described in Section 2.3, the worst-case performance of multi-source breadth-first search method is .

Lemma 2.

The complexity of association is .

2.8.3 Condensing

With the searching depth being limited to , for each vertex, the number of candidates does not exceed , and thus both calculating and sorting the scores cost only , where is the complexity of the measure.

Lemma 3.

The complexity of condensing is .

2.8.4 Partition

Methods to computing strongly or weakly connected components also run in linear time.

Lemma 4.

The complexity of partition is .

Therefore, the complexity of a recursive call is .

Since the total size of all subgraphs is , where is the sampling rate and , the complexity of the approach is .

Theorem 1.

The complexity of Sampling Clustering is .

The complexity of any sampling or condensing measure introduced above is , and thus the complexity of an implementation based on them is linear.

Theorem 2.

In the case where both the measures of sampling and condensing are , the complexity of Sampling Clustering is .

3 Related Work

We compare Sampling Clustering with other cluster analysis methods theoretically in this section.

Almost all existing algorithms, including the classical methods mentioned above, can be classified into two categories. For the first type, the methods process on static original data distributions or graphs. For the second one, the methods are usually iterative, and adjust the distribution of all objects or the structure of a graph dynamically.

3.1 Methods Processing on Static Distributions

Global characteristics matter. Methods using static distributions often lack the ability to reveal global characteristics. A typical example is DBSCAN [2]. As mentioned in Section 1.1, since the method simply groups objects into a cluster based on local connectivity, DBSCAN often merges independent clusters by mistake. Other derivative algorithms of DBSCAN [2], like HDBSCAN [15], do not solve it very well either. Iterative algorithms based on static modeling, including -Means [1], -Means++ [16], -Medoid [17], -Medians [18] , Gaussian Mixture Models and BRICH [19], perform better in this regard. However, since the modeling abilities are also limited, most of them are only suitable for specific types of distribution, which are mainly convex structures. Additionally, many of them, especially for BRICH, are sensitive to parameters, and thus require a good understanding of the data.

HCS [20], Highly Connected Subgraphs, is a positive example. It recursively splits a graph into two subgraphs based on minimum cuts, which is a good global measure. Although it doesn’t work through either [21], HCS is more tolerant to noise. However, it is obvious that computing minimum cuts multiple times is a time consuming task.

Chameleon Clustering [22] is an effective agglomerative method. It groups data into a large number of tiny subclusters, and repeatedly merges two clusters that are relatively close and interconnected. Unfortunately, the metrics used to measure the similarity and interconnectivity between pairs of clusters only performs well in low-dimensional spaces.

GDL [23], Graph Degree Link, is another graph-based agglomerative algorithm. It uses indegree and outdegree to measure the similarities between clusters, and merges them in pairs. The parameters are usually difficult to be specified properly, and often require multiple runs to find usable settings. However, in practice, it is often hard to evaluate the quality of a clustering result without ground-truth labels, which results in the fact that there is no reliable method to measure the parameter settings. Moreover, just like most agglomerative methods, it is also very time consuming and not applicable on big datasets.

3.2 Methods Adjusting Distributions Dynamically

RCC [24], Robust Continuous Clustering, is an iterative method and expresses clustering as optimization of a continuous objective. The method associates each data point with a representative, and optimizes them to reveal the structure of data distribution. In the optimization process, the representatives gradually gather into several clusters. It is fast and also works well in high-dimensional spaces. Another notable feature is that the number of clusters need not to be specified in advance. However, it also causes the granularity of clustering to be uncontrollable. Even worse, since the optimized distribution can not be further easily interpreted, it hard to split or merge existing clusters manually.

There are also a set of cluster analysis methods based on dimensionality reduction. They either require a lot of computing resources (e.g., t-SNE [25]), or perform poorly.

3.3 Sampling Clustering

As far as we know, the algorithm schema proposed in this paper, Sampling Clustering, is the first approach that adjusts the structure of a graph dynamically through removing vertices repeatedly. The approach converts graphs into smaller ones, but does not lose much of structure information. It makes the global characteristics to be well revealed easily with significantly less computing resources required.

4 Experiments

4.1 Comparison with Other Algorithms

4.1.1 Datasets

We follow the experiments in [24], Robust Continuous Clustering, and use the datasets preprocessed and publicly provided by the authors.

For YaleB [26], we only use the frontal face images processed using gamma correction and DoG filter. For TYF [27], the video frames of the first 40 subjects sorted in chronological order are used. For Reuters-21578, the train and test sets of the Modified Apte are used, and categories with less than five instances are not considered. For RCV1 [28], the target clusters are defined as four root categories. We use a random subset of 10,000 instances. For text datasets, Reuters and RCV1, only the 2,000 most frequently occurring word stems are considered. There is no additional preprocessing for other datasets. Unlike [24], for methods other than Robust Continuous Clustering, the features are not normalized or reduced to a low dimension.

A brief summary is shown in Table I.

Dataset Instances Dimensions Classes
MNIST[12] 70,000 784 10
COIL100[29] 7,200 49,152 100
YaleB[26] 2,414 32,256 38
YTF[27] 10,056 9,075 40
Reuters 9,082 2,000 50
RCV1[28] 10,000 2,000 4
Pendigits[30] 10,992 16 10
Shuttle 58,000 9 7
Mice Protein[31] 1,077 77 8
TABLE I: Datasets

4.1.2 Baselines

We compare Sampling Clustering (SC) with both partitional methods, including -Means++ (KM) [16], Mean Shift (MS) [32], Gaussian Mixture Models (GMM), Affinity Propagation (AP) [33] and Robust Continuous Clustering (RCC, RCC-DR), and hierarchical methods, including three classical agglomerative algorithms (AC-Average, AC-Complete, AC-Ward), Graph Degree Linkage (GDL-U, AGDL) [23] and Hierarchical DBSCAN (HD) [15].

For RCCs (RCC, RCC-DR) and GDLs (GDL-U, AGDL), we use implementations publicly provided by the authors, and for others, we use scikit-learn and scikit-learn-contrib.

4.1.3 Measures

We use [34] provided by scikit-learn to evaluate all algorithms. 222The measure used in [24] is . The results evaluated with NMI are also provided as Table V in Appendix.

4.1.4 Settings

  • Sampling Clustering: The Euclidean distance metric is used to build a -nn graph of each dataset. We use indegree sampling and the condensing method based on BFS visiting sequence . Although is not stable and may divide a graph into pieces after multiple runs, it still be usable and very effective, and thus we run condensing on the input using only once to improve the quality of the graph. The parameters are fixed as , . Additionally, for condensing, we run BFS on the graph with the sampled vertices being removed and limit the searching depth to . Graphs are divided into strongly connected components. Smoothing runs times before and after pruning. We use hard pruning method with being fixed as on unbalanced or noise-rich datasets (YTF, Reuters, RCV1, Shuttle and Mice Protein) and soft pruning method on others.

  • KM, GMM: Run each algorithm times.

  • MS: quantile .

  • AP: max iter , convergence iter , damping .

  • RCCs: max iter , inner iter . The weighted graphs are provided by the authors.

  • GDLs: , .

  • HD: min cluster size , where is the average size of ground-truth clusters.

The default setting is used if not mentioned. For algorithms that run multiple times, including KM, GMM, MS, GDLs and HD, we use the best results.

4.1.5 Results

We use a computer with an Intel Core i7-6770HQ CPU () and GiB memory, and running Ubuntu Desktop 18.04. The results are shown in Table II. Some algorithms may require too much memory on a dataset and thus are not applicable, marked as MLE.

The computing resource costs are also compared. Considering the scalability of some algorithms, as shown in Table III, we only evaluate them on RCV1.

Dataset KM MS GMM AP RCC RCC-DR AC-A AC-C AC-W GDL-U AGDL HD SC

MNIST
0.496 0.226 0.281 MLE 0.869 0.746 MLE MLE MLE MLE MLE 0.189 0.895

COIL100
0.793 0.706 MLE 0.635 0.924 0.924 0.514 0.650 0.825 0.936 0.936 0.860 0.862

YTF
0.775 0.714 MLE 0.578 0.788 0.783 0.429 0.621 0.803 0.576 0.563 0.758 0.791

YaleB
0.593 0.272 MLE 0.526 0.958 0.958 0.106 0.387 0.726 0.955 0.955 0.618 0.900

Reuters
0.381 0.000 0.384 0.198 0.379 0.398 0.462 0.289 0.357 0.431 0.429 0.282 0.376

RCV1
0.506 0.000 0.556 0.129 0.106 0.365 0.057 0.106 0.306 0.066 0.144 0.181 0.160

Pendigits
0.665 0.680 0.712 0.427 0.730 0.800 0.566 0.557 0.707 0.422 0.422 0.699 0.857

Shuttle
0.136 0.267 0.223 MLE 0.290 0.365 0.010 0.011 0.160 MLE MLE 0.597 0.293

Mice Protein
0.457 0.438 0.416 0.384 0.500 0.520 0.239 0.274 0.486 0.403 0.403 0.379 0.526
MLE: Memory Limit Exceeded.
TABLE II: Results measured by AMI
Costs on RCV1 KM MS GMM AP RCC RCC-DR AC-A AC-C AC-W GDL-U AGDL HD SC

Time (sec)
25 114 187 757 6906 458 64 64 64 26 21 359 22

Memory (MiB)
479 1279 1055 4017 1983 1356 935 935 935 1807 1804 912 686

Time = Elapsed time Percent of CPU. Memory = Maximum resident set size.
The costs of computing graphs (RCCs, GDLs and SC), distance matrices (GDLs) and bandwidths (MS) are not included.
The RCCs and GDLs are implemented in Matlab and others in Python. Our implementation is mainly based on scipy, numpy and networkx.
TABLE III: Computing resource costs on the RCV1

It shows that the implementation based on Sampling Clustering achieves the best results on three datasets and requires significantly less computing resources, while every other algorithms, including state-of-the-art algorithms like RCCs and GDLs, works best on at most one dataset.

GMM failed on high dimensional datasets. MNIST is the biggest dataset, and all agglomerative methods and affinity propagation are not applicable on it. GDLs and AP also failed on another big dataset, Shuttle.

We discuss the results further in Section 5.

4.2 Sampling and Condensing Measures

We evaluate the measures on four datasets, including MNIST, Pendigits, YaleB and YTF. The results are shown in Table IV. The indicates the average size of ground-truth clusters of each dataset.

MNIST () Pendigits ()
0.630 0.895 0.817 0.698 0.857 0.845
0.631 0.527 0.578 0.853 0.847 0.851
0.260 0.421 0.000 0.817 0.845 0.845
YaleB () YTF ()
0.888 0.900 0.892 0.790 0.791 0.791
0.916 0.905 0.891 0.791 0.790 0.790
0.809 0.886 0.889 0.791 0.791 0.791
TABLE IV: Comparison of measures

It shows that usually leads to bad results, expected that is used. The main reason is that sampling randomly slows down the process of generating tiny components, while other methods tend to remove vertices at junctions and vertices in the core areas are divided into pieces very fast.

With the increase of data volume, the differences between them become more and more significant. On MNIST, that each cluster contains digits, , and are poorly effective, while all measures are almost equally effective on YaleB and YTF.

4.3 Robustness

The robustness to sampling rate , condensing size , and -nn graph size is analyzed. We vary them in ranges and respectively. Other settings are the same as 4.1.4.

Due to the limited space and the similarity between and , we only consider six settings, , and evaluate them on Pendigits. The results are shown in Fig. 11, 12 and 13 respectively. Obviously, any parameter doesn’t play a critical role for most measures in a wide range.

Fig. 11: Robustness to sampling rate . It shows that the results are almost unchanged when . The methods based on still preform very well, even has increased to .
Fig. 12: Robustness to condensing size . It shows that the method is quite stable when varies widely. However, it is worth noting that, since random sampling is less efficient at separating connected clusters than , when is very large, that is, the graph is connected more tightly, the performance is significantly worse than others.
Fig. 13: Robustness to graph size . It clearly shows that all methods perform very well when varies in range , which means that the approach is not very strict with graph quality.

4.4 Effectiveness of Smoothing

We also test the effectiveness of smoothing. Instead of running 16 times before and after pruning as the previous experiments, to make the effect on partitions more obvious, we only run it once before pruning to remove tiny clusters and 16 times after that. The results, Fig. 14, shows that it is effective on most datasets and converges very fast.

Fig. 14: Effectiveness of smoothing. It shows that the fine-tuning method is effective on most datasets, especially for data with poor clustering quality. And besides, it is also obvious that the method converges very fast.

4.5 Visualization

The visualizations of clustering results on MNIST is shown in Fig. 15 and 16. The number on a branch indicates the size of it, and the images are random samples in it. We also visualize the results in 4.1 using t-SNE [25], as shown in Appendix Fig. 17.

Fig. 15: Visualization of the truncated dendrogram on the MNIST.
Fig. 16: Visualization of the dendrogram on the MNIST.

As can be seen, without any pruning, the dendrogram is quite small, compared with other binary outputs of classical hierarchical methods. The pruned dendrogram shows that it first separates s from the whole, and then s. The remaining digits are roughly divided into three groups, , and .

5 Discussion on Experimental Results

As shown in Table II, Sampling Clustering (SC) is compared with 12 methods, and achieves the best results on 3 out of 9 datasets. It is also the second fastest method on RCV1, and just a little slower, about , than the fastest one. The memory required is also less than all other algorithms except -Means. Moreover, it is guaranteed that the implementation runs in linear time.

Four of the datasets are images, including MNIST, COIL100, YTF and YaleB. SC achieves the best result on the biggest dataset, MNIST, and the second best result on YTF. On other two datasets, SC is only worse than RCCs and GDLs. Additionally, it also one of the only two applicable hierarchical methods on MNIST. Neither classical nor recent state-of-art agglomerative methods can handle such a large dataset. As mentioned earlier, GMM is not applicable on high-dimensional datasets.

Reuters and RCV1 are two text datasets. Unfortunately, almost all algorithms do not work well on them. It is mainly because both of them are complicated in structure, and the measure, Euclidean distance of item frequency, is not effective and thus can not accurately represent the similarity. For Shuttle and Mice Protein, there also exist similar problems. The results on such datasets are greatly influenced by random factors. It leads to that conservative methods like -Means can achieve not bad results instead. We believe that such results are not so reliable and should not be used as key evaluation criteria like others.

6 Conclusions

We present an efficient graph-based divisive clustering approach which is shown to be effective on various types of data. We also take an introductory discussion on strategies to divide faster and to avoid undesirable divisions. The sampling and condensing measures discussed are very basic, but they work very well already. We believe that the performance of the approach can be further improved with the help of more effective measures, especially for data of specific types.

As far as we know, it is the first schema trying to groups data into clusters through repeatedly summarizing, and also one of the few effective hierarchical methods running in linear time. Existing methods may be adapted based on the schema, and the performances can be greatly improved. We hope this work can inspire further research.

In addition to clustering algorithms, the effectiveness of cluster analysis is also influenced by many other factors.

Preprocessing is the basis. The experimental results show that most algorithms can not handle datasets like Reuters directly. One of the main reasons is that the similarity measure used is not effective, which leads to the poor quality of graphs.

For hierarchical methods, post-processing is also important. In many cases, a dendrogram may need to be pruned or converted into partitions. Although it is independent of different clustering approaches, considering that most existing pruning methods, like depth-based ones, are not so effective on it, we introduce two simple algorithms in this paper to make the clustering process complete. However, as future work, it is obvious that, since they are simply based on the sizes of leaves, they can be further optimized.

Acknowledgments

We want to thank Yangyan Li for lecturing Pattern Recognition, and we are also appreciate for the support from Computer Architecture Laboratory, Shandong University. Particularly, Ching Tarn would like to thank Xiaojun Cai, Mengying Zhao, Jianhua Yin, Si-Min He, and his classmates for their concern and help.

References

  • [1] J. MacQueen et al., “Some methods for classification and analysis of multivariate observations,” in

    Proceedings of the fifth Berkeley symposium on mathematical statistics and probability

    , vol. 1, no. 14.   Oakland, CA, USA, 1967, pp. 281–297.
  • [2] M. Ester, H.-P. Kriegel, J. Sander, X. Xu et al., “A density-based algorithm for discovering clusters in large spatial databases with noise.” in Kdd, vol. 96, no. 34, 1996, pp. 226–231.
  • [3] H.-P. Kriegel, P. Kröger, J. Sander, and A. Zimek, “Density-based clustering,” Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, vol. 1, no. 3, pp. 231–240, 2011.
  • [4]

    M. Brito, E. Chavez, A. Quiroz, and J. Yukich, “Connectivity of the mutual k-nearest-neighbor graph in clustering and outlier detection,”

    Statistics & Probability Letters, vol. 35, no. 1, pp. 33–42, 1997.
  • [5] P. Fränti, R. Mariescu-Istodor, and C. Zhong, “Xnn graph,” in Joint IAPR International Workshops on Statistical Techniques in Pattern Recognition (SPR) and Structural and Syntactic Pattern Recognition (SSPR).   Springer, 2016, pp. 207–217.
  • [6] M. Belkin and P. Niyogi, “Laplacian eigenmaps for dimensionality reduction and data representation,” Neural computation, vol. 15, no. 6, pp. 1373–1396, 2003.
  • [7] N. Bar, H. Averbuch-Elor, and D. Cohen-Or, “Border-peeling clustering,” arXiv preprint arXiv:1612.04869, 2016.
  • [8] J. L. Bentley, “Multidimensional binary search trees used for associative searching,” Communications of the ACM, vol. 18, no. 9, pp. 509–517, 1975.
  • [9] S. M. Omohundro, Five balltree construction algorithms.   International Computer Science Institute Berkeley, 1989.
  • [10] H. Xiong, G. Pandey, M. Steinbach, and V. Kumar, “Enhancing data analysis with noise removal,” IEEE Transactions on Knowledge and Data Engineering, vol. 18, no. 3, pp. 304–319, 2006.
  • [11] L. Fu and E. Medico, “Flame, a novel fuzzy clustering method for the analysis of dna microarray data,” BMC bioinformatics, vol. 8, no. 1, p. 3, 2007.
  • [12] Y. LeCun, L. Bottou, Y. Bengio, P. Haffner et al., “Gradient-based learning applied to document recognition,” Proceedings of the IEEE, vol. 86, no. 11, pp. 2278–2324, 1998.
  • [13]

    L. Ertöz, M. Steinbach, and V. Kumar, “Finding clusters of different sizes, shapes, and densities in noisy, high dimensional data,” in

    Proceedings of the 2003 SIAM international conference on data mining.   SIAM, 2003, pp. 47–58.
  • [14] L. W. Beineke, O. R. Oellermann, and R. E. Pippert, “The average connectivity of a graph,” Discrete mathematics, vol. 252, no. 1-3, pp. 31–45, 2002.
  • [15]

    R. J. Campello, D. Moulavi, and J. Sander, “Density-based clustering based on hierarchical density estimates,” in

    Pacific-Asia conference on knowledge discovery and data mining.   Springer, 2013, pp. 160–172.
  • [16]

    D. Arthur and S. Vassilvitskii, “k-means++: The advantages of careful seeding,” in

    Proceedings of the eighteenth annual ACM-SIAM symposium on Discrete algorithms.   Society for Industrial and Applied Mathematics, 2007, pp. 1027–1035.
  • [17] L. Kaufman and P. Rousseeuw, Clustering by Means of Medoids, ser. Delft University of Technology : reports of the Faculty of Technical Mathematics and Informatics.   Faculty of Mathematics and Informatics, 1987.
  • [18] A. K. Jain, R. C. Dubes et al., Algorithms for clustering data.   Prentice hall Englewood Cliffs, 1988, vol. 6.
  • [19] T. Zhang, R. Ramakrishnan, and M. Livny, “Birch: an efficient data clustering method for very large databases,” in ACM Sigmod Record, vol. 25, no. 2.   ACM, 1996, pp. 103–114.
  • [20] E. Hartuv and R. Shamir, “A clustering algorithm based on graph connectivity,” Information processing letters, vol. 76, no. 4-6, pp. 175–181, 2000.
  • [21] J. Shi and J. Malik, “Normalized cuts and image segmentation,” Departmental Papers (CIS), p. 107, 2000.
  • [22] K. George, E.-H. Han, and V. Kumar, “Chameleon: a hierarchical clustering algorithm using dynamic modeling,” IEEE computer, vol. 27, no. 3, pp. 329–341, 1999.
  • [23] W. Zhang, X. Wang, D. Zhao, and X. Tang, “Graph degree linkage: Agglomerative clustering on a directed graph,” in European Conference on Computer Vision.   Springer, 2012, pp. 428–441.
  • [24] S. A. Shah and V. Koltun, “Robust continuous clustering,” Proceedings of the National Academy of Sciences, vol. 114, no. 37, pp. 9814–9819, 2017.
  • [25] L. v. d. Maaten and G. Hinton, “Visualizing data using t-sne,”

    Journal of machine learning research

    , vol. 9, no. Nov, pp. 2579–2605, 2008.
  • [26]

    A. S. Georghiades, P. N. Belhumeur, and D. J. Kriegman, “From few to many: Illumination cone models for face recognition under variable lighting and pose,”

    IEEE Transactions on Pattern Analysis & Machine Intelligence, no. 6, pp. 643–660, 2001.
  • [27] L. Wolf, T. Hassner, and I. Maoz, Face recognition in unconstrained videos with matched background similarity.   IEEE, 2011.
  • [28] D. D. Lewis, Y. Yang, T. G. Rose, and F. Li, “Rcv1: A new benchmark collection for text categorization research,” Journal of machine learning research, vol. 5, no. Apr, pp. 361–397, 2004.
  • [29] S. A. Nene, S. K. Nayar, and H. Murase, “Columbia object image library (coil-100),” 1996.
  • [30] F. Alimoglu and E. Alpaydin, “Combining multiple representations and classifiers for pen-based handwritten digit recognition,” in Proceedings of the Fourth International Conference on Document Analysis and Recognition, vol. 2.   IEEE, 1997, pp. 637–640.
  • [31] C. Higuera, K. J. Gardiner, and K. J. Cios, “Self-organizing feature maps identify proteins critical to learning in a mouse model of down syndrome,” PloS one, vol. 10, no. 6, p. e0129126, 2015.
  • [32] D. Comaniciu and P. Meer, “Mean shift: A robust approach toward feature space analysis,” IEEE Transactions on Pattern Analysis & Machine Intelligence, no. 5, pp. 603–619, 2002.
  • [33] B. J. Frey and D. Dueck, “Clustering by passing messages between data points,” science, vol. 315, no. 5814, pp. 972–976, 2007.
  • [34] N. X. Vinh, J. Epps, and J. Bailey, “Information theoretic measures for clusterings comparison: Variants, properties, normalization and correction for chance,” Journal of Machine Learning Research, vol. 11, no. Oct, pp. 2837–2854, 2010.

Appendix

Dataset KM MS GMM AP RCC RCC-DR AC-A AC-C AC-W GDL-U AGDL HD SC


MNIST
0.500 0.430 0.311 MLE 0.893 0.830 MLE MLE MLE MLE MLE 0.356 0.895

COIL100
0.840 0.852 MLE 0.843 0.963 0.963 0.687 0.754 0.862 0.961 0.961 0.905 0.894

YTF
0.789 0.831 MLE 0.783 0.892 0.890 0.583 0.681 0.808 0.699 0.692 0.846 0.881

YaleB
0.658 0.742 MLE 0.799 0.978 0.975 0.231 0.479 0.774 0.967 0.967 0.696 0.914

Reuters
0.535 0.000 0.538 0.503 0.556 0.553 0.533 0.392 0.487 0.514 0.515 0.413 0.538

RCV1
0.511 0.000 0.566 0.355 0.138 0.437 0.142 0.108 0.375 0.116 0.182 0.301 0.169

Pendigits
0.682 0.738 0.715 0.648 0.845 0.851 0.659 0.584 0.728 0.593 0.593 0.738 0.860

Shuttle
0.216 0.375 0.356 MLE 0.488 0.546 0.040 0.039 0.246 MEL MEL 0.622 0.530

Mice Protein
0.479 0.600 0.426 0.593 0.662 0.636 0.377 0.324 0.514 0.513 0.513 0.602 0.581
MLE: Memory Limit Exceeded.
TABLE V: Results measured by NMI
Fig. 17: Visualization of results on datasets.