Ball k-means

05/02/2020 ∙ by Shuyin Xia, et al. ∙ Xi'an Jiaotong University Tencent QQ University of California, Riverside Tianjin University 0

This paper presents a novel accelerated exact k-means algorithm called the Ball k-means algorithm, which uses a ball to describe a cluster, focusing on reducing the point-centroid distance computation. The Ball k-means can accurately find the neighbor clusters for each cluster resulting distance computations only between a point and its neighbor clusters' centroids instead of all centroids. Moreover, each cluster can be divided into a stable area and an active area, and the later one can be further divided into annulus areas. The assigned cluster of the points in the stable area is not changed in the current iteration while the points in the annulus area will be adjusted within a few neighbor clusters in the current iteration. Also, there are no upper or lower bounds in the proposed Ball k-means. Furthermore, reducing centroid-centroid distance computation between iterations makes it efficient for large k clustering. The fast speed, no extra parameters and simple design of the Ball k-means make it an all-around replacement of the naive k-means algorithm.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Ball -means Clustering

In this section, the main idea of the Ball k-mean algorithm is presented by introducing the ball cluster concept, neighbor clusters searching, ball cluster division, and the mechanism on how to reduce centroid-centroid distance computation between iterations.

1.1 Ball Cluster Concept

A ball structure is characterized by a radius and centroid. Therefore, to describe a cluster and analyze the proposed method, we propose the ball cluster concept, where a ball is used to describe a cluster.

Definition 1

Given a cluster , is called as a ball cluster that is defined by its centroid and radius as follows:

(1)

where denotes a point assigned to , and denotes the number of samples in .

1.2 Neighbor Cluster Searching

To skip the distance calculation between a point and a centroid that is very far away from that point, we introduce a method that can find the neighbor clusters of each cluster. Thus, the distance computation is limited to the points and their neighbor clusters. The neighbor cluster is defined by Definition 2.

Definition 2

Given two ball clusters and , whose centroids are denoted as and ; represents the radius of , if it satisfies the following inequality:

(2)

then, is a neighbor cluster of .

Equation (2) indicates the neighbor relationship is not symmetric. Specifically, for two ball clusters and , referring to Fig. 1, their neighbor relationship can be one of the three following types:

(1) and are neighbor clusters, and vice versa, such as clusters and presented in Fig. 1. As and are neighbor clusters, some points in () may be adjusted into () in the current iteration.

(2) is a neighbor cluster of , but is not a neighbor cluster of ; such as and presented in Fig. 1; namely, is a neighbor cluster of , but is not a neighbor cluster of . Therefore, some points in may be adjusted into but no points in can be adjusted into .

(3) and are not neighbor clusters, such as and presented in Fig. 1. Therefore, points in () cannot be adjusted into () in the current iteration.

Fig. 1: The schematic diagram of neighbor relationship of the queried ball cluster . The red dash line represents the vertical bisector of the centroids of two ball clusters. The yellow triangle and green line represent the centroid and radius of a cluster respectively.
Theorem 1

Given two clusters and with centroids and , respectively. For a queried ball cluster with a centroid and radius denoted as and , if is a neighbor cluster of (i.e.,) while is not a neighbor cluster of (i.e., ), then, it holds that some points in may be adjusted into , while all points in cannot be adjusted into .

For a queried cluster , its neighbor ball clusters can be exactly found by Definition 2, so the distance computation of points in to the centroids of the other clusters is limited to the neighbor clusters of , resulting in a significant decrease in the distance computation amount. In [1], similar method in finding neighbor cluster was proposed. For two clusters and , represents the distance between centroids of and . if , is the neighbor of ,where represents the radius of , and represents half the distance between centroid of and its closest other centroid. On the contrast, in this paper, if , is the neighbor of , where represents half the distance. Therefore, in comparison with Definition 2, there was one additional element in  [1]. Consequently, the condition in  [1] was looser than that in Definition 2. In other words, Definition 2 can yield finding less but exacter neighbor clusters than that in [1]. In Section 1.3, it is shown that the ball cluster division can further decrease the distance computation amount.

1.3 Ball Cluster Division

A queried ball cluster can be divided into two parts, stable area and active area, which are defined by Definition 3. The points in the stable area stay in the assigned cluster, which is given in Theorem 2. The active area can be further divided into annulus areas as given in Definition 4. Points in each annulus area need to calculate distance only to some of the neighbor clusters, which is given in Theorem 3.

1.3.1 Stable and Active Areas

The definitions of the stable area and active area are as follows:

Definition 3

Given a queried ball cluster , denotes the centroid set of the neighbor clusters of . If , for a ball cluster whose centeroid is , and , then the sphere area whose centeroid and radius are equal44 to and respectively is defined as the stable area of . And the rest area is defined as the active area of .

Theorem 2

Given a cluster , the points in the stable area of cannot be adjusted into any neighbor clusters in the current iteration.

However, in a special case, when a ball cluster has no neighbor clusters, the stable area is equal to the whole ball cluster. The description similar to the stable area is only provided in  [2], but it relies on the upper bound which is bigger than the direct distance when checking that filtering condition. On the contrary, Definition 3 provides an exact definition that relies on no bounds.

1.3.2 Active Area Division

In this section, we show that the active area of a queried cluster can be divided into annulus areas that are generated by the neighbor clusters.

Definition 4

Annulus area

Given a queried ball cluster with a centroid and radius , supposing ’, represents the neighbor clusters’centroids set of . and represent the centroids of the and closest neighbor clusters of , respectively (’). For , the annulus area denoted as :

Theorem 3

Given a queried cluster with a centroid , supposing , the points in its annulus area can be adjusted only within its first- closest neighbor clusters and itself (’).

1.4 Reducing Centroid-centroid Distances Computation between Iterations

As presented in Section 1.2, to find the neighbor clusters of each ball cluster, it is needed to calculate all the centroid-centroid distances, which costs per iteration, and for large clustering, this is a non-negligible cost. In this paper, the purpose of the calculation of centroid-centroid distances is to find the neighbor clusters of the next iteration. If a non-neighbor relationship in the next iteration can be found in advance according to the relationship of the ball clusters in the current iteration, then direct centroid-centroid distance calculation can be avoided. In this work, we develop a method to implement this idea that can find the non-neighbor relationship in advance to avoid unnecessary calculation of centroid-centroid distances. The specific process of this method is formulated as follows.

Let represent the centroid of cluster in the iteration, and represent the shift of the cluster centroid of between iteration and iteration, and represent the distance of and in the iteration.

Theorem 4

Given clusters and , supposing that , then it holds that cannot be the neighbor cluster of in the current iteration and the centroid-centroid distance of them could be skipped.

Fig. 2: The schematic diagram of avoiding direct centroid-centroid distance calculation. The red dash line represents the midpoint of . is not a neighbor cluster of in the -th iteration.

Proof.  With the shift of cluster centroids due to the centroid update, it holds that:
, and the supposing that ,
,
.

As given in Definition 2 and Theorem 1, when , is not the neighbor of . So, as it shows in Fig.2 when , it holds that (i.e., cannot be a neighbor cluster of in the current iteration). Thus, the computation of distance between and can be avoided.

1.5 Stable Ball Cluster in Subsequent Iterations

According to the characteristics of the k-means algorithm, with the number of the iteration increasing, more and more ball clusters tend to be stable, i.e., that the points in it are unchanged. In ball k-means, an stable ball cluster can be simply described as that no points move into this ball cluster and no points in this ball cluster move out in current iteration. Based on this characteristic of the k-means algorithm itself, we propose a method to find those stable ball clusters. In this method, a flag corresponding to a ball cluster is used to judge whether a ball cluster is stable. For a queried ball cluster, if no points in the queried ball cluster move into its nearest cluster, and no points in other ball cluster move into the queried ball cluster in the current iteration, then its flag is marked TRUE.

Theorem 5

Ball k-means is implemented on a given data set . For a queried ball cluster C, if the points in C are not changed, it is called as a stable ball cluster. In one iteration, if all the neighbor ball clusters of C are stable, C will not participate into the distance calculations in next iteration.

Proof.  This proof is straightforward. For a queried ball cluster, if all the neighbor ball clusters of C is stable, then the division of the stable area and annulus areas are the same as those in the previous iteration, so the assignment step of the queried ball cluster can be avoided.

During the iteration of the k-means algorithm, more and more ball clusters will become stable, and the data points in those stable ball cluster will not participate into any distance calculations. Therefore, the time complexity of ball k-means per iteration will become to be sublinear, and the ball k-means will run faster and faster per iteration.

References

  • [1] Petr Ryšavỳ and Greg Hamerly. Geometric methods to accelerate k-means algorithms. In Proceedings of the 2016 SIAM International Conference on Data Mining, pages 324–332. SIAM, 2016.
  • [2] Charles Elkan. Using the triangle inequality to accelerate k-means. In

    Proceedings of the 20th International Conference on Machine Learning (ICML-03)

    , pages 147–153, 2003.