## I Background and Introduction

Clustering is one of the classic unsupervised machine learning problems. It has been demonstrated to be NP-hard even with only two clusters

Drineas2004 . In 1982, Lloyd Lloyd1982 gave a local search solution to solve this problem, which is one of “the most popular clustering algorithms used in scientific and industrial applications” Berkhin2002 , which is also known as k-means. The total error is monotonically decreasing, and the process will always terminate since the number of possible clusterings is finite (, where is the total number of sample points) Arthur2006 . However, the accuracy of the k-means algorithm cannot be always good enough. In fact, many examples show that the algorithm generates arbitrarily bad clusterings ( is proved to be unbounded even if and are fixed, where is the optimal total error) Arthur2006 . Furthermore, the final clusterings strongly depend on the initial setup of the cluster centers. k-means++algorithm propose a way to choose random starting centers with very specific probabilities

Arthur2006 , which guarantees the upper-bound of the total error expectation by for any set of data points Arthur2006 without sacrifice the fast computation speed and algorithm simplicity. In particular, “k-means++ is never worse than -competitive, and on very well formed data sets, it improves to being -competitive” Arthur2006 .This article is organized as follows. In section II, the traditional k-means and the k-means++ algorithms are introduced based on reference Arthur2006 . In section III, the relation between k-means and k-means++ is illustrated, and generalize the initialization process of the k-means++ algorithm which indicates that to select most distant sample point from nearest center as new center can have the same (or very similar) effect as randomly select new center from the entire weighted sample space.

## Ii Existing Algorithms

Suppose we are given an integer and a set of data points Arthur2006 . The goal is to select centers so as to minimize the potential function (total error)

In this report, I will use the same notation as in Ref Arthur2006 : represents the optimal clustering and represents the contribution of to the potential

In general, k-means algorithm has four steps Arthur2006 :

1. Randomly choose initial centers .

2. For each , set the cluster to be the set of points in that are closer to than they are to for all .

3. For each , set to be the center of mass of all points in .

4. Repeat step 2 and 3 until no longer changes.

###### Lemma 1.

Let be a set of points with center of mass , and let be an arbitrary point. Then

.

The Lemma 1 quantifies the contribution of a center to the cost improvement in a k-means step as a function of the distance it moves Peled2005 . Specifically, if in a k-means step a -clustering is changed to the other -clustering , then the total change of potential function

(1) |

The reason that loss function has a no-less-than sign rather other an equal sign is Lemma

1 only consider the improvement resulting from step 3 of k-means algorithm in which the centers are moved to the centroids of their clusters Peled2005 . However, there is an additional gain from reassigning the points from step 2 of k-means algorithm Peled2005 . Therefore, k-means algorithm guarantees the potential function monotonically decreases over each iteration before reaching the optimal clusterings when initial centers are given.Let denote the shortest distance from a data point to the closest center we have already chosen. Then, the k-mean++ algorithm is Arthur2006 :

1a. Choose an initial center uniformly at random from .

1b. Choose the next center , selecting with probability .

1c. Repeat step 1b until we have chosen a total of centers.

2-4. Proceed as with the standard k-means algorithm.

The weighting used in step 1b is called “ weighting”.
The Ref Arthur2006 proved an important result as follows:

###### Theorem 2.

If is constructed with k-means++, then the corresponding potential function satisfies .

## Iii Alternative Approaches and Their Relations

The k-means++ algorithm demonstrates that during the center initialization process, it is much better to select centers with probability proportional to their square distance with nearest existing center. It is equivalent to say that the k-means++ is the weighted initialized k-means. In a more general case, we can tuning the portion of sample points which can be randomly selected as a new center. In particular, a hyper-parameter is set to determine the most distant ( is the size of sample points and ) points from their nearest existing centers, and then select the new center from them instead of the entire dataset. The two extreme cases are (1) when so that we deterministically choose the most distant point from its nearest center (it saves computation time during center initialization with sacrifice of not considering the distribution of the dataset), and (2) when so that we go back to exact k-means++ algorithm. The values are tested on different datasets (e.g. wines and Spam datasets in UCI ) for different numbers (e.g. 3, 10, 20) compared with traditional k-means algorithm. No matter the computation time, average potential or minimal potential are very similar or exactly same (the computation time is similar as traditional k-means but average and minimal potential is one magnitude lower), even exclude the randomness of first initial center (see Table 1, 2 and 3). This might indicates that the main advantage of the k-means++ algorithm can be explained or replaced by selecting the most distant point from the nearest center.

Except for testing the potentials, it is also possible to evaluate the accuracy of the clustering for some specific dataset. For instance, the Iris dataset in UCI has three classes. When using k-means-related algorithms, it will mainly give two distinct clustering ways: one is same as the ground truth classification; the other is group Virginia and Versicle into one cluster and split Samoset into two clusters. Table 4 shows the ratio of obtaining the correct classification for different algorithms.

Algorithm | Avg Potential | Min Potential | Time |
---|---|---|---|

k-means | 3.8110 | 2.1810 | 1 |

k-means++ | 2.5310 | 2.1810 | 1.05 |

2.5510 | 2.1810 | 1.06 | |

No random | 2.5410 | 2.1810 | 1.09 |

2.5410 | 2.1810 | 1.10 |

Algorithm | Avg Potential | Min Potential | Time |
---|---|---|---|

k-means | 4.1910 | 1.7510 | 1 |

k-means++ | 9.3510 | 7.7010 | 1.05 |

9.3510 | 7.7010 | 1.06 | |

No random | 9.6210 | 7.7010 | 1.05 |

9.2310 | 7.7010 | 1.09 |

Algorithm | Avg Potential | Min Potential | Time |
---|---|---|---|

k-means | 2.5810 | 1.5010 | 1 |

k-means++ | 2.5010 | 2.1410 | 1.35 |

2.5010 | 2.1410 | 1.35 | |

No random | 2.4610 | 2.1410 | 1.42 |

2.4610 | 2.1410 | 1.43 |

Algorithm | Accuracy | Time |
---|---|---|

k-means | 0.08 | 1 |

k-means++ | 0.91 | 1.00 |

0.91 | 0.99 | |

No random | 0.91 | 1.02 |

0.91 | 1.01 |

## Iv Summary

In this article, the existing k-means and k-means++ algorithms are briefly introduced. In center initialization process, the former only considers the samples density distribution while the latter also take the distance into account to modify the sample density distribution. Afterwards, the initialization process is generalized and couple of alternative approaches are compared. It is found that choosing the most distant sample point from the nearest existing center can mostly have the same effect as considering the entire sample space.

## References

- (1) D. Arthur, S Vassilvitshii. k-means++: The Advantages of Careful Seeding. Technical Report, Stanford, 2006
- (2) P. Drineas, A. Frieze, R. Kannan, S. Vempala, V. Vinay Clustering large graphs via the singularI value decomposition. Mach. Learn., 56(1-3):9-33, 2004.
- (3) S. P. Lloyd. Least squares quantization in pcm. IEEE Transactions on Information Theory, 28(2):129-136, 1982.
- (4) P. Berkhin. Survey of clustering data mining techniques. Technical Report, Accrue Software, San Jose, CA, 2002.
- (5) S. Har-Peled, B. Sadri. How fast is the k-means method? In SODA ’05: Proceeding of the sixteenth annual ACM-SIAM symposium on Discrete algorithms, pages 877-885, Philadelphia, PA, USA.
- (6) UCI machine learning repository, center for machine learning and intlligent systems. https://archive.ics.uci.edu/ml/datasets.html.