t-k-means: A k-means Variant with Robustness and Stability

07/17/2019 ∙ by Yang Zhang, et al. ∙ Tsinghua University 16

Lloyd's k-means algorithm is one of the most classical clustering method, which is widely used in data mining or as a data pre-processing procedure. However, due to the thin-tailed property of the Gaussian distribution, k-means suffers from relatively poor performance on the heavy-tailed data or outliers. In addition, k-means have a relatively weak stability, i.e. its result has a large variance, which reduces the credibility of the model. In this paper, we propose a robust and stable k-means variant, the t-k-means, as well as its fast version in solving the flat clustering problem. Theoretically, we detail the derivations of t-k-means and analyze its robustness and stability from the aspect of loss function, influence function and the expression of clustering center. A large number of experiments are conducted, which empirically demonstrates that our method has empirical soundness while preserving running efficiency.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

Code Repositories

t-k-means

The implementation of our paper 't-k-means: A Robust and Stable k-means Variant', accepted by the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2021.


view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Lloyd’s algorithm 111It is very common to call Lloyd’s algorithm the “standard -means algorithm” (called “-means” for short). [Lloyd1982] is one of the most classical methods in solving the clustering problem, and is widely used today in data mining [Yu et al.2009, Tsironis et al.2013]

, pattern recognition

[Coates et al.2011, Coelho and Murphy2009], etc., or as a data pre-processing procedure in more complex algorithms [Gopalan2016, Zhang et al.2017].

It’s known that

-means is a special case of Gaussian mixture model (GMM)

[Mclachlan and Basford1988] with each components sharing the same mixing coefficient and covariance matrix [Bishop2006]. However, due to the thin-tailed property of the Gaussian distribution, -means (also GMM) may perform poorly on the data which contain a group or groups of observations with heavy tails or outliers [Peel and Mclachlan2000]. Consequently, mixture model (TMM) [Liu and Rubin1995] has been introduced to gain robustness in the clustering task, since its base (

distribution) is a heavy-tailed generalization of Gaussian distribution. However, due to the tremendous demand in the necessary parameter estimation (such as covariance matrices), TMM is unstable with the arbitrary initialization and requires overwhelming time cost. The facts greatly prevent it to be a popular clustering method. In addition, since the update of clustering center is based only on the information of the sample in its cluster,

-means have a relatively large variance, which reduces the credibility of the model.

In this paper, to obtain robust and stable clustering method while preserving running efficiency, we propose --means. It is not only as extensible and fast as -means but also robust to heavy-tailed data and more stable than classical -means method. Through this paper, we elaborate on the derivations of --means, prove the robustness and stability, and also illustrate an extensive empirical study.

In summary, our three major contributions are as follows.

  • We derive --means clustering method from TMM, which is a robust and stable generalization of -means.

  • We theoretically prove the proposed method more robust and stable than -means, from the views of loss function, influence function and clustering center expression.

  • Empirically, a large number of experiments demonstrate that our method has empirical soundness while preserving running efficiency.

2 Related Works

2.1 -means Variants

-medoids [Kaufman and Rousseeuw1987] chooses samples as cluster centroid and works with a generalization of the Manhattan Norm to define distance between samples instead of L2-Norm. -medians [Arora et al.1998] calculating the median for each cluster to determine its centroid, instead of the means, as a result of using L1-loss. -means with Mahalanobis distance metric [Mao and Jain1996] has been used to detect hyperellipsoidal clusters, but at the expense of higher computational cost. A variant of -means using the Itakura–Saito distance [Linde et al.1980]

has been used for vector quantization in speech processing. Banerjee

[Banerjee et al.2005] exploits the family of Bregman distances for -means [Jain2010].

In addition, a preprocessing procedure, -means++, for choosing the initial values for -means to avoid the occasional poor -means results due to the arbitrarily terrible initialization is proposed in [Arthur and Vassilvitskii2007]. It can also be perfectly integrated into our proposed --means method.

2.2 Relation between -means and GMM

For better explaining TMM and --means, we start by reviewing the most well-known technique GMM.

Given the dataset , where denotes a -dim sample, Gaussian mixture model (GMM) is a linear superposition of -component Gaussian distribution [Mclachlan and Basford1988], i.e.,

(1)

where , and are the mixing coefficient, the mean vector and the covariance matrix of the -th component, respectively, and , , .

Usually, the EM algorithm can be used to estimate the parameters. More specifically, the complete-data of sample in the EM algorithm is given by , where the latent variable denotes whether belongs to the -th component. As we have known, in M-step, the parameters of GMM is updated by following objective

(2)

where is the expectation of . Let denote a

-dim identity matrix and

be a known parameter. Assuming that all the components share one single mixing coefficient and covariance matrix, we will have . As a result, Eq. (2) becomes

(3)

Eq. (3) is identical to the loss function of -means. Clearly, -means can be regarded as a special case of GMM with different components sharing the same mixture coefficient and covariance matrix [Mitchell and others1997].

3 Derivation of --means

Similar to -means and GMM, --means is also a special case of TMM under the condition that given parameter . To reduce the parameters further, following Liu and Rubin, et al. [Liu and Rubin1995], we also assume that . Those conditions are used to regulate the model complexity, so that --means can have robustness while preserving running efficiency.

3.1 Log Likelihood Function Deduction

Similar to GMM, the mixture model (TMM) is a linear superposition of -component distribution, i.e.,

where and .

Following the definition of the complete-data vector in GMM, we write it for TMM by

where is defined in section 2.2 and are the additional missing data [Liu and Rubin1995], such that

(4)

Thus, the complete-data log likelihood function can be written as

3.2 EM-based Log Likelihood Optimization

In this section, we detail the derivations about how to use EM algorithm to iterative optimizes log likelihood.

In the EM algorithm, the objective function in a new iteration is the current conditional expectation of the complete-data log likelihood function, i.e.,

(5)

where

The parameters with superscript “” will be estimated in the new iteration.

3.2.1 E-step

Estimate

The posterior estimation of latent variable is

(6)
Estimate

Since , from the property of Gaussian distribution, we know follows distribution, i.e., . Treating

as data, from the property of gamma distribution, it is not hard to prove that the likelihood of

is

(7)

According to Eq. (4) and Eq. (7), the posterior distribution of given is

(8)

Based on Eq. (8), we have

(9)
Estimate

To estimate , we need to make use of the following lemma from [Liu and Rubin1995].

Lemma 1.

If a random variable

, then , where .

Applying Lemma 1 to Eq. (8), it is obvious that we obtain

3.2.2 M-step

Given the result of E-step, we can decompose to

where

(10)
(11)
Estimate

is obtained by solving the equation

(12)
Estimate

With the same technique for estimating , we solve

and obtain

(13)
Estimate

The estimation of is the solution of the equation

(14)

We apply the following lemma in the work done by Abramowitz and Milton, et al. [Abramowitz and Stegun1964], to solve Eq. (14).

Lemma 2. , where is the Bernoulli numbers of the second kind and .

From Lemma 2, we have

where is the error term. Denoting the constant term in Eq. (14) as , it is not hard to show

(15)
Figure 1: Graph of

We illustrate the functional plot of in Figure 1, looking at which, when , approximates to . Therefore, we can update using .

3.2.3 A Fast Version of --means

In TMM, if is unknown, the EM algorithm converges slowly [Liu and Rubin1995]. Therefore, following Jarno Vanhatalo and Pasi Jylänki, et al. [Vanhatalo et al.2009], we fix as a constant. For further reducing the dimensionality of parameters, we apply referring to Bishop [Bishop2006]. With fixed and , we have a fast version of --means, and we coin it fast --means.

Figure 2: Graph of the loss functions of -means, -medians and --means ().

3.3 Relation between --means and -means

If , then TMM degenerates to GMM. According to Section 2.2, -means is a special case of GMM with all components sharing the same mixing coefficient and covariance matrix, meanwhile --means is a special case of TMM with the same condition. Therefore, it is not hard to obtain that --means is a robust generalization of -means, i.e., --means-means when .

4 Robustness and Stability Analysis

In this section, we will prove that --means is more robust than -means from the perspective of loss function and influence function [Koh and Liang2017], and explain why --means is more stable than -means.

4.1 Robustness Analysis

4.1.1 Loss Function Perspective

The log likelihood of --means is given by

(16)

Given Eq. (16),we can rewrite the loss function of --means as

Focusing on the term related to data , we have

(17)

Considering Eq. (17) and Eq. (3), we learn that is a log L2-loss function of while the loss function of -means Eq. (3) is a L2-loss norm. Besides, from the work in [Arora et al.1998], it is known that the loss of -medians is a L1-loss norm. On the other hand, an outlier222In this paper, we adopt the definition of outliers in [Tukey1977], i.e., , if and

are the lower and upper quartiles respectively.

is often distant from the component mean . Thus, we plot the relationship between the loss values and the data-to-centre distance in Figure 2. The figure illustrates that log L2-loss is the least sensitive to the distance between and . That is, in this regard, --means is more robust than -means and -medians as its objective function is far less sensitive to the outliers than the other two.

4.1.2 Influence Function Perspective

The influence function, a measure of the influence from upweighting a training sample on the estimation of model parameters [Koh and Liang2017], is adopted to compare the robustness of --means and -means in this section. The influence of upweighting the training sample on the parameter is given by

From Eq. (3), we can obtain the influence function of -means for parameter , i.e.,

Now we consider the influence function of --means:

It is clear that the difference between the influence of -means and that of --means lies on the coefficient. We denote these coefficients as follows:

(18)
(19)

Let us denote , from Eq. (19), Eq. (9) and Eq. (6), it is not hard to prove that and are the strictly decreasing functions of . Since the outliers are farther from the component mean than clean samples (assume the outliers are in the -th component), of outliers are smaller than that of clean samples.Assuming that a sample is an outlier and lies in the -th component, we know that . In contrast, since the outlier is farther from the component mean than clean samples and is a strictly decreasing function of the distance between and , is smaller than where is a clean sample, i.e., , which implies . Therefore, --means is more robust to outliers than -means from the view of influence function.

4.2 Stability Analysis

The randomness of the -means and --means methods is mainly involved in the selection of the initial clustering center. Once the initial clustering center is given, the clustering results of the two methods are also fixed.

In -means method, the update of clustering center is based only on the information of the sample in its cluster. However, according to equation (9) and (14), during the iteration, the update of the clustering center in --means is determined by the information of all samples. In other words, no matter which sample is used as the initial clustering center, the further update of cluster centers still depend on all samples. This use of such global information significantly reduces the influence of the randomized clustering center on --means, therefore it enjoys stronger stability.

A1 A2 A3 S1 S2
-means 0.8040.068 0.8070.056 0.8290.039 0.8440.059 0.8260.057
-means++ 0.8560.050 0.8640.030 0.8820.041 0.9040.046 0.8500.053
-medoids 0.7750.081 0.7830.057 0.7920.039 0.8170.056 0.8030.070
-medians 0.7600.060 0.7800.060 0.7800.040 0.8100.070 0.7800.070
GMM 0.0880.013 0.0520.008 0.0350.012 0.1270.009 0.1220.002
TMM 0.4830.189 0.2950.153 0.2640.110 0.4090.185 0.4830.143
--means 0.8510.061 0.8530.041 0.8820.038 0.9320.062 0.8720.050
fast --means 0.9220.035 0.9280.025 0.9290.028 0.9860.000 0.9370.000
fast --means++ 0.9540.045 0.9480.021 0.9450.020 0.9860.000 0.9360.000
S3 S4 Unbalance dim32 dim64
-means 0.6390.039 0.5840.026 0.5890.306 0.6500.081 0.6390.091
-means++ 0.6710.035 0.5890.028 0.9090.078 0.9850.028 0.9950.018
-medoids 0.6490.039 0.5830.033 0.6520.076 0.7710.094 0.7560.095
-medians 0.6500.040 0.5700.030 0.6100.090 0.7400.080 0.7600.080
GMM 0.1130.010 0.0940.032 0.0570.062 0.0000.000 0.0000.000
TMM 0.2480.107 0.1570.092 0.4260.246 0.5070.132 0.5400.188
--means 0.6990.028 0.6120.011 0.8290.169 0.9680.065 0.9380.102
fast --means 0.7180.018 0.6180.005 0.8070.093 0.9310.057 0.9040.050
fast --means++ 0.7260.011 0.6230.000 0.9310.076 0.9970.015 1.0000.000
Table 1: ARI of the clustering results on synthetic datasets.

5 Experiments

5.1 Experiment Settings

The information of the datasets is shown in Table 2. The synthetic datasets are from [Pasi Franti2015] and the real-world datasets are from UCI datasets [Lichman2013]. In the experiments, the hyper-parameter is given by the selected datasets and the hyper-parameter in fast --means is set as . The baselines include -means [Lloyd1982] , -means++ [Arthur and Vassilvitskii2007], -medoids [Kaufman and Rousseeuw1987], -medians [Arora et al.1998], GMM [Mclachlan and Basford1988] and TMM [Liu and Rubin1995].

In order to evaluate the performance of the model, Adjusted Rand Index (ARI) [Hubert and Arabie1985] is employed for data with label, clustering mean squared error (MSE) [Tan and others2006], and W/B (W: within-cluster sum of squares; B: between-cluster sum of squares) [Kriegel et al.2017] is used for unlabelled data. Besides, the experiment is repeated 100 times to reduce the effect of randomness. Among all methods, the one with the best performance is indicated in boldface and the value with underline denotes the second best. In addition, to make it fair, all of the evaluated methods are implemented with MATLAB and conducted on Intel(R) Core(TM) i7-7500U CPU running at 2.7GHz with 16 GB of RAM.

Data set Instances Features Clusters
A1 3000 2 20
A2 5250 2 35
A3 7500 2 50
S1 5000 2 15
S2 5000 2 15
S3 5000 2 15
S4 5000 2 15
Unbalance 6500 2 8
dim32 1024 32 16
dim64 1024 64 16
Iris 150 4 3
Bezdekiris 150 4 3
Seed 210 7 3
Wine 178 14 3
Table 2: Dataset description.

5.2 On Synthetic Datasets with Labels

In this part, we conduct the experiments on the synthetic datasets, including S1 S4, A1 A3, Unbalance, dim32 and dim64 [Pasi Franti2015].

As illustrated in Table 1, GMM and TMM have relatively poor performance, since the mixture models demand heavy parameter esimation and are sensitive to the parameter initialization. With randomly initiated parameters, the proposed --means and fast --means outperform all -means class methods, GMM and TMM in all datasets. Besides, a new method, the fast --means++, obtained when fast --means is initialized with -means++ instead of random initialization reaches the best performance in all 10 synthetic datasets. In addition, the -

-means class method has a smaller standard deviation than the

-means class method on all data sets, which empirically demonstrates the stability of --means.

metrics methods Bezdekiris Iris Seed Wine
MSE -means 0.2010.033 0.1920.020 0.4200.000 1.1640.065
-means++ 0.1980.031 0.1980.031 0.4200.000 1.1540.000
-medoids 0.2050.037 0.2220.048 0.4260.003 1.2780.159
-medians 0.2150.045 0.2260.051 0.4340.053 1.2580.138
GMM 0.3230.000 0.3240.000 0.9660.301 1.7790.239
TMM 0.4250.238 0.2930.106 0.6060.121 1.6120.209
--means 0.1860.000 0.1870.000 0.4200.000 1.1540.000
fast --means 0.1870.000 0.1870.000 0.4200.000 1.1530.000
fast --means++ 0.1870.000 0.1870.000 0.4200.000 1.1530.000
W/B -means 0.2250.050 0.2130.030 0.3290.000 0.9890.159
-means++ 0.2220.046 0.2220.046 0.3290.000 0.9660.001
-medoids 0.2420.049 0.2630.063 0.3580.032 1.1170.224
-medians 0.2540.066 0.2750.082 0.3490.060 1.1420.201
GMM 0.4180.000 0.4190.000 2.3493.241 3.2802.102
TMM 1.0201.028 0.4100.275 0.5820.183 2.5381.068
--means 0.2020.000 0.2020.000 0.3330.000 1.0150.000
fast --means 0.2160.000 0.2170.000 0.3330.000 0.9750.000
fast --means++ 0.2160.000 0.2170.000 0.3330.000 0.9750.000
Table 3: MSE and W/B of the clustering results on real datasets.

5.3 On Real-world Datasets with Labels

The methods are also evaluated on real-world datasets with labels, including Iris and Bezdekiris. The experiment lead to the same conclusion that the family of --means achieve the best performance on all datasets and with best stability. However, the sample sizes of real-world datasets are so small that the gap between --means and other methods cannot be opened.

methods Iris Bezdekiris
k-means 0.6650.104 0.6700.097
k-means++ 0.6940.068 0.6950.067
k-medoids 0.6780.124 0.6870.112
k-medians 0.6630.142 0.6760.132
GMM 0.5570.080 0.5340.136
TMM 0.6480.203 0.6850.233
t-k-means 0.7030.000 0.7030.000
fast t-k-means 0.6970.007 0.6980.006
fast t-k-means++ 0.6960.006 0.6960.006
Table 4: ARI of the clustering results on real datasets.

5.4 On Real-world Datasets without Labels

We evaluate our methods on real-world datasets without labels (Bezdekiris, Iris, Seeds and Wine) in this section. For Iris and Bezdekiris, the labels are ignored.

For the real-world data, with regard to two measures, the best performer is the family of --means except the W/B for Seed and Wine. Even when our methods could not perform the best (in regard to the certain measure), they are very close to the best performers. In addition, within all measure-data pairs, there is always at least one member in the -

-means family that performs the best (most probably) or the second best. The stability of

--means is also verified here again.

5.5 Runtime Efficiency

As shown in Table 5, --means reduces the total runtime significantly compared with TMM. Notably, the speed of fast --means and fast --means(++) is on the same order of magnitude as the speed of -means.

Methods Iteration Total Time (sec)
k-means 9.58 1.67 0.0159 0.0048
k-means++ 8.50 1.81 0.0156 0.0047
k-medoids 7.46 1.20 0.0153 0.0034
k-medians 7.64 1.38 0.0186 0.0056
GMM 20.007.84 0.14620.0581
TMM 28.258.68 0.41360.1299
t-k-means 29.765.82 0.10430.0228
fast t-k-means 11.782.05 0.01830.0050
fast t-k-means++ 10.502.87 0.01810.0074
Table 5: Time cost on Iris dataset

6 Conclusion

This paper depicts a novel TMM-based -means variant, --means, and its fast version, in order to improve the robustness and stability of the conventional -means method. We present the full mathematical derivations for --means, and compare its robustness and stability with -means with respect to the loss function, influence function and clustering center expression. Additionally, a large number of experiments empirically demonstrate that our method has empirical soundness while preserving running efficiency.

References

  • [Abramowitz and Stegun1964] Milton Abramowitz and Irene A Stegun. Handbook of mathematical functions: with formulas, graphs, and mathematical tables, volume 55. Courier Corporation, 1964.
  • [Arora et al.1998] Sanjeev Arora, Prabhakar Raghavan, and Satish Rao. Approximation schemes for euclidean k-medians and related problems. In

    Proceedings of the thirtieth annual ACM symposium on Theory of computing

    , pages 106–113. ACM, 1998.
  • [Arthur and Vassilvitskii2007] David Arthur and Sergei Vassilvitskii. k-means++:the advantages of careful seeding. In Eighteenth Acm-Siam Symposium on Discrete Algorithms, New Orleans, Louisiana, pages 1027–1035, 2007.
  • [Banerjee et al.2005] Arindam Banerjee, Srujana Merugu, Inderjit S Dhillon, and Joydeep Ghosh. Clustering with bregman divergences.

    Journal of machine learning research

    , 6(Oct):1705–1749, 2005.
  • [Bishop2006] Christopher M. Bishop. Pattern Recognition and Machine Learning. Springer-Verlag New York, Inc., 2006.
  • [Coates et al.2011] Adam Coates, Blake Carpenter, Carl Case, Sanjeev Satheesh, Bipin Suresh, Tao Wang, David J Wu, and Andrew Y Ng. Text detection and character recognition in scene images with unsupervised feature learning. In ICDAR, pages 440–445. IEEE, 2011.
  • [Coelho and Murphy2009] Luıs Pedro Coelho and Robert F Murphy. Unsupervised unmixing of subcellular location patterns. In Proceedings of ICML-UAI-COLT 2009 Workshop on Automated Interpretation and Modeling of Cell Images, 2009.
  • [Gopalan2016] Raghuraman Gopalan. Bridging heterogeneous domains with parallel transport for vision and multimedia applications. In UAI, 2016.
  • [Hubert and Arabie1985] Lawrence Hubert and Phipps Arabie. Comparing partitions. Journal of Classification, 2(1):193–218, 1985.
  • [Jain2010] Anil K Jain. Data clustering: 50 years beyond k-means. Pattern recognition letters, 31(8):651–666, 2010.
  • [Kaufman and Rousseeuw1987] Leonard Kaufman and Peter Rousseeuw. Clustering by means of medoids. North-Holland, 1987.
  • [Koh and Liang2017] Pang Wei Koh and Percy Liang. Understanding black-box predictions via influence functions. arXiv preprint arXiv:1703.04730, 2017.
  • [Kriegel et al.2017] Hans-Peter Kriegel, Erich Schubert, and Arthur Zimek. The art of runtime evaluation: Are we comparing algorithms or implementations? Knowledge and Information Systems, 52(2):341–378, Aug 2017.
  • [Lichman2013] M. Lichman. UCI machine learning repository, 2013.
  • [Linde et al.1980] Yoseph Linde, Andres Buzo, and Robert Gray. An algorithm for vector quantizer design. IEEE Transactions on communications, 28(1):84–95, 1980.
  • [Liu and Rubin1995] Chuanhai Liu and Donald B Rubin. Ml estimation of the t distribution using em and its extensions, ecm and ecme. Statistica Sinica, pages 19–39, 1995.
  • [Lloyd1982] Stuart Lloyd. Least squares quantization in pcm. IEEE transactions on information theory, 28(2):129–137, 1982.
  • [Mao and Jain1996] Jianchang Mao and Anil K Jain. A self-organizing network for hyperellipsoidal clustering (hec).

    Ieee transactions on neural networks

    , 7(1):16–29, 1996.
  • [Mclachlan and Basford1988] Geoffrey J Mclachlan and Kaye E Basford. Mixture models. inference and applications to clustering. Applied Statistics, 38(2), 1988.
  • [Mitchell and others1997] Tom M Mitchell et al. Machine learning. wcb, 1997.
  • [Pasi Franti2015] et al Pasi Franti. Clustering basic benchmark, 2015.
  • [Peel and Mclachlan2000] David Peel and Geoffrey J Mclachlan. Robust mixture modeling using the t distribution. Statistics and Computing, 10(4):339–348, 2000.
  • [Tan and others2006] Pang-Ning Tan et al. Introduction to data mining. Pearson Education India, 2006.
  • [Tsironis et al.2013] Serafeim Tsironis, Mauro Sozio, Michalis Vazirgiannis, and LE Poltechnique.

    Accurate spectral clustering for community detection in mapreduce.

    In NeurIPS Workshops. Citeseer, 2013.
  • [Tukey1977] John W Tukey. Exploratory data analysis, volume 2. Reading, Mass., 1977.
  • [Vanhatalo et al.2009] Jarno Vanhatalo, Pasi Jylänki, and Aki Vehtari. Gaussian process regression with student-t likelihood. In Y. Bengio, D. Schuurmans, J. D. Lafferty, C. K. I. Williams, and A. Culotta, editors, NeurIPS, pages 1910–1918. Curran Associates, Inc., 2009.
  • [Yu et al.2009] Shi Yu, BD Moor, and Yves Moreau. Clustering by heterogeneous data fusion: framework and applications. In NeurIPS workshop, 2009.
  • [Zhang et al.2017] Cheng Zhang, Hedvig Kjellström, and Stephan Mandt. Determinantal point processes for mini-batch diversification. In UAI. AUAI Press Corvallis, 2017.