1 Introduction
Clustering analysis is a fundamental unsupervised machine learning strategy with broadranging applications, aiming to group unlabelled data objects into clusters according to a certain similarity measure so that objects within each cluster are more similar to each other than otherwise.
Many clustering algorithms have been investigated in the past decades cheng1995mean ; rodriguez2014clustering ; shah2017robust . In practice, clusteringfriendly datasets rarely occur in nature, which makes it necessary to employ a twostep strategy. First, the raw data was kernelized scholkopf1998nonlinear ; dhillon2004kernel
or otherwise preprocessed with dimension reduction methods, such as principal component analysis
ding2004k , nonnegative matrix factorization xu2003document ; ding2010convex , spectral embeddings shi2000normalized ; ng2002spectral ; zha2001spectral , deep autoencoders peng2016deep ; ji2017deep ; shaham2018spectralnet or generative adversarial networks chen2016infogan . Second, a clustering algorithm is applied to the latent embedding. Many clustering methods exist for the doing the second step, including Kmeans macqueen1967some ; lloyd1982least sibson1973slink , affinity propagation frey2007clustering and BIRCH zhang1996birch , etc, among which the classic Kmeans is arguably the method of choice in general situations.Unfortunately, even with wellprocessed data the Kmeans algorithm (also called Lloyd algorithm) still encounters a scalability bottleneck. It is demonstrated in Figure 1 that the solution quality of the Kmeans algorithm deteriorates as the number of clusters increases, while a newly proposed algorithm, called KindAP to be introduced soon, correctly recover the ground truth solutions in all tested cases. This set of experiments is performed on synthetic datasets with separable clusters (see more details in the caption of Figure 1). Later we will show that similar phenomena occur in real image datasets as well.
leading singular vectors of the
data matrix for . The Lloyd algorithm (with 10 random replications) and the proposed KindAP algorithm are applied to the processed data matrices. Clustering accuracy and running time are recorded.The root cause of the scalability bottleneck is that greedy algorithms like Kmeans are highly sensitive to initializations and rely on multiple random replications to achieve good results. As K increases, the number of random replications needed for good results appears to rise out of control. To overcome this difficulty, some convex optimization models have been developed. Recent works include semidefinite programming(SDP) peng2007approximating ; awasthi2015relax
and linear programming (LP) relaxations
awasthi2015relax of Kmeans, and convex fusion methods lindsten2011just ; hocking2011clusterpath ; chi2015splitting . However, the periteration complexity of these convex models has been elevated to being quadratic in the number of total samples instead of being linear as in Kmeans.In the framework of spectral clustering, an algorithm called spectral rotation (SR) was proposed as an alternative to Kmeans algorithm yu2003multiclass ; huang2013spectral for doing clustering in embedded spaces. It was argued that the spectral rotation model would be less sensitive to initializations than Kmeans. Nevertheless, our experiments (see Section 6) indicate that, at least in some cases, the spectral rotation algorithm could be as sensitive as the Kmeans.
The main contributions of this paper are summarized as the following:

A general clustering framework is proposed that directly solves for K indicators by subspace matching. Under suitable conditions, Kmeans model and the model that we promote are two special cases of the general clustering framework.

A semiconvexrelaxation scheme, called KindAP, is constructed to efficiently solve the particular Kindicators model. KindAP, which is essentially deterministic, can find highquality solutions at a periteration complexity linear in the number of data points.

KindAP solutions can be used to warmstart the Kmeans algorithm without any replication, resulting in better results (measured by the Kmeans objective) than running Kmeans with a huge number of Kmeans++ random replications, when K is relatively large.

Extensive numerical results show a superior scalability of the proposed approach over both Kmeans and spectral rotation models, especially when the number of clusters becomes large.
2 Preliminary
This section provides important definitions and concepts for our study. It also revisits the classical Kmeans model from an unusual angle that will motivate our new Kindicators model.
2.1 A set of indicator matrices
Consider the problem of clustering a dataset of objects into clusters. A matrix is called an indicator matrix if:
(1) 
where a positive element indicates that object belongs to cluster . This set of indicator matrices is the most general, containing various subsets corresponding to different definitions of indicator matrices in the literature. For example, is called a binary indicator matrix if yu2003multiclass , and a normalized indicator matrix if , where denotes the number of objects in cluster boutsidis2009unsupervised .
For convenience, by default we define the set of indicator matrices as:
Definition 2.1.
(The set of indicator matrices)
(2) 
Clearly, is a discrete set since each row of can have only one positive element. To emphasize the predetermined number , we will refer the columns of collectively as Kindicators. These two terms, indicator matrix and Kindicators, will be used exchangeably.
2.2 Kmeans model viewed as subspacematching
Although the classic Kmeans model is commonly written in terms of centroids, it can also be written in terms of indicator matrices, or Kindicators. Let be a given data matrix where each data vector in corresponds to a row. It is well known that the classical Kmeans model can also be rewritten as boutsidis2009unsupervised :
(3) 
where the constraint , together with other constraints in , forces nonzero elements in each column of to have the same value (i.e., ). In other words, , the subset of , contains all normalized indicator matrices.
When the data matrix is orthonormal, i.e. , then after some simple calculations the Kmeans model (3) can be reduced to the following three equivalent optimization problems:
(4) 
where denotes the
th singular values of a matrix. We note that taking square root of these objective functions does not change the equivalence.
These relationships provide a subspacematching perspective for the Kmeans model. To see this, we note that the distance of two subspaces can be measured by a distance between their unique orthogonal projections. For the first model in (4), the two orthogonal projections involved are and , respectively. On the other hand, minimizing a subspace distance under some norms is equivalent to maximizing vector norms of the cosines of principle angles between the two subspaces bjoerck1971numerical . In (4), these cosines of principle angles are the singular values of where both and are orthonormal bases.
3 Kindicators Model
The subspacematching perspective of the Kmeans model can be extended to a more general framework that solves for an indicator matrix. We will call it Kindicators framework:
(5) 
where refers to the range space of (similarly for ), and "" is a subspace distance, which can also be replaced by distance squared, for example. Clearly, in the Kmeans model (3) the squared distance function is quartic in and nonconvex. Can we have a simpler distance function in ?
Theorem 3.1.
If such that , and such that , then
(6) 
defines a distance between and ,
(7) 
Moreover,
(8) 
The proof of this theorem is provided in the supplementary material.
Now, we propose the following model based on the distance function (6):
(9) 
For convenience, we will refer to this model as the Kindicators model. Other models of course can be constructed using different distance functions in (5).
Theorem 3.1 reveals the relations between the Kmeans model (3) and the Kindicators model (9). Both minimize a distance between the "data space" and "indicator space" , or both maximize a norm of the matrix . In either case, varies in a set of indicator matrices. In terms of the singular values of , the difference between Kmeans model (3) and Kindicators model (9) lies in using norm or norm. It is important to note that the two models are distinct, thus may give distinct solutions at optimality, but quite close as is indicated by inequalities in (8).
4 KindAP Algorithm
We seek to design an algorithm for the Kindicators model (9) that has the potential to overcome the aforementioned scalability bottleneck. We have seen that the objective in (3) for Kmeans is by itself nonconvex as a function of . In contrast, the objective in (9) for Kindicators is convex in , representing a squared distance between two sets, and , both of which are nonconvex sets. Among the two, is extremely nonconvex with a combinatorial structure. On the other hand, is less difficult to handle. For one thing, the projection onto is unique in generic cases.
Our idea is to break the difficult problem of solving model (9) into solving a sequence of subproblems that satisfy two criteria: (i) each one is easier to solve, and (ii) iteration complexity is kept at linear in . In a balance of the two criteria, we propose a semiconvexrelaxation scheme: introducing a convex relaxation to but keeping unchanged. This leads to an intermediate problem:
(10) 
where is a closed convex set whose boundary contains .
The projection onto is trivial, while the projection onto is the socalled Procrustes problem. The closed forms of the two projections are collected into Proposition 4.1 below.
Proposition 4.1.
The projection of matrix onto the set is given by
(11) 
and the projection of matrix onto the set is given by
(12) 
where is an arbitrary orthonormal basis of , and
is a singular value decomposition of the matrix
.An alternating projection algorithm von1950functional appears a natural choice for attacking the semirelaxation model (10), for which the computational complexity of the two projections, onto and , remains linear with respect to .
On top of the above semiconvexrelaxation scheme, we construct a doublelayered alternating projection framework for approximately solving the Kindicators model (9). See Fig.2 for a schematic description. In our algorithm, each outer iteration consists of a loop going from to and then coming back. The route from to takes a detour to by solving the semiconvex model (10) via alternating projections, which are called inner iterations. The inner or the outer iteration is stopped once a prescribed amount of improvement in the relevant objective value is no longer observed. We name this algorithm KindAP (Kindicators by Alternating Projections).
5 Related Works
This section clarifies the relationship between our study and other relevant works. In the framework of spectral clustering, an approach called Program of Optimal Discretization (POD) is proposed to compute a binary indicator matrix and a rotation matrix from an input matrix consisting of
leading eigenvectors of a normalized Laplacian matrix
yu2003multiclass . The model is(13) 
where is the set of binary indicator matrices. The model aims to find a rotation matrix to best match by a binary indicator matrix . The POD model is also called spectral rotation (SR) in huang2013spectral . An “alternating projection" type algorithm for solving the SR model was proposed. For a fixed , the binary indicator matrix is computed by
(14) 
where is the th row of and is the th row of . For a fixed , the rotation matrix is given by
(15) 
where and are formed by the left and right singular vectors of , respectively.
There are two main differences between the SR and the Kindicators clustering approaches. The first one is about the two models which do look rather similar in appearance. Mathematically, the Kindicators framework is based on subspace matching, that is, minimizing a distance or a measure of principle angles between two subspaces. On the other hand, the motivation of the SR model was to add the orthonormal restriction to the matrix whose rows represent centers (note that in Kmeans model these centers are unrestricted). Indeed, since the columns of do not form an orthonormal basis, the SR objective does not mathematically define a subspace distance and the singular values of are not cosines of principle angles.
The second difference is about the algorithms used. For the Kindicators model, we propose a doublelayered alternating projection framework based on a semiconvexrelaxation scheme, which is essentially a deterministic algorithm. In the SR algorithm, formula (14) is still greedy in nature that makes the algorithm vulnerable to the same scalability bottleneck encountered by Kmeans, as we will see in Section 6.
Moreover, an extra benefit of using KindAP is that the intermediate variable in (10) produces posteriori information to evaluate the clustering quality in the absence of ground truth knowledge. Please see more details in supplementary materials.
6 Numerical Experiments
KindAP performs well on synthetic datasets in terms of both quality and efficiency, as is seen in Figure 1. However, we need to validate it on “real" datasets commonly used in the literature. This section contains results from extensive numerical experiments on many real datasets.
All the algorithms used in this section are implemented and run in Matlab R2018a. To be specific, the Lloyd algorithm in use is the Matlab’s kmeans function with GPU support and the Kmeans++ arthur2007k initialization strategy. In our notation, “KM ” denotes running the Lloyd algorithm with random replications in its default setting. In this section, KM 1, KM 10, KM 30, and KM 10000 will be used. KindAP is the proposed algorithm, and KindAP+L is a combination of KindAP and Lloyd in which the former is first run and the resulting centers are used to initialize the latter without further replication. We implement the SR algorithm based on huang2013spectral . Due to SR’s sensitivity to initialization, we also run it with multiple random replications and output the best solution with the lowest SR objective value. Similarly, we use the term “SR ” to denote running the SR algorithm with random replications. Our code is available at https://github.com/yangyuchen0340/Kind. This package also contains a Python implementation, which is consistent with the wellknown sklearn package containing tools for data mining and data analysis. We remark that the package supports SR and KindAP in the same function, and allows a flexibility in selecting different types of indicator matrices for KindAP.
6.1 Deterministic behavior of KindAP
First, we show that KindAP algorithm is robust and essentially deterministic, as is demonstrated on two datasets YaleB and COIL100, both preprocessed by a technique called Deep Subspace Clustering ji2017deep
. Six algorithms are tested, each with 200 random runs. The maximum, minimum, and the standard deviation of clustering accuracy are plotted in Figure
3.As is expected, Figure 3 shows that random replications can reduce the deviation of accuracy for both Kmeans and SR. Most importantly, we observe that KindAP and KindAP+L are essentially deterministic on these two examples, achieving the identical clustering result in 200 independent random runs. Interestingly, on both these two examples the Kindicators model happens to give much better clustering accuracy than the Kmeans model does. Moreover, when solving the Kmeans model for these two examples, KindAP+L produces higher accuracy than both KM 1 and KM 10 with 200 random replications (2000 replications in total for the latter).
6.2 Results on 35 real datasets
In this set of experiments, we compare the clustering accuracy and the Kmeans objective values between KindAP+L and KM 10000 on 35 real datasets. They include many UCI datasets, image and NLP datasets, with ranging from to . Although the underlying structures of these datasets vary from set to set, for practical reasons we preprocess all the 35 raw datasets uniformly by the normalized cut shi2000normalized ; ng2002spectral of their knearestneighbor similarity graphs cai2005document . As a result, the clustering quality is not uniformly high.
Datasets  Clustering Accuracy  Objective Value  

KindAP+L  KM 10000  KindAP+L  KM 10000  
YaleB  38  36.54%  36.50%  6.0864e+00  6.0858e+00 
ORL  40  67.00%  66.50%  6.3029e+00  6.0932e+00 
Reuters  65  39.36%  39.26%  9.4345e+00  1.0090e+01 
PIE  68  16.17%  16.90%  1.0711e+01  1.2179e+01 
FERET  72  66.20%  66.20%  1.3391e+01  1.3398e+01 
AR  120  60.48%  58.33%  1.5452e+01  1.7324e+01 
It turns out that on the 29 datasets with , KindAP+L and KM 10000 have obtained the identical clustering accuracies, presumably corresponding to the global optima for the given datasets (see more details in the supplementary material).
The results for the remaining six datasets with are reported in Table 1. We observe that on the two datasets with , KM 10000 reached smaller Kmeans objective values, but on the four datasets with KindAP obtained better objective values, in fact significantly better in 3 out of the 4 cases. The fact that a single KindAP run provides better initializations than 10000 random replications by the stateoftheart Kmeans++ scheme arthur2007k speaks volume for the merit of the proposed semiconvex relaxation scheme used by KindAP. Since the running time of KindAP+L is at the same order of that of a single run of the Matlab kmeans function, in essence KindAP+L is thousands of times faster than Kmeans in reaching high accuracies on largeK problems.
In the previous section, we claim that the iteration complexity of KindAP is linear with respect to the size of datasets, but the efficiency of KindAP also depends on the number of iterations required for convergence. In practice, we observe that KindAP only takes several outer iterations and dozens of inner iterations on the 35 datasets (see more details in the supplementary material).
6.3 Datasets with deep neural network features
Deep neural network (DNN) is the trend of data mining. Recently, many deep clustering techniques have been developed to achieve higher clustering accuracy on big datasets. In this section, we select 5 image datasets, ORL, CIFAR100 (train and test), COIL100 and UKBench and process them using a DNN procedure as follows. We input the raw data into a pretrained neural network
guerin2018improving, then extract features represented by neurons at a fullyconnected layer (usually second from the last). Afterwards, we do ordinary spectral embeddings and apply KM, SR and KindAP on these DNN features.
To be specific, we extract features from Layer avg pool of DNN Xception to cluster ORL, COIL100 and UKBench, and from Layer fc2 of DNN VGG16 to cluster CIFAR100, see guerin2018improving , with weights pretrained on ImageNet
. All network architecture and pretrained weights are provided by the Python Deep Learning library
Keras. Since data augmentation techniques and batch sizes do not make much differences to the final results, we select the default settings. Lastly, we do 30 random replications for Kmeans and SR.Datasets  KindAP  SR 30  KindAP+L  KM 30  

ORL  40  86.50%/0.04  86.50%/0.21  86.25%/0.05  86.25%/0.57 
CIFAR100(train)  100  99.63%/3.75  84.80%/16.51  99.62%/6.53  94.40%/113.22 
CIFAR100(test)  100  68.98%/0.72  67.85%/4.45  65.48%/1.59  61.54%/ 52.23 
COIL100  100  98.71%/1.15  81.14%/ 3.61  98.70%/1.53  97.67%/ 16.58 
UKBench  2550  89.67%/4034  82.40%/ 4727  89.93%/4602  84.62%/17270 
Table 2 summarizes the performance of the four algorithms on clustering the DNN features. It shows that KindAP and KindAP+L are generally more accurate than SR and Kmeans with 30 random replications. We reiterate that our study is not about preprocessing techniques but about clustering methodologies. However, the use of DNN techniques that produce clusteringfriendly features does enable us to better evaluate the performance of different clustering methods. On poorly processed data, one would hardly be able to differentiate behaviors of methods because all of them would produce almost equally poor clustering results.
In terms of timing, KindAP is usually slower than the average running time per replication of either Lloyd or SR, but at the same order. However, both Lloyd and SR require multiple replications in order to have a chance to reach an accuracy level comparable with that of KindAP (sometimes they could only reached a lower level of accuracy after a huge number of replications). As is indicated by the results in Table 2, KindAP runs much faster than KM 30 and SR 30 while attaining higher accuracies. In fact, the current version of the KindAP algorithm is still far from optimal in efficiency, and we are working on new algorithms to accelerate the solution time in solving the semiconvex relaxation model.
To summarize our numerical experiments, we list several observations. (i) KindAP and KindAP+L are essentially deterministic without the need for random replications. (ii) On smallK problems, KindAP+L works as well as the classic Kmeans with very large numbers of replications. (iii) On largeK problems, KindAP and KindAP+L generally outperform their counterparts SR and Lloyd with multiple replications. (iv) The advantages of KindAP appears more pronounced with high dimensional but clusteringfriendly features extracted by advanced DNN techniques.
7 Conclusions
Data clustering usually consists of two tasks: first extracting suitable features and then applying a clustering method. The focus of this work is on the latter task for which the method of choice has arguably been Kmeans. We propose the Kindicators framework (5) that includes the classic Kmeans model (3) and the particular Kindicators model (9) corresponding to two different subspace distances. We promote the Kindicators model (9) because it has a convex objective function and allows an effective semiconvexrelaxation scheme, leading to the construction of an efficient algorithm called KindAP, which is essentially deterministic without any need for random replications. Like the Kmeans algorithm, KindAP keeps the periteration complexity linear in terms of the sizes of datasets, making it practical for bigvolume data clustering.
For synthetic data with separable clusters, experiments show that Kindicators can overcome the bigK bottleneck suffered by Kmeans (see Fig. 1). Is this advantage really relevant in realworld applications? Our experiment results in Tables 12 strongly suggest an affirmative answer. On the one hand, more and more big data applications come with large K values. On the other hand, the advances in feature extraction techniques, especially those using deep neural networks, make it possible to generate clusteringfriendly or even nearly separable clusters in feature spaces. Therefore, a deterministic clustering method like KindAP that is scalable to K and linear in the the dataset size will clearly become more desirable than Kmeans that suffers from the bigK bottleneck due to its oversensitivity to initializations.
We note that the Kmeans model (3) and the Kindicators model (9) are two distinct models that in general produce different clustering results at optimality. However, the two models are close enough (see (8)) so that KindAP results can be used to initialize Kmeans. Our experiments show that a single KindAP initialization can generate better clustering results, as measured by the Kmeans objective, than those generated from large numbers of replications using the current stateoftheart initialization. A limitation of KindAP is that it requires, at least in theory, the input data matrices to be orthogonal, which is always the case in spectral clustering or similar settings.
Finally, we mention that the development of a theoretic foundation for the KindAP algorithm is an ongoing effort that will be reported in a future work.
Acknowledgments
Y. Yang and Y. Zhang would like to acknowledge the support from NFS Grant DMS1418724 and from the Shenzhen Research Institute of Big Data (SRIBD). The work of L. Xu is partially supported by a Key Project of the Major Research Plan of NSFC (Grant No. 91630205)
References
 (1) David Arthur and Sergei Vassilvitskii. kmeans++: The advantages of careful seeding. In Proceedings of the eighteenth annual ACMSIAM symposium on Discrete algorithms, pages 1027–1035. Society for Industrial and Applied Mathematics, 2007.
 (2) Pranjal Awasthi, Afonso S Bandeira, Moses Charikar, Ravishankar Krishnaswamy, Soledad Villar, and Rachel Ward. Relax, no need to round: Integrality of clustering formulations. In Proceedings of the 2015 Conference on Innovations in Theoretical Computer Science, pages 191–200. ACM, 2015.
 (3) Ake Bjoerck and Gene H Golub. Numerical methods for computing angles between linear subspaces. Mathematics of Computation, 27(123):579–594, 1971.

(4)
Christos Boutsidis, Petros Drineas, and Michael W Mahoney.
Unsupervised feature selection for the
means clustering problem. In Advances in Neural Information Processing Systems, pages 153–161, 2009.  (5) Deng Cai, Xiaofei He, and Jiawei Han. Document clustering using locality preserving indexing. IEEE Transactions on Knowledge and Data Engineering, 17(12):1624–1637, 2005.
 (6) Xi Chen, Yan Duan, Rein Houthooft, John Schulman, Ilya Sutskever, and Pieter Abbeel. Infogan: Interpretable representation learning by information maximizing generative adversarial nets. In Advances in neural information processing systems, pages 2172–2180, 2016.
 (7) Yizong Cheng. Mean shift, mode seeking, and clustering. IEEE transactions on pattern analysis and machine intelligence, 17(8):790–799, 1995.
 (8) Eric C Chi and Kenneth Lange. Splitting methods for convex clustering. Journal of Computational and Graphical Statistics, 24(4):994–1013, 2015.
 (9) Inderjit S Dhillon, Yuqiang Guan, and Brian Kulis. Kernel kmeans: spectral clustering and normalized cuts. In Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining, pages 551–556. ACM, 2004.
 (10) Chris Ding and Xiaofeng He. Kmeans clustering via principal component analysis. In Proceedings of the twentyfirst international conference on Machine learning, page 29. ACM, 2004.
 (11) Chris HQ Ding, Tao Li, and Michael I Jordan. Convex and seminonnegative matrix factorizations. IEEE transactions on pattern analysis and machine intelligence, 32(1):45–55, 2010.
 (12) Brendan J Frey and Delbert Dueck. Clustering by passing messages between data points. science, 315(5814):972–976, 2007.
 (13) Joris Guerin and Byron Boots. Improving image clustering with multiple pretrained cnn feature extractors. british machine vision conference, page 51, 2018.
 (14) Toby Dylan Hocking, Jeanphilippe Vert, Armand Joulin, and Francis R Bach. Clusterpath: an algorithm for clustering using convex fusion penalties. pages 745–752, 2011.
 (15) Jin Huang, Feiping Nie, and Heng Huang. Spectral rotation versus kmeans in spectral clustering. pages 431–437, 2013.
 (16) Pan Ji, Tong Zhang, Hongdong Li, Mathieu Salzmann, and Ian Reid. Deep subspace clustering networks. In Advances in Neural Information Processing Systems, pages 24–33, 2017.
 (17) Fredrik Lindsten, Henrik Ohlsson, and Lennart Ljung. Just relax and come clustering!: A convexification of kmeans clustering. 2011.
 (18) Stuart P Lloyd. Least squares quantization in pcm. IEEE Transactions on Information Theory, 28(2):129–137, 1982.

(19)
James MacQueen et al.
Some methods for classification and analysis of multivariate
observations.
In
Proceedings of the fifth Berkeley symposium on mathematical statistics and probability
, volume 1, pages 281–297. Oakland, CA, USA., 1967.  (20) Andrew Y Ng, Michael I Jordan, and Yair Weiss. On spectral clustering: Analysis and an algorithm. In Advances in neural information processing systems, pages 849–856, 2002.
 (21) Online. ORL dataset. http://www.cl.cam.ac.uk/research/dtg/attarchive/facedatabase.html.
 (22) Jiming Peng and Yu Wei. Approximating kmeanstype clustering via semidefinite programming. SIAM Journal on Optimization, 18(1):186–205, 2007.
 (23) Xi Peng, Shijie Xiao, Jiashi Feng, WeiYun Yau, and Zhang Yi. Deep subspace clustering with sparsity prior. In IJCAI, pages 1925–1931, 2016.
 (24) Alex Rodriguez and Alessandro Laio. Clustering by fast search and find of density peaks. Science, 344(6191):1492–1496, 2014.

(25)
Bernhard Schölkopf, Alexander Smola, and KlausRobert Müller.
Nonlinear component analysis as a kernel eigenvalue problem.
Neural computation, 10(5):1299–1319, 1998.  (26) Sohil Shah and Vladlen Koltun. Robust continuous clustering. Proceedings of the National Academy of Sciences of the United States of America, 114(37):9814–9819, 2017.
 (27) Uri Shaham, Kelly P Stanton, Henry Li, Ronen Basri, Boaz Nadler, and Yuval Kluger. Spectralnet: Spectral clustering using deep neural networks. international conference on learning representations, 2018.
 (28) Jianbo Shi and Jitendra Malik. Normalized cuts and image segmentation. IEEE Transactions on pattern analysis and machine intelligence, 22(8):888–905, 2000.
 (29) Robin Sibson. Slink: an optimally efficient algorithm for the singlelink cluster method. The computer journal, 16(1):30–34, 1973.
 (30) J von Neumann. Functional operators. vol. ii. the geometry of orthogonal spaces, volume 22 (reprint of 1933 notes) of annals of math. Studies. Princeton University Press, 1950.
 (31) Wei Xu, Xin Liu, and Yihong Gong. Document clustering based on nonnegative matrix factorization. In Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval, pages 267–273. ACM, 2003.
 (32) Yu and Shi. Multiclass spectral clustering. pages 313–319, 2003.
 (33) Hongyuan Zha, Xiaofeng He, Chris H Q Ding, Ming Gu, and Horst D Simon. Spectral relaxation for kmeans clustering. pages 1057–1064, 2001.
 (34) Tian Zhang, Raghu Ramakrishnan, and Miron Livny. Birch: an efficient data clustering method for very large databases. In ACM Sigmod Record, volume 25, pages 103–114. ACM, 1996.
8 Supplementary Material
8.1 Proof of Theorem 3.1
Proof.
It is easy to verify that satisfies the nonnegativity, identity of indiscernibles and symmetricity, and it suffices to show satisfies triangle inequality.
First, let for are three orthnormal matrices, and denote as the optimal solutions of the following optimization problems:
Then, we have
Therefore, the function defines a distance between and .
For fixed , the closed form solution of the optimization problem is given by , where is a singular value decomposition of , and its optimum value is given by
Therefore,
(16)  
(17) 
Moreover,
Note that the above inequality always holds when we replace by for any orthogonal , therefore
(18) 
The singular values of are the cosines of the principle angles between and , so , which implies . This claim can also be derived through . Therefore, we prove the right part of inequality (8).
(19) 
∎
8.2 Identical performance on 29 smallK datasets
Table 3 gives the characteristics of the 35 datasets used in the paper. Table 4 compares the performance of KindAP+L and KM 10000 on 29 “smallK” () datasets.
Dataset 
No. of Clusters  No. of Samples  No. of Attributes 
Australian  2  
Breast  2  
Chess  2  
Crx  2  
Diabetes  2  
Heart  2  
Isolet  2  
Monk1  2  
Pima  2  
Vote  2  
Cars  3  
Iris  3  
Lenses  3  
Waveform21  3  
WINE  3  
Auto  6  
Control  6  
Dermatology  6  
glass  6  
Solar  6  
Segment  7  
ZOO  7  
Ecoli  8  
Yeast  10  
JAFFE  10  
USPS  10  
MNIST6000  10  
YALE  15  
COIL20  20  
YALEB  38  
ORL  40  
Reuters  65  
PIE  68  
FERET  72  
AR  120 
Datasets  Clustering Accuracy  Objective Value  
KindAP+L  KM 10000  KindAP+L  KM 10000  
Australian 
2  68.70%  68.70%  1.530017e01  1.530017e01 
Breast  2  51.94%  51.94%  2.413292e02  2.413292e02 
Crx  2  53.33%  53.33%  1.304456e01  1.304456e01 
Diabetes  2  64.71%  64.71%  6.455858e02  6.455858e02 
Heart  2  62.22%  62.22%  3.106905e01  3.106905e01 
Isolet  2  59.29%  59.29%  1.568688e01  1.568688e01 
Monk1  2  66.90%  66.90%  1.568688e01  1.568688e01 
Pima  2  65.36%  65.36%  3.827793e01  3.827793e01 
Vote  2  56.78%  56.78%  1.318960e01  1.318960e01 
Cars  3  67.60%  67.60%  2.180936e01  2.180936e01 
Iris  3  67.60%  67.60%  2.244802e01  2.244802e01 
Lenses  3  41.67%  41.67%  6.056203e01  6.056203e01 
Waveform21  3  52.37%  52.37%  3.746717e01  3.746717e01 
WINE  3  61.80%  61.80%  3.761429e01  3.761429e01 
Auto  6  32.68%  32.68%  5.431086e01  5.431086e01 
Control  6  58.33%  58.33%  4.644075e01  4.644075e01 
Dermatology  6  95.90%  95.90%  3.648208e01  3.648208e01 
glass  6  50.47%  50.47%  1.135033e+00  1.135033e+00 
Solar  6  37.15%  37.15%  5.693965e01  5.693965e01 
Segment  7  41.43%  41.43%  5.369213e01  5.369213e01 
ZOO  7  51.49%  51.49%  4.726120e01  4.726120e01 
Ecoli  8  56.55%  56.55%  2.058497e+00  2.058497e+00 
Yeast  10  31.74%  31.74%  8.848088e02  8.848088e02 
JAFFE  10  62.22%  62.22%  1.351849e+00  1.351849e+00 
USPS  10  66.78%  66.78%  1.394255e+00  1.394255e+00 
MNIST6000  10  63.38%  63.38%  1.777071e+00  1.777071e+00 
YALE  15  56.36%  56.36%  1.855949e+00  1.855949e+00 
COIL20  20  82.01%  82.01%  1.365658e+00  1.365658e+00 
Table 5 demonstrates the convergence behaviors of KindAP on the 35 datasets. For each outer iteration (from to and back to ), we record the number of inner iterations required for convergence (alternating projections between and ), and the total numbers of outer and inner iterations (in parentheses) are reported in the last column.
We observe that in KindAP the first outer iteration takes the most number of inner iterations. In fact, the first outer iteration also makes the most significant progress. In particular, on the datasets Breast and Chess KindAP converges after only one outer iteration in which the Kindicators objective arrives at the global optimal value zero (subject to roundoff error).
We also did another experiment in which after each KindAP outer iteration, we calculate cluster centers from the current solution and start the Lloyd algorithm to solve the Kmeans model. Interestingly, on all the 29 datasets in Table 4 the Lloyd algorithm returns identical Kmeans results after the first KindAP outer iteration. That is, for solving the Kmeans model, only one KindAP outer iteration is necessary to generate a set of centers that allows the Lloyd algorithm to produce the presumably global optimum without any further replication. On the other hand, after the first outer iteration until it stops, KindAP continues to improve the Kindicators objective value. This again highlights the fact that these two models are distinct.
Outer Iterations 
1  2  3  4  5  6  7  Total 
Australian  9  4  2(13)  
Breast  24  1(24)  
Chess  23  1(23)  
Crx  17  3  2(20)  
Diabetes  12  3  2(15)  
Heart  8  2  2(10)  
Isolet  8  2  2(10)  
Monk1  8  2  2(10)  
Pima  12  4  2(16)  
Vote  9  4  2(13)  
Cars  8  2  2(10)  
Iris  11  2  2(13)  
Lenses  40  4  2(44)  
Waveform21  9  3  3  2(15)  
WINE  15  3  2(18)  
Auto  14  4  2(18)  
Control  13  4  2(17)  
Dermatology  12  4  2(16)  
glass  34  5  2(39)  
Solar  9  4  2(13)  
Segment  12  5  2(17)  
ZOO  12  4  2(16)  
Ecoli  12  4  2(16)  
Yeast  15  4  2(19)  
JAFFE  14  4  2(18)  
USPS  14  4  4  4  2(15)  
MNIST6000  20  4  4  4  4(32)  
YALE  25  4  2(29)  
COIL20  14  4  2(18)  
YALEB  26  4  5  5  4(40)  
ORL  50  6  5  4  4  5(69)  
Reuters  37  4  4  4  4  4  6(57)  
PIE  35  5  5  5  4(50)  
FERET  48  5  5  5  5  5  6(73)  
AR  41  5  5  5  5  5(61) 
8.3 Uncertainty information
In unsupervised learning, it is usually difficult to evaluate the performance due to the absence of ground truth. People mostly use clustering accuracy and normalized mutual information, but both require ground truth. However, the proposed KindAP algorithm is able to provide some
a posteriori information as a metrics of performance. An output of KindAP algorithm is a nonnegative matrix , which generally contains more than one positive elements on each row. It is observed in practice that the magnitude of is positively proportional to the probability that data point is in cluster . Let hold the elements of sorted rowwise in a descending order, we define a soft indicator vector , which takes values between zero and one as follows,(20) 
where the ratio is between the second largest and the largest elements on the th row of . It is intuitive to expect that the closer to zero is, the more uncertain about the assignment to this data object, which means it is more probable that the data object belongs to more than one cluster. In contrast, close to one implies a safe clustering assignment. This intuition is clearly validated in the simple example presented in Fig. 4A The results in Fig. 4B shows the relationship between the soft indicator and clustering accuracy on a face image dataset, ORL. We observe that large soft indicator values on average indicate high KindAP clustering accuracy. This soft indicator offered by the KindAP algorithm helps address the very challenging issue of assessing the quality and uncertainty of clustering assignments in the absence of ground truth.
Comments
There are no comments yet.