SAS_Hill_Climb
Source code for the paper "A Simple Approach to Sparse Clustering".
view repo
Consider the problem of sparse clustering, where it is assumed that only a subset of the features are useful for clustering purposes. In the framework of the COSA method of Friedman and Meulman, subsequently improved in the form of the Sparse Kmeans method of Witten and Tibshirani, a natural and simpler hillclimbing approach is introduced. The new method is shown to be competitive with these two methods and others.
READ FULL TEXT VIEW PDF
A new submodule clustering method via sparse and lowrank representation...
read it
We propose the Lasso Weighted kmeans (LWkmeans) algorithm as a
simple...
read it
Clustering, a fundamental activity in unsupervised learning, is notoriou...
read it
Clustering is one of the most fundamental tools in the artificial
intell...
read it
Text in video is useful and important in indexing and retrieving the vid...
read it
In many situations it is desirable to identify clusters that differ with...
read it
BIRCH clustering is a widely known approach for clustering, that has
inf...
read it
Source code for the paper "A Simple Approach to Sparse Clustering".
Consider a typical setting for clustering items based on pairwise dissimilarities, with denoting the dissimilarity between items . For concreteness, we assume that and for all . In principle, if we want to delineate clusters, the goal is (for example) to minimize the average withincluster dissimilarity. In detail, a clustering into groups may be expressed as an assignment function , meaning that indexes the cluster that observation is assigned to. Let denote the class of clusterings of items into groups. For , its average withincluster dissimilarity is defined as
(1) 
This dissimilarity coincides with the withincluster sum of squares commonly used in kmeans type of clustering algorithms, with . If under the Euclidean setting, we further define cluster centers
(2) 
then the withincluster dissimilarity can be rewritten as follows,
(3) 
Since this paper deals with nonEuclidean settings also, we will use the more general withincluster dissimilarity defined in (1). The resulting optimization problem is the following:
Given , minimize over .  (4) 
This problem is combinatorial and quickly becomes computationally too expensive, even for small datasets. A number of proposals have been suggested (ESL, )
, ranging from hierarchical clustering approaches to Kmedoids.
Following in the footsteps of friedman2004clustering , we consider a situation where we have at our disposal not 1 but measures of pairwise dissimilarities on the same set of items, with denoting the th dissimilarity between items . Obviously, these measures of dissimilarity could be combined into a single measure of dissimilarity, for example,
(5) 
Our working assumption, however, is that only a few of these measures of dissimilarity are useful for clustering purposes, but we do not know which ones. This is the setting of sparse clustering, where the number of useful measures is typically small compared to the whole set of available measures.
We assume henceforth that all dissimilarity measures are equally important (for example, when we do not have any knowledge a priori on the relative importance of these measures) and that they all satisfy
(6) 
which, in practice, can be achieved via normalization, meaning,
(7) 
This assumption is important when combining measures in the standard setting (5) and in the sparse setting (8) below.
Suppose for now that we know that at most measures are useful among the measures that we are given. For , define the dissimilarity as
(8) 
and the corresponding average withincluster dissimilarity for the cluster assignment as
(9) 
If the goal is to delineate clusters, then a natural objective is the following:
(10) 
In words, the goal is to find the measures (which play the role of features in this context) that lead to the smallest optimal average withincluster dissimilarity. The problem stated in (10) is at least as hard as the problem stated in (4), and in particular, is computationally intractable even for small item sets.
In many situations, but not all, measurements of possibly different types are taken from each item
, resulting in a vector of measurements
. This vector is not necessarily in a Euclidean space, although this is an important example — see Section 2.2 below. We recover our setting when we have available a dissimilarity measure between and . This special case justifies our using the terms ‘feature’ and ‘attribute’ when referring to a dissimilarity measure.The literature on sparse clustering is much smaller than that of sparse regression or classification. Nonetheless, it is substantial and we review some of the main proposals in this section. We start with the contributions of friedman2004clustering and witten2010framework , which inspired this work.
friedman2004clustering propose clustering objects on subsets of attributes (COSA), which (in its simplified form) amounts to the following optimization problem
(11)  
(12) 
Here is some function and is a tuning parameter. When , the objective function can be expressed as
(13) 
When , the minimization of (13) over (12) results in any convex combination of attributes with smallest average withincluster dissimilarity. If this smallest dissimilarity is attained by only one attribute, then all weights will concentrate on this attribute, with weights for this attribute and for the others. In general, , and the term it multiplies is the negative entropy of the weights seen as a distribution on . This penalty term encourages the weights to spread out over the attributes. Minimizing over the weights first leads to
(14)  
over any clustering .  (15) 
where the minimum is over the ’s satisfying (12). (Note that the needs to be tuned.) The minimization is carried out using an alternating strategy where, starting with an initialization of the weights (say all equal, for all ), the procedure alternates between optimizing with respect to the clustering assignment and optimizing with respect to the weights. (There is a closedform expression for that derived in that paper.) The procedure stops when achieving a local minimum. witten2010framework observe that an application of COSA rarely results in a sparse set of features, meaning that the weights are typically spread out. They propose an alternative method, which they call Sparse Kmeans, which, under (6), amounts to the following optimization problem
(16)  
over any clustering and any weights  (17)  
(18) 
The penalty on results in sparsity for small values of the tuning parameter , which is tuned by the gap statistic of tibshirani2001estimating . The penalty is also important, as without it, the solution would put all the weight on only one the attribute with smallest average withincluster dissimilarity. A similar minimization strategy is proposed, which also results in a local optimum.
As can be shown in later sections, Sparse Kmeans is indeed effective in practice. However, its asymptotic consistency remains unknown. sun2012regularized
propose Regularized Kmeans clustering for highdimensional data and prove its asymptotic consistency. This method aims at minimizing a regularized
withincluster sum of squares with an adaptive group lasso penalty term on the cluster centers:(19)  
over any clustering and any sets of centers .  (20) 
Consider points in space (denoted in ) that we want to cluster. A typical dissimilarity is the Euclidean metric, denoted by . Decomposing this into coordinate components, with , and letting , we have
(21) 
A normalization would lead us to consider a weighted version of these dissimilarities. But assuming that the data has been normalized to have (Euclidean) norm 1 along each coordinate, (6) holds and we are within the framework described above.
This Euclidean setting has drawn most of the attention. Some papers propose to perform clustering after reducing the dimensionality of the data (ghosh2002mixture, ; liu2002, ; tamayo2007metagene, ). However, the preprocessing step of dimensionality reduction is typically independent of the end goal of clustering, making such approaches noncompetitive.
A modelbased clustering approach is based maximizing the likelihood. Under the sparsity assumption made here, the likelihood is typically penalized. Most papers assume a Gaussian mixture model. Let
denote the density of the normal distribution with mean
and covariance matrix . The penalized negative loglikelihood (when the goal is to obtain clusters) is of the form(22) 
where gathers all the parameters, meaning, the mixture weights , the group means , and the group covariance matrices . For instance, assuming that the data has been standardized so that each feature has sample mean
and variance
, pan2007penalized use(23) 
This may be seen as a convex relaxation of
(24) 
Typically, this optimization will result in some coordinates set to zero and thus deemed not useful for clustering purposes. In another variant, wang2008variable use
(25) 
To shrink the difference between every pair of cluster centers for each variable , guo2010pairwise use the pairwise fusion penalty
(26) 
Taking into account the covariance matrices, and assuming they are diagonal, xie2008penalized use
(27) 
The assumption that the covariance matrices are diagonal is common in highdimensional settings and was demonstrated to be reasonable in the context of clustering (fraley2006mclust, ). Note that none of these proposals make the optimization problem (22) convex or otherwise tractable. The methods are implemented via an EMtype approach.
Another line of research on sparse clustering is based on coordinatewise testing for mixing. This constitutes the feature selection step. The clustering step typically amounts to applying a clustering algorithm to the resulting feature space. For example, jin2014important use a KolmogorovSmirnov test against the normal distribution, while jin2015phase use a (chisquared) variance test. The latter is also done in (azizyan2013, ) and in (verzelen2014detection, )
. This last paper also studies the case where the covariance matrix is unknown and proposes an approach via moments. In a nonparametric setting,
chan2010using use coordinatewise mode testing.Hillclimbing methods are iterative in nature, making ‘local’, that is, ‘small’ changes at each iteration. They have been studied in the context of graph partitioning, e.g., by kernighan1970efficient and carson2001hill , among others. In the context of sparse clustering, we find the Kmedoids variant of aggarwal1999fast , which includes a hillclimbing step.
Many of the methods cited in Section 2 use alternate optimization in some form (e.g., EM), which can be interpreted as hillclimbing. Our method is instead directly formulated as a hillclimbing approach, making it simpler and, arguably, more principled than COSA or Sparse Kmeans.
Let be an algorithm for clustering based on dissimilarities. Formally, , where is a class of dissimilarity matrices and , and for with of dimension , . Note that could be a hillclimbing method for graph partitioning, or Kmedoids (or Kmeans if we are provided with points in a vector space rather than dissimilarities), or a spectral method, namely, any clustering algorithm that applies to dissimilarities. (In this paper, we will use Kmeans for numerical data and Kmedoids for categorical data using hamming distances as dissimilarities.) For , define
(28) 
Our procedure is described in Algorithm 1.
The use of algorithm in Step 1 is an attempt to minimize over . The minimization in Step 2 is over of size and it is trivial. Indeed, the minimizing is simply made of the indices corresponding to the smallest . For the choice of parameters and , any standard method for tuning parameters of a clustering algorithm applies, for example, by optimization of the gap statistic of tibshirani2001estimating . We note that the initialization phase, by itself, is a pure coordinatewise approach that has analogs in the Euclidean setting as mentioned Section 2.2. The hillclimbing process is the iteration phase.
We tried another initialization in Algorithm 1 consisting of drawing a feature set at random. We found that the algorithm behaved similarly. (Results not reported here.)
Compared with COSA and Sparse Kmeans, and other methods based on penalties, we note that the choice of features in our SAS algorithm is much simpler, using a hillclimbing approach instead.
A first question of interest is whether the iterations improve the purely coordinatewise method, defined as the method that results from stopping after one pass through Steps 12 in Algorithm 1 (no iteration). Although this is bound to vary with each situation, we examine an instance where the data come from the mixture of three Gaussians with sparse means. In detail, the setting comprises 3 clusters with 30 observations each and respective distributions , and , with having 50 ’s and 450 zeros. We assume that and are both given, and we run the SAS algorithm and record the Rand indexes (rand1971objective, ) and symmetric differences
as the end of each iteration of Steps 12. The setting is repeated 400 times. The means and confidence intervals under different regimes (
, , , ) are shown in Figure 1. At least in this setting, the algorithm converges in a few iterations and, importantly, these few iterations bring significant improvements, particularly over the purely coordinatewise algorithm.We consider the problem of selecting , the number of clusters, as outside of the scope of this work, as it is intrinsic to the problem of clustering and has been discussed extensively in the literature — see (tibshirani2001estimating, ; kou2014estimating, ) and references therein. Thus we assume that is given. Besides , our algorithm has one tuning parameter, the sparsity parameter , which is the number of useful features for clustering, meaning, the cardinality of set in (10).
Inspired by the gap statistic of tibshirani2001estimating , which was designed for selecting the number of clusters in standard Kmeans clustering, we propose a permutation approach for selecting . Let denote the average withincluster dissimilarity of the clustering computed by the algorithm on the original data with input number of features . Let denote the same quantity but obtained from a random permutation of the data — a new sample is generated by independently permuting the observations within each feature. The gap statistic (for ) is then defined as
(29) 
In practice, the expectation is estimated by Monte Carlo, generating
random permuted datasets. A large gap statistic indicates a large discrepancy between the observed amount of clustering and that expected of a null model (here a permutation of the data) with no salient clusters.The optimization of the gap statistics over is a discrete optimization problem. An exhaustive search for would involve computing gap statistics, each requiring runs of the SAS algorithm. This is feasible when and are not too large.^{2}^{2}2In our experiments, we choose as in the code that comes with (witten2010framework, ). See Algorithm 2, which allows for coarsening the grid.
To illustrate the effectiveness of choosing using the gap statistic, we computed the gap statistic for all in the same setting as that of Section 3.2 with . The result of the experiment is reported in Figure 2. Note that, in this relatively high SNR setting, the gap statistic achieves its maximum at the correct number of features.
In this experiment, at least, the gap statistic seems unimodal (as a function of ). If it were the case, we could use a golden section search, which would be much faster than an exhaustive grid search.
We performed a number of numerical experiments, both on simulated data and on real (microarray) data to compare our method with other proposals. Throughout this section, we standardize the data coordinatewisely, we assume that the number of clusters is given, and we use the gap statistic of tibshirani2001estimating to choose the tuning parameter in our algorithm.
We compare our Algorithm 1 with IFPCAHCT (jin2014important, ) and Sparse Kmeans (witten2010framework, ) in the setting of Section 3.2. We note that IFPCAHCT was specifically designed for that model and that Sparse Kmeans was shown to numerically outperform a number of other approaches, including standard Kmeans, COSA (friedman2004clustering, ), modelbased clustering (raftery2006variable, ), the penalized loglikelihood approach of (pan2007penalized, ) and the classical PCA approach. We use the gap statistic to tune the parameters of SAS Clustering and Sparse Kmeans. (SAS_gs uses a grid search while SAS_gss uses a golden section search.) IFPCAHCT is tuningfree — it employs the higher criticism to automatically choose the number of features.
In Table 1a, we report the performance for these three methods in terms of Rand index (rand1971objective, ) for various combinations of and . Each situation was replicated 50 times. As can be seen from the table, SAS Clustering outperforms IFPCAHCT, and performs at least as well as Sparse Kmeans and sometimes much better (for example when and ). We examine a dataset from this situation in depth, and plot the weights resulted from Sparse Kmeans on this dataset, see Figure 3. As seen in this figure, and also as mentioned in (witten2010framework, ), Sparse Kmeans generally results in more features with nonzero weights than the truth. These extraneous features, even with small weights, may negatively impact the clustering result. In this specific example, the Rand index from Sparse Kmeans is 0.763 while our approach gives a Rand index of 0.956. Let denote the true feature set and the feature set that our method return. In this example, .
While both SAS Clustering and Sparse Kmeans use the gap statistic to tune the parameters, IFPCAHCT tunes itself analytically without resorting to permutation or resampling, and (not surprisingly) has the smallest computational time among these three methods. However, as can be seen from Table 1a, the clustering results given by IFPCAHCT are far worse than those resulted from the other two methods. In Table 1b, we report the performance of SAS Clustering and Sparse Kmeans in terms of the running time, under the same setting as that in Table 1a but with tuning parameters for both of the methods given (so that the comparisons are fair). As can be seen in Table 1b, SAS Clustering shows a clear advantage over Sparse Kmeans in terms of the running time, and as increases, the advantage becomes more obvious. (Note that both SAS and Sparse Kmeans are implemented in R code and, in particular, the code is not optimized.)
methods  p = 100  p = 200  p = 500  p = 1000  

0.6  SAS_gs  0.907 (0.048)  0.875 (0.066)  0.827 (0.076)  0.674 (0.096) 
SAS_gss  0.900 (0.054)  0.860 (0.066)  0.781 (0.008)  0.701(0.050)  
Sparse K  0.886 (0.068)  0.807 (0.064)  0.744 (0.046)  0.704 (0.043)  
IFPCA  0.664(0.042)  0.645(0.051)  0.605 (0.045)  0.593(0.038)  
0.7  SAS_gs  0.953 (0.030)  0.965 (0.028)  0.960 (0.032)  0.855 (0.102) 
SAS_gss  0.953 (0.031)  0.961 (0.031)  0.921 (0.088)  0.789 (0.104)  
Sparse K  0.942 (0.045)  0.915 (0.071)  0.802 (0.087)  0.790 (0.087)  
IFPCA  0.681(0.036)  0.653(0.044)  0.629(0.057)  0.614(0.055)  
0.8  SAS_gs  0.986 (0.020)  0.985 (0.022)  0.987 (0.016)  0.966 (0.052) 
SAS_gss  0.984 (0.020)  0.983 (0.019)  0.987 (0.0178)  0.892 (0.122)  
Sparse K  0.985 (0.020)  0.975 (0.029)  0.961 (0.07)  0.948 (0.074)  
IFPCA  0.691(0.043)  0.675(0.056)  0.639(0.068)  0.623(0.059)  
0.9  SAS_gs  0.997 (0.008)  0.997 (0.008)  0.997 (0.007)  0.995 (0.010) 
SAS_gss  0.996 (0.010)  0.996 (0.009)  0.997 (0.009)  0.969 (0.076)  
Sparse K  0.996 (0.010)  0.992 (0.013)  0.992(0.016)  0.993 (0.013)  
IFPCA  0.700(0.031)  0.682(0.051)  0.654(0.057)  0.627(0.065)  
1.0  SAS_gs  0.999 (0.005)  1.000 (0.003)  1.000 (0.003)  0.999 (0.004) 
SAS_gss  0.998 (0.007)  1.000 (0.003)  1.000 (0.004)  0.998 (0.006)  
Sparse K  0.998 (0.007)  0.999 (0.005)  0.996 (0.010)  0.996 (0.009)  
IFPCA  0.717(0.034)  0.710(0.039)  0.659(0.063)  0.639(0.060) 
. The reported values are the mean (and sample standard deviation) of the Rand indexes over 50 simulations.
methods  p = 100  p = 200  p = 500  p = 1000  

0.6  SAS  0.086 (0.031)  0.130 (0.044)  0.217 (0.088)  0.271 (0.118) 
Sparse K  0.113 (0.034)  0.220 (0.053)  0.445 (0.101)  0.850 (0.156)  
0.7  SAS  0.077 (0.021)  0.104 (0.027)  0.207 (0.085)  0.316 (0.147) 
Sparse K  0.107 (0.028)  0.235 (0.057)  0.471 (0.123)  0.945 (0.194)  
0.8  SAS  0.056 (0.019)  0.088 (0.022)  0.182 (0.062)  0.313 (0.118) 
Sparse K  0.091 (0.029)  0.213 (0.051)  0.574 (0.134)  0.984 (0.262)  
0.9  SAS  0.055 (0.017)  0.080 (0.024)  0.136 (0.048)  0.289 (0.131) 
Sparse K  0.094 (0.023)  0.196 (0.052)  0.482 (0.101)  0.982 (0.261)  
1.0  SAS  0.051(0.012)  0.089 (0.021)  0.146 (0.045)  0.272 (0.095) 
Sparse K  0.095(0.019)  0.186 (0.044)  0.554 (0.107)  1.225 (0.270) 
In Section 4.1, the three groups had identity covariance matrix. In this section, we continue comparing our approach with Sparse Kmeans and IFPCAHCT under a more difficult situation, where each of the 3 clusters have 30 points sampled from different variate normal distributions (), with different mean vectors
and same diagonal covariance matrix
across groups, a random matrix with eigenvalues in
. We used 50 repeats and varied from to . The results are reported in Table 2a. We see there that, in this setting, our method is clearly superior to Sparse Kmeans and IFPCAHCT. We also report the symmetric difference between the estimated feature set and the true feature set , as can be seen in Table 2b. Our algorithm is clearly more accurate in terms of feature selection.methods  p = 100  p = 200  p = 500  p = 1000  

0.6  SAS_gs  0.718 (0.037)  0.702 (0.037)  0.611 (0.044)  0.574 (0.027) 
SAS_gss  0.714 (0.028)  0.692 (0.038)  0.635 (0.044)  0.595(0.026)  
Sparse K  0.590 (0.030)  0.594 (0.034)  0.595 (0.034)  0.571 (0.023)  
IFPCA  0.619(0.037)  0.590(0.037)  0.572 (0.024)  0.564(0.020)  
0.8  SAS_gs  0.852 (0.047)  0.844 (0.052)  0.797 (0.066)  0.670 (0.082) 
SAS_gss  0.848 (0.050)  0.819 (0.070)  0.752 (0.060)  0.686 (0.043)  
Sparse K  0.662 (0.057)  0.646 (0.063)  0.657 (0.062)  0.639 (0.054)  
IFPCA  0.646(0.040)  0.634(0.047)  0.603(0.046)  0.575(0.040)  
1.0  SAS_gs  0.940 (0.035)  0.947 (0.033)  0.941 (0.037)  0.919 (0.065) 
SAS_gss  0.935 (0.037)  0.941 (0.038)  0.922 (0.059)  0.799 (0.099)  
Sparse K  0.798 (0.085)  0.814 (0.078)  0.742 (0.080)  0.708 (0.070)  
IFPCA  0.677(0.041)  0.644(0.056)  0.618(0.052)  0.604(0.047) 
methods  p = 100  p = 200  p = 500  p = 1000  

0.6  SAS_gs  26.3(4.7)  37.7 (7.9)  86.2 (20.5)  121.0(28.3) 
SAS_gss  27.9(5.3)  44.1 (12.1)  100.5 (43.3)  143.1(57.0)  
Sparse K  43.8 (7.3)  86.8(35.3)  163.9(105.9)  170.6 (124.6)  
IFPCA  49.4(3.8)  72.3(15.8)  129.7 (61.4)  185.8(126.3)  
0.8  SAS_gs  17.4(3.9)  19.0(3.9)  31.7 (17.2)  94.2 (48.3) 
SAS_gss  17.7 (4.6)  21.9 (6.5)  57.9 (43.7)  132.9 (85.0)  
Sparse K  28.5(13.2)  63.4(28.5)  163.6 (102.9)  218.6 (129.9)  
IFPCA  50.8(5.2)  75.4(16.5)  126.9(61.2)  209.3(130.5)  
1.0  SAS_gs  10.5 (3.7)  10.2 (3.4)  12.4 (3.7)  22.7(19.6) 
SAS_gss  11.7(3.9)  12.5 (4.1)  17.1 (12.8)  100.2 (84.8)  
Sparse K  13.6 (10.8)  49.7 (38.2)  204.88 (104.5)  265 (165.1)  
IFPCA  49.3(4.0)  67.9(14.1)  124.8(53.3)  226.3(146.7) 
In both Section 4.1 and Section 4.2, the three groups have the same covariance matrix. In this section, we continue comparing our approach with Sparse Kmeans and IFPCAHCT under an even more difficult situation, where the mean vectors are the same as in Section 4.2 with , but now the covariances are different: , and are random matrices with eigenvalues in , and , respectively. We used 50 repeats in this simulation. The results, reported in Table 3, are consistent with the results of Section 4.2: our method clearly outperforms Sparse Kmeans and IFPCAHCT, both in terms of clustering and feature selection.
Method  SAS_gs  SAS_gss  Sparse Kmeans  IFPCA 

Rand index  0.920 (0.054)  0.858 (0.098)  0.710 (0.022)  0.668 (0.041) 
8.7 (3.8)  13.0 (7.9)  297.2 (75.6)  118.6 (56.8) 
Notice that the 3 clusters are well separated in the first 50 features as can be seen from the construction of the data, but when 450 noise features are present in the datasets, the task of clustering becomes difficult. See Figure 4(b) as an example where we project a representative dataset onto the first two principal components of the whole data matrix. However, if we are able to successfully select out the first features and apply classical clustering algorithms, then we are able to achieve better results. See Figure 4(a), where we project the same dataset onto the first two principal components of the data submatrix consisting of the first 50 columns (features). To illustrate the comparisons, we also plot in Figure 4 the clustering results by these three methods.
In the previous simulations, all the datasets were Euclidean. In this section, we apply our algorithm on categorical data (with Hamming distance) and compare its performance with Sparse Kmedoids^{3}^{3}3We modified the function of Sparse Kmeans in the R package ‘sparcl’, essentially replacing Kmeans with Kmedoids, so that it can be used to cluster categorical data.. In this example, we generate clusters with data points each from three different distributions on the Hamming space of dimension
. Each distribution is the tensor product of Bernoulli distributions with success probabilities
for . For the first distribution, for and otherwise. For the second distribution, for and otherwise. For the third distribution, for and otherwise. See Table 4, where we compare these two methods in terms of Rand index for various combination of and . Each situation was replicated 50 times. As can be seen from the table, SAS Clustering significantly outperforms Sparse Kmedoids in most situations. We examined why, and it turns out that Sparse Kmedoids works well if the tuning parameter in equation (18) is given, but it happens that the gap statistic often fails to give a good estimate of in this categorical setting. We are not sure why.methods  p = 30  p = 60  p = 100  p = 200  

0.6  SAS_gs  0.878 (0.060)  0.872 (0.042)  0.864 (0.057)  0.863 (0.053) 
Sparse Kmedoids  0.694 (0.045)  0.663 (0.054)  0.654 (0.049)  0.639 (0.044)  
0.7  SAS_gs  0.954 (0.023)  0.960 (0.026)  0.942 (0.026)  0.948 (0.033) 
Sparse Kmedoids  0.807 (0.126)  0.763 (0.077)  0.716 (0.060)  0.686 (0.062)  
0.8  SAS_gs  0.989 (0.011)  0.984 (0.019)  0.983 (0.019)  0.978 (0.021) 
Sparse Kmedoids  0.946 (0.090)  0.889 (0.099)  0.846 (0.100)  0.787 (0.093)  
0.9  SAS_gs  0.998 (0.005)  0.999 (0.003)  0.997 (0.007)  0.997 (0.006) 
Sparse Kmedoids  0.997 (0.006)  0.994 (0.036)  0.983 (0.044)  0.966 (0.065) 
. The reported values are the mean (and standard error) of the Rand indexes over 50 simulations.
In Sections 4.1 – 4.4, we have fixed the number of clusters to be and considered the effects of cluster separation , sparsity () and cluster shape (Identity covariance, same and different covariance matrices across groups) in the comparisons. In this section, we continue to compare our approach with Sparse Kmeans and IFPCAHCT as the number of clusters increases from to . The setup here is different from the above sections. We sample subcenters from a 50variate normal distribution ^{4}^{4}4The constant 0.4 was chosen to make the task of clustering neither too easy nor too difficult. and concatenate each of the subcenters with 450 zeros to have random centers, , of length , which carry at least noise features. Once the centers are generated, we construct clusters with ( in the second setup) observations each, sampled from respective distributions with . Each setting is repeated times. The means and confidence intervals with different ’s are shown in Figure 5(a) and Figure 5(b). Once again, the results were consistent with earlier results in that SAS Clustering outperforms IFPCAHTC and performs at least as well as Sparse Kmeans with different ’s. We also notice that the clustering results given by all these three methods become better as increases. This can be explained by the increased effective sample sizes ( or ) as increases.


We compare our approach with others on real data from genetics. Specifically, we consider the same microarray datasets (listed in Table 5) used by jin2014important to evaluate their IFPCA method. Each of these 10 data sets consists of measurements of expression levels of genes in patients from different classes (e.g., normal, diseased). We notice from Table 5 that is much greater than , illustrating a highdimensional setting. We also mention that, although the true labels are given by the groups the individuals belong to, they are only used as the ground truth when we report the classification errors of the different methods in Table 6. For detailed descriptions and the access to these 10 datasets, we refer the reader to (jin2014important, ).
In Table 6, we report the classification errors of 10 different methods on these datasets. Among these 10 methods, the results from Kmeans, Kmeans++ (arthur2007k, ), hierarchical clustering, SpectralGem (lee2010spectral, ) and IFPCAHCT (jin2014important, ) are taken from (jin2014important, ). We briefly mention that Kmeans++ is Lloyd’s algorithm for Kmeans but with a more careful initialization than purely random; hierarchical clustering is applied to the normalized data matrix directly without feature selection; and SpectralGem is PCAtype method. In addition to these 5 methods, we also include 3 other methods: AHPGMM (wang2008variable, ), which is an adaptively hierarchically penalized Gaussianmixturemodel based clustering method, Regularized Kmeans (sun2012regularized, ), and Sparse Kmeans (witten2010framework, ).
We can offer several comments. First, our method is overall comparable to Sparse Kmeans and IFPCA, which in general outperform the other methods. It is interesting to note that SAS_gss outperforms SAS_gs on a couple of datasets. However, we caution the reader against drawing hard conclusions based on these numbers, as some of the datasets are quite small. For example, the Brain dataset has groups and a total sample size of , and is very highdimensional with . Second, for Breast Cancer, Prostate Cancer, SRBCT and SuCancer, all methods perform poorly with the best error rate exceeding . However, we note that even when the task is classification where class labels in the training sets are given, these data sets are still hard for some wellknown classification algorithms (dettling2004bagboosting, ; yousefi2010reporting, ). Third, we notice that in (sun2012regularized, ), clustering results of the Leukemia and Lymphoma datasets have also been compared. The error rate on Lymphoma given by Regularized Kmeans in (sun2012regularized, )
is the same as reported here, however, the error rate on Leukemia is smaller than the result reported here. This is due to the fact that they applied preprocessing techniques to screen out some inappropriate features and also imputed the missing values using 5 nearest neighbors on this data set. Interestingly,
wang2008variable also reported a better error rate on SRBCT data using their AHPGMM method. However, they split the data into training set and testing set, fit the penalized Gaussian mixture model and report the training error and testing error respectively.#  Data Name  (with sample size from each cluster)  

1  Brain  5  5597  42 (10+10+10+4+8) 
2  Breast  2  22215  276 (183+93) 
3  Colon  2  2000  62 (22+40) 
4  Lung  2  12533  181 (150+31) 
5  Lung(2)  2  12600  203 (139+64) 
6  Leukemia  2  3571  72 (47+25) 
7  Lymphoma  3  4026  62 (42+9+11) 
8  Prostate  2  6033  102 (50+52) 
9  SRBCT  4  2308  63 (23+8+12+20) 
10  SuCancer  2  7909  174 (83+91) 
Data set  Kmeans  Kmeans++  Hier  SpecGem  IFPCA  AHPGMM  RKmeans  Sparse K  

Brain  .286  .472  .524  .143  .262  .214  .262  .190  .310  .310 
Breast  .442  .430  .500  .438  .406  .460  .442  .449  .485  .445 
Colon  .443  .460  .387  .484  .403  .129  .355  .306  .129  .403 
Lung  .116  .196  .177  .122  .033  .116  .094  .122  .099  .099 
Lung(2)  .436  .439  .301  .434  .217  .438  .217  .315  .315  .315 
Leukemia  .278  .257  .278  .292  .069  .028  .347  .028  .028  .028 
Lymphoma  .387  .317  . 468  .226  .065  .484  .016  .016  .016  .016 
Prostate  .422  .432  .480  .422  .382  .422  .441  .373  .431  .431 
SRBCT  .556  .524  .540  .508  .444  .476  .556  .317  .460  .365 
SuCancer  .477  .459  .448  .489  .333  .477  .477  .477  .483  .483 
We presented here a simple method for feature selection in the context of sparse clustering. The method is arguably more natural and simpler to implement than COSA or Sparse Kmeans. At the same time, it performs comparably or better than these methods, both on simulated and on real data.
At the moment, our method does not come with any guarantees, other than that of achieving a local minimum if the iteration is stopped when no improvement is possible. Just like other iterative methods based on alternating optimization, such as Lloyd’s algorithm for Kmeans, proving a convergence to a good local optimum (perhaps even a global optimum) seems beyond reach at the moment. COSA and Sparse Kmeans present similar challenges and have not been analyzed theoretically. IFPCA has some theoretical guarantees developed in the context of a Gaussian mixture model (jin2014important, ) — see also jin2015phase . More theory for sparse clustering is developed in (azizyan2013, ; verzelen2014detection, ; chan2010using, ).
J. S. Liu, et al., Bayesian clustering with variable and transformation selections, Bayesian Statistics 7 (2003) 245–275.
W. Pan, X. Shen, Penalized modelbased clustering with application to variable selection, The Journal of Machine Learning Research 8 (2007) 1145–1164.
B. W. Kernighan, S. Lin, An efficient heuristic procedure for partitioning graphs, Bell system technical journal 49 (2) (1970) 291–307.
Comments
There are no comments yet.