Degrees of Freedom and Model Selection for kmeans Clustering

by   David P. Hofmeyr, et al.

This paper investigates the problem of model selection for kmeans clustering, based on conservative estimates of the model degrees of freedom. An extension of Stein's lemma, which is used in unbiased risk estimation, is used to obtain an expression which allows one to approximate the degrees of freedom. Empirically based estimates of this approximation are obtained. The degrees of freedom estimates are then used within the popular Bayesian Information Criterion to perform model selection. The proposed estimation procedure is validated in a thorough simulation study, and the robustness is assessed through relaxations of the modelling assumptions and on data from real applications. Comparisons with popular existing techniques suggest that this approach performs extremely well when the modelling assumptions



page 1

page 2

page 3

page 4


Computing the degrees of freedom of rank-regularized estimators and cousins

Estimating a low rank matrix from its linear measurements is a problem o...

The Violating Assumptions Series: Simulated demonstrations to illustrate how assumptions can affect statistical estimates

When teaching and discussing statistical assumptions, our focus is often...

Bayesian Estimation of the Degrees of Freedom Parameter of the Student-t Distribution—A Beneficial Re-parameterization

In this paper, conditional data augmentation (DA) is investigated for th...

Shortcuts to Thermodynamic Computing: The Cost of Fast and Faithful Erasure

Landauer's Principle states that the energy cost of information processi...

Effective degrees of freedom for surface finish defect detection and classification

One of the primary concerns of product quality control in the automotive...

Fisher Scoring for crossed factor Linear Mixed Models

The analysis of longitudinal, heterogeneous or unbalanced clustered data...

Recognition of 26 Degrees of Freedom of Hands Using Model-based approach and Depth-Color Images

In this study, we present an model-based approach to recognize full 26 d...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Degrees of freedom arise explicitly in model selection, as a way of accounting for the bias in the model log-likelihood for estimating generalisation error [1, Akaike Information Criterion, AIC]

and, indirectly, Bayes factors 

[14, Baeysian Information Criterion, BIC]. In particular, degrees of freedom account for the complexity, or flexibility of a model by measuring its effective number of parameters. In the context of clustering, model flexibility is varied by different choices of , the number of clusters. In -means, clusters are associated with compact collections of points arising around a set of cluster centroids. The optimal centroids are those which minimise the sum of squared distances between each point and its assigned centroid. Using the squared distance connects the

-means objective with the log-likelihood of a simple Gaussian Mixture Model (GMM). Pairing elements of the GMM log-likelihood with AIC and BIC type penalties, based on the number of explicitly estimated parameters, has motivated multiple model selection methods for

-means [10, 13, 11]. However, it has been observed that these approaches can lead to substantial over-estimation of the number of clusters [7].

We argue that these simple penalties are inappropriate, and do not account for the entire complexity of the model. We investigate more rigorously the degrees of freedom in the -means model. The proposed formulation depends not only on the explicit dimension of the model, but also accounts for the uncertainty in the cluster assignments. This is intuitively appealing, as it allows the degrees of freedom to incorporate the difficulty of the clustering problem, which cannot be captured solely by the model dimension. This formulation draws on the work of [18], and is the first application, of which we are aware, of this approach to the problem of clustering. We validate the proposed formulation by applying it within the BIC to perform model selection for -means. The approach is found to be extremely competitive with the state-of-the-art on a very large collection of benchmark data sets.

2 Degrees of Freedom in the -means Model

From a probabilistic perspective, the standard modelling assumptions for -means are that the data arose from a component Gaussian mixture with equal isotropic covariance matrix, , and either equal mixing proportions [10, 4] or sufficiently small  [8]. With a slight abuse of notation, one may in general write the likelihood for the data, given model , which we assume to include all parameters of the underlying distribution which are being estimated, as

Here are the component means,

is the probability that the

-th datum arises from the -th component, and the subscript “” is used to denote the -th row of a matrix. The terms are usually assumed equal for fixed , and have been used to represent mixing proportions [10]. Popular formulations of the -means likelihood [10, 13, 11] use the so-called classification likelihood [6], which treats the cluster assignments as true class labels. For example, a simple BIC formulation has been expressed, up to an additive constant, as [13]


Here only the means are assumed part of the estimation, and hence the model dimension is , for clusters. There is a fundamental mismatch in formulations such as this, however, including those in [10, 13, 11], between the log-likelihood component and the bias correction term. Specifically, by using the classification likelihood the assumption is that the model is also estimating the assignments of data to clusters. However, without incorporating this added estimation into the model degrees of freedom, the bias of the log-likelihood for estimating generalisation error, and Bayes factors, is severely under-estimated.

In this work a modified formulation is considered which incorporates the cluster assignment into the modelling procedure. We find it convenient to assume that the data matrix has been generated as,


where the mean matrix is assumed to have unique rows and the elements of are independent realisations from a distribution. For this formulation it is possible to consider estimating pointwise the elements of , under the above constraint, using a modelling procedure ,


The matrix estimates the unique rows of , and provides an approximation of the maximum likelihood solution under Eq (2). With this formulation we are able address the estimation of the “effective degrees of freedom” [5], given by


The covariance offers an appealing interpretation in terms of model complexity/flexibility. A more complex model will respond more to variations in the data, in that additional flexibility will allow the model to attempt to “explain” this variation. The covariance between its fitted values and the data will therefore be higher. On the other hand, and inflexible model will, by definition, vary less due to changes in the observations. Furthermore in numerous simple Gaussian error models there is an exact equality between this covariance and the model dimension. The remainder of this section is concerned with obtaining an appropriate approximation of the effective degrees of freedom for the -means model. The following two lemmas are useful for obtaining such an estimate.

Lemma 1

Let , with fixed and with independent for all . Let satisfy the following condition. For all and each , there exists a finite set s.t. , viewed as a univariate function by keeping all other elements of , , fixed, is Lipschitz on each of , and . Then for each , the quantity is equal to


provided the second term on the right hand side exists. Here has zero entries except in the -th position, where it takes the value one, and

is the size of the discontinuity at .

This result is very similar to [18, Lemma 5], however the proof we provide in supplementary material is markedly different. The first term on the right hand side of (7) comes from Stein’s influential result [15, Lemma 2]. Due to the discontinuities in the -means model, which occur at points where the cluster assignments of some of the data change, the additional covariance at the discontinuity points needs to be accounted for. Consider a which is close to a point of discontinuity with respect to the -th entry. Conditional on the fact that is close to such a point, takes values approximately equal to the left and right limits, depending on whether is below or above the discontinuity respectively. On a small enough scale each happens with roughly equal probability. After taking into account the probability of being close to the discontinuity point, and taking the limit as gets arbitrarily close to the discontinuity point, one can arrive at an intuitive justification for the additional term in (7). In the remainder this additional covariance term will be referred to as the excess degrees of freedom. The next lemma places Lemma 1 in the context of the -means model.

Lemma 2

Let be defined as


Then satisfies the conditions on the function in the statement of Lemma 1, and moreover if , with fixed and with independent for all , then

exists and is finite.

One of the most important consequences of [15, Lemma 2], which leads to the first term on the right hand side of Eq. (7

), is that this term is devoid of any of the parameters of the underlying distribution. An unbiased estimate of this term can be obtained by taking the partial derivatives of the model using the observed data. In the case of

-means one arrives at,

where is the number of data assigned to centroid . Therefore,

The excess degrees of freedom therefore equals the difference between the effective degrees of freedom and the explicit model dimension, i.e., the number of elements in . It may therefore be interpreted as the additional complexity in assigning data to clusters. This is intuitively pleasing in light of the fact that this additional covariance directly accounts for the potential assignment of the data to different clusters, in that these are what result in discontinuities in the model.

2.1 Approximating Excess Degrees of Freedom

The excess degrees of freedom reintroduces the unknown parameters to the degrees of freedom expression. Furthermore, as noted by [18], it is generally extremely difficult to determine the discontinuity points, making the computation of the excess degrees of freedom very challenging. This perhaps even more so in the case of clustering. Consider the excess degrees of freedom arising from the -th entry,

Since the observed data matrix, , is assumed to have come from the distribution of , the excess degrees of freedom are estimated using and the corresponding clustering solution. Assume for now that the model parameters, and , are fixed. We will discuss our approach for accommodating these unknown parameters in the next subsection. Now, recall that the discontinuities are those at which the assignment of some of the data changes. That is, those for which s.t.

The fact that discontinuities are determined in terms of the would-be solution, , rather than the observed solution, , is one of the reasons which make determining the discontinuity points extremely challenging. Indeed, one can construct examples where slight changes in only a single matrix entry can result in reassignments of arbitrarily large subsets of data, resulting in substantial and unpredictable changes in . We are thus led to making some simplifications. We only consider discontinuities w.r.t. the -th entry arising from reassignments of , the corresponding datum. This is a necessary simplification which maintains the intuitive interpretation of the excess degrees of freedom as the covariance arising from reassignments of data. Now, consider the value of at which the assignment of changes from to some . Ignoring all other clusters, we find that satisfies


where is the

-th canonical basis vector for

and is the size of the -th cluster. This is a quadratic equation which can easily be solved. A further simplification is adopted here. Rather than considering the paths of through multiple reassignments resulting from varying (which quickly become extremely difficult to calculate), the magnitude and location of a discontinuity at a value is determined as though no reassignments had occurred for values between zero and . Since the corresponding values of are generally large, the contributions from the quantities are generally small. The excess degrees of freedom for the -th entry is thus approximated using

where is the solution to Eq. (8) with smaller magnitude (when a solution exists). To determine the magnitude of the discontinuities observe that when , and we assume that no values result in a reassignment of , we have

If then we simply have the negative of the above.

2.1.1 Selecting Appropriate Values for and

The estimate of excess degrees of freedom depends on the values of and . It is tempting to use the apparently natural candidates,

and an estimate of the within cluster variance. However, this is inappropriate for the purpose of comparing models. First, notice that the value of

will lead to an underestimation of the density terms, . This is because the values which result in a reassignment of the corresponding datum occur at the boundaries of the estimated clusters; and hence at the greatest distances from . Furthermore, except for the correct value of , will have a different number of unique rows from . Interestingly we have found this latter point to affect the estimate only moderately. In fact any reasonable mean matrix which is not precisely the minimiser, , produces a similar estimate. When estimating degrees of freedom for a solution with clusters, we simply replace with the fitted values from the solution containing clusters. We have also experimented with using a single replacement mean matrix for all models containing between and clusters; this being the fitted values from the clustering solution with clusters. The results from both of these choices are very similar.

The choice of is far more crucial for approximating degrees of freedom. Importantly, smaller values of tend to result in a smaller value of the estimated degrees of freedom. It is therefore preferable that the same value(s) of are used for all models. Otherwise the effect is that a hypothesis of a larger number of clusters tends to be self-fulfilling. This is because the within cluster variance estimate from a large number of clusters will be relatively small, decreasing the excess degrees of freedom artificially. We have not found a single estimation strategy which is satisfactory, and so compute effective degrees of freedom for a large range of equally spaced values of . When applying a model selection procedure, we obtain multiple selections of the number of clusters, and propose the most frequently occurring value. Note that under a correctly specified model, a sound model selection procedure should be robust to slight misspecification in the value of . The number associated with the true, unkown , therefore tends to occur repeatedly, whereas other selections tend to be scattered. Exceptions to this include scenarios where too many substantial under- or over-estimates of are considered. In these cases, respectively, and will occur repeatedly. We recommend that a user considers this possibility. For the purpose of this paper we simply apply the automated approach described in the following section.

3 Choosing Using the BIC

The Bayesian Information Criterion approximates, up to unnecessary constants, the logarithm of the evidence for a model , i.e., , using

Again is the model log-likelihood and here is the number of independent “residuals” in . With the modelling assumptions in Eq (2), the estimated BIC for -means is therefore, up to an additive constant,

We introduced numerous simplifications in the approximation . As a result, it is important to investigate the accuracy of the resulting method, and the effect that inaccuracy can have on model selection. Figure 1 shows the results of a simulation study conducted to investigate this accuracy. Data sets of size 1000 were generated under the modelling assumptions in Eq (2). The number of clusters and dimensions were each set to 5, 10 and 20. The figure shows plots of against the averages of (– – –) from 30 replications, based on the proposed effective degrees of freedom approximation. The 30 individual realisations are also included to provide an indication of variation. For illustrative purposes we have used the correct value of , where the value of is replaced using the clustering solution with one more cluster, as explained in the previous subsection. Based on Eq. (6), the quantity estimates the covariance between the model and the data. The plots also show direct empirical estimates of this covariance obtained by sampling from the true distribution (——–). That is, we generate multiple data sets according to Eq (2), apply -means for each value of , and compute the corresponding empirical covariance. This may therefore be seen as our target. For context we also include the plot of (), corresponding to the naïve degrees of freedom equated with the explicit model dimension.

(a) 5 clusters in 5 dimensions
(b) 10 clusters in 5 dimensions
(c) 20 clusters in 5 dimensions
(d) 5 clusters in 10 dimensions
(e) 10 clusters in 10 dimensions
(f) 20 clusters in 10 dimensions
(g) 5 clusters in 20 dimensions
(h) 10 clusters in 20 dimensions
(i) 20 clusters in 20 dimensions
Figure 1: Estimated covariance between the data and fitted values computed through (i) direct sampling (——), (ii) proposed method for approximating effective degrees of freedom (– – –) and (iii) naïve estimate of degrees of freedom ()

Given the number of simplifications made, and the difficulty of the problem in the abstract, we find the estimation to be acceptable for values of up to and including the correct number. However, for values greater than the correct number the estimation becomes problematic. Note that from the point of view of model selection, a relatively larger underestimation of the degrees of freedom for a specific value of will bias the model selection towards that value of . It is therefore the apparent negative bias in the estimated degrees of freedom for higher dimensional cases and for values of greater than the correct value which we find to be most problematic. To account for this we select the smallest value of which corresponds to a local minimum in the estimated BIC curve, seen as a function of . This “first extremum” approach for model selection has also been used by [17]. We also smooth the estimated degrees of freedom curves to mitigate the effect of variation which is quite pronounced in, for example, Figures 1 (c), (f) and (i). Any simple smoothing is appropriate, where we apply kernel smoothing with a leave-one-out cross validation bandwidth estimate. We have found this to provide reliable results in most cases, but expect conscientious users will investigate the BIC curves and also consider alternative values of which correspond with repeated minima.

4 Experimental Results

This section presents briefly on results from experiments using a large collection of (28) plublicly available data sets associated with real applications from diverse fields. These are popular benchmark data sets taken from the UCI machine learning repository 

[2], with the exception of the Yeast111 and Phoneme222 data sets. All data sets were standardised to have unit variance in every dimension before applying any clustering, for which the implementation of -means provided in R’s base stats package was used. Clustering solutions considered for model selection were the best, in terms of -means objective, from ten random initialisations. For all data sets values of from 1 to 30 were considered. In addition to the proposed approach, we also experimented with the following popular existing methods for model selection:

1. The gap statistic [17], which is based on approximating, through Monte Carlo, the deviation of the (transformed) within cluster sum of squares from its expected value when the underlying data distribution contains no clusters. Due to high computation time, solutions for the Monte Carlo samples were based on a single initialisation. Using ten initialisations, as for the clustering solutions of the actual data sets, did not produce different results on data sets for which this approach terminated in a reasonable amount of time.
2. The method of [12] which uses the same motivation as the gap statistic, but determines the deviation of the sum of squares from its expected value analytically under the assumption that the data distribution meets the standard -means assumptions.
3. The silhouette index [9], which is based on comparing the average dissimilarity of each point to its own cluster with its average dissimilarity to points in different clusters. Dissimilarity is determined by the Euclidean distance between points.
4. The jump statistic [16], which selects the number of clusters based on the first differences in the -means objective raised to the power . This statistic is based on rate distortion theory, which approximates the mutual information between the complete data set and the summarisation by the centroids.

fK Gap Silh. Jump BIC Ideal
Wine (3) 2 0.37 3 0.90 3 0.90 30 0.13 3 0.90 3 0.90
Seeds (3) 2 0.48 3 0.77 2 0.48 29 0.11 3 0.77 3 0.77
Ionosphere (2) 2 0.17 9 0.13 4 0.28 30 0.11 4 0.28 3 0.29
Votes (2) 2 0.57 8 0.19 2 0.57 29 0.06 2 0.57 2 0.57
Iris (3) 2 0.57 2 0.57 2 0.57 27 0.14 3 0.62 3 0.62
Libras (15) 2 0.07 13 0.31 19 0.33 29 0.29 7 0.20 20 0.34
Heart (2) 2 0.34 2 0.34 3 0.30 29 0.04 5 0.29 2 0.34
Glass (6) 2 0.19 9 0.17 2 0.19 29 0.13 4 0.20 5 0.24
Mammography (2) 2 0.39 3 0.31 3 0.31 25 0.05 3 0.31 2 0.39
Parkinsons (2) 2 -0.10 7 0.07 2 -0.10 30 0.03 7 0.07 6 0.12
Yeast (5) 2 0.42 9 0.41 2 0.42 29 0.14 4 0.57 4 0.57
Forest (4) 2 0.18 5 0.39 2 0.18 30 0.15 5 0.39 4 0.45
Breast Cancer (2) 2 0.82 8 0.38 2 0.82 30 0.15 4 0.76 2 0.82
Dermatology (6) 2 0.21 9 0.65 4 0.80 28 0.26 5 0.84 5 0.84
Synth. (6) 2 0.27 7 0.61 2 0.27 30 0.35 10 0.65 8 0.67
Soy Bean (19) 2 0.05 13 0.45 2 0.05 30 0.51 7 0.24 18 0.55
Olive Oil (3) 2 0.46 9 0.37 5 0.58 30 0.11 5 0.58 4 0.62
Olive Oil (9) 0.34 0.63 0.76 0.27 0.76 5 0.76
Bank (2) 2 0.01 3 0.06 18 0.10 26 0.09 3 0.06 5 0.21
Optidigits (10) 2 0.13 17 0.57 20 0.60 30 0.47 18 0.65 18 0.65
Image Seg. (7) 2 0.17 14 0.46 6 0.48 28 0.30 6 0.48 9 0.51
M.F. Digits (10) 2 0.15 18 0.62 9 0.65 30 0.47 17 0.62 11 0.68
Satellite (6) 3 0.29 12 0.41 3 0.29 30 0.25 12 0.41 7 0.56
Texture (11) 2 0.11 23 0.41 2 0.11 30 0.41 14 0.47 20 0.50
Pen Digits (10) 2 0.13 30 0.45 8 0.45 28 0.46 9 0.48 14 0.64
Phoneme (5) 2 0.16 11 0.45 2 0.16 1 0.00 21 0.28 5 0.64
Frogs (4) 2 0.40 17 0.15 3 0.46 25 0.12 6 0.32 4 0.48
Frogs (8) 0.41 0.19 0.53 0.16 0.39 4 0.57
Frogs (10) 0.55 0.30 0.71 0.26 0.58 4 0.78
Auto (3) 2 -0.04 4 0.13 2 -0.04 30 0.04 4 0.13 5 0.16
Yeast UCI (10) 7 0.19 1 0.00 6 0.11 9 0.18 3 0.13 7 0.19
# times max ARI 5 8 10 1 18
Avg. 0.55 0.28 0.36 0.59 0.18
Avg. 0.26 0.13 0.14 0.32 0.07
Table 1: Results from Benchmark Data Sets. Number of clusters selected by each method () and corresponding adjusted Rand index (ARI) are reported.

Table 1 shows the selected number of clusters, , and the adjusted Rand index (ARI) for all 28 data sets. For each data set we have also included the “Ideal” -means solution, which corresponds with the solution that attains the highest ARI value. We find this to be pertinent since when the data distribution deviates substantially from the -means assumptions it may be that the best -means solution does not contain the same number of clusters as the ground truth. Finally, summaries of the contents of the table are also included333Two of the data sets offer multiple “ground truth” label sets. For computing the summaries we averaged the performances of each method over the different label sets.. These summaries include the number of times each method achieved the highest performance, as well as the averages of absolute and relative regret when compared with the Ideal solution. Given the variety and number of the data sets used in these experiments, there is strong evidence that the proposed estimation procedure for the effective degrees of freedom leads to selection of models which enjoy very strong performance when compared with existing techniques.

5 Discussion

This work investigated the effective degrees of freedom in the -means model. An extension of Stein’s lemma provided an expression for the degrees of freedom, and a few simplifications allowed us to approximate this value in practice. The approximation was validated through model selection within the Bayesian Information Criterion. Experiments using a large collection of publicly available benchmark data sets suggest that this approach is competitive with popular existing methods for model selection in -means clustering.

Appendix: Proofs

Lemma 1.

Let , with fixed and with independent for all . Let satisfy the following condition. For each and , there exists a finite set s.t. , viewed as a univariate function by keeping all other elements, , fixed, is Lipschitz on each of , and . Then for each ,

provided the second term on the right hand side exists. Here has zero entries except in the -th position, where it takes the value one, and

Proof: Let and consider any which is Lipschitz on and for some . For each define

where . Then is Lipschitz by construction and so by [3, Lemma 3.2] we know is almost differentiable and , and so by [15, Lemma 2] we have


Taking the limit as gives

as required. The extension to any with finitely many such discontinuity points arises from a very simple induction.

We therefore have for any , that

The result follows from the law of total expectation.

Lemma 2.
Let be defined as


Then satisfies the conditions on the function in the statement of Lemma 1, and moreover if , with fixed and with independent for all , then

exists and is finite.

Proof: Notice that the discontinuities in can occur only when there is a change in the assignment of one of the observations. If this occurs at the point , then it is straightforward to show that

where Diam is the diameter of the rows of and is a constant independent of . There are also clearly finitely many such discontinuities since there are finitely many cluster solutions arising from data, i.e.,

for some constant independent of . Furthermore as long as all cluster assignments remain the same, and hence is Lipschitz as a function of between points of discontinuity. Finally,

since is maximised by a satisfying , and is bounded above by . Now, the tail of the distribution of is similar to that of the distribution of the maximum of random variables with degrees of freedom. Therefore is clearly finite. The second term above is clearly finite, since

is normally distributed, and hence the expectation in Lemma 2 exists and is finite.


  • [1] Hirotogu Akaike. Information theory and an extension of the maximum likelihood principle. In Selected papers of hirotugu akaike, pages 199–213. Springer, 1998.
  • [2] K. Bache and M. Lichman. UCI machine learning repository, 2013.
  • [3] Emmanuel J Candes, Carlos A Sing-Long, and Joshua D Trzasko.

    Unbiased risk estimates for singular value thresholding and spectral estimators.

    IEEE transactions on signal processing, 61(19):4643–4657, 2013.
  • [4] Gilles Celeux and Gérard Govaert. A classification em algorithm for clustering and two stochastic versions. Computational statistics & Data analysis, 14(3):315–332, 1992.
  • [5] Bradley Efron. How biased is the apparent error rate of a prediction rule? Journal of the American statistical Association, 81(394):461–470, 1986.
  • [6] Chris Fraley and Adrian E Raftery. Model-based clustering, discriminant analysis, and density estimation. Journal of the American statistical Association, 97(458):611–631, 2002.
  • [7] Greg Hamerly and Charles Elkan. Learning the k in k-means. In Advances in neural information processing systems, pages 281–288, 2004.
  • [8] Ke Jiang, Brian Kulis, and Michael I Jordan. Small-variance asymptotics for exponential family dirichlet process mixture models. In Advances in Neural Information Processing Systems, pages 3158–3166, 2012.
  • [9] Leonard Kaufman and Peter J Rousseeuw.

    Finding groups in data: an introduction to cluster analysis

    , volume 344.
    John Wiley & Sons, 2009.
  • [10] Christopher Manning, Prabhakar Raghavan, and Hinrich Schütze. Introduction to Information Retrieval. Cambridge University Press, 1 edition, 2008.
  • [11] Dan Pelleg, Andrew W Moore, et al. X-means: Extending k-means with efficient estimation of the number of clusters. In Icml, volume 1, pages 727–734, 2000.
  • [12] Duc Truong Pham, Stefan S Dimov, and Cuong Du Nguyen. Selection of k in k-means clustering. Proceedings of the Institution of Mechanical Engineers, Part C: Journal of Mechanical Engineering Science, 219(1):103–119, 2005.
  • [13] Stephen A Ramsey, Sandy L Klemm, Daniel E Zak, Kathleen A Kennedy, Vesteinn Thorsson, Bin Li, Mark Gilchrist, Elizabeth S Gold, Carrie D Johnson, Vladimir Litvak, et al. Uncovering a macrophage transcriptional program by integrating evidence from motif scanning and expression dynamics. PLoS computational biology, 4(3):e1000021, 2008.
  • [14] Gideon Schwarz et al. Estimating the dimension of a model. The annals of statistics, 6(2):461–464, 1978.
  • [15] Charles M Stein. Estimation of the mean of a multivariate normal distribution. The annals of Statistics, pages 1135–1151, 1981.
  • [16] Catherine A Sugar and Gareth M James. Finding the number of clusters in a dataset: An information-theoretic approach. Journal of the American Statistical Association, 98(463):750–763, 2003.
  • [17] Robert Tibshirani, Guenther Walther, and Trevor Hastie. Estimating the number of clusters in a data set via the gap statistic. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 63(2):411–423, 2001.
  • [18] Ryan J Tibshirani. Degrees of freedom and model search. Statistica Sinica, pages 1265–1296, 2015.