Suppose we collect a sample from a population with hidden cluster structure. Of interest is the estimation of the number of clusters () in the sample . An alternative formulation is to consider a parsimonious representation of the data
where is the loading matrix that characterizes dependence between the rows of , which represent measurements for each sample unit, is a matrix whose columns are arbitrary latent vectors , and represents noise in the data. When the dependence structure in the sample corresponds to clusters, the above representation suggests that is a linear combination of the rows of , and each row corresponds to one of the clusters. Hence, the problem we are interested in can be translated to finding the minimal such that the matrix has rank , and the noise, , has independent and identically distributed rows. Henceforth, we refer to as the effective dimension of the data because, intuitively, a clustering structure in the sample reduces the data dimension from to . We will approach the estimation of within the probabilistic principal component analysis framework.
Probabilistic principal component analysis, introduced in the seminal paper of Tipping and Bishop , casts the estimation of PCs as a likelihood optimization problem. Although it is possible to use a Bayesian approach to estimate the number of PCs (
), the full Bayesian estimation using Markov Chain Monte Carlo (MCMC) can be computationally prohibitive for large datasets and approximations are needed, such as variational inference [7, 29]. Alternatively, Minka 
implemented Laplace’s method to approximate the posterior likelihood and showed it to be often superior to cross-validation and variational inference with the benefit of fast computation. For high-dimensional data with a small number of observations, noted the unsatisfactory performance of Laplace’s approximation and proposed to modify the Bayesian model using a Gaussian parametrization that showed improved performance. Observing the symmetry in the data structure,  approximated the Bayesian models for both and , and thus proposed a unifying criterion that works well under divergence of either the number of features, , or sample size, .
In the context of isotropic factor analysis (FA), Bai and Ng 
proposed to estimate the number of factors by finding some threshold to separate large and small eigenvalues of the data covariance matrix, but the approach depends on correct estimation of error variance. Using a different strategy, Passemier et al. tackled the estimation of the noise variance, which led to a bias-corrected criterion for estimating . Alternatively, Gavish and Donoho  proposed to remove the underlying noise in the sample eigenvalues via a hard threshold-based approach. In this case, the stopping rule based on a single threshold could be useful for recovering the original data in the sense of asymptotic mean squared error, but does not directly inform the dimension of the low-rank approximation.
The literature devoted to determining the number of principal components (PC) to retain is too extensive to be comprehensively described here, so we refer to two classic texts [21, 20] on principal component analysis (PCA), an excellent review by Jackson , as well as the state-of-the-art probabilistic methods discussed in Sobczyk et al. .
For small datasets, cross validation is also frequently used  with a general cross-validation (GCV) criterion proposed by  that also works well with large datasets. Another class of methods rely on asymptotic tests such as the likelihood ratio test (LRT) for equality of eigenvalues [3, 11, 23, 24, 35] and differ according to asymptotic conditions on data dimensions. Instead of an asymptotic test, Choi et al.,  recently proposed an exact method for hypothesis testing of signals in a noisy matrix to estimate the number of PCs that showed promising results in simulations.
The approach we propose here spearheads the use of penalized likelihood methods within the probabilistic formulation, where a penalty is introduce to reduce the number of principal components identified via profile likelihood maximization. The aim is to automatically select the effective dimension without manual tuning nor feature selection. Subsequent estimation of cluster structure can be achieved using either k-means or a mixture formulation that identifies the estimated effective dimension as the number of clusters or mixture components.
In Section 2, we will briefly review the key results of probabilistic principal component analysis, motivate the use of a penalty function, and propose a data-driven strategy to choose automatically. In Section 3, the method’s performance in finite samples is empirically explored via simulations performed under a number of scenarios, and is compared with alternative methods. In Section 4, we illustrate the use of our approach to estimate the number of cancer subtypes in three gene expression datasets. Finally, Section 5 discusses the advantages and limitations of our approach for estimating the number of clusters in gene expression and other potential applications.
2 Penalized PPCA
Consider a sample with each representing the measurements for the th sample unit. Denote the effective dimension for the sample units by . Our aim is to determine using a large number of features, .
Define the transpose where each is the random vector of the th feature measured for the sample. We assume that are independent and identically distributed random vectors from a population with mean and covariance . Without loss of generality, we assume that .
We propose to solve this problem by finding a lower dimensional representation for of the form
where is the loading matrix of rank , with each column independently following the and is the noise matrix with independent and identically distributed columns sampled from . The representation in (1) induces the following partition of the covariance matrix
In other words, the matrix completely accounts for the dependence structure between units in the sample and is the minimum rank of in the range such that the representation (1) holds.
2.2 A Brief Review of PPCA
The probabilistic formulation of principal component analysis decomposes the data
to a low-rank approximation and residual noise that follows a multivariate Gaussian distribution,. The log-likelihood function with respect to and is
where is the sample covariance matrix of . The maximum likelihood estimators can be derived via an eigendecomposition of and depend on the number of retained principal components (k):
where ’s are the sample eigenvalues of and , is an matrix with columns corresponding to the first eigenvectors of , is a diagonal matrix with non-zero entries each given by , and
is an arbitrary orthogonal matrix. Since we consider the parameter of interest to be , and nuisance parameters to be and , the profile likelihood can be expressed as:
The following result shows that the profile log-likelihood is monotonically non-decreasing in , suggesting that it can not be used as a criterion to select in finite samples.
Consider a sample with each column following a multivariate Gaussian distribution . If the sample covariance matrix of is positive semi-definite and , then the profile log-likelihood is non-decreasing for .
For the effective dimension is equal to the sample size and (3) is ill-defined due to the degeneracy of the noise distribution which becomes identically zero. In this case, is of rank and does not reveal any clustering structure.
One may be tempted to use a classical model selection criterion, e.g. Bayesian information criterion, to identify the effective dimension. However, the scale of the log-likelihood dominates the penalty when the number of features is large. A similar phenomenon has been reported in factor analysis  .
Assume that the true dimension of the probabilistic principal component model is . The population eigenvalues of the true covariance matrix have the form
and has multiplicity . When is large, the maximum likelihood estimator is approximately equal to that of the th population eigenvalue for any and , and thus for all . However, for finite samples, is always different from which implies for all .
2.3 Motivation for the Penalty
Penalized maximum likelihood approaches are widely used to induce sparsity in statistical models. The level of penalty imposed on the model is regularized via a tuning parameter which controls the trade-off between goodness-of-fit and complexity [40, 45, 5]. In the problem considered here, the model complexity is directly related to the effective dimension selected, while the fit corresponds to the amount of variance explained, i.e. .
We have investigated the use of the following penalty functions as they all account for both the amount of variance explained and the complexity of model with effective dimension :
We ultimately choose to use (4b
) as it leads to simpler analytical derivations and intuitive heuristics. The natural guiding principle is to favor a parsimonious cluster structure by penalizing small values ofas well as large values of .
2.4 Construction of penalized probabilistic principal component analysis
The penalized log-likelihood has the form:
where the tuning parameter controls the amount of penalty due to . Notice that the number of features is a scaling factor and does not affect the maximization other than through the computation of the sample covariance matrix .
Similarly to (2), the penalized maximum likelihood estimators, and , are functions of . Since is not present in the penalty function, the relationship between the penalized maximum likelihood estimator of and
remains the same, but the resulting singular values are different due to the presence of a non-zero-value. Given that the rank of is , the penalized profile likelihood estimator can be expressed in terms of and :
For a fixed , is unbounded as can be very close to 0 or even negative for large -values. Thus, the theoretical range of has an upper bound at such that is positive. This implies that the choice of poses a restriction of the range of , and vice versa. Henceforth, we will work with the scaled tuning parameter .
Similarly, the penalized profile log-likelihood can be written as a function of given each :
The introduction of penalty changes the monotonicity property of (3), and thus making it possible to select the correct dimension for appropriate choices of -value.
Consider a sample with each column following a multivariate Gaussian distribution . If the sample covariance matrix of is positive semi-definite and is the rank of , then there exist such that is maximized at .
Please see the Appendix.
Since and are not analytically available, we obtained conservative upper and lower bounds for using and such that . The proof of Proposition 2 also demonstrates that so that .
By definition, for , and are non-overlapping sets. Therefore, the realized range for is the union of all sets . But because of the restriction embedded in (5) and (7), we must have for each examined value of . Consequently, the restriction imposes a relationship whereby is non-increasing in (or ) and (or ). For example, when , we must have , while for , we must have .
The penalized approach requires proper calibration of so that is the maximizer of . The selection of appropriate tuning parameter values in other well-known problems, such as the selection of shrinkage tuning parameter in lasso [40, 44], uses either a model selection criterion, e.g. Bayesian information criterion, or cross-validation. However, the use of a cross-validation approach is based on optimizing a certain objective function which can be analytically expressed, a task that is difficult when of interest is determining the effective dimension. Our attempts at using an off-the-shelf information criterion have produced only modest results for finite samples.
So far, the best performance is obtained through a “voting” strategy in which each value of will lead to a vote for a particular estimate of . Since the same estimated can result from multiple -values, we ultimately select the estimate that has been obtained most often. We provide analytical and simulation support for this “voting” procedure in the next subsection.
2.5 A voting strategy to estimate
Therefore, the ratio of the boundaries for non-overlapping sets is asymptotically the largest for and this suggests a majority-voting strategy for estimating . We establish a grid of tuning parameter values on where is a user-specified integer, and . Each will result in (11) having support on a range of possible values for , say , and some choice of will be the maximizer of . Then, the number of times that each possible maximizes the penalized profile log-likelihood is counted and the one with the highest vote count is selected and denoted by .
Here we detail the construction of the search grid characterized by its range and the distance between adjacent grid values. Since the exact relationship between and is not analytically available, we rely on conservative bounds obtained via Taylor series approximations. The largest non-trivial choice for is , as when the log-likelihood is not defined. The grid of -values is constructed using a sequence of equidistant points on log scale from to . The value of needs to be large enough to identify a mode and in our simulations we used .
To demonstrate the validity of this voting procedure empirically, we illustrate in Figure 1 the behavior of and as a function of . For simplicity, we set , while fixing the number of observations at and number of subjects to cluster at
. The covariance structure was generated according to a singular value decomposition using simulated orthogonal matrices and specified squared singular values equal toor , both corresponding to .
Since maximizes the penalized profile log-likelihood for a particular value only when i) , and ii) simultaneously, we can visualize this on a panel of plots where the differences in i) and ii) are plotted as functions of for a few possible ’s including . With little or no penalty, i.e., when , can be shown to be negative and positive for all . As increases, the difference in penalized profile log-likelihood between and increases and becomes positive while the difference between and eventually drops below zero. It is easy to confirm visually that the approximations of the sets are non-overlapping and is indeed monotonically decreasing in . Notice that the log difference in is the largest for as compared to or , suggesting that if we took a grid-set of values that are equally spaced on scale, the majority would fall in .
Consequently, with the search grid constructed as proposed, the procedure would estimate the effective dimension by a majority vote, which is expected given the relationship between the log distance and the number of votes for that particular . We simulated two additional scenarios characterized by the singular values and such that . The scatterplots of the number of votes and shown in Figure 2 confirm the effective dimension can be recovered by a majority vote.
The readers interested in implementing this procedure on their data can use the R package available at https://github.com/WeiAkaneDeng/SPAC2.
3 Simulation Studies
3.1 Comparison with Alternative Methods
To mimic realistic scenarios, we assumed the first squared singular values decay at an exponential rate with their values determined by varying the two parameters and . The combinations yield different degrees of difficulty for recovering the effective dimension. The residual noise was assumed to have a multivariate Gaussian distribution with mean vector zero and . The number of subjects to cluster () is often directly associated with the difficulty of recovering the effective dimension. Thus, we examined , but varying . Lastly, the number of features or observations was set to .
3.2 Comparison with Alternative Methods
We compared the performance of our approach with alternative methods, for which details are provided in the Appendix. We included BIC, the Bayesian model selection approach proposed in  for PPCA, denoted Minka BIC, the best performer from a class of Bayesian criteria using PEnalized Semi-integrated Likelihood (PESEL; ), and the bias-corrected criterion for estimating (PASSEMIER; ). We have also included in the comparison a couple of empirical approaches designed to detect an “elbow” in the distribution of the sample eigenvalues: the difference between log cumulative mean of the sample eigenvalues and the mean of the cumulative log sample eigenvalues (Cumlog), defined by and variance of sample eigenvalues (VarD), defined by .
Some of the methods we do not consider in our comparison are Bishop’s ARD  and related methods followed it [10, 33] as they have been shown to be outperformed by methods using the Laplace approximation . We have also not included Bayesian methods that rely on MCMC sampling  as they become computationally prohibitive when is large, e.g., . For the same reason, we excluded cross-validation, but included the general cross-validation (GCV) criterion of  that has better scalability properties.
The performance was evaluated by the mean (Figure 3) and proportion (Figure 4) of correct estimates over 100 replicates. Even though our method was not the “best” in every scenario, the overall performance was strong, capturing the effective dimension for most scenarios when either or was large. As expected, our penalized criterion was more likely to slightly underestimate than overestimate across different scenarios. Notably, all the likelihood based methods performed well when is large. In particular, BIC approximated by Laplace’s method outperformed BIC in all scenarios, while BIC tended to overestimate and chose the maximum possible for smaller when . In addition, the penalized semi-integrated likelihood criterion  maintained consistent performance in all scenarios and showed a clear preference to small and large , i.e. a scenario characterized by a high signal-to-noise ratio (Figure 3).
One of the data attributes encountered in real world applications is a larger number of features,, than number of subjects to cluster, . To capture this characteristic, we fixed , , and . The relationship between and the estimated over 10 replicates is shown in Figure 5 and 6. As was increased, all methods except BIC approach the correct value. Notice that when and are both small, our method tended to under-estimation of as shown by the solid line representing the median of our estimates over 10 replicates. In contrast, although the penalized semi-integrated likelihood criterion and Laplace approximated BIC have a similar outlook as our penalized profile likelihood criterion, their estimates approach the correct value slower than our method in most of the scenarios considered. The results also showed that BIC had reasonable performance for moderate sizes of , but, as the number of features increases, their estimates became unreliable for whenever is small. This reaffirms our observation that for large values of , the leading difference in BIC between choices of , is of order and dominates the difference between respective profile log-likelihoods. In other words, for large , BIC behaves just like the profile log-likelihood and always prefers the full model with .
4 Application to sample-based Gene Expression Clustering
4.1 Sample-based gene clustering
Gene expression is an intermediary measurement of how genetic information is translated to the observed phenotype through production of mRNA and protein. As a consequence of the natural fluctuation in biochemical reactions, the measured gene expression can be seen as a single instance of a complex stochastic process and is often quite noisy . These unique characteristics of gene expression data make it possible to cluster either genes or samples, commonly referred to as gene-based clustering and sample-based clustering. For example, gene-based clustering aims to find genes that might share a common regulator  or co-express , or to find a group of genes that are enriched for specific functions in the same pathway . On the other hand, sample-based clustering relies on the correlation over a large number of measured expression levels between any pairs of individuals. Correlations could possibly lead to similarities between individuals in terms of cell compositions and reveal subtypes of disease.
Here we demonstrate the utility of our procedure on three gene expression datasets and discuss its potential use as a basis for clustering algorithms. It has been shown that the PCs are the continuous solution to the assigned membership of K-means clustering , implying an utility that goes beyond noise reduction. For convenience, we applied both K-means and a mixture model to cluster using the estimated number of PCs.
4.2 Data Analyses
Consider expression measurements from individuals , where , the number of features, is typically in the tens of thousands and , the number of individuals, in the hundreds. A summary of results on three gene expression data applications is shown in Table 1.
|Breast cancer data||NCI60||Ovarian cancer data|
|Penalized profile log-likelihood||9||16||21|
|Laplace’s method approximated BIC||40||40||52|
|Penalized semi-integrated likelihood||43||59||40|
|Elbow approach using cumulative||28||27||38|
Expression data from breast cancer subtypes: The expression measurements were collected on tissues from 153 patients with primary invasive breast cancer . The four cancer subtypes defined clinically are (number in parenthesis indicates patients with that type of breast cancer) Triple-negative (TN; 55), human epidermal growth factor receptor 2 (HER2; 39), Luminal A (29) and Luminal B tumour (30) , and the two types of controls are in the form of 11 normal tissue samples and 14 normal cell lines. A total of 29,874 genes probes were included as features. Our penalized approach estimated , with no other competing method giving any sensible estimates (Table 1).
|Known Molecular Subtypes||Normal Controls|
|Cluster||Triple negative||HER2||Luminal A||Luminal B||Normal tissue||Normal cell line|
It is expected that some of the subtypes would be clustered together for any choice of the clustering algorithm as a result of the original molecular classification for these subtypes of breast cancer. However, the normal tissue and the cell line derived samples are distinctive enough that they would always be assigned their own clusters. K-means clustering algorithm reveals that two of the identified clusters belong to TN, two belong to HER2, two for Luminal A and one for Luminal B (Table 2). Since triple-negative tumours are known to be a more heterogeneous group that contains mostly Basal type and also encompass all other subtypes, it is unsurprising to observe these results. Meanwhile, Luminal A and Luminal B have more homogenous expression signature. In particular, TN and HER2 are hormone-receptor negative, while Luminal A and Luminal B are hormone-receptor positive, and these two classes should have very little overlap. In other words, clustering cancer subtypes using expression from all genome-wide gene probes did not blur the subgroup structure, but rather introduced clarity to specific cases that could be further characterized.
NCI60 cancer cell line data: This dataset originally contained gene expression of nine types of cancer cell line , and has been recently profiled using array technology for individuals on gene probes . Our penalized approach estimated . Though our estimate is larger than the number of cancer cell lines, it is within a reasonable range as most of the cancers have known subtypes.
Human Ovarian Gene Expression Data: This experiment profiled expression levels of 285 ovarian cancer tumours at various stages with the objective to identify molecular subtypes . The study discovered six subtypes defined by molecular pattern in a subset of
probe sets. It is common in the literature to use only a subset of biologically relevant genes to cluster the subjects so that the clusters would be better separated and results of hierarchical clustering would be stable. It is not clear how many genes should be included since the relevant pathways are usually determined by experts and two separate sets of genes are usually used to ensure the robustness of the clusters. On gene probes, our method estimated , while none of the other methods provided a reasonable estimate (Table 1).
In this article, we propose to estimate the number of principal components via a penalized profile log-likelihood derived from the probabilistic principal components analysis formulation. The main aim is to simplify the data structure and provide a lower dimensional representation. Specifically, the focus of the method proposed here is to recover the effective dimension and thus to inform the cluster structure in the observed noisy data.
Under simulated scenarios, there was no universally best method, but PPPCA had the best performance when averaging over scenarios. This is advantageous in applications when only one opportunity is given to estimate the number of principal components. The successful recovery of the latter facilitates subsequent use of clustering algorithms that require a value for the number of clustered.
The main difficulty of clustering is the unsupervised nature, and mapping of complex data structure to independent latent space is usually not directly interpretable as each dimension might not necessarily correspond to, for example, the type of cancer, but possibly a common basis of several cancer types. For example, several clinically distinct cancer types might share similar physiological mechanism that influences the expression profile of the same subset of gene. The advantage of our method is that all features could be incorporated without assuming a homogenous feature space, so gene expression, f-MRI and environmental data could all be combined to improve estimation of the number of clusters. The large dimension of the feature space is also why cross-validation is difficult to implement beyond the heavy computational burden as data splitting can sometimes create biased signal in the data depending on how the held-out datasets are obtained, i.e. if the number of clusters is local to a subset of the features. In addition, this type of data almost always present certain violations to model assumptions, such as highly correlated features, non-normality, or a combination of the two. But as we have shown with simulated data, PPPCA is reasonably robust to dependent observations and moderately fat-tailed data.
In addition, we would like to point out that our method might not be suited to any application that hinges on minimizing reconstruction error, such as digit recognition or image processing. In addition, since principal components are essentially a low-rank approximation of the data by linear projection into subspace, we are restricted to signals in the linear space. Nevertheless, this is a reasonable assumption for gene expression data. Due to the nature of gene expression levels, both row and column correlations are present. Our method does not explicitly account for correlation in the columns (genes), but as we demonstrated in simulation with autocorrelated features that PPPCA was reasonably robust as long as there was a good separation between signal and noise, as well as a large number of observations . This has been observed by  and others [14, 16, 15, 25] for clustering of high-dimensional data that correlation does not strongly impact the results as long as the cluster centres are sufficiently separated.
Finally, we should mention that sparse PCA , where certain loadings within the top PCs are shrunk to zero to produce sparse loadings, is different from our approach since in our case, only loadings for the last PCs are shrunk to zero. The penalty function in PPPCA effectively shrinks the sample eigenvalues of the last PCs relative to the amount of noise variance to select the first PCs sufficient to account for the data structure, and not to select the features that are important for the first PCs. However, it would be interesting to combine the two methods to induce sparsity from both sides such that can be estimated based on important features with non-zero loadings.
The authors thank Andrey Feuerverger for helpful suggestions, Dehan Kong, Lei Sun, Qiang Sun, Stanislav Volgushev, and Fang Yao, for a critical reading of the original version of the paper. Wei Q. Deng is supported by Alexander Graham Bell Canada Graduate Scholarships-Doctoral Program from the Natural Sciences and Engineering Research Council of Canada.
Information Criteria: Let denote the number of free parameters in the probabilistic model, we can select the rank by minimizing the following commonly used model selection criteria: and .
The three gene expression datasets can be obtained publicly via the Gene Expression Omnibus from National Center for Biotechnology Information or ArrayExpress from the European Bioinformatics Institute (http://ebi.ac.uk
). Pre-processed datasets were used in the present analyses. We have transformed measurements at each probe to a standard normal distribution but no further cleaning nor filtering was performed. Here each probe measures the levels of expression of a specific DNA sequence, and can be regarded as a feature or observation.
Proof of Proposition 1
Following  we have:
Notice that the sample eigenvalues are decreasing thus implying .
For , based on the inequality for any we obtain
Proof of Proposition 2
The proof is based on the following lemmas whose proofs are included in the web appendix of the paper.
Consider a collection of samples with each row following a multivariate Gaussian distribution . Suppose is the rank of and further, the sample covariance matrix of is positive semi-definite. Then, the penalized profile log-likelihood at each fixed is a smooth function of on the interval and is monotonically decreasing on , where .
Suppose we consider , then for any fixed , is a monotonically increasing and concave function of and is a monotonically decreasing and convex function of .
For every , there exists such that . And if , if and only if
For , we have , if and only if ; and for , if and only if .
Then can be approximated by , where denote an upper bound for , and a lower bound for , such that .
When is at the true rank, as , we have and for .
of Proposition 2.
We prove the result for in the range
Following Lemma 1 and Lemma 2, when , the difference is a smooth and concave function of and is increasing. Similarly, is a smooth and convex function of and is decreasing. This enables us to define an interval
for any . Lemma 3 implies that on this interval, there exists such that , including .
such that .
Lemma 3 implies that on this interval, there exists such that .