1 Introduction
Recent technological advances have resulted in a wealth of high dimensional data in biology, medicine, and the social sciences. In unsupervised contexts where the data is unlabeled, inferring distinct classes and identifying their characteristics is a key step towards mechanistic insight. This process of clustering relies on identifying modes of the data distribution which correspond to different classes. Clustering in high dimensions is difficult, however, both because of unavoidably low data density in high dimensions (the “curse of dimensionality"
[1]) and because of the possibility that a small but unknown fraction of the measured features define the true clusters while the remaining features are uninformative of cluster identity [2, 3].Identifying informative features has long been of interest in the statistical literature. When the data is labeled, allowing for a supervised analysis, there are successful techniques for extracting important features using high dimensional regressions. When there is no labeled training data, unsupervised discovery of features relevant for clustering is difficult. Existing methods to do so rely either on finding features that correlate with principal components [4], which can be misleading when these components do not capture the variation relevant to cluster identity [3], or on optimizing cost functions which require knowledge of the number of clusters [2]. When little is known about the underlying system, and no forward model exists, an unbiased discovery of key features to define clusters and hence classes is essential for extracting insight.
Using gene expression data to understand processes in developmental biology highlights this challenge. In a developing embryo, multipotent cells make a sequence of decisions between different cell fates, eventually giving rise to all the differentiated cell types of the organism. The goal is both to determine the physiological and molecular features that define the diversity of cell states, and to uncover the molecular mechanisms that govern the generation of these states. Decades of challenging experimental work in developmental biology suggests that a small fractions of genes control specific cell fate decisions [5, 6, 7]. Recent experimental techniques measure tens of thousands of features – gene expression levels – from individual cells obtained from an embryo over the course of development, producing high dimensional data sets [8, 9]. Clustering these data to extract cell states and identifying the small fractions of key genes that govern the generation of cellular diversity during development has been difficult [10, 11]. However, mapping cellular diversity back to specific molecular elements is a crucial step towards understanding how gene expression dynamics lead to the development of an embryo.
Here we show that as the fraction of relevant features decreases, existing clustering and dimensionality reduction techniques fail to discover the identity of relevant features. We show that when the linear separability of clusters is restricted to a subspace, the identity of the subspace can be found without knowing the correct clusters by averaging over discriminators trained on an ensemble of proposed clustering configurations. We then apply it to previously published singlecell RNAseq data from the early developing mouse embryo [12], and discover a subspace of genes in which a greater diversity of cell types can be inferred. Further, the relevant subspace of genes that we discover not only cluster the data but are known from the experimental literature to be instrumental in the the generation of the different cell types that arise at this stage. This approach provides unsupervised sparse feature detection to further mechanistic understanding and can be broadly applied in unsupervised data analysis.
2 Results
2.1 Uninformative Data Dimensions Corrupt Data Analysis
To understand how the decreasing fraction of relevant features affects data analysis, consider data from classes in a space with features. Assume that can be partitioned into two subspaces. First, an informative subspace of dimension , in which the clusters are separable. And second, an uninformative subspace with dimension in which the clusters are not separable. An example of such a distribution is shown in Figure 1 with two clusters, and .
The correlation between the distances computed in the full space with that in the relevant subspace scales as (see, 4.1). When the fraction of relevant features is small, or equivalently , correlations between samples become dominated by noise. In this regime, without the correct identification of
, unsupervised analysis of the data is difficult, and typical dimensionality reduction techniques (PCA, ICA, UMAP, etc) fail. We demonstrate this by constructing a Gaussian mixture model with 7 true clusters which are linearly separable in a subspace
with dimension , and drawn from the same distribution (thus not linearly separable) in the remaining dimensions. As the ratio increases, the separability of the clusters in various projections decreases (Figure 2). For the same reason, clustering methods and graph embeddings that rely on Euclidean distances between data points to measure similarity deteriorate as increases (Figure S1).In many cases, identifying the “true" may be challenging. However, eliminating a fraction of the uninformative features and moving to a regime of smaller
could allow for more accurate analysis using classical methods. We next outline a method to weight dimensions to construct an estimate of
and to reduce .2.2 Minimally Informative Features: The Limit of Pairwise Informative Subspaces Separating Clusters
To develop a framework to identify the relevant features, consider data where samples are represented in the measurement basis . Assume that the data is structured such that each data point is a member of one of clusters, . Let be the subspace of in which the data points belonging to the pair of clusters , are linearly separable. Let
be unit vector normal to the maxmargin hyperplane separating clusters
and . In the space orthogonally complement to , the two clusters are not linearly separable. One can similarly define such subspaces and associated hyperplanes , one for each pair of clusters in . We define a weight for each dimension by it’s component on the s:(1) 
Knowing the cluster configuration would allow us to directly compute
by finding maxmargin classifiers and using Equation
1. Conversely, knowing would allow for better inference of the cluster configuration because restriction to a subspace in which would move to a regime of smaller (Figure 2). Existing work has focused on finding and simultaneously, through either generative models or optimizing a joint cost function. Such methods either rely on context specific forward models, or tend to have problems with convergence on real data sets (see 2.7 and ref. [2]).We focus here on estimating when is unknown. We consider the limit in which the dimensions of each , take on the smallest possible value of , which maximizes the ratio for all . Further, this limit resides in the regime of large where conventional methods fail (Figure 2). We further consider the limit where the intersection between any pair of the subspaces in is null. In this limit, the marginal distribution of all of the data in any one of the can be appear unimodal due to a dominance of data points from the clusters for which this subspace is irrelevant, despite data in the clusters ,
showing a bimodal signature in this subspace. Hence finding the identity of the informative subspaces by distinguishing moments of the marginal distribution is not possible as
grows or data density decreases, even in the case of normally distributed data (Figure 2, 3B). In this limit, the values of
corresponding to informative dimensions are , and 0 for uninformative dimensions. Our reason for studying this limit of pairwise separability is that an algorithm that can find the informative subspaces of in these limits should be able to do so in instances where in the dimensions of are larger than one and intersecting.We generate data such that the mean of the marginal distributions of clusters and along a specific whose span defines are separated by and the sample variance of each cluster’s marginal distribution is . The marginal distribution of cluster in all other dimensions, i.e. , is unimodal with zero mean and unit variance. Therefore, in all there are dimensions in each of which which a pair of clusters are linearly separable, while the other clusters are not, and dimensions where all clusters are drawn from the same unimodal distribution. Normalizing each feature to have unit variance leaves one free parameter, , which controls the pairwise separability of clusters within their informative subspace (Figure 3B).
2.3 Identifying a Sparse Set of Pairwise Informative Features
We develop an approach to estimate the weight vector knowing neither the identity of points belonging to each cluster nor the total number of clusters. In order to estimate , we propose to average estimates of over an ensemble of clustering configurations. Specifically, we sample an ensemble of possible clustering geometries, , from each of which a collection of maxmargin classifiers are computed in order to compute using Equation 1:
(2) 
Where
is the probability of a clustering configuration given the data. This sum can be approximated numerically through a sampling procedure outlined in Algorithm 1, where cluster proposals are sampled according to
(3) 
where, is the number of clusters, and is our prior over the number of proposal clusters.
Consider one such proposed clustering configuration with clusters, denoted by where each indexes the data points that belong to the th proposed cluster. For this proposed clustering configuration, we compute a set of classifiers that separate each pair of clusters. Based on the assumption that is sparse, or equivalently that the true are low dimensional, we impose an L1regularized max margin classifier to compute from the data X and the proposed cluster configuration as in [13]:
(4) 
Where indicates the positive component, and is a sparsity parameter. We set such that the expected number of nonzero components in each is 1. Specifically, we sample cluster configurations by clustering on random subsets of the data, and average the weights of maxmargin classifiers over this ensemble:
(5) 
This procedure is outlined explicitly in Algorithm 1.
2.4 Scaling of Inferred Weights with Dimensionality and Data Density
In the challenging regime of large , this algorithm can robustly identify key features in the data. In particular, as increases, there is a scaling of the algorithms performance as a function of , as well as a dependence on the number of data points . First, we sample a variety of proposal clusters , each with clusters drawn from a prior . Using counting arguments (see 4.2) we can estimate the frequency of proposed aligning with informative with a bimodal signature and uninformative features without. This ratio of the average weights of informative dimensions to the average of the uninformative dimensions, , scales as . The scaling, however, also depends on data density. Specifically, consider the length scale separating two neighboring data points in the full space scales as . In the relevant subspace , this length scale translates to a volume of which must be compared to the characteristic volumes in this subspace that reflect the multimodal structure of the data. If the identities of the true clusters in are known, one can ask what the errors are in clustering in the full space instead of in by computing the entropy, of the composition of inferred clusters based on the true cluster identities of data points. This entropy has to be a function of the ratio of the characteristic volumes in to . Or equivalently, the entropy of the clusters should be a monotonically increasing function , denoting increasing errors in clustering. The form of the function depends on the true data distribution and the clustering method. Therefore, our expectation for the ratio of counts for the informative dimensions and counts for the uninformative dimensions should scale like
(6) 
We numerically generated Gaussian distributed data for
, using , , and ran 2000 iterations of the algorithm with and found close agreement for the range of parameters (Figure 4A,4B).2.5 Significance and Sources of Error
In order to estimate the significance of frequencies produced by the algorithm, we constructed a null models to estimate in the absence of signal. First, each column of the data matrix is shuffled to produce a null distribution . This leaves marginal distributions of each dimension unchanged. Next, each feature can be scored based on the null distribution, , and statistics , and
can be computed from these scores. We then can compute a Zscore for each dimension as
. Motivated by the work of [14] and [15], more precise estimates could be obtained by generating synthetic marginals without multimodality, which is an area for future work.Correlations in the uninformative subspace can lead to erroneous counts in the correlated axes. This is caused by correlations in uninformative dimensions biasing the proposal clusters to be differentially localized in these axes. Despite these false positives, the false negative rate remains low, resulting in minimal degradation of the ROC curve (Figure 4C). In practice, eliminating any number of uninformative dimensions is effective in restricting analysis to a smaller regime of . Thus, even in the presence of false positives, removing uninformative dimensions prior to conventional analysis can increase the accuracy of clustering or dimensionality reduction techniques.
A free parameter in the synthetic data generated in Figure 4 is , the ratio of mean separation to variance of distributions in the informative subspace, which controls the separability of clusters. As decreases, we see degradation in the AUROC for our algorithm, but identification of key dimensions is still possible even as the mean separation approaches the noise level in the distributions (Figure 4D, inset).
2.6 Application to single cell RNAsequencing from early Mouse Development
A central challenge in developmental biology is the characterization of cell types that arise during the course of development, and an understanding of the genes which define and control the identity of cells as they transition between states. Starting at fertilization, embryonic cells undergo rapid proliferation and growth [6, 16]. In a mouse, these cells form the epiblast, a cup shaped tissue surrounded by extraembryonic cells by E6, or 6 days after fertilization. Only the cells of the epiblast will go on to give rise to all the cells of the mouse. These cells are pluripotent, meaning they have the developmental potential to become any cell type in the adult mouse body [17]. At E6, proximal and distal subpopulations of both the epiblast and surround extraembryonic cell types begin secreting signaling proteins [18]
, which when detected by nearby cells, can increase or decrease the expression of transcription factors – proteins that modulate gene expression, and can thus change the overall expression profile of a cell. Signaling factors direct genetic programs within cells to restrict their lineage potential and undergo transcriptional as well as physical changes. Posterior  proximal epiblast cells migrate towards the outside of the embryo forming a population called the primitive streak, in a process called gastrulation which takes place between E6.5 and E8. This time frame is notably marked by the emergence of three populations of specified progenitors known as the germ layers
[19]: endoderm cells, which later differentiate into the gastrointestinal tract and connected organs, mesoderm cells, which have the potential to form internal organs such as the muscoskeletal system, the heart, and hematopoietic system, and ectoderm cells, which later form the skin and nervous system. The mesoderm can be subdivided into the intermediate mesoderm, paraxial mesoderm, and lateral plate mesoderm, which each have further restricted lineage potential. Identifying the key transcription factors that define and control the genetic programs that lead to these distinct subpopulations will allow for experimental interrogation and a greater understanding of the gene regulatory networks which control development.Recent advances in single cell RNAsequencing technology allow for simultaneous measurement of tens of thousands of genes [8, 9] during multiple time points during development. These technological advances promise to provide insight into the identity and dynamics of key genes that guide the developmental process, yet even clustering cells into types of distinct developmental potential, and identifying the genes responsible for the diversity has been difficult [20, 21] . Existing methods typically find signal in correlations between large numbers of genes with large coefficients of variation to determine a cell’s states. However, experimental evidence suggests that perturbations of a small number of transcription factors are sufficient to alter a cell’s developmental state and trajectory [5, 6, 7]. Further, recent work suggests that a small set of four to five key transcription factors is sufficient to encode each lineage decision [11, 22]. We therefore believe that signature of structure in these data resides in a low dimensional subspace. While many existing methods rely on handpicking known transcription factors responsible for developmental transitions [12], we attempt to discover a low dimensional subspace of gene expression which encodes multimodal expression patterns indicating the existence of distinct cell states.
In [12]
, singlecells are collected from a mouse embryo between E6.5 and E8.5, encompassing the entirety of gastrulation, and profiled with RNAsequencing to quantify RNA transcriptional abundance. We considered 48692 cells from E6.5  E7.75 which had more than 10,000 reads mapped to them. We then subsampled reads such that each cell had 10,000 reads. Individual genes were removed from analysis if they had a mean value of less than 0.05, or a standard deviation of less than 0.05 (based on
[20, 21]). We restricted our analysis to transcription factors because, as regulators of other genes, variation in transcription factor expression is a strong indication of biological diversity between cells, or cell types. We normalized the 409 transcription factors with expression above these thresholds to have unit variance. A cellcell correlation analysis, followed by hierarchical clustering fails to capture the fine grained diversity of cell types that is known to exist at this time point (Fig 5A).
We attempted to discover a low dimensional subspace in which signatures of cell type diversity could be inferred using the algorithm outlined in the previous section. We sampled 3000 clustering configurations based on hierarchical (ward) clustering of 5000 subsampled cells, with with , chosen to cover a range around the 37 clusters found in [12]. We find 27 transcription factors with a zscore , 18 of which have known have previously identified essential functions in the regulation of differentiation during gastrulation (See Table 1).
Our hypothesis is that the variation in the 27 discovered transcription factors provides a subspace in which multimodal signatures allow the identification of cell types. However, single cell measurements of individual genes are known to be subject to a variety of sources of technical noise [23]. In order to decrease reliance on individual measurements, we take each of the 27 transcription factors with high scores, and extend the subspace to include 5 genes (potentially not transcription factors) that have the highest correlation with each of the 27 discovered transcription factors, resulting in an expanded subspace of 83 genes in which to cluster the data (Full list in supplement). The cellcell covariance matrix in this subspace (Figure 5B), reveals distinct cell types and subtypes, and a heat map of the expression levels of these 83 genes shows differential expression between subtypes of cells.
We hierarchically clustered the cells into 35 cell types based on expression of these 83 genes. The corresponding identity of these cell types was determined using the expression pattern of all genes (Table 2, Figure 5c), and identify extraembryonic populations (C59,C1820), epiblast populations (C27C34), primitive streak populations (C1417,C2325), mesoderm subtypes (C24,C12,C13,C16,C17,C22), endoderm (C21), and primitive erythrocyte (C1). For example, we find a subpopulation of epiblast cells that have upregulated Nanog (as well as other early markers of the primitive streak), suggesting that these cells are positioned on the posteriorproximal end of the epiblast cup [24]. The large primitive streak population, which extends along the proximal side of the embryo, contains subtypes distinguished by Gsc [25] and Mesp1 [26], which give rise to distinct fates. We find a distinct population of anterior visceral endoderm cells, marked by Otx2 and Hhex, which define the population responsible for the anteriorposterior body axis [27]. This population, which is distinguished from other Foxa2expressing subpopulations of the visceral endoderm, is crucial for proper development.
Most importantly, in extracting the relevant features from the data, our algorithm identifies known and validated transcription factors that are crucial to the developmental processes happening in this time frame. Further, by eliminating extraneous measurements, we are able to identify clear differential expression patterns between subtypes of cells which were indistinguishable through previous methods. In particular, identification of primitive streak subpopulations provides novel insight into a central developmental process, and we identify key genes that would allow for experimental interrogation of the spatial organization of the subtypes and their dynamics.
Table 1  
Gene Name  Associated Cell Type  Citation  
Creb3l3  34.20  
Tfeb  13.37  
Rhox6  10.15  
Elf5  9.45  Extraembryonic Ectoderm  [28] 
Gata1  8.77  Primitive Erythrocite  [29] 
Pou5f1  8.38  Epiblast, Primitive Streak  [24] 
Sox17  5.61  Endoderm  [30] 
Nr0b1  5.58  
Hoxb1  4.54  Mesoderm  [31] 
Foxf1  3.36  Lateral Plate Mesoderm  [32] 
Gata2  3.27  Extraembryonic Mesoderm  [33] 
Prdm6  3.12  
Bcl11a  2.99  
Foxa2  2.70  Anterior Visceral endoderm, Anterior primitive streak  [27], [34] 
Gsc  2.67  Anterior Primitive Streak  [25] 
Hand1  2.57  Posterior Mesoderm, Lateral Plate Mesoderm  [35] 
Ascl2  2.52  Ectoplacental Cone  [36] 
Mesp1  2.46  Posterior Primitive Streak  [26] 
Hoxa1  2.44  Mesoderm  [31] 
Nanog  2.24  Epiblast  [24] 
Zfp42  2.14  Extraembryonic Ectoderm  [37] 
Cdx1  2.12  Paraxial Mesoderm  [38] 
Runx1  1.82  
Hoxb2  1.73  Mesoderm  [31] 
Id2  1.51  Extraembryonic Ectoderm  [39] 
Tbx3  1.31  
Pitx2  1.20  
Table 2  

Cluster ID  Label  Markers 
0  Unclustered  
1  Primitive Erythrocyte Progenitor [29]  Hbax [40], Hbbbh1 [41], Gata1 [29], Lmo2 [42] 
2  Mesoderm  Car3, Spag5, Hoxb1 [31] , Cnksr3, Smad4, Zfp280d, Vim [43], Ifitm1 
3, 4  Lateral Plate Mesoderm  Foxf1 [32], Hand1 [35] 
5  Anterior Visceral Endoderm  Sox17, Foxa2 [27], Cer1 [44], Frat2, Lhx1 [45], Hhex [46], Gata6, Ovol2, Otx2 [27], Sfrp1 [47] 
6,7  Extraembryonic Mesoderm  Fgf3 [48], Lmo2 [42], Gata2 [33], Bmp4 [49] 
8,9  Visceral Endoderm 1  Rhox5 [50], Emb [51], Afp [52] 
10,11  Neuromesodermal Progenitor  Sox2, T [53] 
12,13  Paraxial Mesoderm / Presomitic Mesoderm  Hoxa1, Hoxb1 [31], Cdx1, Cdx2 [38] 
14  Posterior Primitive Streak  Mesp1 [26], Snai1 [54], Lhx1 [45, 55], Smad1 [56] 
15  Anterior Primitive Streak, Organizerlike Cells  Foxa2 [34], Gsc [25], Eomes [34] 
16,17  Posterior Primitive Streak derived Mesoderm, Lateral Plate Mesoderm Progenitors  Msx2 [57], Snai1 [54], Foxf1 [32], Hand1 [35], Gata4 [58] 
18,19  Extraembryonic Ectoderm  Cdx2 [59], Rhox5 [50], Id2 [39], Gjb5 [60], Tfap2c [28], Zfp42(aka Rex1) [37], Elf5 [28], Gjb3 [60], Ets2 [61] 
20  Ectoplacental Cone  Plac1 [61], Ascl2 [36] 
21  Definitive Endoderm  Sox17 [30], Foxa2 [62], Apela [63] 
22  Mesendo Progenitor, Primitive Streak  Tcf15 [64], Cer1, Hhex [65] 
23  Posterior Primitive Streak, Cardiac Mesoderm Progenitors  Mesp1 [26], Gata4 [58], Lhx1 [55], Smad1 [56] 
24,25  Primitive Streak  T, Mixl1, Eomes, Fgf8, Wnt3 
26  Klf10, Gpbp1l1, Hmg20a, Rbm15b, Celf2  
2728  PosteriorProximal Epiblast  Nanog [24], Sox2 [66], Pou5f1 [24], Otx2 [67] 
2934  Epiblast  Sox2 [66], Pou5f1 [24], Otx2 [67] 
2.7 Comparison to Existing Methods
Various methods exist to select features for clustering. Some are based on correlation analysis in the full space [68], which we have shown only be effect in the regime of small . Our approach is based off of constructing maxmargin classifiers which do not rely on generative models of the data distribution, and does not attempt to exactly infer the clustering configuration. This is in general an advantage because it does not rely on contextspecific knowledge of the underlying process. For a full discussion of the comparison, see [69]
. However, the identification of a small subset of informative features can be formulated as a Bayesian inference problem, where a log likelihood function is maximized over the hidden parameters via an expectation maximization scheme
[70]. In fact, model based clustering has been explored in depth in [71, 72], and adapted to feature selection by the inclusion of a lasso term on the separation of the first moments in
[73, 74, 75]. Advantages and drawbacks of these are discussed in [2], which provides a more general framework. In [2], a feature weight vector is introduced and learned by amending a clustering cost function with a penalty on the feature weights. In particular, their method can be used to adapt the KMeans or hierarchical clustering cost function to include a sparsity constraint on feature weights, which is learned simultaneously with the clustering configuration during training via an iterative optimization. Their approach successfully adapts Kmeans and hierarchical clustering to perform better in the large regime, however there are two main drawbacks. 1) For the Kmeans variation, estimation of cluster number is reliant on the noisy and unpredictable gap statistic (see for example, the discussion in [76] ), and estimating the correct number of clusters is challenging. When the number of clusters in not known, informative features are frequently missed (Figure S2). 2) The sparse hierarchical clustering in [2] was repeated 5 times on the mouse data described in the previous section and did not converge after 2 days. We then subsampled the cells from 48692 to 5000 and ran 5 iterations on these subsampled data sets, none of which converged after 2 days. Our method is distinguished by integrating out the cluster configuration to provide more general identification of key features, and providing a faster approximation of key features on large data sets.3 Discussion
Identifying subspaces which define classes and states from high dimensional data is an emerging problem in scientific data analysis where an increasing number of measurements push the limits of conventional statistical methods. Techniques such as PCA and ICA provide invaluable insight in data analysis, but can miss multimodal features, particularly in high dimensional settings. These methods which have reduced success in the regime can be supplemented by our technique by finding a lower dimensional subspace in which further analysis can be conducted. Crucially, eliminating any informative dimensions decreases the ratio, moving to a regime in which conventional methods are more effective. By reducing the dimensionality of the data, it is possible to artificially increase data density, and mitigate associated problems that are prevalent in high dimensional inference. Further, as our algorithm can be a wrapper over any clustering algorithm to construct the proposal clusters, it has varied applicability in settings where Kmeans or other specific clustering algorithms are unsuccessful.
Biological data from neural recordings, behavioral studies, or gene expression is increasingly high dimensional. Identifying the underlying constituents of the system that define distinct states is crucial in each setting. In contexts such as transcriptional analysis in developmental biology, finding the key genes that define cell states is a central problem that bridges the gap between high throughput measurements and mechanistic experimental follow ups. Identification of transcription factors with multimodal expression that define cellular states allows for the study of dynamics of state transitions and spatial patterning of the embryo. Our method rediscovers known factors in well studied developmental processes and predicts several gene candidates for further study. Identifying defining features in high dimensional data is a crucial step in understanding and experimentally perturbing systems in a range of biological domains.
Acknowledgements
We would like to thank Deniz Askel for extensive help with annotating the clusters. In addition we would like to thank Anirvan Sengupta, Massimo Vergasola, Sam Kou, Matt Thomson, Gautam Reddy, Sean Eddy, and members of the Ramanathan lab for discussions and comments on the manuscript. The work was supported by DARPA W911NF1920018 , 1R01GM13110501 and DMS1764269
4 Appendix section
4.1 Correlations and Dimensionality
In situations where some dimensions or features do not carry relevant information For data points , if is the subspace with meaningful information, and is just noise.
We are first interested in distances between . Let . We first want to compute the correlation between and . If and then:
By definition, , and so
The signal in and is uncorrelated, so the second term is . are both differences of multivariate Gaussians, which are also multivariate Gaussian. And if we let then is also multivariate Gaussian. We want to compute . For a multivariate Gaussian , ,
So
The trace is the sum of the eigenvalues squared, so if the data is normalized such that the eigenvalues are of order 1, this scales like
4.2 False Positive and False Negative Rates
Let be the number of proposed clusters per true cluster (). We assume that the entropy relative to the true class labels is low in each of the proposed clusters (in practice, this condition is met in situations where ). This is true as long as the prior on the proposal cluster number is nonzero for , because the expectation of in Equation 5 is dominated by larger values of because each proposal contributes terms to the sum. We start by counting the number of pairs of proposed clusters wherein cluster in the pair contains data points primarily belonging to the same true cluster:
And so the frequencies of errors is given by
If the true clusters have roughly equal sizes, then we can assume that for all , and this ratio reduces to
The signal frequency is given by a count for each feature when the two proposal clusters in a pair contain points primarily from different true clusters, of which there are roughly
These votes all identify an informative feature, so the fraction is given by
When we assume each , this reduces to:
The ratio of tells us the regime in which the noise is much lower than the signal for meaningful features:
Here are the number of uninformative features, and is the number of informative features, so this scales like
5 Data Availability
Single cell RNAseq data from mouse gastrulation [12]: https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE87038
References
 [1] David L. Donoho. Highdimensional data analysis: The curses and blessings of dimensionality. In AMS CONFERENCE ON MATH CHALLENGES OF THE 21ST CENTURY, 2000.
 [2] Daniela M. Witten and Robert Tibshirani. A framework for feature selection in clustering. Journal of the American Statistical Association, 105(490):713–726, 2010. PMID: 20811510.
 [3] WeiChien Chang. On using principal components before separating a mixture of two multivariate normal distributions. Journal of the Royal Statistical Society. Series C (Applied Statistics), 32(3):267–275, 1983.
 [4] Christos Boutsidis, Michael W. Mahoney, and Petros Drineas. Unsupervised feature selection for the kmeans clustering problem. In Proceedings of the 22Nd International Conference on Neural Information Processing Systems, NIPS’09, pages 153–161, USA, 2009. Curran Associates Inc.
 [5] Thomas Graf and Tariq Enver. Forcing cells to change lineages. Nature, 462(7273):587–594, Dec 2009.
 [6] Scott Gilbert. Developmental biology. Sinauer Associates, Inc., Publishers, Sunderland, Massachusetts, 2016.
 [7] Kazutoshi Takahashi and Shinya Yamanaka. Induction of pluripotent stem cells from mouse embryonic and adult fibroblast cultures by defined factors. Cell, 126(4):663–676, Aug 2006.
 [8] Jeffrey A. Farrell, Yiqun Wang, Samantha J. Riesenfeld, Karthik Shekhar, Aviv Regev, and Alexander F. Schier. Singlecell reconstruction of developmental trajectories during zebrafish embryogenesis. Science, 360(6392), 2018.
 [9] James A Briggs, Caleb Weinreb, Daniel E Wagner, Sean Megason, Leonid Peshkin, Marc W Kirschner, and Allon M Klein. The dynamics of gene expression in vertebrate embryogenesis at singlecell resolution. Science (New York, N.Y.), 360(6392):eaar5780, 06 2018.
 [10] Vladimir Yu Kiselev, Tallulah S. Andrews, and Martin Hemberg. Challenges in unsupervised clustering of singlecell rnaseq data. Nature Reviews Genetics, 20(5):273–282, Jan 2019.
 [11] Leon A Furchtgott, Samuel Melton, Vilas Menon, and Sharad Ramanathan. Discovering sparse transcription factor codes for cell states and state transitions during development. eLife, 6, Mar 2017.
 [12] Blanca PijuanSala, Jonathan A. Griffiths, Carolina Guibentif, Tom W. Hiscock, Wajid Jawaid, Fernando J. CaleroNieto, Carla Mulas, Ximena IbarraSoria, Richard C. V. Tyser, Debbie Lee Lian Ho, Wolf Reik, Shankar Srinivas, Benjamin D. Simons, Jennifer Nichols, John C. Marioni, and Berthold GÃ¶ttgens. A singlecell molecular map of mouse gastrulation and early organogenesis. Nature, 566(7745):490–495, 2019.

[13]
Ji Zhu, Saharon Rosset, Trevor Hastie, and Rob Tibshirani.
1normm support vector machines.
In Proceedings of the 16th International Conference on Neural Information Processing Systems, NIPS’03, pages 49–56, Cambridge, MA, USA, 2003. MIT Press.  [14] Robert Tibshirani, Guenther Walther, and Trevor Hastie. Estimating the number of clusters in a data set via the gap statistic. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 63(2):411–423, 2001.
 [15] Emmanuel Candès, Yingying Fan, Lucas Janson, and Jinchi Lv. Panning for gold: ‘modelx’ knockoffs for high dimensional controlled variable selection. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 80(3):551–577, 2018.
 [16] Richard Baldock. Kaufman’s atlas of mouse development supplement : coronal images. Academic Press, Amsterdam, 2015.
 [17] Janet Rossant and Patrick P.L. Tam. New insights into early human development: Lessons for stem cell derivation and differentiation. Cell Stem Cell, 20(1):18–28, Jan 2017.
 [18] Jaime A RiveraPérez and AnnaKaterina Hadjantonakis. The dynamics of morphogenesis in the early mouse embryo. Cold Spring Harbor perspectives in biology, 7(11):a015867, 06 2014.
 [19] Patrick P. L Tam and Richard R Behringer. Mouse gastrulation: the formation of a mammalian body plan. Mechanisms of Development, 68(1):3–25, 1997.
 [20] Caleb Weinreb, Samuel Wolock, and Allon M Klein. SPRING: a kinetic interface for visualizing high dimensional singlecell expression data. Bioinformatics, 34(7):1246–1248, 12 2017.
 [21] Dominic Grün, Anna Lyubimova, Lennart Kester, Kay Wiebrands, Onur Basak, Nobuo Sasaki, Hans Clevers, and Alexander van Oudenaarden. Singlecell messenger rna sequencing reveals rare intestinal cell types. Nature, 525(7568):251–255, Aug 2015.
 [22] Mariela D. Petkova, Gašper Tkačik, William Bialek, Eric F. Wieschaus, and Thomas Gregor. Optimal decoding of cellular identities in a genetic network. Cell, 176(4):844–855.e15, Feb 2019.
 [23] Dominic Grün and Alexander van Oudenaarden. Design and analysis of singlecell sequencing experiments. Cell, 163(4):799–810, Nov 2015.
 [24] Carla Mulas, Gloryn Chia, Kenneth Alan Jones, Andrew Christopher Hodgson, Giuliano Giuseppe Stirparo, and Jennifer Nichols. Oct4 regulates the embryonic axis and coordinates exit from pluripotency and germ layer specification in the mouse embryo. Development (Cambridge, England), 145(12):dev159103, 06 2018.
 [25] Samara L. Lewis, PohLynn Khoo, R. Andrea De Young, Heidi Bildsoe, Maki Wakamiya, Richard R. Behringer, Mahua Mukhopadhyay, Heiner Westphal, and Patrick P. L. Tam. Genetic interaction of gsc and dkk1 in head morphogenesis of the mouse. Mechanisms of Development, 124(2):157–165, 2007.
 [26] Sebastian J. Arnold and Elizabeth J. Robertson. Making a commitment: cell lineage allocation and axis patterning in the early mouse embryo. Nature Reviews Molecular Cell Biology, 10(2):91–103, Jan 2009.
 [27] +A. PereaGomez, +K.A. Lawson, +M. Rhinn, +L. Zakin, +P. Brulet, +S. Mazan, and +S.L. Ang. Otx2 is required for visceral endoderm movement and for the restriction of posterior signals in the epiblast of the mouse embryo. Development, 128(5):753–765, 2001.
 [28] Paulina A. Latos, Arnold R. Sienerth, Alexander Murray, Claire E. Senner, Masanaga Muto, Masahito Ikawa, David Oxley, Sarah Burge, Brian J. Cox, and Myriam Hemberger. Elf5centered transcription factor hub controls trophoblast stem cell selfrenewal and differentiation through stoichiometrysensitive shifts in target gene networks. Genes & Development, 29(23):2435–2448, Nov 2015.
 [29] Margaret H Baron. Concise review: early embryonic erythropoiesis: not so primitive after all. Stem cells (Dayton, Ohio), 31(5):849–856, 05 2013.
 [30] Manuel Viotti, Sonja Nowotschin, and AnnaKaterina Hadjantonakis. Sox17 links gut endoderm morphogenesis and germ layer segregation. Nature Cell Biology, 16:1146 EP –, 11 2014.
 [31] Marta Carapuço, Ana Nóvoa, Nicoletta Bobola, and Moisés Mallo. Hox genes specify vertebral types in the presomitic mesoderm. Genes and Development, 19(18):2116–2121, 2005.
 [32] M. Mahlapuu, M. Ormestad, S. Enerback, and P. Carlsson. The forkhead transcription factor foxf1 is required for differentiation of extraembryonic and lateral plate mesoderm. Development, 128(2):155, 01 2001.
 [33] Louise Silver and James Palis. Initiation of murine embryonic erythropoiesis: A spatial analysis. Blood, 89(4):1154–1164, 1997.
 [34] Sebastian J. Arnold, Ulf K. Hofmann, Elizabeth K. Bikoff, and Elizabeth J. Robertson. Pivotal roles for eomesodermin during axis formation, epitheliumtomesenchyme transition and endoderm specification in the mouse. Development, 135(3):501–511, 2008.
 [35] Pual Riley, Lynn AnaonCartwight, and James C. Cross. The hand1 bhlh transcription factor is essential for placentation and cardiac morphogenesis. Nature Genetics, 18(3):271–275, 1998.
 [36] David G. Simmons and James C. Cross. Determinants of trophoblast lineage and cell subtype specification in the mouse placenta. Developmental Biology, 284(1):12–24, 2005.
 [37] T. A. Pelton, S. Sharma, T. C. Schulz, J. Rathjen, and P. D. Rathjen. Transient pluripotent cell populations during primitive ectoderm formation: correlation of in vivo and in vitro pluripotent cell development. Journal of Cell Science, 115(2):329–339, 2002.
 [38] +Eric van+den+Akker, +Sylvie Forlani, +Kallayanee Chawengsaksophak, +Wim de+Graaff, +Felix Beck, +Barbara+I. Meyer, and +Jacqueline" Deschamps. Cdx1 and cdx2 have overlapping functions in anteroposterior patterning and posterior axis elongation. Development, 129(9):2181–2193, 2002.
 [39] Yale Jen, Katia Manova, and Robert Benezra. Each member of the id gene family exhibits a unique expression pattern in mouse gastrulation and neurogenesis. Developmental Dynamics, 208(1):92–106, 2019/09/11 1997.
 [40] A. Leder, A. Kuo, M.M. Shen, and P. Leder. In situ hybridization reveals coexpression of embryonic and adult alpha globin genes in the earliest murine erythrocyte progenitors. Development, 116(4):1041–1049, 1992.
 [41] Paul D. Kingsley, Jeffrey Malik, Rachel L. Emerson, Timothy P. Bushnell, Kathleen E. McGrath, Laura A. Bloedorn, Michael Bulger, and James Palis. “maturational” globin switching in primary primitive erythroid cells. Blood, 107(4):1665–1672, 2006.
 [42] James Palis. Primitive and definitive erythropoiesis in mammals. Frontiers in Physiology, 5:3, 2014.
 [43] Bechara Saykali, Navrita Mathiah, Wallis Nahaboo, MarieLucie Racu, Latifa Hammou, Matthieu Defrance, and Isabelle Migeotte. Distinct mesoderm migration phenotypes in extraembryonic and embryonic regions of the early mouse embryo. eLife, 8, Apr 2019.
 [44] MariaElena TorresPadilla, Lucy Richardson, Paulina Kolasinska, Sigolène M. Meilhac, Merlin Verena LuetkeEversloh, and Magdalena ZernickaGoetz. The anterior visceral endoderm of the mouse embryo is established from both preimplantation precursor cells and by de novo gene expression after implantation. Developmental Biology, 309(1):97–112, Sep 2007.
 [45] Ita Costello, Sonja Nowotschin, Xin Sun, Arne W. Mould, AnnaKaterina Hadjantonakis, Elizabeth K. Bikoff, and Elizabeth J. Robertson. Lhx1 functions together with otx2, foxa2, and ldb1 to govern anterior mesendoderm, node, and midline development. Genes & Development, 29(20):2108–2122, Oct 2015.
 [46] +Dominic+P. Norris, +Jane Brennan, +Elizabeth+K. Bikoff, and +Elizabeth+J. Robertson. The foxh1dependent autoregulatory enhancer controls the level of nodal signals in the mouse embryo. Development, 129(14):3455–3468, 2002.
 [47] Sabine Pfister, Kirsten A. Steiner, and Patrick P.L. Tam. Gene expression pattern and progression of embryogenesis in the immediate postimplantation period of mouse development. Gene Expression Patterns, 7(5):558–573, Apr 2007.
 [48] L. Niswander and G.R. Martin. Fgf4 expression during gastrulation, myogenesis, limb and tooth development in the mouse. Development, 114(3):755–768, 1992.
 [49] +Takeshi Fujiwara, +N.+Ray Dunn, and +Brigid+L.+M. Hogan. Bone morphogenetic protein 4 in the extraembryonic mesoderm is required for allantois development and the localization and survival of primordial germ cells in the mouse. PNAS; Proceedings of the National Academy of Sciences, 98(24):13739–13744, 2001.
 [50] TzuPing Lin, Patricia A. Labosky, Laura B. Grabel, Christine A. Kozak, Jeffrey L. Pitman, Jeanine Kleeman, and Carol L. MacLeod. The pem homeobox gene is xlinked and exclusively expressed in extraembryonic tissues during early murine development. Developmental Biology, 166(1):170 – 179, 1994.
 [51] Akihiko Shimono and Richard R Behringer. Isolation of novel cdnas by subtractions between the anterior mesendoderm of single mouse gastrula stage embryos. Developmental Biology, 209(2):369–380, May 1999.
 [52] Gloria S. Kwon, Stuart T. Fraser, Guy S. Eakin, Michael Mangano, Joan Isern, Kenneth E. Sahr, AnnaKaterina Hadjantonakis, and Margaret H. Baron. Tg(afpgfp) expression marks primitive and definitive endoderm lineages during mouse development. Developmental Dynamics, 235(9):2549–2558, 2006.
 [53] Frederic Koch, Manuela Scholze, Lars Wittler, Dennis Schifferl, Smita Sudheer, Phillip Grote, Bernd Timmermann, Karol Macura, and Bernhard G. Herrmann. Antagonistic activities of sox2 and brachyury control the fate choice of neuromesodermal progenitors. Developmental Cell, 42(5):514–526.e7, Sep 2017.
 [54] +D.E. Smith, +F.+Franco del+Amo, and +T. Gridley. Isolation of sna, a mouse gene homologous to the drosophila genes snail and escargot: its expression pattern suggests multiple roles during postimplantation development. Development, 116(4):1033–1039, 1992.
 [55] +W. Shawlot, +M. Wakamiya, +K.M. Kwan, +A. Kania, +T.M. Jessell, and +R.R. Behringer. Lim1 is required in both primitive streakderived tissues and visceral endoderm for head formation in the mouse. Development, 126(22):4925–4932, 1999.
 [56] +Kimberly+D. Tremblay, +N.+Ray Dunn, and +Elizabeth+J. Robertson. Mouse embryos lacking smad1 signals display defects in extraembryonic tissues and germ cell formation. Development, 128(18):3609–3621, 2001.
 [57] Katrina M. Catron, Hongyu Wang, Gezhi Hu, Michael M. Shen, and Cory AbateShen. Comparison of msx1 and msx2 suggests a molecular basis for functional redundancy. Mechanisms of Development, 55(2):185–199, Apr 1996.
 [58] Claire S. Simon, Lu Zhang, Tao Wu, Weibin Cai, Nestor Saiz, Sonja Nowotschin, ChenLeng Cai, and AnnaKaterina Hadjantonakis. A gata4 nuclear gfp transcriptional reporter to study endoderm and cardiac development in the mouse. Biology Open, 7(12):bio036517, Dec 2018.
 [59] F. Beck, T. Erler, A. Russell, and R. James. Expression of cdx2 in the mouse embryo and placenta: Possible role in patterning of the extraembryonic membranes. Developmental Dynamics, 204(3):219–227, Nov 1995.
 [60] Stephen Frankenberg, Lee Smith, Andy Greenfield, and Magdalena ZernickaGoetz. Novel gene expression patterns along the proximodistal axis of the mouse embryo before gastrulation. BMC Developmental Biology, 7(1):8, 2007.
 [61] Martyn Donnison, Ric Broadhurst, and Peter L. Pfeffer. Elf5 and ets2 maintain the mouse extraembryonic ectoderm in a dosage dependent synergistic manner. Developmental Biology, 397(1):77 – 88, 2015.
 [62] +Ingo Burtscher and +Heiko Lickert. Foxa2 regulates polarity and epithelialization in the endoderm germ layer of the mouse embryo. Development, 136(6):1029–1038, 2009.
 [63] Ali S. Hassan, Juan Hou, Wei Wei, and Pamela A. Hoodless. Expression of two novel transcripts in the mouse definitive endoderm. Gene Expression Patterns, 10(23):127–134, Feb 2010.
 [64] +Jérome Chal, +Ziad Al+Tanoury, +Masayuki Oginuma, +Philippe Moncuquet, +Bénédicte Gobert, +Ayako Miyanari, +Olivier Tassy, +Getzabel Guevara, +Alexis Hubaud, +Agata Bera, +Olga Sumara, +JeanMarie Garnier, +Leif Kennedy, +Marie Knockaert, +Barbara GayraudMorel, +Shahragim Tajbakhsh, and +Olivier Pourquié. Recapitulating early development of mouse musculoskeletal precursors of the paraxial mesoderm in vitro. Development, 145(6), 2018.
 [65] +P.Q. Thomas, +A. Brown, and +R.S. Beddington. Hex: a homeobox gene revealing periimplantation asymmetry in the mouse embryo and an early transient marker of endothelial cell precursors. Development, 125(1):85–94, 1998.
 [66] A. A. Avilion. Multipotent cell lineages in early mouse development depend on sox2 function. Genes & Development, 17(1):126–140, Jan 2003.
 [67] +Daisuke Kurokawa, +Nobuyoshi Takasaki, +Hiroshi Kiyonari, +Rika Nakayama, +Chiharu KimuraYoshida, +Isao Matsuo, and +Shinichi Aizawa. Regulation of otx2 expression and its functions in mouse epiblast and anterior neuroectoderm. Development, 131(14):3307–3317, 2004.
 [68] Smita Chormunge and Sudarson Jena. Correlation based feature selection with clustering for high dimensional data. Journal of Electrical Systems and Information Technology, 5(3):542 – 549, 2018.

[69]
Andrew Ng and Michael Jordan.
On discriminative vs. generative classifiers: A comparison of logistic regression and naive bayes.
Adv. Neural Inf. Process. Sys, 2, 04 2002.  [70] A. P. Dempster, N. M. Laird, and D. B. Rubin. Maximum likelihood from incomplete data via the em algorithm. Journal of the Royal Statistical Society. Series B (Methodological), 39(1):1–38, 1977.
 [71] G. J. McLachlan and D. Peel. Finite mixture models. Wiley Series in Probability and Statistics, New York, 2000.
 [72] Chris Fraley and Adrian E. Raftery. Modelbased clustering, discriminant analysis, and density estimation. Journal of the American Statistical Association, 97(458):611–631, 2002.
 [73] Wei Pan and Xiaotong Shen. Penalized modelbased clustering with application to variable selection. J. Mach. Learn. Res., 8:1145–1164, May 2007.
 [74] Sijian Wang and Ji Zhu. Variable selection for modelbased highdimensional clustering and its application to microarray data. Biometrics, 64(2):440–448, Jun 2008.
 [75] Benhuai Xie, Wei Pan, and Xiaotong Shen. Penalized modelbased clustering with clusterspecific diagonal covariance matrices and grouped variables. Electronic Journal of Statistics, 2(0):168–212, 2008.
 [76] Wei Sun, Junhui Wang, and Yixin Fang. Regularized kmeans clustering of highdimensional data and its asymptotic consistency. Electronic Journal of Statistics, 6:148–167, 2 2012.
Comments
There are no comments yet.