One of the first and most important tasks in the analysis of single-cell data is cell type annotation, where individual cells are categorized into one of many known cell type categories having well-characterized biological functions. The vast majority of studies perform annotation by first clustering cells based on their gene expression and then manually labeling the clusters based on upregulated marker genes within each cluster (tabula2018single). This is often time-intensive and arguably subjective, as the set of cell type labels used is inconsistent across studies: they vary based on scientific interests of the investigators, aims of the study, and availability of external data. In turn, a large number of automated methods have been developed to standardize the cell type annotation process, for example, see Table 1 of PASQUINI2021961 and references therein.
The vast majority of the existing approaches for automated cell type annotation fit a classification model using a single training dataset (e.g., a dataset collected and annotated by a single investigator/lab), treating normalized gene expression as predictors. Cell types in a new (unannotated) dataset are then predicted according to the fitted model. In abdelaal2019comparison, more than 20 such methods were benchmarked and shown to perform well in a variety of settings. However, these methods tended to perform poorly in terms of prediction across datasets (varying by batch, lab, or protocols) and in datasets with a large number of labels (i.e., high resolution cell type categories) (abdelaal2019comparison). Furthermore, a crucial choice for these methods is deciding which dataset should be used to train the model. Datasets can differ in numerous ways, but most relevant to the task we consider: they can have drastically different cell type labels and differ in the amount of detail provided by each label across datasets (ma2021evaluation). The existing annotation approaches are also limited to single training datasets or multiple datasets with consistent cell type labels. Here, we propose a novel approach for automated annotation that overcomes these limitations.
We begin by depicting the situation of differing degrees of resolution in labels used to annotate different datasets in Figure 1. In this hypothetical situation, one has access to three datasets, Datasets 1, 2, and 3, each of which has been expertly annotated manually. In Dataset 1, cells are labeled as either CD4+ or CD8+. If one trained a model using only Dataset 1, the only possible predicted labels for a new dataset would be CD4+ or CD8+. In Dataset 2, the cells are labeled as one of naive CD4+, effector memory CD4+, central memory CD4+, or CD8+, so if one instead trained the model using Dataset 2, it would be possible to predict/annotate the subcategories of CD4+ T-cells with finer resolution labels when compared to Dataset 1. Dataset 3 has finer resolution labels for subcategories of CD8+ cells than Dataset 2, but does not distinguish between the two finer CD4+ memory cell types like Dataset 2. Thus, using a single dataset to train annotation models presents a trade-off between fine resolution labels for subcategories of CD4+ and subcategories of CD8+ cells.
If one wanted to incorporate all datasets into the framework of existing annotation methods, the level of detail in the annotations of Dataset 2 and Dataset 3 must be reduced by labeling cells of all three datasets as one of CD4+ or CD8+. However, this results in a significant loss of information and may limit downstream scientific applications. Alternatively, one could mix-and-match subsets of cells from different datasets which have the most detail for specific cell types. In this example, it would mean taking the subcategories of CD4+ cells from Dataset 2 and subcategories of CD8+ cells from Dataset 3 and ignoring Dataset 1. As such, this approach would be less efficient than one which uses all available data, and moreover, will generalize poorly since technical differences across datasets (i.e., “batch effects”) may be confounded with some cell type categories. Despite the existence of hundreds of publicly available datasets with expertly annotated cell types, existing methods are limited in their ability to integrate a wide-array of datasets due to varying label resolution.
Ideally, we would like to use all the data from all three datasets to train an annotation model without any loss of information. To do so, our proposed approach takes advantage of the “binned” label structures (Figure 1d). For example, cells with the label CD4+ in Dataset 1, biologically, must belong to one of the following finest resolution categories: naive CD4+, effector memory CD4+, or central memory CD4+ The specific label, however, is unknown without additional analysis or manual annotation. In this article, we propose a new classification method which will allow investigators to (i) use all available datasets jointly to train a unified classification model without loss of information, and (ii) make cell type predictions/annotations at the finest resolution labels allowed by the union of all datasets’ labels. For example, given the datasets depicted in Figure 1, our method would fit a model using data from all cells from all three datasets, and would yield predicted probabilities of each cell belonging to the categories: naive CD4+, effector memory CD4+, central memory CD4+, naive CD8+, effector memory CD8+, or central memory CD8+ (i.e., the categories at the terminal nodes of the tree). Notably, our method does not require that labels are tree-structured as in this example. We require only that labels are amenable to “binning”, which we describe in Section 2.1.
1.2 Existing approaches
The issue of varying labels across datasets has been recognized in the recent single-cell literature. For example, shasha2021superscan
manually reannotated publicly available datasets which collected both single-cell gene expression and protein expression data, and fit a cell type classification model across all datasets using reannotated labels with extreme gradient boosting (XGBoost). To reannotate the data, they cleverly employed methods from the field of flow cytometry to “gate” cells based on protein expression using a series of bivariate protein expression plots and manually drawing shapes around groups of cells. This reannotation process, however, is very time-intensive and requires concurrent protein expression in cells. Even with this detailed approach, differences in protein measurements across datasets limited their ability to achieve consistently fine annotations across all datasets. Similarly,conde2021cross employed a two-step reannotation process. First, with expert input, they attempted to reconcile and rename labels across datasets to achieve a consistent set of labels. Second, they fit a ridge-penalized multinomial logistic regression model on datasets for which they successfully renamed labels for, and used this model to predict the labels for the remaining unresolved datasets. Cells were clustered in each remaining dataset based on gene expression, and each cluster was labeled on a majority vote of the predictions for cells in that cluster. The predicted cluster labels were then treated as true labels for these datasets, and the model was refit using all of the datasets. This approach motivates a two-step approximation to our proposed method, which we term relabel (see Section 5.2) and compare to throughout this paper.
|Dataset||# of labels||Reference(s)|
1.3 Motivating application
Our motivation for this work was to build a new and generalizable model for high-resolution cell type annotations for peripheral blood mononuclear cell (PBMC) samples by combining many publicly available datasets. We collected and processed a total of ten datasets sequenced using 10x Genomics technology, each with raw gene expression counts and curated cell type annotations available for each cell. We chose to work with PBMC data due to the complexity and hierarchy of immune cell types, as well as the common application of single-cell sequencing of PBMCs in clinical studies (su2020multi; haniffastephenson2021single; blishwilk2020single). Each of the ten datasets have labels at different resolutions, and although labels do not follow a tree-structure across datasets, they are amenable to binning. The number of distinct labels in each dataset, as well as references for the dataset, are shown in Table 1. The specific labels for each dataset are in Table 3 of the Supplementary Material. We display the relationships between labels represented in each of these datasets in Figure 2 as graphical representations of “binning functions,” which are further described in Section 2.1. The datasets we use are available through the R package AnnotatedPBMC at https://github.com/keshav-motwani/AnnotatedPBMC/, where we also provide an interface to our fitted model for predicting cell types from new single-cell gene expression data.
Suppose we observe datasets with single-cell gene expression profiles and cell types manually annotated. Let denote the set of labels used to annotate the th dataset for and let denote the set of labels at the desired finest resolution across all datasets. Let and
be the random variables corresponding to the annotated cell type and true (according to the finest resolution label set) cell type of theth cell in the th dataset for , , with supports and , respectively. For the remainder, let denote the cardinality of a set . Let be the observed gene expression matrix, and
be a vector of cell type annotations for theth dataset where is the observed realization of the random variable . Similarly, let for be the unobservable gene expression matrix which is free of batch effects. Our goal is to estimate probabilities for and any .
2.1 Binned categorical responses
As described earlier, each dataset may have a different degree of resolution in their cell type annotations. Again taking an informal example, we may have two datasets with observed cell type labels in for the first dataset and for the second dataset, with being the set of finest categories at which resolution we want to make predictions. We refer to the labels and as “coarse labels” since groups of cells with these labels can each be partitioned into finer, more detailed categories (cells with label can be further divided into categories or and cells with label can be further divided into categories , , or ), and refer to each of as “fine labels” since they cannot be divided any further into more detailed categories. We refer to data observed at the level of a coarse label as a binned observation, because labels from finer categories are binned into one coarser label. For example, cells that are truly of cell type and are both binned into a label called in the first dataset. We will now make these ideas and definitions more formal by setting up some additional notation.
Define the user-specified binning function which maps a finest resolution category to the label used to describe that category in the th dataset. For example, for dataset 1 above. This function bins fine categories together into the possibly coarser resolution labels which are used in annotating the data, hence the name. Also, define the “unbinning” function (inverse image) where for . This provides the set of fine categories to which a cell labeled at a coarser resolution category may be further categorized as. For fine categories that are truly not represented in a given dataset, can map from these categories to another label (named “unobserved” for example). While and the binning functions are user-specified, they must satisfy the condition that for all , there must exist and such that with . In other words, each of the finest resolution categories must actually be observed at least once in at least one of the training datasets.
Using this notation, we can now formally define to be a “coarse label” if (i.e., the label can be broken up into multiple finer resolution categories) and a “fine label” if (i.e., the label cannot be further partitioned). We also now define the relationship between and through the following equivalence of events
That is, a cell can be categorized within one of the finest resolution categories in the bin corresponding to the observed label, with the correspondence defined by . We thus have that
since the events are mutually exclusive as a cell can only be of one cell type.
2.2 Binned multinomial regression model
As mentioned, we are interested in modeling cell type probabilities as a function of gene expression. For now, we consider a model using unobserved gene expression , which is free of batch effects, and will extend this in the next section to the observed gene expression. Without loss of generality, we encode the sets of labels numerically so that and for We assume that each follows a categorical distribution (i.e., multinomial based on a single trial)
In addition, we assume that the probability functions adhere to the standard multinomial logistic regression link so that
where is an unknown vector of intercepts and is an unknown matrix of regression coefficients. Applying exactly the logic from (1), it follows that
Thus, our focus is the development of a method for estimating and . However, we first extend the model to account for potential batch effects in the observed gene expression.
2.3 Adjustment for batch effects
The gene expression can be assumed to be “noisy” in the sense that they may be measured with some batch effects specific to each of the datasets. For example, it may be reasonable to assume that where is the the unobserveable gene expression and is some noise. This additive assumption of batch effects is consistent with the existing literature on data integration for normalized gene expression data in single-cell datasets, which provide methods for estimating the (haghverdi2018batch; hao2020integrated). However, estimating the per-gene batch effect is not necessary for classification: we need only estimate a linear combination of this batch effect, as we now describe.
We can write the linear predictor for the th cell of the th dataset as Because the are not observable, we assume that there are some common sources of batch variation which are related to some cell-specific covariates , and that is some linear combination of these cell specific covariates for , , and coefficients . It follows that the linear predictor for the th cell in the th dataset is where , , and the are unknown. Letting (since both are unknown), we can see that Thus, we can write
In the simplest case, (i.e., provides an intercept adjustment), which implies a batch-specific shift in expression that is constant for all cells in the batch. Alternatively, can also contain the principal components of to capture interactions of batch with other directions of variation in the data. It is worth emphasizing that here, we have both batch specific coefficients to estimate, for , and coefficients shared across batches, ( With this, our goal will be to estimate , , and via penalized maximum likelihood based on the observed predictors for and
3.1 Penalized maximum likelihood estimator
From the probability functions described in Section 2.3, we see that the log-likelihood contribution for the th cell in the th dataset can be expressed
for and , where denotes the indicator function. We can therefore define the (scaled by ) negative log-likelihood as
where is the total sample size and . We thus estimate and , which are the shared across datasets, and for datasets jointly using penalized maximum likelihood. For ease of display, let be the space of the unknown parameters Formally, the estimator of we propose is
where denotes the th row of for , denotes the Euclidean norm of a vector, denotes the Frobenius norm of a matrix, and are user-specified tuning parameters. We now motivate the choice of penalties based on our application.
Manual single-cell annotation is often performed through the identification of upregulated genes within clusters of cells (amezquita2020orchestrating). For example, to label a cluster of cells as type CD4+ naive, an annotater often identifies a number of particular genes that are overexpressed in that cluster relative to the rest of the cells (wolf2018scanpy; hao2020integrated). This implies that a relatively small number of genes are necessary to characterize the relationship between cell type probabilities and gene expression. For this reason we use the group lasso type penalty on the rows of the optimization variable (yuan2006model; obozinski2011support; simon2013sparse). For large values of , this penalty will encourage estimates of which will have rows either entirely equal to zero or entirely nonzero. If the -th row of is zero, the -th gene is irrelevant for discriminating between cell types. The (vector)-norm penalty (i.e., the lasso penalty), in contrast, would not lead to easily interpreted variable selection since a zero in a particular entry of does not alone imply anything about whether the corresponding predictor affects the probabilities.
Regarding the ridge penalty on the : because the are specific to each of the training sets, we do not have corresponding coefficients for a test data point from a new (i.e., unobserved for training) dataset. Additionally, we expect that the batch effect does not contain information relevant to cell type classification. Therefore, we intuitively want to be close to the origin, so that on a test data point, we can simply use our estimates and from (4) to estimate probabilities with
as if . To encourage estimates of the to be small, we add a penalty of the squared Frobenius norm of each
Importantly, the coefficients we intend to estimate are not, in general, identifiable. This is because with for any , for any , , and for . However, if we impose the “sum-to-zero” condition that and similarly for the rows of the , then this issue may be resolved. It is perhaps surprising that the could be identifiable since may be distinct from , but one can see that replacing with will, in general, lead to distinct probabilities (3) unless . In the Supplementary Material, we discuss the (exceptionally rare) situations where this is not true. Fortunately, both our penalties naturally enforce the sum-to-zero constraints on and the . For example, see the Supplementary Material of molstad2021likelihood for a proof of this fact.
3.2 Related methods
The approach proposed here is closely related to a growing literature on methods for integrative analyses. We discuss this literature from two perspectives: that of statistical methodology and that of the analysis of multiple single-cell datasets jointly.
From a methodological perspective, there is a growing interest in developing methods for jointly analyzing datasets from heterogeneous sources. Most often, these methods assume distinct data generating models for each source and aim improve efficiency by exploiting similarities across sources (zhao2015integrative; huang2017promoting; ventz2021integration; molstad2021dimension). For example, huang2017promoting assumed a similar sparsity pattern for regression coefficients corresponding to separate populations. Similarly, molstad2021dimension assumed a shared low-dimensional linear combination of predictors explained the outcome in all sources. The focus of our work is different: the sources from which the data were collected are assumed to differ only in their response category label resolution (and, to a lesser degree, may measure predictors with batch effects). Thus, these approaches are, generally speaking, not directly applicable to our setting.
In the context of single-cell data analysis, integrative analyses often focus on the “alignment” of expression datasets in an attempt to remove batch effects for the purposes of clustering and visualization (haghverdi2018batch; hie2019efficient; korsunsky2019fast; hao2020integrated). As mentioned in the previous section, explicit estimation and removal of batch effects is not necessary for the goal of cell type prediction. In fact, ma2021evaluation found that removing batch effects through alignment-based methods actually decreased downstream cell type prediction accuracy. Our inclusion of batch specific effects in (3) can, loosely speaking, be thought of as performing alignment specifically tailored to prediction (assuming the are chosen appropriately).
In order to compute our proposed estimator, we must address that the group lasso penalty is nondifferentiable at zero and that the overall negative log-likelihood is nonconvex in general. In brief, we employ a blockwise proximal gradient descent scheme (xu2017globally) to overcome these challenges. Specifically, we obtain a new iterate by minimizing a penalized quadratic approximation to at the current iterate, which will ensure – by the majorize-minimize principle (lange2016mm) – a monotonically decreasing objective function value. Our approximations are chosen so as to admit simple, closed form updates for each block. In the remainder of this section, we motivate and derive each block update and summarize our algorithm. Code implementing the algorithm described here is available for download at https://github.com/keshav-motwani/IBMR/.
Let denote the objective function from (4). By construction, denotes the negative log-likelihood To describe our iterative procedure, we focus on the update for , but as we will show, this approach also applies to and the with minor modification. First, notice that given -th iterates of and , , by the Lipschitz continuity of the gradient of with respect to , we know that for any step size such that ,
for all , where denotes the gradient of . Letting denote the right-hand side of the above inequality, we can see that
for all with equality when . If we thus define as the argument minimizing , we are ensured that Hence, defining in this way, we have
where . The second equality above implies that is simply the proximal operator (parikh2014proximal; polson2015proximal) of the -norm (sum of Euclidean norms of the rows of its matrix-valued argument) at . Some straightforward derivations (e.g., see simon2013sparse) reveal that the th row of , can thus be obtained in closed form
We apply analogous arguments to update both with fixed and with fixed. For , yields a standard gradient descent update, whereas for the , each can be updated in parallel. Specifically, by the same motivation as in the update for , we define
With these updating expressions for and in hand, we formally state our iterative procedure for minimizing in Algorithm 1. Applying an identical series of arguments as those to prove that yields a decrement of the objective function, we have the following lemma regarding the sequence of iterates .
(Descent property) As long as each step size is sufficiently small and fixed or chosen by backtracking line search (see the Supplementary Material), the sequence of iterates is guaranteed to satisfy for i.e., Algorithm 1 has the descent property.
Initialize , , and for . Set .
For in parallel, compute
with chosen by backtracking line search.
For in parallel, compute
with the chosen by backtracking line search.
Compute with chosen by backtracking line search.
If objective function value has not converged, set and return to 1.
In the Supplementary Material, we derive explicit forms of the partial derivatives needed in Algorithm 1. Because they provide some insight, we discuss them here. For each , let be a matrix-valued function which maps input parameters to a matrix of (unconditional) probabilities. Specifically, has -th entry
Similarly, let be a matrix-valued function of conditional probabilities where
Intuitively, is the estimated probability that cell from dataset is of type . The conditional probability is the estimated probability that cell from dataset is of type given is the observed (possibly coarse) label. Of course, if is a singleton, then .
The gradients needed in Algorithm 1 can be expressed in terms of and In particular,
Examining the form of these gradients, loosely speaking, we see our algorithm descends in the direction determined the correlation between the predictors and the difference between the unconditional and conditional estimated probabilities. The functions and are also used later when we apply our method to the motivating data analysis.
In the Supplementary Material, we detail how we construct a set of candidate tuning parameters yielding sparse fitted models. In brief, we use the KKT condition for (4) to find a yielding and borrow an approach from glmnet for determining a reasonable set of values for .
5 Simulation studies
We performed extensive numerical experiments to study how the sample size, number of predictors, similarity of categories, and the magnitude of batch effects affect the performance of various methods for estimating finest resolution cell type probabilities.
5.1 Data generating models
For each replication, we generated a total of datasets: six datasets with sample size for fitting the model, six datasets with sample size for validation, and one dataset with sample size for evaluating performance. We considered to reflect the large number of cells available in real datasets. We set the number of finest resolution categories to be fixed at () and the binning functions fixed to have a structure inspired by the real data as shown by Figure 2. Specifically, in the real data, most cell types are observed at a coarse resolution in most datasets and at finest resolution in only a few datasets. Therefore, we chose to bin categories , and into groups of two for Datasets 1–4. That is categories and are binned together, and are binned together, and so on. However, we set it so that these categories would be observed at the finest resolution in Datasets 5 and 6. Also, in the real data, some cell types are labeled at the finest resolution in all datasets (for example, CD14+ Monocytes and CD16+ Monocytes in Figure 2). Hence, we chose categories and to be observed at the finest resolution in all datasets. A graphical representation of these binning functions is shown in Figure B.1 of the Supplementary Material. The validation datasets, Datasets 7–12, are generated in the same way as Datasets 1–6. For the test dataset, all observations are observed at the finest resolution in order to fully evaluate parameter estimation.
In manual single-cell annotation, cell types are binned together due to their similar gene expression. We reflected this to varying extents in the structure of , where we consider . We first randomly select of the rows to be nonzero in . Of these rows, we select many rows for which their coefficients are identical within the coarse groups described above, i.e. for these rows, the coefficients for category and are identical, coefficients for category and are identical, and so on. For the remaining nonzero rows of , the coefficients for all categories are unrelated. We sample each of the nonzero distinct elements from a Normal distribution. This structure to controls the similarity of fine cell types within a coarse label. With , even though two categories may be binned together, they are unrelated and there is no true hierarchy of cell types. With larger , fine categories within a coarse label are increasingly related, meaning there is true hierarchy to the cell type categories and cells are binned according to this hierarchy. We consider .
Finally, to simulate the effect of batch effects in the predictors, we generated where . Each row of is independently simulated from a -dimensional multivariate normal distribution with mean and AR(1) covariance matrix with lag . We consider a simple model for the batch effect itself, in which the batch effect is identical for every observation within a batch. This may also be reasonable in the real data, as the presence of background contamination, also known as ambient RNA, is a common source of batch effects, and it may affect all cells within the experiment similarly (after normalization) (young2020soupx). Therefore, we generate as a realization from a -dimensional mean zero multivariate normal distribution with covariance and set , where is a scalar chosen to control . We consider . The test dataset is observed with no batch effect, again in order to best evaluate parameter estimation.
5.2 Competing methods
We first consider two variants of our method, IBMR-int and IBMR-NG. For IBMR-int, we set for all , , and fit the proposed model using (4). For IBMR-NG, we set for all , where “NG” stands for “no Gamma”, and estimate only and using (4). This is a version of our method which ignores possible batches entirely.
We also consider two alternative methods, subset and relabel. For subset, we “mix-and-match” data from different datasets by subsetting each dataset to only the data that is annotated at the finest resolution and fit a model based on the stacked data. Specifically, define for , the set of indices in the th dataset for which the outcome was observed at the finest resolution: Then, we fit a group lasso-penalized multinomial logistic regression model using (4), but with replaced with for and , replaced with for , and replaced with . However, because of potential confounding, we do not consider a batch effect (i.e., require ). The model can thus be fit using existing software (e.g., glmnet), but since the objective function is identical to our method when using only subsetted data, we use our implementation for consistency in the algorithm and convergence criterion.
For the other method, relabel, we first obtain estimates of using subset, denoted . Using these estimates, we can “relabel” our training data to have outcomes at the finest resolution by choosing the category with the highest conditional probability (as defined in (7)) We then fit the multinomial logistic regression model to , treating these as the observed labels. To be clear, all the training responses are (synthetically) at the finest resolution, so one fits (4) but each is replaced with .
Finally, we also consider oracle (ORC) versions of these methods, in which data at the finest resolution for all datasets is available. IBMR-int-ORC is the same as IBMR-int, with coarse resolution data replaced by the (otherwise unobserved) fine resolution data. By definition of IBMR-NG, subset, and relabel, when all the data is at the finest resolution, the estimators are equivalent to the standard group lasso penalized multinomial logistic regression model. Therefore, we name the oracle version of these estimators GL-ORC, where “GL” stands for “group lasso.”
We present the complete simulation study results in Figure 3.
In the first column of Figure 3, we present results with the total sample size