In recent years the availability of rich biological datasets is challenging the flexibility and robustness of statistical techniques. In fact, while when the size of the sample is moderate it is customary and accepted to make quite strong assumptions on the underlying distributions, in the contest of big data this could often lead to obvious distortions and inconsistencies. This can be relevant in particular in the case of “omics” data (proteomics, genomics or transcriptomics), of which single-cell RNA sequencing (scRNA-seq) is a very recent and exceptionally difficult example [1, 2, 3].
Single-cell RNA-seq data are large matrices with genes in the rows and single cells in the columns, with integer read counts in each component. The vast majority (80% and more) of the read counts are zero and an even larger fraction (95% and more) of the genes has an average of less than 1 read count per cell. Nonetheless it is believed that around 20% of genes are typically active and functional in a cell, and many of these are transcription factors, whose subtle modulation controls the functions and state of the cell.
Given this preamble it is not surprising that the analysis of scRNA-seq is a very important topic and that a general robust approach is yet to be found. In this paper we present a new promising mathematical framework, and propose suitable parameter estimators and statistical inference for gene co-expression.
and references therein). There have been many attempts to normalize and variance-stabilize these quantities, but it remains a difficult problem (see for example[2, 6, 7]).
Our probabilistic model, on the other hand, belongs to the part of the literature that deals with the raw integer read counts (see among the others [8, 9, 10, 11, 6, 12] and references therein). Our model is in particular similar to the one introduced by BASiCS , but we use it in a non-Bayesian contest and we do not model spike-ins.
For each cell and gene , we model the number of read counts
with conditional Poisson distribution
a deterministic extraction efficiency parameter which modulates the expression of all genes for that cell, and
a random potential biological expression level for each gene.
For the first part of the paper, when estimating and , we make no assumptions on the distribution of
In the second part, we need to estimate the probability of zero read counts , and to this end we make the further assumption that this probability can be approximated by assuming that is gamma with mean , and variance that can be fitted on the total number of zero read counts for that gene. Equivalently, is considered of negative binomial distribution with mean and dispersion .
We remark that this assumption does not concern the whole distribution of the read counts, but only the probability of zero, which then takes the form typical of the negative binomial,
Often in the literature the frequency of zero read counts has been considered not completely explained, when using the most natural statistical models, and the concept of dropout has been introduced [9, 13, 14, 15]. Recently there have been criticism on this subject  and it is unclear if the need of zero-inflated distributions is really a technical issue, or in fact it is an artifact due to the use of log-transformations, or limited to the case of non-UMI datasets .
In our model, zero read counts are considered effects of biological variability and random extraction, and in fact the inference itself is based precisely on the occurrence of these events.
After the model is introduced in Section 2, the remainder of the paper is organized as follows.
In Section 3
we propose two fast methods to get estimates of the relevant parameters, both based on moment estimation, and discuss their validity. Maximum likelihood estimation is in good accordance with our methods, but it requires more resources and has to assume the class of(typically gamma); this is a choice that we defer until the inference in subsequent sections.
In Section 4 we introduce a way to estimate the probability of zero read counts, using the estimated parameters and making some assumptions on
, in particular that it can be approximated by a gamma distribution. A natural way to estimate its second parameter, is to fit the total number of cells with zero read counts for gene.
In Section 5
we build co-expression tables, which are similar to contingency tables, but count the number of cells in which two genes have been found expressed together. It is shown that these cannot be analysed like classical contingency tables, because the different efficiency of the cells would cause spurious correlations. Nevertheless the estimates built on the previous sections allow to design a statistical test for independence and a co-expression index. Extensions to differential expression analysis and to a global differentiation index are discussed.
In Section 6 we report the results of the numerical simulations with synthetic datasets, used to evaluate the estimators, the distribution of the statistics, and the false positive rate of the tests.
A twin paper with a computational-biology point of view (currently in the final stages of processing), deals with the application of this framework to real biological datasets and includes the software implementation of all the tools.
Single-cell RNA-seq data analysis is generally performed on a huge matrix of counts , where and are the sets of genes and cells respectively. Typical sizes are and –. The read counts are non-negative integers, with many zeros.
Usually for bulk RNA-seq, where there is no information at single cell level, the counts are modeled with the gamma-Poisson mixture (also known as negative binomial distribution), which is quite suited to the need, as it is supported on the non-negative integers and has two real parameters that ensure a good flexibility (see [18, 8] among the others).
From a physical point of view, this can be interpreted as a model in which the total amount of RNA molecules of gene
is approximated by a gamma random variablewith parameters depending on , and the number of reads has then Poisson conditional distribution with a small efficiency .
One of the challenges of single-cell RNA-seq is that one should consider a different efficiency for each cell , and that a single gamma distribution may not be able to account for two or more cell conditions or types inside the experiment’s population.
To reduce technical noise, which in our model is not accounted for, we make the assumption of dealing with a post-quality-control scRNA-seq dataset with UMI111Unique Molecular Identifiers are molecular labels that nearly eliminate amplification noise . counts as input.
Given these assumptions, we will model the counts as random variables with Poisson conditional distribution
and the real number of molecules with some unknown distribution.
Since and are everywhere multiplied together, they can only be known up to a multiplicative constant. Without loss of generality, we will assume throughout this paper that this constant is fixed in such a way that
hence will be rescaled accordingly, and it will not represent the real number of molecules, but just some typical value for the counts.
We will suppose that, for , the columns
are i.i.d. random vectors with distributionon , and that has expectation and covariance matrix so that,
In Section 3 we will show how to estimate the parameters and . The biological information on the differentiation of the cells in the sample, is instead encoded inside and will be the subject of the subsequent sections.
3 Parameter estimation
A direct computation shows that
The quantity represents the expected read count number, and takes into account the efficiency of cell and the average expression level of gene .
The formula for the variance can be obtained similarly, but it depends on one additional parameter and will not be used much, but we give it for completeness and reference,
Notice that the non-homogeneous dependence on explains quite well the fact that no scaling factor can be used to normalize data so that the variance is stabilized .
In the remainder of this section we develop two fast methods to estimate for all genes and cells . The first one is simple and straightforward, but may sometimes be affected by few genes with high level of expression and large biological variability. The second one is based on a variance stabilizing transformation that, to our knowledge, is used here for the first time for scRNA-seq data analysis. It shows some small bias but should be more stable with respect to random variations in the most expressed genes.
Both these methods are based on moment estimation and do not assume anything about the distribution . Maximum likelihood estimation on the other hand may be preferred when the distribution of can be safely assumed to be gamma. For example this is the case of a single cluster of cells of similar expression, and we used this approach in Section 6 to estimate parameters to generate synthetic datasets. We do not delve into this matter here.
Even though we give some provable statements to establish good properties of our estimators, it is quite difficult to assess their precision and accuracy. Section 6 explains how we generated several realistic synthetic datasets and used them to gauge the estimators. Figure 2 shows the results.
3.1 Average estimation
The most natural way to estimate the parameters is the following. Define the rows, columns and global averages by
The average estimators of the parameters, marked with the “hat” symbol, are given by
The average estimator of is unbiased. Moreover and .
On some real biological datasets this appears to be a poor way to estimate the unknown parameters. In particular there is evidence that may be too sensible to the few genes that have both high reads and large biological variability between cells. Since we plan to use estimates of to normalize the dataset, this would be a source of spurious correlations, and difficult to deal with.
3.1.1 Problems of average estimation
A mathematical explaination of the occasional weakness of these estimators could be the following.
Suppose we are looking for weights such that a linear combination of the counts is a good estimator of . Notice that is such an estimator, and it is characterized by having uniform weights , whose value is fixed by the additional constraint that ,
With this insight, let us consider under the somewhat simpler constraint . Then , so
is an unbiased estimator offor all choices of the weights. Since there is no independence between counts of different genes, the variance is more complicated and must be computed with conditional expectations,
If the term was not present, the variance of would have been minimized, under the constraint, by choosing , as a direct computation with Lagrange multipliers shows. The presence of this term, on the other hand, hints that the weights should be smaller for genes with large biological variability. Unfortunately it is very difficult to estimate it, as it depends on the whole covariance matrix , which is what actually holds the biological information on the differentiation of the cells in the experiment’s population.
Apart for this sub-optimality of the constant weights in terms of total variance, a second problem (which may even be more serious) is that even with optimal weights, the estimator would correlate in particular with high variance genes, while one of our targets is to have it as much uncorrelated as possible to single genes.
3.2 Square root estimation
To get estimates that may be more robust in the cases where average estimators are not, we recall that the square root of a Poisson random variable of mean has variance which depends weakly on , in particular, as . This useful property is at the base of a classical variance-stabilizing transformation that is expected to improve the robustness of averages at the cost of adding a small additional bias.
Let us introduce the square root counts and their rows and columns averages,
We will need also the corresponding sample variances
Then we introduce our main estimators, whose properties will be analyzed in the remarks and proposition below.
The square-root estimators of the parameters, marked with the “check” symbol, are given by
and is the inverse of the function defined by
The main term of the formula for is and for large , we have , so plus some correction terms, and analogously for . (See Figure 1 below.)
To see that , let be and consider ; then and , so that
where . This also implies that .
We stress that square (or square root) and average do not commute, and in particular by Jensen inequality we get,
Hence, as is an unbiased estimator, in itself would be a poor estimator of , with systematic negative bias. The square root estimator is a second order correction of the above approach.
The following proposition gives a non-rigorous argument to show that the terms in the square root estimators are the right ones to get the smallest bias.
The statistics , and estimate , and with a small bias depending on the unknown distribution of and an additional error of order .
The first step is to approximate with
in fact and by independence the error is of the order ,
Then we write and then approximate with a Taylor expansion to the second order,
If we substitute and and also average the whole expression for , then the linear term disappears:
with an error of order proportional to the third moment (skewness) of. (We remark moreover that for all .)
To use the above formulas, we will approximate with and with , adjusted appropriately.
To do so, let denote the vector of i.i.d. random variables and consider the following conditional expectations,
Hence we get immediately that approximates with an error of order ,
Finally we need to approximate . We start from :
Averaging for (with denominator ) yields,
and hence we do the following approximation, with an error of order .
The last term cannot be consistently estimated, and neither can one use Bayesian estimation, since the distribution of is completely unknown, so we resort to
which is reasonable in the regions of linearity of , so both in the approximate range and above about , which are most common for scRNA-seq data.
Finally we get the estimate,
The method for is analogous. In fact, one could reproduce the same passages to get
The result for follows. ∎
4 Probability of zero reads
In this section we want to build on the estimates of introduced in Section 3 to get an estimate of the probability that . In what follows the symbol denotes either the average or the square-root estimator of .
The probability of zero read counts can be expressed as
Here is the log-mgf of the law of (which does not depend on ) rescaled by its mean .
By conditioning on , and using the conditional Poisson distribution of , we get,
In full generality we cannot determine the functions for the different genes, because the distributions ’s are unknown; nevertheless some properties of log-mgfs are universal, in particular starts from the origin, is monotone increasing, concave and has derivative in 0.
Instead of trying to estimate , we choose to model it with a universal one-parameter family of functions with the same properties: , , monotone increasing and concave.
A simple, natural choice, based on , is for , which we choose to extend with continuity to
This model, for , corresponds to the gamma distribution with shape parameter , so we are implicitly making the assumption that (dropping the dependence on ) , so that , , and that the read counts are negative binomial with and , see equation (4).
We would like to use this model to infer, for any gene , some value , to which there corrisponds a reasonable estimate of the probability of zero reads in a cell , and to do so we impose the condition that the marginal expected number of zeros for gene equals the marginal observed number of zeros:
and solve this equation for . We remark that here denotes either the average or the square-root estimator of .
We call chance of expression of gene in cell , the quantity with which solves condition (9).
The following remarks and proposition will clarify that this is both a good definition and a reasonable one and that .
Condition (9) is both necessary and natural. Necessary because for arbitrary the quantities on left and right-hand side are typically very different, and would yield false positives in the tests we will be performing. Natural because it is completely analogous to what is done for classical contingency tables, where the unknown probabilities of the categories are estimated from the proportions of the observed marginals, in such a way that the analogous of equation (9) holds.
We had to extend the definition of equation (8) to the negative values of in order to be able to solve equation (9) for all samples. In fact since and are both random variables, it may well be that for some genes ,
The choice of is a simple, natural family of maps that extend with continuity the definition given for to the “forbidden” region of the plane. The interpretation is that in these cases could be underestimating and hence may correct the error in a suitable way.
The following statement shows that can be computed numerically with ease, for example by bisection, for each gene .
The value such that condition (9) holds, is always uniquely determined as long as .
We will prove that the map defined by
is a bijection under the hypothesis.
A direct computation shows that is monotone decreasing in for all . In fact,
and above formula is always negative, since
moreover is continuous in for . We deduce immediately that is monotone increasing as long as for some . The condition is true both for average and for square-root estimators if for some . Finally it is trivial to observe that and , so is a bijection. ∎
5 Co-expression tables
In this section we present a completely new tool for measuring and testing the co-expression of two genes, and introduce two useful statistical methods which considerably extend its scope.
Co-expression is a meaningful concept when the population of cells is not completely homogeneous, because in that case each gene is assumed to be independently expressed in all the cells of the sample (so that is supposed to be diagonal, see Section 2), and hence the read counts are all independent random variables.
In the case of non-homogeneous population, we assume that different cell types can be found in the sample, each type with different genes expressed, and hence two genes could have positive read counts in the same cells more (or less) often that should be expected if the population was homogeneous. Therefore genes co-expression can be a powerful yet indirect tool to infer cell type profiles .
Our approach to assess co-expression builds on the assumption that cell differentiation will typically shun to zero the expression of several genes and that most genes have so low expression at the single cell level that measuring fold change is not very informative.
Based on this assumption, our main test compares the number of cells with zero read count in couples of genes (jointly versus marginally), in a way similar to contingency tables, but generalized to experimental units with different efficiency.
For any pair of genes , their co-expression table is a contingency table of the form
where is the number of cells with non-zero read count for both genes, is the number of cells with non-zero read count for and zero read count for and so on,
and where the marginals are as usual the sums of rows and columns and we recall that is the total number of cells,
We stress that all the information on how large is is ignored. We consider only the two cases and . In principle this may be a weakness of this approach, but one should recall that very few genes have high counts, and this method is particularly suited to deal with low espressions and small integer counts which are typical in scRNA-seq databases.
5.1 Classical contingency tables
The naive approach with classical contingency tables does not work for our proposed scRNA-seq model, because the variability of efficiency between cells creates spurious correlation.
Consider for example the co-expression table below, relative to two constitutive genes (which, as such, should be expressed in all cells),
The marginals of gene are and , with a ratio , showing that in 1 cell out of 2 there are zero read counts for this gene. Despite the fact that gene should be certainly expressed in all cells, this can be explained because of the combination of low biological expression and low extraction efficiency. Something analogous happens for gene , with a corresponding ratio of about .
These ratios suggest that any cell has a probability of of having zero read count of and a probability of of having zero read count of . These two events would be independent if all cells had the same extraction efficiency: together with the independence of RNA fragments extraction, this would yield the expected read counts of classical contingency tables; for example and similarly,
Since is quite far from , giving alone a
deviation from the null hypothesis, the classical contingency table analysis would give high significance to the false hypothesis that the two constitutive genes are positively co-expressed, suggesting that there are at least two different categories of cells: cells in which both are expressed and cells where neither is expressed.
What’s really happening is that there are cells with high efficiency and cells with low efficiency. While the matematical model of contingency tables builds on the assumption that all experimental units are identically distributed, this does not hold in the case of scRNA-seq data.
5.2 Expected counts in co-expression tables
The definition of the observed cells given by equation (10) can be rewritten more succintly as
Since we are going to build a statistical test for the independence of expression of the two genes, we put ourselves in the null hypothesis, and deduce that the expected number of cells under independence is
By Definition 6, can be estimated using the chance of expression of the genes in each cell.
For any pair of genes , their table of expected cell counts under the hypothesis of independence (“table of expected” for short) is given by
where denotes the chance of expression of gene in cell .
For example, for the experiment with the two constitutive genes presented above, the table with the expected cells (average estimators) is the following,
As can be seen, these values are much closer to the observed.
5.3 Co-expression estimator and test
The interpretation of the co-expression table is now performed in a way similar to usual contingency tables, with a test for independence of expression based on the distribution, and with an additional co-expression index, based on the same framework and similar in principle to a classical correlation.
We stress that this is an approximate test, and one cannot prove in full generality that the distribution is exactly correct for the statistics under the null hypothesis. However we used the synthetic datasets described in Section 6 to get the empirical distribution of the
-value and found it in good accordance with the theory, which prescribes uniform distribution. Figure3 shows the result.
On the other hand the statistical power of this test emerges from the direct application to real data, which will be detailed in the twin paper.
Given two genes with co-expression table and table of expected , the statistics of the test for the independence of expression is
The co-expression index is
The statistics is defined in analogy with the traditional contingency tables, with a regularizing correction at the denominator to take into account the fact that could be much smaller than 1 for some low expression genes. In this way those cases will not become false positives.
The co-expression index is defined in such a way that with the sign that encodes the direction of the deviation from independence, so it will be positive when the genes are positively co-expressed, negative in the opposite case, and -distributed when there is independence.
Given two genes with co-expression table and table of expected , define for ,
and let and be as above. Then , and by definition and in the vector space , and have the same direction, so .
If the components of are supposed to be standard Gaussian, independent
but conditioned on the values of the marginals of the tables, then is
standard Gaussian and is a chi-square with 1 degree of freedom.
is a chi-square with 1 degree of freedom.
Firstly notice that, given the marginals, the value of any cell determines the other three, and the following relations hold: