Fully Bayesian imputation model for non-random missing data in qPCR

by   Valeriia Sherina, et al.

We propose a new statistical approach to obtain differential gene expression of non-detects in quantitative real-time PCR (qPCR) experiments through Bayesian hierarchical modeling. We propose to treat non-detects as non-random missing data, model the missing data mechanism, and use this model to impute Ct values or obtain direct estimates of relevant model parameters. A typical laboratory does not have the resources to perform experiments with a large number of replicates; therefore, we propose an approach that does not rely on large sample theory. We aim to demonstrate the possibilities that exist for analyzing qPCR data in the presence of non-random missingness through the use of Bayesian estimation. Bayesian analysis typically allows for smaller data sets to be analyzed without losing power while retaining precision. The heart of Bayesian estimation is that everything that is known about a parameter before observing the data (the prior) is combined with the information from the data itself (the likelihood), resulting in updated knowledge about the parameter (the posterior). In this work we introduce and describe our hierarchical model and chosen prior distributions, assess the model sensitivity to the choice of prior, perform convergence diagnostics for the Markov Chain Monte Carlo, and present the results of a real data application.



There are no comments yet.


page 1


On a simultaneous parameter inference and missing data imputation for nonstationary autoregressive models

This work addresses the problem of missing data in time-series analysis ...

Missing data analysis and imputation via latent Gaussian Markov random fields

In this paper we recast the problem of missing values in the covariates ...

Bayesian inference for queueing networks and modeling of internet services

Modern Internet services, such as those at Google, Yahoo!, and Amazon, h...

Learning from missing data with the Latent Block Model

Missing data can be informative. Ignoring this information can lead to m...

A Pseudo-Marginal Metropolis-Hastings Algorithm for Estimating Generalized Linear Models in the Presence of Missing Data

The missing data issue often complicates the task of estimating generali...

In Nonparametric and High-Dimensional Models, Bayesian Ignorability is an Informative Prior

In problems with large amounts of missing data one must model two distin...

Likelihood-based missing data analysis in multivariate crossover trials

For gene expression data measured in a crossover trial, a multivariate m...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

There are three missing data mechanisms defined in the statistical literature: missing completely at random (MCAR), missing at random (MAR), and missing not at random (MNAR) (rubin1976inference, ; little2014statistical, ). Our group previously showed that the data in qPCR experiments are most likely MNAR (McCall, ). Even when the proportion of missing values in a qPCR dataset is not substantial, it is not appropriate to ignore or remove the genes or samples with missing values (rubin1976inference, ; little2014statistical, ). In both likelihood-based inference and Bayesian, if the parameters describing the measurement process are functionally independent of the parameters of the missing data mechanism, the missingness process is called ignorable, and a non-random process is non-ignorable (molenberghs2005models, ). In MNAR the parameters of the missingness process and the parameters of the measurement process are not functionally independent, and hence the process is non-ignorable. This means that in MNAR data we need to explicitly address the missing data mechanism, because ignoring it can lead to invalid inference about the model parameters.

To properly characterize the missing data mechanism in qPCR data, we first need to have a good understanding of the experimental procedures and the related properties of the measurement instrument. In a typical qPCR experiment, we start with samples mixed with primers and nucleotides for the identification and multiplication of the biological material. During each cycle we expect a doubling of the existing transcript. Due to the dynamic, real time nature of the qPCR experiment, many factors can contribute to the presence of missing values in the final qPCR data. For example, low abundance genes may fail to amplify or the qPCR instrument may have trouble detecting a weak signal from low-abundance genes even if amplification occurs. Therefore, the smaller the signal, the higher the probability of observing a non-detect, a reaction that failed to be detected by the qPCR technology. Even though these experiments involve a detection threshold, instead of a censoring model, we utilize a probabilistic model proposed by 

McCall , and model the missing data process as a function of the transcript value to be measured.

In this manuscript, we propose a new approach to obtain differential gene expression for all genes, including non-detects in qPCR experiments through Bayesian hierarchical modeling. Bayesian analysis does not assume large samples, as is the case with Maximum Likelihood Estimation (MLE); typically smaller data sets can be analyzed without losing power while retaining precision (hox2012few, ; hamra2013markov, ). For more technical details we kindly refer the reader to gelman2014bayesian .

The heart of Bayesian estimation is that everything that is known about a parameter before observing the data (the prior) is combined with the information from the data itself (the likelihood), resulting in updated knowledge about the parameter (the posterior). The prior information can stem from a meta-analysis, previous studies with comparable research populations, a pilot study, experts, or a range of other sources. If such knowledge is used we call the prior informative, and if no knowledge is available (or used) we call the prior uninformative. Estimation of Bayesian models is frequently done through Markov Chains Monte Carlo (MCMC). Unlike deterministic Maximum Likelihood algorithms, MCMC is a stochastic procedure that repeatedly generates random samples that characterize the posterior distribution of the parameters of interest. This is distinct from the sampling distribution of estimators which are estimated by MLE. The process of generating the random samples is the role of the Markov Chain. The process of generating summary statistics from those random samples is the role of Monte Carlo integration.

In the study of galindo2004bayesian

, the authors specified an informative prior and concluded that the more information that is captured by well-specified priors the smaller the parameter bias. We consider a situation in which, a priori, there is some information available about the parameters of interest, so we use uninformative priors, weakly informative priors, or priors reflecting partial-knowledge of the parameters. In summary, our main objective is to demonstrate methods to analyze small sample size data through the use of Bayesian estimation, and to determine the conditions under which Bayesian inference out performs MLE. This will guide the analysis of qPCR data in the presence of MNAR data.

We first introduce and describe our hierarchical model and chosen prior distributions and perform a prior sensitivity analysis. We then present the results of a real data application, assess the convergence of the Markov Chain Monte Carlo, compare parameter estimates with existing methods described in Sherina231621 , and comment on open problems.

2 Modeling

Throughout, we analyze the gene expression of genes. We define a gene or feature by . Each gene is measured under different conditions, perturbations, and in multiple replicates, we will refer to these conditions and perturbations as sample-types throughout the paper. denotes a partition of the samples into K sets of replicates.
is the observed gene expression of a gene , sample in - a sample of sample type ( ). For each there is a such that if , the gene expression is observed, and if , the value is a non-detect, with



is a function that models the dependence between the gene expression and the probability of the value being observed. The following logistic regression model is a natural choice of such a relationship:


where and are the logistic regression coefficients, and is an estimate of a gene expression for gene , sample that is potentially unobserved. One of the challenges that arise with the use of in Equation 2 is the presence of non-detects. When we do not have enough information to generate estimates for the individual , a possible solution is to borrow information across replicates in .

This is the data generating model proposed in McCall and described in more details in Sherina231621 . In this paper, we consider the gene expression, and model it as follows:


where is again the observed gene expression or the Ct value from qPCR experiment, is the true gene expression for gene in the sample-type , represents a global shift in expression across samples, captures the stochastic component of the model.

Bayesian statistics combines the knowledge from the data in the likelihood with prior information. The prior distribution should reflect what we know about the parameters in the model. The main parameters of interest are the average differential gene expression and its variability (denoted by and ) together with the missing mechanism parameters, and .

All the parameters in the model need to have a prior distribution. The specification of the prior distributions consists of three steps. First, background knowledge is needed as input for the specification. Such knowledge can stem from a meta-analysis, previous studies with comparable research populations, a pilot study, experts, or a range of other sources. In our case, we used existing knowledge of the range of parameters as inspiration. Second, for all the parameters a type of distribution has to be specified. We used a Normal distribution, denoted by

, for

’s; for the standard deviation parameters we specified a Uniform distribution, denoted by

; and for the probability of

we used a Bernoulli distribution, denoted by

. As gelman2006prior pointed out, inferences are not sensitive to the choice of for at least 3 groups and sufficiently large finite .

The third step of the prior specification is to determine the shape of the prior distribution by means of choosing values of the hyper-parameters. For the mean parameters , the hyper-parameters are the mean and the standard deviation of the normal distribution, denoted by and . We fixed the hyper-parameter for the mean of the average gene expression at . This constant is larger than the maximum possible observed Ct, so that it pulls all the values of non-detects to be somewhat greater than the maximum possible observed Ct value. However, in the situation where all the replicates of a sample are non-detects, we really do not have any information about what the values should have been. Therefore, an estimate of the Ct value being 60 or 80 tells us that there is no confidence in the presence of gene expression. For the standard deviation parameters , the hyper-parameters are the maximum of the uniform range, this prior is equivalent to an

with -1 degrees of freedom,

 (gelman2006prior, ). We model and jointly as

with a zero mean vector, a variance of 100, and zero correlation. In addition,

is restricted to be positive.

Below is a full specification of our Bayesian Model for the observed data ():



, and

Prior distributions for model parameters:

, and where , , and is .

Prior for hyper-parameters :

2.1 Posterior distributions

In this section we derived the full conditional distributions of the parameters of interest and the hyper-parameter . Some of them do not have a known distributional form. The distribution of the hyper-parameter , the variance of the mean parameter , is:




2.2 Parameter estimation

We consider two ways of summarizing the output from the MCMC:

  1. Fully Bayesian Imputation (FBI). By taking a median of all draws for the parameters of interest, s and s, we acquire direct parameter estimates of the mean and the variance of gene expression.

  2. Single Imputation Bayesian (SI Bayes). By taking the median of draws of missing , we obtain single imputation for individual missing data points. We further calculate means and the variances of gene expressions to get the values of parameters for this method. Note that we expect the variance to be under-estimated here because we are ignoring the fact that some are missing and treating estimates of as data.

3 Simulation study

Based on the design and the parameters of the experimental data from sampson2013gene we selected the true values of the model parameters to be used in our simulation study. We set and , the number of genes to 16, the number of replicates, the length of , within each sample type to 6, and the number of sample types , we also set to 6. We assume the missing mechanism is common across all genes.

To generate synthetic data, we first simulate complete data, then remove part of the data by constructing a MNAR data mechanism. In the first step, we generate three components: true , , and . We generate from , we set and we generate from a truncated from 20 to 40.5. we set to 0, and simulate from , where true are coming from . In the second step we generate the missing data indicators from , where

is the probability of a point being missing according to the logit model proposed in the Equation 

2 calculated for each data point. All the individual Ct values 40 and values with missing indicators are replaced with a value of 40, indicating a non-detect.

We compared performance under several prior distributions and assessed sensitivity to the choice of prior. For every scenario we ran 100 simulations. We summarize the results as the 25, 50, and 75quantiles of bias and MSE for all genes and samples. We have 16 genes and 6 sample-types, so for each synthetic data set there are 16 distinct values of and 166=96 different . We show results for FBI because they are better, results for SI Bayes are shown in the application.

3.1 Prior sensitivity analysis

The importance of prior sensitivity analysis is hard to overestimate; the choice of prior distributions is influential on the estimation results. The more information that is captured by well-specified priors the smaller resulting parameter bias. We looked at several choices of prior distribution for variability components of the model. We specified priors for and , or and . We compared the conjugate inverse gamma priors with varying parameters and weakly informative uniform priors with two different boundaries: (0, 10) and (0, 100). The summary of Bias and MSE for the mean () and the variance components () are given in Table 1.

Bias MSE
Unif(0, 100), Unif(0, 100)
-0.011 0.122
0.033 0.048
Unif(0, 10), Unif(0, 10)
0.016 0.120
0.036 0.049
IG(0.001, 0.5), IG(0.1, 0.1)
-0.010 0.124
0.047 0.048
IG(0.001, 0.5), IG(0.001, 0.5)
-0.010 0.124
0.047 0.048
IG(0.001, 0.5), IG(1, 10)
-0.008 0.124
0.046 0.048
IG(0.001, 0.5), IG(0.001, 0.001)
-0.010 0.124
0.047 0.048
IG(1, 10), IG(0.001, 0.001)
-0.004 0.232
0.670 0.587
IG(0.001, 0.001), IG(0.001, 0.001)
-0.011 0.118
0.008 0.041
Table 1: Varying prior distributions for and .

The estimates are unbiased, while is slightly overestimated, meaning that this fully Bayesian model incorporates additional uncertainty into the estimation of the variance. MSE’s are higher for and lower for due to the bias variance trade off. The estimation results vary for ) priors, in contrast, the results for priors are consistent and almost identical across different ranges. Therefore, in the real data application, we decided to use the uniform distribution. If a researcher knows more information about the distribution of the parameters, they have the ability to specify the prior in our model.

4 Experimental data application

To study the performance of the proposed methods and compare them to existing methods of imputing MNAR data, we used an experimental study of the effect of p53 and/or Ras mutations on gene expression with 3 or 4 replicates per sample-type (mcmurray2008synergistic, ). Real data applicability is an essential part of testing new methods, and when testing new methods for missing data, it is important to know the true values.

The data set has a small amount of missing values, about 3%. To have a complete data set, we started with removing all the genes with missing values. We call this data set complete. Thereafter, we replaced all the Ct values greater or equal to 30 with 30, this resulted in a truncated data set with 2% of missing values found in 12% of unique genes. We use this dataset in testing FBI and SI Bayes. We initiated MCMC and obtained 10000 draws from posterior distributions of the parameters , . The average effective sample size of the parameters is 4872. We calculated estimates using FBI and SI Bayes methods. Thereafter, we calculate the difference between parameter estimates from experimental data and from several existing methods, such as a Penalized EM algorithm incorporating missing-data mechanism (PEMM), a Direct Estimation (DirEst), a Multiple Imputation (MI), and present the results in Figure LABEL:fig:1. We compared the proposed FBI and SI Bayes methods to estimates simply using the truncated data (Trunc), a penalized EM algorithm incorporating non-random missingness (PEMM) proposed by chen2014penalized , and MLE based methods: DirEst and MI, proposed by our group (Sherina231621, ). PEMM calculates for positive , where is the parameter in the missing-data mechanism; we set . We calculated and for PEMM, DirEst, MI, SI Bayes, and FBI, and compared them to the estimates from the complete data.

We then repeated the same process with truncating Ct values at 33. This time the resulting data has overall 1% of missing values that occur in 7% of unique genes. The average effective sample size in the MCMC draws for the parameters is 4980. The results of this study are presented in Figure LABEL:fig:2.

4.1 FBI outperforms MLE based methods

When there is a small % of missing data, there is very little information about the missing data mechanism, common across all the genes. Our Bayesian modeling framework allows for better estimation of the missing mechanism in the presence of a small amount of non-detects. There has been a lot of attention drawn to methods that can accommodate large proportions of missing data, some methods have shown to be successful, for example  chen2014penalized . In a laboratory setting with limited resources, researchers work with only a few replicates per condition, and missing data is rather unusual. When the missing data mechanism is non-ignorable, researchers are more likely to utilize and rely on methods that are tailored to estimate the missing data mechanism better with less available information about it.

We see that for the data truncated at 30, DirEst and MI slightly outperform fully Bayesian imputation in estimating the average gene expression, while SI Bayes substantially underestimates gene expression in comparison to PEMM and even truncated data mean estimates. The FBI shows very similar performance to DirEst and MI in terms of bias for , while SI Bayes has higher bias, which shows that single imputation procedures are generally sub-optimal when dealing with non-random missing data.

When there is not a lot of information about the missing data mechanism, we see the advantage of using FBI over likelihood based approaches (Figure LABEL:fig:2). The bias and MSE (not shown) for the mean and the variance are smallest for FBI, compared to all other approaches. Recall that in this example, the missing data is only 1% of the total data. This demonstrates that FBI can better estimate the gene expression and variability needed to identify deferentially expressed genes.

5 Discussion

In genomic studies researchers usually have a limited number of samples. They can not sacrifice the information from the data with small number of samples by simply ignoring the observations with missing values, especially when the data is missing not at random. We proposed a fully Bayesian imputation method that works well for small samples sizes. For the purpose of this work, we assume there is one common missing data mechanism for the entire experiment.

In the simulation study we assessed the bias and MSE of the model parameters, performed a sensitivity analysis, and showed that our method is not sensitive to the choice of prior. We recommend using a non-informative uniform prior, but our method can be customized if information about the priors are available.

We compared this new approach with other existing methods for missing data imputation on an experimental dataset. The real data application revealed that fully Bayesian imputation is better than other methods when there is not a lot of information about the missing data mechanism available. Moreover, it performs similarly to MLE based approaches when the proportion of missing data increases, and hence information about the missing data mechanism increases. In experimental situations, investigators try to limit missing data and often do not have more than 3-4 replicates per condition. Therefore, it is useful to show that this method can work well when there is moderate missingness and is better than other proposed methods when the missingness is limited.

In this work, we assume that each gene has common variance across sample types and gene expressions are normally distributed. It is possible to build a model with a genes by sample types covariance matrix, but this is beyond the scope of this work. The framework of FBI can be extended to non-Normal distributions and also other missing data mechanisms; however, this may require modifications to the model.

6 Funding

This work has been partially supported by the National Human Genome Research Institute of the National Institutes of Health under Award Number R00HG006853 (to M.N.M.) and the University of Rochester CTSA award number UL1TR002001 (to M.N.M.) from the National Center for Advancing Translational Sciences of the National Institutes of Health. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health.


  • (1) Rubin DB. Inference and missing data. Biometrika 1976; 63(3): 581–592.
  • (2) Little RJ and Rubin DB. Statistical analysis with missing data, volume 333. John Wiley & Sons, 2014.
  • (3) McCall MN, McMurray HR, Land H et al. On non-detects in qPCR data. Bioinformatics 2014; 30(16): 2310–2316. DOI:10.1093/bioinformatics/btu239. URL http://bioinformatics.oxfordjournals.org/content/30/16/2310.abstract. http://bioinformatics.oxfordjournals.org/content/30/16/2310.full.pdf+html.
  • (4) Molenberghs G and Verbeke G. Models for discrete longitudinal data. chapter 26, 2005.
  • (5) Hox JJ, van de Schoot R and Matthijsse S. How few countries will do? comparative survey analysis from a bayesian perspective. In Survey Research Methods, volume 6. pp. 87–93.
  • (6) Hamra G, MacLehose R and Richardson D. Markov chain Monte Carlo: an introduction for epidemiologists. International journal of epidemiology 2013; 42(2): 627–634.
  • (7) Gelman A, Carlin JB, Stern HS et al. Bayesian data analysis, volume 2. Chapman & Hall/CRC Boca Raton, FL, USA, 2014.
  • (8) Galindo-Garre F, Vermunt JK and Bergsma WP. Bayesian posterior estimation of logit parameters with small samples. Sociological Methods & Research 2004; 33(1): 88–117.
  • (9) Sherina V, McMurray H, Powers W et al. Statistical approaches to decreasing the discrepancy of non-detects in qpcr data. bioRxiv 2017; DOI:10.1101/231621. URL https://www.biorxiv.org/content/early/2017/12/08/231621. https://www.biorxiv.org/content/early/2017/12/08/231621.full.pdf.
  • (10) Gelman A et al. Prior distributions for variance parameters in hierarchical models (comment on article by Browne and Draper). Bayesian analysis 2006; 1(3): 515–534.
  • (11) Sampson ER, McMurray HR, Hassane DC et al. Gene signature critical to cancer phenotype as a paradigm for anticancer drug discovery. Oncogene 2013; 32(33): 3809–3818.
  • (12) McMurray HR, Sampson ER, Compitello G et al. Synergistic response to oncogenic mutations defines gene class critical to cancer phenotype. Nature 2008; 453(7198): 1112–1116.
  • (13) Chen LS, Prentice RL and Wang P. A penalized EM algorithm incorporating missing data mechanism for gaussian parameter estimation. Biometrics 2014; 70(2): 312–322.