Bayesian Detection of Abnormal ADS in Mutant Caenorhabditis elegans Embryos

by   Wei Liang, et al.
Xiamen University

Cell division timing is critical for cell fate specification and morphogenesis during embryogenesis. How division timings are regulated among cells during development is poorly understood. Here we focus on the comparison of asynchrony of division between sister cells (ADS) between wild-type and mutant individuals of Caenorhabditis elegans. Since the replicate number of mutant individuals of each mutated gene, usually one, is far smaller than that of wild-type, direct comparison of two distributions of ADS between wild-type and mutant type, such as Kolmogorov- Smirnov test, is not feasible. On the other hand, we find that sometimes ADS is correlated with the life span of corresponding mother cell in wild-type. Hence, we apply a semiparametric Bayesian quantile regression method to estimate the 95 curve of ADS with respect to life span of mother cell of wild-type individuals. Then, mutant-type ADSs outside the corresponding confidence interval are selected out as abnormal one with a significance level of 0.05. Simulation study demonstrates the accuracy of our method and Gene Enrichment Analysis validates the results of real data sets.


page 1

page 2

page 3

page 4


Robust Statistical Inference for Cell Type Deconvolution

Cell type deconvolution is a computational approach to infer proportions...

Computational tools for assessing gene therapy under branching process models of mutation

Multitype branching processes are ideal for studying the population dyna...

Imaging-based histological features are predictive of MET alterations in Non-Small Cell Lung Cancer

MET is a proto-oncogene whose somatic activation in non-small cell lung ...

Active feature selection discovers minimal gene-sets for classifying cell-types and disease states in single-cell mRNA-seq data

Sequencing costs currently prohibit the application of single cell mRNA-...

Inference after latent variable estimation for single-cell RNA sequencing data

In the analysis of single-cell RNA sequencing data, researchers often ch...

Algebra, Geometry and Topology of ERK Kinetics

The MEK/ERK signalling pathway is involved in cell division, cell specia...

A Geometric Descriptor for Cell-Division Detection

We describe a method for cell-division detection based on a geometric-dr...

1 Introduction

How a single-celled zygote develops into a mature embryo with different tissues is still a foundational but pendent conundrum in developmental biology. Without question, gene expression pattern and dynamics play a crucial role in this procedure.

The embryogenesis of C. elegans has been studied extensively in the past several decades, especially after development of high throughout automated experimental techniques were intruduced. [1] and [2] developed an automated system to analyze the continuous gene expression profiles in C. elegans with cellular resolution from zygote till mid-embryogenesis using time-lapse confocal laser microscope. With this system, [3, 4, 5, 6, 7, 8] analyzed the expression patterns of various genes and their relationships with cell fates and tissue differentiation. Among them, [7] stated that there’re reasons to believe that ADS can result in cell-specific division pattern and hence affect tissue formation. Previous study of cell division timing mainly focused on heterochronic genes during postembryonic stage of C. elegans, such as [9, 10]. However, the mechanisms of cell division in embryonic and postembryonic stages are quite different [11]. With the help of newly collected data sets by [1, 2], we can analyze the cell division timing systematically and quantitatively.

Here, we aim to distinguish whether the ADS of mutant type is abnormal. Since various mother cells demonstrate various types of cell division and ADS, we use mother cells to label different types of ADS in the following content. Due to 260 individuals of wild-type but only one copy of each mutation type, direct comparison of distributions between wild-type and target mutant type is not feasible. On the other hand, the above-mentioned confocal microscopy on C. elegans embryogenesis is designed to measure the expression level of one specific target gene on all existing cells of an individual embryo during development. Due to strain differences (such as the insertion of DNA sequence coding fluorescent protein into various locations of the C. elegans genome) and variability in experimental and environmental factors, even ADS of the same mother cell in different wild type individuals show high quantitative variation, indicating considerable noise. But the lack of principled statistical methods hinders the comprehensive understanding of these data sets. For example, [7] ignored the relationships between ADS and life time of corresponding mother cell, and used an ad hoc threshold to report the abnormal ADS.

In order to overcome these difficulties, we apply a Bayesian semiparametric quantile regression method to classify abnormal ADS in mutants. Bayesian quantile regression, which combines the advantages of quantile regression and Bayesian method, has been studied over a period of time and applied wildly in research areas, such as

[12, 13, 14, 15]. Our method is based on kernel function suggested by [16] to estimate the confidence interval curve of wild-type ADS for each mother cell. For more robust and stable estimation, we transform the optimization problem into a Bayesian framework with MCMC algorithm applied to obtain the estimation. Then we can easily classify the mutant-type ADS based on whether it is outside the corresponding confidence interval.

In Section 2, we present our Bayesian semiparametric quantile regression algorithm and the hypothesis testing framework of whether mutant-type ADS is abnormal compared to wild-type. In Section 3, we synthesize some mimic data sets to validate our algorithm and apply the algorithm to real data files. Section 4 concludes the paper.

2 Methodology

2.1 Experimental Data

Our real data application is based on the data provided by [3, 7]. It consists of 260 wild-type individuals and 83 mutant-type ones. Division timing of each cell is measured from its birth to its death with teporal resolution of approximately 1.5 minutes. So all the life times of cells were measured as discrete positive integer due to the technical restriction. For example, a measurement of minutes of life time may be to

minutes in reality. But it doesn’t matter to our study since all the times no matter wild-type or mutant type, are measured in the same way. It’s also worth noting that the distribution of life time of a given cell among wild-type can be multimodal and skewed as Figure 

1 shown.

Figure 1: Histogram of life time of cell ABprappap.

2.2 Assumptions

In total, we have 351 different mother cells corresponding to various types of ADS among wild-type individuals. We denote these data sets of ADS and corresponding life time of mother cell as . Each of them consists of about 260 data points collected from our 260 wild-type individuals, and each data point has two values, ADS, denoted by , and life time of corresponding mother cell, denoted by . For every data , we run Pearson’s test and Spearman’s test (see Appendix 5.1 for details) to see whether there’re linear or other monotonic relationship between and . Table 1 shows the testing results with .

Test No. of Failure Proportion of Failure
Pearson’s Test 99 38.98
Spearman’s Test 119 46.85
both Pearson’s Test and Spearman’s Test 90 35.43
Table 1: Results of Pearson’s Test and Spearman’s Test where means no given relationship between and .

We can see that in nearly half of

, we should reject null hypothesis and there’re probably some monotonic relationship between

and . On the other hand, 90 data sets of reject in both tests which means most of the monotonic relationships may be linear. Figure 2 is an example of data set which reject in both tests.

Figure 2: An example of data set . X-axis represents life time of mother cell ABarpaa and Y-axis represents ADS. Different shapes of points indicate different frequency as labeled.

Based on this analysis, we assume a linear function as parametric part in our quantile regression model and details are shown in Section 2.3.

2.3 The Bayesian Modelling Framework for Quantile Regression

In this section, we introduce a semiparametric quantile regression within Bayesian framework for the ADS data of wild-type individuals.

Since we assume that and may be monotonically related, we model the quantile function as the following semiparametric form,

where is the parametric component, is the nonparametric component.

Next, we choose the following Gaussian kernel to model the unknown function , that is

where . Hence the semiparametric quantile regression takes the form


For model (2.1), one important consideration is the smoothness of the curve. More specifically, we tend to prefer smooth curves than rough ones. Usually, the integrated squared second derivative is used to quantify the smoothness of a curve, see [17]. Therefore, from [18], to estimate , we consider the following quantile regression minimisation problem


is the loss function

and is the smoothing parameter. Denote , . Using the Riesz representation theorem,

where . Hence (2.3) can be rewritten as


Our main object of the current paper is on the inference for and conditional quantile curves simultaneously. Therefore, we will focus on the minimisation problem


It can be easily shown that the minimization of the loss function is exactly equivalent to the maximization of a likelihood function formed by combining independently distributed asymmetric Laplace densities ALD,


Therefore, in the following, we adopt a Baysesian approach to solve the problem (2.3).

As our approach is Bayesian, we begin by defining the prior density for parameters. Our prior for is defined the multivariate normal density

where is the generalized inverse matrix of . We next require a priority on the smoothing parameter , which is constrained by a lower limit of zero. Here, we follow [15] by using the gamma density as our prior for


are hyperparameters. Under this prior

, and , results can be used to guide hyperparameter choice.

Assume follows asymmetric Laplace distribution (2.5) ALD , , the likelihood for independent observations is given by

Using uniform distribution as the improper prior density for

, combining , , , and , we can write the posterior density function of parameters


Next we shall use Metropolis-Hastings algorithm and Gibbs Sampler to obtain Markov Chain Monte Carlo (MCMC) samples from the posterior distribution (

2.7). [19] introduced a multivariate statistics, which is called Multivariate Potential Scale Reduction Factor (MPSRF), that can be applied here to test the convergence of MCMC chains. We set the number of iterations to and promise a burn-in of iterations. Inference is based on thinned values of after burn-in.

2.4 Hypothesis Test

We consider the null hypothesis : the ADS of mutant type is normal, against the alternative hypothesis : the ADS of mutant type is abnormal. Firstly, we use Bayesian semiparametric quantile regression to get the estimators of quantile curves and , which are the lower and upper confidence bounds for wild-type embryos respectively. Within these curves, we get a confidence interval. If the ADS of mutant individual is located in the interval at corrrsponding life time of mother cell, we accept . If not, we reject .

3 Application

3.1 Synthesized Data

Here we report a simulation study designed to evaluate the performance of the Bayesian quantile regression model. Set and . Different quantile curves are considered in this section.

Scenario (1):
Scenario (2):
Scenario (3):
Scenario (4):
Scenario (5):
Scenario (6): .

For each scenario, we generate the data as follows. Firstly, we select 100 different from 351 datasets . Secondly, for each dataset , delete the data , and keep as the synthesized one. Then given , we simulate

from a Normal distribution N

, where

is the cumulative distribution function of the standard normal distribution. This setting guarantees that the

and conditional quantile curves of are and . For each dataset , repeat the above process 10 times. Hence we get 1000 synthesized datasets.

Each time, we run two independent MCMC chains and then evaluate the MPSRF to monitor the convergence of Markov chains. After the Markov chains converge, we get the mean estimator of and at all points. Table 2 shows the coverage rates (CR) of confidence intervals of and . We list the averaged Mean Square Error (A MSE) of and

and their variances (V MSE)over all synthesized data sets within each scenario in Table 

2. [7] discards the possible correlation between and and directly use and quantiles of all points to define the confidence interval and the results are also listed in Table 2 for comparison.

Table 2. Our Method [7] CR A(MSE) V(MSE) A(MSE) V(MSE) Scenario 1 0.5999 0.6234 8.1659 50.2678 0.5999 0.4862 0.1917 0.1393 Scenario 2 1.2797 3.4002 9.1810 55.6763 1.1438 2.1626 9.0735 54.2697 Scenario 3 0.4768 0.2063 9.4806 63.3023 0.4684 0.1870 3.5869 5.6692 Scenario 4 2.3845 5.1563 27.9158 651.5690 0.7315 0.8093 0.1931 0.1570 Scenario 5 1.9928 3.6880 28.8412 671.9742 0.5295 0.3305 3.7085 5.3046 Scenario 6 0.5264 0.2746 0.2869 0.1938 0.4916 0.2298 0.2405 0.1988

Table 2: Estimation results of our semiparametric method and [7]


As expected, our method is much more accurate than [7] in most scenarios except when is a constant function which meets [7] perfectly. Figure 3 demonstrates a fitting example of each scenario.

Figure 3: Fitting examples of Scenarios 1-6. Black solid line represents true quantile and blue bold line indicates mean estimation results of our semiparametric method with thinned blue lines standing for confidence inetrval for each estimated quantile curve.

3.2 Real Data

In this section, we apply our method to all the ADS of 84 different mutant-type individuals. We also apply method from [7] to do the comparison of detected results. We list part of the results in Table 3, and complete results can be found in Supplementary Materials.
















Table 3: Comparsion of detected results between our method and [7]. First column represents mutant types and first row represents different types of ADSs. indicates abnormal one detected only by our method while indicating significance only detected by [7] with represnting significant ADS detected by both methods.

Figure 4 is a typical example, showing the difference between two methods.

Figure 4: Black dots are the wild-type sample, and the red stars represent the mutant-type individuals.

Furthermore, we binarize the results with

indicating abnormal ADS detected by our method and summarize them into a binary vector for each mutant type with each entry corresponding to one type of ADS. Then we use the data to cluster the mutant genes into 3 types by kmodes method proposed by

[20] for discrete data. We analyze the significance of our clustering result by Gene Set Enrichment Analysis through DAVID developed by [21, 22]. Table 4 demonstrates the analysis results and the details of clustering result can be found in Appendix 5. Type 3 is not significant in any functional module which may be derived from only 4 genes involved.

Type Category Term P-value 1 BP Reproduction 2 BP Body morphogenesis BP DNA binding BP Transcription factor activity, sequence-specific DNA binding

Table 4: Gene Set Enrichment Analysis. Significant P-values smaller than are demonstrated in the table.

4 Conclusion

We provide a principled automatic procedure to detect abnormal ADS of C. elegans

mutant individuals. Simulation studies and real data examples show that our method can estimate the quantile curves precisely and detect the significant outliers efficiently. Gene Enrichment Analysis shows that our clustering results based on abnormal ADS make sense to a certain extent which also validates the importance of ADS. In general, our method can handle most cases well except for the case where the real quantile curve violates our model assumption severely, such as exponential or high-order polynomial function. A non-parametric model may be a good choice for this situation. However, the unsmooth problem of fitted curve is a considerable challenge.

5 Appendix

5.1 Spearman’s rho

Spearman’s rank correlation coefficient or Spearman’s rho, named after Charles Spearman, often denoted by or as , is a nonparametric measure of rank correlation. It assesses how well the relationship between two variables can be described using a monotonic function. Suppose is an i.i.d. copy of , and convert to ranks , then Spearman’s rho is computed from

where is the difference between rankings.

When ties are present in rankings, we should use average ranking instead of ranking. Let and be the numbers of each ties in and respectively. Denote , . The adjusted Spearman’s rho is showed as follow,

The sign of the Spearman correlation indicates the direction of association between and . If tends to increase when increases, the Spearman correlation coefficient is positive. If tends to decrease when increases, the Spearman correlation coefficient is negative. A Spearman correlation of zero indicates that there is no tendency for to either increase or decrease when increases. When and are perfectly monotonically related, the Spearman correlation coefficient becomes 1.

5.2 Pearson correlation and Spearman’s correlation of synthesized Data

For each scenario, we calculate the Pearson correlation and Spearman’s for synthesized datasets. Figure 5 shows the histograms of Pearson correlation and Spearman’s correlation under different Scenarios.

Figure 5: Scenario 1–6: Left column is an example of synthesized data set under each scenario. Middle and right columns show corresponding Pearson correlation and Spearman’s correlation respectively

5.3 MCMC convergence

Visual assessment of the convergence of the Metropolis-Hastings algorithm is shown as following.

Figure 6: Plot of iteration traces of , b, , .

Figure 6 shows that the time series plots of , b, , iteration traces. After initially examining the plots of the MCMC chain, we used the more formal MPSRF statistic to assess convergence of the MCMC chain. This statistic compares the variances between and within chains to monitor convergence. The value of MPSRF should certainly not exceed 1.2 as suggested in the literature. In fact, after 80000 interation, we get that MPSRF take values nearly 1.101. The time series plots together with satisfactory values of the MPSRF statistic gave us confidence that the Metropolis-Hastings algorithm was producing realizations from the posterior distribution .

5.4 Gene Clustering

Details of clustering are demonstrated in Table 5.

Type Count Genes 1 12 gad-1,Y55F3AM.3,cul-1,lit-1,ykt-6,hlh-2,dsh-2,ubc-9,cbp-1,ddx-23,gsk-3,wrm-1 2 49 pie-1,cogc-2,sptf-3,abce-1,pal-1,pop-1,mex-6,unc-62,grh-1,lag-1,nhr-25,frg-1,epc-1, egl-18,gei-17,ref-1,tbx-33,egl-27,efl-1,die-1,F57C9.4,ceh-43,rps-9,W06E11.1,tbp-1, dpy-22,vps-37,cdk-8,F13H8.9,kin-19,npp-2,src-1,pes-1,mom-2,lin-23,pfd-1,ceh-13, mom-5,M03C11.1,mdt-11,plp-1,dpy-28,mys-2,par-6,let-526,ham-1,skr-2,arx-1,cpt-2, 3 4 mex-1,par-2,sel-8,snfc-5

Table 5: Clustering results of 65 genes.


  • [1] Bao Z, Murray JI, Boyle T, Ooi SL, Sandel MJ, Waterston RH. Automated cell lineage tracing in Caenorhabditis elegans. Sciences-New York. 2006;103(8).
  • [2] Murray JI, Bao Z, Boyle TJ, Boeck ME, Mericle BL, Nicholas TJ, et al. Automated analysis of embryonic gene expression with cellular resolution in C. elegans. Nature methods. 2008;5(8):703–709. doi:10.1038/nmeth.1228.
  • [3] Murray JI, Boyle TJ, Preston E, Murray JI, Boyle TJ, Preston E, et al. Multidimensional regulation of gene expression in the C . elegans embryo. Genome Research. 2012; p. 1282–1294. doi:10.1101/gr.131920.111.
  • [4] Long F, Peng H, Liu X, Kim SK, Myers E. A 3D digital atlas of C. elegans and its application to single-cell analyses. Nature methods. 2009;6(9):667–72. doi:10.1038/nmeth.1366.
  • [5] Liu X, Long F, Peng H, Aerni SJ, Jiang M, Sanchez-Blanco A, et al. Analysis of Cell Fate from Single-Cell Gene Expression Profiles in C. elegans. Cell. 2009;139(3):623–633. doi:10.1016/j.cell.2009.08.044.
  • [6] Spencer WC, Zeller G, Watson JD, Henz SR, Watkins KL, Mcwhirter RD, et al. A spatial and temporal map of C. elegans gene expression — Genome Research. Genome Research21. 2011; p. 325–341. doi:10.1101/gr.114595.110.Freely.
  • [7] Ho VWS, Wong MK, An X, Guan D, Shao J, Ng HCK, et al. Systems-level quantification of division timing reveals a common genetic architecture controlling asynchrony and fate asymmetry. Molecular systems biology. 2015;11(6):814. doi:10.15252/msb.20145857.
  • [8] Hu J, Zhao Z, Yalamanchili HK, Wang J, Ye K, Fan X. Bayesian detection of embryonic gene expression onset in C. elegans. Annals of Applied Statistics. 2015;9(2):950–968. doi:10.1214/15-AOAS820.
  • [9] Ren H, Zhang H. Wnt signaling controls temporal identities of seam cells in Caenorhabditis elegans. Developmental Biology. 2010;345(2):144–155. doi:10.1016/j.ydbio.2010.07.002.
  • [10] Gleason JE, Eisenmann DM. Wnt signaling controls the stem cell-like asymmetric division of the epithelial seam cells during C. elegans larval development. Developmental Biology. 2010;348(1):58–66. doi:10.1016/j.ydbio.2010.09.005.
  • [11] Ambros V. The temporal control of cell cycle and cell fate in Caenorhabditis elegans. Novartis Foundation Symposia. 2001;237:203–214.
  • [12] Yu K, Moyeed RA. Bayesian quantile regression. Statistics & Probability Letters. 2001;54(January):437–447. doi:10.1002/jae.1069.
  • [13] Dunson DB, Taylor Ja.

    Approximate Bayesian inference for quantiles.

    Journal of Nonparametric Statistics. 2005;17(3):385–400. doi:10.1080/10485250500039049.
  • [14] Kottas A, KrnjajiĆ M. Bayesian semiparametric modelling in quantile regression. Scandinavian Journal of Statistics. 2009;36(2):297–319. doi:10.1111/j.1467-9469.2008.00626.x.
  • [15] Thompson P, Cai Y, Moyeed R, Reeve D, Stander J. Bayesian nonparametric quantile regression using splines. Computational Statistics and Data Analysis. 2010;54(4):1138–1150. doi:10.1016/j.csda.2009.09.004.
  • [16] Takeuchi I, Le QV, Sears T, Smola AJ. Nonparametric Quantile Regression.

    Journal of Machine Learning Research. 2005;7:1001–1032.

  • [17] Green PJ, Silverman BW. Nonparametric Regression and Generalized Linear Models. London: Chapman and Hall; 1994.
  • [18] Koenker R. Quantile Regression. Cambridge: Cambridge University Press; 2005.
  • [19] Brooks SP. General Methods for Monitoring Convergence of Iterative Simulations. Journal of Computational and Graphical Statistics. 1998;7 %6(4):434–455.
  • [20] Huang Z. A Fast Clustering Algorithm to Cluster Very Large Categorical Data Sets in Data Mining. In: In Research Issues on Data Mining and Knowledge Discovery; 1997. p. 1–8.
  • [21] Huang DW, Sherman BT, Lempicki RA. Systematic and integrative analysis of large gene lists using DAVID Bioinformatics Resources. Nature Protocols. 2009;4(1):44–57.
  • [22] Huang DW, Sherman BT, Lempicki RA. Bioinformatics enrichment tools: paths toward the comprehensive functional analysis of large gene lists. Nucleic Acid Research. 1998;37(1):1–13.