1 Introduction
How a singlecelled zygote develops into a mature embryo with different tissues is still a foundational but pendent conundrum in developmental biology. Without question, gene expression pattern and dynamics play a crucial role in this procedure.
The embryogenesis of C. elegans has been studied extensively in the past several decades, especially after development of high throughout automated experimental techniques were intruduced. [1] and [2] developed an automated system to analyze the continuous gene expression profiles in C. elegans with cellular resolution from zygote till midembryogenesis using timelapse confocal laser microscope. With this system, [3, 4, 5, 6, 7, 8] analyzed the expression patterns of various genes and their relationships with cell fates and tissue differentiation. Among them, [7] stated that there’re reasons to believe that ADS can result in cellspecific division pattern and hence affect tissue formation. Previous study of cell division timing mainly focused on heterochronic genes during postembryonic stage of C. elegans, such as [9, 10]. However, the mechanisms of cell division in embryonic and postembryonic stages are quite different [11]. With the help of newly collected data sets by [1, 2], we can analyze the cell division timing systematically and quantitatively.
Here, we aim to distinguish whether the ADS of mutant type is abnormal. Since various mother cells demonstrate various types of cell division and ADS, we use mother cells to label different types of ADS in the following content. Due to 260 individuals of wildtype but only one copy of each mutation type, direct comparison of distributions between wildtype and target mutant type is not feasible. On the other hand, the abovementioned confocal microscopy on C. elegans embryogenesis is designed to measure the expression level of one specific target gene on all existing cells of an individual embryo during development. Due to strain differences (such as the insertion of DNA sequence coding fluorescent protein into various locations of the C. elegans genome) and variability in experimental and environmental factors, even ADS of the same mother cell in different wild type individuals show high quantitative variation, indicating considerable noise. But the lack of principled statistical methods hinders the comprehensive understanding of these data sets. For example, [7] ignored the relationships between ADS and life time of corresponding mother cell, and used an ad hoc threshold to report the abnormal ADS.
In order to overcome these difficulties, we apply a Bayesian semiparametric quantile regression method to classify abnormal ADS in mutants. Bayesian quantile regression, which combines the advantages of quantile regression and Bayesian method, has been studied over a period of time and applied wildly in research areas, such as
[12, 13, 14, 15]. Our method is based on kernel function suggested by [16] to estimate the confidence interval curve of wildtype ADS for each mother cell. For more robust and stable estimation, we transform the optimization problem into a Bayesian framework with MCMC algorithm applied to obtain the estimation. Then we can easily classify the mutanttype ADS based on whether it is outside the corresponding confidence interval.In Section 2, we present our Bayesian semiparametric quantile regression algorithm and the hypothesis testing framework of whether mutanttype ADS is abnormal compared to wildtype. In Section 3, we synthesize some mimic data sets to validate our algorithm and apply the algorithm to real data files. Section 4 concludes the paper.
2 Methodology
2.1 Experimental Data
Our real data application is based on the data provided by [3, 7]. It consists of 260 wildtype individuals and 83 mutanttype ones. Division timing of each cell is measured from its birth to its death with teporal resolution of approximately 1.5 minutes. So all the life times of cells were measured as discrete positive integer due to the technical restriction. For example, a measurement of minutes of life time may be to
minutes in reality. But it doesn’t matter to our study since all the times no matter wildtype or mutant type, are measured in the same way. It’s also worth noting that the distribution of life time of a given cell among wildtype can be multimodal and skewed as Figure
1 shown.2.2 Assumptions
In total, we have 351 different mother cells corresponding to various types of ADS among wildtype individuals. We denote these data sets of ADS and corresponding life time of mother cell as . Each of them consists of about 260 data points collected from our 260 wildtype individuals, and each data point has two values, ADS, denoted by , and life time of corresponding mother cell, denoted by . For every data , we run Pearson’s test and Spearman’s test (see Appendix 5.1 for details) to see whether there’re linear or other monotonic relationship between and . Table 1 shows the testing results with .
Test  No. of Failure  Proportion of Failure 

Pearson’s Test  99  38.98 
Spearman’s Test  119  46.85 
both Pearson’s Test and Spearman’s Test  90  35.43 
We can see that in nearly half of
, we should reject null hypothesis and there’re probably some monotonic relationship between
and . On the other hand, 90 data sets of reject in both tests which means most of the monotonic relationships may be linear. Figure 2 is an example of data set which reject in both tests.Based on this analysis, we assume a linear function as parametric part in our quantile regression model and details are shown in Section 2.3.
2.3 The Bayesian Modelling Framework for Quantile Regression
In this section, we introduce a semiparametric quantile regression within Bayesian framework for the ADS data of wildtype individuals.
Since we assume that and may be monotonically related, we model the quantile function as the following semiparametric form,
where is the parametric component, is the nonparametric component.
Next, we choose the following Gaussian kernel to model the unknown function , that is
where . Hence the semiparametric quantile regression takes the form
(2.1) 
For model (2.1), one important consideration is the smoothness of the curve. More specifically, we tend to prefer smooth curves than rough ones. Usually, the integrated squared second derivative is used to quantify the smoothness of a curve, see [17]. Therefore, from [18], to estimate , we consider the following quantile regression minimisation problem
where
is the loss function
and is the smoothing parameter. Denote , . Using the Riesz representation theorem,
where . Hence (2.3) can be rewritten as
(2.3) 
Our main object of the current paper is on the inference for and conditional quantile curves simultaneously. Therefore, we will focus on the minimisation problem
(2.4) 
It can be easily shown that the minimization of the loss function is exactly equivalent to the maximization of a likelihood function formed by combining independently distributed asymmetric Laplace densities ALD,
(2.5) 
Therefore, in the following, we adopt a Baysesian approach to solve the problem (2.3).
As our approach is Bayesian, we begin by defining the prior density for parameters. Our prior for is defined the multivariate normal density
where is the generalized inverse matrix of . We next require a priority on the smoothing parameter , which is constrained by a lower limit of zero. Here, we follow [15] by using the gamma density as our prior for
where
are hyperparameters. Under this prior
, and , results can be used to guide hyperparameter choice.Assume follows asymmetric Laplace distribution (2.5) ALD , , the likelihood for independent observations is given by
Using uniform distribution as the improper prior density for
, combining , , , and , we can write the posterior density function of parameters(2.7)  
Next we shall use MetropolisHastings algorithm and Gibbs Sampler to obtain Markov Chain Monte Carlo (MCMC) samples from the posterior distribution (
2.7). [19] introduced a multivariate statistics, which is called Multivariate Potential Scale Reduction Factor (MPSRF), that can be applied here to test the convergence of MCMC chains. We set the number of iterations to and promise a burnin of iterations. Inference is based on thinned values of after burnin.2.4 Hypothesis Test
We consider the null hypothesis : the ADS of mutant type is normal, against the alternative hypothesis : the ADS of mutant type is abnormal. Firstly, we use Bayesian semiparametric quantile regression to get the estimators of quantile curves and , which are the lower and upper confidence bounds for wildtype embryos respectively. Within these curves, we get a confidence interval. If the ADS of mutant individual is located in the interval at corrrsponding life time of mother cell, we accept . If not, we reject .
3 Application
3.1 Synthesized Data
Here we report a simulation study designed to evaluate the performance of the Bayesian quantile regression model. Set and . Different quantile curves are considered in this section.
Scenario (1):  

Scenario (2):  
.  
Scenario (3):  
Scenario (4):  
Scenario (5):  
Scenario (6):  . 
For each scenario, we generate the data as follows. Firstly, we select 100 different from 351 datasets . Secondly, for each dataset , delete the data , and keep as the synthesized one. Then given , we simulate
from a Normal distribution N
, whereis the cumulative distribution function of the standard normal distribution. This setting guarantees that the
and conditional quantile curves of are and . For each dataset , repeat the above process 10 times. Hence we get 1000 synthesized datasets.Each time, we run two independent MCMC chains and then evaluate the MPSRF to monitor the convergence of Markov chains. After the Markov chains converge, we get the mean estimator of and at all points. Table 2 shows the coverage rates (CR) of confidence intervals of and . We list the averaged Mean Square Error (A MSE) of and
and their variances (V MSE)over all synthesized data sets within each scenario in Table
2. [7] discards the possible correlation between and and directly use and quantiles of all points to define the confidence interval and the results are also listed in Table 2 for comparison.3.2 Real Data
In this section, we apply our method to all the ADS of 84 different mutanttype individuals. We also apply method from [7] to do the comparison of detected results. We list part of the results in Table 3, and complete results can be found in Supplementary Materials.
ABalpaap 
ABarppaa 
ABarpppa 
ABarpaa 
ABplpapp 
ABarpapp 
ABplaaap 
ABalapap 
ABalaap 
ABalaapp 
ABalappa 
ABalapp 
ABalppa 
ABaraaaa 
ABarapaa 

pop1  
gei17  
egl27  
ceh43  
tbp1  
tbx37+tbx38  
cdk8  
F11A101  
lin12+glp1  
lin23  
ceh13  
epc1  
frg1  
snfc5  
dpy28  
cbp1  
let526  
cblin40 
Figure 4 is a typical example, showing the difference between two methods.
Furthermore, we binarize the results with
indicating abnormal ADS detected by our method and summarize them into a binary vector for each mutant type with each entry corresponding to one type of ADS. Then we use the data to cluster the mutant genes into 3 types by kmodes method proposed by
[20] for discrete data. We analyze the significance of our clustering result by Gene Set Enrichment Analysis through DAVID developed by [21, 22]. Table 4 demonstrates the analysis results and the details of clustering result can be found in Appendix 5. Type 3 is not significant in any functional module which may be derived from only 4 genes involved.4 Conclusion
We provide a principled automatic procedure to detect abnormal ADS of C. elegans
mutant individuals. Simulation studies and real data examples show that our method can estimate the quantile curves precisely and detect the significant outliers efficiently. Gene Enrichment Analysis shows that our clustering results based on abnormal ADS make sense to a certain extent which also validates the importance of ADS. In general, our method can handle most cases well except for the case where the real quantile curve violates our model assumption severely, such as exponential or highorder polynomial function. A nonparametric model may be a good choice for this situation. However, the unsmooth problem of fitted curve is a considerable challenge.
5 Appendix
5.1 Spearman’s rho
Spearman’s rank correlation coefficient or Spearman’s rho, named after Charles Spearman, often denoted by or as , is a nonparametric measure of rank correlation. It assesses how well the relationship between two variables can be described using a monotonic function. Suppose is an i.i.d. copy of , and convert to ranks , then Spearman’s rho is computed from
where is the difference between rankings.
When ties are present in rankings, we should use average ranking instead of ranking. Let and be the numbers of each ties in and respectively. Denote , . The adjusted Spearman’s rho is showed as follow,
The sign of the Spearman correlation indicates the direction of association between and . If tends to increase when increases, the Spearman correlation coefficient is positive. If tends to decrease when increases, the Spearman correlation coefficient is negative. A Spearman correlation of zero indicates that there is no tendency for to either increase or decrease when increases. When and are perfectly monotonically related, the Spearman correlation coefficient becomes 1.
5.2 Pearson correlation and Spearman’s correlation of synthesized Data
For each scenario, we calculate the Pearson correlation and Spearman’s for synthesized datasets. Figure 5 shows the histograms of Pearson correlation and Spearman’s correlation under different Scenarios.
5.3 MCMC convergence
Visual assessment of the convergence of the MetropolisHastings algorithm is shown as following.
Figure 6 shows that the time series plots of , b, , iteration traces. After initially examining the plots of the MCMC chain, we used the more formal MPSRF statistic to assess convergence of the MCMC chain. This statistic compares the variances between and within chains to monitor convergence. The value of MPSRF should certainly not exceed 1.2 as suggested in the literature. In fact, after 80000 interation, we get that MPSRF take values nearly 1.101. The time series plots together with satisfactory values of the MPSRF statistic gave us confidence that the MetropolisHastings algorithm was producing realizations from the posterior distribution .
5.4 Gene Clustering
Details of clustering are demonstrated in Table 5.
References
 [1] Bao Z, Murray JI, Boyle T, Ooi SL, Sandel MJ, Waterston RH. Automated cell lineage tracing in Caenorhabditis elegans. SciencesNew York. 2006;103(8).
 [2] Murray JI, Bao Z, Boyle TJ, Boeck ME, Mericle BL, Nicholas TJ, et al. Automated analysis of embryonic gene expression with cellular resolution in C. elegans. Nature methods. 2008;5(8):703–709. doi:10.1038/nmeth.1228.
 [3] Murray JI, Boyle TJ, Preston E, Murray JI, Boyle TJ, Preston E, et al. Multidimensional regulation of gene expression in the C . elegans embryo. Genome Research. 2012; p. 1282–1294. doi:10.1101/gr.131920.111.
 [4] Long F, Peng H, Liu X, Kim SK, Myers E. A 3D digital atlas of C. elegans and its application to singlecell analyses. Nature methods. 2009;6(9):667–72. doi:10.1038/nmeth.1366.
 [5] Liu X, Long F, Peng H, Aerni SJ, Jiang M, SanchezBlanco A, et al. Analysis of Cell Fate from SingleCell Gene Expression Profiles in C. elegans. Cell. 2009;139(3):623–633. doi:10.1016/j.cell.2009.08.044.
 [6] Spencer WC, Zeller G, Watson JD, Henz SR, Watkins KL, Mcwhirter RD, et al. A spatial and temporal map of C. elegans gene expression — Genome Research. Genome Research21. 2011; p. 325–341. doi:10.1101/gr.114595.110.Freely.
 [7] Ho VWS, Wong MK, An X, Guan D, Shao J, Ng HCK, et al. Systemslevel quantification of division timing reveals a common genetic architecture controlling asynchrony and fate asymmetry. Molecular systems biology. 2015;11(6):814. doi:10.15252/msb.20145857.
 [8] Hu J, Zhao Z, Yalamanchili HK, Wang J, Ye K, Fan X. Bayesian detection of embryonic gene expression onset in C. elegans. Annals of Applied Statistics. 2015;9(2):950–968. doi:10.1214/15AOAS820.
 [9] Ren H, Zhang H. Wnt signaling controls temporal identities of seam cells in Caenorhabditis elegans. Developmental Biology. 2010;345(2):144–155. doi:10.1016/j.ydbio.2010.07.002.
 [10] Gleason JE, Eisenmann DM. Wnt signaling controls the stem celllike asymmetric division of the epithelial seam cells during C. elegans larval development. Developmental Biology. 2010;348(1):58–66. doi:10.1016/j.ydbio.2010.09.005.
 [11] Ambros V. The temporal control of cell cycle and cell fate in Caenorhabditis elegans. Novartis Foundation Symposia. 2001;237:203–214.
 [12] Yu K, Moyeed RA. Bayesian quantile regression. Statistics & Probability Letters. 2001;54(January):437–447. doi:10.1002/jae.1069.

[13]
Dunson DB, Taylor Ja.
Approximate Bayesian inference for quantiles.
Journal of Nonparametric Statistics. 2005;17(3):385–400. doi:10.1080/10485250500039049.  [14] Kottas A, KrnjajiĆ M. Bayesian semiparametric modelling in quantile regression. Scandinavian Journal of Statistics. 2009;36(2):297–319. doi:10.1111/j.14679469.2008.00626.x.
 [15] Thompson P, Cai Y, Moyeed R, Reeve D, Stander J. Bayesian nonparametric quantile regression using splines. Computational Statistics and Data Analysis. 2010;54(4):1138–1150. doi:10.1016/j.csda.2009.09.004.

[16]
Takeuchi I, Le QV, Sears T, Smola AJ.
Nonparametric Quantile Regression.
Journal of Machine Learning Research. 2005;7:1001–1032.
doi:10.1080/10485259508832642.  [17] Green PJ, Silverman BW. Nonparametric Regression and Generalized Linear Models. London: Chapman and Hall; 1994.
 [18] Koenker R. Quantile Regression. Cambridge: Cambridge University Press; 2005.
 [19] Brooks SP. General Methods for Monitoring Convergence of Iterative Simulations. Journal of Computational and Graphical Statistics. 1998;7 %6(4):434–455.
 [20] Huang Z. A Fast Clustering Algorithm to Cluster Very Large Categorical Data Sets in Data Mining. In: In Research Issues on Data Mining and Knowledge Discovery; 1997. p. 1–8.
 [21] Huang DW, Sherman BT, Lempicki RA. Systematic and integrative analysis of large gene lists using DAVID Bioinformatics Resources. Nature Protocols. 2009;4(1):44–57.
 [22] Huang DW, Sherman BT, Lempicki RA. Bioinformatics enrichment tools: paths toward the comprehensive functional analysis of large gene lists. Nucleic Acid Research. 1998;37(1):1–13.