The increasing demands to analyze high dimensional data with complex structures have facilitated the development of novel statistical models for functional data, in which the outcomes can be interpreted as samples of random functions. Time series, longitudinal, and spatial data are some typical examples of functional data. One common feature among functional data is that, often, the linear regression does not appropriately explain the correlations between the outcomes that are close in the inputs of the function. The correlation is often expressed through a mapping from the functional inputs to the associated outcomes, usually modeled as a stochastic process, and the correlations between nearby inputs are captured through a covariance matrix. One natural choice of such stochastic process is the Gaussian stochastic process (GaSP), which has been widely used in many applications(Sacks et al., 1989; Bayarri et al., 2009; Gelfand et al., 2010).
GaSP models have also been popular in analyzing functional data with multiple functional outcomes, in which independent GaSP models are generally built separately for each outcome for simplicity. A more sophisticated approach is to define a separable GaSP model, where the correlations between functions and between inputs are modeled separately using a matrix normal distribution(Conti and O’Hagan, 2010)2008), with the weights of the basis functions modeled as independent GaSPs to model the correlation structure over the input space. This construction results in nonseparable covariance structures, meaning that the covariance matrix cannot be decomposed as a Kronecker product of two small covariance matrices.
For large-scale data, a GaSP model is often computationally expensive: the evaluation of the likelihood requires computational operations to compute the inverse of the covariance matrix, where is the number of observed data points. To ease the computation, many approximation methods have been proposed, including low rank approximation (Banerjee et al., 2008), covariance tapering (Kaufman et al., 2008), use of Gaussian Markov random field representations (Lindgren et al., 2011), and likelihood approximation (Eidsvik et al., 2013). Those approximation methods are sometimes preferred for computationally intensive problems, however, the exact computation is more desired if we can overcome the computational operations.
Our motivating study is to impute millions of DNA methylation levels at CpG sites across the human genome. DNA methylation is an epigenetic modification of DNA, playing important roles in DNA replication, gene transcription, aging, and cancer evolution (Das and Singal, 2004; Scarano et al., 2005). Methylation levels are quantified at every genomic CpG site, a region of DNA where a cytosine (C) nucleotide is followed by a guanine (G) nucleotide in the linear sequence of bases along its 5’ to 3’ direction. Single-site DNA methylation level can be quantified by whole-genome bisulfite sequencing (WGBS), in which approximately 26 million CpG sites in the human genome are evaluated for whether they are methylated or not. However, WGBS is expensive and hard to examine in certain genomic regions. This motivates alternative methylation assay technologies, such as Illumina HumanMethylation450 BeadChip (henceforth, Methylation450K) that measures DNA methylation levels at approximately CpG sites (less than of the total number of CpG sites). The goal is to impute DNA methylation levels at the CpG sites that are observed in the WGBS samples but unobserved in the Methylation450K data by exploiting the correlations among the full set of CpG sites in the WGBS samples.
The empirical correlation of methylation levels with distance smaller than 5000 bases in the WGBS samples is shown in the left panel of Figure 1. For each integer distance, we calculate the empirical correlation of every possible pair with this distance. The blue dots in the left panel are the average correlation of the methylation levels between two CpG sites at a given CpG distance smaller than 5000 bases. The methylation levels at nearby CpG sites are correlated to each other on average and the correlation gradually decays as the distance between the two CpG sites increases. Such phenomenon is called co-methylation and has been observed in previous studies (Zhang et al., 2015). Furthermore, as shown in the right panel of Figure 1, methylation levels for each CpG site across samples are well correlated for biological reasons. Since methylation plays important roles in suppressing gene expression levels, they are tightly regulated in cells and variability of such regulations is associated with disease risk (Das and Singal, 2004). Understanding the correlation patterns in methylations levels is thus meaningful for reducing the risk of diseases. These empirical findings motivate us to develop a statistical model to exploit the correlations of methylation levels across the unequally-spaced genome sites and across different samples for the goal of imputation.
In this paper, we develop a computationally efficient model to impute the methylation levels. Our contribution is three-fold. First, we propose a nonseparable GaSP model to integrates different correlation structures among the methylation levels across genome sites and across samples into a coherent model, while the previous regression method (Zhang et al., 2015) ignores the correlations across samples. Second, we develop an ultra-fast algorithm that computes the exact likelihood of the proposed nonseparable GaSP model with computational operations and storage when using a large class of covariance functions, building upon the connection between the Gaussian random field and Gaussian Markov random field (Hartikainen and Sarkka, 2010). Lastly, the proposed nonseparable GaSP model can be seen as a general statistical framework that unifies the linear regression and separable GaSP models. Though we focus on methylation levels imputation in this work , the methodology is widely applicable to many other studies, such as estimating the response curve from the electronic health record data, where each patient’s visit is modeled as a function of time (Xu et al. (2016)).
The rest of the paper is organized as follows. In Section 2, we study a class of nonseparable GaSP models. The closed form marginal likelihood and predictive distribution are derived for the imputation problem. In Section 3, the computational strategy for this class of nonseparable GaSP models is introduced, for which the computation scales linearly in the number of inputs of the function without approximation. In Section 4, we unify some other frequently used approaches, such as the linear regression and separable GaSP models, under the framework of the nonseparable GaSP models. Numerical examples and comparisons to alternative methods are provided in Section 5. We conclude the paper with discussion and future extensions in Section 6.
2 Modeling multiple functional data
Let be the methylation level of the sample at the CpG site, recording the proportion of probes for a single CpG site that is methylated, for , . Define two groups of sites, and , where the methylation levels of are observed for all samples and the methylation levels of are only available for the first samples but not available for the last samples. The total number of samples is and the total number of CpG sites is .
For the first samples, methylation levels are measured at all CpG sites, meaning that we observe and . However, for the remaining samples, the methylation levels are only observed at a small subset of CpG sites, denoted as . The methylation levels at the remaining CpG sites (
) of these samples are unknown. Our goal, then, is to interpolate the unobserved methylation levels of thesesamples using their observed methylation values at CpG sites and the full methylation values from the other samples. In other words, we seek the predictive distribution of conditional on , and .
The imputation of methylation levels across the whole genome is computationally challenging due to the large number of CpG sites. In the full WGBS data set, there are about CpG sites; even in the smaller Methylation450K data, there are roughly CpG sites, creating computational challenges. In contrast, the number of samples we are working with is relatively small: samples in the WGBS data and samples in the Methylation450K data. The key advantage of our method is that the computation required for imputation scales linearly in terms of the number of CpG sites.
Here we make several extensions of a class of GaSP models with the nonseparable structure, which has been used for modeling multivariate spatially correlated data and functional outputs (Gelfand et al., 2004; Higdon et al., 2008). First of all, we construct a flexible way to incorporate the correlations across samples and across sites for prediction, with closed form expression of the marginal likelihood and predictive distribution. These expressions enable us to establish the connection between this nonseparable model and other models, such as the linear regression and separable models, discussed in Section 4. Furthermore, we introduce a computationally feasible approach to large-scale problems with inputs (CpG sites) up to a million without approximating the likelihood function. The proof of this section is given in Appendix A. Without loss of generality, we assume the data are centered at zero. An extension to combine site-specified features is given in the supplementary materials.
2.1 Nonseparable GaSP model with homogeneous noises
We start with the model for the functional output, , for every site
where , is a matrix with being the basis function (vector) specified later, and . As shown in Figure 1, the correlation of the methylation levels at nearby CpG sites decays as the genomic distance increases. This motivates us to model each weight function independently as a zero mean GaSP
is an unknown variance parameter andis the correlation between sites. One popular choice of the correlation function is the Matérn kernel
where is the gamma function, is the modified Bessel function of the second kind with a roughness parameter , is a range parameter, and for any . The roughness parameter of the Matérn kernel controls how smooth the stochastic process is. When with , the GaSP with a Matérn kernel is sample path differentiable and the Matérn kernel has a closed form expression in these scenarios. For instance, the exponential kernel is equivalent to the Matérn kernel with and the Gaussian kernel is the Matérn kernel with . The flexibility of the Matérn kernel makes it widely applicable for modeling spatially correlated data (Gelfand et al., 2010).
Denote . Following Higdon et al. (2008)
, we apply the singular value decomposition (SVD) toand estimate as , for the following reasons. First of all, the computational order of estimating is linear to , which is essential when is at the size of . Secondly, we have if , and hence is a diagonal matrix, which substantially simplifies the computation of the likelihood. Moreover, , unifying the linear regression model under the framework of the nonseparable model shown later in Remark 1 of Section 4.
One can marginalize out the weight matrix explicitly for computing the likelihood. We first vectorize outputs and weight matrix , both of which are -dimensional vectors. Define a matrix . By simple algebra, model (1) can be written as
where and , with the entry of the being . Here , where means the block diagonal matrix between sites, and is the correlation matrix, . As shown in Higdon et al. (2008), directly marginalizing out leads to the sampling model, The straightforward computation of the likelihood, however, requires to evaluate the inverse of a covariance matrix , which is computationally infeasible. We have the following lemma to ease the computational challenge.
Assume and as the SVD decomposition. After integrating out , the marginal likelihood of in model (1) follows a product of independent multivariate normal distributions,
where denotes the multivariate normal density with mean and covariance , is the transpose of the row of .
Lemma 1 states that the marginal likelihood by model (1) can be written as a product of multivariate normal densities, which simplifies the computation. In particular, instead of computing the multivariate normal densities with a covariance matrix, one can evaluate the densities by independent multivariate normal distributions, each of which has an covariance matrix. The direct computation of the inverse of an covariance matrix, however, is still very hard in general, when is at the size of . An efficient algorithm that computes the exact likelihood will be provided in Section 3.
The goal of imputation is to find the predictive distribution at an unexamined site conditioning on the available data. Denote , where and are the column of and respectively. We have the following lemma.
We assume the same conditions in Lemma 1.
For every , one has
Here and , where with and being the transpose of the row of , , is a diagonal matrix with as the diagonal term and .
Denote the partition and . For every , the predictive distribution of the unobserved follows
where and .
In the methylation levels imputation study, can be used as predictions for for any site , by properly conditional on all observations.
2.2 Nonseparable GaSP model with heterogeneous noises
The nonseparable model (1) assumes a shared noise parameter , which is typically very restrictive. We generalize the model by assuming different noise parameters as follows
with for any , each weight function being assumed the same as the previous independent GaSP in (2) and being an independent zero-mean Gaussian noise with variance . We still assume where and are defined through the SVD decomposition of . In model (6), follows a zero-mean GaSP with noise
with the covariance , where is the nugget-variance ratio and is defined in (3) with a fixed and an unknown range parameter , for . Replacing with in Lemma 1, the marginal likelihood of can be written as a product of independent multivariate normal densities,
where is the transpose of the row of . The marginal likelihood in (5) is a special case of the marginal likelihood in (8) when for . Note that the above model is not identifiable between , and . In the following, we simply constrain to avoid the potential identifiability issue.
The held-out methylation levels (red circles) and prediction of methylation levels (black circles) by the nonseparable GaSP model for randomly selected 4 samples at 100 CpG sites. The 95% posterior predictive interval is dashed as the shaded area.
The predictive distributions of model (6) at unobserved CpG sites takes almost the same form in Lemma 2 by replacing and with and respectively. Figure 2 plots the predicted methylation levels as the black circles at 100 held-out CpG sites for 4 samples, with the examined methylation levels marked as the red circles. The prediction by the nonseparable GaSP model captures the pattern of the methylation levels reasonably well, with an adequate length of 95% predictive interval, graphed as the shaded area. A more detailed comparison between the nonseparable GaSP model and other computational feasible alternatives is given in Section 5.
3 Computational strategy
The computations of the likelihood in model (6) require to compute the inverse of , each with operations, making the implementation impractical when is large. We introduce a computationally efficient algorithm, based on the connection between the Gaussian random field and Gaussian Markov random field (GMRF). The idea is introduced in Whittle (1954, 1963); Hartikainen and Sarkka (2010), where the Matérn covariance was shown to be Markov. Unlike many other methods, no approximation to the likelihood is needed in this approach. We briefly review this computational strategy and extend it to compute the exact likelihood of the nonseparable GaSP model with linear operations.
Consider a continuous auto-regressive model with order , defined by a stochastic differential equation (SDE),
where is the derivative of and
is the standard Gaussian white noise process defined on. Here we set to avoid the nonidentifiability issue. The spectral density of equation (9) is , where is the imaginary number, and the operator is defined by . The form of the above spectral density is , which is a rational functional form. It has been shown in Whittle (1954, 1963) that the spectral density of GaSP with the Matérn covariance is
where with the range parameter and the roughness parameter . The spectral density in (10) follows a rational functional form, meaning that we can utilize the GMRF representation for computation, elaborated in the following subsection.
3.1 The computation by continuous time stochastic process
Here we assume the Matérn kernel with as for the demonstration purpose
with for any . The computational advantages introduced in this subsection hold for Matérn kernel with for all .
where being an independent noise with , for . Denote , where is the derivative of with regard to , . For each , GaSP with the correlation defined in (11) follows an SDE
where is a zero-mean Gaussian white noise process with variance and . The closed form expression of and is given in the Appendix B. Denote and . The solution of the above SDE can be represented explicitly as
where and for . The stationary distribution of is , with All , , , and the likelihood of are derived in the supplementary materials.
With the above setup, the posterior for , , can be computed by a forward filtering and backward sampling/smoothing (FFBS) algorithm (West and Harrison, 1997; Petris et al., 2009), which only requires computational operations, a lot smaller than operations in the direct computation of the GaSP model. The prediction at also only requires linear computational operations to the number of sites, so the total computational operations are only altogether. Furthermore, can be explicitly marginalized out for all instead of the posterior sampling, discussed in Section 3.2.
Figure 3 compares the posterior means of by the FFBS algorithm and direct computation of GaSP for some given set of parameters. Since they are the same quantities computed in two different ways, the difference only depends on the machine precision, which is extremely small. The main advantage of our method is that all summary statistics of interest, such as the posterior predictive mean and variance, as well as the marginal likelihood can be computed exactly.
The computational time between them, however, differs significantly. As shown in Figure 4, the computation by the FFBS algorithm is a lot more efficient than the direct evaluation of the likelihood, which requires for matrix inversion. For instance, when , evaluating the likelihood by the FFBS algorithm only takes around 0.1 second on a laptop, while the direct computation takes around 60 seconds.
Note when , the Matérn covariance matrix in (11) and its inversion are both dense matrices with rank . However, the covariance matrix of the latent states, is sparse, as shown in supplementary materials. The result holds for all Matérn classes when , with .
3.2 Parameter estimation
The most computationally intensive part of the FFBS algorithm is to sample latent states for and . Fortunately, one can avoid it by marginalizing out the latent states explicitly,
each term of which follows a normal distribution given in the one-step look ahead prediction in the FFBS algorithm (West and Harrison, 1997). The marginal likelihood can also be computed with linear computational operations in terms of , allowing us to evaluate the likelihood without making an approximation.
To complete the model, we assume the prior as follows,
Denote and . We assume the jointly robust prior (Gu, 2016; Gu et al., 2018) for the transformed parameters , for , where , and are prior parameters. This prior approximates the reference prior in tail rates and is robust for posterior mode estimation (Gu et al., 2017). For the prior parameter, we use the default choice , , and , where is the length of .
is large, the Markov chain Monte Carlo (MCMC) algorithm is still very slow for sampling the posterior distribution. We estimate the parameters by the posterior mode,
The posterior mode of the variance parameters is with . The posterior mean and MLE of are and , respectively. When is large, these estimations are almost the same.
4 Unification of linear regression, separable model and nonseparable model
In this section, we show the linear regression model that was used for methylation level imputation (Zhang et al., 2015), and the separable GaSP model used in computer model emulation (Conti and O’Hagan (2010); Gu and Berger (2016)) are both special cases of the nonseparable GaSP model introduced in Section 2.
4.1 Linear regression strategies
Assuming that the samples are independent to each other, a simple model is to apply the linear regression separately for each CpG site , , as follows,
where are the covariates for the CpG site of the sample and is an independent mean-zero Gaussian noise. In Zhang et al. (2015), some site-specific features, such as methylation levels at nearby CpG sites, are used as covariates for imputation. This approach assumes that methylation levels are independent across samples at every CpG site. However, methylation levels of different samples at a CpG site are generally correlated, as shown in Figure 1. Numerical results will be shown that exploiting the correlation between samples leads to drastic improvement in imputation in Section 5.
Alternatively, a regression model that exploits the correlations across samples follows
with . Here each site is treated independently. Assume the methylation levels were first centered to have a zero mean and let , meaning only the methylation levels of the samples with full observations are used as covariates, the model for can be expressed where is the row of and . The least squares (LS) estimator of is , where being the row of . The predictive mean of the methylation level of the sample at the unexamined site, , follows
with being the column of , for , . The following remark establishes the connection between the linear regression and nonseparable GaSP model.
, where and are defined through the SVD decomposition of .
is the realization of independent mean zero Gaussian noise with the same variance.
4.2 Separable model
Denote as a matrix where each row is a sample of methylation levels at CpG sites. Compared to the regression models, the following joint model is sometimes preferred
where is a zero-mean independent noise and is a random matrix modeled from a matrix-variate normal distribution, , with a mean matrix , a row covariance matrix , and an column correlation matrix .
We call model (19) the separable model, as the correlations across samples and across sites are expressed separately by and respectively. Assuming that is modeled by a correlation function, where the entry , with being the correlation function. The following remark states the separable model defined in (19) is a special case of the nonseparable model in (1).
5 Numerical comparison
We evaluate the nonseparable GaSP model in (6) and compare to several alternative methods: two linear regression strategies (by site in (16) and by sample in (17)), nearest neighborhood method (using only the observed methylation level closest to the unobserved site for prediction) and two regression strategies by random forest (Liaw and Wiener, 2002). We also introduce a localized Kriging method by partitioning the data into small blocks, and compare it with the nonseparable GaSP model in the supplementary materials.
The performance of these methods is evaluated on the out of sample prediction for WGBS data and Methylation450 data based on the following criteria
where for and , is the held-out methylation levels of the sample at the CpG site; is the predicted held-out methylation level of the sample at the CpG site; is the
posterior credible interval; andis the length of the posterior credible interval. An effective method is expected to have small out-of-sample RMSE,