Over the past years, there has been a significant development in high-throughput omics technologies e.g. metabolomics, transcriptomics, genomics, epigenomics and proteomics along with a growing interest into joint modelling of multi-omic data joyce2006,ebrahim2016. In metabolomics, several approaches are used to understand the response of a biological system as a function of an internal or external perturbation by monitoring “the chemical fingerprints that specific cellular processes leave behind” daviss2005. These chemical fingerprints are most commonly interrogated in terms of metabolite (i.e. low weight molecules) concentration, structure and transformation pathways (i.e set of chemical reactions) in order to identify a biomarker related to the studied process. Biomarker discovery consists of identifying a metabolite that has significant association patterns with a particular phenotype (disease, clinical variables, physical traits, etc) and that can be thus used as an indicator of that specific phenotype. Typical experimental platforms use analytical techniques such as nuclear magnetic resonance spectroscopy (NMR) reo2002 and mass spectrometry (MS) dettmer2007 to generate appropriate spectral metabolomic profiles of the studied biological system.
Metabolomic data sets are characterized by high correlation structures in that many spectral peaks can arise from the same metabolite and metabolites operate within networks of chemical reactions. In addition, further correlation structure is present in longitudinal metabolomic studies due to repeated measurements of observations over time. Additional challenges include not only the low number of time points and samples compared to the number of profiled metabolic variables, but also integration of a different omic data to the metabolomic data.
First, metabolomic time series are often short due to experimental costs or ethical considerations. Typically, less than 10 time points are available compared to a large number of metabolic variables profiled at each time point e.g. hundreds of metabolic variables for targeted experiments and thousands of metabolic variables for untargeted experiments. Taking into account the small number of time points compared to the large number of metabolic variables profiled, the number of temporal patterns that will arise is limited (due to the limited number of degrees of freedom). Some temporal patterns will be repeated and thus these patterns can be induced by randomness. Second, models fitted to a small number of data points are often prone to overfitting i.e. the model is very sensitive to small fluctuations. This can lead to a poor fitting to unseen data and a high generalisation error. Third, it is also important to consider the number of parameters of the statistical model and make use of the simplest models in order to avoid overparametrisation. Finally, monitoring heterogeneous omic variables can substantially enhance the understanding of the underlying biological mechanism and provide a systems biology approach as these omic variables represent entities that are often involved in related cellular processes joyce2006, ebrahim2016.
For all these reasons, metabolomics scientists need cautious estimates model parameters in order to ensure robust interpretation of the results. Hence, appropriate models are needed to integrate heterogeneous omic data and take into account both experimental conditions and biological variations in order to extract important information from the data. In this paper, we are interested in dose-response time course experiments where additional omic variables (bacteria, genes, transcripts, etc) are monitored along with metabolites in the context of biomarker discovery.
The main contribution of our work is a single probabilistic generative model that i) can overcome overfitting via the use of weakly informative priors ii) makes use of mixed effects models to model the experimental design iii) models metabolite interactions by using pathway information through a conditional auto-regressive (CAR) component and iv) uncovers multiple associations between metabolites and other omic variables by using a horseshoe prior. An additional benefit of our approach is that it naturally yields a list of perturbed metabolic pathways since it is based both on a mixed effects component and a CAR component.
2 Related Work
There is a growing interest in longitudinal experiments for heterogeneous omics data and statistical models to infer biomarkers of a particular treatment or disease over time. Different approaches attempt to infer influential or significant metabolites using dynamic metabolomic data under the assumption that metabolites are independant. These models include fitting smooth splines mixed effects models (SME) to time curves berk2011 and linear mixed effects models augmented with a variable selection approach Mei2009 . However between-metabolite correlation is richly structured and biologically relevant and should be modelled.
Seemingly unrelated regression accounts for metabolite correlation by using correlated regression errors and can be used to identify biologically significant metabolites Chen2015, Chen2017. In gene expression data analysis, [Pham2015]
recently proposed to use confirmatory factor analysis to capture gene-pathway relationship and a conditional autoregressive model to capture relationships between a set of pathways where pathway network has been constructed based on KEGG kanehisa2000 pathways. The latter accounts for biological variation in the data and allows ease of interpretation.
In the metabolomics literature, traditional frameworks for metabolomic data analysis use dimensionality reduction techniques, namely principal component analysis (PCA), partial least squares (PLS) wold1983 and PLS derived models (OPLS trygg2002, O2PLS trygg2003, OnPLS lofstedt2011) to take into account high correlation between metabolites. Extension of PLS to O2PLS and OnPLS allows for integrative analysis of heterogeneous omic data. PCA and PLS models are very popular among metabolomicists as tools for exploratory data analysis, visualisation purposes and ease of interpretation. One of the interests of PCA (PLS) derived models is to be able to visually assess whether or not there is a time effect in the data and identify metabolites that change over time by looking into time trajectories of each metabolite antti2002.
Extensions to PCA (PLS) for longitudinal analysis include lagged PCA (PLS) and dynamic PCA (PLS) where a backshift matrix is introduced to take into account time dependancy. Similarly, [tim2008] used a set piecewise orthogonal projections latent structures to describe changes between neighboring time points. PARAFAC bro1997 is a multi-linear unsupervised decomposition method that can account for the multi-way variation seen in dynamic metabolomics data. Recently, dynamic probabilistic PCA (DPPCA) was proposed in [2014dynamic] as a generative probabilistic model of longitudinal metabolomic data where a stochastic volatility model is used for the latent variables. The main inconvenience of these approaches is that further techniques such as multiple testing correction have to be separately applied to the data in order to identify biomarkers and they do not take heterogeneous data (i.e. data from different omics techniques) into account.
In contrast, we here provide a single model that takes into account metabolite interactions, time variation and experimental design, infers influential metabolites and also quantifies relationships between metabolites and additional omic variables (if any).
3 Model Framework
Given a metabolomics data where is the number of observations, the number of time points and the number of metabolites. is an additional continuous omic data measured along with where is the number of associated omic variables. The set of observations consists of a set of cases and controls. Throughout the paper, index always runs through observations, index runs through time points, index runs through metabolites and index runs through
variables. Vector quantities are written in bold. Matrices are written in bold capitals. Our goal is to build a simple model that can identify biomarkers relative to a specific treatment in time taking into account the multiple sources of variations in the data.
The model is built on three levels: First, a CAR component to capture interaction between metabolites. Second, a variable selection model to uncover associations between metabolites and data. Third, a mixed effects component to model experimental design. We give more details about each level of our model respectively in each of the following sections.
3.1 Metabolite interactions
In a similar fashion to [Pham2015]
, we model metabolite interactions via the CAR model. In fact, incorporation of pathway information in the CAR model through the variance matrix helps provide chemists with an easily interpretable model. First, we assume that the concentration of each metabolite is linearly influenced by concentration levels of metabolites in the same pathway. Letbe the design matrix quantifying metabolite interactions such that matrix elements , if metabolites and are in the same pathway and otherwise. Thus, metabolite intensity levels can be expressed as:
where represents measurements of metabolites of sample at time point excluding metabolite , is a function of covariates of sample for metabolite at time point . If we define thecan be explicitely written as cressie2015:
Chemists are most interested by identifying which pathways are “on” or “off” as an effect of treatment. In the CAR literature, the design matrix can be modeled as a scaled product of a diagonal weight matrix and an adjacency matrix. In order to allow for pathway perturbation inference we construct the distance matrix based on the individual contribution of each pathway. To be precise, we define where is the number of pathways. The distance matrices are a zero-diagonal symmetric adjacency matrices with elements equal to the inverse of the length of the shortest path between metabolites and if they are in pathway and otherwise. A path between two metabolites consists in the number of reactions that lead from one metabolite to the other. The shortest path is the path that contains the smallest number of reactions. In the diagonal matrices we use the reciprocal of the number of neighbors of each metabolite in pathway i.e so that the squared partial correlation is reduced when more metabolites from the same pathway are profiled cressie2015. The model parameter is estimated. It quantifies pathway contribution and is referred to as spatial-dependence parameter in the CAR literature.
The model needs to comply with the condition that is positive definite. If we assume that pathways are a priori equally perturbed, must fall in the interval where and
are the minimum and maximum eigenvalues of, respectively. In practice, strong interaction between observed metabolites of pathway is reproduced in CAR models only when the scaling parameter is quite close to one of the boundaries . Hence, we use a beta-type prior for that places substantial mass on large values of banerjee2014 :
where B is the beta function. The parameter captures variance heterogeneity in metabolite intensities and is given an inverse gamma prior . This prior provides pseudo-observations in addition to available observations. In order to build a reasonably informative prior we set .
3.2 Integrative analysis
Interactions between heterogeneous omic variables such as transcripts and metabolites or gene expression and metabolites is modeled via the following hierarchical shrinkage model:
where denotes the half Student-t distribution with degrees of freedom, represents treatment effect for metabolite , represents individual perturbations for metabolite , follows and auto-regressive process and represents temporal effects for metabolite of individual at time point . Finally, quantifies interactions between metabolite and other omic variables. is called local shrinkage parameter whilst is the global shrinkage parameter. For , this prior reduces to the horseshoe prior. Intuitively, for small values of the coefficient is very close to while for relevant variables will be large. In addition, controls the overall shrinkage level i.e sparsity of the vector is more important for small values of .
Define a random shrinkage coefficient such that when is large and when is small. This transformation implies the following prior distribution on :
This prior density is shown in figure 1 for different values of and . It reduces to a Beta distribution if and to a Beta which looks like a horseshoe, if in addition . When increases, Beta skews towards which increases the global shrinkage power. The expectation of given can be expressed as:
where and is a diagonal matrix of order with elements . Equation (8) introduces a penalty term where is a metabolite specific penalty term introduced by the horseshoe prior and is a global penalty term. Precisely, captures the overall sparsity level amongst all metabolites. The expectation of given is very similar to the estimate of
under ridge regression wheresimply reduces to .
The global sparsity level can be controlled using . Increasing the global sparsity level is a desired property in omic studies, as usually we deal with a large number of omic variables where only few are important. Moreover, when there is prior knowledge available, specifying a priori can optimize the inference and additionally, provide a more informative prior on . If we fix , integrating over gives the expected value of as :
where is Meijer’s G-function meijer1936. The equation above can be used to fix a priori by defining the expected proportion of shrunk coefficients. In practice, different values of are plugged into the equation above to get the desired proportion of shrunk coefficients. However, many definite integrals can be obtained using the tables of Meijer functions in [brychkov2008] for special values of parameters.
3.3 Experimental design
The covariance structure between metabolites might change drastically as a result of treatment if the latter affects metabolic pathways. The model can be extended to take into account the experimental design. As specified in the previous section, captures the treatment effect for metabolite , represents individual perturbations for metabolite , represents temporal effects for metabolite of individual at time point in equation (5). In addition, we allow covariance structures to be different for the control samples and the cases where designates experimental groups. This yields the overall hierarchical model:
Note that by specifying different dependence parameters for metabolite interactions in cases and controls metabolism, the model is able to identify perturbed pathways by comparing and .
In this section we perform experiments on both synthetic and real data to investigate whether our algorithm gives reasonable solutions. We first try our method on a simulated dataset in section 4.1 to get an understanding of the performance of our method. In section 4.2, we test our method on a data set using metabolomic and bacterial composition in a drug treatment experiment. In the following, we refer to our model as “ iCARH ” model for “ integrative CAR Horseshoe ” model.
4.1 Simulation study
To get better understanding of our method and test its applicability, we first perform our approach on synthetic datasets. We will mainly focus on the ability of our model to infer pathway perturbation.
Assessing pathway inference with beta like prior
In the first simulation our objective is to assess how the beta like prior in equation (3) improves the iCARH model. We first fixed the number of pathways P to 11, then simulate the design matrices . Specifically, a membership matrix with dimensions is randomly generated based on the density of the number of KEGG pathways in which a single metabolite is involved. Each design matrix is then equal to where is the th column of . Finally, we generated 10 datasets according to the model below in order to assess how our model infers perturbed pathways:
denotes the truncated normal distribution with boundaries, , and are the minimum and maximum eigenvalues of . The rest of the parameters is set as follows : number of bacterial variables , number of metabolites , number of time points , number of samples , global parameter fixed to 1.2, parameters , , , simulated according to equations (14), (13), (10), (9) respectively.
We set non-informative uniform priors on , , , . We set an informative prior on as we expect low variability amongst biological samples of the same group. We fix to the proportion of expected perturbed pathways. is fixed to a value of , to and to a large value. We compare inference of the model under a uniform prior for and the prior in equation (3). Inference is done using 2000 iterations of Hamiltonian Monte Carlo sampling and 1000 warm-up iterations.
The left plot in figure 2 shows the boxplots of the Area Under the Curve (AUC) for pathway perturbation inference for 10 simulated datasets with uniform prior on and a beta like prior on
. We infer perturbations based on the posterior probability thatand are different i.e. the credible interval of
does not contain zero. The AUC values for the beta like distribution is significantly higher than the AUC values for the uniform distribution. On average pathway peturbation inference under the uniform distribution reduces to a random guess with an average AUC of 0.53. This is likely due to the lack of variance of the uniform distribution. The right plot in figure2 shows the posterior distributions of under the uniform prior (white) and the beta like prior (grey) for each pathway. True perturbed pathways are printed in bold on the x axis.
Assessing pathway inference against design inaccuracies
It is very common in metabolomics data to find metabolites that are correlated but not in the same KEGG pathway. In the following simulation we assess how inaccuracies in the covariance structure between metabolites and the design matrices affect the iCARH model. We used the 10 datasets from the previous simulation and perturbed the design matrices by selecting a random fraction of metabolites in each pathway. We then randomly (falsely) assign these metabolites to no pathway, or to different pathways. We similarly run the model for 2000 iterations of Hamiltonian Monte Carlo sampling and 1000 warm-up iterations for each of the fractions of perturbed metabolites. Finally, in the same fashion, we assess perturbations based on the credible interval of . Figure 3 is a series of average Receiver Operating Characteristic (ROC) curves across 10 datasets for each of the factions . On average, the performance of our model reduces to a random guess (AUC of 0.5) if of the metabolites in each pathway is perturbed. The AUC of our model reaches 0.97 if no metabolites are perturbed and is about 0.88 if of the metabolites in each pathway are perturbed.
4.2 Case study
In this section, we test our model on an actual metabolomic data and 16S data for bacterial profiles. In this study we are interested in the influence of metformin on a non-diabetic model. Metformin is the first-line medicine to treat type 2 diabetes. It has also been suggested that metformin has anti-cancer, cardiovascular and anti-aging effects. Because of their very large metabolic capacity, the gut bacteria can influence toxicity and metabolism of drugs. Here, we are particularly looking for metabolic biomarkers indicative of microbiota changes as result of treatment.
The study design is as follows: metabolic profiles of 24 rats are acquired once a week using different mass spectrometry techniques from plasma samples over a period of 9 weeks. Bacterial profiles are acquired using miSeq. The study has allowed for two groups of 12 rats where metformin has been administrated to the second group (weeks 3 to 7) allowing for 2 weeks of acclimatation (weeks 1 and 2) and 2 weeks of recovery (weeks 8 and 9). After data processing and metabolite identification, a total of 56 metabolites and 6 bacteria species are further analysed using our model. Inference is done using 2000 iterations of Hamiltonian Monte Carlo sampling.
We assess performance of our model for different values of using the Watanabe-Akaike information criterion (WAIC). Tested values of comprise 1, 1.2, 5, 10 with corresponding WAIC values of 7317.296 , 7322.798 , 7317.457 , 7316.476 respectively. WAIC values are very similar for different values of which suggests to use the most selective model with as it is the simplest i.e with the smallest number of selected variables.
Assessing model fit
In order to assess our model fit, we perform posterior predictive checks of our model compared to DPPCA 2014dynamic. The DPPCA model is a multivariate model using PCA, where PCA scores are modeled via a stochastic volatility model. In the Bayesian framework, posterior predictive checks consist in comparing data simulated from the posterior predictive distribution with the observed data. We compared the simulated data and the observed data by means of mean absolute deviations (MADs) between the observed and the simulated covariance matrices for different numbers of metabolites included i.e only part of the data corresponding to these metabolites is considered. The same process was repeated for inference using the DPPCA model 2014dynamic. Figure 4 shows MADs of our model and the DPPCA model. As expected, MADs for both models decrease when the number of metabolites increases. Overall, our model clearly outperforms the DPPCA model.
In addition to posterior predictive checks, goodness of fit was also checked by using where denotes the Cholesky factor of . Zero-mean and normality were thus checked for (See figure 5).
Since the administrated drug was also profiled using mass spectrometry, we fit the iCARH model with . Figure 6 is a series of boxplots for for metabolites 13 to 31. We are mainly interested in “metabolite 27” as it is associated with some bacteria species.
Figures 7 and 8 show posterior distributions of for each pathway and estimates of effects of bacteria on metabolites. Results in section 4.1 suggest to compare the covariance structure of metabolites in the observed data with the covariance induced by the design matrices in order to have an a priori idea on the robustness of pathway inference (See figure 3). For a correlation threshold of , about of the metabolites are misspecified in the design matrices which corresponds to an AUC around according to figure 3. If we set a higher correlation threshold, a lower number of metabolites will be misspecified. This supports the use of the iCARH model for pathway perturbation inference for this data.
Estimates of effects of bacteria on metabolite profiles are captured by . Some metabolites present significant changes along with the bacterial profiles. For example, “metabolite 27”, a hydroxy fatty acid, is associated with alterations in abundance of 4 bacteria species. Figure 7 shows that, as a result of treatment, KEGG pathways are not significantly altered. However, distributions of for “fatty acids biosynthesis” and “biosynthesis of unsaturated fatty acids” KEGG pathways are remarkably flatter than the distributions of . These pathways involve the previously identified hydroxy fatty acid metabolite. Our analysis confirms previously reported studies that hydroxy fatty acids might be produced by the gut microbiome kishino2013, kimura2013.
Identifying biomarkers in time course metabolic data and inferring significant associations with heterogeneous omic variables is extremely challenging due to the several sources of variations of the data. In addition, existing methods developed to analyze such data are very scarce and have the limitations of i) overfitting to the few available data points or ii) confounding the experimental and longitudinal variation or iii) ignoring the metabolite interactions or iv) ignoring effects of other omic variables. In this paper, the model we have developed combines several approaches to take into account the different aspects of the data namely the number of time points, the experimental variation captured by , interactions between metabolites captured by and interactions with additional omic variables captured by .
Our results demonstrate that our model successfully addresses the main questions of a metabolomic study. Most importantly, our model is able to identify metabolic biomarkers related to treatment, infer perturbed pathways as a result of treatment and find significant associations with additional omic variables. We have shown that providing an informative prior on metabolic pathways and an informative prior over the parameter is a significant improvement over the DPPCA model. Particularly, our model is more robust to slight variations usually observed in short time series data thanks to the small number of covariance parameters needed to estimate compared to DPPCA. We have also shown through simulation that an informative beta like prior compares better than a non-informative uniform prior in inferring significant pathways. On real data, we have investigated how the number of profiled metabolites can affect the predictive ability of the model.
Several potential extensions arise naturally from our model. In terms of the metabolite interactions component, many research questions can arise. Alternative strategies to modeling metabolite interactions can be examined such as modeling the non-zero elements of the adjacency matrix
of each pathway as random variables. This strategy was adopted in the CAR literature by[lee2013, Rushworth2017] to take into account step changes in spatial variation. Step changes can potentially be useful to model changes in metabolites correlations as a result of treatment. [lee2011] provide an overview of different CAR models used in spatial modeling. The proposed models can be adapted to fit into the metabolomics literature.
From a practical point of view, the model has been fitted using HMC sampling but takes a large amount of time (about 1 hour) mostly because of the variable selection procedure and metabolites interdependence. This could be addressed by using variational Bayes. In fact, variational Bayes inference procedures offer cost-effective inference by means of principled approximations and appealing computational time for high dimensional data. A variational bayes inference of CAR models was proposed by[harrison2010] for high dimensional data, and a variational bayes approach for variable selection was recently proposed by [ormerod2017].
Metabolomics longitudinal profiling techniques are imperative to understand the effect of a drug or a disease across time and can provide enhanced understanding of the underlying biology of the system. In a data integration framework, we have illustrated the use of the CAR model to incorporate metabolites interactions in the model and the horseshoe prior to identify association with heterogeneous omic variables obtained by other omic techniques. The combination of the CAR and horseshoe levels yields the “integrative CAR Horseshoe” (iCARH) model which we presented in this article.
Although, it is computationally expensive, the iCARH model has various appealing features such that it is able to identify metabolic biomarkers related to treatment, infer perturbed pathways as a result of treatment and identify potential associations between heterogeneous omic variables. Clearly, these appealing features open up further research topics.
Thanks are due to Panagiotis Vorkas for providing the metabolic and 16S data. Infrastructure support for this work was provided by the NIHR Imperial Biomedical Research Centre.