MicroRNAs (miRNAs) are small non-coding, single-stranded RNAs that function in the post-transcriptional regulation of gene expression. There has been a tremendous and growing interest among researchers to investigate the role of miRNA in normal as well as in disease processes over the last decade. Aberrant expressions of these tiny regulatory RNA molecules do have direct functional implications in carcinogenic transformation or in further tumour progression towards lethal metastatic forms (Iorio and Croce, 2012; Jansson and Lund, 2012). From rigorous genetic studies over the years it has been found that miRNAs can function as both pro-tumorigenic and anti-tumorigenic molecules (Zhang et al., 2007; Shenouda and Alahari, 2009) depending upon the genes which they target.
Considering the important role miRNAs play in tumorigenic processes, we try to identify and study miRNAs which might play significant roles in head and neck carcinoma. According to WHO (2015), a high percentage of Indian population are regular and direct tobacco users and tobacco users have a significantly higher risk of cancer development. Gingivo buccal squamous cell carcinoma (GBSCC) is one of the most prevalent among tobacco users and categorized as one of the most prominent type of head and neck carcinoma as well. Head and neck squamous cell carcinoma stands as the fifth most common malignancy worldwide (Jemal et al., 2005) and widely found among the tobacco users of India. Thus we analyze miRNA expression data corresponding to cancer and normal tissues collected from GBSCC patients.
We propose a new Bayesian hierarchical model in this paper for differential expressions of the miRNAs harnessing information regarding their positional clustering. Subsequently we apply a novel Bayesian multiple testing methodology to detect miRNAs with statistically significant differential expressions. The specific data that we analyse in this context is detailed in Section 1.1 and in Section 1.2 we provide an overview of the Bayesian model and the multiple testing method that we employ in order to detect significant miRNAs.
1.1 The data details
We used publicly available de-identified 18 patients’ 531 miRNA expression data from De Sarkar et al. (2014). They generated these miRNA expression data from cancer tissues and normal tissues by a high throughput real time polymerase chain reaction (RTPCR) assay (TaqMan Low Density Arrays (TLDA), Applied Biosystems). Following the notations of Livak and Schmittgen (2001) we write, Cycles at which the PCR product quantity reaches a defined threshold, Centred
values with respect to the geometric mean of expression of 3 most stable endogenous control miRNAs in that tissue andof a miRNA in cancer tissue of that miRNA in control tissue. The values were recorded for case and control tissues for the 531 miRNAs across the 18 patients to derive the values. By design, some miRNA assays were done in duplicate; after the removal of such duplicates and technical control miRNA data points, 522 unique miRNAs values or expression deregulation data remained. It is important to note that, although the values are positive, the values need not be so as they are centred with respect to endogenous control. In this article, the value of each miRNA is referred to its respective expression level. Now note that for some miRNA implies that the number of cycles needed to reach the predefined threshold is in cancer versus normal tissues. Therefore the expression of the particular miRNA is 2 fold lower or higher in cancer tissues compared to its paired normal. If a tumour has at least 2 fold relatively higher expression as compared to its normal counterpart, it is usually referred to as up regulated (-1 or less value) and vice versa. Such up and down regulation of a critical miRNA can be interpreted as a biologically significant deviation from its normal quantity and could result in a possible significant functional impact.
1.2 A brief overview of our Bayesian hierarchical model and the multiple testing problem
Our interest lies in detecting those miRNAs that significantly influence the disease concerned in the population, based on an appropriate and new Bayesian model, a novel multiple testing method, and the available sample data set. We develop the model in three stages. We first propose an additive model for the expression level of the miRNAs associated with Gaussian error terms in Section 2. In Section 3, we model the transcription process of the miRNAs by a latent Gaussian process instrumental for the observed expression levels. Biological admissibility of such modelling is supported by a number of studies discussed in the same section. The latent process also realistically accounts for the dependence between the miRNAs that occurs in the course of the transcription process. In Section 4
we generalize the model to account for miRNAs expressed from different strands of different chromosomes and also formally state the multiple testing problem of interest. We also validate and check the goodness-of-fit of the proposed model using leave-one-out cross-validation based on relevant posterior predictive distributions.
In Section 5 we describe the novel Bayesian multiple testing method proposed by Chandra and Bhattacharya (2017) to detect significantly deregulated miRNAs. In their multiple testing method, the dependence structure between the miRNA is explicitly exploited. Application of their procedure to this problem yielded insights and information that were unrevealed in the previous study of the same data set by De Sarkar et al. (2014); the details are presented in Sections 6 and 7. Finally, we summarize our contributions and make concluding remarks in Section 8.
2 Modelling the data and formulation of the multiple testing problem
Let be the values of miRNAs corresponding to cancer tissues of the individual and be the values from the normal tissues of the same individual, where the suffix
denotes transpose of the corresponding vector. We also assume thatand are both independently and identically distributed for , and that and are independent of each other. Specifically, we assume that for and ,
where and are the mean expression levels of the miRNAs for case and control respectively and , are the random errors. We assume that
and , are independent for all . In the above, and stand for multivariate (-variate) normal with mean (the -component vector with all elements zero) and positive definite covariance matrices and , respectively.
We consider further dependence structure on and
by a bivariate Gaussian process to account their inherent dependence structure. Also note that, in contrast with the usual white noise errors, here we consider errors with dependence structures, which begs some explanation. This is discussed at the end of Section3.2.
3 Modelling the transcription process using bivariate Gaussian process
3.1 Motivation behind the latent process perspective
There has been quite substantial advancement in statistical modelling of miRNA expression data and identification of differentially expressed miRNA genes. Clustering approaches have been extensively used to find coexpressed genes. Genes in the same positional clusters are anticipated to be coexpressed together. Clustered expression analysis of miRNAs also helps in detecting biomarkers which otherwise would have been difficult due to the low intensity of their individual expression levels. Clustering of genes often reduces complexity and also yields greater power. Detailed review on popular clustering-based methods can be found in Wang et al. (2015); Conesa et al. (2016).
However, the existing methods do not take into account any dependence structure which occurs through the transcription process of the miRNAs. Many of the miRNAs are located in the intronic regions of host genes but may have their own promoter regions (Rupaimoole and Slack, 2017). Apart from the intronic regions of protein coding genes miRNAs are commonly found to be located in the clusters in the different intergenic regions of the genome (MacFarlane and R Murphy, 2010). Since miRNAs are found to be in the genomic clusters their transcription might be regulated under the same promoter and likely to be coexpressed as reported for a few miRNAs (Yang et al., 2008). Studies demonstrate that expression levels of different genes are not merely random, but the neighbouring genes in the genome have the tendency to be coexpressed together (Michalak, 2008). These studies provide fair reasons to believe that transcription of the genes in the genome is not just a random phenomenon but are subject to an underlying latent process. We believe that modelling the latent process and incorporating it into the analysis is necessary to account for the dependence between genes in a more accurate way to obtain reliable and interpretable results. In light of this discussion we propose and describe our approach of modelling the transcription process of miRNAs.
3.2 Specifics of the bivariate Gaussian process modelling of the latent transcription process
An important aspect of our modelling strategy, related to considering the transcription phenomenon as a latent stochastic process, is incorporation of positional clusterings of the miRNAs and their genomic coordinates into our model. miRNAs under the same positional cluster are likely to be coexpressed together. Now it is to be noted that there are 22 pairs of autosomes (Chr1 through Chr22) and a pair of sex chromosomes (XX/XY) in a human cell and each of them have two strands (we denote the two strands by “+” and “-”). Thus there are 46 strands in total. As regards the discussion in the previous section, miRNAs located close to each other on the genome, that is, neighbouring miRNAs located on the same strand are expected to be regulated under the same promoter. Therefore, within each strand nearby miRNAs can be regarded to be in the same positional cluster. To incorporate this biological information statistically, we model the latent transcription process of miRNAs by a stationary Gaussian process. In stationary Gaussian process structures, correlation between two miRNA expressions is inversely proportional to their distance provided the miRNAs are expressed from the same strand.
For miRNAs lying in different chromosomes as well as in different strands of the same chromosome, are not regulated by the same promoter. Therefore, the transcription processes of miRNAs for different strands are considered independent and hence we do not put any dependence structure between the strands a priori.
Now, and in (1) are the mean expression levels of the miRNAs corresponding to case and control tissues respectively. Since the case and control values for each miRNA are paired observations collected from cancer and normal tissues of the same individual, a biological association should be present within each pair. We account for this dependence, as well as the dependence between the miRNAs induced by the transcription process, by assuming a bivariate Gaussian process associated with and within each strand. For , where is the total number of strands, consider the strand and let and be the expression levels of the miRNA expressed from coordinate of the strand for case and control respectively. The genomic coordinates of the miRNAs were obtained from miRBase (Griffiths-Jones et al., 2006). Then
where, for any , is the bivariate mean function, is a positive definite matrix, and is a positive definite stationary covariance function. Since a priori information on the mean functions and are unavailable, we consider the same mean function for and , that is, we assume . As regards
, we have considered the Matérn covariance function here with their own set of different hyperparameters for different strands. Technical details onand Matérn covariance function are briefly discussed in S-10.
Before we proceed to generalize our model, it is worth shedding light on an important aspect of our model. Indeed, recall that in Section 2 we considered errors with dependence structures, rather than the white noise error traditionally assumed. Note that the mean expression levels of the miRNAs associated with different strands of the chromosomes are independent of each other as they are controlled independently by different promoters. However, exploratory data analysis exhibited significant correlation between the observed expression levels, even across the strands (see Figure 1). This suggests that there are hitherto unexplored biological factors responsible for such correlations. The structured error distributions account for such correlations, making the data dependent, within and across the clusters, ensuring consistency with the exploratory analysis.
4 Generalization of the model to account for miRNAs expressed from different strands of different chromosomes
Notably, some miRNAs can be expressed from two or more genomic regions also (the coloured miRNAs in (4) exhibit such instances). For such miRNAs, it is not possible to determine the exact expression level corresponding to each region separately from the results of the PCR experiment. As a result we only have total expression levels of the miRNAs in the dataset. For these types of miRNAs, we make some minor modifications to the model. Let be the number of miRNAs in the strand and be their corresponding coordinates in the genome. Note that the miRNAs which are transcribed from several genomic locations also appear in more than one positional cluster according to their genomic coordinates. Hence, the same miRNA is reported in multiple strands and therefore .
Now consider as an illustrating example. Let be its observed expression level for the individual from case tissues. From (4) we see that it is transcribed from two genomic locations. Let appear in the and the strand with genomic coordinates and , respectively. Then we consider the following modification:
In general, suppose a miRNA is in strands and be the corresponding genomic coordinates. Then
For miRNAs which are transcribed from some unique genomic location, we assume the following model:
Let be the set of all coordinates of the miRNAs across all strands and let be the corresponding mean expression levels, where for any , . Then
for a known matrix of order .
Similarly for expression levels corresponding to control tissues we have
However, we are mainly interested in identifying miRNAs which are significantly deregulated in case tissues as compared to control tissues. Therefore, it is sufficient to work with the differential expressions only, instead of dealing with both and . We define
Then where .
Then in terms of , in accordance with the objective discussed in Section 1.1, the hypothesis testing problem can be framed as
Working with differential expression values has certain advantages over considering both the case and control values. Although it is not possible to infer about and separately from the values, for the testing problem (11), considering the values is sufficient (notably, values are actually the values defined in Section 1.1
). It not only simplifies the model, but also reduces the number of parameters, thus significantly improving the performance of our Markov chain Monte Carlo (MCMC) strategy discussed subsequently.
For the re-framed model (10), we need to obtain the distribution of , where and are associated with the bivariate Gaussian process. The following theorem, the proof of which is straightforward, gives the desired distribution:
Let and be associated with a bivariate Gaussian process given by
Define for all . Then
where is the determinant of and is Gaussian process.
Following Theorem 4.1, we see that given the Matérn hyperparamaters and ,
follows multivariate normal distributiona priori (details presented in S-11).
Now, recall that
Here all are unknown parameters. It is to be noted that
that is, is the conditional dispersion matrix of the s. We put Inverse-Wishart prior on . Appropriate prior distributions are also considered over the Matérn hyperparameters. In S-12, we discuss our choice of prior distributions in detail.
Once the Bayesian hierarchical model is complete, samples from the posterior distribution of conditional on observed data requires to be generated to carry out the multiple hypothesis testing problem in (11). We do this by the fast and efficient Transformation based Markov chain Monte Carlo (TMCMC) method (Dutta and Bhattacharya, 2014). We also check the goodness-of-fit of our proposed model by means of the leave-one-out cross-validation by computing appropriate posterior predictive distribution. Details on the TMCMC method and goodness-of-fit results are discussed in S-13.
5 A new Bayesian non-marginal multiple testing proposal
5.1 Motivation for non-marginal Bayesian multiple testing procedure
An analysis of the expression data is done by De Sarkar et al. (2014), though the transcription process of the miRNAs was not considered in their model. They performed individual -test for all the miRNAs and applied the Benjamini-Hochberg (BH) method (Benjamini and Hochberg, 1995) for multiplicity correction. The BH method is a
-value based procedure of multiplicity adjustment; however, in the BH procedure, the dependence structure between the test statistics is not utilized. Although validity of the BH procedure holds under positive dependence(Benjamini and Yekutieli, 2001), for negative dependence the usefulness of BH is unclear.
Finner et al. (2007); Efron (2007) discussed the effect of dependence between test statistics, among others. In Bayesian approaches, a natural dependence occurs between hypotheses through hierarchical modelling and multiplicity correction is taken care of to some extent (Scott and Berger, 2010)
. Loss function based approaches have been discussed byMüller et al. (2004). It is interesting that in the aforementioned Bayesian works, positive dependence, unlike the popular BH procedure, is not required.
However, most of the multiple testing methods, either classical or Bayesian, primarily focus on the validity of the test procedure in the sense of controlling or oracle property corresponding to some loss function, whereas exploiting the information provided by the dependence structure might yield efficient closer to truth inference (Sun and Cai, 2009). When the decisions are not directly (deterministically) dependent, information provided by the joint structure inherent in the hypotheses are somewhat neglected by the marginal multiple testing approaches, even though the data (and the prior in the Bayesian case) possess dependence structure(s).
Keeping in mind the necessity of exploiting the dependence structure directly in the methodology, a novel Bayesian non-marginal multiple testing procedure is devised by Chandra and Bhattacharya (2017). The method, which is based on new notions of error and non-error terms, substantially enhances efficiency by judicious exploitation of the dependence structure between the hypotheses. In this procedure decisions regarding different hypotheses deterministically depend upon each other and hence the method is referred to as the non-marginal procedure. The decision rule also has the desirable oracle property corresponding to an additive “0-1” loss function (see Chandra and Bhattacharya (2017) for the details). In Section 5.2 we discuss the Bayesian non-marginal procedure briefly.
5.2 An overview of the new Bayesian procedure for obtaining non-marginal decisions
5.2.1 The general multiple testing set-up
be the observed data. Let the joint distribution ofgiven be where are the parameters of interest and for all . We put a prior on the parameter space. Let and
be the posterior probability and posterior expectation of, respectively, given . Here and represent the marginal distribution of and expectation with respect to this marginal distribution, respectively.
Consider the following hypotheses:
where and , for . In this particular problem .
Here we discuss the multiple comparison problem in a Bayesian decision theoretic framework, given data .
For , let us first define the following quantities:
5.2.2 New error based criterion
Let be the set of hypotheses (including hypothesis ) whose decisions are highly dependent to that of the
hypothesis. Define the following quantity:
If is a singleton, then we set .
Now consider the term
This is the number of cases for which , and ; in words, is the number of cases for which the decision correctly accepts , and all other decisions in , which may accept either or , for , are correct. We refer to this quantity as the number of true positives, and maximize its posterior expectation with respect to . But there are also errors to be controlled, and Chandra and Bhattacharya (2017) advocated control of the following error:
subject to maximizing .
Note that is the total number of cases for which , , that is, either the hypothesis is wrongly rejected or some other decision(s) in is wrong, or both. This is regarded as the number of false positives in our notion.
We will maximize the posterior expectation of given by (17) subject to controlling the posterior expectation of . Hence, with to be controlled, the function to be maximized is given by
where and is a constant lying in .
Let be the set of all -dimensional binary vectors denoting all possible decision configurations. Define
where . Then is the optimal decision configuration obtained as the solution of non-marginal multiple testing method.
It is to be noted that the performance of the non-marginal method heavily depends on the choice of since the decisions regarding different hypotheses depend upon each other through the terms defined in Section 5.2.2. Chandra and Bhattacharya (2017) provide detailed discussion regarding this issue, following which the groups have been formed on the basis of prior correlation between the parameters. Specifics on group formation are elaborated in S-14.1.
Analogous to Type-I and Type-II errors there also existfalse discovery rate and false non-discovery rate in multiple hypothesis testing. Sarkar et al. (2008) defined the posterior and ; we broadly adopt these definitions but significantly modify them in our Bayesian multiple testing method to account for dependence between the hypotheses. Details on multiple testing error measures are discussed in S-14.2.
6 Application of the Bayesian non-marginal multiple testing method to the miRNA problem
We apply the non-marginal method to detect miRNAs with statistically significant differential expression in cancer tissues as compared to normal tissues. Note that in the definition of the non-marginal procedure is the penalizing factor between Type-I and Type-II error. It balances between the posterior expectations of and
. How much Type-I error is allowed to be incurred depends on the choice of. In this application we choose such that
The estimatedbased on the TMCMC samples is 0.04. The discoveries by the non-marginal method are shown in the second column of Table 1 labelled by NMD. Out of a total of 522 miRNAs, the non-marginal method has identified 12 miRNAs.
|miRNA||Method||Deregulation||BF||95% CI of|
|hsa-miR-1293||5cmDeclared significant by|
|De Sarkar et al. (2014)||Up||0.15||-0.87|
6.1 Biological significances of the discoveries made by the non-marginal method
Most of the findings obtained by the non-marginal method have biological significance. hsa-miR-129-2-3p is reported earlier to promote chemo-resistance in breast cancer (Zhang et al., 2015), and hsa-miR-548k often triggers head and neck cancer by modifying TP53 gene (Gross et al., 2014). hsa-miR-147b is reported to trigger head and neck squamous cell carcinoma with high statistical significance (Yata et al., 2015), and hsa-miR-130b-5p is found to be upregulated in more than five cancer types (Mitra et al., 2015). One of the target genes of hsa-miR-124-3p is ADIPOR2 which was reported to be negatively associated with tumour progression in prostrate cancer (Hiyoshi et al., 2012).
hsa-miR-375 represses cell viability and proliferation via SLC7A11 in oral cancer (Wu et al., 2017). The miRNA hsa-miR-1249 is known to regulate tumour growth via positive feedback loop of Hedgehog signalling pathway (Ye et al., 2017). Both hsa-miR-1 and hsa-miR-133a-3p were reported earlier to inhibit cell proliferation and induce apoptosis via TAGLN2 gene (Nohata et al., 2011; Kawakami et al., 2012). hsa-miR-206 has been reported earlier as tumour suppressor (Song et al., 2009). Notably, the last 3 miRNAs mentioned are also found to be significantly downregulated by De Sarkar et al. (2014). hsa-miR-133b is significantly downregulated in our analysis and has been reported as tumour suppressor in esophageal and gastric cancer (Kano et al., 2010; Qiu et al., 2014). This was not reported significant by De Sarkar et al. (2014), however, hsa-mir-1, hsa-mir-133a, hsa-mir-206 and hsa-mir-133b were reported to be functionally related in human cancers (Nohata et al., 2012). This is a clear indication why incorporating the dependence structure in biological data is important and ignoring which is likely to lead to failure to detect important signals.
6.2 Results of testing the hypotheses using Bayes factor
Apart from applying the non-marginal multiple testing procedure, we also consider testing the hypotheses using Bayes factor (BF). The BF is a summary of the evidence provided by the data in favour of one scientific theory as compared to another, both represented by statistical models. It is often used to compare competing models and Bayesian hypotheses. In multiple hypothesis testing problems, for thehypothesis, the marginal Bayes factor is defined as
that is, the ratio of posterior odds ofto its prior odds for all . As summarized by Kass and Raftery (1995), the evidence against that is directed by the magnitude of is shown in Table 2.
|1 to 3.2||Not worth more than a bare mention|
|3.2 to 10||Substantial|
|10 to 100||Strong|
The BFs corresponding to the discoveries are shown in the 4-th column of Table 1. We see that for most of the discoveries by the non-marginal method, BF indicates very strong evidence against . However, in two cases the results of the non-marginal method, the BF based results and those reported in the literature do not agree. Details follow.
Although hsa-miR-124-3p has been declared significant by the NMD method, the corresponding BF shows evidence towards being true. Notably, BF provides evidence towards a belief but does not take into account the multiplicity effect. On the other hand, the discoveries by the NMD procedure are the results of proper control. Controlling the at a more conservative level would rule out the miRNAs with low BFs as discoveries, however, sometimes at the cost of missing out important signals. Hence, instead of such conservative approach, we recommend further biological investigation regarding this discovery.
Previous biological research has reported hsa-miR-622 to be tumour suppressor in esophageal cancer (Song et al., 2016), though our analysis indicates upregulation. The discrepancy may also be due to different tissue types where hsa-miR-622 have been reported to be downregulated in the literature. However, these findings are particularly interesting and further biological experiments should be conducted with different groups of patients to find their roles in tumorigenic processes.
7 Comparison of our non-marginal results with those of the BH procedure
Based on an independent normal model, De Sarkar et al. (2014) considered the BH multiplicity correction approach to identify the significant miRNAs. Before we compare their results with ours it is important to discuss some methodological issues associated with computing the -values needed for the BH procedure.
For the hypothesis testing problem defined in (11
), the null hypothesis corresponding to eachis composite and here even the marginal tests are not straightforward with the frequentist approach. To mitigate this problem, De Sarkar et al. (2014) computed the sample medians of , say . If they tested versus , else they tested versus for , and obtained the corresponding -values. In this way, each composite test boils down to one-sided -test by means of the monotone-likelihood property. However, such separate tests based on the sign of the sample medians is not well-justified statistically as the results of such tests may be non-negligibly different from those of the original composite tests of interest.
As a solution to the above problem, we propose the likelihood-ratio (LR) test for the original composite hypothesis testing problem with respect to the same independent normal model. To obtain the marginal -values we implement a parametric bootstrap method and subsequently apply the BH adjustment for multiplicity correction. We denote this newly implemented method by LRBH in this article. The discoveries are shown in the second column of Table 1 labelled by LRBH. The discoveries by the LRBH method are compared with the findings of the non-marginal method. The detailed comparison of the results is provided in Section 7.1. The -value approximation procedure is discussed in S-15.
7.1 Comparison of the results obtained by the LRBH and the non-marginal procedure
For comparability with the non-marginal method whose is controlled at level 0.10, we control the of the LRBH procedure at level 0.10 as well. As summarized in Table 1, out of the 522 miRNAs, the LRBH method has identified 7 miRNAs as significant whereas the non-marginal method has identified 12 miRNAs. Only three miRNAs, namely, hsa-miR-1, hsa-miR-133a-3p and hsa-miR-206 turned out to be the common discoveries by the LRBH and the non-marginal procedure. It is thus important to investigate the reasons for the discrepant findings, which we attempt below.
Note that the expression levels corresponding to different miRNAs including some of the discrepant discoveries exhibit negative correlations (see Figures 1 and 2). In case of negatively correlated test statistics validity of general BH procedure of multiplicity correction is not guaranteed theoretically. Also from Section S-15 we see that no dependence between the test statistics has been considered and working with only marginal -values practically omits the correlation between the test statistics in the analysis. Neglecting the correlation between test statistics might often leads to unstable inference (Qiu et al., 2005).
However, in our Bayesian model, information across miRNAs are pooled and the dependence structure is exploited by means of hierarchical modelling. Not only that, the non-marginal method employed to detect statistically significant miRNAs directly takes into account the dependence structure between dependent hypotheses, and the corresponding decisions deterministically depend upon each other through the terms defined in Section 5.2.2
. Extensive simulation studies show that this method indeed performs better in dependent situations as compared to some popular methods (including BH) and asymptotically minimizes the Kullback-Leibler divergence from the true model(Chandra and Bhattacharya, 2017). Also for discoveries which are found significant by the BH method only, the Bayes factors exhibit quite strong evidence towards being true. Hence, we argue that in this application, where the miRNAs possess inherent dependence structure, results yielded by the non-marginal method are more reliable as compared to the BH procedure.
8 Summary and conclusion
To the best of our knowledge our attempt to constitute a Bayesian hierarchical model that realistically accounts for case-control dependence and dependence among the miRNAs using the genomic coordinates via a bivariate Gaussian process, is the first ever in the literature. Also first ever in the literature, is the application of our Bayesian non-marginal multiple testing procedure that accounts for the dependence among the hypotheses in a way that the decisions on the hypotheses are dependent on each other and discovers miRNAs that hitherto seem to be unexplored. Indeed, the discoveries employing the non-marginal procedure with respect to this data differ significantly from those of the very popular BH procedure applied in the context of a simple independent normal model. This vindicates the importance of realistic modelling of the dependence structure and the associated realistic non-marginal Bayesian multiple testing procedure. The BF based evidences are mostly in keeping with the non-marginal testing results except hsa-miR-124-3p. As earlier discussed, performing stricter test will disregard the discoveries with low BF, but we do not recommend that.
Interestingly, all the 12 discoveries made by our non-marginal method are already flagged as significant elsewhere in the literature. However, hsa-miR-622 turned out to be upregulated by our method even though the literature suggests that both are downregulated. In this case, the BF also supports the non-marginal method.
The above discussion points to the fact that the results associated with our Bayesian non-marginal method are generally well-supported by the literature and the BF values. In realistic multiple testing situations where the number of hypotheses is much larger compared to the sample size, pooling information across hypotheses exploiting their inherent dependence structure seems crucial for realistic inference. The leave-one-out posterior predictive plots (S-21) suggest that the predictive prowess of our model is excellent even with a very small training set. Hence, our approach of Bayesian hierarchical modelling of miRNA expression data is well-supported.
Although further biological research is necessary to shed more light on the discrepant findings summarized above, from the statistical perspective, incorporation of dependence structure in our Bayesian model and the non-marginal testing method is the major cause for the discrepant discoveries. Indeed, in our experience, taking account of the underlying dependence structure in the model and the testing method is absolutely necessary for realistic inference in complex phenomena, as here.
9 Code with Example and Instructions
The codes are available in the following link
S-10 Details on bivariate Gaussian process
We now elaborately explain the latent process and its distribution. For any arbitrary set of miRNAs in the strand, let be the matrix of their expression levels, the rows of which denote the expression values corresponding to case and control. Then follows a matrix-variate normal distribution:
In the above, are the coordinates of the miRNA genes in the genome.
Let be the distance between two locations and , say. Then, the Matérn covariance function is given by
where is the gamma function, is the modified Bessel function of the second kind; and are the non-negative parameters of the covariance function. Here
is the process variance,is the smoothness parameter and is the correlation length. For different strands we allow the hyperparameters to be different and denote the vector of hyperparameters by , and , respectively. Stein (2012) discussed why the Matérn class of covariance functions is generally recommended in spatial models, following which we adopted the same.
The matrix normal distribution is related to the multivariate normal distribution () in the following way:
where denotes the Kronecker product and denotes vectorization of the matrix . In the above, is the covariance matrix of the columns and is the covariance matrix of the rows of . Matrix-normal distributions are particularly useful when there are reasons to believe that the vector valued observations are not independent. In that situation takes into account the dependence between the observations. Taking
to be the identity matrix and the rows ofto be identical reduces the matrix normal realization to multivariate-normal realizations.
It is to be noted that the covariance parameters of matrix-normal distribution are non-identifiable in the sense that for any scale factor, , we have:
A remedy to this identifiability problem is discussed by Glanz and Carvalho (2013). It is prescribed to consider and as correlation matrices with diagonals as 1, that is,
and to introduce a positive parameter
to estimate the scale of the overall covariance. After this amendment the re-parametrized model is:
Following this we also consider as a correlation matrix. The parameter which is already embedded in as a Matérn covariance function parameter, measures the scale of the covariance. A rigorous study on matrix-normal distribution and its properties can be found in Gupta and Nagar (1999).
S-11 Discussion on prior distribution of the miRNA expression levels
The strand-wise bivariate Gaussian process assumption implies that has the multivariate normal distribution with Matérn covariance given the hyperparameters. Specifically,
Provided a vector of locations , and represents the mean vector and convariance matrix of respectively. Given the hyperparameters, our assumption of independence among the strands makes the distribution of multivariate normal with block-diagonal covariance matrix (see Figure S-3).
the distribution of is given by
As we have considered the same mean function for and a priori, the conditional distribution of given hyperparameters is the following:
Since are independent bivariate Gaussian process over the strand , it follows that are independent Gaussian processes over the strands. Note that the process variance of is . Clearly, an identifiability problem arises here between and . As a remedy to this problem we consider as a single parameter . Let
that is, the vector of differential expression levels of the miRNAs across all strands. We define
Note that, similar to , is also a covariance matrix.
Then from Theorem 4.1 we have
S-12 Prior distributions on the unknown parameters
S-12.1 Prior on
The Inverse-Wishart distribution is a popular prior on unknown covariance matrices. It is also a conjugate prior for normal likelihoods. Hence, we consider Inverse-Wishart distribution as prior on
with degrees of freedomand parameter-matrix .
Here is the common scale parameter accounting for the variance of the s. We put inverse-gamma prior onto be integrated out from the posterior distribution. As is a order matrix , integrating it out from the posterior density reduces the number of parameters significantly. We have taken the degrees of freedom for a priori
second order moment existence.
S-12.2 Prior distributions on Matérn hyperparameters
It is to be noted that and are all unknown positive parameters. As discussed in Gelman et al. (2014), it is desirable that the prior distribution does not unduly influence the posterior distribution. As such we put vague prior on these parameters, that is, locally uniform over a widespread range where the true parameter is likely to lie. As ’s are the strand-wise Gaussian process variances, following the general practice we take the Inverse-Gamma (IG) prior on the s. The parameters of the IG distribution are adjusted such that the mode of prior distribution is 1 and the variance is 100. This prior gives positive probability to the positive part of the real line with maximum weight on 1; this modal value can be regarded as the common a priori summary choice of variance (see Figure (a)a).
For the prior on , note that determines the analytical smoothness of the Gaussian process such that is times mean-square differentiable, where is the floor function. In real data situations not much smoothness is to be expected. This belief is reflected more appropriately by the log-normal prior compared to the gamma prior since the log-normal is thin tailed in comparison; see Figure (b)b
where we plot gamma and log-normal densities both with mode 1 and variance 100. Also observe that the large variance of the log-normal distribution allows even large values ofif the data indicates so. As such, we consider the log-normal prior on with mode 1 and variance 100.
To put a prior on we first need to analyse the role and interpretation of in the model. This parameter is of the dimension of distance (Diggle and Ribeiro, 2007) and sometimes referred to as the correlation length (Gneiting et al., 2010). In this light, the value of should approximately be the length of the strand over which the observed miRNAs are expressed. We consider the mode of the prior distribution of s to be the chromosomal length corresponding to strand with variance 1000. Now the length of genome is really high (generally in order of ) and with such a large modal value and high variance, the gamma and log-normal densities are almost the same (see Figure (c)c). Hence, taking any of the distributions would practically be equivalent and we consider the log-normal prior for .
The joint posterior density of given data is:
where . Note that we have integrated out from the likelihood. This not only simplifies the likelihood but also reduces the number of parameters aiding the MCMC convergence.
S-13 Transformation based Markov chain Monte Carlo for sampling from the posterior distribution and leave-one-out cross-validation for model validation
S-13.1 Additive TMCMC method
TMCMC is employed to generate samples from the joint posterior distribution defined in (S-41). TMCMC is particularly useful for drawing samples from complex high-dimensional distributions. We now briefly describe the additive TMCMC method, which is what we employ. Suppose one is interested in sampling from the -dimensional distribution with density . Let be an arbitrary density with support , the positive part of the real line. We describe the algorithm in Algorithm 1.
It is important to note that all the variables are updated through a single at each iteration and Dutta and Bhattacharya (2014) discussed its advantages especially in high dimensions. Dey and Bhattacharya (2017) discussed optimal scaling properties of the additive algorithm described in Algorithm 1 where is a normal distribution with the negative part truncated, that is, for optimal acceptance rate of the algorithm and also prescribed a methodology to obtain approximate optimal scaling in practice. Following their prescription we have chosen the scaling parameters mentioned in Step 1 of Algorithm 1 and generated samples from the distribution of our interest. As many as samples were generated out of which the first samples were discarded as burn-in. A TMCMC sample is stored at every iteration. Traceplots of some selected parameters shown in Figure S-20 provide evidence towards excellent mixing.
S-13.2 Model validation
To validate our proposed model we conduct leave-one-out cross-validation for each of the data points by successively excluding the (for ) data point and computing the corresponding posterior predictive distribution based on the remaining data points. The posterior predictive distributions are also approximated by drawing samples by the TMCMC algorithm. In Figure S-21 we show the results associated with four data points. The dark coloured region represents the credible region of the posterior predictive distribution of the ’s. We see that the true data point lies well within the credible region in most of the cases, even though the 75% credible regions are much narrow compared to the traditional 95% regions that are usually advocated in general Bayesian analysis as a rule of thumb. On the basis of our leave-one-out posterior predictive results, we conclude that our postulated model very ably explains the variability of the data.
S-14 Discussion on multiple testing methodology
S-14.1 Choice of
In Section 3.1, with proper biological motivation we have discussed that miRNAs generated from nearby locations have high chance to be coexpressed together. Thus, it is natural to form groups with miRNAs with nearby locations. Now as regards the statistical dependence among the miRNAs, we have considered the Matérn covariance function within strand . Notably,
for the genomic locations and where is the distance between them. Thus forming groups of miRNAs on the basis of genomic distance is equivalent to forming groups based on prior correlation.
In practice, we first estimate the prior correlation matrix of from the model that we propose using the Monte Carlo method. Let be the correlation matrix with the element . We then consider the correlations between the and parameters, with , and obtain the percentile, , of these quantities. Then, in we include only those indices such that . Thus, the group contains indices of the parameters that are highly correlated with the parameter. If there exists no index such that , then we set . To reduce complexity we restrict the groups to contain at most 5 indices associated with at most 5 largest values of exceeding . This strategy has produced excellent results in the simulation studies by Chandra and Bhattacharya (2017). It was observed that limiting the group size to 5 increases the power without much affecting the method complexity. Therefore we employ this strategy in this application as well.
S-14.2 Error measures in multiple testing
The False Discovery Rate is defined as
where is the probability of choosing the decision configuration according to the associated multiple testing procedure given data . In case of non-randomized decision rules, for the decision configuration which is chosen to be the final decision rule. Notably, given a particular data set, the non-marginal procedure also gives a binary vector as the optimal decision configuration, not any randomized decision rule.
Under the prior distribution of , the posterior is defined as :
These can be regarded as measures of Type-I error in multiple testing. Similarly False Non-Discovery Rate stands as a measure of Type-II error. It is defined as:
and the posterior is