Aberrations in complex biological systems develop in the background of diverse genetic and environmental factors and are associated with multiple complex molecular events. These include changes in the genome, transcriptome, proteome and metabolome, as well as epigenetic effects. Advances in high-throughput profiling techniques have enabled a systematic and comprehensive exploration of the genetic and epigenetic basis of various diseases, including cancer (Lee et al., 2016; Kaushik et al., 2016), diabetes (Yuan et al., 2014; Sas et al., 2018), chronic kidney disease (Atzler et al., 2014), etc. Further, such multi-Omics collections have become available for patients belonging to different, but related disease subtypes, with The Cancer Genome Atlas (TCGA: Tomczak et al. (2015)) being a prototypical one. Hence, there is an increasing need for models that can integrate such complex data both vertically across multiple modalities and horizontally across different disease subtypes.
Figure 1 provides a schematic representation of the horizontal and vertical structure of such heterogeneous multi-modal Omics data as outlined above. A simultaneous analysis of all components in this complex layered structure has been coined in the literature as data integration. While it is common knowledge that this will result in a more comprehensive picture of the regulatory mechanisms behind diseases, phenotypes and biological processes in general, there is a dearth of rigorous methodologies that satisfactorily tackle all challenges that stem from attempts to perform data integration (Gomez-Cabrero et al., 2014; Joyce and Palsson, 2006; Gligorijević and Pržulj, 2015). A review of the present approaches towards achieving this goal, which are based mostly on specific case studies, can be found in Gligorijević and Pržulj (2015) and Zhang et al. (2017).
Gaussian Graphical Models (GGM) have been extensively used to model biological networks in the last few years. While the initial work on GGMs focused on estimating undirected edges within a single network through obtaining sparse estimates of the inverse covariance matrix from high-dimensional data (e.g. see references inBühlmann and van de Geer (2011)), attention has shifted to estimating parameters from more complex structures, including multiple related graphical models and hierarchical multilayer models comprising of both directed and undirected edges. For the first class of problems, Guo et al. (2011) and Xie et al. (2016) assumed perturbations over a common underlying structure to model multiple precision matrices, while Danaher et al. (2014) proposed using fused/group lasso type penalties for the same task. To incorporate prior information on the group structures across several graphs, Ma and Michailidis (2016) proposed the Joint Structural Estimation Method (JSEM), which uses group-penalized neighborhood regression and subsequent refitting for estimating precision matrices. For the second problem, a two-layered structure can be modeled by interpreting directed edges between the two layers as elements of a multitask regression coefficient matrix, while undirected edges inside either layer correspond to the precision matrix of predictors in that layer. While several methods exist in the literature for joint estimation of both sets of parameters (Lee and Liu, 2012; Cai et al., 2012a), only recently Lin et al. (2016a) made the observation that a multi-layer model can, in fact, be decomposed into a series of two-layer problems. Subsequently, they proposed an estimation algorithm and derived theoretical properties of the resulting estimators.
All the above approaches focus either on the horizontal or the vertical dimensions of the full hierarchical structure depicted in Figure 1. Hence, multiple related groups of heterogeneous data sets have to be modeled by analyzing all data in individual layers (i.e. models for , , ), and then separately analyzing individual hierarchies of datasets (i.e. separate models for ). In another line of work, Kling et al. (2015); Zhang et al. (2017) model all undirected edges within all nodes together using penalized log-likelihoods. The advantage of this approach is that it can incorporate feedback loops and connections between nodes in non-adjacent layers. However, it has two considerable caveats. Firstly, it does not distinguish between hierarchies, hence delineating the direction of a connection between two nodes across two different Omics modalities is not possible in such models. Secondly, computation becomes difficult when data from different Omics modalities are considered, since the number of estimable parameters increases at a faster late compared to a hierarchical model.
While there has been some progress for parameter estimation in multilayer models, little is known about the sampling distributions of resulting estimates. Current research on such distributions and related testing procedures for estimates from high-dimensional problems has been limited to single-response regression using lasso (Zhang and Zhang, 2014; Javanmard and Montanari, 2014, 2018; van de Geer et al., 2014) or group lasso (Mitra and Zhang, 2016) penalties, and partial correlations of single (Cai and Liu, 2016) or multiple (Belilovsky et al., 2016; Liu, 2017) GGMs. From a systemic perspective, testing and identifying downstream interactions that differ across experimental conditions or disease subtypes can offer important insights on the underlying biological process (Mao et al., 2017; Li et al., 2015). In the proposed integrative framework, this can be accomplished by developing a hypothesis testing procedure for entries in the within-layer regression matrices.
The contributions of this paper are two-fold. Firstly, we propose an integrative framework to conduct simultaneous inference for all parameters in multiple multi-layer graphical models, essentially formalizing the structure in Figure 1. We decompose the multi-layer problem into a series of two-layer problems, propose an estimation algorithm for them based on group penalization, and derive theoretical properties of the estimators. Imposing group structures on the model parameters allows us to incorporate prior information on within-layer or between-layer sub-graph components shared across some or all . For biological processes, such information can stem from experimental or mechanistic knowledge (for example a pathway-based grouping of genes). Secondly, we obtain debiased versions of within-layer regression coefficients in this two-layer model, and derive their asymptotic distributions using estimates of model parameters that satisfy generic convergence guarantees. Consequently, we formulate a global test, as well as a simultaneous testing procedure that controls for False Discovery Rate (FDR) to detect important pairwise differences among directed edges between layers.
Our proposed framework for knowledge discovery from heterogeneous data sources is highly flexible. The group sparsity assumptions in our estimation technique can be replaced by other structural restrictions, for example low-rank or low-rank-plus-sparse, as and when deemed appropriate by the prior dependency assumptions across parameters. As long as the resulting estimates converge to corresponding true parameters at certain rates, they can be used by the developed testing methodology.
Organization of paper.
We start with the model formulation in Section 2, then introduce our computational algorithm for a two-layer model, and derive theoretical convergence properties of the algorithm and resulting estimates. In section 3, we start by introducing the debiased versions of rows of the regression coefficient matrix estimates in our model, then use already computed parameter estimates that satisfy some general consistency conditions to obtain its asymptotic distribution. We then move on to pairwise testing, and use sparse estimates from our algorithm to propose a global test to detect overall differences in rows of the coefficient matrices, as well as a multiple testing procedure to detect elementwise differences and perform within-row thresholding of estimates in presence of moderate misspecification of the group sparsity structure. Section 4 is devoted to implementation of our methodology. We evaluate the performance of our estimation and testing procedure through several simulation settings, and give strategies to speed up the computational algorithm for high data dimensions. We conclude the paper with a discussion in Section 5. Proofs of all theoretical results, as well as some auxiliary results, are given in the appendix.
We denote scalars by small letters, vectors by bold small letters and matrices by bold capital letters. For any matrix, denote its element in the position. For , we denote the set of all real matrices by . For a positive semi-definite matrix
, we denote its smallest and largest eigenvalues byand , respectively. For any positive integer , define . For vectors and matrices , , or and or denote euclidean, and norms, respectively. The notation indicates the non-zero edge set in a matrix (or vector) , i.e. . For any set , denotes the number of elements in that set. For positive real numbers we write if there exists independent of model parameters such that . We use the ‘’ notation to define a quantity for the first time.
2 The Joint Multiple Multilevel Estimation Framework
Suppose there are independent datasets, each pertaining to an -layered Gaussian Graphical Model (GGM). The model has the following structure:
|Layer -||, with|
We assume known structured sparsity patterns, denoted by and , for the parameters of interest in the above model, i.e. the precision matrices and the regression coefficient matrices , respectively. These patterns provide information on horizontal dependencies across for the corresponding parameters, and our goal is to leverage them to estimate the full hierarchical structure of the network -specifically to obtain the undirected edges for the nodes inside a single layer, and the directed edges between two successive layers through jointly estimating and .
Consider now a two-layer model, which is a special case of the above model with :
wherein we want to estimate from data in presence of known grouping structures respectively and assuming for all for simplicity. We focus the theoretical discussion in the remainder of the paper on jointly estimating and . This is because for , within-layer undirected edges of any layer and between-layer directed edges from the layer to the layer can be estimated from the corresponding data matrices in a similar fashion (see details in Lin et al. (2016a)). On the other hand, parameters in the very first layer are analogous to , and can be estimated from using any method for joint estimation of multiple graphical models (e.g. Guo et al. (2011); Ma and Michailidis (2016)). This provides all building blocks for recovering the full hierarchical structure of our -layered multiple GGMs.
We assume an element-wise group sparsity pattern over for the precision matrices :
where each is a partition of , and consists of index groups such that . First introduced in Ma and Michailidis (2016), this formulation helps incorporate group structures that are common across some of the precision matrices being modeled. Figure 2 illustrates this through a small example. Subsequently, we use the Joint Structural Estimation Method (JSEM) (Ma and Michailidis, 2016) to estimate , which first uses the group structure given by in penalized nodewise regressions (Meinshausen and Bühlmann, 2006) to obtain neighborhood coefficients of each variable , then fits a graphical lasso model over the combined support sets to obtain sparse estimates of the precision matrices:
where , is a tuning parameter, and is the set of positive-definite matrices that have non-zero supports restricted to .
For the precision matrices , we assume an element-wise sparsity pattern defined in a similar manner as . The sparsity structure for is more general, each group being defined as:
In other words, any arbitrary partition of can be specified as the sparsity pattern of .
Denote the neighborhood coefficients of the variable in the lower layer by , and . We obtain sparse estimates of , and subsequently , by solving the following group-penalized least square minimization problem that has the tuning parameters and and then refitting:
The outcome of a node in the lower layer is thus modeled using all other nodes in that layer using the neighborhood coefficients , and nodes in the immediate upper layer using the regression coefficients .
Common sparsity structures across the same layer are incorporated into the regression by the group penalties over the element-wise groups , while sparsity pattern overlaps across the different regression matrices are handled by the group penalties over , which denote the collection of elements in that are in . Other kinds of structural assumptions on or can be handled within the above structure by swapping out the group norms in favor of other appropriate norm-based penalties.
2.2.1 Alternating Block Algorithm
The objective function in (2.5) is bi-convex, i.e. convex in for fixed , and vice-versa, but not jointly convex in . Consequently, we use an alternating iterative algorithm to solve for that minimizes (2.5) by iteratively cycling between and , i.e. holding one set of parameters fixed and solving for the other, then alternating until convergence.
Choice of initial values plays a crucial role in the performance of this algorithm as discussed in detail in Lin et al. (2016a). We choose the initial values
by fitting separate lasso regression models for eachand :
We obtain initial estimates of by performing group-penalized nodewise regression on the residuals :
The steps of our full estimation procedure, coined as the Joint Multiple Multi-Layer Estimation (JMMLE) method, are summarized in Algorithm 1.
2.2.2 Tuning parameter selection
The node-wise regression step in the JSEM model (2.2) uses a Bayesian Information Criterion (BIC) for tuning parameter selection. The step for updating , i.e. (2.10), in our JMMLE algorithm is analogous to this procedure, hence we use BIC to select the penalty parameter . In our setting the BIC for a given and fixed is given by:
where in subscript indicates the corresponding quantity is calculated taking as the tuning parameter, and . Every time is updated in the JMMLE algorithm, we choose the optimal as the one with the smallest BIC over a fixed set of values . Thus for a fixed , our final choice of will be .
We use the High-dimensional BIC (HBIC) to select the other tuning parameter, :
We choose an optimal as the minimizer of HBIC by training multiple JMMLE models using Algorithm 1 over a finite set of values : .
2.3 Properties of JMMLE estimators
We now provide theoretical results ensuring the convergence of our alternating algorithm, as well as the consistency of estimators obtained from the algorithm. We present statements of theorems in the main body of the paper, while detailed proofs and auxiliary results are delegated to the Appendix.
We introduce some additional notation and technical conditions that help establish the results that follow. Denote the true values of the parameters by . Sparsity levels of individual true parameters are indicated by . Also define , and .
Condition 1 (Bounded eigenvalues). A positive definite matrix is said to have bounded eigenvalues with constants if
Condition 2 (Diagonal dominance). A matrix is said to be strictly diagonally dominant if for all ,
Also denote .
Our first result establishes the convergence of Algorithm 1 for any fixed realization of and .
Suppose for any fixed , estimates in each iterate of Algorithm 1 are uniformly bounded by some quantity dependent on only and :
Then any limit point of the algorithm is a stationary point of the objective function, i.e. a point where partial derivatives along all coordinates are non-negative.
The next steps are to show that for random realizations of and ,
(a) successive iterates lie in this non-expanding ball around the true parameters, and
both with probability approaching 1 as.
To do so we break down the main problem into two sub-problems. Take as : any subscript or superscript on being passed on to . Denote by and the generic estimators given by
We assume the following conditions on the true parameter versions , defined from similarly as (2.14):
(E1) The matrices are diagonally dominant,
(E2) The matrices have bounded eigenvalues with constants that are common across .
Now we are in a position to establish the estimation consistency for (2.12), as well as the consistency of the final estimates using their support sets.
Consider random , any deterministic that satisfy the following bound
where depends only on . Then, for sample size the following hold:
(I) Denote . Then for the choice of tuning parameter
where depends on the model parameters only, we have
(II) For the choice of tuning parameter ,
both with probability , for some constants .
To prove an equivalent result for the solution of (2.13), we need the following conditions.
(E3) The matrices are diagonally dominant,
(E4) The matrices have bounded eigenvalues with common constants .
Given these, we next establish the required consistency results.
Assume random , and fixed so that for ,
for some dependent on only. Then, given the choice of tuning parameter
where depends on the population parameters only, the following hold
with probability , where and
with being the maximum degree .
Finally, we ensure that the starting values are satisfactory as previously discussed.
Putting all the pieces together, estimation consistency for the limit points of Algorithm 1 given our choice of starting values follows in a straightforward manner.
To save computation time for high data dimensions, an initial screening step, e.g. the debiased lasso procedure of Javanmard and Montanari (2014), can be used to first restrict the support set of before obtaining the initial estimates using (2.7). The consistency properties of resulting initial and final estimates follow along the lines of the special case discussed in Lin et al. (2016a), in conjunction with Theorem 2.4 and Corollary 2.5, respectively. We leave the details to the reader.
3 Hypothesis testing in multilayer models
In this section, we lay out a framework for hypothesis testing in our proposed joint multi-layer structure. Present literature in high-dimensional hypothesis testing either focuses on testing for simularities in the within-layer connections of single-layer networks (Cai and Liu, 2016; Liu, 2017), or coefficients of single response penalized regression (van de Geer et al., 2014; Zhang and Zhang, 2014; Mitra and Zhang, 2016). However, to our knowledge no method is available in the literature to perform testing for between-layer connections in a two-layer (or multi-layer) setup.
Denote the row of the coefficient matrix by , for . In this section we are generally interested in obtaining asymptotic sampling distributions of , and subsequently formulating testing procedures to detect similarities or differences across in the full vector or its elements. There are two main challenges in doing the above- firstly the need to mitigate the bias of the group-penalized JMMLE estimators, and secondly the dependency among response nodes translating into the need for controlling false discovery rate while simultaneously testing for several element-wise hypotheses concerning the true values . To this end, in Section 3.1 we first propose a debiased estimator for that makes use of already computed (using JSEM) node-wise regression coefficients in the upper layer, and establish asymptotic properties of scaled version of them. Section 3.2 is devoted to pairwise testing, where we assume
, and propose asymptotic global tests for detecting differential effects of a variable in the upper layer, i.e. testing for the null hypothesis, as well as pairwise simultaneous tests across for detecting the element-wise differences .
3.1 Debiased estimators and asymptotic normality
Zhang and Zhang (2014)
proposed a debiasing procedure for lasso estimates and subsequently calculate confidence intervals for individual coefficients
in high-dimensional linear regression:and for some . Given an initial lasso estimate their debiased estimator was defined as:
where is the vector of residuals from the -penalized regression of on . With centering around the true parameter value, say
, and proper scaling this has an asymptotic normal distribution:
Essentially, they obtain the debiasing factor for the coefficient by taking residuals from the regularized regression and scale them using the projection of onto a space approximately orthogonal to it. Mitra and Zhang (2016) later generalized this idea to group lasso estimates. Further, van de Geer et al. (2014) and Javanmard and Montanari (2014) performed debiasing on the entire coefficient vectors.
We start off by defining debiased estimates for individual rows of the coefficient matrices in our two-layer model: