# The conditional censored graphical lasso estimator

In many applied fields, such as genomics, different types of data are collected on the same system, and it is not uncommon that some of these datasets are subject to censoring as a result of the measurement technologies used, such as data generated by polymerase chain reactions and flow cytometer. When the overall objective is that of network inference, at possibly different levels of a system, information coming from different sources and/or different steps of the analysis can be integrated into one model with the use of conditional graphical models. In this paper, we develop a doubly penalized inferential procedure for a conditional Gaussian graphical model when data can be subject to censoring. The computational challenges of handling censored data in high dimensionality are met with the development of an efficient Expectation-Maximization algorithm, based on approximate calculations of the moments of truncated Gaussian distributions and on a suitably derived two-step procedure alternating graphical lasso with a novel block-coordinate multivariate lasso approach. We evaluate the performance of this approach on an extensive simulation study and on gene expression data generated by RT-qPCR technologies, where we are able to integrate network inference, differential expression detection and data normalization into one model.

## Authors

• 2 publications
• 1 publication
• 8 publications
• ### L1-Penalized Censored Gaussian Graphical Model

Graphical lasso is one of the most used estimators for inferring genetic...
01/24/2018 ∙ by Luigi Augugliaro, et al. ∙ 0

• ### The Cluster Graphical Lasso for improved estimation of Gaussian graphical models

We consider the task of estimating a Gaussian graphical model in the hig...
07/19/2013 ∙ by Kean Ming Tan, et al. ∙ 0

• ### Nonlinear Statistical Learning with Truncated Gaussian Graphical Models

We introduce the truncated Gaussian graphical model (TGGM) as a novel fr...
06/02/2016 ∙ by Qinliang Su, et al. ∙ 0

• ### Robust Graphical Modeling with t-Distributions

Graphical Gaussian models have proven to be useful tools for exploring n...
08/09/2014 ∙ by Michael A. Finegold, et al. ∙ 0

• ### Modeling massive multivariate spatial data with the basis graphical lasso

We propose a new modeling framework for highly multivariate spatial proc...
01/07/2021 ∙ by Mitchell Krock, et al. ∙ 15

• ### Locally associated graphical models

The notion of multivariate total positivity has proved to be useful in f...
08/11/2020 ∙ by Steffen Lauritzen, et al. ∙ 0

• ### Network reconstruction with local partial correlation: comparative evaluation

Over the past decade, various methods have been proposed for the reconst...
06/11/2018 ∙ by Henrique Bolfarine, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Conditional graphical models, also called conditional random fields, were originally introduced in LaffertyEtAl_ICML_01. Formally, let and be - and

-dimensional random vectors, respectively, and let

be a graph with vertex set , indexing only the entries in , and edge set , where iff there is a directed edge from the vertex to in . An edge is called undirected if both and are in , and the graph is called undirected if it has only undirected edges, which is the case that we will consider in this paper. Let and denote by the conditional density function of given . We shall say that the set is a conditional graphical model when obeys the Markov properties with respect to the graph

. In the case of an undirected graph, two given random variables in

, say and , are conditionally independent given and all the remaining random variables in iff the unordered pair does not belong to the edge set .

The conditional Gaussian graphical model, is a member of the conditional undirected graphical models based on the assumption that , where and . The inverse of the covariance matrix , denoted by , is called precision matrix and its entries have a one-to-one correspondence with partial correlation coefficients. Like in the unconditional Gaussian graphical model, is the parametric tool relating the topological structure of the undirected graph to the factorization of the conditional density of given . That is, it is possible to show that the random variables and are conditionally independent given and all the remaining variables in iff the corresponding partial correlation coefficient is zero (see for example Lauritzen_book_96). Consequently, the problem of estimating the edge set of the undirected graph is equivalent to the problem of finding non-zero entries in . This property, together with recent advances in statistical methods for sparse estimation, has made these methods particularly appealing for a range of applications, such as genomic and bioinformatics, where different sources of data are typically collected on the same biological system and integrative approaches are on high demand. As a motivating example, one can consider the combined analysis of genetic and genomic data by YinEtAl_AOAS_11. In this kind of experiments both gene expression data, modeled by , and genetic variants, modeled by , are measured on the same subjects. Although a direct analysis of the gene expression data alone can provide some insights into the underlying gene regulation network, such an analysis would ignore the effect of genetic variants on the gene expression, that is, the fact that high correlation among genes could be explained by the effects of shared genetic variants. In this cases, it is more informative to study the graph structure of conditioned on .

In order to overcome the inferential problems in a high-dimensional setting, that is when is larger than the sample size , penalized methods have recently been proposed in the literature. In particular, motivated by the biological data analysis mentioned above, YinEtAl_AOAS_11 proposed sparse inference for the conditional Gaussian graphical model defined by:

 E(y∣x)=β0+β⊤x,V(y∣x)=Σ, (1)

that is, assuming that influences the conditional expected value through the regression coefficient matrix whereas the conditional covariance matrix does not depend on . Denoting with the corresponding precision matrix, the density function can be written as

 (2π)−p2|Θ|12exp{−12(y−B⊤x)⊤Θ(y−B⊤x)}, (2)

where, with a little abuse of notation, we let . Given a sample of independent observations, the authors propose the following extension of the well-known graphical lasso (glasso) estimator (YuanEtAl_BioK_07):

 {ˆB,ˆΘ}=argmax1nn∑i=1logϕ(yi∣xi;B,Θ)−λ∥β∥1−ρ∥Θ∥−1, (3)

where and . In this model, the two tuning parameters and are used to control, respectively, the amount of sparsity in and . As elucidated in YinEtAl_AOAS_11, the first penalty function accounts for the biological expectation that each gene has only a few genetic regulators, i.e., each column of has only a few non-zero regression coefficients, whereas the second penalty function accounts for the expectation that the precision matrix is sparse for genetic networks. It is worth noting that the estimator (3) was independently proposed also in RothmanEtAl_JCGS_10. However, in their work the authors focus only on how to efficiently estimate the regression coefficient matrix when incorporating the covariance information. Wang_StatSinica_15, on the other hand, proposes a new algorithm for fitting the estimator (3). This is based on the idea of decomposing the initial problem into a series of simpler conditional problems involving, at each step, the conditional log-likelihood function of each given all the remaining variables. The author also studies the asymptotic properties of the estimator for diverging and values. Recently, other works have proposed two-steps procedures for fitting a sparse conditional Gaussian graphical model. Among these, LiEtAl_JASA_12

propose to first use an initial non-sparse estimator for the conditional variance matrix

using the theory of reproducing kernel Hilbert spaces, and then, in the second step of the procedure, the resulting estimate is used to formulate a glasso-type problem for the estimation of . In contrast to this, the two-steps procedure proposed in YinEtAl_JMA_13 uses an

-penalized multivariate linear regression model with independent errors for estimating the sparse coefficient regression matrix in the first step, whereas, in the second step the precision matrix is estimated using the standard glasso estimator. Asymptotic results are obtained assuming that the error term is sub-Gaussian. A free distribution approach is instead proposed in

CaiEtAl_BioK_13. In this case, is first estimated through a constrained minimization problem not involving the conditional precision matrix, which can be considered as a multivariate extension of the Danzig selector, and then the precision matrix is estimated using the method proposed in CaiEtAl_JASA_11. Finally, ChenEtAl_JASA_16 propose a two steps procedure where the scaled lasso is used in each step to select the amount of sparsity. The authors also propose theoretical results on how to select the optimal values of the two tuning parameters so as to make the entire procedure tuning free.

Another strand of literature on sparse inference for a conditional Gaussian graphical model is theoretically founded on the more stringent assumption that

is also normally distributed with

and . Under this assumption, with expected value and precision matrix partitioned as follows:

 μz=(μxμy)andΘzz=(ΘxxΘxyΘyxΘyy), (4)

and the conditional density (2) can be reparameterized using the following identities:

 β0=μy+Θ−1yyΘyxμx,β=−ΘxyΘ−1yy,Θ=Θyy.

Denoting with the reparameterized conditional density function (2), SohnEtAl_PICAIS_12, WytockEtAl_pmlr_13 and YuanEtAl_IEEETIT_14 independently proposed the following penalized estimator of a sparse conditional Gaussian graphical model:

 {ˆΘxy,ˆΘyy}=argmax1nn∑i=1logϕ(yi∣xi;^μz,Θxy,Θyy)−λ∥Θxy∥1−ρ∥Θyy∥−1, (5)

where . Although estimator (5) is merely based on an alternative parameterization of the conditional density, it enjoys two important advantages over the estimator (3). Firstly, the partial correlation coefficients associated to are the natural parametric tools to infer the conditional independence between and given all the remaining variables, whereas a zero regression coefficient, say , only implies that has no effect on the expected value of , when the level of the remaining predictors is kept fixed. Moreover, the stronger the correlation in is, the more difficult it is to recover the non-zero entries in from , which is why ChiquetEtAl_StatComp_17 suggest to use the first penalty function to sparsify instead of . Secondly, the log-likelihood function used in definition (5) is jointly convex in (YuanEtAl_IEEETIT_14) and this facilitates both the optimization and theoretical analysis. Although such advantages can only be viewed in light of the strong distributional assumptions on , the estimator (5) has gained significant popularity. For example, ZhangEtAl_PLOS_14 uses this estimator to infer genetic networks under SNP perturbations. More recently, HuangEtAl_IEEE_TKDE_18 propose an extension to model dynamic networks influenced by conditioning input variables whereas HuangEtAl_IEEE_TNNLS_18 propose an extension to infer multiple conditional Gaussian graphical models with similar structures.

Despite a widespread literature on sparse conditional Gaussian graphical models, there is a large number of fields in applied research where modern measurement technologies make the use of this graphical model theoretically unfounded, even when the assumption of a multivariate Gaussian distribution is satisfied. An example of this is given by Reverse Transcription quantitative Polymerase Chain Reaction (RT-qPCR), a popular technology for gene expression profiling (DerveauxEtAl_Methods_10). This technique is used to measure the expression of a set of target genes in a given sample through repeated cycles of sequence-specific DNA amplification followed by expression measurements. The cycle at which the observed expression first exceeds a fixed threshold is commonly called the cycle-threshold (McCallEtAl_BioInfo_14). If a target is not expressed, the threshold is not reached after the maximum number of cycles (limit of detection) and the corresponding cycle-threshold is undetermined. For this reason, the resulting data are naturally right-censored and specific methods have been developed to cope with this (PipelersEtAl_Plos1_17; McCallEtAl_BioInfo_14) including graphical modelling approaches for network inference (AugugliaroEtAl_BioStat_18

). In order to take into account the censoring mechanism as well as the availability of different sources of information on the same system, in this paper we propose a novel double penalized estimator to infer a sparse conditional Gaussian graphical model with censored data. The contributions of this paper are three-fold. Firstly, we introduce a suitable generalization of the notion of conditional Gaussian graphical model for data subject to censoring, secondly we propose a doubly penalized estimator that accounts for different levels of sparsity of the network of response variables and of that linking the predictors and the responses, and, finally, we develop an efficient Expectation-Maximization algorithm for inference, based, on the one hand, on approximate calculations of the moments of truncated Gaussian distributions and, on the other hand, on a suitably derived two-step procedure alternating graphical lasso with a novel block-coordinate multivariate lasso approach.

The outline of this paper is as follows. In Section 2 we introduce the notion of conditional censored Gaussian graphical model and in Section 3 we present our estimator. In this section we also consider the computational aspects of the proposed approach. The problem of tuning parameter selection is treated in Section 4, whereas in Section 5 we use an extensive simulation study to compare the behaviour of the proposed estimator with existing competitors. Finally, in Section 6, we study a real gene expression data generated by RT-qPCR technologies, where we are able to integrate network inference, differential expression and data normalization into one model, and in Section 7 we draw some conclusions. All proofs are reported in the Appendix.

## 2 The conditional censored Gaussian graphical models

In this section we extend the notion of conditional Gaussian graphical models to the case in which the random variables and are subject to specific censoring mechanisms. To this end, let be a (partially) latent random vector with joint density function and let and , with for , be the left and right censoring values, respectively. Then, is observed only if it is inside the closed interval , otherwise it is censored from below or above depending on whether or , respectively. Following LittleEtAl_Book_02

, the complete specification of the joint distribution of the observed data can be obtained using a discrete random vector, denoted by

, with support and encoding the censoring patterns. Formally, the th element of is defined as , where is the usual indicator function. To simplify our notation, the realization of the random vector is denoted by . In this way the index set can be partitioned into , and . Furthermore, we shall use the convention that a vector indexed by a set of indices denotes the corresponding subvector, so, for example, the subvector of observed elements in is denoted by and, consequently, the observed data is . With such a notation, the joint density function of the observed data is given by

 φ(zo,r;ξ)=∫Dcg(zo,zc;ξ)dzcI(lo≤zo≤uo), (6)

where and .

Similar to the approach described in the Introduction, the notion of conditional censored Gaussian graphical models, formalized below, arises naturally by combining the density (6) with the assumption that the conditional distribution given is a multivariate Gaussian distribution.

###### Definition 1

Let be a random vector whose density function can be factorized as follows:

 g(z;ξ)=ϕ(y∣x;μy∣x,Θy∣x)h(x;δ),

where denotes the full set of parameters, denotes the marginal distribution of and the conditional distribution of given , which is assumed to follow a -dimensional Gaussian distribution. Furthermore, let be a -dimensional random censoring-data indicator defined by the censoring vectors and . The conditional censored Gaussian graphical model is defined by the set , where

 φ(zo,r;ξ)=∫Dcϕ(yo,yc∣xo,xc;μy∣x,Θy∣x)×h(xo,xc;δ)dxcdyc (7)

with and factorizes according to the undirected graph .

Although the proposed definition is general enough to cover a broad class of marginal distributions for the vector , in the next section we shall focus on two specific cases. The first one, which we consider in Section 3.1, is the simplest and is obtained when only the random variable is subject to a censoring mechanism. Section 3.3, on the other hand, considers the case in which also the random variable is censored but, in this case, as typically done in literature on conditional Gaussian graphical models, we enforce distributional assumptions and require that is also Gaussian distributed.

## 3 Sparse inference on the conditional censored Gaussian graphical models

### 3.1 Sparse inference when x is fully observed

As proposed in YinEtAl_AOAS_11 and RothmanEtAl_JCGS_10, we begin by assuming that influences the parameters of the conditional distribution of given only through a linear model for the conditional expected values, i.e., model (1). Then, under the assumption that is not subject to a censoring mechanism, density (7) can be significantly simplified as it can be rewritten as the product of the marginal density of and the conditional density function of the observed response variable given , i.e.:

 φ(zo,r;ξ) = h(x;δ)∫Dcϕ(yo,yc∣x;B,Θ)dyc (8) = h(x;δ)ψ(yo,r∣x;B,Θ),

with . The factorization (8) implies that inference about and can be done separately. Therefore, since all the relevant information needed to infer the conditional independence structure of adjusted for is contained in , in the following of this section we shall focus only on how to estimate the matrices and . Inference about can be achieved using, for example, the maximum likelihood method if the sample size is large enough or a suitable penalized method otherwise.

Suppose that we have a set of independent observations denoted by , with and . Then the relevant average observed log-likelihood function is given by

 ¯ℓ(B,Θ)=1nn∑i=1log∫Dciϕ(yioi,yici∣xi;B,Θ)dyici, (9)

where , with and . Under a high-dimensional setting, that is , inference about and can be carried out only under the assumption that these matrices have a sparse structure, that is only a few regression coefficients and partial correlation coefficients are different from zero. Then, similarly to the conditional glasso estimator (3), we propose to estimate the parameters of a censored conditional Gaussian graphical model by maximizing a new objective function whereby two specific lasso-type penalty functions are added to the average observed log-likelihood. The resulting estimator, which we call conditional cglasso, is defined as follows:

 {ˆB,ˆΘ}=argmax¯ℓ(B,Θ)−λp∑k=1θkk∥βk∥1−ρ∥Θ∥−1 (10)

where denotes the th column of . Like in the standard conditional graphical lasso estimator, the tuning parameter is used to control the amount of sparsity in the estimated regression coefficient matrix whereas is devoted to control the sparsity in and, consequently, in the corresponding estimated graph , where . When is large enough, some are shrunken to zero resulting in the removal of the corresponding link in ; on the other hand, when is equal to zero and the sample size is large enough the estimator coincides with the maximum likelihood estimator of the precision matrix, which implies a fully connected estimated graph.

A close look at the definition (10) reveals that the penalty function associated to is different to what is proposed in the literature (RothmanEtAl_JCGS_10; YinEtAl_AOAS_11). In particular, we use the diagonal elements of the concentration matrix to scale the -norm of the parameter vectors . As we shall see in the next section, the use of these scaling factors increases significantly the computational efficiency of the proposed algorithm.

### 3.2 An efficient Expectation-Maximization algorithm

In order to address the computational burden in the computation of the integral involved in the definition of the average log-likelihood function (9) and to solve efficiently the maximization problem in definition (10), we develop an efficient Expectation-Maximization (EM) algorithm (DempsterEtAl_JRSSB_77). The EM algorithm is based on the idea of repeating two steps until a convergence criterion is met.

The first step, called E-Step, requires the calculation of the conditional expected value of the complete log-likelihood function using the current estimates. As shown in McLachlanEtAl_Book_08

, this step can be significantly simplified when the complete probability density function is a member of the regular exponential family since, in this case, the E-Step simply requires the computation of the conditional expected values of the sufficient statistics. In our model, the E-Step is theoretically founded on the following theorem, which relates the partial derivatives of the average observed log-likelihood function (

9) to the expected value operator computed with respect to the conditional distribution of given and truncated over a specific region. For the sake of simplicity, in the remaining part of this paper, when needed, we let and we let denote a generic parameter.

###### Theorem 1

Let be the expected value operator computed with respect to the conditional Gaussian distribution of given and truncated over the region . Then is equal to:

 1nn∑i=1E(∂logϕ(yioi,yici∣xi;ϑ)∂ϑhk∣∣ ∣∣yici∈Dci,xi;ϑ).

Let be the current estimate of . Let , with

 ^yi,k={yik if rik=0E(yik∣yici∈Dci,xi;^ϑ) otherwise,

representing the current

-dimensional matrix of the imputed response variables, obtained by replacing the right and left censoring values by the expected values of the truncated normal distribution. In the same way, we let

be the -dimensional matrix where is equal to:

 yihyik if rih=0 and rik=0yihE(yik∣yici∈Dci,xi;^ϑ) if rih=0 and rik≠0E(yih∣yici∈Dci,xi;^ϑ)yik if rih≠0 and rik=0E(yihyik∣yici∈Dci,xi;^ϑ) if rih≠0 and rik≠0.

With such notation, the E-Step of the proposed algorithm requires the computation of the imputed empirical conditional covariance matrix:

 ˆSy∣x(B)=n−1{ˆCyy−ˆCyxB−B⊤ˆCxy+B⊤CxxB},

where , , and denotes the usual design matrix.

The second step of the EM algorithm, the M-Step, requires the solution of a new maximization problem obtained by replacing the objective function of definition (10), with the so-called penalized -function:

 Q(B,Θ)=logdetΘ−tr{ΘˆSy∣x(B)}−λp∑k=1θkk∥βk∥1−ρ∥Θ∥−1. (11)

The problem of maximizing this function is formally equivalent to the problem studied in RothmanEtAl_JCGS_10 and YinEtAl_AOAS_11. In particular, since for a fixed the penalized -function (11) is a bi-convex function of and , its maximization can be obtained by repeating two sub-steps until a convergence criterion is met. In the first sub-step, given , the regression coefficient matrix is estimated by solving the following sub-problem:

 minBtr{ˆΘˆSy∣x(B)}+λp∑k=1^θkk∥βk∥1, (12)

whereas, in the second sub-step, given , the matrix is estimated by solving the sub-problem:

 maxΘ≻0logdetΘ−tr{ΘˆSy∣x(ˆB)}−ρ∥Θ∥−1. (13)

Algorithm 1 reports the pseudo-code of the proposed EM algorithm.

While problem (13) is a standard graphical lasso problem that can be efficiently solved using, for example, the block-coordinate descent algorithm proposed in FriedmanEtAl_biostat_08, problem (12) is similar to what is studied by RothmanEtAl_JCGS_10 and YinEtAl_AOAS_11 in the case of no censoring. However, instead of solving this problem through a cyclic coordinate descent algorithm, we propose a more efficient and easy-to-implement block-coordinate descent algorithm. In particular, the algorithm that we propose in this paper is based on the following result.

###### Theorem 2

Let be the matrix seen as function of the -th column of , denoted by , while the remaining columns are held fixed to the current estimates. Then, the following minimization problem

 minBktr{ˆΘˆSy∣x(Bk)}+λ^θkk∥βk∥1,

is equivalent to

 minBk1n∥˜Yk−XBk∥2+λ∥βk∥1,

where is a vector with th element .

Theorem (2) reveals that the maximization of the penalized -function (11) with respect to , while the remaining parameters are held fixed, is equivalent to a standard lasso problem (Tibshirani_JRSSB_96) that can be efficiently solved using the algorithm proposed in FriedmanEtAl_JSS_10. Algorithm 2 reports the pseudo-code of the method proposed to solve sub-problem (12), which we call multi-lasso algorithm. Convergence to a global minimizer follows from the fact that the trace term in sub-problem (12) is a convex and differentiable function and the penalty function can be decomposed as sum of convex functions (Tseng_JOTA_01). Furthermore, we underline that the computational efficiency of the proposed multi-lasso algorithm can be improved using the results given in WittenEtAl_JCGS_11. In other words, when is large enough, and after a permutation of the response variables, has a block diagonal structure. Therefore, Theorem 2 implies that can be computed by running the multi-lasso algorithm in parallel. This insight makes the proposed algorithm useful for the analysis of data sets where the number of predictors is really large.

Finally, as discussed also in AugugliaroEtAl_BioStat_18 in the context of an (unconditional) Gaussian graphical model under censoring, the computational burden needed to compute the mixed moments in can be too extensive when the estimated precision matrix is dense. For this reason, also in this paper, we propose an approximate EM algorithm defined by replacing the matrix with the approximation suggested in GuoEtAl_JCGS_15. By an extensive simulation study, AugugliaroEtAl_BioStat_18 shows that such an approximation works well no a similar problem and for a large range of -values.

### 3.3 Extension to the censored x case

Although the estimator proposed in Section 3.1 is developed under the assumption that the random vector of predictors is fully observed, the approach can be easily extended to the case in which both and are subject to censoring mechanisms. Following the approaches described in the Introduction, in this section we enforce a distributional assumptions and assume that is also normally distributed with and . Under this assumption, the likelihood function of the observed data set is formally equivalent to the likelihood function of the censored Gaussian graphical model proposed in AugugliaroEtAl_BioStat_18, i.e.:

 φ(zo,r;μz,Θzz)=∫Dcϕ(zo,zc;μz,Θzz)dzc.

Given a set of independent observations, we therefore propose to estimate the sub-matrices of through the following penalized estimator:

 {ˆΘxy,ˆΘyy,ˆΘxx}=argmin1nn∑i=1logφ(zioi,ri;μz,Θzz)−λ∥Θxy∥1−ρ(∥Θxx∥−1+Θyy∥−1). (14)

Estimator (14) is an extension of the censored glasso estimator of AugugliaroEtAl_BioStat_18, now specifically designed to estimate separately the submatrices and in order to emphasize the different role that they have in this kind of graphical model. It is worth noting that, differently to the approaches without censored data, the log-likelihood function used to define the estimator (14) depends on the submatrix and therefore this matrix cannot be ignored in the fitting process. More specifically, an accurate estimate of is needed in the E-Step to compute the conditional expected values of the sufficient statistics. Finally, although estimator (14) is defined using the same tuning parameter to control the amount of sparsity in and , the definition can be generalized by using two specific tuning parameters, say and . However, such a choice would increase significantly the computational burden of selecting the optimal values of these tuning parameters, making the application of the proposed estimator infeasible in the case of a large dataset.

## 4 Tuning parameters selection

The two tuning parameters play a central role in the proposed conditional cglasso estimator (10) since they are designed to control the complexity of the topological structure of the estimated graph and the effects of the predictors on the response variables. Standard search strategies proposed in the literature to select the optimal values of the tuning parameters of any penalized estimator are structured in two steps, each of which requires specific choices that must be carefully evaluated in order to reduce the computational burden of the global strategy.

The first step requires the definition of a two dimensional grid, say , over which to evaluate the goodness-of-fit of the model. When the sample size is large enough, one can let and/or equal to zero, so that the maximum likelihood estimate is one of the points belonging to the coefficient path. On the other hand, when we are in a high-dimensional setting, one can use two values small enough to avoid the overfitting of the model. With respect to the largest values of the two tuning parameters, the next theorem gives both the exact formulas for computing and and the corresponding conditional cglasso estimates.

###### Theorem 3

Let be the marginal density function of the random variable , where and . For any index , define the sets , and and compute the marginal maximum likelihood estimates by maximizing the following log-likelihood function:

 ℓ(μk,σ2k)=∑i∈oklogϕ(yik;μk,σ2k)+|c−k|log∫lk−∞ϕ(y;μk,σ2k)dy+|c+k|log∫+∞ukϕ(y;μk,σ2k)dy.

Then, the estimators and corresponding to the pair are , and . Furthermore,

 λmax =maxh,kn−1|m∑i=1xih(^yi,k−^μk)|, ρmax =∥ˆSy∣x(ˆB)∥−∞,

where and denotes the generic entry of .

The previous results can be easily extended to the case in which also the predictor vector is partially observed. In this case, the largest values of the two tuning parameters of the estimator (14) can be obtained through a generalization of the results given in AugugliaroEtAl_BioStat_18. In particular, denoting the values of the maximization of the following marginal log-likelihood function:

 ℓ(μk,σ2k)=∑i∈oklogϕ(zik;μk,σ2k)+|c−k|log∫lk−∞ϕ(z;μk,σ2k)dz+|c+k|log∫+∞ukϕ(z;μk,σ2k)dz,

and by

the corresponding imputed empirical covariance matrix of , then

 λmax =∥ˆSxy(^μz,ˆΘzz)∥∞, ρmax =max{∥ˆSxx(^μz,ˆΘzz)∥−∞,∥ˆSyy(^μz,ˆΘzz)∥−∞}.

Once and are calculated, the entire coefficient path can be computed over a two dimensional grid using the estimate obtained for a given pair as warm starts for fitting the next conditional cglasso model. This strategy is commonly used also in other efficient lasso algorithms and R packages (FriedmanEtAl_JSS_10, glasso_manual). Motivated by our simulation study, showing the inter-play between the estimation of and , we suggest the following strategy to compute the coefficient path: starting with the model fitted at , the other models are fitted reducing faster than , that is, keeping fixed we reduce from to . When is equal to , we repeat the previous procedure using the next -value in . The coefficient path is completed when we fit the model at .

The second step of a tuning strategy requires the definition of a criterion for evaluating the behaviour of the fitted models. Different choices are suggested in the literature: YinEtAl_AOAS_11, StadlerEtAl_StatComp_12 and LiEtAl_JASA_12 suggest to use the Bayesian Information Citerion (BIC) whereas RothmanEtAl_JCGS_10, LeeEtAl_JMA_2012 and CaiEtAl_BioK_13 suggest to use cross-validation. In this paper, we propose to select the optimal values of the two tuning parameters by using the BIC measure, defined as

 BIC(λ,ρ)=−2n∑i=1log∫Dciϕ(yioi,yici∣xi;ˆB,ˆΘ)dyici+k(λ,ρ)logn,

where denotes the number of non-zero parameters estimated by the proposed method. Although a search on the two dimensional grid defined in the first step can be performed to select the pair that minimizes , the computational burden related to the evaluation of the log-likelihood function can make this strategy infeasible also for moderate size problems. For this reason, following IbrahimEtAl_JASA_08, we propose to select the tuning parameters by minimizing the following approximate measure:

 ¯¯¯¯¯¯¯¯BIC(λ,ρ)=−n[logdetˆΘˆE−tr{ˆΘˆEˆSy∣x(ˆBˆS)}]+k(ρ,λ)logn,

i.e., by substituting the exact log-likelihood function with the -function used in the M-Step of the proposed algorithm, which is easily obtained as a byproduct of the EM algorithm.

## 5 Simulation studies

The global behavior of the proposed estimator (10), which we call conditional cglasso, is compared with the conditional MissGlasso (StadlerEtAl_StatComp_12), where -penalized estimation is performed under the assumption that unobserved data are missing at random, and with the conditional glasso (3), where missing data are imputed using the censoring values. The three different methods will be evaluated on the basis of criteria that measure their performance throughout the entire coefficient path. In this way, on the one hand, we avoid pitfalls of specific tuning parameter strategies, which may favour one estimator over another, and, on the other hand, if an estimator is found which is uniformly better than the other estimators across the entire path, that is for any possible pair , then such an estimator will be preferable regardless of the method used to select the tuning parameters.

As in the dataset studied in Section 6, we consider right-censored response variables with equal to 50, for any . To simulate a sample of size from a sparse conditional censored Gaussian graphical model we use the following procedure. First, is simulated from a multivariate Gaussian distribution with zero expected value and covariance matrix obtained using the method implemented in the R package huge (huge_manual); more specifically, we simulate a random structure where the probability that a pair of nodes has an edge is equal to 0.2.

Then, the right-censored response vector is simulated in analogy with the right-censored multivariate linear regression models. Firstly, we simulate the -dimensional random vector of errors from a multivariate Gaussian distribution with zero expected value and concentration matrix . The diagonal entries are fixed to 1 while the non-zero partial correlation coefficients , with and

, are sampled from a uniform distribution on the interval

. In terms of graph theory what we have is a set of stars, i.e., complete bipartite graphs with only one internal vertex and four leaves. A similar setting for simulating the sparse concentration matrix was used in LiEtAl_JASA_12. Then, for each , we randomly draw the set , with ; the corresponding regression coefficients , with , are sampled from a uniform distribution on the interval . We use the intercepts to fix the probability of right-censoring. More specifically, is chosen in such a way that is equal to for the first response variables whereas it is equal to for the remaining response variables. In our study, is used to evaluate the effect of the number of censored response variables on the behaviour of the considered estimators. Finally, the censored response vector is obtained by the model and treating each greater than as a right censored value. We perform a simulation study, where the sample size is fixed to and the quantities and are used to define the different scenarios under which evaluate the competing estimators. We let , so that we can consider both a low-dimensional setting, when and are equals to , and different high-dimensional settings. Finally, we let , with .

As explained earlier, our aim is the evaluation of the entire coefficient path of each competitor. To this end, in order to study the coefficient path for , we first control the amount of shrinkage on by keeping fixed the ratio and then, for this fixed value, the path for is computed across a decreasing sequence of ten evenly spaced -values, from to . The resulting path is evaluated both in term of network recovery and parameter estimation. For the first aspect, we use the precision-recall curve which shows the relationship between the quantities:

 Precision =number of ^θhk≠0 and θhk≠0number of ^θhk≠0, Recall =number of ^θhk≠0 and θhk≠0number of θhk≠0.

An estimator is defined globally preferable in terms of sparse recovery when its precision-recall curve is always over the other curves. To simplify our comparison, we summarize the precision-recall curve by using the area under the curve (AUC). The second aspect is evaluated in terms of mean squared error, defined as

 MSE(ˆΘ)=E(∥ˆΘ−Θ∥2F),

where denotes the squared Frobenius norm. An estimator is defined globally preferable in terms of mean square error when its path, obtained across values, is below that of the other paths. In order to summarize a given path, we use the minimum mean squared error attained along the path, which we denote by . Finally, in order to study the effects of the amount of shrinkage applied on on the different measures, we repeat the procedure across four possible values for the ratio , i.e., , whereby the first value corresponds to the case where the predictors are not included in the model. The comparison of the paths for the different estimators is done using the same strategy but inverting the role of with and with . The tables reporting the full results of our study, based on 50 replicates, can be found in the Supplementary Materials. Below, we summarize the results in a number of ways.

Firstly, we evaluate the network recovery ability of the considered estimators. As an example, Figure 1 shows the precision-recall curves of the scenarios with and equal to 50 and ratios fixed to 0.25 (Figure 1(a)) and fixed to 0.25 (Figure 1(b)).

These figures unveil two important features of the proposed estimator. Firstly, the conditional cglasso estimator behaves better than its competitors in terms of network recovery, since, for each level of recall, it has a higher level of precision. This aspect is more pronounced in the estimation of

. Secondly, whereas the proposed estimator is rather robust to the number of censored response variables, the results of the two competitors depend strongly on the probability of censoring. In particular, conditional glasso and mglasso respond to an increase in the number of censored values with a reduction in terms of both precision and recall. Also in this case, the effect seems to be more pronounced for the estimation of

than that of .

In order to compare the results of the low-dimensional case with the setting in which both and exceed the sample size and, at the same time, in order to study the effects of an estimator, say , onto the path of the other estimator, say , Figure 2 shows the average AUCs as the ratios and are varying. More specifically, Figure 2(a) shows the average AUC of the precision-recall curve of as the amount of sparsity in is reduced, that is, as the ratio goes from 1 to 0. In the same way, Figure 2(b) shows the average AUC for as the shrinkage on is reduced, which is controlled through . Several comments arise looking at these figures. Firstly, the initial observations coming from the analysis of Figure 1 are confirmed also in the other scenarios, that is, the proposed method is preferable in terms of network recovery and is more robust to an increase in the probability of censoring. As expected, an increase in the number of variables causes a uniform loss of performance across all the methods considered. However, also in the more challenging setting, the conditional cglasso maintains its global superiority in recovering the sparsity structure of the matrices and . More interesting is the analysis of the crossing effects between the two estimators. Figure 2(a) suggests that the network recovery property of the three estimators of is only weakly depending on the level of shrinkage on since all three methods show only a slight increase in the AUC as is reduced. This result is not surprising and was initially observed in RothmanEtAl_JCGS_10 and used, by the authors, to develop an approximate method to estimate the regression coefficient matrix. Such an idea was followed also by other authors, such as YinEtAl_JMA_13 and CaiEtAl_BioK_13. Different from the previous conclusions, Figure 2(b) shows that the performance of the three methods in the network recovery associated to strongly depends on the ratio . This can be explained by observing that a reduction in the amount of sparsity of causes an improvement in the matrix which is translated in a better estimate of the concentration matrix and therefore an increase in the ability of network recovery.

Finally, we compare the three methods in terms of mean squared error. A quick look at Figures 3 and 4 suggests that the main conclusions, previously obtained for the network recovery case, are valid also when we study mean squared errors. Figure 3 shows the average MSEs for the scenario with and equal to 50 and ratios and both fixed to 0.25. We can clearly see that the proposed conditional cglasso is globally preferred and the gap among the three methods is amplified as the probability of right censoring increases. Furthermore, also in terms of MSE, the proposed method is more robust to the percentage of censored values. In Figure 4 we use the minimum value of the mean squared error attained along the path to study the crossing effects among the estimators. Again, the proposed method shows a better performance, despite a global increase in MSE across all methods when we consider the high-dimensional setting.

Although our conclusions come from the analysis of the two extreme settings for and , the full tables reported in the Supplementary Materials show that our method is preferable also in the intermediate settings, that is, when only one of and exceeds the sample size. This is true both in terms of AUC and MSE.