1 Introduction
Undirected graphical models are extensively used to study the conditional independence structure between random variables (Jordan, 1998; Liu & Page, 2013; Liu et al., 2014). Important applications include image processing (Mignotte et al., 2000), finance (Barber & Kolar, 2018) and neuroscience (Zhu & Cribben, 2017), among others. A major challenge in real world applications is that the underlying conditional independence structure can be distorted by confounders. Unfortunately, despite the large literature on graphical model estimation, there is limited work to date on estimation with observed confounding.
The observed confounding issue is ubiquitous. Consider the problem of estimating brain functional connectivity from functional magnetic resonance imaging (fMRI) (Biswal et al., 1995; Fox & Raichle, 2007; Shine et al., 2015, 2016) data. Here, the connectivity estimate is known to be susceptible to confounding from physiological noise such as subject motion (Van Dijk et al., 2012; Goto et al., 2016). We emphasize that although the amount of motion is observed, the resulting confounding can significantly distort the connectivity matrix when estimated using conventional means, leading to incorrect scientific inferences (Laumann et al., 2016). Another example is in social network analysis. The social contagion (the effects caused by the people close to each other in a social network) are shown to be confounded by the effect of an individual’s covariates on his or her behavior or other measurable responses (Shalizi & Thomas, 2011). As a result, a method which is able to recover the social contagion or the social network structure despite the individual confounding is clearly useful.
This manuscript is motivated by the question: is it possible to efficiently estimate sparse conditional independence structure between random variables with known confounders? We provide a positive answer for the important case of jointly Gaussian random variable models. Although prior works have studied the issue of hidden confounders in generative undirected graphical models (Jordan, 1998), to our knowledge, this manuscript is among the first to develop the methodology to deal with observed confounders. We propose a new class of graphical models: the partially linear additive Gaussian graphical models (PLAGGMs), whose parameters capture the underlying relationships of random variables, and where these relationships take a partially linear additive form (Hastie, 2017). Further, we parametrize the model using two additive components: the target i.e. the nonconfounded structure, and the nuisance structure induced by observed confounders.
Importantly, we do not impose a parametric form on the nuisance structure – only requiring smoothness to facilitate nonparametric estimation. This significantly improves on prior work which has required strong adhoc assumptions like the linear assumption (Van Dijk et al., 2012; Power et al., 2014) or the zeroexpectation assumption (Lee & Liu, 2015; Geng et al., 2018a) on the nuisance parameter. PLAGGMs are applicable as long as not all the observed samples are highly confounded, so that the proposed procedure can compare the confounded samples with the nonconfounded ones in order to remove the confounding influence.
We propose a pseudoprofile likelihood (PPL) estimator for learning PLAGMMs, which can be considered as a pseudolikelihood version profile likelihood (Fan et al., 2005). By minimizing the regularized negative log PPL, we derive a sparsistent estimator of the target
structure under mild assumptions. The sparsistency of the estimator indicates that the proposed method recovers the true underlying structure with a high probability
(Wainwright, 2009; Kolar et al., 2009; Kolar & Xing, 2011; Ravikumar et al., 2010). We also show that the convergence rate of the proposed estimator is faster than competing methods.The proposed PLAGGM can be considered as a generativemodel counterpart of partially linear additive discriminative models (Fan & Zhang, 2008; Cheng et al., 2014; Chouldechova & Hastie, 2015; Lou et al., 2016). Compared with these discriminative models, PLAGGM as a generative model focuses on estimating the relationships among random variables, thus can be used to recover the conditional independence structure, which discriminative models like Sohn & Kim 2012; Wytock & Kolter 2013 cannot. Recall that GGMs can be estimated as a collection of related regressions (Meinshausen & Bühlmann, 2006). Along similar linesm, our PLAGGM approach requires studying multiple dependent discriminative models simultaneously.
Main Contributions
Our main technical contributions are summarized as follows:

[leftmargin=*]

To the best of our knowledge, PLAGMM is the first model to specifically deal with the observed confounders in generative undirected graphical models. Without assuming a parametric form for the confounding, PLAGMM can accommodate a broad class of potential structure confounders.

We demonstrate that PLAGGMs facilitate sparsistent estimators by proposing the PPL method as a new objective for the parameter estimation. Further, since the corresponding minimization problem is shown to be equivalent to a regularized weighted least square, the optimization is shown to be efficient by leveraging the coordinate descent method (Friedman et al., 2010) and the corresponding strong screening rule (Tibshirani et al., 2012).
We demonstrate the utility of PLAGGMs using both synthetic data and the 1000 Functional Connectomes Project Cobre dataset (COBRE, 2019), a brain imaging dataset from the Center for Biomedical Research Excellence. The proposed PLAGGM demonstrates superior accuracy in terms of structure recovery and can effectively detect the abnormalities of the brain functional connectivity of subjects with schizophrenia.
2 Modeling
We begin by formulating PLAGGMs. For a continuous random vector
and a confounder variable , we assume that the conditional distribution follows a Gaussian graphical model (Yang et al., 2015a) with a parameter matrix, denoted by , that depends on . In particular, the conditional distribution of follows:where we assume that the diagonal of the covariance matrix of is , without loss of generality (Yang et al., 2015a). Note that the parameter captures the conditional independence structure^{1}^{1}1We will refer to conditional independence structure as structure for the ease of presentation. of . Therefore, the structure of is allowed to vary based on the values of confounders. This characteristic makes the proposed PLAGGM more general in scope and applicability compared to prior work where the structure of is assumed to be unrelated to the confounders (see discussion in Section 5).
Let represent the nonconfounded structure. We assume that the parameter takes the partially linear additive form:
(1) 
Our goal is to recover given independent observations
from the joint distribution of
. The term is a nuisance component that arises due to confounding. Thus, while the structure of varies over observations, we are only interested in a specific one, , whose sparsity pattern encodes the target nonconfounded structure.It is clear that recovering is impossible without constraints on . For instance, estimates and result in the same likelihood, making it impossible to determine the true value of . To this end, we enforce a mild assumption – that the effect of confounders is trivial when the size of confounding itself is small. Specifically, for a known , we assume
(2) 
for any satisfying .
The assumption states that the confounders with values smaller than do not have any effect on the structure of , and thus this serves as a constraint on . Then, as long as we can observe some samples with the confounders small enough (smaller than ), we should be able to distinguish from . Such an assumption is much weaker than those used in existing works including Van Dijk et al. (2012), Power et al. (2014), and Lee & Liu (2015), where or are often assumed. Note that when , (2) will degenerate to . Also, as is smooth, there should exist infinite ’s satisfying the definition. We do not require using the largest possible one. The selection of in practice is discussed in Section 6.2.
3 PseudoProfile Likelihood Method
PLAGGMs facilitate fastconverging estimators. In this section, we propose an estimation procedure for in PLAGGMs.
3.1 Pseudo Likelihood
For a PLAGGM parameterized by with observations
, we first derive the log pseudo likelihood as a linear regression in Lemma
1.Lemma 1.
Define as the vector with the component replaced by , the column vector of , and the column vector of . The log pseudo likelihood of the PLAGGM follows
(3) 
It should be noticed that (3) has the same form as the objective function of linear regressions each with observations and covariates. Specifically, for the regression (), the covariate matrix is defined as
and the corresponding response is .
For graphical models without confounders it is known that minimizing regularized negative log PL (Geng et al., 2017, 2018b, 2018c; Kuang et al., 2017) can lead to sparsistent parameter estimators (Yang et al., 2015a). Unfortunately, this is no longer true for PLAGGMs, since the number of unknown nuisance parameters, which are nonparametric, is far too large. Instead, we leverage kernel methods and propose an approximate PL.
3.2 Pseudo Profile Likelihood
We propose a new inductive principle to estimate . As mentioned in Section 3.1, the varying confounding ’s are an obstruction to estimating . We summarize the varying effects as , where denotes the row vector of . (3) is transformed to
There are two unknown parts in PL: and . Intuitively, if we can express ’s using , we will be able to omit and focus on estimating . This leads to the following Lemma on approximating ’s.
Lemma 2.
For the observation, we define an kernel weight matrix , , which is a diagonal matrix with . is a symmetric kernel density function, and is a user specified bandwidth. Then, we define an auxiliary matrix:
where
satisfying the smoothing assumptions in Section 4.1.
An estimator of can be derived as , where
The function in Lemma 2 is a userspecified function. In Theorem 1, we show that the value of does not affect the sparsistency of the estimation, as long as it satisfies the definitions in Lemma 2.
Note that given the observations, is only dependent on . Therefore, by replacing with in (3) and some additional transformations, we can derive an approximate log pseudo likelihood whose only unknown parameter is . We define this as the log pseudo profile likelihood (PPL):
Definition 1 (Ppl).
Following the notations above, the log PPL is defined as
(4) 
where is an vector, whose component is and others are ’s.
The proposed PPL shares a close relationship with the profile likelihood (Speckman, 1988; Fan et al., 2005): if the components of are independent of each other, the form of PPL is equivalent to the log profile likelihood. However, we do not make any assumptions on the independence here, which makes PPL a type of log pseudo likelihood. Such a rationale of intentionally overlooking the dependency is widely used in the derivation of various types of pseudo likelihoods including the one in Huang et al. 2012. However, Huang et al. 2012 focus on Cox regression for the longitudinal data analysis which is different from our setting. Also, the inductive principle in Huang et al. 2012 emphasizes the consistency, while we will show that a sparsistent estimator can be achieved by using the PPL.
3.3 Regularized MaPPLE
With the proposed PPL (4), we can now derive an estimator for . For the ease of presentation, we will use to denote . Then, the regularized MaPPLE is derived as
(5) 
where , and is the regularization parameter.
Note that (5) has the same form as a regularized weighted least square problem. Therefore, the optimization can be efficiently solved using the coordinate descent method (Friedman et al., 2010), combined with the strong screening rule (Tibshirani et al., 2012). We implement the optimization using the R package glmnet (Friedman et al., 2010).
4 Sparsistency of the Regularized MaPPLE
The regularized MaPPLE (5) is proved to be sparsistent under some mild assumptions.
4.1 Assumptions
To start with, we discuss the assumptions for the estimator. Since the estimation of in Lemma 2 is based on kernel methods, we need some standard assumptions widely used in this literature (Mack & Silverman, 1982; Fan et al., 2005; Kolar et al., 2010b). The following assumptions are concerned with the order of , , and , and the smoothness.
Assumption 1.
Define with and . Then, we assume that there exists , so that .
Assumption 2.
For any , the following matrices are all elementwise Lipschitz continuous with respect to :
Also, since we do not pose parametric assumptions to and , we further need the following assumptions on both.
Assumption 3.
The random variable has a bounded support, and is Lipschitz continuous and bounded away from 0 on its support. has continuous second derivative.
Next, we introduce an assumption required for sparsistency. The following mutual incoherence condition is vital to the sparsistency (Ravikumar et al., 2010). Here, we define as the underlying parameter, and treat as a vector containing all the components without repeats.
Assumption 4.
Define as the index set of the nondiagonal and nonzero components of , as the index set of the diagonal components of , and as the index set of the nondiagonal and zero components of . Define the incoherence coefficient as . Then for , there exists , so that and , where we use the index sets as subscripts to represent the corresponding components of a vector or a matrix.
Our final assumption is required by the fixed point proof technique we apply (Ortega & Rheinboldt, 2000; Yang & Ravikumar, 2011), and may not be necessary for more calibrated proofs.
Assumption 5.
Define , where with , and for some . Then .
4.2 Main Theoretical Results
With the assumptions in Section 4.1, the sparsistency of the regularized MaPPLE is provided in Theorem 1.
Theorem 1.
According to Theorem 1, the regularized MaPPLE recovers the true structure of with a high probability. Also, the scale of the estimation error denoted by is less than , which converges to zero at a rate of . In other words, the smallest scale of the nonzero component that the PPL method can distinguish from zero in the true parameter converges to zero at a rate of . We refer to this result as sparsistency.
Such a convergence rate is faster than ordinary nonparametric methods, which often have a convergence rate (Speckman, 1988; Kolar et al., 2010b). Also, the sparsistency matches the results of semiparametric methods (Fan et al., 2005; Fan & Zhang, 2008) for discriminative models, where the estimated parametric part is shown to be consistent.
In Theorem 1, the value of does not affect the sparsistency of the estimator. In practice, however, if is too small, the tends to be singular, since few samples are observed with . Accordingly, the PPL method will be not applicable. Therefore, we need to observe some nonconfounded samples to implement the PPL method. The sparsistency is not directly related to the selected either. Along the proof of Theorem 1 in the Supplements, we notice that (and thus ) can only affect some auxiliary constants. Since this relationship is neither significant, nor straightforward, we do not discuss it here.
5 Related Methods
After a thorough analysis on the proposed PLAGGM and PPL method, we now study some related methods that fall into four categories: Gaussian graphical models incorporating confounders, denoted by CONGGMs; the Gaussian graphical models using linear regression to deal with confounders (Van Dijk et al., 2012; Power et al., 2014) denoted by LRGGMs; original Gaussian graphical models only using nonconfounded samples, denoted by GGMs; and timevarying Gaussian graphical models (Kolar et al., 2010b; Yang et al., 2015b) denoted by TVGGMs. Theoretically, the proposed PLAGGM is more generalized and facilitates fasterconverging estimators than the existing models.
5.1 CONGGMs and LRGGMs
Although not designed for this task, it is possible to apply more standard graphical modeling approaches to deal with some of the effects of observed confounders. For instance, a straightforward alternative to PLAGGMs is to directly incorporate the confounder as a random variable into the GGM. Specifically, CONGGMs assume that the confounder follows a GGM jointly with the random vector , which means
(6) 
where the joint covariance matrix follows
Since the target structure is for , we can estimate by graphical Lasso (Friedman et al., 2008) first, and then derive the inverse conditional covariance matrix for .
LRGGM is a model widely used in the neuroscience area (Van Dijk et al., 2012; Power et al., 2014), assuming that the confounders will cause a linear confounding to the observed samples. The model can be formulated as:
(7) 
where follows a Gaussian graphical model with parameter , and satisfies Assumption 3. Since conditional on , is equivalent to , the target parameter for the nonconfounded structure is just . LRGGMs use linear regressions to recover , and further to regress out confoundings. Finally, LRGGMs estimate by graphical Lasso.
By deriving the inverse covariance matrices of conditional on for both CONGGMs and LRGGMs, it should be noticed that the inverse conditional covariance matrices are irrelevant to the value of . In other words, the confounder does not affect the conditional independence structure of , which is often an unrealistic restriction. In contrast, PLAGGM particularly deals with confounding of the structure by . Further following this direction, we can derive the following theorem which describes the the relationship among CONGGMs, LRGGMs, and PLAGGMs.
Theorem 2.

.
Thus, it is clear that CONGGMs and LRGGMs both assume a constant underlying structure irrelevant to , and are parametric special cases of the proposed PLAGGMs. Also, since the two methods assume either exactly or asymptotically, they will treat the average of as the underlying and derive incorrect structures that are too dense.
5.2 GGMs and TVGGMs
In PLAGGMs, it is assumed that some nonconfounded samples are observed. Therefore, we can directly apply GGM to the nonconfounded samples and estimate the structure. However, by doing this, only the information from the nonconfounded data are used. The estimators are obviously not sparsistent, considering that most of the observed samples are confounded and not used by GGM. Thus, the method is less accurate than PLAGGM.
Another class of relevant methods are timevarying graphical models (TVGMs) (Song et al., 2009a, b; Kolar et al., 2010b; Kolar & Xing, 2012), used to estimate a different parameter at each time point or observation. Specifically, TVGGMs assume that follows a varying Gaussian graphical model over . Methods like fused Lasso (Yang et al., 2015b; Zhu & Koyejo, 2018) and the kernel estimation (Kolar et al., 2010b) are applied to estimate the varying structure of . One may consider applying such a model, then perhaps averaging the timevarying graph to estimate the nonconfounded component. However, since the target of such methods are multiple structures, the estimators can only be guaranteed to be consistent. In contrast, PLAGGMs use all the samples to recover the parameter representing the underlying nonconfounded structure, and can achieve sparsistency.
5.3 Graphical Models with Nonparametric Methods
PLAGMM is not the first approach to incorporate nonparametric methods into graphical models. Prior works like Liu et al. 2009; Kolar et al. 2010a; Voorman et al. 2013; Wang & Kolar 2014; Suggala et al. 2017, and Lu et al. 2015, 2018 have tried to relax the parametric definition of graphical models to realize a more generalized model. However, these methods do not help much to deal with observed confounders, since the structure among the random variables is assumed to be independent of the values of the confounders. Partially linear additive models have also been combined with directed acyclic graphs in (Rothenhäusler et al., 2018), which was developed for causal inference and not the structure analysis.
6 Experiments
To demonstrate the empirical performance of the proposed PLAGGM and PPL method, we apply them to synthetic data for a structure recovery task in Section 6.1 and a real fMRI dataset for a brain functional connectivity estimation task in Section 6.2.
6.1 Structure Recovery
In this section, we use simulated data to compare PLAGGM, TVGGM and CONGGM discussed in Section 5, with the proposed PLAGGM for structure recovery. We simulate data from PLAGGMs following the procedure provided in the Supplement. We consider the case of . For all these settings, we fix samples. Then, the four methods are applied to the generated datasets to recover the underlying conditional independence structure. The regularization parameter is selected by 10fold cross validation from a series of auto generated ’s by glmnet. The bandwidth is determined according to Assumption 1. We use , where is selected according to the designated . We have also studied other forms for , which did not significantly affect the performance.
The achieved area under curve (AUC) of the receiver operating characteristic using the selected hyper parameters are reported in Figure 1. Consistent with the analysis in Section 5, the proposed PLAGGM achieves higher AUCs on structure recovery than the competing methods. Also, as the number of variables increases, the advantage of PLAGGM gets more significant. The phenomenon results from the sparsistency of the regularized MaPPLE, which is more accurate and requires less data. It should also be noticed that the AUC achieved by CONGGM is always around 0.5. The reason is that, following the data simulation procedure, the true ’s are always dense, although is sparse. As suggested by the analysis in Section 5, CONGGM treats as the , and thus tends to recover a wrongly dense .
6.2 Brain Functional Connectivity Estimation
We apply the PLAGGM to the 1000 Functional Connectomes Project Cobre dataset (COBRE, 2019), from the Center for Biomedical Research Excellence. The dataset contains 147 subjects with 72 subjects with schizophrenia and 75 healthy controls. For each subject, resting state fMRI time series and the corresponding confounders are recorded. We use the 7 confounders provided in the dataset relate to motion for the analysis, and apply HarvardOxford Atlas to select the 48 atlas regions of interest (ROIs). Additional preprocessing details are deferred to the dataset authors (COBRE, 2019). The performance of PLAGGM is compared to LRGGM, which is the most widelyused method to deal with motion confounding in the fMRI literature (Van Dijk et al., 2012; Power et al., 2014).
We use for the following analysis, which is equivalent to . If the selected is less than the largest possible value, the estimation should still be accurate, since (2) is satisfied. However, a too small may induce a singular and thus the failure of the PPL method. We select the smallest where PPL can be successfully implemented, and use the corresponding . The results using other ’s and ’s are reported in the Supplements. Due to Theorem 1, the form of does not affect the sparsistency of the estimator, and thus has a limited effect on the performance.
General Analysis
We first generally analyze the brain functional connectivity by the PLAGGM. Specifically, following the common practice in this area (Belilovsky et al., 2016), we assume that all the fMRIs from the subjects with schizophrenia follow a single PLAGGM with the same brain functional connectivity, and thus combine the preprocessed fMRIs from the subjects into one dataset. Then, the PPL method is applied to the combined dataset to estimate an , which corresponds to the brain functional connectivity for all the subjects with schizophrenia. The same procedure is also implemented on the control subjects’ fMRI datsets.
For a comparison, we also apply the LRGGM discussed in Section 5 by the aforementioned procedure. The estimated brain functional connectivity for subjects and the controls with the two methods are reported in Figure 2. The ROIs are denoted by nodes with different colors. Edges among nodes denote the estimated functional connectivity. Red edges denote the positive connections, while the blues ones denote negative connections. The darker the color, the stronger the connection. We only provide the figure from one angle here. The figures from other angles are provided in the Supplements.
Comparing the glass brain figure for controls with the one for subjects estimated by PLAGGM, we find Occipital Pole and Central Opercular Cortex are the two areas differ the most. Interestingly, these two areas have been implicated in the literature as highly associated with schizophrenia (Sheffield et al., 2015). Also, by comparing the results of PLAGGMs with those of LRGGMs, the results of LRGGMs are much denser and covered with lots of positive connections. This phenomenon is consistent with our analysis in Section 5: LRGGMs will treat the average of ’s as the estimator to and derive incorrect overdense estimates. This strongly suggests that the dense structure here is a result of the confounders which are not successfully accounted for by the regression.
Schizophrenia Diagnosis
Ideally, to demonstrate the accuracy of the proposed method for structure recovery, we should compare the estimated brain functional connectivity to the underlying ground truth, which, however, is not available in practice. Therefore, we consider a surrogate evaluation by using the estimated functional connectivity for schizophrenia diagnosis. Intuitively, if the recovered connectivity is more accurate due to effectively omitting the confounding, we should be able to improve schizophrenia diagnosis using the estimated connectivity as features. We apply PLAGGMs and LRGGMs to each subject respectively, and calculate ’s for every subject. Then, we use
’s as the input for classification methods to classify the subjects. For this twoclass classification task, we use an
regularized logistic regression as this is a common approach in the literature
(Patel et al., 2016).We consider using only the structure (signs without values) of ’s as the input for the classification. Although the classification is of course more challenging, we can more clearly see how helpful the structure itself is for the diagnosis. The AUCs for the diagnosis are reported in Figure 2(a). For a thorough comparison, we use the regularization parameters suggested by the R package glmnet for the penalized logistic regression and report all the AUCs. Clearly, PLAGGMs resut in more accurate prediction. Therefore, the brain connectivity estimators derived by PLAGGMs are more informative for schizophrenia diagnosis and more accurate than those of LRGGMs.
We next include the values into the input, and evaluate the accuracy for schizophrenia diagnosis. Again, we report all the results using different regularization parameters in Figure 2(b). The green line indicates the best performance of using LRGGM and regularized logistic regression achieved in (Patel et al., 2016). Since we are using exactly the same dataset, we directly use their results for the LRGGM combined with penalized logistic regression for a fair comparison. As a result, for most of the regularization parameters, PLAGGMs derive more accurate diagnosis than LRGGMs. Experiments using different are included in the Supplements. The results are similar. Therefore, we conclude that PLAGGMs more accurately address confounding and derive more accurate estimators of brain functional connectivity.
We note, however, that some specialized alternative classifiers have been developed in the brain connectivity literature (Patel et al., 2016; ArroyoRelión et al., 2017; Andersen et al., 2018), and expect that our approach will improve performance for those classifiers as well. We leave such further analysis to future work.
7 Conclusions and Future Works
We propose PLAGGMS, to study the relationships among random variables with observed confounders. PLAGGMs are especially generalized and facilitate sparsisent estimators. The utility of PLAGGMs is demonstrated using a realworld fMRI dataset for the brain connectivity estimation. While we have been taking GGMs as an example, the results can be generalized to other undirected graphical models, especially the univariate exponential family distributions (UEFDs) (Yang et al., 2015a). We leave the details to future work.
References

Andersen et al. (2018)
Andersen, M., Winther, O., Hansen, L. K., Poldrack, R., and Koyejo, O.
Bayesian structure learning for dynamic brain connectivity.
In
International Conference on Artificial Intelligence and Statistics
, pp. 1436–1446, 2018.  ArroyoRelión et al. (2017) ArroyoRelión, J. D., Kessler, D., Levina, E., and Taylor, S. F. Network classification with applications to brain connectomics. arXiv preprint arXiv:1701.08140, 2017.

Barber & Kolar (2018)
Barber, R. F. and Kolar, M.
Rocket: Robust confidence intervals via kendall’s tau for transelliptical graphical models.
Ann. Statist., 46(6B):3422–3450, 2018. ISSN 00905364. doi: 10.1214/17AOS1663.  Belilovsky et al. (2016) Belilovsky, E., Varoquaux, G., and Blaschko, M. B. Testing for differences in gaussian graphical models: applications to brain connectivity. In Advances in Neural Information Processing Systems, pp. 595–603, 2016.
 Biswal et al. (1995) Biswal, B., Zerrin Yetkin, F., Haughton, V. M., and Hyde, J. S. Functional connectivity in the motor cortex of resting human brain using echoplanar mri. Magnetic resonance in medicine, 34(4):537–541, 1995.
 Cheng et al. (2014) Cheng, G., Zhou, L., Huang, J. Z., et al. Efficient semiparametric estimation in generalized partially linear additive models for longitudinal/clustered data. Bernoulli, 20(1):141–163, 2014.
 Chouldechova & Hastie (2015) Chouldechova, A. and Hastie, T. Generalized additive model selection. arXiv preprint arXiv:1506.03850, 2015.
 COBRE (2019) COBRE. The center for biomedical research excellence. http://fcon_1000.projects.nitrc.org/indi/retro/cobre.html, 2019. (Accessed on 01/14/2019).
 Eaton (1983) Eaton, M. Multivariate statistics: a vector space approach. Lecture notesmonograph series. Institute of Mathematical Statistics, 1983. ISBN 9780940600690. URL https://books.google.com/books?id=WyvvAAAAMAAJ.
 Fan & Zhang (2008) Fan, J. and Zhang, W. Statistical methods with varying coefficient models. Statistics and its Interface, 1(1):179, 2008.
 Fan et al. (2005) Fan, J., Huang, T., et al. Profile likelihood inferences on semiparametric varyingcoefficient partially linear models. Bernoulli, 11(6):1031–1057, 2005.
 Fox & Raichle (2007) Fox, M. D. and Raichle, M. E. Spontaneous fluctuations in brain activity observed with functional magnetic resonance imaging. Nature Reviews Neuroscience, 8(9):700–711, 2007.
 Friedman et al. (2008) Friedman, J., Hastie, T., and Tibshirani, R. Sparse inverse covariance estimation with the graphical lasso. Biostatistics, 9(3):432–441, 2008.
 Friedman et al. (2010) Friedman, J., Hastie, T., and Tibshirani, R. Regularization paths for generalized linear models via coordinate descent. Journal of Statistical Software, 33(1):1–22, 2010. URL http://www.jstatsoft.org/v33/i01/.
 Geng et al. (2017) Geng, S., Kuang, Z., and Page, D. An efficient pseudolikelihood method for sparse binary pairwise markov network estimation. arXiv preprint arXiv:1702.08320, 2017.
 Geng et al. (2018a) Geng, S., Kolar, M., and Koyejo, O. Joint nonparametric precision matrix estimation with confounding. arXiv preprint arXiv:1810.07147, 2018a.
 Geng et al. (2018b) Geng, S., Kuang, Z., Liu, J., Wright, S., and Page, D. Stochastic learning for sparse discrete markov random fields with controlled gradient approximation error. In Uncertainty in artificial intelligence: proceedings of the… conference. Conference on Uncertainty in Artificial Intelligence, volume 2018, pp. 156. NIH Public Access, 2018b.
 Geng et al. (2018c) Geng, S., Kuang, Z., Peissig, P., and Page, D. Temporal poisson square root graphical models. In International Conference on Machine Learning, pp. 1700–1709, 2018c.
 Goto et al. (2016) Goto, M., Abe, O., Miyati, T., Yamasue, H., Gomi, T., and Takeda, T. Head motion and correction methods in restingstate functional mri. Magnetic Resonance in Medical Sciences, 15(2):178–186, 2016.
 Hastie (2017) Hastie, T. J. Generalized additive models. In Statistical models in S, pp. 249–307. Routledge, 2017.
 Huang et al. (2012) Huang, C.Y., Qin, J., and Follmann, D. A. A maximum pseudoprofile likelihood estimator for the cox model under lengthbiased sampling. Biometrika, 99(1):199–210, 2012.
 Jordan (1998) Jordan, M. I. Learning in graphical models, volume 89. Springer Science & Business Media, 1998.
 Kolar & Xing (2011) Kolar, M. and Xing, E. On time varying undirected graphs. In Proc. of AISTATS, 2011.
 Kolar & Xing (2012) Kolar, M. and Xing, E. P. Estimating networks with jumps. Electron. J. Stat., 6:2069–2106, 2012. ISSN 19357524. doi: 10.1214/12EJS739. URL https://doi.org/10.1214/12EJS739.
 Kolar et al. (2009) Kolar, M., Song, L., and Xing, E. P. Sparsistent learning of varyingcoefficient models with structural changes. In Advances in neural information processing systems, pp. 1006–1014, 2009.
 Kolar et al. (2010a) Kolar, M., Parikh, A. P., and Xing, E. P. On sparse nonparametric conditional covariance selection. In Fürnkranz, J. and Joachims, T. (eds.), Proceedings of the 27th International Conference on Machine Learning (ICML10), June 2124, 2010, Haifa, Israel, pp. 559–566. Omnipress, 2010a. URL https://icml.cc/Conferences/2010/papers/197.pdf.
 Kolar et al. (2010b) Kolar, M., Song, L., Ahmed, A., Xing, E. P., et al. Estimating timevarying networks. The Annals of Applied Statistics, 4(1):94–123, 2010b.
 Kuang et al. (2017) Kuang, Z., Geng, S., and Page, D. A screening rule for l1regularized ising model estimation. In Advances in neural information processing systems, pp. 720–731, 2017.
 Laumann et al. (2016) Laumann, T. O., Snyder, A. Z., Mitra, A., Gordon, E. M., Gratton, C., Adeyemo, B., Gilmore, A. W., Nelson, S. M., Berg, J. J., Greene, D. J., et al. On the stability of bold fmri correlations. Cerebral cortex, 27(10):4719–4732, 2016.
 Laurent & Massart (2000) Laurent, B. and Massart, P. Adaptive estimation of a quadratic functional by model selection. Annals of Statistics, pp. 1302–1338, 2000.
 Lee & Liu (2015) Lee, W. and Liu, Y. Joint estimation of multiple precision matrices with common structures. The Journal of Machine Learning Research, 16(1):1035–1062, 2015.
 Liu et al. (2009) Liu, H., Lafferty, J., and Wasserman, L. The nonparanormal: Semiparametric estimation of high dimensional undirected graphs. Journal of Machine Learning Research, 10(Oct):2295–2328, 2009.
 Liu & Page (2013) Liu, J. and Page, D. Bayesian estimation of latentlygrouped parameters in undirected graphical models. In Advances in neural information processing systems, pp. 1232–1240, 2013.
 Liu et al. (2014) Liu, J., Zhang, C., Burnside, E., and Page, D. Learning heterogeneous hidden markov random fields. In Artificial Intelligence and Statistics, pp. 576–584, 2014.
 Lou et al. (2016) Lou, Y., Bien, J., Caruana, R., and Gehrke, J. Sparse partially linear additive models. Journal of Computational and Graphical Statistics, 25(4):1126–1140, 2016.
 Lu et al. (2015) Lu, J., Kolar, M., and Liu, H. Kernel meets sieve: Postregularization confidence bands for sparse additive model. arXiv preprint arXiv:1503.02978, 2015.
 Lu et al. (2018) Lu, J., Kolar, M., and Liu, H. Postregularization inference for timevarying nonparanormal graphical models. Journal of Machine Learning Research, 18(203):1–78, 2018. URL http://jmlr.org/papers/v18/17145.html.
 Mack & Silverman (1982) Mack, Y.p. and Silverman, B. W. Weak and strong uniform consistency of kernel regression estimates. Zeitschrift für Wahrscheinlichkeitstheorie und verwandte Gebiete, 61(3):405–415, 1982.
 Meinshausen & Bühlmann (2006) Meinshausen, N. and Bühlmann, P. Highdimensional graphs and variable selection with the lasso. The annals of statistics, 34(3):1436–1462, 2006.
 Mignotte et al. (2000) Mignotte, M., Collet, C., Perez, P., and Bouthemy, P. Sonar image segmentation using an unsupervised hierarchical mrf model. IEEE transactions on image processing, 9(7):1216–1231, 2000.
 Ortega & Rheinboldt (2000) Ortega, J. M. and Rheinboldt, W. C. Iterative solution of nonlinear equations in several variables. SIAM, 2000.

Patel et al. (2016)
Patel, P., Aggarwal, P., and Gupta, A.
Classification of schizophrenia versus normal subjects using deep learning.
InProceedings of the Tenth Indian Conference on Computer Vision, Graphics and Image Processing
, pp. 28. ACM, 2016.  Power et al. (2014) Power, J. D., Mitra, A., Laumann, T. O., Snyder, A. Z., Schlaggar, B. L., and Petersen, S. E. Methods to detect, characterize, and remove motion artifact in resting state fmri. Neuroimage, 84:320–341, 2014.
 Ravikumar et al. (2010) Ravikumar, P., Wainwright, M. J., Lafferty, J. D., et al. Highdimensional ising model selection using l1regularized logistic regression. The Annals of Statistics, 38(3):1287–1319, 2010.
 Rothenhäusler et al. (2018) Rothenhäusler, D., Ernest, J., Bühlmann, P., et al. Causal inference in partially linear structural equation models. The Annals of Statistics, 46(6A):2904–2938, 2018.
 Shalizi & Thomas (2011) Shalizi, C. R. and Thomas, A. C. Homophily and contagion are generically confounded in observational social network studies. Sociological methods & research, 40(2):211–239, 2011.
 Sheffield et al. (2015) Sheffield, J. M., Repovs, G., Harms, M. P., Carter, C. S., Gold, J. M., MacDonald III, A. W., Ragland, J. D., Silverstein, S. M., Godwin, D., and Barch, D. M. Frontoparietal and cinguloopercular network integrity and cognition in health and schizophrenia. Neuropsychologia, 73:82–93, 2015.
 Shine et al. (2015) Shine, J. M., Koyejo, O., Bell, P. T., Gorgolewski, K. J., Gilat, M., and Poldrack, R. A. Estimation of dynamic functional connectivity using multiplication of temporal derivatives. NeuroImage, 122:399–407, 2015.
 Shine et al. (2016) Shine, J. M., Koyejo, O., and Poldrack, R. A. Temporal metastates are associated with differential patterns of timeresolved connectivity, network topology, and attention. Proceedings of the National Academy of Sciences, 113(35):9888–9891, 2016.
 Sohn & Kim (2012) Sohn, K.A. and Kim, S. Joint estimation of structured sparsity and output structure in multipleoutput regression via inversecovariance regularization. In Artificial Intelligence and Statistics, pp. 1081–1089, 2012.
 Song et al. (2009a) Song, L., Kolar, M., and Xing, E. Keller: Estimating timevarying interactions between genes. Bioinformatics, 25(12):i128–i136, 2009a.

Song et al. (2009b)
Song, L., Kolar, M., and Xing, E.
Timevarying dynamic bayesian networks.
In Bengio, Y., Schuurmans, D., Lafferty, J. D., Williams, C. K. I., and Culotta, A. (eds.), Proc. of NIPS, pp. 1732–1740, 2009b.  Speckman (1988) Speckman, P. Kernel smoothing in partial linear models. Journal of the Royal Statistical Society. Series B (Methodological), pp. 413–436, 1988.
 Suggala et al. (2017) Suggala, A., Kolar, M., and Ravikumar, P. K. The expxorcist: nonparametric graphical models via conditional exponential densities. In Advances in neural information processing systems, pp. 4446–4456, 2017.
 Tibshirani et al. (2012) Tibshirani, R., Bien, J., Friedman, J., Hastie, T., Simon, N., Taylor, J., and Tibshirani, R. J. Strong rules for discarding predictors in lassotype problems. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 74(2):245–266, 2012.
 Van Dijk et al. (2012) Van Dijk, K. R., Sabuncu, M. R., and Buckner, R. L. The influence of head motion on intrinsic functional connectivity mri. Neuroimage, 59(1):431–438, 2012.
 Voorman et al. (2013) Voorman, A., Shojaie, A., and Witten, D. Graph estimation with joint additive models. Biometrika, 101(1):85–101, 2013.
 Wainwright (2009) Wainwright, M. J. Sharp thresholds for highdimensional and noisy sparsity recovery using constrained quadratic programming (lasso). IEEE transactions on information theory, 55(5):2183–2202, 2009.
 Wang & Kolar (2014) Wang, J. and Kolar, M. Inference for sparse conditional precision matrices. ArXiv eprints, arXiv:1412.7638, December 2014.
 Wytock & Kolter (2013) Wytock, M. and Kolter, Z. Sparse gaussian conditional random fields: Algorithms, theory, and application to energy forecasting. In International conference on machine learning, pp. 1265–1273, 2013.
 Yang & Ravikumar (2011) Yang, E. and Ravikumar, P. On the use of variational inference for learning discrete graphical model. In Getoor, L. and Scheffer, T. (eds.), Proceedings of the 28th International Conference on Machine Learning (ICML11), ICML ’11, pp. 1009–1016, New York, NY, USA, June 2011. ACM. ISBN 9781450306195.
 Yang et al. (2015a) Yang, E., Ravikumar, P., Allen, G. I., and Liu, Z. Graphical models via univariate exponential family distributions. Journal of Machine Learning Research, 16(1):3813–3847, 2015a.
 Yang et al. (2015b) Yang, S., Lu, Z., Shen, X., Wonka, P., and Ye, J. Fused multiple graphical lasso. SIAM Journal on Optimization, 25(2):916–943, 2015b.
 Zhu & Cribben (2017) Zhu, Y. and Cribben, I. Graphical models for functional connectivity networks: best methods and the autocorrelation issue. bioRxiv, pp. 128488, 2017.
 Zhu & Koyejo (2018) Zhu, Y. and Koyejo, O. Clustered fused graphical lasso. In Conference on Uncertainty in Artificial Intelligence (UAI), 2018.
Supplements
Appendix A Proof of Theorem 1
Following the definition, we can derive the PL for the PLAGGM as:
Then Lemma 1 can be proved by the definition of .
Appendix B Proof of Lemma 2
Appendix C Proof of Theorem 1
In this Section, we prove the sparsistency of the regularized MPPLE by following the widelyused primaldual witness proof technique (Wainwright, 2009; Ravikumar et al., 2010; Yang & Ravikumar, 2011; Yang et al., 2015a). PDW is characterized by the following Lemma 3:
Lemma 3.
Proof.
Bound
Before we use the PDW, we first provide a Lemma bounding , which has been shown to be vital for PDW (Wainwright, 2009; Ravikumar et al., 2010; Yang & Ravikumar, 2011; Yang et al., 2015a).
Lemma 4.
Let . For any , with probability of at least , there exists and satisfying the following two inequalities:
(8) 
(9) 
for .
Proof of (8)
To begin with, we prove (8). We define
We use the to denote the PPL defined in Definition 1. Then, the derivative of is:
(10) 
where denotes the component of the vector. For the ease of presentation, we define
Then, we consider
(11) 
whose component is just the target value
According to Assumption 1
(12) 
Now, we study . According to Lemma 9, we have
uniformly for . Note that is a vector. Therefore, for the component, we have
(13) 
It can be shown that and
are independent and follow chisquared distribution with degree equal to
.By Lemma 1 in (Laurent & Massart, 2000), the linear combinition of chisquared random variables satisfies:
and
for any . Combing the previous four probabilistic bounds, we can derive
(14) 
and
(15) 
Taking (14) and (15) into (13), we can derive
(16) 
Then, by the definition of , for any , there exists so that for :
(17) 
Proof of (9)
Pdw
By Lemma 3, we can prove the sparsistency by building an optimal solution to (5) satisfying the strict dual feasibility (SDF) defined as , which is summarized. Therefore, we now build a solution by solving a restricted problem.
Solve a Restricted Problem
First of all, we derive the KKT condition of (5):
(19) 
To construct an optimal primaldual pair solution, we define as an optimal solution to the restricted problem: