1 Introduction
Determining conditional independence relationships through undirected graphical models is a key component of the statistical analysis of complex observational data in a variety of domains such as bioinformatics, image analysis, physics, economics, etc. In many of these applications one is interested in estimating the undirected graphical model underlying a joint distribution of a vector of random variables which constitute a complex interacting system. In particular, in this work we deal with the problem of learning a GGM (Gaussian Graphical Model) which encodes the conditional dependence relationship between variables
.It is very important to note that the conditional dependence relationship is very different from the marginal dependence relationship and that the former does not imply the second nor vice versa, as pointed out in the well know YuleSimpson effect. More precisely, two variables and are conditionally independent (conditioned on the rest of the other system’s variables with ) if their conditional distribution is the product of the conditional marginal distributions, while two variables are independent (in the classical sense, i.e. marginally) if their joint distribution (i.e. the marginal of and ) is the product of the marginals. The concept of conditional independence, being more sophisticated with respect to the marginal one, can capture more fundamental relations between variables and this is the reason why it is becoming central in the analysis of complex system of variables. As an example, consider a data set which consists of simultaneous protein expression levels, measured in different cell types, hypothesizing that the joint distribution of the proteins can be modeled as a multivariate Gaussian. Starting from the dataset, you want to discriminate between direct and indirect proteins interaction. This is a classic example of biological network, where the marginal (indirect) relationship between different proteins is almost certainly present since the system of proteins is very complex, and thus we are not interested in it; yet the relationship of conditional (direct) dependence expresses a deeper and more interesting link from the biological point of view (see [7] for clear explanation).
In this contribution we propose an implementation of the symmetric parallel regression technique for learning a GGM, showing its performance in the case of high dimensional data. In particular, in Section 2 we formalize the problem and we describe the state of the art of the existing methods in literature. In Section 3 we study a variant of one of these methods proposing a smart algorithm for its implementation. Finally, in Section 4 we show a set of numerical tests that prove the effectiveness of the proposed algorithm.
2 Mathematical framework and state of the art
For a complete and exhaustive treatment of graphs theory we refer to [6]; below we give only definitions and properties necessary for this work. A finite graph consists of a finite collection of nodes and a collection of edges . For the scope of this work, we will consider graphs that are undirected, namely graphs whose edges are not ordered, i.e. there is no distinction between the edges and . Moreover, for any is the set of neighbours of node and is a clique if for all such that .
In this paper the notion of a graph is used to keep track of the conditional dependence relationship between random variables of a complex system. By complex system here we mean a jointly distributed vector of random variables that interact with each other. Moreover a formal definition of conditional independence relationship is the following:
Definition 1.
Two random variables (, ) of a random vector are conditionally independent, , if
(1) 
where stands for density distribution and .
Associated with an undirected graph and a system of random variables indexed in the vertex set there is a range of different Markov properties which establish how much the graph is explanatory of the conditional independence property of the random variables, see [6] for details. Specifically, in this work we deal only with system of random variables which are global Markov with respect to an undirected graph , and in particular it holds that
which establish conditional independence among two variables and iff their corresponding nodes in the graph are not connected, as well as the fact that any variable of the system is conditional independent from the set of variables indexed in given the set of variables indexed in , ie
Our perspective is inferential, therefore, given a statistical sample extracted from the unknown distribution , we want to learn as much as possible about it. The density estimation problem is really impossible in high dimension (
) unless you make very strong assumptions, and therefore in large dimensions you are content to learn the dependence/independence conditional relations between the system variables. Learning these relationships means learning the structure of the graph for which the distribution is global Markov, but even this problem turns out to be very difficult unless you put yourself in one of the following two hypotheses: i) the distribution of the system of variable is Gaussian, ii) the distribution of the system of variable is finite discrete with non zero probability mass function.
In this paper we deal with the first case, so our working hypothesis is that . We stress that the zeromean hypothesis is not restrictive at all because we always can center data before starting analysis; moreover from now on we also suppose i.e. the variables are considered standardized so covariance between two variables is indeed correlation. Again this is not a restrictive hypothesis, because we can standardize the columns of any data matrix before starting analysis.
2.1 Gaussian Graphical Models
Before to deal with the inference aspect, we recall some population results for the GGMs. Suppose with strictly positive definite, then we can write its distribution in terms of parameter as classically :
or equivalently in terms of
(2) 
From the remarkable HammersleyClifford theorem, it follows that, being , the global Markov property with respect to an undirected graph is equivalent to the factorization property over , i.e.
(3) 
where C is the set of all possible cliques of the graph and is a realvalued function of the subvector taking positive values. Then from eq.(2) it follows that factorizes as a product of strictly positive and realvalued functions, so the knowledge of the support of is equivalent to the knowledge of with respect to which the distribution is global Markov. This is a very important fact, because it allows to assert that two variables and are conditional independent, i.e. nodes and are not connected into graph if and only if . This fact can also be derived directly by the property of multivariate Gaussian distribution, as claimed in the following proposition:
Proposition 1.
If , then for any , the distribution of
given the rest is still Gaussian with mean and variance given by
The proof can be obtained in Lemma A.4 of page 215 of [5].
We can now turn to the inferential aspect we are interested in. Suppose we have a random sample from a , i.e. suppose we have a data matrix of dimension where each row is a realization of this random variable. The objective of our analysis is to estimate the graph for which the unknown distribution is global Markov. For previous results, we can equivalently state our problem as the following:
Given , estimate the support of . In the following sections we present the most used methods to solve this problem together with a variant of one of these that turns out to be more advantageous not only from the performance point of view but especially from the computational point of view.
2.2 Estimating G by multiple testing
The simplest method to estimate the support of is to invert numerically an estimate of and then test if its coefficients are zero. As a first step, given the data matrix , we have to evaluate estimator . Since the data are standardized, we observe that is indeed an estimator for the correlation matrix, then is indeed proportional to an estimator of the conditional correlation. Denote the th entry of matrix , then is an estimate of the conditional correlation between variables and . When , we have (see [2], Chapter 4.3)
(4) 
then, for each we can test the hypothesis
by using the test statistic in eq. (
4). It is instructive to observe that, for Gaussian variables, independence is equivalent to zero correlation and this is true also for conditional distribution which are still Gaussian as claimed in Proposition 1.While the empirical variance (in this case correlation) estimator does not suffer of instability when the dimension gets larger, its inversion become more and more unstable, being not invertible at all in the case . Hence alternative approaches have been proposed to deal with the GGMs learning problem in the high dimensional case and they are the object of the following sections.
2.3 Estimating G by maximum likelihood penalized technique
Try to infer graph is hopeless in the high dimensional setting without additional structural assumption, hence from now on we suppose that the underlying graph is sparse (it has a few edges). Since when there is no edge between nodes and , the sparsity of translates into coordinate sparsity for matrix . Given whose rows represent samples from a zeromean multivariate Gaussian distribution with , we can write the Loglikelihood function using expression in eq. (2) and standard property of the trace operator
(5) 
where is the empirical covariance matrix. The standard theory of (Maximum Likelihood Estimator) suggests to maximize function in (5), however since we are seeking for GGMs based on sparse graphs, in order to control the number of nonzeros entry of the MLE of matrix the following penalization approach is considered
(6) 
We point out that diagonal elements are not penalized because they are not expected to be zero. Solution of (6) has been studied by many authors but only in [1] a smart firstorder block coordinatedescendent algorithm has been proposed that made this technique famous with the name of Graphical Lasso or glasso.
2.4 Estimating G by parallel regression technique
Although the algorithm proposed in [1] is efficient, in highdimensional regime it can be less competitive; therefore in [8] an alternative approach for learning GGMs has been proposed under sparsity hypothesis.
Let us first observe that from Proposition 1, for each , there exists independent of , such that . Denote with , hence an estimate of can be obtained as the LS (Least Square) solution of the classical regression problem
(7) 
where is the sub matrix of with columns indexed in . Since is a scalar multiple of , if variables and are conditional independent, i.e. there is no edge between nodes and ; hence authors in [8] propose to learn adding a penalty term in criterion(7) to enforce sparsity. Formally for each the authors solve
(8) 
Unfortunately, there is a difficultly in order to learn from such an approach, because there is no constrain enforcing that when , hence it is possible that node is a neighbour of node and not vice versa. So we have to choose an arbitrary decision rule in order to construct an estimate of the edges set, for example in this paper we adopt the rule iff OR .
This method gives very good results and it is much less computational expensive with respect to glasso. Its efficiency is due especially to the fact that it is a nodewise approach learning the neighbours of each node separately, while glasso is a global approach learning the whole graph. Finally, it is important to stress that this parallel regression method can be reformulated in term of a unique multivariate regression problem. More precisely, denote the space of matrices with zero diagonal and the zero diagonal matrix whose th column has extradiagonal elements equal to defined in (8), then the regression problems can be expressed in a unique multivariate regression problem as:
(9) 
3 Estimating G by symmetric parallel regression technique
Looking at model (9) we can immediately see that it is separable, that is, the parallel regressions are in fact independent of each other. It is clear, however, that from an information point of view the regressions are not unrelated to each other because the conditional independence relationship is symmetric and therefore if the variable is zeroed in the regression on we expect that the variable is zeroed in the regression on . This information can be included into the estimation procedure replacing the penalty by a grouped penalty as proposed in [4]. Hence, in this contribution we study the following variant of the parallel regression technique:
(10) 
Note that takes into account the group size. This estimator has the nice property to be coordinate sparse with symmetric zeros, hence it is clear why we call it the symmetric parallel regression technique. The minimization problem (10) is convex, but it cannot be split in parallel subproblems, hence it is computationally more intensive. However, in this contribution we implement a blockwise descending algorithm inspired by the general algorithm presented in [3] which turns out to be very interesting for learning GGMs.
3.1 Algorithm
In this section we describe the algorithm obtained by adapting the general methodology presented in [3]. We fix , and consider defined in (10). As already mentioned, we assume data matrix standardized, i.e. and for each . The general methodology proposed in [3] provides for the updating of a group of variables at a time in a cyclical fashion until convergence is achieved. In our case, each group of variables consists of a symmetric pair of matrix B, example with , and therefore the total number of groups is . The reasoning behind this strategy is that the problem (10) can be separated into subproblems, each of which has only two variables and can therefore be easily solved by thinking of all the others frozen in the previous step. For our convenience, rewrite criterion (10) in the following form:
(11) 
For example, let us minimize (11) in the variable . When we can evaluate the following partial gradient of criterion (11) with respect to the variables :
Define with
hence minimizing criterion (11) in the variables gives
(12) 
Solution (12) is known as multivariate (here variate) Soft Threshold. In conclusion, the algorithm repeats the step just described for each pair of variables in a cyclic fashion until convergence is achieved. In this contribution the convergence is achieved if a maximum number of iteration steps is exceeded or if the norm of the difference between the current and that calculated in the previous step is smaller than a certain threshold.
4 Numerical experiments
In this section we show the performance of the methods discussed in terms of edge reconstruction and computational time. We focus on highdimensional regimes where method of Section 2.2 can not be applied being the empirical covariance matrix not invertible. So we will focus on the following three methods Graphical Lasso presented in Section 2.3, Parallel Regression presented in Section2.4 and Symmetric Parallel Regression presented in Section 3, here denoted , and respectively. For all methods the choice of is crucial and it can make the difference, so in order to be fair in our comparative study we fix which is known from the theory to be order of the optimal parameter. Since the goal of the methods is to correctly identify the undirected graph which encodes the conditional independence relations among variables, i.e. to correctly identify the support of matrix , we measure the method performance by the following index:
(13) 
where is the number of edges present in the graph and correctly identified (i.e. ), is the number of edges not present in the graph and correctly identified (i.e. ), is the number of edges present in the graph and not correctly identified (i.e. ) and is the number of edges not present in the graph and not correctly identified (i.e. ). In all the previous definitions is the estimator obtained in eqs (6), (9) and (11) respectively. Note that measure in (13) is a scaled measure inherit from the binary classification literature, , being more accurate methods with higher accuracy.
Aim of this section is to show how the technique can be competitive with the others two methods in high dimensional problems especially from a computational point of view. For that reason we analyse two different high dimensional scenarios: not severe and severe regime. If is the problem dimension and is the number of data, for not severe regime we intend , while for severe regime we intend . In particular in this section we analyse the following two situations with and respectively. We have conducted experiments for many types of graphs and here we report results for three graphs representing different type of categories, more precisely we show results for the following three graphs:
 G1

A Chain graph where each node has degree 2: , . (see Fig.1)
 G2

A Grid graph where each node has degree at most 4.(see Fig.2 left)
 G3

A Star graph where there is an hub node with maximum degree: , for all and for . (see Fig.2 right)
In figure 3, 4 and 5 we report results in terms of performance and computational time for graph , and respectively. First of all we note a certain robustness of results among the three different types of graph, secondly in terms of performance we note that method is not highly competitive with respect to the others, however it is clear how its implementation is really competitive with respect to the others methods. Finally, it is worth noting that, the method is more expensive because it also offers a good estimate of matrix and not just of its support. Then, at least for some situations, we can conclude that the method can be competitive with the existing methods for learning the structure of a GGM under sparsity hypothesis and high dimensional regime.
The Matlab codes used to produce results of this contribution are available at http://www.iac.cnr.it/ danielad/software.html.
References
 [1] J. Friedman, T. Hastie and R. Tibshirani, Sparse inverse covariance estimation with graphical Lasso, Biostatistics, 9:3, (208), 432441.
 [2] T.W. Anderson, An introduction to multivariate statistical analysis, Wiley, New York NY, second edition (1984).

[3]
P. Breheny and J. Huang, Group descent algorithms for nonconvex penalized linear and logistic regression models with grouped predictors,
Stat Comput., 25(2), (2015), 173187.  [4] J. Friedman, T. Hastie and R. Tibshirani, Application of the Lasso and grouped lasso to the estimation of sparse graphical models http://statweb.stanford.edu/ tibs/ftp/ggraph.pdf

[5]
C. Giraud,
Introduction to highdimensional statistics
, CRC Monographs on Statistics and Applied Probability 139 (2015).  [6] S.L. Lauritzen, Graphical Models, Oxford Science Pubblications, (1996).
 [7] S.M. Lundberg, W.B. Tu, B. Raught, L.Z. Penn, M.M. Hoffman and S.L. Lee, ChromNet: Learning the human chromatin network from all ENCODE ChIPseq data,Genome Biology, 17:82, (2016).
 [8] N. Meinshausen and P. Bühlmann, Highdimensional graphs and variable selection with the Lasso, The Annals of Statistics, 34, (2006), 14361462.