1. Introduction
Among many appealing properties of multivariate normal distributions, their second moment matrix and its inverse contain complete information about the independence and conditional independence properties. Specifically, a zero in the
th entry of the covariance matrix means that variables and are marginally independent, while a zero in theth entry of the precision (inverse covariance) means that the two are conditionally independent. For highdimensional Gaussian data sets, it is often of interest to estimate either a sparse covariance matrix, or sparse precision matrix, or, in some cases, both
[3].In general, this correspondence—between the second moment matrix and the independence properties, and between the inverse and conditional independence properties—does not hold for nonGaussian distributions. Tests to determine independence and conditional independence become more complex than matrix estimation: the complexity of exhaustive pairwise testing techniques scales exponentially with the number of variables [7]; other methods compute scores or combine onedimensional conditional distributions for the exponential family [8, 12, 11]; another approach (by two of the current coauthors) identifies conditional independence for arbitrary nonGaussian distributions from the Hessian of the log density, but is so far computationally limited to rather small graphs [2]. Thus, it is of broad interest to analytically extract marginal and conditional independence properties of a distribution a priori to any estimation procedure.
In this paper, we show that the above correspondences between independence and sparsity of covariance and precision matrices are approximately preserved for a broad class of distributions, namely those given by certain diagonal and meanpreserving transformations of a multivariate normal (sometimes referred to as “nonparanormal” [9]). In particular, these distributions display the following behavior:

Variables and are marginally independent if and only if the th entry of the covariance is zero, equivalent to the normal case.

Variables and are conditionally independent if and only if the th entry of the precision is small, where “small” will be made precise later on.
In other words, under some assumptions, a Gaussian approximation to a nonGaussian distribution of this form will exactly recover the marginal independence structure, and approximately recover the conditional independence structure, which is often summarized as an undirected graphical model.
Covariance estimation for general nonGaussian data sets is of course standard procedure, to (at least) identify correlations between variables. Perhaps less common but not unusual is precision estimation for general data sets, to identify a “partial correlation graph”—a sort of first approximation of the conditional independence properties [6, 10, 4, 1]. This work provides a new mathematical foundation to connect the sparsity of computed partial correlations to the conditional independence properties for a common class of nonGaussian distributions.
The rest of the paper is organized as follows. Section 2 proves exactly how the entries of the covariance transform, with the result that the covariance structure is exactly preserved by the diagonal transformation, and provides an explicit formula to calculate higher moments of Gaussian random variables. Extra related computations are given in Appendix A. Section 3 computes the inverse covariance matrix after the transformation, showing that the conditional independence structure is approximately preserved. Numerical results for specific graphs are given in Section 4, and we conclude with Section 5.
2. Moments after transformation
Consider a multivariate normal random variable with density . Let be the inverse covariance or precision matrix of . For satisfying some conditional independence properties, the precision matrix will have zero entries. Furthermore, the sparsity of will define the minimal Imap of , i.e., the minimal undirected graphical model satisfying the conditional independence properties of .
Now apply a (nonlinear) univariate transformation to each element of , i.e., . We refer to the overall mapping as a diagonal transformation. Let us then say that where is the pushforward density of through the diagonal transformation. For a general nonlinear , will have a nonGaussian distribution. Nevertheless, its mean is given by and its covariance is given by . For simplicity, we will assume that is meanpreserving, so that . We let denote the precision matrix of . Also note that a Gaussian approximation to will have mean and covariance .
Our ultimate goal is to derive conditions under which a Gaussian approximation to will approximately preserve the conditional independence properties (i.e., the corresponding entries of the inverse covariance matrix of will be small). To do so, in this section we first characterize the first and second moments of .
2.1. Univariate moments after transformation
If is a smooth function of , we can expand it using a Taylor series expansion around as
(1) 
Using this expansion we compute the first two moments of by taking advantage of the linearity of the expectation operator. They are
(2) 
(3) 
For a meanzero Gaussian random variable
with variance
, its moments are given by(4) 
where denotes the double factorial. Next we ignore any terms in (2) and (3) that involve odd terms in the exponent . Therefore, the first and second moments can be written only in terms of the variance of . The first moment is given by
(5) 
where is a constant depending on the higher order derivatives of and the index . Similarly, after a reparameterization of the indices over the sum , the second moment is given by
(6) 
where with .
Of course, the above computations assume convergence of each series. In the next section we will give criteria for these series to converge.
2.2. Multivariate moments after transformation
With a similar argument as above, we can also derive the second moment matrix of the transformation. Using the Taylor series expansions in (1), the second moment matrix is given by
(7) 
where in the second line we reparameterize and switch the order of summations and expectation.
Starting from Isserlis’ theorem (or Wick’s probability theorem)
[5], we compute the moments of the product of Gaussian random variables as:(8) 
Example 2.1.
For the transformation , the third order truncation of the expansion in (7) is exact and the derivatives evaluated at zero are all zero with the exception of . Using this result, we have that
Just as what was done above for one of our goals is to compute the general explicit form of . Before we go any further, however, we should prove that the above series actually makes sense, that is, that it converges. We do this now.
Lemma 2.1.
Suppose that the derivatives of the function are all bounded at zero. Then the series
converges.
Proof.
We consider
for fixed and To begin, we use the CauchySchwartz inequality with
The square of this expectation is bounded by the product of integrals
From (4) we know that this last product is then
Let be a bound on for all and recall we are assuming that the derivatives are bounded. Then the sum in question is bounded by
(9) 
Note that
where the last equality follows from the duplication formula for the Gamma function . Thus, the sum in (9) is given by
Let us consider the inner sum over Using the symmetry in and this is at most
(10) 
The ratio
decreases as increases and hence the sum in (10) is bounded by
The sum over converges and hence it is bounded. The convergence is easily seen with the ratio test. Thus we are left with something bounded by a constant times
which also converges by the ratio test after grouping each even indexed term with the following odd indexed term. ∎
To end this section we note that the derivatives of evaluated at zero need not be bounded to establish convergence. If the derivatives grow as a power or even geometrically as the same computation would work.
2.3. Transformation of matrix elements
This focus of this section is to describe how individual matrix elements of the covariance matrix get transformed when is applied to the random variable . We assume that and thus . For the diagonal entries of the covariance, we already know the answer from (6). Let Then
where . This means that if (or is close to one) then the diagonal elements of the transformed covariance are all equal to a constant factor that only depends on the derivatives of at zero. Here is an example.
Example 2.2.
Suppose , and that Then and we have
We now examine what happens to the other elements of the covariance after the transformation. We will consider for now the case of an odd function .
Theorem 2.2.
Suppose is given by
Define a new function
and Then is transformed to
(11) 
Proof.
After inserting formula (2.2) in (6), we combine all the terms that correspond to . By simplifying the double factorials, one has that the coefficient of is given by
(12) 
Let , , and . Then the sum in (12) becomes
Again change variables. Let and we have
Recall our functions
and Then the coefficient of becomes
or
and the result follows. ∎
While the answer may look complex, it is easy to compute in many cases. The computation is also often simplified by noting that is the th derivative of Here are some examples:

Let Then and . Thus . Summing over the odd indices we have that is transformed to

Let us also verify the computation done earlier for the function For this case, , and if . Thus we have , and . This yields a final answer of

Let Then and Thus and we have that is transformed to

Let Then and Thus we have
In Appendix A, we also provide related and somewhat simpler computations that may be helpful in some circumstances. In particular, we compute the coefficient of the linear term in expansion 7, i.e., the coefficient corresponding to .
To conclude this section, we present an important (and wellknown) consequence of Theorem 2.2 about the marginal independence properties of random variables after diagonal transformations. For pairs of Gaussian variables that are marginally independent, we have . Thus, by (5) and (11) we have that and the variable pair is also marginally independent. Therefore, diagonal transformations exactly preserve the sparsity of the covariance matrix for , i.e., the zero elements in , in the covariance matrix of .
3. Properties of the inverse covariance
Our goal is to not only say something about the covariance matrix for the transformed variables, but also about what happens to the entries in the precision matrix. We may be faced with the following situation. We have a precision matrix from a multivariate normal variable with some zeros to start. After transforming those variables, we would like to know what (approximate) sparsity can be recaptured in the inverse covariance matrix of the nonGaussian variables. We probably cannot hope to do this in general, but we can say something specific about some particular cases that often do occur in applications. First, we begin with a technical lemma about matrix inverses.
Lemma 3.1.
Let where the operator norm of , , is at most . Then where the norm of is at most .
Proof.
From the Neumann series expansion for the inverse of , we have
For , the norm of the last term is at most for a converging geometric series. ∎
We note that entry of the matrix must be of the form where and thus . Lastly, we note that the inverse matrix can also be written as where the norm of is at most
Now suppose that we have a precision matrix of the form with By the lemma above, whenever an entry of the precision matrix is zero, the corresponding entry of the covariance will be of order (at most) . If the entry in the precision is not zero, then the entry is of the from plus something again of order . Here we attempt to keep track of what happens to those entries of the covariance under the transformation given that our function is odd.
Lemma 3.2.
Suppose that is an odd function with derivatives at zero bounded by , and the precision matrix is of the form where has norm at most . Then for
(13) 
where is a constant that depends only on that is given in Theorem 2.2, and .
Proof.
From Theorem 2.2 we have that the transformation of is given by
If is a bound on the derivatives of at zero, then it follows that
Thus, the difference for any is bounded by
This last estimate follows from the Taylor series error for . Given that we have a bound of at most although for a given function and smaller this bound can be improved. ∎
Lemma 3.3.
Suppose that is an odd function with bounded derivatives at zero, and the precision matrix is of the form where has norm at most is given as above, and Let
Then
Proof.
The proof of this is similar to the previous lemma. The difference is that we are expanding our function in a neighborhood of instead of zero. Notice that the transformation of is given by
We think of this as a function of the variable . When is one, then this evaluates to . Recall also that and thus using Taylor’s theorem (since all derivatives are bounded in a neighborhood of one), the result follows. ∎
Lemma 3.4.
Suppose that is an odd function with bounded derivatives, the precision matrix is of the form where has norm at most , is given as above, and Then for
where
To see this, in Lemma 3.2 we replace with its value at . The difference is , and since is at most , the result follows.
To summarize, in the case of odd functions we have that the transformed covariance matrix has the form
(14) 
where and are given in the previous lemmas, is as before, and is the error. Let us now estimate the norm of the error.
Theorem 3.5.
Let the precision matrix be a by matrix, and suppose is bounded. Then the HillbertSchmidt norm of is at most .
Proof.
The diagonal elements of are at most and hence the diagonal part of has norm at most . The offdiagonal elements of are ; thus the matrix consisting of these offdiagonal elements has HilbertSchmidt norm (equal to the Frobenius norm, and greater than or equal to the operator norm) at most and the result follows. ∎
Note that these results can be made much more precise for specific functions and for specific matrices of fixed size when it is possible to keep track of the constants in the estimates.
Our final step is to compute the inverse of the covariance matrix in (14), which is given by
for some error term . Thus the offdiagonal terms of the transformed precision have much the same behavior of the original precision matrix. If the offdiagonal entries are of order . If they are not zero, then the first order term scales with .
This will be illustrated by examples in the next section.
4. Applications to specific graphs
4.1. Chain graph
We begin with an example of a circulant matrix that corresponds to the starting precision of a chain graph. Consider
For this , Its inverse is given by (rounded to places)
and now We now demonstrate the effect of applying the diagonal transformation
to a multivariate normal vector
with the covariance above. Using Theorem 2.2, the transformed covariance is given byWe note that as predicted, this matrix is circulant and preserves the marginal independence properties of .
To verify the computation of , we also estimate the covariance using samples. To do so, we generated 100,000 samples from the distribution with the above covariance and then applied the function to the data. The resulting empirical covariance was
Our theory says that the main diagonal should be and the upper and lower offdiagonals and corners should be . This is reflected in both computations of the transformed covariance above. Notice that all other entries are less than . Finally we compute the inverse of the transformed covariance matrix . This is given by
As expected from the theory in Section 3, the diagonals entries should be and the upper and lower offdiagonal entries and corners should be ; all other entries are less than as predicted.
4.2. Star graph
Let , where is an vector with . Then the inverse of is
An example for is
and its inverse is
After applying the transformation , and using our main theorem, we have the transformed covariance :
Once again the diagonal entries should scale close to . The first nondiagonal row and column entries should have magnitude which they do. Finally, the precision matrix is given by
which as the reader can check is exactly as predicted. All original zero terms are less than in absolute value.
4.3. Grid graph
For a grid graph, with nodes ordered across the rows, ones on the diagonal, and on each edge, the precision matrix is block Toeplitz: