    # Diagonal Nonlinear Transformations Preserve Structure in Covariance and Precision Matrices

For a multivariate normal distribution, the sparsity of the covariance and precision matrices encodes complete information about independence and conditional independence properties. For general distributions, the covariance and precision matrices reveal correlations and so-called partial correlations between variables, but these do not, in general, have any correspondence with respect to independence properties. In this paper, we prove that, for a certain class of non-Gaussian distributions, these correspondences still hold, exactly for the covariance and approximately for the precision. The distributions – sometimes referred to as "nonparanormal" – are given by diagonal transformations of multivariate normal random variables. We provide several analytic and numerical examples illustrating these results.

## Authors

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1. Introduction

Among many appealing properties of multivariate normal distributions, their second moment matrix and its inverse contain complete information about the independence and conditional independence properties. Specifically, a zero in the

th entry of the covariance matrix means that variables and are marginally independent, while a zero in the

th entry of the precision (inverse covariance) means that the two are conditionally independent. For high-dimensional Gaussian data sets, it is often of interest to estimate either a sparse covariance matrix, or sparse precision matrix, or, in some cases, both

.

In general, this correspondence—between the second moment matrix and the independence properties, and between the inverse and conditional independence properties—does not hold for non-Gaussian distributions. Tests to determine independence and conditional independence become more complex than matrix estimation: the complexity of exhaustive pairwise testing techniques scales exponentially with the number of variables ; other methods compute scores or combine one-dimensional conditional distributions for the exponential family [8, 12, 11]; another approach (by two of the current co-authors) identifies conditional independence for arbitrary non-Gaussian distributions from the Hessian of the log density, but is so far computationally limited to rather small graphs . Thus, it is of broad interest to analytically extract marginal and conditional independence properties of a distribution a priori to any estimation procedure.

In this paper, we show that the above correspondences between independence and sparsity of covariance and precision matrices are approximately preserved for a broad class of distributions, namely those given by certain diagonal and mean-preserving transformations of a multivariate normal (sometimes referred to as “nonparanormal” ). In particular, these distributions display the following behavior:

1. Variables and are marginally independent if and only if the th entry of the covariance is zero, equivalent to the normal case.

2. Variables and are conditionally independent if and only if the th entry of the precision is small, where “small” will be made precise later on.

In other words, under some assumptions, a Gaussian approximation to a non-Gaussian distribution of this form will exactly recover the marginal independence structure, and approximately recover the conditional independence structure, which is often summarized as an undirected graphical model.

Covariance estimation for general non-Gaussian data sets is of course standard procedure, to (at least) identify correlations between variables. Perhaps less common but not unusual is precision estimation for general data sets, to identify a “partial correlation graph”—a sort of first approximation of the conditional independence properties [6, 10, 4, 1]. This work provides a new mathematical foundation to connect the sparsity of computed partial correlations to the conditional independence properties for a common class of non-Gaussian distributions.

The rest of the paper is organized as follows. Section 2 proves exactly how the entries of the covariance transform, with the result that the covariance structure is exactly preserved by the diagonal transformation, and provides an explicit formula to calculate higher moments of Gaussian random variables. Extra related computations are given in Appendix A. Section 3 computes the inverse covariance matrix after the transformation, showing that the conditional independence structure is approximately preserved. Numerical results for specific graphs are given in Section 4, and we conclude with Section 5.

## 2. Moments after transformation

Consider a multivariate normal random variable with density . Let be the inverse covariance or precision matrix of . For satisfying some conditional independence properties, the precision matrix will have zero entries. Furthermore, the sparsity of will define the minimal I-map of , i.e., the minimal undirected graphical model satisfying the conditional independence properties of .

Now apply a (nonlinear) univariate transformation to each element of , i.e., . We refer to the overall mapping as a diagonal transformation. Let us then say that where is the push-forward density of through the diagonal transformation. For a general nonlinear , will have a non-Gaussian distribution. Nevertheless, its mean is given by and its covariance is given by . For simplicity, we will assume that is mean-preserving, so that . We let denote the precision matrix of . Also note that a Gaussian approximation to will have mean and covariance .

Our ultimate goal is to derive conditions under which a Gaussian approximation to will approximately preserve the conditional independence properties (i.e., the corresponding entries of the inverse covariance matrix of will be small). To do so, in this section we first characterize the first and second moments of .

### 2.1. Univariate moments after transformation

If is a smooth function of , we can expand it using a Taylor series expansion around as

 (1) f(Xi)=∞∑k=0f(k)(0)k!Xki.

Using this expansion we compute the first two moments of by taking advantage of the linearity of the expectation operator. They are

 (2) Eρ[f(Xi)]=E[∞∑k=0f(k)(0)k!Xki]=∞∑k=0f(k)(0)k!Eρ[Xki],
 (3) Eρ[f(Xi)2]=E[(∞∑k=0f(k)(0)k!Xki)(∞∑l=0f(l)(0)l!Xli)]=∞∑k=0∞∑l=0f(k)(0)f(l)(0)k!l!Eρ[Xk+li].

For a mean-zero Gaussian random variable

with variance

, its moments are given by

 (4) Eρ[Xki]={0if k is oddσkii(k−1)!!if k is even,

where denotes the double factorial. Next we ignore any terms in (2) and (3) that involve odd terms in the exponent . Therefore, the first and second moments can be written only in terms of the variance of . The first moment is given by

 (5) Eρ[f(Xi)]=∞∑k=0even kf(k)(0)k!Eρ[Xki]=∞∑k=0even kakσkii,

where is a constant depending on the higher order derivatives of and the index . Similarly, after a re-parameterization of the indices over the sum , the second moment is given by

 Eρ[f(Xi)2] =∞∑n=0n∑p=0f(p)(0)f(n−p)(0)p!(n−p)!Eρ[Xni] (6) =∞∑n=0even nEρ[Xni](n∑p=0f(p)(0)f(n−p)(0)p!(n−p)!)=∞∑n=0even nbnσnii,

where with .

Of course, the above computations assume convergence of each series. In the next section we will give criteria for these series to converge.

### 2.2. Multivariate moments after transformation

With a similar argument as above, we can also derive the second moment matrix of the transformation. Using the Taylor series expansions in (1), the second moment matrix is given by

 Eρ[f(X)f(X)T] =Eρ[∞∑k=0∞∑l=0f(k)(0)f(l)(0)k!l!Xk(Xl)T] (7) =∞∑n=0n∑p=0f(p)(0)f(n−p)(0)p!(n−p)!Eρ[Xp(Xn−p)T]

where in the second line we re-parameterize and switch the order of summations and expectation.

Starting from Isserlis’ theorem (or Wick’s probability theorem)

, we compute the moments of the product of Gaussian random variables as:

 Eρ [XpiXn−pj] (8) =⎧⎪ ⎪ ⎪ ⎪ ⎪ ⎪⎨⎪ ⎪ ⎪ ⎪ ⎪ ⎪⎩0n odd∑0k=p by −2(p−k−1)!!(pk)(n−pk)k!(n−p−k−1)!!σ(p−k)/2iiσkijσ(n−p−k)/2jjn even% , p≤n/2∑0k=n−p by −2(p−k−1)!!(pk)(n−pk)k!(n−p−k−1)!!σ(p−k)/2iiσkijσ(n−p−k)/2jjn even% , p>n/2.
###### Example 2.1.

For the transformation , the third order truncation of the expansion in (7) is exact and the derivatives evaluated at zero are all zero with the exception of . Using this result, we have that

 (Σπ)ij=f′′′(0)236E[X3iX3j]=E[X3iX3j]=9σiiσjjσij+6σ3ij.

Just as what was done above for one of our goals is to compute the general explicit form of . Before we go any further, however, we should prove that the above series actually makes sense, that is, that it converges. We do this now.

###### Lemma 2.1.

Suppose that the derivatives of the function are all bounded at zero. Then the series

 ∑n≥2n∑p=0f(p)(0)f(n−p)(0)p!(n−p)!Eρ[Xp(Xn−p)T],

converges.

###### Proof.

We consider

 ∑n≥4n∑p=0f(p)(0)f(n−p)(0)p!(n−p)!Eρ[XpiXn−pj]

for fixed and To begin, we use the Cauchy-Schwartz inequality with

 Eρ[XpiXn−pj]=∫XpiXn−pjρ(X)dX.

The square of this expectation is bounded by the product of integrals

 (∫X2piρ(X)dX)(∫X2n−2pjρ(X)dX).

From (4) we know that this last product is then

 σ2pii(2p−1)!!σ2(n−p)jj(2(n−p)−1)!!.

Let be a bound on for all and recall we are assuming that the derivatives are bounded. Then the sum in question is bounded by

 (9) ∑n≥4n∑p=0((2p−1)!!(2(n−p)−1)!!)1/2p!(n−p)!Mn.

Note that

 (2p−1)!!=(2p−1)!(p−1)!2p−1=(2p)!p!2p=Γ(p+1/2)2pπ−1/2,

where the last equality follows from the duplication formula for the Gamma function . Thus, the sum in (9) is given by

 ∑n≥42n/2Mnπ−1/2n∑p=0(Γ(p+1/2)Γ(n−p+1/2))1/2p!(n−p)!.

Let us consider the inner sum over Using the symmetry in and this is at most

 (10) 2[n2]+1∑p=0(Γ(p+1/2)Γ(n−p+1/2))1/2p!(n−p)!.

The ratio

 Γ(k+1/2)1/2Γ(k+1)

decreases as increases and hence the sum in (10) is bounded by

 2Γ(n−[n/2]−1/2)1/2Γ(n−[n/2])∞∑p=0Γ(p+1/2)1/2p!.

The sum over converges and hence it is bounded. The convergence is easily seen with the ratio test. Thus we are left with something bounded by a constant times

 ∑n≥42n/2MnΓ(n−[n/2]−1/2)1/2Γ(n−[n/2])

which also converges by the ratio test after grouping each even indexed term with the following odd indexed term. ∎

To end this section we note that the derivatives of evaluated at zero need not be bounded to establish convergence. If the derivatives grow as a power or even geometrically as the same computation would work.

### 2.3. Transformation of matrix elements

This focus of this section is to describe how individual matrix elements of the covariance matrix get transformed when is applied to the random variable . We assume that and thus . For the diagonal entries of the covariance, we already know the answer from (6). Let Then

 τii=∞∑n=2even n(n−1)!!g(n)(0)n!σnii,

where . This means that if (or is close to one) then the diagonal elements of the transformed covariance are all equal to a constant factor that only depends on the derivatives of at zero. Here is an example.

###### Example 2.2.

Suppose , and that Then and we have

 τii=∞∑n=2even n(n−1)!!g(n)(0)n!=∞∑k=1(−1)k+12k−1(2k)!k!(2k)!=1−e−22 for % all i.

We now examine what happens to the other elements of the covariance after the transformation. We will consider for now the case of an odd function .

###### Theorem 2.2.

Suppose is given by

 f(x)=∑u=0f(2u+1)(0)(2u+1)!x2u+1.

Define a new function

 Fk(x)=∑u=0f(2u+k)(0)u!xu,

and Then is transformed to

 (11) τij=∑oddkGkij(1/2)σkijk!.
###### Proof.

After inserting formula (2.2) in (6), we combine all the terms that correspond to . By simplifying the double factorials, one has that the coefficient of is given by

 (12) ∑nevenn/2≥kn−k∑poddp=kf(p)(0)f(n−p)(0)σp−k2iiσn−p−k2jjk!2n/2−k(p−k2)!(n−p−k2)!.

Let , , and . Then the sum in (12) becomes

 1k!2−k∑m≥k12mm−s−1∑r=sf(2r+1)(0)f(2m−2r−1)σr−siiσm−r−s−1jj(r−s)!(m−r−s−1)!.

Again change variables. Let and we have

 1k!2−k∑m≥k12mm−2s−1∑t=0f(2(s+t)+1)(0)f(2m−2(s+t)−1)(0)σtiiσm−t−2s−1jjt!(m−t−2s−1)!.

Recall our functions

 Fk(x)=∑u=0f(2u+2s+1)(0)u!xu,

and Then the coefficient of becomes

 1k!2−k∑m≥k12mG(m−k)kij(0)(m−2s−1)!

or

 1k!∑m=012mG(m)kij(0)m!=1k!Gkij(1/2),

and the result follows. ∎

While the answer may look complex, it is easy to compute in many cases. The computation is also often simplified by noting that is the th derivative of Here are some examples:

1. Let Then and . Thus . Summing over the odd indices we have that is transformed to

 τij=e−σii+σjj2sinhσij.
2. Let us also verify the computation done earlier for the function For this case, , and if . Thus we have , and . This yields a final answer of

 τij=9σiiσijσjj+6σ3ij.
3. Let Then and Thus and we have that is transformed to

 τij=eσii+σjj2sinhσij.
4. Let Then and Thus we have

 τij=l∑s=0(σiiσjj)l−s4(l−s)(2s+1)!σ2s+1ij.

In Appendix A, we also provide related and somewhat simpler computations that may be helpful in some circumstances. In particular, we compute the coefficient of the linear term in expansion 7, i.e., the coefficient corresponding to .

To conclude this section, we present an important (and well-known) consequence of Theorem 2.2 about the marginal independence properties of random variables after diagonal transformations. For pairs of Gaussian variables that are marginally independent, we have . Thus, by (5) and (11) we have that and the variable pair is also marginally independent. Therefore, diagonal transformations exactly preserve the sparsity of the covariance matrix for , i.e., the zero elements in , in the covariance matrix of .

## 3. Properties of the inverse covariance

Our goal is to not only say something about the covariance matrix for the transformed variables, but also about what happens to the entries in the precision matrix. We may be faced with the following situation. We have a precision matrix from a multivariate normal variable with some zeros to start. After transforming those variables, we would like to know what (approximate) sparsity can be recaptured in the inverse covariance matrix of the non-Gaussian variables. We probably cannot hope to do this in general, but we can say something specific about some particular cases that often do occur in applications. First, we begin with a technical lemma about matrix inverses.

###### Lemma 3.1.

Let where the operator norm of , , is at most . Then where the norm of is at most .

###### Proof.

From the Neumann series expansion for the inverse of , we have

 A−1=∞∑k=0(−B)k=I−B+∞∑k=2(−B)k.

For , the norm of the last term is at most for a converging geometric series. ∎

We note that entry of the matrix must be of the form where and thus . Lastly, we note that the inverse matrix can also be written as where the norm of is at most

Now suppose that we have a precision matrix of the form with By the lemma above, whenever an entry of the precision matrix is zero, the corresponding entry of the covariance will be of order (at most) . If the entry in the precision is not zero, then the entry is of the from plus something again of order . Here we attempt to keep track of what happens to those entries of the covariance under the transformation given that our function is odd.

###### Lemma 3.2.

Suppose that is an odd function with derivatives at zero bounded by , and the precision matrix is of the form where has norm at most . Then for

 (13) τij=G1ijσij+O(ϵ3),

where is a constant that depends only on that is given in Theorem 2.2, and .

###### Proof.

From Theorem 2.2 we have that the transformation of is given by

 τij=∑oddkGkij(1/2)σkijk!.

If is a bound on the derivatives of at zero, then it follows that

 |Gkij(1/2)|≤N2e(σii+σjj)/2≤N2e1+ϵ.

Thus, the difference for any is bounded by

 16ϵ3N2e2coshϵ.

This last estimate follows from the Taylor series error for . Given that we have a bound of at most although for a given function and smaller this bound can be improved. ∎

###### Lemma 3.3.

Suppose that is an odd function with bounded derivatives at zero, and the precision matrix is of the form where has norm at most is given as above, and Let

 κ=∑odd kF2k(1/2)k!.

Then

 τii=κσii+O(ϵ2).
###### Proof.

The proof of this is similar to the previous lemma. The difference is that we are expanding our function in a neighborhood of instead of zero. Notice that the transformation of is given by

 ∑odd kGkii(1/2)σkiik!.

We think of this as a function of the variable . When is one, then this evaluates to . Recall also that and thus using Taylor’s theorem (since all derivatives are bounded in a neighborhood of one), the result follows. ∎

###### Lemma 3.4.

Suppose that is an odd function with bounded derivatives, the precision matrix is of the form where has norm at most , is given as above, and Then for

 τij=λσij+O(ϵ3).

where

To see this, in Lemma 3.2 we replace with its value at . The difference is , and since is at most , the result follows.

To summarize, in the case of odd functions we have that the transformed covariance matrix has the form

 (14) τ=κI+λB+E′

where and are given in the previous lemmas, is as before, and is the error. Let us now estimate the norm of the error.

###### Theorem 3.5.

Let the precision matrix be a by matrix, and suppose is bounded. Then the Hillbert-Schmidt norm of is at most .

###### Proof.

The diagonal elements of are at most and hence the diagonal part of has norm at most . The off-diagonal elements of are ; thus the matrix consisting of these off-diagonal elements has Hilbert-Schmidt norm (equal to the Frobenius norm, and greater than or equal to the operator norm) at most and the result follows. ∎

Note that these results can be made much more precise for specific functions and for specific matrices of fixed size when it is possible to keep track of the constants in the estimates.

Our final step is to compute the inverse of the covariance matrix in (14), which is given by

 κ−1(I−λκ(B+E′))−1 =κ−1(I+λκB+E′′)=1κI+λκ2B+1κE′′

for some error term . Thus the off-diagonal terms of the transformed precision have much the same behavior of the original precision matrix. If the off-diagonal entries are of order . If they are not zero, then the first order term scales with .

This will be illustrated by examples in the next section.

## 4. Applications to specific graphs

### 4.1. Chain graph

We begin with an example of a circulant matrix that corresponds to the starting precision of a chain graph. Consider

 A=Γρ=⎛⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜⎝112200000122122112200000012211220000001221122000000122112200000012211220000001221122122000001221⎞⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟⎠

For this , Its inverse is given by (rounded to places)

 Γ−1ρ=⎛⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜⎝1.0042−0.04570.0021−0.00010.−0.00010.0021−0.0457−0.04571.0042−0.04570.0021−0.00010.−0.00010.00210.0021−0.04571.0042−0.04570.0021−0.00010.−0.0001−0.00010.0021−0.04571.0042−0.04570.0021−0.00010.0.−0.00010.0021−0.04571.0042−0.04570.0021−0.0001−0.00010.−0.00010.0021−0.04571.0042−0.04570.00210.0021−0.00010.−0.00010.0021−0.04571.0042−0.0457−0.04570.0021−0.00010.−0.00010.0021−0.04571.0042⎞⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟⎠

and now We now demonstrate the effect of applying the diagonal transformation

to a multivariate normal vector

with the covariance above. Using Theorem 2.2, the transformed covariance is given by

 Σπ=⎛⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜⎝0.4329−0.01680.00080.0.0.0.0008−0.0168−0.01680.4329−0.01680.00080.0.0.0.00080.0008−0.01680.4329−0.01680.00080.0.0.0.0.0008−0.01680.4329−0.01680.00080.0.0.0.0.0008−0.01680.4329−0.01680.00080.0.0.0.0.0008−0.01680.4329−0.01680.00080.00080.0.0.0.0008−0.01680.4329−0.0168−0.01680.00080.0.0.0.0008−0.01680.4329⎞⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟⎠.

We note that as predicted, this matrix is circulant and preserves the marginal independence properties of .

To verify the computation of , we also estimate the covariance using samples. To do so, we generated 100,000 samples from the distribution with the above covariance and then applied the function to the data. The resulting empirical covariance was

 ⎛⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜⎝0.4349−0.0167−0.00180.−0.00050.00070.0011−0.016−0.01670.4314−0.01840.00030.0009−0.00030.0024−0.0001−0.0018−0.01840.4332−0.01690.00160.0003−0.00170.00070.0.0003−0.01690.4305−0.01650.0024−0.001−0.0028−0.00050.00090.0016−0.01650.4348−0.01590.00080.00050.0007−0.00030.00030.0024−0.01590.4339−0.01750.00130.00110.0024−0.0017−0.0010.0008−0.01750.4341−0.0193−0.016−0.00010.0007−0.00280.00050.0013−0.01930.4331⎞⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟⎠.

Our theory says that the main diagonal should be and the upper and lower off-diagonals and corners should be . This is reflected in both computations of the transformed covariance above. Notice that all other entries are less than . Finally we compute the inverse of the transformed covariance matrix . This is given by

 ⎛⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜⎝2.3170.0895−0.00060.0.0.−0.00060.08950.08952.3170.0895−0.00060.0.0.−0.0006−0.00060.08952.3170.0895−0.00060.0.0.0.−0.00060.08952.3170.0895−0.00060.0.0.0.−0.00060.08952.3170.0895−0.00060.0.0.0.−0.00060.08952.3170.0895−0.0006−0.00060.0.0.−0.00060.08952.3170.08950.0895−0.00060.0.0.−0.00060.08952.317⎞⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟⎠.

As expected from the theory in Section 3, the diagonals entries should be and the upper and lower off-diagonal entries and corners should be ; all other entries are less than as predicted.

### 4.2. Star graph

Let , where is an -vector with . Then the inverse of is

 Γ−1ρ=11−∥b∥2[1−bT−b(1−∥b∥2)I+bbT].

An example for is

 ⎛⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜⎝11111111111111111000111010011100101110001⎞⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟⎠,

and its inverse is

 ⎛⎜ ⎜ ⎜ ⎜ ⎜ ⎜⎝1.0342−0.094−0.094−0.094−0.094−0.0941.00850.00850.00850.0085−0.0940.00851.00850.00850.0085−0.0940.00850.00851.00850.0085−0.0940.00850.00850.00851.0085⎞⎟ ⎟ ⎟ ⎟ ⎟ ⎟⎠.

After applying the transformation , and using our main theorem, we have the transformed covariance :

 ⎛⎜ ⎜ ⎜ ⎜ ⎜ ⎜⎝0.4368−0.0339−0.0339−0.0339−0.0339−0.03390.43350.00310.00310.0031−0.03390.00310.43350.00310.0031−0.03390.00310.00310.43350.0031−0.03390.00310.00310.00310.4335⎞⎟ ⎟ ⎟ ⎟ ⎟ ⎟⎠.

Once again the diagonal entries should scale close to . The first non-diagonal row and column entries should have magnitude which they do. Finally, the precision matrix is given by

 ⎛⎜ ⎜ ⎜ ⎜ ⎜ ⎜⎝2.34510.17950.17950.17950.17950.17952.3212−0.0025−0.0025−0.00250.1795−0.00252.3212−0.0025−0.00250.1795−0.0025−0.00252.3212−0.00250.1795−0.0025−0.0025−0.00252.3212⎞⎟ ⎟ ⎟ ⎟ ⎟ ⎟⎠,

which as the reader can check is exactly as predicted. All original zero terms are less than in absolute value.

### 4.3. Grid graph

For a grid graph, with nodes ordered across the rows, ones on the diagonal, and on each edge, the precision matrix is block Toeplitz:

 α⎛⎜⎝