    # Estimation When Both Covariance And Precision Matrices Are Sparse

We offer a method to estimate a covariance matrix in the special case that both the covariance matrix and the precision matrix are sparse — a constraint we call double sparsity. The estimation method is maximum likelihood, subject to the double sparsity constraint. In our method, only a particular class of sparsity pattern is allowed: both the matrix and its inverse must be subordinate to the same chordal graph. Compared to a naive enforcement of double sparsity, our chordal graph approach exploits a special algebraic local inverse formula. This local inverse property makes computations that would usually involve an inverse (of either precision matrix or covariance matrix) much faster. In the context of estimation of covariance matrices, our proposal appears to be the first to find such special pairs of covariance and precision matrices.

## Authors

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## Abstract

We offer a method to estimate a covariance matrix in the special case that both the covariance matrix and the precision matrix are sparse — a constraint we call double sparsity. The estimation method is maximum likelihood, subject to the double sparsity constraint. In our method, only a particular class of sparsity pattern is allowed: both the matrix and its inverse must be subordinate to the same chordal graph. Compared to a naive enforcement of double sparsity, our chordal graph approach exploits a special algebraic local inverse formula. This local inverse property makes computations that would usually involve an inverse (of either precision matrix or covariance matrix) much faster. In the context of estimation of covariance matrices, our proposal appears to be the first to find such special pairs of covariance and precision matrices.

## 1 Introduction

We begin with a quote from Rothman, Levina, and Zhu (14)As a rule of thumb in practice, if it is not clear from the problem whether it is preferable to regularize the covariance or the inverse, we would recommend fitting both and choosing the sparser estimate.” The authors are describing methods that can estimate a covariance matrix or a precision matrix when one – but not both – of these matrices is sparse. For example, Bickel and Levina (1) exploit sparsity patterns when estimating a covariance matrix or its inverse, and Ravikumar, Wainwright, Raskutti, and Yu (13) explore related problems for high dimensional estimation.

In this article we demonstrate that it is in fact possible to be greedy, and ask for simultaneous sparsity in both the covariance and precision matrices. Our estimator is a maximum likelihood estimator with the constraint that both the covariance and its inverse be sparse. To the best of our knowledge, this is the first proposal to seek efficient estimation for the doubly sparse case (when both covariance matrix and precision matrix are sparse).

Some of the advantages to imposing sparsity in both the covariance and its inverse are simplicity of interpretation and faster computation via local formulas that we describe below (these formulas allow us to work with both the matrix and its inverse in an efficient way).

Estimating a covariance matrix is an important problem in the subject of statistics. Methods of imposing sparsity in the covariance matrix — or more commonly in the inverse of the covariance matrix (the precision matrix) — have attracted a great deal of attention. A very popular example of such an approach is the graphical LASSO of Friedman, Hastie, and Tibshirani (5). One reason these methods are important is that the corresponding model is more interpretable when the precision matrix is sparse, which is important in the subjects of Gaussian Markov random fields (15) and of graphical models and covariance selection (10).

When there is a known relationship between entries in a matrix (covariance matrix, precision matrix, or Cholesky factor, for example) and coefficients in regression, then it is possible to apply many available methods from regression that impose zeros in the regression coefficients, as a way to impose zeros in the matrix. Financial applications of such methods include the shrinkage estimation algorithm of Ledoit and Wolf (11) and the missing data completion algorithm of Georgescu, Higham, and Peters (6).

Missing data completion is closely related to the Dempster completion, which is closely related to the local inverse formula that we describe later. The Dempster (3) completion of a covariance matrix in which there are missing entries has an especially satisfying form when the inverse is chordal. Then Georgescu, Higham, and Peters (6) note the completion brings together a number of attractive properties (17, 8, 4, 9):

• sub-blocks of these matrices away from the main diagonal have low rank, as in The Nullity Theorem (19, 18) ; these matrices are examples of semi-separable matrices (20);

• the inverse can be found directly, without completing the covariance matrix, via a ‘local inverse formula’ (as shown in the symmetric positive definite case by (10, 16)) that uses only information in the blocks on the main diagonal of the incomplete covariance matrix;

• the completed matrix maximizes the -determinant amongst the cone of symmetric positive definite matrices consistent with the original incomplete covariance matrix, and it is a maximum entropy estimate.

When an existing method succeeds in imposing sparsity in the precision matrix, then typically the corresponding covariance matrix is not sparse. Vice-versa, if the covariance matrix is sparse then the precision matrix is typically not sparse. In summary, all methods that are currently available in the literature are not able to impose sparsity simultaneously in both the covariance matrix and the precision matrix. Indeed, informally, if a sparse matrix is chosen “at random” then all entries of the corresponding inverse matrix are non-zero. A sparse matrix with a sparse inverse is an extremely exceptional case. That exceptional case, applied to the problem of estimating a covariance matrix, is the subject of the subsequent sections.

In all subsequent sections, we refer to a chordal graph and the junction tree for that graph. A definition of a chordal graph is that all cycles of four or more vertices have a chord. A chord is an edge that is not part of the cycle, but that connects two vertices of the cycle. There are other equivalent characterizations of chordal graphs, such as the graphs that have perfect elimination orderings. (See, e.g. (21, 17, 9).) Chordal graphs are also known as decomposable graphs, in the graphical models literature (10).

## 2 Doubly Sparse Covariance and Constrained Likelihood

Block diagonal covariance matrices are the simplest examples for which both the matrix and its inverse are sparse. A diagonal covariance matrix has inverse . Similarly, a block-diagonal matrix and its inverse have the same sparsity pattern.

Other than such (block) diagonal trivial examples, it is not clear if it is possible to construct nontrivial examples in which both the covariance and its inverse have the same sparsity pattern — a phenomenon we dub doubly sparse covariance.

Do such doubly sparse covariance matrices exist? The answer is ‘yes’, and examples can be constructed, where both the covariance matrix and its inverse (or precision matrix) are assumed chordal:

 (1)

with inverse (or precision matrix) given by

 (2)

It can be quickly checked this pair are also positive definite. Both V and are not block diagonal, demonstrating that the class of matrices we are considering in this article is richer than simply block diagonal matrices.

While chordal covariance matrices have been studied before, there seem to be no examples in the literature where both the covariance and its inverse are chordal. We will revisit this same matrix example later in (6) to show that it has some local inverse properties.

### 2.1 Constrained Maximum Likelihood

We want both the matrix M and the inverse to be subordinate to the same chordal graph. i.e. if there is no edge between nodes and then the entry of M is zero, and the entry of is zero. We assume that we are given iid sample data , drawn from for some unknown and M. Let the sample covariance matrix be

 \textslS=1nn∑i=1(Xi−^μ)(Xi−^μ)⊤,^μ=1nn∑i=1Xi.

Under the normality assumption and after eliminating the nuisance parameter (by replacing it with its MLE ), the maximization of the log-likelihood is equivalent to the minimization (with respect to M):

 trace(\textslS\textslM−1)+ln|\textslM|.

In our constrained MLE framework, we want to find a symmetric positive definite matrix that minimizes the objective function

 trace(\textslM−1\textslS)+ln|\textslM| (3)

subject to the constraint that

 \textslM and \textslM−1 are subordinate toG. (4)

The main novelty in our constrained optimization is that we will make use of the so-called local inverse formula to impose the doubly sparse constraint in (4). We describe that formula in detail in the next section. This is a simple formula that relates the inverse to inverses of sub-blocks of the matrix, and which applies when the graph is subordinate to a chordal graph.

The same idea allows us to compute in the objective function by a local formula for the determinant Johnson and Lundquist (9). A main advantage is that we can easily define a function based on the local inverse formula that allows us to parameterize the set of matrices in the constraint using only one subset of the variables. For example, we can use only the subset of nonzero entries of M, and then can be used in place of . (Or vice-versa: we could parameterize with only the subset of entries of , and then use in place of M.)

The optimization also exploits the following well-known observations:

• We need only optimize over a subset of the entries of a Cholesky factor, R, that correspond to the chordal graph. Then form M, if required, as , for example. That is, we parameterize only with the subset of entries in a Cholesky factor, R.

• Parameterizing via a Cholesky factor automatically ensures that we optimise over positive definite matrices. Also, parameterizing via a Cholesky factor exploits our knowledge of numerical analysis and our knowledge of chordal graphs, that chordal graphs also correspond to ‘perfect eliminators.’ That is, the sparsity pattern is preserved (2).

• Sometimes it may be better to parameterize by the entries of the precision matrix, or by the entries of the Cholesky factor of the precision matrix, as in Pourahmadi (12), but we have made no attempt to compare the relative merits of those two approaches.

Before proceeding with the details of the local inverse formula, we make two remarks.

#### Imposing the double sparse constraint in a naive way.

A naive approach to (3)+(4) is to apply off-the-shelf optimization software, that only optimizes over the subset of the entries in M that are allowed to be possibly nonzero (so that the first part of the constraint, M subordinate to , is automatically fulfilled), and then impose some form of penalty based on terms in that are nonzero and that are not allowed to be nonzero according to the graph . There are other naive ways that one could imagine to impose the doubly sparse constraint in (4). An issue with such naive approaches is that they do not make use of the underlying algebraic structure of such special pairs of matrices.

#### Separable nonlinear least squares problems.

Finding a pair of matrices M and that are both subordinate to the same graph , as in (4), is related to so-called separable nonlinear least squares problems (7). Roughly speaking, that class of problems corresponds to a nonlinear optimization problem where the variables can be partitioned into two subsets, and where knowing the values of the variables for one of the subsets, leads to a linear problem to be solved in order to find the unknown values of the remaining variables in the other subset. For our application at hand, the two subsets are the unknown entries of M and of . If we knew the true nonzero entries of M, then it is a simple linear problem to find the nonzero entries of (and vice-versa). In the context of separable nonlinear least squares problems, one approach is the so called variable projection method. Such an algorithm alternates between updating the two subsets of the variables. At first glance, the algorithm only involves linear optimization in each iteration. Unfortunately, such an alternating approach often has disappointing performance, and there are challenges with parameterization.

### 2.2 Local Inverse Formula

If the inverse matrix, , is subordinate to a chordal graph, then this local inverse formula (5) tells us how to find the inverse using only sub-blocks of the matrix V. The formula reads

 Θ≡\textslV−1=∑[c]∈C(\textslV[c])−1−∑[j]∈J(\textslV[j])−1, (5)

where denotes a square sub-block of the matrix V, which corresponds to a maximal clique in the set of maximal cliques that are the nodes of the clique tree associated with the chordal graph of the matrix V; and for each element in the set of edges of the clique tree, is a square sub-block of the matrix V that corresponds to the intersection (or ‘separator’) of two maximal cliques that are connected in the clique tree. Proofs of (5) can be found in the references, under mild assumptions Johnson and Lundquist (9). There are also other terminologies for the same thing, see, e.g., Bartlett’s lecture notes on Undirected graphical models: Chordal graphs, decomposable graphs, junction trees, and factorizations https://people.eecs.berkeley.edu/~bartlett/courses/2009fall-cs281a/, or (10, 16).

For our special class of matrices, that satisfy the doubly sparse constraint in (4), both are subordinate to the same chordal graph, so we are allowed to swap the roles of V and and the local inverse formula (5) still holds. We can exploit this local property in whichever way is most convenient. This is what distinguishes the class of covariance matrices we study here from the rest of the literature.

#### Example: A 5×5 Local Inverse Formula.

While the formula (5) scales to large matrices in a computer code, here we only give small examples that can be displayed on a page. The numerical conditioning depends on the conditioning of the cliques and the separators. How well the computations scale depends on the clique tree, and especially the size of the maximal cliques. In other words, the speed depends more on the graph, rather than the size of the matrix.

Let us revisit the same example in (1). We will demonstrate the Local Formula (5) holds. In plain English, formula (5) roughly states that “the inverse is the sum of the inverses of the blocks, minus the inverse of the overlaps,” as in this example:

 Θ=\textslV−1=⎛⎜ ⎜ ⎜ ⎜ ⎜ ⎜⎝⎛⎜ ⎜ ⎜⎝13842813214210621613⎞⎟ ⎟ ⎟⎠−1⎞⎟ ⎟ ⎟ ⎟ ⎟ ⎟⎠+⎛⎜ ⎜ ⎜ ⎜ ⎜ ⎜⎝⎛⎜ ⎜ ⎜⎝106116131010110138110813⎞⎟ ⎟ ⎟⎠−1⎞⎟ ⎟ ⎟ ⎟ ⎟ ⎟⎠ (6)
 −⎛⎜ ⎜ ⎜ ⎜⎝(106613)−1⎞⎟ ⎟ ⎟ ⎟⎠.

We are also seeing examples of the sub-matrices that correspond to the two maximal cliques (corresponding to the two sets of indices and ) and to their intersection (corresponding to the set of indices ) in the clique tree coming from the chordal graph associated with this matrix example.

Recall that what is novel about the general class of matrices that we consider in this article is that they satisfy that local inverse formula, (5), in both directions. That is, for these special examples, we can swap the roles of V and , and the Local Formula (5) remains true! In this example we obtain:

 \textslV=21540⎛⎜ ⎜ ⎜ ⎜ ⎜ ⎜⎝⎛⎜ ⎜ ⎜⎝2960−1690−90090−16902675150−15−9001508715−1218090−15−1218023835⎞⎟ ⎟ ⎟⎠−1⎞⎟ ⎟ ⎟ ⎟ ⎟ ⎟⎠ (7)
 +21540⎛⎜ ⎜ ⎜ ⎜ ⎜ ⎜⎝⎛⎜ ⎜ ⎜⎝8715−1218053855385−1218023835−10770−107705385−10770753932315385−1077032317539⎞⎟ ⎟ ⎟⎠−1⎞⎟ ⎟ ⎟ ⎟ ⎟ ⎟⎠
 −21540⎛⎜ ⎜ ⎜ ⎜⎝(8715−12180−1218023835)−1⎞⎟ ⎟ ⎟ ⎟⎠.

Having simultaneously both (6) and (7) hold is an example of the local algebraic property that we exploit in this article for covariance matrix estimation, and is our main contribution.

#### Example: Local inversion for Block Matrices.

Consider the block matrix

 (8)

Suppose we know that the matrix is invertible and that the inverse has the sparsity pattern

 (9)

We are not specifying the entry in the top right, nor in the bottom left, of M, because it can be shown that the sparsity pattern of implies that those two blocks must be

 \textslM13=\textslM12\textslM−122\textslM23,

and

 \textslM31=\textslM32\textslM−122\textslM21.

There are some mild assumptions, such as that is invertible. Then the local inverse formula tells us that:

 (10)

Direct matrix multiplication of M with this claimed form of

to arrive at the (block) identity matrix is one way to prove this. Equation (

10) is an example of the Local Inverse Formula (5). Although this example is only , in fact the Local Inverse Formula (5) can be applied to arbitrarily large matrices (subject to mild assumptions and of course being subordinate to a chordal graph as previously stated), and the proof for larger matrices essentially boils down to this non-commutative block matrix algebra, applied in a recursive way (and the proof is thus by induction), and combined with the key property of chordal graphs that they always have a junction tree Johnson and Lundquist (9).

### 2.3 The Local Function

Let a chordal graph and its junction tree be given. Then we define a function by the right hand side of (5)

 L(\textslM)≡∑[c]∈C(\textslM[c])−1−∑[j]∈J(\textslM[j])−1. (11)

We now make the following observations:

1. L depends on a chordal graph, but that dependence is not explicit in the notation on the left side of (11), i.e., we could plausibly use notation to indicate the dependence on the chordal graph .

2. The dependence on the chordal graph is seen on the right side of (11) with the first sum being over the maximal cliques of the graph, and the second sum being over the separators . This pair , must correspond to a clique tree (sometimes called a junction tree) of the graph, which has a running intersection property. It is an equivalence characterisation of chordal graphs that a chordal graph always has at least one clique tree with this property.

3. At first glance, it also looks like L depends on the choice of junction tree, but it can be shown that in fact all junction trees would lead to the same final sum on the right of (11).

4. The domain of L is a strict subset of . The input does not need to be an example from our special class of matrices in order to apply L. The only requirement on the input matrix is that the submatrices on the right hand side of (11) are indeed invertible.

5. The output matrix is subordinate to the chordal graph, by definition in (11).

6. For an arbitrary matrix M, we usually expect . However, for matrices for which the inverse is known to be subordinate to a chordal graph, then we have by the results of the local formula that is indeed the inverse.

7. Recall the doubly sparse constraint in (4). We typically fulfill one of the conditions in that constraint (4) automatically by parameterizing by the allowed nonzero entries of, say, M. If the matrix M was truly in the special class that satisfied the constraint, then we would have that both and . We can attempt to exploit this when imposing the constraint, and to take advantage of the property that or will usually be much better to compute than an inverse, because of the local formula.

### 2.4 Optimization Summary and Theoretical Justification

The following two key points make it clear that the constraints in (4) can be enforced via the local formula, in the way that we describe in our algorithm below.

###### Theorem 1 ((2), (21).)

Let be a chordal graph, and let be a corresponding list of cliques and separators, which we call a perfect elimination ordering, in terms of a clique tree. Then M and a Cholesky factor R based on this ordering have the same sparsity pattern, in the sense that, if we consider only entries in the triangular part, i.e. ,

 \textslMij=0⇔\textslRij=0,(i

One direction of the above result is the ‘no fill-in’ property for perfect eliminators, which has been known a long time in the numerical linear algebra literature (see, e.g., references by Rose and by Rose, Tarjan, & Lueker in the exposition of (2), or Theorem 9.1 of Vandenberghe and Andersen (21)). The other direction, is discussed, in relation to monotone transitivity properties of chordal graphs, in Figure 4.1, and in equation (9.7) of Vandenberghe and Andersen (21).

###### Theorem 2 (Johnson and Lundquist (9))

Let be a chordal graph, and let L be defined as in (11) for this graph . Let M and be a given pair of matrices. Assume that , as defined by the right side of equation (11), is defined. The following two statements are equivalent. We have

 \textslM−1subordinate toG

is equivalent to

 \textslM−1=L(\textslM).

One direction of this result is trivial, by definition of our function L. The other direction is a result in Johnson and Lundquist (9) where it is more general, allowing non-symmetric matrices. More examples, exposition, and references can be found in Strang and MacNamara (17).

As a consequence of these theoretical results, the optimization (3) and (4), is equivalent to the following.

Let where is the number of edges in the given chordal graph . Let be a triangular matrix with a sparsity pattern corresponding to the chordal graph , i.e., each entry of corresponds to a particular entry of R. We will let

 \textslM(x)=\textslR(x)⊤\textslR(x).

Let S be the sample covariance matrix from observations. Then we find the numbers in that maximize the objective function

 tr(L(\textslR(x))L(\textslR(x))⊤\textslS)+2lndet(\textslR(x)). (12)

subject to the constraint that

 \textslC=\textsl0 (13)

where the matrix C is defined to be

 \textslC≡\textslML(\textslM)−\textslI.

The fact that we can meet the first constraint of (4) in the original optimization problem simply by optimizing over entries with the same sparsity pattern in a Cholesky factor, , depends on Theorem 1. The fact that we can meet the second constraint of (4) in our original problem by requiring depends on Theorem 2. (There is some redundancy in requiring , since in our application C is symmetric.)

## 3 Numerical Experiments

We provide two experiments, one with simulated data, and the other with financial data obtained from Yahoo Finance.

### 3.1 Simulated Data

In this experiment, we start with the ‘true’ matrix that is given in equation (1). We reproduce that example matrix here for convenience:

We draw

random samples from the multivariate normal distribution with zero mean and with this covariance matrix

, and we form the sample covariance matrix S. We then optimize with Matlab’s fmincon.m. We supply fmincon.m a function handle that it can call to compute the objective function (12) at a given using the local inverse formula. The only constraint is that as in (13), which is also supplied to fmincon.m as a function handle. We now make the following observations.

If is very large then the optimization returns an estimate that is close to the true covariance matrix. Also, the likelihood at the estimate is very close to the likelihood at the true matrix.

However, if is not large then the estimate can be very noticeably different to the true matrix, and the likelihood at the estimate is higher than the likelihood at the true matrix. For example, with samples, in one experiment, the sample covariance matrix is

and the estimate from the optimization procedure is

### 3.2 S&P100 Financial Data

To further showcase the proposed method of estimation, in Figure 1, we considered data based on daily prices of the S&P 100 stocks from 1 January 2019 to 26 March 2021, downloaded from Yahoo Finance. In financial markets, portfolio theory suggests nearly all assets have some correlation with common market factors, so perhaps insisting on zeros in the covariance matrix is only justified if we are looking at ‘residual’ covariance matrices, after conditioning on market factors. Therefore, we calculated the residuals, after regressing the log returns of all stocks in the S&P100 on the S&P100 index. Then we naively estimated the sample covariance matrix of those residuals. For the purpose of demonstration, we chose the subset of the S&P100 index that corresponds to ‘Consumer Discretionary’ (13 stocks) and ‘Information Technology’ (10 stocks), and the corresponding sample covariance matrix. To apply our proposed method of estimation, we also need a sparsity pattern that corresponds to a chordal graph, so we specified the junction tree to have two cliques, corresponding to the ‘Consumer Discretionary’ stocks, and to the ‘Information Technology’ stocks together with Amazon and Tesla, and specified the separator to correspond to the two stocks Amazon and Tesla.

The result of the procedure is displayed in Figure 1. The examples we used earlier to illustrate this special class of matrices, in (1), were in exact arithmetic, so everything was pleasingly exact and algebraic. Here in this application to real data, we obtain our estimates by an optimization procedure, and in finite precision. The norm of the constraint, evaluated at the final estimate M gives a sense of the accuracy, and in this numerical example the optimization terminated with .

Note that in the S&P100 constituents classification, Amazon and Tesla are classified as ‘Consumer Discretionary,’ but arguably these two stocks have more to do with the information technology sector, so visualizing the covariance matrix that we estimate with this sparsity pattern (as in Figure

1), could be useful in exploring the coupling of these two bigger industry groups, via Amazon and Tesla. In this case the right panel of Figure 1 shows the magnitude of the entries in rows corresponding to Amazon and Tesla, and which indicate the strength of the coupling.

Note also that we could not examine such couplings between groups if we were to only allow simply block diagonal matrices as our estimates, so it is important that the methods are more general, and that is one benefit of allowing the class of chordal graphs. All the examples considered here are suggestive of the potential applications of the method we propose. Figure 1: Left: sparsity pattern of the covariance matrix that is estimated by the method proposed in this article. Note this is NOT simply block diagonal. Right: Visualisation of the corresponding correlation matrix. Rows 14 and 15 correspond to Amazon and Tesla, while rows 1-13 correspond to ‘Consumer Discretionary’ stocks, and the remaining rows correspond to ‘Information Technology’ stocks, from the S&P100 data described in the main text.

### 3.3 Discussion

We now highlight a number of issues:

• We have chosen a maximum likelihood framework here merely because it is the simplest way to illustrate our ideas that estimation of doubly sparse matrices is indeed possible. However, note that the algebraic structures and methods for local computations that we describe do not depend on that maximum likelihood framework – so, for example, it should be possible to also use these computational approaches in other frameworks that do not make assumptions about the distribution.

• The choice of norm is likely important (for example, the norm may be preferable to the norm), but that issue is not explored here.

• Instead of prescribing the sparsity pattern, it would be better to estimate the sparsity graph from the data. (For instance, in the finance example, we would prefer not to require any hunch about the significance of Amazon or Tesla.) A crude first approach could be in three separate steps: first use existing methods such as the graphical LASSO to estimate a graph, and then second force the resulting graph to be chordal, and then third and finally apply the suggested method of this paper. But instead of such a crude approach of consecutive but separate estimation problems, it seems more natural to simultaneously estimate the graph together with the matrix estimation problem.

• We have required the matrix and its inverse be subordinate to the same chordal graph, because that it is the simplest first idea to explore. Note that – although we have not displayed such an example pair of matrices in this article – if the edge is present in the graph, then it is possible that the entry of M is nonzero and the entry of is zero, and that both matrices are subordinate to the same chordal graph. It is natural to also consider the generalisation of our problem to the case where the matrix and the inverse are subordinate to different chordal graphs, but we did not explore that generalisation here.

• If the matrix M was truly in the special doubly sparse class, then the matrix would be a fixed point of the special Local Function we defined in (11), composed with itself, i.e. . We have not explored fixed-point iteration algorithms. Nor have we explored the algebraic structure of this special class of matrices from this point of view of the properties of the local function , although The Nullity Theorem gives constraints on ranks of subblocks (18).

• Chordal graphs can be considerably more complicated than only the simplest examples we have illustrated here – see many varied examples of matrix sparsity patterns and corresponding graphs in the references, e.g., Vandenberghe and Andersen (21).

• Any given graph can be ‘forced’ to become a chordal graph by a known process of adding edges, thus allowing the methods of this paper to always be used. However, whether or not such a process is computationally worthwhile depends on the example.

## 4 Conclusion

We proposed a method to estimate a covariance matrix when both the covariance and the precision matrix are sparse (which we called double sparsity). This is a maximum likelihood approach, subject to the double sparsity constraint. This appears to be the first work to estimate such special pairs of covariance and precision matrices. The sparsity patterns we consider are restricted to the class of chordal graphs (also known as decomposable graphs in the graphical models literature). This class includes the banded matrices with banded inverse. Restricting to this class of sparsity pattern allows us to exploit a special algebraic structure – the local inverse formula, as we described – that can make computations faster (and that is computationally more attractive than simply naively imposing the double sparsity constraint during any optimization). For future work, it should be possible to extend these approaches to simultaneously estimate both the sparsity pattern, and the corresponding special pair of covariance matrix and precision matrix.

## References

•  P. J. Bickel and E. Levina (2008) Regularized estimation of large covariance matrices. The Annals of Statistics 36, pp. 199–227. Cited by: §1.
•  J.R.S. Blair and B. Peyton (1993) Introduction to chordal graphs and clique trees. In Graph Theory and Sparse Matrix Computation, Cited by: 2nd item, §2.4, Theorem 1.
•  A.P. Dempster (1972) Covariance selection. Biometrics 28 (1), pp. 157–175. Cited by: §1.
•  H. Dym and I. Gohberg (1981) Extensions of band matrices with band inverses. Linear algebra and its applications 36, pp. 1–24. Cited by: §1.
•  J. Friedman, T. Hastie, and R. Tibshirani (2008) Sparse inverse covariance estimation with the graphical lasso. Biostatistics 9 (3), pp. 432–41. Cited by: §1.
•  D. I. Georgescu, N. J. Higham, and G. W. Peters (2018) Explicit solutions to correlation matrix completion problems, with an application to risk management and insurance. Royal Society open science 5 (3), pp. 172348. Cited by: §1, §1.
•  G. Golub and V. Pereyra (2003) Separable nonlinear least squares: the variable projection method and its applications. Inverse problems 19 (2), pp. R1. Cited by: §2.1.
•  R. Grone, C. R. Johnson, E. M. Sá, and H. Wolkowicz (1984) Positive definite completions of partial hermitian matrices. Linear algebra and its applications 58, pp. 109–124. Cited by: §1.
•  C.R. Johnson and M. Lundquist (1998) Local inversion of matrices with sparse inverses. Linear Algebra and Its Applications 277, pp. 33–39. Cited by: §1, §1, §2.1, §2.2, §2.2, §2.4, Theorem 2.
•  S. Lauritzen (1996) Graphical models. Oxford University Press. Cited by: 2nd item, §1, §1, §2.2.
•  O. Ledoit and M. Wolf (2004) Honey, i shrunk the sample covariance matrix. The Journal of Portfolio Management 30 (4), pp. 110–119. Cited by: §1.
•  M. Pourahmadi (2013) High-dimensional covariance estimation. Wiley. Cited by: 3rd item.
•  P. Ravikumar, M. J. Wainwright, G. Raskutti, and B. Yu (2011) High-dimensional covariance estimation by minimizing l1-penalized log-determinant divergence. Electronic Journal of Statistics 5, pp. 935–980. Cited by: §1.
•  A. J. Rothman, E. Levina, and J. Zhu (2010) A new approach to cholesky-based covariance regularization in high dimensions. Biometrika 97, pp. 539–550. Cited by: §1.
•  P. Salemi, E. Song, B. Nelson, and J. Staum (2019) Gaussian markov random fields for discrete optimization via simulation: framework and algorithms. Operations Research 67 (1), pp. 250–266. Cited by: §1.
•  T.P. Speed and H.T. Kiiveri (1986) Gaussian Markov distributions over finite graphs. The Annals of Statistics 14, pp. 138–150. Cited by: 2nd item, §2.2.
•  G. Strang and S. MacNamara (2018) A local inverse formula and a factorization. In Contemporary Computational Mathematics - A Celebration of the 80th Birthday of Ian Sloan, W. H. Dick J. (Ed.), pp. 1109–1126. External Links: Cited by: §1, §1, §2.4.
•  G. Strang and T. Nguyen (2004) The interplay of ranks of submatrices. SIAM review 46 (4), pp. 637–646. Cited by: 1st item, 5th item.
•  G. Strang (2010) Fast transforms: banded matrices with banded inverses. Proceedings of the National Academy of Sciences 107 (28), pp. 12413–12416. Cited by: 1st item.
•  R. Vandebril, M. van Barel, and N. Mastronardi (2007) Matrix computations and semiseparable matrices. Johns Hopkins. Cited by: 1st item.
•  L. Vandenberghe and M. S. Andersen (2015) Chordal graphs and semidefinite optimization. Foundations and Trends in Optimization 1 (4), pp. 241–433. Cited by: §1, §2.4, 6th item, Theorem 1.

## Author Biographies

Shev MacNamara is a Senior Lecturer in the School of Mathematical and Physical Sciences at the University of Technology Sydney. His research interests include advanced stochastic modeling and simulation. Previously, he was a postdoctoral scholar in the Department of Mathematics at MIT, and a postdoctoral scholar in the Mathematical Institute at The University of Oxford. He has been a Fulbright Scholar, and has held a David G. Crighton Fellowship at The University of Cambridge. His webpage is https://www.uts.edu.au/staff/shev.macnamara, and his email is shev.macnamara@uts.edu.au.

Erik Schlögl is Professor and Director of the Quantitative Finance Research Centre, University of Technology Sydney, Broadway, NSW 2007, Australia. He also holds an honorary Professorship at the African Institute for Financial Markets and Risk Management (AIFMRM), University of Cape Town, Rondebosch 7701, South Africa; and an honorary affiliation with the Department of Statistics, Faculty of Science, University of Johannesburg, Auckland Park 2006, South Africa. His email address is Erik.Schlogl@uts.edu.au. His website is https://profiles.uts.edu.au/Erik.Schlogl.

Zdravko Botev

is a Lecturer of Statistics at UNSW Sydney. His research interest include: 1) Monte Carlo variance reduction methods, especially for rare-event probability estimation; 2) nonparametric kernel density estimation, and more recently 3) fast model-selection algorithms for large-scale statistical learning. He is well-known as the inventor of the widely-used method of

kernel density estimation via diffusion, as well as the generalized splitting algorithm for Monte Carlo rare-event simulation and combinatorial counting. His website is https://web.maths.unsw.edu.au/~zdravkobotev/ and email address is botev@unsw.edu.au.