Consider the problem of estimating a covariance matrix and its inverse from an data matrix
whose rows are independently distributed according to the multivariate normal distributionwith mean zero and covariance matrix . The maximum likelihood estimator (MLE) of is given by
where denotes the set of all symmetric, positive semi-definite (PSD) matrices, denotes the Frobenius inner product, and is the sample covariance matrix, defined as
It is well known that exists if and only if is nonsingular, in which case . In particular, in the high-dimensional setting where , the MLE does not exist, since the minimum in (1) is not finite. Slawski and Hein (2015) observed, however, that if the optimizer in (1) is constrained to lie in the set of positive semidefinite matrices with nonpositive off-diagonal entries
, then, with probability one, the optimum is well-defined and attained for allregardless of the value of . Specifically, let
and observe that it is the convex cone of symmetric -matrices, an important class of matrices appearing in many contexts (see, e.g., Berman and Plemmons, 1994, Chap. 6). Slawski and Hein (2015) proved that the optimizer
exists uniquely as long as, in the observed sample, no two variables are perfectly positively correlated (i.e., for all ) and no variable is constant (i.e., for all ). Both conditions hold with probability one under the assumed Gaussian model for , and thus, unlike the unconstrained MLE in (1), the estimator (3) is well-defined even in the high-dimensional regime.
The constrained MLE presents an elegant, tuning-free method for estimating precision matrices which works for and all values of under the assumption . Efficient algorithms for computing are given in Slawski and Hein (2015) and Lauritzen, Uhler and Zwiernik (2019). Note that the precision matrix having nonpositive off-diagonal entries is equivalent to nonnegative partial correlations (Bølviken, 1982). Examples of practical covariance estimation problems with nonnegative partial correlations abound (see, e.g., Lake and Tenenbaum, 2010; Slawski and Hein, 2015; Agrawal, Roy and Uhler, 2019). More generally, Karlin and Rinott (1983) showed that for the normal distribution the condition that the precision matrix belongs to is equivalent to multivariate total positivity of order two (). is a strong form of positive dependence (Colangelo, Scarsini and Shaked, 2005) that has been widely used in auction theory (Milgrom and Weber, 1982), actuarial sciences (Denuit et al., 2006), and educational evaluation and policy analysis (Chade, Lewis and Smith, 2014).
There is growing interest in in the graph signal processing literature (Pavez and Ortega, 2016; Egilmez, Pavez and Ortega, 2017; Pavez, Egilmez and Ortega, 2018), where -matrices are known as Generalized Graph Laplacians (GGL). Indeed, every graph Laplacian is a diagonally dominant -matrix, and conversely every -matrix can be viewed as a generalized graph Laplacian, in the sense that it has a sparse edge-incidence factorization , where has at most two nonzero entries per column, whereas positive semidefinite matrices that have other sign patterns typically require dense factorizations (Boman et al., 2005). This connection to nonnegative weighted graphs has led to a host of other application areas in image processing and network analysis.
This paper studies the statistical properties of as an estimator of the unknown precision matrix in the high-dimensional regime. Even though exists uniquely for all regardless of the value of , rigorous results have not yet been proved for the accuracy of in the high-dimensional regime. In the classical low dimensional asymptotic regime where is fixed and , Slawski and Hein (2015) apply standard results for -estimators to show consistency of . More recently, Lauritzen, Uhler and Zwiernik (2019) provide an elegant perspective on and a bound on the support graph , and Wang, Roy and Uhler (2019) develop a consistent estimator of .
The study of consistency and optimality properties of
requires fixing an appropriate loss function. Becauseis defined via maximum likelihood, it is natural to work with the Stein loss:
which, up to scaling by
, is the Kullback-Leibler divergence between multivariate mean zero normal distributions with precision matricesand respectively. The Stein loss has a long history of application in covariance matrix estimation (James and Stein, 1961; Stein, 1975, 1986; Dey and Srinivasan, 1985; Ledoit and Wolf, 2018; Donoho, Gavish and Johnstone, 2018). In this paper, we work with the symmetrized Stein loss (alternatively known as the divergence loss), defined as
where . Note that is symmetric and clearly dominates both the Stein loss and the reversed Stein loss (which is also known as the entropy loss). Properties of are further discussed in Section 2.
We use the scaling in the loss function (5) because, as explained by Ledoit and Wolf (2018), this is necessary for consistency in the high-dimensional regime where the number of variables may be much larger than the sample size . Indeed, in the simple case where is known to be diagonal, the natural estimator is the diagonal matrix with diagonal entries (where is the sample covariance matrix defined in (2)). It is easy to see that is of the order which will be far from zero in the high-dimensional regime where .
We present results on the performance of in the symmetrized Stein loss in Section 2. Our main result in Theorem 1 implies that converges to zero as long as . This implies high-dimensional consistency of . Moreover, the rate of convergence is , which we prove in Theorem 2 is optimal in the minimax sense. Thus is minimax optimal in the high-dimensional regime under the symmetrized Stein loss. Our results provide rigorous support for the assertion that the nonpositive off-diagonal constraint provides strong implicit regularization in the high-dimensional regime. In Theorem 4, we also lower bound the loss which implies that the rate is not an artifact of our analysis even when the true precision matrix is diagonal.
High-dimensional consistency with the rate has appeared previously in many papers on covariance and precision matrix estimation—see for instance Rothman et al. (2008); Yuan (2010); Ravikumar et al. (2011); Cai, Liu and Luo (2011); Sun and Zhang (2013) and Cai, Ren and Zhou (2016) for a review of rates in structured covariance estimation. Most of these results are for estimators that use explicit regularizers (such as the penalty in the Graphical Lasso Banerjee, Ghaoui and d’Aspremont (2008); Friedman, Hastie and Tibshirani (2008); Mazumder and Hastie (2012)), which is crucially exploited by the proof techniques and assumptions employed in these papers. By contrast, the regularization induced by the assumption is implicit and we consequently use different arguments relying on careful use of the KKT conditions underlying the optimization (3). Our analysis identifies a bound relating the entries of an -matrix to its spectrum, providing new insight into the simplifying structure of the convex cone .
The symmetrized Stein loss has the additional symmetry property of invariance under inversion: where . This means that is also a high-dimensionally-consistent estimator of . The choice of the loss function is quite crucial here. In Section 3, using the Perron-Frobenius theorem and a careful analysis of the entry-wise positive part of the sample covariance, we prove a negative result which shows that, for the maximum eigenvalue, can be much worse as an estimator of compared to the sample covariance matrix . This result indicates that enforcing the sign-constraints can exacerbate bias in the estimation of the top eigenvalue.
The paper is organized as follows: Section 2 contains our main results establishing optimality of , Section 3 establishes suboptimality under the spectral norm, and Section 4 has a discussion which touches upon some related issues including misspecification (where ), estimation of correlation matrices and connections to shape-restricted regression. Finally Section 5 contains proofs of all the results of the paper.
2 Symmetrized Stein Loss: Consistency and Optimality
This section contains our results on the high-dimensional consistency and optimality of under the symmetrized Stein loss defined in (5). We start by describing some basic properties of .
, proportional to the Kullback-Leibler (KL) divergence between centered multivariate Gaussian distributions:. It is well known that the KL divergence is not symmetric. When the inputs to the divergence are reversed, the resulting Bregman divergence is also known as the entropy loss, . The sum of these loss functions dominates each, and conveniently does not directly involve any determinants. Following Ledoit and Wolf (2018), we define to be the average of the two loss functions. Commonly known as the symmetrized Stein loss or divergence loss, is equal to the Jeffreys (1946) divergence between two centered multivariate Gaussian distributions, divided by . Definition (5) entails a number of useful and important properties for the symmetrized Stein loss:
(Nonnegativity) , with equality if and only if .
(Invariance under inversion) .
(Invariance under congruent transformations) For all nonsingular matrices , we have the scale-invariance property:
The symmetrized Stein loss thus induces a natural geometry on the space of PSD matrices—see Moakher and Batchelor (2006) for a review and comparison to other geometries. We emphasize that triangle inequality fails to hold for both and . As a loss, treats the dual problems of estimating the covariance matrix and the precision matrix equally. It can also be shown that the symmetrized Stein loss is equivalent to the squared Frobenius norm when the input matrices and have bounded spectra.
In terms of the eigenvalues of , the symmetrized Stein loss is simply the goodness-of-fit measure
This alternative representation provides further insight into the normalization of the loss (5) with a factor of . The symmetrized Stein loss is the expectation of the function with respect to the empirical spectral distribution of . This expectation measures how far the spectrum of deviates from a point mass at one, which is the spectrum of the identity . In asymptotic settings where as , a natural consistency criterion checks whether this expectation converges to zero.
Our analysis of the symmetrized Stein loss involves the maximum population correlation between any two variables:
We assume that the above quantity is strictly less than 1 which is clearly necessary for to be nonsingular i.e., for to exist. Our bound on will involve the quantity:
It is natural for to enter the analysis in light of the existence result of Slawski and Hein (2015) which states that the maximum sample correlation must be less than one in order for the estimator to be well-defined. Note that is the smallest such that
Because is defined in terms of population correlations, it is scale-invariant. Note that also has this scale invariance property (see (6)).
Let denote the sample covariance matrix based on data matrix with i.i.d. rows, where . For all , the MLE defined in (3) satisfies
with probability at least . Here are universal positive constants.
Theorem 1 states that is high-dimensionally consistent in the symmetrized Stein loss as long as . We prove Theorem 1 in Section 5, deriving a basic inequality from the first order optimality conditions for (3) and showing that concentration of the intrinsic noise is sufficient to control the basic inequality. Crucially, we use the fact that every -matrix is up to diagonal scaling equivalent to a diagonally dominant matrix (see Berman and Plemmons, 1994, Chap. 6, Property ).
We emphasize that the result holds without additional assumptions on the underlying precision matrix such as sparsity. Consistency in the symmetrized Stein loss is a strong guarantee compared to the recent literature on optimal shrinkage of the sample covariance under high-dimensional asymptotics (Donoho, Gavish and Johnstone, 2018; Ledoit and Wolf, 2018), where the symmetrized Stein loss converges to a nonzero limit under the asymptotic regime as . By contrast, for the constrained MLE the loss converges in probability to zero whenever .
Since the upper bound (9) depends only on the true precision matrix through the population quantity , Theorem 1 actually bounds the worst case risk obtained from the divergence loss over all -matrices with bounded. It is natural to question whether the rate is improvable. Our next result shows that, in the high-dimensional setting where grows superlinearly in , the minimax rate over the class of -matrices with matches the rate from Theorem 1.
Let have i.i.d. rows, and suppose the number of variables satisfies . For every , we have
Here and are universal constants and is a constant depending only on .
Paired with Theorem 1, this result implies that is minimax optimal in the symmetrized Stein loss over -matrices with correlations bounded away from one. Our proof adapts the construction of Cai, Liu and Zhou (2016), Theorem 4.1, which lower bounds the minimax risk in the spectral norm over a parameter set of sparse precision matrices of the form , where depends on problem parameters and , and is an adjacency matrix. A key aspect of this approach is to allow for different perturbations over the rows and columns of , in order to recover the rate (Kim, 2020).
The -matrix constraint provides implicit regularization and is crucial for achieving the minimax rate . If this constraint is dropped, it is impossible for any estimator to achieve a rate better than when . This follows from the next result where we prove a minimax lower bound of for the loss function over the entire class of positive semidefinite matrices when . On the other hand, is much larger than diagonal matrices because the minimax rate of estimation over the class of positive diagonal matrices in the loss function is (this is also proved in the next result). In summary, the class of -matrices acts as a strong high-dimensional regularizer while being considerably larger than the class of all positive diagonal matrices.
Fix and . The minimax risk in the symmetrized Stein loss over diagonal precision matrices satisfies
The minimax risk in the symmetrized Stein loss over PSD matrices satisfies
Theorem 2 implies that the rate of Theorem 1 cannot be improved in worst case over the entire class . In the next result, we prove that the rate for cannot be improved even when the truth lies in the class of positive diagonal matrices. In other words, this shows that does not adapt to the minimax rate over .
Suppose is a positive diagonal matrix and . Then
with probability at least , where and are universal positive constants.
3 Spectral Norm: Suboptimality
In this section, we prove a negative result which implies that and can be suboptimal for estimating spectral quantities of and respectively. Consider the case when and consider estimation of the top eigenvalue . The performance of the sample covariance matrix is well understood. Indeed, in the asymptotic setting , Geman (1980) proved that
in probability as . This implies that is inconsistent for the estimation of when converges to a positive constant. Our next result proves that is also inconsistent for estimating and, more interestingly, its performance is substantially worse compared to . Specifically, in the same asymptotic setting where , we have
in probability as . Thus the introduction of the sign constraints make the resulting covariance matrix estimator much worse compared to for estimating the principal eigenvalue. This should be contrasted with the high-dimensional minimax optimality results from the previous section in the symmetrized Stein loss.
Suppose and . Then
with probability at least , for some universal positive constants .
where the second constraint is an entry-wise inequality. This fact and the well-known observation that the inverse of an -matrix is entry-wise nonnegative (see Berman and Plemmons, 1994, Chap. 6, Property ) together imply that for all . This allows us to prove Theorem 5 by a careful analysis of the entry-wise positive part matrix of .
Theorem 5 implies minimax suboptimality of in the spectral norm . To see this, note that, for every , the sample covariance satisfies the worst case risk bound
for . Hence is minimax suboptimal in the spectral norm for most choices of and .
Theorem 5 also implies inconsistency in spectral norm for the precision matrix. Since , we have
with probability at least , where . As , the upper bound approaches zero: the minimum eigenvalue of poorly estimates that of . We record this as a separate corollary.
Suppose and . Then
with probability at least . Hence, is inconsistent in the spectral norm as and .
In this paper, we establish the possibility of tuning-free estimation of a large precision matrix based only on the knowledge that it is an -matrix i.e., it has nonpositive off-diagonal entries. Our main contribution is to identify a loss—namely, the symmetrized Stein loss—in which is both high-dimensionally consistent and minimax optimal. As the form (7) for the symmetrized Stein loss suggests, the quantity is an average measure of closeness across all of the eigenvalues. The estimator is inadequate, however, for estimating the extreme eigenvalues when is large relative to , and our other main result establishes that is minimax suboptimal in the spectral norm, even relative to the usual sample covariance matrix . For the remainder of this section, we discuss some aspects that are naturally connected to our main results.
Misspecification. In practice, the assumption that all partial correlations are nonnegative may not hold exactly. Slawski and Hein (2015) empirically evaluate the impact of misspecification on the estimator , defining the attractive part of the population precision as the population analogue of the Bregman projection (3) with replaced by . Under the symmetrized Stein loss, a straightforward extension of Theorem 1 shows that targets the attractive part even under misspecification.
Let denote the sample covariance based on with i.i.d. rows. Define the attractive part of the model as
For all , the MLE defined in (3) satisfies
with probability at least . Here are universal positive constants.
Estimating the correlation matrix. One may also be interested, under the same nonnegative partial correlations assumption, in estimating the population correlation matrix and its inverse (here denotes the diagonal matrix whose diagonal is equal to that of ). It is natural to use to estimate . One can check that satisfies
because the optimization problem is equivariant with respect to diagonal scaling (see Lauritzen, Uhler and Zwiernik, 2019, Lemma 2.5). The high-dimensional consistency result of Theorem 1 also holds for as an estimator of the inverse correlation matrix . This follows from an argument analogous to the proof of Theorem 1, with the tail bound for replaced by the corresponding tail bound on (see, e.g., Sun and Zhang, 2013, Lemma 19).
Non-Gaussian observations. We state Theorems 1 and 7 under the Gaussian assumption for simplicity and to remain consistent with other results in this paper. In general, the upper bound depends on the tail behavior of —see Lemma 8. A similar result holds when the rows of are i.i.d. with -sub-Gaussian components. As Ravikumar et al. (2011) note, estimators of the form (3) are motivated via maximum likelihood yet remain sensible for non-Gaussian . For general , the estimator is motivated as a Bregman projection of with respect to the Stein loss.
Modifying . Although we focus on properties of the tuning-free estimator , additional processing such as thresholding or pre-processing the sample covariance may produce an estimator that is high-dimensionally consistent in the spectral norm. The tuning-free covariance estimate may also prove more useful for spectral analysis when the true covariance is a dense matrix. For instance, in the equicorrelation model where has unit diagonal and every off-diagonal entry equal to , the entry-wise inequalities in (16) may introduce less bias.
Related problems. Karlin and Rinott (1983), who pioneered the connection between -matrices and , also considered repulsive models where the covariance matrix has nonpositive off-diagonal, in which case all marginal and partial correlations are nonpositive. This also defines an interesting model class which may similarly simplify estimation in high-dimensional problems. Note, however, that the constraint set of symmetric inverse- matrices is non-convex, presenting potential difficulties for maximum likelihood estimation.
Connection to shape-restricted regression. As a subset of the symmetric positive-semidefinite matrices, the -matrices form a closed, convex cone determined only by sign constraints. The sign constraints on the precision matrix are analogous to a shape constraint in shape-restricted regression, enabling the use of likelihood techniques without explicit regularization. In particular, one can define the Bregman projection of onto (Slawski and Hein, 2015; Lauritzen, Uhler and Zwiernik, 2019). This work thus represents a first foray into the study of shape constraints for high-dimensional precision matrix estimation, inspired by results on regularization-free prediction in high-dimensional linear models via nonnegative least squares (Slawski and Hein, 2013). See Groeneboom and Jongbloed (2014) for a general introduction to shape-restricted regression and Guntuboyina and Sen (2018) for a recent survey with a focus on risk bounds.
We first introduce two lemmas needed in the proof of Theorem 1. Following previous results on sparse precision matrix estimation (see, e.g., Cai, Liu and Luo, 2011; Ravikumar et al., 2011; Sun and Zhang, 2013), we rely on concentration of the entry-wise maximum deviation in the high-dimensional regime. A key technical tool in our analysis is the following lemma, which follows from an application of Bernstein’s inequality.
(Jankova and Van De Geer, 2015, Lemma 6) Suppose has i.i.d. rows and let . For any ,
The next lemma records a distinctive property of -matrices, corresponding to the fact that -matrices are generalized diagonally dominant (Plemmons, 1977).
Every -matrix satisfies .
Since is symmetric PSD, there are vectors such that . Moreover, since has nonpositive off-diagonal entries, for . Hence
An illustrative example is the one-parameter family of symmetric matrices (where is the all ones vector) with unit diagonal and every off-diagonal equal to . Its eigenvalues are (with multiplicity ) and . Thus is PSD if and only if , whereas is an -matrix if and only if . Finally, note and . This example shows Lemma 9 is tight. For general PSD matrices, the element-wise -norm can be as large as times the trace, but for -matrices it can be at most twice as large.
We are now ready to prove Theorem 1.
Proof of Theorem 1.
For any positive diagonal matrix ,
where the first step uses the fact that is invariant under congruent transformations, and the second step uses the scale-invariance of the program (3). With a sample covariance based on Gaussian observations, the loss has the same distribution for covariance matrices of the form . In particular, taking , we may assume without loss of generality that is normalized; i.e., has unit diagonal or equivalently equals the population correlation matrix .
Let . Since the estimator solves the constrained convex optimization problem , it is characterized by , for all where . Hence
Rearranging yields the basic inequality
Let . Using Hölder’s inequality, we have:
Now applying the triangle inequality and Lemma 9 to the element-wise -norm,
Since we have assumed without loss of generality that ,
where we have again used Lemma 9, along with the facts that and are nonpositive for and entry-wise (see Berman and Plemmons, 1994, Chap. 6, Property ). Combining the last three displays and using the characterization of in (8), we get
On the event , we have . Applying Lemma 8 with , the event occurs with probability at least .
To guarantee , we require
which is equivalent to
Using , it is straightforward to check that the right hand side above is at least . Hence, as long as ,
with probability at least . Since , the dominates the term. In particular, we have , so with probability at least . ∎
Proof of Theorem 7.
Since the attractive part is an -matrix, from the first order optimality conditions for ,
Using the first order optimality conditions for and the fact that ,
Adding these and rearranging yields the basic inequality
The remainder of the proof proceeds as the proof of Theorem 1, substituting with . ∎
5.2 Proof of Theorem 2
Proof of Theorem 2.
As in Cai, Liu and Zhou (2016, Proof of Theorem 4.1), we consider precision matrices of the form
where is a sparse binary matrix with nonzero entries per row and at most nonzero entries per column, for some positive integer and some to be chosen later. As long as and , the matrix is a diagonally dominant -matrix. Its inverse is given by the Neumann series
From the last display, it is clear that , so where is the correlation matrix corresponding to . Furthermore, by triangle-inequality, the largest off-diagonal entry of the top left diagonal block is at most
where we use that the first term has zero off-diagonal. This yields
By similar bounds on the other blocks of , it can be shown that
A simple sufficient condition to guarantee is thus
so it suffices to lower bound the minimax rate in the Frobenius norm.
Now let denote the set of all binary matrices with nonzero entries per row and at most nonzero entries per column, and . Finally, let denote a vector of ones of length . Given and , the matrix has the same shape as , where the row is nonzero if and only if . Let
As we have shown, . By (Cai and Zhou, 2012, Lemma 3)
where denotes the Hamming distance and denotes the total variation affinity between the measures and , where is the uniform mixture over