 # A nonparametric empirical Bayes approach to covariance matrix estimation

We propose an empirical Bayes method to estimate high-dimensional covariance matrices. Our procedure centers on vectorizing the covariance matrix and treating matrix estimation as a vector estimation problem. Drawing from the compound decision theory literature, we introduce a new class of decision rules that generalizes several existing procedures. We then use a nonparametric empirical Bayes g-modeling approach to estimate the oracle optimal rule in that class. This allows us to let the data itself determine how best to shrink the estimator, rather than shrinking in a pre-determined direction such as toward a diagonal matrix. Simulation results and a gene expression network analysis shows that our approach can outperform a number of state-of-the-art proposals in a wide range of settings, sometimes substantially.

## Authors

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Covariance matrix estimation is a fundamental statistical problem that plays an essential role in various applications, such as portfolio management (Ledoit and Wolf, 2003), genomics (Schäfer and Strimmer, 2005), and array signal processing (Abramovich et al., 2001). However, in modern problems the number of features can be of the same order as or exceed the sample size, and the standard sample covariance matrix estimator behaves poorly in this regime. To overcome these issues, various methods have been developed to estimate high-dimensional covariance matrix. These can roughly be divided into two groups, according to whether they impose assumptions about the structure of population covariance matrix.

Structured methods make structural assumptions about the population covariance matrix. One class models the population covariance matrix as sparse. The most common method to address this problem is thresholding (Rothman et al., 2009; Cai and Liu, 2011). Penalized likelihood methods (Xue et al., 2012) can also estimate large-scale sparse covariance matrix by penalizing a log-likelihood function. Another class of methods assume the data arise from a factor model (Fan et al., 2008), so that the covariance matrix has low intrinsic dimension. Other common structured methods assume that the covariance matrix is banding (Li et al., 2017) or Toeplitz matrix (Liu et al., 2017).

In contrast, unstructured methods do not make any assumptions about the population covariance matrix, yet can still outperform the sample covariance matrix. A first example was the linear shrinkage approach of Ledoit et al. (2004)

, which shrinks the sample covariance matrix toward a scaled identity matrix. More recently, nonlinear shrinkage methods were developed

(Ledoit et al., 2012; Ledoit and Wolf, 2019; Lam and others, 2016)

. These shrink the eigenvalues of the sample covariance matrix toward clusters. Linear shrinkage can be viewed as a special case of nonlinear shrinkage, as it shrinks sample eigenvalues toward their global mean.

Nonlinear shrinkage estimators have desirable optimality properties (Ledoit and Wolf, 2018)

and show excellent performance. However, they modify only the sample eigenvalues and not the sample eigenvectors. It is known that sample eigenvectors are not consistent estimators of population eigenvectors when the dimension and the sample size increase at the same rate

(Mestre, 2008). This suggests that there may exist a class of unstructured estimators that can outperform nonlinear shrinkage.

Here we propose a new unstructured estimator for high-dimensional covariance matrices. Our approach centers on vectorizing the covariance matrix and treating matrix estimation as a vector estimation problem. We do this because it allows us to use a nonparametric empirical Bayes shrinkage procedure, which has been shown in the compound decision literature to have excellent properties (Jiang and Zhang, 2009; Koenker and Mizera, 2014). We then reassemble the estimated vector into matrix form and project onto the space of positive-definite matrices to give our final estimator. Surprisingly, though our vectorized approach essentially ignores the matrix structure, it can still substantially outperform a number of state-of-the-art proposals in simulations and a real data analysis.

The article is organized as follows. In Section 2, we briefly review compound decision theory and then introduce our proposed approach. In Section 3 we illustrate the performance of our method in simulations and a gene expression dataset. Finally, Section 4 concludes with a discussion. Our procedure is implemented in the R package cole, available on GitHub.

## 2 Method

### 2.1 Compound decision problem formulation

Suppose we have observations independently generated from . The purpose of this paper is to find an estimator of that minimizes the scaled squared Frobenius risk

 R(Σ,δ)=1p2p∑j,k=1E[{δjk(X)−σjk}2], (1)

where is the th entry of and is its corresponding estimate.

Our proposed approach is motivated by two observations. First, (1) shows that estimating under Frobenius risk is equivalent to simultaneously estimating every component of the vector

under a loss function that aggregates errors across components. Second, this type of vector estimation problem has been well-studied in the compound decision literature. Thus recent advances in vector estimation may be profitably applied to covariance matrices. This section briefly reviews compound decision theory and its connection to covariance matrix estimation.

Compound decision problems, introduced by Robbins (1951), study the simultaneous estimation of multiple parameters given data , with . Specifically, the goal is to develop a decision rule that minimizes the compound risk

 R(θ,δ)=1nn∑i=1EL(θi,δi(Y)) (2)

where is a loss function measuring the accuracy of as an estimate of . A classical example is the homoscedastic Gaussian sequence problem, where independently and is squared error loss (Johnstone, 2017). Clearly, covariance matrix estimation under the Frobenius risk (1) can be viewed as a compound decision problem.

A key property of compound decision problems is that while a given seems to offer no information about any specific when , borrowing information across all components of to estimate is superior to estimating each using the corresponding alone. A classical example of this phenomenon is the James-Stein estimator (James and Stein, 1961), which estimates in the Gaussian sequence problem by shrinking each toward by a factor that depends on all components of . It is well-known that when , the James-Stein estimator dominates the maximum likelihood estimator, which simply estimates using . Generalizing this idea, Lindley (1962) proposed shrinking toward a pre-defined subspace instead of toward . A long line of subsequent work has led to much more sophisticated and accurate procedures for estimating (Brown and Greenshtein, 2009; Jiang and Zhang, 2009; Johnstone, 2017; Fourdrinier et al., 2018).

We propose to apply some of these recent vector estimation ideas to covariance matrix estimation, leveraging the close connection between the two tasks. Some existing covariance matrix estimation procedures can already be interpreted as taking a vector approach. The sample covariance matrix , for instance, can be thought of as estimating each component of using maximum likelihood. Less trivially, Cai and Liu (2011) studied sparse high-dimensional covariance matrices and explicitly appealed to the vector perspective. Their adaptive thresholding method is a version of the soft thresholding method of Donoho and Johnstone (1995), which was originally developed to estimate a sparse mean vector in the Gaussian sequence problem.

Interestingly, we can also show that the celebrated linear shrinkage covariance matrix estimator of Ledoit et al. (2004) can be interpreted as a vector estimator. The estimator is defined as

 ^ΣLW=^ρLWS+(1−^ρLW)^μI (3)

where , and for and . Now consider the problem of estimating the vectorized under risk (1). We restrict attention to decision rules that estimate each component using

 βSsjk+βIujk, (4)

where is the th entry of , is the th entry of , and the class is indexed by the parameters . It is straightforward to show that

 ^R(βS,βI)=1p2p∑j,k=1[(2βS−1)^Δ2jk+{(1−βS)sjk−βIujk}2]

is an unbiased estimate of the risk (

1), where

is the sample variance of

. The optimal estimator in this class can now be chosen by minimizing over and . It can be shown that this is equivalent to estimating the vector by shrinking toward the one-dimensional subspace spanned by (Biscarri, 2019). Proposition 1 shows that this subspace shrinkage estimator is identical to the Ledoit et al. (2004) estimator (3).

###### Proposition 1.

Define the estimator such that its th entry obeys , where . Then = .

### 2.2 Proposed estimator

The previous section argues that treating covariance matrix estimation as a vector estimation problem can be a fruitful strategy, but discusses only estimators for linear in . We propose to consider a larger class, the class of so-called separable rules. In the standard compound decision problem of estimating using , a separable decision rule is one where (Robbins, 1951). Here we generalize this to the problem of estimating a vectorized matrix. For decision rules that estimate , define the class of separable rules

 S={δ:δjk=t(X⋅j,X⋅k)}, (5)

where is the vector of observed values of the th feature. In other words, rules in estimate the th entry of the covariance matrix using a fixed function of only observations from features and .

We propose to search for the optimal estimator within . This is sensible because includes the sample covariance , the class of linear estimators (4) used by Ledoit et al. (2004), which can be expressed as

 δjk(X)=βSX⊤⋅jX⋅k/n+βII(X⋅j=X⋅k),

and the class of adaptive thresholding estimators for sparse covariance matrices studied by Cai and Liu (2011). Therefore the optimal separable estimator that minimizes the scaled squared Frobenius risk (1) will perform at least as well as these three estimators, and may perform better. Targeting the optimal separable rule is standard in the compound decision literature (Zhang, 2003).

The optimal is an oracle estimator and cannot be calculated in practice, as the true risk is unknown. In the classical compound decision framework, empirical Bayes methods are used to estimate the oracle optimal separable rule (Robbins, 1955; Zhang, 2003; Brown and Greenshtein, 2009; Jiang and Zhang, 2009; Efron, 2014, 2019). We take a similar approach here. To simplify notation, denote as . The density of of is parameterized by , where and

are standard deviations and

is the correlation. When , is comprised of independent mean-zero multivariate normals with covariance matrices

 Cjk=[σ2jσjσkrjkσjσkrjkσ2k].

When , consists of mean-zero univariate normals with variances .

Now consider the Bayesian model

 A∣η∼f(⋅∣η),η∼Gp(a,b,γ)=1p2p∑j,k=1I(σj≤a,σk≤b,rjk≤γ). (6)

By the fundamental theorem of compound decisions (Robbins, 1951; Jiang and Zhang, 2009), this is closely related to the vectorized covariance matrix estimation problem under Frobenius risk (1): for any separable (5), the Frobenius risk can be written as

 R(Σ,δ)= 1p2p∑j,k=1∫{t(A)−σjk}2f(A∣ηjk)dA = ∫∫{t(A)−g(η)}2f(A∣η)dGp(η)dA=E[{t(A)−g(η)}2],

where and the final expectation is the Bayes risk of estimating . The optimal oracle separable rule therefore has th entry equal to , where minimizes the Bayes risk.

Based on this result, we propose the following empirical Bayes procedure. We first use nonparametric maximum likelihood (Kiefer and Wolfowitz, 1956) to estimate the prior . Under the Bayesian model (6), and the working assumption that the are independent across , we estimate using

 ^Gp=argmaxG∈Gp∏j,k=1∫f(Ajk∣η)dG(η), (7)

where is the family of all distributions supported on . Of course, the are not independent, so does not maximize a likelihood but rather a pairwise composite likelihood (Varin et al., 2011). Using , we estimate the vectorized using

 ^δ(X)=(^t(A11),…,^t(App)),^t(A)=∫g(η)f(A∣η)d^Gp(η)∫f(A∣η)d^Gp(η). (8)

The estimates the Bayes rule and estimate the optimal oracle separable rule .

Our proposed procedure is an example of what Efron (2014) calls -modeling, an approach to empirical Bayes problems that proceeds by modeling the prior. A major advantage of nonparametric estiation of the prior is that it allows the data itself to determine how best to shrink the estimator. In contrast, most existing methods shrink in a pre-determined direction, such as toward a diagonal matrix in the case of Ledoit et al. (2004). Theoretical justification of our proposed is difficult and is discussed in Section 4. Nevertheless, our numerical results in Section 3 show that in practice, our can outperform many existing covariance matrix estimators.

### 2.3 Implementation

Calculating the estimated prior (7

) is difficult, as it is an infinite-dimensional optimization problem over the class of all probability distributions

. Lindsay (1983) showed that the solution is atomic and is supported on at most points. The EM algorithm has traditionally been used to estimate the locations of the support points and the masses at those points (Laird, 1978), but this is a difficult nonconvex optimization problem.

Instead, we maximize the pairwise composite likelihood over a fixed grid of support points, similar to recent -modeling procedures for standard compound decision problems; this restores convexity (Jiang and Zhang, 2009; Koenker and Mizera, 2014; Feng and Dicker, 2018). Specifically, we assume that the prior for the is supported on fixed support points , . We can then use the EM algorithm to estimate the masses at those points via the iteration

 ^w(k)τ=1p2p∑j,k=1^w(k−1)τf(Ajk∣ξτ)∑Dl=1^w(k−1)lf(Ajk∣ξl)

over . Early stopping of the EM algorithm can be useful (Koenker et al., 2019), and more sophisticated convex optimization procedures can be used as well (Koenker and Mizera, 2014). Our proposed estimator (8) then becomes

 ^δ(X)=(^t(A11),…,^t(App)),^t(A)=∑Dτ=1g(ξτ)f(A∣ξτ)^wτ∑Dτ=1f(A∣ξτ)^wτ.

We constructed our support points by taking the Cartesian product of grids along each of the three dimensions of the points . The first two components and correspond to standard deviations and , so we assume that these are supported on equally spaced points , where is the same variance of the th feature. The last component corresponds to the correlation , so we assume it is supported on equally spaced points . Therefore grid points would be sufficient for the support of . However, we can reduce the number of required points thanks to the symmetry of : , and when , . Furthermore, we assume that . Therefore we only consider supported on

 {(a,b,γ)∈(~σ1,…,~σd)2×(~γ2,…,~γd−1):a

This requires grid points.

Ideally, the number of grid points should be chosen to be as large as possible. However, the fact that is multivariate poses difficulties, as for example our strategy of using a grid of points in each dimension requires a total of grid points. Better computational solutions are necessary. One promising alternative is the approach of Tao (2014), who studied the dual problem of (7) and used unequally spaced grid points. Another alternative might be parametric or semiparametric modeling of , along the lines of Efron (2019). We found that our implementation strategy is computationally feasible for , and numerical results in Section 3 suggest that these work well enough.

### 2.4 Positive definiteness correction

Our proposed estimator (8) is not guaranteed to be positive-definite. To correct this, we reshape our vector estimator back into a matrix and then identify the closest positive-definite matrix. Higham (1988) and Huang et al. (2017) showed that the projection of a symmetric matrix onto the space of positive semi-definite matrices is

 P0(B)=argminA≥0∥A−B∥=Qdiag{max(λ1,0),max(λ2,0),…,max(λp,0)}Q⊤,

where denotes the Frobenius norm, is the matrix of eigenvectors of , and are its eigenvalues. Projections in terms of other matrix norms are also possible.

To guarantee positive-definiteness, we follow Huang et al. (2017) and replace non-positive eigenvalues with a chosen positive value smaller than the least positive eigenvalue , so that the corrected estimate is

 P0(B)=Qdiag{max(λ1,c),max(λ2,c),…,max(λp,c)}Q⊤. (10)

Huang et al. (2017) suggest , where the parameter is chosen to minimize over a uniform partition of of . In this paper we chose and .

## 3 Numerical Results

### 3.1 Methods compared

We implemented our proposed estimator (8) as described in Section 2.3. We term our approach MSG: Matrix Shrinkage via -modeling. To illustrate the effect of the number of grid points, we used both and ; these are denoted MSG_d20 and MSG_d30 in the subsequent sections. We also studied the effect of applying the postive-definiteness correction (10); these are abbreviated below as MSGCor_d20 and MSGCor_d30.

We compared our approach to several existing high-dimensional covariance matrix estimation methods mentioned in the Introduction.

• Sample: the sample covariance matrix.

• Linear: the linear shrinkage estimator of Ledoit et al. (2004) given in (3).

• QIS: the Quadratic-Inverse Shrinkage estimator of Ledoit and Wolf (2019), a recently developed nonlinear shrinkage method. QIS performs linear shrinkage on the sample eigenvalues of the covariance matrix in inverse eigenvalue space. A bandwidth parameter is required, which we choose following the paper’s recommendation.

• NERCOME: the Nonparametric Eigenvalue-Regularized COvariance Matrix Estimator of Lam and others (2016). This nonlinear shrinkage method randomly splits the samples into two groups, one for estimating eigenvectors and the other for estimating eigenvalues. Combining the estimates gives a matrix. Following the article, we repeated this procedure 50 times and took the final covariance matrix estimator to be the average of the individual matrices.

• Adap: the adaptive thresholding method of (Cai and Liu, 2011) for sparse covariance matrices, which applies soft thresholding to entries of the sample covariance matrix. The threshold method is adaptive to the entry’s variance and involves a tuning parameter. We fixed the parameter at 2, as recommended.

In addition to the above five estimators, we also implemented the two following oracle estimators, which cannot be implemented in practice as they require the unknown .

• OracNonlin: the optimal rotation-invariant covariance estimator, defined in Ledoit and Wolf (2019), with , where is the sample eigenvector matrix and is composed of oracle eigenvalues . The sample covariance, the linear shrinkage estimator of Ledoit et al. (2004), and the nonlinear shrinkage estimators QIS and NERCOME are all rotation-invariant.

• OracMSG: the optimal covariance estimator in the class of separable estimators (5). It equals our proposed estimator (8) except with the true (6) instead of (7). The adaptive thresholding method of Cai and Liu (2011) also targets a separable estimator.

### 3.2 Simulations

We considered five models for the population covariance matrix. For the first four settings, , where is correlation matrix with diagonal entries are 1 and is a vector of standard deviations.

• Model 1. The standard deviations were independently generated from and the correlation matrix followed Model 2 of Cai and Liu (2011):

 C=(A100Ip/2×p/2),

where the th entry of is . This setting modeled a sparse covariance matrix.

• Model 2. The first standard deviations equaled 1, the last equaled 2, and the correlation matrix was

 C=(C11C12C21C22),

where and were compound symmetric matrices with correlation parameters 0.8 and 0.2, respectively, and and were matrices with entries equal to 0.4. This model was designed such that larger and tended to correspond to larger .

• Model 3. The standard deviations were generated independently from and was a compound symmetric matrix with correlation parameter 0.7. This modeled a dense covariance matrix.

• Model 4. This setting was the same as Model 3 except with correlation parameter 0.9. This high level of dependence tested the robustness of the pairwise composite likelihood estimator (7).

• Model 5. With

a randomly generated orthogonal matrix,

, where was a vector of eigenvalues where the first equaled 1 and the last equaled 4. This followed simulation settings from Lam and others (2016) and Ledoit and Wolf (2019) used to test the rotation-invariant estimators QIS and NERCOME.

In each scenario, we generated samples from a -variate , where or . We generated replicates and reported average errors under the following three norms, where is the estimated matrix with entries and is the true matrix with entries :

• Frobenius: , a version of (1),

• Spectral: , the largest eigenvalue of , and

• Matrix : . Figure 1: Average Frobenius norm errors over 200 replications. The Sparse, Block, Dense, Dense2, and Orth panels correspond to Models 1 through 5, respectively. Figure 2: Average spectral norm errors over 200 replications. The Sparse, Block, Dense, Dense2, and Orth panels correspond to Models 1 through 5, respectively. Figure 3: Average matrix ℓ1 norm errors over 200 replications. The Sparse, Block, Dense, Dense2, and Orth panels correspond to Models 1 through 5, respectively.

Figures 1, 2, and 3 presents the simulation results. Our MSG methods with had the lowest or near-lowest errors across all settings and all error metrics. In some cases, for example in Models 1 and 2, the improvement was substantial. Model 2 was especially interesting because the standard deviation and correlations were related. Our proposed empirical Bayes estimator was able to capture this dependence in its estimate of the prior (6) and leverage it to provide much more accurate estimates. The nonlinear shrinkage estimators very slightly outperformed MSG in Model 5. In every setting, correcting MSG for positive-definiteness never increased the risk and decreased the risk in some cases. Though our estimator was motivated in terms of the Frobenius norm error, it performed extremely well in terms of the other two norms as well.

We were surprised to find that our proposed MSG methods with did not perform as well as with , and in some cases, especially Model 4, were much worse. Intuitively, more grid points should lead to more accurate results. Indeed, we ran additional simulations using MSG with and and found that those results were consistent with this intuition. We believe the counterintuitive performance of stems from the specific values of the specific points comprising the grid (9) in these simulations. In a set of unreported experiments, we let be supported on equally spaced points between the smallest and largest observed sample correlations, instead of between as in (9). In these results, and had similar performance, though neither performed as well as the positive-definite-corrected MSG with reported here. It was computationally infeasible to implement much higher than 30. This highlights the importance of developing better computational strategies for multivariate -modeling problems (Tao, 2014).

Finally, the simulations show that the class of separable estimators (5) proposed in this paper is fundamentally different from the class of rotation-invariant estimators, as the oracle optimal estimators in these two classes behave very differently. For example, the oracle separable estimator had vanishing risk in Model 2, while the oracle rotiation-invariant estimator does not. Separable estimators seemed better for Models 1 and 2 while rotation-invariant estimators were superior in Models 3 and 4. They seem comparable in Model 5.

### 3.3 Data analysis

Covariance matrix estimation is often used to reconstruct gene networks (Markowetz and Spang, 2007). We applied our MSG and the other covariance matrix estimators described in Section 3.1 to gene network estimation using data from a small round blue-cell tumor microarray experiment, which was also studied by Cai and Liu (2011). Osareh and Shadgar (2009) report the expression of 2308 genes from 63 samples from four groups: 12 neuroblastoma, 20 rhabdomyosarcoma, 8 Burkitt lymphoma, and 23 Ewing’s sarcoma patients. We followed the same data preprocessing as Cai and Liu (2011) and sorted the genes in decreasing order according to their -statistic

 F=1k−1k∑m=1nm(¯xm−¯x)2/1n−kk∑m=1(nm−1)^σ2m (11)

where is the number of patient categories, , , and represent the sample size, sample mean, and sample variance of the gene’s expression in the th category, respectively, and is the global mean. We proceeded with the top 40 genes and bottom 160 genes.

We applied various methods to estimate the covariance matrix of these 200 genes. To measure the accuracy of the estimators, we split the 63 samples into two subsets and , ensuring that each subset consisted of the same number of subjects from each of the four disease groups. After centering the variables to have zero mean, we used to calculate covariance matrix estimates and compared these to the sample covariance matrix of , which served as a proxy for the unknown true covariance matrix. We measured the errors using the Frobenius, spectral, and matrix norms. We repeated this process 200 times.

Table 1 reports the average errors across the replications. Our MSG methods had the lowest average error, and in this case and

performed comparably. The positive-definiteness correction slightly reduced the risk as well. The linear shrinkage estimate was almost as accurate, but the other methods were much less accurate. These results suggest that our estimator can perform well in realistic settings, where the mean-zero multivariate normal distributional assumption on the data may not be met.

In addition to comparing the numerical accuracies, we also investigated whether our estimator gave qualitatively different gene networks compared to the other approaches. First, Figure 4 illustrates the covariance matrices in network form, where each node represents a gene and each edge represents a non-zero covariance between the genes it connects. To avoid completely connected graphs, we sparsified the matrix estimates by thresholding the smaller entries of each matrix to zero. Since the adaptive thresholding method of Cai and Liu (2011) naturally produced a sparse estimated matrix, we thresholded the other matrix estimates to match the sparsity level of the Cai and Liu (2011) estimate. Figure 4: Gene networks recovered by the different covariance matrix estimation methods.

The results show several interesting features. First, there appear to be two major clusters, which are disconnected in every estimated network except for the one produced by the adaptive thresholding approach. Second, the larger cluster appears to contain two sub-clusters, and this finer structure was only recovered by MSG and QIS, and to a lesser extent the linear shrinkage estimator and NERCOME. Finally, the nodes in the networks estimated by QIS and NERCOME appear to be clustered more tightly together compared to in the other networks. These observations suggest that MSG produces qualitatively different networks, in addition to lower estimation errors.

Finally, we also compared the estimated degrees of the genes in the different networks. For each estimated network, we ordered the 200 genes by degree and then selected the top 20%, denoting this set for the th network. For each pair of networks and

, we calculated the similarity between their most connected genes using Jaccard index

. Figure 5 visualizes these similarities. Not surprisingly, the MSG networks were most similar to each other. Interestingly, however, among all estimators, they were also the most similar to the unbiased sample covariance matrix. Together with the above results, this indicates that MSG may simultaneously give the lowest error and, at least in terms of degree estimation, the most unbiased results. Figure 5: Similarities of gene degrees between the estimated networks. Each number reports the Jaccard index between the top 20% most connected genes of each pair of networks.

## 4 Discussion

The class of separable covariance matrix estimators (5) that we proposed in this paper appears to be very promising. Many existing procedures already explicitly or implicitly target this class, and our proposed estimate (8) of the optimal separable estimator outperforms a number of existing covariance matrix estimators. This is surprising because our approach vectorizes the matrix and therefore cannot take matrix structure, such as positive-definiteness, into account. This suggests that a vectorized approach combined with a positive-definiteness constraint may have improved performance. The resulting estimator would necessarily not be separable, because the estimate of the th entry would depend on more than just the th and th observed features, so the -modeling estimation strategy is insufficient. More work is needed.

Though our estimator performs well in simulations and in real data, providing theoretical guarantees is difficult. In the standard mean vector estimation problem with , Jiang and Zhang (2009) showed that an empirical Bayes estimator based on a nonparametric maximum likelihood estimate of the prior on the can indeed asymptotically achieve the same risk as the oracle optimal separable estimator. However, this was in a simple model with a univariate prior distribution. Saha and Guntuboyina (2017) extended these results to multivariate with a multivariate prior on the , but assumed that the were independent. In contrast, our covariance matrix estimator is built from arbitrarily dependent . These imposes significant theoretical difficulties that will require substantial work to address; we leave this for future research.

Finally, we have so far assumed that our data multivariate normal. To extend our procedure to non-normal data belonging to a parametric family, we can simply modify the density function in the nonparametric maximum compositive likelihood problem (7) and in our proposed estimator (8). If is unknown or difficult to specify, alternative procedures may be necessary to approximate the optimal separable rule.

## 5 Acknowledgments

We thank Dr. Roger Koenker for some very valuable comments.

## References

• Y. Abramovich, N. Spencer, and A. Gorokhov (2001) Locally optimal covariance matrix estimation techniques for array signal processing applications. In Conference Record of Thirty-Fifth Asilomar Conference on Signals, Systems and Computers (Cat. No. 01CH37256), Vol. 2, pp. 1127–1133. Cited by: §1.
• W. D. Biscarri (2019) Statistical methods for binomial and Gaussian sequences. Ph.D. Thesis, University of Illinois at Urbana-Champaign. Cited by: §2.1.
• L. D. Brown and E. Greenshtein (2009) Nonparametric empirical bayes and compound decision approaches to estimation of a high-dimensional vector of normal means. The Annals of Statistics 37 (4), pp. 1685–1704. Cited by: §2.1, §2.2.
• T. Cai and W. Liu (2011) Adaptive thresholding for sparse covariance matrix estimation. Journal of the American Statistical Association 106 (494), pp. 672–684. Cited by: §1, §2.1, §2.2, 5th item, 2nd item, 1st item, §3.3, §3.3.
• D. L. Donoho and I. M. Johnstone (1995) Adapting to unknown smoothness via wavelet shrinkage. Journal of the American Statistical Association 90 (432), pp. 1200–1224. Cited by: §2.1.
• B. Efron (2014) Two modeling strategies for empirical Bayes estimation. Statistical Science 29 (2), pp. 285–301. Cited by: §2.2, §2.2.
• B. Efron (2019) Bayes, Oracle Bayes and Empirical Bayes. Statistical Science 34 (2), pp. 177–201. Cited by: §2.2, §2.3.
• J. Fan, Y. Fan, and J. Lv (2008) High dimensional covariance matrix estimation using a factor model. Journal of Econometrics 147 (1), pp. 186–197. Cited by: §1.
• L. Feng and L. H. Dicker (2018) Approximate nonparametric maximum likelihood for mixture models: a convex optimization approach to fitting arbitrary multivariate mixing distributions. Computational Statistics & Data Analysis 122, pp. 80–91. Cited by: §2.3.
• D. Fourdrinier, W. E. Strawderman, and M. T. Wells (2018) Shrinkage estimation. Springer. Cited by: §2.1.
• N. J. Higham (1988) Computing a nearest symmetric positive semidefinite matrix. Linear algebra and its applications 103, pp. 103–118. Cited by: §2.4.
• C. Huang, D. Farewell, and J. Pan (2017) A calibration method for non-positive definite covariance matrix in multivariate data analysis.

Journal of Multivariate Analysis

157, pp. 45–52.
Cited by: §2.4, §2.4.
• W. James and C. M. Stein (1961) Estimation with quadratic loss. In Proceedings of the Fourth Berkeley Symposium on Mathematical Statistics and Probability, Vol. 1, pp. 367–379. Cited by: §2.1.
• W. Jiang and C. Zhang (2009) General maximum likelihood empirical bayes estimation of normal means. The Annals of Statistics 37 (4), pp. 1647–1684. Cited by: §1, §2.1, §2.2, §2.2, §2.3, §4.
• I. M. Johnstone (2017) Gaussian estimation: sequence and wavelet models. Technical report Department of Statistics, Stanford University, Stanford. Cited by: §2.1, §2.1.
• J. Kiefer and J. Wolfowitz (1956) Consistency of the maximum likelihood estimator in the presence of infinitely many incidental parameters. The Annals of Mathematical Statistics, pp. 887–906. Cited by: §2.2.
• R. Koenker, J. Gu, et al. (2019) Comment: minimalist -modeling. Statistical Science 34 (2), pp. 209–213. Cited by: §2.3.
• R. Koenker and I. Mizera (2014) Convex optimization, shape constraints, compound decisions, and empirical bayes rules. Journal of the American Statistical Association 109 (506), pp. 674–685. Cited by: §1, §2.3.
• N. Laird (1978) Nonparametric maximum likelihood estimation of a mixing distribution. Journal of the American Statistical Association 73 (364), pp. 805–811. Cited by: §2.3.
• C. Lam et al. (2016) Nonparametric eigenvalue-regularized precision or covariance matrix estimator. The Annals of Statistics 44 (3), pp. 928–953. Cited by: §1, 4th item, 5th item.
• O. Ledoit, M. Wolf, et al. (2004) A well-conditioned estimator for large-dimensional covariance matrices. Journal of Multivariate Analysis 88 (2), pp. 365–411. Cited by: §1, §2.1, §2.2, §2.2, 2nd item, 1st item.
• O. Ledoit, M. Wolf, et al. (2012) Nonlinear shrinkage estimation of large-dimensional covariance matrices. The Annals of Statistics 40 (2), pp. 1024–1060. Cited by: §1.
• O. Ledoit and M. Wolf (2003) Improved estimation of the covariance matrix of stock returns with an application to portfolio selection. Journal of empirical finance 10 (5), pp. 603–621. Cited by: §1.
• O. Ledoit and M. Wolf (2018) Analytical nonlinear shrinkage of large-dimensional covariance matrices. Technical report Technical Report 264, Department of Economics, University of Zurich. Cited by: §1.
• O. Ledoit and M. Wolf (2019) Quadratic shrinkage for large covariance matrices. Technical report Technical Report 335, Department of Economics, University of Zurich. Cited by: §1, 3rd item, 1st item, 5th item.
• J. Li, J. Zhou, B. Zhang, and X. R. Li (2017) Estimation of high dimensional covariance matrices by shrinkage algorithms. In 2017 20th International Conference on Information Fusion (Fusion), pp. 1–8. Cited by: §1.
• D. V. Lindley (1962) Discussion on professor stein’s paper. Journal of the Royal Statistical Society: Series B (Methodological) 24, pp. 265–296. Cited by: §2.1.
• B. G. Lindsay (1983) The geometry of mixture likelihoods: a general theory. The Annals of Statistics, pp. 86–94. Cited by: §2.3.
• Y. Liu, X. Sun, and S. Zhao (2017) A covariance matrix shrinkage method with toeplitz rectified target for doa estimation under the uniform linear array. AEU-International Journal of Electronics and Communications 81, pp. 50–55. Cited by: §1.
• F. Markowetz and R. Spang (2007) Inferring cellular networks – a review. BMC Bioinformatics 8 (6), pp. S5. Cited by: §3.3.
• X. Mestre (2008) On the asymptotic behavior of the sample estimates of eigenvalues and eigenvectors of covariance matrices. IEEE Transactions on Signal Processing 56 (11), pp. 5353–5368. Cited by: §1.
• A. Osareh and B. Shadgar (2009) Classification and diagnostic prediction of cancers using gene microarray data analysis. Journal of Applied Sciences 9 (3), pp. 459–468. Cited by: §3.3.
• H. Robbins (1951) Asymptotically subminimax solutions of compound statistical decision problems. In Proceedings of the second Berkeley symposium on mathematical statistics and probability, Cited by: §2.1, §2.2, §2.2.
• H. Robbins (1955) An empirical Bayes approach to statistics. In Proceedings of the Third Berkeley Symposium on Mathematical Statistics and Probability, pp. 157–164. Cited by: §2.2.
• A. J. Rothman, E. Levina, and J. Zhu (2009) Generalized thresholding of large covariance matrices. Journal of the American Statistical Association 104 (485), pp. 177–186. Cited by: §1.
• S. Saha and A. Guntuboyina (2017) On the nonparametric maximum likelihood estimator for Gaussian location mixture densities with application to Gaussian denoising. arXiv preprint arXiv:1712.02009. Cited by: §4.
• J. Schäfer and K. Strimmer (2005) A shrinkage approach to large-scale covariance matrix estimation and implications for functional genomics. Statistical applications in genetics and molecular biology 4 (1). Cited by: §1.
• S. Tao (2014) Convex duality in nonparametric empirical bayes estimation and prediction. Ph.D. Thesis, University of Alberta. Cited by: §2.3, §3.2.
• C. Varin, N. Reid, and D. Firth (2011) An overview of composite likelihood methods. Statistica Sinica, pp. 5–42. Cited by: §2.2.
• L. Xue, S. Ma, and H. Zou (2012) Positive-definite -penalized estimation of large covariance matrices. Journal of the American Statistical Association 107 (500), pp. 1480–1491. Cited by: §1.
• C. Zhang (2003) Compound decision theory and empirical Bayes methods. The Annals of Statistics 31 (2), pp. 379–390. Cited by: §2.2, §2.2.

## Appendix A Proof of Proposition 1

###### Proof.

We first rewrite the risk estimate . Define , , and the vectorized covariance matrices , , and . Then the unbiased risk estimator can be re-written as

 ^R(βS,βI)=β⊤(Z⊤Z)β−2(ZvS−M)⊤β−1⊤M,

where . Therefore

 ^β =argminβ(Z⊤Z)−1(ZvS−M), ^vΣ=Z^β=vS−Z(Z⊤Z)−1M.

We will need to show and . Since

 Z⊤Z=(s11…sppu11…upp)⎛⎜⎝s11u11……sppupp⎞⎟⎠=(∑pj,k=1s2jk∑pj=1sjj∑pj=1sjjp)

and , it follows that

 Z⊤S=(∑pj,k=1s2jk∑pj=1sjj),Z⊤S−M=⎛⎝∑pj,k=1s2jk−^Δ2jk∑pj=1sjj⎞⎠.

Therefore

 ^β =(Z⊤Z)−1Z⊤S =1p3d2n(p−∑pj=1sjj−∑pj=1sjj∑pj,k=1s2jk)⎛⎝∑pj,k=1s2jk−^Δ2jk∑pj=1sjj⎞⎠ =1p3d2n⎛⎜⎝p∑pj,k=1s2jk−p∑pj,k=1^Δ2jk−(∑pj=1sjj)2(∑pj=1sjj)(∑pj,k=1^Δ2jk)⎞⎟⎠.

The second component of equals , so

 ^βI =1p3d2n(p∑j=1sjj)(p∑j,k=1^Δ2jk) ={(p∑j=1sjj)/p}{(p∑j,k=1^Δ2jk)/p2}/d2n=^μb2nd2n.

Furthermore,

 ^βI/^μ+^βS =1p3d2npp∑j,k=1{s2jk−pp∑j,k=1^Δ2jk−(p∑j=1sjj)2+pp∑j,k=1^Δ2jk} =1p3d2n{pp∑j,k=1s2jk−(p∑j=1sjj)2}=1.