An Empirical Bayes Approach to Shrinkage Estimation on the Manifold of Symmetric Positive-Definite Matrices

07/04/2020 ∙ by Chun-Hao Yang, et al. ∙ 0

The James-Stein estimator is an estimator of the multivariate normal mean and dominates the maximum likelihood estimator (MLE) under squared error loss. The original work inspired great interest in developing shrinkage estimators for a variety of problems. Nonetheless, research on shrinkage estimation for manifold-valued data is scarce. In this paper, we propose shrinkage estimators for the parameters of the Log-Normal distribution defined on the manifold of N × N symmetric positive-definite matrices. For this manifold, we choose the Log-Euclidean metric as its Riemannian metric since it is easy to compute and is widely used in applications. By using the Log-Euclidean distance in the loss function, we derive a shrinkage estimator in an analytic form and show that it is asymptotically optimal within a large class of estimators including the MLE, which is the sample Fréchet mean of the data. We demonstrate the performance of the proposed shrinkage estimator via several simulated data experiments. Furthermore, we apply the shrinkage estimator to perform statistical inference in diffusion magnetic resonance imaging problems.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 23

page 27

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Symmetric positive-definite (SPD) matrices are ubiquitous in applications of science and engineering. In computer vision problems, they are encountered in the form of covariance matrices, and in diffusion magnetic resonance imaging, SPD matrices manifest themselves as diffusion tensors which are used to model the diffusion of water molecules and as Cauchy deformation tensors in morphometry to model the deformations. In such applications, the statistical analysis of data must perform geometry-aware computations, i.e. employ methods that take into account the nonlinear geometry of the data space. In most data analysis applications, it is useful to describe the entire dataset with a few summary statistics. For data residing in Euclidean space, this may simply be the sample mean, and we note that for data residing in non-Euclidean spaces, e.g. Riemannian manifolds, the corresponding statistic is the sample Fréchet mean (FM)

(Fréchet, 1948). The sample FM also plays an important role in different statistical inference methods, e.g. principal geodesic analysis (Fletcher et al., 2003), clustering algorithms, etc. If is a metric space with metric , and , the sample FM is defined by . For Riemannian manifolds, the distance is usually chosen to be the intrinsic distance induced by the Riemannian metric. Then, the above optimization problem can be solved by Riemannian gradient descent algorithms (Groisser, 2004; Afsari, 2011; Moakher, 2005). However, Riemannian gradient descent algorithms are usually computationally expensive, and efficient recursive algorithms for computing the sample FM have been presented in the literature for various Riemannian manifolds by Sturm (2003), Ho et al. (2013), Salehian et al. (2015), Chakraborty and Vemuri (2015), Lim and Pálfia (2014) and Chakraborty and Vemuri (2019).

In

with the Euclidean metric, the sample FM is just the ordinary sample mean. Consider a set of normally distributed random variables

. The sample mean is the maximum likelihood estimator (MLE) for the mean of the underlying normal distribution and the James-Stein (shrinkage) estimator (James and Stein, 1962) was shown to be better (under squared error loss) than the MLE when and the covariance matrix of the underlying normal distribution is assumed to be known. Inspired by this result, the goal of this paper is to develop shrinkage estimators for data residing in , the space of SPD matrices.

For the model where and is known, the MLE of is and it is natural to ask whether the MLE is admissible. Stein (1956) gave a negative answer to this question and provided a class of estimators for that dominate the MLE. Subsequently, James and Stein (1962) proposed the estimator

(1)

where , which was later referred to as the James-Stein (shrinkage) estimator.

Ever since the work reported in James and Stein (1962), generalizations or variants of this shrinkage estimator have been developed in a variety of settings. A few directions for generalization are as follows. First, instead of estimating the means of many normal distributions, carry out the simultaneous estimation of parameters (e.g. the mean or the scale parameter) of other distributions, e.g. Poisson, Gamma, etc.; see Johnson (1971), Fienberg and Holland (1973), Clevenson and Zidek (1975), Tsui (1981), Tsui and Press (1982), Brandwein and Strawderman (1990) and Brandwein and Strawderman (1991). More recent works include Xie et al. (2012), Xie et al. (2016), Jing et al. (2016) and Kong et al. (2017). Second, since the MLE of the mean of a normal distribution is also the best translation-equivariant estimator but is inadmissible, an interesting question to ask is, when is the best translation-equivariant estimator admissible? This problem was studied extensively in Stein (1959) and Brown (1966).

Besides estimation of the mean of different distributions, estimation of the covariance matrix (or the precision matrix) of a multivariate normal distribution is an important problem in statistics, finance, engineering and many other fields. The usual estimator, namely the sample covariance matrix, performs poorly in high-dimensional problems and many researchers have endeavored to improve covariance estimation by applying the concept of shrinkage in this context (Stein, 1975; Haff, 1991; Daniels and Kass, 2001; Ledoit and Wolf, 2003; Daniels and Kass, 2001; Ledoit and Wolf, 2012; Donoho et al., 2018). Note that there is a vast literature on covariance estimation and we only cited a few references here. For a thorough literature review, we refer the interested reader to Donoho et al. (2018).

In order to understand shrinkage estimation fully, one must understand why the process of shrinkage improves estimation. In this context, Efron and Morris presented a series of works to provide an empirical Bayes interpretation by modifying the original James-Stein estimator to suit different problems (Efron and Morris, 1971, 1972b, 1972a, 1973b, 1973a). The empirical Bayes approach to designing a shrinkage estimator can be described as follows. First, reformulate the model as a Bayesian model, i.e. place a prior on the parameters. Then, the hyper-parameters of the prior are estimated from the data. Efron and Morris (1973b) presented several examples of different shrinkage estimators developed within this empirical Bayes framework. An alternative, non-Bayesian, geometric interpretation to Stein shrinkage estimation was presented by Brown and Zhao (2012).

In all the works cited above, the domain of the data has invariably been a vector space and, as mentioned earlier, many applications naturally encounter data residing in non-Euclidean spaces. Hence, generalizing shrinkage estimation to non-Euclidean spaces is a worthwhile pursuit. In this paper, we focus on shrinkage estimation for the Riemannian manifold

. We assume that the observed SPD matrices are drawn from a Log-Normal distribution defined on (Schwartzman, 2016) and we are interested in estimating the mean and the covariance matrix of this distribution. We derive shrinkage estimators for the parameters of the Log-Normal distribution using an empirical Bayes framework (Xie et al., 2012), which is described in detail subsequently, and show that the proposed estimator is asymptotically optimal within a class of estimators including the MLE. We describe simulated data experiments which demonstrate that the proposed shrinkage estimator of the mean of the Log-Normal distribution is better (in terms of risk) than the widely-used Riemannian gradient descent-based estimator (Bhattacharya and Patrangenaru, 2003, 2005) and the recently-developed inductive/recursive FM estimator presented in Ho et al. (2013). Further, we also apply the shrinkage estimator to find group differences between patients with Parkinson’s disease and controls (normal subjects) from their respective brain scans acquired using diffusion magnetic resonance images (dMRIs).

The rest of this paper is organized as follows. In section 2, we present relevant material on the Riemannian geometry of and shrinkage estimation. The main theoretical results are stated in section 3 with the proofs of the theorems relegated to the supplement. In section 4, we demonstrate how the proposed shrinkage estimators perform via several synthetic data examples and present applications to (real data) diffusion tensor imaging (DTI), a clinically popular version of dMRI. Specifically, we apply the proposed shrinkage estimator to the estimation of the brain atlases (templates) of patients with Parkinson’s disease and a control group and identify the regions of the brain where the two groups differ significantly. Finally, in section 5 we discuss our contributions and present some future research directions.

2 Preliminaries

In this section, we first present commonly used notation in Riemannian geometry that will be used throughout the paper. For a detailed exposition on this topic, we refer the reader to the excellent text by Boothby (1986). Then, we briefly review two of the most commonly used Riemannian geometries for and refer the reader to more details in Moakher (2005) and Arsigny et al. (2007). Finally, we review the concept of Stein’s unbiased risk estimate, which will form the framework for deriving the shrinkage estimators.

2.1 Definitions and Notations

Let be an -dimensional Riemannian manifold. For , the tangent space of at is denoted ; this is an -dimensional vector space. The geodesic connecting two points and is denoted by . It has the property that and . Alternatively, a geodesic can be specified uniquely by its starting point and its initial direction ; in this case, the geodesic is denoted by and has the properties that and . Let be the set of tangent vectors such that exists. The Riemannian Exponential map is defined by for . Since the exponential map is a diffeomorphism (see page 33 in Helgason (2001)), its inverse exists and will be denoted by . These two maps will be of fundamental importance for the derivation of our shrinkage estimators in the next section. The Riemannian distance induced by the Riemannian metric is denoted by . Let . The FM of is

Under mild conditions, the FM exists and is unique; for a detailed discussion of these conditions, see Afsari (2011) and Groisser (2004). In most practical situations, these conditions are satisfied.

2.2 Riemannian Geometry of

The Riemannian geometry of is discussed in detail by Helgason (2001) and Terras (2012). There are two commonly used Riemannian geometries for , the -invariant geometry and the Log-Euclidean geometry, and each of them has its own advantages and disadvantages. We will briefly review both of them for the sake of completeness and we will focus on the Log-Euclidean geometry.

The Riemannian manifold can be identified with the quotient space (Terras, 2012), where denotes the General Linear group (the group of non-singular matrices), and is the orthogonal group (the group of orthogonal matrices). Hence is a homogeneous space with as the group that acts on it and the group action defined for any and by . One can now define -invariant quantities such as the -invariant Riemannian metric based on the group action defined above. We will now begin with the inner product in the tangent space of at a point . The -invariant Riemannian metric is given by

(2)

for and . The tangent space of

at the identity matrix

is actually the space of symmetric matrices, denoted by , which is a vector space of dimension . Hence is a -dimensional Riemannian manifold. The -invariance of this metric is easy to check. With the -invariant metric, the induced geodesic distance between is given by

(3)

where is the matrix logarithm (see Terras (2012)). Since this distance is induced from the -invariant metric in (2), it is naturally -invariant, i.e. . The exponential and log maps on are given by

(4)

where and .

Arsigny et al. (2007) proposed the Log-Euclidean metric on the manifold as an alternative to the -invariant metric. The Log-Euclidean metric is a bi-invariant Riemannian metric on the Lie group where . The intrinsic distance induced by the Log-Euclidean metric has a very simple form, namely

where is the Frobenius norm. Consider the map defined by

(Schwartzman, 2016). This map is actually an isomorphism between and . To make the notation more concise, for , we denote . From the definition of , we see that .

Given , we denote the sample FM with respect to the two intrinsic distances given above by

(5)
(6)

2.3 The Log-Normal Distribution on

In this work, we assume that the observed SPD matrices follow the Log-Normal distribution introduced by Schwartzman (2006), which can be viewed as a generalization of the Log-Normal distribution on to . The definition of the Log-Normal distribution is stated as follows.

Definition 1

Let be a -valued random variable. We say follows a Log-Normal distribution with mean and covariance matrix , or , if .

From the definition, it is easy to see that and . Some important results regarding this distribution were obtained in Schwartzman (2016). The following proposition for the MLEs of the parameters from Schwartzman (2016) will be useful subsequently in this work.

Proposition 1

Let . Then, the MLEs of and are and . The MLE of is the sample FM under the Log-Euclidean metric.

2.4 Bayesian Formulation of Shrinkage Estimation in

As discussed earlier, the James-Stein estimator originated from the problem of simultaneous estimation of multiple means of (univariate) normal distributions. The derivation relied heavily on the properties of the univariate normal distribution. Later on, Efron and Morris (1973b) gave an empirical Bayes interpretation for the James-Stein estimator, which is presented by considering the hierarchical model

where is known and and are unknown. The posterior mean for is

(7)

The parametric empirical Bayes method for estimating the

’s consists of first estimating the prior parameters and and then substituting them into (7). The prior parameters and can be estimated by the MLE. For the special case of , this method produces an estimator similar to the James-Stein estimator (1). Although this estimator is derived in an (empirical) Bayesian framework, it is of interest to determine whether it has good frequentist properties. For example, if we specify a loss function and consider the induced risk function , one would like to determine whether the estimator has uniformly smallest risk within a reasonable class of estimators. In (7), the optimal choice of and is

where , , and and depend on , which is unknown. Instead of minimizing the risk function directly, we will minimize Stein’s unbiased risk estimate (SURE) (Stein, 1981), denoted by , which satisfies . Thus, we will use

The challenging part of this endeavor is to derive SURE, which depends heavily on the risk function and the underlying distribution of the data. This approach has been used to derive estimators for many models. For example, Xie et al. (2012)

derived the (asymptotically) optimal shrinkage estimator for a heteroscedastic hierarchical model, and their result is further generalized in

Jing et al. (2016) and Kong et al. (2017).

3 An Empirical Bayes Shrinkage Estimator for Log-Normal Distributions

In this section, we consider the model

and develop shrinkage estimators for the mean and the covariance matrix . These ’s are -valued random matrices. For completeness, we first briefly review the shrinkage estimator proposed by Yang and Vemuri (2019) for assuming , where the ’s are known positive numbers and is the identity matrix. The assumption on is useful when is small since for small sample sizes the MLE for is very unstable. Next, we present estimators for both and . Besides presenting these estimators, we establish asymptotic optimality results for the proposed estimators. To be more precise, we show that the proposed estimators are asymptotically optimal within a large class of estimators containing the MLE.

Another related interesting problem often encountered in practice involves group testing and estimating the “difference” between the two given groups. Consider the model

where the ’s and ’s are independent. We want to estimate the differences between and for and select the ’s for which the differences are large. However, the estimates for the selected values tend to overestimate the corresponding true differences. The bias introduced by the selection process is termed by selection bias (Dawid, 1994). The selection bias originates from the fact that there are two possible reasons for the selected differences to be large: (i) the true differences are large and (ii) the random errors contained in the estimates are large. Tweedie’s formula (Efron, 2011), which we discuss and briefly review in section 3.3, deals with precisely this selection bias, in the context of the normal means problem. In this work, we apply an analogue of Tweedie’s formula designed for the context of SPD matrices.

3.1 An Estimator of When is Known

For completeness, we briefly review the work of Yang and Vemuri (2019) where the authors presented the estimator for assuming that where the ’s are known positive numbers. Under this assumption, they considered the class of estimators given by

(8)

where and . Using the Log-Euclidean distance as the loss function, they showed that the SURE for the corresponding risk function is given by

Hence, and can be estimated by

(9)

Their shrinkage estimator for is given by

(10)

They also presented the following two theorems showing the asymptotic optimality of the shrinkage estimator.

Theorem 1

Assume the following conditions:

  1. ,

  2. ,

  3. for some .

Then,

Theorem 2

If assumptions (i), (ii) and (iii) in Theorem 1 hold, then

3.2 Estimators for and

In Yang and Vemuri (2019), the covariance matrices of the underlying distributions were assumed to be known to simplify the derivation. In real applications however, the covariance matrices are rarely known, and in practice they must be estimated. In this paper, we consider the general case of unknown covariance matrices which is more challenging and pertinent in real applications. Let

(11)

for and . The prior for is called the Log-Normal-Inverse-Wishart (LNIW) prior, and it is motivated by the normal-inverse-Wishart prior in the Euclidean space setting. The main reason for choosing the LNIW prior over others is the property of conjugacy which leads to a closed-form expression for our estimators. Let

(12)

Then the posterior distributions of and are given by

and the posterior means for and are given by

(13)

Consider the loss function

Its induced risk function is given by

with the detailed derivation given in the supplemental material. The SURE for this risk function is

with the detailed derivation given in the supplemental material.

The hyperparameter vector

is estimated by minimizing , and the resulting shrinkage estimators of and are obtained by plugging in the minimizing vector into (13). Note that this is a non-convex optimization problem and for such problems, the convergence relies heavily on the choice of the initialization. We suggest the following initialization, which is discussed in the supplemental material:

In all our experiments, the algorithm converged in less than 20 iterations with the suggested initialization. This concludes the description of our estimators of the unknown means and covariance matrices. Theorem 3 below states that approximates the true loss

well in the sense that the difference between the two random variables converges to 0 in probability as

. Additionally, Theorem 4 below shows that the estimators of and obtained by minimizing are asymptotically optimal in the class of estimators of the form (13).

Theorem 3

Assume the following conditions:

  1. ,

  2. ,

  3. for some .

Then

Theorem 4

If assumptions (i), (ii), and (iii) in Theorem 3 hold, then

Remark: The proofs of Theorems 1 and 2 in Yang and Vemuri (2019) use arguments similar to those that already exist in the literature, and in that sense they are not very difficult. In contrast, the proofs of our Theorems 3 and 4

above do not proceed along familiar lines. Indeed, they are rather complicated, the difficulty being that bounding the moments of Wishart matrices or the moments of trace of Wishart matrices is nontrivial when the orders of the required moments are higher than two. We present these proofs in the supplementary document.

3.3 Tweedie’s Formula for Statistics

One of the motivations for the development of our approach for estimating and is a problem in neuroimaging involving detection of differences between a patient group and a control group. The problem can be stated as follows. There are patients in a disease group and normal subjects in a control group. We consider a region of the brain image consisting of voxels. As explained in section 4.2, the local diffusional property of water molecules in the human brain is of clinical importance and it is common to capture this diffusional property at each voxel in the diffusion magnetic resonance image (dMRI) via a zero-mean Gaussian with a covariance matrix. Using any of the existing state-of-the-art dMRI analysis techniques, it is possible to estimate, from each patient image, the diffusion tensor corresponding to voxel , for . Let and denote the diffusion tensors corresponding to voxel for the disease and control groups respectively. The goal is to identify the indices for which the difference between and is large. The model we consider is

In this work, we compute the Hotelling statistic for each as a measure of the difference between and . The Hotelling statistic is given by

(14)

where and are the FMs of and and is the pooled estimate of where and are computed using (12). Since the ’s and ’s are Log-normally distributed, one can easily verify that the sampling distribution of is given by

where

is a degrees of freedom parameter

(Johnson and Wichern, 2007) (recall that ). Note that we make the assumption that the covariance matrices for the two groups are the same, i.e. . Similar results can be obtained for the unequal covariance case, but with more complicated expressions for the statistics and the degrees of freedom parameters. The ’s are the non-centrality parameters for the non-central distribution and are given by

These non-centrality parameters can be interpreted as the (squared) differences between and , and they are the parameters we would like to estimate using the statistics (14) computed from the data. Then, based on the estimates , we select those ’s with large estimates, say the largest of all . However, the process of selection from the computed estimates introduces a selection bias (Dawid, 1994). The selection bias comes from the fact that it is possible to select some indices ’s for which the actual ’s are not large but the random errors are large, so that the estimates are pushed away from the true parameters . There are several ways to correct this selection bias, and Efron (2011) proposed to use Tweedie’s formula for such a purpose.

Tweedie’s formula was first proposed by Robbins (1956), and we review this formula here in the context of the classical normal means problem, which is stated as follows. We observe , , where the ’s are unknown and is known, and the goal is to estimate the ’s. In the empirical Bayes approach to this problem we assume that ’s are iid according to some distribution . The marginal density of the ’s is then , where is the density of the distribution. With this notation, if is known (so that is known), the best estimator of (under squared error loss) is the so-called Tweedie estimator given by

A feature of this estimator is that it depends on only through , and this is desirable because it is fairly easy to estimate from the ’s (so we don’t need to specify ). Another interesting observation about this estimator is that is shrinking the MLE and can be viewed as a generalization of the James-Stein estimator, which assumes with unknown . The Tweedie estimator can be generalized to exponential families. Suppose that , and the prior for is . Then the Tweedie estimator for is

where is the log of the marginal likelihood of the ’s and .

Although this formula is elegant and useful, it applies only to exponential families. Recently, Du and Hu (2019) derived a Tweedie-type formula for non-central statistics, for situations where one is interested in estimating the non-centrality parameters. Suppose and . Then,

(15)

where is the marginal log-likelihood of the ’s (see Theorem 1 in Du and Hu (2019)).

For our situation, if we define , then

(recall that ). Assume that the ’s are iid according to some distribution . We would now like to address the problem of how to obtain an empirical Bayes estimate of . Let

be the cumulative distribution function (cdf) of the non-central

distribution, , and let be the cdf of the non-central distribution, . Then the transformed variable follows a non-central distribution with degrees of freedom parameter and non-centrality parameter , and we note that when is large, and

are nearly equal, so that this quantile transformation is nearly the identity. However, the transformation depends on

, which is the parameter to be estimated, so we propose the following iterative algorithm for estimating . Let be the estimate of at the -th iteration. Then, our iterative update of is given by

where , , and . Now the marginal log-likelihood is not available since the prior for is unknown. There are several ways to estimate the marginal density of the

’s. One of these is through kernel density estimation. However, the iterative formula involves the first and second derivatives of the marginal log-likelihood, and estimates of the derivatives of a density produced through kernel methods are notoriously unstable (see Chapter 3 of

Silverman (1986)). There exist different approaches to deal with this problem (see Sasaki et al. (2016) and Shen and Ghosal (2017)). Here we follow Efron (2011) and postulate that is well approximated by a polynomial of degree , and write . The coefficients , , can be estimated via Lindsey’s method (Efron and Tibshirani, 1996), which is a Poisson regression technique for (parametric) density estimation; the coefficient is determined by the requirement that integrates to . The advantage of Lindsey’s method over methods that use kernel density estimation is that it does not require us to estimate the derivatives separately, since and . In our experience, with and estimated in this way, if we initialize the scheme by setting to be the estimate of given by Du and Hu (2019) procedure, then the algorithm converges in less than 10 iterations.

4 Experimental Results

In this section, we describe the performance of our methods on two synthetic data sets and on two sets of real data acquired using diffusion magnetic resonance brain scans of normal (control) subjects and patients with Parkinson disease (PD). For the synthetic data experiments, we show that (i) the proposed shrinkage estimator for the FM (SURE.Full-FM; with simultaneous estimation of the covariance matrices) outperforms the sample FM based on both the Log-Euclidean metric (FM.LE) and the GL-invariant metric (FM.GL) and the shrinkage estimator proposed by Yang and Vemuri (2019) (SURE-FM; with fixed covariance matrices) (section 4.1.1) and (ii) the shrinkage estimates of the differences capture the regions that are significantly different between two groups of SPD matrix-valued images (section 4.1.2). For the real data experiments, we demonstrate that (iii) the SURE.Full-FM provides improvement over the three competing estimators (FM.LE, FM.GL, and SURE-FM) for computing an atlas (templates) of diffusion tensor images acquired from human brains (section 4.2.1) and (iv) the proposed shrinkage estimator for detecting group differences is able to identify the regions that are different between patients with PD and control subjects (section 4.2.2). Details of these experiments will be presented subsequently.

4.1 Synthetic Data Experiments

We present two synthetic data experiments here to show that the proposed shrinkage estimator, SURE.Full-FM, outperforms the sample FMs and SURE-FM and that the shrinkage estimates of the group differences can accurately localize the regions that are significantly different between the two groups.

4.1.1 Comparison Between SURE.Full-FM and Competing Estimators

Using synthesized noisy SPD fields () as data, here we present performance comparisons of four estimators of : (i) SURE.Full-FM, which is the proposed shrinkage estimator, (ii) SURE-FM proposed by Yang and Vemuri (2019) which assumes known covariance matrices, (iii) the MLE, which is denoted by FM.LE, since by proposition 1 it is the FM based on the Log-Euclidean metric, and (iv) the sample FM computed using the GL-invariant metric (using the recursive algorithm in Ho et al. (2013)), denoted by FM.GL. The synthetic data are generated according to (11). Specifically, we set , , and

, and we vary the variance