# Ensemble minimaxity of James-Stein estimators

This article discusses estimation of a multivariate normal mean based on heteroscedastic observations. Under heteroscedasticity, estimators shrinking more on the coordinates with larger variances, seem desirable. Although they are not necessarily minimax in the ordinary sense, we show that such James-Stein type estimators can be ensemble minimax, minimax with respect to the ensemble risk, related to empirical Bayes perspective of Efron and Morris.

• 4 publications
• 6 publications
• 12 publications
03/19/2020

### Admissible estimators of a multivariate normal mean vector when the scale is unknown

We study admissibility of a subclass of generalized Bayes estimators of ...
05/28/2018

### On estimation of nonsmooth functionals of sparse normal means

We study the problem of estimation of the value N_gamma(θ) = sum(i=1)^d ...
11/30/2017

### Bayes Minimax Competitors of Preliminary Test Estimators in k Sample Problems

In this paper, we consider the estimation of a mean vector of a multivar...
12/10/2020

### Leveraging vague prior information in general models via iteratively constructed Gamma-minimax estimators

Gamma-minimax estimation is an approach to incorporate prior information...
05/06/2022

### Efficient Minimax Optimal Estimators For Multivariate Convex Regression

We study the computational aspects of the task of multivariate convex re...
02/24/2021

### On admissible estimation of a mean vector when the scale is unknown

We consider admissibility of generalized Bayes estimators of the mean of...
02/12/2019

### Enhanced Balancing of Bias-Variance Tradeoff in Stochastic Estimation: A Minimax Perspective

Biased stochastic estimators, such as finite-differences for noisy gradi...

## 1 Introduction

Let where , and . Let us assume

 σ21>σ22>⋯>σ2p. (1.1)

We are interested in the estimation of

with respect to the ordinary squared error loss function

 L(δ,θ)=∥δ−θ∥2, (1.2)

where the risk of an estimator is . The MLE with constant risk is shown to be extended Bayes and hence minimax for any and any .

In the homoscedastic case , James and Stein (1961) showed that the shrinkage estimator

 (1.3)

dominates the MLE for . There is some literature discussing the minimax properties of shrinkage estimators under heteroscedasticity. Brown (1975) showed that the James-Stein estimator (1.3) is not necessarily minimax when the variances are not equal. Specifically, it is not minimax for any when . Berger (1976) showed that

 (I−Σ−1cXTΣ−2X)X for c∈(0,2(p−2)) (1.4)

is minimax for and any . However, Casella (1980) argued that the James-Stein estimator (1.4) may not be desirable even if it is minimax. Ordinary minimax estimators, as in (1.4), typically shrink most on the coordinates with smaller variances. From Casella’s (1980) viewpoint, one of the most natural Jame-Stein variants is

 (I−Σc∥X∥2)X for c>0, (1.5)

which we are going to rescue, by providing some minimax properties related to Bayesian viewpoint.

In many applications, are thought to follow some exchangeable prior distribution . It is then natural to consider the compound risk function which is then the Bayes risk with respect to the prior

 ¯R(δ,π)=∫RpR(θ,δ)π(dθ). (1.6)

Efron and Morris (1971); Efron and Morris (1972a, b, 1973) addressed this problem from both the Bayes and empirical Bayes perspective. In particular, they considered a prior distribution with , and used the term “ensemble risk” for the compound risk. By introducing a set of ensemble risks

 ¯R(δ,τ)=∫RpR(δ,θ)1(2πτ)p/2exp(−∥θ∥22τ)dθ, (1.7)

we can define ensemble minimaxity with respect to a set of priors

 P⋆={Np(0,τIp):τ∈(0,∞)}, (1.8)

that is, an estimator is said to be ensemble minimax with respect to if

 supτ∈(0,∞)¯R(δ,τ)=infδ′supτ∈(0,∞)¯R(δ′,τ). (1.9)

As a matter of fact, the second author in his unpublished manuscript, Brown, Nie and Xie (2011), has already introduced the concept of ensemble minimaxity. In this article, we follow their spirit but propose a simpler and clearer approach for establishing ensemble minimaxity of estimators.

Our article is organized as follows. In Section 2, we elaborate the definition of ensemble minimaxity and explain Casella’s (1980) viewpoint on the contradiction between minimaxity and well-conditioning. In Section 3, we show the ensemble minimaxity of various shrinkage estimators including a variant of the James-Stein estimator

 (I−Σp−2(p−2)σ21+∥X∥2)X (1.10)

as well as the generalized Bayes estimator with respect to the hierarchical prior

 θ|λ∼Np(0,(σ21/λ)I−Σ), π(λ)∼λ−2I(0,1)(λ) (1.11)

which is a generalization of the harmonic prior for the heteroscedastic case.

## 2 Minimaxity, Ensemble Minimaxity and Casella’s viewpoint

If the prior were known, the resulting posterior mean would then be the optimal estimate under the sum of the squared error loss. However, it is typically not feasible to exactly specify the prior. One approach to avoid excessive dependence on the choice of prior, is to consider a set of priors on and study the properties of estimators based on the corresponding set of ensemble risks. As in classical decision theory, there rarely exists an estimator that achieves the minimum ensemble risk uniformly for all . A more realistic goal as pursued in this paper is to study the ensemble minimaxity of James-Stein type estimators.

Recall that with ordinary risk , is said to be minimax if

 supθ∈ΘR(δ,θ)=infδ′supθ∈ΘR(δ′,θ). (2.1)

Similarly for the case of ensemble risk we have the following definition. Note the Bayes risk of under the prior is given by (1.6). The estimator is said to be ensemble minimax with respect to if

 supπ∈P¯R(δ,π)=infδ′supπ∈P¯R(δ′,π). (2.2)

The motivation for the above definitions comes from the use of the empirical Bayes method in simultaneous inference.

Efron and Morris (1972b), derived the James-Stein estimator through the parametric empirical Bayes model with . Note that in such an empirical Bayes model, is the unknown non-random parameter. Given the family , the Bayes risk is a function of as follows,

 ¯R(δ,τ)=∫RpR(δ,θ)1(2πτ)p/2exp(−∥θ∥22τ)dθ. (2.3)

Hence, with , the estimator is said to be ensemble minimax with respect to if

 supτ∈(0,∞)¯R(δ,τ)=infδ′supτ∈(0,∞)¯R(δ′,τ), (2.4)

which may be seen as the counterpart of ordinary minimaxity in the empirical Bayes model.

Clearly the usual estimator has constant risk, has constant Bayes risk and hence is ensemble minimax. Then the ensemble minimaxity of follows if

 ¯R(δ,τ)≤p∑i=1σ2i, ∀τ∈(0,∞).
###### Remark 2.1.

Note that ensemble minimaxity can also be interpreted as a particular case of Gamma minimaxity studied in the context of robust Bayes analysis by Good (1952); Berger (1979). However, in such studies, a “large” set consisting of many diffuse priors are usually included in the analysis. Since this is quite different from our formulation of the problem, we use the term ensemble minimaxity throughout our paper, following the Efron and Morris papers cited above.

A class of shrinkage estimators which we consider in this paper, is given by

 δϕ=(I−Gϕ(z)z)x,  % for z=xTGΣ−1x=∑gix2iσ2i, (2.5)

where with

 0

Berger and Srinivasan (1978) showed, in their Corollary 2.7, that, given positive-definite and non-singular , a necessary condition for an estimator of the form

 (I−Bϕ(xTCx)xTCx)x

to be admissible is for some constant , which is satisfied by estimators among the class of (2.5).

A version of Baranchik’s (1964) sufficient condition for ordinary minimaxity is given in Appendix A; For given which satisfies

 h(Σ,G)=2(∑giσ2imax(giσ2i)−2)>0,

given by (2.5) is ordinary minimax if

 ϕ(⋅) is non-decreasing and 0≤ϕ≤h(Σ,G). (2.6)

Berger (1976) showed that, for any given ,

 maxGh(Σ,G)=2(p−2),argmaxGh(Σ,G)=σ2pΣ−1=diag(σ2pσ21,…,σ2pσ2p−1,1)

which seems the right choice of . However, from the “conditioning” viewpoint of Casella (1980) which advocates more shrinkage on higher variance estimates, the descending order

 g1>⋯>gp (2.7)

is desirable, whereas corresponding to the ascending order under given by (1.1). As Casella (1980) pointed out, ordinary minimaxity cannot be enjoyed together with well-conditioning given by (2.7) when

 h(Σ,cI)≤0 or equivalently ∑σ2i≤2σ21

for some . In fact, when and , we have

 cσ21=g1σ21, cσ22>g2σ22, …, cσ2p>gpσ2p

and hence follows. The motivation of Casella (1980, 1985) seems to provide a better treatment for the case. Actually Brown (1975) pointed out essentially the same phenomenon from a slightly different viewpoint.

Ensemble minimaxity, based on ensemble risk given by (1.7), provides a way of saving shrinkage estimators with well-conditioning, estimators which are not necessarily ordinary minimax.

## 3 Ensemble minimaxity

### 3.1 A general theorem

We have the following theorem on ensemble minimaxity of with general , though we will eventually focus on with with the descending order as in (2.7).

###### Theorem 3.1.

Assume is non-negative, non-decreasing and concave. Also is assumed non-increasing. Then

 δϕ=(I−Gϕ(z)z)x,  % for z=xTGΣ−1x=∑gix2iσ2i

is ensemble minimax if

 ϕ(pminigi(1+τ/σ2i))≤2(p−2)minigi(1+τ/σ2i)maxigi(1+τ/σ2i),∀τ∈(0,∞). (3.1)
###### Proof.

Recall, for ,

 xi|θi∼N(θi,σ2i), and% θi∼N(0,τ).

Then the posterior and marginal are given by

 θi|xi∼N(ττ+σ2ixi,τσ2iτ+σ2i) and xi∼N(0,τ+σ2i),

respectively, where are mutually independent and are mutually independent. Then the Bayes risk is given by

 ¯R(δϕ,τ) =p∑i=1EθEx|θ⎡⎣{(1−giϕ(z)z)xi−θi}2⎤⎦ =p∑i=1ExEθ|x⎡⎣{(1−giϕ(z)z)xi−θi}2⎤⎦ =p∑i=1ExEθ|x⎡⎣{(1−giϕ(z)z)xi−E[θi|xi]+E[θi|xi]−θi}2⎤⎦ =p∑i=1Ex⎡⎣{(1−giϕ(z)z)xi−E[θi|xi]}2⎤⎦+p∑i=1Var(θi|xi) =p∑i=1Ex⎡⎣(σ2iτ+σ2ixi−giϕ(z)zxi)2⎤⎦+p∑i=1τσ2iτ+σ2i.

Since the first term of the r.h.s. of the above equality is rewritten as

 p∑i=1Ex⎡⎣(σ2iτ+σ2ixi−giϕ(z)zxi)2⎤⎦ =p∑i=1(σ2iτ+σ2i)2Ex[x2i]−2Ex[p∑i=1σ2igix2iτ+σ2iϕ(z)z]+Ex[p∑i=1g2ix2iϕ2(z)z2] =p∑i=1σ4iτ+σ2i−2Ex[p∑i=1σ2igix2iτ+σ2iϕ(z)z]+Ex[p∑i=1g2ix2iϕ2(z)z2],

we have

 ¯R(δϕ,τ)−∑σ2i=−2Ex[p∑i=1σ2igix2iτ+σ2iϕ(z)z]+Ex[p∑i=1g2ix2iϕ2(z)z2]. (3.2)

Let

 wi=x2iσ2i+τ, w=p∑i=1wi  and  ti=wiw for i=1,…,p.

Then

 w∼χ2p,t=(t1,…,tp)T∼Dirichlet(1/2,…,1/2),

and and are mutually independent. With the notation, we have

 x2i=wti(σ2i+τ) and z=xTGΣ−1x=p∑i=1gix2iσ2i=wp∑i=1tigi(1+τσ2i)

and hence

 Ex[∑g2ix2iϕ2(z)z2] =Ew,t[∑tig2i(σ2i+τ)∑tigi(1+τ/σ2i)ϕ(w∑tigi(1+τ/σ2i))2w∑tigi(1+τ/σ2i)] =Et[∑tig2i(σ2i+τ)∑tigi(1+τ/σ2i)Ew|t[ϕ(w∑tigi(1+τ/σ2i))2w∑tigi(1+τ/σ2i)]].

Since is non-increasing and is non-decreasing, by the correlation inequality, we have

 Ew|t[ϕ(w∑tigi(1+τ/σ2i))2w∑tigi(1+τ/σ2i)] ≤Ew|t[ϕ(w∑tigi(1+τ/σ2i))]Ew|t[ϕ(w∑tigi(1+τ/σ2i))w∑tigi(1+τ/σ2i)] ≤Ew|t[ϕ(w∑tigi(1+τ/σ2i))]Ew[ϕ(wmingi(1+τ/σ2i))wmingi(1+τ/σ2i)],

and hence

 Ex[∑g2ix2iϕ2(z)z2]≤Ew[ϕ(wmingi(1+τ/σ2i))wmingi(1+τ/σ2i)]×Et[∑tig2i(σ2i+τ)∑tigi(1+τ/σ2i)Ew|t[ϕ(w∑tigi(1+τ/σ2i))]]. (3.3)

In the first part of the r.h.s. of the inequality (3.3), we have

 Ew[ϕ(wmingi(1+τ/σ2i))w]≤Ew[1/w]Ew[ϕ(wminigi(1+τ/σ2i))]≤ϕ(Ew[w]minigi(1+τ/σ2i))p−2=ϕ(pminigi(1+τ/σ2i))p−2, (3.4)

where the first and second inequality follow from the correlation inequality and Jensen’s inequality, respectively. In the second part of the r.h.s. of the inequality (3.3), by the inequality

 ∑tig2i(σ2i+τ)≤maxigi(1+τ/σ2i)∑tigiσ2i,

we have

 Et[∑tig2i(σ2i+τ)∑tigi(1+τ/σ2i)Ew|t[ϕ(w∑tigi(1+τ/σ2i))]]=Ew,t[∑tig2i(σ2i+τ)∑tigi(1+τ/σ2i)ϕ(w∑tigi(1+τ/σ2i))]≤maxigi(1+τ/σ2i)Ew,t[∑tigiσ2i∑tigi(1+τ/σ2i)ϕ(w∑tigi(1+τ/σ2i))]=maxigi(1+τ/σ2i)Ex[∑σ2igix2iτ+σ2iϕ(z)z]. (3.5)

By (3.3), (3.4) and (3.5), we have

 Ex[∑g2ix2iϕ2(z)z2]≤ϕ(pminigi(1+τ/σ2i))p−2maxigi(1+τ/σ2i)minigi(1+τ/σ2i)Ex[∑σ2igix2iτ+σ2iϕ(z)z], (3.6)

and, by (3.2) and (3.6),

 ¯R(δϕ,τ)−∑σ2i ≤(ϕ(pminigi(1+τ/σ2i))p−2maxigi(1+τ/σ2i)minigi(1+τ/σ2i)−2)Ex[∑σ2igix2iτ+σ2iϕ(z)z],

which guarantees for all under the condition (3.1). ∎

Given , the choice with descending order , is one of the most natural choice of from Casella’s (1980) viewpoint. In this case, we have

 minigi(1+τ/σ2i)maxigi(1+τ/σ2i)=mini{σ2i+τ}maxi{σ2i+τ}=σ2p+τσ21+τ, pminigi(1+τ/σ2i)=pσ21mini(σ2i+τ)=pσ2p+τσ21,

and hence a following corollary.

###### Corollary 3.1.

Assume that is non-negative, non-decreasing and concave. Also is assumed non-increasing. Then

 δϕ=(I−Σϕ(∥x∥2/σ21)∥x∥2)x

is ensemble minimax if

 ϕ(p(σ2p+τ)/σ21)≤2(p−2)σ2p+τσ21+τ,∀τ∈(0,∞). (3.7)

### 3.2 An ensemble minimax James-Stein variant

As an example of Corollary 3.1, we consider

 ϕ(z)=c1zc2+z (3.8)

for and , which is motivated by Stein (1956) and James and Stein (1961). Under , Stein (1956) suggested that there exist estimators dominating the usual estimator among a class of estimators with given by (3.8) for small and large . Following Stein (1956), James and Stein (1961) showed that with and is ordinary minimax. The choice is, however, not good since, by Corollary 3.1, cannot be larger than . With positive , we can see that can be much larger as follows.

Note that given by (3.8) is non-negative, increasing and concave and that is decreasing. Then the sufficient condition in (3.7) is

 c1p(σ2p+τ)/σ21c2+p(σ2p+τ)/σ21≤2(p−2)σ2p+τσ21+τ∀τ∈(0,∞),

which is equivalent to

 2(p−2){σ21c2+p(σ2p+τ)}−c1p(σ21+τ)≥0∀τ∈(0,∞)

or

 pτ{2(p−2)−c1}+2(p−2)σ21{c2−p(c12(p−2)−σ2pσ21)}≥0∀τ∈(0,∞).

Hence we have a following result.

###### Theorem 3.2.
1. When

 0

the shrinkage estimator

 (I−Σc1c2σ21+∥x∥2)x

is ensemble minimax.

2. It is ordinary minimax if

 2(∑σ4i/σ41−2)≥c1.

Part 2 above follows from Theorem A.1.

It seems to us that one of the most interesting estimators with ensemble minimaxity from Part 1 is

 (I−Σp−2(p−2)σ21+∥x∥2)x (3.10)

with the choice satisfying (3.9). It is clear that the -th shrinkage factor

 1−(p−2)σ2i(p−2)σ21+∥x∥2

is nonnegative for any and any , which is a nice property.

### 3.3 A generalized Bayes ensemble minimax estimator

In this subsection, we provide a generalized Bayes ensemble minimax estimator. Following Strawderman (1971), Berger (1976) and Maruyama and Strawderman (2005), we consider the generalized harmonic prior

 θ|λ∼Np(0,λ−1ΣG−1−Σ), π(λ)∼λ−2I(0,1)(λ) (3.11)

where satisfies . Note that for , the density of is exactly , since and

 1(2π)p/2∫10(λ1−λ)p/2exp(−λ∥θ∥22(1−λ))λ−2dλ =1(2π)p/2∫∞0gp/2−2exp(−g∥θ∥2/2)dg =Γ(p/2−1)2p/2−1(2π)p/2∥θ∥2−p.

The prior is called the harmonic prior and was originally investigated by Baranchik (1964) and Stein (1974). Berger (1980) and Berger and Strawderman (1996) recommended the use of the prior (3.11) mainly because it is on the boundary of admissibility.

By the way of Strawderman (1971), the generalized Bayes estimator with respect to the prior is given by

 δ∗=(I−Gϕ∗(z)z)x, % for z=xTGΣ−1x (3.12)

with

 ϕ∗(z)=z∫10λp/2−1exp(−zλ/2)dλ∫10λp/2−2exp(−zλ/2)dλ,

where satisfies the following properties

1. [label= H0]

2. is increasing in .

3. is concave.

4. .

5. is decreasing in .

6. The derivative of at is .

Under the choice and with the condition of Corollary 3.1, we have a following result.

###### Theorem 3.3.
1. The estimator is ensemble minimax.

2. The estimator is ordinary minimax when

 2(∑σ4i/σ41−2)≥p−2.
3. The estimator is conventional admissible.

###### Proof.

[Part 1] Recall that the sufficient condition for ensemble minimaxity is given by Corollary 3.1. By 15, we have only to check (3.7) in Corollary 3.1.

For , we have

 2(p−2)σ2p+τσ21+τ≥p−2.

By the properties 1 and 3,

 ϕ∗(p(σ2p+τ)/σ21)≤p−2.

for . Hence for , it follows that

 ϕ∗(p(σ2p+τ)/σ21)≤2(p−2)σ2p+τσ21+τ.

So it suffices to show

 ϕ∗(p(σ2p+τ)/σ21)≤2(p−2)σ2p+τσ21+τ

when and . By the properties 2 and 5, we have for all . Then