DeepAI

On Optimal Solutions to Compound Statistical Decision Problems

In a compound decision problem, consisting of n statistically independent copies of the same problem to be solved under the sum of the individual losses, any reasonable compound decision rule δ satisfies a natural symmetry property, entailing that δ(σ(y)) = σ(δ(y)) for any permutation σ. We derive the greatest lower bound on the risk of any such decision rule. The classical problem of estimating the mean of a homoscedastic normal vector is used to demonstrate the theory, but important extensions are presented as well in the context of Robbins's original ideas.

03/16/2019

Deciding with Judgment

A decision maker starts from a judgmental decision and moves to the clos...
08/16/2019

Simultaneous estimation of normal means with side information

The integrative analysis of multiple datasets is an important strategy i...
10/12/2021

On Permutation Invariant Problems in Large-Scale Inference

Simultaneous statistical inference problems are at the basis of almost a...
07/04/2021

Attribute reduction and rule acquisition of formal decision context based on two new kinds of decision rules

This paper mainly studies the rule acquisition and attribute reduction f...
04/22/2018

A constrained risk inequality for general losses

We provide a general constrained risk inequality that applies to arbitra...
12/30/2015

Sharp Computational-Statistical Phase Transitions via Oracle Computational Model

We study the fundamental tradeoffs between computational tractability an...
12/22/2016

Role of Simplicity in Creative Behaviour: The Case of the Poietic Generator

We propose to apply Simplicity Theory (ST) to model interest in creative...

1 Introduction

Let

 Y∼Nn(θ,I), (1)

where is unknown and nonrandom, and consider the problem of estimating under sum-of-squares loss. The task is, therefore, to design a rule such that the risk,

 Rn(δ,θ):=Eθ∥δ(Y)−θ∥2=Eθn∑i=1(δi(Y)−θi)2, (2)

is in some sense small for all

. Throughout, we use boldface to denote vectors. A subscript under operations such as expectation or variance always means that the integral is taken over

when is distributed under the nonrandom value of the parameter in the subscript; this will usually—but not always—be , the true value of the parameter.

The maximum-likelihood estimator (MLE) of is given by

 δMLi(Y)=Yi, (3)

and has constant risk for all , a fact that can be used to show that

 Rn(δML,θ)=infδsupθRn(δ,θ),

that is, is a minimax estimator for all .

The most celebrated fact about the normal means problem presented above, is that the MLE is admissible for but inadmissible for . This result, first shown in Stein (1956), is equivalent to saying that when the dimensionality is three or more, the MLE is not the only minimax estimator. What is known today also as the Stein phenomenon became a canonical example for the bias-variance tradeoff and the importance of regularization, and has inspired countless papers on shrinkage estimation in the normal model and far beyond it.

The fact that the MLE is inadmissible for is by all means a remarkable discovery. Meanwhile, the notion that for large values of

“better" solutions than the MLE exist, is less surprising if we adopt the ideas of Herbert Robbins from a paper that preceded the work of Stein. Applied to the problem at hand, the heuristic argument in

Robbins (1951) proceeds as follows. Consider first a simple rule, that is, an estimator given by for some function , and denote by the collection of all simple rules. Then, dividing (2) by the constant , for any we can write

 1nRn(δ,θ)=1nEθn∑i=1(u(Yi)−θi)2=n∑i=11nEθ(u(Yi)−θi)2=n∑i=11nEθi(u(Yi)−θi)2=E(u(Z)−ξ)2, (4)

where the expectation in the last term is taken over the random triple

with the joint distribution

 P(I=i)=1/n    if i∈{1,...,n},          (Z,ξ)|I=i ∼ (Yi,θi). (5)

This immediately reveals that the optimal rule in is given by the Bayes solution,

 δ∗i(Y)=u∗(Yi)=E(ξ|Z=Yi)=n∑j=1θjexp{−(Yi−θj)2/2}∑nj′=1exp{−(Yi−θj′)2/2}. (6)

Because is a simple rule, its risk must be larger than that of (6); in other words, we found a (biased) rule that improves over the MLE for any . The problem is that (6) depends on , so it is not a legitimate estimator. However, unlike the trivial (nonsimple) rule that estimates by , the optimal simple rule (6) depends on only through its unordered

elements, namely, the distribution of the random variable

, which can in turn be estimated nonparametrically using all of the observations in what is known nowadays as the general maximum-likelihood (Jiang and Zhang, 2009) or the deconvolution (Zhang, 1997) problem. As a matter of fact, in the normal case, (6) can be written as a functional of only the marginal density of in (5), utilizing an elegant formula due to Tweedie (Efron, 2011). The advantage is that the marginal distribution of is relatively easy to estimate nonparametrically from the readily available observations . In any case—whether Tweedie’s formula is taken advantage of or not—the virtue of (4) is that it gives rise to a marginal empirical Bayes problem, the meaningful problem of estimating the optimal solution in a bona fide Bayesian model where the prior is unknown. One can therefore hope that the resulting estimator, call it , has the property that

 1n{Rn(δ,θ)−infδ′∈C1Rn(δ′,θ)}⟶0 (7)

for all vectors (as long as the sequence is “well enough behaved"). An estimator satisfying (7) is called a competitor of . Notice that, while holding only asymptotically, the property (7) is much stronger than minimaxity, because it implies that the performance of the best simple rule for the underlying instance is essentially attainable for large enough .

The manuscript of Robbins (1951) abounds with insightful and original observations, and the account above does not do it justice. Still, we could conclude the previous paragraph as follows. The the quantity

 Rn(δ∗,θ)=nE[Var(ξ|Z)], (8)

which we can write explicitly in terms of , lower bounds the risk of any simple estimator. Furthermore, by definition, for any there is a simple estimator attaining this bound, in other words, (8) is the greatest lower bound for the class . Lastly, this bound is, under certain weak conditions, asymptotically attainable uniformly in by a nonsimple rule, that is, by a procedure that does not belong to .

A few questions arise naturally. From an asymptotic point of view, is there a more ambitious attainable benchmark? I.e., is there a rule, not necessarily simple, that has a limiting risk smaller than (8) uniformly in ? From a non-asymptotic perspective, (8) is only partially satisfactory, because many estimators that have been proposed over the years are not simple, for example the James-Stein estimator (James and Stein, 1961). Can we obtain a lower bound that applies to a larger class of rules?

In the next section we derive a tight lower bound on a class of estimators required only to satisfy a natural symmetry property, which we claim is the minimum requirement from any rule in a compound decision problem. Unlike (8), this (nonasymptotic) bound applies, for example, to any empirical Bayes estimator, including such estimators for which we have no closed form.

2 Symmetric rules

Robbins (1951) targeted the performance of the best simple rule, but, at least when discussing competitors, he considers also a larger class of rules. Thus, Robbins calls a rule symmetric if its risk is invariant under permutations of . Formally, denote by the set of all permutations of , that is, all rearrangements of the vector . Also, for any , denote by the corresponding operator that rearranges the elements of its input according to , that is, for any . With these definitions, a rule is symmetric if for all ,

 Rn(δ,σg(θ))=Rn(δ,θ)          for all g∈Mn. (9)

Denote by the set of all symmetric rules.

We begin by obtaining an equivalent characterization of a symmetric rule. With any rule and an element , we can associate two other rules given by

 (δ∘σg)(Y):=δ(σg(Y)),            (σg∘δ)(Y):=σg(δ(Y)).

Identifying between any two rules whenever their risk is equal for any , we have

Proposition 1.

A rule is symmetric if and only if for any .

Proof.

We need to show that

 Eσg(θ)∥δ(Y)−σg(θ)∥2=Eθ∥δ(Y)−θ∥2          for all g∈Mn (10)

if and only if

 (11)

Note first that

 Y|σg(θ)d=σg(Y)|θ, (12)

meaning that the distribution of under is the same as that of under . Additionally, we have trivially that for any ,

 ∥σg(δ(y))−σg(θ)∥2=∥δ(y)−θ∥2 (13)

for all .

Assume (10). Then, for any ,

 Eθ∥δ(σg(Y))−θ∥2 =Eσg(θ)∥δ(Y)−θ∥2 =Eσg(θ)∥σg(δ(Y))−σg(θ)∥2 =Eθ∥σg(δ(Y))−θ∥2,

where the first equality uses (12); the second equality uses (13); and the third equality uses (10) for the rule .

In the other direction, assume (11). Then, for any ,

 Eσg(θ)∥δ(Y)−σg(θ)∥2 =Eθ∥δ(σg(Y))−σg(θ)∥2 =Eθ∥σg(δ(Y))−σg(θ)∥2 =Eθ∥δ(Y)−θ∥2,

where the first equality uses (12); the second equality uses (11); and the third equality uses (13). ∎

Proposition (1) says that is essentially the collection of all estimators satisfying

 δ(σg(y))=σg(δ(y))          for all g∈Mn.

Any class of estimators that is larger than must include an estimator that, absurdly, treats the indices unequally, even though the statement of the problem is entirely symmetric with respect to the indices. Restricting attention to symmetric rules is, therefore, innocuous on the one hand. On the other hand, it does eliminate the trivial rule that estimates by , because the latter is not symmetric except when all are identical. Put differently, the greatest lower bound on the risk of an estimator in is generally nonzero, so the problem of bounding from below the risk of a symmetric rule is a meaningful problem. The main contribution of the current article, stated in the theorem below, implies an explicit expression for this lower bound.

Theorem 1.

Define a (-dependent) rule by

 δ∗∗(Y)=∑g∈Mnσg(θ)exp(−∥Y−σg(θ)∥2/2)∑g′∈Mnexp(−∥Y−σg′(θ)∥2/2). (14)

Then is symmetric, and it attains the minimum risk among all rules in :

 Rn(δ∗∗,θ):=Eθ∥δ∗∗(Y)−θ∥2=infδ∈C2Rn(δ,θ).
Proof.

By definition of a symmetric rule, for any we have

 Eθ∥δ(Y)−θ∥2=Eσg(θ)∥δ(Y)−σg(θ)∥2          for any g∈Mn,

implying that

 Eθ∥δ(Y)−θ∥2=∑g∈Mn1n!Eσg(θ)∥δ(Y)−σg(θ)∥2. (15)

Now, the right hand side is precisely equal to

 E∥δ(Z)−ξ∥2, (16)

the expectation taken over distributed as

 P(G=g)=1/n!     if g∈Mn,          (ξ,Z)|G=g ∼ (σg(θ),σg(Y)). (17)

Without any restriction at all on , the rule minimizing (16) is the Bayes solution,

 δ(Y)=E(ξ|Z=Y),

written explicitly in (14). But this rule is also symmetric (as a function of ), because the distribution of in (17) is an exchangeable distribution, being uniform over all permutations of , and the Bayes solution with respect to any exchangeable prior on , is symmetric. It must therefore be that (14) minimizes the risk among all symmetric rules.

Theorem 1 says that the best symmetric rule with respect to , is a weighted average of all permutations of , where the weights are proportional to the likelihood that the permutation gives to the vector . Compare to (6), where the -th component of is a weighted average of the with weights that are proportional to the likelihood that gives to . In the proof, notice that the sum in (15) could be taken over any subset of ; however, the only subset that ensures that the resulting Bayes solution is symmetric, is itself, because otherwise the distribution of may not be exchangeable.

We can restate Theorem 1 as follows. The risk of any symmetric rule is bounded from below by

 Rn(δ∗∗,θ)=tr{E[Cov(ξ|Z)]}, (18)

where is a random triple jointly distributed according to (17). This bound is precisely the envelope of the risks of the symmetric rules calculated with respect to every .

Because any simple rule is also symmetric, the bound (18) lies beneath the bound given in (8) for any . For small , the difference could be substantial. For large , we suspect that the bounds are “usually" close; we discuss this point in Section 4.

3 Generalizations

We have concentrated on the normal case and estimation under squared loss for two reasons: the first is the specific interest in this classical estimation problem, and the second is clarity of exposition. Similarly to Robbins’s argument from Section 1, the ideas of Section 2 apply more generally in every compound decision problem. Thus, suppose that

 Yi∼p(y;θi),          i=1,...,n,

independently, where is the data and where is unknown and nonrandom. On observing the data, we must choose an action regarding , for each of . The loss incurred for is the “compound" loss

 Ln(a,θ)=n∑i=1l(ai,θi), (19)

where

is some common “marginal" loss function. A protocol

mapping the data to a vector of decisions is called a decision rule. Every decision rule is associated with its risk, the expectation under of the loss,

 Rn(δ,θ):=Eθ[Ln(δ(Y),θ)],

and the goal is to design decision rules that in some sense have small risk for any .

Consistent with the definitions in the previous sections, call a rule simple if , for some function , and call a rule symmetric if its risk is invariant under permutations of . Then similar reasoning to that in (4) gives that for any compound decision problem, the best simple rule is the Bayes solution

 δ∗i(Y)=u∗(Yi)=argmina∈A{E[l(a,ξ)|Z=Yi]},

where the triple has the joint distribution specified by (5). Likewise, the arguments from Section 2 can be extended to imply that, in any compound decision problem, the optimal symmetric rule is

 δ∗∗(Y)=argmina∈An{E[Ln(a,ξ)|Z=Y]}, (20)

where the triple has the joint distribution specified by (17). The risk of is therefore the ultimate (greatest) lower bound on the risk of a simple rule, and the risk of is therefore the ultimate lower bound on the risk of a symmetric rule.

A very interesting point is that, while the derivation of depends crucially on the compound structure (19) of the loss, for the derivation of it is enough to require only that itself be symmetric, meaning that

 Ln(σg(a),σg(θ))=Ln(a,θ)          for all g∈Mn. (21)

This condition is obviously weaker than (19), and opens up many more possibilities for specifying . For example, in the normal means problem we can take

 Ln(a,θ)=max1≤i≤n{(ai−θi)2},

which violates (19), and the best symmetric rule is still given by (20) for the latter choice of . As another example, consider the problem of testing the null hypotheses that in (1) under the loss

 Vmax(R,1)+λSn−n0,

where is the total number of rejections; is the number of true nulls rejected; is the number of false nulls rejected; and is the number of true nulls among the hypotheses. This loss is not compound, but it is clearly symmetric as no index is given any preference. Our results would therefore apply to give the benchmark for the risk of any symmetric multiple testing rule—although we might not have an explicit form for the Bayes solution—for example, the Benjamini-Hochberg procedure (Benjamini and Hochberg, 1995). Notice that the expectation of the first term in the loss defined above, is the False discovery rate (FDR). What we want to emphasize here is that, while multiple testing problems have been studied before in a decision theoretic framework (e.g., Sun and Cai, 2007), these works usually use a compound loss, by means of which a “marginal" problem arises, and optimal procedures that control can be identified. If we insist on the difference between controlling FDR and controlling its modified version, mFDR, then the theory presented here can be useful.

Lastly, we remark that independence of the in this section can be relaxed to requiring that , where is the joint distribution of . To summarize informally, all we need for (20) to hold is complete symmetry in the problem with respect to the indices . Note, however, that independence of the might be important when trying to construct a competitor for , e.g. by an empirical Bayes procedure.

4 Conclusion

In any compound decision problem, the risk of any simple rule can be written as the Bayes risk with respect to a marginal instance of the problem, giving rise to the ultimate lower bound on the performance of a simple rule. In the current paper we have obtained an analogous result for the class of all symmetric rules. Requiring that a rule be symmetric is far less restrictive than requiring that a rule be simple, hence the new lower bound applies much more generally (for example, it bounds from below the risk of the James-Stein estimator).

The heuristic argument in Robbins (1951) suggests that the greatest lower bound on the risk of a simple rule is asymptotically attainable uniformly in , as long as the sequence is not too erratic, by a (nonsimple) empirical Bayes rule. For the normal means problem many nonparametric empirical Bayes procedures have been suggested over the years, for example those of Jiang and Zhang (2009); Brown and Greenshtein (2009); Efron (2011, 2016); Koenker and Mizera (2014)

, each method with its own merits. Although an asymptotic analysis is not always included, using one of these methods should, qualitatively, work well in pursuing the benchmark given by the risk of the best simple rule.

When considering the class of symmetric rules, we can always choose some arbitrary value and use the rule (20) calculated at . Of course, this is usually not a good idea: if it so happens that is a permutation of , then no symmetric rule can perform better; but otherwise, such decision rule might perform poorly. Ultimately, we want a decision rule posessing the stronger adaptivity property pursued by Robbins, that is, a decision rule whose risk asymptotically attains (18) uniformly in . This sounds at first like a more ambitious target, but we suspect that when is very large, it might not be. The intuition is as follows. First, note that the components of in (17) are identically distributed as in (5). Moreover, for any fixed , consider the distribution of any sub-vector of in (17) of size . Then the components of that sub-vectors are almost independent for large enough , at least if we impose the condition that the empirical distribution of the sequence

, converges to some probability distribution, say

, that does not depend on . Now, if the components of itself were exactly i.i.d., then the problem would decompose and the Bayes rules in, e.g., (14) and (6), would coincide, implying that the lower bounds in (18) and (8) coincide. But (8) should, in turn, be asymptotically attainable under appropriate conditions on , as argued heuristically in the previous paragraph. If this can be proved, it would have great implications for the “regimes of learning" in the normal means problem: it would say that nonparametric empirical Bayes estimators are essentially asymptotically optimal in the strong sense that they come close to attaining the greatest lower bound (18) on symmetric rules uniformly in . This point is left for future research.

References

• Benjamini and Hochberg (1995) Y. Benjamini and Y. Hochberg. Controlling the false discovery rate: a practical and powerful approach to multiple testing. Journal of the Royal Statistical Society. Series B (Methodological), pages 289–300, 1995.
• Brown and Greenshtein (2009) L. D. Brown and E. Greenshtein. Nonparametric empirical bayes and compound decision approaches to estimation of a high-dimensional vector of normal means. The Annals of Statistics, pages 1685–1704, 2009.
• Efron (2011) B. Efron. Tweedie’s formula and selection bias. Journal of the American Statistical Association, 106(496):1602–1614, 2011.
• Efron (2016) B. Efron. Empirical bayes deconvolution estimates. Biometrika, 103(1):1–20, 2016.
• James and Stein (1961) W. James and C. Stein. Estimation with quadratic loss. In Proceedings of the fourth Berkeley symposium on mathematical statistics and probability, volume 1, pages 361–379, 1961.
• Jiang and Zhang (2009) W. Jiang and C.-H. Zhang. General maximum likelihood empirical bayes estimation of normal means. The Annals of Statistics, 37(4):1647–1684, 2009.
• Koenker and Mizera (2014) R. Koenker and I. Mizera. Convex optimization, shape constraints, compound decisions, and empirical bayes rules. Journal of the American Statistical Association, 109(506):674–685, 2014.
• Robbins (1951) H. Robbins. Asymptotically subminimax solutions of compound statistical decision problems. In Proceedings of the second Berkeley symposium on mathematical statistics and probability. The Regents of the University of California, 1951.
• Stein (1956) C. Stein.

Inadmissibility of the usual estimator for the mean of a multivariate normal distribution.

In Proceedings of the Third Berkeley symposium on mathematical statistics and probability, volume 1, pages 197–206, 1956.
• Sun and Cai (2007) W. Sun and T. T. Cai. Oracle and adaptive compound decision rules for false discovery rate control. Journal of the American Statistical Association, 102(479):901–912, 2007.
• Zhang (1997) C.-H. Zhang. Empirical bayes and compound estimation of normal means. Statistica Sinica, 7(1):181–193, 1997.