1 Introduction
Let
(1) 
where is unknown and nonrandom, and consider the problem of estimating under sumofsquares loss. The task is, therefore, to design a rule such that the risk,
(2) 
is in some sense small for all
. Throughout, we use boldface to denote vectors. A subscript under operations such as expectation or variance always means that the integral is taken over
when is distributed under the nonrandom value of the parameter in the subscript; this will usually—but not always—be , the true value of the parameter.The maximumlikelihood estimator (MLE) of is given by
(3) 
and has constant risk for all , a fact that can be used to show that
that is, is a minimax estimator for all .
The most celebrated fact about the normal means problem presented above, is that the MLE is admissible for but inadmissible for . This result, first shown in Stein (1956), is equivalent to saying that when the dimensionality is three or more, the MLE is not the only minimax estimator. What is known today also as the Stein phenomenon became a canonical example for the biasvariance tradeoff and the importance of regularization, and has inspired countless papers on shrinkage estimation in the normal model and far beyond it.
The fact that the MLE is inadmissible for is by all means a remarkable discovery. Meanwhile, the notion that for large values of
“better" solutions than the MLE exist, is less surprising if we adopt the ideas of Herbert Robbins from a paper that preceded the work of Stein. Applied to the problem at hand, the heuristic argument in
Robbins (1951) proceeds as follows. Consider first a simple rule, that is, an estimator given by for some function , and denote by the collection of all simple rules. Then, dividing (2) by the constant , for any we can write(4) 
where the expectation in the last term is taken over the random triple
with the joint distribution
(5) 
This immediately reveals that the optimal rule in is given by the Bayes solution,
(6) 
Because is a simple rule, its risk must be larger than that of (6); in other words, we found a (biased) rule that improves over the MLE for any . The problem is that (6) depends on , so it is not a legitimate estimator. However, unlike the trivial (nonsimple) rule that estimates by , the optimal simple rule (6) depends on only through its unordered
elements, namely, the distribution of the random variable
, which can in turn be estimated nonparametrically using all of the observations in what is known nowadays as the general maximumlikelihood (Jiang and Zhang, 2009) or the deconvolution (Zhang, 1997) problem. As a matter of fact, in the normal case, (6) can be written as a functional of only the marginal density of in (5), utilizing an elegant formula due to Tweedie (Efron, 2011). The advantage is that the marginal distribution of is relatively easy to estimate nonparametrically from the readily available observations . In any case—whether Tweedie’s formula is taken advantage of or not—the virtue of (4) is that it gives rise to a marginal empirical Bayes problem, the meaningful problem of estimating the optimal solution in a bona fide Bayesian model where the prior is unknown. One can therefore hope that the resulting estimator, call it , has the property that(7) 
for all vectors (as long as the sequence is “well enough behaved"). An estimator satisfying (7) is called a competitor of . Notice that, while holding only asymptotically, the property (7) is much stronger than minimaxity, because it implies that the performance of the best simple rule for the underlying instance is essentially attainable for large enough .
The manuscript of Robbins (1951) abounds with insightful and original observations, and the account above does not do it justice. Still, we could conclude the previous paragraph as follows. The the quantity
(8) 
which we can write explicitly in terms of , lower bounds the risk of any simple estimator. Furthermore, by definition, for any there is a simple estimator attaining this bound, in other words, (8) is the greatest lower bound for the class . Lastly, this bound is, under certain weak conditions, asymptotically attainable uniformly in by a nonsimple rule, that is, by a procedure that does not belong to .
A few questions arise naturally. From an asymptotic point of view, is there a more ambitious attainable benchmark? I.e., is there a rule, not necessarily simple, that has a limiting risk smaller than (8) uniformly in ? From a nonasymptotic perspective, (8) is only partially satisfactory, because many estimators that have been proposed over the years are not simple, for example the JamesStein estimator (James and Stein, 1961). Can we obtain a lower bound that applies to a larger class of rules?
In the next section we derive a tight lower bound on a class of estimators required only to satisfy a natural symmetry property, which we claim is the minimum requirement from any rule in a compound decision problem. Unlike (8), this (nonasymptotic) bound applies, for example, to any empirical Bayes estimator, including such estimators for which we have no closed form.
2 Symmetric rules
Robbins (1951) targeted the performance of the best simple rule, but, at least when discussing competitors, he considers also a larger class of rules. Thus, Robbins calls a rule symmetric if its risk is invariant under permutations of . Formally, denote by the set of all permutations of , that is, all rearrangements of the vector . Also, for any , denote by the corresponding operator that rearranges the elements of its input according to , that is, for any . With these definitions, a rule is symmetric if for all ,
(9) 
Denote by the set of all symmetric rules.
We begin by obtaining an equivalent characterization of a symmetric rule. With any rule and an element , we can associate two other rules given by
Identifying between any two rules whenever their risk is equal for any , we have
Proposition 1.
A rule is symmetric if and only if for any .
Proof.
We need to show that
(10) 
if and only if
(11) 
Note first that
(12) 
meaning that the distribution of under is the same as that of under . Additionally, we have trivially that for any ,
(13) 
for all .
Proposition (1) says that is essentially the collection of all estimators satisfying
Any class of estimators that is larger than must include an estimator that, absurdly, treats the indices unequally, even though the statement of the problem is entirely symmetric with respect to the indices. Restricting attention to symmetric rules is, therefore, innocuous on the one hand. On the other hand, it does eliminate the trivial rule that estimates by , because the latter is not symmetric except when all are identical. Put differently, the greatest lower bound on the risk of an estimator in is generally nonzero, so the problem of bounding from below the risk of a symmetric rule is a meaningful problem. The main contribution of the current article, stated in the theorem below, implies an explicit expression for this lower bound.
Theorem 1.
Define a (dependent) rule by
(14) 
Then is symmetric, and it attains the minimum risk among all rules in :
Proof.
By definition of a symmetric rule, for any we have
implying that
(15) 
Now, the right hand side is precisely equal to
(16) 
the expectation taken over distributed as
(17) 
Without any restriction at all on , the rule minimizing (16) is the Bayes solution,
written explicitly in (14). But this rule is also symmetric (as a function of ), because the distribution of in (17) is an exchangeable distribution, being uniform over all permutations of , and the Bayes solution with respect to any exchangeable prior on , is symmetric. It must therefore be that (14) minimizes the risk among all symmetric rules.
∎
Theorem 1 says that the best symmetric rule with respect to , is a weighted average of all permutations of , where the weights are proportional to the likelihood that the permutation gives to the vector . Compare to (6), where the th component of is a weighted average of the with weights that are proportional to the likelihood that gives to . In the proof, notice that the sum in (15) could be taken over any subset of ; however, the only subset that ensures that the resulting Bayes solution is symmetric, is itself, because otherwise the distribution of may not be exchangeable.
3 Generalizations
We have concentrated on the normal case and estimation under squared loss for two reasons: the first is the specific interest in this classical estimation problem, and the second is clarity of exposition. Similarly to Robbins’s argument from Section 1, the ideas of Section 2 apply more generally in every compound decision problem. Thus, suppose that
independently, where is the data and where is unknown and nonrandom. On observing the data, we must choose an action regarding , for each of . The loss incurred for is the “compound" loss
(19) 
where
is some common “marginal" loss function. A protocol
mapping the data to a vector of decisions is called a decision rule. Every decision rule is associated with its risk, the expectation under of the loss,and the goal is to design decision rules that in some sense have small risk for any .
Consistent with the definitions in the previous sections, call a rule simple if , for some function , and call a rule symmetric if its risk is invariant under permutations of . Then similar reasoning to that in (4) gives that for any compound decision problem, the best simple rule is the Bayes solution
where the triple has the joint distribution specified by (5). Likewise, the arguments from Section 2 can be extended to imply that, in any compound decision problem, the optimal symmetric rule is
(20) 
where the triple has the joint distribution specified by (17). The risk of is therefore the ultimate (greatest) lower bound on the risk of a simple rule, and the risk of is therefore the ultimate lower bound on the risk of a symmetric rule.
A very interesting point is that, while the derivation of depends crucially on the compound structure (19) of the loss, for the derivation of it is enough to require only that itself be symmetric, meaning that
(21) 
This condition is obviously weaker than (19), and opens up many more possibilities for specifying . For example, in the normal means problem we can take
which violates (19), and the best symmetric rule is still given by (20) for the latter choice of . As another example, consider the problem of testing the null hypotheses that in (1) under the loss
where is the total number of rejections; is the number of true nulls rejected; is the number of false nulls rejected; and is the number of true nulls among the hypotheses. This loss is not compound, but it is clearly symmetric as no index is given any preference. Our results would therefore apply to give the benchmark for the risk of any symmetric multiple testing rule—although we might not have an explicit form for the Bayes solution—for example, the BenjaminiHochberg procedure (Benjamini and Hochberg, 1995). Notice that the expectation of the first term in the loss defined above, is the False discovery rate (FDR). What we want to emphasize here is that, while multiple testing problems have been studied before in a decision theoretic framework (e.g., Sun and Cai, 2007), these works usually use a compound loss, by means of which a “marginal" problem arises, and optimal procedures that control can be identified. If we insist on the difference between controlling FDR and controlling its modified version, mFDR, then the theory presented here can be useful.
Lastly, we remark that independence of the in this section can be relaxed to requiring that , where is the joint distribution of . To summarize informally, all we need for (20) to hold is complete symmetry in the problem with respect to the indices . Note, however, that independence of the might be important when trying to construct a competitor for , e.g. by an empirical Bayes procedure.
4 Conclusion
In any compound decision problem, the risk of any simple rule can be written as the Bayes risk with respect to a marginal instance of the problem, giving rise to the ultimate lower bound on the performance of a simple rule. In the current paper we have obtained an analogous result for the class of all symmetric rules. Requiring that a rule be symmetric is far less restrictive than requiring that a rule be simple, hence the new lower bound applies much more generally (for example, it bounds from below the risk of the JamesStein estimator).
The heuristic argument in Robbins (1951) suggests that the greatest lower bound on the risk of a simple rule is asymptotically attainable uniformly in , as long as the sequence is not too erratic, by a (nonsimple) empirical Bayes rule. For the normal means problem many nonparametric empirical Bayes procedures have been suggested over the years, for example those of Jiang and Zhang (2009); Brown and Greenshtein (2009); Efron (2011, 2016); Koenker and Mizera (2014)
, each method with its own merits. Although an asymptotic analysis is not always included, using one of these methods should, qualitatively, work well in pursuing the benchmark given by the risk of the best simple rule.
When considering the class of symmetric rules, we can always choose some arbitrary value and use the rule (20) calculated at . Of course, this is usually not a good idea: if it so happens that is a permutation of , then no symmetric rule can perform better; but otherwise, such decision rule might perform poorly. Ultimately, we want a decision rule posessing the stronger adaptivity property pursued by Robbins, that is, a decision rule whose risk asymptotically attains (18) uniformly in . This sounds at first like a more ambitious target, but we suspect that when is very large, it might not be. The intuition is as follows. First, note that the components of in (17) are identically distributed as in (5). Moreover, for any fixed , consider the distribution of any subvector of in (17) of size . Then the components of that subvectors are almost independent for large enough , at least if we impose the condition that the empirical distribution of the sequence
, converges to some probability distribution, say
, that does not depend on . Now, if the components of itself were exactly i.i.d., then the problem would decompose and the Bayes rules in, e.g., (14) and (6), would coincide, implying that the lower bounds in (18) and (8) coincide. But (8) should, in turn, be asymptotically attainable under appropriate conditions on , as argued heuristically in the previous paragraph. If this can be proved, it would have great implications for the “regimes of learning" in the normal means problem: it would say that nonparametric empirical Bayes estimators are essentially asymptotically optimal in the strong sense that they come close to attaining the greatest lower bound (18) on symmetric rules uniformly in . This point is left for future research.References
 Benjamini and Hochberg (1995) Y. Benjamini and Y. Hochberg. Controlling the false discovery rate: a practical and powerful approach to multiple testing. Journal of the Royal Statistical Society. Series B (Methodological), pages 289–300, 1995.
 Brown and Greenshtein (2009) L. D. Brown and E. Greenshtein. Nonparametric empirical bayes and compound decision approaches to estimation of a highdimensional vector of normal means. The Annals of Statistics, pages 1685–1704, 2009.
 Efron (2011) B. Efron. Tweedie’s formula and selection bias. Journal of the American Statistical Association, 106(496):1602–1614, 2011.
 Efron (2016) B. Efron. Empirical bayes deconvolution estimates. Biometrika, 103(1):1–20, 2016.
 James and Stein (1961) W. James and C. Stein. Estimation with quadratic loss. In Proceedings of the fourth Berkeley symposium on mathematical statistics and probability, volume 1, pages 361–379, 1961.
 Jiang and Zhang (2009) W. Jiang and C.H. Zhang. General maximum likelihood empirical bayes estimation of normal means. The Annals of Statistics, 37(4):1647–1684, 2009.
 Koenker and Mizera (2014) R. Koenker and I. Mizera. Convex optimization, shape constraints, compound decisions, and empirical bayes rules. Journal of the American Statistical Association, 109(506):674–685, 2014.
 Robbins (1951) H. Robbins. Asymptotically subminimax solutions of compound statistical decision problems. In Proceedings of the second Berkeley symposium on mathematical statistics and probability. The Regents of the University of California, 1951.

Stein (1956)
C. Stein.
Inadmissibility of the usual estimator for the mean of a multivariate normal distribution.
In Proceedings of the Third Berkeley symposium on mathematical statistics and probability, volume 1, pages 197–206, 1956.  Sun and Cai (2007) W. Sun and T. T. Cai. Oracle and adaptive compound decision rules for false discovery rate control. Journal of the American Statistical Association, 102(479):901–912, 2007.
 Zhang (1997) C.H. Zhang. Empirical bayes and compound estimation of normal means. Statistica Sinica, 7(1):181–193, 1997.