On Permutation Invariant Problems in Large-Scale Inference

10/12/2021
by   Asaf Weinstein, et al.
0

Simultaneous statistical inference problems are at the basis of almost any scientific discovery process. We consider a class of simultaneous inference problems that are invariant under permutations, meaning that all components of the problem are oblivious to the labelling of the multiple instances under consideration. For any such problem we identify the optimal solution which is itself permutation invariant, the most natural condition one could impose on the set of candidate solutions. Interpreted differently, for any possible value of the parameter we find a tight (non-asymptotic) lower bound on the statistical performance of any procedure that obeys the aforementioned condition. By generalizing the standard decision theoretic notions of permutation invariance, we show that the results apply to a myriad of popular problems in simultaneous inference, so that the ultimate benchmark for each of these problems is identified. The connection to the nonparametric empirical Bayes approach of Robbins is discussed in the context of asymptotic attainability of the bound uniformly in the parameter value.

READ FULL TEXT VIEW PDF

page 1

page 2

page 3

page 4

10/25/2020

Regularizing Towards Permutation Invariance in Recurrent Models

In many machine learning problems the output should not depend on the or...
12/16/2020

A connection between the pattern classification problem and the General Linear Model for statistical inference

A connection between the General Linear Model (GLM) in combination with ...
04/30/2022

A nonparametric regression approach to asymptotically optimal estimation of normal means

Simultaneous estimation of multiple parameters has received a great deal...
11/26/2019

On Optimal Solutions to Compound Statistical Decision Problems

In a compound decision problem, consisting of n statistically independen...
03/20/2019

Permutation patterns in genome rearrangement problems: the reversal model

In the context of the genome rearrangement problem, we analyze two well ...
08/08/2018

Permutation patterns in genome rearrangement problems

In the context of the genome rearrangement problem, we analyze two well ...
12/28/2021

Admissibility is Bayes optimality with infinitesimals

We give an exact characterization of admissibility in statistical decisi...

1 Introduction

In a large scale inference problem we simultaneously observe a large number

of random variables

, each associated with an unknown quantity , and based on the we would like to make decisions on some or all of the . Denoting and , we will write

(1)

where the function

itself is known, and the unknown vector

is assumed throughout to be identifiable and nonrandom.

Many canonical examples of large scale problems exhibit symmetry with respect to the coordinates , in the sense that the following two conditions are satisfied:

  1. [wide, labelwidth=!, labelindent=0pt]

  2. If is obtained from by a permutation, and is obtained from by the same permutation, then the distribution of under is the same as that of under .

  3. The function (or functions) of of interest reflects no a priori preference for any of the individual .

Deferring the formal definitions to later sections, when the two conditions above are satisfied we will say that the problem is permutation invariant (PI, hereafter). The first condition requires that the model (1) itself is PI. The most popular case is when , independently, for some family of marginal distributions , for example for known , or for known , or . The second condition says that before seeing the data, there is full symmetry with respect to the also in the target

of inference. For example, deciding in advance to construct a confidence interval for only

is precluded. The condition does allow problems of the kind treated in, e.g., Efron (2012), that call for simultaneous inference on each of the components of , or only a subset thereof chosen after seeing the data; but the characterization above is meant to be even broader in order to also accommodate problems outside the scope of simultaneous or selective inference. An example that we emphasize throughout is testing the intersection hypothesis that all of the are zero, which is indeed perfectly PI in the above sense.

To demonstrate the comprehensiveness of large scale PI problems, suppose, for simplicity, that , independently, for , where are unknown parameters. In a clinical trial could measure the difference in efficacy of the -th candidate drug compared to a common control, or in a genome-wide association study

might represent a (normalized) log odds-ratio for the

-th SNP. In typical applications the analyst would be interested in parameters that are meaningfully different from zero, and the process of seeking for such parameters might include, for example, the following problems.

Global null testing. Is there at least one , or is it the case that ? Testing of the global null is a classical problem in statistics, and has been studied extensively also beyond the normal means model. Under normality, the usual test rejects for large values of and is known to be optimal against “spherically invariant” alternatives, meaning that if the alternative includes some value , then it also includes all other points for which ; see, e.g. Seifert (1979) and Lehmann and Romano (2006, Ch. 7.1). Other procedures may be preferable, however, if the ordinary invariance assumptions do not apply, e.g. when is sparse (Arias-Castro et al., 2011).

Multiple hypothesis testing. Among the parameters , which (if any) is nonzero, ? Multiple testing problems, coming into their own with Tukey’s work from the 1950s, are a prototypical example of simultaneous inference and the manifestation of the so-called file-drawer

effect. Most of the early work centered on controlling the familywise error rate (FWER), which is the probability of making at least one false rejection. Modern applications have driven a shift in focus in the last two decades toward controlling the more lenient false discovery rate (FDR), the expected ratio of the number of false rejections to the total number of rejections. For the problem in hand, applying the Benjamini-Hochberg (BH) procedure to the two-sided p-values is standard, but many empirical Bayes “improvements” have been proposed

(Efron et al., 2001; Efron, 2008; Sun and Cai, 2007, among many others).

Multiple sign-classification. For which of the parameters

can we confidently classify the sign of

as positive or negative? Some would argue that estimating the sign of the parameters

is more meaningful because in practice all

are different from zero to some decimal precision. In that case control of a Type I error rate does not automatically imply control of the corresponding Type S error rate

(Gelman and Tuerlinckx, 2000), the latter referring to incorrect decisions on the sign. Even if the adequacy of testing point nulls were not a concern, determining the sign of with confidence for all such that the hypothesis that could be rejected, is a natural follow-up question (Tukey, 1991), and so will often be of interest anyway. As in multiple testing, the early literature studies the problem of controlling a FWER (e.g., Bohrer, 1979), whereas most of the modern literature focuses on directional FDR (e.g., Benjamini et al., 1993; Stephens, 2016).

Effect-size estimation for selected parameters. In the implied process of seeking for interesting , if the primary goal is to detect nonzero and the secondary goal is to classify the sign of for rejected nulls, then the third question would typically concern the size of for parameters whose sign we were able to classify (Tukey, 1991). More generally, the scientists might be interested in obtaining estimates for the corresponding to the largest , where is fixed in advance or data-dependent. The special case with and sum-of-squares loss recovers the classical estimation problem of Stein (Stein, 1956), which has alone cultivated a long and famous line of work. The other extreme case has also attracted considerable interest over the years, e.g. Sarkadi (1967); Cohen and Sackrowitz (1989). For , different approaches that have been proposed include methods that condition on selection (e.g., Zhong and Prentice, 2008; Reid et al., 2017)

, and empirical Bayes methods

(Greenshtein et al., 2008; Efron, 2011). A general framework for constructing selective interval estimates (confidence intervals) was proposed by Benjamini and Yekutieli (2005), who allowed arbitrary selection rules and focused on guaranteeing coverage on-the-average over the selected. Other criteria have also been considered, for example coverage conditional on selection (Zhong and Prentice, 2008), and coverage simultaneously over the selected (Benjamini et al., 2019). An empirical Bayes approach can produce estimated posterior intervals by substituting the postulated prior with an estimate.

Each of the aforementioned problems has drawn much attention in its own right, and any attempt to summarize the rich body of existing literature would be unjust. Of course, many other questions or variants of the ones mentioned above would be pertinent, but the few problems discussed above are already quite different from each other: some are conventionally treated in a decision-theoretic framework, and others in a Neyman–Pearson type of paradigm; some of them involve decisions on individual components , while others entail only a single (global) decision on the vector ; in some of the problems the target of inference is determined by the observations, whereas in others the target parameter is pre-specified.

Despite the differences, the observation we make here is that all of these problems can, in a sense, be solved optimally by invoking nothing more than the permutation invariance of the problem. More specifically, for every we find the best decision rule among all decision rules which are themselves PI, i.e., they are correspondingly oblivious to the ordering of the components of . Viewed from another angle, for each of the aforementioned problems (and many more) we find an explicit form of the greatest lower bound on the risk of any PI decision rule. On the one hand, restricting attention to PI rules is innocuous, because this condition should be respected by every conceivable decision rule due to the inherent structure of the problem; in other words, this is arguably the largest class of procedures one could consider for the problem. At the same time—and perhaps surprisingly at first—this condition gives rise to a meaningful lower bound: whereas the best bound when not imposing any condition at all is often trivial (in many cases it is simply zero), the bound we propose lies significantly above zero. As we further explain in Section 6, our results are an extension of the insightful work of Hannan and Robbins (1955), who obtained the best PI rule in a compound decision problem (Robbins, 1951; Zhang, 2003), a very special case of a PI problem entailing problems of identical structure to be solved under the sum of the individual losses. To appreciate the importance of the extension proposed here, notice that none of the problems that we listed before (except for the effect-size estimation problem in the special case of ) is a compound decision problem.

Also from the algorithmic viewpoint (as opposed to the lower bound viewpoint) the situation seems favorable. The account in the previous paragraph implies that, computational considerations aside, the bound is attainable at every point separately, that is, by a possibly different PI procedure which depends on . In practice, this alone may not be very useful because is unknown. However, in Section 6 we argue that, under mild conditions, this bound may be asymptotically attainable uniformly in , meaning that there would exist a single procedure, not depending on , that still asymptotically attains the bound at each value of . In fact, it is no other than than the usual nonparametric empirical Bayes (EB, hereafter) approach, postulating , i.i.d., for a completely unknown , that yields adequate candidate procedures for the job. To establish this prospect would require generalizing results from Hannan and Robbins (1955) and the important followup work of Greenshtein and Ritov (2009). While a formal proof of the claim is beyond the scope of the current paper, we explain why, in the spirit of the aforementioned works, there is good reason to suspect that such a generalization would be possible. If it does hold, this result would have far-reaching implications for large-scale PI problems. Indeed, it would lend meaning to Robbins’s nonparametric EB approach as yielding asymptotically instance-optimal solutions far beyond the original, compound decision context: except for confirming that nonparametric EB is the right approach for the problems above where it has already been proposed (Greenshtein et al., 2008; Efron, 2011, for example), this suggests to take a similar approach in PI problems where EB methods have not been explored before, global testing being a notable example.

The rest of the paper is organized as follows. Section 2 introduces a unified decision theoretic framework that can accommodate, for example, all of the problems mentioned above, but applies to almost any statistical task. In Section, using the proposed framework, 3 we define permutation invariant problems in a broader sense than usual. Section 4, containing the main technical contributions, gives explicit characterizations for the oracle permutation invariant rule in any problem that can be presented in the proposed general framework; precise results follow after an informal preamble. In Section 5 we specialize the theory to the examples mentioned above, deriving explicit forms for the oracle permutation invariant rule. The connection with the nonparametric empirical Bayes approach is discussed in Section 6 with reference to Robbins’s work. Section 7 considers the computational aspects, and we conclude with a discussion in Section 8.

2 A Unified Decision Theoretic Framework

Assuming (1), we begin by recalling the standard decision-theoretic framework (e.g., Berger, 1985), where the statistician is to choose, upon observing , an element from an action space . Every action incurs loss , where

is a pre-specified loss function. A decision rule is a mapping

, associating every possible realization with an action . Finally, the risk of a decision rule is the function associating with each the number .

The decision-theoretic framework is convenient because the abstraction allows to treat different problems at the same time instead of developing a separate theory for each. Some additions to the classical framework are required, however, in order to accommodate the various problems mentioned in the previous section: first, a loss function by itself is not enough if we want to handle also hypothesis testing or confidence interval problems, for example, because in these problems not all decision rules are acceptable, but only such that satisfy some validity condition (e.g., a confidence interval must have some kind of coverage). Second, the loss function in the standard framework does not allow the statistician to choose the target of inference adaptively (as a function of the data), so that problems such as estimating the mean of the largest observation are ruled out.

To facilitate a unified treatment of large-scale problems, we now extend the standard framework mentioned above by introducing two independent modifications:

  1. We allow to specify in advance a subset of the set of all (appropriately measurable) decision rules , such that only rules in can be used by the statistician.

  2. We allow the loss to be a function also of the data, that is, may now take as input in addition to , and returns a number .

With these two additions, we can reformulate almost any imaginable large-scale inference problem (not necessarily PI) in a common framework, while retaining exactly the same criteria as in the traditional, problem-specific setup. Specifically, with appropriate choices for the elements , any such problem can be stated equivalently in the new decision theoretic framework, so that a given decision rule in is better than another at a particular in the ordinary sense if and only if its risk at , defined henceforth as

is smaller. We now demonstrate this for the large-scale problems mentioned in the opening section. As in the Introduction, suppose for concreteness that and independently for .

Global null testing (continued). Allowing the global null to be in general a composite hypothesis, we are interested in testing , where is a prespecified subset of (the standard problem in Section 1 has ). The action space will be , where signifies rejection of the null, and the loss will be

To enforce validity of a test, we take to be the set of all rules such that

(2)

in other words, only rules that control the Type I error are admitted. Notice that for

the risk is the probability of making a Type II error,

, and the problem indeed coincides with the usual Neyman–Pearson formulation.

Multiple hypothesis testing (continued). Here we want to test each of the composite (in general) hypotheses , where is prespecified. The action space is , where, for , the coordinate if the -th null is rejected and otherwise. Also, we denote by the number of incorrect rejections; by the number of correct rejections; by the number correct nonrejections; and by the number of incorrect nonrejections. The choices of and depend on the error criteria under consideration. In the FDR problem, define the false discovery proportion for any as

where we use the convention . We take to be the set of all rules such that the false discovery rate,

(3)

To match the optimality criterion considered in, e.g., Sun and Cai (2007), we take the loss function to be the proportion of false non-discoveries among all non-discoveries,

so that minimizing subject to corresponds to minimizing the false non-discovery rate,

under the constraint that for all .

Multiple sign-classification (continued). For simplicity, we consider a version of the problem where the task is to classify the sign weakly as nonnegative or nonpositive. The action space will be , where, for , the coordinate when deciding to classify , when deciding to classify , and symbolizes “insufficient information” for deciding on the sign. Denote by the number of negative parameters classified as nonnegative; by the number of positive parameters classified as nonpositive; and by and the number of positive and the number of negative parameters, respectively, for which no decision on the sign is made. As in the multiple testing problem, the choices of and depend on the error criterion. For a directional-error analogue of the FDR, we define the directional false discovery proportion for any action as

where and are, respectively, the number of parameters classified as nonnegative and the number of parameters classified as nonpositive, and again we use the convention . We then take to be the set of all rules for which the directional false discovery rate (Benjamini et al., 1993),

(4)

The loss is taken to be the proportion of nonzero among the parameters which we failed to classify for their sign,

where . Hence, minimizing subject to corresponds to minimizing the directional false non-discovery rate,

under the constraint that for all possible values of the parameter.

Effect-size estimation for selected parameters (continued). To fit the problem into our framework, we will take , with the interpretation that for , the coordinate is the estimate of if is selected for estimation, and for the remaining coordinates the choice of will not matter anyway because is not estimated. The set of “feasible” decision rules is unconstrained in this problem, that is, consists of all mappings . To define an appropriate loss, first let be the function representing selection, so that if will be considered for estimation, and otherwise. For estimation under squared loss, for example, we take

(5)

so that, as required, the risk is the mean squared error in estimating only the selected coordinates,

Note that this accommodates selection of the indices corresponding to the largest observations, even when is itself data-dependent (so, for example, letting be the number of parameters selected by Benjamini-Hochberg procedure at some fixed level , is suitable).

3 Permutation Invariant Problems

Now that we have a common formal framework for large-scale inference problems, the notions of permutation invariance in the first section can be made precise. Throughout the paper, we denote by the group of permutations on , i.e, consists of all bijective functions from to itself. With some abuse of notation, for any and any vector , we write for the vector obtained by rearranging the components of according to .

From now on, we will consider decision problems in the framework of Section 2 where the set of feasible rules is of the form

(6)

for some set that may depend on the true value , and for some prespecified function that, similarly to the loss, is defined on and returns a nonnegative number. It will hopefully become clearer to the reader along the way why this technicality is needed; for now, notice that in all of the examples of the previous section, can indeed be written in this form. The following definition formalizes the conditions 1 and 2 appearing in the Introduction.

Definition 1 (Permutation invariant decision problem).

Under the model (1), a decision problem is said to be permutation invariant if the following conditions hold:

  1. [label=.]

  2. The model is permutation invariant: for any and any , on defining , we have that .

  3. The loss function is permutation invariant: for any , any and any , there exists such that for all . In this case, if for all implies , then will be unique and we denote it by .

  4. The set of feasible rules is permutation invariant, in the sense that the function in (6) has exactly the same property as in the previous item B.

For problems that would also fit the conventional decision theoretic framework, i.e., in the special case where is the set of all functions and the loss does not depend on , Definition 1 is consistent with the standard definition of invariance under a (general) group of transformations, see, e.g., Berger (1985, Ch. 6.2).

PI decision problems naturally call for decision rules that are themselves PI: as it would apply to invariance under any other group of transformations, a decision rule that is not correspondingly invariant is self-conflicting, because it is inconsistent with the symmetry in the very setup of the problem. Thus, if a given decision problem is PI, we say that a decision rule is PI if for all ,

(7)

and recall the definition of in Definition 1. For example, in Stein’s problem of estimating all from independent observations under sum-of-squares loss, is PI if , which is the condition imposed in Greenshtein and Ritov (2009); we need the more general formulation (7), using instead of , for the principle to apply beyond (compound decision problems) problems in which a separate decision is made for each of the parameters . To demonstrate just how basic the PI property is, notice for example that Stein’s problem is invariant also under translations (additions of a constant), and yet the famous James-Stein estimator is not translation invariant (because shrinkage is toward zero); it is, however, PI.

For any given PI decision problem, we denote by the set of all PI rules, and by

the set of all PI rules in , remembering that is defined as the set of rules that meet the minimum requirements in the problem statement. Because adding the PI requirement on is natural when the problem is PI, we consider to be really the largest class of decision rules one could consider. The following simple property of PI rules will be instrumental in the next section.

Lemma 1.

Consider a decision problem in the framework of Section 2. If the problem is PI, then for any PI rule and for any ,

Proof.

Recall that the risk of any rule is defined as . Then, we have

Above, the second equality holds because the model is assumed to be PI, and so, by definition, . The third equality uses permutation invariance of the decision rule ; and the fourth equality is true because the loss is assumed to be PI. ∎

Lemma 1 says that the risk of any PI rule is constant on the sets

consisting of all permutation of (a fixed) , and usually called orbits of in the context of invariant statistical problems. We comment that Lemma 1 is consistent with known results on invariance in general, for example Berger (1985, Ch. 6.2); but notice again that, unconventionally, we allow the loss to take also as input (so, formally, Lemma 1 is not implied by such existing results).

4 The Oracle Permutation Invariant Rule

We now turn to the main contribution of this paper: for a given PI problem in the general framework of Section 2, we find the optimal PI decision rule. Some discussion of the meaning of “optimal” here is in order. Ideally, for a given PI problem under consideration, there would be a PI decision rule which minimizes the risk uniformly in . If such a rule exists we call it best PI, in agreement with standard terminology in the study of invariance (for a general group of transformations the term is best invariant). Unfortunately, a best PI rule rarely exists. For example, none of the PI problems considered in the previous sections admit a best PI rule. Another way to say this is that, for any fixed , the rule

(8)

does indeed depend on (to emphasize this we could have written instead of , but we opted for simpler notation). Thus, if we insisted on a notion of optimality that is independent of , we would have to resort to such criteria as minimaxity. Instead, our goal will be to identify for each the rule in (8), to which we refer from now on as the oracle PI rule. To point out an important subtlety, notice that, because the search in (8) is over , for each the rule belongs to , i.e., it is a function that takes only the data as input; the term “oracle” is suitable (only) when thinking of as a function of both and .

We would also like to remark that, whereas without the permutation invariance requirement the oracle is often trivial, the oracle PI rule is generally not. For example, consider Stein’s compound estimation problem, which also fits the conventional decision theoretic framework. If there are no restrictions whatsoever on , the risk at is minimized here by the trivial rule . However, this oracle rule is not PI (unless all happen to be equal). This demonstrates that finding the best PI rule is not a trivial task—certainly not in the generality we aim for, e.g., in problems involving selective inference.

Lastly, instead of the “algorithmic” point of view which seeks for each the oracle PI rule, we can, by definition, interpret the risk of at each as a tight lower bound on the risk of any PI rule at all (including rules that depend on the true ). Formally, for each we have

(9)

for all , so that the left hand side is indeed the greatest possible lower bound. Because we do not regard permutation invariance of as a real restriction when the problem itself is PI, the risk in a sense defines the ultimate lower bound for the problem. This claim is strengthened in Section 6 when we discuss how, at least in some PI problems, the lower bound is attainable uniformly in by a single (independent of ) PI rule.

4.1 Motivation

To find the oracle PI rule for a given problem, the basic idea is to rewrite the risk at each fixed as an average risk over all permutations of , capitalizing on Lemma 1. Before presenting the precise results, we would like to give an informal description of the main idea. Thus, consider a PI problem as defined earlier, and fix . Then for any PI rule , we have from Lemma 1 that

for all . Then it is obviously also true that

(10)

the sum on the right hand side taken over all permutations in . The expression on the right hand side of (10) can be interpreted as the Bayes risk with respect to a “prior” putting equal mass on each of the permutations of . The quotes in “prior” are a reminder that it is not actually specified by the data generating model, but rather is an artifact of the permutation invariance property of . When there are no constraints on other than being PI, this Bayes risk is minimized among all possible decision rules (not necessarily PI) by the corresponding Bayes rule, call it , the minimizer of the posterior expected loss. To finish the argument, note that is in fact PI, which is easy to see from the symmetry of the prior and the likelihood with respect to the coordinates. It follows that is the oracle PI decision rule, i.e., .

In the special case where the loss function takes the form

(11)

which does not allow to specify the target of inference adaptively and, in fact, limits the scope to problems requiring exactly decisions (one for each ), Hannan and Robbins (1955) used the simple argument above to derive the best PI rule. In the current paper we make the observation that the basic argument above continues to hold if instead of the restrictive condition (11) we only require the loss to be PI, and, furthermore, that we might as well allow the loss function to depend also on , because the Bayes rule anyway conditions on .

For all its simplicity, the informal argument above is the impetus of this paper. In fact, for problems in the framework of Section 2 where is just the set of all PI rules, this argument makes a valid proof. In the more general case where is a subset of , however, we will need to be more careful in how we proceed from (10). Indeed, as we will see, the situation is different when is only a subset of all PI rules, because in this case it is not always possible to translate the problem to one of minimizing a posterior expected loss. Still, identifying the optimal solution relies on the same fundamental identity (10), and the Bayesian notions will be apparent also in the general case.

4.2 Formal results

Returning to the general setup, in this subsection we adopt a more formal style and work toward a precise statement. Thus, consider a PI decision problem in the framework of Section 2 with loss and an associated set of feasible rules; in what follows we will refer to this frequentist setup as the “original” problem. Also, recall that denotes the oracle PI rule in the original problem.

Now fix , and let be a random triple, where , and

have a joint distribution given by

(12)

If we accept that for two distinct the elements and are counted separately even when , then (12) equivalently says that

has the uniform distribution over all permutations of

, and , where is the likelihood function from (1).

With these definitions in place, denote by the solution to the optimization problem given by

(13)
s.t.

where the subscript in reminds us that the distribution (12), with respect to which the expectation is taken, depends on the nonrandom . The following result says that is the oracle PI rule in the original problem.

Theorem 1.

In the framework of Section 2, consider a PI decision problem with loss and an associated set of feasible rules given by (6). Then for any fixed , we have

where .

Proof.

We need to show that . By definition, is the solution to

(14)
s.t.

Now, for any PI rule ,

where the first equality is a consequence of Lemma 1, because if each term in the sum is equal to , then so is their average. Above, when needed we use a superscript to emphasize which random variable the expectation is taken over.

Since the function has the same properties as , we can use exactly the same reasoning to conclude that for any PI rule and any ,

Therefore, (13) and (14) are equivalent, except that in (13) there is no restriction that is PI. But from the complete symmetry with respect to the coordinates , it follows that the solution to (13) will anyway be PI. This finishes the proof. ∎

In our endeavor to find an explicit form for , the reformulation (13) has a crucial advantage over (14) because the nonrandom vector is essentially replaced with the random vector , which allows for an optimal solution. In fact, in the case where the feasible set in the original problem is the set of all mappings , Theorem 1 has further simplification. To state the result in that case, for every let

the expectation taken with respect to the conditional distribution under (12) of given . It might be instructive to refer to as the selective posterior expected loss, making the connection to elements of standard Bayesian theory. However, it is important to keep in mind that, as mentioned before, with respect to the original problem is not a genuine random vector, and it is meaningless to speak of a posterior distribution because everything is deterministic conditional on .

Theorem 2.

In the setting of Theorem 1, suppose that is the set of all mappings . Then the oracle PI rule is given by

(15)
Proof.

Because there are no requirements now on the decision rule other than being PI, Theorem 1 says that the problem is equivalent to minimizing over all PI rules . Also, for any PI rule we have

(16)

where is the marginal distribution of under (12). But, from the definition, this is minimized by the rule that associates with any the action (15). ∎

Theorem 2 will not apply immediately if in (6) does not include all possible mappings . However, for the special case where in (6), that is, when the condition defining is imposed only at the true (unknown) value , it is actually possible to convert the problem into an equivalent problem with an unrestricted , which extends considerably the applicability of Theorem 2. Fortunately, for almost all of the examples given in the manuscript, this reduction will be possible after a slight modification of the formulation of the problem allowing us to replace the corresponding set with . This slight modification will be presented in Section 5 right before we apply the general results here to specific problems; for now, we state the Lemma allowing that reduction.

Lemma 2.

Let be a pair of jointly distributed random variables defined on . Let be some action space, and let be two loss functions in the extended sense of Section 2. Consider the optimization problem given by

(17)
s.t.

where the expectation is taken over the pair , and the minimization is over all measurable mappings . Then for any constant such that the solution to (17) exists and is unique, there exists a constant for which the solution to the weighted optimization problem given by

(18)

is also the solution to (17).

Remark.

The lemma above applies to any pair of random variables. We used and to make the connection to the statistical problems considered, where has the role of an unobserved parameter and is the observed data, while avoiding additional notation. But is not to be confused with the pair in the rest of the manuscript, in which is deterministic, and has the distribution (1) indexed by .

Proof.

This is an adaptation of the proof of Theorem 5 in Courcoubetis and Weber (2003, Appendix A.2). For any denote by the solution to (18), assuming that it exists and is unique, and notice that this is also the solution to

because the objectives differ by the constant . Furthermore, denote by the value of for which , again assuming that such a value exists and is unique. Then we have

where the last equality is because by the definition of . On the other hand, since ,

It follows that

i.e., the solution to (18) is also the solution to (17). ∎

Now consider a decision problem in the general framework of Section 2. If in the definition (6) of , then Lemma 2 will apply. Indeed, in this case the functions and can be thought of as operating on the same triple , since it is the same value under which expectations are taken on and on . In turn, Theorem 2 applies with the modified loss function , assuming that there is some for which is satisfied with equality instead of inequality.111This is usually the case when are continuous; in the general case, it should be possible to show that still solves (18), but with begin the largest constant for which This is demonstrated in Section 5 for specific problems.

When no reduction to the case of an unrestricted is possible (so Lemma 2 is inapplicable), we may resort to Theorem 1, which of course would still apply. In that situation, each problem should generally be inspected individually for existence of an explicit solution deriving from Theorem 1. As we will demonstrate in the next section, for some testing problems Theorem 1 will actually give the explicit solution directly when interpreted properly.

5 Application

We proceed to specialize the results above to the example large-scale problems considered earlier. As was remarked before, for each of the examples the formulation presented in Section 2 within the new decision theoretic framework is exactly equivalent to the conventional formulation of the problem. However, before deriving the oracle PI rule in an explicit form for each of the specific problems, we propose a subtle but important modification to the formulation of some of these problems.

5.1 A Modification for

Recall from (6) that, in general, is the set of all mappings such that

(19)

where is problem-specific. Whenever it makes sense, the modification we propose is to change this condition to

(20)

in other words, to impose the condition only at the true (unknown to the statistician) value of the parameter. Of course, as long as includes , by replacing with we only make the oracle stronger: the minimum risk achievable subject to the constraint (20) can only be smaller than that achievable under the constraint (19). Therefore, if we identify the oracle PI rule under the former, we only solve a more ambitious problem, so there should be no objection to this modification. From a practical viewpoint this does not really make a difference, because chosen by the statistician anyway cannot depend on the unknown value . At the same time, from the viewpoint of the oracle this modification makes a crucial difference, because it says that satisfying the condition in (20) may be chosen for each separately, that is, as a function of . This gives the oracle much more flexibility, which can result in significantly smaller attainable risk. In fact, if we did not in turn require to be PI, then the proposed modification would typically make the problem uninteresting: taking the FDR problem as an example, the best overall oracle rule (without restrictions whatsoever) would now simply set whenever , and otherwise, and incur zero loss; notice that this choice trivially “controls” the FDR for the true , but not for all simultaneously. However, because we require to be PI, the decision rule just suggested is illegal (it is independent of the data), and the best PI rule under (20) is still not trivial.

More generally, replacing (19) with (20) is reasonable when decisions on individual components —all of them or only a (data-dependent) subset—are required; in the examples of the Introduction, this applies to all but the global testing problem. The latter is an example where the proposed modification is not suitable, and we have to keep the original condition (19)—this is why the qualification “whenever it makes sense” appears above. Indeed, for the global testing problem as formulated in Section 2, it would be meaningless to replace in (2) with , because then the oracle that decides if not all , and otherwise flips a coin so that is rejected with probability (independently of the data), is optimal overall and also PI, so it is the oracle PI rule. The reason we have to keep the condition (19) is that in this case, a global decision (as opposed to decisions on individual components) is made on , allowing the trivial rule to maintain the permutation invariance property. In that case, as proposed in the previous section, we will resort to Theorem 1 to directly derive the oracle PI rule under the restriction (19).

5.2 The Oracle PI Rule in Some Examples

We are ready to return to the set of examples considered in the Introduction, and obtain the oracle rule in explicit form by applying Theorem 1 or Theorem 2, depending on the case. Whenever it is suitable per the discussion above, we apply the modification of Subsection 5.1. Recall that in all of the examples we assume, for simplicity, that independently for .

Global null testing (continued). With the setup of Section 2

, we consider now the classical case of a point null hypothesis,

, captured by taking in the general formulation. As we explained in the previous subsection, in this case we appeal directly to Theorem 1. Specializing (13) to the functions and from Section 2, the optimization problem now reads

s.t.

Recall that is a random vector obtained by mixing the likelihood of according to the uniform distribution on all permutations of , so that the problem posed above is a testing problem in the usual Neyman–Pearson setup for a simple null versus a simple alternative. Denoting by the density function of a vector, the likelihood ratio statistic (for ) is

and the most powerful test is given by

(21)

where the constant makes a valid level- test, i.e., for . Theorem 1 guarantees that (21) is the best PI rule.

Multiple hypothesis testing (continued). The formal setup is identical to that presented in Section 2, except that will now be the set of all rules such that

(22)

replacing the more restrictive condition (3). As explained in the short discussion after Lemma 2, in this case we can appeal again to Theorem 2 on replacing the original loss function with

Therefore, we first need to find for a fixed the Bayes rule, i.e., the minimizer of

where we used to denote the total number of rejections, and to denote the total number of non-rejections. To characterize more explicitly the minimizer in of , first consider the minimizer among all actions , where is the set of all actions making exactly rejections, . For , let

be the posterior probability that

(notice that the conditioning is on the entire vector , and that is a function of the realization of ). Then the posterior expected loss becomes