Confidence Intervals for Selected Parameters

06/02/2019 ∙ by Yoav Benjamini, et al. ∙ 0

Practical or scientific considerations often lead to selecting a subset of parameters as "important." Inferences about those parameters often are based on the same data used to select them in the first place. That can make the reported uncertainties deceptively optimistic: confidence intervals that ignore selection generally have less than their nominal coverage probability. Controlling the probability that one or more intervals for selected parameters do not cover---the "simultaneous over the selected" (SoS) error rate---is crucial in many scientific problems. Intervals that control the SoS error rate can be constructed in ways that take advantage of knowledge of the selection rule. We construct SoS-controlling confidence intervals for parameters deemed the most "important" k of m shift parameters because they are estimated (by independent estimators) to be the largest. The new intervals improve substantially over Šidák intervals when k is small compared to m, and approach the standard Bonferroni-corrected intervals when k ≈ m. Standard, unadjusted confidence intervals for location parameters have the correct coverage probability for k=1, m=2 if, when the true parameters are zero, the estimators are exchangeable and symmetric.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

Code Repositories

selectiveCI

Implement selective confidence intervals


view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Modern statistical applications, for instance, studies of high-throughput experiments or high-dimensional databases, rarely involve only one parameter. Rather, investigators generally rely on model selection or parameter selection of some kind, e.g., algorithmically, by trial and error, or merely by examining the data before deciding what analysis to perform. Practical or scientific reasons may limit interest or reporting to a subset of parameters, even when the analysis involved many more. These may be the only parameters reported in the analysis, or simply emphasized above the rest by inclusion in the abstract, discussion, tables, or figures. Inferences about such selected parameters often are based on the same data used to decide they are “important.”

Because selection alters the distribution of

-values, estimators, and test statistics, selection complicates inference. The ASA board issued a warning about

-values for selected parameters [Wasserstein and Lazar, 2016], but did not suggest any remedy. Instead, they recommended using alternatives such as confidence intervals. Unfortunately, selection also alters the coverage probability of standard confidence intervals. Here, we provide new methods for constructing simultaneous confidence intervals for parameters selected because their estimates are largest. The methods illustrate that using information about the specific way the parameters are selected can improve inferences.

The issue of inference after selecting hypotheses, parameters, or models for inference after observing the data—selective inference—was recognized 70 years ago in the problem of making all pairwise comparisons between independent groups. The field that copes with selective inference came to be known as ‘multiple comparisons.’ Inference about the difference selected as interesting because it was estimated to be largest led Tukey to introduce the studentized range [Tukey, 1953, Braun, 1994], and selection has remained a motivating theme, as exemplified by the introduction to the definitive work by Hochberg and Tamhane [1987]. That motivation notwithstanding, the book’s solutions involved simultaneous coverage over all parameters of potential interest, thereby controlling the error probability for whatever subset was selected. Similarly, Berk et al. [2013]

, who develop methods for selective inference in the context of model selection in linear regression, rely on simultaneous protection.

Addressing selective inference by simultaneous coverage of all possible subsets of parameters is conservative in “modern” statistical problems with very many potential inferences, often more than observations (“”). Benjamini and Yekutieli [2005] distinguished between simultaneous inference and selective inference

. The latter requires that the inferential property (e.g., the Type I error rate or coverage probability) hold

on average over the selected parameters. Selective inference is often less stringent than simultaneous inference. For instance, if a collection of intervals has simultaneous coverage probability , its selective coverage is at least , but the converse is not true. Methods for selective inference can be more powerful than methods for simultaneous inference, as illustrated by methods that control the False Coverage Rate (FCR) [Benjamini and Yekutieli, 2005].

FCR control is a property of the entire selected set of intervals. FCR is not controlled for subsets of that set. This is not a hypothetical problem. For instance, we might evaluate a large number of potential drug molecules for efficacy, then decide to look more closely at the most promising ten. One or two of the ten (not necessarily the largest) are then used in additional experiments. Regions on the genome may be identified to be associated with some disease, but only one or two locations in each region used for replication. Identifying a few peaks of activity in fMRI studies is another example where initial screening is followed by study of a subset of items that pass the initial screening. In all these examples, controlling FCR in the initial screening guarantees nothing about subsets of the set that pass.

This problem is attracting attention, but proposed solutions require specifying how the sub-selection is performed. See, e.g., Katsevich and Ramdas [2018] and references therein. On the other hand, starting with simultaneous confidence intervals could be too conservative or exaggerate the potential of treatments by including larger effects than the data support.

In this work, we explore a middle way: Given a specific selection rule, a set of confidence intervals has simultaneous coverage over the selected if the probability of not covering any selected parameter is controlled at a desired level. Simultaneity is guaranteed only on the selected set, not all possible sets. This allows us to make sharper inferences than omnibus protection against all selection rules allows. For example, an interval for the parameter estimated to be the larger of two does not need to be longer than the standard univariate interval; see Section 3.

We demonstrate the advantage of this approach for a simple yet practical rule: “select the parameters estimated to be largest.” This rule is used—sometimes silently—in applications in which a single risk factor is of primary interest, but there is a collection of possible confounders. We also give a new result for the rule “select the parameter estimated to be largest in absolute value” in the bivariate normal case.

2 Simultaneous over selected parameters

We observe , where ,

, are real-valued random variables. A

selection rule is a mapping from into , the power set of , that is -measurable for all . The set consists of the indices of the selected parameters: If , we seek finite-length confidence intervals for . We make no confidence statement about the other parameters.

The Simultaneous over all Possible selections (SoP) error rate is

(1)

The False Coverage-statement Rate (FCR) controls non-coverage on average over the selected parameters:

(2)

where (if no interval is constructed, no interval fails to cover). Another error rate that explicitly involves is the Conditional over Selected (CoS):

(3)

(For a recent review of this criterion, see Taylor and Tibshirani [2015].)

This paper focuses on the Simultaneous over Selected parameters (SoS) error rate:

(4)

SoS is the probability that any interval for a selected parameter fails to cover. Controlling SoS allows further selection from without requiring the intervals to be modified.

Despite how often practitioners use the same data to select a set of parameters and then make inferences about the selected parameters, there are few results regarding controlling SoS. Venter [1988] constructed a confidence interval for the mean estimated to be the largest among independent Gaussian estimators and Fuentes et al. [2018] recently constructed SoS intervals for the means estimated to be largest among independent normal estimators. “Multiple comparisons with sample best” [Hsu, 1981, 1996] also involves testing a set of hypotheses that depends on the data through the identity of the parameter estimated to be largest, to which all other parameter estimates are compared: this might be the first example of concern about SoS. But the general idea of simultaneous coverage over a selected subset does not seem to have been recognized, aside from the work of Hechtlinger [2014], incorporated here.

Controlling SoP obviously controls both SoS and FCR. Controlling CoS assures conditional coverage for each selected parameter, so it controls coverage on average for the selected parameters. Therefore, CoS controls the conditional FCR, given that at least one parameter was selected. Since conditional FCR is larger than FCR, controlling CoS controls FCR.

For some selection rules, intervals that control CoS might not control SoP; for others, CoS yields longer intervals than SoP. Intervals with control SoP at level .

When with probability , SoS and CoS are equivalent and CoS intervals control SoS and SoS intervals control CoS. If with probability , CoS intervals control SoS at level .

3 Controlling SoS by inverting non-equivariant unconditional tests

One way to construct intervals that control SoS is to make simultaneous intervals for all parameters. If we construct the simultaneous confidence intervals by inverting suitable non-equivariant hypothesis tests, then projecting the joint confidence set onto the selected components, we can get “selective” behavior by designing the tests so that the confidence intervals for the parameters that are not selected are , in effect not making any inference about those parameters.

For instance, consider the acceptance region for testing the hypothesis :

(5)

where

(6)

Inverting this family of tests for yields

(7)

which satisfies

(8)

Whenever , is between and , for all . Therefore, the intervals

(9)

are simultaneous confidence intervals for . Because these intervals have simultaneous coverage for all , they have simultaneous coverage for .

This choice of gives these confidence intervals a selective structure: Suppose . Consider with components

(10)

Then . Hence, , and so for all . Since this holds for all , for : the procedure does not try to constrain non-selected parameters.

In general, depends on the selection rule , but bounds on can make it easy to invert these families of tests conservatively. For instance, if , , then , and for . If one can find a simple function such that for all , then conservative confidence intervals can be found by inverting acceptance regions based on rather than .

3.1 The larger of two exchangeable, symmetric estimates

We construct SoS intervals for the parameter estimated to be the larger of two: , where denotes the index of the larger of and . (Ties can be broken lexicographically.) We specialize to the case

are exchangeable and symmetric; they need not be independent or continuous. This generalizes a result for correlated bivariate normals with unit variance given by

Hechtlinger [2014].

Define the set transformation

(11)

Symmetry and exchangeability imply that if

is measurable with respect to the joint distribution of

, so is , and . By definition, .

For this , the acceptance region (5) is the union of two disjoint semi-infinite trapezoids:

(12)

where

Proposition 1.

Suppose , are exchangeable with a symmetric marginal distribution, and define . For all ,

Proof.

The proof hinges on the fact that

(13)

The LHS is a subset of the RHS: If , . If , there exists with and for which . Thus , and , so .
The RHS is a subset of the LHS: If and , . If and , . Then , and , so . Calculation verifies . Thus (13) holds.

Now

The conclusion follows by substituting . ∎

Figure 1 (a) illustrates the key idea in the proof.

Figure 1: The semi-infinite trapezoids that comprise the acceptance region . The blue hashed region is and the dark red hashed region is . Exchangeability and symmetry of imply that the probability of the lighter red region is equal to that of the dark red region if , for every .

The proposition shows that is the acceptance region for a level test of the hypothesis . As mentioned above, inverting tests of this form yields the confidence interval

(14)

This is the standard two-sided univariate symmetric confidence interval, based on the larger of the two observations. The standard two-sided univariate confidence interval has the right coverage probability despite the selection and the dependence.

3.2 Larger absolute value of two normal estimators

This section addresses constructing a confidence interval for the component selected by the rule , where with

. (Since the two estimators have normal distributions, the probability of a tie is zero.) As above, we find the intervals by inverting a family of hypothesis tests. The acceptance regions for the tests are again a union of two pieces, but they are more complicated than those in section 

3.1.

3.2.1 Acceptance regions

The acceptance region for testing the hypothesis is

(15)

where

(16)

and is chosen so that the test has level . Figure 2 plots acceptance regions for six values of .

Figure 2: for various values of (the black dot in the figures). The blue area is and the cayenne area is .

3.2.2 Finding

The constant is the smallest for which . At , is a square; so, is Šidák’s constant . We will bound for .

Proposition 2.

Let . For for all ,

(17)

The proof is in appendix A. Define

(18)

By Proposition 2, . Explicit calculation gives

(19)

where . The value of is the smallest for which (19) is at least .

Hence, tests using instead of have level no larger than , and inverting them will give confidence intervals with coverage probability at least . Figure 3 plots as a function of .

Figure 3: The upper bound as a function of .

As grows, decreases monotonically; , the half-width of a standard univariate normal confidence interval.

3.2.3 The confidence interval

For , the confidence interval for is:

where

Since , both endpoints are between and : the interval is shorter than Šidák’s simultaneous interval. For , the interval is widest when ; there, the acceptance region for just includes . The maximum width is about of the width of the Šidák intervals. As , the length converges to that of the standard unadjusted confidence interval, about 88% of the length of the Šidák interval.

Unlike the SoS interval for the larger of two, this SoS interval for the larger absolute value of two does not automatically work when are dependent, because can be lower than it is when they are independent (see Hechtlinger [2014]). Acceptance regions for parameters with dependent estimators could be calibrated by brute force computation, then inverted computationally to construct confidence intervals.

4 Largest of

Rather than construct a confidence interval for the single parameter estimated to be the larger of , as in section 3.1, in this section we construct SoS-controlling confidence intervals corresponding for the parameters estimated to be the largest of parameters.

Consider independent random variables . Let be the CDF of , where . Let be the order statistics of . Let and let be the observed order statistics.

Consider the selection rule that keeps the components corresponding to the largest components of , with ties broken lexicographically, and let . Let denote the cardinality of the set , so . Let .

Since contains the components for which is largest, one might expect the conditional distribution given to be stochastically larger than its unconditional distribution. It is, which allows us control the chance that the upper endpoint of any interval is below its parameter using a Bonferroni adjustment with multiplicity rather than .

Consider the intervals

(20)

where and , , with , , and

Proposition 3.
(21)
Corollary 4.

For all , if and , then the intervals defined in equation 20 have SoS coverage.

Corollary 4 defines a family of intervals that control SoS at level . The intervals are in general asymmetric, and the chance that the intervals miss the parameter from below is not in general equal to the chance that they miss from above. For , neither the expected rate at which the upper endpoint is below its parameter nor the expected rate at which the lower endpoint is above its parameter exceeds . When are all symmetric, setting gives symmetric intervals corresponding to the two sided Bonferroni correction.

When are all equal, finding the that yields the shortest intervals is a 1-dimensional optimization problem. Figure 4 plots the length of the intervals for different values of and when . Because selects large components, the shortest intervals extend below by more than they extend above .

Figure 4: Lengths of the confidence intervals as a function of , , and , for . The dots are plotted at the minimizing values of ; lengths are nearly minimal when .
Proof of Proposition 3.

Define

(22)

and

(23)

Now if either (then the lower endpoint of the interval is greater than , and ) or (then the upper endpoint of the interval is below , and ).

Since and are mutually exclusive, the event is the event . Let and . The event that at least one interval does not cover its parameter is the event . Hence,

(24)

The first term on the right hand side is

(25)

The second term on the right of (24) is

Let . Then for all ,

By Theorem 4.12.3 of Grimmett and Stirzaker [2001], there exists a random variable independent of such that and

Hence,

where the first step uses the independence of . Thus

(26)

Combining (24), (25), and (26) yields the desired result. ∎

5 Comparisons

Bonferroni confidence intervals are of the form , with

(27)

Bonferroni confidence intervals control SoP (and consequently SoS) even when are dependent. When are independent, Šidák confidence intervals, which are of the same form but with

(28)

offer a small (but optimal) improvement on Bonferroni intervals.

Fuentes et al. [2018] construct SoS intervals for this out of problem with are IID . Their intervals, which we call FCW, are of the form , where and are constants such that

(29)

Figure 5 compares the Bonferroni intervals, the Šidák intervals, FCW intervals (with and with and chosen to make the intervals as short as possible), and the SoS intervals (with chosen to make the intervals symmetric or as short as possible).

Figure 5: Lengths of confidence intervals with simultaneous coverage of the of normal means estimated to be largest, for .

Because SoS intervals make inferences about only parameters, one would expect them to be narrower than SoP intervals—but not as narrow as intervals for a pre-specified set parameters: there should be some penalty for the selection. (The case , is a remarkable exception.) Letting the intervals be asymmetric can reduce their length further. Suppose all the parameters are equal. Because selects the largest , will preferentially select components for which . Using intervals that extend below more than they extend above allows the intervals to be shorter. Figure 5 shows that the advantage of allowing asymmetry is substantial when . The SoS intervals are slightly wider than the FCW intervals, but the FCW intervals require to have normal distributions, while the SoS intervals do not.

Weinstein et al. [2013], Fithian et al. [2014] and Reid et al. [2017] develop CoS intervals for the largest of . Controlling CoS controls FCR. These intervals also control SoS if each interval is constructed at level , rather than .

CoS intervals use conditional distributions, which depend on the conditioning event and the underlying parameters of the model. As a result, CoS intervals are sometimes useful and sometimes not. For instance, CoS intervals for the largest of or largest absolute value of can be too wide to be useful when the largest and second-largest estimators are close [Fithian et al., 2015]. Table 1 shows the uniformly most powerful unbiased (UMPU) CoS intervals from example 4 in Fithian et al. [2014] when is bivariate uncorrelated normal centered at , for the observed value and the greater is selected. In section 3, we showed that the unadjusted univariate interval has the right coverage. The UMPU saturated CoS intervals use . For , the conditional distribution is a renormalized tail of the distribution, which yields long intervals. The UMPU selected CoS intervals use . Then, the underlying distribution is a function of both and . Setting produces “nicer” intervals than because the selection event has less effect on the conditional distribution. But while setting the coefficient of an unselected variable to zero in a model might make sense, for location parameters, setting when it is estimated to be does not. This shows the importance of the unselected parameters to the coverage of selected parameters. Furthermore, if a coverage were needed for all , the interval would have an infinite lower tail, since for all there exist such that is centered at .

Because the lengths of CoS intervals depend on the observed value of (not only on and ), we do not include CoS intervals in our figures or numerical comparisons.

Length
Marginal / SoS
- UMPU saturated
- UMPU selected
- UMPU selected
- UMPU selected
Table 1: Intervals for for the bivariate normal after selecting the greater of the two estimators when . The depicted UMPU saturated and selected CoS intervals are from Fithian et al. [2014].

6 Discussion

Katsevich and Ramdas [2018] address “simultaneous selective inference” in testing. They consider making selective inference statements on many selection rules , guaranteeing this statements hold simultaneously with high probability. By restricting the possible selections to those determined by a particular algorithm, for instance, those obtained by varying the level at which FDR is controlled, their method can improve on full simultaneity.

We have shown that knowing the selection rule can improve inferences, for instance by allowing one to control SoS or CoS without necessarily controlling SoP. More information is better. Lee et al. [2016] and Tibshirani et al. [2016] constructed confidence interval for parameters selected by the Lasso and by forward stepwise selection, respectively.

For SoP and FCR, it is not necessary to know , but if is known, improvements are possible. Berk et al. [2013] addressed the problem of inference when the model is selected because a pre-specified explanatory variable had the highest statistical significance, which restricts the family over which simultaneous coverage is required. The restriction transforms the problem into that of making a confidence interval for the coefficient whose estimate is most significantly nonzero among many correlated estimates. This amounts to selecting on the basis of the largest 1 of correlated statistics, which can be solved computationally for problems of modest dimension. If the design matrix is orthogonal, their confidence interval amounts to the Bonferroni interval adjusted for parameters—the dimension of the relevant family. Weinstein et al. [2013] constructed CoS intervals tailored to avoid containing zero, at the expense of being somewhat longer where it does not matter and Weinstein and Yekutieli [2014] designed FCR intervals that try to avoid covering 0. Barber and Candès [2015] took advantage of the fact that their knockoff method used forward stepwise search (their method does not yet offer confidence intervals).

To see that FCR intervals can also take advantage of knowledge of , consider the rule that selects the parameters estimated to be the largest of . This rule always selects parameters. The proof of proposition 3 bounds the probability of making one or more errors via the expected number of upper endpoints that are smaller than their corresponding parameter plus the expected number of lower endpoints that are larger than their corresponding parameters. These expectations are bounded by a multiple of . FCR control requires the expected number of non-covering intervals, divided by , to be at most . Therefore, if selects the largest of , FCR intervals can have lower endpoints based on the quantile (which works for any simple selection rule [Benjamini and Yekutieli, 2005]), and upper endpoints based on the quantile instead of the quantile.

In conclusion, specification of a selection rule allows us to explore the inference ground between simultaneous and false confidence statement rate controlling confidence intervals, by defining the in-between goal of simultaneous coverage one the selected. We explored the implications of this error-rate by studying the largest k-out-of-m rule. This resulted in new confidence intervals. In their simplest version, the lower endpoints are Bonferroni adjusted for m, while the upper endpoints are Bonferroni adjusted only for k, offering uniform improvement over regular Bonferroni. Furthermore, the utilization of such a selection rule allows improvement over the general FCR intervals.

It seems clear that the confidence intervals retain their coverage for positive-regression-dependent estimators, because the conditional distribution of the variables selected is stochastically greater than the same variables without selection, but we do not have a formal proof. Appendix B gives numerical evidence that this is indeed the case. The intervals may also be valid when is random, for instance when selection depends on crossing a fixed threshold or some other testing procedure.

We think the results for the largest in absolute value of extend to and , but again, we offer no formal proof. We have decided to publish our current results without those generalizations with the hope that they will spark interest engage others in pursuing this approach.

Appendix A Proof of Proposition 2

Since are IID standard normals, they are rotationally invariant and symmetric. Of course, . These three properties let us find or bound .

Assume without loss of generality that . The acceptance regions have three forms, depending on whether (1) , (2) , or (3) .

Case (1) . Because , is a trapezoid and consists of two congruent triangles. Rotating the triangles and reflecting them about yields the hexagon in Figure 6. Because , consists of two trapezoids. Clockwise rotation and reflection of the trapezoid yields a parallelogram. See Figure 7. The transformation translates to , allowing us to compare their acceptance regions. See Figure 8.

(a)
(b)
Figure 6: (a): . (b): after a -preserving transformation.
(a)
(b)
Figure 7: (a):  for . (b):  after a -preserving transformation.
(a)
(b)
Figure 8: (a):  and superposed after -preserving transformations when . The black hexagon is and the shaded parallelogram is after re-centered at . (b): Detail of the top of panel (a). The bottom half of the trapezoid is the top half rotated by , which does change its probability.

The line in Figure 8 (b) is . It intersects at the point . If , then . Otherwise, is above the point , and to the right of