On agnostic post hoc approaches to false positive control

10/25/2019 ∙ by Gilles Blanchard, et al. ∙ 0

This document is a book chapter which gives a partial survey on post hoc approaches to false positive control.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 10

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Setting and basic assumptions

Let us observe a random variable

with distribution belonging to some model . Consider null hypotheses , , for . We denote the set of true null hypotheses and its complement. We assume that a -value

is available for each null hypothesis

, for each .

We introduce the following assumptions on the distribution , that will be useful in the sequel:

(Superunif)
(Indep)

2 From confidence bounds …

Consider some fixed deterministic . A -confidence bound for , the number of false positives in , is such that

A first example is given by the -Bonferroni bound for some fixed such that

(otherwise the bound is trivial). The coverage probability is ensured under (

Superunif) by the Markov inequality:

However, in practice, is often chosen by the user and possibly depends on the same data set, then denoted to emphasize this dependence; it typically corresponds to items of potential strong interest. The most archetypal example is when consists of the smallest -values , for some fixed value of . In that case, it is easy to check that the above bound does not have the correct coverage: for instance, when the -values are i.i.d. and , we have for (that is, when the bound is informative),

where

denotes the usual beta distribution with parameters

and . For instance, taking , , and , the latter is approximately equal to , while the intended target is .

This phenomenon is often referred to as the selection effect: after some data driven selection, the probabilities change and thus the usual statistical inferences are not valid.

3 … to post hoc bounds

To circumvent the selection effect, one way is to aim for a function (denoted by for short) satisfying

(1)

that is, a confidence bound that is valid uniformly over all subsets . As a result, for any particular algorithm , inequality (1) entails , and thus does not suffer from the selection effect. Such a bound will be referred to as a -post hoc confidence bound throughout this chapter, ”post hoc” meaning that the set can be chosen after having seen the data, and possibly using the data several times.

As a first example, the -Bonferroni post hoc bound is

(2)

Following the same reasoning as above, it has a coverage at least under (Superunif):

Remark 1.1

Compared to the -Bonferroni confidence bound of Section 2, has been replaced by , so that the post hoc bound is much more conservative than a (standard, non uniform, fixed) confidence bound when gets small, which is well expected. This scaling factor is the price paid here to make the inference post hoc. We will see in Sections 4 and 7 that it can be diminished when considering bounds of a different nature.

Exercise 1.1

For , when the -values are i.i.d. and , prove that the coverage probability of the -Bonferroni post hoc bound is at most . Does the Bonferroni post hoc bound provide a sharp coverage in that case?

The Bonferroni post hoc bound, while it is valid under no assumption on the dependence structure of the -value family, may be conservative, in the sense that will be large for many subsets . For instance, one has (trivial bound) for all the sets such that .

The Bonferroni bound can be further improved under some dependence restriction, with the Simes post hoc bound:

(3)

Its coverage can be computed as follows (using arguments similar as above):

(4)

Under (Superunif) and (Indep), this is lower than or equal to by using the Simes inequality. More generally, the Simes post-hoc bound is valid in any setting where the Simes inequality holds. This is the case under a specific positive dependence assumption called Positive Regression Dependency on a Subset of hypotheses (PRDS), which is also the assumption under which the Benjamini-Hochberg (BH) procedure has been shown to control the false discovery rate (FDR).

While it uses more stringent assumptions, can be much less conservative than . For instance, if , we have and , which can lead to a substantial improvement.

From Exercise 1.2 below, the Simes bound has a nice graphical interpretation: can be interpreted as the smallest integer for which the shifted line is strictly below the ordered -value curve, see Figure 1.

Figure 1: Illustration of the Simes post hoc bound (3) according to the expression (5), for two subsets of (left display/right display), both of cardinal and for . The level is (taken large only for illustration purposes). Black dots: sorted -values in the respective subsets. Lines: thresholds (in red for , in light gray otherwise). The post hoc bound corresponds the length of the bold line on the -axis.
Exercise 1.2

Prove that for all , is equal to

(5)

where denote the ordered -values of . [Hint: start from for some , and find equivalent expressions.]

Exercise 1.3

In Figure 1, check that (resp. ) in the left (resp. right) situation. Compare to for (, ).

The Simes post hoc bound (3) has, however, several limitations: first, the coverage is only valid when the Simes inequality holds. This imposes restrictive conditions on the model used, which are rarely met or provable in practice. Second, even in that case, the bound does not incorporate the dependence structure, which may yield conservativeness (see Exercise 1.4 below). Finally, this bound intrinsically compares the ordered -values to the threshold (possibly shifted). We can legitimately ask whether taking a different threshold (called template below) does not provide a better bound.

Exercise 1.4

Consider the case , for which is even, and denote the upper-tail distribution function of a standard variable. Consider the one-sided testing situation where , and , , for a

-dimensional Gaussian vector

that is centered, with covariance matrix having as diagonal elements and as off-diagonal elements. Show that the coverage probability of the Simes post hoc bound is equal to

(6)

The above quantity is displayed in Figure 2 for , as a function of .

Figure 2: Coverage of the Simes post hoc bound (6) in the setting of Exercise 1.4 as a function of and for .

4 Threshold-based post hoc bounds

More generally, let us consider bounds of the form

(7)

where , , , is a family of functions, called a template. A template can be seen as a spectrum of curves, parametrized by . We focus here on the two following examples:

  • Linear template: , ;

  • Beta template:

    -quantile of

    , .

An illustration for the above templates is provided in Figure 3.

Figure 3: Curves for a wide range of values. Left: linear template. Right: beta template.

For a fixed template, the idea is now to choose one of these curves, that is, one value of the parameter , so that the overall coverage is larger than . Following exactly the same reasoning as the one leading to (4), we obtain

(8)
(9)

by letting the generalized inverse of (in general, this is valid provided that for all , and is non-decreasing and left-continuous on , as in the case of the two above examples). What remains to be done is thus to calibrate such that the quantity (9) is below .

Several approaches can be used for this. It is possible that for the model under consideration, the joint distribution of

is equal to the restriction of some known, fixed distribution on to the coordinates of (this is a version of the so-called subset-pivotality condition). It is met under condition (Indep), but it is also possible that the dependence structure of the -values is known (for example, in genome-wide association studies, the structure and strength of linkage disequilibrium can be tabulated from previous studies and give rise to a precise dependence model). In such a situation, the calibration of can be obtained either by exact computation, numerical approximation or Monte-Carlo approximation under the full null.

Another situation of interest, on which we focus for the remainder of this section, is when the null corresponds to an invariant distribution with respect to a certain group of data transformations, which is the setting for (generalized) permutation tests, allowing for the use of an exact randomization technique. More precisely, assume the existence of a finite transformation group acting onto the observation set . By denoting the null -value vector for , we assume that the joint distribution of the transformed null -values is invariant under the action of any , that is,

(Rand)

where denotes that has been transformed by .

Let us consider a (random) tuple of (for some ), where is the identity element of and

have been drawn (independently of the other variables) as i.i.d. variables, each being uniformly distributed on

. Now, let for all , and consider where denote the ordered sample . The following result holds.

Theorem 1.1

Under (Rand), for any deterministic template, the bound is a post hoc bound of coverage . This level is to be understood as a joint probability with respect to the data and the draw of the group elements .

As a case in point, let us consider a two-sample framework where

is composed of independent -dimensional real random vectors with , , i.i.d. (case) and , , i.i.d. (control). Then we aim at testing the null hypotheses ”, simultaneously for , without knowing the covariance matrix . Consider any family of -values such that only depends on the -th coordinate of the observations (e.g., based on difference of the coordinate means of the two groups). Note that is thus a measurable function of . Now, the group of permutations of is naturally acting on via the permutation of the individuals: for all ,

Exercise 1.5

Show that are i.i.d. and prove (Rand).

An illustration of the above -calibration method is provided in Figure 4 in the case where ,

for and using a beta template. In the left panel (full null), we have , so that . In the right panel (half of true nulls), we have for and , for , for some , so that . Following expression (8), is the highest beta curve such that at most orange curves have a point situated below it. This also shows that the above

-calibration is slightly more severe when part of the data follows the alternative distribution. This is a commonly observed phenomenon: although the permutation approach is valid even when part of the null hypotheses are false, their inclusion in the permutation procedure tends to yield test statistics that exhibit more variation under permutation, thus inducing more conservativeness in the calibration.

Figure 4: Illustration of the calibration method on one realization of the data . Black curves: beta template for some range of values. Orange curves: ordered -values (after permutation) for . Red curve: . Left panel : full null, right panel : half of true nulls (see text). (Parameters , , , , .)

5 Reference families

We cast the previous bounds in a more general setting, where –post hoc bounds are explicitly based on a reference family with some joint error rate (JER in short) controlling property. This general point of view offers more flexibility and allows us to consider post hoc bounds of a different nature, as for instance those incorporating a spatial structure, see Section 7.

In general, a reference family is defined by a collection , where the ’s are data-dependent subsets of and the ’s are data dependent integer numbers (we will often omit the dependence in to ease notation). The reference family is said to control the JER at level if

(10)

Markedly, (10) is similar to (1), but restricted to some subsets , . The rationale behind this approach is that, while the choice of is let completely free in (1) (to accommodate any choice of the practitioner), the choice of the ’s and ’s in (10) is done by the statistician and is part of the procedure. Once we obtain a reference family satisfying (10), we obtain a post hoc bound by interpolation:

(11)

We call the optimal post hoc bound (built upon the reference family ). Computing the bound can be time-consuming, it actually has NP-hard complexity in a general configuration. We can introduce the following computable relaxations: for ,

(12)
(13)
Exercise 1.6

Show that and for all . Moreover, provided that (10) holds, show that , and are all valid –post hoc bounds.

In addition, the following result shows that the relaxed versions coincide with the optimal bound if the reference sets have some special structure:

Lemma 1.1
  • In the nested case, that is, , for , we have ;

  • In the disjoint case, that is, for , we have .

We can briefly revisit the post-hoc bounds of the previous sections in this general framework. The -Bonferroni post hoc bound (2) derives from the one-element reference family . The Simes post hoc bound (3) derives from the reference family comprising the latter reference sets for all . More generally, the threshold-based post hoc bounds of the form (7) are equal to the optimal bound with and , (indeed, these reference sets are nested, so that ).

How to choose a suitable reference family in general? A general rule of thumb is to choose the reference sets of the same qualitative form as the sets for which the bound is expected to be accurate. For instance, the Simes post hoc bound will be more accurate for sets with the smallest -values. In Section 7, we will choose reference sets with a spatial structure, which will produce a post hoc bound more tailored for spatially structured subsets .

6 Case of a fixed single reference set

It is useful to focus first on the case of a single fixed (non-random) reference set , with (random) satisfying (10), that is,

(In contrast with the -Bonferroni bound (2) where was fixed and variable, here is fixed and is variable.) In other words, is a –confidence bound of . Several example of such can be build, under various assumptions.

Exercise 1.7

For fixed, show that the following bounds are –confidence bounds for :

  • under (Superunif), for some fixed ,

    (14)

    where denotes the largest integer smaller than or equal to . [Hint: use the Markov inequality.]

  • under (Superunif) and (Indep),

    (15)

    where . [Hint: use the DKW inequality, that is, for any integer , for i.i.d. , we have for all with probability at least .]

In addition to the two above bounds (14) and (15), we can elaborate another bound in the generalized permutation testing framework  (Rand), as described in Section 4. Applying the result of that section, the following bound is also valid:

(16)

where denotes the -quantile of a distribution and , where denote the ordered sample for which (see the -calibration method of Section 4).

Once a proper choice of has been done, the optimal post hoc bound can be computed as follows: for any , When is large and does not contain very small -values, this bound can be sharper than the Simes bound.

Exercise 1.8

Let us consider a post bound based on the single reference family and as in (15) (choosing ). For such that , show that and .

Finally, while the case of a single reference set can be considered as an elementary example, the bounds developed in this section will be useful in the next section, for which several fixed reference sets are considered, and thus several (random) should be designed.

7 Case of spatially structured reference sets

We consider here the case where the null hypotheses , , have a spatial structure, and we are interested in obtaining accurate bounds on for subsets of the form , for some .

In that case, it is natural to choose formed of contiguous indices. To be concrete, consider reference sets consisting of disjoint intervals of the same size : assume for some integers and and let

(17)

When each of these regions is considered in isolation, Section 6 suggested several approaches (in the appropriate settings (Superunif), (Indep) or (Rand)) of a specific form , to underline the dependence of in and . By using a simple union bound, it is then straightforward to show that the JER control (10) is satisfied for

(18)

When the reference regions are disjoint as in the example (17) above, we can use the proxy (see (13)) which is known to coincide with the optimal bound . This gives rise to a post hoc bound that accounts for the spatial structure of the data.

Exercise 1.9

Compute in the case where is given by (14) () and (15). In each case, for a given , discuss how fluctuates when the size of the family increases.

When considering the reference regions defined by segments (17), we have to prescribe a scale ( here, the size of the segments). It is possible to extend this to a multi-scale approach, choosing overlapping reference intervals at different resolutions arranged in a tree structure, where parent sets are formed by taking union of (disjoint) children sets taken at a finer resolution. Furthermore, the proxy (13) has to be replaced by a more elaborate one, minimizing over all possible multi-scale partitions made of such reference regions. This can still be computed efficiently by exploiting the the tree structure. Doing so, the post hoc bound will be more scale adaptive to sets with possibly various sizes. The price to pay lies in the cardinality of the family, which gets larger. However, this does not necessarily make the corresponding bound much larger, as Exercise 1.9 shows when using the bound (15), since the level only enters it logarithmically.

8 Applications

Differential gene expression studies in cancerology aim at identifying genes whose mean expression level differs significantly between two (or more) cancer populations, based on a sample of gene expression measurements from individuals from these populations. We consider here a microarray data set111Taken from Chiaretti et. al., Clinical cancer research, 11(20):7209–7219, 2005. consisting of expression measurements for more than genes for biological samples from individuals with B-cell acute lymphoblastic leukemia (ALL). A subset of cardinal of these individuals harbor a specific mutation called BCR/ABL, while the remaining don’t. One of the goals of this study is to identify those genes for which there is a difference in the mean expression level between the mutated and non-mutated population. This question can be addressed, after relevant data preprocessing, by performing a statistical test of equality in means for each gene. A classical approach is then to derive a list of “differentially expressed” genes (DEG) as those passing a FDR correction by the Benjamini-Hochberg (BH) procedure at a user-defined level. For example, 163 genes are selected by the BH procedure at level . We note that although the usage of the BH procedure is standard for multiple two-sample tests and widely accepted in the biomedical literature, we have no formal guarantee that it is mathematically justified – in particular, genes are not independent, and there is no proof that the PRDS assumption holds in this setting.

In this section, we illustrate how the post hoc inference framework introduced in the preceding sections can be applied to this case to build confidence envelopes for the proportion of false positives (Section 8.1), and to obtain bounds on data-driven sets of hypotheses (Section 8.2), and on sets of hypotheses defined by an a priori structure (Section 8.3). These numerical results were obtained using the R package sansSouci, version 0.8.1222Available from https://github.com/pneuvial/sanssouci..

8.1 Confidence envelopes

In absence of specific prior information on relevant subsets of hypotheses to consider, it is natural to focus on subsets consisting of the most significant hypotheses. Specifically, we define the th -value level set as the set of the most significant hypotheses, corresponding to the -values , and consider post hoc bounds associated to for . Figure 5 provides post hoc confidence envelopes for the ALL data set, for . While -lower confidence bounds on the number of true positives of the form are displayed in the left panel, -upper confidence bounds on the proportion of false positives are shown in the right panel.

The confidence envelopes are built from the Simes bound (3) (long-dashed purple curve), and from two bounds obtained from Theorem 1.1 by -calibration using permutation of the sample labels, based on the two templates introduced in Section 4: the dashed red curve corresponds to the linear template with , and the solid green curve to the beta template with . Note that Assumption (Rand) holds because we are in the two-sample framework described after Theorem 1.1.

Figure 5: Confidence bounds on the number of true positives (left) and on the proportion of false positives (right) for several reference families: Simes reference family (long-dashed purple curve), linear template after -calibration (dashed red curve), and beta template after -calibration (solid green curve).

The vertical line in Figure 5 corresponds to the 163 genes selected by the BH procedure at level . The Simes bound ensures that the FDP of this subset is not larger than 0.48. As noted above concerning the BH procedure, we have a priori no guarantee that this bound is valid, because such multiple two-sample testing situations have not been shown to satisfy the PRDS assumption under which the Simes inequality is valid333In this particular case, -calibration with the linear template yields , which a posteriori implies that the Simes inequality was indeed valid.. In contrast, the -calibrated bounds built by permutation are by construction valid here. Moreover, both are much sharper than the Simes bound while the -calibrated bound using the linear template is twice smaller, ensuring FDP, and even smaller for the beta template with . The bound obtained by -calibration of the linear template is uniformly sharper that the original Simes bound (3), which corresponds to . This illustrates the adaptivity to dependence achieved by -calibration. The bound obtained from the beta template is less sharp for -value level sets of cardinal less than , and then sharper. This is consistent with the shape of the threshold functions displayed in Figure 3.

8.2 Data-driven sets

A common practice in the biomedical literature is to only retain, among the genes called significant after multiple testing correction, those whose “fold change” exceeds a prescribed level. The fold change is the ratio between the mean expression levels of the two groups. With the notation of Section 4, the fold-change of gene is given by , where and .

This is illustrated by Figure 6, where each gene is represented as a point in the ((fold change), ) plan. This representation is called a “volcano plot” in the biomedical literature. Among the 163 genes selected by the BH procedure at level 0.05, 151 have an absolute log fold change larger than 0.3. As FDR is not preserved by selection, FDR controlling procedures provide no statistical guarantee on such data-driven lists of hypotheses.

Figure 6: Post-hoc inference for volcano plots

In contrast, the post hoc bounds proposed in this chapter are valid for such data-driven sets. The two shaded boxes in Figure 6 correspond to the data-driven subsets and , where is the set of 163 genes selected by the BH procedure at level , and . The post hoc bounds on the number of true positives in and obtained by the Simes bound and by the -calibrated linear and beta templates are given in Table 1. Both -calibrated bounds are more informative than the Simes bound, in the sense that they provide a higher bound on the number of true confidence. Moreover, they have proven -coverage, whereas the coverage of the Simes bound is a priori unknown for multiple two-sample tests. None of the two -calibrated bounds dominates the other one, which is in line with the fact that the linear template is well-adapted to situations with smaller -value level sets than the beta template.

Finally, we also note that the bound obtained for is systematically larger than the sum of the two individual bounds, which, again, is in accordance with the theory.

n Simes Linear Beta(K=50)
124 62 88 100
27 1 7 5
151 79 114 127
Table 1: Post hoc bounds on the number of true positives in and obtained by the post hoc bounds displayed in Figure 5.

8.3 Structured reference sets

In this section we give an example of application of the bounds mentioned in Section 7. Our biological motivation is the fact that gene expression activity can be clustered along the genome.

The individual hypotheses are naturally partitioned into subsets, each corresponding to a given chromosome. Within each chromosome, we consider sets of successive genes as in (17). Hence, we focus on a reference family with the following elements

where, in chromosome , denotes the number of genes, the number of corresponding regions. In addition, for each we use coming from the union bound (18) in combination with the device (15) and . This choice accounts for a union bound over all the chromosomes. As shown in Exercise 1.7, is a valid upper confidence bound for under (Superunif) and (Indep). In this genomic example, (Indep) may not hold, so we have in fact no formal guarantee that this bound is valid. Therefore, the results obtained below are merely illustrative of the approach and may not have biological relevance.

We report the results for chromosome , which contains genes. In this particular case, we obtain trivial bounds for all . Therefore, the proxy defined in (13) for disjoint sets does not identify any signal for this chromosome. However, non-trivial bounds can be obtained via the multi-scale approach briefly mentioned in Section 7. The idea is to enrich the reference family by recursive binary aggregation of the neighboring . The total number of elements in this family is less than . In our example, it turns out that (15) yields 6 true discoveries in the interval and 1 true discovery in the interval , where we have denoted

This is illustrated by Figure 7 where the individual -values are displayed (on the scale) as a function of their order on chromosome 19. The sets and are highlighted in orange, with the corresponding number of true discoveries marked in each region.

Figure 7: Evidence of locally-structured signal on chromosome 19 detected by the bound (15).

We obtain a non-trivial bound not because of the large effect of any individual gene, but because of the presence of sufficiently many moderate effects. In particular, in the rightmost orange region in Figure 7, the distribution of is shifted away from when compared to the rest of chromosome 19. In comparison, we obtain trivial bounds and from (12) both for the linear or the beta template. These numerical results illustrate the interest of the bounds introduced in Section 7 in situations where one expects the signal to be spatially structured.

9 Bibliographical notes

The material exposed in this chapter is mainly a digested account of the article [2]. The seminal work [9] introduced the idea of false positive bounds for arbitrary rejection sets. It started from the idea of building a confidence set on the set of null hypotheses , and introduced the concepts of augmentation procedure and inversion procedure. The latter consists in building a confidence set based on the inversion of tests for for all . The former starts from a set with controlled -familywise error rate, and the proposed associated post hoc bound is (10) (for the one-element reference family ). The name augmentation refers to a similar idea found in [6]. The relaxation (