Multiple competition based FDR control

07/02/2019
by   Kristen Emery, et al.
0

Competition based FDR control has been commonly used for over a decade in the computational mass spectrometry community [7]. The approach has gained significant popularity in other fields after Barber and Candés recently laid its theoretical foundation in a more general setting that, importantly, included the feature selection problem [1]. Here we consider competition based FDR control where we can generate multiple, rather than a single, competing null score. We offer several methods that can take advantage of these multiple null scores, all of which are based on a novel procedure that rigorously controls the FDR in the finite sample setting, provided its two tuning parameters are set without looking at the data. Because none of our methods clearly dominates all the others in terms of power we also develop a data driven approach, which is based on a novel resampling procedure and which tries to select the most appropriate procedure for the problem at hand. Using extensive simulations, as well as real data, we show that all our procedures seem to largely control the FDR and that our data driven approach offers an arguably overall optimal choice. Moreover, we show using real data that in the peptide detection problem our novel approach can increase the number of discovered peptides by up to 50

READ FULL TEXT VIEW PDF

Authors

page 1

page 2

page 3

page 4

11/24/2020

Competition-based control of the false discovery proportion

Target-decoy competition (TDC) is commonly used in the computational mas...
11/21/2019

Controlling the FDR in variable selection via multiple knockoffs

Barber and Candes recently introduced a feature selection method called ...
12/06/2017

Dynamic adaptive procedures for false discovery rate estimation and control

In the multiple testing problem with independent tests, the classical li...
06/25/2021

Semi-supervised multiple testing

An important limitation of standard multiple testing procedures is that ...
06/26/2020

Stable Feature Selection with Applications to MALDI Imaging Mass Spectrometry Data

This paper discusses an approach, based on the subsampling boostrap and ...
03/14/2019

Distributionally Robust Selection of the Best

Specifying a proper input distribution is often a challenging task in si...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

In the problem of multiple testing we simultaneously test (null) hypotheses , looking to reject as many as possible subject to some statistical control of our error rate. The rejected hypotheses are typically referred to as “discoveries”, and

corresponds to a false discovery if it is rejected but is in fact a true null hypothesis.

Pioneered by Benjamini and Hochberg, the common approach to controlling the error rate is through bounding the expected proportion of false discoveries at any desired level . More precisely, assume we have a selection procedure that produces a list of discoveries of which, unbeknownst to us, are false. Let be the unobserved false discovery proportion (FDP). Benjamini and Hochberg showed that applying their selection procedure (BH) at level controls the expected value of at that level: . They referred to as the false discovery rate (FDR), and their precise statement is that, provided the true null p-values are distributed as independent uniform random variables, for any p-values the true alternative / false null hypotheses assume [2]

. Other, more powerful selection procedures that rely on estimating

, the fraction of true null hypotheses, are available. Generally referred to as “adaptive BH” procedures, with one particularly popular variant by Storey (e.g. [3, 4, 26, 27]), these procedures are also predicated on our ability to assign a p-value to each of our tested hypotheses.

The problem we consider here is controlling the FDR when we cannot assign p-values to the hypotheses. Specifically, we assume that we can compute a test statistic

for each hypothesis , so that the larger is, the less likely is the null. However, departing from the standard setup, we further assume that we cannot compute p-values for the observed scores. Instead, we can only generate a small sample of independent competing null scores for each hypothesis : . As elaborated below, we refer to as decoys, or knockoff scores, and we would like to use the competition between them and the originally observed scores to control the FDR.

This is not just an interesting theoretical problem, competition based FDR control, albeit with , is already widely used. Indeed, in tandem mass spectrometry many spectra (possibly up to tens of thousands) are generated in each experiment and the main computational problems one faces are: associating with each observed spectrum the peptide that was most likely responsible for generating it, compiling the list of peptides, or the list of proteins, that are likely to be present in the sample. We refer to these as the spectrum identification (ID), peptide / protein detection problems respectively.

For over a decade now target-decoy competition (TDC) has been used to control the FDR in the reported list of discoveries in all three problems [7, 5, 13, 8, 9, 29]. The basic idea behind TDC is best explained in the context of the spectrum-ID problem: a search engine scans each input spectrum against a target peptide database for its best matching peptide and reports the optimal peptide-spectrum match (PSM) along with its score . In order to control the FDR among the reported peptide-spectrum matches (PSMs) the same search engine is used to assign each input spectrum a decoy PSM score, , by searching for its best match in a decoy database of peptides obtained from the original database by randomly shuffling (or reversing) each peptide in the database.

Each decoy score directly competes with its corresponding target score for determining the reported list of discoveries. Specifically, for each score threshold we would only report target PSMs that won their competition: . Additionally the number of decoy wins () is used to estimate the number of false discoveries in the list of target wins. Thus, the ratio between that estimate and the number of target wins yields an estimate of the FDR among the target wins. To control the FDR at level we choose the smallest threshold for which the estimated FDR is still . It was recently shown that, assuming that incorrect PSMs are independently equally likely to come from a target or a decoy match, and provided we add 1 to the number of decoy wins before dividing by the number of target wins, this procedure rigorously controls the FDR [11, 21].

The subject of competition-based FDR control gained a lot of interest in the statistical and machine learning communities following the work of Barber and Candés, who demonstrated how it can be used to control the FDR in feature selection in a classical linear regression model 

[1]. A significant part of Barber and Candés’ work is the sophisticated construction of their knockoff scores (despite the different terminology both knockoffs and decoys serve the same purpose in competition-based FDR control – for the ideas presented in this paper the two are interchangeable). Controlling the FDR then follows exactly the same competition that TDC uses. Indeed Barber and Candés’ Selective SeqStep+ (SSS+) procedure rigorously formalizes in a much more general setting the same procedure described above in the context of TDC.

Moving on to more than a single decoy () we note that Barber and Candés pointed out that using multiple knockoffs could potentially improve the power of their method, however it is not clear how to generalize their construction to – a point we will return to in the discussion. In contrast, in the mass spectrometry context it is conceptually trivial to generate additional decoys by searching each spectrum against random shuffles of the peptide database. In Section 6 we show how the multi-decoy methods we develop here can significantly improve our power in the peptide detection problem.

Other examples of our general problem might include analyzing a large number of motifs reported by a motif finder (e.g., [10]), where creating competing null scores can require the time consuming task of running the finder on randomized versions of the input sets (e.g., [23]), as well as controlling the FDR in selecting differentially expressed genes in microarray experiments where a small number of permutations is used to generate competing null scores [30].

A key feature of our problem is that due to computational costs the number of decoys, , is small. Indeed, if we are able to generate a large number of independent decoys for each hypothesis, then we can simply apply the above canonical FDR controlling procedures to the empirical p-values. The latter p-values are estimated from the empirical null distributions, which are constructed for each hypothesis using its corresponding decoys. However, with a small as we assume here, this approach is not viable as the empirical p-values will be too coarse resulting in a significant loss of power.

Alternatively, one might consider pooling all the decoys regardless of which hypothesis generated them. While this approach can work if the product of is sufficiently large, it also crucially hinges on the uniformity, or calibration of our scores. In Section 1.1 we show there exists arbitrary large examples where pooling all the decoy scores can significantly bias the empirical p-values. As a result, Storey significantly fails to control the FDR in some of those examples, and BH is essentially powerless in others. These failures illustrate the need for the alternative methods we develop here that extend the original TDC.

The main result of this paper is the introduction of several procedures that control the FDR in our competition based setup. Our procedures roughly fall into three categories, with the first category including procedures that depend on two tuning parameters that are selected without looking at the data, thus guaranteeing rigorous control the FDR in the finite sample setting. The proof of this finite sample FDR control heavily relies on the original work of Barber and Candés [1] and subsequent work by Lei and Fithian [20]. The second category includes methods that are in some sense discrete versions of the methods proposed by Storey et al. [27], and as we show, in the limit of (when you can actually compute p-values) our procedures indeed converge to those p-value based methods. Finally, the third category introduces a novel data-driven bootstrap approach that tries to select the optimal procedure for a given data set and FDR threshold.

1.1 Failings of the canonical procedures

Applying either the BH or an adaptive BH procedure (we use Storey’s variant here) to control the FDR requires having p-values. Such p-values can be obtained by noting the rank of the originally observed score (“original score” for short) when combined with the decoy scores . This can be done either by pooling all the decoy scores together, , or keeping them separate for each hypothesis (non-pooled).

1.1.1 Non-pooled decoys

In the non-pooled version the empirical p-values take values of the form , where , and is essentially the rank of the original score in the combined list of scores: . Using these non-pooled p-values the BH procedure rigorously controls the FDR, and Storey’s method will asymptotically control the FDR as . However, because the p-values are rather coarse both methods can be extremely weak especially when is small. For example, if , each empirical p-value is either 1/2 or 1, and therefore for many practical examples both methods will not be able to make any discoveries at usable FDR thresholds.

1.1.2 Pooled decoys

When pooling all the decoys together our empirical p-values attain values of the form for ; hence, particularly when is large, the p-values generally no longer suffer from being too coarse. However, other significant issues arise when pooling the decoys. First, the empirical p-values computed using the pooled decoys do not satisfy the assumption that the p-values of the true null hypotheses are independent: because all p-values are computed using the same batch of pooled decoy scores, it is clear that they are dependent to some extent. While this dependency diminishes as , there is a more serious problem that in general cannot be alleviated by considering a large enough .

In pooling the decoys we make the implicit assumption that the score is calibrated, i.e., that all true null scores are generated according to the same distribution. If this assumption is violated, as is typically the case in the spectrum identification problem for one [15], then the p-values of the true null hypotheses are not identically distributed, and in particular they are also not (discrete) uniform in general. This means that even the more conservative BH procedure is no longer guaranteed to control the FDR, and the problem is much worse with Storey. Indeed, Supplementary Section 8.1 shows that there are arbitrary large examples wherein Storey significantly fails to control the FDR, and similar ones where BH is essentially powerless. Those examples, which we will further comment on below, demonstrate that, in general, applying BH or Storey’s procedure to p-values that are estimated by pooling the competing null scores can be problematic both in terms of power, and control of the FDR (these issues were discussed in the context of the spectrum-ID problem, where the effect of pooling on power, and on FDR control were demonstrated using simulated and real data [15, 14]).

2 Controlling the FDR using multiple decoys

We formulated the multiple competition problem assuming the decoys can be generated independently of anything else (independent decoys assumption). However, the procedures we offer here control the FDR in the following, more general setting, which is a straightforward generalization of the conditional exchangeability property of  [1].

Definition 1.

Let , where is the th original score and are the corresponding decoy scores, and let denote the set of all permutations on . With we define , i.e., the permutation

is applied to the indices of the vector

rearranging the order of its entries. Let be the indices of the true null hypotheses and call a sequence of permutations with a null-only sequence if (the identity permutation) for all . We say the data satisfies the conditional null exchangeability property if for any null-only sequence of permutations

, the joint distribution of

is invariant of .

Going back to TDC, or equivalently, to the FDR controlling part of knockoff+, we note it can be summarized as follows. Assign to each hypothesis a score-label pair based on the observed original (target), , and decoy, , scores as follows:

(1)

Reorder the hypotheses so that their scores are decreasing and find the maximal such that

(2)

Barber and Candés prove that reporting the list of discoveries as all with and controls the FDR if the data satisfies the conditional null exchangeability property.

Remark 1.
  • In their setup Barber and Candés combine and by considering their product, but here we find it conceptually clearer to separate them: refers to either an original (target) or a decoy win, and is the selected score among the two.

  • Consistently with [1], we effectively ignore ties between and , however randomly assigning in case of a tie is both statistically valid and occasionally practiced in TDC.

  • Ties between the scores can also be randomly broken.

2.1 The mirror method

Our first FDR controlling procedure is the one closest to the above formulation of FDR control in TDC or knockoff+. Let denote the rank of in (with higher ranks corresponding to better scores). To simplify our notation, we assume first that

is odd (so

is even), and that there are no ties. With () denoting the th order statistic of the observations , we define

(3)

Clearly, for (3) coincides with (1) in the absence of ties, and for example, for , we have if but if and if . In words, (and or this is an original observation win, or original win for short) if is larger than more than half of its decoys, and otherwise this is a decoy win, or , and in this case is the decoy that is ranked symmetrically across the median to – hence we refer to this procedure as the mirror method.

The remainder of the mirror procedure is the same as described above: the scores are sorted, and assuming they are in decreasing order, our discovery list is , where is determined by (2).

Claim 1.

If the data satisfies the conditional null exchangeability property then the mirror method controls the FDR in the reported list of target discoveries, that is,

The expectation is taken with respect to all true null scores (competing and observed), i.e., it holds regardless of the values of the competing and observed test scores of the false null hypotheses.

A direct proof is given in Supplementary Section 8.2. Alternatively, the result follows as a special case of Claim 2 below.

Remark 2.

Note that there is a certain symmetry to our definition of , but this symmetry is not required for controlling the FDR. Other definitions of , such as setting if , would also yield procedures that control the FDR; however, such choices would typically offer less power than our mirror procedure. For example, in the above proposed variant, middling target scores are replaced by high decoy scores, whereas for the mirror method, it is the low target scores that are replaced by the same high decoy scores. Of course, the high decoy scores are the ones more likely to appear in the numerator of (2), and generally we expect the density of the target scores to monotonically decrease with the quality of the score. Taken together, it follows that the estimated FDR will generally be higher using the last variant when compared with the mirror, and hence that variant will be weaker than the mirror.

Remark 3.

As explained in Supplementary Section 8.3, or alternatively in Section 2.5 below, the mirror method can be extended to handle an even number of decoys, as well as to handle ties between an observed and its decoys. Similarly to TDC (Remark 1), ties between the selected scores are randomly broken.

As noted, above, after defining the scores and the signs/labels the mirror method essentially applies Barber and Candés’ SSS+ [1] with . This naturally brings up the question of whether we can generalize the definitions of and so that we can use SSS+ with a different value of . The next sections first provide a positive answer to this question and then generalize this approach even further.

2.2 The max method

Our second method is not one we generally recommend on its own, but it is useful as a conceptual stepping stone to a class of methods that will follow in the next section. For our “max method”, again assuming no ties, we define

(4)

Assuming again are decreasing, we modify our rejection threshold to be the maximal such that

(5)

The intuition is obvious: among the true null hypotheses, for every decoy wins there is, on average, a target win. Hence, the number of decoy wins divided by yields an estimate of the number of true nulls among the target wins. Therefore, ignoring the correction, the ratio of the latter estimate to the number of target wins provides an estimate of the FDR and, as above, we seek the largest number of discoveries for which the estimated FDR is still . Hence, the max method discovery list is , where is determined by (5). In Supplementary Section 8.4 we have an analog of Claim 1.

The max method generally does better than the mirror for smaller values of . Indeed, in those cases the “penalty” in the numerator of (2) can have a substantial impact. For example, if then only if they all correspond to original observation wins: for . However, with the max method, where the numerator of the first term in (5) is multiplied by we can afford to have decoy wins ( among the first 20 hypotheses. At the same time, the max method has its own cases where it is far from optimal, particularly when and/or are not very small.

2.3 Mirandom (I) – a general value of

Both the max and the mirror methods follow the following general scheme (assuming no ties or ties are broken randomly):

  1. First, the original/decoy win label, , that is associated with each hypothesis, is determined based on , the rank of the target score among the scores :

    (6)

    where in the max case, and in the mirror (with no ties, and odd ).

  2. Each hypothesis is assigned a score , where the “selected rank”, , is always for the max method, and in the mirror case for an original win ( and is the mirror image of for a decoy win ().

  3. The hypotheses are reordered so that are decreasing (ties are broken randomly) and the list of discoveries is defined as , where

    (7)

    and again in the mirror case and in the max case.

To allow this scheme to utilize an arbitrary , or more precisely any with we note that the definition of (6) is already expressed as a function of a general . What remains is to define the selected ranks so that the labels will still be independent, conditional on the s. Particularly, the issue is how to define when (, or a decoy win) because when (, or an original win) we naturally define , or . The next claim guides us how to define .

Claim 2.

If the data satisfies the conditional null exchangeability property, and if for any and

(8)

where , like , is also determined only from the order of the scores , then

Proof.

By (8) for any and ,

Moreover, the conditional exchangeability and the fact that and are determined from the order of the scores imply that the above equalities hold even when conditioning on . Hence, in this case the p-values, based again on , satisfy

(9)

The proof is again completed by invoking Theorem 3 of [1] with . ∎

Note that if, in the event of a decoy win (), we define by randomly and uniformly drawing one of the “winning ranks” then (8) is clearly satisfied:

(10)

Thus, based on the last claim, the above procedure again controls the FDR. However, choosing at random introduces an additional level of variability, and it can also yield reduced power for the same reason discussed in Remark 2: you would typically be better off if is anti-correlated with rather than non-correlated as above. Hence the motivation for introducing our “mirandom” (“mirror” + “random”).

The mirandom procedure defines , or equivalently , so that (8) and therefore Claim 2 holds, but it does so while mapping to in an essentially monotonically decreasing fashion. It is best illustrated by examples: first, for and (or ) mirandom maps the losing original observation ranks to the winning ranks by mirroring with respect to the mid-point between those two sets, i.e.,

This scheme works well as long as , and it coincides with the mirror method () when is odd and there are no ties. But when we need to introduce a degree of randomness into our mapping, while still maintaining the monotonicity in some sense. Again, this is best explained by examples:

  • for and mirandom randomly and uniformly maps the losing rank to , and it similarly maps losing rank to .

  • for and mirandom maps losing rank to , whereas is mapped to

    with probability

    and to with probability , similarly, is mapped to , and is mapped to with probability and to with probability , and finally losing rank is mapped to . Notably, each rank in the range has the same “coverage”, that is, for each the sum of the probabilities . In particular, when choosing a losing rank at random the probability the corresponding is the same for all .

More generally, the mirandom procedure consists of two steps. In the first step it defines a sequence of distributions on the range so that:

  • each is defined on a contiguous sequence of numbers, and

  • if then stochastically dominates and .

In practice, it is straightforward to construct this sequence of distributions and to see that combined they necessarily have the equal coverage property:

In the second step mirandom defines for any with by randomly drawing a number from (independently of everything else). As usual, in case of an original win mirandom sets . It follows from the equal coverage property that for any and (10) holds. Therefore, by Claim 2 using mirandom to define (and hence ) and proceeding as outlined at the start of this section controls the FDR with .

Remark 4.

With the risk of stating the obvious we note that one cannot simply use Barber and Candés’ SSS+ by selecting and using as the corresponding p-value because in this case the order of the hypotheses (according to ) is not independent of the true null p-values.

2.4 Mirandom (II) – a general value of

The last section raises the question of how to set , but before trying to address this question we introduce another tuning parameter to the problem. Recall that at their core, all the procedures we presented so far rely on Barber and Candés’ Selective SeqStep+ (SSS+) procedure to control the FDR in sequential hypotheses testing, where the hypotheses are ordered in advance according to some knowledge, which is independent of the null hypotheses p-values [1]. Indeed, the parameter we keep referring to is the same as SSS+’ parameter .

Lei and Fithian introduced Adaptive SeqStep (AS) which relies on an additional tuning parameter to improve on SSS+ in much the same way as adaptive BH procedures, including Storey’s, improve on the Benjamini-Hochberg procedure [20]. Adopting this improvement in our setup we offer the following revision of our general scheme by adding the parameter to where :

  1. First, , the original/decoy win label associated with each hypothesis is determined based on , the rank of the target score among the scores :

    (11)

    Note that the tuning parameter determines the original win () threshold, and that determines the decoy win () threshold.

  2. Each hypothesis is assigned a score , where the selected rank , satisfies: for an original win, is randomly and uniformly selected from for a “neutral outcome” (), and is specified in a procedure-specific manner for a decoy win.

  3. The hypotheses are reordered so that are decreasing (ties are broken randomly) and the list of discoveries is defined as , where

    (12)

The next claim generalizes Claim 2 for and where :

Claim 3.

If the data satisfies the conditional null exchangeability property, and if for any and

(13)

where , like , is also determined only from the order of the scores , then

The proof is provided in Supplementary Section 8.5.

We can readily adjust our mirandom procedure to this more general setting by constructing distributions as above on the range , so the average coverage is now and (13) holds. For example, for with and (so ) mirandom maps losing rank to , and it maps to with probability and to with probability , is mapped to with probability and to with probability , and finally losing rank is mapped to . Thus, the coverage of each rank is the same . Note that in practice we do not really need to decide on when the outcome is neutral () because those scores are in fact ignored when determining in (12).

Remark 5.

In determining the ranks ties are broken randomly and the same applies when ordering the selected scores .

2.5 Determining the tuning parameters

How shall we set our pair of tuning parameters ? We already saw a couple of examples: the mirror method corresponds to using mirandom with when is odd, and when is even (Supplementary Section 8.3) it corresponds to using and . The max method is equivalent to using mirandom with .

In their paper, Lei and Fithian suggest using and , implicitly assuming that  [20]. According to Claim (3) and its ensuing discussion, applying mirandom with this last setting of – a method we refer to as LF – we rigorously control the FDR. However, as we will see below there are many cases where this selection of values for is far from optimal.

We next introduce a couple of methods to select in a way that involves peeking at the data. As such they do not strictly fall within the guarantees of Claim (3); however, one of them can be considered as a finite- version of Storey’s method (and indeed we show that it converges to a variant of Storey’s procedure as ), while the other is in the same vein. Moreover, extensive numerical simulations give empirical evidence that these methods control the FDR in our finite- setting.

Lei and Fithian pointed out the connection between the 111Lei and Fithian refer to as . parameters of their AS procedure and the corresponding parameters in Storey’s procedure. Specifically, AS’s is analogous to the parameter of Storey, Taylor and Siegmund that determines the interval from which , the fraction of true null hypotheses is estimated. Specifically, in the finite sample case, is estimated as [27]:

(14)

where is again the number of hypotheses, and is the number of discoveries at threshold (number of hypotheses whose p-value is ). The parameter is analogous to the threshold

(15)

of [27], where

(16)

We take this analogy one step further and essentially use the above procedure of Storey, Taylor and Siegmund to determine by applying it, as detailed below, to the empirical p-values. Those p-values are computed from the competing decoys:

(17)

In order to do that, however, we first need to determine .

2.5.1 Determining from the empirical p-values

In principle we could have used Storey’s bootstrap approach [27] to determine . However, in practice we found that using the bootstrap option of the qvalue package [31] in our setup can result in a significant failure to control the FDR. Therefore, instead we devised the following method that seems more appropriate for this empirical p-value setting.

Our approach is inspired by the spline based method for estimating  [28], in that it also looks for the flattening of the tail of the p-values histogram as we approach 1. Because our p-values, of (17), lie on the lattice for , instead of threading a spline as in [28], we ask whether the number of p-values in the first half of the considered tail interval is larger than their number in the second half of this interval.

Specifically, starting with we repeatedly apply the binomial test to see if the number of p-values in is significantly larger than their number in where if is odd. When is even we set and similarly test if the number of p-values in is significantly larger than their number in .

If the test is significant at level (here we used ), then we are yet to see the flattening of the tail of the p-values histogram so we increase by 1. Otherwise, the first half of the interval does not have a statistically significant larger number of p-values than the second half, so we set and the interval from which is estimated in (14) coincides with .

Note that if is large then we apply the same sequence of binomial tests but for computational efficiency, we now apply it to the lattice defined by the points (we used ), rather than to the original lattice of .

2.5.2 Finite-decoy Storey (FDS and FDS)

The procedure we call finite-decoy Storey (FDS) starts with determining as above. Then, given the FDR threshold , FDS proceeds along (14)-(16) using , to determine

(18)

This in principle is our threshold except that, especially when is small, can often be zero which is not a valid value for in our setup. Hence FDS defines

(19)

With determined, FDS continues by applying the mirandom procedure with the chosen parameter values.

FDS was defined as close as possible to Storey, Taylor and Siegmund’s recommended procedure for guaranteed FDR control in the finite setting. We found that a variant of FDS that we denote as FDS, often yields better power. FDS differs from FDS as follows:

  • When computing (18) we use Storey’s large sample formulation which does not include the in the estimator (14), and maximizes over .

  • Instead of defining as in (19), FDS defines , where is some hard bound on (we used ).

  • If then when it comes to invoking mirandom, FDS sets rather than continue using the same that it used until this point (defined in Section 2.5.1).

3 The limiting behavior of our FDR controlling methods

In its selection of the parameter , FDS essentially applies Storey’s procedure to the empirical p-values; however, there is a more intimate connection between FDS, and more generally between some of the methods described above and Storey’s procedure that becomes clearer once we let . To elucidate that connection we further need to assume here that the score is calibrated, that is, that the distribution of the decoy scores is the same for all null hypotheses. In this case, we might as well assume our observed scores are already expressed as p-values: (keeping in mind that this implies that small scores are better, not worse as they are elsewhere in this paper).

It is not difficult to see that for a given , mirandom assigns, in the limit as , if (, or original win), and if (, or decoy win). Sorting the scores in increasing order (smaller scores are better here) , we note that for with the denominator term in (12) is the number of original scores, or p-values . At the same time, for the same and , if and only if and so we have for the numerator term

Considering that in (12) must be attained at an for which (original win), we can essentially rewrite (12) as

(20)

Consider now Storey’s selection of the threshold , which when using the more rigorous estimate (14) essentially amounts to

where is a tuning parameter. Considering the cases where and setting Storey’s threshold becomes

Since in practice can be taken as equal to one of the the equivalence with (20) becomes obvious by identifying above with in (20).

Thus, for example, as the mirror method () converges, in the context of a calibrated score, to Storey’s procedure using , which coincides with the “mirroring method” of [32]. It is worth noting that the general setting is not obviously supported by the finite sample theory of [27]; however, it can be justified by noting the above equivalence and our results here.

An even more direct connection with Storey’s procedure is established by letting in the context of FDS. Indeed, using the same determined by the progressive interval splitting procedure described in Section 2.5.1, Storey’s finite-sample procedure (15) would amount to setting the rejection threshold to

Recalling that we note that the latter coincides with the value FDS assigns to via (18) and (19). Let be such that the above (recall we assume no ties here), then we can assume without loss of generality that and hence (compare with (20)) the mirandom part of FDS will find

But

Hence satisfies the required inequality and . It follows that the rejection lists of FDS and the above variant of Storey’s procedure with the same coincide in the limit.

4 Bootstrap-based selection of and

As shown in Section (5), each of the methods above can be relatively weak in some cases. Moreover, which method is optimal generally varies in a non-obvious way with the parameters of the input set, as well as with the selected FDR threshold. To address this problem we next propose a resampling based procedure that attempts to select the optimal method for the given observed and decoy scores, as well as the FDR threshold.

4.1 Segmented resampling

Our underlying assumptions that computing additional null scores is forbiddingly expensive, and that consequently the number of decoys scores is fairly small imply that when we resample an observed score , we also need to include its corresponding decoys in our resample (rather than resampling those scores directly from ). We can create such a bootstrap sample, for example, by independently sampling indices for and then defining the resampled set as .

Having a number of such bootstrap samples at hand it is tempting to apply our above methods, and select the one that yields, say, the largest number of discoveries. However, such an approach will occasionally fail to control the FDR. Indeed, when using an analogous bootstrap approach to select the tuning parameter , Storey et al. did not seek a

that optimizes the number of discoveries, choosing instead to minimize the bias-variance tradeoff in estimating

 [27].

Our approach here is based on the observation that, while in terms of FDR control, selecting a method based on maximizing the number of discoveries is not generally valid, in many examples it works just fine. Hence, if we have a reason to believe such a greedy approach could work for the particular data at hand, then we can go ahead and optimize with respect to the number of target discoveries. If on the other hand, we suspect this approach would fail to control the FDR, then we can instead fall back on our default favorite method.

In order to gauge the rate of false discoveries we need labeled samples, which of course we do not have in practice. As a substitute for those, we devised a resampling procedure that makes informed guesses which of the hypotheses are false nulls before resampling the indices.

For each conjectured true null index that is resampled we would have ideally sampled, using its null distribution, both the observed and the decoy scores. This would have generated bootstrap samples that are more reflective of the variability we expect to see in our samples. However, as noted above, we are constrained by the cost of generating null scores, so instead we try to capture some of this variability by permuting the scores. Specifically, consistently with our assumption that for each true null hypothesis the variables are exchangeable, we permute each resampled conjectured true null index before applying our FDR controlling procedures (cf. Definition 1). In doing so we typically swap an original score with one of its corresponding decoy scores, and in fact for the procedures considered here we can just as well randomly sample and swap with .

The effectiveness of our resampling scheme hinges on how informed our guesses are. We can use our estimate described in Sections 2.5.1 and 2.5.2 to estimate the number of true null hypotheses. However, randomly picking of the hypotheses as our conjectured false nulls will produce little overlap with the correct false nulls in cases where is large. That in turn will again impair our ability to control the FDR based on our resamples. To address this issue we refined our procedure for drawing the conjectured false nulls in two major ways.

First, rather than sampling conjectured false nulls from the set of all hypotheses we consider increasing sets of hypotheses and verify that the number of conjectured false nulls we draw from each agrees with our estimate of the number of false nulls in