Aggregated False Discovery Rate Control

07/08/2019
by   Fang Xie, et al.
Ruhr University Bochum
0

We propose an aggregation scheme for methods that control the false discovery rate (FDR). Our scheme retains the underlying methods' FDR guarantees in theory and can decrease FDR and increase power in practice.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

02/15/2021

Controlling False Discovery Rates Using Null Bootstrapping

We consider controlling the false discovery rate for many tests with unk...
08/11/2013

CDfdr: A Comparison Density Approach to Local False Discovery Rate Estimation

Efron et al. (2001) proposed empirical Bayes formulation of the frequent...
09/05/2021

A Q-Q plot aids interpretation of the False Discovery Rate

A method is demonstrated for representing the false discovery rate (FDR)...
02/21/2020

Aggregation of Multiple Knockoffs

We develop an extension of the Knockoff Inference procedure, introduced ...
10/09/2020

False Discovery Rate Computation: Illustrations and Modifications

False discovery rates (FDR) are an essential component of statistical in...
04/11/2018

A Likelihood Ratio Approach for Precise Discovery of Truly Relevant Protein Markers

The process of biomarker discovery is typically lengthy and costly, invo...
08/14/2020

A New Perspective on Pool-Based Active Classification and False-Discovery Control

In many scientific settings there is a need for adaptive experimental de...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The false discovery rate (FDR) is the expected proportion of false discoveries and total discoveries. It can be viewed as an extention of the Type I error to multiple testing; in particular, the FDR equals the Type I error if the number of hypotheses is one. Neverthless, FDR control is useful also beyond multiple testing, such as for variable selection in high-dimensional linear regression.

There is a variety of methods for FDR control, such as the Benjamini-Hochberg (BHq) procedure [3] and the fixed rejection region method of [7] (see also [6]) for independent tests, the Benjamini-Yekutieli procedure for dependent tests [9], and the knockoff filter for linear regression [1]. However, accurate hypothesis testing (and similarly, variable selection) means more than just small FDR: it also means that the proportion of correctly selected hypotheses and total number of true hypotheses, the power, is large. An important question is, therefore, how power can be maximized while guaranteeing FDR control.

In this paper, we propose a simple aggregation scheme for FDR control methods. It consists of two steps: First, the FDR method is applied  times with specific FDR target levels; and second, the resulting selections are combined by taking the union. We show that this aggregation scheme retains the original methods’ theoretical FDR guarantees while having the potential to improve FDR and power in practice.

The remainder of the paper is organized as follows. In Section 2, we introduce our aggregation scheme and establish its theory. In Section 3, we apply our scheme to the knockoff filter (including knockoff and knockoff+) in high-dimensional linear regression.

2 The aggregation scheme and its theory

In this section, we introduce and study our general aggregation scheme. Consider data and null-hypotheses of the form

The true active set (or the index set of false null hypotheses) is denoted by . The set

denotes an estimate of

with FDR control at level :

(2.1)

where denotes the cadinality of a set.

Our aggregation scheme applies an FDR control method times and combines the results:

Step 1: Given a target FDR level , choose a sequence such that . Apply the FDR control method times with respective FDR levels and denote the corresponding estimated active sets by .

Step 2: Combine the estimated active sets by taking the union:

The following theorem shows that our method achieves FDR control at target level .

Theorem 1.

Given a target FDR level and an FDR control method that satisfies inequality (2.1) for the ’s of Step 1, the set of the aggregation scheme provides FDR control at level :

This result demonstrates that our aggregation scheme with suitable ’s has the same guarantees as the underpinning FDR control method with target level . The theorem is general in two ways: First, a wide range of sequences work in Step 1, such as or . Second, a wide range of FDR methods can satisfy inequality (2.1), such as the BHq procedure [3] and knockoff filter [1].

For , our method equals the original FDR method. In practice, we recommend as a trade-off between computation effort and effect.

Proof.

The proof is a short straightforward calculation. By assumption, we have for all , ,

Hence,

as desired. ∎

3 An example: FDR control in high-dimensional linear regression

In this section, we apply our method into knockoff filter [1] in high-dimensional linear regression. The corresponding model is

where is a design matrix,

a vector of response,

an unknown vector of coefficients, and a noise vector. Our data corresponding to Section 2 has the form .

3.1 A brief introduction to the knockoff filter

The knockoff filter is a method for controlling the FDR in the linear regression [1]. A key point of the knockoff filter is to generate knockoffs  for the design matrix . The goal of knockoffs  is to imitate the correlation structures between the variables so that we can do FDR control on the specific statistics based on the both  and .

Denote and . In this paper, we generate knockoffs

from a Gaussian distribution obeying

(3.1)

where we assume with being a positive definte matrix and and satisfy

with making positive definite. This way to generate knockoffs was also used in [2, 5].

In the following, we introduce the knockoff filter method which can produce an estimated active set of achieving the FDR control. To obtain the estimated active set, we need a statistic vector and thresholds. We consider the following penalized estimator for linear regression

(3.2)

where is an augmented matrix and is a penalty function with tuning parameter . When , the estimator in (3.2) is a Lasso estimator [8]. When the derivative of has the form with and , the estimator in (3.2) is a SCAD estimator [4].

Denote the maximum penalty coefficients of each variable entering in the model by , that is,

for . For simplicity, we omit sometimes, such as using and instead of and . The same omissions will happen below.

The statistic vector can be defined by

Then, the thresholds of knockoff and knockoff+ procedures (two types of knockoff filter methods) for a given FDR are defined by

where .

Thus, the coresponding estimated active sets are defined as

From Theorem 2 of [1], we know that the estimated active sets obtained by the knockoff+ procedure satisfy the inequlity (2.1). For knockoff procedure, we can not obtain this theoretical bound of FDR directly, since Theorem 1 in [1] only gave the same bound for an approximate FDR which is less or equal to the FDR.

3.2 Application of the aggregation scheme

After introducing the knockoff filter, we plug it into our aggregation scheme and also show the simulations results of this application.

Given a target FDR , we choose the sequence satisfying  as . For a complete statement, we describe our method again with pluging knockoff filter and the specific sequence as follows:

Step 1: Apply the knockoff (or knockoff+) procedure above times with respective target FDR ’s above and denote the corresponding estimated active sets by (or ).

Step 2: Combine these estimated active sets by taking the union:

In the following, we show the simulation results which support our theoretical results. In addition, we also exhibit the performances of the selection accuracy.

The dimensions of the data are . The design matrix is generated by . The noise is drawn from . The true parameter has nonzero coefficients taking value and randomly from . We regenerate such that . We simulated independent knockoffs according to (3.1). For the penalized methods to solve the linear regression, we choose the Lasso [8] and the SCAD [4] with default .

We repeat the simulation times and calculate the emprical FDR and power

for our scheme with knockoff procedure, while for that with knockoff+ procedure, is replaced by . The empirical selection accuracy is defined as

The with 1 is the ideal case.

We vary the target FDR in . Figure 1 and Figure 2 show the relationships between the and the target FDR (first line) and between the actual FDR (or ) and the target FDR (second line) of the cases and respectively. The plots with titles Lasso and SCAD are for knockoff procedure. The plots with titles Lasso+ and SCAD+ are for knockoff+ procedure. The orange lines are plotted by our method and the purple ones are for standard knockoff filter. The two figures have the similar performances. The plots for the actual FDR with the target FDR show that our actual FDR is always below the diagonal lines which verifies our theoretical result and also below the FDR of the standard knockoff and knockoff+ procedures. For the knockoff case, the of our method is always larger than the standard knockoff procedure. For the knockoff+ case, the of our method is larger than standard knockoff+ procedure when the target FDR is large.


Figure 1: . The yellow lines are for our aggregation scheme and the purple lines are for the standard knockoff filter. Our actual FDR is almost always smaller than or equal to than the standard knockoff filter’s and our selection accuracy has a substantial improvement when the target FDR is large.

Figure 2: . The yellow lines are for our aggregation scheme and the purple lines are for the standard knockoff filter. Our actual FDR are almost always smaller than or equal to the standard knockoff filter’s and our selection accuracy has a substantial improvement when the target FDR is large.

References

  • Barber and Candès [2015] Rina F. Barber and Emmanuel J. Candès. Controlling the false discovery rate via knockoffs. The Annals of Statistics, 43(5):2055–2085, 2015.
  • Barber et al. [2019] Rina F. Barber, Emmanuel J. Candès, and Richard J. Samworth. Robust inference with knockoffs. arXiv:1801.03896, 2019.
  • Benjamini and Hochberg [1995] Yoav Benjamini and Yosef Hochberg. Controlling the false discovery rate: A practical and powerful approach to multiple testing. Journal of the Royal Statistical Society: Series B (Methodological), 57(1):289–300, 1995.
  • Fan and Li [2001] Jianqing Fan and Runze Li. Variable selection via nonconcave penalized likelihood and its oracle properties. Journal of the American Statistical Association, 96(456):1348–1360, 2001.
  • Holden and Hellton [2019] Lars Holden and Kristoffer Hellton. Multiple model-free knockoffs. arXiv:0902.0885, 2019.
  • Langaas et al. [2005] Mette Langaas, Bo H. Lindqvist, and Egil Ferkingstad. Estimating the proportion of true null hypotheses, with application to dna microarray data. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 67(4):555–572, 2005.
  • Storey [2002] John D. Storey. A direct approach to false discovery rates. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 64(3):479–498, 2002.
  • Tibshirani [1996] Robert Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society: Series B (Methodological), 58(1):267–288, 1996.
  • Yekutieli and Benjamini [1999] Daniel Yekutieli and Yoav Benjamini.

    Resampling-based false discovery rate controlling multiple test procedures for correlated test statistics.

    Journal of Statistical Planning and Inference, 82(1):171–196, 1999.