1 Introduction
The false discovery rate (FDR) is the expected proportion of false discoveries and total discoveries. It can be viewed as an extention of the Type I error to multiple testing; in particular, the FDR equals the Type I error if the number of hypotheses is one. Neverthless, FDR control is useful also beyond multiple testing, such as for variable selection in highdimensional linear regression.
There is a variety of methods for FDR control, such as the BenjaminiHochberg (BHq) procedure [3] and the fixed rejection region method of [7] (see also [6]) for independent tests, the BenjaminiYekutieli procedure for dependent tests [9], and the knockoff filter for linear regression [1]. However, accurate hypothesis testing (and similarly, variable selection) means more than just small FDR: it also means that the proportion of correctly selected hypotheses and total number of true hypotheses, the power, is large. An important question is, therefore, how power can be maximized while guaranteeing FDR control.
In this paper, we propose a simple aggregation scheme for FDR control methods. It consists of two steps: First, the FDR method is applied times with specific FDR target levels; and second, the resulting selections are combined by taking the union. We show that this aggregation scheme retains the original methods’ theoretical FDR guarantees while having the potential to improve FDR and power in practice.
2 The aggregation scheme and its theory
In this section, we introduce and study our general aggregation scheme. Consider data and nullhypotheses of the form
The true active set (or the index set of false null hypotheses) is denoted by . The set
denotes an estimate of
with FDR control at level :(2.1) 
where denotes the cadinality of a set.
Our aggregation scheme applies an FDR control method times and combines the results:
Step 1: Given a target FDR level , choose a sequence such that . Apply the FDR control method times with respective FDR levels and denote the corresponding estimated active sets by .
Step 2: Combine the estimated active sets by taking the union:
The following theorem shows that our method achieves FDR control at target level .
Theorem 1.
Given a target FDR level and an FDR control method that satisfies inequality (2.1) for the ’s of Step 1, the set of the aggregation scheme provides FDR control at level :
This result demonstrates that our aggregation scheme with suitable ’s has the same guarantees as the underpinning FDR control method with target level . The theorem is general in two ways: First, a wide range of sequences work in Step 1, such as or . Second, a wide range of FDR methods can satisfy inequality (2.1), such as the BHq procedure [3] and knockoff filter [1].
For , our method equals the original FDR method. In practice, we recommend as a tradeoff between computation effort and effect.
Proof.
The proof is a short straightforward calculation. By assumption, we have for all , ,
Hence,
as desired. ∎
3 An example: FDR control in highdimensional linear regression
In this section, we apply our method into knockoff filter [1] in highdimensional linear regression. The corresponding model is
where is a design matrix,
a vector of response,
an unknown vector of coefficients, and a noise vector. Our data corresponding to Section 2 has the form .3.1 A brief introduction to the knockoff filter
The knockoff filter is a method for controlling the FDR in the linear regression [1]. A key point of the knockoff filter is to generate knockoffs for the design matrix . The goal of knockoffs is to imitate the correlation structures between the variables so that we can do FDR control on the specific statistics based on the both and .
Denote and . In this paper, we generate knockoffs
from a Gaussian distribution obeying
(3.1) 
where we assume with being a positive definte matrix and and satisfy
with making positive definite. This way to generate knockoffs was also used in [2, 5].
In the following, we introduce the knockoff filter method which can produce an estimated active set of achieving the FDR control. To obtain the estimated active set, we need a statistic vector and thresholds. We consider the following penalized estimator for linear regression
(3.2) 
where is an augmented matrix and is a penalty function with tuning parameter . When , the estimator in (3.2) is a Lasso estimator [8]. When the derivative of has the form with and , the estimator in (3.2) is a SCAD estimator [4].
Denote the maximum penalty coefficients of each variable entering in the model by , that is,
for . For simplicity, we omit sometimes, such as using and instead of and . The same omissions will happen below.
The statistic vector can be defined by
Then, the thresholds of knockoff and knockoff+ procedures (two types of knockoff filter methods) for a given FDR are defined by
where .
Thus, the coresponding estimated active sets are defined as
From Theorem 2 of [1], we know that the estimated active sets obtained by the knockoff+ procedure satisfy the inequlity (2.1). For knockoff procedure, we can not obtain this theoretical bound of FDR directly, since Theorem 1 in [1] only gave the same bound for an approximate FDR which is less or equal to the FDR.
3.2 Application of the aggregation scheme
After introducing the knockoff filter, we plug it into our aggregation scheme and also show the simulations results of this application.
Given a target FDR , we choose the sequence satisfying as . For a complete statement, we describe our method again with pluging knockoff filter and the specific sequence as follows:
Step 1: Apply the knockoff (or knockoff+) procedure above times with respective target FDR ’s above and denote the corresponding estimated active sets by (or ).
Step 2: Combine these estimated active sets by taking the union:
In the following, we show the simulation results which support our theoretical results. In addition, we also exhibit the performances of the selection accuracy.
The dimensions of the data are . The design matrix is generated by . The noise is drawn from . The true parameter has nonzero coefficients taking value and randomly from . We regenerate such that . We simulated independent knockoffs according to (3.1). For the penalized methods to solve the linear regression, we choose the Lasso [8] and the SCAD [4] with default .
We repeat the simulation times and calculate the emprical FDR and power
for our scheme with knockoff procedure, while for that with knockoff+ procedure, is replaced by . The empirical selection accuracy is defined as
The with 1 is the ideal case.
We vary the target FDR in . Figure 1 and Figure 2 show the relationships between the and the target FDR (first line) and between the actual FDR (or ) and the target FDR (second line) of the cases and respectively. The plots with titles Lasso and SCAD are for knockoff procedure. The plots with titles Lasso+ and SCAD+ are for knockoff+ procedure. The orange lines are plotted by our method and the purple ones are for standard knockoff filter. The two figures have the similar performances. The plots for the actual FDR with the target FDR show that our actual FDR is always below the diagonal lines which verifies our theoretical result and also below the FDR of the standard knockoff and knockoff+ procedures. For the knockoff case, the of our method is always larger than the standard knockoff procedure. For the knockoff+ case, the of our method is larger than standard knockoff+ procedure when the target FDR is large.
References
 Barber and Candès [2015] Rina F. Barber and Emmanuel J. Candès. Controlling the false discovery rate via knockoffs. The Annals of Statistics, 43(5):2055–2085, 2015.
 Barber et al. [2019] Rina F. Barber, Emmanuel J. Candès, and Richard J. Samworth. Robust inference with knockoffs. arXiv:1801.03896, 2019.
 Benjamini and Hochberg [1995] Yoav Benjamini and Yosef Hochberg. Controlling the false discovery rate: A practical and powerful approach to multiple testing. Journal of the Royal Statistical Society: Series B (Methodological), 57(1):289–300, 1995.
 Fan and Li [2001] Jianqing Fan and Runze Li. Variable selection via nonconcave penalized likelihood and its oracle properties. Journal of the American Statistical Association, 96(456):1348–1360, 2001.
 Holden and Hellton [2019] Lars Holden and Kristoffer Hellton. Multiple modelfree knockoffs. arXiv:0902.0885, 2019.
 Langaas et al. [2005] Mette Langaas, Bo H. Lindqvist, and Egil Ferkingstad. Estimating the proportion of true null hypotheses, with application to dna microarray data. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 67(4):555–572, 2005.
 Storey [2002] John D. Storey. A direct approach to false discovery rates. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 64(3):479–498, 2002.
 Tibshirani [1996] Robert Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society: Series B (Methodological), 58(1):267–288, 1996.

Yekutieli and Benjamini [1999]
Daniel Yekutieli and Yoav Benjamini.
Resamplingbased false discovery rate controlling multiple test procedures for correlated test statistics.
Journal of Statistical Planning and Inference, 82(1):171–196, 1999.
Comments
There are no comments yet.