# Deep Knockoffs

This paper introduces a machine for sampling approximate model-X knockoffs for arbitrary and unspecified data distributions using deep generative models. The main idea is to iteratively refine a knockoff sampling mechanism until a criterion measuring the validity of the produced knockoffs is optimized; this criterion is inspired by the popular maximum mean discrepancy in machine learning and can be thought of as measuring the distance to pairwise exchangeability between original and knockoff features. By building upon the existing model-X framework, we thus obtain a flexible and model-free statistical tool to perform controlled variable selection. Extensive numerical experiments and quantitative tests confirm the generality, effectiveness, and power of our deep knockoff machines. Finally, we apply this new method to a real study of mutations linked to changes in drug resistance in the human immunodeficiency virus.

## Authors

• 22 publications
• 10 publications
• 29 publications
• ### On Unifying Deep Generative Models

Deep generative models have achieved impressive success in recent years....
06/02/2017 ∙ by Zhiting Hu, et al. ∙ 0

• ### Auto-Encoding Knockoff Generator for FDR Controlled Variable Selection

A new statistical procedure (Model-X candes2018) has provided a way to i...
09/27/2018 ∙ by Ying Liu, et al. ∙ 0

• ### Knockoff Boosted Tree for Model-Free Variable Selection

In this article, we propose a novel strategy for conducting variable sel...
02/20/2020 ∙ by Tao Jiang, et al. ∙ 0

• ### Measuring global properties of neural generative model outputs via generating mathematical objects

We train deep generative models on datasets of reflexive polytopes. This...
05/28/2021 ∙ by Bernt Ivar Utstøl Nødland, et al. ∙ 0

• ### Deep Tempering

Restricted Boltzmann Machines (RBMs) are one of the fundamental building...
10/01/2014 ∙ by Guillaume Desjardins, et al. ∙ 0

• ### KernelNet: A Data-Dependent Kernel Parameterization for Deep Generative Modeling

Learning with kernels is an often resorted tool in modern machine learni...
12/02/2019 ∙ by Yufan Zhou, et al. ∙ 24

• ### Deep Unsupervised Learning using Nonequilibrium Thermodynamics

A central problem in machine learning involves modeling complex data-set...
03/12/2015 ∙ by Jascha Sohl-Dickstein, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

### 1.1 Motivation

Model-X knockoffs [1] is a new statistical tool that allows the scientist to investigate the relationship between a response of interest and hundreds or thousands of explanatory variables. In particular, model-X knockoffs can be used to identify a subset of important variables from a larger pool that could potentially explain a phenomenon under study while rigorously controlling the false discovery rate [2]

in very complex statistical models. While this methodology does not require any knowledge of how the response depends on the values of the features, the correctness of the inferences rests entirely on a precise description of the distribution of the explanatory variables, which are assumed to be random. This makes model-X knockoffs well adapted to situations in which good models are available to describe the joint distribution of the features, as in genome-wide association studies

[3]

where hidden Markov models are widely used to describe patterns of genetic variation. To apply the knockoffs approach in a broad set of applications, however, we would need flexible tools to construct knockoff variables from sampled data in settings where we do not have reliable prior knowledge about the distribution of the covariates but perhaps sufficiently many labeled or unlabeled samples to ‘learn’ this distribution to a suitable level of approximation. These conditions are realistic because the construction of model-X knockoffs only depends on the explanatory variables whose unsupervised observations may be abundant. For example, even though the genome-wide association analysis of a rare disease may contain a relatively small number of subjects, the genetic variants for other individuals belonging to the same population can be gathered from different studies. The goal of this paper is simply stated: to extend the applicability of the knockoffs framework as to make it practically model-free and, therefore, widely applicable. This is achieved by taking advantage of important recent progress in machine learning, which is repurposed to harness the information contained in large unsupervised datasets to sample approximate model-X knockoffs. The ultimate outcome is a set of sensible and flexible tools for model-free controlled variable selection that can help alleviate the crucial irreproducibility issues afflicting many areas of science and data analysis

[4, 5, 6, 7]. A preview of our contribution is sketched below, while the technical details are postponed to later sections.

### 1.2 Our contribution

Given independent copies of from some unknown distribution , we seek to construct a random generator of valid knockoffs such that the joint law of is invariant under the swapping of any and for each (see Section 2 for details). Concretely, the machine takes the data as input and generates through a mapping , where is random noise, and

is a deep neural network. The parameters of the network are fitted on multiple observations of

to optimize a loss function that quantifies the extent to which

is a good knockoff copy of . This goal is related to the classical problem of learning generative models; however, the challenge here is unusual since only is accessible while no sample from the target distribution is available. Fortunately, the existing methods of deep generative modeling reviewed in Section 3 can be suitably repurposed, as we shall see in Section 4. Furthermore, the lack of uniqueness of the target distribution raises an additional question. Intuitively, this ambiguity should be resolved by making as different as possible from , since a trivial copy—setting —would satisfy the required symmetry without being of any practical use for variable selection. Our approach generalizes the solution described in [1], which relies on the simplifying assumption that

can be well-described as a multivariate Gaussian vector. In the context of deep generative models, the analogous idea consists of training a machine that optimizes the compatibility of the first two moments of

while keeping the strength of the pairwise correlations between and for each under control. By including in the loss function an additional term that promotes the matching of higher moments, we will show that one can move beyond the second-order approximation towards a model-free knockoff generator. The effectiveness of deep knockoff machines can be quantitatively measured using the goodness-of-fit diagnostics presented in Section 5, as shown empirically by the results of our numerical experiments (Section 6) and data analysis (Section 7). The algorithms described in this paper have been implemented in Python and the corresponding software is available from https://web.stanford.edu/group/candes/deep-knockoffs/.

### 1.3 Related work

The main idea of using knockoffs as negative control variables was originally devised in the context of linear regression setting with a fixed design matrix

[8]. The generation of model-X knockoffs beyond the settings considered in [1] and [3] has also been tackled in [9]

, which extends the results for hidden Markov models to a broader class of Bayesian networks. More recent advances in the framework of knockoffs include the work of

[10, 11, 12], while some interesting applications can be found in [13, 14, 15]. Very recently, deep generative models have independently been suggested as a procedure for sampling knockoffs in [16]; there, the approach focuses on adversarial rather than moment matching networks. Even though the fundamental aims coincide and the solutions are related, our machine differs profoundly by design and it offers a more direct connection with existing work on second-order knockoffs. Also, it is well known that generative adversarial networks are difficult to train [17], while moment matching is a simpler task [18, 19]. Since the approach of [16] requires simultaneously training four different and interacting neural networks, we expect that a good configuration for our machine should be faster to learn and require less tuning. This may be a significant advantage since the ultimate goal is to make knockoffs easily accessible to researchers from different fields. A computationally lighter alternative is proposed in [20]

, which relies on the variational autoencoder

[21] to generate knockoff copies. Since our work was developed in parallel111The results of this paper were first discussed at the University of California, Los Angeles, during the Green Family Lectures on September 27, 2018. to that of [16, 20], we are not including these recent proposals in our simulation studies. Instead, we will compare our method to well-established alternatives.

## 2 Model-X knockoffs

Since the scope of our work depends on properties of model-X knockoffs, we begin by rehearsing some of the key features of the existing theory. For any sampled from a distribution , the random vector is said to be a knockoff copy of [1] if the joint law of obeys

 (X,~X)d=(X,~X)swap(j)for each j∈{1,…,p}; (1)

here, the symbol indicates equality in distribution and is defined as the operator swapping with ; if and ,

. Knockoffs play a key role in controlled variable selection, by serving as negative controls that allow one to estimate and limit the number of false positives in the variable selection problem defined below. Consider

observations , with each assumed to be drawn independently from a known , and the associated label drawn from an unknown conditional distribution . The goal is to identify a subset of important components of that affect . In order to state this objective more precisely, one refers to as unimportant if

 Y\rotatebox[origin=c]90.0$⊨$Xj∣X−j,

where indicates the remaining variables after is excluded. The true null hypotheses is the set of all variables that are unimportant; in words, is not important if it is conditionally independent of the response once we know the value of . Put differently, is not important if it does not provide any additional information about beyond what is already known. While searching for a subset that includes the largest possible number of important variables in , one wishes to ensure that the false discovery rate,

 FDR =E[|^S∩H0||^S∨1|],

remains below a nominal level , e.g. . The false discovery rate is thus defined as the expected fraction of selected variables that are false positives. The approach of [1] provably controls the false discovery rate without placing any restrictions on the conditional likelihood of , which can be arbitrary and completely unspecified. The first step in their method consists of generating a knockoff copy for each available sample of , before looking at , such that both (1) is satisfied and . Some measures of feature importance and are then evaluated for each and , respectively. For this purpose, almost any available method from statistics and machine learning can be applied to the vector of labels and the augmented data matrix , with the only fundamental rule that the original variables and the knockoffs should be treated equally; this is saying that the method should not use any information revealing which variable is a knockoff and which is not. Examples include sparse generalized linear models [1, 3]

[15]

, support vector machines and deep neural networks

[9, 10]. Each pair of and is then combined through an antisymmetric function into the statistics , e.g. . By construction, a large and positive value of suggests evidence against the

th null hypothesis, while unimportant variables are equally likely to be positive or negative. Under this choice of

, it can be shown that exact control of the false discovery rate below the nominal level can be obtained by selecting , where

 τq=min{t>0:1+|{j:Wj≤−t}||{j:Wj≥t}|≤q}.

The numerator in the expression above can be understood as a conservative estimate of the number of false positives above the fixed level . This adaptive significance threshold is that first proposed in the knockoff filter of [8]

, while the choice of the test statistics

may be different [1]. The validity of the false discovery rate control relies entirely on the exact knowledge of and our ability to generate satisfying (1). Even though procedures that can sample exact knockoff copies have been previously derived for a few special classes of

such as multivariate Gaussian distributions

[1] and hidden Markov models [3], the general case remains algorithmically challenging. This difficulty arises because (1) is much more stringent than a first look may suggest. For instance, obtaining new independent samples from or permuting the rows of the data matrix would only ensure that is equal in distribution to , while the analogous result would not hold between and

. At the same time, the latter property is crucial since a null variable and its knockoff must be able to explain on average the same fraction of the variance in the response. The practical approximate solution described in

[1] consists of relaxing the condition in (1) as to match only the first two moments of the distributions on either side. In this weaker sense, is thus said to be a second-order knockoff copy of if the two random vectors have the same expected value and their joint covariance matrix is equal to

 Cov[(X,~X)] =[ΣΣ−diag(s)Σ−diag(s)Σ], (2)

where is the covariance matrix of under and is any -dimensional vector selected in such a way that the matrix in the right-hand side is positive semidefinite. The role of is to make as uncorrelated with as possible, in order to increase statistical power during variable selection. Therefore, the value of is typically chosen to be as large as possible [1]. The weaker form of exchangeability in (2) is reminiscent of the notion of fixed-design knockoffs from [8] and it can be practically implemented by approximating the distribution of as multivariate Gaussian [1]. This approximation often works well in practice, even though it is in principle insufficient to guarantee control of the false discovery rate under the general conditions of the model-X framework [22]. In this paper, we build upon the work of [1] and [22] to obtain higher-order knockoffs that can achieve a better approximation of (1) using modern techniques from the field of deep generative models.

## 3 Deep generative models

Replicating the underlying distribution of a data source is an essential task of statistical machine learning that can be broadly described as follows. Given independent -dimensional samples from an unknown distribution , a generative model approximating the true is sought in order to synthesize new observations that could plausibly belong to the training set, while being sufficiently different to be non-trivial. Several well known techniques have been developed to tackle this problem, some of which are based on hidden Markov models [23]

[24][25]. In recent years, such traditional approaches have been largely replaced by neural networks, with two popular examples being variational autoencoders [21, 26, 27, 28] and generative adversarial networks [29, 30, 31, 32, 33, 34, 35]. These are based on a parametric non-linear function that maps an input noise vector to the sample domain of . The parameters in represent the collection of weights and biases defining the neural network and they need to be learned from the available data. The function thus defined is deterministic for any fixed realization of the noise and, with an appropriate choice of , it propagates and transforms the noise in

to obtain a random variable

approximately distributed as . Training deep generative models is computationally difficult, and considerable effort has been dedicated to the development of practical algorithms that can find good solutions. For instance, the popular variational method, which lies at the heart of the autoencoder in [21], proceeds by maximizing a traceable lower bound on the log-likelihood of the training data. In contrast, generative adversarial networks strive to minimize the inconsistencies of the generated samples with the original ones, by formulating the learning task as a two-player game [29]. As the generator

attempts to produce realistic samples, an antagonistic discriminator tries to recognize them. Since the discriminator is defined as a deep binary classifier with a differentiable loss function, the two networks can be simultaneously trained by gradient descent, until no further gain can be made on either side. Even though generative adversarial networks have enjoyed a great deal of success

[29, 30, 31, 32, 33, 34, 35], non-convex minimax optimization is notoriously complex [17]. More recent alternatives mitigate the issue by replacing the classifier with a simpler measure of the distance between the distributions of the original and the simulated samples [18, 19, 36, 37, 38, 39]. The remaining part of this section is dedicated to reviewing the basics of some of the latter approaches, upon which we will begin to develop a knockoff machine. The discriminator component of a deep generative model faces the following challenge. Given two sets of independent observations and , respectively drawn from some unknown distributions and , it must be verified whether . This multivariate two-sample problem has a long history in the statistics literature and many non-parametric tests have been proposed to address it [40, 41, 42, 43, 44, 45, 46]. In particular, the work of [45] introduced a test statistic, called the maximum mean discrepancy, whose desirable computational properties have inspired the development of generative moment matching networks [18, 19]. The relevant key idea is to quantify the discrepancy between the two distributions in terms of the largest difference in expectation between and , over functions mapping the random variables into the unit ball of a reproducing kernel Hilbert space [45]. Fortunately, this abstract characterization can be made explicit with the kernel trick [45], leading to the practical utilization described below. Let be independent samples drawn from and , respectively, and define the maximum mean discrepancy between and as

 DMMD(PX,PZ) =EX,X′[k(X,X′)]−2EX,Z[k(X,Z)]+EZ,Z′[k(Z,Z′)], (3)

where is a kernel function. If the characteristic kernel of a reproducing kernel Hilbert space [45] is used, it can be shown that the quantity in (3) is equal to zero if and only if . Concretely, valid choices of include the common Gaussian kernel, , with bandwidth parameter , and mixtures of such. Furthermore, the maximum mean discrepancy is always non-negative and it can be estimated from finite samples in an unbiased fashion via

 ˆDMMD(X,Z) =1n(n−1)n∑i=1∑j≠ik(Xi,Xj)−2n2n∑i=1n∑j=1k(Xi,Zj)+1n(n−1)n∑i=1∑j≠ik(Zi,Zj), (4)

see [45]. Since the expression in (4) is easily computable and differentiable, it can serve as the objective function of a deep generative model, effectively replacing the discriminator required by generative adversarial networks [18, 19]. The generator is then trained on to produce samples that minimize (4), by applying the standard techniques of gradient descent. This idea can also be repurposed to develop a knockoff machine, as discussed in the next section.

## 4 Deep knockoff machines

### 4.1 Overview

A knockoff machine is defined as a random mapping that takes as input a random , an independent noise vector and returns an approximate knockoff copy . The machine is characterized by a set of parameters and it should be designed such that the joint distribution of deviates from (1) as little as possible. If the original variables follow a multivariate Gaussian distribution, i.e. , a family of machines generating exact knockoffs is given by

 fθ(X,V) =X−XΣ−1diag{s}+(2diag{s}−% diag{s}Σ−1diag{s})1/2V, (5)

for any choice of the vector that keeps the matrix multiplying positive-definite [1]. In practice, the value of is typically determined by solving a semi-definite program [1], see Section 4.5. By contrast, the algorithm for sampling knockoff copies of hidden Markov models in [3] cannot be easily represented as an explicit function . This difficulty should be expected for various other choices of , and an analytic derivation of seems intractable in general. In order to develop a flexible machine that can sample knockoffs for arbitrary and unknown distributions , we assume to take the form of a deep neural network, as described in Section 4.5. The values of its parameters will be estimated on the available observations of by solving a stochastic optimization problem. An overview of our approach is visually presented in Figure 1 and it can be summarized as follows: the machine is provided with realizations of the random vector , independently sampled from an unknown underlying distribution . For any fixed configuration of , each is computed as a function of the corresponding input and the noise , for . The latter () is independently resampled for each observation and each time the machine is called. A scoring function examines the empirical distribution of and quantifies its compliance with the exchangeability property in (1). After each such iteration, the parameters are updated in the attempt to improve future scores. Ideally, upon successful completion of this process, the machine should be ready to generate approximate knockoff copies for new observations of drawn from the same . A specific scoring function that can generally lead to high-quality knockoffs will be defined below.

### 4.2 Second-order machines

We begin by describing the training of a special knockoff machine that is interesting for expository purposes. Suppose that instead of requiring the joint distribution of to satisfy (1), we would be satisfied with obtaining second-order knockoffs. In order to incentivize the machine to produce such that and the joint covariance matrix of satisfies (2), we consider a simple loss function that computes a differentiable measure of its compatibility with these requirements. For the sake of notation, we let indicate the empirical covariance matrix of , which takes the following block form:

 ^G =[^GXX^GX~X^GX~X^G~X~X]. (6)

Above, are the empirical covariance matrices of , respectively. Then we define

 Jsecond-order(X,~X) =λ1∥1n∑ni=1(Xi−~Xi)∥22p+λ2∥^GXX−^G~X~X∥2F∥^GXX∥2F+λ3∥M∘(^GXX−^GX~X)∥2F∥^GXX∥2F. (7)

Here, the symbol indicates element-wise multiplication, while , with being a matrix of ones and

the identity matrix. For simplicity, the weights

will be set equal to one throughout this paper. The first term in (7) penalizes differences in expectation, while the second and third terms encourage the matching of the second moments. Smaller values of this loss function intuitively suggest that is a better second-order approximate knockoff copy of . Since

is smooth, a second-order knockoff machine can be trained with standard techniques of stochastic gradient descent. As we mentioned earlier, knockoffs are not uniquely defined, and it is desirable to make

as different as possible from . There are various ways of encouraging a machine to seek this outcome, and a practical solution inspired by [1] consists of adding a regularization term to the loss function, penalizing large pairwise empirical correlations between and :

 Jdecorrelation(X,~X) =p∑j=1ˆcorr(Xj,~Xj). (8)

Each term above is defined as the empirical estimate of the Pearson correlation coefficient for the th columns of and . In summary, this describes a new general procedure for sampling approximate second-order knockoffs. Compared to the original method in [1], the additional computational burden of fitting a neural network is significant. However, the tools developed in this section are valuable because they can be generalized beyond the second-order setting, as discussed next.

### 4.3 Higher-order machines

In order to build a general knockoff machine, one must be able to precisely quantify and control the deviation from exchangeability: the difference in distribution between and for each . For this purpose, we deploy the maximum mean discrepancy metric from Section 3

. In order to obtain an unbiased estimate, we randomly split the data into a partition

and define the corresponding output of the machine as . Then, it is natural to seek a machine that targets

 p∑j=1ˆDMMD[(X′,~X′),(X′′,~X′′)swap(j)].

Above, stands for the empirical estimate in (4) of the maximum mean discrepancy, evaluated with a Gaussian kernel. Intuitively, the above quantity is minimized in expectation if the knockoffs are exchangeable according to (1). This idea will be made more precise below, for a slightly modified objective function. We refer to this solution as a higher-order knockoff machine because the expansion of the Gaussian kernel into a power series leads to a characterization of (3) in terms of the distance between vectors containing all higher-moments of the two distributions [47, 45]. Therefore, our approach can be interpreted as a natural generalization of the method in [1]. Since computing at each iteration may be expensive (there are swaps), in practice we will only consider two swaps and ask the machine to minimize

 JMMD(X,~X) =ˆDMMD[(X′,~X′),(~X′′,X′′)]+ˆDMMD[(X′,~X′),(X′′,~X′′)swap(S)], (9)

where indicates a uniformly chosen random subset of such that

with probability

. The following result confirms that the objective function in (9) provides a sensible guideline for training knockoff machines.

###### Theorem 1.

Let be a collection of independent observations drawn from , and define as the corresponding random output of a fixed machine . Then for defined as in (9),

 E[J\em MMD(X,~X)]≥0.

Moreover, equality holds if and only if the machine produces valid knockoffs for . Above, the expectation is taken over , the noise in the knockoff machine, and the random swaps in the loss function.

With a finite number of observations available, stochastic gradient descent aims to minimize the expectation of (9) conditional on the data. This involves solving a high-dimensional non-convex optimization problem that is difficult to study theoretically. Nonetheless, effective algorithms exist and a weak form of convergence of stochastic gradient descent is established in Section 4.4. Therefore, these results provide a solid basis for our proposed method. The full objective function of a knockoff machine may also include the quantities from (7) and (8), as a form of regularization, thus reading as

 J(X,~X) =γJMMD(X,~X)+λJsecond-order(X,~X)+δJ% decorrelation(X,~X). (10)

In the special case of , a second-order machine is recovered, while may lead to knockoffs with little power. The second-order penalty may appear redundant because already penalizes discrepancies in the covariance matrix, as well as in all other moments. However, we have observed that setting

often helps to decrease the amount of time required to train the machine. For optimal performance, the hyperparameters should be tuned to the specific data distribution at hand. For this purpose, we discuss practical tools to measure goodness of fit later in Section

5.1. Meanwhile, for any fixed choice of , the learning strategy is summarized in Algorithm 1.

Alternative types of knockoff machines could be based on different choices of kernel or other measures of the discrepancy between two distributions. An intuitive option would be the Kullback-Leibler divergence

[48], which appears at first sight to be a natural choice. In fact, a connection has been shown in [22] between this and the worst-case inflation of the false discovery rate that may occur if the variable selection relies on inexact knockoffs. In recent years, some empirical estimators of this divergence have been proposed in the literature on deep generative models [49, 48], which could also be employed for our purposes.

### 4.4 Analysis of the optimization algorithm

In this section, we study the behavior of Algorithm 1, establishing a weak form of convergence. For simplicity, we focus on the machine defined by the loss function in (10) with . The other cases can be treated similarly and they are omitted in the interest of space. In order to facilitate the exposition of our analysis, we begin by introducing some helpful notations. Let and denote a randomly chosen partition of the fixed training set . The state of the learning algorithm at time is fully described by , where is the current configuration of the machine parameters. Conditional on the noise realizations and the randomly chosen set of swapping indices , the objective

 JMMD(X′t,~X′t,X′′t,~X′′t,St),

is a deterministic function of since and . Above, is written with a slight, but clarifying, abuse of the notation in (9). At this point, we can also define

 Jθt=E[JMMD(X′t,~X′t,X′′t,~X′′t,St)∣∣ζt], (11)

with the expectation taken over the noise and the choice of . Let us also define as the gradient of (11) with respect to . In practice, this quantity is approximated by sampling one realization of and a set of swapping indices , then computing the following unbiased estimate:

 gt=∇JMMD(X′t,fθt(X′t,V′t),X′′t,fθt(X′′t,V′′t),St). (12)

Since the function is deterministic because all random variables in (12

) have been observed by the algorithm, backpropagation can be used to calculate the gradient on the right-hand-side. This gradient is then used to update the machine parameters in the next step, through

, where is the learning rate. Under standard regularity conditions, we can follow the strategy of [50] to show that the algorithm tends to approach a stationary regime. In particular, we assume the existence of a finite Lipschitz constant such that, for all and all possible values of the data batches ,

 ∥∥∇E[JMMD(X′,~X′,X′′,~X′′,S)∣∣X′,X′′,θ′]−∇E[JMMD(X′,~X′,X′′,~X′′,S)∣∣X′,X′′,θ′′]∥∥2 ≤L∥θ′−θ′′∥2,

and we define

 Δ=2Lsup(Jθ1−J∗).

Above, indicates the expected loss (11) at the first step, conditional on the data and the initialization of . The supremum is taken over all possible values of the data and the initial . The value of is defined as a uniform lower bound on . Following the result of [45] that bounds the empirical estimate of the maximum mean discrepancy from below,

 ˆDMMD(X,Z)≥−1n(n−1)n∑i=1[k(Xi,Xi)+k(Zi,Zi)−k(Xi,Zi)], ∀X,Z∈Rn×p,

we can conclude that a finite value of can be determined from the data.

###### Theorem 2.

Consider a fixed training set and adopt the notation above. Assume that the gradient estimates have uniformly bounded variance; that is,

 E[∥gt−∇Jθt∥22∣∣ζt]≤σ2,∀t≤T,

for some . Then for any initial state of the machine and a suitable value of the constant defined above,

 1TT∑t=1E[∥∇Jθt∥22∣∣ζ1] ≤1TLΔμ(2−Lμ)+Lσ2μ(2−Lμ).

In particular, choosing for some gives

 1TT∑t=1E[∥∇Jθt∥22∣∣ζ1] ≤L2ΔT+(μ0+Δμ0)Lσ√T.

In a nutshell, Theorem 2 states that the squared norm of the gradient of the loss function (11) decreases on average as when . This can be interpreted as a weak form of convergence that is not necessarily implying that will reach a fixed point. One could also follow the strategy of [51] instead of [50] to obtain a closely related result, guaranteeing that the norm of the gradient will be small at a sufficiently large and randomly chosen stopping time. It would of course be more desirable to establish the convergence in a stronger sense, perhaps to a local minimum; however, this is difficult and we are not aware of any similar results in the literature on deep moment matching networks. It should be noted that our assumption that the gradient estimates have uniformly bounded variance is not as strong as requiring the gradients to be uniformly bounded. The work of [52] provides explicit bounds in several special instances of single and multi-layer neural networks. However, we choose not to validate this assumption in our knockoff machines for two reasons. First, it is standard in the literature [51, 50]; second, a proof would need to rely heavily on a specific architecture and loss function. In practice, we observed that a learning rate in the typical range between and works well.

### 4.5 Implementation details

The construction of deep knockoff machines allows considerable freedom in the precise form of . In general, neural networks can be implemented following a multitude of different architectures, and the final choice is often guided by the experience of the practitioners. For the purpose of this paper, we describe a structure that works well across a wide range of scenarios. However, the options are virtually limitless and we expect that more effective designs will be found for more specific problems. The first layer of the neural network in our knockoff machine takes a vector of original variables and a -dimensional noise vector as input. Then a collection of

latent variables is produced by taking different linear combinations of the input and applying to each a nonlinear activation function. The connections in this layer are represented in the schematics of Figure

2, where and . The same pattern of linear and nonlinear units is repeatedly applied to the hidden variables, through layers of width , as shown in Figure 2(a). Finally, a similarly designed output layer returns a -dimensional vector, as depicted in Figure 2(b). Following the approach of generative moment matching networks [18], we replaced the unbiased maximum mean discrepancy loss in (9) with a slightly modified version that is always positive because it performs better in practice; see [18, Section 4.3] for technical details. In order to reduce the training time, the machines are fitted by applying stochastic gradient descent with momentum as it is customary in the field.

We have observed superior performance when a modified decorrelation penalty is adopted instead of the simpler expression in (8). For this purpose, we suggest using

 Jdecorrelation(X,~X) =∥diag(^GX~X)−1+s∗SDP(^GXX)∥22,

where is defined as in (6) and is the solution to the semi-definite program

 s∗SDP(Σ)=argmins∈[0,1]p p∑j=1∣∣1−sj∣∣, s.t. 2Σ⪰diag(s).

The above optimization problem is the same used in [1] to minimize the pairwise correlations between the knockoffs and the original variables, in order to boost the power, for the special case of . Under the Gaussian assumption, the constraint is necessary and sufficient to ensure that the joint covariance matrix of is positive semidefinite.

## 5 Robustness and diagnostics

### 5.1 Measuring goodness-of-fit

For any fixed data source , the goodness-of-fit of a conditional model producing approximate knockoff copies can be informally defined as the compatibility of the joint distribution of with the exchangeability property in (1). By defining and evaluating different measures of such discrepancy, the quality of our deep knockoff machines can be quantitatively compared to that of existing alternatives. This task is a special case of the two-sample problem mentioned in Section 3, with the additional complication that a large number of distributions are to be simultaneously analyzed. In fact, for any fixed and , one should verify whether all of the following null hypotheses are true:

 H(j)0:P(X,~X)=P(X,~X)% swap(j),j∈{1,…,p}.

In order to reduce the number of comparisons, we will instead consider the following two hypotheses:

 Hfull0:P(X,~X)=P(~X,X), Hpartial0:P(X,~X)=P(X,~X)swap(S), (13)

where is a random subset of , chosen uniformly and independently of , such that with probability . Either hypothesis can be separately investigated by applying a variety of existing two-sample tests, as described below. In order to study , we define and as two independent sets of observations, respectively drawn from the distribution of and . The analogous tests of can be performed by defining as the family of samples , and they are omitted in the interest of space. Covariance diagnostics. It is natural to begin with a comparison of the covariance matrices of and , namely . For this purpose, we compute the following statistic meant to test the hypothesis that :

 ˆφCOV =1n(n−1)n∑i=1n∑j≠i[(Z⊤1iZ1j)2+(Z⊤2iZ2j)2]−2n2n∑i=1n∑j=1(Z⊤1iZ2j)2. (14)

This quantity is an unbiased estimate of , if and have zero mean [53]. In practice, and will be centered if this assumption does not hold. The asymptotic distribution of (14) can be derived under mild conditions, thus yielding a non-parametric test of the null hypothesis that [53]. However, since our goal is to compare knockoffs generated by alternative algorithms, we will simply interpret larger values of (14) as evidence of a worse fit. MMD diagnostics. While being intuitive and easy to evaluate, the above diagnostic is limited as it does not capture the higher-order moments of . Therefore, different diagnostics should be used in order to have power against other alternatives. A natural choice is to rely on the maximum mean discrepancy, on which the construction of the deep knockoff machines in Section 4.3 is based. In particular, the first null hypothesis in (13) can be tested by computing

 ˆφMMD=ˆDMMD(Z1,Z2), (15)

where the function is defined as in (4). See [45] for details. Since this is an unbiased estimate of the maximum mean discrepancy between the two distributions, large values can again be interpreted as evidence against the null. On the other hand, exact knockoffs will lead to values equal to zero on average. KNN diagnostics. The -nearest neighbors test [42] can also be employed to obtain a non-parametric measure of goodness-of-fit. For simplicity, we consider here the special case of . For each sample , with , we denote the nearest neighbor in Euclidean distance of among as . Then, we define to be equal to one if and zero otherwise, and compute

 ˆφKNN =12nn∑i=1[I1(i)+I2(i)]. (16)

This quantity is the fraction of samples whose nearest neighbor happens to originate from the same distribution. In expectation, is equal to if the two distributions are identical, while larger values provide evidence against the null [42]. A rigorous test can be performed to determine whether to reject any given knockoff generator, by applying the asymptotic significance threshold derived in [42]. However, since exact knockoffs may be difficult to achieve in practice, we choose to use these statistics to grade the quality of different approximations. According to this criterion one should prefer knockoff constructions leading to values of this statistic that are closer to . Energy diagnostics. Finally, the hypotheses in (13) can also be tested in terms of the energy distance [46], defined as

 DEnergy(PZ1,PZ2) =2EZ1,Z2[∥Z1−Z2∥2]−EZ1,Z′1[∥Z1−Z′1∥2]−EZ2,Z′2[∥Z2−Z′2∥2], (17)

where are independent samples drawn from and , respectively. Assuming finite second moments, one can conclude that , with equality if and only if and are identically distributed [46]. Therefore, we follow the approach of [46] and define the empirical estimator

 ˆDEnergy(Z1,Z2) =2n2n∑i=1n∑j=1∥Z1i−Z2j∥2−1n2n∑i=1n∑j=1∥Z1i−Z1j∥2−1n2n∑i=1n∑j=1∥Z2i−Z2j∥2,

and the test statistic

 ˆφEnergy =n2ˆDEnergy(Z1,Z2). (18)

The quantity in (18) can be shown to be always positive, and it leads to a consistent test for the equality in distribution [46], under the assumption of finite second moments. More precisely, this statistic converges in probability to as the sample size grows, while diverging otherwise. Therefore, we can interpret larger values of as evidence of a poorer fit. The diagnostics defined above provide a systematic way of comparing different knockoff constructions. Sampling in compliance with (1) for any fixed data distribution is a difficult problem. Even though the effort is motivated by the ultimate goal of performing controlled variable selection, here the challenge is greater because even roughly approximated knockoffs may sometimes happen to allow control of the rate of false positives, while failing to pass the above tests. By contrast, respecting (1) guarantees that the inference will be valid. In the experiments of Section 6 we will show that deep machines can almost match the quality of knockoffs produced by the existing specialized algorithms when prior information on is known, while greatly surpassing them in other cases.

### 5.2 False discovery rate under model misspecification

The quality of knockoffs produced by our deep machines has been tested according to stringent measures of discrepancy with the original data. However, even when is far from respecting the exchangeability in (1), the false discovery rate may sometimes be controlled in practice. Since a scientist aiming to perform inference on real problems cannot blindly trust any statistical method, it is important to develop a richer set of validation tools. The strategy originally proposed in [1]

consists of making controlled numerical experiments that replicate the model misspecification present in the data of interest. The main idea is to sample artificial response variables

from some known conditional likelihood given the real explanatory variables . Meanwhile, approximate knockoff copies are generated using the best available algorithm. Since the true null hypotheses are known in this setting, the proportion of false discoveries can be computed after applying the knockoff filter. By repeating this experiment a sufficient number of times, it is possible to verify whether the false discovery rate is contained below the nominal level. These experiments help confirm whether the knockoffs can be applied because the distribution of is the same as in the real data.

## 6 Numerical experiments

### 6.1 Experimental setup

The deep knockoff machine presented in Section 4

has been implemented in Python using the PyTorch library, following the design outlined in Section

4.5. The activation units in each layer of the neural network sketched in Figure 2

are the parametric rectified linear unit functions

[54]

. Between the latter and the linear combinations, an additional batch normalization function

[55] is included. The width of the hidden layers should in general depend on the dimension of , and the guideline works well in practice. Six such layers are interposed between the input and the output of the machine, each parametrized by different weight matrices and biases. The maximum mean discrepancy loss function is evaluated using the Gaussian mixture kernel , with . The performance of the knockoff machines is analyzed in a variety of experiments for different choices of the data distribution , the results of which are separately presented in the subsections below. In each case the machines are trained on synthetic data sets containing realizations of , with . Stochastic gradient descent is applied with mini-batches of size and learning rate , for a total number of gradients steps . A few different values of the hyperparameters in Algorithm 1 are considered, in the proximity of . The performance of the deep knockoff machine is typically not very sensitive to this choice, although we will discuss how different ratios work better with certain distributions. Upon completion of training, the goodness-of-fit of the machines is quantified in terms of the metrics defined in Section 5.1, namely the matching of second moments (14), the maximum mean discrepancy score (15), the -nearest neighbors test (16) with and the energy test (18). These measures are evaluated on knockoff copies generated for 1000 previously unseen independent samples drawn from the same distribution . The diagnostics obtained with deep knockoff machines are compared against those corresponding to other existing algorithms. A natural benchmark in all scenarios exposed below is the original second-order method in [1], which is applied by relying on the empirical covariance matrix computed on the same data used to train the deep machine. Moreover, we also consider exact knockoff constructions with perfect oracle knowledge of as ideal competitors. Finally, numerical experiments are carried out by performing variable selection in a controlled setting, where the response is simulated from a known conditional likelihood. For each sample , the response variable is sampled according to , with containing randomly chosen non-zero elements equal to . The experiments are repeated 1000 times, for different values of the signal amplitude and the number of observations . The importance measures are defined by fitting the elastic-net [56] on the augmented data matrix and . More precisely, we compute as

 (^β,~β) (19)

with the value of tuned by 10-fold cross validation and some fixed . The knockoff filter is applied on the statistics , for all , at the nominal level . The power and the false discovery rate corresponding to knockoffs generated by different algorithms can be evaluated and contrasted, as a consequence of the exact knowledge of the ground truth. It is important to stress that these experiments and all the diagnostics described above only rely on new observations from , generated independently of those on which the machine is trained.

### 6.2 Multivariate Gaussian

The first example that we present concerns the multivariate Gaussian distribution, for which the exact construction of knockoffs in [1] provides the ideal benchmark. For simplicity we consider to be an autoregressive process of order one, with correlation parameter , such that and . A deep knockoff machine is trained with hyperparameters and the value of its loss (10) is plotted in Figure 4 as a function of the training time.

The controlled numerical experiments are carried out on synthetic datasets containing samples, and setting in (19). The results corresponding to the deep machine are shown in Figure 5 as a function of the signal amplitude. The performance is compared to that of the second-order method in [1] and an oracle that constructs exact knockoffs by applying the formula in (5) with the true covariance matrix . The value of in (5) is determined by solving the semi-definite program [1] from Section 4.5. The goodness-of-fit of these three alternative knockoff constructions is further investigated in terms of the diagnostics defined earlier, as shown in Figure 6. Unsurprisingly, the knockoffs generated by the oracle are perfectly exchangeable, while the deep machine and the second-order knockoffs are almost equivalent. Finally, Figure 7 suggests that the oracle has the potential to be slightly more powerful, as it can generate knockoffs with smaller pairwise correlations with their original counterparts.

The goodness-of-fit of the knockoff machine can also be measured against that of a misguided oracle that believes to be an autoregressive process of order one with correlation parameter equal to . The thus generated are clearly not valid knockoffs unless . This comparison may be helpful because the limitations of the imperfect oracle are simpler to understand. For example, as approaches zero, becomes independent of and the violation of (1) should be easily detectable by our tests. In the interest of space, we only compute the second-order diagnostics in (14) as a function of , and compare them to those in Figure 6. The results are shown in Figure 8. The misspecified oracle leads to a significantly poorer fit than the alternative methods, unless is very close to . This indicates that the second-order approximation and the deep machine are capturing the true rather accurately, despite having very little prior information. Moreover, the experiment confirms that our diagnostics are effective at detecting invalid knockoffs.

### 6.3 Hidden Markov model

We now consider discrete random variables

, for , distributed according to the same hidden Markov model used in [3]

to describe genotypes. In order to make the experiment more realistic, the parameters of this model are estimated from real data, by applying the imputation software fastPHASE

[57] on a reference panel of 1011 individuals from the third phase of the International HapMap project, which is freely available from https://mathgen.stats.ox.ac.uk/impute/data_download_hapmap3_r2.html. For simplicity, we restrict our attention to

features, corresponding to variants on chromosome one whose physical positions range between 0.758 Mb and 2.456 Mb, and whose minor allele frequency is larger than 0.1. The expectation-maximization algorithm of fastPHASE is run for 35 iterations, with the number of hidden states chosen equal to

, and the rest of the configuration set to the default values. New observations are then sampled from the estimated , so that the exact knockoff construction for hidden Markov models can be used as the oracle benchmark. The deep knockoff machine is trained with the hyperparameters equal to . The numerical experiments follow the approach outlined in Section 6.1, using samples in each instance and setting in (19). The power and the false discovery rate are reported in Figure 9, and are very similar across the three methods. However, the oracle is slightly more conservative. The goodness-of-fit diagnostics in Figures 10 and 11 indicate that the machine is almost equivalent to the second-order approximation. It may be possible to improve the performance of this deep knockoff machine by changing its architecture to account for the discrete values of this data or by tuning it more carefully.

### 6.4 Gaussian mixture model

The next example considers a multivariate Gaussian mixture model with equal proportions. In particular, we assume that each is independently sampled from

 X∼⎧⎪ ⎪ ⎪⎨⎪ ⎪ ⎪⎩N(0,Σ1), % with probability 13,N(0,Σ2), with probability 13,N(0,Σ3), with probability 13,

where the covariance matrices , and have the same autoregressive structure as in Section 6.2, with , and , respectively. Exact knockoffs can be constructed by applying the procedure described in [9] to the true model parameters. This oracle performs two simple steps. First, a latent mixture allocation