1 Introduction
Conditional independence (CI) is a fundamental concept in statistics and machine learning. Testing conditional independence is a key building block and plays a central role in a wide variety of statistical learning problems, for instance, causal inference (Pearl, 2009), graphical models (Koller and Friedman, 2009), dimension reduction (Li, 2018)
, among others. In this article, we aim at testing whether two random variables
and are conditionally independent given a set of confounding variables . That is, we test the hypotheses:(1) 
given the observed data of i.i.d. copies of . For our problem, and can all be multivariate. However, the main challenge arises when the confounding set of variables is highdimensional. As such, we primarily focus on the scenario with a univariate and , and a multivariate . Meanwhile, our proposed method is applicable to the multivariate and scenario as well. Another challenge is the limited sample size compared to the dimensionality of . As a result, many existing tests are ineffective, with either an inflated typeI error, or not having enough power to detect the alternatives. See Section 2 for a detailed review.
We propose a double generative adversarial networks (GANs Goodfellow et al., 2014)based inference procedure for the CI testing problem (1). Our proposal involves two key components, a double GANs framework to learn two generators that approximate the conditional distribution of given and given , and a maximum of generalized covariance measures of multiple combinations of the transformation functions of and . We first establish that our test statistic is doublyrobust, which offers additional protections against potential misspecification of the conditional distributions (see Theorems 1 and 2). Second, we show the resulting test achieves a valid control of the typeI error asymptotically, and more importantly, under the conditions that are much weaker and practically more feasible (see Theorem 3). Finally, we prove the power of our test approaches one asymptotically (see Theorem 4), and demonstrate it is more powerful than the competing tests empirically.
2 Related works
There has been a growing literature on conditional independence testing in recent years; see (Li and Fan, 2019) for a review. Broadly speaking, the existing testing methods can be cast into four main categories, the metricbased tests, e.g., (Su and White, 2007, 2014; Wang et al., 2015), the conditional randomizationbased tests (Candes et al., 2018; Bellot and van der Schaar, 2019), the kernelbased tests (Fukumizu et al., 2008; Zhang et al., 2011), and the regressionbased tests (Hoyer et al., 2009; Zhang et al., 2018; Shah and Peters, 2018). There are other types of tests, e.g., (Bergsma, 2004; Doran et al., 2014; Sen et al., 2017; Berrett et al., 2019), to mention a few.
The metricbased tests typically employ some kernel smoothers to estimate the conditional characteristic function or the distribution function of
given and. Kernel smoothers, however, are known to suffer from the curse of dimensionality, and as such, these tests are not suitable when the dimension of
is high. The conditional randomizationbased tests require the knowledge of the conditional distribution of (Candes et al., 2018). If unknown, the typeI error rates of these tests rely critically on the quality of the approximation of this conditional distribution. Kernelbased test is built upon the notion of maximum mean discrepancy (MMD, Gretton et al., 2012), and could have inflated typeI errors. The regressionbased tests have valid typeI error control, but may suffer from inadequate power. Next, we discuss in detail the conditional randomizationbased tests, in particular, the work of Bellot and van der Schaar (2019), the regressionbased and the MMDbased tests, since our proposal is closely related to them.2.1 Conditional randomizationbased tests
The family of conditional randomizationbased tests is built upon the following basis. If the conditional distribution of given is known, then one can independently draw for , and these samples are independent of the observed samples ’s and ’s. Write , , , and . Here we use boldface letters to denote data matrices that consist of
samples. The joint distributions of
and are the same under . Any large difference between the two distributions can be interpreted as the evidence against . Therefore, one can repeat the process times, and generate , , . Write . Then, for any given test statistic , its associated value is , where is the indicator function. Since the triplets are exchangeable under , the value is valid, and it satisfies that for any .In practice, however, is rarely known, and Bellot and van der Schaar (2019) proposed to approximate it using GANs. Specifically, they learned a generator from the observed data, then took and a noise variable as input to obtain a sample , which minimizes the divergence between the distributions of and . The value is then computed by replacing by . They called this test GCIT, short for generative conditional independence test. By Theorem 1 of Bellot and van der Schaar (2019), the excess typeI error of this test is upper bounded by
(2) 
where
is the total variation norm between two probability distributions, the supremum is taken over all measurable sets, and the expectations in (
2) are taken with respect to .By definition, the quantity on the righthandside of (2) measures the quality of the conditional distribution approximation. Bellot and van der Schaar (2019) argued that this error term is negligible due to the capacity of deep neural nets in estimating conditional distributions. To the contrary, we find this approximation error is usually not negligible, and consequently, it may inflate the typeI error and potentially invalidate the test. We next consider a simple example to further elaborate this.
Example 1. Suppose
is onedimensional, and follows a simple linear regression model,
, where the error is independent of and for some .Suppose we know a priori that the linear regression model holds. We thus estimate
by ordinary least squares, and denote the resulting estimator by
. For simplicity, suppose is known too. For this simple example, we have the following result regarding the approximation error term . We use to denote a quantity that converges to zero as the sample size diverges to infinity.Proposition 1
Suppose the linear regression model holds. The derived distribution is , where is the identity matrix. Then is not .
To facilitate the understanding of the convergence behavior of , we sketch a few lines of an outline of the proof of Proposition 1. A detailed proof is given in the appendix. Let denote the conditional distribution of given , which is in this example. If , then,
(3) 
In other words, the validity of GCIT requires the root mean squared total variation distance in (3) to converge at a faster rate than . However, this rate cannot be achieved in general. In our simple Example 1, we have for some universal constant . Consequently, in (2) is not . Proposition 1 shows that, even if we know a priori that the linear model holds, is not to decay to zero as grows to infinity. In practice, we do not have such prior model information. Then it would be even more difficult to estimate the conditional distribution . Therefore, using GANs to approximate does not guarantee a negligible approximation error, nor the validity of the test.
2.2 Regressionbased tests
The family of regressionbased tests is built upon a key quantity, the generalized covariance measure,
where and are the predicted condition mean and , respectively, by any supervised learner. When the prediction errors of and satisfy certain convergence rates, Shah and Peters (2018) proved that GCM is asymptotically normal. Under
, the asymptotic mean of GCM is zero, and its asymptotic standard deviation can be consistently estimated by some standard error estimator, denoted by
. Therefore, for a given significance level , we reject , if exceeds the upperth quantile of a standard normal distribution.
Such a test is valid. However, it may not have sufficient power to detect . This is because the asymptotic mean of GCM equals . The regressionbased tests require to be nonzero under to have power. However, there is no guarantee of this requirement. We again consider a simple example to elaborate.
Example 2. Suppose , and are independent random variables. Besides, has mean zero, and for some function .
For this example, we have , since both and are independent of , and so is . Besides, , since is independent of and . As such, for any function . On the other hand, and are conditionally dependent given , as long as is not a constant function. Therefore, for this example, the regressionbased tests would fail to discriminate between and .
2.3 MMDbased tests
The family of kernelbased tests often involves the notion of maximum mean discrepancy as a measure of independence. For any two probability measures , and a function space , define
Let , be function spaces of square integrable functions, i.e., for . Define , where
is the tensor product,
is the joint distribution of , and is the conditionally independent distribution with the same and margins as , Then after calculations given in Appendix LABEL:sec:moremmd, we have,(4) 
We see that measures the average conditional association between and given . Under , it equals zero, and hence an estimator of this measure can be used as a test statistic for .
3 A new double GANsbased testing procedure
We propose a double GANsbased testing procedure for the conditional independence testing problem (1). Conceptually, our test integrates GCIT, regressionbased and MMDbased tests. Meanwhile, our new test addresses the limitations of the existing ones. Unlike GCIT that only learned the conditional distribution of , we learn two generators and to approximate the conditional distributions of both and . We then integrate the two generators in an appropriate way to construct a doublyrobust test statistic, and we only require the root mean squared total variation norm to converge at a rate of for some . Such a requirement is much weaker and practically more feasible than the condition in (3).
Moreover, to improve the power of the test, we consider a set of the GCMs, , for multiple combinations of transformation functions and . We then take the maximum of all these GCMs as our test statistic. This essentially yields , which is connected with the notion of MMD. To see why the maximumtype statistic can enhance the power, we quickly revisit Example 2. When is not a constant function, there exists some nonlinear function such that is not a constant function of . Set . We have . This enables us to discriminate the null from the alternative hypothesis.
We next detail our test. An overview of our testing procedure is depicted in Figure 1.
3.1 Test statistic
We begin with two function spaces, and , indexed by some parameters and , respectively. We then randomly generate functions, , , where we independently generate i.i.d. multivariate normal variables , and . We then set , and , . Consider the following maximumtype test statistic,
(5) 
where . To compute (5), however, we need to estimate the conditional means , for
. Separately applying supervised learning algorithms
times to compute these means is computationally very expensive for a large value of . Instead, we propose to implement this step based on the generators and estimated using GANs, which is computationally much more efficient.Specifically, for , we randomly generate i.i.d. random noises , and output the pseudo samples , , for , to approximate the conditional distributions of and given . We then compute , and , for . Plugging those estimated means into (5) produces our test statistic, , where
To help reduce the typeI error of our test, we further employ a data splitting and crossfitting strategy, which is commonly used in statistical testing (Romano and DiCiccio, 2019). That is, we use different subsets of data samples to learn GANs and to construct the test statistic. We summarize our procedure of computing the test statistic in Algorithm 1.
3.2 Bootstrapping the value
Next, we propose a multiplier bootstrap method to approximate the distribution of under to compute the corresponding value. The key observation is that is asymptotically normal with zero mean under ; see the proof of Theorem 3 in the appendix for details. As such, is to converge to a maximum of normal variables in absolute values.
To approximate this limiting distribution, we first estimate the covariance matrix of a
dimensional vector formed by
using the sample covariance matrix . We then generate i.i.d. random vectors with the covariance matrix equal to , and compute the maximum elements of each of these vectors in absolute values. Finally, we use these maximum absolute values to approximate the distribution of under the null. We summarize this procedure in Algorithm 2.3.3 Approximating conditional distribution via GANs
We adopt the proposal in Genevay et al. (2017) to learn the conditional distributions and . Recall that is the distribution of pseudo outcome generated by the generator given . We consider estimating by optimizing , where
denotes the Sinkhorn loss function between two probability measures with respect to some cost function
and some regularization parameter . A detailed definition of is given in the appendix. Intuitively, the closer the two probability measures, the smaller the Sinkhorn loss. As such, maximizing the loss with respect to the cost function learns a discriminator that can better discriminate the samples generated between and . On the other hand, minimizing the maximum cost with respect to the generator makes it closer to the true distribution . This yields the minimax formulation that we target.In practice, we approximate the cost and the generator based on neural networks. Integrations in the objective function
are approximated by sample averages. A pseudocode detailing our learning procedure is given in the appendix. The conditional distribution is estimated similarly.4 Asymptotic theory
To derive the theoretical properties of the test statistic , we first introduce a concept of the “oracle" test statistic . If and were known a priori, then one can draw and from and directly, and can compute the test statistic by replacing and with and . We call the resulting an “oracle" test statistic.
We next establish the doublerobustness property of , which helps us better understand why our proposed test can relax the requirement in (3). Informally speaking, the doublerobustness means that is asymptotically equivalent to when either the conditional distribution of , or that of is well approximated by GANs.
Theorem 1 (Doublerobustness)
Suppose , are bounded function classes, is proportional to , and for some constant . Suppose for some constant . Then , when either , or .
A consequence of the doublyrobustness is that, when both total variation distances converge to zero, the test statistic converges at a faster rate than those total variation distances. Therefore, we can greatly relax the condition in (3), and replace it with, for any ,
(6) 
for some constant , where and denote the conditional distributions approximated via GANs trained on the th subset. The next theorem summarizes this discussion.
Since , the convergence rate of is faster than that in (6). To ensure , it suffices to require . In contrast to (3), this rate is achievable. We consider two examples to illustrate, while the condition holds in a wide range of settings.
Example 3 (Parametric setting). Suppose the parametric forms of and are correctly specified. Then the requirement holds if for some , where
is the dimension of the parameters defining the parametric model.
Example 4. (Nonparametric setting with binary data). Suppose ,
are binary variables. Then it suffices to estimate the conditional means of
and given . The requirement holds if the mean squared prediction errors of both nonparametric estimators are for some .More detailed discussion of the above two examples can be found in Section 5.1 of Berrett et al. (2019). Next, we establish the size (typeI error) and the power properties of our proposed test.
Theorem 3 (TypeI error)
Theorem 4 (Power)
Theorem 3 essentially shows that our proposed test can control the typeI error, whereas Theorem 4 shows that the power of our test approaches one as the sample size approaches infinity. In Theorem 4, we require
to diverge to infinity, at an arbitrary polynomial order with the sample size, to ensure the typeII error decays to zero with the sample size. Note that such a condition is not needed to establish its size property in Theorem
3. The dimensions of parameters in the function spaces and , i.e., and are fixed in Theorem 4 to simplify the analysis. However, we allow the dimension of to diverge with in Theorems 3 and 4. Whereas there is no explicit requirement on , implicitly, this requirement is embedded in (6), as the rate in (6) is expected to decay with .5 Numerical studies
The time complexity of our testing procedure is dominated by Step 2 of Algorithm 1, where we use GANs to estimate the conditional distributions and . The complexity of each SGD iteration is ; see the appendix. All experiments were run on 16 N1 CPUs on Google Cloud Computing platform. The wall clock time for computing a single test statistic was about 3 minutes.
5.1 Synthetic data example
We generate synthetic data following the post nonlinear noise model similarly as in Zhang et al. (2011); Doran et al. (2014); Bellot and van der Schaar (2019),
The entries of are randomly and uniformly sampled from , then normalized to unit norm. The noise variables
are independently sampled from a normal distribution with mean zero and variance
. In this model, the parameter determines the degree of conditional dependence. When , holds, and otherwise holds. The sample size is fixed at .We call our proposed test DGCIT, short for double GANsbased conditional independence test. We compare it with the GCIT test of Bellot and van der Schaar (2019), the regressionbased test (RCIT) of Shah and Peters (2018) and the kernel MMDbased test (KCIT) of Zhang et al. (2011).
TypeI error under . We vary the dimension of as , and consider two generation distributions. We first generate from a standard normal distribution, then from a Laplace distribution. We set the significance level at and . Figure 2 top panels report the empirical size of the tests aggregated over 500 data replications. We make the following observations. First, the typeI error rates of our test and RCIT are close to or below the nominal level in nearly all cases. Second, KCIT fails in that its typeI error is considerably larger than the nominal level in all cases. Third, GCIT has an inflated typeI error in some cases. For instance, when is normal, and , its empirical size is close to . This is consistent with our discussion in Section 2.1, as it requires a very strong condition to control the typeI error.
Powers under . We generate from a standard normal distribution, with , and vary the value of that controls the magnitude of the alternative. Figure 2 bottom panels report the empirical power of the tests aggregated over 500 data replications. We observe that our test is the most powerful, and the empirical power approaches 1 as increases to , demonstrating the consistency of the test. Meanwhile, both GCIT and RCIT have no power in all cases. We did not report the power of KCIT, because as we show earlier, it can not control the size, and thus its empirical power is meaningless.
BRAF.V600E  BRAF.MC  HIP1  FTL3  CDC42BPA  THBS3  DNMT1  PRKD1  PIP5K1A  MAP3K5  
EN  1  3  4  5  7  8  9  10  19  78 
RF  1  2  3  14  8  34  28  18  7  9 
GCIT  <0.001  <0.001  0.008  0.521  0.050  0.013  0.020  0.002  0.001  <0.001 
DGCIT  0  0  0  0  0  0  0  0  0  0.794 
The variable importance measures of the elastic net and random forest models, versus the
values of the GCIT and DGCIT tests for the anticancer drug example.5.2 Anticancer drug data example
We illustrate our proposed test with an anticancer drug dataset from the Cancer Cell Line Encyclopedia (Barretina et al., 2012). We concentrate on a subset, the CCLE data, that measures the treatment response of drug PLX4720. It is well known that the patient’s cancer treatment response to drug can be strongly influenced by alterations in the genome (Garnett et al., 2012) This data measures 1638 genetic mutations of cell lines, and the goal of our analysis is to determine which genetic mutation is significantly correlated with the drug response after conditioning on all other mutations. The same data was also analyzed in Tansey et al. (2018) and Bellot and van der Schaar (2019). We adopt the same screening procedure as theirs to screen out irrelevant mutations, which leaves a total of 466 potential mutations for our conditional independence testing.
The ground truth information is unknown for this data. Instead, we compare with the variable importance measures obtained from fitting an elastic net (EN) model and a random forest (RF) model as reported in Barretina et al. (2012). In addition, we compare with the GCIT test of Bellot and van der Schaar (2019). Table 1 reports the corresponding variable importance measures and the values, for 10 mutations that were also reported by Bellot and van der Schaar (2019). We see that, the values of the tests generally agree well with the variable important measures from the EN and RF models. Meanwhile, the two conditional independence tests agree relatively well, except for two genetic mutations, MAP3K5 and FTL3. GCIT concluded that MAP3K5 is significant () but FTL3 is not (), whereas our test leads to the opposite conclusion that MAP3K5 is insignificant () but FTL3 is (). Besides, both EN and RF place FTL3 as an important mutation. We then compare our findings with the cancer drug response literature. Actually, MAP3K5 has not been previously reported in the literature as being directly linked to the PLX4720 drug response. Meanwhile, there is strong evidence showing the connections of the FLT3 mutation with cancer response (Tsai et al., 2008; LarrosaGarcia and Baer, 2017). Combining the existing literature with our theoretical and synthetic results, we have more confidence about the findings of our proposed test.
6 Discussion
Our test statistic is constructed based on in (4). Meanwhile, we may consider another test based on , where is the joint distribution of , , and is the class of square integrable functions of . This type of test may be more powerful for certain alternative hypotheses, and we leave it as our future research.
References
 The cancer cell line encyclopedia enables predictive modelling of anticancer drug sensitivity. Nature 483 (7391), pp. 603–607. Cited by: §5.2, §5.2.
 Conditional independence testing using generative adversarial networks. In Advances in Neural Information Processing Systems, pp. 2199–2208. Cited by: §2.1, §2.1, §2, §2, §5.1, §5.1, §5.2, §5.2.

Testing conditional independence for continuous random variables
. Eurandom. Cited by: §2.  The conditional permutation test for independence while controlling for confounders. Journal of the Royal Statistical Society: Series B (Statistical Methodology) accepted. Cited by: §2, §4.
 Panning for gold:‘modelx’knockoffs for high dimensional controlled variable selection. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 80 (3), pp. 551–577. Cited by: §2, §2.
 A permutationbased kernel conditional independence test.. In UAI, pp. 132–141. Cited by: §2, §5.1.
 Kernel measures of conditional dependence. In Advances in neural information processing systems, pp. 489–496. Cited by: §2.
 Systematic identification of genomic markers of drug sensitivity in cancer cells. Nature 483, pp. 570–5. External Links: Document Cited by: §5.2.
 Learning generative models with sinkhorn divergences. arXiv preprint arXiv:1706.00292. Cited by: §3.3.
 Generative adversarial nets. In Advances in neural information processing systems, pp. 2672–2680. Cited by: §1.
 A kernel twosample test. Journal of Machine Learning Research 13 (Mar), pp. 723–773. Cited by: §2.
 Nonlinear causal discovery with additive noise models. In Advances in neural information processing systems, pp. 689–696. Cited by: §2.
 Probabilistic graphical models: principles and techniques. Adaptive computation and machine learning, MIT Press. Cited by: §1.
 FLT3 inhibitors in acute myeloid leukemia: current status and future directions. Molecular Cancer Therapeutics 16 (6), pp. 991–1001. Cited by: §5.2.
 Sufficient dimension reduction: methods and applications with r. CRC Press. Cited by: §1.
 On nonparametric conditional independence tests for continuous variables. Wiley Interdisciplinary Reviews: Computational Statistics, pp. e1489. Cited by: §2.
 Causality: models, reasoning and inference. Cambridge: Cambridge University Press, 2nd Edition. Cited by: §1.
 Multiple data splitting for testing. Technical report Technical report. Cited by: §3.1.
 Modelpowered conditional independence test. In Advances in neural information processing systems, pp. 2951–2961. Cited by: §2.
 The hardness of conditional independence testing and the generalised covariance measure. arXiv preprint arXiv:1804.07203. Cited by: §2.2, §2, §5.1.
 A consistent characteristic functionbased test for conditional independence. Journal of Econometrics 141 (2), pp. 807–834. Cited by: §2.
 Testing conditional independence via empirical likelihood. Journal of Econometrics 182 (1), pp. 27–44. Cited by: §2.

The holdout randomization test: principled and easy black box feature selection
. arXiv preprint arXiv:1811.00645. Cited by: §5.2.  Discovery of a selective inhibitor of oncogenic braf kinase with potent antimelanoma activity. Proceedings of the National Academy of Sciences 105 (8), pp. 3041–3046. External Links: Document Cited by: §5.2.
 Conditional distance correlation. Journal of the American Statistical Association 110 (512), pp. 1726–1734. Cited by: §2.

Measuring conditional independence by independent residuals: theoretical results and application in causal discovery.
In
ThirtySecond AAAI Conference on Artificial Intelligence
, Cited by: §2.  Kernelbased conditional independence test and application in causal discovery. In 27th Conference on Uncertainty in Artificial Intelligence (UAI 2011), pp. 804–813. Cited by: §2, §5.1, §5.1.
Comments
There are no comments yet.