Conditional independence testing lies at the heart of causal discovery (Spirtes et al., 2000)
and at the same time is one of its most challenging tasks. For observed random variables, measuring that and are independent given , denoted as , implies that no causal link can exist between and under the relatively weak assumption of faithfulness (Spirtes et al., 2000). A finding of conditional independence is then more pertinent to causal discovery than a finding of (conditional) dependence from which a causal link only follows under stronger assumptions (Spirtes et al., 2000).
Here we focus on the difficult case of continuous variables (Bergsma, 2004). While various conditional independence (CI) tests exist if assumptions such as linearity or additivity (Daudin, 1980; Peters and Schölkopf, 2014) are justified (for a numerical comparison see Ramsey (2014)), here we focus on the general definition of CI implying that the conditional joint density factorizes:
Recent research has focused on the general case without assuming a functional form of the dependencies as well as the data distributions. One approach is to discretize the variable and make use of easier unconditional independence tests (Margaritis, 2005; Huang, 2010)
. However, this method suffers from the curse of dimensionality for high-dimensional conditioning sets.
On the other hand, kernel-based methods are known for their capability to deal with nonlinearity and high dimensions (Fukumizu et al., 2008). A popular test is the Kernel Conditional Independence Test (KCIT) (Zhang et al., 2012) which essentially tests for zero Hilbert-Schmidt norm of the partial cross-covariance operator, or the Permutation CI test (Doran et al., 2014) which solves an optimization problem to generate a permutation surrogate on which kernel-two sample testing can be applied. Kernel methods suffer from high computational complexity since large kernel matrices have to be computed. Strobl et al. (2017) present two types of orders of magnitude faster CI tests based on approximating kernel methods using random Fourier features, called Randomized Conditional Correlation Test (RCoT) and Randomized Conditional Independence Test (RCIT). RCoT can be related to kernelized two-step conditional independence testing (Zhang et al., 2017). Last, Wang et al. (2015) proposed a conditional distance correlation (CDC) test based on the correlation of distance matrices between which have been linked to kernel-based approaches (Sejdinovic et al., 2013).
Kernel methods in general require carefully adjusted bandwidth parameters that characterize the length scales between samples in the different subspaces of . These bandwidths are global in each subspace in the sense that they are applied on the whole range of values for , respectively.
derived asymptotic approximations of the theoretical null distributions, but such approximations only hold for larger sample sizes. The alternative are permutation-based approaches, where the null-distribution is generated by computing the test statistic from a permuted sample.
Our approach to testing CI is founded in an information-theoretic framework. The conditional mutual information (CMI) is zero if and only if . While some kernel-based measures can also be related to information-theoretic quantities (see, e.g., Fukumizu et al. (2008)), our approach is to directly estimate CMI by combining the well-established Kozachenko-Leonenko -nearest neighbor estimator (Kozachenko and Leonenko, 1987; Kraskov et al., 2004; Frenzel and Pompe, 2007; Vejmelka and Paluš, 2008; Póczos and Schneider, 2012; Gao et al., 2017) with a nearest-neighbor local permutation scheme. Their main advantage is that nearest-neighbor statistics are locally adaptive: The hypercubes around each sample point are smaller where more samples are available. Unfortunately, few theoretical results are available for the complex mutual information estimator. While the Kozachenko-Leonenko estimator is asymptotically unbiased and consistent (Kozachenko and Leonenko, 1987; Leonenko et al., 2008)
, the variance and finite sample convergence rates are unknown. Hence, our approach relies on a local permutation test that is also based on nearest neighbors and, hence, data-adaptive.
Our numerical experiments comparing the CMI test with KCIT, RCIT, RCoT, and CDC show that the test reliably simulates the null distribution even for small sample sizes and with high dimensional conditioning sets. The local permutation scheme yields better calibrated tests for sample sizes below 1000 than kernel-based tests relying on asymptotic approximations such as KCIT, RCIT or RCoT. We also tested RCIT and RCoT combined with our local permutation scheme which yields better calibration for smaller sample sizes. The CMI test reaches the same or higher power levels than the other compared approaches, especially for highly nonlinear dependencies. The computational time of both the CMI test and the kernel tests strongly depends on hyperparameters. We found that for smaller samples sizes CMI is faster than RCIT or RCoT by making use of KD-tree neighbor search methods(Maneewongvatana and Mount, 1999), but a major drawback is its computationally expensive permutation scheme making more theoretical research to analytically approximate the null distribution for larger sample sizes desirable. Code for the CMI test is freely available at https://github.com/jakobrunge/tigramite.
2 Conditional independence test
2.1 Conditional mutual information
CMI for continuous and possibly multivariate random variables is defined as
where denotes the Shannon entropy and where we have to assume that the densities exist. We wish to test the conditional independence hypothesis
versus the general alternative. From the definition of CMI it is immediately clear that if and only if , provided that the densities are well-defined. Shannon-type conditional mutual information is theoretically well-founded and its value is well interpretable as the shared information between and not contained in . While this does not immediately matter for a conditional independence test’s -value, causal discovery algorithms often make use of the test statistic’s value, for example to sort the order in which conditions are tested. CMI here readily allows for an interpretation in terms of the relative importance of one condition over another. Note that the test statistic values of kernel-based tests typically depend on the chosen kernel.
2.2 Nearest-neighbor CMI estimator
Inspired by Dobrushin (1958), Kozachenko and Leonenko (1987) introduced a class of differential entropy estimators that can be generalized to estimators of conditional mutual information. This class is based on nearest neighbor statistics as further discussed in Kozachenko and Leonenko (1987); Frenzel and Pompe (2007). For a -dimensional random variable the nearest neighbor entropy estimate is defined as
with the Digamma function as the logarithmic derivative of the Gamma function , sample length , volume element depending on the chosen metric, i.e., for the maximum metric, for Euclidean metric with Gamma function . For every sample with index , the integer is the number of points in the -dimensional ball with radius . Formula (4) holds for any and the corresponding , which will be used in the following definition of a CMI estimator. Based on this entropy estimator, Kraskov et al. (2004) derived an estimator for mutual information where the -balls with radius are hypercubes. This estimator was generalized to an estimator for CMI first by Frenzel and Pompe (2007) and independently by Vejmelka and Paluš (2008). The CMI estimator is obtained by inserting the entropy estimator Eq. (4) for the different entropies in the definition of CMI in Eq. (2). For all entropy terms in Eq. (2), we use the maximum norm and choose as the side length of the hypercube the distance to the -nearest neighbor in the joint space . The CMI estimate then is
The only free parameter is the number of nearest neighbors in the joint space of and , and are computed as follows for every sample point indexed by :
Determine (here in maximum norm) the distance to its -th nearest neighbor (excluding the reference point which is not a neighbor of itself) in the joint space .
Count the number of points with distance strictly smaller than (including the reference point at ) in the subspace to get , in the subspace to get , and in the subspace to get .
Similar estimators, but for the more general class of Rényi entropies and divergences, were developed in Wang et al. (2009); Póczos and Schneider (2012). Estimator (2.2) uses the approximation that the densities are constant within the epsilon environment. Therefore, the estimator’s bias will grow with since larger lead to larger -balls where the assumption of constant density is more likely violated. The variance, on the other hand, is the more important quantity in conditional independence testing and it becomes smaller for larger because fluctuations in the -balls average out. The decisive advantage of this estimator compared to fixed bandwidth approaches is its data-adaptiveness (Fig. 1B).
The Kozachenko-Leonenko estimator is asymptotically unbiased and consistent (Kozachenko and Leonenko, 1987; Leonenko et al., 2008). Unfortunately, at present there are no results, neither exact nor asymptotically, on the distribution of the estimator as needed to derive analytical significance bounds. In Goria and Leonenko (2005), some numerical experiments indicate that for many distributions of the asymptotic distribution of MI is Gaussian. But the important finite size dependence on the dimensions , the sample length and the parameter are unknown.
Some notes on the implementation: Before estimating CMI, we rank-transform the samples individually in each dimension: Firstly, to avoid points with equal distance, small amplitude random noise is added to break ties. Then, for all values , we replace with the transformed value , where is defined such that is the th largest among all values. The main computational cost comes from searching nearest neighbors in the high dimensional subspaces which we speed up using KD-tree neighbor search (Maneewongvatana and Mount, 1999). Hence, the computational complexity will typically scale less than quadratically with the sample size. Kernel methods, on the other hand, typically scale worse than quadratically in sample size if they are not based on Kernel approximations such as via random Fourier features (Strobl et al., 2017). Further, the CMI estimator scales roughly linearly in and , the total dimension of .
2.3 Nearest-neighbor permutation test
Since no theory on finite sample behavior of the CMI estimator is available, we resort to a permutation-based generation of the distribution under .
Typically in CMI-based independence testing, CMI-surrogates to simulate independence are generated by randomly permuting all values in . The problem is, that this approach not only destroys the dependence between and , as desired, but also destroys all dependence between and . Hence, this approach does not actually test . In order to preserve the dependence between and , we propose a local permutation test utilizing nearest-neighbor search. To avoid confusion, we denote the CMI-estimation parameter as and the permutation-parameter as .
As illustrated in Fig. 1, we first identify the -nearest neighbors around each sample point (here including the point itself) in the subspace of using the maximum norm. With Algorithm 1 we generate a permutation mapping which tries to achieve draws without replacement. Since this is not always possible, some values might occur more than once, i.e., they were drawn with replacement as in a bootstrap. In Paparoditis and Politis (2000) a bootstrap scheme that always draws with replacement is described which is used for the CDC independence test. The difference to our scheme is that we try to avoid tied samples as much as possible to preserve the conditional marginals.
The permutation test is then as follows:
Generate a random permutation with Algorithm 1
Compute CMI via Eq. (2.2)
Repeat steps (1) and (2) times, sort the estimates from the null and obtain -value by
where denotes the indicator function and is the CMI estimate of the original data.
The CMI estimator holds for arbitrary dimensions of all arguments and also the local permutation scheme can be used to jointly permute all of ’s dimensions. In the following numerical experiments, we focus on the case of univariate and and uni- or multivariate .
3.1 Choosing and
Our approach has two free parameters and . The following numerical experiments indicate that restricting to only very few nearest neighbors already suffices to reliably simulate the null distribution in most cases while for we derive a rule-of-thumb based on the sample size .
Figure 2 illustrates the effect of different . If is too large or even , i.e., a full non-local permutation, the permuted distribution under independence (red) is negatively biased. As illustrated by the red markers, this would lead to an increase of false positives (type-I error). On the other hand, for the dependent case, if , the permuted distribution is positively biased yielding lower power (type-II errors). For a range of optimal values of , the permuted distribution perfectly simulates the true null distribution.
have jointly independent standard Gaussian distributions, anddenote smooth functions uniformly chosen from . Thus, we have in any case. To simulate dependent and , we used , for and identical Gaussian noise and keep independent of and .
In Fig. 3, we show results for sample size . The null distribution was generated with surrogates in all experiments. The results demonstrate that a value yields well-calibrated tests while not affecting power much. This holds for a wide range of sample sizes as shown in Fig. 12.
Larger yield more power and even for the tests are still well calibrated. But power peaks at some value of and slowly decreases for too large values. Still, the dependency of power on is relatively robust and we suggest a rule-of-thumb of . Note that, as shown in Fig. 4, runtime increases linearly with while does not impact runtime much. Here we depict the runtime per CMI estimate assuming that the permutation scheme is (embarrassingly) parallelized.
3.2 Comparison with kernel measures
In Fig. 5 we show results comparing our CMI test (CMIknn) to KCIT and the two random-fourier-based approximations RCIT and RCoT introduced in Strobl et al. (2017). As a metric for type-I errors, as in Strobl et al. (2017) we evaluated the Kolmogorov-Smirnov (KS) statistic to quantify how uniform the distribution of -values is. For type-II errors we measure the area under the power curve (AUPC). All metrics were evaluated from
realizations and error bars give the boostrapped standard errors. We show results for CMIknn forwith permutation surrogates and using the rule-of-thumb as well as a fixed .
Figure 5 demonstrates that CMIknn is better calibrated with the lowest KS-values for almost all sample sizes tested. KCIT and RCIT are especially badly calibrated for smaller sample sizes or higher dimensions and RCoT better approximates the null distribution only for for and for for . Note that this is also expected (Strobl et al., 2017) since the analytical approximation of the null distribution for RCIT and RCoT requires large sample sizes. The power as measured by AUPC is, thus only comparable for for and CMIknn has the highest power throughout if is scaled with the sample size. Also for fixed the power of CMIknn is competitive. Also for and CMIknn has slightly higher power than RCoT and RCIT.
If the computationally expensive permutation scheme of CMIknn is (embarrassingly) parallelized, the CMIknn test is faster than RCIT or RCoT for not too large sample sizes due to efficient KD-tree nearest-neighbor search procedures (Maneewongvatana and Mount, 1999), especially for smaller
. KCI is not shown here because it is orders of magnitude slower. A major computational burden of RCIT and RCoT is the kernel bandwith computation via the median Euclidean distance heuristic. Inhttps://github.com/ericstrobl/RCIT the median is computed from the first samples only, leading to the “kink” in the runtime for RCIT and RCoT. The runtime of RCIT and RCoT depends quadratically on the number of random Fourier features used (here the default of for subspace and for subspaces and was used), for more results see Fig. 8. CMIknn’s runtime increases more sharply with sample size.
Our results indicate that the analytical approximations of the null distribution utilized in RCIT and RCoT do not work well for small sample sizes below . In Fig. 6 we explore the option to combine the kernel statistics with our nearest-neighbor permutation test. While then RCIT and RCoT loose their computational advantage, the tests are now well-calibrated. Their power is still mostly lower than that of CMIknn, especially for RCIT.
In Fig. 7 we explore more cardinalities of the conditioning set. KCI and RCIT are not well-calibrated for higher dimensions and also the permutation-version of RCIT quickly looses power for higher dimensions. The power of CMIknn and RCoT is rather insensitive to the dimensionality. Note, however, that in the numerical experiments of Strobl et al. (2017) the conditioning variables for evaluating power are independent of and . Other experimental setups might induce a dependence of power on . CMIknn’s runtime starts lower, but increases more sharply with than RCIT and RCoT.
Until now we considered rather smooth dependencies of and on the conditioning variables. In Figs. 9,10 we consider more nonlinear relationships. For an extremely oscillatory sinusoidal dependency like and ( added for the dependent case), shown in Fig. 9, needs to be set to a very small value in order to control false positives. Here the analytical versions of RCIT and RCoT do not work at all and the permutation-based versions have much lower power than CMIknn.
In Fig. 10 we consider a multiplicative noise case with the model , with all variables as before and another independent Gaussian noise term. For the dependent case we used , for and identical Gaussian noise and keep independent of and . Even though the density is highly localized in this case, CMIknn is still well calibrated for . On the other hand, RCoT cannot control false positives even if we vary the number of Fourier features to much higher values (which takes much longer). RCIT is slightly better suited here, but only combined with the local permutation test both become better calibrated. CMIknn has higher power than both permutation-based kernel tests in this example.
3.3 Comparison with conditional distance correlation
|Example 1||Example 2|
|Example 3||Example 4|
|Example 5||Example 6|
|Example 7||Example 8|
In Tab. 1 we repeat the results from Wang et al. (2015) proposing the CDC test together with results from RCoT and our CMI test. The experiments are described in Wang et al. (2015). Examples 1–4 correspond to conditional independence and Examples 5–8 to dependent cases. CMIknn has well-calibrated tests except for Example 4 (as well as Example 8) which is based on discrete Bernoulli random variables while the CMI test is designed for continuous variables. For Examples 5–8 CMIknn has competitive power compared to CDC and outperforms KCIT in all and RCoT in all but Example 5 where they reach the same performance. Note that the CDC test also is based on a computationally expensive local permutation scheme since the asymptotics break down for small sample sizes.
4 Real data application
We apply CMIknn in a time series version of the PC causal discovery algorithm (Runge et al., 2017) to investigate dependencies between hourly averaged concentrations for carbon monoxide (CO), benzene (C6H6), total nitrogen oxides (NOx), nitrogen dioxide (NO2), as well as temperature (T), relative humidity (RH) and absolute humidity (AH) taken from De Vito et al. (2008)111http://archive.ics.uci.edu/ml/datasets/Air+Quality. The time series were detrended using a Gaussian kernel smoother with bandwidth and we limited the analysis to the first three months of the dataset ( samples). After accounting for missing values we obtain an effective sample size of . As in our numerical experiments, we used the CMIknn parameters and with permutation surrogates. The causal discovery algorithm was run including lags from up to . The resulting graph at a 10% FDR-level shown in Fig. 11 indicates that temperature and relative humidity influence Benzene which in turn affects NO2 and CO concentrations.
We presented a novel fully non-parametric conditional independence test based on a nearest neighbor estimator of conditional mutual information. Its main advantage lies in the ability to adapt to highly localized densities due to nonlinear dependencies even in higher dimensions. This feature results in well-calibrated tests with reliable false positive rates. We tested setups for sample sizes to and dimensions of the conditional set of . The power of CMIknn is comparable or higher than advanced kernel based tests such as KCIT or its faster random Fourier feature versions RCIT and RCoT, which, however, are not well-calibrated in the smaller sample limit. Combining our local permutation scheme with kernel tests leads to better calibration, but power is still lower than CMIknn. CMIknn has a shorter runtime for not too large sample sizes since efficient nearest-neighbor search schemes can be utilized, but its runtime increases more sharply with sample size and dimensionality than the fourier-feature bases kernel tests. Here approximate nearest-neighbor techniques could speed up computations. The permutation scheme leads to a higher computational load which, however, can be easily parallelized. Nevertheless, more theoretical research is desirable to obtain approximate analytics for the null distribution in the large sample limit. For small sample sizes below we find that a permutation-approach is inevitable also for kernel-based approaches.
We thank Eric Strobl for kindly providing R-code for KCIT, RCIT and RCoT and Dino Sejdinovic for many helpful comments.
Bergsma, W. (2004).
Testing conditional independence for continuous random variables.Eurandom technical report, 48:1–19.
- Daudin (1980) Daudin, J. J. (1980). Partial association and an measures to qualitative application regression. Biometrika, 67(3):581–590.
- De Vito et al. (2008) De Vito, S., Massera, E., Piga, M., Martinotto, L., and Di Francia, G. (2008). On field calibration of an electronic nose for benzene estimation in an urban pollution monitoring scenario. Sensors and Actuators, B: Chemical, 129(2):750–757.
Dobrushin, R. L. (1958).
A simplified method of experimentally evaluating the entropy of a
Theory of Probability & Its Applications, 3(4).
Doran et al. (2014)
Doran, G., Muandet, K., Zhang, K., and Schölkopf, B. (2014).
A Permutation-Based Kernel Conditional Independence Test.
Proceedings of the Thirtieth Conference on Uncertainty in Artificial Intelligence, pages 132–141.
- Frenzel and Pompe (2007) Frenzel, S. and Pompe, B. (2007). Partial Mutual Information for Coupling Analysis of Multivariate Time Series. Physical Review Letters, 99(20):204101.
- Fukumizu et al. (2008) Fukumizu, K., Gretton, A., Sun, X., and Schölkopf, B. (2008). Kernel measures of conditional dependence. Advances in neural information processing systems, pages 489–496.
- Gao et al. (2017) Gao, W., Oh, S., and Viswanath, P. (2017). Demystifying fixed k-nearest neighbor information estimators. In 2017 IEEE International Symposium on Information Theory (ISIT), pages 1267–1271. IEEE.
Goria and Leonenko (2005)
Goria, M. N. and Leonenko, N. N. (2005).
A new class of random vector entropy estimators and its applications in testing statistical hypotheses.Nonparametric Statistics, 17(3):277–297.
- Huang (2010) Huang, T. M. (2010). Testing conditional independence using maximal nonlinear conditional correlation. Annals of Statistics, 38(4):2047–2091.
- Kozachenko and Leonenko (1987) Kozachenko, L. F. and Leonenko, N. N. (1987). Sample estimate of the entropy of a random vector. Problemy Peredachi Informatsii, 23(2):9–16.
- Kraskov et al. (2004) Kraskov, A., Stögbauer, H., and Grassberger, P. (2004). Estimating mutual information. Physical Review E, 69(6):066138.
- Leonenko et al. (2008) Leonenko, N. N., Pronzato, L., and Savani, V. (2008). A class of Rényi information estimators for multidimensional densities. The Annals of Statistics, 36(5):2153–2182.
- Maneewongvatana and Mount (1999) Maneewongvatana, S. and Mount, D. (1999). It’s okay to be skinny, if your friends are fat. Center for Geometric Computing 4th Annual Workshop on Computational Geometry, pages 1–8.
Margaritis, D. (2005).
Distribution-free learning of Bayesian network structure in continuous domains.Proceedings of the National Conference on Artificial Intelligence, 20(2):825.
- Paparoditis and Politis (2000) Paparoditis, E. and Politis, D. N. (2000). The local bootstrap for kernel estimators under general dependence conditions. Annals of the Institute of Statistical Mathematics, 52(1):139–159.
Peters and Schölkopf (2014)
Peters, J. and Schölkopf, B. (2014).
Causal Discovery with Continuous Additive Noise Models.
Journal of Machine Learning Research, 15(June):2009–2053.
- Póczos and Schneider (2012) Póczos, B. and Schneider, J. (2012). Nonparametric estimation of conditional information and divergences. 15th International Conference on Artificial Intelligence and Statistics, XX:914–923.
- Ramsey (2014) Ramsey, J. D. (2014). A Scalable Conditional Independence Test for Nonlinear, Non-Gaussian Data. https://arxiv.org/abs/1401.5031.
- Runge et al. (2017) Runge, J., Sejdinovic, D., and Flaxman, S. (2017). Detecting causal associations in large nonlinear time series datasets. http://arxiv.org/abs/1702.07007.
- Sejdinovic et al. (2013) Sejdinovic, D., Sriperumbudur, B., Gretton, A., and Fukumizu, K. (2013). Equivalence of distance-based and RKHS-based statistics in hypothesis testing. The Annals of Statistics, 41(5):2263–2291.
- Spirtes et al. (2000) Spirtes, P., Glymour, C., and Scheines, R. (2000). Causation, Prediction, and Search, volume 81. The MIT Press, Boston.
- Strobl et al. (2017) Strobl, E. V., Zhang, K., and Visweswaran, S. (2017). Approximate Kernel-based Conditional Independence Tests for Fast Non-Parametric Causal Discovery. http://arxiv.org/abs/1702.03877.
- Vejmelka and Paluš (2008) Vejmelka, M. and Paluš, M. (2008). Inferring the directionality of coupling with conditional mutual information. Physical Review E, 77(2):026214.
- Wang et al. (2009) Wang, Q., Kulkarni, S. R., and Verdú, S. (2009). Divergence Estimation for Multidimensional Densities Via k -Nearest-Neighbor Distances. IEEE Transactions on Information Theory, 55(5):2392–2405.
- Wang et al. (2015) Wang, X., Pan, W., Hu, W., Tian, Y., and Zhang, H. (2015). Conditional Distance Correlation. Journal of the American Statistical Association, 110(512):1726–1734.
- Zhang et al. (2012) Zhang, K., Peters, J., Janzing, D., and Schölkopf, B. (2012). Kernel-based conditional independence test and application in causal discovery. preprint arXiv:1202.3775.
- Zhang et al. (2017) Zhang, Q., Filippi, S., Flaxman, S., and Sejdinovic, D. (2017). Feature-to-Feature Regression for a Two-Step Conditional Independence Test. Proceedings of the Thirtythird Conference on Uncertainty in Artificial Intelligence.