1. Introduction
The problem of measuring the amount of dependence between two random variables is an old problem in statistics. Numerous methods have been proposed over the years. For recent surveys, see [12, 33]. The literature on measures of conditional dependence, on the other hand, is not so large, especially in the nonparametric setting. This is the focus of this paper.
The nonparametric conditional independence testing problem can be relatively easily solved for discrete data using the classical Cochran–Mantel–Haenszel test [14, 37]
. This test can be adapted for continuous random variables by binning the data
[31] or using kernels [27, 63, 52, 17, 48].Besides these, there are methods based on estimating conditional cumulative distribution functions
[36, 41], conditional characteristic functions
[53], conditional probability density functions
[54], empirical likelihood [55], mutual information and entropy [46, 32, 43], copulas [4, 51, 58], distance correlation [60, 23, 56], and other approaches [47]. A number of interesting ideas based on resampling and permutation tests have been proposed in recent years [10, 48, 5].The main contribution of this paper is a new coefficient of conditional dependence between two random variables and given a set of other variables , based on i.i.d. data. The coefficient is inspired by a similar measure of univariate dependence recently proposed in [12]. The main features of our coefficient are the following:

it has a simple expression,

it is fully nonparametric,

it has no tuning parameters,

there is no need for estimating conditional densities, conditional characteristic functions, or mutual information,

it can be estimated from data very quickly, in time where is the sample size,

asymptotically, it converges to a limit in , where the limit is if and only if and are conditionally independent given , and is if and only if is equal to a measurable function of given , and

all of the above hold under absolutely no assumptions on the laws of the random variables.
As an application of this measure of conditional dependence, we propose a new variable selection algorithm, called Feature Ordering by Conditional Independence (FOCI), which is modelfree, has no tuning parameters, and is provably consistent under sparsity assumptions.
The paper is organized as follows. The definition and properties of our coefficient are presented in Section 2. This is followed by a result about hypothesis testing in Section 3. Our variable selection method is introduced in Section 4 and the theorem about its consistency is stated in Section 5. Applications to simulated and real data are presented in Section 6. The remaining sections are devoted to proofs.
2. The coefficient
Let be a random variable and and
be random vectors, all defined on the same probability space. Here
and . The value means that has no components at all. Let be the law of . We propose the following quantity as a measure of the degree of conditional dependence of and given :(2.1) 
In the denominator, is the indicator of the event . If the denominator equals zero, is undefined. (We will see below that this happens if and only if is almost surely equal to a measurable function of , which is a degenerate case that we will ignore.) If , then
has no components, and the conditional expectations and variances given
should be interpreted as unconditional expectations and variances. In this case we will write instead of .Note that is a nonrandom quantity that depends only the joint law of . Before stating our theorem about , let us first see why is a reasonable measure of conditional dependence. Since taking conditional expectation decreases variance, we have that for any ,
This shows that the numerator in (2.1) is less than or equal to the denominator, and so is always between and . Now, if and are conditionally independent given , then is a function of only, and hence . Therefore in this situation, . We will show later that the converse is also true. On the other hand, if is almost surely equal to a measurable function of given , then for any . Therefore in this case, . Again, we will prove later that the converse is also true. The following theorem summarizes these properties of .
Theorem 2.1.
Suppose that is not almost surely equal to a measurable function of (when , this means that is not almost surely a constant). Then is welldefined and . Moreover, if and only if and are conditionally independent given , and if and only if is almost surely equal to a measurable function of given . When , conditional independence given simply means unconditional independence.
Having defined , the main question is whether can be efficiently estimated from data. We will now present a consistent estimator of , which is our conditional dependence coefficient. Our data consists of i.i.d. copies of the triple , where . For each , let be the index such that is the nearest neighbor of with respect to the Euclidean metric on , where ties are broken uniformly at random. Let be the index such that is the nearest neighbor of in , again with ties broken uniformly at random. Let be the rank of , that is, the number of such that . If , our estimate of is
If , let be the number of such that , let denote the such that is the nearest neighbor of (ties broken uniformly at random), and let
In both cases, is undefined if the denominator is zero. The following theorem proves that is indeed a consistent estimator of .
Theorem 2.2.
Suppose that is not almost surely equal to a measurable function of . Then as , almost surely.
Remarks. (1) The statistic can be computed in time because nearest neighbors can be determined in time and ranks can also be calculated in time.
(2) No assumptions on the joint law of are needed other than the nondegeneracy condition that is not almost surely equal to a measurable function of . This condition is inevitable, because if this does not hold, then the question of conditional independence becomes ambiguous.
(3) Although the limit of is guaranteed to be in , the actual value of for finite may lie outside this interval.
(4) It is not easy to explain why is a consistent estimator of without going into the details of the proof, so we will not make that attempt here.
(5) We have not given a name to , but if an acronym is desired for easy reference, one may call it CODEC, which is an acronym for Conditional Dependence Coefficient. In fact, this is the acronym that we use in the R code for computing .
(6) An R package will soon be made available. For now, the code can be downloaded from https://statweb.stanford.edu/~souravc/foci.R. This R program contains the code for computing as well as the code for the variable selection algorithm FOCI presented in Section 4 below.
(7) Besides variable selection, another natural area of applications of our coefficient is graphical models. This is currently under investigation.
3. Testing conditional independence
The theorems of the previous section raise the possibility of constructing a consistent test for conditional independence based on . However, it is known that this is an impossible task, even for a single alternative hypothesis, if we demand that the level of the test be asymptotically uniformly bounded by some given
over the whole null hypothesis space
[49]. For the statistic , the main problem is that although we know if and only if and are conditionally independent given , the rate at which this convergence happens may depend on the joint law of . In fact, we believe (without proof) that the convergence can be arbitrarily slow.Still, it is possible to construct sequences of tests that achieve level pointwise asymptotically [49] and are consistent against arbitrarily large classes of alternatives. Fix any . For each and , let be the test that rejects the hypothesis of conditional independence when . Then the following result is an obvious corollary of Theorems 2.1 and 2.2.
Theorem 3.1.
Fix and . Let be the class of all joint laws of where is a realvalued random variable, is a dimensional random vector, is a dimensional random vector, is not almost surely equal to a measurable function of , and and are conditionally independent given . Let be defined as above. For , let denote probabilities computed under . Then for any ,
On the other hand, if denotes the set of all joint laws of such that , then for any ,
Finally, the set increases as decreases, and if is the set of all joint laws of such that and are not conditionally independent given , then
showing that we can have consistent tests with pointwise asymptotic level against arbitrarily large classes of alternatives.
In the absence of quantitative bounds, the above theorem is not useful from a practical point of view because it does not tell us how to choose in a given problem. To obtain quantitative bounds, it is necessary to have more information about the joint law of . Whether it will be possible to extract the required information from data is not clear. This is a future direction of research that is currently under investigation.
Incidentally, if we only want to test independence of and , and not conditional independence given , then it is easy to do it using and a permutation test.
4. Feature Ordering by Conditional Independence (FOCI)
In this section we propose a new variable selection algorithm for multivariate regression using a forward stepwise algorithm based on our measure of conditional dependence. The commonly used variable selection methods in the statistics literature use linear or additive models. This is true of the classical methods [6, 28, 13, 57, 21, 25, 29, 40] as well as the more modern ones [11, 65, 66, 61, 22, 44]. These methods are powerful and widely used in practice. However, they sometimes run into problems when significant interaction effects or nonlinearities are present. We will later show an example where methods based on linear and additive models fail to select any of the relevant predictors, even in the complete absence of noise.
Such problems can sometimes be overcome by modelfree methods [10, 30, 2, 7, 24, 29, 9, 3, 59, 8]. These, too, are powerful and widely used techniques, and they perform better than modelbased methods if interactions are present. On the flip side, their theoretical foundations are usually weaker than those of modelbased methods.
The method that we are going to propose below, called Feature Ordering by Conditional Independence (FOCI), attempts to combine the best of both worlds by being fully modelfree, as well as having a proof of consistency under a set of assumptions.
The method is as follows. First, choose to be the index that maximizes . Having obtained , choose to be the index that maximizes . Continue like this until arriving at the first such that , and then declare the chosen subset to be . If there is no such , define to be the whole set of variables. It may also happen that . In that case declare to be empty.
Although it is not required theoretically, we recommend that the predictor variables be standardized before running the algorithm. We will see later that FOCI performs well in examples, even if the true dependence of
on is nonlinear in a complicated way. In the next section we prove the consistency of FOCI under a set of assumptions on the law of .If computational time is not an issue, one can try to add variables at each step instead of just one. Although we do not explore this idea in this paper, it is possible that this gives improved results in certain situations. Similarly, one can try a forwardbackward version of FOCI, analogous to the forwardbackward version of ordinary stepwise selection.
5. Consistency of FOCI
Let be as in the previous section. For any subset of indices , let , and let
. In the machine learning literature, a subset
is sometimes called sufficient [59] if and are conditionally independent given . This includes the possibility that is the empty set, when it simply means that and are independent. Sufficient subsets are known as Markov blankets in the literature on graphical models [42, Section 3.2.1], and are closely related to the concept of sufficient dimension reduction in classical statistics [15, 1, 35]. If we can find a small subset of predictors that is sufficient, then our job is done, because these predictors contain all the relevant predictive information about among the given set of predictors, and the statistician can then fit a predictive model based on this small subset of predictors.For any subset , let
(5.1) 
where is the law of . We will prove later (Lemma 9.2) that whenever , with equality if and only and are conditionally independent given . Thus if , the difference is a measure of how much extra predictive power is added by appending to the set of predictors .
Let be the smallest number such that for any insufficient subset , there is some such that . In other words, if is insufficient, there exists some index such that appending to increases the predictive power by at least . The main result of this section, stated below, says that if is not too close to zero, then under some regularity assumptions on the law of , the subset selected by FOCI is sufficient with high probability. Note that a sparsity assumption is hidden in the condition that is not very small, because the definition of ensures that there is at least one sufficient subset of size .
To prove our result, we need the following two technical assumptions on the joint distribution of
. It is possible that the assumptions can be relaxed.
There is a number such that for any of size , any , and any ,

There is a number such that for any of size , the support of has diameter .
Assumption (A1) says that a small change in a small set of predictors do not change the conditional law of by much. This is certainly not an unreasonable assumption. Assumption (A2) looks more restrictive, but since we can freely apply transformations to the predictors before carrying out the analysis, we can always make them bounded.
The following theorem shows that under the above assumptions, the subset chosen by FOCI is sufficient with high probability.
Theorem 5.1.
Suppose that , and that the assumptions (A1) and (A2) hold. Let be the subset selected by FOCI with a sample of size . There are positive real numbers , and depending only on , and such that .
The main implication of Theorem 5.1 is that if is not too close to zero, and , then with high probability, FOCI chooses a sufficient set of predictors. In particular, this theorem allows to be quite large compared to , as long as is not too small.
6. Examples
In this section we present some applications of our methods to simulated examples and real datasets. In all examples, the covariates were standardized prior to the analysis.
Example 6.1 (Simulated example for CODEC).
Let and be independent Uniform random variables, and define
The relationship between and has three main features:

is a function of ,

unconditionally, is independent of , and

conditional on , is a function of .
Simulations showed that all three features are effectively captured by our coefficient . We took , and computed , , in independent simulations. About percent of the time, took values between and , in agreement with the fact that is a function of . Similarly, percent of time lay between and , as we would expect from the fact that is a function of conditional on . On the other hand, the value of was between and in percent of the simulations, again in agreement with the fact that and are unconditionally independent.
Existing methods from the literature are unable to capture the strong conditional dependency between and given . For example, in a typical simulation, the partial correlation between and given turned out to be only , completely failing to detect that is actually a function conditional on . The recently proposed conditional distance correlation [60] also turned out to be quite small — approximately — which, while statistically significantly different than zero, is far from detecting that is a function of given .
Example 6.2 (Simulated example for FOCI).
In this example we investigate the performance of FOCI in a simulated dataset and compare it to some popular methods of variable selection. We chose to have covariates , which were generated as i.i.d. standard normal random variables. The response was defined as the following noiseless function of the first three variables , and :
(6.1) 
There is nothing special about this particular function. It was chosen as an arbitrary example of a function with complicated nonlinearities and interactions. We also tried to avoid explicit monotone relationships between and the ’s, because most methods are good at detecting monotone relationships.
Method  Selected variables 

FOCI  1, 2, 3. 
Forward stepwise  247 variables were selected, but 1, 2, and 3 were not in the list. 
Lasso  28, 43, 68, 95, 96, 189, 241, 262, 275, 292, 351, 362, 387, 403, 490, 514, 526, 537, 560, 578, 583, 623, 635, 675, 787, 814, 834, 914, 965, 968. 
Dantzig selector  No variables were selected. 
SCAD  28, 43, 68, 241, 262, 292, 351, 387, 403, 537, 583, 623, 675, 814, 834, 968. 
With a sample of size , we compared the performance of FOCI with the following popular algorithms for variable selection: Forward stepwise, Lasso [57], the Dantzig selector [11], and SCAD [22]. The tuning parameters for Lasso, Dantzig selector and SCAD were chosen using
fold crossvalidation, and the AIC criterion was used for stopping in forward stepwise. Standard R packages were used for all computations. In this example we did not compare with methods that only give an ordering of variables or prediction rules (such as random forests
[8]), or methods for which standard prescriptions for choosing tuning parameter values are not available (such as SPAM [44] or mutual information [3]).Table 1 displays the results of our comparisons. The table shows that only FOCI was able to select the correct subset. Forward stepwise, Lasso and SCAD selected long lists of variables that did not include the relevant ones, while the Dantzig selector did not select any variables.
Example 6.3 (Spambase data).
In this example we apply FOCI to a widely used benchmark dataset, known as the spambase data [18], and compare its performance with other methods.
Method  Subset size  MSPE 

FOCI  26  0.036 
Forward stepwise  45  0.039 
Lasso  56  0.038 
Dantzig selector  51  0.038 
SCAD  33  0.039 
We used the version of the dataset that is available at the UCI Machine Learning Repository. The data consists of observations, each corresponding to one email, and
features for each observation. The response variable is binary, indicating whether the email is a spam email or not.
We compared FOCI with forward stepwise, Lasso, Dantzig selector and SCAD, as in the previous example. As before, the tuning parameters for Lasso, Dantzig selector and SCAD were chosen using fold crossvalidation, and the AIC criterion was used for stopping in forward stepwise.
We chose 80% of the observations at random as the training set and kept the rest for testing. The variables were selected by running the competing algorithms on the training set. For each method, after selecting the variables, we fitted a predictive model using random forests. Random forests were used because they gave better prediction errors than any other technique (such as linear models).
The mean squared prediction errors were then estimated using the test set. The results are shown in Table 2. From this table, we see that FOCI gave a slightly better prediction error than the other methods ( for FOCI versus or for every other method), even though the size of the subset selected by FOCI was far less than the sizes of the subsets selected by the other methods ( for FOCI, for forward stepwise, for Lasso, for Dantzig selector, and for SCAD). This is a recurring pattern that we saw in almost all the datasets on which we ran our tests. On rare occasions, FOCI stopped a little too soon, resulting in worse prediction errors.
Example 6.4 (Polish companies bankruptcy data).
We now consider another dataset from the UCI Machine Learning Repository, known as the Polish companies bankruptcy dataset [18, 64]. The dataset consists of samples with features. Each sample corresponds to a company in Poland. The response variable is binary, indicating whether or not the company was bankrupted after a period of time. We carried out the exact same comparison procedure for this data as we did for the spam data in the previous example. The results are shown Table 3. Again we see that FOCI achieved a slightly better prediction error than the other methods, but with a far fewer number of variables. FOCI selected variables, whereas forward stepwise selected , Lasso selected , and the Dantzig selector selected . Only SCAD selected a smaller number of variables (3), but it did so at the cost of a significantly worse prediction error.
Method  Subset size  MSPE 

FOCI  10  0.015 
Forward stepwise  24  0.016 
Lasso  48  0.017 
Dantzig selector  27  0.017 
SCAD  3  0.021 
Example 6.5 (Broader comparison).
In this final example, we compare the performance of FOCI with a broader set of competing methods, including methods that provide good prediction tools but are difficult or impossible to use for subset selection, such as random forests [8], mutual information [3] and SPAM [44]. We took the Polish companies bankruptcy data from Example 6.4, and divided the data randomly into training and test sets as before. For each method and each , we took the top variables selected by the method, fitted a predictive model using random forests, and estimated the mean squared prediction error using the test data. We took only up to because we know, from Table 3, that FOCI selects the top variables, and the other methods (at least four of them) do not beat that performance with a larger number of variables. The results are plotted in Figure 1. The figure shows that FOCI and random forests were by far the best methods for early detection of important variables. This is a pattern that we observed in various other examples that we analyzed while preparing this manuscript.
In our experiments, the performances of FOCI and random forests were generally similar, but FOCI has three distinct advantages over random forests: (1) FOCI selects a subset of variables, whereas random forests only give an ordering by importance and a prediction rule, (2) FOCI runs much faster than random forests in large datasets, and (3) arguably, FOCI has better theoretical support than random forests, due to the results of this paper.
7. Restatement of Theorems 2.1 and 2.2
Beginning with this section, the rest of the paper is devoted to proofs. Throughout the rest of the manuscript, whenever we say that a random variable is a function of another variable , we will mean that almost surely for some measurable function .
First, we focus on Theorems 2.1 and 2.2. To prove these theorems, it is convenient to break up the estimators into pieces. This gives certain ‘elaborate’ versions of Theorems 2.1 and 2.2, which are interesting in their own right. First, suppose that . Define
(7.1) 
and
(7.2) 
Let denote the law of . We will see later that the following theorem implies both Theorem 2.1 and Theorem 2.2 in the case .
Theorem 7.1.
Suppose that . As , the statistics and converge almost surely to deterministic limits. Call these limit and , respectively. Then

.

is conditionally independent of given if and only if .

is conditionally a function of given if and only if .

is not a function of if and only if .
Explicitly, the values of and are given by
and
Next, suppose that . Define
(7.3) 
and
(7.4) 
We will prove later that the following theorem implies Theorems 2.1 and 2.2 when .
Theorem 7.2.
As , and converge almost surely to deterministic limits and , satisfying the following properties:

.

is independent of if and only if .

is a function of if and only if .

if and only if not a constant.
Explicitly,
and
It is not difficult to see that when has a continuous distribution, is necessarily equal to . In this case, there is no need for estimating using . On the other hand, the value of may be dependent on the distribution of when the distribution is not continuous. In such cases, needs to be estimated from the data using .
8. Proofs of Theorems 2.1 and 2.2 using Theorems 7.1 and 7.2
Suppose that . Recall the quantities and from the statement of Theorem 7.1, and notice that . Suppose that is not a function of . Then by conclusion (iv) of Theorem 7.1, , and hence is welldefined. Moreover, conclusion (i) implies that , conclusion (ii) implies that if and only if and are conditionally independent given , and conclusion (iii) implies that is a function of given if and only if . This proves Theorem 2.1 when . Next, note that , where and , as defined in (7.1) and (7.2). By Theorem 7.1, and in probability. Thus, in probability. This proves Theorem 2.2 when .
Next, suppose that . The proof proceeds exactly as before, but using Theorem 7.2. Here , where and are the quantities from Theorem 7.2. Suppose that is not a function of , which in this case just means that is not a constant. Then by conclusion (iv) of Theorem 7.2, , and hence is welldefined. Moreover, conclusion (i) implies that , conclusion (ii) implies that if and only if and are independent, and conclusion (iii) implies that is a function of if and only if . This proves Theorem 2.1 when . Next, note that , where and , as defined in (7.3) and (7.4). By Theorem 7.2, and in probability. Thus, in probability. This proves Theorem 2.2 when .
9. Preparation for the proofs of Theorems 7.1 and 7.2
In this section we prove some lemmas that are needed for the proofs of Theorems 7.1 and 7.2. Let be a random variable and be an valued random vector, defined on the same probability space. Define
By the existence of regular conditional probabilities on regular Borel spaces (see for example [19, Theorem 2.1.15 and Exercise 5.1.16]), for each Borel set there is a measurable map from into , such that

for any , is a version of , and

with probability one, is a probability measure on .
In the above sense, is the conditional law of given . For each , let
Define
(9.1) 
Lemma 9.1.
Let be as above. Then if and only if and are independent.
Proof.
If and are independent, then for any , almost surely. Thus, almost surely, and so . Consequently, .
Conversely, suppose that . Then there is a set such that and for every . Since , almost surely for each . We claim that .
To show this, take any . If , then clearly must be a member of and there is nothing more to prove. So assume that . This implies that is rightcontinuous at .
There are two possibilities. First, suppose that for all . Then for each , , and hence must intersect . This shows that there is a sequence in such that decreases to . Since almost surely for each , this implies that with probability one,
But . Thus, almost surely.
The second possibility is that there is some such that . Take the largest such , which exists because is leftcontinuous. If , then , and hence almost surely because . Suppose that . Then either , which implies that almost surely, or and for all , which again implies that almost surely, by the previous paragraph. Therefore in either case, with probability one,
Since , this implies that almost surely.
This completes the proof of our claim that for every . In particular, for each , almost surely. Therefore, for any and any Borel set ,
This proves that and are independent. ∎
Let be an valued random vector defined on the same probability space as and , and let be the concatenation of and .
Lemma 9.2.
Let be as above. Then , and equality holds if and only if and are conditionally independent given .
Proof.
Since , it follows that for each ,
Consequently, . If and are conditionally independent given , then for any ,
Therefore, . Conversely, suppose that . Notice that
Thus,
So, if , then there is a Borel set such that and almost surely for every . We claim that . Let us now prove this claim. The proof is similar to the proof of the analogous claim in Lemma 9.1, with a few additional complications.
Take any . If , then clearly must be a member of . So assume that . As before, this implies that is rightcontinuous at . Take any sequence decreasing to . Then . But
and is a nonnegative random variable. Thus, in probability, and therefore there is a subsequence such that converges to almost surely. But from the properties of the regular conditional probability we know that is a nonincreasing function almost surely. Thus, it follows that is rightcontinuous at almost surely.
Now, as before, there are two possibilities. First, suppose that for all . Then for each , , and hence must intersect . This shows that there is a sequence in such that decreases to . Since almost surely for each and is rightcontinuous at with probability one, this implies that with probability one,
But . Thus, almost surely.
The second possibility is that there is some such that . Take the largest such , which exists because is leftcontinuous. If , then , and hence almost surely because . Suppose that . Then either , which implies that almost surely (by the previous step), or and for all , which again implies that almost surely (also by the previous step). Therefore in either case, with probability one,
Now, , and hence almost surely. In other words, almost surely. Thus, almost surely. Since , this implies that almost surely. This completes the proof of our claim that .
Therefore, for any and any Borel set ,
This proves that and are conditionally independent given . ∎
Let be an infinite sequence of i.i.d. copies of . For each and each , let be the Euclidean nearestneighbor of among . Ties are broken at random.
Lemma 9.3.
With probability one, as .
Proof.
Let be the law of . Let be the support of . Recall that is the set of all such that any open ball containing has strictly positive measure. From this definition it follows easily that the complement of is a countable union of open balls of measure zero. Consequently, with probability one.
Take any . Let be the ball of radius centered at . Then
Since almost surely, it follows that almost surely. Thus,
almost surely, and hence
This proves that in probability. But is decreasing in . Therefore almost surely. ∎
Take any particular realization of . In this realization, for each , let be the number of such that is a nearest neighbor of (not necessarily the randomly chosen one) and . The following is a wellknown geometric fact (see for example [62, page 102]).
Lemma 9.4.
There is a deterministic constant , depending only on the dimension , such that always.
Proof.
Consider a triangle with vertices , and in , where and . Suppose that the angle at is strictly less that and . Then
Consequently,
where the last inequality holds because . Thus, if is a cone at of aperture less than , and is a finite list of points in (not necessarily distinct), then there can be at most one such that the nearest neighbor of in is .
Now, it is not difficult to see that there is a deterministic constant depending only on such that the whole of can be covered by at most cones of apertures less than based at any given point. Take this point to be . Then within each cone, there can be at most one , which is not equal to , and whose nearest neighbor is . This shows that there can be at most points distinct from whose nearest neighbor is , completing the proof of the lemma. ∎
Lemma 9.5.
There is a constant depending only on , such that for any measurable and any , .
Proof.
Since is nonnegative,
Therefore by symmetry,
Comments
There are no comments yet.