1 Introduction
There is no doubt that causality detection is a task of great practical interest. In a wide sense, attributing causes to effects guides all our efforts to understand our world and to solve any kind of real life problems. There is not, however, a simple and general definition of causality and the topic remains a staple in contemporary philosophy.
The development of analytical methods for detecting a causeeffect relationship in a set of ordered pairs of values also lacks of a universal formal definition of causality. From a pure statistical point of view any bivariate joint distribution can be expressed as the product of any of the two marginal distributions by the conditional distribution of the other variable given the first. And these two equivalent expressions can also be used to explain the generation process in both directions.
In order to be able to attack the causality detection problem we need to introduce one or more assumptions about the generation process or the shape of the joint distribution. Most of those assumptions come from the Occam’s razor succinctness principle. We expect to have a simpler model in the correct direction that in the opposite, i.e. the algorithmic complexity or minimum description length of the generation process should be lower in the true causal direction than in the opposite direction. To be more precise, if the random variable is the cause of the random variable we usually expect the conditional distribution to be unimodal or at least to have a similar shape for different given values of .
Several methods have been proposed in the literature as practical measures of the uncomputable Kolmogorov complexity of the generation model in the hypothetical causal direction. See Statnikov et al. (2012) for a review of the usual assumptions and generation models. In this paper we develop new causality measures based on the assumption that the shape of the conditional distribution tends to be very similar for different values of if the random variable is the cause of . The main difference with respect to previous methods is that we do not impose a strict independence between the conditional distribution (or noise) and the cause. However we still expect the conditional distribution to have a similar shape or similar statistical characteristics for different values of the cause.
The developed features are combined with standard statistical features following a machine learning approach: the selection of a good set of relevant features and of an adequate learning model.
2 Features
In this section we enumerate the features used by our model. All the measures are computed in both directions, i.e., exchanging the role of the two random variables X and Y, except if the measure is symmetric.
2.1 Preprocessing
Mean and Variance Normalization.
Numerical data is normalized to have zero mean and unit variance. All of our features are scale and mean invariant.
Discretization of numerical variables. Discrete measures as the discrete entropy and discrete mutual information are also used as features of numerical date after a discretization or quantization process. The quantization uses equally spaced segments of length and truncates all absolute values above . For almost all measures requiring a discretization of the input we selected and in our experiments, i.e, a quantization to 19 different values.
Relabeling of categorical variables. The specific values assigned to categorical data are assumed to have no information by themselves. However, in some cases we considered the calculation of numerical
measures (as skewness) for categorical variables. For these computations we assigned integer values to the categorical variables as a function of its probability. After the relabeling of variables with M different categories we have:
. This step let us obtain numerical features of categorical variables that do not depend on the labels but on the sorted probabilities.2.2 Informationtheoretic measures
In the baseline system we include the standard informationtheoretic features as entropy and mutual information. Both the discrete and the continuous version of the entropy estimator are applied to numerical and categorical data after the preprocessing described above.
Discrete entropy and joint entropy.
The entropy of a random variable is a informationtheoretic measure that quantifies the uncertainty in a random variable. In the case of a discrete random variable X, the entropy of X is defined as:
In our implementation of the discrete entropy estimator we added the simple Miller (1955) bias correction term to finally obtain
where M is the number of different values of the random variable X in the data set. We also considered the normalized version where is the maximum entropy a discrete random variable with N different values. The definition and estimation of the entropy can be extended to a pair of variables replacing the counts by the counts of the number of times the pair appears in the sample set.
Discrete conditional entropy.
The conditional entropy quantifies the average amount of information needed to describe the outcome of a random variable Y given that the value of another random variable X is known. In our implementation, the discrete conditional entropy is computed as the difference between the discrete joint entropy and the marginal entropy
Discrete mutual information.
The Mutual Information is the informationtheoretic measure of the dependence of two random variables. It can be computed from the entropy of each of the variables and its joint entropy as
In addition to the above unnormalized version, we also included as features two normalized versions. The mutual information normalized by the joint entropy and the mutual information normalized by the minimum of the marginal entropies:
Adjusted mutual information.
The Adjusted Mutual Information score is an adjustment of the Mutual Information measure. It corrects the effect of agreement solely due to chance, Vinh et al. (2009). This feature is computed with the scikitlearn python package, Pedregosa et al. (2011).
Gaussian and uniform divergence.
These features are an estimation of the KullbackLeibler divergence or
distanceof the distribution of the data with respect to a normalized Gaussian distribution and a uniform distribution respectively. After mean and variance normalization, the estimation of the Gaussian divergence is equivalent to the estimation of the differential entropy except for a constant factor.
An estimator of the differential entropy can also be used to compute the divergence respect an uniform distribution if the samples are first normalized in range:
2.3 Conditional distribution variability measures.
In this section we define distribution variability measures that are used as tests of the spread of the conditional distribution for different values of . If this variable is numerical we apply first the quantization process described in 2.1.
Standard deviation of the conditional distributions.
This is a direct measure of the spread of the conditional distributions after normalization. If is a numerical variable, the conditional distribution is normalized for each value of to have zero mean and then quantized as in section 2.1. If is a categorical variable, the variability of the conditional distribution for different values of is calculated after sorting these probabilities for each
. The standard deviation of the preprocessed conditional distributions is then computed as:
where refers to the normalized conditional probability and to the sample variance over .
Standard deviation of the entropy, skewness and kurtosis
These additional features use the standard deviation to quantify the spread of the entropy, variance and skewness of the conditional distributions for different values of the hypothetical causeBayesian error probabilityThis feature is an estimation of the average probability of error using the (discretized) conditional distributions . For each value of the probability of error is computed as one minus the probability of guessing given if we choose for the prediction the value that maximizes . where
2.4 Other features
Number of samples and number of unique samples
Hilbert Schmidt Independence Criterion (HSIC) This standard independence measure is computed using a python version of the MATLAB script provided by the organizers.
Slopebased Information Geometric Causal Inference (IGCI) The IGCI approach for causality detection, Janzing et al. (2012) proposes measures based on the relative entropy and a slopebased measure that we also added to our set of features.
Moments and mixed moments
We included the skewness and kurtosis of each of the variables as features, as well as the mixed moments
andPearson correlation The Pearson r correlation coefficient computed by the scipy python package, Jones et al. (2001–)
Polynomial fit We propose two features based on a polynomial regression of order 2. The first feature is based on the absolute value of the second order coefficient. We have observed that the causal direction usually requires a smaller coefficient. The second feature measures the regression mean squared error or residual.
3 Classification model selection
We tested different learning methods for classification and regression. Gradient Boosting,
Hastie et al. (2001), significantly performed better that the rest of algorithms in our 10Fold crossvalidation experiments on the training set after a manual hyperparameter tuning. We used the scikitlearn implementation (GradientBoostingClassifier) with 500 boosting stages and individual regression estimators with a large depth (9).
The classification task of the ChaLearn causeeffect pair challenge is in fact a threeclass problem. For each pair of variables and , we have a ternary truth value indicating whether is a cause of (+1), is a cause of (1), or neither (0). The participants have to provide a single predicted value between and , large positive values indicating that is a cause of with certainty, large negative values indicating that is a cause of with certainty, and middle range scores (near zero) indicate that neither causes nor causes
. The official evaluation metric was the average of two Area Under the ROC curve (AUC) scores. The first AUC is computed associating the truth values 0 and 1 to the same class (the class 1 versus the rest), while the second AUC is computed grouping toghether the 1 and 0 classes (the class 1 versus the rest).
Note that the symmetry of the task allow us to duplicate the training sample pairs. Exchanging with in an example of class provides a new example of the class .
To deal with this ternary classification problem we tested 3 different schemes:

[leftmargin=*]

A single ternary classification or regression model. The predicted value is computed in this case as where and
are the estimated probabilities assigned by the classifier to class 1 and class 1 respectively. Alternatively, we can use the output of any regression model. In the case of the selected Gradient Boosting model the classifier version with the
devianceloss function gave better results than the regressor loss functions in our experiments. 
A binary model for estimating the direction (class 1 versus class 1) and a binary model for independence classification (class 0 versus the rest). The first direction model is trained only with training sample pairs classified as 1 or 1, while the second independence model is trained with all the data after grouping class 1 and 1 in a single class. The predicted value is computed in this case as the product of the probabilities given by each of the models where is the probability of class 1 given by the direction model and is the independence probability provided by the second model.

A symmetric model based on two binary models. In this scheme we also have two binary models: a model for class 1 versus the rest and another model for class 1 versus the rest. In this sense, this configuration follows the same scheme of the evaluation metric. Both binary models are trained with all the training data after the corresponding relabeling of classes. The predicted value is then computed as the difference of the probability given by the first model to class 1 and the probability given by the second model to class 1, .
Using the same set of selected features, the three schemes provide similar results as shown in Table 1. The proposed final model uses a equally weighted linear combination of the output of each of the three models to obtain an additional significant gain respect to the best performing scheme.
Scheme  Score 

1. Single ternary model  0.81223 
2. Direction / Independence models  0.81487 
3. Symmetric models  0.81476 
System combination  0.81960 
4 Results
The main training database includes hundreds of pairs of real variables with known causal relationships from diverse domains. The organizers of the challenge also intermixed those pairs with controls (pairs of independent variables and pairs of variables that are dependent but not causally related) and semiartificial causeeffect pairs (real variables mixed in various ways to produce a given outcome). In addition, they also provided training datasets artificially generated, ^{1}^{1}1http://www.causality.inf.ethz.ch/causeeffect.php?page=data.
The results presented in this section correspond to the score of the test data given by the web submission system of the causeeffect pair challenge hosted by Kaggle. Previous crossvalidation experiments on the training set provided similar results. The table 2 summarizes the results for different subsets of the proposed complete set of features. The baseline system includes 21 features: number of samples(1), number of unique samples(2), discrete entropy(2), normalized discrete entropy(2), discrete conditional entropy(2), discrete mutual information and the two normalized versions(3), adjusted mutual information(1), Gaussian divergence(2), uniform divergence(2), IGCI(2), HSIC(1), and Pearson R(1)
Features  Score 

Baseline(21)  0.742 
Baseline(21) + Moment31(2)  0.750 
Baseline(21) + Moment21(2)  0.757 
Baseline(21) + Error probability(2)  0.749 
Baseline(21) + Polyfit(2)  0.757 
Baseline(21) + Polyfit error(2)  0.757 
Baseline(21) + Skewness(2)  0.754 
Baseline(21) + Kurtosis(2)  0.744 
Baseline(21) + the above statistics set (14)  0.790 
Baseline(21) + Standard deviation of conditional distributions(2)  0.779 
Baseline(21) + Standard deviation of the skewness of conditional distributions(2)  0.765 
Baseline(21) + Standard deviation of the kurtosis of conditional distributions(2)  0.759 
Baseline(21) + Standard deviation of the entropy of conditional distributions(2)  0.759 
Baseline(21) + Measures of variability of the conditional distribution(8)  0.789 
Full set(43 features)  0.820 
A more detailed analysis of the results of the proposed system and of other top ranking systems can be found in Guyon (2014).
5 Conclusions
We have proposed several measures of the variability of conditional distributions as features to infer causal relationships in a given pair of variables. In particular, the proposed standard deviation of the normalized conditional distributions stands out as one of the best features in our results. The combination of the developed measures with standard informationtheoretic and statistical measures provides a robust set of features to address the causality problem in the framework of the ChaLearn causeeffect pair challenge. In a test set with categorical, numerical and mixed pairs from diverse domains, the proposed method achieves an AUC score of 0.82.
This work has been supported in part by Spanish Ministerio de Economía y Competitividad, contract TEC201238939C0302 as well as from the European Regional Development Fund (ERDF/FEDER)
References
 Guyon (2014) Isabelle Guyon. Results and analysis of the 2013 chalearn causeeffect pair challenge. Journal of Machine Learning Research, In preparation, 2014.
 Hastie et al. (2001) Trevor Hastie, Robert Tibshirani, and Jerome Friedman. The Elements of Statistical Learning. Springer Series in Statistics. Springer New York Inc., New York, NY, USA, 2001.
 Janzing et al. (2012) Dominik Janzing, Joris Mooij, Kun Zhang, Jan Lemeire, Jakob Zscheischler, Povilas Daniušis, Bastian Steudel, and Bernhard Schölkopf. Informationgeometric approach to inferring causal directions. Artificial Intelligence, 182–183(0):1 – 31, 2012. ISSN 00043702. doi: http://dx.doi.org/10.1016/j.artint.2012.01.002. URL http://www.sciencedirect.com/science/article/pii/S0004370212000045.
 Jones et al. (2001–) Eric Jones, Travis Oliphant, Pearu Peterson, et al. SciPy: Open source scientific tools for Python, 2001–. URL http://www.scipy.org/.
 Miller (1955) G. A. Miller. Note on the bias of information estimates, 1955.
 Pedregosa et al. (2011) F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. Scikitlearn: Machine learning in Python. Journal of Machine Learning Research, 12:2825–2830, 2011.
 Statnikov et al. (2012) Alexander Statnikov, Mikael Henaff, Nikita Lytkin, and Constantin Aliferis. New methods for separating causes from effects in genomics data. BMC Genomics, 13(Suppl 8):S22, 2012. ISSN 14712164. doi: 10.1186/1471216413S8S22. URL http://www.biomedcentral.com/14712164/13/S8/S22.
 Vinh et al. (2009) Nguyen Xuan Vinh, Julien Epps, and James Bailey. Information theoretic measures for clusterings comparison: is a correction for chance necessary? In Proceedings of the 26th Annual International Conference on Machine Learning, ICML ’09, pages 1073–1080, New York, NY, USA, 2009. ACM. ISBN 9781605585161. doi: 10.1145/1553374.1553511. URL http://doi.acm.org/10.1145/1553374.1553511.
Appendix A. ChaLearn causeeffect pair challenge. FACT SHEET.
Title: Conditional distribution variability measures for causality detection
Participant name, address, email and website: José A. R. Fonollosa, Universitat Politècnica de Catalunya, c/Jordi Girona 13, Edifici D5, Barcelona 08034, SPAIN. jose.fonollosa@upc.edu, www.talp.upc.edu
Task solved: causeeffect pairs
Reference: José A. R. Fonollosa: Conditional distribution variability measures for causality detection. NIPS 2013 Workshop on Causality
Method:

Preprocessing. Normalization of numerical variables. Relabeling of categorical variables

Causal discovery. Standard features plus new measures base on variability measures of the conditional distributions for different values of

Feature selection. Greedy selection

Classification. Gradient Boosting. Combination of three different multiclass schemes

Model selection/hyperparameter selection. Manual hyperparameter selection
Results:
Dataset/Task  Official score  Postdeadline score 

Final test  0.81052  0.81960 


quantitative advantages: the developed model is simple and very fast compared to other top ranking models

qualitative advantages: it relaxes the noise independence assumption introducing less strict similarity measures for the conditional probability .
The complete python code for training the model and reproducing the presented results is available at https://github.com/jarfo/causeeffect. The training time is about 45 minutes on a 4core server, and computing the predictions for the test test takes about 12 minutes.