1 Introduction
Estimating a causal directed acyclic graph (DAG) model (also known as causal Bayesian network) from observational data is a challenging problem and has applications in many research areas, including bioinformatics, economics and social science [Spirtes et al.1993, Pearl2000, Morgan and Winship2007]. Existing methods for causal discovery commonly assume the involved variables are either discrete, or continuous valued. In the discrete case, one of the principled approaches is the constrainedbased method, which relies on the results of conditional independence tests. While this approach dose not impose any functional assumptions on the dependencies, it can not identify the causal directions from two different DAG models that entail identical set of conditional independencies. To cope with this identifiability problem, [Peters et al.2011] extend the additive noise model to the discrete case and demonstrate the causal direction in their model can be identified in general. However, the model imposes linear dependence assumptions on the data and this assumption is not often satisfied in practice, especially for binary categorical data. For the multivariate count data, [Park and Raskutti2015]
proposed Poisson DAG model in which each node corresponds to a Poisson random variable with rate parameters depending only on its parent variables. Again, the model can be applied on only count data instead of categorical data.
For the continuousvalued data, the traditional methods for causal discovery are based on linear model with Gaussian noise [Geiger and Heckerman1994, Spirtes et al.1993]. However, the linear Gaussian approach usually outputs a set of possible models which belong to the Markov equivalence class of the true model. To avoid this limitation, shimizu06a shimizu06a proposed linear nonGaussian acyclic model (LiNGAM) and showed that the full causal structure is identifiable given sufficiently large number of data. To relax the assumption of all variables are nonGaussian, [Hoyer et al.2008] proposed the PClingam algorithm which combines the independence based PC algorithm [Spirtes et al.1993] and the ICAbased LiNGAM algorithm [Shimizu et al.2006]. The algorithm first use the PC algorithm to obtain a set of candidate DAG models, and then apply a scoring directions scheme for model selection.
While real data often contains a mixture of discrete and continuous variables, all approaches so far we described have assumed data are either discrete or continuous. One of the commonly employed approach for mixed data is to ignore the discrete variable and apply a linear causal network approach for only continuous data. The causal analysis losses sight of some important information due to ignorance of discrete data. Another one is to discretize the continuous variables and then apply the discrete Bayesian network to analysis causal relationships, since many efficient Bayesian network learning algorithm [Spirtes et al.1993, Yehezkel and Lerner2009, Yuan and Malone2013] has been proposed for discrete data. The choice of discretization policy has significant impact on the resulting model and the discretization may lead to wrong model if much information is lost due to discretization process. Recently, Chen2017 Chen2017 proposed a new discretization strategy to mitigate these problem. Nevertheless, these traditional methods learn a Markov equivalence class and therefore the causal directions of some edges can not be determined.
In this contribution, we propose a novel hybrid causal model which consists of both continuous and discrete variables. Our model is based on the LiNGAM model and logistic regression model. In our model, we assume:

The data generating process can be represented by a directed acyclic graph.

Each continuous variable is generated from a linear function of its parent variables plus a nonGaussian noise.

Each discrete variable is a logistic variable which depends on its parent variables.
An important features of this model is that the model can handle continuous and discrete variables simultaneously without using discretization. In addition, we derive the BIC scoring function for evaluating possible model and we also propose to use the BIC score for causal discovery. Most constraintbased discovery algorithms, e.g. the PC algorithm [Spirtes et al.1993], find a Markov equivalent class which is a set of DAG models. In contrast, our method leverage the identifiability of the LiNGAM model, are expected to be able to identify the full causal structure from observational data. Finally, we empirically demonstrate the power of our method through thorough simulations.
The remainder of this paper is structured as follows: Section 2 summaries the necessary notation and reviews the LiNGAM model and logistic model. Section 3 defines our hybrid causal model and derives the BIC scoring function for model selection. Section 4 empirically evaluates our methods. Section 5 concludes this paper.
2 Background
In this section, we first introduce some necessary notation and definitions for directed acyclic graph (DAG) models. Then we briefly review the two building blocks of our model: the linear nonGaussian acyclic model (LiNGAM) and logistic conditional probability distribution.
2.1 DAG Models
Let us consider a set of random variables with index set . Following the convention of previous studies, a causal graph over a set of variables is a DAG with node set , which represents the random variables , and edge set (or lack of them), which represents direct dependency relationships (or conditional independence relationships) between variables. A directed edge from node to node is denoted by or . A node is called a parent of if and the parent set of a node consists of all nodes such that . The joint probability distribution of variables can be factorized in terms of the conditional probability distributions as follows:
(1) 
where refers to the conditional probability distribution of given its parents variables . For the variables without parents (called root variables), stands for marginal distribution .
The set of all independence constraints imposed by the structure of a DAG model can be characterized by the Markov conditions, which are the constraints that each variable is independent of its nondescendants given its parents. Two DAG structures are Markov equivalent if the set of conditional independence constraints imposed by one DAG is identical to that of another DAG. A Markov equivalence class is a set of DAGs that encode the same set of conditional independencies. The constraint based causal discovery algorithm requires a faithfulness assumption: the conditional independencies in the data distribution exactly equal the ones encoded in the causal structure. Because the constraint based approach to causal inference considers only independence constraints, these methods find a Markov equivalent class of the true causal structure.
To identify more edge directions of estimated causal structure, we propose to combine the LiNGAM model and logistic model for causal discovery. Therefore, we review these two concepts in the next two subsections.
2.2 LiNGAM
To estimate a causal structure from continuous data, shimizu06a shimizu06a proposed a linear nonGaussian acyclic model (LiNGAM), which is a special case of structural equation models and continuousvalued Bayesian networks. The LiNGAM model assumes that the observed data are generated from a process which is represented graphically by a directed acyclic graph (DAG). Moreover, it assumes that the relations between the variables are linear. Let us denote a connection strength from a variable to another variable in the DAG by , then the model can be represented by
(2) 
where is called an noise variable. All noise variables
are continuous random variables having nonGaussian distributions with zero means and nonzero variances, and
are independent of each other so that there are no latent confounding variables [Spirtes et al.1993]. See Figure 1a for a concrete example of LiNGAM model, the data is generated by first drawing the independently from their respective nonGaussian distributions, and subsequently setting (in an topological order) , , , and . (Here, we have assumed for simplicity that all the are zero, but this may not be the case in general.) A remarkable result was shown in shimizu06a shimizu06a is that under the nonGaussian assumption about the noise distribution, the full causal structure and associated parameters are identifiable. In contrast, the constraintbased algorithms estimate the Markov equivalence class and thus the directions of some edges can not be estimated (see Figure 1b).2.3 Logistic Conditional Probability Distribution
In this section, we describe the logistic regression model as a local causal structure. Consider a discrete variable whose distribution depends on some set of causes
. In this study, we restrict our analysis on binary variables for
which takes two values . We assume that the conditional probability distribution of given its dependent variables is a logistic CPD, which is defined as follows.Let be a binaryvalued random variable defined over the domain , with parents that take on numerical values. The conditional probability distribution (CPD) of is a logistic CPD if there are weights such that:
(3) 
where the sigmoid function stands for:
(4) 
The logistic CPD is a natural model for many realworld applications, because it naturally aggregates the influence of different parents. Koller+Friedman:09 Koller+Friedman:09 also provide a variant of binary logistic CPD which can handle the multivalued variables, however, we have not implemented this feature in our software, yet.
3 The Hybrid Causal Model
In this section, we propose a novel hybrid causal model, i.e., a DAG model consisting of both continuous and discrete variables. Then we discuss two commonly used approaches to causal discovery. Finally, we derive the BIC scoring function to evaluate the fitness of a DAG model to the data.
3.1 Definition of Our Model
We partition the variables of our model in two types: continuous variables and discrete variables. In this paper, we assume that each discrete variable only take two values and assume that the observed data has been generated by the following process:

The data are generated from a process represented graphically by a directed acyclic graph, in which each variable is directly caused by its parent variables.

The value assigned to each continuous variables is a linear function of its parent variables plus a nonGaussian noise term , that is
(5) where the noise term are all continuous random variables with nonGaussian densities, and the noise variables are independent of each other.

For a discrete variable , the conditional probability distribution of variable is the logistic CPD, such that,
(6) (7)
3.2 Discovery Algorithms
Causal discovery consists in finding the causal model that best fits the sample data according to certain criterion. Since a causal model consists of a causal graph structure (a DAG) and associated parameters, the discovery algorithms often need to deal with two highly related tasks: search for a causal graph and estimation of the parameters. That is, in order to estimate the parameters, we must know the causal structure; in order to evaluate a candidate causal structure, we must estimate the parameters from the data and the causal graph. In this paper, we are mainly interested in algorithms for learning the causal structure, and view the parameter estimation part as a subroutine of the search algorithm.
The constraintbased algorithms typically apply statistical tests to identify conditional independence relations and attempt to find a causal graph that represents these relations as precise as possible. Since the accuracy of the statistical test is sensitive to the number of data and the complexity of the independence tests, the constraintbased algorithms may not work well when there are insufficient data. Another issue in the constraintbased algorithm is that independence test based approach can not distinguish two DAGs in the same Markov equivalence class. Since most of Markov equivalence classes contain more than one graph, conditional independence based methods leave some arrows undirected and cannot uniquely identify the true causal graph. Recently, Hoyer08 Hoyer08 use constraintbased methods to infer the Markov equivalence class of the true causal model and then score each DAG belonging to the equivalence class.
3.3 Scoring the Hybrid Causal Model
In order to evaluate the hybrid causal model, we derive the Bayesian information criterion (BIC) scoring function [Schwarz1978] of our model. The basic idea of the BIC is to select the causal structure that maximizes the loglikelihood and being penalized by the number of parameters which are necessary to specify the causal model.
Let us denote our model as a pair , where denotes the causal graph and represents all parameter for the graph structure . The BIC score of a structure can be defined as:
(8) 
where is the logarithm of the likelihood function, are the maximum likelihood estimated (MLE) parameters for , stands for the number of data points, and signifies the number of free parameters of the model.
Since in our model, each conditional probability distribution have the number of its parent variables plus one constant parameter, the total number of parameters is the number of edges plus the number of variables.
Next, we discuss the methods for obtain the MLE parameters. For the logistic variables, it is easy to estimate dependence coefficients by the maximum likelihood principle [Bishop2006]. however, for the continuous variables, as we do not assume the Gaussian noise, to obtain the exact MLE parameters, we have to estimate the noise distribution first. This would be very complicated and not easy for implementation. For simplicity of the estimation, we obtain the coefficients
estimated using ordinary leastsquare regression. Note this provide a consistent estimates. When the number of data is sufficiently large, the approximation error becomes zero.
Finaly, for the continuous variable , the local loglikelihood of the variable is given by Aapo2010 Aapo2010. For the discrete variable, we just use canonical likelihood function for logistic regression [Bishop2006].
For the discrete Bayesian network, it is well known that the BIC scoring function assigns the same score to structures in the same equivalence class. However, under assumptions described in subsection 3.1, using BIC scoring function we are expected to be able to find the unique true causal structure as well as the associated parameters.
4 Experiments
In this section, we evaluated our proposed algorithm with respect to accuracy rate of discovering causal structure through extensive experiments. We also showed that our algorithm has better performance compared to the state of the art algorithm.
4.1 Simulations
The simulation study was conducted on a set of random DAG models with both continuous and discrete variables. Each random graph was generated by adding an edge with probability 0.5 for each pair of possible nodes under constraints that the newly added edge will not introduce any directed cycle. In the data generating process, each continuous variable is caused by the function ; each discrete variable was sampled from the probability distribution . In all results presented, parameters were chosen uniformly at random in the range [1, 0.5] or [0.5, 1].
Using the BIC scoring function in equation (8), we searched for the structure with maximum score among all possible structures. Our search algorithm is implemented in Python 3. We compared our algorithm against the PC algorithm which is implemented in the latest versions of the pgmpy library. Since the discrete PC algorithm can not directly applied on the mixed continuous and discrete data, we discretized all the continuous data using mean value [Yuan and Malone2013]. Since the PC algorithm does not recover all directions of the DAG, we only measure how often the PC algorithm can correctly infer the skeleton of the true DAG, which is the undirected graph resulting from removing all arrowheads from the DAG.
Procedures used for the simulation experiments are described below.

For different number of continuous variables , we simulated 100 random DAGs with 4 variables. In total, we generated 300 random graphs.

Using BIC score as we described previously, the causal structures were estimated based on samples, respectively.

The accuracy rate of recovering causal structure was calculated based on 100 iterations.
Figure 2 provides a comparison of our proposed algorithm and the PC algorithm in terms of skeleton accuracy. First, we observed that our proposed method learns the correct causal structure as the number of data increases. The results empirically show that our method has the asymptotic consistency. In contrast, the discrete PC algorithm was not able to estimate the correct causal structure even for large numbers of samples. Second, as the number of continuous variables increases, the performance of the PC algorithm decreases. The reason is considered that some information might be lost due to discretization. The more number of continuous variables are discretized, the more information loses.
In another experiment, we assume that the skeleton of the DAG model can be obtained from some oracle procedure. Then we score each DAG which is consistent with the skeleton. Figure 3 reported the accuracy rate of this hybridoracle approach on the previous generated 300 DAG models. In comparison, we also reported the performance analysis of full search approach. A key observation is that when we know the undirected structure, the hybridoracle can learn full causal structure with accuracy, even for only 100 data samples.
This insight is quite important in practical case. Because in many analysis we might know some pair of variables are correlate but do not know the causal direction. Our BIC score and search approach can get efficiency if we know the undirected structure. On they other hand, if we start from undirected structure instead of scratch, the necessary number of data samples decreases, which is a huge advantage for practice.
5 Summary
Causal discovery from a mixture of continuous and discrete data is a important research problem and has practical value. Most existing causal discovery methods either ignore the discrete data or discretize all the continuous data. In this paper we proposed a hybrid causal model and derived the BIC scoring function for evaluating our model.
References
 [Bishop2006] Christopher M. Bishop. Pattern Recognition and Machine Learning. Springer, 2006.

[Chen et al.2017]
YiChun Chen, Tim A. Wheeler, and Mykel J. Kochenderfer.
Learning discrete Bayesian networks from continuous data.
Journal of Artificial Intelligence Research
, 59:103–132, 2017.  [Geiger and Heckerman1994] Dan Geiger and David Heckerman. Learning gaussian networks. In Proceedings of the Tenth Conference on Uncertainty in Artificial Intelligence (UAI 1994), pages 235–243. Morgan Kaufmann, 1994.
 [Hoyer et al.2008] Patrik O. Hoyer, Aapo Hyvärinen, Richard Scheines, Peter Spirtes, Joseph Ramsey, Gustavo Lacerda, and Shohei Shimizu. Causal discovery of linear acyclic models with arbitrary distributions. In Proceedings of the 24th Conference in Uncertainty in Artificial Intelligence (UAI 2008), pages 282–289. AUAI Press, 2008.

[Hyvärinen et al.2010]
Aapo Hyvärinen, Kun Zhang, Shohei Shimizu, and Patrik O. Hoyer.
Estimation of a structural vector autoregression model using nongaussianity.
Journal of Machine Learning Research, 2010.  [Koller and Friedman2009] D. Koller and N. Friedman. Probabilistic Graphical Models: Principles and Techniques. MIT Press, 2009.
 [Morgan and Winship2007] Stephen L. Morgan and Christopher Winship. Counterfactuals and Causal Inference: Methods and Principles for Social Research. Cambridge University Press, 2007.
 [Park and Raskutti2015] Gunwoong Park and Garvesh Raskutti. Learning largescale poisson DAG models based on overdispersion scoring. In NIPS, pages 631–639, 2015.
 [Pearl2000] Judea Pearl. Causality: Models, Reasoning, and Inference. Cambridge University Press, 2000.
 [Peters et al.2011] J. Peters, D. Janzing, and B. Scholkopf. Causal inference on discrete data using additive noise models. IEEE Transactions on Pattern Analysis and Machine Intelligence, 33(12):2436–2450, 2011.
 [Schwarz1978] Gideon Schwarz. Estimating the dimension of a model. The Annals of Statistics, 6(2):461–464, 1978.
 [Shimizu et al.2006] Shohei Shimizu, Patrik O. Hoyer, Aapo Hyvärinen, and Antti Kerminen. A linear nongaussian acyclic model for causal discovery. Journal of Machine Learning Research, 7:2003–2030, 2006.
 [Spirtes et al.1993] Peter Spirtes, Clark Glymour, and Richard Scheines. Causation, Prediction, and Search. SpringerVerlag New York, 1993.
 [Yehezkel and Lerner2009] Raanan Yehezkel and Boaz Lerner. Bayesian network structure learning by recursive autonomy identification. Journal of Machine Learning Research, 10:1527–1570, 2009.
 [Yuan and Malone2013] Changhe Yuan and Brandon M. Malone. Learning optimal Bayesian networks: A shortest path perspective. Journal of Artificial Intelligence Research, 48:23–65, 2013.