Structure Learning from Related Data Sets with a Hierarchical Bayesian Score

08/04/2020 ∙ by Laura Azzimonti, et al. ∙ IDSIA 0

Score functions for learning the structure of Bayesian networks in the literature assume that data are a homogeneous set of observations; whereas it is often the case that they comprise different related, but not homogeneous, data sets collected in different ways. In this paper we propose a new Bayesian Dirichlet score, which we call Bayesian Hierarchical Dirichlet (BHD). The proposed score is based on a hierarchical model that pools information across data sets to learn a single encompassing network structure, while taking into account the differences in their probabilistic structures. We derive a closed-form expression for BHD using a variational approximation of the marginal likelihood and we study its performance using simulated data. We find that, when data comprise multiple related data sets, BHD outperforms the Bayesian Dirichlet equivalent uniform (BDeu) score in terms of reconstruction accuracy as measured by the Structural Hamming distance, and that it is as accurate as BDeu when data are homogeneous. Moreover, the estimated networks are sparser and therefore more interpretable than those obtained with BDeu, thanks to a lower number of false positive arcs.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Investigating challenging problems at the forefront of science increasingly requires large amounts of data that can only be gathered through collaborations between several institutions. This naturally leads to heterogeneous data sets that are in fact the collation of related, but not identical, subsets of data that will necessarily differ in the details of how they are collected. Examples can be found in multi-centre clinical trials, in which protocols are applied in slightly different ways to different patient populations; population genetics, which studies the architecture of phenotypic traits across populations and their evolution; ecology and environmental sciences, which produce different patterns of measurement errors and limitations in different environments. A common goal in analysing these complex data is to construct a mechanistic model that elucidates the interplay between different elements under investigation, either as a step towards building a causal model or to perform accurate prediction from a purely probabilistic perspective.

On the one hand, the task of efficiently modelling such related data sets can be tackled using hierarchical models (Gelman et al., 2014), which make it possible to pool the information common to the different subsets of the data while correctly encoding the information that is specific to each subset. On the other hand, Bayesian networks (BNs; Koller and Friedman, 2009) provide a rigorous approach for both causal and predictive modelling by representing variables as nodes and probabilistic dependencies as arcs in a graph. To the best of our knowledge, however, no method has been proposed in the literature to combine these two approaches to learn a single BN structure from a set of related data sets and get the best of both worlds. Available methods focus on learning an ensemble of BNs that have similar structures by penalising differences in their arc sets (Niculescu-Mizil and Caruana, 2007; Oates et al., 2016). Parameter learning from related data sets has been investigated in De Michelis et al. (2006) for Gaussian BNs and in Malovini et al. (2012)

for discrete BNs. However, they only consider a naive Bayes structure and they initialise their hyperprior with maximum likelihood point estimates.

In this paper, we show how to learn the structure of a BN from related data sets, containing the same variables, by building on our previous work on parameter learning in Azzimonti et al. (2019). In Section 2 we briefly introduce BNs and hierarchical models in the context of discrete data as well as prior work on parameter learning from related data sets. We then propose a score function for related data sets in Section 3, and we study its performance on simulated data in Section 4. Finally we discuss our results and possible future research directions in Section 5.

2 Background and Notation

Bayesian networks (BNs) are a class of graphical models that use a directed acyclic graph (DAG)

to model a set of random variables

: each node is associated with one

and arcs represent direct dependence relationships. Graphical separation of two nodes implies the conditional independence of the corresponding random variables. In principle, there are many possible choices for the joint distribution of

; literature has focused mostly on discrete BNs (Heckerman et al., 1995), in which both and the are categorical (multinomial) random variables. Other possibilities include Gaussian BNs and conditional linear Gaussian BNs (Lauritzen and Wermuth, 1989), which include both discrete and Gaussian BNs as particular cases.

The task of learning a BN from a data set of observations is performed in two steps in an inherently Bayesian fashion:


where are the parameters of . Structure learning consists in finding the DAG

that encodes the dependence structure of the data. In this paper we will focus on score-based algorithms, which are typically heuristic search algorithms that use a goodness-of-fit score such as BIC

(Schwarz, 1978) or the Bayesian Dirichlet equivalent uniform (BDeu) marginal likelihood (Heckerman et al., 1995) to find an optimal . Parameter learning involves the estimation of the parameters given the DAG learned in the first step. Thanks to the Markov property, this step is computationally efficient because if the data are complete the global distribution of decomposes into


and the local distribution associated with each node depends only on the configurations of its parents . Note that this decomposition does not uniquely identify a BN; different DAGs can encode the same global distribution, thus grouping BNs into equivalence classes (Chickering, 1995) characterised by the skeleton of (its underlying undirected graph) and its v-structures (patterns of arcs of the type , with no arc between and ).

2.1 Classic Multinomial-Dirichlet Parameterisation

In the case of discrete BNs, we assume that each follows a categorical distribution for each configuration of . Hence the parameters of

are the conditional probabilities

, whose th element corresponds to , for which we assume a conjugate Dirichlet prior:



is a hyperparameter vector defined over a simplex with sum

. The posterior estimator of is:


where represents the number of observations for which and . It is common to set with the same imaginary sample size for all .

In the context of structure learning, we have and we can use as a score function. (Implicitly, we are saying that by disregarding it.) Assuming positivity (), parameter independence (columns of associated to different parent configurations are independent), parameter modularity ( associated with different nodes are independent) and complete data, Heckerman et al. (1995) derived a closed form expression for known as the Bayesian Dirichlet (BD) family of scores:


Choosing again gives the Bayesian Dirichlet equivalent uniform (BDeu) score. A default value of has been recommended by Ueno (2010). Assuming a uniform prior for both and is common in the literature, even if they can have serious impact on the accuracy of the learned structures (Scutari, 2016), especially for sparse data which are likely to lead to violations of the positivity assumption (Scutari, 2018). These assumptions are taken to represent lack of prior knowledge, and they result in BDeu giving the same score to BNs in the same equivalence class (score-equivalence). BDeu is the only BD score with this property.

2.2 Hierarchical Multinomial-Dirichlet Parameterisation for Related Data Sets







Figure 1: Directed factor graphs representing hierarchical Multinomial-Dirichlet model for related data sets (top panel) and its variational approximation (bottom panel). Cat. and Dir. represent respectively Categorical and Dirichlet distributions.

The classic Multinomial-Dirichlet model in (3) can be extended to handle related data sets by treating it as a particular case of the hierarchical Multinomial-Dirichlet (hierarchical MD) model presented in Azzimonti et al. (2019). For this purpose, we introduce an auxiliary variable which identifies the related data sets. Assuming that the data sets contain the same variables and that is always observed, we can learn a BN with a common structure but with different parameter estimates for each related data set.

For simplicity, we apply the hierarchical model independently to each local distribution to estimate the joint distribution of conditional on , , by pooling information between different data sets. The resulting hierarchical model is shown in the top panel of Figure 1. Specifically, for each node we assume to be a latent random vector and we add a Dirichlet hyperprior to make a mixture of Dirichlet distributions:


where this time is a latent random vector defined over a simplex with sum . The new hyperparameters of this model are the imaginary sample size and the parameter vector , which in turn is defined over a simplex with sum ; in the following we will omit both for brevity.

The marginal posterior distribution for is not analytically tractable, as noted in Azzimonti et al. (2019). However, the posterior average can be compactly expressed as:


represents the posterior average of ; it cannot be written in closed form but can be approximated using variational inference (Jordan et al., 1999; Wainwright and Jordan, 2008). The resulting are data-set-specific but depend on all the available data via the partial pooling (Gelman et al., 2014) of the information present in the related data sets, thanks to the shared term. On the one hand, this produces more reliable estimates for sparse data and for related data sets with unbalanced sample sizes (Casella and Moreno, 2009). On the other hand, the prior in (6) violates the parameter independence assumption, leading to a marginal likelihood that does not decompose over parent configurations and that is not score-equivalent. The prior is specified on , as opposed to , which is only later computed from the joint distribution. As a result, because and are estimated by applying the hierarchical model separately to two different sets of variables, thus pooling the available information differently.

3 Structure Learning from Related Data Sets

In this section we derive the marginal likelihood score associated with the hierarchical model in (6) to implement structure learning from related data sets that contain the same variables. As the hierarchical model is not analytically tractable, we replace it with the approximate variational model shown in the bottom panel of Figure 1:


where , with ; ; and ; and for

. These parameters are estimated from the available data by minimising the Kullback-Leibler divergence between the exact posterior distribution

and its variational approximation , as described in Azzimonti et al. (2019). Since is assumed to be the parent of any node in the network and to be always observed, we treat it as an input variable in a conditional Bayesian network (Koller and Friedman, 2009, Section 5.6) and we do not explicitly assign it a distribution. Therefore, the auxiliary variable will not influence the score.

The variational model (8) is similar to the original hierarchical MD model (6), but it removes the dependence between and thus making it possible to derive an approximation of the marginal likelihood .

Given complete and related data sets , under the assumption that the related data sets have the same dependence structure and that each local distribution follows the hierarchical MD (6) with positive parameters, the marginal likelihood can be approximated with the quantity


where represents the posterior average of under the variational model (8).

Starting from the variational model (8), we can derive the conditional distribution of given and , with ; and , as a categorical distribution with parameters , whose th element is with . Thanks to the properties of Dirichlet distributions, and , whose th element is defined as , are still distributed as Dirichlet distributions respectively with parameters , whose th element is , and , whose th element is .

The approximated conditional distribution satisfies independence between related data sets. Moreover, given a data set , both parameter modularity and parameter independence are satisfied. Thus,

Thanks to the independence between and induced by the variational model and the fact that , we obtain

Since we derive


The parameter is not known a priori, but it is estimated a posteriori as , see Azzimonti et al. (2019) for details on the derivation. Hence a posteriori is distributed as a . Thanks to this relationship and the conjugacy between Categorical and Dirichlet distributions, we can replace with in (10), thus obtaining (9).

Note that the marginal likelihood (9) has the same form as the classic BD score (5), with replaced by , which represents the posterior average of under the hierarchical variational model. The posterior average is shared between different related data sets, thus inducing a pooling effect that makes and dependent once more.

From Lemma 9, we define the approximated Bayesian hierarchical Dirichlet score as

The proposed BHD score can be factorised over the nodes, i.e.,

and can be used to learn a common structure for all related data sets, taking into account potential differences in the probabilistic relationships between variables.

4 Simulation Study

We now compare the empirical performance of BHD to that of BDeu and BIC with a simulation study; for brevity, we will not discuss the results for BIC in detail since they are fundamentally the same as those for BDeu. In particular, we are interested in structure learning in the following two scenarios:

  1. the true underlying network is the same for all the related data sets;

  2. the true underlying network is the same for all the related data sets, apart from data sets in which randomly selected arcs have been removed.

For each scenario, we consider three different models for the local distributions of each node:

  1. the parameters associated with each of the related data sets are sampled from a hierarchical Dirichlet distribution with imaginary sample size equal to 10 and parameter that is in turn sampled from a Dirichlet distribution with all ;

  2. the parameters associated with each of the related data sets are independently sampled from the same Dirichlet distribution with imaginary sample size equal to 10 and all ;

  3. the parameters are identical for all data sets.

The first approach follows the distributional assumptions of the hierarchical model underlying BHD and may favour the proposed score. The last approach may favour methods not considering that data may comprise related data sets, which pool all the data and assume they are generated from the same distribution. The second approach is a middle ground between the first and the third, since parameters associated to the related data sets are different but they are not generated from the hierarchical model.

Figure 2: Boxplots of SHD difference between BHD and BDeu score for scenario (a) (left panel) and (b) (right panel). Positive values favour the hierarchical score.
Figure 3: Boxplots of SHD difference between BHD and BDeu score for scenario (a) (equal structures) with different values of number of variables , number of related groups , number of states and number of observations , with parameters sampled with the hierarchical (left panels), i.i.d (central panels) or identical distribution (right panel) approach. Positive values favour the hierarchical score.

To evaluate the performance of BHD, we first sample 3 network structures for each of three different levels of sparsity, such that the proportion of the number of arcs per node is equal to , and each combination of:

  • , where represents the number of nodes;

  • , where represents the number of related data sets;

  • , where represents the number of states for each variable.

Then, for both scenario (a) and (b), we replicate the same structure for all the related data sets. In scenario (b), for each of the data sets differing from the others, we randomly remove arcs from the network, with and . Thus, in scenario (b) we deal with structures that differ from one another and from the main structure by arcs.

Once the structures have been generated, we sample 10 different parameter sets for each of hier, iid and id; and, for each of these parameter sets, we sample 10 times related data sets, each of them composed of the same number of observations . We then perform structure learning on the resulting data by means of the hill-climbing implementation in bnlearn (Scutari, 2010) with BHD and BDeu scores. For both scores we use the imaginary sample . In the case of BDeu we pool all the available data from different related data sets.

We evaluate the accuracy in reconstructing the network with the Structural Hamming Distance (SHD) between the estimated and the true underlying structure, True Positive (TP), False Positive (FP) and False Negative (FN) arcs.

Figure 4: Boxplots of SHD difference between BHD and BDeu score for scenario (b) (different structures) with different values of number of related groups and number of observations , with parameters sampled with the hierarchical (left panels), i.i.d (central panels) or identical distribution (right panel) approach. Positive values favour the hierarchical score.

The difference between BDeu and BHD in terms of SHD for scenarios (a) and (b) is shown in Figure 2, left and right panel respectively. Positive values favour the proposed BHD score. When parameters are sampled from a hierarchical distribution (hier), BHD outperforms BDeu in both scenarios, with a larger improvement in scenario (b). In the iid case, BHD is competitive with BDeu when the underlying network structures are homogeneous, and it outperforms BDeu when the underlying network structures are different. On the other hand, in the id case BDeu has better accuracy than BHD because it correctly assumes that all the data are generated form the same distribution, while BHD has a large number of redundant parameters that would model the non-existing related data sets.

Figure 3 shows how the SHD difference between BHD and BDeu varies for different simulation parameters in scenario (a). Specifically, as the number of variables or the number of related data sets increase, the differences between BHD and BDeu (positive for hier and iid, negative for id), become increasingly large in magnitude. On the other hand, as the number of states increases the differences in accuracy between BHD and BDeu gradually decrease. For what concerns the sample size, BHD increasingly outperforms BDeu in both hier and iid as increases. In the id case we expect the two scores to be asymptotically equivalent; the values we consider for are not large enough to clearly show this empirically.

Figure 4 shows the relationship between the difference in SHD and some key simulation parameters in scenario (b). The effect of both the number of related data sets and the number of observations is more marked than in scenario (a). For the same and , BHD outperforms BDeu by a larger margin when some network structures are different (scenario (b)) compared to when they are all identical (scenario (a)).

Figure 5 compares the difference in TP (left), FP (center) and FN (right panel) between BDeu and BHD for scenario (b). Positive values favour the proposed BHD score. While the two methods perform similarly in terms of TP and FN, BHD outperforms BDeu in terms of FP in the hier case. The structures learned by BHD are thus sparser and more interpretable than those learned by BDeu.

We also perform some experiments with different values of the imaginary sample size. As increases, BHD achieves marginally lower SHDs. However, its average SHD is not significantly different from that of BDeu for the same (figures not shown for reasons of space).

Figure 5: Boxplots of TP (left panel), FP (central panel) and FN (right panel) difference between BHD and BDeu score for scenario (b) (different structures). Positive values always favour the hierarchical score.

5 Conclusions and future work

In this work we propose a new Bayesian score, BHD, to learn a common BN structure from related data sets. BHD assumes that the joint distribution in each node of the network follows a mixture of Dirichlet distributions, thus pooling information between the data sets. We find that BHD outperforms both BDeu and BIC when applied to data that are composed by related data sets; and that it has comparable performance to BDeu and BIC when data are a single, homogeneous data set.

Learning a common BN structure with BHD builds on and complements our previous work on parameter learning from related data sets, described in Azzimonti et al. (2019)

. We can use the latter to learn the parameters associated with a network structure learned using BHD, thus obtaining different BNs (one for each related data set) with the same structure and related parameters. Combining the two approaches may increase the performance of the BN models such as BN classifiers when dealing with related data sets.

The assumptions underlying BHD can be relaxed in several ways to extend its applicability to more complex scenarios. For instance, by relaxing the assumption that related data sets share the same dependence structure, it would be possible to detect independencies that hold only in certain contexts, similarly to Boutilier et al. (1996). In this case, the context-specific independence would be directly modelled by learning different but related network structures for each data set. Another interesting development would be to derive a conditional independence tests from BHD to learn BNs from related data sets with constraint-based algorithm similarly to, e.g., Tillman (2009).

We would like to acknowledge support for this project from the Swiss National Science Foundation (NSF, Grant No. IZKSZ2_162188).


  • L. Azzimonti, G. Corani, and M. Zaffalon (2019) Hierarchical Estimation of Parameters in Bayesian Networks. Computational Statistics & Data Analysis 137, pp. 67–91. Cited by: §1, §2.2, §2.2, §3, §3, §5.
  • C. Boutilier, N. Friedman, M. Goldszmidt, and D. Koller (1996) Context-Specific Independence in Bayesian Networks. In

    Proceedings of the 12th International Conference on Uncertainty in Artificial Intelligence

    UAI ’ 96, pp. 115– 123. Cited by: §5.
  • G. Casella and E. Moreno (2009)

    Assessing Robustness of Intrinsic Tests of Independence in Two-Way Contingency Tables

    Journal of the American Statistical Association 104 (487), pp. 1261–1271. Cited by: §2.2.
  • D. M. Chickering (1995) A Transformational Characterization of Equivalent Bayesian Network Structures. In Proceedings of the 11th Conference on Uncertainty in Artificial Intelligence, UAI ’ 95, pp. 87–98. Cited by: §2.
  • F. De Michelis, P. Magni, P. Piergiorgi, M. A. Rubin, and R. Bellazzi (2006) A Hierarchical Naive Bayes Model for Handling Sample Heterogeneity in Classification Problems: an Application to Tissue Microarrays. BMC bioinformatics 7 (1), pp. 514. Cited by: §1.
  • A. Gelman, J. B. Carlin, H. S. Stern, D. B. Dunson, A. Vehtari, and D. B. Rubin (2014) Bayesian Data Analysis. 3rd edition, CRC press. Cited by: §1, §2.2.
  • D. Heckerman, D. Geiger, and D. M. Chickering (1995) Learning Bayesian Networks: The Combination of Knowledge and Statistical Data. Machine Learning 20 (3), pp. 197–243. Note: Available as Technical Report MSR-TR-94-09 Cited by: §2.1, §2, §2.
  • M. I. Jordan, Z. Ghahramani, T. S. Jaakkola, and L. K. Saul (1999) An Introduction to Variational Methods for Graphical Models. Machine Learning 37 (2), pp. 183–233. Cited by: §2.2.
  • D. Koller and N. Friedman (2009) Probabilistic Graphical Models: Principles and Techniques. MIT press. Cited by: §1, §3.
  • S. L. Lauritzen and N. Wermuth (1989) Graphical Models for Associations Between Variables, Some of Which are Qualitative and Some Quantitative. The Annals of Statistics 17 (1), pp. 31–57. Cited by: §2.
  • A. Malovini, N. Barbarini, R. Bellazzi, and F. De Michelis (2012) Hierarchical Naive Bayes for Genetic Association Studies. BMC bioinformatics 13 (14), pp. S6. Cited by: §1.
  • A. Niculescu-Mizil and R. Caruana (2007) Inductive Transfer for Bayesian Network Structure Learning. In Proceedings of Artificial Intelligence and Statistics, pp. 339–346. Cited by: §1.
  • C. J. Oates, J. Q. Smith, S. Mukherjee, and J. Cussens (2016) Exact Estimation of Multiple Directed Acyclic Graphs. Statistics and Computing 26 (4), pp. 797–811. Cited by: §1.
  • G. Schwarz (1978) Estimating the Dimension of a Model. The Annals of Statistics 6 (2), pp. 461–464. Cited by: §2.
  • M. Scutari (2010) Learning Bayesian Networks with the bnlearn R Package. Journal of Statistical Software 35 (3), pp. 1–22. Cited by: §4.
  • M. Scutari (2016) An Empirical-Bayes Score for Discrete Bayesian Networks. Journal of Machine Learning Research (Proceedings Track, PGM 2016) 52, pp. 438–448. Cited by: §2.1.
  • M. Scutari (2018) Dirichlet Bayesian Network Scores and the Maximum Relative Entropy Principle. Behaviormetrika 45 (2), pp. 337–362. Cited by: §2.1.
  • R. E. Tillman (2009) Structure Learning with Independent Non-Identically Distributed Data. In Proceedings of the 26th Annual International Conference on Machine Learning, ICML ’09, pp. 1041–1048. Cited by: §5.
  • M. Ueno (2010) Learning Networks Determined by the Ratio of Prior and Data. In Proceedings of the 26th Conference on Uncertainty in Artificial Intelligence, pp. 598–605. Cited by: §2.1.
  • M. J. Wainwright and M. I. Jordan (2008) Graphical Models, Exponential Families, and Variational Inference. Foundations and Trends in Machine Learning 1 (1-2), pp. 1–305. Cited by: §2.2.