Entropy-based Pruning for Learning Bayesian Networks using BIC

07/19/2017 ∙ by Cassio P. de Campos, et al. ∙ 0

For decomposable score-based structure learning of Bayesian networks, existing approaches first compute a collection of candidate parent sets for each variable and then optimize over this collection by choosing one parent set for each variable without creating directed cycles while maximizing the total score. We target the task of constructing the collection of candidate parent sets when the score of choice is the Bayesian Information Criterion (BIC). We provide new non-trivial results that can be used to prune the search space of candidate parent sets of each node. We analyze how these new results relate to previous ideas in the literature both theoretically and empirically. We show in experiments with UCI data sets that gains can be significant. Since the new pruning rules are easy to implement and have low computational costs, they can be promptly integrated into all state-of-the-art methods for structure learning of Bayesian networks.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

A Bayesian network (pearl1988, )

is a well-known probabilistic graphical model with applications in a variety of fields. It is composed of (i) an acyclic directed graph (DAG) where each node is associated to a random variable and arcs represent dependencies between the variables entailing the


condition: every variable is conditionally independent of its non-descendant variables given its parents; and (ii) a set of conditional probability mass functions defined for each variable given its parents in the graph. Their graphical nature makes Bayesian networks excellent models for representing the complex probabilistic relationships existing in many real problems ranging from bioinformatics to law, from image processing to economic risk analysis.

Learning the structure (that is, the graph) of a Bayesian network from complete data is an NP-hard task (chickering2004, ). We are interested in score-based learning, namely finding the structure which maximizes a score that depends on the data (HGC95, ). A typical first step of methods for this purpose is to build a list of suitable candidate parent sets for each one of the variables of the domain. Later an optimization is run to find one element from each such list in a way that maximizes the total score and does not create directed cycles. This work concerns pruning ideas in order to build those lists. The problem is unlikely to admit a polynomial-time (in ) algorithm, since it is proven to be LOGSNP-hard (koivisto2006parent, ). Because of that, usually one forces a maximum in-degree (number of parents per node) and then simply computes the score of all parent sets that contain up to parents. A worth-mention exception is the greedy search of the K2 algorithm (cooper1992bayesian, ).

A high in-degree implies a large search space for the optimization and thus increases the possibility of finding better structures. On the other hand, it requires higher computational time, since there are candidate parent sets for a bound of if an exhaustive search is performed. Our contribution is to provide new rules for pruning sub-optimal parent sets when dealing with the Bayesian Information Criterion score (schwarz1978, ), one of the most used score functions in the literature. We devise new theoretical bounds that can be used in conjunction with currently published ones (decampos2011a, ). The new results provide tighter bounds on the maximum number of parents of each variable in the optimal graph, as well as new pruning techniques that can be used to skip large portions of the search space without any loss of optimality. Moreover, the bounds can be efficiently computed and are easy to implement, so they can be promptly integrated into existing software for learning Bayesian networks and imply immediate computational gains.

The paper is divided as follows. Section 2 presents the problem, some background and notation. Section 3 describes the existing results in the literature, and Section 4 contains the theoretical developments for the new bounds and pruning rules. Section 5 shows empirical results comparing the new results against previous ones, and finally some conclusions are given in Section 6.

2 Structure learning of Bayesian networks

Consider the problem of learning the structure of a Bayesian Network from a complete data set of instances . The set of categorical random variables is denoted by (each variable has at least two categories). The state space of is denoted and a joint space for is denoted by (and with a slight abuse containing a null element). The goal is to find the best DAG , where is the collection of nodes (associated one-to-one with the variables in ) and is the collection of arcs. can be represented by the (possibly empty) set of parents of each node/variable.

Different score functions can be used to assess the quality of a DAG. This paper regards the Bayesian Information Criterion (or simply (schwarz1978, )

, which asymptotically approximates the posterior probability of the DAG. The

score is decomposable, that is, it can be written as a sum of the scores of each variable and its parent set:

where denotes the log-likelihood of and its parent set:

where the base is usually taken as natural or 2. We will make it clear when the result depends on such base. Moreover,

is the maximum likelihood estimate of the conditional probability

, that is, ; represents the number of times appears in the data set (if is null, then and ). In the case with no parents, we use the notation . is the complexity penalization for and its parent set:

again with the notation .

For completeness, we present the definition of (conditional) mutual information. Let , , be two-by-two disjoint subsets of . Then

(unconditional version is obtained with ), and (the sample estimate of) entropy is defined as usual: and

( runs over the configurations of .) Since , it is clear that for any disjoint subsets .

The ultimate goal is to find (we avoid equality because there might be multiple optima). We assume that if two DAGs and have the same score, then we prefer the graph with fewer arcs. The usual first step to achieve such goal is the task of finding the candidate parent sets for a given variable (obviously a candidate parent set cannot contain itself). This task regards constructing the list of parent sets for alongside their scores . Without any restriction, there are possible parent sets, since every subset of is a candidate. Each score computation costs , and the number of score computations becomes quickly prohibitive with the increase of . In order to avoid losing global optimality, we must guarantee that contains candidate parent sets that cover those in an optimal DAG. For instance, if we apply a bound on the number of parents that a variable can have, then the size of

is , but we might lose global optimality (this is the case if any optimal DAG would have more than parents for ). Irrespective of that, this pruning is not enough if is large. Bounds greater than can already become prohibitive. For instance, a bound of is adopted in (Bartlett2015, ) when dealing with its largest data set (diabetes), which contains 413 variables. One way of circumventing the problem is to apply pruning rules which allow us to discard/ignore elements of in such a way that an optimal parent set is never discarded/ignored.

3 Pruning rules

The simplest pruning rule one finds in the literature states that if a candidate subset has better score than a candidate set, then such candidate set can be safely ignored, since the candidate subset will never yield directed cycles if the candidate set itself does not yield cycles (Teyssier+Koller:UAI05, ; deCampos2009, ). By safely ignoring/discarding a candidate set we mean that we are still able to find an optimal DAG (so no accuracy is lost) even if such parent set is never used. This is formalized as follows.

Lemma 1.

(Theorem 1 in  (decampos2011a, ), but also found elsewhere (Teyssier+Koller:UAI05, ).) Let be a candidate parent set for the node . Suppose there exists a parent set such that and . Then can be safely discarded from the list of candidate parent sets of .

This result can be also written in terms of the list of candidate parent sets. In order to find an optimal DAG for the structure learning problem, it is sufficient to work with

Unfortunately there is no way of applying Lemma 1 without computing the scores of all candidate sets, and hence it provides no speed up for building the list (it is nevertheless useful for later optimizations, but that is not the focus of this work).

There are however pruning rules that can reduce the computation time for finding and that are still safe.

Lemma 2.

Let be candidate parent sets for . Then , and .


The inequalities follow directly from the definitions of log-likelihood, entropy and penalization. ∎

Lemma 3.

(Theorem 4 in (decampos2011a, ).111There is an imprecision in the Theorem 4 of (decampos2011a, ), since as defined there does not account for the constant of BIC/AIC while in fact it should. In spite of that, their desired result is clear. We present a proof for completeness.) Let be a node with two candidate parent sets, such that . Then and all its supersets can be safely ignored when building the list of candidate parent sets for .


Let . By Lemma 2, we have (equality only if ). Then , and we have , so Lemma 1 suffices to conclude the proof. ∎

Note that can as well be written as , and if for some , then it can be written also as . The reasoning behind Lemma 3 is that the maximum improvement that we can have in score by inserting new parents into would be achieved if , which is a non-positive value, grew all the way to zero, since the penalization only gets worse with more parents. If is already close enough to zero, then the loss in the penalty part cannot be compensated by the gain of likelihood. The result holds for every superset because both likelihood and penalty are monotone with respect to increasing the number of parents.

4 Novel pruning rules

In this section we devise novel pruning rules by exploiting the empirical entropy of variables. We later demonstrate that such rules are useful to ignore elements while building the list that cannot be ignored by Lemma 3, hence tightening the pruning results available in the literature. In order to achieve our main theorem, we need some intermediate results.

Lemma 4.

Let for , with candidate parent sets for . Then .


This comes from simple manipulations and known bounds to the value of conditional mutual information.

Theorem 1.

Let , and be a parent set for . Let such that . Then the parent set and all its supersets can be safely ignored when building the list of candidate parents sets for .


We have that

First step is the definition of BIC, second step uses Lemma 4 and third step uses the assumption of this theorem. Therefore, can be safely ignored (Lemma 1). Now take any . Let . It is immediate that , since and hence . The theorem follows by the same arguments as before, now applied to and . ∎

The rationale behind Theorem 1 is that if the data do not have entropy in amount enough to beat the penalty function, then there is no reason to continue expanding the parent set candidates. Theorem 1 can be used for pruning the search space of candidate parent sets without having to compute their BIC scores. However, we must have available the conditional entropies and . The former is usually available, since , which it is used to compute

(and it is natural to assume that such score has been already computed at the moment Theorem 

1 is checked). Actually, this bound amounts exactly to the previous result in the literature:

By Theorem 1 we know that and any superset can be safely ignored, which is the very same condition as in Lemma 3. The novelty in Theorem 1 comes from the term . If such term is already computed (or if it will need to be computed irrespective of this bound computation, and thus we do not lose time computing it for this purpose only), then we get (almost) for free a new manner to prune parent sets. In case this computation of is not considered worth, or if we simply want a faster approach to prune parent sets, we can resort to a more general version of Theorem 1, as given by Theorem 2.

Theorem 2.

Let , and be parent sets for with . Let such that . Then the parent set and all its supersets can be safely ignored when building the list of candidate parents sets for .


It is well-known (see Lemma 2) that and for any ,, as defined in this theorem, so the result follows from Theorem 1. ∎

An important property of Theorem 2 when compared to Theorem 1 is that all entropy values regard subsets of the current parent set at our own choice. For instance, we can choose and so they become entropies of single variables, which can be precomputed efficiently in total time . Another option at this point, if we do not want to compute and assuming the cache of has been already created, would be to quickly inspect the cache of to find the most suitable subset of to plug into Theorem 2. Moreover, with Theorem 2, we can prune the search space of a variable without evaluating the likelihood of parent sets for (just by using the entropies), and so it could be used to guide the search even before any heavy computation is done. The main novelty in Theorems 1 and 2 is to make use of the (conditional) entropy of .

This new pruning approach is not trivially achievable by previous existing bounds for BIC. It is worth noting the relation with previous work. The restriction of Theorem 2 can be rewritten as:

Note that the condition for Lemma 3 (known from literature) is exactly . Hence, Theorem 2 will be effective (while the previous rule in Lemma 3 will not) when , and so when . Intuitively, the new bound of Theorem 2 might be more useful when the parent set being evaluated is poor (hence is low) while the result in Lemma 3 plays an important role when the parent set being evaluated is good (and so is high). We provide now a numerical example, detailing two real cases from the well-known UCI data set glass Lichman:2013 where only one bound is activated.

Target variable ()
2 7
2 2
Table 1: Pruning rules’ examples using UCI data set glass. and are two cases of interest. Variable numbers are indexed from 0 to 7 from left to right in the data table. Name convention follows Theorem 1 to facilitate the understanding.

Consider case in Table 1: We are constructing the list of candidate parent sets for , and have just computed the BIC score of . We are interested whether is a good parent set. We have that , and thus the old pruning rule (Lemma 3) is activated. On the other hand, , so the new bound that uses the (conditional) entropy of is not activated.

Now consider case in Table 1: We are building the list of candidate parent sets for and have just computed the BIC score of . We are interested whether is a good parent set. We have that , and thus the new bound is activated, while does not activate the old bound.

The result of Theorem 2 can also be used to bound the maximum number of parents in any given candidate parent set. While the asymptotic result is already implied by previous work decampos2011a , we obtain the finer and interesting result of Theorem 3.

Theorem 3.

There is an optimal structure such that variable has at most

parents, where denotes the smallest natural number greater than or equal to its argument.


If is the optimal parent for , then the result trivially follows since . Now take such that and . Since and , we have . Now, if

then by Theorem 2 (used with ) every super set of containing can be safely ignored, and so it would be . Therefore,

and since is a natural number, the result follows by applying the same reasoning for every . ∎

Corollary 1 is demonstrated for completeness, since it is implied by previous work (see for instance decampos2011a ). It is nevertheless presented here in more detailed terms and without an asymptotic function.

Corollary 1.

There is an optimal structure such that each variable has at most parents.


By Theorem 3, we have that can be a parent of a node only if

since it is assumed that and . ∎

Theorem 3 can be used to bound the number of parent sets per variable, even before computing parent sets for them, with the low computation cost of computing the empirical entropy of each variable once (hence overall cost of time). We point out that Theorem 3 can provide effective bounds (considerably smaller than ) on the number of parents for specific variables, particularly when number of states is high and entropies are low, as we will see in the next section.

5 Experiments

We run experiments using a collection of data sets from the UCI repository Lichman:2013 . Table  2 shows the data set names, number of variables and number of data points . In the same table, we show the maximum number of parents that a node can have, according to the new result of Theorem 3, as well as the old result from the literature (which we present in Corollary 1). The old bound is global, so a single number is given in column 5, while the new result of Theorem 3 implies a different maximum number of parents per node. We use the notation bound (number of times), with the bound followed by the number of nodes for which the new bound reached that value, in parenthesis (so all numbers in parenthesis in a row should sum to of that row). We see that the gains with the new bounds are quite significant and can prune great parts of the search space further than previous results.

Bound on number of parents
Dataset Theorem 3 Corollary 1
glass 8 214 6 (7), 3 (1) 6
diabetes 9 768 7 (9) 8
tic-tac-toe 10 958 6 (10) 8
cmc 10 1473 8 (3), 7 (7) 9
breast-cancer 10 286 6 (4), 5 (2), 4 (1), 3 (3) 7
solar-flare 12 1066 7 (4), 6 (1), 5 (5), 3 (1), 2 (1) 8
heart-h 12 294 6 (6), 5 (3), 4 (2), 3 (1) 7
vowel 14 990 8 (12), 4 (2) 8
zoo 17 101 5 (10), 4 (6), 2 (1) 5
vote 17 435 7 (15), 6 (2) 7
segment 17 2310 9 (16), 6 (1) 9
lymph 18 148 5 (8), 4 (8), 3 (2) 6
primary-tumor 18 339 6 (9), 5 (7), 4 (1), 2 (1) 7
vehicle 19 846 7 (18), 6 (1) 8
hepatitis 20 155 5 (18), 4 (2) 6
colic 23 368 6 (8), 5 (12), 4 (3) 7
autos 26 205 6 (16), 5 (3), 4 (1), 3 (5), 1 (1) 6
flags 29 194 6 (5), 5 (7), 4 (7), 3 (7), 2 (3) 6
Table 2: Maximum number of parents that nodes have using new (column 4) and previous bounds (column 5). In column 4, we list the bound on number of parents followed by how many nodes have that bound in parenthesis (the new theoretical results obtain a specific bound per node, while previous results obtain a single global bound).

Our second set of experiments compares the activation of Theorems 12, and 3 in pruning the search space for the construction of the list of candidate parent sets. Tables 3 and 4 (in the end of this document) present the results as follows. Columns one to four contain, respectively, the data set name, number of variables, number of data points and maximum in-degree (in-d) that we impose (a maximum in-degree is impose so as we can compare the obtained results among different approaches). The fifth column, named , presents the total number of parent sets that need to be evaluated by the brute-force procedure (taking into consideration the imposed maximum in-degree). Columns 6 to 12 present the number of times that different pruning results are activated when exploring the whole search space. Larger numbers means that more parent sets are ignored (even without being evaluated). The naming convention for the pruning algorithms as used on those columns is:

  1. [label=Alg0]

  2. Application of Theorem 1 using in the expression of the rule (instead of the minimization), where is the variable for which we are building the list and is the current parent set being explored. This is equivalent to the previous rule in the literature, as presented in this paper in Lemma 3.

  3. Application of Theorem 1 using in the expression of the rule (instead of the minimization), where is the variable for which we are building the list and is the variable just to be inserted in the parent set that is being explored. This is the new pruning rule which makes most use of entropy, but it may be slower than the others (since conditional entropies might need to be evaluated, if they were not yet).

  4. Application of Theorem 2 using in the formula, that is, with (and instead of the minimization). This is a slight improvement to the known rule in the literature regarding the maximum number of parents of a variable and is very fast, since it does not depend on evaluating any parent sets.

  5. Application of Theorem 2 using in the formula, that is, with (and instead of the minimization). This is a different improvement to the known rule in the literature regarding the maximum number of parents of a variable and is very fast, since it does not depend on evaluating any parent sets.

We also present the combined number of pruning obtained by some of these ideas when they are applied together. Of particular interest is column 8 with (Alg1)+(Alg2), as it shows the largest amount of pruning that is possible, albeit more computationally costly because of the (possibly required) computations for (Alg2). This is also presented graphically in the boxplot of Figure 1, where the values for the 18 data sets are summarized and the amount of pruning is divided by the pruning of (Alg1), and so a ratio above one shows (proportional) gain with respect to the previous literature pruning rule.

in-d 3

in-d 4

in-d 5
Figure 1: Ratio between pruned candidates using (Alg1) (which theoretically subsumes (Alg3)) and (Alg2) divided by pruned candidates using prune approach (Alg1) alone, for different values of maximum in-degree. Greater than one means better than (Alg1). Results over 18 data sets. Averages are marked with a diamond.

Column 12 of Tables 3 and 4 have the pruning results (number of ignored candidates) for (Alg1) and (Alg4) together, since this represents the pruning obtained by the old rule plus the new rule give by Theorem 2 such that no extra computational cost takes place (and moreover it subsumes approach (Alg3), since (Alg1) is theoretically superior to (Alg3)). Again, this is summarized in the boxplot of Figure 2 over the 18 data sets and the values are divided by the amount of pruning of (Alg1) alone, so values above one show the (proportional) gain with respect to the previous literature rule.

As we can see in more detail in Tables 3 and 4, the gains with the new pruning ideas are significant in many circumstances. Moreover, there is no extra computational cost for applying (Alg3) and (Alg4), so one should always apply those rules while deciding selectively whether to employ prune (Alg2) or not (we recall that one can tune that rule by exploiting the flexibility of Theorem 2 and searching for a subset that is already available in the computed lists, so a more sophisticated pruning scheme is also possible).

in-d 3

in-d 4

in-d 5
Figure 2: Ratio between pruned candidates using (Alg1) (which theoretically subsumes (Alg3)) and (Alg4) divided by pruned candidates using prune approach (Alg1) alone, for different values of maximum in-degree. Greater than one means better than (Alg1). Results over 18 data sets. Averages are marked with a diamond.

6 Conclusions

This paper present new non-trivial pruning rules to be used with the Bayesian Information Criterion (BIC) score for learning the structure of Bayesian networks. The derived theoretical bounds extend previous results in the literature and can be promptly integrated into existing solvers with minimal effort and computational costs. They imply faster computations without losing optimality. The very computationally efficient version of the new rules imply gains of around 20% with respect to previous work, according to our experiments, while the most computationally demanding pruning achieves around 50% more pruning than before. We conjecture that further bounds for the BIC score are unlikely to exist unless for some particular cases and situations.


Work partially supported by the Swiss NSF grant n. 200021_146606 /1 and ns. IZKSZ2_162188.


  • (1) J. Pearl, Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference, Morgan Kaufmann Publishers Inc., 1988.
  • (2)

    D. M. Chickering, D. Heckerman, C. Meek, Large-sample learning of Bayesian networks is NP-hard, Journal of Machine Learning Resesearch 5 (2014) 1287–1330.

  • (3) D. Heckerman, D. Geiger, D. Chickering, Learning Bayesian networks: The combination of knowledge and statistical data, Machine Learning 20 (1995) 197–243.
  • (4) M. Koivisto, Parent assignment is hard for the MDL, AIC, and NML costs, in: Proceedings of the 19st Annual Conference on Learning Theory, Springer-Verlag, 2006, pp. 289–303.
  • (5) G. F. Cooper, E. Herskovits, A Bayesian method for the induction of probabilistic networks from data, Machine Learning 9 (4) (1992) 309–347.
  • (6) G. Schwarz, Estimating the dimension of a model, The Annals of Statistics 6 (1978) 461–464.
  • (7) C. P. de Campos, Q. Ji, Efficient structure learning of Bayesian networks using constraints, Journal of Machine Learning Research 12 (2011) 663–689.
  • (8)

    M. Bartlett, J. Cussens, Integer linear programming for the Bayesian network structure learning problem, Artificial Intelligence 24 (2017) 258–271.

  • (9) M. Teyssier, D. Koller, Ordering-based search: A simple and effective algorithm for learning Bayesian networks, in: Proceedings of the 21st Conference on Uncertainty in Artificial Intelligence, 2005, pp. 584–590.
  • (10) C. P. de Campos, Z. Zeng, Q. Ji, Structure learning of Bayesian networks using constraints, in: Proceedings of the 26th Annual International Conference on Machine Learning, ACM, 2009, pp. 113–120.
  • (11) M. Lichman, UCI machine learning repository (2013).
    URL http://archive.ics.uci.edu/ml