1 Introduction
Bayesian networks are graphical models widely used to represent joint probability distributions on complex multivariate domains
[31]. A Bayesian network comprises two parts: a directed acyclic graph (the structure) describing the relationships among variables in the model, and a collection of conditional probability tables from which the joint distribution can be reconstructed. As the number of variables in the model increases, specifying the underlying structure becomes a tedious and difficult task, and practitioners often resort to learning Bayesian networks directly from data. Here, learning a Bayesian network refers to inferring the underlying graphical structure from data, a task wellknown to be NPhard
[13].Learned Bayesian networks are commonly used for drawing inferences such as querying the posterior probability of some variable after evidence is entered (a task known as belief updating), finding the mode of the joint distribution (known as most probable explanation or MAP inference), or selecting a configuration of a subset of the variables that maximizes their conditional probability (known as marginal MAP inference). All those inferences are NPhard to compute even approximately
[18, 38, 1, 19, 21], and all known (exact and provably good) algorithms have worstcase time complexity that is exponential in the treewidth [31, 19, 34, 24], which is a measure of connectedness of the graph. Polynomialtime algorithms for such inferences do exist, but they provide no guarantees on the quality of the solution they deliver, which raises doubts as to whether occasional bad results are a consequence of suboptimal structure learning or of approximate inference. In fact, under widely believed assumptions from complexity theory, exponential time complexity in the treewidth is inevitable for any algorithm that provides provably good inferences [11, 33]. Thus, learning network structures of small treewidth is essential if one wishes to perform reliable and efficient inference. This is particularly important in the presence of missing data, as learning methods usually resort to some kind of ExpectationMaximization procedure that requires performing belief updating in the network at every iteration
[25]. In those cases inefficient inference leads to great computational cost of learning; unreliable inference leads to learning underfitted/overfitted structures.Since estimating a network’s treewidth is itself an NPhard task
[2], extending current methods for learning Bayesian networks to the case of bounded treewidth while maintaining their relative efficiency and accuracy is not trivial. In comparison to unconstrained Bayesian network learning, few algorithms have been designed for the bounded treewidth case. Korhonen and Parviainen [32] showed that learning bounded treewidth Bayesian networks is NPhard, and developed an exact algorithm based on dynamic programming that learns optimal node structures of treewidth at most in time , which is above the time required by the best worstcase algorithms for learning optimal Bayesian networks with no constraint on treewidth [40]. Elidan and Gould [23]combined several heuristics to treewidth computation and network structure learning in order to design approximate methods. Others have addressed the similar (but not equivalent) problem of learning undirected models of bounded treewidth
[3, 42, 12]. Very recently, there seems to be an increase of interest in the topic. Berg et al. [6] showed that the problem of learning bounded treewidth Bayesian networks can be reduced to a weighted maximum satisfiability problem, and subsequently solved by weighted MAXSAT solvers. They report experimental results showing that their approach outperforms Korhonen and Parviainen’s dynamic programming approach. In the same year, Parviainen et al. [36] showed that the problem can be reduced to a mixedinteger linear program (MILP), and then solved by offtheshelf MILP optimizers (e.g. CPLEX). Their reduced MILP problem however has exponentially many constraints in the number of variables. Following the work of Cussens [16], the authors avoid creating such large programs by a cutting plane generation mechanism, which iteratively includes a new constraint while the optimum is not found. The generation of each new constraint (cutting plane) requires solving another MILP problem. The works of [6] and [36] have been developed independently and simultaneously with our work presented here; for this reason, we do not compare our methods with theirs. We intend to do so in the near future.In this paper, we present two novel ideas for scorebased Bayesian network structure learning with a hard constraint on treewidth. We first introduce a mixed integer linear programming formulation of the problem (Section 3) that builds on existing MILP formulations for unconstrained structure learning of Bayesian networks [16, 17] and for computing the treewidth of a graph [27]. The designed formulation is able to find a scoremaximizer Bayesian network of treewidth smaller than a given constant for models containing many more variables than Korhonen and Parviainen’s method, as we empirically demonstrate in Section 5. Unlike the MILP formulation of Parviainen et al. [36], the MILP problem we generate is of polynomial size in the number of variables, and does not require the use of cutting planes techniques. This makes for a clean and succinct formulation that can be solved with a single call of a MILP optimizer. A better understanding of cases where one approach is preferred to the other is yet to be achieved.
Since linear programming relaxations are used for solving the MILP problem, any MILP formulation can be used to provide approximate solutions and error estimates in an anytime fashion (i.e., the method can be stopped at any time during the computation with a feasible solution). However, the MILP formulations (both ours and the one proposed by Parviainen et al. [36]) cannot cope with very large domains, even if we agreed on obtaining only approximate solutions. This is because the minimum size of the MILP problems is cubic in the number of variables (hence it is difficult even to start the MILP solver for large domains), and there is probably little we can do to considerably improve this situation (a further discussion on that is given in Section 3). This limitation is observed in the experiments reported in Section 5, where our MILP formulation requires a much larger amount of time to obtain much poorer solutions for networks with over 50 variables.
In order to deal with large domains, we devise (in Section 4) an approximate method based on a uniform sampling of trees (maximal triangulated graphs of treewidth ), which is achieved by using a fast computable bijection between trees and Dandelion codes [10]. For each sampled tree, we either run an exact algorithm similar to the one proposed in [32] (when computationally appealing) to learn the scoremaximizing network whose moral graph is a subgraph of that tree, or we resort to a much more efficient method that takes partial variable orderings uniformly at random from a (relatively small) space of orderings that are compatible with the tree. We discuss the time and sample complexity of both variants, and compare it to those of similar schemes for learning unconstrained networks. We show empirically (in Section 5) that the double sampling scheme (of trees and partial variable orderings) is very effective in learning close to optimal structures in a selected set of data sets. We conclude in Section 6 by noting that the methods we propose can be considered as stateoftheart, and by suggesting possible improvements. To start, Section 2 presents some background knowledge on learning Bayesian networks.
2 Preliminaries
A Bayesian network is a concise graphical representation of a multivariate domain, where each random variable is associated with a node of its underlying directed acyclic graph (DAG) and local conditional probability distributions are specified for the variable given its parents in the graph (we often refer to variables and nodes in the graph interchangeably).
Let be and consider a finite set of categorical random variables taking values in finite sets . Formally, a Bayesian network is a triple , where is a DAG whose nodes are in onetoone correspondence with variables in , and is a set of numerical parameters specifying (conditional) probability values , for every node in , value of and assignment to the parents of , according to . The structure (that is, the DAG of the network) represents a set of stochastic independence assessments among variables in . In particular, represents a set of graphical Markov conditions: every variable is conditionally independent of its nondescendant nonparents given its parents. As a consequence, a Bayesian network uniquely defines a joint probability distribution over as the product of its parameters [31, Chapter 3.2.3]:
(1) 
Learning the structure from data is a challenging problem. One approach is to identify, for each variable, the minimal set of variables that makes that variable conditionally independent of others (Markov blanket), which is usually done by means of statistical tests of stochastic independence or information theoretic measures [41]
. Alternatively, structural learning can be posed as a combinatorial optimization problem in which one seeks the structure that maximizes a score function that relates to the data likelihood, while avoiding some excessive model complexity. Commonly used score functions include the Minimum Description Length (which is equivalent to the Bayesian Information Criterion)
[39], and Bayesian Dirichlet (likelihood) equivalent uniform score [9, 15, 28]. These functions follow different rationale but they all satisfy two properties: (i) they can be written as a sum of local score functions that depend only on the parent set of each node and on the data, and (ii) the local score functions can be efficiently computed and stored. Scorebased structure learning is a difficult task, and research on this topic has been very active [30, 29, 44, 17, 4, 45, 32].In scorebased Bayesian network learning we seek a DAG structure such that
(2) 
where is the class of all DAGs with nodes, are local score functions that depend only on the parent set as given by (i.e., the computation of each depends only on the values that and take in the data set). We assume (unless otherwise stated) that local scores have been previously computed and can be retrieved at constant time. Despite the decomposability of the score functions, the optimization cannot be performed locally lest it almost certainly introduce directed cycles in the graph.
We say that a cycle in an undirected graph has a chord if there are two nodes in the cycle which are connected by an edge outside the cycle. A chordal graph is an undirected graph in which all cycles of length four or more have a chord. Any graph can be made chordal by inserting edges, a process called chordalization [2, 8]. The treewidth of a chordal graph is the size of its largest clique minus one. The treewidth of an arbitrary undirected graph is the minimum treewidth over all chordalizations of it. The moral graph of a DAG is the undirected graph obtained by connecting any two nodes with a common child and dropping arc directions. The treewidth of a DAG is the treewidth of its corresponding moral graph. The treewidth of a Bayesian network is the treewidth of the DAG .
An elimination order is a linear ordering of the nodes in a graph. We say that an elimination order is perfect if for every node in the order its higherordered neighbors form a clique (i.e., are pairwise connected). A graph admits a perfect elimination order if and only if it is chordal. Perfect elimination orders can be computed in linear time if they exist. The elimination of a node according to an elimination order is the process of pairwise connecting all of its higherordered neighbors. Thus, the elimination of all nodes produces a chordal graph for which the elimination order used is perfect. The edges inserted by the elimination process are called fillin edges. Given a perfect elimination order, the treewidth of the graph can be computed as the maximum number of higher ordered neighbors in the graph.
The reason why most score functions penalize model complexity (as given by the number of free numerical parameters) is that data likelihood always increases by augmenting the number of parents of a variable (and hence the number of free parameters in the model), which leads to overfitting and poor generalization. The way scores penalize model complexity generally leads to structures of bounded indegree and helps in preventing overfitting, but even bounded indegree graphs can have large treewidth (for instance, directed square grids have treewidth equal to the square root of the number of nodes, yet have maximum indegree equal to two), which yields a great problem to subsequent probabilistic inferences with the model.
There are at least two direct reasons to aim at learning Bayesian networks of bounded treewidth: (i) As discussed previously, all known exact algorithms for probabilistic inference have exponential time complexity in the treewidth, and networks with very high treewidth are usually the most challenging for approximate methods; (ii) Previous empirical results [37, 23] suggest that bounding the treewidth might improve model performance on heldout data. There is also evidence that bounding the treewidth does not impose a great burden on the expressivity of the model for real data sets [7].
The goal of learning Bayesian networks of bounded treewidth is to search for such that
(3) 
where is the class of all DAGs of treewidth not (strictly) greater than . From a theoretical point of view, this is no easy task. Korhonen and Parviainen [32] adapted Srebro’s complexity result for Markov networks [42] to show that learning the structure of Bayesian networks of bounded treewidth strictly greater than one is NPhard. Dasgupta’s results also prove this hardness if the score maximizes data likelihood [20] (in the case of networks of treewidth one, that is, directed trees with at most one parent per node, learning can be performed efficiently by the Chow and Liu’s algorithm [14]).
3 Mixed integer linear programming
The first contribution of this work is the mixed integer linear programming (MILP) formulation that we design to exactly solve the problem of structure learning with bounded treewidth. MILP formulations have shown to be very effective to learning Bayesian networks without the treewidth bound [16, 4], surpassing other attempts in a range of data sets. Moreover, the great language power of a MILP problem allows us to encode the treewidth constraint in a natural manner, which might not be easy with other structure learning approaches [45, 44, 29, 22, 35]. We note that computing the treewidth of a graph is an NPhard problem itself [2], even if there are linear algorithms that are only exponential in the treewidth [8] (these algorithms might be seen mostly as theoretical results, since their practical use is shadowed by very large hidden constants). Hence, one should not hope to enforce a bound on the treewidth (which should work for any chosen bound) without a machinery that is not at least as powerful as NP.
The novel formulation is based on combining the MILP formulation for structure learning in [17] with the MILP formulation presented in [27] for computing the treewidth of an undirected graph. There are although crucial differences, which we highlight later on. We have avoided the use of sophisticated techniques for MILP in the context of structure learning, such as constraint generation [16, 4], because we are interested in providing a clean and succinct MILP formulation, which can be ran using offtheshelf solvers without additional coding.
Since our formulation is a combination of two previous MILP formulations of distinct problems, we will present each formulation separately, and then describe how to combine them into a concise MILP problem.
3.1 A MILP formulation for bounding the treewidth
Consider a graph . We begin with the MILP formulation of the class of all supergraphs of a graph that have treewidth less than or equal to a given value :
(4a)  
(4b)  
(4c)  
(4d)  
(4e)  
(4f) 
The formulation above is based on encoding all possible elimination orders of the nodes of . A chordalization of of treewidth at most can be obtained from a feasible solution (if it exists) of the program by setting . Constraint (4a) ensures has treewidth at most by bounding the number of higherordered neighbors of every node (which is an alternative way of defining the treewidth of chordal graphs). The variables , , take (real) values in (Constraint (4e)) and partially define an elimination order of the nodes: a node is eliminated before node if (the specification is partial since its allows for two nodes and with ). This order does not need to be linear because there are cases where multiple linearizations of the partial order are equally good in building a chordalization of (i.e., in minimizing the maximum clique size of ). In such cases, two nodes and might be assigned the same value indicating that eliminating before and the converse results in chordal graphs with the same treewidth. The variables , , are valued (Constraint (4f)) and indicate whether node precedes in the order (i.e., whether ) and an edge exists among them in the resulting chordal graph (recall that an elimination process always produces a chordal graph). Although the values are not forced to be integers in our formulation, in practice they will most likely be so. Constraint (4b) allows to be only if appears after in the order (it in fact requires that to allow to be one). Constraint (4c) ensures is a supergraph of . Constraint (4d) guarantees that the elimination ordering induced by , , is perfect for : if and are higher ordered neighbors of in , then and are also neighbors in , that is, either or must be . The practical difference of this formulation with respect to the one in [27] lies in the fact that we allow partial elimination orders, and we do not need integer variables to enforce such orders. A bottleneck is the specification of Constraint (4d), as there are such constraints. The following result is an immediate conclusion of the above reasoning.
Proposition 1
The graph has treewidth at most if and only if the set defined by Constraints (4) is non empty.
3.2 A MILP formulation for structure learning
We now turn our attention to the MILP formulation of the structure learning part. Consider a chordal (undirected) graph , a perfect elimination order for , and let , , be valued variables such that if and only if contains and is eliminated before . For each node in let be the collection of all allowed parent sets for that node (these sets can be specified manually by the user or simply defined as the subsets of with cardinality less than a given bound). We denote an element of as , with (hence ). The following MILP formulation specifies the class of all DAGs over that are consistent with the parent sets and whose moral graph is a subgraph of :
(5a)  
(5b)  
(5c)  
(5d)  
(5e)  
(5f) 
where the scope of the in each constraint is . A DAG can be obtained from a solution to the above program by setting . The variables , , take values in (Constraint (5e)) and partially specify a topological order of the nodes in : if then is not an ancestor of . The variables , , , are valued (Constraint (5f)) and indicate whether the th parent set in was chosen for node . Constraint (5a) enforces that exactly one parent set is chosen for each node. Constraint (5b) forces those choices to be acyclic, that is, to respect the topological order induced by the variables (with ties broken arbitrarily for nodes with ). Here too the order does not need to be linear. In fact, only the relative ordering of nodes that are connected in is relevant because Constraints (5c) and (5d) ensure that arcs appear in only if the corresponding edges in the moral graph of exist in (Constraint (5d) is responsible for having the moralization of the graph falling inside ).
Proposition 3
Let , , , be variables satisfying Constraints (5). Then the directed graph , where is acyclic and consistent with every set . Moreover the moral graph of is a subgraph of .
A corollary of the above result is that the treewidth of is at most the treewidth of [8].
3.3 Combining the MILP formulations
We can now put together the two previous MILP formulations to reach the following MILP formulation for the problem of learning DAGs of treewidth bounded by a constant :
maximize:  
(6b)  
subject to:  
(6c)  
(6d)  
(6e)  
(6f)  
(6g)  
(6h)  
(6i)  
(6j)  
(6k)  
(6l) 
As the following result shows, the MILP formulation above specifies DAGs of bounded treewidth:
Theorem 1
Corollary 1
The MILP formulation (6) can be directly fed into any offtheshelf MILP optimizer. According to Corollary (1), the outcome will always be an optimum structure if enough resources (memory and time) are given. Standard MILP optimizers (e.g. CPLEX) often employ branchandbound (or branchandcut) procedures, which are able to be halted prematurely at any time and still provide a valid solution and an outer bound for the maximum score. Hence, the MILP formulation also provides an anytime algorithm for learning Bayesian networks of bounded treewidth: the procedure can be stopped at time and still provide an approximate solutions and error bound. Moreover, the quality of the approximation solution returned increases with time, while the error bounds monotonically decrease and eventually converge to zero.
3.4 Comparison with the dynamic programming approach
To validate the practical feasibility of our MILP formulation, we compare it against the the dynamic programming method proposed previously for this problem [32], which we call K&P from now on.^{5}^{5}5We used the freely available code provided by the authors at http://www.cs.helsinki.fi/u/jazkorho/aistats2013/. Table 1 show the time performance of our MILP formulation and that of K&P on a collection of reasonably small data sets from the UCI repository^{6}^{6}6Obtained from http://archive.ics.uci.edu/ml/. (discretized over the median value, when needed) and small values of the treewidth bound. More details about these data are presented in Section 5. The experiments have been run with a limit of 64GB in memory usage and maximum number of parents per node equal to three (the latter restriction facilitates the experiments and does not impose a constraint in the possible treewidths that can be found). While one shall be careful when directly comparing the times between methods, as the implementations use different languages (we are running CPLEX 12.4, K&P uses a Cython^{7}^{7}7http://www.cython.org. compiled Python code), we note that our MILP formulation is orders of magnitude faster than K&P, and able to solve many problems which the latter could not (in Section 5 we show the results of experiments with much larger domains). A time limit of 3h was given to the MILP, in which case its own estimation of the error is reported (in fact, it found the optimal structure in all instances, but was not able to certify it to be optimal within 3h).
METHOD  TW  NURSERY  BREAST  HOUSING  ADULT 

MILP  2  1s  31s  3h [2.4%]  3h [0.39%] 
3  1s  19s  25m  3h [0.04%]  
4  1s  8s  80s  40m  
5  1s  8s  56s  37s  
K&P  2  7s  26s  128m  137m 
3  72s  5m  –  –  
4  12m  103m  –  –  
5  131m  –  –  – 
The results in the table show that our MILP formulations largely outperforms K&P, being able to handle much larger problems. Yet we see from these experiments that both algorithms scale poorly in the number of variables. In particular, K&P cannot cope with data sets containing more than a dozen of variables. The results suggest that the MILP problems become easier as the treewidth bound increases. This is likely a consequence of the increase of the space of feasible solutions, which makes the linear relaxations used for solving the MILP problem tighter, thus reducing the computational load. This is probably aggravated by the small number of variables in these data sets (hence, by increasing the treewidth we effectively approximate an unbounded learning situation).
We shall demonstrate empirically in Section 5 that the quality of solutions found by the MILP approach in a reasonable amount of time degrades quickly as the number of variables reaches several dozens. Indeed, the MILP formulation is unable to find reasonable solutions for data sets containing 100 variables, which is not surprising given that number of Constraints (6e) and (6i) is cubic in the number of variables; thus, as increases even the linear relaxations of the MILP problem become hard to solve. In the next section, we present a clever sampling algorithm over the space of trees to overcome such limitations and handle large domains. The MILP formulation just described will set a baseline for the performance of such approximate approach.
4 Sampling trees using Dandelion codes
In this section we develop an approximate method for learning bounded treewidth Bayesian networks that is based on sampling graphs of bounded treewidth and subsequently finding DAGs whose moral graph is a subgraph of that graph. The approach is designed aiming at data sets with large domains, which cannot be handled by the MILP formulation.
A naive approach to designing an approximate method would be to extend one of the sampling methods for unconstrained Bayesian network learning. For instance, we could envision a rejection sampling approach, which would sample structures using some available procedure (for instance, by sampling topological orderings and then greedily finding a DAG structure consistent with that order, as in [43]), and verify their treewidth, discarding the structure when the test fails. There are two great issues with this approach: (i) the computation of treewidth is a hard problem, and even if there are lineartime algorithms (but exponential on the treewidth), they perform poorly in practice; (ii) virtually all structures would be discarded due to the fact that complex structures tend to have larger scores than simple ones, at least for the most used score functions (their penalizations reduce the local complexity of the model, but are not able to constrain a global property such as treewidth). We empirically verified these facts, but will not report further on them here.
Another natural approach to the problem is to consider both an elimination order for the variables (from which the treewidth can be computed) and a topological order (from which one can greedily search for parent sets without creating cycles in the graph). It is straightforward to uniformly sample from the space of orderings, but the combined overall number of such orderings is quite high: (from the Stirling approximation). We propose an interesting way that is more efficient in terms of the size of the sampling space, and yet can be sampled uniformly (uniform sampling is a desirable property, as it ensures a good coverage of the space and is superior to other options if one has no prior information about the search space). This approach is based on the set of trees.
Definition 1
A tree is defined in the following recursive way:
(1) A clique is a tree.
(2) If is a tree with nodes and edges , is a clique
and , then is a tree.
We denote by the set of all trees over nodes. In fact, a Bayesian network with treewidth bounded by is closely related to a tree. Because trees are exactly the maximal graphs with treewidth (graphs to which no more edges can be added without increasing their treewidth), we know that the moral graph of the optimal structure has to be a subgraph of a tree [32].
The idea is to sample trees and then search for the best structure whose moral graph is one of the subgraphs of the tree. While directly sampling a tree might not be trivial, Caminiti et al. [10] proposed a linear time method for coding and decoding trees into what is called Dandelion codes (the set of such codes is denoted by ). Moreover, they established a bijective mapping between codes in and trees in . The code is a pair where with and is a list of pairs of integers drawn from , where is an arbitrary number not in . For example, and is a Dandelion code of a (single) tree over nodes (that is , , ). Dandelion codes can be sampled uniformly at random by a trivial lineartime algorithm that uniformly chooses elements out of to build , and then uniformly samples pairs of integers in .
Theorem 2
[10] There is a bijection mapping elements of and that is computable in time linear in and .
Given , we can use the dynamic programming algorithm proposed in [32] to find the optimal structure whose moral graph is a subgraph of . Our implementation follows the ideas in [32], but can also be seen as extending the divideandconquer method of [26] to account for all possible divisions of nodes. This results in the following theorem.
Theorem 3
[32] For any fixed , given (a tree) and the scoring function for each node , we can find a DAG whose moralized graph is a subgraph of maximizing the score in time and space .
We can combine the lineartime sampling of trees described in Theorem 2 with the lineartime learning of bounded structures consistent with a graph in the above theorem to obtain an algorithm for learning bounded treewidth Bayesian networks. The algorithm is described in Algorithm 1 [Version 1].
Theorem 4
The sampling space of Algorithm 1 [Version 1] is less than . Each of its iterations runs in linear time in (but exponential in ).
Proof. The follow equality holds [5].
(7) 
It is not hard to see that the maximum happens for (because of the symmetry of and of around , while decreases with the increase of ). By manipulating this number and applying Stirling’s approximation for the factorials, we obtain:
which is less than . The decoding algorithm has complexity linear in (Theorem 2), as well as the method to uniformly sample a Dandelion code, and the method to find the best DAG consistent with a tree (Theorem 3).
While the running time of Algorithm 1 [Version 1] is linear in , the computational complexity of step 2.c, which uses the method in [32], is exponential in the treewidth (more precisely, it is ). Hence, one cannot hope to use it with moderately high treewidth bounds (say, larger than 8). Regarding the sample space, according to the above theorem it is slightly higher than that of orderbased learning of unconstrained Bayesian networks (e.g. [43]), especially if . However, each iteration of step 2.c needs considerable more effort than the corresponding iteration in the unbounded case (yet, as it is a method theoretically linear in , more efficient implementations of the algorithm that searches within a given tree might bring an additional boost to this approach in the future).
As just explained, the main practical drawback of Algorithm 1 [Version 1] is step 2.c, which process each sampled tree. In the sequel we propose a new approach ([Version 2]) that is much faster (per iteration), at the price of a slight increase in the sampling space. We will empirically compare these approaches in the next section.
Let define a partial order of the nodes. We say that a DAG is consistent with if, (as defined by ), there is no directed path from to in . In other words, constrains the valid topological orderings for the nodes in . We do not force to be a linear order, because we are only interested in orderings that specify, for each edge in a tree , which of the two ending points precedes the other (in other words, we are only interested in possible ways of orienting the edges of the ktree). There are multiple linear orderings that achieve the very same result for , and our goal is to sample from the smallest possible space of orderings (if we used a linear order, then the sampling space would be ).
A partial order can be represented as a DAG : is smaller than in if and only if node is an ancestor of node in . Given a ktree , we will sample by following the same recursive process as in Definition 1. This is described in Algorithm 2. The procedure produces partial orders (i.e., DAGs) whose underlying graph (obtained by ignoring arc directions) is exactly the graph . Note that the treewidth of the DAG corresponding to might exceed the treewidth of . This does not affect the correctness of Algorithm 1, as is only used to specify which node preceeds which node in the order, and hence which are the possible parents; the actual parents are chosen so that the treewidth bound is respect. This can be done efficiently using .
Theorem 5
Algorithm 2 samples DAGs on a sample space of size and runs in linear time in and .
Proof. The sampling of the nodes in the root clique takes time by sampling one of the ways to choose the arcs without creating cycles. We assume that an appropriate structure representing is known (e.g., a treedecomposition with nodes), so Steps 1 and 3 can be done in time. For each iteration of Step 4, we spend time because there are only ways to direct the edges, as this is equivalent to placing in its relative order with respect to the already ordered neighbors. Hence the total running time is and the sampling space is .
The following result shows that the sampling space of this version of the sampling algorithm remains reasonably small, especially for (it would be also small if is close to , then decreases drastically, so the total sampling space would also decrease).
Theorem 6
The sampling space of Algorithm 1 [Version 2] is less than . Each of its iterations runs in linear time in and .
Proof. As before, the decoding algorithm (Theorem 2) and the method to uniformly sample a Dandelion code run in linear time in both and . Algorithm 2 samples the ordering in linear time too. Finally, finding the best DAG consistent with a tree and is a greedy procedure over all nodes (choosing the parent set of a node each time): the treewidth cannot exceed because we take a subgraph of , and no cycles can be formed if we respect .
Although the sampling space of Version 2 is larger than the one of Version 1, Version 2 is much faster per iteration. This allows us to explore a much larger region of the space of tress than Version 1 can within a fixed amount of time. Moreover, one can run Version 2 without precomputing the score function: when scores are needed, they are computed and stored into a hash table for further accesses, thus closely matching another desirable characteristic of orderbased learning methods for unbounded treewidth (namely, to avoid computing all scores a priori).
5 Experiments
DATA SET  VAR.  SAMPLES 

nursery  9  12960 
breast  10  699 
housing  14  506 
adult  15  32561 
zoo  17  101 
letter  17  20000 
mushroom  22  8124 
wdbc  31  569 
audio  62  200 
hill  100  606 
community  100  1994 
We empirically analyze the accuracy of Algorithm 1
by comparing its two versions with each other and with the values obtained by the MILP method. As before, we use a collection of data sets from the UCI repository of varying dimensionality, with variables discretized over the median value when needed. The number of (binary) variables and samples in each data set are described in Table
2. Some columns of the original data sets audio and community were discarded: 7 variables of audio had always a constant value, 5 variables of community have almost a different value per sample (such as personal data), and 22 variables have missing data (Table 2 shows dimensions after this preprocessing). In all experiments, we maximize the Bayesian Dirichlet likelihood equivalent uniform (BDeu) score with equivalent sample size equal to one [28].We use treewidth bounds of 4 and 10, and maximum parent set size of 3 (for hill and community, it was set as 2; nevertheless, the MILP formulation is the one with a strong dependency on the maximum parent set size, as scores need to be precomputed). To be fair among runs, we have precomputed all scores, and have considered them as input of the problem. The MILP has been optimized by CPLEX 12.4 with a memory limit of 64GB. We have allowed it to run up to three hours, and have also collected the incumbent solution after 10 minutes. Algorithm 1 has been given only 10 minutes (in either version).
To account for the variability of the performance of the sampling methods with respect to the sampling seed, we ran each version of Algorithm 1 ten times on each data set with different seeds. We report the minimum, median and maximum obtained values over those runs for each dataset. We show the relative scores (in percentage) of the approximate methods (Versions 1 and 2 of Algorithm 1 and the best score found by the MILP formulation within 10 minutes and 3 hours) with respect to Version 2’s median score, for treewidth bounds of four (Figure 1) and ten (Figure 2). The relative score is computed as the ratio of the obtained value and the median score of Version 2, so higher values are better. Moreover, a value higher than 100% shows that the method outperformed Version 2, whereas a value smaller than 100% shows the converse. The raw data used in the figures appear in Tables 3 (for Figure 1) and 4 (for Figure 2). The exponential dependence on treewidth of Version 1 made it intractable to run with treewidth bound greater than 8. We see from the plot on top that Version 2 is largely superior to Version 1, even if the former might only find suboptimal networks for a given tree. This is probably a consequence of the much lower running times per iteration, which allows Version 2 to explore a much larger set of trees. It also suggests that spending time finding good trees is more worthy than optimizing network structures for a given tree. We also see that the MILP formulation scales poorly with the number of variables, being unable to obtain satisfactory solutions for data sets with more than 50 variables. On the hill data set with treewidth , CPLEX running the MILP formulation was not able to output any solution within 10 minutes, and the solution obtained within 3 hours is far left of the zoomed area of the graph in Figure 1; on the community data set with treewidth , CPLEX did not find any solution within 3 hours. Regarding the treewidth bound of ten (Figure 2), we observe that Version 2 is very accurate and outperforms the MILP formulation in the larger data sets.
It is worth noting that both versions of Algorithm 1 were implemented in Matlab; hence, the comparison with the approximate solution of running the MILP formulation with the same amount of time (10 minutes) might be unfair, as we expect to produce better results by an appropriate recoding of our sampling methods in a more efficient language (one could also try to improve the MILP formulation, although it will eventually suffer from the problems discussed in Section 3). Nevertheless, the results show that Version 2 is very competitive even in this scenario.
6 Conclusions
We have created new exact and approximate procedures to learn Bayesian networks of bounded treewidth. They perform well and are of immediate practical use. The designed mixedinteger linear programming (MILP) formulation improves on MILP formulations for related tasks, especially regarding the specification of treewidthrelated constraints. It solves the problem exactly and surpasses a stateoftheart method both in size of networks and treewidth that it can handle. Even if results indicate it is better than the state of the art, MILP is not so accurate and might fail in large domains. For that purpose, we have proposed a double sampling idea that provides means to learn Bayesian networks in large domains and high treewidth limits, and is empirically shown to perform very well in a collection of public data sets. It scales well, because its complexity is linear both in the domain size and in the treewidth bound. There are certainly other search methods that can be integrated with our sampling approach, for instance a local search after every iteration of sampling, local permutations of orderings that are compatible with the trees, etc. We leave the study of these and other avenues for future work.
During the making of this work, two closely related works appeared in the literature. [6] developed an exact learning procedure based on maximum satisfiability. [36] developed an alternative MILP formulation of the problem with exponentially many constraints, and used cutting plane generation techniques to improve on performance. These works have been developed independently and simultaneously with our work presented here; future work should compare their performance empirically against the methods proposed here.
7 Acknowledgments
This work was partly supported by the grant N000141210868 from the US Office of Navy Research, the Swiss NSF grant n. 200021_146606/1, and the FAPESP grant n. 2013/231974.
References
 Abdelbar and Hedetniemi [1998] A. M. Abdelbar and S. M. Hedetniemi. Approximating MAPs for belief networks is NPhard and other theorems. Artif. Intell., 102(1):21–38, 1998.
 Arnborg et al. [1987] S. Arnborg, D. Corneil, and A. Proskurowski. Complexity of finding embeddings in a ktree. SIAM J. on Matrix Analysis and Applications, 8(2):277–284, 1987.
 Bach and Jordan [2001] F. R. Bach and M. I. Jordan. Thin junction trees. In Advances in Neural Inf. Proc. Systems 14, pages 569–576, 2001.
 Barlett and Cussens [2013] M. Barlett and J. Cussens. Advances in Bayesian Network Learning using Integer Programming. In Proc. 29th Conf. on Uncertainty in AI, pages 182–191, 2013.
 Beineke and Pippert [1969] L. W. Beineke and R. E. Pippert. On the number of kdimensional trees. J. of Comb. Theory, 6:200–205, 1969.
 Berg et al. [2014] J. Berg, M. J ”arvisalo, and B. Malone. Learning optimal bounded treewidth Bayesian networks via maximum satisfiability. In Proc. 17th Int. Conf. on AI and Stat., pages 86–95, 2014. JMLR W&CP 33.
 Beygelzimer and Rish [2003] A. Beygelzimer and I. Rish. Approximability of probability distributions. In Advances in Neural Inf. Proc. Systems 16, pages 377–384, 2003.
 Bodlaender [1996] H. L. Bodlaender. A linear time algorithm for finding treedecompositions of small treewidth. SIAM J. on Computing, 25(6):1305–1317, 1996.
 Buntine [1991] W. Buntine. Theory refinement on Bayesian networks. In Proc. 7th Conf. on Uncertainty in AI, pages 52–60, 1991.
 Caminiti et al. [2010] S. Caminiti, E. G. Fusco, and R. Petreschi. Bijective linear time coding and decoding for ktrees. Theory of Comp. Systems, 46(2):284–300, 2010.
 Chandrasekaran et al. [2008] V. Chandrasekaran, N. Srebro, and P. Harsha. Complexity of inference in graphical models. In Proc. 24th Conf. on Uncertainty in AI, pages 70–78, 2008.
 Chechetka and Guestrin [2007] A. Chechetka and C. Guestrin. Efficient principled learning of thin junction trees. In Advances in Neural Inf. Proc. Systems, pages 273–280, 2007.
 Chickering [1996] D. M. Chickering. Learning Bayesian networks is NPcomplete. In Learning from Data: AI and Stat. V, pages 121–130. SpringerVerlag, 1996.
 Chow and Liu [1968] C. Chow and C. Liu. Approximating discrete probability distributions with dependence trees. Inf. Theory, IEEE Trans. on, 14(3):462–467, 1968.
 Cooper and Herskovits [1992] G. F. Cooper and E. Herskovits. A Bayesian method for the induction of probabilistic networks from data. Mach. Learning, 9(4):309–347, 1992.
 Cussens [2011] J. Cussens. Bayesian network learning with cutting planes. In Proc. 27th Conf. on Uncertainty in AI, pages 153–160, 2011.
 Cussens et al. [2013] J. Cussens, M. Bartlett, E. M. Jones, and N. A. Sheehan. Maximum Likelihood Pedigree Reconstruction using Integer Linear Programming. Genetic Epidemiology, 37(1):69–83, 2013.
 Dagum and Luby [1993] P. Dagum and M. Luby. Approximating probabilistic inference in Bayesian belief networks is NPhard. Artif. Intell., 60(1):141–153, 1993.
 Darwiche [2009] A. Darwiche. Modeling and Reasoning with Bayesian Networks. Cambridge University Press, 2009.
 Dasgupta [1999] S. Dasgupta. Learning polytrees. In Proc. 15th Conf. on Uncertainty in AI, pages 134–141, 1999.
 de Campos [2011] C. P. de Campos. New Complexity Results for MAP in Bayesian Networks. In Proc. Int. Joint Conf. on AI, pages 2100–2106, 2011.
 de Campos et al. [2009] C. P. de Campos, Z. Zeng, and Q. Ji. Structure learning of Bayesian networks using constraints. In Proc. 26th Int. Conf. on Mach. Learning, pages 113–120, 2009.
 Elidan and Gould [2008] G. Elidan and S. Gould. Learning Bounded Treewidth Bayesian Networks. J. of Mach. Learning Res., 9:2699–2731, 2008.

Ermon et al. [2013]
S. Ermon, C. P. Gomes, A. Sabharwal, and B. Selman.
Taming the curse of dimensionality: Discrete integration by hashing and optimization.
In Proc. 30th Int. Conf. on Mach. Learning, pages 334–342, 2013.  Friedman [1998] N. Friedman. The Bayesian structural EM algorithm. In Proc. 14th Conf. on Uncertainty in AI, pages 129–138, 1998.
 Friedman et al. [1999] N. Friedman, I. Nachman, and D. Pe’er. Learning Bayesian network structure from massive datasets: The ”sparse candidate” algorithm. In Proc. 15th Conf. on Uncertainty in AI, pages 206–215, 1999.
 Grigoriev et al. [2011] A. Grigoriev, H. Ensinck, and N. Usotskaya. Integer linear programming formulations for treewidth. Technical report, Maastricht Res. School of Economics of Tech. and Organization, 2011.
 Heckerman et al. [1995] D. Heckerman, D. Geiger, and D. M. Chickering. Learning Bayesian networks: The combination of knowledge and statistical data. Mach. Learning, 20(3):197–243, 1995.
 Hemmecke et al. [2012] R. Hemmecke, S. Lindner, and M. Studený. Characteristic imsets for learning Bayesian network structure. Int. J. of Approx. Reasoning, 53(9):1336–1349, 2012.
 Jaakkola et al. [2010] T. Jaakkola, D. Sontag, A. Globerson, and M. Meila. Learning bayesian network structure using LP relaxations. In Proc. 13th Int. Conf. on AI and Stat., pages 358–365, 2010. JMLR W&CP 9.
 Koller and Friedman [2009] D. Koller and N. Friedman. Probabilistic Graphical Models. MIT press, 2009.
 Korhonen and Parviainen [2013] J. H. Korhonen and P. Parviainen. Exact learning of bounded treewidth Bayesian networks. In Proc. 16th Int. Conf. on AI and Stat., pages 370–378, 2013. JMLR W&CP 31.
 Kwisthout et al. [2010] J. H. P. Kwisthout, H. L. Bodlaender, and L. C. van der Gaag. The Necessity of Bounded Treewidth for Efficient Inference in Bayesian Networks. In Proc. 19th European Conf. on AI, pages 237–242, 2010.
 Mauá and de Campos [2012] D. D. Mauá and C. P. de Campos. Anytime marginal MAP inference. In Proc. 28th Int. Conf. on Mach. Learning, pages 1471–1478, 2012.
 Parviainen and Koivisto [2009] P. Parviainen and M. Koivisto. Exact structure discovery in Bayesian networks with less space. In Proc. 25th Conf. on Uncertainty in AI, pages 436–443, 2009.
 Parviainen et al. [2014] P. Parviainen, H. S. Farahani, and J. Lagergren. Learning bounded treewidth Bayesian networks using integer linear programming. In Proc. 17th Int. Conf. on AI and Stat., pages 751–759, 2014. JMLR W&CP 33.
 Perrier et al. [2008] E. Perrier, S. Imoto, and S. Miyano. Finding optimal Bayesian network given a superstructure. J. of Mach. Learning Res., 9(2):2251–2286, 2008.
 Roth [1996] D. Roth. On the hardness of approximate reasoning. Artif. Intell., 82(1–2):273–302, 1996.
 Schwarz [1978] G. Schwarz. Estimating the dimension of a model. Annals of Stat., 6(2):461–464, 1978.
 Silander and Myllymaki [2006] T. Silander and P. Myllymaki. A simple approach for finding the globally optimal Bayesian network structure. In Proc. 22nd Conf. on Uncertainty in AI, pages 445–452, 2006.
 Spirtes and Meek [1995] P. Spirtes and C. Meek. Learning Bayesian networks with discrete variables from data. In Proc. 1st Int. Conf. on Knowledge Discovery and Data Mining, pages 294–299, 1995.
 Srebro [2003] N. Srebro. Maximum likelihood bounded treewidth Markov networks. Artif. Intell., 143(1):123–138, 2003.
 Teyssier and Koller [2005] M. Teyssier and D. Koller. Orderingbased search: A simple and effective algorithm for learning Bayesian networks. In Proc. 21st Conf. on Uncertainty in AI, pages 584–590, 2005.
 Yuan and Malone [2012] C. Yuan and B. Malone. An Improved Admissible Heuristic for Learning Optimal Bayesian Networks. In Proc. 28th Conf. on Uncertainty in AI, pages 924–933, 2012.
 Yuan and Malone [2013] C. Yuan and B. Malone. Learning optimal Bayesian networks: A shortest path perspective. J. of Artif. Intell. Res., 48:23–65, 2013.