1 Introduction
The size of an explicit representation of the joint distribution of
categorical random variables is exponential in
. Bayesian networks [1] compactly represent joint distributions by exploiting independence relations and encoding them into a directed acyclic graph (DAG), also referred to as structure. Yet, algorithms able to perform structure learning from thousands of variables have been devised only very recently for Bayesian networks [2, 3] and for chordal loglinear graphical models (that can be exactly mapped on Bayesian networks) [4, 5].Given a Bayesian network, the task of computing the marginal distribution of a set of variables, possibly given evidence on another set of variables, is called inference. The complexity of exact inference grows exponentially in the treewidth [1, Chap. 7] of the DAG, under the exponential time hypothesis [6]. In order to allow tractable inference we thus need to learn Bayesian networks with a boundedtreewidth structure; this problem is NPhard [7].
Most research on learning boundedtreewidth Bayesian networks adopts a scorebased approach. The score measures the fit of the DAG to the data; the goal is hence to find the highestscoring DAG that respects the treewidth bound. Exact methods [7, 8, 9] exist, but their applicability is restricted to small domains. Approximate approaches that scale up to some hundreds of variables [10, 11] have been more recently proposed. A recent breakthrough has been achieved by the kgreedy algorithm [3]. It consistently yields higherscoring DAGs than its competitors and it scales to several thousands of variables.
In this paper we present a new algorithm called kMAX, which improves over kgreedy. Both kMAX and kgreedy are anytime algorithms: they can be stopped at any moment, yielding the current best solution. kMAX adopts a set of more sophisticated heuristics compared to kgreedy; as a result it consistently yields higherscoring DAGs than both kgreedy and other competitors, as demonstrated by our extensive experiments on complete data sets.
Structure learning algorithms commonly assume data sets to be complete; yet real data sets are often incomplete. Structure learning on incomplete data sets can be accomplished via the
structural expectationmaximization
(SEM) algorithm [12], which alternates between an estimation of the sufficient statistics given the current model (expectation step), and the search of a new model given the expected sufficient statistics (maximization step). Yet, SEM is computationally demanding: in particular the expectation step requires computing several inferences, which might become prohibitive if the model has unbounded treewidth and/or there are many missing data whose actual value has to be inferred. We adopt kMAX as the structure learning algorithm within SEM; in this way we obtain a fast implementation of SEM, since the boundedtreewidth structures learned in the different iterations perform efficient inferences. To the best of our knowledge, this is the first implementation of SEM that is able to scale to thousands of variables.
To test our method, we use the Bayesian networks learned by SEM in order to perform data imputation. We consider as a competitor a recent method for data imputation based on random forests [13] and we compare the two approaches on data sets with different degrees of missingness. The two approaches achieve the same imputation accuracy, but our approach is faster by almost one order of magnitude. Furthermore we show that the complexity of our method scales linearly in the input size (Subsec. 7.3), and that it is easily parallelizable (Subsec. 7.5). To the best of our knowledge, it is the first approach in the literature able to do so.
In Section 2 we present the technical background of the paper. In Section 3 we detail our approach for boundedtreewidth structure learning, kMAX. In Section 4 and 5 we evaluate its performance against existing stateoftheart approaches. In Section 6 we present how kMAX can be used in the SEM algorithm, obtaining the SEMkMAX algorithm. It is evaluated in Section 7 on the task of data imputation against the stateoftheart approach. Section 8 concludes our paper.
The software of this paper is available from http://ipg.idsia.ch/software/blip, together with supplementary material containing the detailed results of our experiments.
2 Treewidth and trees
Intuitively, the treewidth quantifies the extent to which a graph resembles a tree. Following the terminology of [14] we now provide a formal definition. Let us recall that a clique of an undirected graph is a subset of its nodes such that every two distinct nodes are linked by an edge. Moreover, a clique is maximal if it is not a subset of a larger clique.
Treewidth of an undirected graph. We denote an undirected graph by where is the vertex set and is the edge set. An undirected graph is triangulated when every cycle of length greater than or equal to 4 has a chord, that is, an edge connecting two nonconsecutive nodes in the cycle [1, Def. 9.16]. Triangulated graphs are also called chordal graphs. The triangulation of a graph is the operation of adding chords until the graph is triangulated. The treewidth of a triangulated graph is the size of its largest clique minus one. The treewidth of is the minimum treewidth among all the possible triangulations of .
Treewidth of a Bayesian network. The moral graph of the DAG associated to a Bayesian network is an undirected graph that includes an edge ( – ) for every edge () in the DAG and an edge ( – ) for every pair of edges (), () in the DAG. The treewidth of the DAG is the treewidth of its moral graph.
2.1 trees
A tree is an undirected edgemaximal graph of treewidth , that is, the addition of any edge to the tree increases its treewidth. It is defined inductively as follows [15]. Base case: a clique with nodes is a tree. Inductive step: given a tree on nodes, a tree on nodes is obtained by connecting the th node to a clique of (a clique is a clique over nodes). See Figure 1 for an example. As a final remark, a subgraph of a tree is called partial tree; its treewidth is at most .
3 Structure learning of Bayesian networks
We consider the problem of learning the structure of a Bayesian network from a complete data set. The set of categorical random variables is . The goal is to find the highestscoring boundedtreewidth DAG , where is the collection of nodes and is the collection of arcs. can be represented by the set of parents of all variables.
Structure learning is usually accomplished in two steps. First, parent set identification is the identification of a list (cache) of candidate parent sets independently for each variable . Second, structure optimization is the assignment of a parent set to each node in order to maximize the score of the resulting DAG.
The problem of boundedtreewidth structure learning can be casted as follows:
Different scores can be used to assess the fit of a DAG. We adopt the Bayesian information criterion (
), which is asymptotically proportional to the posterior probability of the DAG. The
score is defined as follows:(1) 
where denotes the loglikelihood of and its parent set:
(2) 
while is the complexity penalization:
(3) 
We denote by the maximum likelihood estimate of the conditional probability ; by the number of times that appears in the data set; indicates the size of the Cartesian product space of the variables given as argument. Thus is the number of states of and is the product of the number of states of the parents of .
The BIC score is decomposable, namely it is constituted by the sum of the scores of the individual variables given their parents. The kMAX algorithm, which we present later, can be applied to any decomposable scoring functions; see [16] for a discussion of decomposable scoring functions.
3.1 Parent set identification
In order to efficiently prepare the cache of candidate parent sets of each variable we adopt the approach of [2]. The main idea of [2] is to quickly identify the most promising parent sets through an approximate scoring function that does not require scanning the data set. The approximate scoring function is called . The of a parent set constituted by the union of two nonempty and disjoint parent sets and is:
(4) 
that is, the sum of the scores of the two parent sets and of an interaction term, which ensures that the penalty term of matches the penalty term of . In particular, If and are known, then is computed in constant time (with respect to data accesses). The independence selection algorithm [2] exploits to quickly approximately score a large number of parent sets without limiting the indegree, which is the maximum number of parents allowed for every node. Eventually, it computes the actual score of the most promising parent sets. Additionally we adopt pruning rules [17] in order to avoid computing the score of suboptimal parent sets.
3.2 Learning Bayesian networks with bounded treewidth
Exact methods for boundedtreewidth structure learning of Bayesian networks [7, 8, 9] scale to at most a few dozens of variables. Approximate approaches are therefore needed to scale to larger domains.
The S2 algorithm [10] uniformly samples the space of trees; then it assesses the sampled trees through a heuristic scoring function (informative score). The DAG is then obtained by constraining its moral graph to be a subgraph of the tree with highest informative score. The S2+ algorithm [11] further refines this idea, obtaining via A* the tree guaranteed to maximize the informative score. In general S2+ recovers higherscoring DAGs than S2 but its scalability is limited: for instance it cannot be used with thousands of variables.
3.2.1 kgreedy
To the best of our knowledge, the stateoftheart algorithm for boundedtreewidth learning is so far constituted by kgreedy [3], which consistently yields higherscoring DAGs than its competitors. The kgreedy algorithm samples the space of the orderings of variables, and given an ordering it builds the boundedtreewidth DAG inductively as follows.
Initialization. Given an order over the variables and the value of , kgreedy initializes the structure with a DAG over the first variables in the order. The DAG is learned using either the exact method of [18] or the approximate method of [19], depending on the value of . The treewidth of the learned DAG is at most (its moral graph is a subgraph of the clique over the same variables, which has treewidth ).
Addition of the following nodes. Given a DAG over variables, kgreedy chooses the highestscoring feasible parent set for the variable in the order, . A parent set is feasible if it is a clique (or a subset of a clique) of the moral graph of . Once the parent set of is chosen, we obtain the DAG over variables. At each iteration the moral graph of the DAG is a partial tree: thus the treewidth of the DAG is bounded by .
Then kgreedy samples a new ordering and repeats the above procedure. When a maximum number of iterations or a maximum execution time is met it returns the highestscoring DAG found.
3.3 kMAX
kMAX shares a fundamental idea with kgreedy: it incrementally grows a DAG, guaranteeing that at each step its moral graph is a subgraph of a tree. Differently from kgreedy, kMAX does not adopt a predefined ordering over the variables. Instead it ranks the variables that can be inserted in the graph through the heuristic score :
where:
is the subset of parent sets that are feasible. Recall that the feasible parent sets are constituted by the cliques of the moral graph of the current DAG, and by subsets of such cliques.
The heuristic compares for each variable the highestscoring feasible parent set with the lowestscoring and the highestscoring parent set available in the cache (notice that most parent sets in are not feasible when the tree contains a small number of variables). The rationale is to defer the addition of variables whose score is low, as it will increase in the subsequent iterations due to availability of a larger number of feasible parent sets. The scores and can be found in linear time w.r.t. and cached. As for , it needs to be updated each time a new variable is added to the DAG.
We present an outline of kMAX in Alg. 1. The procedure is repeated until a specific termination condition is met (for example maximum execution time). The highestscoring DAG found is then returned.
Initialization
We start by building an initial tree over variables as follows. We initialize the list of chosen variables as an empty set, and we create a set , which stores the candidate parents: namely every variable appearing at least in one parent set of . We first choose randomly the first variable , we add it to the set and we add its candidate parents to . Then, until contains variables we: (1) add to it a new variable chosen randomly from ; (2) add all the candidate parents of to .
Finally we learn the initial DAG over the variables contained in , either with exact or approximate methods, as in the initialization of kgreedy. The moral graph of is a subgraph of and thus has treewidth at most . For each new clique added to , we update for each variable not yet processed.
Addition of the following nodes
Let us denote by and the current DAG and the tree over nodes.
The th variable to be added is:
We connect to the parent set:
This yields the updated DAG . We then update the tree, connecting to the clique that is superset of . In the event of several cliques sharing this property, a random one between them is chosen. This yields the tree ; it contains an additional ()clique compared to . By construction, is also a tree. For each new clique added to , we update for each variable not yet processed.
Space of learnable DAGs
A reverse topological order is an order over the vertices of a DAG in which each vertex appears before its parents . The search space of kMAX contains only DAGs whose reverse topological order, when used as variable elimination order, has treewidth . The proof is identical to that provided in [3] for kgreedy. The extensive experiments by [3] show that such limitation does not hurt the empirical performance of kgreedy and thus we do not further discuss this point.
3.4 Advantages over kgreedy
Initialization
Recall that kMAX builds iteratively the initial set of () variables by ensuring that any added variable is a candidate parent of at least another variable already in the clique.
We prune the available parent sets for each variable by Lemma 1 of [20]. It states that given two parent sets and for the same variable , such that and , then can be discarded from as it yields suboptimal structures. This happens when has low mutual information with ; its contribution to the score is negative since the small increase in loglikelihood (due to low mutual information) is outweighed by the complexity penalization of the scoring function. Thus a variable is a candidate parent only if it appears in at least one nonpruned parent set of (and which thus has significant mutual information with ).
Instead the initialization of kgreedy randomly samples variables, ignoring their mutual information. By virtue of the selection process the initial DAG found by kMAX is in general higherscoring than the one found by kgreedy.
Addition of the following nodes
The addition step of kgreedy follows the order, which is randomly sampled. Again, this overlooks whether there exists a feasible parent set given the current tree with a good score for the variable being added. kMAX instead optimizes the variable to be added at each iteration, by ranking them according to the score.
4 Experiments: kMAX against kgreedy
Name  Name  Name  

Kdd  64  11490  Retail  135  4408  EachMovie  500  591 
Plants  69  3482  Pumsbstar  163  2452  WebKB  839  838 
Audio  100  3000  DNA  180  1186  Reuters52  889  1540 
Jester  100  4116  Kosarek  190  6675  C20NG  910  3764 
Netflix  100  2634  MSWeb  294  5000  BBC  1058  330 
Accidents  111  2551  Book  500  1739  Ad  1556  491 
Name  Name  Name  

andes  223  r2  2000  r9  4000 
diabetes  413  r3  2000  r10  10000 
pigs  441  r4  2000  r11  10000 
link  724  r5  4000  r12  10000 
munin  1041  r6  4000  r13  10000 
r0  2000  r7  4000  r14  10000 
r1  2000  r8  4000 
We consider the 18 data sets listed in Table 1. They have been previously used in [21] and in other works referenced therein. The data sets are available for instance from https://github.com/arranger1044/awesomespn#dataset. Each data set is split in three subsets; we thus perform 183=54 structure learning experiments. The number of instances in each dataset ranges from 226 to 180000.
Additionally, we compared kMAX and kgreedy on 20 synthetic data sets sampled from known networks (Table 2). Five of these are taken from the literature^{1}^{1}1http://www.bnlearn.com/bnrepository/ (, , , , ) while the other fifteen (r0r14) have been generated by us, using the BNgenerator package^{2}^{2}2http://sites.poli.usp.br/pmr/ltd/Software/BNGenerator/. They contain up to 10,000 variables. From each known network we sample a training data set of 5000 instances.
We hence consider a total of 74 data sets (54 real ones and 20 synthetic ones), on which we compare kgreedy and kMAX. We provide both algorithms with the same cache of parent sets for each variable, precomputed using independence selection [2]; we then let each algorithm run for one hour on the same machine.
In each experiment we measure the difference between the BIC scores (BIC) of the DAG returned by kMAX and kgreedy. There is an exact mapping between the values of
BIC and the Bayes factor (BF)
[22, Sec. 4.3]. The BIC score of graph is an approximation of the logarithm of the marginal likelihood of , namely . Given two graphs and , the Bayes factor (BF) is the ratio of their marginal likelihoods. The log of the Bayes factor can be approximated by the difference of the BIC scores:A positive provides evidence in favor of and a negative provides evidence in favor of . The posterior probability of is given by
For instance a >10 implies a Bayes factor >150 and a posterior probability >0.99, conveying very strong evidence in favor of . Following the same logic, the values of BIC can be interpreted [22, Sec. 4.3] according to this scale:

BIC >10: extremely positive evidence;

6 <BIC <10: strongly positive evidence;

2 <BIC <6: positive evidence;

BIC <2: neutral evidence.
We report only the case of positive BIC; negative values of BIC are interpreted in the same way but they have the meaning of negative evidence.
treewidth  

kMAX vs kgreedy  
BIC  
extremely positive  71  67  69 
strongly positive  0  0  0 
positive  0  0  1 
neutral  0  0  0 
negative  0  0  0 
strongly negative  0  0  0 
very negative  3  7  5 
We perform independent experiments with the treewidths . We summarize the results in Table 3. In most cases there is an extremely positive evidence for the model learned by kMAX over the model learned by kgreedy (BIC >10). We further analyze the results through a signtest, considering one method as winning over the other when there is a BIC of at least 2 in its favor, and treating as ties the cases in which <2. The number of wins obtained by kMAX over kgreedy is significant for every tested treewidth.
Iteration statistics
We further compare kMAX and kgreedy by analyzing their iterations. We consider as an example the data set tmovie.test (500 variables) with treewidth =5. As shown in Table 4, kMAX performs much less iterations (two orders of magnitude less, in this example) than kgreedy; this is due to the overhead of updating the values for all the variables not yet added to the structure. However this strategy pays off, as the median score of the DAG retrieved at each iteration is much higher for kMAX. This is the advantage of using the more sophisticated heuristics of kMAX.
kMAX  kgreedy  

Number of iterations  1,111  96,226 
Median BIC score  36,937  37,489 
Comparison against S2
In [3] it was shown that kgreedy consistently outperforms S2. kMAX further increases the gap over S2. In particular kMAX achieves BIC >10 compared to S2 for every treewidth and data set considered.
Inference times
We perform some tests about inference times, using Iterative Join Graph Propagation [23] as inference engine. We focus first on the networks containing 1,000 or more variables provided in Table 2, in which case the groundtruth networks are known. Using such large groundtruth networks results in slow inference even when computing marginals. In several cases we had no convergence of the inference even after 30 minutes of computation. In these cases, even if we could manage to learn the actual DAG with a perfect learner, the model would be hardly usable due to the slowness of the inference. Conversely, the boundedtreewidth models learned by kMAX compute marginals consistently in less than 0.1 seconds, even with treewidth 8. By bounding the treewidth we thus guarantee the efficiency of the inferences.
Similar considerations hold also for the smaller groundtruth networks, such as andes, diabetes, etc. In these cases marginals can be efficiently computed using the groundtruth networks, but slowness problems appear when we compute the probability of the joint evidence of five variables. This requires (averaging over data set) about 60 seconds when using the groundtruth networks and less than 5 seconds when using boundedtreewidth models with treewidth eight (the slowest boundedtreewidth model).
5 Comparison with Chordalysis
Chordalysis [24] performs structural learning for loglinear models; such undirected graphical models are also known as Markov networks. In particular Chordalysis learns chordal models, which are at the intersection between Bayesian networks and Markov networks: given a chordal Markov network it is possible to obtain a Bayesian network which encodes exactly the same independences [25, Sec. 4.5.3].
Chordalysis starts from the empty graph, which contains no edges. It then decides which edges to add based on a series of statistical test of independence. While classical approaches to loglinear analysis hardly scale beyond ten variables, Chordalysis scales to highdimensional data thanks to sophisticated algorithms for efficiently computing the statistics of the test and avoiding the computation of unnecessary tests. We learn the undirected chordal graph using the variant of Chordalysis referred in
[5] (code available from https://github.com/fpetitjean) and we subsequently obtain the equivalent Bayesian network using the algorithm of [25, Sec. 4.5.3].While scorebased structural learning aims at maximizing the predictivity of the model, structural learning based on hypothesis tests aims at building an explanatory model, by controlling the rate of false positive edges among those which constitute the graph. Thus such two approaches have different goals. Yet, it does make sense to compare kMAX and Chordalysis. These are among the very few methods able to learn PGMs from thousands of variables; moreover, even if Chordalysis does not formally bound the treewidth, it generally yields quite sparse graph that are likely to have low treewidth (see the results shown later). This happens because the statistical test (corrected for multiple comparisons) does not allow adding an arc unless there is strong evidence against the null hypothesis of independence.
5.1 Results
In this comparison we avoid using the BIC score of the resulting networks as a performance indicator, as Chordalysis does not aim at maximizing it. Instead, we measure how well the models fit the distribution by computing the loglikelihood of the instances of the test set, , where denote the instances of the test set and is the model being tested. We then compute the difference in testset loglikelihood (LL) between the model learned by kMAX and by Chordalysis. A value greater than 0 indicates that the model learned by kMAX yields higher likelihood (thus better fit) than the model learned by Chordalysis, and vice versa.
We first consider the datasets listed in Table 1. Recall that each data set is split into three parts; for each dataset, we use a part for learning the models and the union of the other two parts as a test set. We thus perform 3 experiments for each data set, for a total of 54 experiments. In Fig. 3 (upper plot) we show the distribution of LL across the data sets for different tested treewidths. The models learned by Chordalysis provide a fit that is comparable to the models learned by kMAX using treewidth . When kMAX adopts higher treewidths, such as 5 or 8, it fits better (or even much better) the distribution than Chordalysis.
We then consider the true networks of Table 2. In Fig. 3 (lower plot) we see that the same pattern appears for the LL. In this case we sample from the networks instances as the training set and instances for the test set.
Additionally we study the running time and the accuracy of the inference. As inference we consider the computation of the probability of the evidence constituted by five randomly selected variables, which we set in random states. For each data set we run 100 queries. On each data set we measure the mean absolute error (mae) of each model:
where denotes the total number of queries, and the probability of evidence computed by respectively the groundtruth model and the boundedtreewidth model on the ith query. As groundtruth we take the probability of evidence computed on the original network, using the algorithm of Iterative Join Graph Propagation [23] and running it until convergence.
We show in Fig. 4 the mae of the model learned by Chordalysis and the models learned by kMAX with the treewidth bound in . The mae of the learned models are comparable, with a slight improvement with higher value of treewidth (not visible in the graph). In Fig. 5 we show the time required to compute the probability of evidence. The inference times of the models learned by kMAX increase with the treewidth. Yet even the models learned by kMAX with treewidth 8 are on average slightly faster than the model learned by Chordalysis. Generally the kMAX models with treewidth 5 or 8 yield both a better fit and a quicker inference than Chordalysis. This shows the soundness of kMAX. We recall that however Chordalysis aims at building explanatory models rather than predictive ones.
6 Structural EM
Most works in structure learning assume the data set to be complete. Yet this is rarely the case in realword domains. This poses a serious problem, since learning a structure from incomplete data is computationally challenging.
The most common approach for structure learning from incomplete data sets is the structural EM (SEM), originally presented in [26, 27]; a discussion about this algorithm is given also by [25, Chapter 19].
The key idea of structural EM is to use the current best estimate of the distribution to complete the data, and then analyze such complete data. In a nutshell, structural EM performs search in the joint space of structure and parameters. It alternates between finding better parameters for the current structure, or select a new structure. The former case is as a parametric EM step, while the latter step is a structural EM step. For penalized likelihood scoring functions, such as the BIC, this procedure is proven to converge to a local maximum [26].
More in detail, structural EM proceeds by alternating two steps. In the expectation step, it computes the expected sufficient statistics given a candidate structure (the sufficient statistics are the counts of the occurrences of each value of a given variable jointly with each possible assignment of its parents). Given the expected sufficient statistics, the maximization step learns an updated structure and estimates its parameters. The two steps are alternated until the search converges to a structure. We present SEM in Alg. 2, adopting the description of [25, Chapter 19].
The most demanding part of SEM is the expectation step, which requires computing several queries, whose complexity is (in the worst case) exponential in the treewidth of the model. Thus SEM becomes prohibitive if the model being learned has unbounded treewidth and there are many missing data. This prevents the application of SEM to large data sets. We adopt kMAX in order to perform structure learning within SEM: by learning boundedtreewidth models we obtain an efficient computation of the expected sufficient statistics. We call the resulting approach as SEMkMAX. As far as we know this is the first implementation of the structural expectationmaximization that scales up to thousands of variables.
The StructureLearn step of Algorithm 2 is constituted by two parts. Given the (expected) sufficient statistics computed in the previous iteration, we first compute the cache of best parent sets for each variable using independence selection. Then we find the highestscoring DAG using kMAX with a treewidth bound of . We set an execution time of seconds (one second per variable) for the first step and of seconds for the second step. Such time limits are shorter than those adopted on complete data sets, as we need to perform structure learning at each maximization step of SEM. In Section 7.4 we evaluate the sensitivity of the results on the allowed execution time of both steps.
The algorithm reaches convergence when the structure remains unchanged between the subsequent SEM iterations and hence no improvements are found on the structure of the previous iteration. As for the choice of the initial network we adopt a random chain that connects all the variables, as in [12].
6.1 Further implementation details
A peculiar aspect of our implementation is that we adopt the hard EM [25, Chap. 19.2.2.6] for the computation of the expected sufficient statistics. While the standard soft
EM produces a probability distribution over the missing data, hard EM fillsin the missing data with their most probable completion. The relative merits of hard and soft EM are discussed for instance by
[28] and [25, Chap. 19.2.2.6]. We adopt the hard EM in order to limit the memory usage of the algorithm. In fact we cannot foresee which sufficient statistics will be required by the maximization step, when it looks for a new structure. Keeping the softcompleted data set in memory is however not feasible: the memory requirement is, for each instance, exponential in the number of missing values. Some workarounds exist [25, Chap. 19.4.3.4], based on severe restrictions on the type of learnable structures. By adopting the hard EM we radically solve the memory issues, as the hardcompleted data set requires the same memory of the original data set. This allows us to perform structure learning without restricting the search space of the structures, apart from the boundedtreewidth constraint.Given an instance of the dataset containing missing values, we can fill the missing values jointly or independently. The joint approach requires running a single MPE (most probable explanation) query [25, Chap.2.1.5.2] for each incomplete instance. The independent approach requires running a marginal query for each missing value, marginalizing out over the missing variables in the same instance. In the field of multilabel classification (where one has to predict a set of related labels, given the observed features) it has been pointed out [29] that the MPE inference maximizes the probability of correctly predicting the whole set of labels, while the marginal inference maximizes the probability of correctly predicting each label independently. In practical terms, extensive experiments in multilabel classification do not show major differences between the results yielded by the two approaches [30]. There are however applications, such as message decoding over a noisy channel, where the joint approach is clearly preferable. In the following, we report results obtained using the joint approach.
7 Application to data imputation
We benchmark the performance of SEMkMAX in the task of data imputation, which is the process of replacing missing data with the predictions of their values. Once we run SEM until convergence, we have both a trained model and an imputed data set.
A pioneering Bayesian network approach for data imputation is that of [31], which however requires to order the variables according to their reliability before performing structure learning; this approach is hardly applicable in data mining applications, where the number of variables can be in order of the thousands. As a stateoftheart competitor for data imputation we thus consider an approach based on random forests.
The missForest algorithm [13] recasts the problem of missing data imputation as a prediction problem. The initial guess is made using mode imputation, namely by substituting missing values with the most common value of the variable. The variables are then sorted according to the amount of missing values and the data are imputed by regressing (using random forest) each variable in turn against all other variables (starting from the variable with the smallest missingness). The predictions of missing data for the dependent variable are used as imputation. The empirical study of [32] compares different imputation algorithms based on random forests, concluding that missForest is indeed the most accurate, but also the slowest. The problem is that at each iteration it requires fitting random forests (one for each variable); this becomes slow when dealing with large number of variables. Thus [32] propose the mRF algorithm as a scalable alternative to missForest. It randomly divides the variables into mutually exclusive groups of approximate size . Each group in turn acts as the multivariate response to be regressed on the remaining variables at each iteration. Thus at each iteration it trains random forest models. We used the implementation of mRF available in the randomForestSRC R package, setting , which yields good results in the experiments of [32].
7.1 Experimental setup
We ran imputation experiments on the 18 data sets of Table 1. On each dataset we induced missing completely at random (MCAR) missingness: namely, each observation was made missing with fixed probability, regardless of the values of the other variables. For each data set we considered the missingness percentages (meaning, e.g., that in the first setting we made missing each value with probability ), and we repeated each experiment 5 times. In each repetition we measured the proportion of missing values that were correctly imputed (imputation accuracy):
where denotes the value of the variable in instance in the original dataset, its imputed value, the number of values missing in instance and is the total number of instances.
7.2 Comparison with mRF
In the comparison with mRF we used all the available datasets of Table 1. In two cases (MSWeb and C20NG) mRF failed to provide a solution in less than 24 hours, and the datasets were removed from the comparison. We conjecture that this is due to the presence of both a high number of variables and a high number of data points.
Figure 6 shows the scatter plots of imputation accuracy and execution times, which compare SEMkMAX and mRF for different missingness levels. The two approaches offer practically the same imputation accuracy (left plots), but SEMkMAX is substantially faster than mRF (right plots).
We further report their ratio of imputation accuracy in Fig. 7 and their ratio of execution times in Fig. 8 by aggregating the experiments performed on various data sets but having the same level of missingness. Again we can see that SEMkMAX achieves an imputation accuracy comparable to mRF, while on average reducing the computational time of one order of magnitude.
7.3 Computational complexity
We now study the complexity of SEMkMAX, showing that it is linear in the input size. Let us focus on a single iteration of the procedure. Each iteration is composed of three different steps: parent set exploration, kMAX, and data imputation.
Recall that , and represent respectively the number of variables, the bound on the treewidth and the number of data points in the data set. As for the first step, the size of the search space of possible parent sets for each variable is (without loss of generality, the search can be restricted to parent sets with a size up to the chosen treewidth); for the evaluation of each parent set, we can assume that a scan of the entire dataset is required. This would lead to an overall complexity of for the first step. However, such a complete exploration of the search space of parent sets is usually unfeasible; the algorithm BIC* ([2]) was designed exactly to overcome this complexity blowup by guiding the exploration to the most promising parent sets in the allowed time ( seconds in our choice). This is equivalent to allowing the exploration of a maximum number of parent sets , which yields a complexity equal to . Since is a fixed bound in practice, the actual cost of the first step is .
The second step is the execution of kMAX. This is an iterative algorithm. In every iteration, we first build a core structure of nodes. Since is constant by choice, the related computation takes constant time too. The subsequent part of the algorithm adds one node at a time; for each node we explore the precomputed parent sets, whose maximum number is , which is a constant as we mentioned before. Therefore the overall complexity of one iteration is —note that this part does not depend on data, which have been processed in the first step. The number of iterations is then bounded by the allowed time (in our case seconds) and for this reason it is a constant as well. It follows that the complexity of the second step is dominated by that of the first step.
The last step is data imputation, which amounts to performing an exact inference on the boundedtreewidth BN for every record containing missing values. In general the number of queries will then be . The cost of performing each query is , where is the maximum number of states for a variable in the network. In our application is a constant by definition, and can be regarded as a constant too, given that it usually does not exceed few tens in applications. The complexity cost of a query is then , and the final cost of performing data imputation is then .
So far we have considered a single iteration of SEMkMAX. The number of iterations is usually not very large and in any case can be chosen up to a certain maximum amount. Taking this into account, the overall complexity of SEMkMAX is or, in other words, SEMkMAX has worstcase complexity linear in the input size (the data).
7.4 Parameter tuning
As mentioned, one can tune SEMkMAX by choosing the allowed maximum execution time for its steps. We will now experimentally evaluate different possibilities as for this choice.
We allow for the parent set indentification step seconds, and for the structure optimization step seconds, comparing the choices , and . We denote respectively by , and the resulting variants of structural EM (in the previous experiments we adopted , which is a good compromise between imputation accuracy and required time).
In Figure 9 we plot the total execution time for all the datasets, for the three configurations. For the required time is almost linear in the size of the dataset, showing that learning from incomplete data is feasible even for massive datasets with more than thousands of variables. For and we see that the for the largest dataset the total execution time is closer to . The reason for this is the cost for executing the expectation step, which remains almost the same in all the cases.
From the overall behaviour we can see that the complexity increase in the required execution time is at most linear in the number of variables. In Figure 10 we plot the imputation accuracy of against and in Figure 10 between against . The advantage of over is clear, while the advantage of over can be called into question.
By choosing a value for the parameter , the user can tune the tradeoff between imputation accuracy and total execution time. We recommend using as it strikes an optimal balance between the two objectives, at least in our experiments.
7.5 Parallelization
The whole learning procedure can greatly benefit from parallelization. In the parent set identification step, each variable can be considered independently from the others. In the structure optimization kMAX can be run simultaneously on multiple cores, taking into consideration only the best structures found for each of them. Finally in the expectation phase the queries required for the estimation of the missing values can be executed independently for each data point.
8 Conclusions
We presented a new anytime algorithm (kMAX) for learning boundedtreewidth Bayesian networks. Experiments on complete data sets show that kMAX finds structures with significantly higher fit to the data than its competitors, especially on highdimensional data sets. Moreover, kMAX can be plugged within structural EM in order to perform structure learning from incomplete data sets; in this case it allows to efficiently compute the expectation phase thanks to the bounded treewidth. Structural EM with kMAX achieves comparable accuracy to stateoftheart imputation approaches based on random forest, while allowing for a speedup of about one order of magnitude. To the best of our knowledge, our approach is the first implementation of structural EM able to efficiently scale to thousands of variables.
9 Acknowledgments
The research in this paper has been partially supported by the Swiss NSF grants n. IZKSZ2_162188. This work was also supported by the National Research Foundation of Korea(NRF) funded by the Ministry of Science, ICT and Future Planning (NRF2015K1A3A1A14021055). We thank Cassio de Polpo Campos for critical discussions on the topics of this paper.
References
 [1] A. Darwiche, Modeling and reasoning with Bayesian networks, Cambridge University Press, 2009.
 [2] M. Scanagatta, C. P. de Campos, G. Corani, M. Zaffalon, Learning Bayesian Networks with Thousands of Variables, in: NIPS15: Advances in Neural Information Processing Systems 28, 2015, pp. 1855–1863.
 [3] M. Scanagatta, G. Corani, C. P. de Campos, M. Zaffalon, Learning TreewidthBounded Bayesian Networks with Thousands of Variables, in: NIPS16: Advances in Neural Information Processing Systems 29, 2016, pp. 1462–1470.
 [4] F. Petitjean, G. I. Webb, Scaling loglinear analysis to datasets with thousands of variables, in: SIAM International Conference on Data Mining, 2015, pp. 469–477.

[5]
G. I. Webb, F. Petitjean, A multiple test correction for streams and cascades of statistical hypothesis tests, in: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2016, pp. 1225–1264.

[6]
J. H. P. Kwisthout, H. L. Bodlaender, L. C. van der Gaag, The necessity of bounded treewidth for efficient inference in Bayesian networks, in: ECAI10: Proceedings of the 19th European Conference on Artificial Intelligence, 2010, pp. 237–242.
 [7] J. H. Korhonen, P. Parviainen, Exact learning of bounded treewidth Bayesian networks, in: Proc. 16th Int. Conf. on AI and Stat., JMLR W&CP 31, 2013, pp. 370–378.

[8]
P. Parviainen, H. S. Farahani, J. Lagergren, Learning bounded treewidth Bayesian networks using integer linear programming, in: Proceedings of the 17th International Conference on Artificial Intelligence and Statistics, 2014, pp. 751–759.
 [9] J. Berg, M. Järvisalo, B. Malone, Learning optimal bounded treewidth Bayesian networks via maximum satisfiability, in: AISTATS14: Proceedings of the 17th International Conference on Artificial Intelligence and Statistics, 2014, pp. 86–95.
 [10] S. Nie, C. P. de Campos, Q. Ji, Learning Bounded Treewidth Bayesian Networks via Sampling, in: ECSQARU15: Proceedings of the 13th European Conference on Symbol and Quantitative Approaches to Reasoning with Uncertainty, 2015, pp. 387–396.
 [11] S. Nie, C. P. de Campos, Q. Ji, Learning Bayesian networks with bounded treewidth via guided search, in: AAAI16: Proceedings of the 30th AAAI Conference on Artificial Intelligence, 2016, pp. 3294–3300.

[12]
N. Friedman, Learning belief networks in the presence of missing values and hidden variables, in: Proceedings of the Fourteenth International Conference on Machine Learning, 1997, pp. 125–133.
 [13] D. J. Stekhoven, P. Bühlmann, MissForest nonparametric missing value imputation for mixedtype data, Bioinformatics 28 (1) (2012) 112–118.
 [14] G. Elidan, S. Gould, Learning bounded treewidth Bayesian networks, in: Advances in Neural Information Processing Systems 21, Curran Associates, Inc., 2009, pp. 417–424.
 [15] H. P. Patil, On the structure of ktrees, Journal of Combinatorics, Information and System Sciences (1986) 57–64.
 [16] Z. Liu, B. Malone, C. Yuan, Empirical evaluation of scoring functions for Bayesian network model selection, BMC Bioinformatics 13 (15) (2012) S14.
 [17] C. P. de Campos, Z. Zeng, Q. Ji, Structure learning of Bayesian networks using constraints, in: Proceedings of the 26th Annual International Conference on Machine Learning, ACM, 2009, pp. 113–120.
 [18] J. Cussens, Bayesian network learning with cutting planes, in: UAI11: Proceedings of the 27th Conference Annual Conference on Uncertainty in Artificial Intelligence, AUAI Press, 2011, pp. 153–160.
 [19] M. Scanagatta, G. de Corani, M. Zaffalon, Improved local search in Bayesian networks structure learning, in: Proceedings of the Third Workshop on Advanced Methodologies for Bayesian Networks, 2017, pp. 45–56.
 [20] C. P. de Campos, Q. Ji, Efficient structure learning of Bayesian networks using constraints, Journal of Machine Learning Research 12 (2011) 663–689.
 [21] A. Rooshenas, D. Lowd, Learning sumproduct networks with direct and indirect variable interactions, in: ICML, 2014, pp. 710–718.
 [22] A. E. Raftery, Bayesian model selection in social research, Sociological methodology 25 (1995) 111–164.
 [23] R. Mateescu, K. Kask, V. Gogate, R. Dechter, Joingraph propagation algorithms, Journal of Artificial Intelligence Research 37 (1) (2010) 279–328.
 [24] F. Petitjean, G. I. Webb, A. E. Nicholson, Scaling loglinear analysis to highdimensional data, in: IEEE International Conference on Data Mining, 2013, pp. 597–606.
 [25] D. Koller, N. Friedman, Probabilistic Graphical Models: Principles and Techniques, The MIT Press, 2009.
 [26] N. Friedman, M. Goldszmidt, Learning Bayesian networks with local structure, in: Proceedings of the Twelfth international conference on Uncertainty in artificial intelligence, Morgan Kaufmann Publishers Inc., 1996, pp. 252–262.
 [27] N. Friedman, The Bayesian structural EM algorithm, in: UAI98: Proceedings of the 14th Conference on Uncertainty in Artificial Intelligence, Morgan Kaufmann Publishers Inc., 1998, pp. 129–138.
 [28] R. Samdani, M.W. Chang, D. Roth, Unified expectation maximization, in: Proceedings of the 2012 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Association for Computational Linguistics, 2012, pp. 688–698.

[29]
W. Cheng, E. Hüllermeier, K. J. Dembczynski, Bayes optimal multilabel classification via probabilistic classifier chains, in: Proceedings of the 27th international conference on machine learning (ICML10), 2010, pp. 279–286.
 [30] A. Alessandro, G. Corani, D. Mauá, S. Gabaglio, An ensemble of Bayesian networks for multilabel classification, in: Proceedings of the TwentyThird international joint conference on Artificial Intelligence, AAAI Press, 2013, pp. 1220–1225.
 [31] M. Di Zio, M. Scanu, L. Coppola, O. Luzi, A. Ponti, Bayesian networks for imputation, Journal of the Royal Statistical Society: Series A (Statistics in Society) 167 (2) (2004) 309–322.

[32]
F. Tang, H. Ishwaran, Random forest missing data algorithms, Statistical Analysis and Data Mining: The ASA Data Science Journal.
Comments
There are no comments yet.