Hypertree Decompositions Revisited for PGMs

04/05/2018 ∙ by Aarthy Shivram Arun, et al. ∙ University at Buffalo 0

We revisit the classical problem of exact inference on probabilistic graphical models (PGMs). Our algorithm is based on recent worst-case optimal database join algorithms, which can be asymptotically faster than traditional data processing methods. We present the first empirical evaluation of these new algorithms via JoinInfer, a new exact inference engine. We empirically explore the properties of the data for which our engine can be expected to outperform traditional inference engines refining current theoretical notions. Further, JoinInfer outperforms existing state-of-the-art inference engines (ACE, IJGP and libDAI) on some standard benchmark datasets by up to a factor of 630x. Finally, we propose a promising data-driven heuristic that extends JoinInfer to automatically tailor its parameters and/or switch to the traditional inference algorithms.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Efficient inference on probabilistic graphical models (PGMs) is a core topic in artificial intelligence (AI) and standard inference techniques are based on tree decomposition 

[21, 13, 23, 28] . The runtime of such algorithms is exponential in the treewidth () of the underlying graph, which in the worst case, is unavoidable. Over the years, efforts in the logic, database and AI communities to refine into a finer-grained measure of complexity have culminated in generalized hypertree decompositions (GHDs) [18, 15]. Recently, FAQ/AJAR  [4, 22] theoretically reconnected such GHD-based algorithms with probabilistic inference and achieved tighter bounds based on a finer-grained notion of width called fractional hyper-tree width ().

However, (1) the practical significance of such GHD based PGM inference has met with some skepticism: Dechter et al. [14]

, via experimental evaluation, conclude that classical treewidth-based algorithms run faster on PGM benchmarks than those optimizing GHD-based measures. Their experiments suggest that the advantages of the latter manifest only in instances with substantial factor sparsity (i.e. large number of factor entries have zero probabilities) and high factor arity. (2) Translating the superior asymptotic bounds of GHDs into practice is a non-trivial challenge. In theory, these algorithms assume that one can exhaustively search over all potential GHDs, which is often untenable due to the combinatorial explosion of possible GHDs with thousands of variables and factors. Indeed, the theoretical runtimes of these algorithms completely ignore the dependence on the number of variables and factors; in practice, their asymptotic advantages may be negated by large constants.

In the current work, we revisit the conclusions in (1) and overcome the challenges of (2) using a proof-of-concept inference engine — JoinInfer — that leverages recently introduced worst case optimal joins [31] in conjunction with improved data structures. In particular, we make the following contributions:

low

high

small

medium

large

-x fast

-x slow

-x slow

-x fast

-x fast

-x (mean: 5x) slow

-x (mean: 10x) slow

Band

(JoinInfer)

Band

(JoinInfer)

Band

(JoinInfer)

Band

(JoinInfer)

Band

(libDAI)

Band

(libDAI)
Figure 1: Datasets are divided into six bands depending on the sizes of and . The red shade in each box denotes the speedup of GHD based system over a treewidth based system and the blue shades shows the speedup of JoinInfer with respect to libDAI. The “winner" is stated explicitly for each band.
GHDs revisited.
  • Experimental evaluation in [14] used a ratio of theoretical bounds based on the GHD measure of hypertree width () and the treewidth measure (we call the analogous version of this ratio : where we replace with ). However, we suggest that the predictions made by are, in practice, contingent upon the total number of entries processed across all bags of the GHD (we call this ). Engines such as libDAI that use truth-table indices do not scale to higher levels of , an insight that is not captured in [14]’s experimental paradigm. For instance, in the top row in Figure 1 (corresponding to high ), libDAI fails on all datasets while JoinInfer successfully completes on all of them.

  • Introduction of shows that GHD based algorithms like JoinInfer may have a wider scope than previously predicted: while [14] predicted that JoinInfer will work well only when is small, we expect it to perform well even when is medium-to-large (provided is high). We expect libDAI to perform well when is medium-to-large and is low (Bands 5 and 6 in Figure 1).

  • We use a finer-grained theoretical measure that better predicts JoinInfer’s speed up, thus refining the insights of [14]. For instance, our measure can differentiate between the rows of column ‘ small’ in Figure 1, while cannot.

  • We show that JoinInfer can be up to 630x faster on networks with small and high (Band in Figure 1). When is large, at the higher threshold of (Band in Figure 1), JoinInfer actually outperforms at non-trivial levels (by upto 2.7x), while the corresponding prediction by [14] expects under-performance by more than x. IJGP and libDAI fail in this space; ACE is the only other engine that completes on a subset of the space, since it is designed to accommodate higher levels of .

Hybrid Architecture.

Given the relative advantages of JoinInfer and libDAI in different spaces, we explore the feasibility of exploiting their strengths in a ‘best-of-all-worlds’ architecture. The hybrid outperforms libDAI, IJGP and ACE for 75% of the networks (ref. Table 1), illustrating its promise.

Technical Contributions.

Our primary technical contribution in JoinInfer lies in improved data-
structures. We introduce two data representations for use in different passes of the algorithm: (i) level-order trie, which collapses a conventional trie into a single array; (ii) (two variants of) an index-based compressed list. We find that the resulting gains more than compensate for the overheads involved in maintaining both data representations.

2 Related Work

Several streams of inquiry have emerged in the exact inference setting. One such stream involves conditioning algorithms [32, 11] that adopt a case-based reasoning approach. Cutset conditioning [32] attempts to reduce a network into a tree structure so as to make inference tractable, while another approach recursive conditioning [11] recursively decomposes a network into smaller subnetworks that are solved independently.

Another class of algorithms seeks to exploit local structure [26, 33], where [10, 20] exploit factor sparsity by going beyond the listing representation (i.e., they only store tuples with non-zero probability). In particular, the goal in these aforementioned works is to represent the factors compactly via algebraic structures. Among related representations, arithmetic circuits or ACs [9] are still a very active research area [34]. At a very high level, these circuits work best when the PGM variables themselves are Boolean. However, the ‘compilation’ process (to AC representation) becomes more expensive for larger domain sizes. Although our algorithm also seeks to exploit factor sparsity, it works directly with the much simpler listing representation, making it much more efficient.

An emerging area lifted probabilistic inference [12, 29, 24], exploits symmetric structures within graphs to speed up inference. Thus far, this has been undertaken primarily in the relational learning paradigm, whereas our current work is propositional.

Yet another stream runs along the lines of variable elimination [13, 35]

, which undertakes a sequential marginalization of variables to compute posterior probabilities. Another stream involves tree decomposition-based routines 

[21, 23, 28] where the original network is decomposed into a hypertree and inference is performed using a two-phase message passing routine on this decomposed tree. The runtime complexity of most of these algorithms is dictated by the treewidth of the underlying graph. JoinInfer improves on this class of algorithms, with its complexity bounded by a finer notion of fractional hypertree width.

Finally, past work in PGMs has also focused on approximate inference [25]; we believe that the advancements introduced in JoinInfer could enhance existing approximate engines by, for instance, applying it to models in the emerging tractable learning paradigms [7] such as thin junction trees [6].

3 JoinInfer: An Overview

Below we (a) give a brief overview of the background concepts, (b) outline the theoretical algorithm and (c) identify the implementation challenges and present our solutions.

3.1 Background

Definition 1

A (discrete) probabilistic graphical model can be defined by the triplet where hypergraph represents the underlying graphical structure (note ). There are discrete random variables on finite domains and factors , where each factor is a mapping: .

For instance Figure 2 is a hypergraph representing a PGM with variables , edges , , and factors , , .

Definition 2

For any factor , the size of is its support size, i.e., the number of entries with non-zero probabilities. Storing only the non-zero entries (as well as their values) is called the listing representation of . Factor sparsity is defined as , where is the size of factor .

Figure 2: The PGM query hypergraph (left), factor graph (middle) and a GHD for it (right). We use hypergraphs instead of factor graphs for notational ease.

A typical inference task in PGMs is to compute the marginal estimates given by:

, ,

(1)

, denotes the projection of x onto the variables in and is a normalization constant. Variable/factor marginals are a special case of (1); for for variable marginals and for factor marginals.

Exact inference in PGMs is usually performed by propagating on a generalized hypertree decomposition (GHD) of the underlying hypergraph .

Definition 3

A GHD of is defined by a triple , where is a tree, is a function associating a set of vertices to each node of , and is a function associating a set of hyperedges to each node of such that the following properties hold (i) for each , there is a node such that and ; and (ii) for each , the set is connected in .

3.1.1 Existing GHD-based message passing algorithms

As illustrated in Figure 2, a GHD can be thought of as a labeled (hyper)tree , where sets assigned to each node in are called bags of the hypertree. Inference propagation on involves a two-pass ‘message-passing’ algorithm [21]. In the first pass (message-up), messages are propagated ‘up’ from leaf (child) to root (parent). Subsequently in the second pass (message-down), they are propagated ‘down’ from root (parent) to leaf (child).

A message from node to node can be viewed as a marginal estimate where variables are summed out of the factor product of node , which is given by:

(2)

where represents or depending on the propagation direction. Upon completion of both propagation passes variable marginals for all can be retrieved using label , and factor marginals for all can be retrieved from each node using the label .

3.2 Joins based theoretical inference algorithm

The above GHD based messaging passing framework (also known as junction tree algorithms) forms the structure of JoinInfer as outlined in Algorithm 1 (in standard algorithms is always , and in FAQ/AJAR is always ). To compute the GHD in Line 3 we build a junction tree using the MinFill heuristic, rooting it arbitrarily and determine the parent-child relationship for every node. We assume that each input factor in is assigned to a unique bag in the GHD. In particular, for every node , denotes the factors in assigned to it.

1:Input: A PGM .
2:Output: Variable and Factor Marginals.
3:Create a GHD for . Using MinFill variable ordering.
4: HYJAR .
5: JoinInferUp
6:
7:Compute variable and factor marginals from
Algorithm 1 JoinInfer

3.2.1 Message-Up Phase

Computation Within a Single Bag.

The upward pass propagates messages or marginalized factor products from leaf-nodes (child bags) to root-nodes (parent bags) along the hypertree. This involves computing factor products at every node (bag) of the hypertree. The runtime complexity of junction tree algorithms is dominated by these bag-wise factor product computations, and, it is here that JoinInfer lends its core contribution: it uses a novel algorithm different from previously proposed exact inference algorithms to compute the factor products inside each bag. By exploiting the correspondence between computing database joins and computing factor marginals111If the probability values in the factor entries are Boolean, i.e., just 0 or 1, the factor product would reduce to a join., it uses worst-case optimal join algorithms (WCOJA) to compute the factor products within each bag. The key idea behind these algorithms is that, unlike the traditional approach of computing factor products via a sequence of pairwise products, WCOJA undertake a multi-way product, i.e., the product of all relevant factors are computed simultaneously. Moreover, they work on a listing representation of the factors thereby exploiting the inherent factor sparsity of a PGM. The multi-way product algorithm used by JoinInfer is outlined in Algorithm 2, which is essentially from [31].

The inputs to Algorithm 2 are: a product flag that decides between the Multiway Product algorithm of [31] and Algorithm 3, the set of variables whose product needs to be computed, the corresponding factors , factor tables and finally, the set of free variables . When the product flag is , we run the traditional Pairwise Product algorithm. Otherwise, we run our algorithm whose description we provide here. We assume that , , and are all sorted in input order. We now start with and find a value that is present in all the factor tables of factors which contain . We then fix this value as a potential candidate for in the factor product and continue the process for and so on. If we get a potential candidate from every variable in

, we have a vector that we add to the final factor product. On the other hand, if we don’t get a potential candidate in a given variable say

, we backtrack and find a new potential candidate for (i.e. a value that is present in all factor tables of factors containing ) and so on (until ). We obtain two outputs from this algorithm: the factor product and its marginal on denoted by .

1:Input: Product Flag , Variables , Factors , Factor Tables (as tries) and Free variables . All these are sorted in the input order.
2:Output: Factor Product and its marginal on , denoted by .
3:if  then
4:      Algorithm 3.
5:  return
6:Initialize a vector of size to all s. The entries of will be added one-by-one as a vector (), probability pair.
7:
8:while  do
9:     
10:     while (true) do
11:          for all  do All factors that contain the variable .
12:                Here, is the factor table corresponding to . Among all having at least one entry in , we pick the smallest one (using galloping).           
13:          
14:          if  then Check if the value is present in all factor tables of factors that contain .
15:               
16:               break
17:          else
18:                               
19:     if  then
20:          if  then
21:                The value for variable is fixed as and we move on to the next variable.
22:               
23:          else
24:                          
25:     else
26:          if  then
27:                We don’t have any potential candidates for and we backtrack.                
28:if  then
29:      We marginalize out on all the variables in , resulting in a set of vector, probability pairs.
30:return
Algorithm 2 MultFacProd
1:Input: Variables , Factors , Factor Tables , Free variables . All these are sorted in the input order.
2:Output: Factor Product and its marginal on , denoted by
3:
4:for all  do
5:     if  then
6:         
7:     else
8:          This step is computed using LibDAI’s API for pairwise product.      
9: This is step is computed using LibDAI’s API for marginalizing out variables from a Factor/Cluster Product.
10:return
Algorithm 3 PairwiseProd
Example 1

Let us look at the triangle query for a bag. For simplicity, let for all and for all . Standard algorithms compute this query in time . On the other hand, we obtain a runtime bound of , a bound that becomes especially useful in the presence of factor sparsity i.e., (in the worst case , we recover the bound ).

Computation Across Multiple Bags.

Another significant difference stems from the ‘factors’ that are included in the bag-wise product computation [14]. Standard algorithms process (a) the input factors mapped to the bag and (b) the messages received by the bag from its children. In addition to the above two, JoinInfer also includes factors called ‘01-projections’ that are not originally mapped to the bag, but have non-trivial intersections with the variables in the bag. In other words, such factors are computed across multiple bags in the hypertree. In the presence of sparsity, this ‘look-ahead’ maneuver helps prune unproductive entries from factor products early on. Further, using 01-projections is central to realizing the asymptotically better bounds in FAQ/AJAR (see Section 3.2.3).

Example 2

Consider the bag of the GHD in Figure 2 (where assume ) representing a product over two factors, . The ‘up’ pass would propagate , a marginal on to the root bag . The worst-case output size of this factor product is given by and consequently the runtime bound is . However, we are able to obtain a much better bound by also utilizing factor in the product computation (as both and participate in this factor), , where , d s.t. . By employing this 01-projection in our factor product computation, we can get a bound of on the output size (recall Example 1) and consequently a runtime bound of for the above query (detailed illustration and formal definition of 01-projections are provided later in Appendix A.1).

The upward message propagation in JoinInfer is outlined in Algorithm 4. In FAQ/AJAR, the condition in Line 7 is always satisfied while in traditional algorithms, the condition is never satisfied. Further, traditional algorithms in lines 13 and 14 use a pairwise product algorithm, which is asymptotically slower than WCOJA.

1:Input: GHD and the map .
2:Output: Factor Products and Up Messages both as tries.
3:for all nodes  do: This is done in a level-order traversal from leaves-to-root.
4:      Initialize the PGM Query corresponding to ’s Factor Product.
5:     for all  do We add all the messages sent to from its children.
6:                
7:     if  then We include the projections while computing the Factor Product for .
8:          for all  do
9:               if  then
10:                                                   
11:     if  is not a root then
12:          Let , be and .
13:          
14:     else      
15:return
Algorithm 4 JoinInferUp

3.2.2 Message-Down phase

Since the Message-Down phase (Algorithm 5) involves updating factor products for each bag (except the root) using down messages, we perform two in-place s for each such bag: first, between the up and down messages sent/received by the bag, and then, between the result of the previous step with the bag’s factor product. A detailed description on how to compute an in-place follows.

Consider any two factors and , whose corresponding factor tables are denoted by and respectively. (Note that cluster products/messages can be treated as factor tables as well.) Recall that the factor tables are stored as a list of index, probability pairs. We start by assuming that one of or is stored as a hash-table with index as key and probability as value respectively. 222The up/down messages are stored as hash tables. Without loss of generality, let’s assume that is stored as a hash table. Our goal is to compute in-place. To this end, we iterate through each entry in and probe the corresponding index in ’s hash table. If it is present, then we multiply the probabilities and store the result in the corresponding entry in . Otherwise, we discard the entry from . It follows that by the end of this procedure, we would have computed and the result is stored in . Note that probing in the Hash Table is an amortized constant time operation and entry removal in can be done in constant time as well, concluding that the time complexity of our algorithm is .

1:Input: GHD . Factor Products and Up Messages converted to listing representation from tries.
2:Output: Final Factor Products .
3:Set for every .
4:for all nodes  do: This is done as a level-order traversal from root-to-leaf.
5:     for all  do
6:           Compute the Down Message from to by summing out variables not in .
7:           We divide the Down Message by its corresponding Up Message.      
8:     for all  do
9:           We multiply ’s cluster product with .
10:          .      
11:return
Algorithm 5 JoinInferDown

Summing up, our inference algorithm is a standard tree propagation algorithm with two modifications: (1) We adapt WCOJA to compute factor-products in the bags, and (2) We modify the factor products in bags using 01-projections.

3.2.3 Runtime complexity

With the above steps we obtain a two-pass inference routine whose runtime bound follows from a recent improved bound (AGM) on size of factor product [31]. For a hypergraph , let be any subset of vertices and let be a vector indexed by edges, such that

be the optimal solution to the linear program

min (3)
s.t. (4)

Then, the quantity

is called the AGM-bound of B using edges in . The WCOJA used in JoinInfer meets the AGM-bound thus giving our GHD (Definition 3) based algorithm a runtime complexity of

(5)

Upper-bounding each by in the above equation and maximizing (instead of summing) over all bags in the GHD gives us an asymptotic bound of . is guaranteed to be smaller than or (hypertree width or tree width) for the same GHD, giving us the best known theoretical bounds for exact inference in PGMs. (Formal definitions for can be found in Appendix A.2.)

Example 3

In Example 2, the original hypergraph structure, , consisted of only two edges and . The AGM-bound from the optimal solution translates to a bound of . However, by employing 01-projections the induced hypergraph, , has three edges , and and the optimal solution gives us an asymptotically better bound of . For the GHD T, , .

Note that (5) gives a finer grained measure for our runtime: . Recall that the asymptotic bounds for based algorithms is given by , where . However, a more realistic measure here would be . This gives us a finer grained ratio (as compared to [14]) to evaluate JoinInfer against classical engines:

(6)

Replacing the numerator with the upper bound and the denominator with the upper bound in (6) gives us

(7)

This ratio is analogous to the one in [14], which was based on hypertree width:

Since computing is NP-Hard, [14] used an approximation for it (denoted by ). In our measure , we overcome this issue by using over . Using offers two significant advantages – one, it is a more fine-grained measure (see (8)) and two, is polynomially computable (basically, solve the LP from (3)). We compute the numerator in as described in Appendix A.2. We show in the Experiments Section that is a better predictor than on most bands since it exploits the fact that factors/factor tables and variables in a bag can have different sizes and domain sizes respectively.

3.3 Challenges and Our Solutions

3.3.1 Factor Representations

Consistent with WCOJA, we use a listing representation [19, 26] to store data, i.e., we only store factor entries with non-zero probabilities. It has been shown that tries are sufficient for theoretical bounds333A trie is a multi-level data structure where each factor tuple corresponds to a unique path from root to leaf and the probability value associated with each tuple are stored in the leaf.. While computing the factor product of a bag, Algorithm 4 adopts a back-tracking based search routine over multiple tries (MultFacProd). However, ‘multi-level’ tries impose considerable random access costs during the back-tracking search; they are also unsuitable for message propagation. We resolve these two problems via two novel factor-representations.

Within Bag Computation.

First, we introduce ‘level-order’ tries for the back-tracking search in Algorithm 4. Essentially, we flatten the trie into a single-level contiguous block and redesign the search to mimic the original multi-level traversal. This re-design enabled us to simultaneously (i) exploit the compact storage of tries for sparse factors and (ii) the caching advantages of contiguous memory blocks. Initial experimental runs showed a gain of 10x in runtime using level-order tries.

Message Propagation Between Bags.

Second, in addition to storing each factor as a trie, we also store it as a list of pairs where each factor tuple is converted into a single number: their index value. We use two variants (i) which stores only ‘reverse’ indices, i.e., indices computed in reverse variable order and (ii) which stores forward and reverse indexes (for representing intermediary messages). These multiple representations enable us to optimize Algorithms 4 and  5: the reverse index enables efficient construction of tries in the up-phase and in decoding message entries over all children in a single pass in the down-phase. Moreover, the reverse indices of the up-messages act as placeholders for down-messages, enabling the re-use of data-structures. Finally, the forward indexes are used while merging down-messages with cluster products, thus reducing redundant decoding/encoding steps.

01-projections.

To obtain the asymptotically better theoretical bounds, FAQ/AJAR uses 01-projections for all bag-wise computations. However, in practice they impose significant computational costs since interconnected factors generate a large number of projections per bag; building and maintaining these are costly. Firstly, since the utility of a 01-projection lies in its sparseness we filter out dense projections at each bag-wise computation. Secondly, instead of pre-computing all projections, we compute projections on the fly and amortize it (cache-and-reuse) over subsequent computations.

1:Input: GHD , Sum of product of domains .
2:Output: A map that maps every node to , or .
3:Initialize to for every .
4:if  then
5:     for all nodes with.  do: This traversal is done in the order of their product of domain sizes, highest to lowest.
6:          
7:          
8:          
9:          
10:          
11:          , where .
12:          for all  do Propagate the decision obtained for to all unassigned nodes in its subtree.
13:                               
14:else
15:     for all nodes  do
16:                
17:return R
Algorithm 6 HYJAR

3.3.2 Hybrid Architecture

In Bands 5 and 6 (Figure 1), libDAI’s pairwise product implementation demonstrates distinct advantages over JoinInfer’s multi-way product. We explore the feasibility of leveraging the respective advantages of both these strategies in a new {HY}brid {J}oin {AR}chitecture (HYJAR). To build such a system, we use the native structure of JoinInfer and import only the pairwise-product functionality from libDAI (we do not integrate the entire engine). Given the high costs of switching between the data-structures required for JoinInfer and libDAI, the main challenge here was to devise a system that not only optimally chooses between the strategies per bag, but at the same time minimizes the switches between bags. We overcome this challenge by introducing a deterministic heuristic (Algorithm 6) that decides the optimal strategy (JoinInfer (with or without 01-projections) or PairwiseProd) for each bag in the GHD that has at least one input factor assigned to it (i.e. ).444Recall our earlier assumption that each factor table is assigned to a unique bag. As a result, not many bags are chosen in this process. Further we ignore the incoming messages for a bag when deciding on , making this decision faster. We then propagate this decision along the subtree of , until it reaches a bag that was already assigned a decision. To decide the order of preference, we consider bags in decreasing order of , with the intuition being that the larger bags dominate the runtime of libDAI. A detailed description of Algorithm 6 follows.

The inputs to our algorithm are a GHD and the sum of product of domains . Our goal is output a map that maps every node to , or . In particular, implies that we would be running JoinInfer without any projections for that node in the message up phase. Similarly, and imply that we would be running JoinInfer with projections and Pairwise Product (Algorithm 3) without any projections respectively. We first consider the case when – we iterate through all nodes with (i.e. at least one input factor table assigned) in the decreasing order of their corresponding product of domain sizes. In particular, we run JoinInfer and Pairwise Product without 0/1 projections (i.e. the first and third algorithms) with only pre-assigned factor tables determined by . For the second algorithm, since we include 0/1 projections, we include all factor tables whose variables have a non-empty intersection with the union of variables in the pre-assigned factors in the current bag. Once these three runs are done, we assign the fastest strategy to node . We then propagate this decision to all nodes in its subtree for which no decision has been made so far. However, even after this procedure, there could be some nodes, which don’t have an value assigned. In order to address this issue, we repeat the same procedure for the root i.e., we make an arbitrary choice of strategy for it and then propagate this choice along the remaining GHD. Finally, for the case when , running JoinInfer without 0/1 projections could turn out to be very expensive (since we don’t consider the messages). Thus, for each cluster, we randomly choose between JoinInfer without 0/1 projections and with 0/1 projections. Note that we don’t run Pairwise Product for this case since LibDAI crashes when .

4 Experimental Evaluation

In this section, we empirically validate JoinInfer and outline features that influence JoinInfer’s performance. Specifically, we describe our empirical setup and on standard benchmarks, we (i) demonstrate the scope of JoinInfer vis-a-vis state-of-the-art systems, (ii) document performance gains of the hybrid setting and (iii) evaluate our technical contributions.

4.1 Experimental Setup

Datasets.

To create a testbed that spans the full range of cases illustrated in Figure 1, we sample from three publicly available benchmark datasets: UAI’06 [8], PIC 2011 [16] and the BN Learns dataset [1] (which subsumes the IJCAI’05 networks [9]). Our testbed contains 52 networks. For Band , we selected the CELAR subset from PIC 2011. From the UAI’06 benchmarks, we selected grids (BN_30-41) for Band , iscas89 (BN_47-68) for Band , the Speech Recognition DBNs (BN_20-25) for Band and, iscas85 (BN_42-46) for Band . We left out networks CELAR-SUB4, BNs 65, 66 and 68 in Bands 1 to 3 because JoinInfer does not (yet) support the precision to compute very large indices (). Finally, we selected networks from BN Learns which had more than variables (i.e., we filtered out the smaller ones); these populated Bands 4, 5 and 6.

In order to improve the tractability of some of the larger networks for exact inference (high cases), we randomly induce factor table sparsity. Further, to ensure that all final cluster tables are non-empty, for every factor , we force the entry in the corresponding factor table for every
. These sparsity levels are consistent with the ranges found in other networks in the benchmark. In particular, for the CELAR subsets, the induced median factor sparsity is at 20% and 40% (original median sparsity was 100%). However, for the networks in Bands 2 and 3, we ensure that the induced median factor table sparsity remains close to the original value of 50%. (Note that inducing sparsity to improve model tractability is a well-accepted procedure in many practical settings [27, 26]).

Comparison Engines.

We compare JoinInfer against three state-of-the-art systems: ACE [9], an engine that explicitly exploits determinism, and, libDAI [30] and IJGP [28], two award winning systems in the UAI 2010 inference challenge. Since JoinInfer is for exact inference, we compare against the exact inference ‘settings’ in these models: the ‘Junction Tree’ algorithm in libDAI and the ‘Join-Tree’ propagation in IJGP (since it computes exact marginals only when the join-graph is a tree). In particular, IJGP requires the ‘degree’ parameter to be set high, which we set to the number of variables in the network (see Table 1’s description for more details). For ACE, we follow the recommended settings outlined for the IJCAI networks and the standard settings for the others – commands ‘compile’ and ‘evaluate’ to compile the AC and run inference respectively. The compile/run commands for these engines can be found in Appendix B.1.

Inference Queries.

The inference query we evaluate is the computation of (all) variable marginals. We observed that while JoinInfer, IJGP and ACE process evidence (requiring it to be input separately), libDAI does not. Further, JoinInfer, IJGP and ACE perform SAT-based singleton consistency and treat the resulting variables as evidence, which again libDAI doesn’t. Hence, in order to ensure a fair comparison, we incorporate the evidence and singleton-consistency directly into the input given to the engines. In particular, we remove both these types of variables from the input network (note that this could remove some factors from the query as well). We then provide the updated network resulting from this procedure to all four engines in their respective formats: libDAI (.fg), IJGP (.uai), ACE (.uai) and JoinInfer. We compare our marginal outputs with these engines, with an error limit of .

Evaluation Metrics and Settings.

We evaluate the systems on the time taken to compute variable marg-
inals. We repeat each experimental run times and report the average of the runs. Additionally, we set a timeout of minutes for our experimental runs. We would like to note here that ACE requires separate compilation of the arithmetic circuit representing the input network (a non-standard design). For a fair comparison with other engines with end-to-end computations, we report both (i) the total of compilation and inference times and (ii) only the inference time for ACE. We ran all our experiments on a Linux server (Ubuntu 14.04 LTS) with Intel Xeon E5-2640 v3 CPU @ 2.60GHz and 64 GB RAM.

Error Codes in Table 1.

We describe the error codes in Table 1 here. ‘T’ denotes engine-time out (60 mins). LibDAI and IJGP crash on all benchmarks where , due to huge pre-memory allocation and this is denoted by ‘F’. For ACE, we observed that for benchmarks BN_30-39 in Band , it compiles successfully but throws a runtime exception due to precision issues. We believe that this is due to large treewidth () on these datasets. We denote this by ‘E’. For IJGP, we observed that it does approximate inference in benchmarks Munin1 and BN_43-46 respectively. In particular, we recorded its final treewidth using the MinFill ordering on all benchmarks and compared it with JoinInfer’s and libDAI’s treewidth (both using the MinFill ordering) respectively. We noticed that the final treewidth reported by IJGP was much smaller than the treewidth reported by JoinInfer and libDAI. Recall that we preprocess evidence and SAT-based singleton consistency on these benchmarks and thus, we concluded that IJGP does Approximate Inference on these datasets (which we denote by ‘A’).

4.2 Experimental Results

Band Dataset Var/Factors JoinInfer HYJAR libDAI IJGP ACE Sparsity (in %) fhtw tw D/N
w/o 0/1 0/1 TTime ITime
Band CELAR6-SUB0_20 16/57 2.00E-03 1.00E-03 0.17 0.16 0.19 F F 97.24 0.36 20/20 4 8 44/387
CELAR6-SUB1_20 14/75 3.00E-04 3.00E-04 2.57 2.58 2.59 F F 444.38 0.99 20/20 5 10 44/387
CELAR6-SUB2_20 16/89 1.00E-04 1.00E-04 1.05 1.04 1.07 F F 653.33 0.74 20/20 5.5 11 44/387
CELAR6-SUB3_20 18/106 1.00E-04 1.00E-04 3.72 3.69 3.67 F F 1219.08 0.78 20/20 5.5 11 44/387
CELAR6-SUB0_40 16/57 3.00E-02 2.50E-02 4.55 4.59 4.17 F F 855.12 1.2 40/40 4 8 44/774
CELAR6-SUB1_40 14/75 1.00E-02 1.00E-02 392.7 388.76 388.42 F F T T 40/40 5 10 44/774
CELAR6-SUB2_40 16/89 6.50E-03 6.40E-03 449.02 448.56 441.41 F F T T 40/40 5.5 11 44/774
CELAR6-SUB3_40 18/106 6.00E-03 7.00E-03 796.76 794.98 780.93 F F T T 40/40 5.5 11 44/774
Band BN_30 1036/1153 15.96 5.10E+02 1.29 1.34 1.03 F F E E 50/44.5 25 41 2/4
BN_31 1036/1153 47.47 2.00E+03 1.22 1.31 1.05 F F E E 50/44.5 25 39 2/4
BN_32 1294/1441 1.88 1.20E+02 2 2.01 1.59 F F E E 50/44.3 28 49 2/4
BN_33 1294/1441 6.20E+02 1.00E+03 1.93 1.96 1.58 F F E E 50/44.3 26 42 2/4
BN_34 1294/1443 2.60E+04 3.20E+04 2 2.07 1.56 F F E E 50/44.3 28 41 2/4
BN_35 1294/1443 3.10E+01 5.10E+02 1.91 2.06 1.58 F F E E 50/44.3 26 43 2/4
BN_36 1294/1444 1.70E+03 2.00E+03 1.98 2.15 1.61 F F E E 50/43.7 30 49 2/4
BN_37 1294/1444 7.10E+02 1.00E+03 2.01 2.02 1.62 F F E E 50/43.7 30 50 2/4
BN_38 1294/1442 4.10E+02 2.50E+02 1.96 2.03 1.59 F F E E 50/43.9 27 46 2/4
BN_39 1294/1442 10.7 2.50E+02 2.03 2.09 1.60 F F E E 50/43.9 26 44 2/4
BN_62 657/667 7.70E+09 8.38E+09 0.74 0.7 0.68 F F 1.45 0.24 25/34.1 21 47 2/14
Band BN_60 530/539 1.20E+08 7.20E+16 0.7 0.75 0.73 F F 1.65 0.24 50/44.03 29 60 2/16
BN_61 657/667 3.00E+10 1.70E+10 0.74 0.74 0.69 F F 1.69 0.24 25/34.1 21 46 2/14
BN_63 530/540 1.40E+10 9.20E+18 2.43 0.77 0.67 F F 1.84 0.27 50/43.5 30 57 2/16
BN_64 530/540 1.60E+09 5.76E+17 0.79 0.68 0.63 F F 1.84 0.25 50/43.5 28.5 55 2/16
BN_67 430/437 4.60E+10 2.95E+20 2.64 1.65 2.05 F F 1.71 0.26 50/50.57 32.5 62 2/16
Band BN_20 2433/2840 119 1.00E-04 4.53 4.47 14.94 22.73 T T T 50/49.3 4 7 91/208
BN_21 2433/2840 109 1.00E-04 4.49 4.42 14.86 23.37 T T T 50/49.3 4 7 91/208
BN_22 2119/2423 0.97 1.00E-05 2.13 2.18 2.98 3.77 T 7.83 1.71 50/47 4 7 91/208
BN_23 2119/2423 0.97 1.00E-05 2.14 2.21 3 3.74 T 7.74 1.79 50/47 2 5 91/208
BN_24 1514/1818 2.08 1.00E-05 1.33 1.4 1.74 2.11 T 6.24 1.58 53.8/53 2 5 91/208
BN_25 1514/1818 2.01 1.00E-05 1.31 1.39 1.76 2.12 T 6.34 1.62 53.8/53 2 5 91/208
Pathfinder 109/109 54.94 1.00E-05 0.15 0.29 0.29 0.11 0.34 0.81 0.31 52.4/61.4 2 7 63/6437
Band Alarm 37/37 3.58 11.39 0.03 0.03 0.05 0.02 0.06 0.46 0.21 100/99.4 2 5 4/108
Hepar2 70/70 5.95 9 0.04 0.05 0.06 0.03 0.19 0.42 0.19 100/100 2 7 4/384
Mildew 35/35 2.00E+03 327 0.78 0.71 0.24 0.27 2.81 2.91 1.89 75/61.7 3 5 100/14849
Munin 1041/1041 35.77 557 13.27 13.82 1.98 3.14 11.35 T T 42.1/46.6 6 9 21/276
Munin1 186/186 43.23 16.6 598.86 629.75 20.6 39.01 A T T 46.2/48.6 7 12 21/276
Munin4 1038/1038 57.9 557 16.86 17.21 2.15 2.06 10.67 3.77 2.01 44/46.6 6 9 21/276
Diabetes 413/413 524.51 4.24E+06 3.28 3.31 0.72 0.89 32.98 6.99 4.69 33.3/45.6 4 5 21/2040
Munin2 1003/1003 2.90E+05 8.90E+08 2.71 2.28 0.68 0.79 4.27 2.81 1.63 46.4/48 8 8 21/276
Munin3 1041/1041 5.00E+05 1.20E+04 2.71 2.52 0.70 0.95 5.81 2.38 1.28 45.8/37 6 8 21/276
Pigs 441/441 144 1.50E+04 0.98 0.92 0.36 0.24 1.05 1.37 0.7 55.6/70.2 8 11 3/15
Link 724/724 2.00E+07 2.70E+08 18.02 19.91 14.27 3.43 29.73 E E 50/65.1 12 16 4/31
Barley 48/48 4.00E+07 6.50E+03 26.69 27.17 1.13 1.45 15.32 17.53 10.94 100/100 4 8 67/40320
Hailfinder 56/56 13.02 1.00E+04 0.03 0.05 0.06 0.01 0.05 0.53 0.23 94.2/83.9 3 5 11/1181
Water 32/32 1.00E+05 1.00E+06 0.17 0.14 0.27 0.31 0.15 0.81 0.34 50/58.23 4 11 4/1454
Win95pts 76/76 8.66 3.10E+04 0.05 0.05 0.05 0.03 0.03 0.53 0.2 100/90 3 9 2/252
Band Andes 223/223 3.10E+04 1.50E+20 0.57 0.59 0.19 0.14 0.59 1.12 0.67 100/95.7 12 17 2/128
BN_42 870/879 1.23 4.72E+21 32.63 32.11 35.15 2.66 19.18 1216.39 19.6 50/54.4 24 24 2/16
BN_43 870/880 1.14 3.78E+22 65.47 64.03 4.37 4.43 A 1132.1 22.24 50/54.4 25 25 2/16
BN_44 870/880 1.03 2.40E+24 227.87 216.98 133.5 12.82 A 1341.72 17.02 50/54.3 27 27 2/16
BN_45 870/880 1.1 3.78E+22 67.05 68.21 8.07 6.95 A 778.91 19.05 50/54.2 25 25 2/16
BN_46 489/497 1.04 7.92E+28 45.98 46.09 20.16 5.85 A 150.25 5.79 50/55.9 24 24 2/16
Table 1: Benchmark Comparisons: The first column denotes the range of , followed the band of the datasets (see Figure 1) and the dataset name. The fourth column denotes the number of variables/factors, followed by and . We report three runtimes for JoinInfer: without 0/1 projections, with all 0/1 projections and HYJAR, followed by our comparison engines – LibDAI, IJGP and ACE (Total Time and Inference Time). (All runtimes are in seconds.) Further, we report the median and mean sparsity for every dataset, followed by fractional hypertree width (fhtw) and tree-width (tw) (computed for the same GHD). The fractional hypertreewidth () numbers were generated by solving the linear program (3) using Google OR-Tools [17]. Finally, we report the maximum domain value (D) and maximum factor table (non-zero) entry size (N). ‘T’ denotes engine-time out (60 mins). Engine crash due to huge pre-memory allocation is denoted by ‘F’. Runtime exceptions due to precision issues is denoted by ‘E’. When an engine does Approximate Inference due to large treewidth, we denote it by ‘A’). See Section 4.1 for a detailed description of these error-codes.

4.2.1 Benchmark experiments

Benchmark Comparisons.

The results in Table 1 are laid out along the lines of Figure 1. These networks span over a wide range of sparsity (20% - 100%), domain sizes (2 -100) and factor arity levels (1-10).

high.

The measure from [14] predicts superior performance for JoinInfer only in Band (CELAR). However, in this region of high , JoinInfer performs consistently better than the predictions in [14]. In Band it is be up to x faster on subsets where ACE completes. In Band , it is up to x faster and in Band where the corresponding predictions of [14] is under-performance by x - x, it can be upto x faster than ACE (libDAI and IJGP fail in this space). libDAI fails in these bands since it relies primarily on truth-table indices for its speed which do not scale to these levels of . On the other hand, ACE that takes advantage of factor sparsity using arithmetic circuits is the only other engine that completes; that said, compiling these structures is costly. We surmise that JoinInfer’s performance advantages are rooted in the use of multi-way products. Further, applying multi-way products directly on a listing-representation accentuates the gains.

The networks in these bands (1, 2 and 3) cover a sparsity range of 20%-50% and have factor arity between to . Further, in Band 1, given that ACE requires the 20% sparsity levels to complete, we present results at two levels of sparsity for CELAR (20% and 40%).

low.

predicts superior performance for JoinInfer in Band , which it achieves. It is upto 5.29x faster than libDAI (it’s closest competitor) and up to 5.4x faster than ACE (on the subsets that ACE completes on). IJGP times-out on almost all of the networks. Finally, in Bands 5/6, the two unfavorable settings, JoinInfer is on an average faster than ACE by x/x and IJGP by x/x respectively. It is on average slower than libDAI by x/x respectively: libDAI’s truth-table indexing advantages clearly manifest in these two bands. We would like to note however, that the corresponding predictions of  [14] for JoinInfer is under performance by x - x for Band , and x - x for Band , i.e., several orders of magnitude worse.

Secondly, our fine-grained measure (Column 4) is a better predictor for JoinInfer’s performance than (Column 5) on most networks. Consider the Bands 3, 4 and 5 in Table 1. While overestimates our performance in Band , provides a more realistic measure that correlates with our performance. On the other hand, in Bands 3 and 5, while severely underestimates our performance (sometimes by orders of magnitude), ’s predictions are generally tighter. On Bands 1 and 2, its predictions are comparable to . The only place where it fails to make good estimates is Band .

01-Projections.

In theory, 01-projections are central to realizing the asymptotic bounds of FAQ/AJAR. However, we find that in practice they offer mixed results: we found they help in 20/52 networks (average gains of 20%), they make no difference in 3 networks and marginally hurt in 29 networks (average loss of 7%).

Memory Usage.

To examine memory usage, we selected three large networks ( variables) from Bands 4 and 5 (results in Table 2). We found that in the former set JoinInfer consumed the least memory: on average libDAI consumed 2.35x and ACE consumed 7x more. In the latter set, libDAI consumed the least memory: on average JoinInfer consumed 2.82x, IJGP 1.98x and ACE 7.29x more. In summary, this underscores that JoinInfer is comparable to classical inference frameworks on the memory consumption metric.

Band Datasets JoinInfer IJGP libDAI ACE
Band BN_23 305 T 2.35x 7.41x
BN_24 293 T 1.67x 6.83x
BN_25 294 T 1.71x 6.95x
Band Munin2 395 0.56x 0.33x 3.01x
Munin3 394 0.61x 0.31x 3.05x
Munin4 1038 1.03x 0.45x 1.37x
Table 2: Real-World Experiments: Memory Consumed (in MB)
Hybrid Architecture.

Since JoinInfer is the only engine that completes on all networks when is high, we now focus on low conditions. As evident from Table 1, HYJAR helps exploit the relative strengths of each strategy–multiway or pairwise products–into a single architecture, yielding consistent performance across a majority of networks (26/28). Of these, in 9 cases (e.g., munin1, munin, barley, mildew) HYJAR’s completion times are faster than its nearest standalone competitors (JoinInfer or libDAI), in 10 cases it is less than 2.5x slower and in 7 cases it is between 2.5x - 4.5x slower. BN_42 and BN_44 are the only two networks where HYJAR’s strategy does not lead to a notable improvement. Further, follow up analyses indicate that HYJAR consistently switches between strategies at the bag level. We present a detailed analysis of our results in Appendix B.2.

Band Dataset JoinInfer HYJAR Pairwise LibDAI Clusters JoinInfer Clusters (%) Pairwise Clusters (%)
w/o 0/1 0/1 w/o 0/1 0/1
Band BN_20 4.52 4.53 14.94 14.47 22.73 1825 0 27 73
BN_21 4.52 4.58 15.47 14.86 23.37 1825 0 9 91
BN_22 2.15 2.22 2.98 2.59 3.77 1513 9 32 59
BN_23 2.16 2.24 3 2.56 3.74 1513 8 26 66
BN_24 1.31 1.43 1.74 1.43 2.11 908 0 33 67
BN_25 1.34 1.47 1.76 1.36 2.12 908 0 33 67
Pathfinder 0.16 0.27 0.29 0.1 0.11 91 0 0 100
Band Alarm 0.03 0.04 0.05 0.05 0.02 27 4 3 93
Hepar2 0.05 0.04 0.06 0.04 0.02 58 4 5 91
Mildew 0.81 0.76 0.24 0.21 0.27 29 6 6 88
Munin 13.34 13.61 1.98 1.74 3.14 872 0 6 94
Munin1 600.86 630.22 20.6 19.44 39.01 158 0 9 91
Munin4 16.91 17.18 2.06 1.89 2.93 869 0 0 100
Diabetes 3.31 3.36 0.72 0.6 0.89 337 0 7 93
Munin2 2.74 2.31 0.68 0.6 0.79 866 0 1 99
Munin3 2.72 2.27 0.7 0.64 0.95 901 1 0 99
Pigs 1.01 0.93 0.36 0.22 0.23 368 1 28 71
Link 18.02 19.91 14.27 3.28 3.44 591 22 12 66
Barley 26.8 27.12 1.13 0.8 1.45 36 0 0 100
Hailfinder 0.03 0.05 0.06 0.04 0.02 43 0 1 99
Water 0.18 0.15 0.27 0.28 0.31 19 0 6 94
Win95pts 0.04 0.03 0.05 0.06 0.03 50 19 13 68
Band Andes 0.59 0.59 0.19 0.12 0.14 178 1 27 73
BN_42 32.63 33.01 35.15 2.48 2.66 789 13 42 45
BN_43 65.47 64.03 4.37 4.36 4.43 789 12 22 66
BN_44 227.87 216.98 133.5 16.29 12.82 789 18 38 44
BN_45 67.05 68.21 8.07 8 6.95 788 14 25 61
BN_46 45.98 46.01 20.16 5.64 5.85 446 18 17 65
Table 3: We report JoinInfer w/o 0/1, with all 0/1, HYJAR, Pairwise and LibDAI runtimes (in seconds) along with the total number of clusters, followed by the average % of clusters running Multiway w/o 0/1, with all 0/1 and Pairwise (i.e., we set for all in Algorithm 1). All runtimes are in seconds.
Takeaways.

(i) In our investigation, we identified a threshold for at reflecting the current memory limits for truth tables (on our machine). While the absolute value of the threshold may change depending on machine configurations, such a threshold will always exist. (ii) In this work,we show that is a better predictor of JoinInfer’s performance than . (iii) Of the 52 networks in the test-bed, HYJAR outperforms libDAI, IJGP and ACE on 39 (i.e., on 75% of networks). libDAI is the next largest winner with wins on 11 datasets. HYJAR thus offers promise as a practically relevant architecture for building a robust, broadly applicable inference engine.

4.2.2 Factor Representations

As described in Section 3.3.1, a secondary representation for our factors is a list of pairs. Specifically, we use two variants of the list that store indices computed in two variable orders: forward and reverse. We perform three descriptive experiments on the UAI speech recognition datasets (BN_20-25) with these new data structures/algorithms, comparing with their corresponding alternatives. Note that the data is stored as compressed indices the memory blowup is only 1.3x (and not 2x) with respect to tries.

Forward vs Reverse Index.

We compare our new method of constructing tries using a reverse index (based on reverse input order) as opposed to the way of constructing tries with a forward index. In particular, this new method removes significant computational/storage overhead and is on an average 3x faster than the older method (see Table 4).

Dataset Num Var/D Old Tries New Tries
BN_20 2843/91 0.1 0.27x
BN_21 2843/91 0.1 0.33x
BN_22 2425/91 0.09 0.37x
BN_23 2425/91 0.09 0.38x
BN_24 2425/91 0.09 0.36x
BN_25 2425/91 0.09 0.36x
Table 4: Descriptive Experiment: Input Processing. We implement and compare two methods for constructing tries: first using a forward index (Old Tries) and then, using a reverse index (based on reverse input order, New Tries). All runtimes are in seconds.
Upward Pass.

During the upward pass (Algorithm 4), we store the up message as a hash-table of forward index, reverse index, probability tuples. Further, we store the cluster tables as lists of reverse index, probability pairs. Overall, this design gives us on an average 3x speed gains in Algorithm 4 (see Table 5) compared to a version that stores both the up messages and cluster tables as lists of forward index, probability pairs.

Dataset Num Var/D New Data Structures Slowdown of Older Data Structures
BN_20 2843/91 1.17 7.26x
BN_21 2843/91 1.11 6.22x
BN_22 2425/91 0.27 1.33x
BN_23 2425/91 0.29 1.21x
BN_24 2425/91 0.27 1.44x
BN_25 2425/91 0.28 1.46x
Table 5: Descriptive Experiment: Upward Pass. We implement and compare two data structures for storing the up messages/cluster tables: one by storing the up messages as hash-tables of forward index, reverse index, probability tuples and cluster tables as lists of reverse index, probability pairs and the other, by storing both up messages/cluster tables as lists of forward index, probability pairs. All runtimes are in seconds.
Downward Pass.

Finally, for the downward pass (Algorithm 5), we perform an in-place and compare it with a version implementing Sort-MergeProduct. In particular, we observe that using s instead of Sort-MergeProduct gives us an average speedup of 1.6x (see Table 6).

Dataset Num Var/D Slowdown of Sort-MergeProduct
BN_20 2843/91 0.53 3.02x
BN_21 2843/91 0.52 2.58x
BN_22 2843/91 0.11 1.18x
BN_23 2843/91 0.12 1.17x
BN_24 2843/91 0.17 1.06x
BN_25 2843/91 0.18 1.06x
Table 6: Descriptive Experiment: Downward Pass. We implement and compare two algorithms for computing products in the downward pass: and SortMergeProduct. All runtimes are in seconds.

5 Conclusion and Future Directions

This paper demonstrates that in a wide range of PGM benchmarks, GHD based inference algorithms offer much promise in terms of performance and prior conclusions on their practical (ir)relevance [14] might have to be re-visited. Further, the HYJAR architecture shows great promise in integrating the benefits of the traditional inference engines along with JoinInfer. Improving the data structures of JoinInfer to facilitate this integration is one of our future work. The following are other future research directions that seem promising: (1) Incorporate Single Instruction Multiple Data (SIMD) instructions to speed up the computation in a bag (this has been successful in database joins [3]); and (2) utilize JoinInfer to speed up approximate inference engines (there are currently no known approximate extensions for the multi-way product algorithm).

Acknowledgments

We thank Andrew Lamb, Chris Aberger, Hung Ngo, Jimmy Dobler and Ravishankar Krishnaswamy for helpful discussions.

SVMJ’s research is supported in part by NSF grant CCF-1717134 and thanks Microsoft Research for hospitality where a part of this work was done. CR gratefully acknowledges the support of DARPA No. N6600115C4043 (SIMPLEX), No. FA87501720095 (D3M), No. FA87501220335 (XDATA), No. FA87501320039 (DEFT), DOE under No. 108845 (Integrated Compiler and Runtime Autotuning Infrastructure for Power, Energy, and Resilience), NSF under No.1505728 (Intel/NSF CPS Security grant), NIH under No. U54EB020405 (Mobilize), ONR under No. N000141712266 (Unifying Weak Supervision), No. N00014-14-1-0102 (Data-Driven Systems: Join Algorithms and Random Network Theory), AFOSR under No. 580K753 (Mathematical Foundations of Secure Computing Clouds), Moore Foundation, Okawa Research Grant, American Family Insurance, Accenture, Toshiba, Secure Internet of Things Project, Google, VMware, Qualcomm, Ericsson, Analog Devices, and members of the Stanford DAWN project: Intel, Microsoft, Teradata, and VMware. The U.S. Government is authorized to reproduce and distribute reprints for Governmental purposes notwithstanding any copyright notation thereon. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views, policies, or endorsements, either expressed or implied, of DARPA, DOE, NSF, NIH, ONR, or the U.S. Government. AR’s research is supported in part by NSF grants CCF-1763481 and CCF-1717134.

References

  • [1] Bayesian network repository. http://bnlearn.com/bnrepository/.
  • [2] Ace: Ucla. http://reasoning.cs.ucla.edu/ace/moreInformation.html, 2005.
  • [3] C. R. Aberger, S. Tu, K. Olukotun, and C. Ré. Emptyheaded: A relational engine for graph processing. In Proc.SIGMOD 2016, pages 431–446, 2016.
  • [4] M. Abo Khamis, H. Q. Ngo, and A. Rudra. FAQ: questions asked frequently. In Proc. 35th PODS, pages 13–28. ACM, 2016.
  • [5] A. Atserias, M. Grohe, and D. Marx. Size bounds and query plans for relational joins. SIAM J. Comput., 42(4):1737–1767, 2013.
  • [6] F. R. Bach and M. I. Jordan. Thin junction trees. In Advances in Neural Information Processing Systems, pages 569–576, 2001.
  • [7] J. Bekker, J. Davis, A. Choi, A. Darwiche, and G. Van den Broeck. Tractable learning for complex probability queries. In Advances in Neural Information Processing Systems, pages 2242–2250, 2015.
  • [8] J. Bilmes and R. Dechter. Evaluation of probabilistic inference systems of UAI’06, 2006.
  • [9] M. Chavira and A. Darwiche. Compiling bayesian networks with local structure. In IJCAI, volume 5, pages 1306–1312, 2005.
  • [10] M. Chavira and A. Darwiche. Compiling bayesian networks using variable elimination. In Proc. 20th IJCAI, pages 2443–2449, 2007.
  • [11] A. Darwiche. Recursive conditioning. Artif. Intell., 126(1-2):5–41, 2001.
  • [12] R. de Salvo Braz, E. Amir, and D. Roth. Lifted first-order probabilistic inference. In Proc. 19th IJCAI, pages 1319–1325, 2005.
  • [13] R. Dechter. Bucket elimination: A unifying framework for probabilistic inference. In Proc. 12th UAI, pages 211–219, 1996.
  • [14] R. Dechter, L. Otten, and R. Marinescu. On the practical significance of hypertree vs. treewidth. In ECAI, volume 178, pages 913–914, 2008.
  • [15] W. Fischl, G. Gottlob, and R. Pichler. General and fractional hypertree decompositions: Hard and easy cases. arXiv preprint arXiv:1611.01090, 2016.
  • [16] A. G. Gal Elidan. Probabilistic Inference Challenge 2011. http://www.cs.huji.ac.il/project/PASCAL/.
  • [17] Google. OR Solver. https://developers.google.com/optimization/.
  • [18] G. Gottlob, M. Grohe, N. Musliu, M. Samer, and F. Scarcello. Hypertree decompositions: Structure, algorithms, and applications. In WG (5) 2005, pages 1–15. Springer, 2005.
  • [19] C. Huang and A. Darwiche. Inference in belief networks: A procedural guide. Int. J. Approx. Reasoning, 15(3):225–263, 1996.
  • [20] J. Huang, M. Chavira, and A. Darwiche. Solving MAP exactly by searching on compiled arithmetic circuits. In AAAI, volume 6, pages 1143–1148, 2006.
  • [21] F. Jensen, S. Lauritzen, and K. Olesen. Bayesian updating in causal probabilistic networks by local computations. Computational Statistics Quarterly, 4:269–282, 1990.
  • [22] M. R. Joglekar, R. Puttagunta, and C. Ré. AJAR: aggregations and joins over annotated relations. In Proc. 35th PODS, pages 91–106, 2016.
  • [23] K. Kask, R. Dechter, J. Larrosa, and A. Dechter. Unifying tree decompositions for reasoning in graphical models. Artificial Intelligence, 166(1-2):165–193, 2005.
  • [24] K. Kersting. Lifted probabilistic inference. In Proc. 20th ECAI, pages 33–38, 2012.
  • [25] D. Koller and N. Friedman. Probabilistic graphical models: principles and techniques. MIT press, 2009.
  • [26] D. Larkin and R. Dechter. Bayesian inference in the presence of determinism. In Proc. 9th AISTATS, 2003.
  • [27] X. Liu, J. Pool, S. Han, and W. J. Dally. Efficient sparse-winograd convolutional neural networks. CoRR, abs/1802.06367, 2018.
  • [28] R. Mateescu, K. Kask, V. Gogate, and R. Dechter. Join-graph propagation algorithms. J. Artif. Intell. Res. (JAIR), 37:279–328, 2010.
  • [29] B. Milch, L. S. Zettlemoyer, K. Kersting, M. Haimes, and L. P. Kaelbling. Lifted probabilistic inference with counting formulas. In Proc. 23rd AAAI, pages 1062–1068, 2008.
  • [30] J. M. Mooij. libDAI: A free and open source C++ library for discrete approximate inference in graphical models.

    Journal of Machine Learning Research

    , 11:2169–2173, Aug. 2010.
  • [31] H. Q. Ngo, C. Ré, and A. Rudra. Skew strikes back: New developments in the theory of join algorithms. ACM SIGMOD Record, 42(4):5–16, 2014.
  • [32] J. Pearl. Probabilistic reasoning in intelligent systems - networks of plausible inference. Morgan Kaufmann series in representation and reasoning. Morgan Kaufmann, 1989.
  • [33] D. L. Poole and N. L. Zhang. Exploiting contextual independence in probabilistic inference. J. Artif. Intell. Res. (JAIR), 18:263–313, 2003.
  • [34] Y. Shen, A. Choi, and A. Darwiche. Tractable operations for arithmetic circuits of probabilistic models. In Advances in Neural Information Processing Systems, pages 3936–3944. 2016.
  • [35] N. L. Zhang and D. L. Poole. Exploiting causal independence in bayesian network inference. J. Artif. Intell. Res. (JAIR), 5:301–328, 1996.

Appendix A Missing Details in Section 3

a.1 01-Projections

As stated earlier, using 01-Projections is central to realizing the asymptotically better bounds in FAQ/AJAR. Moreover, this gain is accentuated when we deal with sparsity: the key idea is to exploit the sparsity of all the input factors and not just those encompassed by the query corresponding to the current bag. We highlight this aspect in an example where the gain is more dramatic than Example 2.

Example 4

For instance, say we are processing the bag of a GHD at node that contains factors and variables , and, that the query for the bag involves marginalizing out variable to create an intermediary factor/message . The optimal fractional cover (recall the LP in (3)) for the query hypergraph would be for all edges making the runtime complexity of processing this query , a bound that could be significant depending on the value of .

Now suppose the original PGM hypergraph contains a factor that would be encountered somewhere along the tree in subsequent computations. We could employ this factor upfront while computing query at to avoid redundant computations at a later stage. With the addition of this factor support the optimal fractional cover of the induced query hypergraph would be for the new hyperedge and for the other edges incident on . This makes the AGM bound go to , a dramatic improvement in the runtime bound especially if and are large.

However, employing this factor as-is could lead to double counting of probabilities while processing the additional factor. Hence, we need to find a way of incorporating such support factors without changing the computation.

We accomplish this goal using 01-projections defined by:

Definition 4

For each and any set such that , define the 01-projection of onto as the function

where

Two key improvement afforded by -projections can be summarized as follows:

  • In terms of computation: there could be potential wastage in computing entries that would eventually be annihilated. We avoid this redundancy by computing only those entries that we know will not be eliminated. When dealing with very large factors exploiting factor sparsity via projections could lend substantial reductions in computation.

  • The above also implies better theoretical bounds: by incorporating support factors using 01-projections while processing the query corresponding to each bag, we are now bounded by the fractional cover of the induced hypergraph formed by using the projection of edges incident on in addition to edges in . This gain as illustrated in Example 4 can be substantial.

a.2 GHD: Notions of Width

We describe three notions of width here – treewidth (), hypetreewidth () and fractional hypertreewidth () for a GHD (recall Definition 3). The treewidth of a GHD is given by
. 555The standard definition of treewidth is but we use our modified definition thoroughout the paper. The notions