Sum-Product Networks (SPNs) have recently been proposed as tractable deep models (Poon & Domingos, 2011)
for probabilistic inference. They distinguish themselves from other types of probabilistic graphical models (PGMs), including Bayesian Networks (BNs) and Markov Networks (MNs), by the fact that inference can be done exactly in linear time with respect to the size of the network. This has generated a lot of interest since inference is often a core task for parameter estimation and structure learning, and it typically needs to be approximated to ensure tractability since probabilistic inference in BNs and MNs is #P-complete(Roth, 1996).
The relationship between SPNs and BNs, and more broadly with PGMs, is not clear. Since the introduction of SPNs in the seminal paper of Poon & Domingos (2011)
, it is well understood that SPNs and BNs are equally expressive in the sense that they can represent any joint distribution over discrete variables111Joint distributions over continuous variables are also possible, but we will restrict ourselves to discrete variables in this paper., but it is not clear how to convert SPNs into BNs, nor whether a blow up may occur in the conversion process. The common belief is that there exists a distribution such that the smallest BN that encodes this distribution is exponentially larger than the smallest SPN that encodes this same distribution. The key behind this belief lies in SPNs’ ability to exploit context-specific independence (CSI) (Boutilier et al., 1996).
While the above belief is correct for classic BNs with tabular conditional probability distributions (CPDs) that ignore CSI, and for BNs with tree-based CPDs due to the replication problem (Pagallo, 1989), it is not clear whether it is correct for BNs with more compact representations of the CPDs. The other direction is clear for classic BNs with tabular representation: given a BN with tabular representation of its CPDs, we can build an SPN that represents the same joint probability distribution in time and space complexity that may be exponential in the tree-width of the BN. Briefly, this is done by first constructing a junction tree and translate it into an SPN222http://spn.cs.washington.edu/faq.shtml. However, to the best of our knowledge, it is still unknown how to convert an SPN into a BN and whether the conversion will lead to a blow up when more compact representations than tables and trees are used for the CPDs.
We prove in this paper that by adopting Algebraic Decision Diagrams (ADDs) (Bahar et al., 1997) to represent the CPDs at each node in a BN, every SPN can be converted into a BN in linear time and space complexity in the size of the SPN. The generated BN has a simple bipartite structure, which facilitates the analysis of the structure of an SPN in terms of the structure of the generated BN. Furthermore, we show that by applying the Variable Elimination (VE) algorithm (Zhang & Poole, 1996) to the generated BN with ADD representation of its CPDs, we can recover the original SPN in linear time and space with respect to the size of the SPN.
Our contributions can be summarized as follows. First, we present a constructive algorithm and a proof for the conversion of SPNs into BNs using ADDs to represent the local CPDs. The conversion process is bounded by a linear function of the size of the SPN in both time and space. This gives a new perspective to understand the probabilistic semantics implied by the structure of an SPN through the generated BN. Second, we show that by executing VE on the generated BN, we can recover the original SPN in linear time and space complexity in the size of the SPN. Combined with the first point, this establishes a clear relationship between SPNs and BNs. Third, we introduce the subclass of normal SPNs and show that every SPN can be transformed into a normal SPN in quadratic time and space. Compared with general SPNs, the structure of normal SPNs exhibit more intuitive probabilistic semantics and hence normal SPNs are used as a bridge in the conversion of general SPNs to BNs. Fourth, our construction and analysis provides a new direction for learning the parameter/structure of BNs since the SPNs produced by the algorithms that learn SPNs (Dennis & Ventura, 2012; Gens & Domingos, 2013; Peharz et al., 2013; Rooshenas & Lowd, 2014) can be converted into BNs.
2 Related Work
Exact probabilistic reasoning has a close connection with propositional logic and weighted model counting (Roth, 1996; Gomes et al., 2008; Bacchus et al., 2003; Sang et al., 2005). The model counting problem, #SAT, is the problem of computing the number of models for a given propositional formula, i.e., the number of distinct truth assignments of the variables for which the formula evaluates to TRUE. In its weighted version, each boolean variable has a weight when set to TRUE and a weight when set to FALSE. The weight of a truth assignment is the product of the weights of its literals. The weighted model counting problem then asks the sum of the weights of all satisfying truth assignments. There are two important streams of research for exact weighted model counting and exact probabilistic reasoning that relate to SPNs: DPLL-style exhaustive search (Birnbaum & Lozinskii, 2011) and those based on knowledge compilation, e.g., Binary Decision Diagrams (BDDs), Decomposable Negation Normal Forms (DNNFs) and Arithmetic Circuits (ACs) (Bryant, 1986; Darwiche, 2001, 2000) .
The SPN, as an inference machine, has a close connection with the broader field of knowledge representation and knowledge compilation. In knowledge compilation, the reasoning process is divided into two phases: an offline compilation phase and an online query-answering phase. In the offline phase, the knowledge base, either propositional theory or belief network, is compiled into some tractable target language. In the online phase, the compiled target model is used to answer a large number of queries efficiently. The key motivation of knowledge compilation is to shift the computation that is common to many queries from the online phase into the offline phase. As an example, ACs have been studied and used extensively in both knowledge representation and probabilistic inference (Darwiche, 2000; Huang et al., 2006; Chavira et al., 2006). Rooshenas & Lowd (2014) recently showed that ACs and SPNs can be converted mutually without an exponential blow-up in both time and space. As a direct result, ACs and SPNs share the same expressiveness for probabilistic reasoning.
Another representation closely related to SPNs in propositional logic and knowledge representation is the deterministic-Decomposable Negation Normal Form (d-DNNF) (Darwiche & Marquis, 2001). Propositional formulas in d-DNNF are represented by a directed acyclic graph (DAG) structure to enable the re-usability of sub-formulas. The terminal nodes of the DAG are literals and the internal nodes are AND or OR operators. Like SPNs, d-DNNF formulas can be queried to answer satisfiability and model counting problems. We refer interested readers to Darwiche & Marquis (2001) and Darwiche (2001) for more detailed discussions.
Since their introduction by Poon & Domingos (2011)
, SPNs have generated a lot of interest as a tractable class of models for probabilistic inference in machine learning. Discriminative learning techniques for SPNs have been proposed and applied to image classification(Gens & Domingos, 2012). Later, automatic structure learning algorithms were developed to build tree-structured SPNs directly from data (Dennis & Ventura, 2012; Peharz et al., 2013; Gens & Domingos, 2013; Rooshenas & Lowd, 2014). SPNs have also been applied to various fields and have generated promising results, including activity modeling (Amer & Todorovic, 2012), speech modeling (Peharz et al., 2014) and language modeling (Cheng et al., 2014). Theoretical work investigating the influence of the depth of SPNs on expressiveness exists (Delalleau & Bengio, 2011), but is quite limited. As discussed later, our results reinforce previous theoretical results about the depth of SPNs and provide further insights about the structure of SPNs by examining the structure of equivalent BNs.
We start by introducing the notation used in this paper. We use to abbreviate the notation . We use a capital letter
to denote a random variable and a bold capital letterto denote a set of random variables . Similarly, a lowercase letter is used to denote a value taken by and a bold lowercase letter
denotes a joint value taken by the corresponding vectorof random variables. We may omit the subscript from and if it is clear from the context. For a random variable , we use to enumerate all the values taken by . For simplicity, we use to mean and to mean . We use calligraphic letters to denote graphs (e.g., ). In particular, BNs, SPNs and ADDs are denoted respectively by , and . For a DAG and a node in , we use to denote the subgraph of induced by and all its descendants. Let be a subset of the nodes of , then is a subgraph of induced by the node set . Similarly, we use or to denote the restriction of a vector to a subset . We use node and vertex, arc and edge interchangeably when we refer to a graph. Other notation will be introduced when needed.
To ensure that the paper is self contained, we briefly review some background material about Bayesian Networks, Algebraic Decision Diagrams and Sum-Product Networks. Readers who are already familiar with those models can skip the following subsections.
3.1 Bayesian Network
Consider a problem whose domain is characterized by a set of random variables with finite support. The joint probability distribution over can be characterized by a Bayesian Network, which is a DAG where nodes represent the random variables and edges represent probabilistic dependencies among the variables. In a BN, we also use the terms “node” and “variable” interchangeably. For each variable in a BN, there is a local conditional probability distribution (CPD) over the variable given its parents in the BN.
The structure of a BN encodes conditional independencies among the variables in it. Let be a topological ordering of all the nodes in a BN333A topological ordering of nodes in a DAG is a linear ordering of its nodes such that each node appears after all its parents in this ordering., and let be the set of parents of node in the BN. Each variable in a BN is conditionally independent of all its non-descendants given its parents. Hence, the joint probability distribution over admits the factorization in Eq. 1.
Given the factorization, one can use various inference algorithms to do probabilistic reasoning in BNs. See Wainwright & Jordan (2008) for a comprehensive survey.
3.2 Algebraic Decision Diagram
We first give a formal definition of Algebraic Decision Diagrams (ADDs) for variables with Boolean domains and then extend the definition to domains corresponding to arbitrary finite sets.
Definition 1 (Algebraic Decision Diagram (Bahar et al., 1997)).
An Algebraic Decision Diagram (ADD) is a graphical representation of a real function with Boolean input variables: , where the graph is a rooted DAG. There are two kinds of nodes in an ADD. Terminal nodes, whose out-degree is 0, are associated with real values. Internal nodes, whose out-degree is 2, are associated with Boolean variables . For each internal node , the left out-edge is labeled with and the right out-edge is labeled with .
We extend the original definition of an ADD by allowing it to represent not only functions of Boolean variables, but also any function of discrete variables with a finite set as domain. This can be done by allowing each internal node to have out-edges and label each edge with , where is the domain of variable and is the number of values takes. Such an ADD represents a function , where means the Cartesian product between two sets. Henceforth, we will use our extended definition of ADDs throughout the paper.
For our purpose, we will use an ADD as a compact graphical representation of local CPDs associated with each node in a BN. This is a key insight of our constructive proof presented later. Compared with a tabular representation or a decision tree representation of local CPDs, CPDs represented by ADDs can fully exploit CSI(Boutilier et al., 1996) and effectively avoid the replication problem (Pagallo, 1989) of the decision tree representation.
We give an example in Fig. 1 where the tabular representation, decision-tree representation and ADD representation of a function of 4 Boolean variables is presented.
Another advantage of ADDs to represent local CPDs is that arithmetic operations such as multiplying ADDs and summing-out a variable from an ADD can be implemented efficiently in polynomial time. This will allow us to use ADDs in the Variable Elimination (VE) algorithm to recover the original SPN after its conversion to a BN with CPDs represented by ADDs. Readers are referred to Bahar et al. (1997) for more detailed and thorough discussions about ADDs.
3.3 Sum-Product Network
Before introducing SPNs, we first define the notion of network polynomial, which plays an important role in our proof. We use to denote an indicator that returns 1 when and 0 otherwise. To simplify the notation, we will use to represent .
Definition 2 (Network Polynomial (Poon & Domingos, 2011)).
Let be an unnormalized probability distribution over a Boolean random vector . The network polynomial of is a multilinear function of indicator variables, where the summation is over all possible instantiations of the Boolean random vector .
Intuitively, the network polynomial is a Boolean expansion (Boole, 1847) of the unnormalized probability distribution . For example, the network polynomial of a BN is .
Definition 3 (Sum-Product Network (Poon & Domingos, 2011)).
A Sum-Product Network (SPN) over Boolean variables is a rooted DAG whose leaves are the indicators and and whose internal nodes are sums and products. Each edge emanating from a sum node has a non-negative weight . The value of a product node is the product of the values of its children. The value of a sum node is where are the children of and is the value of node . The value of an SPN is the value of its root.
The scope of a node in an SPN is defined as the set of variables that have indicators among the node’s descendants: For any node in an SPN, if is a terminal node, say, an indicator variable over , then , else . Poon & Domingos (2011) further define the following properties of an SPN:
Definition 4 (Complete).
An SPN is complete iff each sum node has children with the same scope.
Definition 5 (Consistent).
An SPN is consistent iff no variable appears negated in one child of a product node and non-negated in another.
Definition 6 (Decomposable).
An SPN is decomposable iff for every product node , scope() scope() where .
Clearly, decomposability implies consistency in SPNs. An SPN is said to be valid iff it defines a (unnormalized) probability distribution. Poon & Domingos (2011) proved that if an SPN is complete and consistent, then it is valid. Note that this is a sufficient, but not necessary condition. In this paper, we focus only on complete and consistent SPNs as we are interested in their associated probabilistic semantics. For a complete and consistent SPN , each node in defines a network polynomial which corresponds to the sub-SPN rooted at . The network polynomial defined by the root of the SPN can then be computed recursively by taking a weighted sum of the network polynomials defined by the sub-SPNs rooted at the children of each sum node and a product of the network polynomials defined by the sub-SPNs rooted at the children of each product node. The probability distribution induced by an SPN is defined as , where is the network polynomial defined by the root of the SPN . An example of a complete and consistent SPN is given in Fig. 2.
4 Main Results
In this section, we first state the main results obtained in this paper and then provide detailed proofs with some discussion of the results. To keep the presentation simple, we assume without loss of generality that all the random variables are Boolean unless explicitly stated. It is straightforward to extend our analysis to discrete random variables with finite support. For an SPN, let be the size of the SPN, i.e., the number of nodes plus the number of edges in the graph. For a BN , the size of , , is defined by the size of the graph plus the size of all the CPDs in (the size of a CPD depends on its representation, which will be clear from the context). The main theorems are:
There exists an algorithm that converts any complete and decomposable SPN over Boolean variables into a BN with CPDs represented by ADDs in time . Furthermore, and represent the same distribution and .
As it will be clear later, Thm. 1 immediately leads to the following corollary:
There exists an algorithm that converts any complete and consistent SPN over Boolean variables into a BN with CPDs represented by ADDs in time . Furthermore, and represent the same distribution and .
Assuming sum nodes alternate with product nodes in SPN , the depth of is proportional to the maximum in-degree of the nodes in , which, as a result, is proportional to a lower bound of the tree-width of .
Given the BN with ADD representation of CPDs generated from a complete and decomposable SPN over Boolean variables , the original SPN can be recovered by applying the Variable Elimination algorithm to in .
To make the upcoming proofs concise, we first define a normal form for SPNs and show that every complete and consistent SPN can be transformed into a normal SPN in quadratic time and space without changing the network polynomial. We then derive the proofs with normal SPNs. Note that we only focus on SPNs that are complete and consistent. Hence, when we refer to an SPN, we assume that it is complete and consistent without explicitly stating this.
4.1 Normal Form
For an SPN , let be the network polynomial defined at the root of . Define the height of an SPN to be the length of the longest path from the root to a terminal node.
An SPN is said to be normal if
It is complete and decomposable.
For each sum node in the SPN, the weights of the edges emanating from the sum node are nonnegative and sum to 1.
Every terminal node in the SPN is a univariate distribution over a Boolean variable and the size of the scope of a sum node is at least 2 (sum nodes whose scope is of size 1 are reduced into terminal nodes).
For any complete and consistent SPN , there exists a normal SPN such that and .
To show this, we first prove the following lemmas.
For any complete and consistent SPN over , there exists a complete and decomposable SPN over such that and .
Let be a complete and consistent SPN. If it is also decomposable, then simply set and we are done. Otherwise, let be an inverse topological ordering of all the nodes in , including both terminal nodes and internal nodes, such that for any , all the ancestors of in the graph appear after in the ordering. Let be the first product node in the ordering that violates decomposability. Let be the children of where (due to the inverse topological ordering). Let
be the first ordered pair of nodes such that. Hence, let . Consider and which are the network polynomials defined by the sub-SPNs rooted at and .
Expand network polynomials and into a sum-of-product form by applying the distributive law between products and sums. For example, if , then the expansion of is . Since is complete, then sub-SPNs rooted at and are also complete, which means that each monomial in the expansion of must share the same scope. The same applies to . Since , then every monomial in the expansion of and must contain an indicator variable over , either or . Furthermore, since is consistent, then the sub-SPN rooted at is also consistent. Consider . Because is consistent, we know that each monomial in the expansions of and must contain the same indicator variable of , either or , otherwise there will be a term in which violates the consistency assumption. Without loss of generality, assume each monomial in the expansions of and contains . Then we can re-factorize in the following way:
where we use the fact that indicator variables are idempotent, i.e., and is defined as the function by factorizing out from . Eq. 2 means that in order to make decomposable, we can simply remove all the indicator variables from sub-SPNs rooted at and and later link to directly. Such a transformation will not change the network polynomial as shown by Eq. 2, but it will remove from . In principle, we can apply this transformation to all ordered pairs with nonempty intersections of scope. However, this is not algorithmically efficient and more importantly, for local components containing in which are reused by other nodes outside of , we cannot remove from them otherwise the network polynomials for each such will be changed due to the removal. In such case, we need to duplicate the local components to ensure that local transformations with respect to do not affect network polynomials . We present the transformation in Alg. 1.
Alg. 1 transforms a complete and consistent SPN into a complete and decomposable SPN . Informally, it works using the following identity:
where , i.e., is the union of all the shared variables between pairs of children of and is the indicator variable of appearing in . Based on the analysis above, we know that for each there will be only one kind of indicator variable that appears inside , otherwise is not consistent. In Line 6, is defined as the sub-SPN of induced by the node set , i.e., a subgraph of where the node set is restricted to . In Lines 5-6, we first extract the induced sub-SPN from rooted at using the node set in which nodes have nonempty intersections with . We disconnect the nodes in from their children if their children are indicator variables of a subset of (Lines 15-17). At Line 18, we build a new product node by multiplying all the indicator variables in and link it to directly. To keep unchanged the network polynomials of nodes outside that use nodes in , we create a duplicate node for each such node and link to all the parents of outside of and at the same time delete the original link (Lines 9-13).
In summary, Lines 15-17 ensure that is decomposable by removing all the shared indicator variables in . Line 18 together with Eq. 3 guarantee that is unchanged after the transformation. Lines 9-13 create necessary duplicates to ensure that other network polynomials are not affected. Lines 21-23 simplify the transformed SPN to make it more compact. An example is depicted in Fig. 3 to illustrate the transformation process.
We now analyze the size of the SPN constructed by Alg. 1. For a graph , let be the number of nodes in and let be the number of edges in . Note that in Lines 8-17 we only focus on nodes that appear in the induced SPN , which clearly has . Furthermore, we create a new product node at Line 10 iff is reused by other nodes which do not appear in . This means that the number of nodes created during each iteration between Lines 2 and 20 is bounded by . Line 10 also creates 2 new edges to connect to and the indicator variables. Lines 11 and 12 first connect edges to and then delete edges from , hence these two steps do not yield increases in the number of edges. So the increase in the number of edges is bounded by . Combining increases in both nodes and edges, during each outer iteration the increase in size is bounded by . There will be at most outer iterations hence the total increase in size will be bounded by . ∎
For any complete and decomposable SPN over that satisfies condition 2 of Def. 7, .
We give a proof by induction on the height of . Let be the root of .
Base case. SPNs of height 0 are indicator variables over some Boolean variable whose network polynomials immediately satisfy Lemma 6.
Induction step. Assume Lemma 6 holds for any SPN with height . Consider an SPN with height . We consider the following two cases:
For any complete and decomposable SPN over that satisfies condition 2 of Def. 7, .
For any complete and decomposable SPN , there exists an SPN where the weights of the edges emanating from every sum node are nonnegative and sum to 1, and , .
Alg. 2 runs in one pass of to construct the required SPN .
We proceed to prove that the SPN returned by Alg. 2 satisfies , and that satisfies condition 2 of Def. 7. It is clear that because we only modify the weights of to construct at Line 7. Based on Lines 6 and 7, it is also straightforward to verify that for each sum node in , the weights of the edges emanating from are nonnegative and sum to 1. We now show that . Using Corollary 7, . Hence it is sufficient to show that . Before deriving a proof, it is helpful to note that for each node , . We give a proof by induction on the height of .
Base case. SPNs with height 0 are indicator variables which automatically satisfy Lemma 8.
Induction step. Assume Lemma 8 holds for any SPN of height . Consider an SPN of height . Let be the root node of with out-degree . We discuss the following two cases.
(10) (11) (12) (13) (14)
This completes the proof since . ∎
Given a complete and decomposable SPN , we now construct and show that the last condition in Def. 7 can be satisfied in time and space .
Given a complete and decomposable SPN , there exists an SPN satisfying condition 3 in Def. 7 such that and .
We give a proof by construction. First, if is not weight normalized, apply Alg. 2 to normalize the weights (i.e., the weights of the edges emanating from each sum node sum to 1).
Now check each sum node in in a bottom-up order. If , by Corollary 7 we know the network polynomial is a probability distribution over its scope, say, . Reduce into a terminal node which is a distribution over induced by its network polynomial and disconnect from all its children. The last step is to remove all the unreachable nodes from to obtain . Note that in this step we will only decrease the size of , hence . ∎
4.2 SPN to BN
Let be a normal SPN over . Before showing how to construct a corresponding BN, we first give some intuitions. One useful view is to associate each sum node in an SPN with a hidden variable. For example, consider a sum node with out-degree . Since is normal, we have and . This naturally suggests that we can associate a hidden discrete random variable with multinomial distribution for each sum node . Therefore, can be thought as defining a joint probability distribution over and where are the observable variables and are the hidden variables. When doing inference with an SPN, we implicitly sum out all the hidden variables and compute . Associating each sum node in an SPN with a hidden variable not only gives us a conceptual understanding of the probability distribution defined by an SPN, but also helps to elucidate one of the key properties implied by the structure of an SPN as summarized below:
Given a normal SPN , let be a product node in with children. Let be sum nodes which lie on a path from the root of to . Then
where means the sum node selects its th branch and denotes restricting by set , is the th child of product node .
Consider the sub-SPN rooted at . can be obtained by restricting , i.e., going from the root of along the path . Since is a decomposable product node, admits the above factorization by the definition of a product node and Corollary 7. ∎
Note that there may exist multiple paths from the root to in . Each such path admits the factorization stated in Eq. 21. Eq. 21 explains two key insights implied by the structure of an SPN that will allow us to construct an equivalent BN with ADDs. First, CSI is efficiently encoded by the structure of an SPN using Proposition 21. Second, the DAG structure of an SPN allows multiple assignments of hidden variables to share the same factorization, which effectively avoids the replication problem presents in decision trees.
Based on the observations above and with the help of the normal form for SPNs, we now proceed to prove the first main result in this paper: Thm. 1. First, we present the algorithm to construct the structure of a BN from in Alg. 3.
In a nutshell, Alg. 3 creates an observable variable in for each terminal node over in (Lines 2-4). For each internal sum node in , Alg. 3 creates a hidden variable associated with and builds directed edges from to all observable variables appearing in the sub-SPN rooted at (Lines 11-17). The BN created by Alg. 3 has a directed bipartite structure with a layer of hidden variables pointing to a layer of observable variables. A hidden variable points to an observable variable in iff appears in the sub-SPN rooted at in .
We now present Alg. 4 and 5 to build ADDs for each observable variable and hidden variable in . For each hidden variable , Alg. 5 builds as a decision stump444A decision stump is a decision tree with one variable. obtained by finding and its associated weights in . Consider ADDs built by Alg. 4 for observable variables s. Let be the current observable variable we are considering. Basically, Alg. 4 is a recursive algorithm applied to each node in whose scope intersects with . There are three cases. If current node is a terminal node, then it must be a probability distribution over . In this case we simply return the decision stump at the current node. If the current node is a sum node, then due to the completeness of , we know that all the children of share the same scope with . We first create a node corresponding to the hidden variable associated with into (Line 8) and recursively apply Alg. 4 to all the children of and link them to respectively. If the current node is a product node, then due to the decomposability of , we know that there will be a unique child of whose scope intersects with . We recursively apply Alg. 4 to this child and return the resulting ADD (Lines 12-15).
Equivalently, Alg. 4 can be understood in the following way: we extract the sub-SPN induced by and contract555In graph theory, the contraction of a node in a DAG is the operation that connects each parent of to each child of and then delete from the graph. all the product nodes in it to obtain