# On the Latent Variable Interpretation in Sum-Product Networks

One of the central themes in Sum-Product networks (SPNs) is the interpretation of sum nodes as marginalized latent variables (LVs). This interpretation yields an increased syntactic or semantic structure, allows the application of the EM algorithm and to efficiently perform MPE inference. In literature, the LV interpretation was justified by explicitly introducing the indicator variables corresponding to the LVs' states. However, as pointed out in this paper, this approach is in conflict with the completeness condition in SPNs and does not fully specify the probabilistic model. We propose a remedy for this problem by modifying the original approach for introducing the LVs, which we call SPN augmentation. We discuss conditional independencies in augmented SPNs, formally establish the probabilistic interpretation of the sum-weights and give an interpretation of augmented SPNs as Bayesian networks. Based on these results, we find a sound derivation of the EM algorithm for SPNs. Furthermore, the Viterbi-style algorithm for MPE proposed in literature was never proven to be correct. We show that this is indeed a correct algorithm, when applied to selective SPNs, and in particular when applied to augmented SPNs. Our theoretical results are confirmed in experiments on synthetic data and 103 real-world datasets.

## Authors

• 18 publications
• 1 publication
• 29 publications
• 15 publications
• ### Inverse Ising problem in continuous time: A latent variable approach

We consider the inverse Ising problem, i.e. the inference of network cou...
09/04/2017 ∙ by Christian Donner, et al. ∙ 0

• ### Probabilistic Canonical Correlation Analysis: A Whitening Approach

Canonical correlation analysis (CCA) is a classic and widely used statis...
02/10/2018 ∙ by Takoua Jendoubi, et al. ∙ 0

• ### Conditional Sum-Product Networks: Imposing Structure on Deep Probabilistic Architectures

Bayesian networks are a central tool in machine learning and artificial ...
05/21/2019 ∙ by Xiaoting Shao, et al. ∙ 0

• ### Scalable Inference for Nonparametric Hawkes Process Using Pólya-Gamma Augmentation

In this paper, we consider the sigmoid Gaussian Hawkes process model: th...
10/29/2019 ∙ by Feng Zhou, et al. ∙ 0

• ### GP-select: Accelerating EM using adaptive subspace preselection

We propose a nonparametric procedure to achieve fast inference in genera...
12/10/2014 ∙ by Jacquelyn A. Shelton, et al. ∙ 0

• ### On the Relationship between Sum-Product Networks and Bayesian Networks

In this paper, we establish some theoretical connections between Sum-Pro...
01/06/2015 ∙ by Han Zhao, et al. ∙ 0

• ### Regularized matrix data clustering and its application to image analysis

In this paper, we propose a regularized mixture probabilistic model to c...
08/06/2018 ∙ by Xu Gao, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Sum-Product Networks are a promising type of probabilistic model, combining the domains of deep learning and graphical models

[1, 2]. One of their main advantages is that many interesting inference scenarios are expressed as single forward and/or backward passes, i.e. these inference scenarios have a computational cost linear in the SPN’s representation size. SPNs have shown convincing performance in applications such as image completion [1, 3, 4][5], classification [6] and speech and language modeling [7, 8, 9]. Since their proposition [1]

, one of the central themes in SPNs has been their interpretation as hierarchically structured latent variable (LV) models. This is essentially the same approach as the LV interpretation in mixture models. Consider for example a Gaussian mixture model with K components over a set of random variables (RVs)

:

 p(X)=K∑k=1wkN(X|μk,Σk), (1)

where is the Gaussian PDF, and are the means and covariances of the component, and are the mixture weights with , . The GMM can be interpreted in two ways: i) It is a convex combination of PDFs and thus itself a PDF, or ii) it is a marginal distribution of a distribution over and a latent, marginalized variable , where and . The second interpretation, the LV interpretation, yields a syntactically well-structured model. For example, following the LV interpretation, it is clear how to draw samples from by using ancestral sampling. This structure can also be of semantic nature, for instance when represents a clustering of or when is a class variable. Furthermore, the LV interpretation allows the application of the EM algorithm – which is essentially maximum-likelihood learning under missing data [10, 11] – and enables advanced Bayesian techniques [12, 13].

Mixture models can be seen as a special case of SPNs with a single sum node, which corresponds to a single LV. More generally, SPNs can have arbitrarily many sum nodes, each corresponding to its own LV, leading to a hierarchically structured model. In [1], the LV interpretation in SPNs was justified by explicitly introducing the LVs in the SPN model, using the so-called indicator variables corresponding to the LVs’ states. However, as shown in this paper, this justification is actually too simplistic, since it is potentially in conflict with the completeness condition [1], leading to an incompletely specified model. As a remedy we propose the augmentation of an SPN, which additionally to the IVs also introduces the so-called twin sum nodes, in order to completely specify the LV model. We further investigate the independency structure of the LV model resulting from augmentation and find a parallel to the local independence assertions in Bayesian networks (BNs) [14, 15]. This allows us to define a BN representation of the augmented SPN. Using our BN interpretation and the differential approach [16, 17] in augmented SPNs, we give a sound derivation of the (soft) EM algorithm for SPNs.

Closely related to the LV interpretation is the inference scenario of finding the most-probable-explanation (MPE), i.e. finding a probability maximizing assignment for all RVs. Using results form

[18, 19], we first point out that this problem is generally NP-hard for SPNs. In [1] it was proposed that an MPE solution can be found efficiently when maximizing over both model RVs (i.e. non-latent RVs) and LVs. The proposed algorithm replaces sum nodes by max nodes and recovers the solution by using Viterbi-style backtracking. However, it was not shown that this algorithm delivers a correct MPE solution. In this paper, we show that this algorithm is indeed correct, when applied to selective SPNs [20]. In particular, since augmented SPNs are selective, this algorithm obtains an MPE solution in augmented SPNs. However, when applied to non-augmented SPNs, the algorithm still returns an MPE solution of the augmented SPN, but implicitly assumes that the weights for all twin sums are deterministic, i.e. they are all 0 except a single 1. This leads to a phenomenon in MPE inference which we call low-depth bias, i.e. more shallow parts of the SPN are preferred during backtracking.

The main contribution in this paper is to provide a sound theoretical foundation for the LV interpretation in SPNs and related concepts, i.e. the EM algorithm and MPE inference. Our theoretical findings are confirmed in experiments on synthetic data and 103 real-world datasets.

The paper is organized as follows: In the remainder of this section we introduce notation, review SPNs and discuss related work. In Section 2 we propose the augmentation of SPNs, show its soundness as hierarchical LV model and give an interpretation as BN. Furthermore, we discuss independency properties in augmented SPNs and the interpretation of sum-weights as conditional probabilities. The EM algorithm for SPNs is derived in Section 3. In Section 4 we discuss MPE inference for SPNs. Experiments are presented in Section 5 and Section 6 concludes the paper. Proofs for our theoretical findings are deferred to the Appendix.

### 1.1 Background and Notation

RVs are denoted by upper-case letters , , and . The set of values of an RV is denoted by , where corresponding lower-case letters denote elements of , e.g.  is an element of . Sets of RVs are denoted by boldface letters , , and . For RV set , we define and use corresponding lower-case boldface letters for elements of , e.g.  is an element of . For a sub-set , denotes the projection of onto .

The elements of can be interpreted as complete evidence, assigning each RV in a fixed value. Partial evidence about is represented as a subset , which is an element of the sigma-algebra induced by RV . For all RVs we use , being the Borel-sets over . For discrete RVs, this choice yields the power-set . For example, partial evidence for a discrete RV with represents evidence that takes one of the states , or , and for a real-valued RV represents evidence that takes a value smaller than . Formally speaking, partial evidence is used to express the domain of marginalization or maximization for a particular RV.

For sets of RVs , we use the product sets to represent partial evidence about . Elements of are denoted using boldface notation, e.g. . When and , we define . Furthermore, we use to symbolize any combination of complete and partial evidence, i.e. for RVs we have some complete evidence for and some partial evidence for .

Given a node in some directed graph , let and be the set of children and parents of , respectively. Furthermore, let be the set of descendants of , recursively defined as the set containing itself and any child of a descendant. Similarly, we define as the ancestors of , recursively defined as the set containing itself and any parent of an ancestor. SPNs are defined as follows.

###### Definition 1 (Sum-Product Network).

A Sum-Product network (SPN) over a set of RVs is a tuple where is a connected, rooted and acyclic directed graph, and is a set of non-negative parameters. The graph contains three types of nodes: distributions, sums and products. All leaves of are distributions and all internal nodes are either sums or products. A distribution node (also called input distribution or simply distribution) is a distribution function over a subset of RVs , i.e. either a PMF (discrete RVs), a PDF (continuous RVs), or a mixed distribution function (discrete and continuous RVs mixed). A sum node computes a weighted sum of its children, i.e. , where is a non-negative weight associated with edge , and contains the weights for all outgoing sum-edges. A product node computes the product over its children, i.e. . The sets and contain all sum nodes and all product nodes in , respectively.

The size of the SPN is defined as the number of nodes and edges in . For any node in , the scope of is defined as

 sc(N)={Yif N is % a distribution DY⋃C∈ch(N)sc(C)% otherwise. (2)

The function computed by is the function computed by its root and denoted as , where without loss of generality we assume that the scope of the root is .

We use symbols , , , , and for nodes in SPNs, where denotes a distribution, denotes a sum, and denotes a product. Symbols , and denote generic nodes, where and indicate a child or parent relationship to another node, respectively. The distribution of an SPN is defined as the normalized output of , i.e. . For each node , we define the sub-SPN rooted at as the SPN defined by the graph induced by the descendants of and the corresponding parameters.

Inference in unconstrained SPNs is generally intractable. However, efficient inference in SPNs is enabled by two structural constraints, completeness and decomposability [1]. An SPN is complete if for all sums it holds that

 ∀C′,C′′∈ch(S):sc(C′)=sc(C′′). (3)

 ∀C′,C′′∈ch(P),C′≠C′′:sc(C′)∩sc(C′′)=∅. (4)

Furthermore, a sum node is called selective [20] if for all choices of sum-weights and all possible inputs it holds that at most one child of is non-zero. An SPN is called selective if all its sum nodes are selective.

As shown in [17, 19], integrating over arbitrary sets , i.e. marginalization over , reduces to the corresponding integrals at the input distributions and evaluating sums and products in the usual way. This property is known as validity of the SPNs [1], and key for efficient inference. In this paper we only consider complete and decomposable SPNs. Without loss of generality [17, 21], we assume locally normalized sum-weights, i.e. for each sum node we have , and thus , i.e. the SPN’s normalization constant is .

For RVs with finitely many states, we will use so-called indicator variables (IVs) as input distributions [1]. For a finite-state RV and state , we introduce the IV , assigning all probability mass to . A complete and decomposable SPN represents the (extended) network polynomial of , which can be used in the differential approach to inference [16, 1, 17]. Assume any evidence which is evaluated in the SPN. The derivatives of the SPN function with respect to the IVs (by interpreting the IVs as real-valued variables, see [16, 17] for details) yield

 ∂S(e)∂λX=x=S(X=x,e∖X), (5)

representing the inference scenario of modified evidence, i.e. evidence is modified such that is set to . The computationally attractive feature of the differential approach is that (5) can be evaluated for all and all simultaneously using a single back-propagation pass in the SPN, after evidence has been evaluated. Similarly, for the second (and higher) derivatives, we get

 ∂2S(e)∂λX=xλY=y={S(X=x,Y=y,e∖{X,Y})if X≠Y0otherwise. (6)

Furthermore, the differential approach can be generalized to SPNs with arbitrary input distributions, i.e. SPNs over RVs with countably infinite or uncountably many states (cf. [17] for details).

### 1.2 Related Work

SPNs are related to negation normal forms (NNFs), a potential deep network representation of propositional theories [22, 23, 24]. Like in SPNs, structural constraints in NNFs enable certain polynomial-time queries in the represented theory. In particular, the notions of smoothness, decomposability and determinism in NNFs translate to the notions of completeness, decomposability and selectivity in SPNs, respectively. The work on NNFs led to the concept of network polynomials as a multilinear representation of BNs over finitely many states [16], [25]. BNs were cast into an intermediate d-DNNF (deterministic decomposable NNF) representation in order to generate an arithmetic circuit (ACs), representing the BN’s network polynomial. ACs, when restricted to sums and products, are equivalent to SPNs but have a slightly different syntax. In [26], ACs were learned by optimizing an objective trading off the log-likelihood on the training set and the inference cost of the AC, measured as the worst-case number of arithmetic operations required for inference (i.e. the number of edges in the AC). The learned models still represent BNs with context-specific independencies [27]. A similar approach learning Markov networks represented by ACs is followed in [28]. SPNs were the first time proposed in [1]

, where the represented distribution was not defined via a background graphical model any more, but directly as the normalized output of the network. In this work, SPNs were applied to image data, where a generic architecture reminiscent to convolutional neural networks was proposed. Structure learning algorithms not restricted to the image domain were proposed in

[3, 2, 4, 29, 30, 31]. Discriminative learning of SPNs, optimizing conditional likelihood, was proposed in [6]. Furthermore, there is a growing body of literature on theoretical aspects of SPNs and their relationship to other types of probabilistic models. In [32] two families of functions were identified which are efficiently representable by deep, but not by shallow SPNs, where an SPN is considered as shallow if it has no more than three layers. In [17] it was shown that SPNs can w.l.o.g. be assumed to be locally normalized and that the notion of consistency does not allow exponentially more compact models than decomposability. These results were independently found in [21]. Furthermore, in [17], a sound derivation of inference mechanisms for generalized SPNs was given, i.e. SPNs over RVs with (uncountably) infinitely many states. In [21], a BN representation of SPNs was found, where LVs associated with sum nodes and the model RVs are organized in a two layer bipartite structure. The actual SPN structure is captured in structured conditional probability tables (CPTs) using algebraic decision diagrams. Recently, the notion of SPNs was generalized to sum-product functions over arbitrary semirings [33]. This yields a general unifying framework for learning and inference, subsuming, among others, SPNs for probabilistic modeling, NNFs for logical propositions and function representations for integration and optimization.

## 2 Latent Variable Interpretation

As pointed out in [1], each sum node in an SPN can be interpreted as a marginalized LV, similar as in the GMM example in Section 1. For each sum node , one postulates a discrete LV whose states correspond to the children of . For each state, an IV and a product is introduced, such that the children are switched on/off by the corresponding IVs, as illustrated in Fig. 1.111In graphical representations of SPNs, IVs are depicted as nodes containing a small circle, general distributions as nodes containing a Gaussian-like PDF, and sum and products as nodes with and symbols. Empty nodes are of arbitrary type. When all IVs in Fig. (b)b are set to 1, still computes the same value as in Fig. (a)a. Since setting all IVs of to 1 corresponds to marginalizing , the sum should be interpreted as a latent, marginalized RV.

However, when we regard a larger structural context in Fig. (b)b, we recognize that this justification is actually too simplistic. Explicitly introducing the IVs renders the ancestor incomplete, when is no descendant of , and is thus not in the scope of . Note that setting all IVs to 1 in an incomplete SPN generally does not correspond to marginalization. Furthermore, note that also corresponds to an LV, say

. While we know the probability distribution of

if is in the state corresponding to , namely the weights of , we do not know this distribution when is in the state corresponding to . Intuitively, we recognize that the state of is “irrelevant” in this case, since it does not influence the resulting distribution over the model RVs . Nevertheless, the probabilistic model is not completely specified, which is unsatisfying.

A remedy for these problems is shown in Fig. (c)c. We introduce the twin sum node whose children are the IVs corresponding to . The twin is connected as child of an additional product node, which is interconnected between and . Since this new product node has scope , is rendered complete now. Furthermore, if takes the state corresponding to (or actually the state corresponding to the new product node), we now have a specified conditional distribution for , namely the weights of the twin sum node. Clearly, given that all IVs of are set to 1, the network depicted in Fig. (c)c still computes the same function as the network in Fig. (a)a (or Fig. (b)b), since constantly outputs 1, as long as we use normalized weights for it. Which weights should be used for the twin sum node ? Basically, we can assume arbitrary normalized weights, which will cause to constantly output 1, where, however, a natural choice would be to use uniform weights for (maximizing the entropy of the resulting LV model). Although the choice of weights is not crucial for evaluating evidence in the SPN, it plays a role in MPE inference, see Section 4. For now, let us formalize the explicit introduction of LVs, denoted as augmentation.

### 2.1 Augmentation of SPNs

Let be an SPN over . For each we assume an arbitrary but fixed ordering of its children , where . Let be an RV on the same probability space as , with , where state corresponds to child . We call the LV associated with . For sets of sum nodes we define . To distinguish from the LVs, we will refer to the former as model RVs. For node , we define the sum ancestors/descendants as

 ancS(N) :=anc(N)∩S(S), (7) descS(N) :=desc(N)∩S(S). (8)

For each sum node we define the conditioning sums as

 Sc(S):={Sc∈ancS(S)∖{S}|∃C∈ch(Sc):S∉desc(C)}. (9)

Furthermore, we assume a set of locally normalized twin-weights , containing a twin-weight for each weight in the SPN. We are now ready to define the augmentation of an SPN.

###### Definition 2 (Augmentation of SPN).

Let be an SPN over , be a set of twin-weights and be the result of algorithm AugmentSPN, shown in Fig. 2. is called the augmented SPN of , denoted as . Within the context of , is called the former child of . The introduced product node is called link of , and , respectively. The sum node , if introduced, is called the twin sum node of . With respect to , we denote as the original SPN.

In steps 512 of AugmentSPN we introduce the links which are interconnected between sum node and its child. Each link has a single parent, namely , and simply copies the former child . In steps 1416, we introduce IVs corresponding to the associated LV , as proposed in [1]. As we saw in Fig. 1 and the discussion above, this can render other sum nodes incomplete. These sums are clearly the conditioning sums . Thus, when necessary, we introduce a twin sum node in steps 1825, to treat this problem. The following proposition states the soundness of augmentation.

###### Proposition 1.

Let be an SPN over , and . Then is a complete and decomposable SPN over with .

Proposition 1 states that the marginal distribution over in the augmented SPN is the same distribution as represented by the original SPN, while being a completely specified probabilistic model over and . Thus, augmentation provides a sound way to generalize the LV interpretation from mixture models to more general SPNs. An example of augmentation is shown in Fig. 3.

Note that we understand the augmentation mainly as a theoretical tool to establish and work with the LV interpretation in SPNs. In most cases, it will be neither necessary nor advisable to explicitly construct the augmented SPN.

An interesting question is how the sizes of the original SPN and the augmented SPN relate to each other. A lower bound is , holding e.g. for SPNs with a single sum node. An asymptotic upper bound is . To see this, note that the introduction of links, IVs and twin sums cause at most a linear increase of the SPN’s size. The number of edges introduced when connecting twins to the links of conditioning sums is bounded by , since the number of twins and links are both bounded by . Therefore, we have . This asymptotic upper bound is indeed achieved by certain types of SPNs: Consider e.g. a chain consisting of sum nodes and distribution nodes. For the sum is the parent of the sum and the distribution, and the sum is the parent of the last two distributions. For the sum, all preceding sums are conditioning sums, yielding introduced edges. In total this gives edges, i.e. in this case indeed grows quadratically in .

### 2.2 Conditional Independencies in Augmented SPNs and Probabilistic Interpretation of Sum-Weights

It is helpful to introduce the notion of configured SPNs, which takes a similar role as conditioning in the literature on DNNFs [22, 23, 24].

###### Definition 3 (Configured SPN).

Let be an SPN over , and . The configured SPN is obtained by deleting the IVs and their corresponding link for each , from , and further deleting all nodes which are rendered unreachable from the root.

Intuitively, the configured SPN isolates the computational structure selected by . All sum edges which ”survive” in the configured SPN are equipped with the same weights as in the augmented SPN. Therefore, a configured SPN is in general not locally normalized. We note the following properties of configured SPNs.

###### Proposition 2.

Let be an SPN over , and . Let and let . It holds that

1. Each node in has the same scope as its corresponding node in .

2. is a complete and decomposable SPN over .

3. For any node in with , we have that .

4. For it holds that

 (10)

The next theorem shows certain conditional independencies in the augmented SPN. For ease of discussion, we make the following definitions.

###### Definition 4.

Let be a sum node in an SPN and its associated LV. All other RVs (model RVs and LVs) are divided into three sets:

• Parents , which are all LVs ”above” , i.e. .

• Children , which are all model RVs and LVs ”below” , i.e. .

• Non-descendants , which are the remaining RVs, i.e. .

We will show that the parents, children and non-descendants play the likewise role as for independencies in BNs [14, 15], i.e. is independent of given . We will further show that the sum-weights of are the conditional distribution of , conditioned on the event that ” select a path to ”. One problem in the original LV interpretation [1] was, that no conditional distribution of was specified for the complementary event. Here, we will show that the twin-weights are precisely this conditional distribution. This requires that the event “ select a path to the twin ” is indeed the complementary event to “ select a path to ”. This is shown in following lemma.

###### Lemma 1.

Let be an SPN over , let be a sum node in and be the parents of . For any , the configured SPN contains either or its twin , but not both.

We are now ready to state the our theorem concerning conditional independencies in augmented SPNs.

###### Theorem 1.

Let be an SPN over and . Let be an arbitrary sum in and , , . With respect to , let be the parents, be the children and be the non-descendants, respectively. Then there exists a two-partition of , i.e. , , such that

 ∀z∈Z:S′(ZS=k,Yn,z) =wkS′(Yn,z),and (11) ∀z∈¯Z:S′(ZS=k,Yn,z) =¯wkS′(Yn,z). (12)

From Theorem 1 it follows that the weights and twin-weights of a sum node can be interpreted as conditional probability tables (CPTs) of , conditioned on and that is conditionally independent of given , i.e.

 S′(ZS=k|Yn,z)=S′(ZS=k|z)={wkif z∈Z¯wkif z∈¯Z. (13)

Using this result, we can define a BN representing the augmented SPN as follows: For each sum node , connect as parents of , and all RVs as children of . By doing this for each LV, we obtain our BN representation of the augmented SPN, serving as a useful tool to understand SPNs in the context of probabilistic graphical models. An example of the BN interpretation is shown in Fig. 4.

Note that the BN representation by Zhao et al. [21] can be recovered from the BN representation of augmented SPNs. They proposed a BN representation of SPNs using a bipartite structure, where an LV is a parent of a model RV if it is contained in the scope of the corresponding sum node. The model RVs and LVs are unconnected among each other, respectively. When we constrain the twin-weights to be equal to the sum-weights, we can see in (13) that becomes independent of . This special choice of twin weights effectively removes all edges between LVs, recovering the BN structure in [21]. In the next section, we use the augmented SPN and the BN interpretation to derive the EM algorithm for SPNs.

## 3 EM Algorithm

The EM algorithm is a general scheme for maximum likelihood learning, when for some RVs complete evidence is missing [10, 11]. Thus, augmented SPNs are amenable for EM due to the LVs associated with sum nodes. Moreover, the twin-weights can be kept fixed, so that EM applied to augmented SPNs actually optimizes the weights of the original SPN. This approach was already pointed out in [1], where it was suggested that for evidence and for any LV , the marginal posteriors should be given as , which should be used for EM updates. These updates, however, cannot be the correct ones, as they actually leave the weights unchanged. Here, using augmented SPNs, we formally derive the standard EM updates for sum-weights and the input distributions, when they are chosen from an exponential family.

Assume a dataset of i.i.d. samples, where each is any combination of complete and partial evidence for the model RVs , cf. Section 1.1. Let be the set of all LVs and consider an arbitrary sum node . Eq. (13) shows that the weights can be interpreted as conditional probabilities in our BN interpretation, where

 (14)

As mentioned above, the twin-weights are kept fixed. Using the well-known EM-updates in BNs over discrete RVs [15, 10], the updates for sum-weight are given by summing over the expected statistics

 S′(ZS=k,Zp∈Z|e(l)), (15)

followed by renormalization. We make the event explicit, by introducing a switching parent of : When the twin sum of exists, assumes the two states , where and . When the twin sum does not exist, just takes the single value . Clearly, when observed, renders independent from . The switching parent can be explicitly introduced in the augmented SPN, as depicted in Fig. 5.

Here we simply introduce two new IVs and , which switch on/off the output of and , respectively. It is easy to see that when these IV are constantly set to 1, i.e. when is marginalized, the augmented SPN performs exactly the same computations as before. It is furthermore easy to see that completeness and decomposability of the augmented SPN are maintained when the switching parent is introduced. Using the switching parent, the required expected statistics (15) translate to

 S′(ZS=k,YS=yS|e(l)). (16)

To compute (16), we use the differential approach, [16, 17, 19], cf. also Section 1.1. First note that

 S′(ZS=k,YS=yS,e(l))=∂2S′(e(l))∂λYS=yS∂λZS=k. (17)

The first derivative is given as

 ∂S′(e(l))∂λYS=yS =∂S′(e(l))∂PS(e(l)) (18) =∂S′(e(l))∂PKS∑k=1λZS=kwkCkS(e(l)), (19)

where is the common product parent of and in the augmented SPN (see Fig. (b)b). Differentiating (19) after yields the second derivative

 ∂2S′(e(l))∂λYS=yS∂λZS=k=∂S′(e(l))∂PwkCkS(e(l)), (20)

delivering the required posteriors

 S′(ZS=k,YS=yS|e(l))=1S′(e(l))∂S′(e(l))∂PwkCkS(e(l)). (21)

We do not want to construct the augmented SPN explicitly, so we express (21) in terms of the original SPN. Since all LVs are marginalized, it holds that and , yielding

 (22)

delivering the required statistics for updating the sum-weights. We now turn to the updates of the input distributions.

### 3.2 Updates for Input Distributions

For simplicity, we derive updates for univariate input distributions, i.e. for all distributions we have . Similar updates can rather easily be derived also for multivariate input distributions. In [17], the so-called distribution selectors (DSs) were introduced to derive the differential approach for generalized SPNs. Similar as the switching parents for (twin) sum nodes, the DSs are RVs which render the respective model RVs independent from the remaining RVs. More formally, for each , let be the set of all input distributions which have scope . Assume an arbitrary but fixed ordering of and let be the index of in this ordering. Let the DS be a discrete RV with states. The so-called gated SPN is obtained by replacing each distribution by the product node

 DX→DX×λWX=[DX]. (23)

The introduced product is denoted as gate. As shown in [17], is rendered independent from all other RVs in the SPN when conditioned on . Moreover, is the conditional distribution of given . Therefore, each and its DS can be incorporated as a two RV family in our BN interpretation. When each input distribution is chosen from an exponential family with natural parameters , the M-step is given by the expected sufficient statistics

 θDX←∑lSg(WX=k|e(l))∫DX(x|e(l))θDX(x)dx∑lSg(WX=k|e(l)), (24)

where . When contains complete evidence for , then the integral reduces to . When contains partial evidence , then

 ∫DX(x|e(l))θDX(x)dx=∫XDX(x)θDX(x)dx∫XDX(x)dx. (25)

Depending on and the the type of , evaluating (25) can be more or less demanding. A simple but practical case is when is Gaussian and is some interval, permitting a closed form solution for integrating the Gaussian’s statistics , using truncated Gaussians [34].

To obtain the posteriors required in (24), we again use the differential approach. Note that

 Sg(WX=k,e(l))=∂Sg(e(l))∂λWX=k=∂Sg(e(l))∂PDX(e(l)), (26)

where and is the gate of , cf. (23). If we do not want to construct the gated SPN explicitly, we can use the identity . Thus the required posteriors are given as

 Sg(WX=k|e(l))=1S(e(l))∂S(e(l))∂DXDX(e(l)). (27)

The EM algorithm for SPNs, both for sum-weights and input distributions, is summarized in Fig. 6. In Section 5.1 we empirically verify our derivation of EM and show that standard EM successfully trains SPNs when a suitable structure is at hand.

Note that recently Zhao and Poupart [35] derived a concave-convex procedure (CCCP) which yield the same sum-weight updates as the EM algorithm presented here and in [19]. This result is surprising, as EM and CCCP are rather different approaches in general.

## 4 Most Probable Explanation

In [1, 4, 7], SPNs were applied for reconstructing data using MPE inference. Given some distribution over and evidence , MPE can be formalized as finding where we assume that actually has a maximum in . MPE is a special case of MAP, defined as finding for some two-partition of , i.e. . Both MPE and MAP are generally NP-hard in BNs [36, 37, 38], and MAP is inherently harder than MPE [37, 38]. Using the result in [18], it follows that MAP inference is NP-hard also in SPNs. In particular, Theorem 5 in [18]

shows that the decision version of MAP is NP-complete for a Naive Bayes model, when the class variable is marginalized. Naive Bayes is represented by the augmentation of an SPN with a single sum node, the LV representing the class variable. Therefore, MAP in SPNs is generally NP-hard. Since MAP in the augmented SPN representing the Naive Bayes model corresponds to MPE inference in the original SPN, i.e. a mixture model, it follows that also MPE inference is generally NP-hard in SPNs. A proof tailored to SPNs can be found in

[19].

However, when considering the the sub-class of selective SPNs (cf. Section 1.1 and [20]), an MPE solution can be obtained using a Viterbi-style backtracking algorithm in max-product networks.

###### Definition 5 (Max-Product Network).

Let be an SPN over . We define the max-product network (MPN) , by replacing each distribution node by a maximizing distribution node

 ^D:Hsc(D)↦[0,∞],^D(Y):=maxy∈YD(y), (28)

and each sum node by a max node

 ^S:=max^C∈ch(^S)w^S,^C^C. (29)

A product node in corresponds to a product node in .

###### Theorem 2.

Let be a selective SPN over and let the corresponding MPN. Let be some node in and its corresponding node in . Then, for every we have .

Theorem 2 shows that the MPN maximizes the probability in its corresponding selective SPN. The proof (see appendix) also shows how to actually find a maximizing assignment. For a product, a maximizing assignment is given by combining the maximizing assignments of its children. For a sum, a maximizing assignment is given by the maximizing assignment of a single child, whose weighted maximum is maximal among all children. Here the children’s maxima are readily given by the upwards pass in the MPN. Thus, finding a maximizing assignment of any node in an selective SPN recursively reduces to finding maximizing assignments for the children of this node; this can be accomplished by a Viterbi-like backtracking procedure. This algorithm, denoted as MPESelective, is shown in Fig. 7. Here denotes a queue of nodes, where and denote the en-queue and de-queue operations, respectively. Note that Theorem 2 has already been derived for a special case, namely for arithmetic circuits representing network polynomials of BNs over discrete RVs [39].

A direct corollary of Theorem 2 is that MPE inference is tractable in augmented SPNs, since augmented SPNs are selective SPNs over and . This can easily be seen in AugmentSPN, as for any and any sum , exactly one IV of is set to , causing that at most one child of (or ) can be non-zero. Therefore, we can use MPESelective in augmented SPNs, in order to find an MPE solution over both model RVs and LVs. Note that an MPE solution for the augmented SPN does in general not correspond to an MPE solution for the original SPN, when discarding the states of the LVs. However, this procedure is a frequently used approximation for models where MPE is tractable for both model RVs and LVs, but not for model RVs alone.