1 Introduction
Sumproduct networks (SPNs) Poon2011 are a prominent type of deep probabilistic model, as they are a flexible representation for highdimensional distributions, yet allowing for fast and exact inference. Learning SPNs can be naturally organised into structure learning and parameter learning, following the same dichotomy as in probabilistic graphical models (PGMs) Koller2009 . Like in PGMs, stateoftheart SPN parameter learning covers a wide range of welldeveloped techniques. In particular, various maximum likelihood approaches have been proposed, using either gradientbased optimisation Sharir2016 ; Peharz2018 ; Butz2019 ; Molina2019 or expectationmaximisation (and related) schemes Peharz2017 ; Poon2011 ; Zhao2016b . Furthermore, several discriminative criteria, e.g. Gens2012 ; Kang2016 ; Trapp2017 ; Rashwan2018 , as well as Bayesian approaches to parameter learning, e.g. Zhao2016 ; Rashwan2016 ; Vergari2019 , have been developed.
Concerning structure learning
, however, the situation is remarkably different. Although there is a plethora of structure learning approaches for SPNs, most of them can be described as heuristic. For example, the most prominent structure learning scheme, LearnSPN
Gens2013 , derives an SPN structure by recursively applying clustering on the data instances (yielding sum nodes) and partitioning data dimensions (yielding product nodes). Each of these steps can be understood as some local structure improvement, and as an attempt to optimise some local criterion. While LearnSPN is an intuitive scheme and elegantly maps the structural SPN semantics onto an algorithmic procedure, the fact that the global goal of structure learning is not declared, is unsatisfying. This principal shortcoming of LearnSPN is shared by its many variants such as online LearnSPN Lee2013 , IDSPN Rooshenas2014 , LearnSPNb Vergari2015 , mixed SPNs Molina2018 , and automatic Bayesian density analysis (ABDA) Vergari2019 . Also other approaches lack a sound learning principle, such as Dennis2012 ; Adel2015which derive SPN structures from kmeans and SVD clustering, respectively,
Peharz2013 which grows SPNs bottom up using a heuristic based on the information bottleneck, Dennis2015 which uses a heuristic structure exploration, or Kalra2018 which use a variant of hard EM to decide when to enlarge or shrink an SPN structure. All of the above mentioned approaches fall short of posing some fundamental questions: What is a good SPN structure? or What is a good principle to derive an SPN structure?This situation is somewhat surprising, since the literature on PGMs actually offers a rich set of structure learning principles: In PGMs, the main strategy is to optimise a structure score such as minimumdescriptionlength (MDL) Suzuki1993 , Bayesian information criterion (BIC) Koller2009 or the BayesDirichlet (BD) score Cooper1992 ; Heckerman1995 . Moreover, in Friedman2003 an approximate (but asymptotically correct) MCMC sampler was proposed for full Bayesian structure learning.
In this paper, we propose a wellprincipled Bayesian approach to SPN learning, simultaneously over both structure and parameters. We first decompose the structure learning problem into two steps, namely i) proposing a computational graph, laying out the arrangement of sums, products and leaf distributions, and ii) learning the socalled scopefunction, which assigns to each node its scope.^{1}^{1}1
The scope of a node is a subset of random variables, the node is responsible for, and needs to fulfil the socalled completeness and decomposability conditions, see Section
2. The first step is straightforward and akin to validating various structures in neural network training.The second step, learning the scope function, is more involved in full generality. However, we propose a parametrisation of the scope function for a widely used special case of SPNs, namely those following a socalled treeshaped region graph Dennis2012 ; Peharz2018
. In this case, the scope function is elegantly encoded via categorical variables representing a dimension clustering at each partition node in the underlying region graph. While clustering dimensions is often encountered in SPN structure learning
Dennis2012 ; Gens2013 , the technique to make this step explicit via latent variables is novel.Having encoded the scopefunction via latent variables, Bayesian learning becomes conceptually simple: We equip all categorical latent variables and the SPN’s leaves with appropriate priors, and perform monolithic Bayesian inference, implemented via simple collapsed Gibbsupdates. After learning, the predictive distribution of Bayesian SPNs can be approximated by averaging over a set of posterior samples, which again can be conveniently represented as a
single standard SPN. In summary, our main contributions in this paper are:
We propose a novel and wellprincipled approach to SPN structure learning, by decomposing the problem into finding a computational graph and learning a scopefunction.

To learn the scope function, we propose a natural parametrisation for an important subtype of SPN, which allows to formulate a joint Bayesian framework simultaneously over structure and parameters.

Due to the Bayesian nature of our approach, we introduce several benefits to the SPN toolset: Bayesian SPNs are not prone to overfitting, waiving the necessity of a separate validation set, which is beneficial for low data regimes. Furthermore, they naturally deal with missing data and are the first – to the best of our knowledge – which consistently and robustly learn SPN structures under missing data. Bayesian SPNs can easily be extended to nonparametric formulations, supporting growing data domains.
2 Background and Related Work
Let be a set of random variables (RVs), for which i.i.d. samples are available. Let be the ^{th} observation for the ^{th} dimension and . Our goal is to estimate the distribution of using a sumproduct network (SPN). In the following we review SPNs, but use a more general definition than usual, in order to facilitate our discussion below. In this paper, we define an SPN as a 4tuple , where is a computational graph, is a scopefunction, is a set of sumweights, and is a set of leaf parameters. In the following, we explain these terms in more detail.
Definition 1 (Computational graph).
The computational graph is a connected acyclic directed graph, containing three types of nodes: sums (), products () and leaves (). A node in has no children if and only if it is of type . When we do not discriminate between node types, we use for a generic node. , , , and denote the collections of all , all , all , and all in , respectively. The set of children of node is denoted as . In this paper, we require that has only a single root (node without parent).
Definition 2 (Scope function).
The scope function is a function , assigning each node in a subset of ( denotes the power set of ). It has the following properties:

If is the root node, then .

If is a sum or product, then .

For each we have (completeness).

For each we have (decomposability).
Each node in represents a distribution over the random variables , described in the following. Each leaf computes a distribution over its scope (for , we set ). We assume that is parametrised by , and that represents a distribution for any possible choice of . In the most naive setting, we would maintain a separate parameter set for each of the possible choices for , but this would quickly become intractable. In this paper, we simply assume that contains parameters over singledimensional distributions (e.g. Gaussian, Bernoulli, etc), and that for a given , the represented distribution factorises: . However, more elaborate schemes are possible. Note that our definition of leaves is quite distinct from prior art: previously, leaves were defined to be distributions over a fixed scope; our leaves define at all times distributions over all possible scopes. The set denotes the collection of parameters for all leaf nodes. A sum node computes a weighted sum . Each weight is nonnegative, and can w.l.o.g. Peharz2015 ; Zhao2015 be assumed to be normalised: , . The set of all sumweights for is denoted as , and denotes the set of all sumweights in the SPN. A product node simply computes .
The two conditions we require from – completeness and decomposability – ensure that each node is a a welldefined distribution over . The distribution represented by is defined to be the distribution of the root node in , and denoted as . Furthermore, completeness and decomposability are key to render many inference scenarios tractable in SPNs. In particular, arbitrary marginalisation tasks reduce to marginalisation tasks at the leaves, i.e. simplify to several marginalisation tasks over (small) subsets of , while the evaluation of the internal part (sum and products) amounts to a simple feedforward pass Peharz2015 . Thus, exact marginalisation can be computed in linear time in size of the SPN (assuming constant time marginalisation at the leaves). Conditioning can be tackled similarly. Note that marginalisation and conditioning are key inference routines in probabilistic reasoning, so that SPNs are generally referred to as tractable probabilistic models.
3 Bayesian SumProduct Networks
Note that all previous works defined and in an entangled way, i.e. the scope was seen as an inherent property of the nodes in . In this paper, we propose to decouple these two aspects of SPN structure: searching over and nested learning of . Note that has quite few structural requirements, and can simply be validated like a neural network structure. Consequently, we fix in the following discussion, and crossvalidate it in our experiments. Learning is challenging, as has nontrivial structure due to the completeness and decomposability conditions. In the following, we develop a parametrisation of and incorporate into a Bayesian framework. We first revisit Bayesian parameter learning using a fixed .
3.1 Learning Parameters , – Fixing Scope Function
The key insight for Bayesian parameter learning Zhao2016 ; Rashwan2016 ; Vergari2019 is that sum nodes can be interpreted as latent variables, clustering data instances Poon2011 ; Zhao2015 ; Peharz2017 . Formally, consider any sum node and assume that it has children. For each data instance and each , we introduce a latent variable with states and categorical distribution given by the weights of . Intuitively, the sum node represents a latent clustering of data instances over its children. Let be the collection of all . In order to establish the interpretation of sum nodes as latent variables, we introduce the notion of induced tree Zhao2016 . We omit subscript when a distinction between data instances is not necessary.
Definition 3 (Induced tree Zhao2016 ).
Let an be given. Consider a subgraph of obtained as follows: i) for each sum , delete all but one outgoing edge and ii) delete all nodes and edges which are now unreachable from the root. Any such is called an induced tree of (sometimes also denoted as induced tree of ). The SPN distribution can always be written as the mixture
(1) 
where the sum runs over all possible induced trees in , and denotes the evaluation of on the restriction of to .
We define a function which assigns to each value of the induced tree determined by , i.e. where indicates the kept sum edges in Definition 3. Note that the function is surjective, but not injective, and thus, is not invertible. However, it is “partially” invertible, in the following sense: Note that any splits the set of all sum nodes into two sets, namely the set of sum nodes which are contained in , and the set of sum nodes which are not. For any , we can identify (invert) the state for any , as it corresponds to the unique child of in . On the other hand, the state of any is arbitrary. In short, given an induced tree , we can perfectly retrieve the states of the (latent variables of) sum nodes in , while the states of the other latent variables are arbitrary.
Now, define the conditional distribution and prior , where is the sumweight indicated by . When marginalising from the joint , we yield
(2)  
(3) 
establishing the SPN distribution (1) as latent variable model, with marginalised out. In (2), we split the sum over all into the double sum over all induced trees , and all , where is the preimage of under , i.e. the set of all for which . As discussed above, the set is made up by a unique zassignment for each , corresponding to the unique sumedge , and all possible assignments for , leading to (3).
It is now conceptually straightforward to extend the model to a Bayesian setting, by equipping the sumweights and leafparameters with suitable priors. In this paper, we assume Dirichlet priors for sumweights and some parametric form
for each leaf, with conjugate prior over
, leading to the following generative model:(4)  
We now extend the model to also comprise the SPN’s “effective” structure, the scope function .
3.2 Jointly Learning , and
Given a computational graph , we wish to learn , additionally to the SPN’s parameters and , and adopt it in our generative model (4). In fully general graphs , representing in an amenable form is rather involved. Therefore, in this paper, we restrict to a subclass of SPNs which facilitates a natural encoding of . In particular, we consider the class of SPNs whose computational follows a treeshaped region graph. Region graphs can be understood as a “vectorised” representation of SPNs, and have been used in several SPN learners e.g. Dennis2012 ; Peharz2013 ; Peharz2018 .
Definition 4 (Region graph).
Given a set of random variables , a region graph is a tuple where is a connected acyclic directed graph containing two types of nodes: regions () and partitions (). is bipartite w.r.t. to these two types of nodes, i.e. children of are only of type and vice versa. has a single root (node with no parents) of type , and all leaves are also of type . Let be the set of all and be the set of all . The scope function is a function , with the following properties: 1) If is the root, then . 2) If is either a region with children or a partition, then . 3) For all we have . 4) For all we have .
Note that, we generalised previous notions of region graph Dennis2012 ; Peharz2013 ; Peharz2018 , also decoupling its graphical structure and the scope function (we are deliberately overloading symbol ). Given a region graph , we can easily construct an SPN structure as follows. In order to construct the SPN graph , introduce a single sum node for the root region in ; this sum node will be the output of the SPN. For each leaf region , we introduce SPN leaves. For each other region , which is neither root nor leaf, we introduce sum nodes. Both and are hyperparameters of the model. For each partition we introduce all possible crossproducts of nodes from ’s child regions. More precisely, let . Let be the assigned sets of nodes in each child region . Now, we construct all possible crossproducts , where , for . Each of these crossproducts is connected as children of each sum node in each parent region of .
The scope function of the SPN is simply inherited from the of the region graph: any SPN node introduced for a region (partition) gets the same scope as the region (partition) itself. It is easy to check that, if the SPN’s follows using above construction, any proper scope function according to Definition 4 corresponds to a proper scope function according to Definition 2.
In this paper, we consider SPN structures following a treeshaped region graph , i.e. each node in has at most one parent. Note that is in general not treeshaped in this case, unless . Further note, that this subclass of SPNs is still very expressive, and that many SPN learners, e.g. Gens2013 ; Peharz2018 , also restrict to it.
When the SPN follows a treeshaped region graph, the scope function can be encoded as follows. Let be any partition and be its children. For each data dimension , we introduce a discrete latent variable with different states. Intuitively, the latent variable represents a decision to assign dimension to a particular child, given that all partitions “above” have decided to assign onto the path leading to (this path is unique, since is a tree). More formally we define:
Definition 5 (Induced scope function).
Let be a treeshaped region graph structure, let be defined as above, let , and let be any assignment for . Let denote any node in , let be the unique path from the root to (exclusive ). The scope function induced by is defined as:
(5) 
i.e. contains if for each partition in also the child indicated by is in .
It is easy to check that for any treeshaped and any , the induced scope function is a proper scope function according to Definition 4.
Conversely, for any proper scope function according to Definition 4, there exists a such that .
We can now incorporate in our model. Therefore, we assume Dirichlet priors for each and extend the generative model (4) as follows:
(6)  
Here, the notation denotes evaluation of on the scope induced by .
Furthermore, our Bayesian formulation naturally allows for various nonparametric formulations of SPNs. In particular, one can use the stickbreaking construction sethuraman1994 of a Dirichlet process mixture model with SPNs as mixture components. We illustrate this approach in the experiments.
4 Samplingbased Inference
Given a set of instances , we now aim to draw posterior samples from (6). For this purpose, we perform Gibbs sampling alternating between i) updating parameters , (fixed ), and ii) updating (fixed , ).
Updating Parameters , (fixed )
We follow the same procedure as in Vergari2019 , i.e. in order to sample and , we first sample assignments for all the sum latent variables in the SPN, and subsequently sample new and . For given and , each can be drawn independently for each , and follows standard SPN ancestral sampling. The latent variables which are not visited during ancestral sampling, are drawn from the prior. After sampling all , the sumweights are sampled from a Dirichlet with parameters , where . The parameters at leaf nodes can be updated similarly; please see Vergari2019 for further details.
Updating the Structure , (fixed , )
We use a simple collapsed Gibbs sampler to sample all assignments at the partitions. For this purpose, we marginalise out and sample from the following conditional:
(7) 
where is the set of all partitions excluding and denotes the exclusion of from . The conditional prior follows standard derivations, i.e.
(8) 
where are component counts. The second term in Equation 7 is simply the product of marginal likelihood terms for each product node in . Intuitively, values for are more likely if other dimension have selected the same region, richgetricher property, and if the distributions of the leaves under
have low variance. Note that the marginal likelihood term in Equation
7 has shown to empirically work better than the use of the likelihood.Given a set of samples from the posterior, we can compute predictions for an unseen data point
using an approximation of the posterior predictive, i.e.
where denotes the pdf for an SPN with scopefunction encoded by and parametrised by and . Note that the approximate posterior predictive is again a single SPN.
5 Experiments
Dataset  LearnSPN  RATSPN  CCCP  IDSPN  ours  ours 

NLTCS  
MSNBC  
KDDCup2k  
Plants  
Audio  
Jester  
Netflix  
Accidents  
Retail  
Pumsbstar  
DNA  
Kosarak  
MSWeb  
Book  
EachMovie  
WebKB  
Reuters52  
20 Newsgrps  
BBC  
AD 
We assessed the performance of our approach on discrete data Gens2013 and heterogeneous data datasets with missing values Vergari2019 .^{2}^{2}2Source code, datasets and predictions are available under: https://github.com/trappmartin/BayesianSumProductNetworks For this purpose, we constructed a computational graph in form of a regiongraph, see Appendix A for details. Since the Bayesian framework is protected against overfitting, we combined training and validation sets and followed classical Bayesian model selection Rasmussen2001 . We used a grid search over the parameters of the computation graph, see Appendix B, and used burnin steps after which we estimated the predictions using samples from the posterior. The best computation graph was selected according to the Bayesian model evidence. For posterior inference in infinite mixtures of SPNs we used the distributed slice sampler Ge2015 .
Dataset  MSPN  ABDA  ours  ours 

Abalone  
Adult  
Australian  
Autism  
Breast  
Chess  
Crx  
Dermatology  
Diabetes  
German  
Student  
Wine 
Table 1 lists the test loglikelihood scores of state of the art (SOTA) structure learning algorithms, i.e. LearnSPN Gens2013 , RATSPN Peharz2018 , LearnSPN with parameter optimisation (CCCP) Zhao2016b and IDSPN Rooshenas2014 , and the results obtained using Bayesian SPNs (ours) and infinite mixtures of Bayesian SPN (ours) on discrete datasets. Significant differences to the best SOTA approach under the MannWhitneyUTest Mann1947 with are underlined. Further details on the significance test results are found in the Appendix B.4. We can see that Bayesian SPNs and infinite mixtures generally improve over LearnSPN and RATSPN. Further, in many cases we observe an improvement over LearnSPN with additional parameter learning and our approach obtains results often comparable to IDSPN. Additionally, we conducted experiments on heterogeneous data which have recently been used in the context of mixed SPNs (MSPN) Molina2018 and for ABDA Vergari2019 . To handle heterogeneous data, we used mixtures over likelihood functions as leaf distributions, similar to Vergari et al. Vergari2019 . Further details on this can be found in Appendix B.2.
Table 2 lists the test loglikelihood scores of all approaches. Note that we did not apply a significance test as predictions for the existing approaches are not available. In general, our approaches, i.e. Bayesian SPNs and infinite mixtures of Bayesian SPNs, perform comparable to structure learners tailored for modelling heterogeneous dataset and sometimes set a new SOTA. Interestingly, we obtain, with a large margin, better test scores for Autism which indicates that existing approaches overfit in this case while our Bayesian formulation naturally penalises complex models.
We further evaluated learnSPN, IDSPN and Bayesian SPNs on three discrete datasets with artificially introduced missing values in the training and validation set. We compared the test log likelihood with an increasing number of observations having 50% values missing completely at random Polit2008
. LearnSPN and IDSPN have been evaluated by i) removing all observations with missing values and ii) using Knearest neighbour imputation
Beretta2016 (denoted with an asterisk). All methods have been trained using the full training set, i.e. training and validation set combined, and where evaluated using default parameters to ensure a fair comparison across methods and levels of missing values. See Appendix B.3 for further details.Figure 2 shows the resulting test log likelihoods obtained in the experiment. In all three cases, our Bayesian model is consistently robust agains missing values while SOTA approaches often suffer from missing values, sometimes even when additional imputation is used.
6 Conclusion
Structure learning is an important topic in SPNs, and many promising directions have been proposed in recent years. However, most of these approaches are based on intuition and refrain from declaring an explicit and global principle to structure learning. In this paper, our main motivation is to change this practice. To this end, we phrase structure (and joint parameter) learning as Bayesian inference in a latent variable model. Our experiments show that this principled approach competes well with prior art, and that we gain several benefits, such as automatic protection against overfitting and robustness under missing data.
A key insight for our approach is to decompose structure learning into two steps, namely determining a computational graph and separately learning the SPN’s scope function – determining the “effective” structure of the SPN. We believe that this novel approach will be stimulating for future work. For example, while we used Bayesian inference over the scope function, it could well also be optimised, e.g. with gradientbased techniques.
The Bayesian framework presented in this paper allows several natural extensions, such as parameterisations of the scopefunction using hierarchical priors, variational inference for largescale approximate Bayesian inference, and relaxing the necessity of a given computational graph, by incorporating nonparametric priors in all stages of the model formalism.
References
 [1] T. Adel, D. Balduzzi, and A. Ghodsi. Learning the structure of sumproduct networks via an SVDbased algorithm. In Proceedings of UAI, 2015.
 [2] L. Beretta and A. Santaniello. Nearest neighbor imputation algorithms: a critical evaluation. BMC medical informatics and decision making, 16(3):74, 2016.
 [3] C. J. Butz, J. S. Oliveira, A. E. dos Santos, and A. L. Teixeira. Deep convolutional sumproduct networks. In Proceedings of AAAI, 2019.
 [4] G. F. Cooper and E. Herskovits. A Bayesian method for the induction of probabilistic networks from data. Machine learning, 9(4):309–347, 1992.
 [5] A. W. Dennis and D. Ventura. Learning the architecture of sumproduct networks using clustering on variables. In Proceedings of NIPS, pages 2042–2050, 2012.
 [6] A. W. Dennis and D. Ventura. Greedy structure search for sumproduct networks. In Proceedings of IJCAI, pages 932–938, 2015.

[7]
N. Friedman and D. Koller.
Being Bayesian about network structure. a Bayesian approach to structure discovery in Bayesian networks.
Machine learning, 50(12):95–125, 2003.  [8] H. Ge, Y. Chen, M. Wan, and Z. Ghahramani. Distributed inference for Dirichlet process mixture models. In Proceedings of ICML, pages 2276–2284, 2015.
 [9] R. Gens and P. Domingos. Discriminative learning of sumproduct networks. In Proceedings of NIPS, pages 3248–3256, 2012.
 [10] R. Gens and P. Domingos. Learning the structure of sumproduct networks. Proceedings of ICML, pages 873–880, 2013.
 [11] D. Heckerman, D. Geiger, and D. M. Chickering. Learning Bayesian networks: The combination of knowledge and statistical data. Machine learning, 20(3):197–243, 1995.
 [12] A. Kalra, A. Rashwan, W.S. Hsu, P. Poupart, P. Doshi, and G. Trimponias. Online structure learning for feedforward and recurrent sumproduct networks. In Proceedings of NIPS, pages 6944–6954, 2018.
 [13] H. Kang, C. D. Yoo, and Y. Na. Maximum margin learning of tSPNs for cell classification with filtered input. IEEE Journal of Selected Topics in Signal Processing, 10(1):130–139, 2016.
 [14] D. Koller and N. Friedman. Probabilistic Graphical Models: Principles and Techniques. MIT Press, 2009.
 [15] S.W. Lee, M.O. Heo, and B.T. Zhang. Online incremental structure learning of sumproduct networks. In Proceedings of NIPS, pages 220–227, 2013.
 [16] H. B. Mann and D. R. Whitney. On a test of whether one of two random variables is stochastically larger than the other. The annals of mathematical statistics, pages 50–60, 1947.
 [17] A. Molina, A. Vergari, N. Di Mauro, S. Natarajan, F. Esposito, and K. Kersting. Mixed sumproduct networks: A deep architecture for hybrid domains. In Proceedings of AAAI, 2018.
 [18] A. Molina, A. Vergari, K. Stelzner, R. Peharz, P. Subramani, N. Di Mauro, P. Poupart, and K. Kersting. SPFlow: An easy and extensible library for deep probabilistic learning using sumproduct networks. arXiv preprint arXiv:1901.03704, 2019.
 [19] R. Peharz, B. C. Geiger, and F. Pernkopf. Greedy partwise learning of sumproduct networks. In Proceedings of ECML/PKDD, pages 612–627, 2013.
 [20] R. Peharz, R. Gens, F. Pernkopf, and P. Domingos. On the latent variable interpretation in sumproduct networks. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 39(10):2030–2044, 2017.
 [21] R. Peharz, S. Tschiatschek, F. Pernkopf, and P. Domingos. On theoretical properties of sumproduct networks. In Proceedings of AISTATS, 2015.
 [22] R. Peharz, A. Vergari, K. Stelzner, A. Molina, M. Trapp, K. Kersting, and Z. Ghahramani. Probabilistic deep learning using random sumproduct networks. arXiv preprint arXiv:1806.01910, 2018.
 [23] D. F. Polit and C. T. Beck. Nursing research: Generating and assessing evidence for nursing practice. Lippincott Williams & Wilkins, 2008.
 [24] H. Poon and P. Domingosm. Sumproduct networks: A new deep architecture. In Proceedings of UAI, pages 337–346, 2011.
 [25] A. Rashwan, P. Poupart, and C. Zhitang. Discriminative training of sumproduct networks by extended BaumWelch. In Proceedings of PGM, pages 356–367, 2018.

[26]
A. Rashwan, H. Zhao, and P. Poupart.
Online and distributed Bayesian moment matching for parameter learning in sumproduct networks.
In Proceedings of AISTATS, pages 1469–1477, 2016.  [27] C. E. Rasmussen and Z. Ghahramani. Occam’s Razor. In Proceedings of NIPS, pages 294–300, 2001.
 [28] A. Rooshenas and D. Lowd. Learning sumproduct networks with direct and indirect variable interactions. In Proceedings of ICML, pages 710–718, 2014.
 [29] J. Sethuraman. A constructive definition of Dirichlet priors. Statistica sinica, pages 639–650, 1994.
 [30] O. Sharir, R. Tamari, N. Cohen, and A. Shashua. Tractable generative convolutional arithmetic circuits. arXiv preprint arXiv:1610.04167, 2016.
 [31] J. Suzuki. A construction of Bayesian networks from databases based on an MDL principle. In Proceedings of UAI, pages 266–273, 1993.

[32]
M. Trapp, T. Madl, R. Peharz, F. Pernkopf, and R. Trappl.
Safe semisupervised learning of sumproduct networks.
In Proceedings of UAI, 2017.  [33] A. Vergari, N. Di Mauro, and F. Esposito. Simplifying, regularizing and strengthening sumproduct network structure learning. In Proceedings of ECML/PKDD, pages 343–358, 2015.
 [34] A. Vergari, A. Molina, R. Peharz, Z. Ghahramani, K. Kersting, and I. Valera. Automatic Bayesian density analysis. In Proceedings of AAAI, 2019.
 [35] H. Zhao, T. Adel, G. J. Gordon, and B. Amos. Collapsed variational inference for sumproduct networks. In Proceedings of ICML, pages 1310–1318, 2016.
 [36] H. Zhao, M. Melibari, and P. Poupart. On the relationship between sumproduct networks and Bayesian networks. In Proceedings of ICML, pages 116–124, 2015.
 [37] H. Zhao, P. Poupart, and G. J. Gordon. A unified approach for learning the parameters of sumproduct networks. In Proceedings of NIPS, pages 433–441, 2016.
Appendix A Computational Graph Generation
This section describes the algorithm to generate computational graphs used in the paper. Note that we only consider partitions into two disjoint subregions. Our algorithm can, however, easily be extended for the more general case.
Appendix B Experiments
This section gives further details on the experiments conducted in the course of the paper.
b.1 Setup
As described in the main manuscript, we: 1) We combined the training and validation set to a single training set. 2) We used burnin steps and estimated the training and testing performance from samples of a chain of samples. 3) We used a grid search over the number of nodes per region , number of nodes per atomic region , number of partitions under a region , and the depth, i.e. consecutive regionpartition layers, and selected the best configuration according to the model evidence. In the experiments we used the following hyperparameters for the Dirichlet priors: as concentration parameter for all sum nodes, as concentration parameter for all product nodes to enforce partitions into equally size parts.
We ran all experiments on a high performance cluster using multithreaded computations. The SLURM script and the necessary code and datasets to run the experiments with the respective number of threads can be found on https://github.com/trappmartin/BayesianSumProductNetworks.
b.2 Heterogeneous Experiments
To conduct the heterogeneous data experiments, we introduce mixtures over likelihood functions for each leaf node. In particular, we used the following likelihood and prior constructions in the experiment.
Datatype  Likelihood  Prior 

Continuous  Gaussian i.e.,  
Continuous  Exponential i.e.,  
Discrete  Poisson i.e.,  
Discrete  Categorical i.e.,  
Discrete  Bernoulli i.e., 
Note that we used mixtures of parametric forms as leaf nodes in the model. Therefore, the distribution of each leaf factorises as:
(9) 
where place a Dirichlet prior with concentration parameter on the weights of the mixture to ensure only few component weights are large. This approach is similar to the model in [34].
b.3 Missing Data Experiments
We evaluated the robustness of learnSPN, IDSPN and Bayesian SPN against missing values in the training data. For this purpose, we artificially introduced missing values completely at random in the training and validation set of EachMovie, CWebKB and BBC. We evaluate their performance in the cases of 20%, 40%, 60% or 80% of all observations having 50% missing values. All methods have been trained using the full training set, i.e. training and validation set combined, and where evaluated using the following default parameters:
b.4 Statistical Significance Tests
To assess the statistical significance of the reported results we computed the value of the MannWhitneyUTest [16]. The MannWhitneyUTest is a nonparametric equivalent of the two sample
test which does not require the assumption of normal distributions. The respective
values obtained from the MannWhitneyUTest for Bayesian SPNs are listed in Table 4 and the values for infinite mixtures of Bayesian SPNs are listed in Table 5.Dataset  LearnSPN  RATSPN  IDSPN 

NLTCS  
MSNBC  
KDDCup2k  
Plants  
Audio  
Jester  
Netflix  
Accidents  
Retail  
Pumsbstar  
DNA  
Kosarak  
MSWeb  
Book  
EachMovie  
WebKB  
Reuters52  
20 Newsgroups  
BBC  
AD 
Dataset  LearnSPN  RATSPN  IDSPN 

NLTCS  
MSNBC  
KDDCup2k  
Plants  
Audio  
Jester  
Netflix  
Accidents  
Retail  
Pumsbstar  
DNA  
Kosarak  
MSWeb  
Book  
EachMovie  
WebKB  
Reuters52  
20 Newsgroups  
BBC  
AD 