Bayesian Learning of Sum-Product Networks

05/26/2019 ∙ by Martin Trapp, et al. ∙ University of Cambridge TU Graz 0

Sum-product networks (SPNs) are flexible density estimators and have received significant attention, due to their attractive inference properties. While parameter learning in SPNs is well developed, structure learning leaves something to be desired: Even though there is a plethora of SPN structure learners, most of them are somewhat ad-hoc, and based on intuition rather than a clear learning principle. In this paper, we introduce a well-principled Bayesian framework for SPN structure learning. First, we decompose the problem into i) laying out a basic computational graph, and ii) learning the so-called scope function over the graph. The first is rather unproblematic and akin to neural network architecture validation. The second characterises the effective structure of the SPN and needs to respect the usual structural constraints in SPN, i.e. completeness and decomposability. While representing and learning the scope function is rather involved in general, in this paper, we propose a natural parametrisation for an important and widely used special case of SPNs. These structural parameters are incorporated into a Bayesian model, such that simultaneous structure and parameter learning is cast into monolithic Bayesian posterior inference. In various experiments, our Bayesian SPNs often improve test likelihoods over greedy SPN learners. Further, since the Bayesian framework protects against overfitting, we are able to evaluate hyper-parameters directly on the Bayesian model score, waiving the need for a separate validation set, which is especially beneficial in low data regimes. Bayesian SPNs can be applied to heterogeneous domains and can easily be extended to nonparametric formulations. Moreover, our Bayesian approach is the first which consistently and robustly learns SPN structures under missing data.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Sum-product networks (SPNs) Poon2011 are a prominent type of deep probabilistic model, as they are a flexible representation for high-dimensional distributions, yet allowing for fast and exact inference. Learning SPNs can be naturally organised into structure learning and parameter learning, following the same dichotomy as in probabilistic graphical models (PGMs) Koller2009 . Like in PGMs, state-of-the-art SPN parameter learning covers a wide range of well-developed techniques. In particular, various maximum likelihood approaches have been proposed, using either gradient-based optimisation Sharir2016 ; Peharz2018 ; Butz2019 ; Molina2019 or expectation-maximisation (and related) schemes Peharz2017 ; Poon2011 ; Zhao2016b . Furthermore, several discriminative criteria, e.g. Gens2012 ; Kang2016 ; Trapp2017 ; Rashwan2018 , as well as Bayesian approaches to parameter learning, e.g. Zhao2016 ; Rashwan2016 ; Vergari2019 , have been developed.

Concerning structure learning

, however, the situation is remarkably different. Although there is a plethora of structure learning approaches for SPNs, most of them can be described as heuristic. For example, the most prominent structure learning scheme, LearnSPN

Gens2013 , derives an SPN structure by recursively applying clustering on the data instances (yielding sum nodes) and partitioning data dimensions (yielding product nodes). Each of these steps can be understood as some local structure improvement, and as an attempt to optimise some local criterion. While LearnSPN is an intuitive scheme and elegantly maps the structural SPN semantics onto an algorithmic procedure, the fact that the global goal of structure learning is not declared, is unsatisfying. This principal shortcoming of LearnSPN is shared by its many variants such as online LearnSPN Lee2013 , ID-SPN Rooshenas2014 , LearnSPN-b Vergari2015 , mixed SPNs Molina2018 , and automatic Bayesian density analysis (ABDA) Vergari2019 . Also other approaches lack a sound learning principle, such as Dennis2012 ; Adel2015

which derive SPN structures from k-means and SVD clustering, respectively,

Peharz2013 which grows SPNs bottom up using a heuristic based on the information bottleneck, Dennis2015 which uses a heuristic structure exploration, or Kalra2018 which use a variant of hard EM to decide when to enlarge or shrink an SPN structure. All of the above mentioned approaches fall short of posing some fundamental questions: What is a good SPN structure? or What is a good principle to derive an SPN structure?

This situation is somewhat surprising, since the literature on PGMs actually offers a rich set of structure learning principles: In PGMs, the main strategy is to optimise a structure score such as minimum-description-length (MDL) Suzuki1993 , Bayesian information criterion (BIC) Koller2009 or the Bayes-Dirichlet (BD) score Cooper1992 ; Heckerman1995 . Moreover, in Friedman2003 an approximate (but asymptotically correct) MCMC sampler was proposed for full Bayesian structure learning.

In this paper, we propose a well-principled Bayesian approach to SPN learning, simultaneously over both structure and parameters. We first decompose the structure learning problem into two steps, namely i) proposing a computational graph, laying out the arrangement of sums, products and leaf distributions, and ii) learning the so-called scope-function, which assigns to each node its scope.111

The scope of a node is a subset of random variables, the node is responsible for, and needs to fulfil the so-called completeness and decomposability conditions, see Section 

2. The first step is straight-forward and akin to validating various structures in neural network training.

The second step, learning the scope function, is more involved in full generality. However, we propose a parametrisation of the scope function for a widely used special case of SPNs, namely those following a so-called tree-shaped region graph Dennis2012 ; Peharz2018

. In this case, the scope function is elegantly encoded via categorical variables representing a dimension clustering at each partition node in the underlying region graph. While clustering dimensions is often encountered in SPN structure learning

Dennis2012 ; Gens2013 , the technique to make this step explicit via latent variables is novel.

Having encoded the scope-function via latent variables, Bayesian learning becomes conceptually simple: We equip all categorical latent variables and the SPN’s leaves with appropriate priors, and perform monolithic Bayesian inference, implemented via simple collapsed Gibbs-updates. After learning, the predictive distribution of Bayesian SPNs can be approximated by averaging over a set of posterior samples, which again can be conveniently represented as a

single standard SPN. In summary, our main contributions in this paper are:

  • We propose a novel and well-principled approach to SPN structure learning, by decomposing the problem into finding a computational graph and learning a scope-function.

  • To learn the scope function, we propose a natural parametrisation for an important sub-type of SPN, which allows to formulate a joint Bayesian framework simultaneously over structure and parameters.

  • Due to the Bayesian nature of our approach, we introduce several benefits to the SPN toolset: Bayesian SPNs are not prone to overfitting, waiving the necessity of a separate validation set, which is beneficial for low data regimes. Furthermore, they naturally deal with missing data and are the first – to the best of our knowledge – which consistently and robustly learn SPN structures under missing data. Bayesian SPNs can easily be extended to nonparametric formulations, supporting growing data domains.

The paper is organised as follows. Section 2 introduced the required background and Section 3 our Bayesian model for SPN learning. Section 4 discusses sampling based inference in our model. Section 5 presents experimental results, and Section 6 concludes the paper.

2 Background and Related Work

Let be a set of random variables (RVs), for which i.i.d. samples are available. Let be the th observation for the th dimension and . Our goal is to estimate the distribution of using a sum-product network (SPN). In the following we review SPNs, but use a more general definition than usual, in order to facilitate our discussion below. In this paper, we define an SPN as a 4-tuple , where is a computational graph, is a scope-function, is a set of sum-weights, and is a set of leaf parameters. In the following, we explain these terms in more detail.

Definition 1 (Computational graph).

The computational graph is a connected acyclic directed graph, containing three types of nodes: sums (), products () and leaves (). A node in has no children if and only if it is of type . When we do not discriminate between node types, we use for a generic node. , , , and denote the collections of all , all , all , and all in , respectively. The set of children of node is denoted as . In this paper, we require that has only a single root (node without parent).

Definition 2 (Scope function).

The scope function is a function , assigning each node in a sub-set of ( denotes the power set of ). It has the following properties:

  1. If is the root node, then .

  2. If is a sum or product, then .

  3. For each we have (completeness).

  4. For each we have (decomposability).

Each node in represents a distribution over the random variables , described in the following. Each leaf computes a distribution over its scope (for , we set ). We assume that is parametrised by , and that represents a distribution for any possible choice of . In the most naive setting, we would maintain a separate parameter set for each of the possible choices for , but this would quickly become intractable. In this paper, we simply assume that contains parameters over single-dimensional distributions (e.g. Gaussian, Bernoulli, etc), and that for a given , the represented distribution factorises: . However, more elaborate schemes are possible. Note that our definition of leaves is quite distinct from prior art: previously, leaves were defined to be distributions over a fixed scope; our leaves define at all times distributions over all possible scopes. The set denotes the collection of parameters for all leaf nodes. A sum node computes a weighted sum . Each weight is non-negative, and can w.l.o.g. Peharz2015 ; Zhao2015 be assumed to be normalised: , . The set of all sum-weights for is denoted as , and denotes the set of all sum-weights in the SPN. A product node simply computes .

The two conditions we require from – completeness and decomposability – ensure that each node is a a well-defined distribution over . The distribution represented by is defined to be the distribution of the root node in , and denoted as . Furthermore, completeness and decomposability are key to render many inference scenarios tractable in SPNs. In particular, arbitrary marginalisation tasks reduce to marginalisation tasks at the leaves, i.e. simplify to several marginalisation tasks over (small) subsets of , while the evaluation of the internal part (sum and products) amounts to a simple feed-forward pass Peharz2015 . Thus, exact marginalisation can be computed in linear time in size of the SPN (assuming constant time marginalisation at the leaves). Conditioning can be tackled similarly. Note that marginalisation and conditioning are key inference routines in probabilistic reasoning, so that SPNs are generally referred to as tractable probabilistic models.

3 Bayesian Sum-Product Networks

Note that all previous works defined and in an entangled way, i.e. the scope was seen as an inherent property of the nodes in . In this paper, we propose to decouple these two aspects of SPN structure: searching over and nested learning of . Note that has quite few structural requirements, and can simply be validated like a neural network structure. Consequently, we fix in the following discussion, and cross-validate it in our experiments. Learning is challenging, as has non-trivial structure due to the completeness and decomposability conditions. In the following, we develop a parametrisation of and incorporate into a Bayesian framework. We first revisit Bayesian parameter learning using a fixed .

3.1 Learning Parameters , – Fixing Scope Function

The key insight for Bayesian parameter learning Zhao2016 ; Rashwan2016 ; Vergari2019 is that sum nodes can be interpreted as latent variables, clustering data instances Poon2011 ; Zhao2015 ; Peharz2017 . Formally, consider any sum node and assume that it has children. For each data instance and each , we introduce a latent variable with states and categorical distribution given by the weights of . Intuitively, the sum node represents a latent clustering of data instances over its children. Let be the collection of all . In order to establish the interpretation of sum nodes as latent variables, we introduce the notion of induced tree Zhao2016 . We omit sub-script when a distinction between data instances is not necessary.

Definition 3 (Induced tree Zhao2016 ).

Let an be given. Consider a sub-graph of obtained as follows: i) for each sum , delete all but one outgoing edge and ii) delete all nodes and edges which are now unreachable from the root. Any such is called an induced tree of (sometimes also denoted as induced tree of ). The SPN distribution can always be written as the mixture

(1)

where the sum runs over all possible induced trees in , and denotes the evaluation of on the restriction of to .

We define a function which assigns to each value of the induced tree determined by , i.e. where indicates the kept sum edges in Definition 3. Note that the function is surjective, but not injective, and thus, is not invertible. However, it is “partially” invertible, in the following sense: Note that any splits the set of all sum nodes into two sets, namely the set of sum nodes which are contained in , and the set of sum nodes which are not. For any , we can identify (invert) the state for any , as it corresponds to the unique child of in . On the other hand, the state of any is arbitrary. In short, given an induced tree , we can perfectly retrieve the states of the (latent variables of) sum nodes in , while the states of the other latent variables are arbitrary.

Now, define the conditional distribution and prior , where is the sum-weight indicated by . When marginalising from the joint , we yield

(2)
(3)

establishing the SPN distribution (1) as latent variable model, with marginalised out. In (2), we split the sum over all into the double sum over all induced trees , and all , where is the pre-image of under , i.e. the set of all for which . As discussed above, the set is made up by a unique z-assignment for each , corresponding to the unique sum-edge , and all possible assignments for , leading to (3).

It is now conceptually straight-forward to extend the model to a Bayesian setting, by equipping the sum-weights and leaf-parameters with suitable priors. In this paper, we assume Dirichlet priors for sum-weights and some parametric form

for each leaf, with conjugate prior over

, leading to the following generative model:

(4)

We now extend the model to also comprise the SPN’s “effective” structure, the scope function .

Figure 1: Plate notation of our latent variable model for Bayesian structure and parameter learning.

3.2 Jointly Learning , and

Given a computational graph , we wish to learn , additionally to the SPN’s parameters and , and adopt it in our generative model (4). In fully general graphs , representing in an amenable form is rather involved. Therefore, in this paper, we restrict to a sub-class of SPNs which facilitates a natural encoding of . In particular, we consider the class of SPNs whose computational follows a tree-shaped region graph. Region graphs can be understood as a “vectorised” representation of SPNs, and have been used in several SPN learners e.g. Dennis2012 ; Peharz2013 ; Peharz2018 .

Definition 4 (Region graph).

Given a set of random variables , a region graph is a tuple where is a connected acyclic directed graph containing two types of nodes: regions () and partitions (). is bipartite w.r.t. to these two types of nodes, i.e. children of are only of type and vice versa. has a single root (node with no parents) of type , and all leaves are also of type . Let be the set of all and be the set of all . The scope function is a function , with the following properties: 1) If is the root, then . 2) If is either a region with children or a partition, then . 3) For all we have . 4) For all we have .

Note that, we generalised previous notions of region graph Dennis2012 ; Peharz2013 ; Peharz2018 , also decoupling its graphical structure and the scope function (we are deliberately overloading symbol ). Given a region graph , we can easily construct an SPN structure as follows. In order to construct the SPN graph , introduce a single sum node for the root region in ; this sum node will be the output of the SPN. For each leaf region , we introduce SPN leaves. For each other region , which is neither root nor leaf, we introduce sum nodes. Both and are hyper-parameters of the model. For each partition we introduce all possible cross-products of nodes from ’s child regions. More precisely, let . Let be the assigned sets of nodes in each child region . Now, we construct all possible cross-products , where , for . Each of these cross-products is connected as children of each sum node in each parent region of .

The scope function of the SPN is simply inherited from the of the region graph: any SPN node introduced for a region (partition) gets the same scope as the region (partition) itself. It is easy to check that, if the SPN’s follows using above construction, any proper scope function according to Definition 4 corresponds to a proper scope function according to Definition 2.

In this paper, we consider SPN structures following a tree-shaped region graph , i.e. each node in has at most one parent. Note that is in general not tree-shaped in this case, unless . Further note, that this sub-class of SPNs is still very expressive, and that many SPN learners, e.g. Gens2013 ; Peharz2018 , also restrict to it.

When the SPN follows a tree-shaped region graph, the scope function can be encoded as follows. Let be any partition and be its children. For each data dimension , we introduce a discrete latent variable with different states. Intuitively, the latent variable represents a decision to assign dimension to a particular child, given that all partitions “above” have decided to assign onto the path leading to (this path is unique, since is a tree). More formally we define:

Definition 5 (Induced scope function).

Let be a tree-shaped region graph structure, let be defined as above, let , and let be any assignment for . Let denote any node in , let be the unique path from the root to (exclusive ). The scope function induced by is defined as:

(5)

i.e.  contains if for each partition in also the child indicated by is in .

It is easy to check that for any tree-shaped and any , the induced scope function is a proper scope function according to Definition 4.

Conversely, for any proper scope function according to Definition 4, there exists a such that .

We can now incorporate in our model. Therefore, we assume Dirichlet priors for each and extend the generative model (4) as follows:

(6)

Here, the notation denotes evaluation of on the scope induced by .

Furthermore, our Bayesian formulation naturally allows for various nonparametric formulations of SPNs. In particular, one can use the stick-breaking construction sethuraman1994 of a Dirichlet process mixture model with SPNs as mixture components. We illustrate this approach in the experiments.

4 Sampling-based Inference

Given a set of instances , we now aim to draw posterior samples from (6). For this purpose, we perform Gibbs sampling alternating between i) updating parameters , (fixed ), and ii) updating (fixed , ).

Updating Parameters , (fixed )

We follow the same procedure as in Vergari2019 , i.e. in order to sample and , we first sample assignments for all the sum latent variables in the SPN, and subsequently sample new and . For given and , each can be drawn independently for each , and follows standard SPN ancestral sampling. The latent variables which are not visited during ancestral sampling, are drawn from the prior. After sampling all , the sum-weights are sampled from a Dirichlet with parameters , where . The parameters at leaf nodes can be updated similarly; please see Vergari2019 for further details.

Updating the Structure , (fixed , )

We use a simple collapsed Gibbs sampler to sample all assignments at the partitions. For this purpose, we marginalise out and sample from the following conditional:

(7)

where is the set of all partitions excluding and denotes the exclusion of from . The conditional prior follows standard derivations, i.e.

(8)

where are component counts. The second term in Equation 7 is simply the product of marginal likelihood terms for each product node in . Intuitively, values for are more likely if other dimension have selected the same region, rich-get-richer property, and if the distributions of the leaves under

have low variance. Note that the marginal likelihood term in Equation 

7 has shown to empirically work better than the use of the likelihood.

Given a set of samples from the posterior, we can compute predictions for an unseen data point

using an approximation of the posterior predictive, i.e.

where denotes the pdf for an SPN with scope-function encoded by and parametrised by and . Note that the approximate posterior predictive is again a single SPN.

5 Experiments

Dataset LearnSPN RAT-SPN CCCP ID-SPN ours ours
NLTCS
MSNBC
KDDCup2k
Plants
Audio
Jester
Netflix
Accidents
Retail
Pumsb-star
DNA
Kosarak
MSWeb
Book
EachMovie
WebKB
Reuters-52
20 Newsgrps
BBC
AD
Table 1: Average test log-likelihoods on discrete datasets using SOTA, Bayesian SPNs (ours) and infinite mixtures of SPNs (ours). Significant differences are underlined. Overall best result is in bold and / arrows indicate if Bayesian SPNs perform better or worse than the respective method.

We assessed the performance of our approach on discrete data Gens2013 and heterogeneous data datasets with missing values Vergari2019 .222Source code, datasets and predictions are available under: https://github.com/trappmartin/BayesianSumProductNetworks For this purpose, we constructed a computational graph in form of a region-graph, see Appendix A for details. Since the Bayesian framework is protected against overfitting, we combined training and validation sets and followed classical Bayesian model selection Rasmussen2001 . We used a grid search over the parameters of the computation graph, see Appendix B, and used burn-in steps after which we estimated the predictions using samples from the posterior. The best computation graph was selected according to the Bayesian model evidence. For posterior inference in infinite mixtures of SPNs we used the distributed slice sampler Ge2015 .

Dataset MSPN ABDA ours ours
Abalone
Adult
Australian
Autism
Breast
Chess
Crx
Dermatology
Diabetes
German
Student
Wine
Table 2: Average test log-likelihoods on heterogeneous datasets using SOTA, Bayesian SPN (ours) and infinite mixtures of SPNs (ours). Overall best result is in bold and / arrows indicate if Bayesian SPN perform better or worse than the respective method.

Table 1 lists the test log-likelihood scores of state of the art (SOTA) structure learning algorithms, i.e. LearnSPN Gens2013 , RAT-SPN Peharz2018 , LearnSPN with parameter optimisation (CCCP) Zhao2016b and ID-SPN Rooshenas2014 , and the results obtained using Bayesian SPNs (ours) and infinite mixtures of Bayesian SPN (ours) on discrete datasets. Significant differences to the best SOTA approach under the Mann-Whitney-U-Test Mann1947 with are underlined. Further details on the significance test results are found in the Appendix B.4. We can see that Bayesian SPNs and infinite mixtures generally improve over LearnSPN and RAT-SPN. Further, in many cases we observe an improvement over LearnSPN with additional parameter learning and our approach obtains results often comparable to ID-SPN. Additionally, we conducted experiments on heterogeneous data which have recently been used in the context of mixed SPNs (MSPN) Molina2018 and for ABDA Vergari2019 . To handle heterogeneous data, we used mixtures over likelihood functions as leaf distributions, similar to Vergari et al. Vergari2019 . Further details on this can be found in Appendix B.2.

Table 2 lists the test log-likelihood scores of all approaches. Note that we did not apply a significance test as predictions for the existing approaches are not available. In general, our approaches, i.e. Bayesian SPNs and infinite mixtures of Bayesian SPNs, perform comparable to structure learners tailored for modelling heterogeneous dataset and sometimes set a new SOTA. Interestingly, we obtain, with a large margin, better test scores for Autism which indicates that existing approaches overfit in this case while our Bayesian formulation naturally penalises complex models.

We further evaluated learnSPN, ID-SPN and Bayesian SPNs on three discrete datasets with artificially introduced missing values in the training and validation set. We compared the test log likelihood with an increasing number of observations having 50% values missing completely at random Polit2008

. LearnSPN and ID-SPN have been evaluated by i) removing all observations with missing values and ii) using K-nearest neighbour imputation 

Beretta2016 (denoted with an asterisk). All methods have been trained using the full training set, i.e. training and validation set combined, and where evaluated using default parameters to ensure a fair comparison across methods and levels of missing values. See Appendix B.3 for further details.

Figure 2 shows the resulting test log likelihoods obtained in the experiment. In all three cases, our Bayesian model is consistently robust agains missing values while SOTA approaches often suffer from missing values, sometimes even when additional imputation is used.

(a) EachMovie (D: 500, N: 5526)
(b) WebKB (D: 839, N: 3361)
(c) BBC (D: 1058, N: 1895)
Figure 2: Performance under missing values for discrete datasets with increasing dimensionality (D) (a-c). Results for learnSPN are shown using dashed lines and results for ID-SPN using dotted lines. Our approach is indicated using solid lines. Asterisk indicates the use of additional k-NN imputation.

6 Conclusion

Structure learning is an important topic in SPNs, and many promising directions have been proposed in recent years. However, most of these approaches are based on intuition and refrain from declaring an explicit and global principle to structure learning. In this paper, our main motivation is to change this practice. To this end, we phrase structure (and joint parameter) learning as Bayesian inference in a latent variable model. Our experiments show that this principled approach competes well with prior art, and that we gain several benefits, such as automatic protection against overfitting and robustness under missing data.

A key insight for our approach is to decompose structure learning into two steps, namely determining a computational graph and separately learning the SPN’s scope function – determining the “effective” structure of the SPN. We believe that this novel approach will be stimulating for future work. For example, while we used Bayesian inference over the scope function, it could well also be optimised, e.g. with gradient-based techniques.

The Bayesian framework presented in this paper allows several natural extensions, such as parameterisations of the scope-function using hierarchical priors, variational inference for large-scale approximate Bayesian inference, and relaxing the necessity of a given computational graph, by incorporating nonparametric priors in all stages of the model formalism.

References

  • [1] T. Adel, D. Balduzzi, and A. Ghodsi. Learning the structure of sum-product networks via an SVD-based algorithm. In Proceedings of UAI, 2015.
  • [2] L. Beretta and A. Santaniello. Nearest neighbor imputation algorithms: a critical evaluation. BMC medical informatics and decision making, 16(3):74, 2016.
  • [3] C. J. Butz, J. S. Oliveira, A. E. dos Santos, and A. L. Teixeira. Deep convolutional sum-product networks. In Proceedings of AAAI, 2019.
  • [4] G. F. Cooper and E. Herskovits. A Bayesian method for the induction of probabilistic networks from data. Machine learning, 9(4):309–347, 1992.
  • [5] A. W. Dennis and D. Ventura. Learning the architecture of sum-product networks using clustering on variables. In Proceedings of NIPS, pages 2042–2050, 2012.
  • [6] A. W. Dennis and D. Ventura. Greedy structure search for sum-product networks. In Proceedings of IJCAI, pages 932–938, 2015.
  • [7] N. Friedman and D. Koller.

    Being Bayesian about network structure. a Bayesian approach to structure discovery in Bayesian networks.

    Machine learning, 50(1-2):95–125, 2003.
  • [8] H. Ge, Y. Chen, M. Wan, and Z. Ghahramani. Distributed inference for Dirichlet process mixture models. In Proceedings of ICML, pages 2276–2284, 2015.
  • [9] R. Gens and P. Domingos. Discriminative learning of sum-product networks. In Proceedings of NIPS, pages 3248–3256, 2012.
  • [10] R. Gens and P. Domingos. Learning the structure of sum-product networks. Proceedings of ICML, pages 873–880, 2013.
  • [11] D. Heckerman, D. Geiger, and D. M. Chickering. Learning Bayesian networks: The combination of knowledge and statistical data. Machine learning, 20(3):197–243, 1995.
  • [12] A. Kalra, A. Rashwan, W.-S. Hsu, P. Poupart, P. Doshi, and G. Trimponias. Online structure learning for feed-forward and recurrent sum-product networks. In Proceedings of NIPS, pages 6944–6954, 2018.
  • [13] H. Kang, C. D. Yoo, and Y. Na. Maximum margin learning of t-SPNs for cell classification with filtered input. IEEE Journal of Selected Topics in Signal Processing, 10(1):130–139, 2016.
  • [14] D. Koller and N. Friedman. Probabilistic Graphical Models: Principles and Techniques. MIT Press, 2009.
  • [15] S.-W. Lee, M.-O. Heo, and B.-T. Zhang. Online incremental structure learning of sum-product networks. In Proceedings of NIPS, pages 220–227, 2013.
  • [16] H. B. Mann and D. R. Whitney. On a test of whether one of two random variables is stochastically larger than the other. The annals of mathematical statistics, pages 50–60, 1947.
  • [17] A. Molina, A. Vergari, N. Di Mauro, S. Natarajan, F. Esposito, and K. Kersting. Mixed sum-product networks: A deep architecture for hybrid domains. In Proceedings of AAAI, 2018.
  • [18] A. Molina, A. Vergari, K. Stelzner, R. Peharz, P. Subramani, N. Di Mauro, P. Poupart, and K. Kersting. SPFlow: An easy and extensible library for deep probabilistic learning using sum-product networks. arXiv preprint arXiv:1901.03704, 2019.
  • [19] R. Peharz, B. C. Geiger, and F. Pernkopf. Greedy part-wise learning of sum-product networks. In Proceedings of ECML/PKDD, pages 612–627, 2013.
  • [20] R. Peharz, R. Gens, F. Pernkopf, and P. Domingos. On the latent variable interpretation in sum-product networks. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 39(10):2030–2044, 2017.
  • [21] R. Peharz, S. Tschiatschek, F. Pernkopf, and P. Domingos. On theoretical properties of sum-product networks. In Proceedings of AISTATS, 2015.
  • [22] R. Peharz, A. Vergari, K. Stelzner, A. Molina, M. Trapp, K. Kersting, and Z. Ghahramani. Probabilistic deep learning using random sum-product networks. arXiv preprint arXiv:1806.01910, 2018.
  • [23] D. F. Polit and C. T. Beck. Nursing research: Generating and assessing evidence for nursing practice. Lippincott Williams & Wilkins, 2008.
  • [24] H. Poon and P. Domingosm. Sum-product networks: A new deep architecture. In Proceedings of UAI, pages 337–346, 2011.
  • [25] A. Rashwan, P. Poupart, and C. Zhitang. Discriminative training of sum-product networks by extended Baum-Welch. In Proceedings of PGM, pages 356–367, 2018.
  • [26] A. Rashwan, H. Zhao, and P. Poupart.

    Online and distributed Bayesian moment matching for parameter learning in sum-product networks.

    In Proceedings of AISTATS, pages 1469–1477, 2016.
  • [27] C. E. Rasmussen and Z. Ghahramani. Occam’s Razor. In Proceedings of NIPS, pages 294–300, 2001.
  • [28] A. Rooshenas and D. Lowd. Learning sum-product networks with direct and indirect variable interactions. In Proceedings of ICML, pages 710–718, 2014.
  • [29] J. Sethuraman. A constructive definition of Dirichlet priors. Statistica sinica, pages 639–650, 1994.
  • [30] O. Sharir, R. Tamari, N. Cohen, and A. Shashua. Tractable generative convolutional arithmetic circuits. arXiv preprint arXiv:1610.04167, 2016.
  • [31] J. Suzuki. A construction of Bayesian networks from databases based on an MDL principle. In Proceedings of UAI, pages 266–273, 1993.
  • [32] M. Trapp, T. Madl, R. Peharz, F. Pernkopf, and R. Trappl.

    Safe semi-supervised learning of sum-product networks.

    In Proceedings of UAI, 2017.
  • [33] A. Vergari, N. Di Mauro, and F. Esposito. Simplifying, regularizing and strengthening sum-product network structure learning. In Proceedings of ECML/PKDD, pages 343–358, 2015.
  • [34] A. Vergari, A. Molina, R. Peharz, Z. Ghahramani, K. Kersting, and I. Valera. Automatic Bayesian density analysis. In Proceedings of AAAI, 2019.
  • [35] H. Zhao, T. Adel, G. J. Gordon, and B. Amos. Collapsed variational inference for sum-product networks. In Proceedings of ICML, pages 1310–1318, 2016.
  • [36] H. Zhao, M. Melibari, and P. Poupart. On the relationship between sum-product networks and Bayesian networks. In Proceedings of ICML, pages 116–124, 2015.
  • [37] H. Zhao, P. Poupart, and G. J. Gordon. A unified approach for learning the parameters of sum-product networks. In Proceedings of NIPS, pages 433–441, 2016.

Appendix A Computational Graph Generation

This section describes the algorithm to generate computational graphs used in the paper. Note that we only consider partitions into two disjoint sub-regions. Our algorithm can, however, easily be extended for the more general case.

Input: Dimensionality of dataset , Number of nodes per region , Number of nodes per atomic region , Number of partitions under a region and depth .
function buildAtomicRegion(, )
      empty atomic region.
     for  do Equip with distribution nodes, each factorising .
         Equip with .
     end for
     return
end function
function buildRegion(, , , , , )
      empty region.
     for  do
          buildPartition(, , , , , )
         Make a child of .
     end for
     Let be all product nodes of all .
     for  do
         Equip with .
     end for
     return
end function
function buildPartition(, , , , , )
      empty partition.
     if  then
         buildAtomicRegion(, )
         buildAtomicRegion(, )
     else
         buildRegion(, , , , , )
         buildRegion(, , , , , )
     end if
     Make and children of .
     Let be all nodes of .
     for ,  do
         Equip with .
     end for
     return
end function
return buildRegion(, , , , , )
Algorithm 1 Generation of a Computational Graph

Appendix B Experiments

This section gives further details on the experiments conducted in the course of the paper.

b.1 Setup

As described in the main manuscript, we: 1) We combined the training and validation set to a single training set. 2) We used burn-in steps and estimated the training and testing performance from samples of a chain of samples. 3) We used a grid search over the number of nodes per region , number of nodes per atomic region , number of partitions under a region , and the depth, i.e. consecutive region-partition layers, and selected the best configuration according to the model evidence. In the experiments we used the following hyper-parameters for the Dirichlet priors: as concentration parameter for all sum nodes, as concentration parameter for all product nodes to enforce partitions into equally size parts.

We ran all experiments on a high performance cluster using multi-threaded computations. The SLURM script and the necessary code and datasets to run the experiments with the respective number of threads can be found on https://github.com/trappmartin/BayesianSumProductNetworks.

b.2 Heterogeneous Experiments

To conduct the heterogeneous data experiments, we introduce mixtures over likelihood functions for each leaf node. In particular, we used the following likelihood and prior constructions in the experiment.

Datatype Likelihood Prior
Continuous Gaussian i.e.,
Continuous Exponential i.e.,
Discrete Poisson i.e.,
Discrete Categorical i.e.,
Discrete Bernoulli i.e.,
Table 3: Likelihood functions and priors used for heterogeneous data experiments.

Note that we used mixtures of parametric forms as leaf nodes in the model. Therefore, the distribution of each leaf factorises as:

(9)

where place a Dirichlet prior with concentration parameter on the weights of the mixture to ensure only few component weights are large. This approach is similar to the model in [34].

b.3 Missing Data Experiments

We evaluated the robustness of learnSPN, ID-SPN and Bayesian SPN against missing values in the training data. For this purpose, we artificially introduced missing values completely at random in the training and validation set of EachMovie, CWebKB and BBC. We evaluate their performance in the cases of 20%, 40%, 60% or 80% of all observations having 50% missing values. All methods have been trained using the full training set, i.e. training and validation set combined, and where evaluated using the following default parameters:

  • LearnSPN: cluster penalty = , significance threshold = as described in [10].

  • ID-SPN: See [28] for the default settings.

  • Bayesian SPN: nodes per region , nodes per atomic region , partitions under a region , depth

b.4 Statistical Significance Tests

To assess the statistical significance of the reported results we computed the -value of the Mann-Whitney-U-Test [16]. The Mann-Whitney-U-Test is a nonparametric equivalent of the two sample

-test which does not require the assumption of normal distributions. The respective

-values obtained from the Mann-Whitney-U-Test for Bayesian SPNs are listed in Table 4 and the -values for infinite mixtures of Bayesian SPNs are listed in Table 5.

Dataset LearnSPN RAT-SPN ID-SPN
NLTCS
MSNBC
KDDCup2k
Plants
Audio
Jester
Netflix
Accidents
Retail
Pumsb-star
DNA
Kosarak
MSWeb
Book
EachMovie
WebKB
Reuters-52
20 Newsgroups
BBC
AD
Table 4: Mann-Whitney-U-Test -values of Bayesian SPNs (BSPN) compared with LearnSPN, RAT-SPN and ID-SPN. Values below the threshold are underlined.
Dataset LearnSPN RAT-SPN ID-SPN
NLTCS
MSNBC
KDDCup2k
Plants
Audio
Jester
Netflix
Accidents
Retail
Pumsb-star
DNA
Kosarak
MSWeb
Book
EachMovie
WebKB
Reuters-52
20 Newsgroups
BBC
AD
Table 5: Mann-Whitney-U-Test -values of infinite SPNs (ISPN) compared with LearnSPN, RAT-SPN and ID-SPN.