Conditional Sum-Product Networks: Imposing Structure on Deep Probabilistic Architectures

by   Xiaoting Shao, et al.

Bayesian networks are a central tool in machine learning and artificial intelligence, and make use of conditional independencies to impose structure on joint distributions. However, they are generally not as expressive as deep learning models and inference is hard and slow. In contrast, deep probabilistic models such as sum-product networks (SPNs) capture joint distributions in a tractable fashion, but use little interpretable structure. Here, we extend the notion of SPNs towards conditional distributions, which combine simple conditional models into high-dimensional ones. As shown in our experiments, the resulting conditional SPNs can be naturally used to impose structure on deep probabilistic models, allow for mixed data types, while maintaining fast and efficient inference.


page 11

page 12


Exchangeability-Aware Sum-Product Networks

Sum-Product Networks (SPNs) are expressive probabilistic models that pro...

Dynamic Sum Product Networks for Tractable Inference on Sequence Data (Extended Version)

Sum-Product Networks (SPN) have recently emerged as a new class of tract...

Tractable Querying and Learning in Hybrid Domains via Sum-Product Networks

Probabilistic representations, such as Bayesian and Markov networks, are...

Probabilistic Deep Learning using Random Sum-Product Networks

Probabilistic deep learning currently receives an increased interest, as...

Random Sum-Product Forests with Residual Links

Tractable yet expressive density estimators are a key building block of ...

Interventions and Counterfactuals in Tractable Probabilistic Models: Limitations of Contemporary Transformations

In recent years, there has been an increasing interest in studying causa...

Explaining Deep Tractable Probabilistic Models: The sum-product network case

We consider the problem of explaining a tractable deep probabilistic mod...

1 Introduction

Probabilistic models Koller and Friedman (2009) are a fundamental approach in machine learning and artificial intelligence to distill meaningful representations from data with inherent structure. In practice, however, it has been challenging to come up with probabilistic models that are expressive enough to capture the complexity of real-world distributions, while still allowing for tractable inference. Meanwhile, advances in probabilistic deep learning have shown that tractable models like arithmetic circuits Darwiche (2003); Choi and Darwiche (2017) can be used to capture complex distributions, while using little interpretable structure.

Here we explore the intersection of structured probabilistic models and probabilistic deep learning. Prior work on deep generative neural methods such as variational autoencoders (VAEs)

Kingma and Welling (2014) and generative adversarial networks (GANs) Goodfellow et al. (2014) has been mostly unstructured, and has therefore yielded models that, despite producing impressive samples, have lacked interpretable meaning. Furthermore, these models have generally limited capabilities when it comes to probabilistic inference. Sum-product networks (SPNs) Darwiche (2003); Poon and Domingos (2011) are a rich family of hierarchical latent variable models Zhao et al. (2015); Peharz et al. (2017) allowing for tractable inference.

Their structure, however, is mainly a modeling trick, and also lacks interpretable meaning. On the other hand, classical structured probabilistic models are not as expressive as deep learning models, and inference is generally hard and slow. Consequently, in this paper, we aim to combine the advantages of these approaches and extend the notion of sum-product networks to conditional probability distributions.

Specifically, we introduce conditional sum-product networks (CSPNs), which recursively construct a high-dimensional conditional probability model via a combination of smaller conditional models. Thereby, they maintain a broad set of exact and tractable inference routines for queries of higher complexity than those which can be answered by the independent smaller models. Moreover, since CSPNs can be naturally combined with SPNs, one can easily impose a rich structure on high-dimensional joint distributions.

Learning CSPNs from data, however, requires different decomposition and conditioning steps than for SPNs. Here we present a learning algorithm tailored towards nonparametric conditional distributions and we make the following contributions:

(1) We introduce a deep model for computing multivariate, conditional probabilities where the different variables might even belong to different distribution families.

(2) We present a structure learning algorithm for the deep conditional distributions based on randomized conditional correlation tests (RCoT) Strobl et al. (2017)—the first application of them to learning deep probabilistic models.

(3) We define a novel type of mixture nodes with functional weights that increase the capacity of the CSPNs while maintaining tractability.

On several real-world data sets, we demonstrate the effectiveness of CSPNs and compare against state-of-the-art. To illustrate how to impose structure on deep probabilistic models, we devise Autoregressive Block-wise Conditional Sum-Product Networks (ABCSPNs), the first auto-regressive model for image generation based on CSPNs.

We proceed as follows. We start by introducing CSPNs. Then we show how to learn their structure from data using RCoT, introduce autoregressive CSPNs, and discuss further related work. Before concluding, we present our experiments.

2 Conditional Sum-Product Networks

We denote random variables (RVs) as upper-case letters, e.g.,

, their values as lower-case letters, e.g., ; and sets of RVs in bold, e.g.,. In the following, we employ to denote the target RVs, also called labels, while denoting the disjoint set of observed RVs, also called features, as .

Sum-Product Networks (SPNs) Darwiche (2003); Poon and Domingos (2011) are deep tractable probabilistic models decomposing a joint probability distribution via a directed acyclic graph (DAG) comprising sum, product and leaf nodes. Under some restrictions, SPNs can model high-treewidth distributions while preserving exact inference for a range of queries in time polynomial in the network size. For a detailed overview of SPNs, please refer to Peharz et al. (2017); Vergari et al. (2019). In this paper, we explore Conditional Sum-Product Networks (CSPNs), a formulation of SPNs for modeling conditional distributions . Intuitively, we rewrite as where the parameters of the conditional distributions are a function of the input : . That is, we learn an SPN with functional parameters. Now, one can now account for the functional dependencies in the structure of the CSPN. This motivated the following definition.

Definition of CSPNs. A CSPN is a rooted DAG of sum, gating, product and leaf nodes, encoding the probability distribution . Each leaf encodes a normalized univariate conditional distribution over a target RV , denoting its conditional scope. A sum node defines the mixture model where is the conditional probability modeled by its -th child node. A product node factorizes a conditional probability distribution over its children, i.e., where . A gating node computes where is the output of a nonnegative function w.r.t. the -th child node, such that . The conditional scope of a non-leaf node is the union of the scopes of its children. Fig. 1 provides an example of a CSPN.

First, one might note that a CSPN is still an SPN over labels where RVs are always equally “accessible” to all nodes. Moreover, gating nodes can be interpreted as functional sum nodes, i.e., mixture models whose mixing weights are not constants. This is akin to gates in mixtures of experts Shazeer et al. (2017), hence the name.

From this interpretation, we can extend the notions of completeness and decomposability of SPNs Poon and Domingos (2011) to CSPNs, in order to reuse efficient SPN inference routines, while guaranteeing any conditional marginal distribution to be computed exactly Rooshenas and Lowd (2016).

Tractable Inference in CSPNs. A CSPN is conditionally complete iff the conditional scope of all sum and gating nodes is equal to the scope of its children. A CSPN is conditionally decomposable iff the children of each product node do not have overlapping conditional scopes.

A conditionally complete and decomposable CSPN effectively models a tractable conditional distribution, i.e., one can compute for any arbitrary . Indeed, after observing x, a CSPN turns into an SPN comprising only leaf, product and sum nodes with constant weights. Analogously, to perform most-probable-explanation (MPE) inference over , one can leverage an approximate Viterbi-like inference, effectively evaluating the CSPN only twice. See Poon and Domingos (2011); Peharz et al. (2017) for a discussion.




Figure 1: An example of a valid CSPN. Here, , and are effective variables in Y, and X is the set of conditional variables. The structure represents the conditional distribution .

CSPNs are more expressive than SPNs. As an intuitive argument for which CSPNs are more expressive efficient that SPNs, we leverage the framework of Sharir and Shashua (2017). Consider the simple case of modeling a stochastic process under the Markov assumption . Here we are interested in modeling the transitions from a -dimensional state from time to .

Using SPNs, to answer a conditional query we would still require to learn the joint distribution and then to marginalize over the observed variables X.

However, the potential SPN structures that can represent such a joint distribution that include at least a product node, destroy the flow of information from time to time in an unrecoverable way. To see why, consider any product node, where we can always find a variable in the scope of a child node and that by the decomposability property was separated from a variable that is now in the scope of a child node . This implies that a label is independent from the feature, i.e. and this might disrupt the capability of the SPN to make good predictions. The SPN can still represent the joint distribution by adding children to the sum nodes using different independency assumptions, but this can increase the size of the network significantly.

CSPNs behave differently, since each node including the leaf nodes can have access to all information in X

. Not only CSPNs can encode this type of problem, inference can also be faster than in SPNs as they do not have to marginalize and only need to traverse the graph once. Moreover, CSPNs extend e.g. GLMs from single response variable to multiple ones via its graphical structure. In this sense, CSPNs can also be viewed as multi-output regression or classification models that can unify different architectures into one framework while maintaining tractability.

3 Learning Conditional SPNs

While it is possible to craft a conditionally complete and decomposable CSPN structure by hand, doing so would require domain knowledge and weight learning afterwards Poon and Domingos (2011). Here, we introduce a simple structure learning strategy extending the established LearnSPN algorithm Gens and Domingos (2013) which has been instantiated several times for learning SPNs under different distributional assumptions Vergari et al. (2015); Molina et al. (2018).

Our LearnCSPN routine builds a CSPN top-down by introducing nodes while partitioning a data matrix whose rows represent samples and columns LVs in a recursive and greedy matrix. LearnCSPN is sketched in Algorithm 1. In a nutshell, it comprises four steps to introduce each node type : 1) leaves, 2) products, 3) sums and 4) gating nodes. If only one target RV is present, one conditional probability distribution can be fit as a leaf. For product nodes, conditional independencies are found by means of a statistical test to partition the set of target RVs . If no such partitioning is found, then training samples are partitioned into clusters (conditioning) to induce a sum or a gating node. We now review the four steps of LearnCSPN more in detail.

1:  Input: samples where , and ; : minimum number of instances to split; : threshold of significance
2:  Output: a CSPN encoding learned from
3:  if  then
4:      using any approach of choice (e.g. GLMs)
5:  else if  then
7:  else
8:      // compare Alg. 2
9:     if  then
12:     else

e.g. using random splits or k-Means with an appropriate metric

15:  return  
Algorithm 1 LearnCSPN (, , )
1:  Input: samples where label RVs are , : threshold of significance
2:  Output: a label partitioning
4:  for each  do
6:     if  then
8:  return  
Algorithm 2 SplitLabels (, )

(1) Learning Leaves. In order to allow for tractable inference, we require conditional models at the leaves to be normalized. Apart from such a requirement, any such univariate tractable conditional model might be plugged in a CSPNs effortlessly. While one could adopt an expressive neural architecture to model we strive for simplicity and adopt simple univariate models and let the CSPN structure above compose a deeper dependency structure. In particular, we use Generalized Linear Models (GLMs) McCullagh (1984). We compute by regressing univariate parameters from features X, for a given set of distributions in the exponential family.

(2) Learning Product Nodes. We are interested in decomposing the labels into subsets via conditional independence (CI). In terms of density functions, testing that RVs are independent of given , for any value of , i.e., , can equivalently be characterized as . As CI testing is generally a hard problem Shah and Peters (2018), we approximate it by pair-wise CI testing.

Since CSPNs aim to accommodate to any leaf conditional distribution, regardless of its parametric likelihood model, we adopt a non-parametric pairwise CI test procedure to decompose labels . Kernel-based methods like KCIT Zhang et al. (2012) and PCIT Doran et al. (2014), however, scale quadratically with sample size. To speed-up structure learning, we employ a randomized approximation of KCIT, the Randomized conditional Correlation Test (RCoT) Strobl et al. (2017), which has been proven to be very effective in practice and scales linearly w.r.t. sample size.

Briefly, RCoT computes the same statistics as KCIT, i.e., the squared Hilbert-Schmidt norm of the partial cross-covariance operator but uses the Lindsay-Pilla-Basak method to approximate the asymptotic distribution. To this end, RCoT specifies conditional independence using characteristic kernels (e.g. RBFs, Laplacian) for variables with domains and their corresponding RKHS by . Now, it employs the cross-covariance operator on the RKHS from to and is defined as for all and .

The partial cross-covariance operator of given can then be written as

Under mild assumptions, it then holds: if then and in turn 111Indeed, there are some special cases where yet , i.e., this is not an equivalence relation. However, these cases are rarely encountered in practice.

Finally, the test statistic estimator is given as

and the asymptotic distribution of

under the null hypothesis is approximated by the Lindsay-Pilla-Basak method

Lindsay et al. (2000) that matches the first moments to a finite mixture of Gamma distributions. Lastly, we create a graph where the nodes are RVs in Y and we create edges between two nodes if we cannot reject the null hypothesis that for a given threshold (see Alg. 2).

(3) Learning Sum Nodes. As we want to learn a mixture of conditional distributions, we are interested in clustering samples (data matrix rows) together. We approximate conditional clustering by grouping samples by looking only at labels Y

, a heuristic that worked well in practice in our experiments. To this end, one can exploit any flexibly parameterized clustering scheme conditioned on any knowledge of the data distribution (e.g., k-Means for Gaussians).

We can also leverage random splits, as in random projection trees Dasgupta and Freund (2008). Here we sample a random

-dimensional hyper-plane with normal vector

from a

-dimensional uniform distribution


We then split the data, centered around its mean, i.e., , into points that are above the hyper-plane and below it, . These two sets represent our sample partition.

(4) Learning Gating Nodes. Gating nodes provide an additional mechanism in CSPNs to condition on while enhancing flexibility. Learning a mixture of experts requires a double optimization: learning the gating function as well as the conditional experts . For CSPNs we approximate mixture of experts learning by performing clustering over features once, then building the functional weight mapping as the clustering assignment score, i.e. the membership of sample to any of the induced clusters. Additionally, one might restrict to act as a hard gating function, i.e., allowing one sample to be assigned to a single cluster (a single non-zero child branch). In our experiments we use random splits and k-Means with appropriate distance functions.

End-to-End Parameter Optimization. The CSPNs as described here contain three sets of parameters: one for the weights of the sum nodes, one for the parameters of the indicator at the gating nodes, and another for the parameters of the GLMs at the leaf nodes. LearnCSPN in Alg. 1 sets automatically the weights as the proportion of instances sent to the respective children in the recursive call. The parameters of the GLMs, are obtained by an Iteratively Reweighted Least Squares (IRWLS) algorithm as described in Green (1984), on the instances available at the leaf node. However, those parameters are locally optimized and usually not optimal for the global distribution. Fortunately, CSPNs are differentiable as long as the leaf models and conditioning models are differentiable. Hence, one can apply gradient-based learning to the CSPN as a whole in an end-to-end fashion: where denotes a collection of the two sets of parameters and the model. Extra care must be taken with the sum weights, namely, they must remain normalized throughout the optimization. To that end, it is recommended to re-parameterize the weights under a Softmax transformation that guarantees normalization.

4 Autoregressive Block-wise CSPNs

To illustrate how to how to impose structure on generative models by employing CSPNs as building blocks, in the same way as Bayesian networks represent a joint distribution as a factorization of conditional models. Indeed, by applying the chain rule of probabilities, we can then decompose a joint distribution as the product

. Then, one could learn an SPN to model and a CSPN for . By combining both models using a single product node, one would have the flexibility to represent the whole joint as a computational graph.

Now, if one applies the same operation several times by keeping on partitioning in a series of disjoint sets we can obtain an autoregressive

model representation. Inspired by image autoregressive models like PixelCNN 

van den Oord et al. (2016a), and PixelRNN van den Oord et al. (2016b) we propose an Autoregressive Block-wise CSPN (ABCSPN) for conditional image generation. For one ABCSPN, we divide images into pixel blocks, hence factorizing the joint distribution block-wise instead of pixel-wise as in PixelC/RNN. Each factor accounting for a block of pixels is then a CSPN representing the distribution of those pixels as conditioned on all previous blocks and on the class labels222Note that here image labels play the role of the observed RVs ..

We factorize blocks in raster scan order: row by row and left to right, however arbitrary orderings are possible. The complete generative model over image encodes:

where denotes the pixel RVs of the -th block and the image class RVs333

We assume image classes to be one hot encoded.

. Learning each conditional block as a CSPN can be done by the structure learning routines just introduced.

5 Related work

Conditional probabilistic modeling has been tackled in many flavours in the past, starting from probabilistic classifiers, which are generally limited to representing univariate distributions, i.e.,

. While one could learn a univariate predictor per label independently, this assumption might be very restrictive in real-world scenarios.

Gaussian Processes (GPs) Rasmussen (2004) and Conditional Random Fields (CRFs) Lafferty et al. (2001) are staples for structured output prediction

(SOP) regression and classification. However, they have serious shortcomings when inference has to scale to high dimensional data. Moreover, they do not generally allow for exact marginalization. Deep mixtures of GPs have been introduced in

Trapp et al. (2018). However, while partially alleviating GP inference scalability issues, they are limited to continuous domains and with CSPNs we directly tackle , i.e., the scenario where is multivariate. In a nutshell, CSPNs might be seen as an efficient way to aggregate any univariate (leaf) predictor to tackle unrestricted SOP in a principled probabilistic way.

Sharir and Shashua [2017] introduced Sum-Product-Quotient (SPQN) Networks, as SPNs including quotient nodes. This enables representing as the ratio where the two terms are modeled by two SPNs. While being more expressive than SPNs, SPQNs lose efficient marginalization. Determining expressiveness efficiency of CSPNs w.r.t. SPQNs is an interesting future research venue.

Concerning tractable models, logistic circuits (LCs) Liang and Van den Broeck (2019) have been recently introduced as discriminative models showing competitive classification accuracy w.r.t. neural nets on a series of benchmarks. However, LCs are limited to single (binary) output prediction and hence not suited for SOPs. Structured Bayesian Networks (SBNs) leverage Conditional Probabilistic Sentential Decision Diagrams Shen et al. (2018) to decompose a joint distribution into conditional models. As for now, both models are restricted to discrete RVs and conditioning requires to explicitly represent the states for .

Closer in spirit to CSPNs, discriminative arithmetic circuits (DACs) Rooshenas and Lowd (2016) directly tackle modeling a conditional distribution. They are learned via compilation of CRFs, requiring sophisticated structure learning routines which, even if employing elaborated heuristics to approximate CRFs’ partition function, are very slow in practice.

Figure 2: (Best viewed in color) Comparing traffic flow predictions (RMSE, the lower the better) of CSPNs versus SPNs for shallow (left) or deep model (center). CSPNs are consistently more accurate than corresponding SPNs and, as expected, deeper CSPNs outperform shallow ones (right).

6 Experimental Evaluation

Here we investigate CSPNs in experiments on real-world data. Specifically, we aim to answer the following questions: (Q1) Can CSPNs perform better than compared to regular SPNs? (Q2) How accurate are CSPNs for SOP? (Q3) How do ABC-SPNs perform w.r.t. state-of-the-art generative models? (Q4)

Can we employ neural networks within functional CSPN to model complex distribution?

To this end, we implemented CSPNs444We will release code upon acceptance

in Python calling TensorFlow and R.

(Q1, Q2) Multivariate Traffic Data Prediction. We employ CSPNs for multivariate traffic data prediction, comparing them against SPNs with Poisson leaf distributions Molina et al. (2017). This is an appropriate model as the traffic data represents counts of vehicles. We considered temporal vehicular traffic flows in the German city of Cologne Ide et al. (2015). The data comprises 39 RVs whose values are from stationary detectors located at the 50km long Cologne orbital freeway in Germany, each one counting the number of vehicles within a fixed time interval. It contains 1440 samples, each of which is a snapshot of the traffic flow. The task of the experiments is to predict the next snapshot () given a historical one ().

We trained both CSPNs and SPNs controlling the depth of the models. The CSPNs use GLMs with exponential link function as parameter for a Poisson univariate conditional leaf. Results are summarized in Fig. 2. We can see that CSPNs are always the most accurate model as their root mean squared error (RMSE) is always the lowest. As expected, deeper CSPNs have lower predictive error compared to shallow CSPNs. Moreover smaller CSPNs perform equally well or even better than SPNs, empirically confirming what we hypothesised in Section  1. This answers (Q1, Q2) affirmatively and also provides evidence for the convenience of directly modeling a conditional distribution.

(Q2) Conditional density estimation.

We now focus on conditional density estimation. Due to space constraints, we present results on a subset of the standard binary benchmark datasets555We adopt the classic train/valid/test splits as in Rooshenas and Lowd (2016)., when different percentage of evidence () is available. We compare to DACL Rooshenas and Lowd (2016) as it currently provides state-of-the-art conditional log-likelihoods (CLLs) on such data. To this end, we first perform structure learning on the train data split (stopping learning when no more than 10% of samples are available), followed by end-to-end parameter learning on the train and validation data.

Note that the sophisticated structure learning in DACL directly optimizes for the CLL at each iteration.

Tab. 1

reports statistically significant results (best in bold), after a paired t-tests (

) has been run. We can see how on the 80% evidence scenario CSPNs are comparable with DACL on most benchmarks. On the other hand, in case only 50% of is observable, DACL tends to perform better than CSPNs, even though by a slight margin in general.

We note that CSPNs are faster to learn than DACL and that, in practice, no real hyperparameter tuning was necessary to achieve these scores, while DACL ones are the result of a fine grained grid search (see 

Rooshenas and Lowd (2016). This answers (Q2) affirmatively and shows that CSPNs are comparable to state-of-the-art.

50% Evidence 80% Evidence
Nltcs -2.770 -2.787 -1.255 -1.254
Msnbc -2.918 -3.165 -1.557 -1.654
KDD -0.998 -1.048 -0.386 -0.396
Plants -4.655 -4.720 -1.812 -1.804
Audio -18.958 -18.759 -7.337 -7.223
Jester -24.830 -24.544 -9.998 -9.768
Netflix -26.245 -25.914 -10.482 -10.352
Accidents -9.718 -11.587 -3.493 -4.045
Retail -4.825 -5.600 -1.687 -1.653
Pumsb. -6.363 -7.383 -2.594 -2.618
Dna -34.737 -30.289 -12.116 -7.994
W/T/L 2/4/5 2/7/2
Table 1: Average test conditional log-likelihood (CLL) on standard density estimation benchmarks for DACL and CSPNs.

(Q3) Auto-Regressive Image Generation.

We investigate ABCSPNs on a subset (20000 random samples) of grayscale MNIST and Olivetti faces by splitting each image into 16 resp. 64 blocks of equal size where we normalized the greyscale value for MNIST. Then we trained a CSPN on Gaussian domain for each block conditioned on all the blocks above and to the left of it and on the image class and formulate the distribution of the images as the product of all the CSPNs.

As baseline for generative image modeling we compare with the state-of-the-art PixelCNN++ model Salimans et al. (2017). Training PixelCNN++ on a machine with 4 NVIDIA GeForce GTX 1080 GPUs took approximately a week to converge to 1.3 bits per dimension (1.32 bits per dimension on test dataset) for MNIST. We also started to train classic SPNs on Olivetti faces, but it took more than 10 days so we terminated the process.

Table 2 reports the bits per dimension (bpd) of all models (the lower the better), which stands for the (binary) negative log-likelihood normalized per dimension. While ABCSPNs score higher bpd than PixelCNNs, they remarkably just employ one order of magnitude less parameters. Additionally PixelCNN++ took about a week to train, SPNs more than a week, and ABCSPNs only half an hour. More interestingly, samples from ABCSPNs (see Figs. 3 and 4) look as plausible as PixelCNN ones, confirming that log-likelihood might be a misleading metric to look at Theis et al. (2015). All in all, ABCSPNs achieve this by imposing a stronger dependency bias via its “scaffold” structure, while accommodating for flexible conditional models provided by CSPNs. By doing so it reduces the number of independency tests among pixels required by CSPNs: from quadratic over all pixels in an image down to only quadratic in the block size.

An even more suggestive experimental result is reported in Figure 5. There Olivetti faces are sampled from an ABCSPN after conditioning on a set of class images that is the mixing of two original classes. That is, by conditioning on multiple classes generates samples that resemble both individuals belonging to those classes, even though the ABCSPN never saw that class combination before during training. This demonstrates how ABCSPNs are able to learn meaningful and accurate models over the image manifold, providing an affirmative answer to (Q3).

Figure 3: Samples generated by an ABCSPN (top), SPN (mid) and PixelCNN++ (bottom) trained on MNIST.
Figure 4: Samples generated by a ABCSPN (top) and PixelCNN++ (bottom) trained on the Olivetti faces dataset.
Figure 5: Conditional image generation with ABCSPNs: bottom row images are sampled while conditioning on the two classes to which individuals from the two upper rows belong to.
MNIST Olivetti
(4.5M) (0.5M) (16M) (8.5M) (54.5M)
train 1.73 -5.47 1.2990 1.753 0.48
test 1.69 -6.56 1.3294 1.330 1.03
Table 2: Bits per dimension (bpd) on MNIST and Olivetti faces for SPNs, ABCSPNs and PixelCNN++. Number of parameters per models in parenthesis.

(Q4) Neural conditional SPNs with random structure In high-dimensional domains, such as images, the structure learning procedure introduced above may be intractable. In this case, functional CSPNs may still be applied by starting from a random SPN structure Peharz et al. (2018), resulting in a flexible distribution . When deep networks are used to represent the function , this architecture, which we call neural CSPN, resembles a deep version of Bishop’s classic mixture density networks.

To illustrate the usefulness of this novel link to deep neural learning, we trained a neural CSPN for left-completion on the Olivetti faces dataset. To this end, we generated a random SPN structure featuring 6 layers and roughly 32k parameters, which are determined by the output of a deep neural network. The neural network first processes the input using a convolutional layer to obtain a latent representation. Sum weights and leaf parameters for the SPN are then output using a fully connected layer and two transposed convolutional layers, respectively. Fig. 6 demonstrates that neural CSPNs work actually well. Exploring and evaluating them more is an interesting avenue for future work.

Figure 6: Left completions on Olivetti faces obtained by taking the MPE from a neural CSPN.

7 Conclusions

We have extended the stack of sum-product networks (SPNs) towards conditional distributions by introducing conditional SPNs (CSPNs). Conceptually, they combine simpler models in a hierarchical fashion in order to create a deep representation that can model multivariate and mixed conditional distributions while maintaining tractability. They can be used to impose structure on deep probabilistic models and, in turn, significantly boost their power as demonstrated by our experimental results.

Much remains to be explored, including other learning methods for CSPNs, design principles for CSPN+SPN architectures, combining the (C)SPN stack with the deep neural learning stack, more work on extensions to sequential and autoregressive domains, and further applications.

8 Acknowledgements

We acknowledge the support of the German Science Foundation (DFG) project ”Argumentative Machine Learning” (CAML, KE 1686/3-1) of the SPP 1999 “Robust Argumentation Machines” (RATIO). Kristian Kersting also acknowledges the support of the Rhine-Main Universities Network for “Deep Continuous-Discrete Machine Learning” (DeCoDeML). Thomas Liebig was supported by the German Science Foundation under project B4 ‘Analysis and Communication for Dynamic Traffic Prognosis’ of the Collaborative Research Centre SFB 876.