1 Introduction
Probabilistic models Koller and Friedman (2009) are a fundamental approach in machine learning and artificial intelligence to distill meaningful representations from data with inherent structure. In practice, however, it has been challenging to come up with probabilistic models that are expressive enough to capture the complexity of realworld distributions, while still allowing for tractable inference. Meanwhile, advances in probabilistic deep learning have shown that tractable models like arithmetic circuits Darwiche (2003); Choi and Darwiche (2017) can be used to capture complex distributions, while using little interpretable structure.
Here we explore the intersection of structured probabilistic models and probabilistic deep learning. Prior work on deep generative neural methods such as variational autoencoders (VAEs)
Kingma and Welling (2014) and generative adversarial networks (GANs) Goodfellow et al. (2014) has been mostly unstructured, and has therefore yielded models that, despite producing impressive samples, have lacked interpretable meaning. Furthermore, these models have generally limited capabilities when it comes to probabilistic inference. Sumproduct networks (SPNs) Darwiche (2003); Poon and Domingos (2011) are a rich family of hierarchical latent variable models Zhao et al. (2015); Peharz et al. (2017) allowing for tractable inference.Their structure, however, is mainly a modeling trick, and also lacks interpretable meaning. On the other hand, classical structured probabilistic models are not as expressive as deep learning models, and inference is generally hard and slow. Consequently, in this paper, we aim to combine the advantages of these approaches and extend the notion of sumproduct networks to conditional probability distributions.
Specifically, we introduce conditional sumproduct networks (CSPNs), which recursively construct a highdimensional conditional probability model via a combination of smaller conditional models. Thereby, they maintain a broad set of exact and tractable inference routines for queries of higher complexity than those which can be answered by the independent smaller models. Moreover, since CSPNs can be naturally combined with SPNs, one can easily impose a rich structure on highdimensional joint distributions.
Learning CSPNs from data, however, requires different decomposition and conditioning steps than for SPNs. Here we present a learning algorithm tailored towards nonparametric conditional distributions and we make the following contributions:
(1) We introduce a deep model for computing multivariate, conditional probabilities where the different variables might even belong to different distribution families.
(2) We present a structure learning algorithm for the deep conditional distributions based on randomized conditional correlation tests (RCoT) Strobl et al. (2017)—the first application of them to learning deep probabilistic models.
(3) We define a novel type of mixture nodes with functional weights that increase the capacity of the CSPNs while maintaining tractability.
On several realworld data sets, we demonstrate the effectiveness of CSPNs and compare against stateoftheart. To illustrate how to impose structure on deep probabilistic models, we devise Autoregressive Blockwise Conditional SumProduct Networks (ABCSPNs), the first autoregressive model for image generation based on CSPNs.
We proceed as follows. We start by introducing CSPNs. Then we show how to learn their structure from data using RCoT, introduce autoregressive CSPNs, and discuss further related work. Before concluding, we present our experiments.
2 Conditional SumProduct Networks
We denote random variables (RVs) as uppercase letters, e.g.,
, their values as lowercase letters, e.g., ; and sets of RVs in bold, e.g.,. In the following, we employ to denote the target RVs, also called labels, while denoting the disjoint set of observed RVs, also called features, as .SumProduct Networks (SPNs) Darwiche (2003); Poon and Domingos (2011) are deep tractable probabilistic models decomposing a joint probability distribution via a directed acyclic graph (DAG) comprising sum, product and leaf nodes. Under some restrictions, SPNs can model hightreewidth distributions while preserving exact inference for a range of queries in time polynomial in the network size. For a detailed overview of SPNs, please refer to Peharz et al. (2017); Vergari et al. (2019). In this paper, we explore Conditional SumProduct Networks (CSPNs), a formulation of SPNs for modeling conditional distributions . Intuitively, we rewrite as where the parameters of the conditional distributions are a function of the input : . That is, we learn an SPN with functional parameters. Now, one can now account for the functional dependencies in the structure of the CSPN. This motivated the following definition.
Definition of CSPNs. A CSPN is a rooted DAG of sum, gating, product and leaf nodes, encoding the probability distribution . Each leaf encodes a normalized univariate conditional distribution over a target RV , denoting its conditional scope. A sum node defines the mixture model where is the conditional probability modeled by its th child node. A product node factorizes a conditional probability distribution over its children, i.e., where . A gating node computes where is the output of a nonnegative function w.r.t. the th child node, such that . The conditional scope of a nonleaf node is the union of the scopes of its children. Fig. 1 provides an example of a CSPN.
First, one might note that a CSPN is still an SPN over labels where RVs are always equally “accessible” to all nodes. Moreover, gating nodes can be interpreted as functional sum nodes, i.e., mixture models whose mixing weights are not constants. This is akin to gates in mixtures of experts Shazeer et al. (2017), hence the name.
From this interpretation, we can extend the notions of completeness and decomposability of SPNs Poon and Domingos (2011) to CSPNs, in order to reuse efficient SPN inference routines, while guaranteeing any conditional marginal distribution to be computed exactly Rooshenas and Lowd (2016).
Tractable Inference in CSPNs. A CSPN is conditionally complete iff the conditional scope of all sum and gating nodes is equal to the scope of its children. A CSPN is conditionally decomposable iff the children of each product node do not have overlapping conditional scopes.
A conditionally complete and decomposable CSPN effectively models a tractable conditional distribution, i.e., one can compute for any arbitrary . Indeed, after observing x, a CSPN turns into an SPN comprising only leaf, product and sum nodes with constant weights. Analogously, to perform mostprobableexplanation (MPE) inference over , one can leverage an approximate Viterbilike inference, effectively evaluating the CSPN only twice. See Poon and Domingos (2011); Peharz et al. (2017) for a discussion.
CSPNs are more expressive than SPNs. As an intuitive argument for which CSPNs are more expressive efficient that SPNs, we leverage the framework of Sharir and Shashua (2017). Consider the simple case of modeling a stochastic process under the Markov assumption . Here we are interested in modeling the transitions from a dimensional state from time to .
Using SPNs, to answer a conditional query we would still require to learn the joint distribution and then to marginalize over the observed variables X.
However, the potential SPN structures that can represent such a joint distribution that include at least a product node, destroy the flow of information from time to time in an unrecoverable way. To see why, consider any product node, where we can always find a variable in the scope of a child node and that by the decomposability property was separated from a variable that is now in the scope of a child node . This implies that a label is independent from the feature, i.e. and this might disrupt the capability of the SPN to make good predictions. The SPN can still represent the joint distribution by adding children to the sum nodes using different independency assumptions, but this can increase the size of the network significantly.
CSPNs behave differently, since each node including the leaf nodes can have access to all information in X
. Not only CSPNs can encode this type of problem, inference can also be faster than in SPNs as they do not have to marginalize and only need to traverse the graph once. Moreover, CSPNs extend e.g. GLMs from single response variable to multiple ones via its graphical structure. In this sense, CSPNs can also be viewed as multioutput regression or classification models that can unify different architectures into one framework while maintaining tractability.
3 Learning Conditional SPNs
While it is possible to craft a conditionally complete and decomposable CSPN structure by hand, doing so would require domain knowledge and weight learning afterwards Poon and Domingos (2011). Here, we introduce a simple structure learning strategy extending the established LearnSPN algorithm Gens and Domingos (2013) which has been instantiated several times for learning SPNs under different distributional assumptions Vergari et al. (2015); Molina et al. (2018).
Our LearnCSPN routine builds a CSPN topdown by introducing nodes while partitioning a data matrix whose rows represent samples and columns LVs in a recursive and greedy matrix. LearnCSPN is sketched in Algorithm 1. In a nutshell, it comprises four steps to introduce each node type : 1) leaves, 2) products, 3) sums and 4) gating nodes. If only one target RV is present, one conditional probability distribution can be fit as a leaf. For product nodes, conditional independencies are found by means of a statistical test to partition the set of target RVs . If no such partitioning is found, then training samples are partitioned into clusters (conditioning) to induce a sum or a gating node. We now review the four steps of LearnCSPN more in detail.
(1) Learning Leaves. In order to allow for tractable inference, we require conditional models at the leaves to be normalized. Apart from such a requirement, any such univariate tractable conditional model might be plugged in a CSPNs effortlessly. While one could adopt an expressive neural architecture to model we strive for simplicity and adopt simple univariate models and let the CSPN structure above compose a deeper dependency structure. In particular, we use Generalized Linear Models (GLMs) McCullagh (1984). We compute by regressing univariate parameters from features X, for a given set of distributions in the exponential family.
(2) Learning Product Nodes. We are interested in decomposing the labels into subsets via conditional independence (CI). In terms of density functions, testing that RVs are independent of given , for any value of , i.e., , can equivalently be characterized as . As CI testing is generally a hard problem Shah and Peters (2018), we approximate it by pairwise CI testing.
Since CSPNs aim to accommodate to any leaf conditional distribution, regardless of its parametric likelihood model, we adopt a nonparametric pairwise CI test procedure to decompose labels . Kernelbased methods like KCIT Zhang et al. (2012) and PCIT Doran et al. (2014), however, scale quadratically with sample size. To speedup structure learning, we employ a randomized approximation of KCIT, the Randomized conditional Correlation Test (RCoT) Strobl et al. (2017), which has been proven to be very effective in practice and scales linearly w.r.t. sample size.
Briefly, RCoT computes the same statistics as KCIT, i.e., the squared HilbertSchmidt norm of the partial crosscovariance operator but uses the LindsayPillaBasak method to approximate the asymptotic distribution. To this end, RCoT specifies conditional independence using characteristic kernels (e.g. RBFs, Laplacian) for variables with domains and their corresponding RKHS by . Now, it employs the crosscovariance operator on the RKHS from to and is defined as for all and .
The partial crosscovariance operator of given can then be written as
Under mild assumptions, it then holds: if then and in turn ^{1}^{1}1Indeed, there are some special cases where yet , i.e., this is not an equivalence relation. However, these cases are rarely encountered in practice.
Finally, the test statistic estimator is given as
and the asymptotic distribution ofunder the null hypothesis is approximated by the LindsayPillaBasak method
Lindsay et al. (2000) that matches the first moments to a finite mixture of Gamma distributions. Lastly, we create a graph where the nodes are RVs in Y and we create edges between two nodes if we cannot reject the null hypothesis that for a given threshold (see Alg. 2).(3) Learning Sum Nodes. As we want to learn a mixture of conditional distributions, we are interested in clustering samples (data matrix rows) together. We approximate conditional clustering by grouping samples by looking only at labels Y
, a heuristic that worked well in practice in our experiments. To this end, one can exploit any flexibly parameterized clustering scheme conditioned on any knowledge of the data distribution (e.g., kMeans for Gaussians).
We can also leverage random splits, as in random projection trees Dasgupta and Freund (2008). Here we sample a random
dimensional hyperplane with normal vector
from adimensional uniform distribution
.We then split the data, centered around its mean, i.e., , into points that are above the hyperplane and below it, . These two sets represent our sample partition.
(4) Learning Gating Nodes. Gating nodes provide an additional mechanism in CSPNs to condition on while enhancing flexibility. Learning a mixture of experts requires a double optimization: learning the gating function as well as the conditional experts . For CSPNs we approximate mixture of experts learning by performing clustering over features once, then building the functional weight mapping as the clustering assignment score, i.e. the membership of sample to any of the induced clusters. Additionally, one might restrict to act as a hard gating function, i.e., allowing one sample to be assigned to a single cluster (a single nonzero child branch). In our experiments we use random splits and kMeans with appropriate distance functions.
EndtoEnd Parameter Optimization. The CSPNs as described here contain three sets of parameters: one for the weights of the sum nodes, one for the parameters of the indicator at the gating nodes, and another for the parameters of the GLMs at the leaf nodes. LearnCSPN in Alg. 1 sets automatically the weights as the proportion of instances sent to the respective children in the recursive call. The parameters of the GLMs, are obtained by an Iteratively Reweighted Least Squares (IRWLS) algorithm as described in Green (1984), on the instances available at the leaf node. However, those parameters are locally optimized and usually not optimal for the global distribution. Fortunately, CSPNs are differentiable as long as the leaf models and conditioning models are differentiable. Hence, one can apply gradientbased learning to the CSPN as a whole in an endtoend fashion: where denotes a collection of the two sets of parameters and the model. Extra care must be taken with the sum weights, namely, they must remain normalized throughout the optimization. To that end, it is recommended to reparameterize the weights under a Softmax transformation that guarantees normalization.
4 Autoregressive Blockwise CSPNs
To illustrate how to how to impose structure on generative models by employing CSPNs as building blocks, in the same way as Bayesian networks represent a joint distribution as a factorization of conditional models. Indeed, by applying the chain rule of probabilities, we can then decompose a joint distribution as the product
. Then, one could learn an SPN to model and a CSPN for . By combining both models using a single product node, one would have the flexibility to represent the whole joint as a computational graph.Now, if one applies the same operation several times by keeping on partitioning in a series of disjoint sets we can obtain an autoregressive
model representation. Inspired by image autoregressive models like PixelCNN
van den Oord et al. (2016a), and PixelRNN van den Oord et al. (2016b) we propose an Autoregressive Blockwise CSPN (ABCSPN) for conditional image generation. For one ABCSPN, we divide images into pixel blocks, hence factorizing the joint distribution blockwise instead of pixelwise as in PixelC/RNN. Each factor accounting for a block of pixels is then a CSPN representing the distribution of those pixels as conditioned on all previous blocks and on the class labels^{2}^{2}2Note that here image labels play the role of the observed RVs ..We factorize blocks in raster scan order: row by row and left to right, however arbitrary orderings are possible. The complete generative model over image encodes:
where denotes the pixel RVs of the th block and the image class RVs^{3}^{3}3
We assume image classes to be one hot encoded.
. Learning each conditional block as a CSPN can be done by the structure learning routines just introduced.5 Related work
Conditional probabilistic modeling has been tackled in many flavours in the past, starting from probabilistic classifiers, which are generally limited to representing univariate distributions, i.e.,
. While one could learn a univariate predictor per label independently, this assumption might be very restrictive in realworld scenarios.Gaussian Processes (GPs) Rasmussen (2004) and Conditional Random Fields (CRFs) Lafferty et al. (2001) are staples for structured output prediction
(SOP) regression and classification. However, they have serious shortcomings when inference has to scale to high dimensional data. Moreover, they do not generally allow for exact marginalization. Deep mixtures of GPs have been introduced in
Trapp et al. (2018). However, while partially alleviating GP inference scalability issues, they are limited to continuous domains and with CSPNs we directly tackle , i.e., the scenario where is multivariate. In a nutshell, CSPNs might be seen as an efficient way to aggregate any univariate (leaf) predictor to tackle unrestricted SOP in a principled probabilistic way.Sharir and Shashua [2017] introduced SumProductQuotient (SPQN) Networks, as SPNs including quotient nodes. This enables representing as the ratio where the two terms are modeled by two SPNs. While being more expressive than SPNs, SPQNs lose efficient marginalization. Determining expressiveness efficiency of CSPNs w.r.t. SPQNs is an interesting future research venue.
Concerning tractable models, logistic circuits (LCs) Liang and Van den Broeck (2019) have been recently introduced as discriminative models showing competitive classification accuracy w.r.t. neural nets on a series of benchmarks. However, LCs are limited to single (binary) output prediction and hence not suited for SOPs. Structured Bayesian Networks (SBNs) leverage Conditional Probabilistic Sentential Decision Diagrams Shen et al. (2018) to decompose a joint distribution into conditional models. As for now, both models are restricted to discrete RVs and conditioning requires to explicitly represent the states for .
Closer in spirit to CSPNs, discriminative arithmetic circuits (DACs) Rooshenas and Lowd (2016) directly tackle modeling a conditional distribution. They are learned via compilation of CRFs, requiring sophisticated structure learning routines which, even if employing elaborated heuristics to approximate CRFs’ partition function, are very slow in practice.
6 Experimental Evaluation
Here we investigate CSPNs in experiments on realworld data. Specifically, we aim to answer the following questions: (Q1) Can CSPNs perform better than compared to regular SPNs? (Q2) How accurate are CSPNs for SOP? (Q3) How do ABCSPNs perform w.r.t. stateoftheart generative models? (Q4)
Can we employ neural networks within functional CSPN to model complex distribution?
To this end, we implemented CSPNs^{4}^{4}4We will release code upon acceptance
in Python calling TensorFlow and R.
(Q1, Q2) Multivariate Traffic Data Prediction. We employ CSPNs for multivariate traffic data prediction, comparing them against SPNs with Poisson leaf distributions Molina et al. (2017). This is an appropriate model as the traffic data represents counts of vehicles. We considered temporal vehicular traffic flows in the German city of Cologne Ide et al. (2015). The data comprises 39 RVs whose values are from stationary detectors located at the 50km long Cologne orbital freeway in Germany, each one counting the number of vehicles within a fixed time interval. It contains 1440 samples, each of which is a snapshot of the traffic flow. The task of the experiments is to predict the next snapshot () given a historical one ().
We trained both CSPNs and SPNs controlling the depth of the models. The CSPNs use GLMs with exponential link function as parameter for a Poisson univariate conditional leaf. Results are summarized in Fig. 2. We can see that CSPNs are always the most accurate model as their root mean squared error (RMSE) is always the lowest. As expected, deeper CSPNs have lower predictive error compared to shallow CSPNs. Moreover smaller CSPNs perform equally well or even better than SPNs, empirically confirming what we hypothesised in Section 1. This answers (Q1, Q2) affirmatively and also provides evidence for the convenience of directly modeling a conditional distribution.
(Q2) Conditional density estimation.
We now focus on conditional density estimation. Due to space constraints, we present results on a subset of the standard binary benchmark datasets^{5}^{5}5We adopt the classic train/valid/test splits as in Rooshenas and Lowd (2016)., when different percentage of evidence () is available. We compare to DACL Rooshenas and Lowd (2016) as it currently provides stateoftheart conditional loglikelihoods (CLLs) on such data. To this end, we first perform structure learning on the train data split (stopping learning when no more than 10% of samples are available), followed by endtoend parameter learning on the train and validation data.
Note that the sophisticated structure learning in DACL directly optimizes for the CLL at each iteration.
Tab. 1
reports statistically significant results (best in bold), after a paired ttests (
) has been run. We can see how on the 80% evidence scenario CSPNs are comparable with DACL on most benchmarks. On the other hand, in case only 50% of is observable, DACL tends to perform better than CSPNs, even though by a slight margin in general.We note that CSPNs are faster to learn than DACL and that, in practice, no real hyperparameter tuning was necessary to achieve these scores, while DACL ones are the result of a fine grained grid search (see
Rooshenas and Lowd (2016). This answers (Q2) affirmatively and shows that CSPNs are comparable to stateoftheart.50% Evidence  80% Evidence  
Dataset  DACL  CSPN  DACL  CSPN 
Nltcs  2.770  2.787  1.255  1.254 
Msnbc  2.918  3.165  1.557  1.654 
KDD  0.998  1.048  0.386  0.396 
Plants  4.655  4.720  1.812  1.804 
Audio  18.958  18.759  7.337  7.223 
Jester  24.830  24.544  9.998  9.768 
Netflix  26.245  25.914  10.482  10.352 
Accidents  9.718  11.587  3.493  4.045 
Retail  4.825  5.600  1.687  1.653 
Pumsb.  6.363  7.383  2.594  2.618 
Dna  34.737  30.289  12.116  7.994 
W/T/L  2/4/5  2/7/2 
(Q3) AutoRegressive Image Generation.
We investigate ABCSPNs on a subset (20000 random samples) of grayscale MNIST and Olivetti faces by splitting each image into 16 resp. 64 blocks of equal size where we normalized the greyscale value for MNIST. Then we trained a CSPN on Gaussian domain for each block conditioned on all the blocks above and to the left of it and on the image class and formulate the distribution of the images as the product of all the CSPNs.
As baseline for generative image modeling we compare with the stateoftheart PixelCNN++ model Salimans et al. (2017). Training PixelCNN++ on a machine with 4 NVIDIA GeForce GTX 1080 GPUs took approximately a week to converge to 1.3 bits per dimension (1.32 bits per dimension on test dataset) for MNIST. We also started to train classic SPNs on Olivetti faces, but it took more than 10 days so we terminated the process.
Table 2 reports the bits per dimension (bpd) of all models (the lower the better), which stands for the (binary) negative loglikelihood normalized per dimension. While ABCSPNs score higher bpd than PixelCNNs, they remarkably just employ one order of magnitude less parameters. Additionally PixelCNN++ took about a week to train, SPNs more than a week, and ABCSPNs only half an hour. More interestingly, samples from ABCSPNs (see Figs. 3 and 4) look as plausible as PixelCNN ones, confirming that loglikelihood might be a misleading metric to look at Theis et al. (2015). All in all, ABCSPNs achieve this by imposing a stronger dependency bias via its “scaffold” structure, while accommodating for flexible conditional models provided by CSPNs. By doing so it reduces the number of independency tests among pixels required by CSPNs: from quadratic over all pixels in an image down to only quadratic in the block size.
An even more suggestive experimental result is reported in Figure 5. There Olivetti faces are sampled from an ABCSPN after conditioning on a set of class images that is the mixing of two original classes. That is, by conditioning on multiple classes generates samples that resemble both individuals belonging to those classes, even though the ABCSPN never saw that class combination before during training. This demonstrates how ABCSPNs are able to learn meaningful and accurate models over the image manifold, providing an affirmative answer to (Q3).
MNIST  Olivetti  
SPN  ABCSPN  PixelCNN  ABCSPN  PixelCNN  
(4.5M)  (0.5M)  (16M)  (8.5M)  (54.5M)  
train  1.73  5.47  1.2990  1.753  0.48 
test  1.69  6.56  1.3294  1.330  1.03 
(Q4) Neural conditional SPNs with random structure In highdimensional domains, such as images, the structure learning procedure introduced above may be intractable. In this case, functional CSPNs may still be applied by starting from a random SPN structure Peharz et al. (2018), resulting in a flexible distribution . When deep networks are used to represent the function , this architecture, which we call neural CSPN, resembles a deep version of Bishop’s classic mixture density networks.
To illustrate the usefulness of this novel link to deep neural learning, we trained a neural CSPN for leftcompletion on the Olivetti faces dataset. To this end, we generated a random SPN structure featuring 6 layers and roughly 32k parameters, which are determined by the output of a deep neural network. The neural network first processes the input using a convolutional layer to obtain a latent representation. Sum weights and leaf parameters for the SPN are then output using a fully connected layer and two transposed convolutional layers, respectively. Fig. 6 demonstrates that neural CSPNs work actually well. Exploring and evaluating them more is an interesting avenue for future work.
7 Conclusions
We have extended the stack of sumproduct networks (SPNs) towards conditional distributions by introducing conditional SPNs (CSPNs). Conceptually, they combine simpler models in a hierarchical fashion in order to create a deep representation that can model multivariate and mixed conditional distributions while maintaining tractability. They can be used to impose structure on deep probabilistic models and, in turn, significantly boost their power as demonstrated by our experimental results.
Much remains to be explored, including other learning methods for CSPNs, design principles for CSPN+SPN architectures, combining the (C)SPN stack with the deep neural learning stack, more work on extensions to sequential and autoregressive domains, and further applications.
8 Acknowledgements
We acknowledge the support of the German Science Foundation (DFG) project ”Argumentative Machine Learning” (CAML, KE 1686/31) of the SPP 1999 “Robust Argumentation Machines” (RATIO). Kristian Kersting also acknowledges the support of the RhineMain Universities Network for “Deep ContinuousDiscrete Machine Learning” (DeCoDeML). Thomas Liebig was supported by the German Science Foundation under project B4 ‘Analysis and Communication for Dynamic Traffic Prognosis’ of the Collaborative Research Centre SFB 876.
References
 Choi and Darwiche [2017] A. Choi and A. Darwiche. On relaxing determinism in arithmetic circuits. In Proc. of ICML, 2017.
 Darwiche [2003] A. Darwiche. A differential approach to inference in Bayesian networks. Jour. of ACM, 2003.

Dasgupta and
Freund [2008]
S. Dasgupta and Y. Freund.
Random projection trees and low dimensional manifolds.
In
Proc. of Symp. Theory of computing
, 2008.  Doran et al. [2014] G. Doran, K. Muandet, K. Zhang, and B. Schölkopf. A permutationbased kernel conditional independence test. In UAI, 2014.
 Gens and Domingos [2013] R. Gens and P. Domingos. Learning the Structure of SumProduct Networks. In Proc. of ICML, 2013.
 Goodfellow et al. [2014] I. J. Goodfellow, J. PougetAbadie, M. Mirza, B. Xu, A. Courville, and Y. Bengio. Generative adversarial nets. In Proc. of NIPS, 2014.
 Green [1984] Peter J Green. Iteratively reweighted least squares for maximum likelihood estimation, and some robust and resistant alternatives. JSTOR, 1984.
 Ide et al. [2015] C. Ide, F. Hadiji, L. Habel, A. Molina, T. Zaksek, M. Schreckenberg, K. Kersting, and C. Wietfeld. Lte connectivity and vehicular traffic prediction based on machine learning approaches. In VTC. IEEE, 2015.
 Kingma and Welling [2014] D. P. Kingma and M. Welling. Autoencoding variational Bayes. In Proc. of ICLR, 2014.
 Koller and Friedman [2009] D. Koller and N. Friedman. Probabilistic Graphical Models: Principles and Techniques. MIT Press, 2009.
 Lafferty et al. [2001] J. Lafferty, A. McCallum, and F. Pereira. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. 2001.
 Liang and Van den Broeck [2019] Y. Liang and G. Van den Broeck. Learning logistic circuits. In Proc. of AAAI, 2019.
 Lindsay et al. [2000] B. Lindsay, R. Pilla, and P. Basak. Momentbased approximations of distributions using mixtures: Theory and applications. Annals of ISM, 2000.
 McCullagh [1984] Peter McCullagh. Generalized linear models. EJOR, 16(3), 1984.
 Molina et al. [2017] A. Molina, S. Natarajan, and K. Kersting. Poisson sumproduct networks: A deep architecture for tractable multivariate poissons. In Proc. of AAAI, 2017.
 Molina et al. [2018] A. Molina, A. Vergari, N. Di Mauro, S. Natarajan, F. Esposito, and K. Kersting. Mixed sumproduct networks: A deep architecture for hybrid domains. In Proc. of AAAI, 2018.
 Peharz et al. [2017] R. Peharz, R. Gens, F. Pernkopf, and P. Domingos. On the latent variable interpretation in sumproduct networks. IEEE TPAMI, 39(10), 2017.
 Peharz et al. [2018] R. Peharz, A. Vergari, K. Stelzner, A. Molina, M. Trapp, K. Kersting, and Z. Ghahramani. Probabilistic deep learning using random sumproduct networks. In preprint arXiv:1806.01910, 2018.
 Poon and Domingos [2011] H. Poon and P. Domingos. SumProduct Networks: a New Deep Architecture. UAI, 2011.
 Rasmussen [2004] C. E. Rasmussen. Gaussian processes in machine learning. In Adv. lect. on machine learning. 2004.
 Rooshenas and Lowd [2016] A. Rooshenas and D. Lowd. Discriminative structure learning of arithmetic circuits. In Proc. of AISTATS, 2016.
 Salimans et al. [2017] T. Salimans, A. Karpathy, X. Chen, D. Kingma, and Y. Bulatov. Pixelcnn++. In ICLR, 2017.
 Shah and Peters [2018] R. Shah and J. Peters. The hardness of conditional independence testing and the generalised covariance measure. preprint arXiv:1804.07203, 2018.
 Sharir and Shashua [2017] O. Sharir and A. Shashua. Sumproductquotient networks. arXiv:1710.04404, 2017.
 Shazeer et al. [2017] N. Shazeer, A. Mirhoseini, K. Maziarz, A. Davis, Quoc Le, G. Hinton, and J. Dean. Outrageously large neural networks: The sparselygated mixtureofexperts layer. preprint arXiv:1701.06538, 2017.
 Shen et al. [2018] Y. Shen, A. Choi, and Darwiche A. Conditional psdds: Modeling and learning with modular knowledge. In Proc. of AAAI, 2018.
 Strobl et al. [2017] E. Strobl, K. Zhang, and S. Visweswaran. Approximate kernelbased conditional independence tests for fast nonparametric causal discovery. preprint arXiv:1702.03877, 2017.
 Theis et al. [2015] L. Theis, A. van den Oord, and M. Bethge. A note on the evaluation of generative models. preprint arXiv:1511.01844, 2015.
 Trapp et al. [2018] M. Trapp, R. Peharz, C. Rasmussen, and F. Pernkopf. Learning deep mixtures of gaussian process experts using sumproduct networks. preprint arXiv:1809.04400, 2018.
 van den Oord et al. [2016a] A. van den Oord, N. Kalchbrenner, L. Espeholt, O. Vinyals, et al. Conditional image generation with pixelcnn decoders. In Proc. of NIPS, 2016.

van den Oord et al. [2016b]
A. van den Oord, N. Kalchbrenner, and K. Kavukcuoglu.
Pixel recurrent neural networks.
In Proc. of ICML, 2016.  Vergari et al. [2015] A. Vergari, N. Di Mauro, and F. Esposito. Simplifying, regularizing and strengthening spn structure learning. In Proc. of ECML/PKDD, 2015.
 Vergari et al. [2019] A. Vergari, N. Di Mauro, and F. Esposito. Visualizing and understanding sumproduct networks. MLJ, 2019.
 Zhang et al. [2012] K. Zhang, J. Peters, D. Janzing, and B. Schölkopf. Kernelbased conditional independence test and application in causal discovery. preprint arXiv:1202.3775, 2012.
 Zhao et al. [2015] Han Zhao, Mazen Melibari, and Pascal Poupart. On the Relationship between SumProduct Networks and Bayesian Networks. In Proc. of ICML, 2015.