1 Introduction
Uncertainty matters. An intelligent system applied in the real world should both be able to deal with uncertain inputs, as well as express its uncertainty over outputs. Especially the latter is a crucial point in automatic decisionmaking processes, such as medical diagnosis and planning systems for autonomous agents. Therefore, it is no surprise that probabilistic approaches have recently gained great momentum in deep learning, having led to a variety of innovative probabilistic models such as variational autoencoders (VAEs)
[14], deep generative models [27], generative adversarial nets (GANs) [11], neural autoregressive density estimators (NADEs) [15], and PixelCNNs/RNNs [33]. However, most of these probabilistic deep learning systems have limited capabilities when it comes to inference.Implicit probabilistic models like GANs, even when successful in capturing the data distribution, do not allow to evaluate the likelihood of a test sample. Similar problems arise in deep generative models and VAEs, which typically employ a jointly trained inference network to infer the posterior over a latent variable space. While these techniques mark a milestone in variational learning, inference in these models and – ironically – also their inference networks is limited to drawing samples, forcing users to retreat to Monte Carlo estimates. Autoregressive density estimators like NADEs and PixelCNNs/RNNs somewhat alleviate these limitations, as they permit exact and efficient evaluation of model likelihoods. Moreover, they permit certain marginalization and conditioning tasks, as long as the task is consistent with the variable ordering assumed in the model. Inference tasks not consistent with this variables ordering, however, remain intractable. Uria et al. [32] address this problem by training an ensemble of NADEs with shared network structure. This approach, however, introduces the delicate problem of approximately training a superexponential ensemble of NADEs. Thus, autoregressive models still fall short when fully fledged inference is required.
To this end, sumproduct networks (SPNs) [25] are a promising remedy, as they are a class of probabilistic model which permits exact and efficient inference. More precisely, SPNs are able to compute any marginalization and conditioning query in time linear of the model’s representation size. Nevertheless, although SPNs can be described in a nutshell as “deep mixture models” [21], they have received rather limited attention in the deep learning community, despite their attractive inference properties. We conjecture that there are three reasons why SPNs have been underused in deep learning so far.
First, the structure of SPNs needs to obey certain constraints, requiring either careful structure design by hand or learning the structure from data [6, 9, 20, 28, 34, 1, 31]. These structural requirements are somewhat opposed to the usual homogeneous model structures employed in deep learning, i.e. combining modules like matrix multiplication and elementwise nonlinearities in an almost unconstrained way. Second, the parameter learning schemes proposed so far are usually inspired by graphical models [25, 36, 21] or tailored to SPNs [8]
. This peculiar learning style probably hindered a wide application of SPNs in the connectionist approach so far, which typically relies on automatic differentiation and SGD. Third, there seems to be a folklore that SPNs are “somewhat weak function approximators”, i.e. it is widely believed that SPNs are significantly inferior to solve prediction tasks to an extent we expect from deep neural networks. Indeed, in
[17], a class of distributions was identified which can be tractably represented by a neural net, but not by an SPN. However, as also mentioned in [17], this example is a somewhat academic one, and we should not jump to conclusion concerning SPNs’ fitness in practical problems. Furthermore, the notion of SPNs used here was a restricted one, i.e. using univariate leaves, and the example could actually be circumvented by extending SPNs to multivariate leaves [24]. In that way, the tractable representation of the negtive example in [17] could (trivially) by incorporated in the SPN framework. In general, SPNs inherit universal approximation properties from mixture models, as a mixture model is simply a “shallow” SPN with a single sum node. Consequently, SPNs are able to approximate any prediction function via probabilistic inference in an asymptotic sense.In this paper, we investigate the fitness of SPNs as deep learning models from a practical point of view. To this end, we introduce a particularly simple way to construct SPNs, waiving the necessity for structure learning and simplifying their use as connectionist model. These SPNs are obtained by first constructing a random region graph [6, 20]
laying out the overall network structure, and subsequently populating the region graph with tensors of SPN nodes. This architecture – which we call
Random Tensorized SPNs(RATSPNs) – is naturally implemented in deep learning frameworks like as TensorFlow
[7]and easily optimized endtoend using automatic differentiation, SGD, and automatic GPUparallelization. To avoid overfitting, we adopt the wellknown dropout heuristic
[29], which yields an elegant probabilistic interpretation in our models as marginalization of missing features (dropout at inputs) and as injection of discrete noise (dropout at sum nodes). We trained RATSPNs on several realworld classification data sets, showing that their prediction performances are comparable to traditional deep neural networks. At the same time, RATSPNs maintain a complete joint distribution over both inputs and outputs, which allows us to treat uncertainty in a consistent manner. We show that RATSPNs are dramatically more robust in the presence of missing features than neural networks. Furthermore, we demonstrate that RATSPNs also provide wellcalibrated uncertainty estimates over their inputs, i.e., the model “knows what it does not know”. This property can be naturally exploited for anomaly and outofdomain detection.
2 Related Work
We denote random variables (RVs) by uppercase letters, e.g.
, , and their values by corresponding lowercase letters, e.g. , . Similarly, we denote sets of RVs by uppercase boldface letters, e.g. , and their combined values by corresponding lowercase letters, e.g. , . An SPN over is a probabilistic model defined via a directed acyclic graph (DAG) containing three types of nodes: input distributions, sums and products. All leaves of the SPN are distribution functions over some subset . Inner nodes are either weighted sums or products, denoted by and , respectively, i.e., and , where denotes the children of . The sum weights are assumed to be nonnegative and normalized, i.e., , .The scope of an input distribution is defined as the set of RVs for which is a distribution function, i.e. . The scope of an inner (sum or product) node is recursively defined as . To allow efficient inference, SPNs should satisfy two structure constraints [5, 25], namely completeness and decomposability. An SPN is complete if for each sum it holds that , for all . An SPN is decomposable if it holds for each product that , for all . In that way, all nodes in an SPN recursively define a distribution over their respective scopes: the leaves are distributions by definition, sum nodes are mixtures of their child distributions, and products are factorized distributions, i.e., assuming (conditional) independence among the scopes of their children.
Besides representingprobability distributions, the crucial advantage of SPNs is that they allow efficient inference: In particular, any marginalization task reduces to the corresponding marginalizations at the leaves (each leaf marginalizing only over its scope), and evaluating the internal nodes in a bottomup pass [24]. Thus, marginalization in SPNs follows essentially the same procedure as evaluating the likelihood of a sample – both scale linearly in the SPN’s representation size. Conditioning is tackled in a similar manner.
Learning the parameters of SPNs, i.e. the sum weights and the parameters of input distributions, can be addressed in various ways. By interpreting the sum nodes as discrete latent variables [25, 37, 21]
, SPNs can be trained using the classical expectationmaximization (EM) algorithm
[21]. Zhao et al. [38] derived a concaveconvex procedure, which interestingly coincides with the EM updates for sumweights. Moreover, SPN parameters can be treated in the Bayesian framework, as proposed in [26, 36, 31]. Trapp et al. [30]introduced a safe semisupervised learning scheme for discriminative and generative parameter learning, providing guarantees for the performance in the semisupervised case. Vergari et al.
[35] employed SPNs as probabilistic autoencoders and unsupervised representation learners. The structure of SPNs can be crafted by hand [25, 22] or learned from data. Most structure learners [28, 34, 1, 18] are variations of the divideandconquer scheme known as LearnSPN [9]. This schemed recursively splits the data via clustering (determining sum nodes) and independence tests (determining product nodes).In general, most approaches to learning SPNs are motivated by techniques borrowed from the graphical model literature. However, from their definition it is evident that SPNs can also be interpreted as a special kind of neural networks. In this paper, we aim to follow through this interpretation and investigate how SPNs perform when treated as connectionist model. To this end, we make a drastic simplification by simply picking a scalable random structure and optimizing its parameters in a deep learning manner.
3 Random Tensorized SumProduct Networks
In order to construct randomandtensorized SPNs (RATSPNs) we use a region graph [25, 6, 20] as an abstract representation of the network structure. Given a set of RVs , a region is defined as any nonempty subset of . Given any region , a partition of is a collection of nonempty, nonoverlapping subsets of , whose union is again , i.e., , , , . In this paper, we consider only 2partitions, which causes all product nodes in our SPNs to have exactly two children. This assumption, frequently made in the SPN literature, simplifies SPN design and seems not to impair performance.
A region graph over is a DAG whose nodes are regions and partitions such that i) is a region in and has no parents (root region), ii) all other regions have at least one parent, iii) all children of regions are partitions and all children of partitions are regions (i.e., is bipartite), iv) if is a child of , then and v) if is a child of , then . From this definition it follows that a region graph dictates a hierarchical partition of the overall scope . We denote regions which have no child partitions as leaf regions.
Given a region graph, we can construct a corresponding SPN, as illustrated in Algorithm 1. In this paper we assume a classification problem with classes (for density estimation we simply set ), where each class conditional distributions corresponds to a root of the RATSPN, i.e. the root represents . By multiplying the SPN roots with a prior , we get a full joint distribution . Further, is the number of input distributions per leaf region, and is the number of sum nodes in regions, which are neither leaf nor root regions. It is easy to verify that Algorithm 1 always leads to a complete and decomposable SPN.
In this paper we construct random regions graphs, with the simple procedure depicted in Algorithm 2. We randomly divide the root region into two subregions of equal size and proceed recursively down to depth , resulting in an SPN of depth . This recursive splitting mechanism is repeated times. Figure 1 shows an example SPN with , , and , following Algorithm 2 and subsequently Algorithm 1
. Note that this construction scheme yields SPNs where input distributions, sums, and products can be naturally organized in alternating layers. Similar to classical multilayer perceptrons (MLPs), each layer takes inputs from its directly preceding layer only. Unlike MLPs, however, layers in RATSPNs are connected blockwise sparsely in a random fashion. Thus, layers in MLPs and RATSPNs are hardly comparable; however, we suggest to understand each pair of sum and product layer to be roughly corresponding to one layer in an MLP: sum layers play the role of (blockwise sparse) matrix multiplication and product layers as nonlinearities (or, more precisely, bilinearities of their inputs). The input layer, containing the SPN’s leaf distributions, can be interpreted as a nonlinear feature extractors.
3.1 Training and Implementation
Let be a training set of inputs and class labels . Furthermore, let be the output of the RATSPN and all SPN parameters. We train RATSPNs by minimizing the objective
(1) 
where is the crossentropy
(2) 
and is the normalized negative loglikelihood
(3) 
When setting , we purely optimize crossentropy (discriminative setting), while for we perform maximum likelihood training (generative setting). For , we have a continuum of hybrid objectives, trading off the generative and discriminative character of the model.
We implemented RATSPNs in Python/TensorFlow, where the nodes of a region are represented by a matrix with rows corresponding to samples in a minibatch (containing samples throughout our experiments) and columns corresponding to the number of distributions in the region (either , or ). All computation are performed in the logdomain, using the well known logsumexp
trick, readily provided in Tensorflow. Sumweights, which we require to be nonnegative and normalized, are reparameterized via logsoftmax layers. Product tensors are implemented by taking outer products (actually sums in the logdomain) of the two matrices below, realized by broadcasting. Throughout our experiments, we used Adam
[13]in its default settings. As input distributions, we used Gaussian distributions with isotropic covariances, i.e. each input distribution further decomposes into a product of single dimensional Gaussians. We tried to optimize the variances jointly with the means which, however, delivered worse results than merely setting all variances uniformly to
. While RATSPNs are implemented and trained in a seemingless way, they still yield hundreds of tensors. This, together with performing computations in the logdomain, causes that RATSPNs are an order of magnitude slower than ReLUMLPs of similar sizes. This disadvantage is mainly an effect of the simplicity of our implementation, just employing native Tensorflow operations and a few dozens of python code. The advantage of this implementation, however, is that combinations of RATSPNs with other deep learning methods can be done in a simple plug and play manner. Moreover, we are confident that with sufficient engineering effort RATSPNs could be trained and run at comparable speed as MLPs.
3.2 Probabilistic Dropout
The size of RATSPNs can be easily controlled via the structural parameters , , and . RATSPNs with many parameters, however, tend to overfit just like regular neural networks, which requires regularization. One of the classical techniques that boosted deep learning models is the wellknown dropout heuristic [29], setting inputs and/or hidden units to zero with a certain probability , and rescaling the remaining units by . In the following we modify the dropout heuristic for RATSPNs, exploiting their probabilistic nature.
3.2.1 Dropout at Inputs: Marginalizing out Inputs
Dropout at inputs essentially marks input features as missing at random. In the probabilistic paradigm, we would simply marginalize over these missing features. Fortunately, this is an easy exercise in SPNs, as we only need to set the distributions corresponding to the droppedout features to . As we operate in the logdomain, this means to set the corresponding logdistribution nodes to . This is in fact quite similar to standard dropout, except that we are not compensating by , and blocks of units are dropped out (i.e., all logdistributions whose scope corresponds to a missing input feature are jointly set to ).
3.2.2 Dropout at Sums: Injection of Discrete Noise
As discussed in [25, 37, 21], sum nodes in SPNs can be interpreted as marginalized latent variables, akin to the latent variable interpretation in mixture models. In particular, [21] introduced socalled augmented SPNs which explicitly incorporate these latent variables in the SPN structure. The augmentation introduces indicator nodes representing the states of the latent variables, which can switch the children of sum nodes on or off by connecting them via an additional product. This mechanism establishes the explicit interpretation of sum children as conditional distributions.
In RATSPNs, we can equally well interpret a whole region as a single latent variable, and the weights of each sum node in this region as the conditional distribution of this variable. Indeed, the argumentation in [21] also holds when introducing a set of indicators for a single latent variable which is shared by all sum nodes in one region. While the latent variables are not observed, we can employ a simple probabilistic version of dropout, by introducing artificial observations for them. For example, if the sum nodes in a particular region have children (i.e. the corresponding variable has states), then we could introduce artificial information that assumes a state in some subset of . By doing this for each latent variable in the network, we essentially select a small substructure of the whole SPN to explain the data – this argument is very similar to the original dropout proposal [29]. Implementing dropout at sumlayers is again straightforward: we select a subset of all product nodes which are connected to the sums in one region and set them to 0 (actually in the logdomain).
4 Experiments
4.1 Exploring the Capacity of RatSpns
In our first experiment, we aim to empirically investigate the capacity of RATSPNs, by simply trying to overfit data with various model sizes. To this end, we fit RATSPNs on MNIST train data, using every combination of split depth , number of split repetitions and number of distributions per region . In this paper, we follow a dataagnostic setting, i.e. we deliberately do not exploit the neighborhood correlations present in images. Consequently, our models will perform the same for any permutation of pixels. The natural baselines in the dataagnostic setting are MLPs, where we take ReLU activations for the hidden units and linear activations for the output layer. We ran MLPs with every combination of number of layers in and number of hidden units in . For both RATSPNs and MLPs, we used Adam with its default parameters to optimize crossentropy.
Figure 2 summarizes the training accuracy of both models after 200 epochs as a function of the number of parameters in the respective model. As one can see see, RATSPNs can scale to millions of parameters, and furthermore, they are easily able to overfit the MNIST training set to the same extent as MLPs. For numbers of layers it seems that RATSPNs are suited slightly better to fit the data. This is in fact an artifact of SGD optimization: MLPs still jitter around during the last epochs, while the accuracy of RATSPNs remains stable.
These overfitting results give evidence that RATSPNs are capacitywise at least as powerful as ReLUMLPs. In the next experiment, we investigated whether RATSPNs are also on par with MLPs concerning generalization on classification tasks. Subsequently, we show that RATSPNs exhibit superior performance when dealing with missing features and are able to identify outliers reliably.
4.2 Generalization of RatSpns
When trained without regularization, RATSPNs achieve less than on the test set of MNIST, which is rather inferior even for dataagnostic models. Therefore, we trained them with our probabilistic dropout variant as introduced in section 3.2. We crossvalidated , and number of distributions per region , dropout rates for inputs in and dropout rates for sumlayers in . A dropout rate of means that a fraction of features is kept on average.
For comparison, we trained ReLUMLPs with number of hidden layers in , number of hidden units in , input dropout rates in and dropout rates for hidden layers in . No dropout was applied to the output layer. We trained MLPs in two variants, namely ’vanilla’ (vMLPs), meaning that besides dropout no additional optimization tricks were applied, and a variant (MLP) also employing Xavierinitialization [10]
[12]. While the latter should be considered the default variant to train MLPs, note that helpful heuristics like Xavierinitialization and batch normalization have evolved over decades, while similar techniques for RATSPNs are not available. Thus, vMLPs might serve as a fairer comparison.RATSPN  MLP  vMLP  

Accuracy 
MNIST  98.19  98.32  98.09 
(8.5M)  (2.64M)  (5.28M)  
FMNIST  89.52  90.81  89.81  
(0.65M)  (9.28M)  (1.07M)  
20NG  47.8  49.05  48.81  
(0.37M)  (0.31M)  (0.16M)  
CrossEntropy 
MNIST  0.0852  0.0874  0.0974 
(17M)  (0.82M)  (0.22M)  
FMNIST  0.3525  0.2965  0.325  
(0.65M)  (0.82M)  (0.29M)  
20NG  1.6954  1.6180  1.6263  
(1.63M)  (0.22M)  (0.22M) 
We trained on MNIST, fashionMNIST^{1}^{1}1
FashionMNIST is a dataset in the same format as MNIST, but with the task of classifying fashion items rather than digits; github.com/zalandoresearch/fashionmnist
and 20 News Groups (20NG). The 20NG dataset is a text corpus of 18846 news documents that belong to 20 different news groups or classes. We first split the news documents into 13568 instances for training, 1508 for validation, and 3770 for testing. The text was preprocessed into a bagofwords representation by keeping the top 1000 most relevant words according to their TfIDF. Then, 50 topics were extracted using LDA [2] and employed as the new feature representation for classification.Table 1 summarizes the classification accuracy and crossentropy on the test set, as well as the size of the models in terms of number of parameters. As one can see, RATSPNs are on par with MLPs, and only slightly outperformed in terms of traditional classification tasks. However, as shown in the following sections, the real potential of probabilistic deep learning models actually lies beyond classical benchmark results.
4.3 Hybrid PostTraining
Recall that SPNs define a full joint distribution over both inputs and class variable , and that our objective (1) with tradeoff parameter allows us to trade off between crossentropy () and loglikelihood (). When , we cannot hope that the distribution over is faithful to the underlying data. By setting , however, we can obtain interesting hybrid models, yielding both a discriminative and generative behavior. To this end, we use the RATSPN with highest validation accuracy from the previous experiment, and posttrain it for another 20 epochs, for various values of . This yields a natural tradeoff between the loglikelihood over inputs and predictive performance regarding classification accuracy/crossentropy. Figure 3 shows this tradeoff. As one can see, by sacrificing little predictive performance, we can drastically improve the generative character of SPNs. The benefit of this is shown in the following.
4.4 Spns Are Robust Under Missing Features
When input features in are missing at random, the probabilistic paradigm dictates to marginalize these [16]. As SPNs allow marginalization simply and efficiently, we expect that RATSPNs should be able to robustly treat missing features, especially the “more generative” they are (corresponding to smaller ). To this end, we randomly discard a fraction of pixels in the MNIST test data – independently for each sample – and classify the data using RATSPNs trained with various values of , marginalizing missing features. This is the same procedure we used for probabilistic dropout during training, cf. section 3.2. Similarly, we might expect MLPs to perform robustly under missing features during test time, by applying (classical) dropout.
Figure 4 summarizes the classification results when varying between and . As one can see, RATSPNs with smaller are more stable under even large fractions of missing features. A particularly interesting choice is : here the corresponding RATSPN starts with an accuracy for no missing features and degrades very gracefully: for a large fraction of missing features () the advantage over MLPs is dramatic. Note that this result is consistent with other hybrid learning schemes applied in graphical models [23]. Purely discriminative RATSPNs and MLPs are roughly on par concerning robustness against missing features.
4.5 Spns Know What They Don’t Know
Besides being robust against missing features, an important feature of (hybrid) generative models is that they are naturally able to detect outliers and peculiarities by monitoring the likelihood over inputs . To this end, we evaluated the likelihoods on the test set for both MNIST and fashionMNIST, using the respective RATSPN posttrained with . We selected two thresholds of and by visual inspection of the likelihood histograms. These two values determine roughly the percentiles of most likely/unlikely samples. In both these sets, we selected – following the original order in MNIST – the first 10 samples which are correctly and incorrectly classified, respectively. Thus yields 4 groups of 10 samples each: outlier/correct, outlier/incorrect, inlier/correct, inlier/incorrect.
These samples are shown in Figure 5. Albeit qualitative, these results are interesting: One can visually confirm that the outlier MNIST digits are indeed peculiar, both the correctly and the incorrectly classified ones. Among the outlier/incorrect group are 2 digits (top row, right, 3rd and 8th), which are not recognizable to the authors either. The inlier/incorrect digits can be interpreted, to a certain extent, as the ambiguous ones, e.g. two ’2’s (bottom row, right, 5th and 6th) are similar to ’7’ (and indeed classified as such), or a digit (bottom row, right, 8th) which could either be ’6’ or ’0’. For fashionMNIST, one can clearly see that the outliers are all low in contrast and fill the whole image. In one images (top row, right, 9th) the background has not been removed.
For a more objective analysis, we use a variant of transfer testing recently proposed by Bradshaw et al. [3]. This technique is quite simple: we feed a classifier trained on one domain (e.g. MNIST) with examples from a related but different domain, e.g. street view house numbers (SVHN) [19] or the handwritten digits of SEMEION [4], converted to MNIST format ( pixels, grey scale). While we would expect that most classifiers perform poorly in such setting, an important property of an AI system would be to be aware that it is confronted with outofdomain data and be able to communicate this either to other parts of the system or a human user. While Bradshaw et al. applied transfer testing to conditional models in order to assess output uncertainties, we follow an arguably natural approach and assess input uncertainties in RATSPNs, i.e. their likelihoods over .
Figure 6, top, shows histograms of the loglikelihoods of the RATSPN posttrained with , when fed with MNIST test data (indomain), SVHN test data (outofdomain) and SEMEION (outofdomain). The result is striking: the histogram shows that the likelihood over inputs provides a strong signal (note the yaxis logscale) whether a sample comes from indomain or outofdomain. That is, RATSPNs have an additional communication channel to inform us whether we ought trust their predictions. An MLP does not have such a mean, as it does not represent a full joint distribution. However, a potential objection could be that this positive results for SPNs (or more generally, joint models) might actually stem in some way from the discriminative character of the model, rather than from its generative nature. After all, in order to compute SPN likelihoods in Figure 6
, we simply had to sum over the SPN outputs, reweighted by a class prior (assumed uniform here). Perhaps the strong outlierdetection signal merely stems from averaging predictive outputs? Thus, as a sanity check we perform the likewise computations in our trained MLPs. One might suspect, that the result, although not interpretable as logprobability, still yields a decent signal to detect outliers. In need of a name for this rather odd quantity, we name it
mocklikelihood. Figure 6, bottom, shows histograms of this mocklikelihood: although histograms are more spread for outofdomain data, they are highly overlapping, yielding no clear signal for outofdomain vs. indomain.5 Conclusion
We introduced a particularly simple but effective way to train SPNs: simply pick a random structure and train them in endtoend fashion like neural networks. This makes the application of SPNs within the deep learning framework seamless and allows the application of common deep learning tools such automatic differentiation and easy use of GPUs. As a modest technical contribution, we adapted the wellknown dropout heuristic and equipped it with a sound probabilistic interpretation within RATSPNs. RATSPNs show classification performance on par with traditional neural networks on several classification tasks. Moreover, RATSPNs demonstrate their full power when used as a generative model, showing remarkable robustness against missing features through exact and efficient inference and compelling results in anomaly/outofdomain detection. In future work, the hybrid properties of RATSPNs could allow promising directions like new variants of semisupervised or active learning. While this paper is held in the dataagnostic regime, in future we will investigate SPNs tailored to structured data sources.
References
 [1] T. Adel, D. Balduzzi, and A. Ghodsi. Learning the structure of sumproduct networks via an SVDbased algorithm. In UAI, 2015.
 [2] D. M. Blei, A. Y. Ng, and M. I. Jordan. Latent dirichlet allocation. J. Mach. Learn. Res., 3, 2003.
 [3] J. Bradshaw, A. Matthews, and Z. Ghahramani. Adversarial examples, uncertainty, and transfer testing robustness in gaussian process hybrid deep networks. preprint arXiv, 2017. arxiv.org/abs/1707.02476.
 [4] M. Buscema. MetaNet*: The Theory of Independent Judges, volume 33. 02 1998.

[5]
A. Darwiche.
A differential approach to inference in Bayesian networks.
Journal of the ACM, 50(3):280–305, 2003.  [6] A. Dennis and D. Ventura. Learning the architecture of sumproduct networks using clustering on variables. In Proceedings of NIPS, 2012.
 [7] Abadi M. et al. (40 authors). TensorFlow: Largescale machine learning on heterogeneous systems, 2015.
 [8] R. Gens and P. Domingos. Discriminative learning of sumproduct networks. In Proceedings of NIPS, pages 3248–3256, 2012.
 [9] R. Gens and P. Domingos. Learning the structure of sumproduct networks. Proceedings of ICML, pages 873–880, 2013.
 [10] X. Glorot and Y. Bengio. Understanding the difficulty of training deep feedforward neural networks. In Proceedings of AISTATS, pages 249–256, 2010.
 [11] I. J. Goodfellow, J. PougetAbadie, M. Mirza, B. Xu, D. WardeFarley, S. Ozair, A. Courville, and Y. Bengio. Generative adversarial nets. In Proceedings of NIPS, pages 2672–2680, 2014.
 [12] S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In Prooceedings of ICML, 2015.
 [13] D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. In Proceedings of ICLR, 2015.
 [14] D. P. Kingma and M. Welling. Autoencoding variational Bayes. In ICLR, 2014. arXiv:1312.6114.
 [15] H. Larochelle and I. Murray. The neural autoregressive distribution estimator. In Proceedings of AISTATS, pages 29–37, 2011.
 [16] R. J. A. Little and D. B Rubin. Statistical analysis with missing data, volume 333. John Wiley & Sons, 2014.
 [17] J. Martens and V. Medabalimi. On the expressive efficiency of sum product networks. online, http://arxiv.org/abs/1411.7717, 2015.
 [18] A. Molina, A. Vergari, N. Di Mauro, S. Natarajan, F. Esposito, and K. Kersting. Mixed sumproduct networks: A deep architecture for hybrid domains. In Proceedings of AAAI, 2018.
 [19] Y. Netzer, T. Wang, A. Coates, ABissacco, B. Wu, and A. Y. Ng. Reading digits in natural images with unsupervised feature learning. In NIPS Workshop on Deep Learning and Unsupervised Feature Learning 2011, 2011.
 [20] R. Peharz, B. Geiger, and F. Pernkopf. Greedy partwise learning of sumproduct networks. In Proceedings of ECML/PKDD, pages 612–627. Springer Berlin, 2013.
 [21] R. Peharz, R. Gens, F. Pernkopf, and P. Domingos. On the latent variable interpretation in sumproduct networks. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2016.
 [22] R. Peharz, G. Kapeller, P. Mowlaee, and F. Pernkopf. Modeling speech with sumproduct networks: Application to bandwidth extension. In Proceedings of ICASSP, pages 3699–3703, 2014.
 [23] R. Peharz, S. Tschiatschek, and F. Pernkopf. The most generative maximum margin Bayesian networks. In Proceedings of ICML, pages 235–243, 2013.
 [24] R. Peharz, S. Tschiatschek, F. Pernkopf, and P. Domingos. On theoretical properties of sumproduct networks. In Proceedings of AISTATS, pages 744–752, 2015.
 [25] H. Poon and P. Domingos. Sumproduct networks: A new deep architecture. In Proceedings of UAI, pages 337–346, 2011.

[26]
A. Rashwan, H. Zhao, and P. Poupart.
Online and distributed bayesian moment matching for parameter learning in sumproduct networks.
In AISTATS, pages 1469–1477, 2016. 
[27]
D. J. Rezende, S. Mohamed, and D. Wierstra.
Stochastic backpropagation and approximate inference in deep generative models.
In Proceedings of ICML, pages 1278–1286, 2014.  [28] A. Rooshenas and D. Lowd. Learning SumProduct Networks with Direct and Indirect Variable Interactions. ICML – JMLR W&CP, 32:710–718, 2014.
 [29] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov. Dropout: A simple way to prevent neural networks from overfitting. JMLR, 15:1929–1958, 2014.
 [30] M. Trapp, T. Madl, R. Peharz, F. Pernkopf, and R. Trappl. Safe semisupervised learning of sumproduct networks. In Proceedings of UAI, 2017.
 [31] M. Trapp, R. Peharz, M. Skowron, T. Madl, F. Pernkopf, and R. Trappl. Structure inference in sumproduct networks using infinite sumproduct trees. In NIPS Workshop on Practical Bayesian Nonparametrics, 2016.
 [32] B. Uria, I. Murray, and H. Larochelle. A deep and tractable density estimator. In Proceedings of ICML, pages 467–475, 2014.

[33]
A. van den Oord, N. Kalchbrenner, and K. Kavukcuoglu.
Pixel recurrent neural networks.
In Proceedings of ICML, 2016.  [34] A. Vergari, N. Di Mauro, and F. Esposito. Simplifying, regularizing and strengthening sumproduct network structure learning. In Proceedings of ECML/PKDD, pages 343–358. Springer, 2015.
 [35] A. Vergari, R. Peharz, N. Di Mauro, A. Molina, K. Kersting, and F. Esposito. Sumproduct autoencoding: Encoding and decoding representations using sumproduct networks. In AAAI, 2018.
 [36] H. Zhao, T. Adel, G. Gordon, and B. Amos. Collapsed variational inference for sumproduct networks. In Proceedings of ICML, 2016.
 [37] H. Zhao, M. Melibari, and P. Poupart. On the relationship between sumproduct networks and Bayesian networks. In Proceedings of ICML, 2015.
 [38] H. Zhao, P. Poupart, and G. J Gordon. A unified approach for learning the parameters of sumproduct networks. In Proceedings of NIPS. 2016.