1 Introduction
The central goal of probabilistic modeling is to approximate the datagenerating distribution, in order to answer statistical queries by means of inference. In recent years many novel probabilistic models based on deep neural networks have been proposed, such as Variational Autoencoders (VAEs)
(Kingma & Welling, 2014) (formerly known as Density Networks (MacKay, 1995)), Normalizing Flows (Flows) (Rezende & Mohamed, 2015; Papamakarios et al., 2019), Autoregressive Models (ARMs)
(Larochelle & Murray, 2011; Uria et al., 2016), and Generative Adversarial Networks (GANs) (Goodfellow et al., 2014). While all these models have achieved impressive results on largescale datasets, i.e. they have been successful in terms of representational power and learning, they unfortunately fall short in terms of inference, a main aspect of probabilistic modeling and reasoning (Pearl, 1988; Koller & Friedman, 2009). All of the mentioned models allow to draw unbiased samples, enabling inference via Monte Carlo estimation. This strategy, however, becomes quickly unreliable and computational expensive for all but the simplest inference queries. Also other approximate inference techniques, e.g. variational inference, are often biased and their inference quality might be hard to analyse. Besides sampling, only ARMs and Flows support efficient evaluation of the probability density for a given sample, which can be used, e.g., for model comparison and outlier detection.
However, even for ARMs and Flows the following inference task is computationally hard: Consider a density over random variables, where might be just in the order of a few dozens. For example, might represent medical measurements of a particular person drawn from the general population modeled by . Now assume that we split the variables into three disjoint sets , , and of roughly the same size, and that we wish to compute
(1) 
for arbitrary values and . In words, we wish to predict query variables , based on evidence , while accounting for (marginalizing) all possible values of missing variables . Conditional densities like (1) highlight the role of generative models as “multipurpose predictors”, since the choice of , , and can be arbitrary. However, evaluating conditional densities of this form is notoriously hard for any of the models above, which represents a principal drawback of these methods.
These shortcomings in terms of inference have motivated a growing stream of research on tractable probabilistic modeling, i.e. to construct a constrained class of probabilistic models, in which inference queries like (1) can be computed exactly and efficiently (in polynomial time). One of the most prominent families of tractable models are probabilistic circuits (PCs),^{1}^{1}1We adopt the name probabilistic circuits, as suggested by (Van den Broeck et al., 2019), which serves as an umbrella term for many structurally related models, like arithmetic circuits, sumproduct networks, cutset networks, etc. See Section 2 for details. which represent probability densities via a computational graph of i) mixtures (convex sum nodes), ii) factorizations (product nodes), and iii) tractable distributions (leaves, input nodes). A key structural property of PCs is decomposability (Darwiche, 2003), which ensures that any integral, like those appearing in (1), can be computed in linear time of the circuit size. PCs can be equipped with further structural constraints, which unlock a wider range of tractable inference routines – see Section 2
for an overview. These structural constraints, however, make it hard to work with PCs in practice, as they lead to highly sparse and cluttered computational graphs, which are inapt for current machine learning frameworks. Furthermore, PCs are typically implemented in the logdomain, which slows down learning and inference even further.
In this paper, we propose a novel implementation design for PCs, which ameliorates these practical difficulties, and allows to evaluate and train PCs of up to two orders of magnitude faster than previous implementations (Pronobis et al., 2017; Peharz et al., 2019). The central idea is to compute all product and sum operations on the same topological layer using a single monolithic einsum operation.^{2}^{2}2The einsum
operation implements the Einstein notation of tensorproduct contraction, and unifies standard linear algebra operations like dot product, outer product, and matrix multiplication.
In that way, the main computational work is lifted by a parallel operation for which efficient implementations, both for CPU and GPU, are readily available in most numerical frameworks. In order to ensure numerical stability, we extend the wellknown logsumexptrick to our setting, leading to the “logeinsumexp” trick. Since our model implements PCs via a hierarchy of large einsum layers, we call our model Einsum Network (EiNet).As further contributions, we present two algorithmic improvements for training PCs. First, we show that ExpectationMaximization (EM), the canonical maximumlikelihood learning algorithm for PCs (Peharz et al., 2017), can be easily implemented using the gradient of the model’s logprobability. Thus, EM can be implemented using automatic differentiation, readily provided by most machine learning frameworks. Second, we leverage stochastic online EM (Sato, 1999) in order to further improve the learning speed of EiNets, or, enable EM for large datasets at all. In experiments we demonstrate that EiNets can rapidly learn generative models for street view house numbers (SVHN) and CelebA. To the best of our knowledge, this is the first time that PCs have been successfully trained on datasets of this size. EiNets are capable of producing highquality image samples, while maintaining tractable inference, e.g. for conditional densities (1
). This can be exploited for arbitrary image inpainting tasks and other advanced inference tasks.
Concerning notation, we use , possibly with subscripts, to denote random variables and to denote a value of . Sets (or tuples) of random variables are denoted as , also possibly with subscript, and their values as . We denote the set of possible values for (sets of) random variables with . For directed graphs, we use and to denote the parents and children of node , respectively.
2 Probabilistic Circuits
Probabilistic circuits (PCs) are a family of probabilistic models which allow a wide range of exact and efficient inference routines. The earliest representatives of PCs are arithmetic circuits (ACs) (Darwiche, 2002, 2003), which are based on decomposable negation normal forms (DNNFs) (Darwiche, 1999, 2001), a tractable representation for propositional logic formulas. Further members of the PC family are sumproduct networks (SPNs) (Poon & Domingos, 2011), cutset networks (CNs) (Rahman et al., 2014), and probabilistic sentential decision diagrams (PSDDs) (Kisa et al., 2014). The main differences between different types of PCs are i) the set of assumed structural constraints and tractable inference routines, ii) syntax and representation, and iii) application scenarios. In order to treat these models in a unified fashion, we adopt the umbrella term probabilistic circuits suggested by (Van den Broeck et al., 2019), and discriminate various instantiates of PCs mainly via their structural properties.
Definition 1 (Probabilistic Circuit).
Given a set of random variables , a probabilistic circuit (PC) is a tuple , where , denoted as computational graph, is a directed acyclic graph (DAG) and , denoted as scope function, is a function assigning a scope to each node in , i.e. a subset of . For internal nodes of , i.e. any node which has children, the scope function satisfies . A leaf of computes a probability density over its scope . All internal nodes of are either sum nodes () or product nodes (). A sum node computes a convex combination of its children, i.e. , where , and . A product node computes a product of its children, i.e. .
PCs can be seen as a special kind of neural network, where the first layer computes nonlinear functions (probability densities) over subsets of , and all internal nodes compute either weighted sums (linear functions) or products (multiplicative interactions) of their inputs. Each node in a PC sees only a subset of the inputs, namely variables in its scope . The output of a PC is the value of one or more selected nodes in the computational graph . Typically, one constructs PCs such that they have a single root (node without parents), for which we assume full scope , and define a density over proportional to , i.e. . PCs as defined above define a density, but do not permit tractable inference yet. In particular, the normalization constant is hard to compute. Various tractable inference routines “can be bought” by imposing structural constraints on the PC, which we review next.
Decomposability. A PC is decomposable, if for each product node it holds that , for , , i.e. a PC is decomposable if children of product nodes have pairwise nonoverlapping scope. The most important consequence of decomposability is that integrals which can be written as nested singledimensional integrals—in particular the normalization constant—can be computed in linear time of the circuit size (Peharz et al., 2015). This follows from the fact that any singledimensional integral i) commutes with a sumnode and ii) affects only a single child of a productnode. Consequently, all integrals can be distributed to the PC’s leaves, i.e. we simply need to perform integration at the leaves (which we assume to be tractable), and evaluate the internal nodes of the PC as usual, in a single feedforward pass.
Smoothness. A PC is smooth, if for each sum node it holds that , for , i.e. a PC is smooth if children of sum nodes have identical scope. Smoothness does not have particular computational advantages, but leads to a welldefined probabilistic interpretation of PCs. In particular, in smooth (and decomposable) PCs any node is already a properly normalized density, since i) leaves are densities by assumption, ii) sum nodes are simply mixture densities, with their children being the mixture components, and iii) product nodes are factorized densities. Smooth and decomposable PCs admit a natural interpretation as latent variable models (Zhao et al., 2015; Peharz et al., 2017), which admits a natural sampling procedure via ancestral sampling and maximumlikelihood learning via EM. Smooth and decomposable PCs are often referred to as sumproduct networks (SPNs) (Poon & Domingos, 2011).
In this paper, we consider only smooth and decomposable PCs (aka SPNs), which facilitate efficient marginalization and conditioning. A further structural property, not considered in this paper, is determinism (Darwiche, 2003) which allows for exact probability maximization (mostprobableexplanation), which is NPhard in nondeterministic PCs (de Campos, 2011; Peharz et al., 2017). Moreover, PCs can be equipped with structured decomposability (Kisa et al., 2014), a stronger notion of decomposability which allows circuit multiplication and computing certain expectations (Shen et al., 2016; Khosravi et al., 2019).
3 Einsum Networks
3.1 Vectorizing Probabilistic Circuits
An immediate way to yield a denser and thus more efficient layout for PCs is to vectorize them. To this end, we redefine a leaf to be a vector of densities over
, rather than a single density. For example, a leaf computing a Gaussian density is replaced by a vector
, each being a Gaussian over , equipped with private parameters . A product node is redefined to be an outer product , containing the products of all possible combinations of densities coming from the child vectors. Since the number of elements in outer products grows exponentially, we restrict the number of children to two (this constraint is frequently imposed on PCs for simplicity). Finally, we redefine sum nodes to be a vector of weighted sums, where each individual sum operation has its private weights and computes a convex combination over all the densities computed by its children. The vectorized version of PCs is frequently called a region graph (Dennis & Ventura, 2012; Trapp et al., 2019a) and has been used in previous GPUsupporting implementations (Pronobis et al., 2017; Peharz et al., 2019). It is easily verified, that our desired structural properties—smoothness and decomposability—carry over to vectorized PCs.For the remainder of the paper, we use symbols , , for the vectorized versions, and refer to them as sum nodes, product nodes, and leaf nodes, respectively, or also simply as sums, products and leaves. To any single entry in these vectors we explicitly refer to as entry, or operation. In principle, the number of entries could be different for each leaf or sum, which would, however, lead to a less homogeneous PC design. Therefore, in this paper, we assume for simplicity the same for all leaves and sums. Furthermore, we make some simplifying assumptions about the structure of . First, we assume a structure alternating between sums/leaves and products, i.e. children of sums can only be products, and children of products can only be sums or leaves. Second, we also assume the root of the PC is a sum node. These assumptions are commonly made in PC literature and are no restriction of generality. Furthermore, we also assume that each product node has at most one parent (which must be a sum due to alternating structure). This is also not a severe structural assumption: If a product has two or more sum nodes as parents, then, by smoothness, these sum nodes have all the same scope. Consequently, they could be simply concatenated to a single sum vector. Of course, since in this paper we assume a constant length for all sum and leaf vectors, this assumption requires a large enough .
3.2 The Basic Einsum Operation
The core computational unit in EiNets is the vectorized PC excerpt in Fig. 1, showing a sum node with a single product child , which itself has two children and (shown here as sum nodes, but they could also be leaves). Nodes and compute each a vector of densities, the product node computes the outer product of and , and the sum node computes a matrixvector product . Here, is an elementwise nonnegative matrix, whose rows sum to one, and unrolls to a vector of elements.
Previous PC implementations (Pronobis et al., 2017; Peharz et al., 2019), are also based on this core computational unit. However, for numerical stability, they use a computational workaround in the logdomain: The outer product is transformed into an “outer sum” of logdensities (realized with broadcasting), the matrix multiplication is implemented using a broadcasted sum of and , to which then a logsumexp operation is applied, yielding . This workaround introduces significant overhead and needs to allocate the products explicitly. Mathematically, however, the PC excerpt in Fig. 1 is a simple multilinear form, naturally expressed in Einstein notation:
(2) 
Here we have reshaped into a elementwise nonnegative tensor, normalized over its last two dimensions, i.e. , . The signature in (2) mentions three indices , and labeling the axes of , , and . Axes with the same index get multiplied. Furthermore, any indices not mentioned on the left hand side get summed out. Generalpurpose Einstein summations are readily implemented in most numerical frameworks, and usually denoted as einsum operation.
However, applying (2) in a naive way would quickly lead to numerical underflow and unstable training. In order to ensure numerical stability, we develop a technique similar to the classical “logsumexp”trick. We keep all probabilistic values in the logdomain, but the weighttensor is kept in the linear domain. Consequently, we need a numerically stable computation for
(3) 
Let us define and . Then, we can show that can be computed as
(4) 
To see that (4) is correct, note that for the last term we have
Substituting the last line into (4) yields (3), so our logeinsumexp trick delivers the correct result. A sufficient condition for numerical stability of (4) is that all sumweights are larger than , since in this case the maximal values in vectors and are guaranteed to be , leading to a positive argument for the . This is not a severe requirement, as positive sumweights are commonly enforced in PCs, e.g. by using Laplace smoothing or imposing a positive lower bound on the weights.
Given two dimensional vectors , and the weighttensor , our basic einsum operation (4) requires expoperations, operations, multiplications and additions. We need to store values for , , , while the product operations are not stored explicitly. In contrast, the indirect implementations of the same operation in (Pronobis et al., 2017; Peharz et al., 2019) need additions, expoperations and operations. These implementations also store values for , , , and additional values for the explicitly computed products. While our implementation is cubic in the number of multiplications, the previous implementations need a cubic number of expoperations. This partially explains the speedup of EiNets in our experiments in Section 4. However, the main reasons for the observed speedup are i) an optimized implementation of the einsum operation, ii) avoiding the overhead of allocating product nodes, and iii) a higher degree of parallelizm, as discussed in the next section.
3.3 The Einsum Layer
Rather than computing single vectorized sums, we can do better by computing whole layers in parallel. To this end, we first organize the PC in a layerwise structure: We traverse the PC topdown, and construct a topologically sorted list of layers of nodes, alternating between sums and products, and starting with the leaf nodes. Nodes are only inserted in a layer if all their parents have already been inserted in some layer above, i.e. all nodes in the layer depend only on nodes in layers with index strictly smaller than . Furthermore, note that since we can assume that each product has exactly one parent, see Section 3.1, a consecutive pair of product and sum layers will always be such that the product layer contains exactly the inputs to the sum layer. Pseudocode for organizing PCs in topological layers is provided in the supplementary. Our strategy for EiNets is to perform efficient parallel computations for i) the whole leaf layer, and ii) the whole sum layer in each pair of consecutive product and sum layers. In both cases, the result is a matrix of logdensities, with as many rows as there are leafs (resp. sums) in the layer, and columns.
Computing the input layer is discussed in Section 3.4. For now, consider a layer of sum nodes, as illustrated in Fig. 2. Here, we have assumed that each sum node has only a single child. We discuss this simpler case first, and handle sum nodes with multiple children below. In order to compute the sums, we collect all the vectors of the “left” product children in a matrix and similarly all the “right” product children in matrix , where “left” and “right” are arbitrary but fixed assignments. We further extend the 3dimensional weighttensor from (2) to a 4dimensional tensor, where the slice contains the weights for the vectorized sum node. The result of all sums—i.e. in total sum operations—can be performed with a single einsum operation:
(5) 
Note the similarity to (2), and that we parallelized over the whole sum layer by simply introducing an additional index . Consequently, it is straightforward to extend the logeinsumexp trick (see Section (3.2)) to (5). Constructing the two matrices and requires some bookkeeping and introduces some computational overhead stemming from extracting and concatenating slices from the logprobability tensors below. This overhead is essentially the symptom of the sparse and cluttered layout of PCs. The main computational work, however, is then performed by a highly parallel einsum operation (5).
Eq. (5) computes whole sumproductlayers, when the sum nodes are restricted to single children. In order to compute general layers, we express any layer containing sums with multiple children as 2 consecutive sum layers. The first layer consists of sum nodes with single children, one for each product child of the original sum layer. The second layer takes elementwise mixtures of the sum nodes in the first layer, over sum nodes which correspond to children of the original layer. Note that this structure is simply an overparameterization (Trapp et al., 2019b) of the original sum nodes, decomposing them into chains of two sum nodes. The first layer can now be efficiently computed with (5). The second layer, which we denote as mixing layer, can also be computed in parallel with an einsum operation. For details, see the supplementary paper.
3.4 Exponential Families as Input Layer
The leaves of EiNets compute logdensities of an exponential family (EF), which has the form where are the natural parameters, is the socalled base measure, the sufficient statistic and is the lognormalizer. Many parametric distributions can be expressed as EFs, e.g. Gaussian, Binomial, Categorical, Beta, etc. In order to facilitate learning using EM, we keep the parameters in their expectation form (Sato, 1999). The natural parameters and expectation parameters are onetoone, and connected via and , where denotes entropy. This dual parameterization allows us to implement EM on the abstract level of EFs, while particular instances of EFs are easily implemented by providing , , , and the conversion .
In order to compute all leaves in parallel, we first compute a tensor of EF logprobabilities, where is the number of RVs and is the number of socalled replica. Each entry is a logdensity of an EF over . The number of replica must be large enough in order to ensure that each leaf has a set of private EF distributions, i.e. each leaf has an index to a particular replica, such that leaves with the same have disjoint scope. The number of required replica and an assignment of leaves to replica can be inferred from the PC’s structure. The EF tensor is parameterized with a tensor, where is the dimensionality of the sufficient statistic . Note that can be computed in parallel with a handful of parallel operations, e.g. inner product , evaluating , etc. is then used to compute the leaves, which, in this paper, are simply factorizations over the logdensities in .
3.5 ExpectationMaximization (EM)
A natural way to learn PCs is the EM algorithm, wich is known to rapidly increase the likelihood, especially in early iterations (Salakhutdinov et al., 2003). EM for PCs was derived in (Peharz et al., 2017), leading to the following update rules for sumweights and leaves:
(6)  
(7) 
where the sums in (7) range over all training examples . In (Peharz et al., 2017) and (Peharz et al., 2019), the derivatives and required for the expected statistics and were computed with an explicitly implemented backwardspass, performed in the logdomain for robustness. Here we show that this implementation overhead can be avoided by leveraging automatic differentiation. Recall that EiNets represent all probability values in the logdomain, thus the PC output is actually , rather than . Calling autodiff on yields the following derivative for each sumweight (omitting argument ): which is exactly in (6), i.e. autodiff readily provides the required expected statistics for sum nodes. In many frameworks, autodiff readily accumulate the gradient by default, as required in (7), leading to an embarrassingly simple implementation for the Estep. For the Mstep, we simply multiply the result of the accumulator with the current weights, and renormalize. Furthermore, recall that each singledimensional leaf is implemented as logdensity of an EF. Taking the gradient yields: which is in (6). Thus, autodiff also implements most of the EM update for the leaves. We simply need to accumulate both and over the whole dataset, and use (7) to update the expectation parameters . Note, that both sum nodes and leafs can be updated using the same calls to autodiff.
The classical EM algorithm uses a whole pass over the training set for a single update, which is computationally wasteful due to redundancies in the training data (Bottou, 1998)
. Similar to stochastic gradient descent (SGD), it is possible to define a stochastic version for EM
(Sato, 1999). To this end, one replaces the sums over the entire dataset in (7) with sums over minibatches, yielding and . The full EM update is then replaced with gliding averages(8)  
(9) 
where is a stepsize parameter. This stochastic version of EM introduces two hyperparameters, stepsize and the batchsize, which need to be set appropriately. Furthermore, unlike fullbatch EM, stochastic EM does not guarantee that the training likelihood increases in each iteration. However, stochastic EM updates the parameters after each minibatch and typically leads to faster learning.
Sato (1999) shows an interesting connection between stochastic EM and natural gradients (Amari, 1998). In particular, for any EF model , where are observed and latent variables, performing (8) and (9) is equivalent to SGD, where the gradient is premultiplied by the inverse Fisher information matrix of the complete model . The difference to standard natural gradient is, that the natural gradient is defined via the inverse Fisher of the marginal (Amari, 1998). Sato’s analysis also applies to EiNets, since smooth and decomposable PCs are EFs of the form (Peharz et al., 2017), where the latent variables are associated with sum nodes. Thus, (8) and (9) are in fact performing SGD with (a variant of) natural gradient.
4 Experiments
dataset  RATSPN  EiNet 
nltcs  6.011  6.015 
msnbc  6.039  6.119 
kdd2k  2.128  2.183 
plants  13.439  13.676 
jester  52.970  52.563 
audio  39.958  39.879 
netflix  56.850  56.544 
accidents  35.487  35.594 
retail  10.911  10.916 
pumsbstar  32.530  31.954 
dna  97.232  96.086 
kosarek  10.888  11.029 
msweb  10.116  10.026 
book  34.684  34.739 
eachmovie  53.632  51.705 
webkb  157.530  157.282 
reuters52  87.367  87.368 
20ng  152.062  153.938 
bbc  252.138  248.332 
ad  48.472  26.273 
when trained on 20 binary datasets, and using the same structures. On most datasets, the results between RATSPNs and EiNets are not significantly different (results on boldface, using a one sided ttest,
).We implemented EiNets fully in PyTorch
(Paszke et al., 2019). The core implementation is provided in the supplementary, and we will release the code in the near future.4.1 Efficiency Comparison
In order to demonstrate the efficiency of EiNets we compare with the two most prominent “deeplearning” implementations of PCs, namely LibSPN (Pronobis et al., 2017) and SPFlow (Molina et al., 2019)
. LibSPN is natively based on Tensorflow
(Abadi M. et al., 2015), while SPFlow supports multiple backends. For our experiment, we used SPFlow with Tensorflow backend. We used randomized binary trees (RATSPNs) (Peharz et al., 2019) as a common benchmark: These PC structures are governed by two structural parameters, the splitdepth and number of replica : Starting from the root sum node, they split the whole scope into two randomized balanced parts, recursively until depth , yielding a binary tree shape with leaves. This construction is repeated times, yielding random binary trees, which are mixed at the root. As a sanity check, we first reproduced the density estimation results on 20 binary datasets in (Peharz et al., 2019), see Table 1. The difference in test loglikelihood is for most of the datasets not statistically significant, using a onesided ttest with . EiNets even outperformed RATSPNs on two datasets, once with a large margin, while RATSPNs outperformed EiNets on one datset, with a small margin.We aim to compare EiNets, LibSPN and SPFlow in terms of i) training time, ii) memory consumption, and iii) inference time. To this end, we trained and tested them on synthetic data (Gaussian noise) with samples and dimensions. As leaves we used singledimensional Gaussians. We varied each structural hyperparameter in the following ranges: depth , replica , and vector length of sums/leaves . When varying one hyperparameter, we left the others to default values , , and . We ran this set of experiments on a GeForce RTX Ti.
In Figure 3 we see a comparison of the three implementations in terms of train time and GPUmemory consumption, where the circle radii represent the magnitude of the varied hyperparameter. We see that EiNets tend to be one or two orders of magnitude faster (note the logscale) than the competitors, especially for large models. Also in terms of memory consumption EiNets scale gracefully. In particular, for large , memory consumption is an order of magnitude lower than for LibSPN or SPFLow. This can be easily explained by the fact that our einsum operations do not generate product nodes explicitly in memory, while the other frameworks do. Moreover, EiNets also perform superior in terms of inference time. In particular for large models, they run again one or two orders of magnitude faster than the other implementations. For space reasons, these results are deferred to the supplementary.
4.2 EiNets as Generative Image Models
While PCs are an actively researched topic, their scalability issues have restricted them so far to rather small datasets like MNIST (LeCun et al., 1998). In this paper, we use PCs as generative model for streetview house numbers (SVHN) (Netzer et al., 2011), containing RGB images of digits, and centercropped CelebA (Liu et al., 2015), containing RGB face images. For SVHN, we used the first train images and concatenated it with the extra set, yielding a train set images. We reserved the rest of the core train set, images, as validation set. The test set comprises images. For Celeba, we used the standard train, validation and test splits, containing , , and , respectively. The data was normalized before training (division by ), but otherwise no preprocessing was applied. To the best or our knowledge, PCs have not been successfully trained on datasets of this size before.
We trained EiNets on the imagetailored structure proposed in (Poon & Domingos, 2011), to which we refer as PD structure. The PD structure recursively decomposes the image into subrectangles using axisaligned splits, displaced by a certain stepsize . Here,
serves as a structural hyperparameter, which governs the number of sum nodes according to
(Peharz, 2015). The recursive splitting process stops, when a rectangle cannot be split by any value in . We first clustered both datasets into clusters using the sklearnimplementation of kmeans, and learned an EiNet on each of these clusters. We then used these
EiNets as mixture components of a mixture model, using the cluster proportions as mixture coefficients. Note that i) a mixture of PCs yields again a PCs, and ii) this step is essentially the first step of LearnSPN (Gens & Domingos, 2013), one of the most prominent PC structure learners. For the PD structure of each EiNet component, we used a stepsize for SVHN and for CelebaA, i.e. we applied axisaligned splits. We only applied vertical splits, in order to reduce artifacts stemming from the PD structure. As leaves, we used factorized Gaussians, i.e. Gaussians with diagonal covariance mattrix, which were further factorized over the RGB channels. The vector length for sums and leaves was set to. After each EM update, we projected the Gaussian variances to the interval
, corresponding to a maximal standard deviation of
. Each component was trained for epochs, using a batch size of and EM stepsize of . In total, training lasted hours for SVHN and hours for CelebA, on an NVidia P100.Original samples and samples from the EiNet mixture are shown in Figure 4, (a,b) for SVHN and (d,e) for CelebA, respectively. The SVHN samples are rather compelling, and some samples could be mistaken for real data samples. The CelebA samples are somewhat oversmoothed, but captured the main quality of faces well. In both cases, some “stripy” artifacts, typical for the PD architecture (Poon & Domingos, 2011), can be seen. Although the image quality is not comparable to e.g. GAN models (Goodfellow et al., 2014), we want to stress that EiNets permit tractable inference, while GANs are restricted to sampling. In particular, tractable inference can be immediately used for image inpainting, as demonstrated in Figure 4, (c) for SVHN and (f) for CelebA. Here, we marked parts of test samples as missing (top or left image half) and reconstructed it by drawing a sample from the conditional distribution, conditioned on the visible image part. We see that the reconstructions are plausible, in particular for SVHN.
5 Conclusion
Probabilistic models form a spectrum of machine learning techniques. Most of the research is focused on representing and learning flexible and expressive models, but ignore the downstream impact on the set of inference tasks which can be provably solved within the model. The philosophy of tractable modeling also aims to push the expressivity boundaries, but under the constraint to maintain a defined set of exact inference routines. Probabilistic circuits are certainly a central and prominent approach for this philosophy. In this paper, we addressed a major obstacle for PCs, namely their scalability in comparison to unconstrained models. Our improvements of training speed and memoryuse reduction, both in the orders of one or two orders of magnitude, are compelling, and we hope that our results stimulates further developments in the area of tractable models.
Appendix A Organizing EiNets in Topological Layers
In this section, we elaborate on the layerwise organization of EiNets, as discussed in Section 3.3 in the main paper. The layerwise organization is obtained by Algorithm 1, which takes the computational graph of some EiNet as input, and which performs a type of breadthfirst search over .
We first divide the node set into the node sets , , and , containing all leaves, all sums and all products, respectively. We initialize an empty set , which will contain all “visited” nodes during the execution of the algorithm. We also initialize an empty list which will be a topologically ordered list of pure nodesets (i.e. containing exclusively nodes form , , or ), and which will contain the result of the Algorithm.
The whileloop in Algorithm 1 is executed until all sum and product nodes have been visited and stored in some layer. In line 6, we construct the set of sum nodes which have not been visited yet , but whose parents have all been visited. Note that in the first iteration of the while loop will simply contain the root of the EiNet. The set is then inserted in the front of the list in line 7, and all nodes in are marked as visited in line 8. A similar procedure, but for product nodes, is performed in lines 9–11 yielding node set . Note that since we assume that each product node has only one parent (cf. Section 3.1 in the main paper), will be exactly the set of children of sum nodes contained in (constructed in line 6 within the same whileiteration). Thus, if additionally all sum nodes have exactly one child, and will form a consecutive pair of product and sum nodes, as illustrated in Fig. 2 in the main paper, which can be compactly computed using a single call to an einsum operation. If sum nodes have multiple children, we decompose them into chains of consecutive sum nodes, as discussed in the next section.
In line 13 of Algorithm 1, all leaf nodes are inserted as the bottom layer. It is easy to check that the returned list will be topologically sorted, meaning that the nodes in the layer will only have inputs from layers with index strictly smaller than .
Appendix B The Mixing Layer
The einsum layer computes a large number of sumproduct operations with a single call to an einsum operation, but requires that each sum node has exactly one child. In order to compute sum nodes with multiple product children, we express any sum node with multiple children as a cascade of 2 vectorized sum operations. In particular, for a sum node with children, we introduce new sum nodes , each having one of the children as its single child. This forms a layer of sum nodes with single children, which can be computed with a single einsum operation depicted in Eq. (5) in the main paper. Subsequently, the results of , , get mixed in an elementwise manner, i.e. . We call the simple sums and aggregated sum.
This structure is simply an overparameterization (Trapp et al., 2019b) of the original sum nodes, and represents the same linear functions. This principle is illustrated in Fig. 5, where sum node with children and sum node with children are expressed with a layer of simple sum nodes, followed by aggregated sums (sums in boxes).
The elementwise mixtures can also be implemented with a single einsum operation, which we denote as mixing layer. To this end, within a sum layer, let be the number of sum nodes having more than one child, and the maximal number of children. We collect the dimensional probability vectors computed by the first (simple) sum layer in a tensor, where the
axis is zeropadded for sum nodes with less than
children. The mixing layer computes then a convex combination over the first dimension of this tensor.Constructing this tensor involves some copy overhead, and the mixing layer also wastes some computation due to zero padding. However, using sum nodes with multiple children allows a much wider range of PC structures than e.g. random binary trees, which were the only structure considered in (Peharz et al., 2019).
For sum layers which originally contain only simple sums, the construction of the mixing layer is skipped.
Appendix C Inference Time Comparison
Section 4.1 in the main paper compared training time and memory consumption for EiNets, LibSPN (Pronobis et al., 2017) and SPFlow (Molina et al., 2019), showing that EiNets scale much more gracefully than its competitors. The same holds true for inference time. Fig. 6 shows the results corresponding to Fig. 3 in the main paper, but for inference time per sample rather than training time per epoch. Inference was done for a batch of test samples for each model, i.e. the displayed inference time is of the evaluation time for the whole batch. Again, we see significant speedups for EiNets, of up to three orders of magnitude (for maximal depth and EiNet vs. SPFlow).
References
 Abadi M. et al. (2015) Abadi M. et al. TensorFlow: Largescale machine learning on heterogeneous systems, 2015. URL https://www.tensorflow.org/. Software available from tensorflow.org.
 Amari (1998) Amari, S.I. Natural gradient works efficiently in learning. Neural computation, 10(2):251–276, 1998.
 Bottou (1998) Bottou, L. Online algorithms and stochastic approximations. In Saad, D. (ed.), Online Learning and Neural Networks. Cambridge University Press, 1998.
 Darwiche (1999) Darwiche, A. Compiling knowledge into decomposable negation normal form. In IJCAI, volume 99, pp. 284–289, 1999.
 Darwiche (2001) Darwiche, A. Decomposable negation normal form. Journal of the ACM (JACM), 48(4):608–647, 2001.
 Darwiche (2002) Darwiche, A. A logical approach to factoring belief networks. KR, 2:409–420, 2002.

Darwiche (2003)
Darwiche, A.
A differential approach to inference in Bayesian networks.
Journal of the ACM, 50(3):280–305, 2003.  de Campos (2011) de Campos, C. New complexity results for MAP in Bayesian networks. In Proceedings of IJCAI, pp. 2100–2106, 2011.
 Dennis & Ventura (2012) Dennis, A. and Ventura, D. Learning the architecture of sumproduct networks using clustering on variables. In Proceedings of NIPS, pp. 2042–2050, 2012.
 Gens & Domingos (2013) Gens, R. and Domingos, P. Learning the structure of sumproduct networks. Proceedings of ICML, pp. 873–880, 2013.
 Goodfellow et al. (2014) Goodfellow, I. J., PougetAbadie, J., Mirza, M., Xu, B., WardeFarley, D., Ozair, S., Courville, A., and Bengio, Y. Generative adversarial nets. In Proceedings of NIPS, pp. 2672–2680, 2014.

Khosravi et al. (2019)
Khosravi, P., Liang, Y., Choi, Y., and Van den Broeck, G.
What to expect of classifiers? reasoning about logistic regression with missing features.
In Proceedings of IJCAI, pp. 2716–2724, 2019.  Kingma & Welling (2014) Kingma, D. P. and Welling, M. Autoencoding variational Bayes. In ICLR, 2014. arXiv:1312.6114.
 Kisa et al. (2014) Kisa, D., den Broeck, G. V., Choi, A., and Darwiche, A. Probabilistic Sentential Decision Diagrams. In KR, 2014.
 Koller & Friedman (2009) Koller, D. and Friedman, N. Probabilistic Graphical Models: Principles and Techniques. MIT Press, 2009. ISBN 0262013193.
 Larochelle & Murray (2011) Larochelle, H. and Murray, I. The neural autoregressive distribution estimator. In Proceedings of AISTATS, pp. 29–37, 2011.
 LeCun et al. (1998) LeCun, Y., Bottou, L., Bengio, Y., and Haffner, P. Gradientbased learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.

Liu et al. (2015)
Liu, Z., Luo, P., Wang, X., and Tang, X.
Deep learning face attributes in the wild.
In
Proceedings of International Conference on Computer Vision (ICCV)
, December 2015.  MacKay (1995) MacKay, D. J. C. Bayesian neural networks and density networks. Nuclear Instruments and Methods in Physics Research Section A: Accelerators, Spectrometers, Detectors and Associated Equipment, 354(1):73–80, 1995.
 Molina et al. (2019) Molina, A., Vergari, A., Stelzner, K., Peharz, R., Subramani, P., Mauro, N. D., Poupart, P., and Kersting, K. Spflow: An easy and extensible library for deep probabilistic learning using sumproduct networks, 2019.
 Netzer et al. (2011) Netzer, Y., Wang, T., Coates, A., ABissacco, Wu, B., and Ng, A. Y. Reading digits in natural images with unsupervised feature learning. In NIPS Workshop on Deep Learning and Unsupervised Feature Learning 2011, 2011.
 Papamakarios et al. (2019) Papamakarios, G., Nalisnick, E., Rezende, D. J., Mohamed, S., and Lakshminarayanan, B. Normalizing flows for probabilistic modeling and inference. arXiv preprint arXiv:1912.02762, 2019.
 Paszke et al. (2019) Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., Killeen, T., Lin, Z., Gimelshein, N., Antiga, L., Desmaison, A., Kopf, A., Yang, E., DeVito, Z., Raison, M., Tejani, A., Chilamkurthy, S., Steiner, B., Fang, L., Bai, J., and Chintala, S. Pytorch: An imperative style, highperformance deep learning library. In Advances in Neural Information Processing Systems 32, pp. 8024–8035. 2019. URL https://pytorch.org/.
 Pearl (1988) Pearl, J. Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 1988. ISBN 1558604790.
 Peharz (2015) Peharz, R. Foundations of SumProduct Networks for Probabilistic Modeling. PhD thesis, Graz University of Technology, 2015.
 Peharz et al. (2015) Peharz, R., Tschiatschek, S., Pernkopf, F., and Domingos, P. On theoretical properties of sumproduct networks. In Proceedings of AISTATS, pp. 744–752, 2015.
 Peharz et al. (2017) Peharz, R., Gens, R., Pernkopf, F., and Domingos, P. On the latent variable interpretation in sumproduct networks. IEEE transactions on pattern analysis and machine intelligence, 39(10):2030–2044, 2017.
 Peharz et al. (2019) Peharz, R., Vergari, A., Stelzner, K., Molina, A., Trapp, M., Shao, X., Kersting, K., and Ghahramani, Z. Random sumproduct networks: A simple and effective approach to probabilistic deep learning. In Proceedings of UAI, 2019.
 Poon & Domingos (2011) Poon, H. and Domingos, P. Sumproduct networks: A new deep architecture. In Proceedings of UAI, pp. 337–346, 2011.
 Pronobis et al. (2017) Pronobis, A., Ranganath, A., and Rao, R. Libspn: A library for learning and inference with sumproduct networks and tensorflow. In Principled Approaches to Deep Learning Workshop, 2017.
 Rahman et al. (2014) Rahman, T., Kothalkar, P., and Gogate, V. Cutset networks: A simple, tractable, and scalable approach for improving the accuracy of chowliu trees. In Joint European conference on machine learning and knowledge discovery in databases, pp. 630–645, 2014.
 Rezende & Mohamed (2015) Rezende, D. J. and Mohamed, S. Variational inference with normalizing flows. In Proceedings of ICML, pp. 1530–1538, 2015.
 Salakhutdinov et al. (2003) Salakhutdinov, R., Roweis, S. T., and Ghahramani, Z. Optimization with em and expectationconjugategradient. In Proceedings of ICML, pp. 672–679, 2003.
 Sato (1999) Sato, M.a. Fast learning of online em algorithm. Rapport Technique, ATR Human Information Processing Research Laboratories, 1999.
 Shen et al. (2016) Shen, Y., Choi, A., and Darwiche, A. Tractable operations for arithmetic circuits of probabilistic models. In Advances in Neural Information Processing Systems 29, pp. 3936–3944. 2016.
 Trapp et al. (2019a) Trapp, M., Peharz, R., Ge, H., Pernkopf, F., and Ghahramani, Z. Bayesian learning of sumproduct networks. Proceedings of NeurIPS, 2019a.
 Trapp et al. (2019b) Trapp, M., Peharz, R., and Pernkopf, F. Optimisation of overparametrized sumproduct networks. arXiv preprint arXiv:1905.08196, 2019b.
 Uria et al. (2016) Uria, B., Côté, M.A., Gregor, K., Murray, I., and Larochelle, H. Neural autoregressive distribution estimation. The Journal of Machine Learning Research, 17(1):7184–7220, 2016.
 Van den Broeck et al. (2019) Van den Broeck, G., Di Mauro, N., and Vergari, A. Tractable probabilistic models: Representations, algorithms, learning, and applications. http://web.cs.ucla.edu/~guyvdb/slides/TPMTutorialUAI19.pdf, 2019. Tutorial at UAI 2019.
 Zhao et al. (2015) Zhao, H., Melibari, M., and Poupart, P. On the relationship between sumproduct networks and Bayesian networks. In Proceedings of ICML, pp. 116–124, 2015.