Einsum Networks: Fast and Scalable Learning of Tractable Probabilistic Circuits

04/13/2020 ∙ by Robert Peharz, et al. ∙ 10

Probabilistic circuits (PCs) are a promising avenue for probabilistic modeling, as they permit a wide range of exact and efficient inference routines. Recent “deep-learning-style” implementations of PCs strive for a better scalability, but are still difficult to train on real-world data, due to their sparsely connected computational graphs. In this paper, we propose Einsum Networks (EiNets), a novel implementation design for PCs, improving prior art in several regards. At their core, EiNets combine a large number of arithmetic operations in a single monolithic einsum-operation, leading to speedups and memory savings of up to two orders of magnitude, in comparison to previous implementations. As an algorithmic contribution, we show that the implementation of Expectation-Maximization (EM) can be simplified for PCs, by leveraging automatic differentiation. Furthermore, we demonstrate that EiNets scale well to datasets which were previously out of reach, such as SVHN and CelebA, and that they can be used as faithful generative image models.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 8

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The central goal of probabilistic modeling is to approximate the data-generating distribution, in order to answer statistical queries by means of inference. In recent years many novel probabilistic models based on deep neural networks have been proposed, such as Variational Autoencoders (VAEs)

(Kingma & Welling, 2014) (formerly known as Density Networks (MacKay, 1995)), Normalizing Flows (Flows) (Rezende & Mohamed, 2015; Papamakarios et al., 2019)

, Autoregressive Models (ARMs)

(Larochelle & Murray, 2011; Uria et al., 2016), and Generative Adversarial Networks (GANs) (Goodfellow et al., 2014). While all these models have achieved impressive results on large-scale datasets, i.e. they have been successful in terms of representational power and learning, they unfortunately fall short in terms of inference, a main aspect of probabilistic modeling and reasoning (Pearl, 1988; Koller & Friedman, 2009)

. All of the mentioned models allow to draw unbiased samples, enabling inference via Monte Carlo estimation. This strategy, however, becomes quickly unreliable and computational expensive for all but the simplest inference queries. Also other approximate inference techniques, e.g. variational inference, are often biased and their inference quality might be hard to analyse. Besides sampling, only ARMs and Flows support efficient evaluation of the probability density for a given sample, which can be used, e.g., for model comparison and outlier detection.

However, even for ARMs and Flows the following inference task is computationally hard: Consider a density over random variables, where might be just in the order of a few dozens. For example, might represent medical measurements of a particular person drawn from the general population modeled by . Now assume that we split the variables into three disjoint sets , , and of roughly the same size, and that we wish to compute

(1)

for arbitrary values and . In words, we wish to predict query variables , based on evidence , while accounting for (marginalizing) all possible values of missing variables . Conditional densities like (1) highlight the role of generative models as “multi-purpose predictors”, since the choice of , , and can be arbitrary. However, evaluating conditional densities of this form is notoriously hard for any of the models above, which represents a principal drawback of these methods.

These shortcomings in terms of inference have motivated a growing stream of research on tractable probabilistic modeling, i.e. to construct a constrained class of probabilistic models, in which inference queries like (1) can be computed exactly and efficiently (in polynomial time). One of the most prominent families of tractable models are probabilistic circuits (PCs),111We adopt the name probabilistic circuits, as suggested by (Van den Broeck et al., 2019), which serves as an umbrella term for many structurally related models, like arithmetic circuits, sum-product networks, cutset networks, etc. See Section 2 for details. which represent probability densities via a computational graph of i) mixtures (convex sum nodes), ii) factorizations (product nodes), and iii) tractable distributions (leaves, input nodes). A key structural property of PCs is decomposability (Darwiche, 2003), which ensures that any integral, like those appearing in (1), can be computed in linear time of the circuit size. PCs can be equipped with further structural constraints, which unlock a wider range of tractable inference routines – see Section 2

for an overview. These structural constraints, however, make it hard to work with PCs in practice, as they lead to highly sparse and cluttered computational graphs, which are inapt for current machine learning frameworks. Furthermore, PCs are typically implemented in the log-domain, which slows down learning and inference even further.

In this paper, we propose a novel implementation design for PCs, which ameliorates these practical difficulties, and allows to evaluate and train PCs of up to two orders of magnitude faster than previous implementations (Pronobis et al., 2017; Peharz et al., 2019). The central idea is to compute all product and sum operations on the same topological layer using a single monolithic einsum operation.222The einsum

operation implements the Einstein notation of tensor-product contraction, and unifies standard linear algebra operations like dot product, outer product, and matrix multiplication.

In that way, the main computational work is lifted by a parallel operation for which efficient implementations, both for CPU and GPU, are readily available in most numerical frameworks. In order to ensure numerical stability, we extend the well-known log-sum-exp-trick to our setting, leading to the “log-einsum-exp” trick. Since our model implements PCs via a hierarchy of large einsum layers, we call our model Einsum Network (EiNet).

As further contributions, we present two algorithmic improvements for training PCs. First, we show that Expectation-Maximization (EM), the canonical maximum-likelihood learning algorithm for PCs (Peharz et al., 2017), can be easily implemented using the gradient of the model’s log-probability. Thus, EM can be implemented using automatic differentiation, readily provided by most machine learning frameworks. Second, we leverage stochastic online EM (Sato, 1999) in order to further improve the learning speed of EiNets, or, enable EM for large datasets at all. In experiments we demonstrate that EiNets can rapidly learn generative models for street view house numbers (SVHN) and CelebA. To the best of our knowledge, this is the first time that PCs have been successfully trained on datasets of this size. EiNets are capable of producing high-quality image samples, while maintaining tractable inference, e.g. for conditional densities (1

). This can be exploited for arbitrary image inpainting tasks and other advanced inference tasks.

Concerning notation, we use , possibly with sub-scripts, to denote random variables and to denote a value of . Sets (or tuples) of random variables are denoted as , also possibly with sub-script, and their values as . We denote the set of possible values for (sets of) random variables with . For directed graphs, we use and to denote the parents and children of node , respectively.

2 Probabilistic Circuits

Probabilistic circuits (PCs) are a family of probabilistic models which allow a wide range of exact and efficient inference routines. The earliest representatives of PCs are arithmetic circuits (ACs) (Darwiche, 2002, 2003), which are based on decomposable negation normal forms (DNNFs) (Darwiche, 1999, 2001), a tractable representation for propositional logic formulas. Further members of the PC family are sum-product networks (SPNs) (Poon & Domingos, 2011), cutset networks (CNs) (Rahman et al., 2014), and probabilistic sentential decision diagrams (PSDDs) (Kisa et al., 2014). The main differences between different types of PCs are i) the set of assumed structural constraints and tractable inference routines, ii) syntax and representation, and iii) application scenarios. In order to treat these models in a unified fashion, we adopt the umbrella term probabilistic circuits suggested by (Van den Broeck et al., 2019), and discriminate various instantiates of PCs mainly via their structural properties.

Definition 1 (Probabilistic Circuit).

Given a set of random variables , a probabilistic circuit (PC) is a tuple , where , denoted as computational graph, is a directed acyclic graph (DAG) and , denoted as scope function, is a function assigning a scope to each node in , i.e. a sub-set of . For internal nodes of , i.e. any node which has children, the scope function satisfies . A leaf of computes a probability density over its scope . All internal nodes of are either sum nodes () or product nodes (). A sum node computes a convex combination of its children, i.e. , where , and . A product node computes a product of its children, i.e. .

PCs can be seen as a special kind of neural network, where the first layer computes non-linear functions (probability densities) over sub-sets of , and all internal nodes compute either weighted sums (linear functions) or products (multiplicative interactions) of their inputs. Each node in a PC sees only a sub-set of the inputs, namely variables in its scope . The output of a PC is the value of one or more selected nodes in the computational graph . Typically, one constructs PCs such that they have a single root (node without parents), for which we assume full scope , and define a density over proportional to , i.e. . PCs as defined above define a density, but do not permit tractable inference yet. In particular, the normalization constant is hard to compute. Various tractable inference routines “can be bought” by imposing structural constraints on the PC, which we review next.

Decomposability. A PC is decomposable, if for each product node it holds that , for , , i.e. a PC is decomposable if children of product nodes have pairwise non-overlapping scope. The most important consequence of decomposability is that integrals which can be written as nested single-dimensional integrals—in particular the normalization constant—can be computed in linear time of the circuit size (Peharz et al., 2015). This follows from the fact that any single-dimensional integral i) commutes with a sum-node and ii) affects only a single child of a product-node. Consequently, all integrals can be distributed to the PC’s leaves, i.e. we simply need to perform integration at the leaves (which we assume to be tractable), and evaluate the internal nodes of the PC as usual, in a single feedforward pass.

Smoothness. A PC is smooth, if for each sum node it holds that , for , i.e. a PC is smooth if children of sum nodes have identical scope. Smoothness does not have particular computational advantages, but leads to a well-defined probabilistic interpretation of PCs. In particular, in smooth (and decomposable) PCs any node is already a properly normalized density, since i) leaves are densities by assumption, ii) sum nodes are simply mixture densities, with their children being the mixture components, and iii) product nodes are factorized densities. Smooth and decomposable PCs admit a natural interpretation as latent variable models (Zhao et al., 2015; Peharz et al., 2017), which admits a natural sampling procedure via ancestral sampling and maximum-likelihood learning via EM. Smooth and decomposable PCs are often referred to as sum-product networks (SPNs) (Poon & Domingos, 2011).

In this paper, we consider only smooth and decomposable PCs (aka SPNs), which facilitate efficient marginalization and conditioning. A further structural property, not considered in this paper, is determinism (Darwiche, 2003) which allows for exact probability maximization (most-probable-explanation), which is NP-hard in non-deterministic PCs (de Campos, 2011; Peharz et al., 2017). Moreover, PCs can be equipped with structured decomposability (Kisa et al., 2014), a stronger notion of decomposability which allows circuit multiplication and computing certain expectations (Shen et al., 2016; Khosravi et al., 2019).

3 Einsum Networks

3.1 Vectorizing Probabilistic Circuits

An immediate way to yield a denser and thus more efficient layout for PCs is to vectorize them. To this end, we re-define a leaf to be a vector of densities over

, rather than a single density. For example, a leaf computing a Gaussian density is replaced by a vector

, each being a Gaussian over , equipped with private parameters . A product node is re-defined to be an outer product , containing the products of all possible combinations of densities coming from the child vectors. Since the number of elements in outer products grows exponentially, we restrict the number of children to two (this constraint is frequently imposed on PCs for simplicity). Finally, we re-define sum nodes to be a vector of weighted sums, where each individual sum operation has its private weights and computes a convex combination over all the densities computed by its children. The vectorized version of PCs is frequently called a region graph (Dennis & Ventura, 2012; Trapp et al., 2019a) and has been used in previous GPU-supporting implementations (Pronobis et al., 2017; Peharz et al., 2019). It is easily verified, that our desired structural properties—smoothness and decomposability—carry over to vectorized PCs.

For the remainder of the paper, we use symbols , , for the vectorized versions, and refer to them as sum nodes, product nodes, and leaf nodes, respectively, or also simply as sums, products and leaves. To any single entry in these vectors we explicitly refer to as entry, or operation. In principle, the number of entries could be different for each leaf or sum, which would, however, lead to a less homogeneous PC design. Therefore, in this paper, we assume for simplicity the same for all leaves and sums. Furthermore, we make some simplifying assumptions about the structure of . First, we assume a structure alternating between sums/leaves and products, i.e. children of sums can only be products, and children of products can only be sums or leaves. Second, we also assume the root of the PC is a sum node. These assumptions are commonly made in PC literature and are no restriction of generality. Furthermore, we also assume that each product node has at most one parent (which must be a sum due to alternating structure). This is also not a severe structural assumption: If a product has two or more sum nodes as parents, then, by smoothness, these sum nodes have all the same scope. Consequently, they could be simply concatenated to a single sum vector. Of course, since in this paper we assume a constant length for all sum and leaf vectors, this assumption requires a large enough .

3.2 The Basic Einsum Operation

Figure 1: Basic einsum operation in EiNets: A sum node , with a single child , which itself has 2 children. All nodes are vectorized, as described in Section 3.1, and here illustrated for .

The core computational unit in EiNets is the vectorized PC excerpt in Fig. 1, showing a sum node with a single product child , which itself has two children and (shown here as sum nodes, but they could also be leaves). Nodes and compute each a vector of densities, the product node computes the outer product of and , and the sum node computes a matrix-vector product . Here, is an element-wise non-negative matrix, whose rows sum to one, and unrolls to a vector of elements.

Previous PC implementations (Pronobis et al., 2017; Peharz et al., 2019), are also based on this core computational unit. However, for numerical stability, they use a computational workaround in the log-domain: The outer product is transformed into an “outer sum” of log-densities (realized with broadcasting), the matrix multiplication is implemented using a broadcasted sum of and , to which then a log-sum-exp operation is applied, yielding . This workaround introduces significant overhead and needs to allocate the products explicitly. Mathematically, however, the PC excerpt in Fig. 1 is a simple multi-linear form, naturally expressed in Einstein notation:

(2)

Here we have re-shaped into a element-wise non-negative tensor, normalized over its last two dimensions, i.e. , . The signature in (2) mentions three indices , and labeling the axes of , , and . Axes with the same index get multiplied. Furthermore, any indices not mentioned on the left hand side get summed out. General-purpose Einstein summations are readily implemented in most numerical frameworks, and usually denoted as einsum operation.

However, applying (2) in a naive way would quickly lead to numerical underflow and unstable training. In order to ensure numerical stability, we develop a technique similar to the classical “log-sum-exp”-trick. We keep all probabilistic values in the log-domain, but the weight-tensor is kept in the linear domain. Consequently, we need a numerically stable computation for

(3)

Let us define and . Then, we can show that can be computed as

(4)

To see that (4) is correct, note that for the last term we have

Substituting the last line into (4) yields (3), so our log-einsum-exp trick delivers the correct result. A sufficient condition for numerical stability of (4) is that all sum-weights are larger than , since in this case the maximal values in vectors and are guaranteed to be , leading to a positive argument for the . This is not a severe requirement, as positive sum-weights are commonly enforced in PCs, e.g. by using Laplace smoothing or imposing a positive lower bound on the weights.

Given two -dimensional vectors , and the weight-tensor , our basic einsum operation (4) requires exp-operations, -operations, multiplications and additions. We need to store values for , , , while the product operations are not stored explicitly. In contrast, the indirect implementations of the same operation in (Pronobis et al., 2017; Peharz et al., 2019) need additions, exp-operations and -operations. These implementations also store values for , , , and additional values for the explicitly computed products. While our implementation is cubic in the number of multiplications, the previous implementations need a cubic number of exp-operations. This partially explains the speedup of EiNets in our experiments in Section 4. However, the main reasons for the observed speedup are i) an optimized implementation of the einsum operation, ii) avoiding the overhead of allocating product nodes, and iii) a higher degree of parallelizm, as discussed in the next section.

3.3 The Einsum Layer

Rather than computing single vectorized sums, we can do better by computing whole layers in parallel. To this end, we first organize the PC in a layer-wise structure: We traverse the PC top-down, and construct a topologically sorted list of layers of nodes, alternating between sums and products, and starting with the leaf nodes. Nodes are only inserted in a layer if all their parents have already been inserted in some layer above, i.e. all nodes in the layer depend only on nodes in layers with index strictly smaller than . Furthermore, note that since we can assume that each product has exactly one parent, see Section 3.1, a consecutive pair of product and sum layers will always be such that the product layer contains exactly the inputs to the sum layer. Pseudo-code for organizing PCs in topological layers is provided in the supplementary. Our strategy for EiNets is to perform efficient parallel computations for i) the whole leaf layer, and ii) the whole sum layer in each pair of consecutive product and sum layers. In both cases, the result is a matrix of log-densities, with as many rows as there are leafs (resp. sums) in the layer, and columns.

Figure 2: Example of an einsum layer, parallelizing the basic einsum operation.

Computing the input layer is discussed in Section 3.4. For now, consider a layer of sum nodes, as illustrated in Fig. 2. Here, we have assumed that each sum node has only a single child. We discuss this simpler case first, and handle sum nodes with multiple children below. In order to compute the sums, we collect all the vectors of the “left” product children in a matrix and similarly all the “right” product children in matrix , where “left” and “right” are arbitrary but fixed assignments. We further extend the 3-dimensional weight-tensor from (2) to a 4-dimensional tensor, where the slice contains the weights for the vectorized sum node. The result of all sums—i.e. in total sum operations—can be performed with a single einsum operation:

(5)

Note the similarity to (2), and that we parallelized over the whole sum layer by simply introducing an additional index . Consequently, it is straight-forward to extend the log-einsum-exp trick (see Section (3.2)) to (5). Constructing the two matrices and requires some book-keeping and introduces some computational overhead stemming from extracting and concatenating slices from the log-probability tensors below. This overhead is essentially the symptom of the sparse and cluttered layout of PCs. The main computational work, however, is then performed by a highly parallel einsum operation (5).

Eq. (5) computes whole sum-product-layers, when the sum nodes are restricted to single children. In order to compute general layers, we express any layer containing sums with multiple children as 2 consecutive sum layers. The first layer consists of sum nodes with single children, one for each product child of the original sum layer. The second layer takes element-wise mixtures of the sum nodes in the first layer, over sum nodes which correspond to children of the original layer. Note that this structure is simply an over-parameterization (Trapp et al., 2019b) of the original sum nodes, decomposing them into chains of two sum nodes. The first layer can now be efficiently computed with (5). The second layer, which we denote as mixing layer, can also be computed in parallel with an einsum operation. For details, see the supplementary paper.

3.4 Exponential Families as Input Layer

The leaves of EiNets compute log-densities of an exponential family (EF), which has the form where are the natural parameters, is the so-called base measure, the sufficient statistic and is the log-normalizer. Many parametric distributions can be expressed as EFs, e.g. Gaussian, Binomial, Categorical, Beta, etc. In order to facilitate learning using EM, we keep the parameters in their expectation form (Sato, 1999). The natural parameters and expectation parameters are one-to-one, and connected via and , where denotes entropy. This dual parameterization allows us to implement EM on the abstract level of EFs, while particular instances of EFs are easily implemented by providing , , , and the conversion .

In order to compute all leaves in parallel, we first compute a tensor of EF log-probabilities, where is the number of RVs and is the number of so-called replica. Each entry is a log-density of an EF over . The number of replica must be large enough in order to ensure that each leaf has a set of private EF distributions, i.e. each leaf has an index to a particular replica, such that leaves with the same have disjoint scope. The number of required replica and an assignment of leaves to replica can be inferred from the PC’s structure. The EF tensor is parameterized with a tensor, where is the dimensionality of the sufficient statistic . Note that can be computed in parallel with a handful of parallel operations, e.g. inner product , evaluating , etc. is then used to compute the leaves, which, in this paper, are simply factorizations over the log-densities in .

3.5 Expectation-Maximization (EM)

A natural way to learn PCs is the EM algorithm, wich is known to rapidly increase the likelihood, especially in early iterations (Salakhutdinov et al., 2003). EM for PCs was derived in (Peharz et al., 2017), leading to the following update rules for sum-weights and leaves:

(6)
(7)

where the sums in (7) range over all training examples . In (Peharz et al., 2017) and (Peharz et al., 2019), the derivatives and required for the expected statistics and were computed with an explicitly implemented backwards-pass, performed in the log-domain for robustness. Here we show that this implementation overhead can be avoided by leveraging automatic differentiation. Recall that EiNets represent all probability values in the log-domain, thus the PC output is actually , rather than . Calling auto-diff on yields the following derivative for each sum-weight (omitting argument ): which is exactly in (6), i.e. auto-diff readily provides the required expected statistics for sum nodes. In many frameworks, auto-diff readily accumulate the gradient by default, as required in (7), leading to an embarrassingly simple implementation for the E-step. For the M-step, we simply multiply the result of the accumulator with the current weights, and renormalize. Furthermore, recall that each single-dimensional leaf is implemented as log-density of an EF. Taking the gradient yields: which is in (6). Thus, auto-diff also implements most of the EM update for the leaves. We simply need to accumulate both and over the whole dataset, and use (7) to update the expectation parameters . Note, that both sum nodes and leafs can be updated using the same calls to auto-diff.

The classical EM algorithm uses a whole pass over the training set for a single update, which is computationally wasteful due to redundancies in the training data (Bottou, 1998)

. Similar to stochastic gradient descent (SGD), it is possible to define a stochastic version for EM

(Sato, 1999). To this end, one replaces the sums over the entire dataset in (7) with sums over mini-batches, yielding and . The full EM update is then replaced with gliding averages

(8)
(9)

where is a step-size parameter. This stochastic version of EM introduces two hyper-parameters, step-size and the batch-size, which need to be set appropriately. Furthermore, unlike full-batch EM, stochastic EM does not guarantee that the training likelihood increases in each iteration. However, stochastic EM updates the parameters after each mini-batch and typically leads to faster learning.

Sato (1999) shows an interesting connection between stochastic EM and natural gradients (Amari, 1998). In particular, for any EF model , where are observed and latent variables, performing (8) and (9) is equivalent to SGD, where the gradient is pre-multiplied by the inverse Fisher information matrix of the complete model . The difference to standard natural gradient is, that the natural gradient is defined via the inverse Fisher of the marginal (Amari, 1998). Sato’s analysis also applies to EiNets, since smooth and decomposable PCs are EFs of the form (Peharz et al., 2017), where the latent variables are associated with sum nodes. Thus, (8) and (9) are in fact performing SGD with (a variant of) natural gradient.

4 Experiments

dataset RAT-SPN EiNet
nltcs -6.011 -6.015
msnbc -6.039 -6.119
kdd-2k -2.128 -2.183
plants -13.439 -13.676
jester -52.970 -52.563
audio -39.958 -39.879
netflix -56.850 -56.544
accidents -35.487 -35.594
retail -10.911 -10.916
pumsb-star -32.530 -31.954
dna -97.232 -96.086
kosarek -10.888 -11.029
msweb -10.116 -10.026
book -34.684 -34.739
each-movie -53.632 -51.705
web-kb -157.530 -157.282
reuters-52 -87.367 -87.368
20ng -152.062 -153.938
bbc -252.138 -248.332
ad -48.472 -26.273
Table 1: Sanity check that EiNets reproduce the test log-likelihood of RAT-SPNs (Peharz et al., 2019)

when trained on 20 binary datasets, and using the same structures. On most datasets, the results between RAT-SPNs and EiNets are not significantly different (results on boldface, using a one sided t-test,

).
Figure 3: Illustration of training time and peak memory consumption of EiNets, SPFlow and LibSPN when training randomized binary PC trees, and varying hyper-parameters (number of densities per sum/leaf), depth , and number of replica , respectively. The blob size directly corresponds to the respective hyper-parameter under change. The total number of parameters ranged within (for varying ), (for varying ), and (for varying ). For LibSPN, some settings exhausted GPU memory and are therefore missing.

We implemented EiNets fully in PyTorch

(Paszke et al., 2019). The core implementation is provided in the supplementary, and we will release the code in the near future.

4.1 Efficiency Comparison

In order to demonstrate the efficiency of EiNets we compare with the two most prominent “deep-learning” implementations of PCs, namely LibSPN (Pronobis et al., 2017) and SPFlow (Molina et al., 2019)

. LibSPN is natively based on Tensorflow

(Abadi M. et al., 2015), while SPFlow supports multiple backends. For our experiment, we used SPFlow with Tensorflow backend. We used randomized binary trees (RAT-SPNs) (Peharz et al., 2019) as a common benchmark: These PC structures are governed by two structural parameters, the split-depth and number of replica : Starting from the root sum node, they split the whole scope into two randomized balanced parts, recursively until depth , yielding a binary tree shape with leaves. This construction is repeated times, yielding random binary trees, which are mixed at the root. As a sanity check, we first reproduced the density estimation results on 20 binary datasets in (Peharz et al., 2019), see Table 1. The difference in test log-likelihood is for most of the datasets not statistically significant, using a one-sided t-test with . EiNets even outperformed RAT-SPNs on two datasets, once with a large margin, while RAT-SPNs outperformed EiNets on one datset, with a small margin.

We aim to compare EiNets, LibSPN and SPFlow in terms of i) training time, ii) memory consumption, and iii) inference time. To this end, we trained and tested them on synthetic data (Gaussian noise) with samples and dimensions. As leaves we used single-dimensional Gaussians. We varied each structural hyper-parameter in the following ranges: depth , replica , and vector length of sums/leaves . When varying one hyper-parameter, we left the others to default values , , and . We ran this set of experiments on a GeForce RTX Ti.

In Figure 3 we see a comparison of the three implementations in terms of train time and GPU-memory consumption, where the circle radii represent the magnitude of the varied hyper-parameter. We see that EiNets tend to be one or two orders of magnitude faster (note the log-scale) than the competitors, especially for large models. Also in terms of memory consumption EiNets scale gracefully. In particular, for large , memory consumption is an order of magnitude lower than for LibSPN or SPFLow. This can be easily explained by the fact that our einsum operations do not generate product nodes explicitly in memory, while the other frameworks do. Moreover, EiNets also perform superior in terms of inference time. In particular for large models, they run again one or two orders of magnitude faster than the other implementations. For space reasons, these results are deferred to the supplementary.

(a) Real SVHN images.
(b) EiNet SVHN samples.
(c) Real images (top), covered images, and EiNet reconstructions
(d) Real Celeba samples.
(e) EiNet Celeba samples.
(f) Real images (top), covered images, and EiNet reconstructions
Figure 4: Qualitative results of EiNets trained on RGB data, namely SVHN (top, image dimensions ) and Celeba (bottom, image dimensions ).

4.2 EiNets as Generative Image Models

While PCs are an actively researched topic, their scalability issues have restricted them so far to rather small datasets like MNIST (LeCun et al., 1998). In this paper, we use PCs as generative model for street-view house numbers (SVHN) (Netzer et al., 2011), containing RGB images of digits, and center-cropped CelebA (Liu et al., 2015), containing RGB face images. For SVHN, we used the first train images and concatenated it with the extra set, yielding a train set images. We reserved the rest of the core train set, images, as validation set. The test set comprises images. For Celeba, we used the standard train, validation and test splits, containing , , and , respectively. The data was normalized before training (division by ), but otherwise no preprocessing was applied. To the best or our knowledge, PCs have not been successfully trained on datasets of this size before.

We trained EiNets on the image-tailored structure proposed in (Poon & Domingos, 2011), to which we refer as PD structure. The PD structure recursively decomposes the image into sub-rectangles using axis-aligned splits, displaced by a certain step-size . Here,

serves as a structural hyperparameter, which governs the number of sum nodes according to

(Peharz, 2015). The recursive splitting process stops, when a rectangle cannot be split by any value in . We first clustered both datasets into clusters using the sklearn

implementation of k-means, and learned an EiNet on each of these clusters. We then used these

EiNets as mixture components of a mixture model, using the cluster proportions as mixture coefficients. Note that i) a mixture of PCs yields again a PCs, and ii) this step is essentially the first step of LearnSPN (Gens & Domingos, 2013), one of the most prominent PC structure learners. For the PD structure of each EiNet component, we used a step-size for SVHN and for CelebaA, i.e. we applied axis-aligned splits. We only applied vertical splits, in order to reduce artifacts stemming from the PD structure. As leaves, we used factorized Gaussians, i.e. Gaussians with diagonal covariance mattrix, which were further factorized over the RGB channels. The vector length for sums and leaves was set to

. After each EM update, we projected the Gaussian variances to the interval

, corresponding to a maximal standard deviation of

. Each component was trained for epochs, using a batch size of and EM stepsize of . In total, training lasted hours for SVHN and hours for CelebA, on an NVidia P100.

Original samples and samples from the EiNet mixture are shown in Figure 4, (a,b) for SVHN and (d,e) for CelebA, respectively. The SVHN samples are rather compelling, and some samples could be mistaken for real data samples. The CelebA samples are somewhat over-smoothed, but captured the main quality of faces well. In both cases, some “stripy” artifacts, typical for the PD architecture (Poon & Domingos, 2011), can be seen. Although the image quality is not comparable to e.g. GAN models (Goodfellow et al., 2014), we want to stress that EiNets permit tractable inference, while GANs are restricted to sampling. In particular, tractable inference can be immediately used for image inpainting, as demonstrated in Figure 4, (c) for SVHN and (f) for CelebA. Here, we marked parts of test samples as missing (top or left image half) and reconstructed it by drawing a sample from the conditional distribution, conditioned on the visible image part. We see that the reconstructions are plausible, in particular for SVHN.

5 Conclusion

Probabilistic models form a spectrum of machine learning techniques. Most of the research is focused on representing and learning flexible and expressive models, but ignore the down-stream impact on the set of inference tasks which can be provably solved within the model. The philosophy of tractable modeling also aims to push the expressivity boundaries, but under the constraint to maintain a defined set of exact inference routines. Probabilistic circuits are certainly a central and prominent approach for this philosophy. In this paper, we addressed a major obstacle for PCs, namely their scalability in comparison to unconstrained models. Our improvements of training speed and memory-use reduction, both in the orders of one or two orders of magnitude, are compelling, and we hope that our results stimulates further developments in the area of tractable models.

Appendix A Organizing EiNets in Topological Layers

In this section, we elaborate on the layer-wise organization of EiNets, as discussed in Section 3.3 in the main paper. The layer-wise organization is obtained by Algorithm 1, which takes the computational graph of some EiNet as input, and which performs a type of breadth-first search over .

We first divide the node set into the node sets , , and , containing all leaves, all sums and all products, respectively. We initialize an empty set , which will contain all “visited” nodes during the execution of the algorithm. We also initialize an empty list which will be a topologically ordered list of pure node-sets (i.e. containing exclusively nodes form , , or ), and which will contain the result of the Algorithm.

The while-loop in Algorithm 1 is executed until all sum and product nodes have been visited and stored in some layer. In line 6, we construct the set of sum nodes which have not been visited yet , but whose parents have all been visited. Note that in the first iteration of the while loop will simply contain the root of the EiNet. The set is then inserted in the front of the list in line 7, and all nodes in are marked as visited in line 8. A similar procedure, but for product nodes, is performed in lines 9–11 yielding node set . Note that since we assume that each product node has only one parent (cf. Section 3.1 in the main paper), will be exactly the set of children of sum nodes contained in (constructed in line 6 within the same while-iteration). Thus, if additionally all sum nodes have exactly one child, and will form a consecutive pair of product and sum nodes, as illustrated in Fig. 2 in the main paper, which can be compactly computed using a single call to an einsum operation. If sum nodes have multiple children, we decompose them into chains of consecutive sum nodes, as discussed in the next section.

In line 13 of Algorithm 1, all leaf nodes are inserted as the bottom layer. It is easy to check that the returned list will be topologically sorted, meaning that the nodes in the layer will only have inputs from layers with index strictly smaller than .

1:  Input: PC graph
2:  Let , , be the set of all leaves, sum nodes, product nodes in , respectively
3:  
4:  
5:  while  do
6:     
7:     
8:     
9:     
10:     
11:     
12:  end while
13:  
14:  Return
Algorithm 1 Topological Layers

Appendix B The Mixing Layer

Figure 5: Decomposing a layer of sum nodes with multiple children (left) into two consecutive sum layers (right). The first sum layer computes a standard einsum layer, discussed in Section 3.3 in the main paper. The second layer, the so-called mixing layer, takes element-wise mixtures (depicted with sums in boxes).
Figure 6: Illustration of inference time and peak memory consumption of EiNets, SPFlow and LibSPN on randomized binary PC trees, and varying hyper-parameters (number of densities per sum/leaf), depth , and number of replica , respectively. The blob size directly corresponds to the respective hyper-parameter under change. The unchanged hyper-parameters were fixed to their respective default values: , , and . The total number of parameters ranged within (for varying ), (for varying ), and (for varying ). For LibSPN, some settings exhausted GPU memory and are therefore missing.

The einsum layer computes a large number of sum-product operations with a single call to an einsum operation, but requires that each sum node has exactly one child. In order to compute sum nodes with multiple product children, we express any sum node with multiple children as a cascade of 2 vectorized sum operations. In particular, for a sum node with children, we introduce new sum nodes , each having one of the children as its single child. This forms a layer of sum nodes with single children, which can be computed with a single einsum operation depicted in Eq. (5) in the main paper. Subsequently, the results of , , get mixed in an element-wise manner, i.e. . We call the simple sums and aggregated sum.

This structure is simply an over-parameterization (Trapp et al., 2019b) of the original sum nodes, and represents the same linear functions. This principle is illustrated in Fig. 5, where sum node with children and sum node with children are expressed with a layer of simple sum nodes, followed by aggregated sums (sums in boxes).

The element-wise mixtures can also be implemented with a single einsum operation, which we denote as mixing layer. To this end, within a sum layer, let be the number of sum nodes having more than one child, and the maximal number of children. We collect the -dimensional probability vectors computed by the first (simple) sum layer in a tensor, where the

-axis is zero-padded for sum nodes with less than

children. The mixing layer computes then a convex combination over the first dimension of this tensor.

Constructing this tensor involves some copy overhead, and the mixing layer also wastes some computation due to zero padding. However, using sum nodes with multiple children allows a much wider range of PC structures than e.g. random binary trees, which were the only structure considered in (Peharz et al., 2019).

For sum layers which originally contain only simple sums, the construction of the mixing layer is skipped.

Appendix C Inference Time Comparison

Section 4.1 in the main paper compared training time and memory consumption for EiNets, LibSPN (Pronobis et al., 2017) and SPFlow (Molina et al., 2019), showing that EiNets scale much more gracefully than its competitors. The same holds true for inference time. Fig. 6 shows the results corresponding to Fig. 3 in the main paper, but for inference time per sample rather than training time per epoch. Inference was done for a batch of test samples for each model, i.e. the displayed inference time is of the evaluation time for the whole batch. Again, we see significant speedups for EiNets, of up to three orders of magnitude (for maximal depth and EiNet vs. SPFlow).

References

  • Abadi M. et al. (2015) Abadi M. et al. TensorFlow: Large-scale machine learning on heterogeneous systems, 2015. URL https://www.tensorflow.org/. Software available from tensorflow.org.
  • Amari (1998) Amari, S.-I. Natural gradient works efficiently in learning. Neural computation, 10(2):251–276, 1998.
  • Bottou (1998) Bottou, L. Online algorithms and stochastic approximations. In Saad, D. (ed.), Online Learning and Neural Networks. Cambridge University Press, 1998.
  • Darwiche (1999) Darwiche, A. Compiling knowledge into decomposable negation normal form. In IJCAI, volume 99, pp. 284–289, 1999.
  • Darwiche (2001) Darwiche, A. Decomposable negation normal form. Journal of the ACM (JACM), 48(4):608–647, 2001.
  • Darwiche (2002) Darwiche, A. A logical approach to factoring belief networks. KR, 2:409–420, 2002.
  • Darwiche (2003) Darwiche, A.

    A differential approach to inference in Bayesian networks.

    Journal of the ACM, 50(3):280–305, 2003.
  • de Campos (2011) de Campos, C. New complexity results for MAP in Bayesian networks. In Proceedings of IJCAI, pp. 2100–2106, 2011.
  • Dennis & Ventura (2012) Dennis, A. and Ventura, D. Learning the architecture of sum-product networks using clustering on variables. In Proceedings of NIPS, pp. 2042–2050, 2012.
  • Gens & Domingos (2013) Gens, R. and Domingos, P. Learning the structure of sum-product networks. Proceedings of ICML, pp. 873–880, 2013.
  • Goodfellow et al. (2014) Goodfellow, I. J., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., and Bengio, Y. Generative adversarial nets. In Proceedings of NIPS, pp. 2672–2680, 2014.
  • Khosravi et al. (2019) Khosravi, P., Liang, Y., Choi, Y., and Van den Broeck, G.

    What to expect of classifiers? reasoning about logistic regression with missing features.

    In Proceedings of IJCAI, pp. 2716–2724, 2019.
  • Kingma & Welling (2014) Kingma, D. P. and Welling, M. Auto-encoding variational Bayes. In ICLR, 2014. arXiv:1312.6114.
  • Kisa et al. (2014) Kisa, D., den Broeck, G. V., Choi, A., and Darwiche, A. Probabilistic Sentential Decision Diagrams. In KR, 2014.
  • Koller & Friedman (2009) Koller, D. and Friedman, N. Probabilistic Graphical Models: Principles and Techniques. MIT Press, 2009. ISBN 0-262-01319-3.
  • Larochelle & Murray (2011) Larochelle, H. and Murray, I. The neural autoregressive distribution estimator. In Proceedings of AISTATS, pp. 29–37, 2011.
  • LeCun et al. (1998) LeCun, Y., Bottou, L., Bengio, Y., and Haffner, P. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
  • Liu et al. (2015) Liu, Z., Luo, P., Wang, X., and Tang, X. Deep learning face attributes in the wild. In

    Proceedings of International Conference on Computer Vision (ICCV)

    , December 2015.
  • MacKay (1995) MacKay, D. J. C. Bayesian neural networks and density networks. Nuclear Instruments and Methods in Physics Research Section A: Accelerators, Spectrometers, Detectors and Associated Equipment, 354(1):73–80, 1995.
  • Molina et al. (2019) Molina, A., Vergari, A., Stelzner, K., Peharz, R., Subramani, P., Mauro, N. D., Poupart, P., and Kersting, K. Spflow: An easy and extensible library for deep probabilistic learning using sum-product networks, 2019.
  • Netzer et al. (2011) Netzer, Y., Wang, T., Coates, A., A-Bissacco, Wu, B., and Ng, A. Y. Reading digits in natural images with unsupervised feature learning. In NIPS Workshop on Deep Learning and Unsupervised Feature Learning 2011, 2011.
  • Papamakarios et al. (2019) Papamakarios, G., Nalisnick, E., Rezende, D. J., Mohamed, S., and Lakshminarayanan, B. Normalizing flows for probabilistic modeling and inference. arXiv preprint arXiv:1912.02762, 2019.
  • Paszke et al. (2019) Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., Killeen, T., Lin, Z., Gimelshein, N., Antiga, L., Desmaison, A., Kopf, A., Yang, E., DeVito, Z., Raison, M., Tejani, A., Chilamkurthy, S., Steiner, B., Fang, L., Bai, J., and Chintala, S. Pytorch: An imperative style, high-performance deep learning library. In Advances in Neural Information Processing Systems 32, pp. 8024–8035. 2019. URL https://pytorch.org/.
  • Pearl (1988) Pearl, J. Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 1988. ISBN 1558604790.
  • Peharz (2015) Peharz, R. Foundations of Sum-Product Networks for Probabilistic Modeling. PhD thesis, Graz University of Technology, 2015.
  • Peharz et al. (2015) Peharz, R., Tschiatschek, S., Pernkopf, F., and Domingos, P. On theoretical properties of sum-product networks. In Proceedings of AISTATS, pp. 744–752, 2015.
  • Peharz et al. (2017) Peharz, R., Gens, R., Pernkopf, F., and Domingos, P. On the latent variable interpretation in sum-product networks. IEEE transactions on pattern analysis and machine intelligence, 39(10):2030–2044, 2017.
  • Peharz et al. (2019) Peharz, R., Vergari, A., Stelzner, K., Molina, A., Trapp, M., Shao, X., Kersting, K., and Ghahramani, Z. Random sum-product networks: A simple and effective approach to probabilistic deep learning. In Proceedings of UAI, 2019.
  • Poon & Domingos (2011) Poon, H. and Domingos, P. Sum-product networks: A new deep architecture. In Proceedings of UAI, pp. 337–346, 2011.
  • Pronobis et al. (2017) Pronobis, A., Ranganath, A., and Rao, R. Libspn: A library for learning and inference with sum-product networks and tensorflow. In Principled Approaches to Deep Learning Workshop, 2017.
  • Rahman et al. (2014) Rahman, T., Kothalkar, P., and Gogate, V. Cutset networks: A simple, tractable, and scalable approach for improving the accuracy of chow-liu trees. In Joint European conference on machine learning and knowledge discovery in databases, pp. 630–645, 2014.
  • Rezende & Mohamed (2015) Rezende, D. J. and Mohamed, S. Variational inference with normalizing flows. In Proceedings of ICML, pp. 1530–1538, 2015.
  • Salakhutdinov et al. (2003) Salakhutdinov, R., Roweis, S. T., and Ghahramani, Z. Optimization with em and expectation-conjugate-gradient. In Proceedings of ICML, pp. 672–679, 2003.
  • Sato (1999) Sato, M.-a. Fast learning of on-line em algorithm. Rapport Technique, ATR Human Information Processing Research Laboratories, 1999.
  • Shen et al. (2016) Shen, Y., Choi, A., and Darwiche, A. Tractable operations for arithmetic circuits of probabilistic models. In Advances in Neural Information Processing Systems 29, pp. 3936–3944. 2016.
  • Trapp et al. (2019a) Trapp, M., Peharz, R., Ge, H., Pernkopf, F., and Ghahramani, Z. Bayesian learning of sum-product networks. Proceedings of NeurIPS, 2019a.
  • Trapp et al. (2019b) Trapp, M., Peharz, R., and Pernkopf, F. Optimisation of overparametrized sum-product networks. arXiv preprint arXiv:1905.08196, 2019b.
  • Uria et al. (2016) Uria, B., Côté, M.-A., Gregor, K., Murray, I., and Larochelle, H. Neural autoregressive distribution estimation. The Journal of Machine Learning Research, 17(1):7184–7220, 2016.
  • Van den Broeck et al. (2019) Van den Broeck, G., Di Mauro, N., and Vergari, A. Tractable probabilistic models: Representations, algorithms, learning, and applications. http://web.cs.ucla.edu/~guyvdb/slides/TPMTutorialUAI19.pdf, 2019. Tutorial at UAI 2019.
  • Zhao et al. (2015) Zhao, H., Melibari, M., and Poupart, P. On the relationship between sum-product networks and Bayesian networks. In Proceedings of ICML, pp. 116–124, 2015.