# On the Sample Complexity of Learning Sum-Product Networks

Sum-Product Networks (SPNs) can be regarded as a form of deep graphical models that compactly represent deeply factored and mixed distributions. An SPN is a rooted directed acyclic graph (DAG) consisting of a set of leaves (corresponding to base distributions), a set of sum nodes (which represent mixtures of their children distributions) and a set of product nodes (representing the products of its children distributions). In this work, we initiate the study of the sample complexity of PAC-learning the set of distributions that correspond to SPNs. We show that the sample complexity of learning tree structured SPNs with the usual type of leaves (i.e., Gaussian or discrete) grows at most linearly (up to logarithmic factors) with the number of parameters of the SPN. More specifically, we show that the class of distributions that corresponds to tree structured Gaussian SPNs with k mixing weights and e (d-dimensional Gaussian) leaves can be learned within Total Variation error ϵ using at most O(ed^2+k/ϵ^2) samples. A similar result holds for tree structured SPNs with discrete leaves. We obtain the upper bounds based on the recently proposed notion of distribution compression schemes. More specifically, we show that if a (base) class of distributions F admits an "efficient" compression, then the class of tree structured SPNs with leaves from F also admits an efficient compression.

## Authors

• 3 publications
• 9 publications
• ### Agnostic Distribution Learning via Compression

We study sample-efficient distribution learning, where a learner is give...
10/14/2017 ∙ by Hassan Ashtiani, et al. ∙ 0

• ### Private Identity Testing for High-Dimensional Distributions

In this work we present novel differentially private identity (goodness-...
05/28/2019 ∙ by Clément L. Canonne, et al. ∙ 0

• ### Learning a Tree-Structured Ising Model in Order to Make Predictions

We study the problem of learning a tree graphical model from samples suc...
04/22/2016 ∙ by Guy Bresler, et al. ∙ 0

• ### Predictive Learning on Hidden Tree-Structured Ising Models

We provide high-probability sample complexity guarantees for exact struc...
12/11/2018 ∙ by Konstantinos E. Nikolakakis, et al. ∙ 0

• ### Private and polynomial time algorithms for learning Gaussians and beyond

We present a fairly general framework for reducing (ε, δ) differentially...
11/22/2021 ∙ by Hassan Ashtiani, et al. ∙ 0

• ### Deep Compression of Sum-Product Networks on Tensor Networks

Sum-product networks (SPNs) represent an emerging class of neural networ...
11/09/2018 ∙ by Ching-Yun Ko, et al. ∙ 0

• ### Robust Estimation of Tree Structured Ising Models

We consider the task of learning Ising models when the signs of differen...
06/10/2020 ∙ by Ashish Katiyar, et al. ∙ 6

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

A Sum-Product Network (SPN) [Poon2011, darwiche2003]

is a type of deep probabilistic model that can represent complex probability distributions. An SPN can be specified by its graphical model which takes the form of a rooted directed acyclic graph (DAG). The leaves of an SPN represent probability distributions from a fixed (simple and often parametric) class such as Bernoulli and Gaussian distributions. Higher-level nodes in the graph correspond to more complex distributions that are obtained by “combining” the lower-level distributions. More specifically, each node of an SPN is either a leaf, a

sum node, or a product node. Each sum/product node represents the mixture/product distribution of its children respectively. The use of sum and product “operations” allows for representation of increasingly more complex distributions, all the way to the root. The distribution that an SPN represents is the one that corresponds to its root node. In this work, our focus is on a powerful subclass of SPNs that take the form of rooted trees instead of rooted DAGs. We clarify that our results hold for tree structured SPNs and that for the remainder of this paper we will refer to tree structured SPNs simply as SPNs. See Figure 1 for an example of a simple SPN.

SPNs can be considered a generalization of mixture models. The alternating use of sum and product operations in SPNs results in representation of highly structured and complex distributions in a concise form. The appealing property of SPNs in encoding deeply factored and mixed distributions becomes more evident when we increase their depth – allowing representation of more complex distributions [Delalleau, Martens2014]. This property is a major incentive for the use of SPNs in practice [Delalleau, Martens2014, Poon2011, Rashwan2018b, Zhao2016, Vergari2015, Adel2015].

A fundamental open problem that we aim to address is characterizing the number of training instances needed to learn an SPN. More specifically, we want to establish the sample complexity of learning an SPN as a function of its depth and the number of its nodes (as well as the sample complexity of learning a single leaf). In this work, we initiate the study of the sample complexity of SPNs within the standard distribution learning framework (e.g., [devroye_book]), where we are given an i.i.d. sample from an unknown distribution and we wish to find a distribution that—with high probability—is close to it in Total Variation (TV) distance.

One important special case of SPNs are Gaussian Mixture Models (GMMs), which can be regarded as SPNs with only one sum node and a number of Gaussian leaves connected to it. Only recently, it has been shown that the number of samples required to learn GMMs is

, where is the number of parameters of the mixture model [gaussian_mixture_tr]. It is therefore an intriguing question whether this result can be extended to SPNs.

We will establish an upper bound on the sample complexity of learning tree structured SPNs, affirming that the sample complexity grows (almost) linearly with the number of parameters. As a concrete example, we show that the sample complexity of learning SPNs with fixed structures and Gaussian leaves is at most where is basically the number of parameters (the number of edges/weights in the graph plus the number of Gaussian parameters). Similar results also hold for SPNs with other usual types of leaves, including discrete (categorical) leaves.

We prove our results using the recently proposed notion of distribution compression schemes [gaussian_mixture_tr]. We obtain our sample complexity upper bounds by showing that if a class of distributions, , admits a certain form of efficient sample compression, then the set of distributions that correspond to SPNs with leaves from is also efficiently compressible, as long as the number of edges in the SPN is bounded. A technical feature of this result is that the upper bound depends on the number of the edges, but has no extra dependence (e.g., no exponential dependence) on the depth of the SPN.

### 1.1 Notation

For a distribution , the notation means is an i.i.d sample of size generated from . We write the set as , and write to represent the cardinality of the set . The empty set is defined as and denotes logarithm in the natural base.

## 2 The Distribution Learning Framework

In this short section we formally define the distribution learning framework. A distribution learning method is an algorithm that takes as input a sequence of i.i.d. samples generated from an unknown distribution , then outputs (a description of) a distribution

as an estimate of

. We assume that is in some class of distributions (i.e., realizable setting) and we require to be a member of this class as well (i.e., proper learning). Let and be two probability distributions defined over and let be the Borel sigma algebra over . The TV distance is defined by

 TV(f1,f2)\coloneqqsupB∈B∫B(f1(x)−f2(x))dx=12∥f1−f2∥1

where is the norm of . We define what it means for two distributions to be -close.

###### Definition 1 (ϵ-close).

A distribution is -close to if .

The following is a formal Probably Approximately Correct (PAC) learning definition for distribution learning with respect to .

###### Definition 2 (PAC-learning of distributions).

A distribution learning method is called a PAC-learner for with sample complexity if, for all distributions and all , given , , and an i.i.d. sample of size from , with probability at least (over the samples) we have .

## 3 Main Results

Here we state our main result regarding the sample complexity of learning SPNs with Gaussian leaves which are the most common forms of continuous SPNs.

###### Theorem 3 (Informal).

Let be any class of distributions that corresponds to SPNs with the same structure—having mixing weights and (-dimensional Gaussian) leaves. Then can be PAC-learned using at most

 ˜O(ed2+kϵ2)

samples, where hides logarithmic dependencies on , , , and .

The parameters of an SPN consist of the mixing weights of the sum nodes and the parameters of the leaves. The number of parameters of a -dimensional Gaussian is , so our upper bound is nearly linear in the total number of parameters of the SPN (i.e., ).

One of the technical aspects of our upper bound is that it depends on the structure of SPN only through and . In other words, the upper bound will be the same for learning deep vs. shallow structured SPNs as long as they have the same number of mixing weights and leaves. This result motivates the use of deeper SPNs from the information-theoretic point of view—especially given the fact that deeper SPNs can potentially encode a distribution much more efficiently [Delalleau, Martens2014].

###### Remark 4 (Tightness of the upper-bound).

It is known [gaussian_mixture_tr] that the sample complexity of learning mixtures of -dimensional Gaussians is at least . Given the fact that GMMs are special cases of SPNs, we can conclude that our upper bound cannot be improved in general (i.e., it can nearly match the lower bound, e.g., for the special case of GMMs). However, it might still be possible to refine the bound by considering additional parameters, which is the subject of future research.

The approach that we use in this paper is quite general and allows us to investigate SPNs with other types of leaves, including discrete leaves. In fact, studying SPNs with discrete leaves is a simpler problem than those with Gaussian leaves. Here we state the corresponding result for discrete SPNs.

###### Theorem 5 (Informal).

Let be any class of distributions that corresponds to SPNs with the same structure—having mixing weights and discrete leaves of support size . Then can be PAC-learned using at most

 ˜O(ed+kϵ2)

samples, where hides logarithmic dependencies on , , , , and .

## 4 Sum Product Networks

We begin this section by defining mixture and product distributions which are the fundamental building blocks of SPNs. As we are working with absolutely continuous probability measures we sometimes use distributions and their density functions interchangeably. Let denote the -dimensional simplex.

###### Definition 6 (Mixture distribution).

Let be densities over domain . We call a -mixture of ’s if it can be written in the following form

 f \coloneqq ∑ki=1wifi

where are mixing weights.

###### Definition 7 (Product distribution).

Let be densities over domains . Then a product density, , over is defined by

### 4.1 SPN Signatures

In this subsection, we introduce the notion of SPN signatures which help defining SPNs more formally. SPN signatures can be thought of as a recursive syntactic representation of SPNs. We find this syntactic representation useful in improving the clarity and preciseness of the statements that we will make in our main results and their proofs.

We now define base signatures which will later allow us to recursively define SPN signatures. Let denote a (base) class of distributions that are defined over . In the following, we define the set of “base signatures” which basically correspond to the leaves of SPNs.

###### Definition 8 (Base signatures).

The set of base signatures formed by over is defined by the following set of tuples

 TnF\coloneqq{(f,b):b⊂[n],|b|=d,f∈F}

The first element of each signature (tuple) is a symbol that represents a distribution in . The second element of the tuple represents the subset of dimensions that the domain of is defined over. For example, the tuple represents a distribution that is defined over the first, third and fifth dimensions of (i.e., is dimensionality of the domain of , and is the dimensionality of the domain of the whole SPN). The set is commonly referred to as the scope of the distribution. An example of a set of base signatures are the base signatures formed by the class of -dimensional Gaussians, , given by .

We are now ready to define SPN signatures. In fact, SPN signatures are “generated” recursively from the base signatures, by either taking the product or mixtures of the existing signatures.

###### Definition 9 (SPN signatures).

Given a set of base signatures , we (recursively) define the set of SPN signatures generated from – denoted by – to consist of the following tuples:

1. If and
, then

2. If and
, then

The first rule of the above definition states that SPN signatures include the corresponding base signatures. The second rule defines new signatures based on the product of the existing ones. The resulting signature, , is a string that is the concatenation of a number of substrings (i.e., each and ) and a number of symbols (, and ) in the given order. The second element of the tuple (i.e., ) keeps track of the dimensions over which the signature is defined. Similarly, the third rule defines signatures based on the mixture of the existing signatures ( are the string representation of a mixing weight).

One can take an SPN signature and create its corresponding visual (graph-based) representation based on the sum and product rules. See Figure 1 for an example of an SPN and its corresponding signature. We will often switch between referring to an SPN as a distribution or as a rooted tree, and it will be clear what we are referring to from the context.

### 4.2 (ϵ,α)Similarity and SPN Structures

In this subsection we define -similarity between two signatures. This property is a useful tool that will allow us to simplify our proofs. -similarity can also be used to formally define what it means for two signatures to have the same structure.

Roughly speaking, two signatures are -similar if (i) their corresponding SPNs have the same structure, and (ii) all their corresponding weights are -similar, and (iii) all their corresponding leaves are -close in TV distance. Here is the formal definition.

###### Definition 10 ((ϵ,α)-similar).

Given parameters and , we say two signatures are -similar, , if they satisfy one of the following three properties:

• and

• and

1. and such that

• and

• and

• and

2. and
such that

• and

• and

• and

We slightly abuse notation by writing and , since the and are strings/symbols; yet, here we just mean the distribution in that corresponds to that symbol. -similarity is a useful property that we will later use in our proofs.

More immediately, we need -similarity to formally define what it means for two signatures to have the same structure.

###### Definition 11 (same structure).

We say two signatures have the same structure if they are -similar and we denote this by .

Note that the TV distance between two distributions is at most and the absolute difference between two weights is at most . Therefore, the fact that two signatures are -similar just means that their corresponding SPNs have the same structure, in the sense that the way their nodes are connected is the same, and the scope of corresponding sub trees is the same (with no actual guarantee on the “closeness” of their leaves or weights).

Note that is an equivalence relation, therefore we can talk about the equivalence classes of this relation. We use these equivalence classes to give the following definition.

###### Definition 12 (An SPN structure).

Given a signature , we define an SPN structure, , as the equivalence class of

 [s]F\coloneqq{s′∈S(TnF):s≡s′}

Essentially, an SPN structure is a set of signatures that represent distributions that all have the same structure. This is a useful definition as it will allow us to precisely state our results. Finally, we denote by the set of all distributions that correspond to the signatures in the equivalence class . We are now ready to state our main results formally.

### 4.3 Main Results: Formal

The proof of these results can be found in Section 7.

###### Theorem 3.

Let be the class of -dimensional Gaussians. For every SPN structure , the class can be learned using

 ˜O(ed2+kϵ2)

samples, where and are the number of leaves and the number of mixing weights of (every) respectively.

###### Theorem 5.

Let be the class of discrete distributions with support size . For every SPN structure , the class can be learned using

 ˜O(ed+kϵ2)

samples, where and are the number of leaves and the number of mixing weights of (every) respectively.

## 5 Distribution Compression Schemes

In this section we provide an overview of distribution compression schemes and their relation to PAC-learning of distributions. Distribution compression schemes were recently introduced in [gaussian_mixture_tr] as a tool to study the sample complexity of learning a class of distributions. Here, the high-level use-case of this approach is that if we show a class of distributions admits a certain notion of compression, then we can bound the sample complexity of PAC-learning with respect to that class of distributions.

Let us fix a class of distributions, . A distribution compression scheme for consists of an encoder and a decoder. The encoder, knowing the true data distribution , receives an i.i.d. sample of size , and tries to “encode” using a small subset111Technically, the encoder can use the same instance multiple times in the message, so we have a sequence rather than a set. of and a few extra bits. On the other hand, the decoder, unaware of , aims to reconstruct (an approximation of) using the given subset of samples and the bits. Roughly speaking, is compressible if there exist a decoder and an encoder such that for any , the decoder can recover (a good approximation of) based on the given information from the encoder.

More precisely, suppose that the encoder always uses “short” messages to encode any : it uses a sequence of at most instances from and at most extra bits. Also, suppose that for all , the decoder receives the encoder’s message and with high probability outputs an such that . In this case, we say that admits compression, where , , and can be functions of the accuracy parameter, .

The difference between this type of sample compression and the more usual notions of compression is that we not only use bits, but also use the samples themselves to encode a distribution. This extra flexibility is essential—e.g., the class of univariate Gaussian distributions (with unbounded mean) has infinite metric entropy and can be compressed only if on top of the bits we use samples to encode the distribution.

### 5.1 Formal Definition of Compression Schemes

In this section, we provide a formal definition of distribution compression schemes.

###### Definition 13 (decoder [gaussian_mixture_tr]).

A decoder [gaussian_mixture_tr] for is a deterministic function , which takes a finite sequence of elements of and a finite sequence of bits, and outputs a member of .

###### Definition 14 (compression schemes).

Let be functions. We say admits compression if there exists a decoder for such that for any distribution , the following holds:

• For any , if a sample is drawn from , then with probability at least , there exists a sequence of at most elements of , and a sequence of at most bits, such that .

Briefly put, the definition states that with probability , there is a (short) sequence of elements from and a (short) sequence of additional bits, from which can be approximately reconstructed. This probability can be increased to by generating a sample of size . The following technical lemma gives an efficient compression scheme for the class of -dimensional Gaussians.

###### Lemma 15 (Lemma 4.2 in [gaussian_mixture_tr]).

For any positive integer , the class of -dimensional Gaussians admits an

 ( O(dlog(2d)),O(d2log(2d)log(d/ϵ)),O(dlog(2d)) )

compression scheme.

### 5.2 From Compression to Learning

The following theorem draws a connection between compressibility and PAC-learnability. It states that the sample complexity of PAC-learning a class of distributions can be upper bounded if the class admits a distribution compression scheme.

###### Theorem 16 (Compression implies learning, Theorem 3.5 in  [gaussian_mixture_tr]).

Suppose admits compression. Then can be PAC-learned using

 ˜O(m(ϵ6)+t(ϵ/6)+τ(ϵ/6)ϵ2) \emph{samples.}

The idea behind the proof of this theorem is simple: if a class admits a compression scheme, then the learner can try to simulate all the messages that the encoder could have possibly sent, and use the decoder on them to find the corresponding outputs. The problem then reduces to learning from a finite class of candidate distributions (see [gaussian_mixture_tr] for details).

## 6 Compressing SPNs

In this section we show (roughly) that if a class of distributions, , is compressible, then a class of SPNs with fixed structure and leaves from is also compressible. We use this result as a crucial step in proving our main results. We give a full proof of the following Theorem in the supplement.

###### Theorem 17.

Let be a class that admits compression. For every SPN structure , the class Dist admits

 (eτ(ϵ/3n),et(ϵ/3n)+klog2(3k/2ϵ),48m(ϵ/3n)elog(6e)/ϵ)

compression where , and are the number of weights, the number of leaves, and the dimension of the domain of the distributions in Dist, respectively.

### 6.1 Overview of Our Techniques

In this subsection, we give a high level overview of our technique. As was stated in Theorem 17, given an SPN structure , we want to derive a compression scheme for the class of distributions as long as the class is compressible. Our compression scheme utilizes an encoder and decoder for the class . Our encoder can encode any in the following way: given samples from , for each leaf the encoder can likely choose a sequence of samples (from the samples of ) and bits such that a decoder for the class can outputs an that is an accurate approximation of . Furthermore, for each sum node , we discretize its mixing weights with high accuracy. Our discretization, , can be encoded exactly using some bits.

Our decoder for the class decodes the message from the encoder in the following way: Our decoder is given the discretize mixing weights for each sum node directly in the form of bits, so nothing more needs to be done for the weights. The decoder is also given samples and bits for each leaf . We can use the decoder for the class to reconstruct that is an accurate approximation for , with high probability. Our decoder thus outputs the reconstructed SPN with leaves and discretized mixing weights for each sum node . Finally, we show that the decoders reconstruction, , is -close to with high probability.

### 6.2 Results

In this subsection, we work towards proving a less general version of Theorem 17 that has a simpler and more intuitive proof. With this in mind, we introduce a few definitions.

###### Definition 18 (ϵ-net).

Let . We say is an -net for in metric if for each there exists some such that .

###### Definition 19 (path weight).

Let be the number of leaves in an SPN. For any index , the path weight, , of the th leaf of an SPN is the product of all the mixing weights along the unique path from the root to the th leaf.

In other words, when sampling from an SPN distribution, the path weight of a leaf is the probability of getting a sample from that leaf. We say a leaf in an SPN is negligible if its path weight is less than , where is the number of leaves in the SPN. When we say a class of SPNs has no negligible leaves, we mean none of the distributions in the class has negligible path weights. We now proceed to prove the following lemma.

###### Lemma 20.

Let be a class that admits compression. For every SPN structure , the class Dist (with no negligible leaves), admits

 (eτ(ϵ/2n),et(ϵ/2n)+klog2(k/ϵ),48m(ϵ/2n)elog(6e)/ϵ)

compression where , and are the number of weights, the number of leaves, and the dimensionality of the domain of the distributions in Dist, respectively.

To prove the above lemma, we first need to show that if two SPN signatures are -equivalent, then there is a direct relationship between the SPNs they represent. The proof can be found in the supplement.

###### Lemma 21.

Given that two signatures are -similar, their corresponding SPNs satisfy

 TV(^f,f)≤nϵ+kα/2

where and are the number of weights in and the dimensionality of the domain of both and respectively.

We can use the above result to prove lemma 20.

###### Proof of Lemma 20.

We want to show that given samples from , we can construct that is -close to with probability . For any index , the -th leaf of is given by .

Encoding: We define to be the number of samples we have from . Since we have samples (and none of the path weights are negligible) using a standard Chernoff bound together with a union bound, there are no less than samples for every leaf of , with probability at least . Given this many samples for each leaf, there exists a sequence of samples and bits such that a decoder for the class outputs that satisfies

 TV(fi,^fi)≤ϵ2n (1)

with probability no less than . Finally, using a union bound, the failure222Failure here is either our leaves not getting samples or not having a sequence of samples and bits such that a decoder can output a good approximation for each leaf. probability of our encoding is no more than .

Let be the number of sum nodes. Let be an index; for the -th sum node, we can construct an -net in of size for its mixing weights, where is the number of mixing weights of the th sum node. There exists, in each sum node’s net, an element such that

 ∥(^w1,…,^wkj)−(w1,…,wkj)∥∞≤ϵk (2)

For each sum node , we can encode using no more than bits; in total we need no more than bits to encode all the weights in the SPN. Taking everything into account, we have instances and total bits that we use to encode the SPN .

Decoding: the decoder directly receives bits that correspond to , for each sum node . We also receive, for each leaf , instances and bits such that the decoder for the class outputs that satisfies Equation (1). Our decoder will thus output an SPN where the -th leaf is and each sum node has mixing weights .

To complete the proof, we need to show that our reconstruction, is -close to with probability . Our encoding succeeds with probability no less than , so we simply need to show that . Let and represent the SPN signatures of and respectively. By equations (1) and (2) and Definition 10, we have that and are -similar. Using lemma 21 we have that . This completes our proof. ∎

In this section we prove some of the results stated in earlier sections.

### 7.1 Proof of Theorem 3

###### Proof of Theorem 3.

Let be the class of -dimensional Gaussians. Combining Theorem 17 and lemma 15, we have the following: for any SPN structure the class – where each distribution in has leaves and has domain with dimensionality – admits a

 ( O(edlog(2d)), O(ed2log(2d)log(3dn/ϵ))+klog2(3k/ϵ), O(dlog(2d)elog(6e)/ϵ))

compression scheme. Using Theorem 16 shows that this class can be learned using samples, which completes the proof. ∎

### 7.2 Proof of Theorem 5

To prove Theorem 5 we need the following lemma.

###### Lemma 22.

The class of discrete distributions, , with support size admits compression.

###### Proof.

The proof is quite simple. We only need to compress the parameters directly. We cast an -net in of size for the parameters. There exists an element in our -net, , that satisfies

 ∥(^p1,…,^pd)−(p1,…,pd)∥∞≤ϵd

So for any , we have . We can thus encode using no more than bits. The decoder receives these discretized weights (in bits) and directly outputs which is -close to . This completes the proof. ∎

###### Proof of Theorem 5.

Let be the class of discrete distributions with support size . Combining Theorem 17 and lemma 22 we have the following: for any SPN structure the class – where each distribution in has leaves and has domain with dimensionality – admits a

 (0,edlog2(3dn/ϵ)+klog2(3k/ϵ),0)

compression scheme. Using Theorem 16 we have that this class can be learned using samples, which completes the proof. ∎

## 8 Previous Work

Distribution learning is a broad topic of study and has been investigated by many scientific communities (e.g.,  [kearns, devroye_density_estimation_first, silverman]). There are many metrics to choose from when one wishes to measure the similarity of two distributions. In this work we use the TV distance which has been applied to derive numerous bounds on the sample complexity of learning GMMs [gaussian_mixture_tr, ashtiani2017sample, onedimensional, DK14, suresh2014near] as well as other types of distributions (see [Diakonikolas2016, chan2013learning, diakonikolas2016efficient] and references therein). There are many other common metrics used for density estimations such as KL divergence and general distances with . Unfortunately, for the simpler problem of learning GMMs it can be shown that the sample complexity of learning with respect to KL divergence and general distances must depend on structural properties of the distribution while the same does not hold for the TV distance. For more details on this see [gaussian_mixture_tr].

Research on SPNs has primarily been focused on developing practical methods that learn appropriate structures for SPNs [Dennis2012, Gens2013, Peharz2013, lee2013online, Dennis2015, Vergari2015, Rahman2016, Trapp2016, hsu2017online, Dennis2017, jaini2018prometheus, Rashwan2018b, Bueff2018, trapp2019bayesian] as well as methods that learn the parameters of SPNs [Poon2011, gens2012discriminative, peharz2014learning, desana2016learning, Rashwan2016, Zhao2016, jaini2016online, Trapp2018, Rashwan2018a, peharz2019]. This line of research is not directly related to our work since they are interested in the development of efficient algorithms to be used in practice.

There has also been some theoretical work showing that increasing the depth of SPNs provably increases the representational power of SPNs [Martens2014, Delalleau] as well as some work showing the relationship between SPNs and other types of probabilistic models [jaini2018deep]. Although these papers investigate theoretical properties of SPNs, they are not directly related to our work as we are interesting in answering different questions about SPNs.

## 9 Discussion

Loosely put, in this work we have derived upper bounds on the sample complexity of learning a class of tree structured SPNs with Gaussian leaves and a class of tree structured SPNs with discrete leaves. Our results hold for SPNs that are in the form of rooted trees, however SPNs are more generally rooted DAGs. It is thus an interesting open problem to characterize the sample complexity of SPNs in general which we leave for future work. Our upper bounds hold for learning in the realizable setting, so an interesting open direction is to extend our results to learning in the agnostic setting. We leave this direction for future work. Although there exists a lower bound based on the fact that mixture models can be viewed as a special case of SPNs, an interesting open question is whether we can determine lower bounds for SPNs of arbitrary depth and we also leave this for future work.

## Appendix A Omitted Proofs

### a.1 Proof of Lemma 21

To prove lemma 21, we need the following proposition. The following proposition is standard and can be proved, e.g., using the coupling characterization of the TV distance.

###### Proposition 23.

For , let and be probability distributions over the same domain . Then .

###### Proof of Lemma 21.

Two signatures that are -similar have corresponding SPNs that have the same structure. Since any SPN can be considered the composition of smaller trees (that are also SPNs), we prove this by structural induction. Note that implies by definition.

Base Case: Trees of height are simply leaves. By the definition of -similarity, two leaves are -close. Thus, the inductive hypothesis is satisfied.

Inductive step:

Suppose that the root nodes of both and have children. For , let and be the th children (that are smaller SPNs) of the root nodes of the SPNs and respectively. Assume the inductive hypothesis holds between and . There are two possibilities for SPNs and : They are rooted with a product node or they are rooted with a sum node.

Case 1: The root node is a product node with children. This gives us

 ∥^f−f∥1 ≤m∑i=1(2niϵ+kiα)=2nϵ+kα

Where and are the number of weights in and the dimensionality of the domain of respectively. The first inequality follows from Proposition 23.

Case 2: The root node is a sum node with children. This give us

 ∥^f−f∥1 ≤∥∥m∑i=1wi(^fi−fi)∥∥1+∥∥m∑i=1(^wi−wi)^fi∥∥1 ≤m∑i=1wi∥^fi−fi∥1+m∑i=1|^wi−wi|∥^fi∥1 ≤m∑i=1wi⋅(2nϵ+kiα)+m∑i=1α⋅1 ≤2nϵm∑i=1wi+(m∑i=1wi⋅m∑i=1kiα)+m∑i=1α⋅1 =2nϵ⋅1+1⋅m∑i=1kiα+m∑i=1α⋅1 =2nϵ+kα

Where the first two inequalities hold from the properties of a norm. This completes our proof. ∎

### a.2 Proof of Theorem 17

The proof of Theorem 17 mirrors many aspects of the proof of Lemma 21.

###### Proof of Theorem 17.

We want to show that given samples from , we can construct that is -close to with probability .

Encoding: We encode the mixing weights of the sum nodes the exact same way as in the proof for Lemma 21, but we use an -net instead of an -net.

Let be an index. Recall that a leaf is negligible if its path weight is less than . We can split leaves into two groups: leaves that are negligible and leaves that are non-negligible. For non-negligible leaves, using a standard Chernoff bound together with a union bound, we have samples from all non-negligible leaves with probability no less than . Given this many samples for each non-negligible leaf, there exists a sequence of samples and bits such that a decoder for the class outputs that satisfies

 TV(fi,^fi)≤ϵ3n (3)

with probability no less than . Using a union bound, Equation 3 holds for all non-negligible leaves simultaneously with probability no less than .

We cannot make any guarantees on the number of samples that come from negligible leaves. For each negligible leaf , we pick an arbitrary sequence of samples and bits. The decoder for the class will thus output an arbitrary such that . As we will show shortly, we do not need to do any better for negligible leaves. Finally, using a union bound, the failure 333Failure here is either our non-negligible leaves not getting samples or not having a sequence of samples and bits such that a decoder can output a good approximation for each non-negligible leaf. probability of our encoding is no more than .

Decoding: The mixing weights for our sum nodes are decoded in the same way as as in the proof for Lemma 21. Our decoder receives a sequence of and bits for each leaf . For non-negligible leaves, the decoder for the class will output that satisfies Equation (3). For non-negligible leaves, the decoder outputs also outputs an , but we have no guarantee of how accurately it approximates . Our decoder thus outputs an SPN where the -th leaf is and each sum node has mixing weights .

We need to show that with probability . Our encoding succeeds with probability , so we only need to show . Let be the number of nodes in an SPN and let be an index. For any , we can prove that any sub tree of - with at least 1 sum node in the sub tree - rooted at node , , and corresponding sub tree of , , satisfy the following:

 ∥^fi′−fi′∥1≤2∑j′∈Ni′Wi′j′+2^nϵ/3n+2^kϵ/3k

where is the dimensionality of the domain of the sub trees, is the number of mixing weights in the sub trees, is the subset of that represent the negligible leaves in and represents the product of the mixing weights along the unique path from (negligible) leaf up to node . We can prove this via induction over the height of the sub trees.

Base case: Our base case consists of two separate possibilities. All sub trees (that contain at least 1 sum node) are built on top of either: 1) A sub tree of height 1 with a single sum node and some leaves or 2) a sub tree of height 2 rooted with a sum node. The sub trees of type 2) have roots that are connected to product nodes, which are further connected to leaves.

Case 1: Let represent a sub tree of height 1 rooted with a sum node. Let be the number of children of and . The children of the root nodes of and are given by and respectively, where . Thus we have

 ∥^fi′−fi′∥1 ≤m∑k′=1wk′∥^fk′−fk′∥1+m∑k′=1|^wk′−wk′|∥^fk′∥1 =∑k′∈Ni′wk′∥^fk′−fk′∥1+∑k′∉Ni′wk′∥^fk′−fk′∥1m∑k′=1|^wk′−wk′|∥^fk′∥1 ≤2∑k′∈Ni′wk′+2ϵ/3n+2^kϵ/3k≤2∑k′∈Ni′wk′+2^nϵ/3n+2^kϵ/3k =2∑j′∈Ni′Wi′j′+2^nϵ/3n+2^kϵ/3k

Case 2: Let represent a sub tree of height 1 rooted with a sum node. Let be the number of children of and . The children of the root nodes of and are given by and respectively, where . We define as the set of nodes that are children of node . Thus we have

 ∥^fi′−fi′∥1 ≤m∑k′=1wk′∥^fk′−fk′∥1+m∑k′=1|^wk′−wk′|∥^fk′∥1 =m∑k′=1wk′∥∥∏j∈Ck′^fj−∏j∈Ck′fj∥∥1+∑k′=1|^wk′−wk′|∥^fk′∥1 ≤m∑k′=1wk′⎛⎜⎝∑j∈Ck′∩Nk′2+∑j∈Ck′∩¯Nk′2ϵ/3n⎞⎟⎠+m∑k′=1|^wk′−wk′|∥^fk′∥1 ≤2∑j′∈Ni′wj