On the Expressive Efficiency of Sum Product Networks

11/27/2014 ∙ by James Martens, et al. ∙ UNIVERSITY OF TORONTO 0

Sum Product Networks (SPNs) are a recently developed class of deep generative models which compute their associated unnormalized density functions using a special type of arithmetic circuit. When certain sufficient conditions, called the decomposability and completeness conditions (or "D&C" conditions), are imposed on the structure of these circuits, marginal densities and other useful quantities, which are typically intractable for other deep generative models, can be computed by what amounts to a single evaluation of the network (which is a property known as "validity"). However, the effect that the D&C conditions have on the capabilities of D&C SPNs is not well understood. In this work we analyze the D&C conditions, expose the various connections that D&C SPNs have with multilinear arithmetic circuits, and consider the question of how well they can capture various distributions as a function of their size and depth. Among our various contributions is a result which establishes the existence of a relatively simple distribution with fully tractable marginal densities which cannot be efficiently captured by D&C SPNs of any depth, but which can be efficiently captured by various other deep generative models. We also show that with each additional layer of depth permitted, the set of distributions which can be efficiently captured by D&C SPNs grows in size. This kind of "depth hierarchy" property has been widely conjectured to hold for various deep models, but has never been proven for any of them. Some of our other contributions include a new characterization of the D&C conditions as sufficient and necessary ones for a slightly strengthened notion of validity, and various state-machine characterizations of the types of computations that can be performed efficiently by D&C SPNs.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Sum Product Networks (SPNs) (Poon and Domingos, 2011)

are a recently developed class of deep generative models which compute their associated unnormalized density functions using a special type of arithmetic circuit. Like neural networks, arithmetic circuits

(e.g. Shpilka and Yehudayoff, 2010) are feed-forward circuits whose gates/nodes compute real values, and whose connections have associated real-valued weights. Each node in an arithmetic circuit computes either a weighted sum or a product over their real-valued inputs.

For an important special class of SPNs called “valid SPNs”, computing the normalizing constant, along with any marginals, can be performed by what amounts to a single evaluation of the network. This is to be contrasted with other deep generative models like Deep Boltzmann Machines

(Salakhutdinov and Hinton, 2009), where quantities crucial to learning and model evaluation (such as the normalizing constant) are provably intractable, unless (Roth, 1996).

The tractability properties of valid SPNs are the primary reason they are interesting both from a theoretical and practical perspective. However, validity is typically enforced via the so-called “decomposability” and “completeness” conditions (which we will abbreviate as “D&C”). While easy to describe and verify, the D&C conditions impose stringent structural restrictions on SPNs which limit the kinds of architectures that are allowed. While some learning algorithms have been developed that can respect these conditions (e.g. Gens and Domingos, 2013; Peharz et al., 2013; Rooshenas and Lowd, 2014), the extent to which they limit the expressive efficiency111By this we mean the extent to which they can efficiently capture various distribution. A distribution is “efficiently captured” if it is contained in the closure of the set of distributions corresponding to different settings of the models parameters, for polynomial sized (in the dimension of the data/input) instances of the model, where size is measured by the number of “units” or parameters. Often these will be realistic low-order polynomials, although this depends on how exactly the constructions are done. Note that a distribution being “efficiently captured” says nothing about how easily the marginal densities or partition function of its associated density can be computed (except in the case of D&C SPNs of course). The concept of expressive efficiency is also sometimes called “expressive power” or “representational power”, although we will use the word “efficiency” instead of “power” to emphasize our focus on the question of whether or not certain distributions which can be captured efficiently by the model, instead of the question of whether or not they can be captured at all (i.e. by super-polynomially sized instances of the model). This latter question is the topic of papers which present so-called “universality” results which show how some models can capture any distribution if they are allowed be exponentially large in (by essentially simulating a giant look-up table). Such results are fairly straightforward, and indeed it easy to show that D&C SPNs are universal in this sense. of SPNs versus various other deep generative models remains unclear.

Like most models, D&C SPNs are “universal” in the sense that they can capture any distribution if they are allowed to be of a size which is exponential in the dimension of the data/input. However, any distribution function which can be efficiently captured by D&C SPNs, which is to say by one of polynomial size, must therefore be tractable (in the sense of having computable marginals, etc). And given complexity theoretic assumptions like it is easy to come up with density functions whose marginals/normalizers are intractable, but which nonetheless correspond to distributions which can be efficiently captured by various other deep generative models (e.g. using the simulation results from Martens (2014)). Thus we see that the tractability properties enjoyed by D&C SPNs indeed come with a price.

However, one could argue that the intractability of these kinds of “hard” distributions would make it difficult or even impossible to learn them in practice. Moreover, any model which can efficiently capture them must therefor lack an efficient general-case inference/learning algorithm. This is a valid point, and it suggests the obvious follow-up question: is there a fully tractable distribution (in the sense that its marginal densities and partition function can be computed efficiently) which can be efficiently captured by other deep models, but not by D&C SPNs?

In this work we answer this question in the affirmative, without assuming any complexity theoretic conjectures. This result thus establishes that D&C SPNs are in some sense less expressively efficient than many other deep models (since, by the results of Martens (2014), such models can efficiently simulate D&C SPNs), even if we restrict our attention only to tractable distributions. Moreover, it suggests existence of a hypothetical model which could share the tractability properties of D&C SPNs, while being more expressively efficient.

In addition to this result, we also analyze the effect of depth, and other structural characteristics, on the expressive efficiency of D&C SPNs. Perhaps most notably, we use existing results from arithmetic circuit theory to establish that D&C SPNs gain expressive efficiency with each additional layer of depth. In particular, we show that the set of distributions which can be efficiently captured by D&C SPNs grows with each layer of depth permitted. This kind of “depth hierarchy” property has never before been shown to hold for any other well-known deep model, despite the widespread belief that it does hold for most of them (e.g. Bengio and Delalleau, 2011).

Along with these two results, we also make numerous other contributions to the theoretical understanding of SPNs which are summarized below.

In Section 2, we first propose a generalized definition of SPNs that captures all previous definitions. We then illuminate the various connections between SPNs and multilinear arithmetic circuits, allowing us to exploit the many powerful results which have already been proved for the latter.

In Section 3 we provide new insights regarding the D&C conditions and their relationship to validity, and introduce a slightly strengthened version of validity which we show to be equivalent to the D&C conditions (whereas standard validity is merely implied by them). We also show that for a slightly generalized definition of SPNs, testing for standard validity is a co-NP hard problem.

In Section 5 we give examples of various state-based models of computation which can be efficiently simulated by D&C SPNs, and show how these can be used to give constructive proofs that various simple density functions can be efficiently computed by D&C SPNs.

In Section 6 we address the prior work on the expressive efficiency of D&C SPNs due to Delalleau and Bengio (2011), and give a much shorter proof of their results using powerful techniques borrowed from circuit theory. We go on to show how these techniques allow us to significantly strengthen and extend the results of Delalleau and Bengio (2011), answering an open question which they posed.

In Section 7 we leverage prior work done on multilinear arithmetic circuits to prove several very powerful results regarding the relationship between depth and expressive efficiency of D&C SPNs. First, we show that with each extra layer of depth added, there is an expansion of the set of functions efficiently computable by D&C SPNs (thus giving a strict “hierarchy of depth”). Next we show that if depth is allowed to grow with the input dimension , that its effect on expressive efficiency greatly diminishes after it reaches .

In Section 8 we show that when D&C SPNs are constrained to have a recursive “formula” structure, as they are when learned using the approach of (Gens and Domingos, 2013), they lose expressive efficiency. In particular we use prior work on multilinear arithmetic circuits to produce example functions which can be efficiently computed by general D&C SPNs, but not by ones constrained to have a formula structure.

Finally, in Section 9 we give what is perhaps our most significant and difficult result, which is the existence of a simple density function whose marginals and normalizer are computable by an time algorithm, and whose corresponding distribution can be efficiently captured by various other deep models (in terms of their size), but which cannot be efficiently computed, or even efficiently approximated, by a D&C SPN of any depth.

2 Definitions and Notation

2.1 Arithmetic circuits

Arithmetic circuits (e.g. Shpilka and Yehudayoff, 2010) are a type of circuit, similar to Boolean logic circuits, or neural networks. But instead of having gates/nodes which compute basic logical operations like AND, or sigmoidal non-linearities, they have nodes which perform one of the two fundamental operations of arithmetic: addition and multiplication. Their formal definition follows.

An arithmetic circuit over a set/tuple222By “set/tuple” we mean a tuple like which we will occasionally treat like a standard set, so that expressions such as are well defined and have the natural interpretation. of real-valued variables will be defined as a special type of directed acyclic graph with the following properties.

Each node of the graph with in-degree 0 is labeled by either a variable from or an element from . Every other node is labeled by either or , and are known as product nodes or sum nodes respectively. All the incoming edges to a sum node are labeled with weights from . Nodes with no outgoing edges are referred to as output nodes. We will assume that arithmetic circuits only have one output node, which we will refer to as the root. Nodes with edges going into a node in are referred to as ’s children. The set of such children is denoted by .

Given values of the elements of , a node of an arithmetic circuit computes a real-valued output, which we denote by , according to the following rules. When is labeled with an element of or , the node simply computes its label. The latter type of nodes are referred to as constant nodes, since they compute constants that don’t depend on . Product nodes compute the product of the outputs of their children, i.e. , while sum nodes compute a weighted sum of the outputs of their children, i.e. , where denotes the weight labeling the edge from from to . Given these definitions, it is not hard to see that for each node of an arithmetic circuit, is a multivariate polynomial function in the elements of . The output of , denoted by , is defined as the output of its singular root/output node (i.e. , where is the root/output node of ).

For a node in , denotes the subcircuit of rooted at . This subcircuit is formed by taking only the nodes in that are on a path to .

An arithmetic circuit is said to be monotone if all of its weights and constants are non-negative elements of .

The scope of a node , denoted by , is defined as the subset of the elements of which appear as labels in the sub-circuit rooted at . These are the variables which ’s output essentially “depends on”.

The size of , denoted by , is defined as the number of nodes in , and its depth is defined as the length of the longest directed path in . An alternative notion of depth, called product depth (Raz and Yehudayoff, 2009), is defined as the largest number of product nodes which appear in a directed path in .

Note that in general, nodes in an arithmetic circuit can have out-degree greater than 1, thus allowing the quantities they compute to be used in multiple subsequent computations by other nodes. When this is not the case and the nodes of each have out-degree at most 1, is said to be an arithmetic formula, because it can be written out compactly as a formula.

2.2 Sum Product Networks (SPNs)

In this section we will give our generalized definition of Sum Product Networks (SPNs).

Let be a set/tuple of variables where can take values in a range set and are measures over the respective ’s. For each let
denote a set/tuple of non-negative real-valued univariate functions of , each with a finite -integral over . will denote the set/tuple whose elements are given by appending all of these ’s together, and will denote the product measure .

A Sum Product Network (SPN) is defined as a monotone arithmetic circuit over . It inherits all of the properties of monotone arithmetic circuits, and gains some additional ones, which are discussed below.

Because an SPN is an arithmetic circuit over , any one of its nodes computes a polynomial function in . But because the elements of are functions of the elements of , a node of an SPN can be also viewed as computing a function of , as given by , where denotes the set/tuple obtained by replacing each element in with the value of .

The dependency-scope of a node is defined as the set of elements of on which members of ’s scope depend. The dependency-scope is denoted by .

SPNs are primarily used to model distributions over . They do this by defining a density function given by , where is a normalizing constant known as the partition function. Because is non-negative this density is well defined, provided that is non-zero and finite.

A formula SPN is defined as an SPN which is also an arithmetic formula. In other words, a formula SPN is one whose nodes each have an out-degree of at most 1.

It is important to remember that the domains (the ’s) and the measures (’s) can be defined however we want, so that SPNs can represent both continuous and discrete distributions. For example, to represent a discrete distribution, we can choose to be the counting measure with support given by a finite subset, such as . In such a case, integration of some function w.r.t. such a amounts to the summation .

2.3 Validity, decomposability, and completeness

Treated as density models, general SPNs suffer from many of the same intractability issues that plague other deep density models, such as Deep Boltzmann Machines (Salakhutdinov and Hinton, 2009). In particular, there is no efficient general algorithm for computing their associated partition function or marginal densities.

However, it turns out that for a special class of SPNs, called valid SPNs, computing the the partition function and marginal densities can be accomplished by what is essentially a single evaluation of the network. Moreover, the validity of a given SPN can be established using certain easy-to-test structural conditions called decomposability and completeness, which we will discuss later.

Definition 1 (Valid SPN) An SPN , is said to be valid if the following condition always holds. Let where each is a distinct element of , and let be subsets of the ranges of the respective ’s. For any fixed value of we have

where (with defined analogously), (with defined analogously), and where denotes the set/tuple obtained by taking and for each and each replacing with its integral over , and also replacing for each with .

Decoding the notation, this definition says that for a valid SPN we can compute the integral of the output function with respect to a subset of the input variables (given by the index set ) over corresponding subsets of their respective domains (the ’s), simply by computing the corresponding integrals over the respective univariate functions (the ’s) and evaluating the circuit by having nodes labeled by these ’s compute said integrals.

Note that for a subsets of the range of of that do not have the form of a Cartesian product , validity doesn’t say anything. In general, the integral over such a set will be intractable for valid SPNs.

Validity is a very useful property for an SPN to have, as it allows us to efficiently integrate with respect to any subset of variables/elements of by performing what amounts to a single evaluation of . Among other uses (Gens and Domingos, 2013), this allows us to efficiently compute the partition function333As an aside, note that validity also acts as a proof of the finiteness of the partition function, provided each integral is finite. which normalizes ’s associated density function by taking and for each . It also allows us to efficiently compute any marginal density function by taking and for each .

While validity may seem like a magical property for an SPN to have, as shown by Poon and Domingos (2011) there is a pair of easy to enforce (and verify) structural properties which, when they appear together, imply validity. These are known as “decomposability” and “completeness”, and are defined as follows.

Definition 2 (Decomposable) An SPN is decomposable if for every product node in the dependency-scopes of its children are pairwise disjoint.

Definition 3 (Completeness) An SPN is complete if for every sum node in the dependency-scopes of its children are all the same.

As was the case in the work of Poon and Domingos (2011), decomposability and completeness turn out to be sufficient conditions, but not necessary ones, for ensuring validity according to our more general set of definitions. Moreover, we will show that for a natural strengthening of the concept of validity, decomposability and completeness become necessary conditions as well.

The tractability of the partition function and marginal densities is a virtually unheard of property for deep probabilistic models, is the primary reason that decomposably and complete SPNs are so appealing.

For the sake of brevity we will call an SPN which satisfies the decomposability and completeness conditions a D&C SPN.

A notion related to decomposability which was discussed in Poon and Domingos (2011) is that of “consistency”, which is defined only for SPNs whose univariate functions are either the identity function or the negation function , and whose inputs variables are all 0/1-valued. Such an SPN is said to be consistent if each product node satisfies the property that if one of its children has the identity function of in its scope, then none of the other children can have the negation function of in their scopes. This is a weaker condition than decomposability, and is also known to imply validity (Poon and Domingos, 2011).

Note that for -valued variables we have and , and so it is possible to construct an equivalent decomposable SPN from a consistent SPN by modifying the children of each product node so as to remove the “redundant” factors of (or ). Note that such a construction may require the introduction of polynomially many additional nodes, as in the proof of Proposition 10. In light of this, and the fact that consistency only applies to a narrowly defined sub-class of SPNs, we can conclude that consistency is not a particularly interesting property to study by itself, and so we will not discuss it any further.

2.4 Top-down view of D&C SPNs

For a D&C SPN it is known (and is straightforward to show) that if the weights on the incoming edges to each sum node sum to 1, and the univariate functions have integrals of 1 (i.e. so they are normalized density functions), then the normalizing constant of ’s associated density is 1, and each node can be interpreted as computing a normalized density over the variables in its dependency scope. We will call such a “weight-normalized”.

A weight-normalized D&C SPN can be interpreted as a top-down directed generative model where each sum node corresponds to a mixture distribution over the distributions associated with its children (with mixture weights given by the corresponding edge weights), and where each product node corresponds to factorized distribution, with factors given by the distributions of its children (Gens and Domingos, 2013). Given this interpretation it is not hard to see that sampling from can be accomplished in a top-down fashion starting at the root, just like in a standard directed acyclic graphical model.

One interesting observation we can make is that it is always possible to transform a general D&C SPN into an equivalent weight-normalized one, as is formalized in the following proposition:

Proposition 4.

Given a D&C SPN there exists a weight-normalized D&C SPN with the same structure as and with the same associated distribution.

2.5 Relationship to previous definitions

Our definitions of SPNs and related concepts subsume those given by Poon and Domingos (2011) and later by Gens and Domingos (2013). Thus the various results we prove in this paper will still be valid according to those older definitions.

The purpose of this subsection is justify the above claim, with a brief discussion which assumes pre-existing familiarity with the definitions given in the above cited works.

First, to see that our definition of SPNs generalizes that of Poon and Domingos (2011), observe that we can take the univariate functions to be of the form or , and that we can choose the domains of measures so that the ’s are discrete -valued variables, and choose the associated measures so that integration over values of becomes equivalent to summation.

Second, to see that our definition generalizes that of Gens and Domingos (2013), observe that we can take the univariate functions to be univariate density functions. And while Gens and Domingos (2013) formally defined SPNs as always being decomposable and complete, we will keep the concepts of SPNs and D&C SPNs separate in our discussions, as in the original paper by Poon and Domingos (2011).

2.6 Polynomials and multilinearity

In this section we will define some additional basic concepts which will be useful in our analysis of SPNs in the coming sections.

Given a set/tuple of formal variables , a monomial is defined as a product of elements of (allowing repeats). For example, is a monomial. In an arithmetic expression, a “monomial term” refers to a monomial times a coefficient from . For example is a monomial term in .

In general, polynomials over are defined as a finite sum of monomial terms. Given a monomial , its associated coefficient in a polynomial will refer to the coefficient of the monomial term whose associated monomial is (we will assume that like terms have been collected, so this is unique). As a short-hand, we will say that a monomial is “in ” if the associated coefficient of in is non-zero.

The zero polynomial is defined as a polynomial which has no monomials in it. While the zero polynomial clearly computes the zero function, non-zero polynomials can sometimes also compute the zero function over the domain of , and thus these are related but distinct concepts 444For example, computes the zero function when the domain of is but is clearly not the zero polynomial..

A polynomial is called non-negative if the coefficients of each of its monomials are non-negative. Non-negativity is related to the monotonicity of arithmetic circuits in the following way:

Fact 5.

If is a monotone arithmetic circuit over (such as an SPN with ), then is a non-negative polynomial.

We will define the scope of a polynomial in , denoted by , to be the set of variables which appear as factors in at least one of its monomials. Note that for an node in an arithmetic circuit, the scope of can easily be shown to be a superset of the scope of its output polynomial (i.e. ), but it will not be equal to in general.

A central concept in our analysis of D&C SPNs will be that of multilinearity, which is closely related to the decomposability condition.

Definition 6 (Multilinear Polynomial) A polynomial in is multilinear if the degree of each element of is at most one in each monomial in . For example, is a multilinear polynomial.

Some more interesting examples of multilinear polynomials include the permanent and determinant of a matrix (where we view the entries of the matrix as the variables).

Definition 7 (Multilinear Arithmetic Circuit) If every node of an arithmetic circuit over computes a multilinear polynomial in , is said to be a (semantically) multilinear arithmetic circuit. And if for every product node in , the scopes of its child nodes are pair-wise disjoint, is said to be a syntactically multilinear arithmetic circuit.

It is easy to show that a syntactically multilinear arithmetic circuit is also a semantically multilinear circuit. However, it is an open question as to whether one can convert a semantically multilinear arithmetic circuit into a syntactically multilinear one without increasing its size by a super-polynomial factor Raz et al. (2008). In the case of formulas however, given a semantically multilinear formula of size one can transform it into an equivalent syntactically multilinear formula of size at most Raz (2004).

It should be obvious by this point that there is an important connection between syntactic multilinearity and decomposability. In particular, if our univariate functions of the ’s are all identity functions, then scope and dependency-scope become equivalent, and thus so do syntactic multilinearity and decomposability.

Given this observation we have that a monotone syntactically multilinear arithmetic circuit over can be viewed as a decomposable SPN.

A somewhat less obvious fact (which will be very useful later) is that any decomposable SPN over 0/1-valued ’s can be viewed as a syntactically multilinear arithmetic circuit over , of a similar size and depth. To see this, note that any arbitrary univariate function of a 0/1-valued variable can always be written as an affine function of , i.e. of the form with and . Thus we can replace each node computing a univariate function of some with a subcircuit computing this affine function, and this yields a (non-monotone) syntactically multilinear arithmetic circuit over , with a single additional layer of sum nodes (of size ).

An extension of the concept of multilinearity is that of set-multilinearity (e.g. Shpilka and Yehudayoff, 2010). To define set-multilinearity we must first define some additional notation which we will carry through the rest of the paper.

Let be a partitioning of the elements of into disjoint sets. The set-scope of a polynomial is the sub-collection of the collection defined as consisting of those sets which have some element in common with the scope of . i.e. . Similarly, the set-scope of a node in an arithmetic circuit is the sub-collection of the collection defined as consisting of those sets which have some element in common with the scope of . i.e. .

Definition 8 (Set-multilinear polynomial) A polynomial is set-multilinear if each of its monomials has exactly one factor from each of the ’s in its set-scope.

For example, is a set-multilinear polynomial when , while is not. The permanent and determinant of a matrix also turn out to be non-trivial examples of set-multilinear polynomials, if we define the collection of sets so that consists of the entries in the row of the matrix.

Definition 9 (Set-multilinear arithmetic circuits) An arithmetic circuit is called (semantically) set-multilinear if each of its nodes computes a set-multilinear polynomial. An arithmetic circuit is called syntactically set-multilinear if it satisfies the following two properties:

  • for each product node in , the set-scopes of the children of are pairwise disjoint

  • for each sum node , the set-scopes of the children of are all the same.

A crucial observation is that the concepts of set-multilineary in arithmetic circuits and decomposability and completeness in SPNs are even more closely related than syntactic multilinearity is to decomposability. In particular, it is not hard to see that if we take , and , then set-scope (of nodes) and dependency-scope become analogous concepts, and D&C SPNs correspond precisely to monotone syntactically set-multilinear arithmetic circuits in . Because of the usefulness of this connection, we will use the above identifcations for the remainder of this paper whenever we discuss set-multilinearity in the specific context of SPNs.

This connection also motivates a natural definition for the dependency-scope for polynomials over . In particular, the dependency-scope of the polynomial over will be defined as the set of variables on which the members of ’s scope depend. We will denote the dependency-scope by .

3 Analysis of Validity, Decomposability and Completeness

In this section we give a novel analysis of the relationship between validity, decomposability and completeness, making use of many of the concepts from circuit theory reviewed in the previous section.

First, we will give a quick result which shows that an incomplete SPN can always be efficiently transformed into a complete one which computes the same function of . Note that this is not a paradoxical result, as the new SPN will behave differently than the original one when used to evaluate integrals (in the sense of definition of valid SPNs in Section 2).

Proposition 10.

Given an SPN of size there exists a complete SPN of size , and an expanded set/tuple of univariate functions s.t. for all values of , where is the sum over the fan-in’s of the sum nodes of . Moreover, is decomposable if is.

So in some sense we can always get completeness “for free”, and of the two properties, decomposability will be the one which actually constrains SPNs in a way that affects their expressive efficiency.

Unlike with decomposability and completeness, validity depends on the particular definitions of univariate functions making up , and thus cannot be described in purely structural terms like set-multilinearity. This leads us to propose a slightly stronger condition which we call strong validity, which is independent of the particular choice of univariate functions making up .

Definition 11 An SPN is said to be strongly valid if it is valid for every possible choice of the univiariate functions making up . Note: only the values computed by each are allowed to vary here, not the identities of the dependent variables.

The following theorem establishes the fundamental connection between set-multilinearity and strong validity.

Theorem 12.

Suppose the elements of are all non-trivial variables (as defined below). Then an SPN is strongly valid if and only if its output polynomial is set-multilinear.

A variable is non-trivial if there are at least two disjoint subsets of ’s range which have finite and non-zero measure under .

Non-triviality is a very mild condition. In the discrete case, it is equivalent to requiring that there is more than one element of the range set which has mass under the associated measure. Trivial variables can essentially be thought of as “constants in disguise”, and we can easily just replace them with constant nodes without affecting the input-output behavior of the circuit.

It is worth noting that the non-triviality hypothesis is a necessary one for the forward direction of Theorem 12 (although not the reverse direction). To see this, consider for example the SPN which computes in the obvious way, where and , and the are the standard counting measures. While ’s output polynomial is not set-multilinear by inspection, it is relatively easy to show that is indeed strongly valid, as it is basically equivalent to for a constant .

While Theorem 12 is interesting by itself as it provides a complete characterization of strong validity in terms of purely algebraic properties of an SPN’s output polynomial, its main application in this paper will be to help prove the equivalence of strong validity with the decomposability and completeness conditions.

Note that such an equivalence does not hold for standard validity, as was first demonstrated by Poon and Domingos (2011). To see this, consider the SPN which computes the expression in the obvious way, where the are 0/1-valued, the are the standard counting measures, and , , and . Clearly this SPN is neither decomposable nor complete, and yet an exhaustive case analysis shows that it is valid for these particular choices of the ’s.

Before we can prove the equivalence of strong validity with the decomposability and completeness conditions, we need to introduce another mild hypothesis which we call “non-degeneracy”.

Definition 13 A monotone arithmetic circuit (such as an SPN) is called non-degenerate if all of its weights and constants are non-zero (i.e. strictly positive).

Like non-triviality, non-degeneracy is a very mild condition to impose, since weights which are zero don’t actually “affect” the output. Moreover, there is a simple and size-preserving procedure which can transform a degenerate monotone arithmetic circuit to a non-degenerate one which computes precisely the same output polynomial, and also preserves structural properties like decomposability and completeness in SPNs. The procedure is as follows. First we remove all edges with weight . Then we repeatedly remove any nodes with fan-out 0 (except for the original output node) or fan-in 0 (except input node and constant nodes), making sure to remove any product node which is a parent of a node we remove. It is not hard to see that deletion of a node by this procedure is a proof that it computes the zero-polynomial and thus doesn’t affect the final output.

Without non-degeneracy, the equivalence between strong validity and the decomposability and completeness conditions does not hold for SPNs, as can be seen by considering the SPN which computes the expression in the obvious way, where the are 0/1-valued and the are the standard counting measures, and the ’s are the identity function (i.e. ). Because the output polynomial of is equivalent to , it is indeed valid. However, the product node within which computes violates the decomposability condition, even though this computation is never actually “used” in the final output (due to how it is weighted by 0).

Non-degeneracy allows us to prove many convenient properties, which are given in the lemma below.

Lemma 14.

Suppose is a non-degenerate monotone arithmetic circuit. Denote by the root of , and its child nodes.

We have the following facts:

  1. Each of the ’s are non-degenerate monotone arithmetic circuits.

  2. If is a product node, the set of monomials in is equal to the set consisting of every possible product formed by taking one monomial from each of the ’s. NOTE: This is true even for degenerate circuits.

  3. If is a sum node, the set of monomials in is equal to the union over the sets of monomials in the ’s.

  4. is not the zero polynomial.

  5. The set-scope of is equal to the set-scope of .

We are now in a position to prove the following theorem.

Theorem 15.

A non-degenerate monotone arithmetic circuit has a set-multilinear output polynomial if and only if it is syntactically set-multilinear.

Given this theorem, and utilizing the previously discussed connection between syntactic set-multilinearity and the decomposability and completeness conditions, the following corollary is immediate:

Corollary 16.

A non-degenerate SPN has a set-multilinear output polynomial if and only if it is decomposable and complete.

And from this and Theorem 12, we have a 3-way equivalence between strong validity, the decomposability and completeness conditions, and the set-multilinearity of the output polynomial. This is stated as the following theorem.

Theorem 17.

Suppose is a non-degenerate SPN whose input variables (the elements of ) are all non-trivial. Then the following 3 conditions are equivalent:

  1. is strongly valid

  2. is decomposable and complete

  3. ’s output polynomial is set-multilinear

Because SPNs can always be efficiently transformed so that the non-degeneracy and non-triviality hypotheses are both satisfied (as discussed above), this equivalence between strong validity and the D&C conditions makes the former easy to verify (since the D&C conditions themselves are).

However, as we will see in later sections, decomposability and completeness are restrictive conditions that limit the expressive power of SPNs in a fundamental way. And so a worthwhile question to ask is whether a set of efficiently testable criteria exist for verifying standard/weak validity.

We will shed some light on this question by proving a result which shows that a criterion cannot be both efficiently testable and capture all valid SPNs, provided that . A caveat to this result is that we can only prove it for a slightly extended definition of SPNs where negative weights and constants are permitted.

Theorem 18.

Define an extended SPN as one which is allowed to have negative weights and constants. The problem of deciding whether a given extended SPN is valid is co-NP-hard.

We leave it as an open question as to whether a similar co-NP-hardness property holds for validity checking of standard SPNs.

4 Focusing on D&C SPNs

One of the main goals of this paper is to advance the understanding of the expressive efficiency of SPNs. In this section we explore possible directions we can take towards this goal, and ultimately propose to focus exclusively on D&C SPNs.

It is well known that standard arithmetic circuits can efficiently simulate Boolean logic circuits with only a constant factor overhead. Thus they are as efficient at computing a given function as any standard model of computation, up to a polynomial factor. However, we cannot easily exploit this fact to study SPNs, as this simulation requires negative weights, and the weights of an SPN are constrained to be non-negative (i.e. they are monotone arithmetic circuits). And while SPNs have access to non-negative valued univariate functions of the input which standard monotone arithmetic circuits do not, this fact cannot obviously be used to construct a simulation of Boolean logic circuits.

Another possible way to gain insight into general SPNs would be to apply existing results for monotone arithmetic circuits. However, a direct application of such results is impossible, as SPNs are monotone arithmetic circuits over and not , and indeed their univariate functions can compute various non-negative functions of (such as for values of in ) which a monotone circuit could not.

But while it seems that the existing circuit theory literature doesn’t offer much insight into general SPNs, there are many interesting results available for multilinear and set-multilinear arithmetic circuits. And as we saw in Section 3, these are closely related to D&C SPNs.

Moreover, it makes sense to study D&C SPNs, as they are arguably the most interesting class of SPNs, both from a theoretical and practical perspective. Indeed, the main reason why SPNs are interesting and useful in the first place is that valid SPNs avoid the intractability problems that plague conventional deep models like Deep Boltzmann Machines. Meanwhile the D&C conditions are the only efficiently testable conditions for ensuring validity that we are aware of, and as we showed in Section 3, they are also necessary conditions for a slightly strengthened notion of validity.

Thus, D&C SPNs will be our focus for the rest of the paper.

5 Capabilities of D&C SPNs

Intuitively, D&C SPNs seem very limited compared to general arithmetic circuits. In addition to being restricted to use non-negative weights and constants like general SPNs, decomposability heavily restricts the kinds of structure the networks can have, and hence the kinds of computations they can perform. For example, something as simple as squaring the number computed by some node becomes impossible.

In order to address the theoretical question of what kinds of functions D&C SPNs can compute efficiently, despite their apparent limitations, we will construct explicit D&C SPNs that efficiently compute various example functions.

This is difficult to do directly because the decomposability condition prevents us from using the basic computational operations we are accustomed to working with when designing algorithms or writing down formulae. To overcome this difficulty we will provide a couple of related examples of computational systems which we will show can be efficiently simulated by SPNs. These systems will be closer to more traditional models of computation like state-space machines, so that our existing intuitions about algorithm design will be more directly applicable to them.

The first such system we will call a Fixed-Permutation Linear Model (FPLM), which works as follows. We start by initializing a “working vector

with a value , and then we process the input (the ’s) in sequence, according to a fixed order given by a permutation of . At each stage we multiply by a matrix which is determined by the value of the current . After seeing the whole input, we then take the inner product of with another vector , which gives us our real-valued output.

More formally, we can define FPLMs as follows.

Definition 19 A Fixed-Permutation Linear Model (FPLM) will by defined by a fixed permutation of , a ‘dimension’ constant (which in some sense measures the size of the FPLM), vectors and for each , a matrix-valued function from to . The output of a FPLM is defined as .

An FPLM can be viewed as a computational system which must process its input in a fixed order and maintains its memory/state as a

-dimensional vector. Crucially, an FPLM cannot revisit inputs that it has already processed, which is a similar limitation to the one faced by read-once Turing Machines. The state vector can be transformed at each stage by a linear transformation which is a function of the current input. While its

-dimensional state vector allows an FPLM to use powerful distributed representations which clearly possess enough information capacity to memorize the input seen so far, the fundamental limitation of FPLMs lies in their limited tools for manipulating this representation. In particular, they can only use linear transformations (given by matrices with positive entries). If they had access to arbitrary transformations of their state then it is not hard to see that

any function could be efficiently computed by them.

The following result establishes that D&C SPNs can efficiently simulate FPLMs.

Proposition 20.

Given a FPLM of dimension there exists a D&C SPN of size which computes the same function.

Thus D&C SPNs are at least as expressively efficient as FPLMs. This suggests the following question: are they strictly more expressively efficient than FPLMs, or are they equivalent? It turns out that they are more expressively efficient. We sketch a proof of this fact below.

Suppose that takes values in . As observed in Section 2.6, this allows us to assume without loss of generality that any univariate function of one of the ’s is affine in . In particular, we can assume that the matrix-valued functions used in FPLMs are affine functions of the respective ’s. In this setting it turns out that FPLMs can be viewed as a special case of a computational system called “ordered syntactically multilinear branching programs”, as they are defined by Jansen (2008). Jansen (2008) showed that there exists a polynomial function in whose computation by such a system requires exponential size (corresponding to an FPLM with an exponentially large dimension ). Moreover, this function is computable by a polynomially sized monotone syntactically multilinear arithmetic. As observed in Section 2.6, such a circuit can be viewed as a decomposable SPN whose univariate functions are just identity functions. Then using Proposition 10 we can convert such a decomposable SPN to a D&C SPN while only squaring its size. So the polynomial provided by Jansen (2008) is indeed computed by a D&C SPN of polynomial size, while requiring exponential size to be computed by a FPLM, thus proving that D&C SPNs are indeed more expressively efficient.

Given this result, we see that FPLMs do not fully characterize the capabilities of D&C SPNs. Nevertheless, if we can construct an FPLM which computes some function efficiently, this constitutes proof of existence of a similarly efficient D&C SPN for computing said function.

To make the construction of such FPLMs simpler, we will define a third computational system which we call a Fixed-Permutation State-Space Model (FPSSM) which is even easier to understand than FPLMs, and then show that FPLMs (and hence also D&C SPNs) can efficiently simulate FPSSMs.

An FPSSM works as follows. We initialize our “working state” as , and then we process the input (the ’s) in sequence, according to a fixed order given by the permutation of . At each stage we transform by computing , where the transition function can be defined arbitrarily. After seeing the whole input, we then decode the state as the non-negative real number .

More formally we have the following definition.

Definition 21 A Fixed-Permutation State-Space Model (FPSSM) will by defined by a fixed permutation of , a ‘state-size’ constant (which in some sense measures the size of the FPSSM), an initial state , a decoding function from to , and for each an arbitrary function which maps values of and elements of to elements of . The output of an FPSSM will be defined as for an arbitrary function mapping elements of to .

FPSSMs can be seen as general state-space machines (of state size ), which like FPLMs, are subject to the restriction that they must process their inputs in a fixed order which is determined ahead of time, and are not allowed to revisit past inputs. If the state-space is large enough to be able to memorize every input seen so far, it is clear that FPSSMs can compute any function, given that their state-transition function can be arbitrary. But this would require their state-size constant to grow exponentially in , as one needs a state size of in order to memorize input bits. FPSSMs of realistic sizes can only memorize a number of bits which is logarithmic in . And this, combined with their inability to revisit past inputs, clearly limits their ability to compute certain functions efficiently. This is to be contrasted with FPLMs, whose combinatorial/distributed state have a high information capacity even for small FPLMs, but are limited instead in how they can manipulate this state.

The following result establishes that FPLMs can efficiently simulate FPSSMs.

Proposition 22.

Given a FPSSM of state-size there exists a FPLM of dimension which computes the same function.

Note that this result also implies that FPSSMs are no more expressively efficient than FPLMs, and are thus strictly less expressively efficient than D&C SPNs.

The following Corollary follows directly from Propositions 20 and 22:

Corollary 23.

Given a FPSSM of state-size there exists a D&C SPN of size which computes the same function.

Unlike with D&C SPNs, our intuitions about algorithm design readily apply to FPSSMs, making it easy to directly construct FPSSMs which implement algorithmic solutions to particular problems. For example, suppose we wish to compute the number of inputs whose value is equal to 1. We can solve this with an FPSSM with a state size of by taking to be the identity permutation, and the state to be the number of 1’s seen so far, which we increment whenever the current has value 1. We can similarly compute the parity of the number of ones (which is a well known and theoretically important function often referred to simply as “PARITY”) by storing the current number of them modulo 2, which only requires the state size to be . We can also decide if the majority of the ’s are 1 (which is a well known function and theoretically important often referred to as “MAJORITY” or “MAJ”) by storing a count of the number of ones (which requires a state size of ), and then outputting if and otherwise.

It is noteworthy that the simulations of various models given in this section each require an D&C SPN of depth . However, as we will see in Section 7.2, the depth of any D&C SPN can be reduced to , while only increasing its size polynomially.

6 Separating Depth 3 From Higher Depths

The only prior work on the expressive efficiency of SPNs which we are aware of is that of Delalleau and Bengio (2011). In that work, the authors give a pair of results which demonstrate a difference in expressive efficiency between D&C SPNs of depth 3, and those of higher depths.

Their first result establishes the existence of an -dimensional function (for each ) which can be computed by a D&C SPN of size and depth , but which requires size to be computed by a D&C SPN of depth555Note that in our presentation the input layer counts as the first layer and contributes to the total depth. 3.

In their second result they show that for each there is an -dimensional function which can be computed by a D&C SPN of size and depth , but which requires size to be computed by a D&C SPN of depth 3.

It is important to note that these results do not establish a separation in expressive efficiency between D&C SPNs of any two depths both larger than 3 (e.g. between depths 4 and 5). So in particular, despite how the size lower bound increases with in their second result666As shown by our Theorem 29, there is a much stronger separation between depths 3 and 4 than is proved by Delalleau and Bengio (2011) to exist between depths and for any , and thus this apparent increase in the size of their lower bound isn’t due to the increasing power of D&C SPNs with depth so much as it an artifact of their particular proof techniques. this does not imply that the set of efficiently computable functions is larger for D&C SPNs of depth than for those of depth , for any , except when . This is to be contrasted with our much stronger “depth hierarchy” result (Theorem 29 of Section 7.1) which shows that D&C SPNs do in fact have this property (even with ) for all choices of , where depth is measured in terms of product-depth.

In the next subsection we will show how basic circuit theoretic techniques can be used to give a short proof of a result which is stronger than both of the separation results of Delalleau and Bengio (2011), using example functions which are natural and simple to understand. Beyond providing a simplified proof of existing results, this will also serve as a demonstration of some of the techniques underlying the more advanced results from circuit theory which we will later make use of in Sections 7.1 and 8.

Moreover, by employing these more general and powerful proof techniques, we are able to prove a stronger result which seperates functions that can be efficiently approximated by D&C SPNs of depth 3 from those which can be computed by D&C SPNs of depth 4 and higher. This addresses the open question posed by Delalleau and Bengio (2011).

6.1 Basic separation results

We begin by defining some basic concepts and notation which are standard in circuit theory.

For an arbitrary function of , and a partition of the set of indices of the elements of , define to be the by matrix of values that takes for different values of , where the rows of are indexed by possible values of , and the columns of are indexed by possible values of .

is called a “communication matrix” in the context of communication complexity, and appears frequently as a tool to prove lower bounds. Its usefulness in lower bounding the size of D&C SPNs of depth is established by the following theorem.

Theorem 24.

Suppose is a D&C SPN of depth 3 with nodes in its second layer. For any partition of we have .

Note that the proof of this theorem doesn’t use the non-negativity of the weights of the SPN, and thus applies to the “extended” version of SPNs discussed in Section 3.

We will now define the separating function which we will use to separate the expressive efficiency of depth 3 and 4 D&C SPNs.

Define and . We will define the function for -valued to be when (i.e. the first half of the input is equal to the second half) and otherwise.

Observe that and so this matrix has rank . This gives the following simple corollary of the above theorem:

Corollary 25.

Any D&C SPN of depth 3 with computes must have at least nodes in its second layer.

Meanwhile, is easily shown to be efficiently computed by a D&C SPN of depth 4. This is stated as the following proposition.

Proposition 26.

can be computed by an D&C SPN of size and depth 4.

Note that the combination of Corollary 25 and Proposition 26 gives a stronger separation result than both of the aforementioned results of Delalleau and Bengio (2011). Our result also has the advantage of using an example function which is easy to interpret, and can be easily extended to prove separation results for other functions which have a high rank communication matrix.

6.2 Separations for approximate computation

An open question posed by Delalleau and Bengio (2011) asked whether a separation in expressive efficiency exists between D&C SPNs of depth 3 and 4 if the former are only required to compute an approximation to the desired function. In this section we answer this question in the affirmative by making use of Theorem 24

and an additional technical result which lower bounds the rank of the perturbed versions of the identity matrix.

Theorem 27.

Suppose is a D&C SPN of depth 3 whose associated distribution is such that each value of with

has an associated probability between

and for some (so that all such values of have roughly equal probability), and that the total probability of all of the values of satisfying obeys (so that the probability of drawing a sample with is ). Then must have at least nodes in its second layer.

To prove this result using Theorem 24 we will make use of the following lemma which lower bounds the rank of matrices of the form for some “perturbation matrix” , in terms of a measure of the total size of the entries of .

Lemma 28.

Suppose is a real-valued matrix such that for some . Then .

7 Depth Analysis

7.1 A depth hierarchy for D&C SPNs

In this section we show that D&C SPNs become more expressively efficient as their product-depth777Product-depth is defined in Section 2.1. Note that it can be shown to be equivalent to standard depth up to a factor of 2 (e.g. by ‘merging’ sum nodes that are connected as parent/child). increases, in the sense that the set of efficiently computable density functions expands as grows. This is stated formally as follows:

Theorem 29.

For every integer and input size there exists a real-valued function of such that:

  1. There is a D&C SPN of product-depth and size which computes for all values of in , where the SPN’s univariate functions consist only of identity functions.

  2. For any choice of the univariate functions , a D&C SPN of product-depth that computes for all values of in must be of size (which is super-polynomial in ).

Previously, the only available result on the relationship of depth and expressive efficiency of D&C SPNs has been that of Delalleau and Bengio (2011), who showed that D&C SPNs of depth 3 are less expressively efficient than D&C SPNs of depth 4.

An anologous result seperating very shallow networks from deeper ones also exists for neural networks. In particular, it is known that under various realistic constraints on their weights, threshold-based neural networks with one hidden layer (not counting the output layer) are less expressively efficient those with 2 or more hidden layers (Hajnal et al., 1993; Forster, 2002).

More recently, Martens et al. (2013)

showed that Restricted Boltzmann Machines are incapable of efficiently capturing certain simple distributions, which by the results of

Martens (2014), can be efficiently captured by Deep Boltzmann Machines.

A “depth-hierarchy” property analogous to Theorem 29 is believed to hold for various other deep models like neural networks, Deep Boltzmann Machines (Salakhutdinov and Hinton, 2009), and Sigmoid Belief Networks (Neal, 1992), but has never been proven to hold for any of them. Thus, to the best of our knowledge, Theorem 29 represents the first time that a practical and non-trivial deep model has been rigorously shown to gain expressive efficiency with each additional layer of depth added.

To prove Theorem 29, we will make use of the following analogous result which is a slight modification of one proved by Raz and Yehudayoff (2009) in the context of multilinear circuits.

Theorem 30.

(Adapted from Theorem 1.2 of Raz and Yehudayoff (2009)) For every integer and input size there exists a real-valued function of such that:

  1. There is a monotone syntactically multilinear arithmetic circuit over of product-depth , size which computes for all values of in .

  2. Any syntactically multilinear arithmetic circuit over of product-depth that computes for all values of in must be of size .

Note that the original Theorem 1.2 from Raz and Yehudayoff (2009) uses a slightly different definition of arithmetic circuits from ours (they do not permit weighted connections), and the constructed circuits are not stated to be monotone. However we have confirmed with the authors that their result still holds even with our definition, and the circuits constructed in their proof are indeed monotone (Raz, 2014).

There are several issues which must be overcome before we can use Theorem 30 to prove Theorem 29. The most serious one is that syntactically multilinear arithmetic circuits are not equivalent to D&C SPNs as either type of circuit has capabilities that the other does not. Thus the ability or inability of syntactically multilinear arithmetic circuits to compute certain functions does not immediately imply the same thing for D&C SPNs.

To address this issue, we will consider the case where is binary-valued (i.e. takes values in ) so that we may exploit the close relationship which exists between syntactically multilinear arithmetic circuits and decomposable SPNs over binary-valued inputs (as discussed in Section 2.6).

Another issue is that Theorem 30 deals only with the hardness of computing certain functions over all of instead of just (which could be easier in principle). However, it turns out that for circuits with multilinear output polynomials, computing a function over is equivalent to computing it over , as is established by the following lemma.

Lemma 31.

If and are two multilinear polynomials over with , then .

With these observations in place the proof of Theorem 29 from Theorem 30 becomes straightforward (and is given in the appendix).

7.2 The limits of depth

Next, we give a result which shows that the depth of any polynomially sized D&C SPN can be essentially compressed down to , at the cost of only a polynomial increase in its total size. Thus, beyond this sublinear threshold, adding depth to a D&C SPN does not increase the set of functions which can be computed efficiently (where we use this term liberally to mean “with polynomial size”). Note that this does not contradict Theorem 29 from the previous subsection as that dealt with the case where the depth is a fixed constant and not allowed to grow with .

To prove this result, we will make use of a similar result proved by Raz and Yehudayoff (2008) in the context of multilinear circuits. In particular, Theorem 3.1 from Raz and Yehudayoff (2008) states, roughly speaking, that for any syntactically multilinear arithmetic circuit over of size (of arbitrary depth) there exists a syntactically multilinear circuit of size and depth which computes the same function.

Because this depth-reducing transformation doesn’t explicitly preserve monotonicity, and deals with multilinear circuits instead of set-multilinear circuits, while using a slightly different definition of arithmetic circuits, it cannot be directly applied to prove an analogous statement for D&C SPNs. However, it turns out that the proof contained in Raz and Yehudayoff (2008) does in fact support a result which doesn’t have these issues (Raz, 2014). We state this as the following theorem.

Theorem 32.

(Adapted from Theorem 3.1 of Raz and Yehudayoff (2008)) Given a monotone syntactically set-multilinear arithmetic circuit (over with sets given by ,…,) of size and arbitrary depth, there exists a monotone syntactically set-multilinear arithmetic circuit of size and depth which computes the same function.

Note that the size of the constructed circuit is smaller here than in Theorem 3.1 of Raz and Yehudayoff (2008) because we can avoid the “homogenization” step required in the original proof, as syntactically set-multilinear arithmetic circuits automatically have this property.

Given this theorem and the equivalence between monotone syntactically set-multilinear arithmetic circuits and D&C SPNs which was discussed near the end of Section 2.6, the following corollary is immediate.

Corollary 33.

Given a D&C SPN of size and arbitrary depth there exists a D&C SPN of size and depth which computes the same function.

Note that when the size is a polynomial function of , this depth bound is stated more simply as .

8 Circuits vs Formulas

In Gens and Domingos (2013) the authors gave a learning algorithm for SPNs which produced D&C SPN formulas. Recall that formulas are distinguished from more general circuits in that each node has fan-out at most 1. They are called “formulas” because they can be written down directly as formula expressions without the need to define temporary variables.

It is worthwhile asking whether this kind of structural restriction limits the expressive efficiency of D&C SPNs.

As we show in this section, the answer turns out to be yes, and indeed D&C SPN formulas are strictly less expressively efficient than more general D&C SPNs. This is stated formally as the following theorem:

Theorem 34.

For every input size there exists a real-valued function of such that:

  1. There is a D&C SPN of size which computes , where the SPN’s univariate functions consist only of identity functions.

  2. For any choice of the univariate functions , a D&C SPN formula that computes must be of size (which is super-polynomial in ).

As in Section 7.1, to prove Theorem 34, we will make use of an analogous result which is a slight modification of one proved by Raz and Yehudayoff (2008) in the context of multilinear circuits. This is stated below.

Theorem 35.

(Adapted from Theorem 4.4 of Raz and Yehudayoff (2008)) For every input size there exists a real-valued function of such that:

  1. There is a monotone syntactically multilinear arithmetic circuit over of size with nodes of maximum in-degree which computes for all values of in .

  2. Any syntactically multilinear arithmetic formula over that computes for all values of in must be of size .

As in Section 7.1, the original Theorem 4.4 from Raz and Yehudayoff (2008) uses a slightly different definition of arithmetic circuits from ours (they do not permit weighted connections), and the constructed circuits are not stated to be monotone. However we have confirmed with the authors that their result still holds even with our definition, and the circuits constructed in their proof are indeed monotone (Raz, 2014).

When trying to use Theorem 35 to prove Theorem 34, we encounter similar obstacles to those encountered in Section 7.1. Fortunately, the transformation between decomposable SPNs and multilinear arithmetic circuits (for the case of binary-valued inputs) happens to preserve formula structure. Thus the ideas discussed in Section 7.1 for overcoming these obstacles also apply here.

9 A Tractable Distribution Separating D&C SPNs and Other Deep Models

The existence of a D&C SPN of size for computing some density function (possibly unnormalized) implies that the corresponding marginal densities can be computed by an time algorithm. Thus, it follows that D&C SPNs cannot efficiently compute densities whose marginals are known to be intractable. And if we assume the widely believed complexity theoretic conjecture that , such examples are plentiful.

However, it is debatable whether this should be considered a major drawback of D&C SPNs, since distributions with intractable marginals are unlikely to be learnable using any model. Thus we are left with an important question: can D&C SPNs efficiently compute any density with tractable marginals?

In Poon and Domingos (2011) it was observed that essentially every known model with tractable marginal densities can be viewed as a D&C SPN, and it was speculated that the answer to this question is yes.

In this section we refute this speculation by giving a counter example. In particular, we construct a simple distribution whose density function and corresponding marginals can be evaluated by efficient algorithms, but which provably cannot be computed by a sub-exponentially sized D&C SPN of any depth. Notably, this density function can be computed by a Boolean circuit of modest depth and size, and so by the simulation results of (Martens, 2014) the distribution can in fact be captured efficiently by various other deep probabilistic models like Deep Boltzmann Machines (DBMs).

Notably, our proof of the lower bound on the size of D&C SPNs computing this density function will not use any unproven complexity theoretic conjectures, such as .

It is worthwhile considering whether there might be distributions which can be efficiently modeled by D&C SPNs but not by other deep generative models like DBMs or Contrastive Backprop Networks (Hinton et al., 2004). The answer to this question turns out to be no.

To see this, note that arithmetic circuits can be efficiently approximated by Boolean circuits, and even more efficiently approximated by linear threshold networks (which are a simple type of neural network). Thus, by the simulations results of Martens (2014) the aforementioned deep models can efficiently simulate D&C SPNs of similar depths (up to a reasonable approximation factor). Here “efficiently” means “with a polynomial increase in size”, although in practice this polynomial can be of low order, depending on how exactly one decides to simulate the required arithmetic. For linear threshold networks (and hence also Contrastive Back-prop Nets), very efficient simulations of arithmetic circuits can be performed using the results of Reif and Tate (1992), for example.

9.1 Constructing the distribution

To construct the promised distribution over values of we will view each as an indicator variable for the presence or absence of a particular labeled edge in a subgraph of , where denotes the complete graph on vertices. In particular, will take the value if the edge labeled by is present in and otherwise. Note that there are total edges in and so the total input size is .

The distribution

will then be defined simply as the uniform distribution over values of

satisfying the property that is a spanning tree of . We will denote its density function by .

Computing up to a normalizing constant888The normalizing constant in this case is given by by Cayley’s Formula. amounts to deciding if the graph represented by is indeed a spanning tree of , and outputting if it is, and otherwise. And to decide if the graph is a spanning tree amounts to checking that it is connected, and that it has exactly edges.

The first problem can be efficiently solved by a Boolean circuit with gates and depth using the well-known trick of repeatedly squaring the adjacency matrix. The second can be solved by adding all of the entries of together, which can also be done with a Boolean circuit with gates and depth (Paterson et al., 1990). Due to how neural networks with simple linear threshold nonlinearities can simulate Boolean circuits in a 1-1 manner (e.g Parberry, 1994), it follows that such networks of a similar size and depth can compute

. And since linear threshold gates are easily simulated by a few sigmoid or rectified linear units, it follows that neural networks of the same dimensions equipped with such nonlinearities can also compute

, or at least approximate it arbitrarily well (see Martens (2014) for a review of these basic simulation results).

Moreover, by the results of Martens (2014) we know that any distribution whose density is computable up to a normalization constant by Boolean circuits can be captured, to an arbitrarily small KL divergence, by various deep probabilistic models of similar dimensions. In particular, these results imply that Deep Boltzmann Machines of size and Constrastive Backprop Networks of size and depth can model the distribution to an arbitrary degree of accuracy. And since we can sample -length Prüfer sequences (Prüfer, 1918) and implement the algorithm for converting these sequences to trees using a threshold network of size and depth it follows from Martens (2014) that we can also approximate using Sigmoid Belief Networks (Neal, 1992) of this size and depth.

While the existence of small circuits for computing isn’t too surprising, it is a somewhat remarkable fact that it is possible to evaluate any marginal of using an -time algorithm. That is, given a subset of , and associated fixed values of the corresponding variables (i.e. ), we can compute the sum of over all possible values of the remaining variables (i.e. ) using an algorithm which runs in time .

To construct this algorithm we first observe that the problem of computing these marginal densities reduces to the problem of counting the number of spanning trees consistent with a given setting of (for a given ). And it turns out that this is a problem we can attack directly by first reducing it to the problem of counting the total number of spanning trees of a certain auxiliary graph derived from , and then reducing it to the problem of computing determinants of the Laplacian matrix of this auxiliary graph via an application of generalized version of Kirchoff’s famous Matrix Tree Theorem (Tutte, 2001). This argument is formalized in the proof of the following theorem.

Theorem 36.

There exists a -time algorithm, which given as input a set and corresponding fixed values of , outputs the number of edge-labeled spanning trees of which are consistent with those values.

9.2 Main lower bound result

The main result of this section is stated as follows:

Theorem 37.

Suppose that can be approximated arbitrarily well by D&C SPNs of size and . Then .

By “approximated arbitrarily well by D&C SPNs of size ” we mean that there is a sequence of D&C SPNs of size whose output approaches , where the univariate functions are allowed to be different for each SPN in the sequence. Observe that being computed exactly by a D&C SPN of size trivially implies that it can approximated arbitrarily well by D&C SPNs of size .

Note that the large constant in the denominator of the exponent can likely be lowered substantially with a tighter analysis than the one we will present. However, for our purposes, we will be content simply to show that the lower bound on is exponential in (and hence also in ).

Our strategy for proving Theorem 37 involves two major steps. In the first we will show that the output polynomial of any D&C SPN of size can be “decomposed” into the sum of “weak” functions. We will then extend this result to show that the same is true of any function which can computed as the limit of the outputs of an infinite sequence of D&C SPNs of size . This will be Theorem 39.

In the second step of the proof we will show that in order to express as the sum of such “weak” functions, the size of the sum must be exponentially large in , and thus so must . This will follow from the fact (which we will show) that each “weak” function can only be non-zero on a very small fraction of the all the spanning trees of (to avoid being non-zero for a non-spanning tree graph), and so if a sum of them has the property of being non-zero for all of the spanning trees, then that sum must be very large. This will be Theorem 40.

Theorem 37 will then follow directly from Theorems 39 and 40.

9.3 Decomposing D&C SPNs

The following theorem shows how the output polynomial of a D&C SPN of size can be “decomposed” into a sum of non-negatives functions which are “weak” in the sense that they factorize over two relatively equal-sized partitions of the set of input variables.

Theorem 38.

Suppose is a D&C SPN over of size . Then we have:

where , and where the ’s and ’s are non-negative polynomials in satisfying the conditions:

(1)

It should be noted that this result is similar to an existing one proved by Raz and Yehudayoff (2011) for monotone multilinear circuits, although we arrived at it independently.

While Theorem 38 provides a useful characterization of the form of functions which can be computed exactly by a D&C SPN of size , it doesn’t say anything about functions which can only be computed approximately. To address this, we will strengthen this result by showing that any function which can be approximated arbitrarily well by D&C SPNs of size also has a decomposition which is analogous to the one in Theorem 38. This is stated as follows.

Theorem 39.

Suppose is a sequence of D&C SPNs of size at most (where the definitions of the univariate functions is allowed to be different for each), such that the sequence of corresponding output polynomials converges pointwise (considered as functions of ) to some function of . And further suppose that the size of the range of possible values of is given by some finite . Then we have that can be written as

(2)

where and , and are real-valued non-negative functions of and (resp.) where and are sub-sets/tuples of the variables in satisfying , , .

9.4 A lower bound on k

In this section we will show that if is of the same form of from eqn. 2, then the size of the size of the sum must grow exponentially with (and hence ). In particular, we will prove the following theorem.

Theorem 40.

Suppose is of the form from eqn. 2, and . Then we must have that .

Our strategy to prove this result will be to show that each term in the sum can only be non-zero on an exponentially small fraction of all the spanning trees of (and is thus “weak”). And since the sum must be non-zero on all the spanning trees in order to give , it will follow that will have to be exponentially large.

We will start with the simple observation that, due to the non-negativity of the ’s and ’s, each factored term in the sum must agree with wherever is 0 (i.e. because we have for each ). And in particular, for each value of with , either or must be 0.

Intuitively, this is a very stringent requirement. As an analogy, we can think of each factor ( or ) as “seeing” roughly half the input edges, and voting “yes, I think this is a spanning tree”, or “no, I don’t think this is a spanning tree” by outputting either a value for “yes” or for “no”, with tie votes always going “no”. The requirement can thus be stated as: “each pair of factors is never allowed to reach an incorrect ‘yes’ decision”.

Despite both factors in each pair being arbitrary functions of their respective inputs (at least in principle), each only “sees” the status of roughly half the edges in the input graph, and so cannot say much about whether the entire graph actually is a spanning tree. While some potential cycles might be entirely visible from the part of the graph visible to one of the factors, this will not be true of most potential cycles. Thus, to avoid ever producing an incorrect “yes” decision, the factors are forced to vote using a very conservative strategy which will favor “no”.

The remainder of this section is devoted to formalizing this argument by essentially characterizing this conservative voting strategy and showing that it leads to a situation where only a very small fraction of all of the possible spanning trees of can receive two “yes” votes.

Lemma 41.

Suppose and are real-valued non-negative functions of the same form as those described in eqn. 2, and that for any value of , implies or . Define and . Then for we have

It is not hard to see that this lemma will immediately imply Theorem 40. In particular, provided that implies that each term in the sum is , we have each term can be non-zero on at most a proportion of the values of for which , and thus the entire sum can be non-zero on at most a proportion at most . Thus we must have that , i.e. .

The rest of this section will be devoted to the proof of Lemma 41, which begins as follows.

Suppose we are given such a and . We will color all of the edges of the complete graph as red or blue according to whether they correspond to input variables from or (resp.).

We define a “triangle” of a graph to be any complete subgraph on 3 vertices. has triangles total since it is fully connected. After coloring , each triangle is either monochromatic (having edges with only one color), or dichromatic, having 2 edges of one color and 1 edge of the other color. We will refer to these dichromatic triangles as “constraint triangles”, for reasons which will soon become clear.

Clearly any graph which is a spanning tree of can’t contain any triangles, as these are simple examples of cycles. And determining whether contains all 3 edges of a given constraint triangle is impossible for or by themselves, since neither of them gets to see the status of all 3 edges. Because of this, and must jointly employ one of several very conservative strategies with regards to each constraint triangle in order to avoid deciding “yes” for some graph containing said triangle. In particular, we can show that either must always vote ‘no’ whenever all of the red edges of the triangle are present in the input graph , or must vote “no” whenever all of the blues edges of the triangle are present in .

This is formalized in the following proposition.

Proposition 42.

Let and be edges that form a constraint triangle in . Suppose that and are both of a different color from .

Then one of the following two properties holds:

  • for all values of such that contains both and

  • for all values of such that contains

Thus we can see that each constraint triangle over edges , , and in gives rise to distinct constraint which must be obeyed by any graph for which . These are each one of two basic forms:

  1. doesn’t contain both and

  2. doesn’t contain

We now give a lower bound on the number of constraint triangles (i.e. the number of dichromatic triangles) in as a function of the number edges of each color.

Lemma 43.

Given any coloring of the complete graph with which has red edges and blue edges (recall is the total number of edges), for , the total number of dichromatic triangles is lower bounded by .

Our proof of the above lemma makes use of a known upper bound of the number of triangles in an arbitrary graph due to Fisher (1989).

As the choice of and implies the hypothesis we can apply this lemma to conclude that there are at least constraint triangles, and thus any graph for which must obey distinct constraints of the forms given above.

It remains to show that the requirement of obeying such constraints limits the number of graphs for which to be an exponentially small proportion of the total. Our strategy for doing this will be as follows. We will consider a randomized procedure (due to Aldous, 1990) that samples uniformly from the set of all spanning trees of by performing a type of random walk on , adding an edge from the previous vertex whenever it visits a previously unvisited vertex. We will then show that the sequence of vertices produced by this random walk will, with very high probability, contain a length-3 subsequence which implies that the sampled tree violates at least one of the constraints.

This argument is formalized in the proof of the following lemma.

Lemma 44.

Suppose we are given distinct constraints which are each one of the two forms discussed above. Then, of all the spanning trees of , a proportion of at most

of them obey all of the constraints.

As we have constraints, this lemma tells us that the proportion of spanning trees for which is upper bounded by

This finally proves Lemma 41, and thus Theorem 40.

10 Discussion and future directions

We have shown that there are tractable distributions which D&C SPNs cannot efficient capture, but other deep models can. However, our separating distribution , which is the uniform distribution over adjacency matrices of spanning trees of the complete graph, is a somewhat “complicated” one, and seems to require depth to be efficiently captured by other deep models. Some questions worth considering are:

  • Is a distribution like learnable by other deep models in practice?

  • Is there a simpler example than of a tractable separating distribution?

  • Can we extend D&C SPNs in some natural way that would allow them to capture distributions like ?

  • Should we care that D&C SPNs have this limitation, or are most “natural” distributions that we might want to model with D&C SPNs of a fundamentally different character than ?

Far from showing that D&C SPNs are uninteresting, we feel that this paper has established that they are a very attractive objects for theoretical analysis. While the D&C conditions limit SPNs, they also make it possible for us to prove much stronger statements about them than we otherwise could.

Indeed, it is worth underlining the point that the results we have proved about the expressive efficiency of D&C SPNs are much stronger and more thorough than results available for other deep models. This is likely owed to the intrinsically tractable nature of D&C SPNs, which makes them amenable to analysis using known mathematical methods, avoiding the various proof barriers that exist for more general circuits.

One aspect of SPNs which we have not touched on in this work is their learnability. It is strongly believed that for conditional models like neural networks, which are capable of efficiently simulating Boolean circuits, learning is hard in general (Daniely et al., 2014). However, D&C SPNs don’t seem to fall into this category, and to the best of our knowledge, it is still an open question as to whether there is a provably effective and efficient learning algorithm for them. It seems likely that the “tractable” nature of D&C SPNs, which has allowed us to prove so many strong statements about their expressive efficiency, might also make it possible to prove strong statements about their learnability.

Acknowledgments

The authors would like to thank Ran Raz for his helpful discussions regarding multilinear circuits. James Martens was supported by a Google Fellowship.

References

  • Aldous [1990] David J Aldous. The random walk construction of uniform spanning trees and uniform labelled trees. SIAM Journal on Discrete Mathematics, 3(4):450–465, 1990.
  • Bengio and Delalleau [2011] Yoshua Bengio and Olivier Delalleau. On the expressive power of deep architectures. In Algorithmic Learning Theory, pages 18–36. Springer, 2011.
  • Bunch and Hopcroft [1974] J. Bunch and J. Hopcroft. Triangular factorization and inversion by fast matrix multiplication. Mathematics of Computation, 28:231–236, 1974.
  • Coppersmith and Winograd [1987] D. Coppersmith and S. Winograd. Matrix multiplication via arithmetic progressions. In

    Proceedings of the Nineteenth Annual ACM Symposium on Theory of Computing (STOC)

    , pages 1–6, 1987.
  • Daniely et al. [2014] Amit Daniely, Nati Linial, and Shai Shalev-Shwartz. From average case complexity to improper learning complexity. In Proceedings of the Forty-Sixth Annual ACM Symposium on Theory of Computing (STOC), pages 441–448, 2014.
  • Delalleau and Bengio [2011] Olivier Delalleau and Yoshua Bengio. Shallow vs. deep sum-product networks. In Advances in Neural Information Processing Systems 24, pages 666–674, 2011.
  • Fisher [1989] David C Fisher. Lower bounds on the number of triangles in a graph. Journal of graph theory, 13(4):505–512, 1989.
  • Forster [2002] J. Forster. A linear lower bound on the unbounded error probabilistic communication complexity. J. Comput. Syst. Sci., 65(4):612–625, 2002.
  • Gens and Domingos [2013] Robert Gens and Pedro Domingos. Learning the structure of sum-product networks. In

    International Conference on Machine Learning (ICML)

    , 2013.
  • Hajnal et al. [1993] A. Hajnal, W. Maass, P. Pudlák, M. Szegedy, and G. Turán. Threshold circuits of bounded depth. J. Comput. System. Sci., 46:129–154, 1993.
  • Hinton et al. [2004] G. E. Hinton, S. Osindero, M. Welling, and Y. W. Teh.

    Unsupervised discovery of non-linear structure using contrastive backpropagation, 2004.

  • Jansen [2008] Maurice J. Jansen. Lower bounds for syntactically multilinear algebraic branching programs. In MFCS, pages 407–418, 2008.
  • Martens [2014] James Martens. Beyond universality: On the expressive efficiency of deep models. In preparation, 2014.
  • Martens et al. [2013] James Martens, Arkadev Chattopadhyay, Toniann Pitassi, and Richard Zemel. On the representational efficiency of restricted boltzmann machines. In Advances in Neural Information Processing Systems (NIPS), 2013.
  • Neal [1992] Radford M. Neal. Connectionist learning of belief networks. Artif. Intell., 56(1):71–113, July 1992.
  • Parberry [1994] Ian Parberry. Circuit Complexity and Neural Networks. MIT Press, Cambridge, MA, USA, 1994.
  • Paterson et al. [1990] M. S. Paterson, N. Pippenger, and U. Zwick. Optimal carry save networks. Technical report, Coventry, UK, UK, 1990.
  • Peharz et al. [2013] Robert Peharz, Bernhard Geiger, and Franz Pernkopf. Greedy part-wise learning of sum-product networks. volume 8189, pages 612–627. Springer Berlin Heidelberg, 2013.
  • Poon and Domingos [2011] Hoifung Poon and Pedro Domingos. Sum-product networks: A new deep architecture. In UAI, pages 337–346, 2011.
  • Prüfer [1918] H. Prüfer. Neuer beweis eines satzes über permutationen. Arch. Math. Phys., 27:742–744, 1918.
  • Raz [2004] Ran Raz. Multi-linear formulas for permanent and determinant are of super-polynomial size. In Proceedings of the thirty-sixth annual ACM symposium on Theory of computing, pages 633–641. ACM, 2004.
  • Raz [2014] Ran Raz. Personal Communication, 2014.
  • Raz and Yehudayoff [2008] Ran Raz and Amir Yehudayoff. Balancing syntactically multilinear arithmetic circuits. Computational Complexity, 17(4):515–535, December 2008.
  • Raz and Yehudayoff [2009] Ran Raz and Amir Yehudayoff. Lower bounds and separations for constant depth multilinear circuits. Computational Complexity, 18(2):171–207, 2009.
  • Raz and Yehudayoff [2011] Ran Raz and Amir Yehudayoff. Multilinear formulas, maximal-partition discrepancy and mixed-sources extractors. Journal of Computer and System Sciences, 77(1):167–190, 2011.
  • Raz et al. [2008] Ran Raz, Amir Shpilka, and Amir Yehudayoff. A lower bound for the size of syntactically multilinear arithmetic circuits. SIAM Journal on Computing, 38(4):1624–1647, 2008.
  • Reif and Tate [1992] J. Reif and S. Tate. On threshold circuits and polynomial computation. SIAM Journal on Computing, 21(5):896–908, 1992.
  • Rooshenas and Lowd [2014] Amirmohammad Rooshenas and Daniel Lowd. Learning sum-product networks with direct and indirect variable interactions. In International Conference on Machine Learning (ICML), pages 710–718, 2014.
  • Roth [1996] Dan Roth. On the hardness of approximate reasoning. Artif. Intell., 82(1-2):273–302, 1996.
  • Salakhutdinov and Hinton [2009] Ruslan Salakhutdinov and Geoffrey E. Hinton. Deep boltzmann machines. In

    Proceedings of the Twelfth International Conference on Artificial Intelligence and Statistics, AISTATS 2009

    , pages 448–455, 2009.
  • Shpilka and Yehudayoff [2010] Amir Shpilka and Amir Yehudayoff. Arithmetic circuits: A survey of recent results and open questions. Foundations and Trends in Theoretical Computer Science, 5(3–4):207–388, 2010.
  • Tutte [2001] W.T. Tutte. Graph Theory. Cambridge Mathematical Library. Cambridge University Press, 2001.

Appendix A Proofs for Section 2

Proof of Proposition 4.

We will sketch a proof of this result by describing a simple procedure to transform into .

This procedure starts with the leaf nodes and then proceeds up the network, processing a node only once all of its children have been processed. After being processed, a node will have the property that it computes a normalized density, as will all of its descendant nodes.

To process a node , we first compute the normalizing constant of its associated density. If it’s a sum node, we divide its incoming weights by , and if it’s a leaf node computing some univariate funciton of an , we transform this function by dividing it by . Clearly this results in computing a normalized density. Note that processing a product node is trivial since it will be automatically normalized as soon its children are (which follows from decomposability).

After performing this normalization, the effect on subsequent computations performed by ancestor nodes of is described as follows. For every sum node which is a parent of , the effect is equivalent to dividing the weight on edge by . And for every product node which is a parent of , the effect is to divide its output by , which affects subsequent ancestor nodes of by applying this analysis recursively. The recursive analysis fails only once we reach the root node , in which case the effect is to divide the output of the SPN by the constant , which won’t change the SPN’s normalized density function (or distribution).

Thus we can compensate for the normalization and leave the distribution associated with the SPN unchanged by multiplying certain incoming edge weights of certain ancestor sum nodes of by (as specified by the above analysis). This is the second step of processing a node .