An Advance on Variable Elimination with Applications to Tensor-Based Computation

02/21/2020 ∙ by Adnan Darwiche, et al. ∙ 0

We present new results on the classical algorithm of variable elimination, which underlies many algorithms including for probabilistic inference. The results relate to exploiting functional dependencies, allowing one to perform inference and learning efficiently on models that have very large treewidth. The highlight of the advance is that it works with standard (dense) factors, without the need for sparse factors or techniques based on knowledge compilation that are commonly utilized. This is significant as it permits a direct implementation of the improved variable elimination algorithm using tensors and their operations, leading to extremely efficient implementations especially when learning model parameters. Moreover, the proposed technique does not require knowledge of the specific functional dependencies, only that they exist, so can be used when learning these dependencies. We illustrate the efficacy of our proposed algorithm by compiling Bayesian network queries into tensor graphs and then learning their parameters from labeled data using a standard tool for tensor computation.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 7

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The work reported in this paper is motivated by an interest in model-based supervised learning, in contrast to model-free supervised learning that currently underlies most applications of neural networks. We briefly discuss this subject first to put the proposed work in context.

Supervised learning has become very influential recently and stands behind most real-world applications of AI. In supervised learning, one learns a function from labeled data, a practice that is now dominated by the use of neural networks to represent such functions; see [16, 17, 1, 26]. Supervised learning can be applied in other contexts as well, such as causal models in the form of Bayesian networks [23, 24, 25]. In particular, for each query on the causal model, one can compile

an Arithmetic Circuit (AC) that maps evidence (inputs) to the posterior probability of interest (output) 

[9, 10]. AC parameters, which correspond to Bayesian network parameters, can then be learned from labeled data using gradient descent. Hence, like a neural network, the AC is a circuit that computes a function whose parameters can be learned from labeled data.

The use of ACs in this fashion can be viewed as model-based supervised learning, in contrast to model-free supervised learning using neural networks. Model-based supervised learning is attractive since the AC can integrate the background knowledge embecded in its underlying causal model. This has a number of advantages, which include a reduced reliance on data, improved robustness and the ability to provide data-independent guarantees on the learned function. One important type of background knowledge is functional dependencies between variables and their direct causes in a model (a special case of what is known as determinism). Not only can this type of knowledge significantly reduce the reliance on data, but it can also significantly improve the complexity of inference. In fact, substantial efforts have been dedicated to exploiting determinism in probabilistic inference, particularly the compilation of Bayesian networks into ACs [9, 3, 2], which is necessary for efficient inference on dense models.

There are two main approaches for exploiting functional dependencies. The first is based on the classical algorithm of variable elimination (VE), which underlies algorithms for probabilistic inference including the jointree algorithm [29, 11, 18]. VE represents a model using factors, which are tables or multi-dimensional arrays. It then performs inference using a few and simple factor operations. Exploiting functional dependencies within VE requires sparse factors; see, e.g., [19, 21]. The second approach for exploiting functional dependencies reduces probabilistic inference to weighted model counting on a propositional formula that encodes the model, including its functional dependencies. It then compiles the formula into a circuit that is tractable for model counting; see, e.g., [8, 2]. This approach is in common use today given the efficacy of knowledge compilers.

Our main contribution is a new approach for exploiting functional dependencies in VE that works with standard (dense) factors. This is significant for the following reason. We wish to map probabilistic inference, particularly the learning of parameters, into a tensor computation to exploit the vast progress on tensor-based technology and be on par with approaches that aggressively exploit this technology. Tensors are multi-dimensional arrays whose operations are heavily optimized and can be extremely fast, even on CPU-based platforms like modern laptops (let alone GPUs). A tensor computation takes the form of a tensor graph with nodes representing tensor operations. Factors map directly to tensors and sparse factors to sparse tensors. However, sparse tensors have limited support in state of the art tools, which prohibits an implementation of VE using sparse tensors.222

For example, in TensorFlow, a sparse tensor can only be multiplied by a dense tensor, which rules out the operation of (sparse) factor multiplication that is essential for sparse VE; see

https://www.tensorflow.org/api_docs/python/tf/sparse/SparseTensor. Knowledge compilation approaches produce circuits that cast into scalar tensor graphs, which are less effective than general tensor graphs as they are less amenable to parallelization. Moreover, while our approach needs to know that there is a functional dependency between variables it does not require the specific dependency (the specific numbers). Hence, it can be used to speed up inference even when the model parameters are unknown which can be critical when learning model parameters from data. Neither of the previous approaches can exploit this kind of abstract information.

VE is based on two theorems that license factor operations. We add two new theorems that license more operations in the presence of functional dependences. This leads to a standard VE algorithm except with a significantly improved complexity and computation that maps directly to a tensor graph. We present experimental results for inference and learning that show promise of the proposed algorithm.

We start in Section 2 by discussing factors, their operations and the VE algorithm including its underlying theorems. We also present our new VE theorems in this section. We then propose a new VE algorithm in Section 3 that exploits functional dependencies. We show how the proposed algorithm maps to tensor graphs and why this matter in Section 4. We follow by case studies in Section 5 that illustrate the algorithm’s performance in the context of model-based supervised learning. We finally close with some remarks in Section 6.

2 The Fundamentals: Factors & Operations

The VE algorithm is based on applying operations to factors.

A factor for discrete variables is a function that maps each instantiation of variables

into a number. The following are two factors over binary variables

and ternary variable :

     

Factors can be represented as multi-dimensional arrays and are now commonly referred to as tensors (factor variables corresponds to array/tensor dimensions). One needs three factor operations to implement the VE algorithm: multiplication, sum-out and normalization.

The product of factors and is another factor , where and for the unique instantiations and that are compatible with instantiation . Summing-out variables from factor yields another factor , where and . We use to denote the resulting factor . We also use which reads: sum out all variables from factor except for variables . Normalizing factor yields another factor where . We use to denote the normalization of factor .

A Bayesian Network (BN) is specified by a directed acyclic graph (DAG) and a set of factors. In particular, for each node and its parents , we need a factor over variables . The value represents the conditional probability and the factor is called a Conditional Probability Table (CPT). The joint distribution specified by a Bayesian network is simply the product of its CPTs.

The Bayesian network in Figure 2 has five CPTs , , , and . The network joint distribution is the product of these factors .

Evidence on variable is captured by a factor called an evidence indicator. Hard evidence fixes a value giving and for . For soft evidence, is the likelihood of  [23]. The posterior distribution of a Bayesian network is the normalized product of its CPTs and evidence indicators.

An expression constructed by applying operations to factors will be called an f-expression. Suppose we have evidence on variables and in Figure 2. The posterior on variable is obtained by evaluating the following f-expression:

The VE algorithm factors f-expressions so they are evaluated more efficiently [29, 11] and is based on two theorems; see, e.g., [10, Chapter 6]. The first theorem says that the order in which variables are summed out does not matter.

Theorem 1.

.

The second theorem allows us to reduce the size of factors involved in a multiplication operation.

Theorem 2.

If variables appear in factor but not in factor , then .

Factor is exponentially smaller than factor so Theorem 2 allows us to evaluate the f-expression much more efficiently.

Consider the f-expression . A direct evaluation multiplies the two factors to yield then sums out variables . Using Theorem 1, we can arrange the expression into and using Theorem 2 into , which is more efficient to evaluate.

Using an appropriate order for summing out (eliminating) variables, Theorems 1 and 2 allow one to compute the posterior on any variable in a Bayesian network in time and space. Here, is the number of network variables and is the network treewidth (a graph-theoretic measure of the network connectivity).

This works well for sparse networks that have a small treewidth, but is problematic for dense networks like the ones we will look at in Section 5. We present next two new results that allow us to sometimes significantly improve this computational complexity, by exploiting functional relationships between variables and their direct causes.

While we will focus on exploiting functional dependencies in Bayesian networks, our results are more broadly applicable since the VE algorithm can be utilized in many other domains including symbolic reasoning and constraint processing [12]. VE can also be used to contract tensor networks which have been receiving increased attention. A tensor network is a set of factors in which a variable appears in at most two factors. Contracting a tensor network is the problem of summing out all variables that appear in two factors; see, e.g., [14, 15]. The VE algorithm can also be used to evaluate Einstein summations which are in common use today and implemented in many tools including NumPy.333https://numpy.org/

2.1 Functional CPTs

Consider variable that has parents in a Bayesian network and let factor be its conditional probability table (CPT).444Since the CPT satisfies for every . If for all instantiations and , the CPT is said to be functional as it specifies a function that maps parent instantiation into the unique value satisfying . The following CPTs are functional:

     

The first specifies the function , . The second specifies the function , . Functional dependencies encode a common type of background knowledge (examples in Section 5). They are a special type of determinism, which generally refers to the presence of zero parameters in a CPT. A CPT that has zero parameters is not necessarily a functional CPT.

We will next present two results that empower the VE algorithm in the presence of functional CPTs. The results allow us to factor f-expressions beyond what is permitted by Theorems 1 and 2, leading to significant reduction in complexity. The results do not depend on the identity of a functional CPT, only that it is functional. This is significant when learning model parameters from data.

To state these results, we will use , and to denote sets of factors. Depending on the context, a set of factors may be treated as one factor obtained by multiplying members of the set .

The first result says the following. If a functional CPT for variable appears in both parts of a product, then variable can be summed out from one part without changing the value of the product.

Theorem 3.

Consider a functional CPT for variable . If and , then .

Proof.

Suppose CPT is over variables . Let and be the factors corresponding to and , respectively. Let and . Then variables must belong to , and , and parents must belong to . Let and . We want to show for every instantiation .

Consider an instantiation and let , , and be the instantiations of , , and in . Then and . Since CPT is functional, for any and there is a unique , call it , such that .

If , then since , leading to

If is the instantiation of in , then and . Otherwise, since , which leads to . Hence, for every instantiation and we have . ∎

Theorem 3 has a key corollary. If a functional CPT for variable appears in both parts of a product, we can sum out variable from the product by independently summing it out from each part.

Corollary 1.

Consider a functional CPT for variable . If and , then .

Proof.

by Theorem 3, which equals by Theorem 2. ∎

Theorem 3 and Corollary 1 may appear unusable as they require multiple occurrences of a functional CPT whereas the factors of a Bayesian network contain a single (functional) CPT for each variable. This is where the second result comes in: duplicating a functional CPT in a product of factors does not change the product value.

Theorem 4.

For functional CPT , if , then .

Proof.

Let be the product of factors in and let . Suppose factor is the CPT of variable and parents . Consider an instantiation and suppose it includes instantiation . If , then since . Moreover, . If , then . Hence, for all instantiations and we have . ∎

Theorem 4 holds if embeds any functional dependency that is implied by factors instead of being a functional CPT in but we do not pursue the applications of this generalization in this paper.

To see how Theorems 3 and 4 interplay, consider the f-expression . In the standard VE algorithm, one must multiply all three factors before summing out variable , leading to a factor over four variables . If factor is a functional CPT for variable , we can duplicate it by Theorem 4: . Moreover, Corollary 1 gives , which avoids constructing a factor over four variables. We show in Section 3 how these theorems enable efficient inference on models with very large treewidth.

3 Variable Elimination with Functional CPTs

Figure 1: An arithmetic circuit (AC) compiled from the Bayesian network , . The AC computes factor , where is the posterior on variable given evidence on variables and .

We now present our proposed VE algorithm. We first present a standard VE algorithm based on jointrees [18] and then extend it to exploit functional CPTs. Our algorithm will not compute probabilities, but will compile symbolic f-expressions whose factors contain symbolic parameters. A symbolic f-expression is compiled once and used thereafter to answer multiple queries. Moreover, its parameters can be learned from labeled data using gradient descent. We will show how to map symbolic f-expressions into tensor graphs in Section 4 and use these graphs for supervised learning in Section 5.

Once the factors of a symbolic f-expression are unfolded, the result is an Arithmetic Circuits (ACs) [9, 4] as shown in Figure 1. In fact, the standard VE algorithm we present next is a refinement on the one proposed in [9] for extracting ACs from jointrees.

The next section introduces jointrees and some key concepts that we need for the standard and extended VE algorithms.

3.1 Jointrees

Figure 2: A Bayesian network with a jointree (two views).

Consider the Bayesian network in the middle of Figure 2 and its jointree on the left of the figure. The jointree is simply a tree with factors attached to some of its nodes (the circles in Figure 2 are the jointree nodes). We use binary jointrees [28], in which each node has either one or three neighbors and where nodes with a single neighbor are called leaves. The two jointrees in Figure 2 are identical but arranged differently. The one on the left has leaf node  at the top and the one on the right has leaf node  at the top.

Our use of jointrees deviates from normal for reasons that become apparent later. First, we use a binary jointree whose leaves are in one-to-one correspondence with model variables. Second, we only attach factors to leaf nodes: The CPT and evidence indicator for each variable are assigned to the leaf node corresponding to variable . Leaf jointree node is called the host of variable in this case.555For similar uses and a method for constructing such binary jointrees, see [7] and [10, Chapter 8]. Contraction trees which were adopted later for contracting tensor networks [15] correspond to binary jointrees.

The Bayesian network in Figure 2 has five variables. Its jointree also has five leaves, each of which hosts a network variable. For example, jointree node  at the top-left hosts variable : the CPT and evidence indicator for variable are assigned to this jointree node.

A key notion underlying jointrees are edge separators which determine the space complexity of inference (the rectangles in Figure 2 are separators). The separator for edge , denoted , are model variables that appear in leaf nodes on both sides of the edge. For example, as these are the model variables that appear in jointree leaves and . A related notion is the cluster of jointree node . If is leaf, its cluster are the variables appearing at node . Otherwise, it is the union of separators for edges . Every factor constructed by VE is over the variables of some separator or cluster. The time complexity of VE is exponential in the size of clusters and linear in the number of nodes in a jointree.

The size of largest cluster is called the jointree width and cannot be lower than the Bayesian network treewidth; see [10, Chapter 9] for a detailed treatment of this subject. When the network contains variables with different cardinalities, the size of a cluster is better measured by the number of instantiations that its variables has. We therefore define the binary rank of a cluster as log2 of its instantiation count. The binary rank coincides with the number of variables in a cluster when all variables are binary.

Our technique for exploiting functional dependencies will use Theorems 3 and 4 to shrink the size of clusters and separators significantly below jointree width, allowing us to handle networks with very large treewidth. The algorithm will basically reduce the maximum binary rank of clusters and separators, which can exponentially reduce the size of factors constructed by VE during inference.

3.2 Compiling Symbolic f-expressions using VE

Suppose we wish to compile an f-expression that computes the posterior on variable . We first identify the leaf jointree node that hosts variable . We then arrange the jointree so host is at the top as in Figure 2. Host will then have a single child which we call the jointree root. The tree rooted at node is now a binary tree, with each node having two children and and a parent . On the left of Figure 2, root has two children , and parent . We refer to such a jointree arrangement as a jointree view.

Jointree views simplify notation. For example, we can now write to denote the separator between node and its parent instead of . We will adopt this simpler notation from now on.

We now compile an f-expression using the following equations:

(1)
(2)

Here, is the product of factors assigned to leaf node (CPT and evidence indicator for the model variable assigned to node ).

For the jointree view in Figure 2 (left), applying these equations to variable , host and root yields the f-expression:

This expression results from applying Equation 1 to the host followed by applying Equation 2 to each edge in the jointree. Each sum in the expression corresponds to a separator and every product constructed by the expression will be over the variables of a cluster.

Our compiled AC is simply the above f-expression. The value of the expression represents the circuit output. The evidence indicators in the expression represent the circuit inputs. Finally, the CPTs of the expression contain the circuit parameters (see the AC in Figure 1).

We will now introduce new notation to explain Equations 1 and 2 as we need this understanding in the following section; see also [10, Chapter 7]. For node  in a jointree view, we use to denote the set of factors at or below node . We also use to denote the set of factors above node . Consider node  on the left of Figure 2. Then contains the factors assigned to leaf nodes and contains the factors assigned to leaf nodes .

For a jointree view with host and root , contains all factors in the jointree and . Equation 1 computes , while delegating the computation of product to Equation 2, which actually computes by summing out all variables but for ones in . The equation uses the decomposition to sum out variables more aggressively:

The rule employed by Equation 2 is simple: sum out from product all variables except ones appearing in product (Theorem 2). The only variables shared between factors and are the ones in so Equation 2 is exploiting Theorem 2 to the max. The earlier that variables are summed out, the smaller the factors we need to multiply and the smaller the f-expressions that VE compiles.

3.3 Exploiting Functional Dependencies

1:procedure shrink_sep() 2:       variable assigned to host 3:      if  then 4:             -= 5:      end if 6:      sum() 7:end procedure 8:procedure sum() 9:      if leaf node  then 10:            return 11:      end if 12:       children of node 13:       14:       either or 15:       -= 16:       &= 17:       &= 18:      sum() 19:      sum() 20:end procedure    
Figure 3: Left: Algorithm for shrinking separators based on functional CPTs. Right: An application of the algorithm where dropped variables are colored red. Variables and have functional CPTs.

We now present an algorithm that uses Theorems 3 and 4 to sum out variables earlier than is licensed by Theorems 1 and 2. Here, ‘earlier’ means lower in the jointree view which leads to smaller factors.

Our algorithm uses the notation to denote the set of variables that have a functional CPT at or below node in the jointree view. For example, in Figure 3, we have , and .

The algorithm is depicted in Figure 3 and is a direct application of Theorem 3 with a few subtleties. The algorithm traverses the jointree view top-down, removing variables from the separators of visited nodes. It is called on root and host of the view, shrink_sep(). It first shrinks the separator of root which decomposes the set of factors into . The only functional CPT that can be shared between factors and is the one for variable assigned to host . If variable is functional and its CPT is shared, Theorem 3 immediately gives . Variable can then be summed at root by dropping it from as done on line 4.

The algorithm then recurses on the children of root . The algorithm processes both children and of a node before it recurses on these children. This is critical as we explain later. The set computed on line 13 contains variables that have functional CPTs in both factors and factors (recall Equation 3.2). Theorem 3 allows us to sum out these variables from either or but not both, a choice that is made on line 14. A variable that has a functional CPT in both and is summed out from one of them by dropping it from either or on line 15

. In our implementation, we heuristically choose a child based on the size of separators below it. We add the sizes of these separators (number of instantiations) and choose the child with the largest size breaking ties arbitrarily.

If a variable is summed out at node and at its child , we can sum it out earlier at child by Theorem 2 (classical VE): . A symmetric situation arrises for child . This is handled on lines 16-17. Applying Theorem 2 in this context demands that we process nodes and before we process their children. Otherwise, the reduction of separators and will not propagate downwards early enough, missing opportunities for applying Theorem 2 further.

Figure 3 depicts an example of applying algorithm shrink_sep to a jointree view for the Bayesian network in Figure 2. Variables colored red are dropped by shrink_sep. The algorithm starts by processing root , dropping variable from on line 4. It then processes children and simultaneously. Since both children contain a functional CPT for variable , the variable can be dropped from either or . Child is chosen in this case and variable is dropped from . We have and at this point. Lines 16-17 shrink these separators further to and .

Our proposed technique for shrinking separators will have an effect only when functional CPTs have multiple occurrences in a jointree (otherwise, set on line 13 is always empty). While this deviates from the standard use of jointrees, replicating functional CPTs is licensed by Theorem 4. The (heuristic) approach we adopted for replicating functional CPTs in a jointree is based on replicating them in the Bayesian network. Suppose variable has a functional CPT and children in the network, where . We replace variable with replicas . Each replica has a single child and the same parents as . We then construct a jointree for the resulting network and finally replace each replica by in the jointree. This creates replicas of the functional CPT in the jointree. Replicating functional CPTs leads to jointrees with more nodes, but smaller separators and clusters as we shall see in Section 5.

4 Mapping ACs into Tensor Graphs

We discuss next how we map ACs (symbolic f-expressions) into tensors graphs for efficient inference and learning. Our implementation is part of the PyTAC system under development by the author. PyTAC

 is built on top of TensorFlow and will be open sourced.

A tensor is a data structure for a multi-dimensional array. The shape of a tensor defines the array dimensions. A tensor with shape has elements or entries. The dimensions of a tensor are numbered and called axes. The number of axes is the tensor rank. Tensor computations can be organized into a tensor graph:

a data flow graph with nodes representing tensor operations. Tensors form the basis of many machine learning tools today.

A factor over variables can be represented by a tensor with rank and shape , where is the cardinality of variable

(i.e., its number of values). Factor operations can then be implemented using tensor operations, leading to a few advantages. First, tensor operations are heavily optimized to take advantage of special instruction sets and architectures (on CPUs and GPUs) so they can be orders of magnitude faster than standard implementations of factor operations (even on laptops). Next, the elements of a tensor can be variables, allowing one to represent symbolic f-expressions, which is essential for mapping ACs into tensor graphs that can be trained. Finally, tools such as TensorFlow and PyTorch provide support for computing the partial derivates of a tensor graph with respect to tensor elements, and come with effective gradient descent algorithms for optimizing tensor graphs (and hence ACs). This is very useful for training ACs from labeled data as we do in Section 

5.

To map ACs (symbolic f-expressions) into tensor graphs, we need to implement factor multiplication, summation and normalization. Mapping factor summation and normalization into tensor operations is straightforward: summation has a corresponding tensor operation (tf.reduce_sum) and normalization can be implemented using tensor summation and division. Factor multiplication does not have a corresponding tensor operation and leads to some complications.666Tensor multiplication is pointwise while factors are normally over different sets of variables. Hence, multiplying the tensors corresponding to factors and does not yield the expected result. The simplest option is to use tf.einsum, which can perform factor multiplication if we pass it the string “abc, bde – abcde” (https://www.tensorflow.org/api_docs/python/tf/einsum). We found this too inefficient though for extensive use as it performs too many tensor transpositions. One can also use the technique of broadcasting by adding trivial dimensions to align tensors (https://www.tensorflow.org/xla/broadcasting), but broadcasting has limited support in TensorFlow requiring tensors with small enough ranks.

We bypassed these complications in the process of achieving something more ambitious. Consider Equation 2 which contains almost all multiplications performed by VE. Factors , and the result are over separators , and . This equation multiplies factors and to yield a factor over variables and then shrinks it by summation into a factor over variables . We wanted to avoid constructing the larger factor before shrinking it. That is, we wanted to multiply-then-sum in one shot as this can reduce the size of our tensor graphs significantly.777See a discussion of this space issue in [10, Chapter 7]. A key observation allows this using standard tensor operations.

Figure 4: Left: A generative model for rectangles. Right: Examples of clean and noisy rectangle images.
Figure 5: Left: A generative model for seven-segment digits. Middle: Examples of noisy digit images. Right: Seven-segment digit.
Figure 4: Left: A generative model for rectangles. Right: Examples of clean and noisy rectangle images.

The previous separators are all connected to jointree node so they satisfy the following property [10, Chapter 9]: If a variable appears in one separator, it also appears in at least one other separator. Variables can then be partitioned as follows:888In a jointree, every separator that is connected to a node is a subset of the union of other separators connected to that node. Hence, .

  • : variables in and ,

  • : variables in but not ,

  • : variables in but not ,

  • : variables in but not ,

where variables are the ones summed out by Equation 2. The variables in each factor can now be structured as follows: , and . We actually group each set of variables and into a single compound variable so that factors and can each be represented by a rank- tensor. We then use the tensor operation for matrix multiplication tf.matmul to compute in one shot, without having to construct a tensor for the product . Matrix multiplication is perhaps one of the most optimized tensor operations on both CPUs and GPUs.

Preparing tensors and for matrix multiplication requires two operations: tf.reshape which aggregate variables into compound dimensions and tf.transpose which order the resulting dimensions so tf.matmul can map and into . The common dimension must appear first in and . Moreover, the last two dimensions must be ordered as and but tf.matmul can transpose the last two dimensions of an input tensor on the fly if needed. Using matrix multiplication in this fashion had a significant impact on reducing the size of tensor graphs and the efficiency of evaluating them, despite the added expense of using tf.transpose and tf.reshape operations (the latter operation does not use space and is very efficient).

PyTAC represents ACs using an abstract tensor graph called an ops graph, which can be mapped into a particular tensor implementation depending on the used machine learning tool. PyTAC also has a dimension management utility, which associates each tensor with its structured dimensions while ensuring that all tensors are structured appropriately so operations can be applied to them efficiently. We currently map an ops graph into a tf.graph object, using the tf.function utility introduced recently in TensorFlow 2.0.0. PyTAC also supports the recently introduced Testing Arithmetic Circuits (TACs), which augment ACs with testing units that turns them into universal function approximators like neural networks [6, 5, 27].

5 Case Studies

We next evaluate the proposed VE algorithm on two classes of models that have abundant functional dependencies. We also evaluate the algorithm on randomly generated Bayesian networks while varying the amount of functional dependencies. The binary jointrees constructed for these models are very large and prohibit inference using standard VE. We constructed these binary jointrees from variable elimination orders using the method proposed in [7]; see also [10, Chapter 9]. The elimination orders were obtained by the minfill heuristic; see, e.g., [20].999The minfill heuristic and similar ones aim for jointrees that minimize the size of largest cluster (i.e., treewidth). It was observed recently that minimizing the size of largest separator (called max rank) is more desirable when using tensors since the memory requirements of Equation 2 can depend only on the size of separators not clusters (see [14] for recent methods that optimize max rank). This observation holds even when using classical implementations of the jointree algorithm and was exploited earlier to reduce the memory requirements of jointree inference; see, e.g., [22, 13].

5.1 Rectangle Model

We first consider a generative model for rectangles shown in Figure 5. In an image of size , a rectangle is defined by its upper left corner (, ),  and . Each of these variables has values. The rectangle also has a binary  variable, which is either tall or wide. Each row has a binary variable  indicating whether the rectangle will render in that row ( variables total). Each column has a similar variable . We also have binary variables which correspond to image pixels () indicating whether the pixel is on or off. This model can be used to predict rectangle attributes from noisy images such as those shown in Figure 5. We use the model to predict whether a rectangle is tall or wide by compiling an AC with variable  as output and variables  as input. The AC computes a distribution on

 given a noisy image as evidence and can be trained from labeled data using cross entropy as the loss function.

101010Arthur Choi suggested the use of rectangle models and Haiying Huang proposed this particular version of the model.

Our focus is on the variables  and  which are determined by / and /, respectively (for example,  is on iff   + ). In particular, we will investigate the impact of these functional relationships on the efficiency of our VE compilation algorithm and their impact on learning AC parameters from labeled data. Our experiments were run on a MacBook Pro, 2.2 GHz Intel Core i7, with 32 GB RAM.

Table 2 depicts statistics on ACs that we compiled using our proposed VE algorithm. For each image size, we compiled an AC for predicting the rectangle  while exploiting functional CPTs to remove variables from separators during the compilation process. As shown in the table, exploiting functional CPTs has a dramatic impact on the complexity of VE. This is indicated by the size of largest jointree cluster (binary rank) in a classical jointree vs one whose separators and clusters where shrunk due to functional dependencies.111111We applied standard node and value pruning to the Bayesian network before computing a jointree and shrinking it. This has more effect on the digits model in Section 5.2. For example, it can infer that some pixels will never be turned on as they will never be occupied by any digit. Recall that a factor over a cluster will have a size exponential in the cluster binary rank (the same for factors over separators). The table also shows the size of compiled ACs, which is the sum of tensor sizes in the corresponding tensor graph (the tensor size is the number of elements/entries it has). For a baseline, the AC obtained by standard VE (without exploiting functional CPTs) for an image of size is , which is about times larger than the size of AC reported in Table 2. What is particularly impressive is the time it takes to evaluate these ACs (compute their output from input). On average it takes about milliseconds to evaluate an AC of size ten million for these models, which shows the promise tensor-based implementations (these experiments were run on a laptop).

We next investigate the impact of integrating background knowledge when learning AC parameters. For training, we generated labeled data for all clean images of rectangles and added noisy images for each (with the same label). Noise is generated by randomly flipping background pixels, where is the number of rectangle pixels and is the number of background pixels. We used the same process for testing data, except that we increased the number of noisy pixels to and doubled the number of noisy images. We trained the AC using cross entropy as the loss function to minimize the classification accuracy.121212Some of the CPTs contain zero parameters but are not functional, such as the ones for  and . We fixed these zeros in the AC when learning with background knowledge. We also tied the parameters of the  variables therefore learning one CPT for all of them.

Table 2

shows the accuracy of classifying rectangles (tall vs wide) on

images using ACs with and without background knowledge. ACs compiled from models with background knowledge have fewer parameters and therefore need less data to train. The training and testing examples were selected randomly from the datasets described above with examples always used for testing, regardless of the training data size. Each classification accuracy is the average over twenty five runs. The table clearly shows that integrating background knowledge into the compiled AC yields higher classification accuracies given a fixed number of training examples.

Image Functional Network Max Cluster Size AC Eval Compile
Size CPTs Nodes rank binary rank Size Time Time
Table 2: Classification accuracy on noisy rectangle images. Testing data included examples in each case.
Functional Accuracy Number of Training Examples Param
CPTs Count
fixed in AC mean
stdev
trainable mean
stdev
Table 1: Size and compile/evaluation time for ACs that compute the posterior on rectangle label. Reported times are in seconds. Evaluation time is the average of evaluating an AC over a batch of examples.
Image Functional Network Max Cluster Size AC Eval Compile
Size CPTs Nodes rank binary rank Size Time Time
Table 4: Classification accuracy on noisy digit images. Testing data included examples in each case.
Functional Accuracy Number of Training Examples Param
CPTs Count
fixed in AC mean
stdev
trainable mean
stdev
Table 3: Size and compile/evaluation time for ACs that compute a posterior over digits. Reported times are in seconds. Evaluation time is the average of evaluating an AC over a batch of examples.

5.2 Digits Model

We next consider a generative model for seven-segment digits shown in Figure 5 (https://en.wikipedia.org/wiki/Seven-segment_display). The main goal of this model is to recognize digits in noisy images such as those shown in Figure 5. The model has four vertical and three horizontal segments. A digit is generated by activating some of the segments. For example, digit is generated by activating all segments and digit by activating two vertical segments. Segments are represented by rectangles as in the previous section, so this model integrates seven rectangle models. A digit has a location specified by the row and column of its upper-left corner (height is seven pixels and width is four pixels). Moreover, each segment has an activation node which is turned on or off depending on the digit. When this activation node is off, segment pixels are also turned off. An image of size has pixels whose values are determined by the pixels generated by segments.

This is a much more complex and larger model than the rectangle model and also has an abundance of functional dependencies. It is also much more challenging computationally. This can be seen by examining Tables 4, which reports the size of largest clusters in the jointrees for this model. For example, the model for images has a cluster with a binary rank of . This means that standard VE would have to construct a factor of size which is impossible. Our proposed technique for exploiting functional dependencies makes this possible though as it reduces the binary rank of largest cluster down to . And even though the corresponding AC has size of about one hundred million, it can be evaluated in about milliseconds. The AC compilation times are also relatively modest.

We trained the compiled ACs as we did in the previous section. We generated all clean images and added noise as follows. For each clean image we added noisy images for training and for testing by randomly flipping background pixels where is the image size.

Table 4 parallels the one for the rectangle model. We trained two ACs, one that integrates background knowledge and one that does not. The former AC has fewer parameters and therefore requires less data to train. While this is expected, it is still interesting to see how little data one needs to get reasonable accuracies. In general, Tables 4 and 4 reveal the same patterns of the rectangle model: exploiting functional dependencies leads to a dramatic reduction in the AC size and integrating background knowledge into the compiled AC significantly improves learnability.

5.3 Random Bayesian Networks

Network Maximal Percentage Binary Rank of Largest Cluster
Node Parent Functional Original Jointree Shrunk Jointree Reduction
Count Count Nodes % mean stdev mean stdev mean stdev
Table 5: Reduction in maximum cluster size due to exploiting functional dependencies. The number of values a node has was chosen randomly from . We averaged over random networks for each combination of network node count, maximal parent count and the percentage of nodes having functional CPTs. The parents of a node and their count where chosen randomly. Functional nodes where chosen randomly from non-root nodes. The binary rank of a cluster is log2 of the number of its instantiations.
Batch Tensor Graph (TenG) Milliseconds Slow Down Factor
Size Limit on Actual Max Binary TenG Time ScaG / TenG ScaBaG / TenG
Size Size Rank Normalized Time Ratio Time Ratio
- M
- M