# Query the model: precomputations for efficient inference with Bayesian Networks

We consider a setting where a Bayesian network has been built over a relational database to represent the joint distribution of data attributes and is used to perform approximate query processing or answer predictive queries. We explain how careful precomputation of certain probabilistic quantities can lead to large efficiency gains in response time and provide algorithms for optimal precomputation.

## Authors

• 6 publications
• 27 publications
• 10 publications
10/07/2021

### Workload-Aware Materialization of Junction Trees

Bayesian networks are popular probabilistic models that capture the cond...
02/06/2013

### Learning Bayesian Nets that Perform Well

A Bayesian net (BN) is more than a succinct way to encode a probabilisti...
07/14/2019

### An Approach Based on Bayesian Networks for Query Selectivity Estimation

The efficiency of a query execution plan depends on the accuracy of the ...
02/23/2020

### Sample Debiasing in the Themis Open World Database System (Extended Version)

Open world database management systems assume tuples not in the database...
08/07/2014

### Logarithmic-Time Updates and Queries in Probabilistic Networks

In this paper we propose a dynamic data structure that supports efficien...
12/04/2016

### The Complexity of Bayesian Networks Specified by Propositional and Relational Languages

We examine the complexity of inference in Bayesian networks specified by...
12/02/2020

### Complex Coordinate-Based Meta-Analysis with Probabilistic Programming

With the growing number of published functional magnetic resonance imagi...
##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Research in machine learning has led to powerful methods for building probabilistic models for general predictive tasks [bishop2013model]

. As a result, machine learning is currently employed in a wide range of fields, enabling us to automate tasks that until recently were seen as particularly challenging. Examples include image and speech recognition

[NIPS2018_imagerecognition, NIPS2018_speechrecognition][NIPS2018_NLP], and machine translation [NIPS2018_Translation].

Moreover, as several researchers have pointed out recently, machine-learning techniques can also be used within database management systems (dbms[kraska2019sagedb]. For instance, a database system can employ machine-learning techniques to learn a probabilistic model from the data, such as a Bayesian network, which captures the dependencies among values of different attributes, and subsequently use the model either for approximate query processing, i.e., to produce approximate answers for queries about the data already stored in the database, or to answer predictive queries, i.e., to infer the attributes of future data entries. The case is further illustrated with the application scenarios below.

Approximate query answering: We are interested in analyzing a survey database [scutari2014bayesian], a small extract of which is shown in Figure 1. The database consists of a single table with variables A (age): indicating the age bracket of a person as young, adult, or old; S (sex): which can be female or male; E (education level): indicating whether a person has finished high-school or university; O (occupation): indicating whether a person is employed or self-employed; R (size of residence city): which can be small or big; and T (means of transportation): which takes values train, car, or other. The database is managed by a dbms that enables its users to ask queries over the data. For example, consider the following two queries.

• what fraction of the population are young females who commute by car?

• among young females who commute by car, what fraction live in big cities?

To answer each query exactly, the dbms could simply make a scan over the entire database and evaluate the requested quantity. However, a scan is often too expensive to perform and in many application scenarios exact answers are unnecessary; instead approximate answers are satisfactory. To accomplish this goal, the approximate query-answering component of the dbms

learns a Bayesian-network model over the joint distribution of the data attributes, which approximates the observed data distribution within some error range. When the user submits a query, it is translated into a probabilistic quantity that corresponds to the query, the value of which is calculated from the learned model and serves as an approximate answer to the query. For example, the aforementioned queries correspond to the following marginal and conditional probabilities, respectively.

Note that answering such queries through the learned model can be vastly more efficient than scanning the entire database.

Predictive queries: The same system employs a similar approach for predictive queries such as the following.

• what is the probability that a person added to the database is a young female who commutes by car?

• what is the probability that a person added to the database lives in a big city, given that she is a young female who commutes by car?

Unlike the approximate query-answering scenario, in which queries requested quantities directly measurable from the available data, the two queries above request a prediction over unseen data. Nevertheless, queries and correspond to marginal and conditional probabilities that share the same expression as queries and , respectively. The only difference between the two scenarios is that, while for approximate query answering the probabilities are evaluated over a model that is optimized to approximate the existing data, for predictive queries the probabilities are evaluated over a model that is optimized for generalized performance over possibly unseen data.111 Without going into specifics here, a standard method to optimize a model for generalized performance in machine-learning literature is to partition the data randomly into one “training” and one “test” set, build the model based on the data from the training set, but choose its parameters so that its performance over the test set is optimized. For detailed treatment of the topic see the classic textbook of Bishop [bishop2006pattern].

As the two scenarios above suggest, machine learning finds natural application in the tasks of approximate query processing and predictive querying: learning a probabilistic model for the joint distribution of data attributes allows the system to answer a very general class of queries, the answers to which can be evaluated via probabilistic quantities from the model; with no need for potentially expensive access to the data. Probabilistic models such as Bayesian networks offer several advantages over traditional approximate query-processing approaches that use synopses [cormode2011synopses]

. In particular, probabilistic models allow us to work with distributions rather than single values. This is useful in cases where we are interested in the variance or modality of numerical quantities. Moreover, these models typically extend gracefully to regions of the data space for which we have observed no data. For example, for predictive queries over relational databases with a large number of attributes, it is important to assign non-zero probability to all possible tuples, even for combinations of attribute values that do not exist in the database, yet. And most importantly, since probabilistic models are learned from data (or,

fit to the data

), their complexity is adjusted to the complexity of the data at hand. This allows us to have very concise representations of the data distribution even for a large number of attributes. For example, one can typically learn from data a very sparse Bayesian network to represent dependencies among a large number of attributes, while, by comparison, synopses like multi-dimensional histograms would suffer to represent high-dimensional data.

In this work, we opt for Bayesian networks to model the joint distribution of database attributes, as they have intuitive interpretation, adapt easily to settings of varying complexity, and have been studied extensively for many years [nielsen2009bayesian]. In what follows, we do not argue further about why this is a good choice for approximate query processing and predictive queries; the interested reader can find a very good treatment of the topic by Getoor et al. [getoor2001selectivity]. By contrast, we focus on the issue of efficiency, as inference in Bayesian networks is -hard [chickering1996learning], thus one cannot preclude the possibility that the evaluation of some queries proves expensive for practical settings.

How can we mitigate this risk? Our key observations are the following: first, the evaluation of probabilities over the Bayesian-network model involves intermediate results, in the form of relational tables, which are costly to compute every time a query requests them; second, same intermediate tables can be used for the evaluation of many different queries. Based on these observations, we set to precompute and materialize the intermediate relational tables that bring the largest computational benefit, i.e., those that are involved in the evaluation of many expensive queries. The problem formulation is general enough to accommodate arbitrary query workloads and takes as input a budget constraint on the number of intermediate relational tables one can afford to materialize (Section 3).

Our contributions are the following: () an exact polynomial-time algorithm to choose an optimal materialization (Section 4.1); () a greedy algorithm with approximation guarantees (Section 4.3); () a pseudo-polynomial algorithm to address the problem under a budget constraint on the space used for materialization (Section 5.1); () a further-optimized computational scheme that avoids query-specific redundant computations (Section 5.2); () experiments over real data, showing that by materializing only few intermediate tables, one can significantly improve the running time of queries, reaching upto an average gain of over a uniform workload of queries (Section 6).

To make the paper self contained, we review the required background material in Section 3. All the proofs, omitted due to lack of space, can be found in an extended version of the paper [us-arxiv].

## 2 Related Work

Bayesian networks or “directed graphical models” are probabilistic models that capture global relationships among a set of variables, through local dependencies between small groups of variables [pearl2014probabilistic]. This combination of complexity and modularity makes them suitable for general settings where one wishes to represent the joint distribution of a large number of variables (e.g., the column attributes of a large relational table). Bayesian networks have intuitive interpretation captured by the structure of a directed graph (each directed edge between two variables represents a dependency of the child on the parent variables) and probabilities based on the model can be expressed with compact sums-of-products. For exact inference, i.e., exact computation of marginal and conditional probabilities, the conceptually simplest algorithm is variable elimination [zhang1994simple, zhang1996exploiting], which follows directly the formula for joint probability under the model. The main other algorithm for exact inference is the Junction-tree algorithm [jensen1996introduction, lauritzen1988local], which is based on “message passing” among variables and is quite more elaborate than variable elimination. For Bayesian networks of tree structure, simpler message-passing algorithms exist, e.g., the sum-product algorithm [bishop2006pattern]. As this paper is the first work to address materialization for Bayesian networks, we opt to work with variable elimination [zhang1994simple] due to its conceptual simplicity. However, it is possible that a similar approach is applicable for message-passing algorithms.

explained in 2001 how Bayesian networks can be used for selectivity estimation that moved beyond simplistic independence assumptions. Our work is in the same spirit of Getoor et al.

[getoor2001selectivity], in the sense that it assumes a Bayesian network is used to model dependencies between data attributes, but it addresses a different technical problem, i.e., the problem of optimal materialization for efficient usage of the model.

## 3 Setting and problem statement

A Bayesian network is a directed acyclic graph (dag), where nodes represent variables and edges represent dependencies among variables. Each node is associated with a table quantifying the probability that the node takes a particular value conditionally on values of its parents. For instance, if a node associated with a -ary variable has parents, all of which are

-ary variables, the associated probability distribution for

is a table with entries.

One key property of Bayesian networks is that, conditional on the values of its parents, a variable is independent of other variables. This property leads to simple formulas for the evaluation of marginal and conditional probabilities. For example, for the network of Figure 1, the joint probability of all variables is given by

 Pr(\tt A,\tt S,\tt E,\tt O,\tt R% ,\tt T) = Pr(\tt T∣\tt O,\tt R)Pr(\tt O∣\tt E)Pr(\tt R∣\tt E) (1) Pr(\tt E∣\tt A,\tt S)Pr(\tt A)Pr(\tt S).

Each factor on the right-hand side of Equation (1) is part of the specification of the Bayesian network and represents the marginal and conditional probabilities of its variables.

In what follows, we assume that a Bayesian network has been learned from a relational database, with each variable corresponding to one relational attribute. While in the example of Figure 1 the Bayesian network has only 6 variables, in many applications we have networks with hundreds or thousands of variables. Here we assume that all variables are categorical; numerical variables can be discretized in categorical intervals. In many cases, access to the probability tables of a Bayesian network can replace access to the original data; indeed, we can answer queries via the Bayesian-network model rather than through direct processing of potentially huge volumes of data. As discussed in the introduction, this approach is not only more efficient, but it can also lead to more accurate estimates as it avoids over-fitting.

Querying the Bayesian network. We consider the task of answering probabilistic queries over the model defined by a Bayesian network . For instance, for the model shown in Figure 1, example queries are: “what is the probability that a person is a university-graduate female, lives in a small city, and is self employed?” or “for each possible means of transport, what is the probability that a person is young and uses the particular means of transport?” More precisely, we consider queries of the form

 q=Pr(Xq,Yq=yq), (2)

where is a set of free variables and is a set of bound variables with corresponding values . Notice that free variables are the ones for which the query requests a probability for each of their possible values. For the examples above, the first query is answered by the probability , and the second by the distribution .

We denote by the set of variables that do not appear in the query . The variables in the set are those that need to be summed out in order to compute the query . Specifically, query is computed via the summation

 Pr(Xq,Yq=yq)=∑ZqPr(Xq,Yq=yq,Zq). (3)

The answer to the query is a table indexed by combinations of values of variables . Note that conditional probabilities of the form can be computed from the corresponding joint probabilities by

 Pr(Xq∣Yq=yq) = Pr(Xq,Yq=yq)Pr(Yq=yq) = Pr(Xq,Yq=yq)∑XqPr(Xq,Yq=yq),

thus without loss of generality, we focus on queries of type (2).

Answering queries. The variable-elimination algorithm, proposed by Zhang et al. [zhang1994simple] to answer queries , introduces the concept of elimination tree. The algorithm computes by summing out the variables that do not appear in , according to Equation (3). When we sum out a variable, we say that we eliminate it. The elimination tree represents the order in which variables are eliminated and the intermediate results that are passed along.

Variable elimination. We eliminate variables according to a total order  on variables that is given as input and considered fixed hereafter. For example, for the Bayesian network of Figure 1, one possible order is .

A query can be computed by brute-force elimination in two steps. In the first step, compute into a table  the joint probability for each combination of values of all variables. In the second step, process variables sequentially in the order of : for a variable , select those entries of that satisfy the corresponding equality condition in ; for a variable compute a sum over each group of values of variables that have not been processed so far (thus “summing out” the variable); and finally for a free variable , no computation is needed. The table that results from this process is the answer to query .

The variable-elimination algorithm by Zhang et al. improves upon brute-force elimination by observing that it is not necessary to compute . To see why, let us consider again the query and order . In this example, we have , , and . The first variable in is . The brute-force algorithm would first compute a natural join over all factors in Equation (1) and then select those rows that match the condition . An equivalent but more efficient computation is to first consider only the tables of those factors that include variable A and select only the rows that satisfy the equality condition ; then perform the natural join over the resulting tables. This computation corresponds to the following two equations.

 ψ\tt A(\tt S,\tt E;\tt A% =young)= Pr(\tt E∣\tt A=young,\tt S)Pr(\tt A=young) Pr(\tt A=young,\tt S,\tt E,\tt O% ,\tt R,\tt T)= ψ\tt A(\tt S,\tt E;\tt A% =young)Pr(\tt T∣\tt O,\tt R) Pr(\tt O∣\tt E)Pr(\tt R∣\tt E)Pr(\tt S). (4)

The crucial observation is that the equality condition concerns only two factors of the joint probability formula. After processing variable A these two factors can be replaced by a factor , a table indexed by S and E and containing only entries with .

We can continue repeatedly for the remaining variables. Let us consider how to sum out , the second variable in . Instead of a brute-force approach, a more efficient computation is to compute a new factor by summing out S over only those factors that include S; and use the new factor to perform a natural join with the remaining factors.

 ψ\tt S(\tt E;\tt A=young% )= ∑\tt Sψ\tt A(\tt S,%E;\tt A=young)Pr(\tt S) Pr(\tt A=young,\tt E,\tt O,\tt R% ,\tt T)= ∑\tt SPr(\tt A=young,\tt S,\tt E,\tt O,\tt R,\tt T) = ψ\tt S(\tt E;\tt A=young% )Pr(\tt T∣\tt O,\tt R) Pr(\tt O∣\tt E)Pr(\tt R∣\tt E). (5)

Again, the crucial observation is that S appears in only two factors of Equation (4), which, after the summation over S, can be replaced by a factor , a table indexed by E and containing only entries with .

The third variable in is the free variable . As with the previous two cases, the processing of a free variable corresponds to the computation of a new factor from the natural join over all factors that involve it. Unlike the previous cases, however, where the natural join was followed by a selection of a subset of entries or a summation, no such operation is applied on the natural join in this case. This computation corresponds to the two equations below.

 ψ\tt T(\tt O,\tt R;\tt T% )= Pr(\tt T∣\tt O,\tt R) Pr(\tt A=young,\tt E,\tt O,\tt R% ,\tt T)= ψ\tt S(\tt E;\tt A=young% )ψ\tt T(\tt O,\tt R;\tt T) Pr(\tt O∣\tt E)Pr(\tt R∣\tt E). (6)

As in the previous cases, the factors that involve T (in this example it is only ) are replaced with a factor , the table of which is indexed by variables O and R, but also contains a column for free variable T.

The procedure described above for the first three variables of  is repeated for the remaining variables, and constitutes the variable-elimination algorithm [zhang1994simple]. To summarize, the variable-elimination algorithm considers variables in the order of . If or , the algorithm computes a natural join over the factors that involve , it performs a selection or group summation, respectively, and uses the result to replace the factors that involve .

Elimination tree. The variable-elimination algorithm gives rise to a graph, like the one shown in Figure 2 for the example we discussed. Each node is associated with a factor and there is a directed edge between two factors if one is used for the computation of the other. In particular, each leaf node corresponds to one of the factors that define the Bayesian network. In our running example, these are the factors that appear in Equation (1). Each internal node corresponds to a factor that is computed from its children (i.e., the factors that correspond to its incoming edges), and replaces them in the variable-elimination algorithm. Moreover, as we saw, each internal node corresponds to one variable. The last factor computed is the answer to the query.

Notice that the graph constructed in this manner is either a tree or a forest. It is not difficult to see that the elimination graph is a tree if and only if the corresponding Bayesian network is a weakly connected dag. To simplify our discussion, we will focus on connected Bayesian networks and deal with an elimination tree for each query. All our results can be directly extended to the case of forests.

Notice that the exact form of the factor for each internal node of depends on the query. For the elimination tree in Figure 2, we have factor on the node that corresponds to variable A. However, if the query contained variable A as a free variable rather than bound to value , then the same node in would contain a factor . And if the query did not contain variable A, then the same node in would contain a factor . On the other hand, the structure of the tree, the factors that correspond to leaf nodes and the variables that index the variables of the factors that correspond to internal nodes are query-independent.

Note. The elimination algorithm we use here differs slightly from the one presented by Zhang et al. [zhang1994simple]. Specifically, the variable-elimination algorithm of Zhang et al. computes the factors associated with the bound variables at a special initialization step, which leads to benefits in practice (even though the running time remains super-polynomial in the worst case). On the other hand, we compute factors in absolute accordance with the elimination order. This allows us to consider the variable-elimination order fixed for all variables independently of the query.

Materialization of factors. Materializing factor tables for internal nodes of the elimination tree can speed up the computation of queries that require those factors. As we saw in the previous example, factors are computed in a sequence of steps, one for each variable, and each step involves the natural join over other factor tables, followed by: (i) either variable summation (to sum-out variables ); or (ii) row selection (for variables ); or (iii) no operation (for variables ). In what follows, we focus on materializing factors that involve only variable summation, the first out of these three types of operations. Materializing such factors is often useful for multiple queries and sufficient to make the case for the materialization of factors that lead to the highest performance gains over a given query workload. Dealing with the materialization of general factors is a rather straightforward albeit non-trivial extension which, due to space constraints, will be the topic of future work.

To formalize our discussion, let us introduce some notation. Given a node in an elimination tree , we write to denote the subtree of that is rooted in node . We also write to denote the subset of variables of that are associated with the nodes of . Finally, we write to denote the set of ancestors of in , that is, all nodes between and the root of the tree , excluding .

Computing a factor for a query incurs a computational cost. We distinguish two notions of cost: first, if the children factors of a node in the elimination tree  are given as input, computing incurs a partial cost of computing the factor of from its children; second, starting from the factors that define the Bayesian network, the total cost of computing a node includes the partial costs of computing all intermediate factors, from the leaf nodes to . Formally, we have the following definitions.

###### Definition 1 (Partial-Cost)

The partial cost of a node in the elimination tree is the computational effort required to compute the corresponding factor given the factors of its children nodes.

###### Definition 2 (Total-Cost)

The total cost of a node in the elimination tree is the total cost of computing the factor at node , i.e.,

 b(u)=∑x∈Tuc(x),

where is the partial cost of node .

When we say that we materialize a node , we mean that we materialize the factor that is the result of summing out all variables below it on . When is a materialized factor useful for a query ? Intuitively, it is useful if it is one of the factors computed during the evaluation of , in which case we save the total cost of computing it from scratch, provided that there is no other materialized factor that could be used in its place, with greater savings in cost. The following definition of usefulness formalizes this intuition.

###### Definition 3 (Usefulness)

Let be a query, and a set of nodes of the elimination tree that are materialized. We say that a node is useful for the query with respect to the set of nodes , if () ; () ; and () there is no other node for which conditions () and () hold.

To indicate that a node is useful for the query with respect to a set of nodes with materialized factors, we use the indicator function . That is, if node is useful for the query with respect to the set of nodes , and otherwise.

When a materialized node is useful for a query , it saves us the total cost of computing it from scratch. Considering a query workload, where different queries appear with different probabilities, we define the benefit of a set of materialized nodes as the total cost we save in expectation.

###### Definition 4 (Benefit)

Consider an elimination tree , a set of nodes , and query probabilities for the set of all possible queries . The benefit of the node set is defined as:

 B(R) = ∑qPr(q)∑u∈Rδq(u;R)b(u) = ∑u∈RPr(δq(u;R)=1)b(u) = ∑u∈RE[δq(u;R)]b(u).

Problem definition. We can now define formally the problem we consider: for a space budget , our goal is to select a set of factors to materialize to achieve optimal benefit.

###### Problem 1

Given a Bayesian network , an elimination tree for answering probability queries over , and budget , select a set of nodes to materialize, whose total size is at most , so as to optimize .

For simplicity of exposition we also consider a version of the problem where we are given a total budget on the number of nodes that we can materialize. We first present algorithms for Problem 2 in Section 4, and discuss how to address the more general Problem 1 in Section 5.

###### Problem 2

Given a Bayesian network , an elimination tree for answering probability queries over , and an integer , select at most nodes to materialize so as to optimize .

## 4 Algorithms

This section focuses on algorithms for Problem 2: Section 4.1 presents an exact polynomial-time dynamic-programming algorithm; and Section 4.3 discusses a greedy algorithm, which yields improved time complexity but provides only an approximate solution, yet with quality guarantee.

### 4.1 Dynamic programming

We discuss our dynamic-programming algorithm in three steps. First, we introduce the notion of partial benefit that allows us to explore partial solutions for the problem. Second, we demonstrate the optimal-substructure property of the problem, and third, we present the algorithm.

Partial benefit. In Definition 4 we defined the (total) benefit of a subset of nodes (i.e., a potential solution) for the whole elimination tree . Here we define the partial benefit of a subset of nodes  for a subtree of a given node of the elimination tree .

###### Definition 5 (partial benefit)

Consider an elimination tree , a subset of nodes , and probabilities for the set of all possible queries . The partial benefit of the node set at a given node is

 Bu(R)=∑v∈R∩TuE[δq(v;R)]b(v).

The following lemma states that, given a set of nodes , and a node , the probability that is useful for a random query with respect to depends only on the lowest ancestor of in .

###### Lemma 1

Consider an elimination tree and a set of nodes. Let such that and . Then we have:

 E[δq(u;R)]=E[δq(u;v)],

where the expectation is taken over a distribution of queries .

To prove the lemma, we will show that for any query , it is if and only if .

We first show that implies . From Definition 3, we have that if and there is no such that . Given that and , it follows that , hence, . Reversely, we show that implies . Notice that if and . This means that, for all we have since . Given also that , we have that for all it is , hence, .

Given the one-to-one correspondence between the set of queries in which and the set of queries in which , we have , hence, the result follows.

Building upon Lemma 1, we arrive to Lemma 2 below, which states that the partial benefit of a node-set at a node depends only on () the nodes of that are included in , and () the lowest ancestor of in , and therefore it does not depend on what other nodes “above”  are included in .

For the proof of Lemma 2, we introduce some additional notation. Let be an elimination tree, a node of , and the subtree of rooted at . Let be a set of nodes. For each node , we define to be the lowest ancestor of that is included in .

###### Lemma 2

Consider an elimination tree and a node . Let be an ancestor of . Consider two sets of nodes and for which

• and ;

• ; and

• .

Then, we have: .

From direct application of Lemma 1, we have , for all , and similarly, we have for all . Now, given that and , we have , for all ; and similarly , for all . It then follows that for all , we have Putting everything together, we get

 Bu(R) =∑w∈R∩TuE[δq(w;R)]b(w) =∑w∈R′∩TuE[δq(w;R′)]b(w)=Bu(R′).

Let be a special node, which we will use to denote that no ancestor of a node is included in a solution . We define as the extended set of ancestors of , which adds into . Notice that corresponds to the set of ancestors of including the root , i.e., .

Optimal substructure. In Lemma 3 we present the optimal-substructure property for Problem 2. Lemma 3 builds upon Lemma 2 and states that among nodes of the optimal solution, the subset of nodes that fall within a given subtree depends only on the nodes of the subtree and the lowest ancestor of the subtree that is included in the optimal solution.

###### Lemma 3 (Optimal Substructure)

Given an elimination tree and an integer budget , let denote the optimal solution to Problem 2. Consider a node and let be the lowest ancestor of that is included in . Let denote the set of nodes in the optimal solution that reside in and let . Then,

 R∗u=argmaxRu⊆Tu|Ru|=κ∗u{Bu(Ru∪{v})}.

First, notice that the sets and satisfy the pre-conditions of Lemma 2, and thus,

 Bu(R∗)=Bu(R∗u∪{v}). (7)

Now, to achieve a contradiction, assume that there exists a set such that and

 Bu(R∗u∪{v})

Let denote the solution obtained by replacing the node set in by . Again and satisfy the preconditions of Lemma 2, and thus,

 Bu(R′)=Bu(R′u∪{v}). (9)

As before, for we define and to be the lowest ancestor of in and in , respectively. Given that , we have , hence, for all :

 E[δq(w;R∗)]=E[δq(w;R′)]. (10)

Putting together Equations (7-10) we get

 B(R′) = ∑w∈R′E[δq(w;R′)]b(w) = ∑w∈R′uE[δq(w;R′)]b(w)+∑w∈R∗∖R∗uE[δq(w;R′)]b(w) = Bu(R′u∪{v})+∑w∈R∗∖R∗uE[δq(w;R′)]b(w) = Bu(R′u∪{v})+∑w∈R∗∖R∗uE[δq(w;R∗)]b(w) > Bu(R∗u∪{v})+∑w∈R∗∖R∗uE[δq(w;R∗)]b(w) = ∑w∈R∗uE[δq(w;R∗)]b(w)+∑w∈R∗∖R∗uE[δq(w;R∗)]b(w) = B(R∗)

which is a contradiction since is the optimal solution of Problem 2, and thus, .

The following lemma provides a bottom-up approach to combine partial solutions computed on subtrees. We note that in the rest of the section, we present our results on binary trees. This assumption is made without any loss of generality as any -ary tree can be converted into a binary tree by introducing dummy nodes; furthermore, by assigning appropriate cost to dummy nodes, we can ensure that they will not be selected by the algorithm.

Consider an elimination tree , a node , and a set of nodes in . Let and be the right and left children of , and let and . Then, for any node it is

 Bu(Ru∪{v}) =⎧⎪⎨⎪⎩Bu({u,v})+Br(u)(Rr(u)∪{u}) +Bℓ(u)(Rℓ(u)∪{u}),if u∈RuBr(u)(Rr(u)∪{v})+Bℓ(u)(Rℓ(u)∪{v}),otherwise.

We show the result in the case of : notice that since the node cannot be the lowest solution ancestor of any node in . Given also that no node in can have an ancestor in and vice versa, following Lemma 1, we have:

 Bu(Ru∪{v}) = ∑w∈RuE[δq(w;Ru∪{v})]b(w) +∑w∈Rℓ(u)E[δq(w;Rℓ(u)∪{u})]b(w) = Bu({u,v})+Br(u)(Rr(u)∪{u})+Bℓ(u)(Rℓ(u)∪{u}).

The case is similar and we omit the details for brevity.

Dynamic programming. Finally, we discuss how to use the structural properties shown above in order to devise the dynamic-programming algorithm. We first define the data structures that we use. Consider a node in the elimination tree, a node , and an integer between 1 and . We define to be the optimal value of partial benefit over all sets of nodes that satisfy the following three conditions:

• ;

• ; and

• .

Condition () states that the node set has at most nodes in the subtree ; condition () states that node is contained in ; and condition () states that no other node between and is contained in , i.e., node is the lowest ancestor of in .

For all , , , and sets that satisfy conditions ()–() we also define and to denote the optimal partial benefit for the cases when and , respectively. Hence, we have

 F(u,κ,v)=max{F+(u,κ,v),F−(u,κ,v)}.

We assume that the special node belongs in all solution sets but does not count towards the size of . Notice that is the optimal partial benefit for all sets that have at most nodes in and no ancestor of belongs to . We now show how to compute for all , , and by a bottom-up dynamic-programming algorithm:

• If is a leaf of the elimination tree then

 F−(u,1,v)=0, for all v∈¯Au,

and

 F+(u,1,v)=−∞, for all v∈¯Au.

This initialization enforces leaf nodes, i.e., the nodes that correspond to factors that define the Bayesian network, not to be selected into the solution, as they are considered part of the input.

• If is not a leaf of the elimination tree then

 F+(u,κ,v) = Bu({u,v})+

and

 F−(u,κ,v)=maxκℓ+κr=κ{F(ℓ(u),κℓ,v)+F(r(u),κr,v)}}.

The value of the optimal solution to the problem is returned by , where is the root of the elimination tree.

To compute the entries of the table , for all , , and , we proceed in a bottom-up fashion. For each node , once all entries for the nodes in the subtree of have been computed, we compute , for all , and all . Hence, for computing each entry , we only need entries that have already been computed.

Once all the entries of are computed, we construct the optimal solution by backtracking in a top-down fashion, specifically, by calling the subroutine ConstructSolution; pseudocode depicted as Algorithm 1.

###### Theorem 1

The dynamic-programming algorithm described above computes correctly the optimal solution .

The correctness of the bottom-up computation of and follows from Lemmas 3 and 4. Once we fill the table, we have, for each node , the optimal partial benefit for all possible combinations of partial solution size and lowest solution ancestor of

 F(u,κ,v)=max Ru⊆Tu|Ru|=κBu(Ru∪{v}).

Moreover, each entry indicates whether would be included in any solution in which () nodes are selected from into and () and . Once we fill all the entries of the table, the optimal solution is constructed by Algorithm 1 that performs a BFS traversal of the tree: the decision to select each visited node into is given based on its inclusion state indicated by the entry , where is the lowest ancestor of in solution that is added to the solution before visiting and is the optimal partial budget allowance for , which are both determined by the decisions taken in previous layers before visiting node .

Notice that for each node the computation of the entries requires the computation of partial benefit values for pairs of nodes (, ), which in turn, require access to or computation of values . As Lemma 5 below shows, the latter quantity can be computed from and , for all and . In practice, it is reasonable to consider a setting where one has used historical query logs to learn empirical values for and thus for .

###### Lemma 5

Let be a given node in an elimination tree and let denote an ancestor of . Then,

 E[δq(u;v)]=E[δq(u;∅)]−E[δq(v;∅)].

Notice that for any possible query , whenever , we also have , since . This suggests that given any query for which , we also have . On the other hand, when , there can be two cases: () , which implies , and () there exists a node such that , which implies . The latter suggests that the event occurs for a subset of queries for which the event occurs. The lemma follows.

Finally, the running time of the algorithm can be easily derived by the time needed to compute all entries of the dynamic-programming table. As with all lemmas and theorems, the proof can be found in the extended version [us-arxiv].

###### Theorem 2

The running time of the dynamic-programming algorithm is , where is the number of nodes in the elimination tree, 0ptis its height, and is the number of nodes that we ask to materialize.

Notice that we have subproblems, where each subproblem corresponds to an entry of the three-dimensional table. To fill each entry of the table, we need to compute the two distinct values of and that maximize (subject to ) and (subject to ), respectively. Thus, it takes time to fill each entry of the table in a bottom-up fashion, hence, the overall running time is