1 Introduction
Since calculating the exact probability of a query over probabilistic databases (PDBs) is #Phard in general, probabilistic inference is a major bottleneck in several related applications, such as statistical relational learning [Raedt and Kersting2017], and fast approximation methods are needed. Anytime approximation methods give approximate answers fast, yet allow the user to refine the answer by using additional time and resources. The current state of the art in anytime approximation for PDBs are either based on () sampling [Ré, Dalvi, and Suciu2007] or on () the modelbased branchandbound approach by [Fink, Huang, and Olteanu2013] implemented in the SPROUT system, where the latter outperforms the former. See [Van den Broeck and Suciu2017] for a recent survey.
We propose here to improve the branchandbound approach by replacing modelbased bounds with novel dissociationbased bounds, which were shown to dominate modelbased bounds [Gatterbauer and Suciu2014]. The technique of dissociation is related to the variable splitting framework by [Choi and Darwiche2010], and has so far only been applied at the firstorder (query) level for upper bounds [Gatterbauer and Suciu2017]
. One reason is that for conjunctive queries, these dissociationbased upper bounds are uniquely defined and proven to be better than any modelbased bounds. In contrast, a whole spectrum of optimal oblivious lower bounds exists, which includes the modelbased bounds as special cases. Important ingredients of our approach are strategies for quickly choosing good lower bounds, as well as a novel heuristic for the “branch” part of the algorithm based on
influence or sensitivity [Kanagal, Li, and Deshpande2011].2 Background
The problem. We consider the evaluation of Boolean conjunctive queries (CQs) and illustrate our problem and approach with the following query over PDB :
a b a c 0.3 a d 0.4 b d 0.5 c 0.4 d 0.8
Grounding over leads to a propositional formula that is called the “lineage of over ” in which each variable represents a tuple in the database. In our example, . Computing its probability is equivalent to computing the probability , i.e. the probability that the query is true over PDB . In general, calculating this probability is #Phard in the number of variables in the lineage, and thus in the number of tuples in the database. We are interested in developing an approximation scheme that allows us to tradeoff available time with required accuracy of approximation.
State of the art. The algorithm underlying SPROUT [Fink, Huang, and Olteanu2013] approximates by using lineage decompositions based on independence and determinism. When the smaller lineages obtained are readonce (i.e. have no multiple occurrences of the same variable) their probabilities are computed exactly in PTIME, otherwise they are approximated by modelbased bounds: for each variable, all except one of its occurrences are set to true or 1 (resp. false or 0) while the remaining occurrence is assigned the original probability to get an upper (resp. lower) bound. SPROUT randomly selects these occurrences. The bounds are propagated back up, based on the decomposition, to obtain an approximation of . If this approximation is not accurate enough, Shannon expansion (SE) is applied on the decomposed lineages: a variable is selected and, based on , the decomposition and approximation process continues on and until the desired accuracy is obtained. SPROUT selects the most frequent variable (the one with the highest number of occurrences, ties broken randomly) for SE. Clearly, the challenge is to limit the number of SEs, as these can double the size of the formulas, resulting in higher computation cost.
3 Our approach
The branchandbound approach just described is generic: any technique for approximating the probability of lineages can be plugged in during decomposition, and similarly, any variable selection procedure for SE can be used.
We propose to use dissociation to obtain approximations. Dissociation replaces multiple occurrences of a variable with independent copies. Our example lineage becomes . By carefully assigning new probabilities and to both copies and of , is guaranteed to be an upper, and to be a lower bound for . What makes dissociations particularly attractive for our framework is that dissociation bounds generalize modelbased bounds, and that they can be calculated efficiently. As a consequence, we obtain better approximations in each step, hereby possibly reducing the number of SEs needed to obtain a desired accuracy. For SE, we propose a method based on influence. We next detail some aspects of our approach and highlight some of its advantages.
1. Better bounds. Following Theorem 4.8 in [Gatterbauer and Suciu2014], the two green shaded areas in Fig. (a)a show all possible assignments of probabilities to the occurrences and in our lineage that guarantee to upper or lower bound the true probability . For upper bounds, the two modelbased bounds provide an approximation of and . By contrast, assigning the original probability to both and results in the unique and optimal upper dissociation bound of .
For the lower bounds, the shaded area on the lower left shows all possible lower bound assignments, and all assignments on the curved border are possible optima. Here, the two modelbased bounds are just a few of the possible options, achieving approximations of and respectively. The symmetric lower dissociation bound assigns both occurrences an equal share by setting which gives as lower bound. The best assignment lies slightly to the left and gives .
Part of our approach are gradientdescent methods that aim to find this optimal assignment. Computation of the gradient is closely related to the notion of influence, which can be computed in PTIME for readonce formulas [Kanagal, Li, and Deshpande2011], as well as for dissociated formulas. While gradientdescent methods generally take longer to find the best lower bound than randomly assigning modelbased bounds, the quality of the bound is often much better, thus justifying these additional computations.
2. Better variable selection. The stateoftheart heuristic for choosing a variable for SE in SPROUT is selecting the most frequent variable. We instead select the variable for SE that has the highest sum of influences of each of its occurrences, as if they were independent. This ensures PTIME computation and turns out to be a better choice. Moreover, we can reuse the computed influences from our gradientbased optimization methods, thus avoiding recomputation.
3. Optimization tradeoff. We have a “knob” to control how much time to spend on finding a good lower bound using gradient descent, before moving on to a next Shannon expansion. Recall that modelbased bounds are fast to compute; they are randomly selected, in one step. In contrast, descent methods perform multiple steps, and this extra computation pays off only when considerably better lower bounds are obtained, and further SEs can be avoided.
4 Experiments
We experimented with several approximation methods on the Yago3 dataset [Mahdisoltani, Biega, and Suchanek2015]. We obtained 380 lineages from 4 different queries by assigning different input probabilities to the data, using different joinorders to factorize the lineages, and by injecting different constants into the queries. Figure (b)b shows the average approximation error over time for four different instantiations of our approach.
MB
is the existing Modelbased approach from SPROUT and performs slightly worse than SD
,
which uses Symmetric Dissociations for both upper and lower bounds.
Both methods work best with the frequency heuristic for SE.
Our new methods PGD
(Projected Gradient Descent) and HB
(Hybrid method) vastly outperform both approaches.
Both methods use the optimal symmetric upper bounds and apply 10 gradient descent steps before using SE.
But whereas PGD
searches for the true best lower bounds,
HB
searches for the best possible modelbased lower bound, by moving in a gradient direction.
This makes the optimization faster, but produces slightly worse bounds than PGD
.
These methods work best with the influence heuristic for SE, but outperform the others even with the frequency heuristic.
5 Conclusion
We introduced an anytime approximation framework for probabilistic query evaluation. Our framework leverages novel dissociation bounds that generalize and improve upon modelbased bounds. Our experimental results show notable improvements over the current state of the art, and we believe that the approach also has applications beyond PDBs in the broader area of statisticalrelational learning (SRL).
Acknowledgements.
This work has been supported in part by NSF IIS1762268 and FWO G042815N.
References

[Choi and Darwiche2010]
Choi, A., and Darwiche, A.
2010.
Relax, compensate and then recover.
In
JSAI International Symposium on Artificial Intelligence
, volume 6797 of Lecture Notes in Computer Science, 167–180. Springer.  [Fink, Huang, and Olteanu2013] Fink, R.; Huang, J.; and Olteanu, D. 2013. Anytime approximation in probabilistic databases. VLDB J. 22(6):823–848.
 [Gatterbauer and Suciu2014] Gatterbauer, W., and Suciu, D. 2014. Oblivious bounds on the probability of Boolean functions. TODS 39(1):1–34.
 [Gatterbauer and Suciu2017] Gatterbauer, W., and Suciu, D. 2017. Dissociation and propagation for approximate lifted inference with standard relational database management systems. VLDB J. 26(1):5–30.
 [Kanagal, Li, and Deshpande2011] Kanagal, B.; Li, J.; and Deshpande, A. 2011. Sensitivity analysis and explanations for robust query evaluation in probabilistic databases. In SIGMOD, 841–852.
 [Mahdisoltani, Biega, and Suchanek2015] Mahdisoltani, F.; Biega, J.; and Suchanek, F. M. 2015. YAGO3: A knowledge base from multilingual wikipedias. In CIDR.

[Raedt and Kersting2017]
Raedt, L. D., and Kersting, K.
2017.
Statistical relational learning.
In
Encyclopedia of Machine Learning and Data Mining
. Springer. 1177–1187.  [Ré, Dalvi, and Suciu2007] Ré, C.; Dalvi, N.; and Suciu, D. 2007. Efficient topk query evaluation on probabilistic data. In ICDE, 886–895.
 [Van den Broeck and Suciu2017] Van den Broeck, G., and Suciu, D. 2017. Query processing on probabilistic data: A survey. Foundations and Trends in Databases 7(34):197–341.