At the heart of probabilistic AI lies the problem of performing probabilistic inference, which is a #P-hard problem. For discrete random variables, the reduction to weighted model counting (WMC)[darwiche2009modeling] has emerged as the go-to technique to manage the hardness of probabilistic inference by exploiting structure, such as determinism and context-specific independence [chavira2008probabilistic]. Weighted Model Integration (WMI) [belle2015probabilistic]
extends the WMC task from the discrete domain to the continuous domain by allowing for continuous random variables.
Consider the example of a WMI problem in Figure 1. The problem has two continuous random variables ( and ) and three Boolean random variables which produce the different feasible regions (the red region and the two blue regions). The regions themselves are given by constraints on the continuous variables. Moreover, for each feasible region a weight function is given. Outside of the regions the weight is zero. WMI tackles the problem of computing the integral over the feasible regions.
Recently, it has been shown [kolb2018efficient, kolb2019structure, zeng2019efficient] that, just as for WMC, exploiting structure in highly structured WMI problems is possible and can in practice lead to exponential-to-polynomial speed-ups in inference time. In contrast to WMC, however, WMI exhibits one major complication: integrating out continuous random variables when performing probabilistic inference. WMI solvers capable of exploiting structure in WMI problems, such as the F-XSDD family of algorithms [kolb2019structure], rely on symbolic probabilistic inference, i.e. performing integrals symbolically. Symbolic integration is the problem of performing anti-differentiation (finding the indefinite integral also called the primitive).
Computing integrals (definite as well as indefinite) is a computationally hard problem, in the case of integrating polynomials, for example, #P-hard [valiant1979complexity], i.e. the number of functions calls necessary to compute an integral grows exponentially as the dimensionality increases. This dependency of the complexity class on the dimensionality of a problem is also referred to as curse of dimensionality111The term curse of dimensionality was originally coined in [bellman1957dynamic] in the context of dynamic programming but has since also been applied to the problem of integration.
. A popular technique, capable of circumventing the curse of dimensionality for definite integrals, is Monte Carlo (MC) integration.
This brings us to the crux of this paper. On the one hand, we would like to exploit structure present in WMI problems, for which we need to be able to calculate indefinite integrals. On the other hand, we want to circumvent the curse of dimensionality of integration by using MC techniques.
Unfortunately, vanilla MC integration is not capable of calculating indefinite integrals. Therefore, we introduce the concept of Monte Carlo anti-differentiation (MCAD) in order to approximately compute the anti-derivative of an integrand with a Monte Carlo estimate
approximately compute the anti-derivative of an integrand with a Monte Carlo estimate.
For the empirical evaluation we integrated MCAD into the existing F-XSDD family of WMI solvers where we use MCAD as a drop-in replacement for the exact symbolic integration backend. Experimentally, we show that approximating the anti-derivative with a Monte Carlo estimate, instead of computing symbolic indefinite integrals, allows for efficiently solving highly-structured WMI problems in high-dimensional domains — yielding a practical WMI solver.
2.1 Weighted Model Integration
WMI is the extension of WMC from propositional logic formulas to so-called satisfiability modulo theory (SMT) formulas. An SMT formula is a first-order formula with respect to a decidable background theory. Following [morettin2017efficient], we define SMT() formulas, i.e. SMT formulas that use linear real arithmetics as background theories :
(SMT()) Let be a set of Boolean and a set of real variables. An atomic formula is an expression of the form , where the and , and . We then define SMT() theories as Boolean combinations (by means of the standard Boolean operators ) of Boolean variables and of atomic formulas over .
Following Equation in [kolb2019structure], we define weighted model integration in function of an indicator function over an SMT() formula.
(WMI) Given a set of Boolean variables, of real variables, a weight function , and a support , in the form of an SMT formula, over and , the weighted model integral is given by:
where we use the Iverson bracket notation in to denote the indicator function of .
In [kolb2019structure, Equation 8], the authors further manipulate the expression for computing the weighted model integral into a summation of weighted model integrals where the weight function does not dependent on anymore:
Solving a WMI problem over an SMT() formula can hence be reduced to a two step procedure: 1) rewriting the problem into a sum over tuples of disjoint convex polytopes and weight functions, and 2) integrating every weight over the corresponding support . The WMI is obtained by summing up the results obtained in the second step. The first step of this procedure was coined -SMT [kolb2019structure, Defintion 6]. The -SMT problem lies at the heart of all WMI solvers. Efficient WMI solvers are characterized by efficiently solving the -SMT problem by exploiting redundancies in the SMT() formula, thereby avoiding the computation of superfluous integrals. Solving the -SMT problem and the integration step are the two #P-hard problems that characterize probabilistic inference in the hybrid domain.
2.2 Exact Integration Techniques for WMI
Techniques to compute integrals can be split into two categories 1) symbolic integration and 2) numerical integration. Symbolic integration is the problem of finding the anti-derivative, or indefinite integral, used to compute exact definite integrals, whereas numerical integration is a family of algorithms for calculating definite integrals. Numerical integration algorithms are either exact, e.g. [bueler2000exact, de2004effective], or approximate, e.g. [metropolis1953equation, duane1987hybrid]. In the remainder of this subsection we give an overview of algorithms used so far in the WMI literature and discuss their advantages and shortcomings222We limit the discussion to approaches that handle general WMI problems over SMT() formulas, in contrast to [belle2016component] and [zeng2019efficient]..
Fourier-Motzkin Elimination What Gaussian elimination is to systems of strict linear equalities, is Fourier-Motzkin elimination [imbert1990redundant] to systems of linear inequalities. First and foremost Fourier-Motzkin elimination is a method to solve a system of linear inequalities but does also underlie a number of symbolic integration algorithms [sanner2011symbolic, gehr2016psi, kolb2018efficient] (even though this not being mentioned explicitly). As an underlying component of symbolic integration methods, the Fourier-Motkin elimination has also found its way into WMI solvers [kolb2018efficient, zuidbergdosmartires2018exactcompilation, kolb2019structure].
Decomposition into Simplices An other strategy for exact integration of polynomials over convex polytopes is decomposing a polytope into (signed) simplices and adding up the (signed) volumes of the simplices. VINCI [bueler2000exact] and LattE Integrale [de2004effective, de2013software] are realization of such decomposition algorithms. Even though LattE Integrale is technically a symbolic algorithm, we consider it a numeric integration algorithm as it can only produce definite integrals. LattE Integrale has been used in WMI solvers of [belle2015probabilistic, belle2015hashing, morettin2017efficient, morettin2019advanced].
2.3 MCMC Volume Estimation of Convex Bodies
Monte Carlo integration is a popular technique to escape the curse of dimensionality for computing integrals or to approximate integrals for which no analytic solution is available and which hence cannot be computed symbolically. The simplest MC strategy is naive rejection sampling. Even though rejection sampling is straightforward to implement, it hardly circumvents the curse of dimensionality: for high-dimensional spaces the rejection rate is prohibitively high.
To repair this problem, a plethora of sophisticated sampling algorithms have been developed over the years, e.g. Metropolis-Hastings [metropolis1953equation] or Hamilton Monte Carlo [duane1987hybrid]. However, most of these advanced sampling techniques are not designed to handle constrained spaces and it is not straightforward to apply these techniques to constrained spaces —exceptions exist [betancourt2011nested, afshar2016closed]. A different road is to specifically design MC algorithms for constrained spaces. One such algorithm is VolEsti, which is presented in [emiris2014efficient, cousins2016practical]333An efficient C++ implementation is available at https://github.com/GeomScale/volume˙approximation.
. VolEsti is a Markov Chain Monte Carlo algorithm, in the family ofhit-and-run samplers [simonovits2003compute], which is able of sampling from convex regions in high-dimensional spaces. Unlike [betancourt2011nested, afshar2016closed], VolEsti is additionally capable of estimating the volume of the convex region, from which samples are drawn, up to a user-defined threshold by recursively constructing co-centric balls of diminishing radii, a technique introduced in [dyer1991random]. The complexity of VolEsti is stated in the following theorem (cf. [kannan1997random, Theorem 2.1]):
Given an dimensional convex body , the precision parameter , and the upper bound on the probability of error
, and the upper bound on the probability of error, there is a (randomized) algorithm that returns a real number such that
with probability at least . The algorithm takes
3 Monte Carlo Anti-Differentiation
3.1 Monte Carlo WMI
The first step towards Monte Carlo anti-differentiation is to formally write down an MC approximation of the volume computation in Equation 2, i.e. an MC estimate of the definite integral. Such an approximation was first given for WMI in [zuidbergdosmartires2018exactcompilation], where the authors assumed that the weight function
is a probability density function defined overand proposed an algorithm that samples values directly from the probability density. However, while sampling, constraints were not taken into account, which boils down to performing rejection sampling.
We give now the expression for the MC estimate of the weighted model integral when samples are drawn uniformly from points within convex polytopes. This might look like a step into the wrong direction at first. However, formulating the MC approximation of the weighted model integral in terms of uniform samples allows us to deploy the hit-and-run sampler of VolEsti (cf. Section 2.3) to perform MC integration in high-dimensional spaces, thereby avoiding prohibitively high sample rejection rates.
(MC approximation of WMI) Let be an SMT() theory, a weight function over the continuous variables . Then the volume computation can be approximated by the following Monte Carlo estimate:
is the number of samples uniformly drawn from the convex polytope induced by the linear constraints in and denotes its volume.
Proof. The proof is trivial and follows directly from the definition of Monte Carlo integration (cf. [weinzierl2000introduction]).∎
(Error on MC estimate) Let and . The MC estimate of lies with probability within
where denotes the standard deviation and
denotes the standard deviation andis the number of samples (cf. [weinzierl2000introduction]).
3.2 An Example
Before formally introducing Monte Carlo anti-differentiation we first discuss MCAD in an example.
Consider the convex polytope given in Figure 2 on the left and the weight function depicted in Figure 2 on the right. We would like to compute the anti-derivative of the weight function with respect to , which means that we need to integrate out taking into consideration the bounds imposed on ( and (blue)). The problem is that theses bounds do not induce a convex polytope but only a region in and we are not able to deploy the sampling algorithm introduced in Section 2.3. To fix this, we take also into consideration the domain of the WMI problem on the variable — and (in dashed red). These four inequalities combined do now induce a convex polytope (shaded in blue).
With the convex poltyope at hand, we can now sample uniformly points from it. We then weight each sample with the weight produced by the weight function . The Monte Carlo approximation of the anti-derivative is then obtained by estimating the density of the weighted samples. In Figure 3 we performed the density estimation using a histogram (bottom right).
Note that while we used a 1histogram, i.e. a piecewise constant function, to estimate the density for the anti-derivative in Figure 3, this can be replaced with any density estimator, which we will discuss in the next subsection.
We are now going to generalize Example 2 such that it fits the general setting of WMI over SMT() formulas.
Given is the -dimensional convex poltyope , which is defined by inequalities over the variables . The Monte Carlo estimate of the anti-derivative of the weight function with respect to the set of integration variables , constrained by the convex polytope is given by:
, with , are free variables that do not appear together with variables in in any of the SMT() atoms. , with , are variables that are coupled to the variables in by appearing together in at least one SMT() atom. is a density estimator that estimates the density of the samples weighted by , where the samples are drawn from the sub-polytope over the variables induced by the polytope .
Proof. We start by writing the expression of the anti-derivate and expressing the polynomials as function of the SMT() atoms.
Now we separate the SMT() atoms by grouping together the atoms depending on (indexed by ) and the atoms not depending on (index by )
The Iverson brackets we would like to integrate over do not necessarily form a convex polytope, which would allow us to use the MCMC samples introduced in Section 2.2. In order to obtain a convex polytope, we first explicitly write the lower and upper bounds on the variables in , and respectively, and push them inside the integral.
Performing the integral results in a density dependent on the variables in :
We estimate the density by using samples drawn from which are weighted according to :
where is the volume of the polytope and a density estimator. The weights for the samples are obtained from the Monte Carlo estimate, cf. Equation 5. and denote the -th sample. ∎
Note that in the proof above we assumed that the upper and lower bounds on the variables in
were readily available. If this is not the case, they can be obtained by means of linear programming. Note also that adding the pushing the bounds on the variablesin Equation 13 is reflected in Example 2 when adding the bounds on (in dashed red in Figure 3) to the bounds on . Furthermore, in Example 2.
3.4 Histograms as Density Estimator
The perhaps simplest approach to estimate the density that produced a set of data points is by means of (multidimensional) histograms, i.e. by piecewise constant functions. In order to estimate the density in Equation 15 we first need to obtain the domain on which we would like to estimate . For each of the variables in we take its lower and upper bound. The so obtained hyperrectangle defines the bounds of the multivariate histogram. The hyperrectangle is then partitioned into a predefined number of bins of volume ( depends on and the volume of ). The estimate of the density for the -th bin denoted by , with is given by:
where is the number of samples which we use to estimate the density.
Even though using histograms as density estimators is straight forward they are not well suited for estimating high dimensional data. If we want to preserve the bin resolution, the number of bins we need to partition our space into grows exponentially with the number variables present in the set. A possible solution would be to represent the estimate the density by a lower dimensional representation instead of a piecewise constant function encoded through exponentially many bins, such as kernel density methods [rosenblatt1956remarks, parzen1962estimation], density estimation trees [ram2011density], or hybrid sum-product-networks [molina2018mixed].
We are now interested in the error that originates from approximating a function by a piecewise constant function. Therefore, we assume that the true density
is the (midpoint) interpolant of the histogram(we are not interested in the error introduced by the Monte Carlo approximation of , this error is described in Theorem 3). We first state a proposition, which we need to carry out the proof for the bounds on the error.
Let be open, be differentiable, and the segment joining to be contained in U. Then there exists such that
denotes the norm.
Proof. Follows trivially from the mean value theorem for multivariate functions, see for example [hubbard2015vector, Theorem 1.9.1].
Let be the set of points in the -th bin of the histogram , be a polynomial function, be the distance of the midpoint to one of the vertices of a bin. The error on the approximation of by is bounded by:
Proof. The error on the histogram with regards to to the actual value of the polynomial for a given point is:
We notice now that for every , where is the midpoint of the -th bin. This gives us:
|using Proposition 2||(21)|
The upper bounds for the error on the bin is the maximum value of , i.e. . We then obtain:
Where we also used the fact that for polynomials we can replace the supremum with a maximum.
Realizing that the maximal distance between the midpoint and any other point is the distance between the midpoint and a corner point of the hypercube finishes the proof:
Replacing the exact symbolic integration back-end with a Monte Carlo estimate gives the F-XSDD(Mcad) algorithm. Structurally, the F-XSDD(Mcad) follows the F-XSDD algorithm proposed in [kolb2019structure], and also shown in Algorithm 1. The sole difference to the original F-XSDD algorithm is that in the F-XSDD(Mcad) algorithm the exact symbolic anti-differentiation is substituted by a Monte Carlo approximation, whose computation is given in Equation 4.
Crucial hyperparameters of F-XSDD(Mcad) are the bin resolution of the histograms, which we use for the density estimation, and the number of samples used for the Monte Carlo integration. A further choice to be made is the method to be used to perform the volume computation of convex polytopes, i.e. whether to perform this volume computation exactly or approximately.
In future work we would also like to perform a theoretical analysis of the F-XSDD(Mcad) algorithm where we analyze the error propagation when evaluating a underlying arithmetic circuit. Such an analysis would follow ideas presented in [shah2019problp].
5 Experimental Evaluation
The experimental study of F-XSDD(Mcad) presented in this section answers the following questions. Q1 Can F-XSDD(Mcad) exploit structure in highly-structured WMI problems and does F-XSDD(Mcad) produce reliable MC approximations? Q2 How does F-XSDD(Mcad) compare to naive rejection sampling and XSDD(Sampling)? Q3 How does F-XSDD(Mcad) handle integrations in higher dimensional spaces?
In order to answer Q1 we compare F-XSDD to the BR algorithm of [kolb2018efficient] on the XOR() benchmark (Figure 4) — denotes the variable problem size. The XOR(N) benchmark is a highly structured synthetic problem, which most state-of-the-art solvers struggle with [kolb2019structure] and unless the BR algorithm is deployed, they exhibit an exponential dependency of the run time on the problem size. Besides the comparison on the XOR(N) benchmark we compare also the F-XSDD and BR algorithms on a variation of the XOR() benchmark. Instead of having a constant weight function of one in the XOR() benchmark, we use now a weight function where the are the continuous variables present in the benchmark. We refer to this variation as XOR(,)
In Figure 4 we see that the run time of F-XSDD does not exhibit the exponential growth of the run time with increasing problem size. This shows that F-XSDD is, just as BR, capable of exploiting the structure present by performing MC approximations of indefinite integrals. Furthermore, we observe that the run time increases more rapidly for the BR algorithm than for F-XSDD in both plots (being more prominent in Figure 4). This is because symbolic integration of the symbolic inference engine in the BR algorithm starts slowing down the run time for larger problem sizes. Analyzing the the root mean squared error (RMSE) and the relative RMSE (RRMSE) we see that the solutions produced by F-XSDD are meaningful approximations.
To answer Q2 we compare, on the one hand, F-XSDD to naive rejection sampling, which means that we sample uniformly from the problem domain (which is a hyperrectangle) and reject the samples that do not satisfy the SMT() constraints of the problem. On the other hand, we compare to XSDD(Sampling), an XSDD based algorithm, which collects all convex polytopes separately, finds bounding hypercubes for these and performs rejection sampling in these hypercubes. The comparisons are made for the XOR() benchmark and the Mutex() benchmark, cf. [kolb2018efficient].
In the plots (Figure 5), we observe the drawback of collecting convex polytopes: the number of integrations becomes prohibitively high and hurts the run time of the XSDD(Sampling) algorithm. For the XOR() benchmark, an additional factor plays into the run time: the number of convex polytopes to be collected grows exponential with the problem size. We also see that for the naive rejection sampling approach the relative RMSE grows drastically for large problem sizes, which means that too many samples are being rejected. F-XSDD does not suffer from the drawbacks in run time of the XSDD(Sampling) algorithm nor from the sharp drop in accuracy of the naive rejection sampler.
We tackle Q3 by investigating a variation of the Dual benchmark [kolb2019structure], which aims to show the benefit of factorized solving. We dub this variation M-ual (M-ual()), which is specified as follows:
with domain and . It is easy to see that by increasing and simultaneously a high-dimensional problem is created. For instance the problem M-ual() is 100-dimensional.
As for the experiment itself we compared variation of the F-XSDD(Mcad) algorithm where we used different algorithms to compute the volume of the polytopes over which integrations are performed (cf. Equation. For completeness we also included the naive rejection sampling algorithm in the comparison.
In Figure 6 we see that for comparable run times the rejection sampling algorithm’s accuracy breaks down for higher dimensional spaces — the space is 30-40 dimensional but the F-XSDD algorithms perform integrations merely in 2-dimensions due to the factorizability of the problem. In the run times, we see that for such low dimensional spaces of integration the exact methods (FXSDD(MCAD(PSI)) and FXSDD(MCAD(LattE)) outperform the approximate method (FXSDD(MCAD(VolEsti)). (Here we refer to to the computation of the volume of the polytopes as exact and approximate, and not to the integration itself.) We observe the opposite effect when tackling higher dimensional problems (Figure 6, where the exact algorithms time out for (FXSDD(MCAD(PSI)) and FXSDD(MCAD(LattE)). The naive rejection sampling algorithm timed out at and exhibited a considerably higher relative standard deviation, when compared to the F-XSDD algorithms.
6 Conclusions and Future Work
We developed the concept of Monte Carlo anti-differentiation, and proposed a method for performing an MC approximation of an indefinite integral. Based on Monte Carlo anti-differentiation we enriched the F-XSDD family of algorithms with F-XSDD(Mcad). F-XSDD(Mcad) is based on a hit-and-run sampler, which is used to approximate the anti-derivative. As such, F-XSDD(Mcad
) is the first inference algorithm for WMI that performs Monte Carlo integration while exploiting structure present in WMI problems, and avoiding prohibitively high sample rejection rates. Even though the sample rejection rate is zero, the integration algorithm could still be improved by, instead of drawing samples uniformly, drawing them from the integrand directly. Deploying more sophisticated MC algorithms might lead to lower variance in the approximation of the integrals. Similarly, it would also be interesting to investigate approximate integration methods for WMI other than MC integration that circumvent the curse of dimensionality, such as sparse grids[bungartz2004sparse] and Bayesian quadrature [briol2015frank].
This work has received funding from ERC AdG SYNTH(694980). Samuel Kolb is supported by the Research Foundation-Flanders (FWO). Pedro Zuidberg Dos Martires is supported by Research Foundation-Flanders (FWO) and Special Research Fund of the KU Leuven (BOF). The authors would like to thank Vissarion Fisikopoulos for his help with the VolEsti library and giving valuable feedback on the paper draft, and Luc De Raedt for commenting on early iterations of the paper.