1. Introduction
Constrained optimization is central to decision making over a broad range of domains, including finance (jorion2007financial; hong2014monte), transportation (clare2012air), healthcare (geng2019addressing), the travel industry (DeChoudhury2010), robotics (du2011probabilistic), and engineering (bienstock2014chance). Consider, for example, the following very common investment problem.
Example 0 (Financial Portfolio).
Given uncertain predictions for future stock prices based on financial models derived from historical data, an investor wants to invest in a set of trades (decisions on which stocks to buy and when to sell them) that will maximize the expected future gain, while ensuring that the loss (if any) will be lower than
with probability at least
.Suppose each row in a table contains a possible stock trade an investor can make: whether to buy one share of a certain stock, and when to sell it back, as shown in the lefthand side of Figure 1. The investor wants a “package” of trades—a subset of the input table, with possible repetitions (i.e., multiple shares)—that is feasible, in that it satisfies the given constraints (total price at most $1,000 and loss lower than $10 with probability at least ), and optimal, in that it maximizes an objective (expected future gain). Although the current price of a stock is known—i.e., price is a deterministic attribute—its future price, and thus the gain obtained after reselling the stock, is unknown. In the input table, Gain is a stochastic attribute. If the future gains were known, Example 1 would be a “package query” (Brucato2018; brucato2014packagebuilder), directly solvable as an
Integer Linear Program
(ILP) using offtheshelf linear solvers such as IBM CPLEX (cplex), and declaratively expressible in the Package Query Language (PaQL). Because Gain is stochastic, the investor is solving a stochastic ILP instead. In this paper, we introduce stochastic package queries (SPQs), a generalization of package queries that allows uncertainty in the data, thereby allowing specification and solution of stochastic ILP problems.We first introduce a simple language extension to PaQL, called sPaQL, that allows easy specification of package queries with stochastic constraints and objectives. We show the sPaQL query for Example 1 in Figure 1. The result of the query, on the righthand side of the figure, is a package that informs the investor about how many trades to buy for each individual stock, and when to plan reselling them to the stock market.
Probabilistic databases (dalvi2007efficient; suciu2011probabilistic)
enable the representation of random variables in a database. The
Financial Portfolio, like many other realworld applications, typically uses complex distributions to model uncertainty. For instance, future stock prices are sometimes forecast using lognormal variates based on “geometric Brownian motion” (ross2014introduction)using historical stock price data; alternatively, forecasts can incorporate complex stochastic predictive simulation or machine learning models. For this reason, we base SPQs on the Monte Carlo probabilistic data model
(jampani2008mcdb; JampaniXWPJH11), which offers support for arbitrary distributions via userdefined variable generation (VG) functions. To generate a sample realization of the random variables in a database, the system calls the appropriate VG functions. Whereas existing probabilistic databases excel at supporting SQLlike queries under uncertainty, they do not support packagelevel optimization, and therefore cannot answer SPQs. PackageBuilder (Brucato2018; brucato2014packagebuilder), on the other hand, only supports deterministic package queries and their translation into deterministic ILPs.The state of the art in solving stochastic ILPs (SILPs) has been developed outside of the database setting, in the field of stochastic programming (SP) (ahmed2008solving; CAMPI2009149; HOMEMDEMELLO201456). SP techniques approximate the given SILP by a large deterministic ILP (DILP) that simultaneously incorporates multiple scenarios. In a Monte Carlo database, a scenario is obtained by generating a realization of every random variable in the table, via a call to each associated VG function; this procedure may be repeated multiple times, generating a set of scenarios that are mutually independent and identically distributed (i.i.d.). Figure 2 shows an example of three possible scenarios for the input investment table for Example 1. Roughly speaking, expectations in the SILP are approximated by averages over the scenarios and probabilities by relative frequencies to form the DILP, which is then fed to a standard solver (e.g., CPLEX). The obtained solution approximates the true optimal solution for the SILP; the more scenarios, the better the approximation.
The solution of the DILP, however, may not be feasible with respect to the original SILP, especially if the approximation is based on only a small number of scenarios that do not well represent the true uncertainty distribution. For example, a financial package obtained by using too few scenarios might guarantee a loss less than with a probability of only , rather then , incurring more risk than desired.
There is no practical way to know how many scenarios will be needed a priori; existing theoretical apriori bounds—see, e.g., (luedtke2008sample)—are usually too conservative to be usable when table sizes are large. For example, if the Stock_Investments table contains rows, then to guarantee that the DILP solution is feasible for the SILP with merely probability (which is really no guarantee at all), one would need scenarios, resulting in a DILP with billion coefficients! SP solutions must therefore be “validated” a posteriori, using a much larger, and outofsample, set of scenarios. In Example 1, for instance, we would generate, say, scenarios and verify that the loss is less than in at least of them; such validation is much faster than solving a DILP with scenarios.
The stateoftheart algorithm thus works in a loop: the optimization phase creates scenarios, combines them into a DILP, and computes a solution; the validation phase validates the solution against the outofsample scenarios. If the solution is feasible on the validation scenarios (validationfeasible), the algorithm terminates, otherwise it creates more scenarios and repeats. A solution that is validationfeasible is highly likely to be truly feasible for the original SILP. Typically the ultimate number of scenarios used to compute the optimal solution to the DILP is astronomically smaller than the number prescribed by the conservative theoretical bounds (though it is still large enough to be extremely computationally challenging).
Unfortunately, this process often breaks down in practice. Uncertainty increases with increasing table size, and large tables typically need a huge number of scenarios to achieve feasibility. Thus the validation phase repeatedly fails, and the scenario set—and hence the DILP—grows larger and larger until the solver is overwhelmed. Even if the solver can ultimately handle the problem, many everslower iterations may be required until validationfeasible solutions are found, resulting in poor performance.
In this paper, we present an endtoend system for SPQs, seamlessly connecting SILP optimization with data management and stochastic predictive modeling. Thus tasks related to efficiently storing data, maintaining consistency, controlling access, and efficiently retrieving and preparing the data for analysis can leverage the full power of a DBMS, while avoiding the usual slow, cumbersome, and errorprone analytics workflow where we read a dataset off of a database into main memory, feed it to stochasticprediction and optimization packages, and store the results back into the database.
We first introduce a Naïve query evaluation algorithm, which embodies the stateoftheart optimization/validation technique outlined above, and thoroughly discuss its drawbacks. (Although the Naïve technique is mentioned in the SP literature, to our knowledge this is the first systematic implementation of the approach.) We then introduce our new algorithm, SummarySearch, that is typically faster than Naïve by orders of magnitude and can handle problems that cause Naïve to fail.
Our key observation is that the randomly selected set of scenarios used to form the DILP during an iteration of Naïve tend to be overly “optimistic”, leading the solver towards a seemingly good solution that “in reality”—i.e., when tested against the validation scenarios—turns out to be infeasible. This problem is also known as the “optimizer’s curse” (smith2006optimizer).
To overcome the optimizer’s curse, SummarySearch replaces the large set of scenarios used to form the Naïve DILP by a very small synopsis of the scenario set, called a “summary”, which results in a “reduced” DILP that is much smaller than the Naïve DILP. A summary is carefully crafted to be “conservative” in that the constraints in the reduced DILP are harder to satisfy than the constraints in the Naïve DILP. Because the reduced DILP is much smaller than the Naïve DILP, it can be solved much faster; moreover, the resulting solution is much more likely to be validationfeasible, so that the required number of optimization/validation iterations is typically reduced. Of course, if a summary is overly conservative, the resulting solution will be feasible, but highly suboptimal. Therefore, during each optimization phase, SummarySearch implements a sophisticated search procedure aimed at finding a “minimally” conservative summary; this search requires solution of a sequence of reduced DILPs, but each can be solved quickly.
Our experiments (Section 6) show that, since its iterations are much faster than those of Naïve, SummarySearch exhibits a large net performance gain even when the number of iterations is comparable; typically, the number of iterations is actually much lower for SummarySearch than for Naïve, further augmenting the performance gain.
In summary, the contributions of our paper are as follows.

[itemsep=5pt,wide,labelwidth=!,labelindent=0pt,leftmargin=]

We extend the PaQL language for deterministic package queries (itself an extension of SQL); the resulting language, sPaQL (Appendix A), allows specification of package queries with stochastic constraints and objectives.

We provide a precise and concrete embodiment, the Naïve algorithm, of the optimization/validation procedure suggested by the SP literature (Section 3).

We provide a novel algorithm, SummarySearch, that is ordersofmagnitude faster than Naïve, and that can solve SPQs that require too many scenarios for Naïve to handle. This is a significant contribution and fundamental extension to the known stateoftheart in stochastic programming (Section 4).

We present techniques that allow SummarySearch to optimize its parameters automatically, and we provide theoretical approximation guarantees on the solution of SummarySearch relative to Naïve (Section 5).

We provide a comprehensive experimental study, which indicates that SummarySearch always finds validationfeasible solutions of high quality, even when Naïve cannot, with dramatic speedups relative to Naïve (Section 6).
2. Preliminaries
Our work lies at the intersection of package queries, probabilistic databases, and stochastic programming. In this section, we introduce some basic definitions from these areas that we will use throughout the paper.
2.1. Deterministic Package Queries
A package of a relation is a relation obtained from by inserting copies of into for each ; here is the multiplicity function of . The goal of a package query is to specify , and hence the tuples of the corresponding package relation. A package query may include a WHERE clause (tuplelevel constraints), a SUCH THAT clause (packagelevel constraints), a packagelevel objective predicate and, possibly, a REPEAT limit, i.e., an upper bound on the number of duplicates of each tuple in the package.
A deterministic package query can be translated into an equivalent integer program (Brucato2018). For each tuple , the translation assigns a nonnegative integer decision variable corresponding to the multiplicity of in , i.e., . If the objective function and all constraints are linear in the ’s, the resulting integer program is an ILP. A cardinality constraint is translated into the ILP constraint . A summation constraint is translated into ; this translation works similarly for other linear constraints and objectives. A REPEAT constraint is translated into bound constraints .
2.2. Monte Carlo Relations
We use the Monte Carlo database model to represent uncertainty in a probabilistic database. Uncertain values are modeled as random variables, and a scenario (a deterministic realization of the relation) is generated by invoking all of the associated VG functions for the relation. In the simplest case, where all random variables are statistically independent, each random variable has its own VG function; in general, multiple random variables can share the same VG function, allowing specification of various kinds of statistical correlations. A Monte Carlo database system such as MCDB (jampani2008mcdb) (or its successor, SimSQL (cai2013simulation)) facilitates specification of VG functions as userdefined functions. We assume that there exists a deterministic key column that is the same in each scenario, so that each scenario contains exactly tuples for some and the notion of the “th tuple " is well defined across scenarios. For simplicity, we focus henceforth on the case where a database comprises a single relation. Our results extend to Monte Carlo databases containing multiple (stochastic) base relations in which the SPQ is defined in terms of a relation obtained via a query over the base relations.
2.3. Stochastic ILPs
The field of stochastic programming (SP) (shapiro2009lectures; kall1994stochastic) studies optimization problems—selecting values of decision variables, subject to constraints, to optimize an objective value—having uncertainty in the data. We focus on SILPs with linear constraints and linear objectives that are deterministic, expressed as expectations, or expressed as probabilities. Probabilistic constraints are also called “chance” constraints in the SP literature.
Linear constraints. Given random variables , decision variables , a real number , and a relation , a linear expectation constraint takes the form , and a linear probabilistic constraint takes the form , where . We refer to as the inner constraint of the probabilistic constraint, and to as its inner function. Constraints of the form can be rewritten in the aforementioned form by flipping the inequality sign of the inner constraint and using instead. If for constants we have for , then we obtain the deterministic constraint as a special case of an expectation constraint.
Objective. Without loss of generality, we assume throughout that the objective has the canonical form for deterministic constants . Indeed, observe that an objective in the form of an expectation of a linear function can be written in canonical form: , and thus we take . (This assumes that each expectation is known or can be accurately approximated.) We call the inner function of the expectation. Similarly, an objective in the form of a probability can be written in canonical form using epigraphic rewriting (campi2018wait). For example, we can rewrite an objective of the form in canonical form as and add a new probabilistic constraint . Here and is an artificial decision variable added to the problem with objective coefficient . Throughout the rest of the paper, we will primarily focus on techniques for minimization problems with a nonnegative objective function; the various other possibilities can be handled with suitable modifications and are presented in Appendix B.
In our database setting, we assume for ease of exposition that, in a given constraint or objective, each random variable corresponds to a random attribute value for some realvalued attribute A; a different attribute can be used for each constraint, and need not be the same as the attribute that appears in the objective. Our methods can actually support more general formulations: e.g., an expectation objective of the form , where is an arbitrary realvalued function of tuple attributes; constraints can similarly be generalized. Note that this general form allows categorical attributes to be used in addition to realvalued attributes.
3. Naïve SILP Approximation
Recall that Naïve is the first systematic implementation of the optimization/validation approach mentioned in the SP literature. The pseudocode is given as Algorithm 1. As discussed previously, the algorithm generates scenarios (line 1), combines them into an approximating DILP (line 3), solves the DILP to obtain a solution (line 4), and then validates the feasibility of against a large number of outofsample validation scenarios (line 5). The process is iterated, adding additional scenarios at each iteration (line 10) until the validation phase succeeds. We now describe these steps in more detail.
As discussed in the Introduction, the optimization phase for the DILP can be very slow, and often the convergence to feasibility requires so many optimize/validate iterations that the DILP becomes too large for the solver to handle, so that Naïve fails. Our novel SummarySearch algorithm in Section 4 uses “summaries” to speed up the optimization phase and reduce the number of required iterations.
3.1. Sampleaverage approximation
As mentioned previously, we can generate a scenario by invoking all of the VG functions for a table to obtain a realization of each random variable, and can repeat this process times to obtain a Monte Carlo sample of i.i.d. scenarios. In our implementation, Naïve generates scenarios by seeding the random number generator once for the entire execution, and accumulates scenarios in main memory.
We then obtain the DILP from the original SILP by replacing the distributions of the random variables with the empirical distributions corresponding to the sample. That is, the probability of an event is approximated by its relative frequency in the sample, and the expectation of a random variable by its sample average. In the SP literature, this approach is known as Sample Average Approximation (SAA) (ahmed2008solving; luedtke2008sample), and we therefore refer to the DILP for the stochastic package query as .
More formally, suppose that we have scenarios , each with tuples. Recall that denotes the random variable corresponding to attribute A in tuple , and denote by the realized value of in scenario . Then each expected sum is approximated by , where .
To approximate a probabilistic constraint of the form
(1) 
we add to the problem a new indicator variable, for each scenario , along with an associated indicator constraint: , where the indicator function equals 1 if the inner constraint is satisfied and equals 0 otherwise. We say that solution “satisfies scenario ” (with respect to the constraint) if and only if . (Solvers like CPLEX can handle indicator constraints.) Finally, we add the following linear constraint over the indicator variables: , where is the smallest integer greater than or equal to . That is, we require that the solution satisfies at least a fraction of the scenarios. The function applies these approximations to create the DILP .
Size complexity. With constraints, the size of , measured with respect to the number of coefficients, is : we have coefficients for each expectation constraint and, for each probabilistic constraint, coefficients (for ) for each scenario.
3.2. Outofsample validation
After using scenarios to create and solve the DILP , we check to see if the solution is validationfeasible in that it is a feasible solution for the DILP that is constructed using outofsample scenarios. When is sufficiently large, validation feasibility is a proxy for true feasibility, i.e., feasibility for the original SILP; commonly, or . This definition of validationfeasibility is simple, but widely accepted (luedtke2008sample)
. Although there are other, more sophisticated ways to use validation scenarios to obtain confidence intervals on degree of constraint violation—see, e.g.,
(campi2018wait)—these are orthogonal to the scope of this paper. Henceforth, we use the term “feasibility” to refer to “validation feasibility”, unless otherwise noted.In our implementation, during a precomputation phase, we actually average
scenarios—the same number as the number of validation scenarios—to estimate each
; we then append these estimates, denoted , to the table. We do this because such averaging is typically very fast to execute, and is spaceefficient in that we simply maintain running averages. Thus a solution returned by a solver is always feasible for every expectation constraint, and hence is feasible overall if and only if, for every probabilistic constraint of the form (1), satisfies at least a fraction of the validation scenarios. We can therefore focus attention on the probabilistic constraints, which are the most challenging.The procedure checks the feasibility of , the solution to ; we describe its operation on a single probabilistic constraint , but the same steps are taken independently for each probabilistic constraint. It first seeds the system random number generator with a different seed than the one used to generate the optimization scenarios. For each , it generates a realization for each such that (i.e., for each tuple that appears in the solution package), and computes the “score” . It then sets . After all scenarios have been processed, it computes and declares to be feasible if . The algorithm purges all realizations from main memory after each scenario has been processed, and only stores the running count of the ’s, allowing it to scale to an arbitrary number of validation scenarios. Moreover, a package typically contains a realtively small number of tuples, so only a small number of realizations need be generated.
4. SummaryBased Approximation
The Naïve algorithm has three major drawbacks. (1) The overall time to derive a feasible solution to can be unacceptably long, since the size of sharply increases as increases. (2) It often fails to obtain a feasible solution altogether—in our experiments, the solver (CPLEX) started failing with just a few hundred optimization scenarios. (3) Naïve does not offer any guarantees on how close the objective value of the solution to is to the true objective value of the solution to the DILP that is based on the validation scenarios. (Recall that we use as a proxy for the actual SILP.) A feasible solution that Naïve provides can be far from optimal.
Our improved algorithm, SummarySearch, which we present in this section, addresses these challenges by ensuring the efficient generation of feasible results through much smaller “reduced” DILPs that each replace a large collection of scenarios with a very small number of scenario “summaries”; in many cases it suffices to take . We call such a reduced DILP a Conservative Summary Approximation (CSA), in contrast to the much larger sample average approximation (SAA) used by Naïve. The summaries are carefully designed to be more “conservative” than the original scenario sets that they replace: the constraints are harder to satisfy, and thus the solver is induced to produce feasible solutions faster. SummarySearch also guarantees that, for any userspecified approximation error (where is defined in Section 5.4), if the algorithm returns a solution , then the corresponding objective value satisfies ; in this case we say that is a approximate solution. (Recall that we focus on minimization problems with nonnegative objective functions; the other cases are discussed in Appendix B.)
4.1. Conservative Summary Approximation
We first define the concept of an summary, and then describe how summaries are used to construct a CSA.
Summaries. Recall that a solution to satisfies a scenario with respect to a probabilistic constraint of the form of Equation (1) if , where is the realized value of in .
Definition 0 (Summary).
Let . An summary of a scenario set with respect to a probabilistic constraint of the form (1) is a collection of deterministic values of attribute A such that if a solution satisfies in that , then satisfies at least of the scenarios in with respect to .
Constructing an summary, for , is simple: Suppose that the inner constraint of probabilistic constraint has the form . Given any subset of scenarios of size exactly , we define as the tuplewise minimum over :
Scenario 1  Scenario 3  0.66Summary  
id  gain  id  gain  id  gain  
1  0.1  1  0.01  1  0.01  
2  0.05  2  0.02  2  0.02  
3  0.2  3  0.1  3  0.2  
4  0.2  4  0.3  4  0.3  
5  0.1  5  0.2  5  0.1  
6  0.7  6  0.3  6  0.7 
Proposition 0 ().
is an summary of with respect to .
Proof.
Suppose satisfies , i.e., . Then, for every scenario , . Since , the result follows. ∎
Figure 3 illustrates an summary for the three scenarios in Figure 2, where and comprises scenarios 1 and 3. The summary is conservative in that, for any choice of trades, the gain under the summary values will be less than the gain under either of the two scenarios. Thus if we can find a solution that satisfies the summary, it will automatically satisfy at least scenarios 1 and 3. It might also satisfy scenario 2, and possibly many more scenarios, including unseen scenarios in the validation set. Indeed, if we are lucky, and in fact our solution satisfies at least of the scenarios in the validation set, then will be feasible with respect to the constraint on Gain.
Clearly, for an inner constraint with , the tuplewise maximum of yields an summary. While there may be other ways to construct summaries, in this paper we only consider minimum and maximum summaries, and defer the study of other, more sophisticated summarization methods to future work. Importantly, a summary need not coincide with any of the scenarios in ; we are exploiting the fact that optimization and validation are decoupled.
CSA formulation. A CSA is basically an SAA in which all probabilistic constraints are approximated using summaries instead of scenarios.^{1}^{1}1As with the SAA formulation, expectations are approximated as averages over a huge number of independent scenarios. The foregoing development implicitly assumed a single summary (with respect to a given probabilistic constraint ) for all of the scenarios in . In general, we use summaries, where . These are obtained by dividing randomly into disjoint partitions , of approximately scenarios each. Then the summary for partition is obtained by taking a tuplewise minimum or maximum over scenarios in a subset , where .
For each probabilistic constraint of form (1), we add to the DILP a new indicator variable, , and an associated indicator constraint . We say that solution “satisfies summary ” iff . We also add the linear constraint , requiring at least of the summaries to be satisfied. We denote the resulting reduced DILP by .
Size complexity. Assuming probabilistic constraints, the number of coefficients in is , which is independent of . Usually, takes on only small values, so that the effective size complexity is only .
Our results (Section 6) show that in most cases SummarySearch finds good solutions with only one summary, i.e., . Because is small, the solution to can be rapidly computed by a solver. The CSA formulation is also more robust to random fluctuations in the sampled data values, and less prone to “overfit” to an unrepresentative set of scenarios obtained by luck of the draw.
An important observation is that as increases, approaches the formulation: at each partition will contain exactly one scenario, which will also coincide with the summary for the partition. Since encompasses , we can always do at least as well as Naïve with respect to the feasibility and optimality properties of our solution, given scenarios. We address the issue of how to choose , , and each below and in Section 5, and also discuss how to generate summaries efficiently.
4.2. Query Evaluation with CSA
Algorithm 2 shows query evaluation with SummarySearch. The goal is to find a feasible solution whose objective value is as close as possible to , the objective value of the SAA based on the validation scenarios. In the algorithm, denotes the SPQ obtained from by removing all of the probabilistic constraints. At the first step, SummarySearch computes , the solution to the DILP ; the only constraints are deterministic constraints and expectation constraints, with the latter estimated from scenarios in the usual way. This corresponds to the “least conservative” solution possible, and is effectively equivalent to solving a CSA using summaries constructed with , because (i.e., none) of the scenarios are required to be satisfied. For some problems, might have an infinite objective value, in which case we simply ignore this solution and incrementally increase until we find a finite solution.
Like Naïve, the SummarySearch algorithm starts with an initial number of optimization scenarios, , and iteratively increments it while solutions are infeasible. In the optimization phase, the algorithm uses a CSA formulation, which replaces the real scenarios with conservative summaries. Initially, the algorithm uses , replacing the set of scenarios with a single summary. After feasibility is achieved for a solution with objective value , the algorithm tries to check whether the ratio is less than or equal to the userdefined error bound ; although , and hence , is unknown, we can conservatively check whether , where is an upper bound on that we develop in Section 5.4, If the solution is unsatisfactory, SummarySearch increases , and iterates again. The algorithm stops if and when a feasible and approximate solution is found. In practice, because of the conservative nature of summaries, SummarySearch typically finds feasible solutions in drastically fewer iterations than Naïve.
5. Optimal Summary Selection
The key component of SummarySearch is CSASolve, described in this section. With and fixed, CSASolve finds the best CSA formulation, i.e., the one having, for each constraint, the optimal value of and the best set of scenarios for each summary. CSASolve thus determines the best solution achievable with scenarios and summaries, and also computes metadata used by SummarySearch for checking feasibility and optimality.
5.1. CSASolve Overview
Algorithm 3 depicts the iterative process of CSASolve: at each iteration it produces a solution to a problem based on an summary for each constraint . Initially, for all , and thus the solution to is simply , which has already been computed by SummarySearch prior to calling CSASolve. Then CSASolve stops in two cases: (1) if it finds a feasible approximate solution; (2) if it enters a cycle, producing the same solution twice with the same values. In case (2), it returns the “best” solution found so far: if one or more feasible solutions have been found, it returns the one with the best objective value, otherwise it returns an infeasible solution, and SummarySearch will increase in its next iteration.
5.2. Choosing
Larger leads to more conservative summaries, as we take the tuplewise minimum (or maximum) over more and more scenarios. Thus a high value of increases the chances of finding a feasible solution. On the other hand, if the constraints are more restrictive than necessary, then the solution can have a seriously suboptimal objective value because we are considering fewer candidate solutions, possibly missing the best ones. Thus, CSASolve seeks the minimally conservative value of that will suffice.
How can we measure the true conservativeness of with respect to a constraint ? As discussed previously, the solution to a formulation based on summaries is guaranteed to satisfy at least of the optimization scenarios, but the actual true probability of satisfying the constraint—or more pragmatically, the fraction of the validation scenarios satisfied by —will usually differ from . Thus, we look at the difference between the fraction of validation scenarios satisfied by and the target value . We call this difference the surplus, and define it as:
We expect the function to be increasing in with high probability.
Observe that essentially satisfies the constraint . Clearly, if , then is infeasible for constraint , whereas if , then satisfies the inner constraint with a probability that exceeds , and so is conservative and therefore likely suboptimal. Thus the optimal value satisfies . Solutions that achieve zero surplus may be impossible to find, and therefore CSASolve tries to choose to minimize the surplus for each of the K constraints, while keeping it nonnegative. The search space is finite (hence the possibility of cycles) since for .
At each iteration , CSASolve updates to , creates the corresponding CSA problem, and produces a new solution . For simplicity and ease of computation, our initial implementation updates each individually by fitting a smooth curve to the historical points and then solving the equation . In our experiments, we observed that (1) fitting an arctangent function provides the most accurate predictions and (2) this artificial decoupling with respect to the constraints yields effective summaries; we plan to investigate other methods for jointly updating .
5.3. Choosing
So far, we have assumed that the subset used to build the summary is any set containing scenarios. SummarySearch
employs a simple greedy heuristic to determine
: it chooses the scenarios that produce the summary most likely to keep the previous solution feasible in the current iteration, so that the new solution will likely have a higher objective value. For an inner () constraint, this is achieved by sorting the scenarios in according to their “scenario score” and taking the first in descending (ascending) order.5.4. Approximation Guarantees
If is feasible, SummarySearch can terminate if it can determine that is approximate relative to the optimal feasible solution based on the validation scenarios, i.e., that , where and are the objective values for and , respectively, and is an accuracy parameter specified by the user. Without loss of generality, we assume below that the objective function is an expectation; should the objective be deterministic, nesting it within an expectation does not change its value.
This termination check proceeds as follows. During the th iteration of SummarySearch, the function Validate() computes surplus values , one for each probabilistic constraint in the query. Further, it computes (as defined below). We show below that if and , then is a feasible approximate solution, and SummarySearch can immediately return and terminate. As usual, we focus on minimization problems with nonnegative objective values, and take the optimal solution and objective value of as proxies for those of the original SILP. We start with the following simple but important result.
Proposition 0 (General Approximation Guarantee).
Let and let be a positive constant such that . Set . If , then .
Proof.
Suppose that . Since , we have
and the result follows. ∎
We obtain a specific formula for by choosing a specific bound . Clearly, we would like to choose as large as possible, since this maximizes the likelihood that . One simple choice that always works is to set , where is the objective value of the SAA problem corresponding to the original SILP but with all probabilistic constraints removed—see line 2 of Algorithm 2. If all random variables are lowerbounded by a constant and the size of any feasible package is lowerbounded by a constant , then , , so that
which yields an alternative lower bound. Yet another bound can be sometimes obtained by exploiting the relation of the constraints to the objective.
Definition 0 (ObjectiveConstraint Interaction).
Let the objective be , for random variables . The objective is said to be supported by a constraint of the form and counteracted by a constraint of the form . All other forms of constraint are said to be independent of the objective.
Intuitively, a supporting probabilistic constraint “supports” the objective function in the same “direction” of the optimization ( for minimization, for maximization), whereas a counteracting constraint goes against the optimization. If there exists a counteracting constraint with , it can be shown (Appendix B) that .
Finally, we take to be the maximum of all applicable lower bounds. Similar formulas can be derived for other possible cases—maximization problems, negative objective values, and so on; see Appendix B.
Note that if , then , , so that SummarySearch cannot terminate with a feasible approximate solution. To avoid this problem, we require that , where . Here is any upper bound on . It can be shown, for example, that if (1) all random variables are upperbounded by a constant , (2) the size of any feasible package is upperbounded by a constant , and (3) there exists a supporting constraint with , then ; see Appendix B. If we have available a feasible solution with objective value , then we can take . We choose to be the minimum of all applicable bounds.
5.5. Implementation Considerations
We now discuss several implementation optimizations.
Efficient summary generation. Recall that summarization has two steps: (1) computing the scenario scores to sort scenarios by the previous solution, and (2) computing the tuplewise minimum (or maximum) of the first of the scenarios in sorted order. The fastest way to generate an summary is if all scenarios are generated and kept in main memory at all times. In this case, computing the tuplewise minimum (or maximum) is trivial. However, the memory requirement for this may exceed the memory limits if is large. We devise two possible strategies for memoryefficient summary generation with optimal space complexity: tuplewise summarization and scenariowise summarization. Tuplewise summarization uses a unique random number seed for each tuple () and it generates all realizations, one tuple at a time. Scenariowise summarization uses a unique seed for each scenario (), and it generates one realization for all tuples, one scenario at a time.
With tuplewise summarization, sorting the scenario only requires time, where is the size of the current package; usually, . However, generating the summaries is more costly, as it requires time, as all realizations must be constructed for all tuples. The total time is . With scenariowise summarization, generating summaries has lower time complexity of , as it only generates scenarios in , but sorting has higher complexity , with total time .
It follows that if , tuplewise summarization is generally faster than scenariowise summarization. However, other factors may affect the runtime, e.g., some random number generators, such as Numpy, generate large quantities of random numbers faster if generated in bulk using a single seed. In this case, tuplewise summarization may suffer considerably in the summary generation phase, as it needs to reseed the random number generator for each tuple. In our experiments, we observed that tuplewise summarization is better when the input table is relatively small, but worse than scenariowise for larger tables. In general, a system should implement both methods and test the two in situ.
Convergence acceleration. When is obtained by decreasing , the solution typically is feasible, and our goal is for to strictly improve in objective value. CSASolve achieves this by slightly modifying the generation of summaries in order to ensure that the previous solution is still feasible for the next CSA problem. This is done by using the tuplewise maximum (instead of minimum) in the summary generation for all tuples such that (tuples in the previous solution). For all other tuples, we set the summary as usual. We have found that ensuring monotonicity of the objective values promotes faster convergence.
6. Experimental Evaluation
In this section, we present an experimental evaluation of our techniques for stochastic package queries on three different domains where uncertainty naturally arises: noise in sensor data, uncertainty in future predictions, uncertainty due to data integration (dong2009data). Our results show that: (1) SummarySearch is always able to find feasible solutions, while Naïve cannot in most cases—when both SummarySearch and Naïve can find feasible solutions, SummarySearch is often faster by orders of magnitude; (2) The packages produced by SummarySearch are of high quality (low empirical approximation ratio), sometimes even better than Naïve when they both produce feasible solutions; (3) Increasing , the number of optimization scenarios, helps SummarySearch find feasible solutions, and the value of required by SummarySearch to start producing feasible solutions is much smaller than Naïve, explaining the orders of magnitude improvement in running time; (4) Increasing , the number of summaries, helps SummarySearch find higherquality solutions; (5) Increasing , the number of input tuples, impacts the running time of both algorithms, but SummarySearch is still orders of magnitude faster than Naïve, and finds feasible solutions with better empirical approximation ratios than Naïve.
6.1. Experimental Setup
We now describe the software and runtime environment, and the three workloads we used in the experiments.
Environment. We implemented our methods in Python 2.7, used Postgres 9.3.9 as the underlining DBMS, and IBM CPLEX 12.6 as the ILP solver. We ran our experiments on servers equipped with two 24 2.66GHz cores, 15GB or RAM, and a 7200 RPM 500GB hard drive.
Datasets and queries. We constructed three workloads:
Noisy sensor measurements: The Galaxy datasets vary between and tuples, extracted from the Sloan Digital Sky Survey (SDSS) (sdss_dr12). Each tuple contains the color components of a small portion of the sky as read by a telescope. We model the uncertainty in the telescope readings as Gaussian or Pareto noise.
Financial predictions: The Portfolio dataset contains 6,895 stocks downloaded from Yahoo Finance (EmptyId10). The initial price of each stock is set according to its actual value on January 2, 2018, and future prices are generated according to a geometric Brownian motion. We consider selling stocks in one day or in one week, as in Figure 1; the dataset for the shortterm (resp., longterm) trades contains 14,000 (resp., 48,000) tuples. For each prediction type, we also extracted a subset corresponding to the most volatile stocks to construct some of the hardest queries. Tuples referring to the same stock are correlated to one another. For example, in Figure 1, tuples 1 and 2 are correlated to each other but are independent of the other tuples.
Data integration: The TPCH dataset consists of about 117,600 tuples extracted from the TPCH benchmark (tpch)
. We simulate the result of hypothetically integrating several data sources to form this data set: we model uncertainty in each attribute’s value with discrete probability distributions. For each original (deterministic) value in the TPCH dataset, we generate
possible variations thereof, where is the number of data sources that have been integrated into one. The mean of these values is anchored around the original value; each source value is sampled from an exponential, Poisson, uniform or Student’s tdistribution.For each of the three datasets, we constructed a workload of eight sPaQL queries; all 24 queries, except one in TPCH, are feasible. The workloads span seven different distributions for the uncertain data attributes, including a complex VG function to predict future stock prices. The objective functions are supported by the constraints for the Portfolio queries, independent for the TPCH queries and either supported or counteracted for the Galaxy queries (see Definition 2 for supported/counteracted/independent objectives). The Portfolio workload tests high and lowrisk, high and lowVaR (Value at Risk)—i.e., and in Equation 1—as well as short and longterm trade predictions. The TPCH workload is split into queries with and (number of integrated sources). For all queries there are two constraints, one of which is probabilistic with . Examples include: (1) for Galaxy, we seek a set of five to ten sky regions that minimizes total expected radiation flux while avoiding total flux levels higher than 40 with high probability, and (2) for TPCH, we seek a set of between one and ten transactions having maximum expected total revenue, while containing less than 15 items total with high probability. A detailed description of the workloads can be found in Appendix C.
Evaluation metrics. We measure response time (in seconds and logarithmic scale) across i.i.d runs using different seeds for generating the optimization scenarios, and evaluate feasibility and the objective value on an outofsample validation set with scenarios ( for the Portfolio workload). We plot the average across the runs, and its confidence interval in a shaded area. For each run of an algorithm, we set a time limit of four hours. When the time limit expires, we interrupt CPLEX and get the best solution found by the solver until then. We measure feasibility rate as the fraction, out of the runs, in which a method produces a feasible solution (including, for all methods, when the time limit expired). Because the true optimal solution for any of the queries is unknown, we measure accuracy by , where and is the objective value of the best feasible solution found by any of the methods.
6.2. Results and Discussion
We evaluate four fundamental aspects of our algorithms: (1) query response time to reach feasibility rate; (2) scalability with increasing number of scenarios (); (3) scalability of SummarySearch with increasing number of summaries (); (4) scalability with increasing dataset size ().
6.2.1. Response time to reach feasibility rate
Both Naïve and SummarySearch increase (the number of scenarios) up to when solutions start to be feasible. We report the cumulative time for all iterations the algorithm took to reach a certain feasibility rate, from , up to (when the algorithm produces feasible solutions for all runs). For SummarySearch, is fixed ( for Galaxy and Portfolio, for TPCH). We set to the lowest value (per workload) such that SummarySearch could reach 100% feasibility rate. Figure 4 shows the results of the experiment. For all (23) feasible queries across all workloads, SummarySearch is always able to reach feasibility rate, while Naïve can only reach feasibility for only queries. Even then, SummarySearch is usually orders of magnitude faster than Naïve (e.g., Galaxy Q6, TPCH Q2, Q6, and Q7). Moreover, in 15 out of the 23 feasible queries, SummarySearch reached 100% feasibility while Naïve was still at . The conservative nature of summaries allows higher feasibility rates for SummarySearch even with fewer scenarios. As the number of scenarios increases, SummarySearch solves a much smaller problem than Naïve, leading to ordersofmagnitude faster response time.
The only case where SummarySearch is slower than Naïve at reaching feasibility rate is Galaxy Q7, which was an easy query for both methods: both solved it with only 10 scenarios. This query has a supported objective function over data with minimal uncertainty described by a Pareto distribution with “scale” and “shape” both equal to . For this query, the summarization process and solving a probabilisticallyunconstrained problem are overheads for SummarySearch. TPCH Q8 is an infeasible query. Both methods increase up to before declaring infeasibility, but again SummarySearch is faster than Naïve in doing so.
6.2.2. Effect of increasing the number of optimization scenarios
We evaluate the scalability of our methods when the number of optimization scenarios increases; is fixed as described above. For each algorithm, we group feasibility rates into 5 groups: 0%, 25%, 50%, 75% and 100%, and use different shadings to distinguish each case.
Figure 5 gives scalability results for the three workloads. Generally, with low , Naïve executes very quickly to produce infeasible solutions with low objective values (optimizer’s curse); as Naïve increases , the running time increases exponentially—note the logarithmic scale—up to a point where it fails altogether (missing Naïve points in the plots). On the other hand, SummarySearch finds feasible solutions even with as little as 10 scenarios.
SummarySearch produces high quality solutions as demonstrated by the low approximation ratio (), close to for most queries. However, with the hardest Portfolio queries (Q5 and Q6), the worst approximation ratio for SummarySearch is quite high for feasible solutions: this is an indicator that the number of summaries, is too low and should be increased.
6.2.3. Effect of increasing the number of summaries
In this experiment, we show how increasing the number of summaries () helps improve the approximation ratio in the Portfolio queries. We increase from up to (number of scenarios), where is set to where the feasibility rate of SummarySearch was in the previous experiment, and we show the running time and approximation ratio compared to Naïve with scenarios. Figure 6 shows the results of this experiment. First, the response time with increasing is in most cases independent of . In fact, while increasing adds more scenarios to the CSA formulation, each summary becomes less and less conservative, making the problem a bit larger but always easier; in the limit (), each summary is identical to an original scenario, and thus SummarySearch only pays the extra overhead, compared to Naïve, of solving the probabilisticallyunconstrained problem first. On the other hand, Naïve is always faster, but its solutions are infeasible. For most queries, the approximation ratio closely approaches , while still maintaining a high feasibility rate. Increasing too far eventually causes feasibility to drop, reaching that of Naïve in the limit ().
Finally, even though infeasible solutions tend to have better objective values than feasible ones, we find that Naïve’s infeasible solutions to Q7 and Q8 have worse objective values. These queries proved quite challenging for Naïve as they involved stock price predictions for a week in the future.
6.2.4. Effect of increasing the dataset size
In this experiment, we increase the Galaxy dataset up to five times from 55,000 tuples to 274,000 tuples. For all queries except Q8 we fix (for both algorithms) and . In general, SummarySearch scales well with increasing data set size: it finds feasible solutions with good approximation ratios. Naïve, however, times out for several queries (Q1, Q2, Q5, Q6, & Q8) and its response time sharply increases as dataset size increases (Q1, Q2, Q6, Q8). Except for three queries (Q3, Q4, Q7), most of Naïve’s solutions are infeasible; even then, SummarySearch produces feasible solutions in orders of magnitude less time with better approximation ratios.
In Q8, we set to enable SummarySearch to still produce feasible solutions (75% feasibility at 274K tuples), without causing Naïve to fail. Q8 is a challenging query as each data value is sampled from a Pareto distribution with different parameters leading to high variability across scenarios.
To further increase the data size scalability of SummarySearch, we hope to combine it with partitioning and divideandconquer approaches similar to SketchRefine (Brucato2018).
7. Related Work
Probabilistic databases and package queries. Probabilistic databases (dalvi2007efficient; suciu2011probabilistic) have focused mainly on modeling discrete data uncertainty; the Monte Carlo Database (MCDB) (jampani2008mcdb) supports arbitrary uncertainty, via VG functions. Probabilistic databases support SQL queries, but lack support for optimization. Package query engines (Brucato2018; vsikvsnys2016solvedb) offer support only for deterministic optimization.
Stochastic optimization. Stochastic optimization (HOMEMDEMELLO201456) studies approximations for stochastic constraints and objectives. Probabilistic constraints are very hard to handle in general, because the feasible region of the inner constraint may be nonconvex (HOMEMDEMELLO201456; ahmed2008solving; calafiore2006probabilistic; CAMPI2009149; nemirovski2006scenario; dentcheva2006optimization; Luedtke2010). In this work, we study stochastic optimization problems with objective functions and constraints defined in terms of linear functions of the tuple attributes.
Our Naïve method is derived from the numerous “scenario approximations” from the SP literature (HOMEMDEMELLO201456; kall1994stochastic; calafiore2006probabilistic; CAMPI2009149; nemirovski2006scenario; luedtke2008sample; campi2011sampling; nemirovski2006convex). Choosing the number of scenarios () a priori is one of the most studied problems. Campi et al. (CAMPI2009149) show that the optimal solution of a Monte Carlo formulation that satisfies exactly i.i.d. scenarios is feasible with probability at least if . Apriori bounds quickly become impractical in a database setting, where is also the number of tuples, and thus typically large. For example, with a table of size ,