A central step in stochastic control problems concerns estimating expected costs-to-go that are used to approximate the optimal feedback control. In simulation approaches to this question, costs-to-go are sampled by generating trajectories of the stochastic system and then regressed against current system state. The resulting Q-values are finally ranked to find the action that minimizes expected costs.
When simulation is expensive, computational efficiency and experimental design become important. Sequential strategies rephrase learning the costs-to-go as another dynamic program, with actions corresponding to the sampling decisions. In this article, we explore a Bayesian formulation of this sequential design problem. The ranking objective imposes a novel loss function which mixes classification and regression criteria. Moreover, the presence of multiple stochastic samplers (one for each possible action) and a continuous input space necessitates development of targeted response surface methodologies. In particular, a major innovation is modeling in parallel the spatial correlation within each Q-value, while utilizing a multi-armed bandit perspective for picking which sampler to call next.
To obtain a tractable approximation of the Q-values, we advocate the use of Gaussian process metamodels, viewing the latent response surfaces as realizations of a Gaussian random field. Consequently, the ranking criterion is formulated in terms of the posterior uncertainty about each Q-value. Thus, we connect metamodel uncertainty to the sampling decisions, akin to the discrete-state frameworks of ranking-and-selection and multi-armed bandits. Our work brings forth a new link between emulation of stochastic simulators and stochastic control, offering a new class of approximate dynamic programming algorithms.
1.1 Abstract Ranking Problem
Let , be smooth functions over a subset of . We are interested in the problem of learning the resulting ranking of over the input space
, namely finding the classifier
The functions are a priori unknown but can be noisily sampled. That is for any we have access to a simulator which generates estimates of :
where. Intuitively speaking, we have smooth hyper-surfaces on that can be sampled via Monte Carlo. In the dynamic programming context, is the system state, indexes the various actions available to the controller, represents the expected costs-to-go and captures the simulation noise arising from pathwise simulation of the underlying stochastic system and corresponding costs.
Our goal is to identify the minimal surface globally over the entire input space. More precisely, we seek to assign at each a label , while optimizing the loss metric
where is a specified weight function on determining the relative importance of ranking different regions. Thus, the loss is zero if the ranking is correct , and otherwise is proportional to the (positive) difference between the selected response and the true minimum . The above criterion aims to identify the optimal action to take in state ; if the wrong action is chosen instead, then (1.3
) captures the resulting integrated loss to the controller assuming a probability distributionof potential states .
The loss function in (1.3) blends regression and classification objectives. In regression, one seeks to estimate the response marginally with the loss function tied to a single surface . Instead, (1.3) is only about correctly identifying the index of the minimal response. As a result, small estimation errors are tolerated as long as the minimal response does not change, leading to a thresholding behavior in the loss function. In classification the loss function is discrete (typically with fixed mis-classification penalties), whereas (1.3) takes losses proportional to the mis-classification distance . A further key distinction is that in classification the sampling space is just (returning a noisy label ), whereas in our context a sampling query consists of the location-index pair , sampling one response at a time. The question of which surface to sample requires separate analysis over .
We analyze the design problem of constructing efficient sampling strategies that can well-estimate while optimizing the number of Monte Carlo samples needed. Because ’s are unknown, we frame (1.3) as a Bayesian sequential learning problem of adaptively growing a design that quickly learns . Classical static, i.e. response-independent, designs are inadequate for ranking since the whole essence of optimizing computational efforts is predicated on learning the structure of the unknown ’s. Intuitively, learning manifests itself in focusing the sampling through discriminating both in the input space (focus on regions where identifying is difficult) and in the sampling indices (focus on the surfaces where is likely to be the smallest response).
Due to the joint design space , our problem allows for a dual interpretation. Fixing , (1.1) is about reconstructing an unknown response surface through noisy samples. Aggregating the different response surfaces, sequential design over reduces to identifying the partitions of into the sets
Because in interiors of the partitions the ranking is easier to identify, the main problem is to find the partition boundaries . As a result, (1.1) is related to contour-finding, for which sequential design was studied in [GL13, Picheny10, RanjanBingham08]. Standard contour-finding attempts to identify the level set of the response surface, which corresponds to with known in (1.1
). Hence, the analysis herein can be viewed as a multi-variate extension of contour finding. In turn, contour-finding generalizes the classical objective of minimizing a noisy response, connecting to the expected improvement/information gain trade-off in simulation optimization. In particular, we re-purpose the active learning rules of[cohn:1996, Mackay92].
Conversely, fixing , the aim of determining the smallest response corresponds to the setting of multi-armed bandits (MAB). The bandit has arms and corresponding payoffs , with the decision-theoretic objective (1.1) known as the pure exploration problem [BubeckMunos11, BubeckMunos11X]. Decision policies for which arm to pull are usually expressed in terms of posterior mean and confidence about the respective payoff; this point of view motivates our use of Gap-Upper Confidence Bound (UCB) design strategies [Auer02, SRKRKASE:12]. Compared to this literature, (1.3) contains two key differences. First, the loss function is a weighted pure-regret criterion which to our knowledge has never been used in MAB context. Second, instead of a single bandit with independent arms, we treat the radical extension to a continuum of bandits indexed by . Recently, [GrunewalderAudibert10, GabillonBubeck11] considered multiple bandits which can be viewed as (1.1) with a discrete, non-metrized . We generalize their setting to a continuous , with a spatial correlation structure of the arms.
1.2 Summary of Approach
To handle continuous state spaces that appear in stochastic control, we adopt the framework of kriging or Gaussian process (GP) regression for modeling the Q-values. In both contexts of Design of Experiments (DoE) and continuous MAB’s, kriging models have emerged as perhaps the most popular framework [WilliamsRasmussenBook]. In particular, kriging has been used extensively for sequential regression designs as it allows an intuitive approach to borrow information spatially across samples to build global estimates of the entire response surface
. Two further advantages are the analytic structure of Gaussian processes that allows for analytic evaluation of many Expected Improvement criteria, and the ability to naturally transition between modeling of deterministic (noise-free) experiments where data needs to be interpolated, and stochastic simulators where data smoothing is additionally required.
More generally, we suggest a Bayesian perspective to global ranking, viewing the response surfaces as realizations of a random variable taking values in a given function space. This offers a tractable quantification of posterior metamodel uncertainty and related sequential metrics for determining the minimum surface. Thus, we stress that kriging is not essential to implementation of our algorithms; for example competitive alternatives are available among tree-based models, such as dynamic trees [GTP-trees11] and Bayesian trees [chip:geor:mccu:2010]. Moreover, while classical kriging may not be flexible enough for some challenging problems, there are now several well-developed generalizations, including treed GPs [tgpPackage], local GPs [gramacy:apley:2013], and particle-based GPs [GramacyPolson11], all offering off-the-shelf use through public R packages.
Following the Efficient Global Optimization approach [JonesSchonlauWelch98], we define expected improvement scores that blend together the local complexity of the ranking problem and the posterior variance of our estimates. In particular, we rely on the expected reduction in posterior variance and borrow from the Stepwise Uncertainty Reduction criteria based on GP regression from [PichenyGinsbourger13, ChevalierPicheny13]. We also investigate UCB-type heuristics [Auer02] to trade-off exploration and exploitation objectives. Based on the above ideas, we obtain a number of fully sequential procedures that specifically target efficient learning of over the entire design space . Extensive numerical experiments are conducted to compare these proposals and identify the most promising solutions.
As explained, our algorithms are driven by the exploration-exploitation paradigm quantified in terms of (empirically estimated) local ranking complexity for and confidence in the estimated . To quantify the local ranking complexity, we use the gaps [GabillonBubeck11, CarpentierLazaric11, HoffmanDeFreitas13]. For any , denote by the ranked responses at and by
the gap between the best (smallest) and second-best response. measures the difficulty in ascertaining : for locations where is big, we do not need high fidelity, since the respective minimal response surface is easy to identify; conversely for locations where is small we need more precision. Accordingly, we wish to preferentially sample where is small. This is operationalized by basing the experimental design decisions on the estimated gaps .
In terms of design over , exploration suggests to spend the budget on learning the responses offering the biggest information gain. Namely, substantial benefits are available by discriminating over the sampling indices through locally concentrating on the (two) most promising surfaces . This strategy is much more efficient than the naive equal sampling of each . In addition, since the noise level in may vary with this must also be taken into account. Summarizing, our expected improvement metrics blend empirical gaps and empirical posterior uncertainty based on kriging variance , jointly discriminating across .
Our contributions can be traced along three directions. First, we introduce and analyze a novel sequential design problem targeting the loss function (1.3). This setting is motivated by dynamic programming algorithms where statistical response models have been widely applied since the late 1990s [Egloff05, Longstaff]. Here we contribute to this literature by proposing a Bayesian sequential design framework that generates substantial computational savings. This aspect becomes especially crucial in complex models where simulation is expensive and forms the main computational bottleneck. Second, we generalize the existing literature on Bayesian optimization and contour-finding to the multi-surface setting, which necessitates constructing new EI measures that address joint design in space and index dimensions. We demonstrate that this allows for a double efficiency gain: both in and in . Third, we extend the multiple bandits problem of [GabillonBubeck11] to the case of a continuum of bandits, which requires building a full meta-model for the respective arm payoffs. Our construction offers an alternative to the recent work [BubeckMunos11X] on -armed bandits and opens new vistas regarding links between MAB and DoE.
Our approach also generalizes Gramacy and Ludkovski [GL13]. The latter work proposed sequential design for the contour-finding case where the design is solely over the input space . In that context [GL13] introduced several EI heuristics and suggested the use of dynamic trees for the response modeling. The framework herein however requires a rather different approach, in particular we emphasize the bandit-inspired tools (such as UCB) that arise with simultaneous modeling of multiple response surfaces.
The rest of the paper is organized as follows. Section 2 describes the kriging response surface methodology that we employ, as well as some analytic formulas helpful in the context of ranking. Section 3 then develops the expected improvement heuristics for (1.1). Sections 4 and 5 illustrate the designed algorithms using synthetic data (where ground truth is known), and a case-study from epidemic management, respectively. Finally, Section 6 concludes.
1.3 Connection to Stochastic Control
Consider the objective of minimizing total costs associated with a controlled state process ,
on the horizon . Above encodes the stagewise running costs, is the control strategy taking values in the finite action space , and is a stochastic discrete-time Markov state process with state space . The dynamics of are of the form
for some map , where is a random independent centered noise source. The performance criterion optimizes expected rewards, captured in the value function ,
over all admissible closed-loop Markov strategies . Thus, at time , the action is specified in feedback form as a function of current state . The policy map translates system states into actions and is related to the value function via the dynamic programming equation (DPE):
The notation is meant to emphasize averaging of the stochastic future at based on time- information summarized by the system state . The term is the Q-value, providing the expected cost-to-go if one applies action at .
Solving the DPE is equivalent to computing the Q-values since by (1.6), . The ranking problem in (1.1) is then known as the policy map that partitions the state space into action sets . Given for all and all (initialized via ), we observe that
where is a strategy that uses action at and thereafter, . Indeed, the sum in (1.8) is precisely the random variable for the pathwise costs-to-go . The loss (1.3) is then the difference between acting optimally as at , vis-a-vis taking action (and then acting optimally for the rest of the future, ), weighted by the distribution of .
The formulation (1.8) allows pursuit of policy search methods by tying the accuracy in (1.7) not to the immediate fidelity of (estimated) Q-values , but to the quality of the policy map . Namely, one iteratively computes approximate policy maps for , using (1.8) to construct based on . Note that the original objective of finding requires solving ranking problems of the form (1.1).
This approach to dynamic programming is especially attractive when the action space is very small. A canonical example are optimal stopping problems where , i.e. . For a single stopping decision, the immediate reward is typically given, leading to the case of estimating a single Q-value , see [GL13]. Multiple stopping problems where both and need to be estimated arise in the pricing of swing options [MeinshausenHambly04], valuation of real options [AidLangrene12], and optimizing of entry-exit trading strategies [ZervosJohnson12]. The case was considered for valuation of energy assets, especially gas storage [Secomandi11], that lead to optimal switching problems. For example, storage decisions are usually modeled in terms of the triple alternative of . Small action spaces also arise in many engineering settings, such as target tracking [AndersonMilutinovic11, HLQ15], and sensor management [VeeravalliFuemmeler08].
2 Statistical Model
2.1 Sequential Design
Fix a configuration and corresponding classifier . A design of size is a collection ,
, with superscripts denoting vectors. Fixing, and conditioning on the corresponding samples , let be an estimate of . We aim to minimize the expected loss over all designs of size , i.e.
where the expectation is over the sampled responses . To tackle (2.1) we utilize sequential algorithms that iteratively augment the designs as -samples are collected. The interim designs are accordingly indexed by their size , where . At each step, a new location is added and the estimate is recomputed based on the newly obtained information. The overall procedure is summarized by the following pseudo-code:
Select a new location and sample corresponding
Augment the design
Update the classifier by assimilating the new observation
The basic greedy sampling algorithm adds locations with the aim of minimizing the myopic expected estimation error. More precisely, at step , given design (and corresponding ), the next pair is chosen by
where the expectation is over the next sample . This leads to a simpler one-step-ahead optimization compared to the -dimensional (and typically we are looking at ) formulation in (2.1). Unfortunately, the optimization in (2.2) is still generally intractable because it requires
re-computing the full loss function at each step;
finding the expected change in given ;
integrating over the (usually unknown) distribution of ;
optimizing over the full -dimensional design space .
We accordingly propose efficient numerical approximations to (2.2), relying on the twin ideas of (i) sequential statistical modeling (i.e. computing and updating as grows), and (ii) stochastic optimization (i.e. identifying promising new design sites ).
2.2 Response Surface Modeling
A key aspect of sequential design is adaptive assessment of approximation quality in order to maximize information gain from new samples. Consequently, measuring predictive uncertainty is central to picking . For that purpose, we use a Bayesian paradigm, treating as random objects. Hence, we work with a function space and assume that with some prior distribution . Thus, for each , is a random variable whose posterior distribution is updated based on the collected information from samples . Given the information generated by the -step design , , we define the posterior . The random variable is the belief about conditional on
; its first two moments are referred to as the kriging mean and variance respectively,
We will use as a point estimate of , and as a basic measure of respective uncertainty. The overall global map is called the kriging surface. Note that while there is a spatial correlation structure over , we assume that observations are independent across (so sample noise ), so that the posteriors , are independent.
The order statistics describe the sorted posterior means at a fixed . A natural definition is to announce the minimum estimated surface
i.e. the estimated classifier corresponds to the smallest posterior mean, so that . On the other hand, the uncertainty about can be summarized through the expected minimum of the posteriors ,
Observe that , and we accordingly define the M-gap (“M” for minimum)
The M-gap measures the difference between expectation of the minimum and the minimum expected response, which precisely corresponds to the Bayesian expected loss at in (1.3). This fact offers an empirical analogue of the original loss function in (1.3),
The above formula translates the local accuracy of the kriging surface into a global measure of fidelity of the resulting classifier and will be the main performance measure for our algorithms.
The response surfaces are assumed to be smooth in . As a result, information about is also revealing about for , coupling observations at different sites. To enforce such conditions without a parametric representation, we view each as a sample from a Gaussian process (GP). A GP is specified by its trend or mean function and a covariance structure , with . By specifying the correlation behavior, the kernel encodes the smoothness of the response surface.
Fix the response surface index and let denote the observed samples at locations . These realizations are modeled as in (1.2) with the response represented as
where is a fixed trend term and is a realization of a Gaussian process. Given the samples , the posterior of again forms a GP; in other words any collection is multivariate Gaussian with mean , covariance , and variance , specified by [WilliamsRasmussenBook, Sec. 2.7] (see also [NelsonStaum10]):
and is the positive definite matrix , . By independence across , the vector of posteriors at a fixed satisfies
A common choice is the Matern-5/2 kernel
The length-scale parameter vector controls the smoothness of members of , the smaller the rougher. The variance scalar parameter determines the amplitude of fluctuations in the response.
A major advantage of kriging for sequential design are updating formulas that allow to efficiently assimilate new data points into an existing fit. Namely, if a new sample is added to an existing design , the mean and kriging variance at location are updated via
where is a weight function specifying the influence of the new sample at on (conditioned on existing design locations
). In particular, the local reduction in posterior standard deviation atis proportional to the current [GinsbourgerEmery14]:
Note that the updated posterior variance is a deterministic function of which is independent of .
For our examples below, we have used the DiceKriging R package [kmPackage-R] to compute (2.9). The software takes as input the location-index pairs , the corresponding samples and the noise levels , as well as the kernel family (Matern-5/2 (2.11) by default) and trend basis functions and runs an EM MLE algorithm to estimate the hyper-parameters describing the kriging kernel .
2.4 Summary Statistics for Ranking
Given a fitted kriging surface (for notational convenience in this section we omit the indexing by the design size ), the respective classifier is obtained as in (2.5). Note that
is not necessarily the MAP (maximum a posteriori probability) estimator, since the ordering of the posterior probabilities and posterior means need not match for. Two further quantities are of importance for studying the accuracy of : gaps and posterior probabilities. First, the gaps quantify the differences between the posterior means, namely
where are the ordered posterior means. Note that under , we have due to symmetry. Second, define the posterior probabilities for the minimal rank
We refer to as the decreasing ordered values of the vector , so that the index of is the MAP estimate of the minimal response surface. The following proposition provides a semi-analytic recursive formula to evaluate in terms of the kriging means and variances .
Proposition (Azimi et al. [AzFeFe:11])
If , then for any ,
where is standard normal cdf, and , with a matrix defined via
For , we have , and .
The next proposition provides another semi-analytic formula to evaluate defined in (2.6).
Suppose that and let , be two independent Gaussians. Define
Then the first two moments of are given by:
Equation (2.19) provides a closed-form expression to evaluate for . In the case , one may evaluate recursively using a Gaussian approximation. For instance, for , approximate by a Gaussian random variable with mean/variance specified by (2.19)-(2.20) respectively (i.e. using and ) and then apply Proposition 2.4 once more to .
3 Expected Improvement
The Bayesian approach to sequential design is based on greedily optimizing an acquisition function. The optimization is quantified through Expected Improvement (EI) scores that identify pairs that are most promising in terms of lowering the global empirical loss function according to (2.2). In our context the EI scores are based on the posterior distributions which summarize information learned so far about .
Our two main heuristics are dubbed Gap-UCB and Gap-SUR:
The Gap-UCB score is motivated by the exploration-exploitation trade-off in MAB’s and favors locations with small gaps in posterior means and high kriging variance. Indeed, the local empirical gap measure [GabillonBubeck11] identifies the most promising arm, while the kriging variance promotes exploration to reduce uncertainty about arm payoffs. The two are connected via the UCB (upper confidence bound [SRKRKASE:12]) tuning parameter that balances exploration (regions with high ) and exploitation (regions with small gap). Another interpretation of Gap-UCB is to mimic a complexity-sampling scheme that selects design sites based on the complexity of the underlying ranking problem. Indeed, the gap measures the hardness of testing whether ; the smaller the tougher. At the same time, the kriging variance can be related to information gain from sampling at
(being akin to the standard error of a point estimator).
The Gap-SUR strategy is coming from the perspective of simulation optimization. Recall that we strive to lower the empirical loss in (2.8) which is related to the M-gap in (3.2), . Accordingly, the Gap-SUR criterion uses to guide the adaptive design, by aiming to maximizing its expected local reduction if we add to the design. Such Stepwise uncertainty reduction (SUR) strategies were introduced in [Picheny12, ChevalierPicheny13]. The evaluation of (3.2) requires computing the expected mean and variance of and . The updating formula (2.12) implies that (keeping fixed) , while (2.14) yields . The rest of the computation becomes straightforward in view of Proposition 2.4.
Gap-SUR is also connected to the Active Learning Cohn (ALC) [cohn:1996] approach to DoE. In ALC, minimization of posterior variance is achieved by greedily maximizing reduction in . In Gap-SUR, minimization of is achieved by maximizing reduction in . The ALC paradigm suggests an alternative to (3.1), namely , that blends expected decline in kriging variance with the estimated gap.
Asymptotic Behavior. The Gap-SUR method aims to drive the M-gaps to zero, which is equivalent to learning all the responses: , see (3.2). For GP models, vanishing posterior variance at corresponds to the design being dense in the neighborhood of . Thus, asymptotically, the Gap-SUR heuristic will generate designs that are dense across . Finally, previous results about consistency of GP models (see for example [ChoiSchervish07]) can be invoked to establish that .
On the other hand, proper selection of the UCB schedule is crucial for the performance of Gap-UCB. If then convergence is not guaranteed. Indeed, consider such that , but the estimated gaps based on interim satisfy due to estimation error at . Then at stage the algorithm will prefer site over (since it has smaller gap ) and will then possibly get trapped indefinitely, never realizing that the estimated ordering between and is wrong. Hence without UCB the algorithm is prone to get trapped at local minima of . At the same time, any increasing unbounded guarantees that . Toward this end, Srinivas et al. [SRKRKASE:12] proved that in a cumulative regret setting should grow logarithmically in sample size . Further rules on how to choose (for the case of a finite state space ) can be found in [GabillonBubeck11]. Another alternative is a randomized version. For example, in -greedy sampling, with probability at any step instead of using an EI metric, are selected uniformly in . This ensures that the designs are dense in as and is a feature that we resort to in our experiments. Still, fine-tuning the schedule of is highly non-trivial in black-box settings. For this reason, usage of the Gap-UCB approach is sensitive to implementation choices and further guidance on selecting is left for future research.
3.1 Selecting the Next Sample Location
To grow the designs over we use the EI scores via the greedy sampling strategy
Because the above introduces a whole new optimization sub-problem, in cases where this is computationally undesirable we instead replace with where is a finite candidate set. Optimization over is then done by direct inspection. The justification for this procedure is that (i) we expect to be smooth in and moreover relatively flat around ; (ii) is already an approximation so that it is not required to optimize it precisely; (iii) performance of optimal design should be insensitive to small perturbations of the sampling locations. To construct such candidate sets in , we employ Latin hypercube sampling (LHS) [McBeCo:79]. LHS candidates ensure that new locations are representative, and well spaced out over . See [gra:lee:2009, Sec 3.4] for some discussion on how should be designed. In addition, we refresh our candidate set at each iteration, to enable “jittering”. Algorithm 1 below presents the resulting method in pseudo-code.