Solving Bayesian Risk Optimization via Nested Stochastic Gradient Estimation

07/11/2020 ∙ by Sait Cakmak, et al. ∙ 0

In this paper, we aim to solve Bayesian Risk Optimization (BRO), which is a recently proposed framework that formulates simulation optimization under input uncertainty. In order to efficiently solve the BRO problem, we derive nested stochastic gradient estimators and propose corresponding stochastic approximation algorithms. We show that our gradient estimators are asymptotically unbiased and consistent, and that the algorithms converge asymptotically. We demonstrate the empirical performance of the algorithms on a two-sided market model. Our estimators are of independent interest in extending the literature of stochastic gradient estimation to the case of nested risk functions.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

We consider the following optimization problem:


where is the solution space,

is a random vector representing the randomness in simulation, and

is a function that is evaluated through simulation. The expectation is taken with respect to (w.r.t.) , the correct distribution of . In a typical simulation optimization setting, the true distribution is unknown and estimated from a finite set of input data, and the following approximate problem is solved, where the estimated distribution is denoted by .


Due to the use of a finite dataset, even when the approximate problem (2) is solved to optimality, the optimal solution can perform poorly under true distribution. This issue is referred to as input uncertainty in simulation optimization.

Recently, zhou2015BRO and wu2018BRO proposed the Bayesian Risk Optimization (BRO) framework which formulates the simulation optimization problem under input uncertainty. In BRO, assuming that belongs to a known parameterized family of distributions with unknown parameter , instead of solving (1) we solve the following:



is a risk function mapping the random variable

(induced by ) to a real number, and is the Bayesian posterior distribution of given a chosen prior and input data . The risk function can be chosen according to the risk preferences of the practitioner, and includes the risk neutral expectation and the minimax formulation of distributionally robust optimization (DRO, see e.g. rahimian2019dro-review) as extreme cases under certain conditions. In this paper, we consider the following four cases of

: Expectation, Mean-Variance, Value-at-Risk (VaR), and Conditional Value-at-risk (CVaR). A formal introduction and a thorough review of BRO, along with a discussion on alternative approaches, is provided in Section


We aim to solve the BRO problem (3). To do so, we will use a Stochastic Approximation (SA, see Kushner&Yin2003SA) approach, which requires gradient information. Historically, most work on stochastic gradient estimation focused on finding the gradient of expectation (see FU2006Gradient; FU2008SA-Summary). Some more recent research studies the Monte-Carlo estimation of gradients of VaR and CVaR; e.g. Hong2009VaR for VaR, and Hong2009CVaR for CVaR, where each derives a closed form expression of the corresponding gradient, and provides an asymptotically unbiased and consistent infinitesimal perturbation analysis (IPA) based estimator. Other work in this field includes Liu&Hong2009Quantile-Kernel, Fu2009Quantile-Sens, Jiang2015Quantile, Tamar2015CVaR and Peng2017Quantile, to name a few. A review of Monte-Carlo methods for estimation of VaR, CVaR and their gradients can be found in Hong2014VaR-CVaR-Rev.

Other related works include the literature on nested simulation, e.g. Lan2010CVaR, Gordy2010Nested, Broadie2015Regression, Zhu2017input-uncertainty; Jaiswal2019, which studies the data-driven risk averse optimization problem under a parameterized Bayesian setting using the log-exponential risk measure; and Wang2020BOBRO, which uses Bayesian Optimization (see frazier2018tutorial) methods to optimize the expectation case of BRO with black-box expensive-to-evaluate objective functions.

Our work differs from the aforementioned works in the sense that the literature on nested simulation does not consider gradients or optimization, and the literature on gradient estimation does not consider nested risk functions. The literature on gradient estimation requires access to (and its gradients), while we only have access to (and its gradients), where , and (and its gradients) has to be estimated via sampling. The need to estimate the function adds another level of uncertainty to gradient estimation. To the best of our knowledge, this work is the first to study stochastic gradient estimation of nested risk functions.

The contributions of this paper can be summarized as follows: (i) We propose sample path gradient estimators for the four risk functions

mentioned earlier, extending the literature in stochastic gradient estimation to the case of nested risk functions; (ii) We propose stochastic approximation algorithms with local convergence guarantees for optimization of the BRO framework; (iii) We provide a numerical study on a two-sided market model that demonstrates the value of risk averse solution approaches in the presence of input uncertainty. Although the exposition in this paper is focused on the BRO framework, it is worth noting that our estimators can be applied more broadly, e.g., for estimating the sensitivities of quantiles of financial portfolios.

2 An Overview of BRO Framework

As mentioned in the introduction, in a typical simulation optimization framework, one aims to solve the following problem:


where the solution space is a non-empty, compact subset of , is a random vector representing the stochastic noise in the system, and is a function mapping to . The expectation is taken w.r.t , the true distribution of . In practice, is not known and is typically replaced with an estimate which is obtained from a finite set of input data. The estimation error of due to the use of finite data is often referred to as the input model uncertainty, or simply as input uncertainty. There is a large literature dedicated to studying the impact of input uncertainty in estimating system performance; see Barton2012Input-uncertainty and Song2014Input-uncertainty for a review.

Due to input uncertainty, even when the estimated problem is solved to optimality, the optimal solution can perform poorly under the true distribution. Hence, a natural question is, “how do we make decisions that account for input uncertainty?”. We are interested in finding good solutions that hedge against input uncertainty. One common approach is to construct an ambiguity set that includes

with high probability, and optimize w.r.t. the worst-case outcome within this set. This approach is referred to as Distributionally Robust Optimization (DRO) framework and has a large literature dedicated to it; see

rahimian2019dro-review for a review. In DRO, constructing the ambiguity set is a non-trivial task and has a large impact on the solution performance and tractability of the resulting problem. A large ambiguity set can lead to overly conservative solutions, whereas, a small uncertainty set might fail to include the true distribution. An alternative approach is to optimize with respect to a risk neutral expectation of the objective function over the set of all possible input distributions. As argued in zhou2015BRO, these two approaches can be seen as two extreme cases. The risk neutral expectation might fail to put enough weight over extreme (tail) scenarios, whereas, the DRO approach might be overly conservative due to hedging against worst-case scenarios.

Moreover, to the best of our knowledge, in a simulation optimization setting, the existing literature lacks tractable reformulations of the DRO problems. Although one can use the minimax formulation to formulate the distributionally robust simulation optimization problem, efficient optimization of this problem remains an open question due to the lack of structure in the function. In the robust optimization literature, Bertsimas2010RobustSim study a simulation optimization problem that is jointly robust to both implementation errors and parameter uncertainty, however, their method does not work when one is only concerned about the parameter uncertainty. Besides these popular approaches, the robust simulation optimization problem found interest in the kriging literature, e.g. Dellino2015Metamodel; KLEIJNEN2017Kriging, where a response surface is fitted over

, and robustness is typically facilitated by optimizing the mean performance subject to constraints on the standard deviation.

In this paper, we focus on the Bayesian Risk Optimization (BRO) framework, which was proposed by zhou2015BRO and wu2018BRO as an alternative approach to simulation optimization under input parameter uncertainty. Suppose that the true distribution belongs to a parameterized family of distributions such that for some , where is the unknown true parameter and is the parameter space. Assuming that the form of is known, we take a Bayesian approach and calculate the posterior likelihood of for a given dataset of independent and identically distributed (i.i.d.) input data drawn from the true distribution. Let denote the prior distribution of . Then, using Bayesian updating we can calculate the posterior distribution

where () is the likelihood of obtaining () given parameter , and denotes equivalence up to a normalization constant.

Define as the objective value under parameter , where denotes the expectation w.r.t. . If we view as a random variable with distribution , we can treat as a random variable induced by . Define as a risk function over which maps the random variable to . Instead of solving (4), we solve


which is referred to as the BRO problem. The risk function can be chosen to reflect the risk preference of the practitioner. In this paper, we focus on the following four cases of :

  1. Expectation: ;

  2. Mean - Variance: ;

  3. Value-at-Risk: ;

  4. Conditional Value-at-Risk: ;

where () denote that the expectation (variance) is taken w.r.t. , and denote the level Value-at-Risk and Conditional Value-at-Risk respectively. We will define VaR and CVaR formally in corresponding subsections.

For the four cases of considered here, wu2018BRO study the asymptotic properties of the objective functions and optimal solutions. We briefly summarize their results here. As the intuition would suggest, they show that as the data size , the posterior distribution converges in distribution to a degenerate distribution on . Furthermore, under mild regularity conditions, it is shown that for every fixed as , almost surely (a.s.), and a.s. for all four cases of considered here. Similarly, for the consistency of optimal solutions, it is shown that a.s. as where and are the sets of optimal solutions to (5) and (4) respectively, and is the distance between two sets with and being an arbitrary norm.

Moreover, the analysis of wu2018BRO

reveals the following asymptotic normality results which can be used to construct confidence intervals for the true objective value. Let

denote the normal distribution, and let


denote the probability density function (PDF) and cumulative density function (CDF) of

respectively. Then, for every , as ,

  • for Expectation and Mean-Variance objectives,

  • for the Value-at-Risk objective,

  • and for the Conditional Value-at-Risk objective,

where denotes convergence in distribution. The variance is defined as where is the Fisher information matrix, denotes the transpose, and is the gradient w.r.t. . The point-wise convergence results presented here can be extended to convergence results in the function space of . Similar normality results also hold for the optimal values.

To summarize, wu2018BRO establish the consistency and asymptotic normality of objective functions and optimal solutions for the four cases of considered here. They also show that the objectives of BRO can be approximated as a weighted sum of posterior mean objective and half-width of the true-objective’s confidence interval. In this paper, our aim is to optimize the BRO problem (5) for a given choice of and a given posterior distribution of . We refer the interested reader to zhou2015BRO, Zhou2017book-chapter, and wu2018BRO for further discussion on BRO formulation.

3 Solving the BRO Problem

In this section, we introduce our approach to solving the BRO problem. We take an SA approach, develop the stochastic gradient estimators needed, and conclude with convergence results for the algorithms. Throughout the paper, we use and to denote the gradients and respectively. We use as shorthand for , the expectation over the posterior distribution of , and as a shorthand for . The nested expectation is also shortened as . The proofs are provided in the online supplement.

3.1 Stochastic Approximation Algorithm

The BRO problem in its essence is a typical simulation optimization problem where the objective function is costly to estimate. Due to the nested structure of the objective function, one needs many more samples to estimate the BRO objective compared to a typical expectation or CVaR objective. If one were to use samples to estimate the inner expectation and samples to estimate the outer risk function, it would take a total of samples to estimate the BRO objective. This high cost of estimation motivates us to concentrate on algorithms that take advantage of the structure of the problem and require fewer function evaluations per iteration. With this motivation, the class of gradient based methods known as Stochastic Approximation emerges as an obvious candidate. To solve the BRO problem (5), we propose to use the SA algorithm of the following form (see Kushner&Yin2003SA):


where is a non-empty, compact solution space, is the step size sequence, is the descent direction, is the projection operator that projects the iterate back to the feasible set . A typical candidate for

is an estimate of the negative gradient of the objective function, which leads to the well known Stochastic Gradient Descent algorithm. Given a good estimator of the gradient, the stochastic approximation algorithm has nice convergence properties. We proceed in next subsection with the derivation of stochastic gradient estimators of the BRO problem (

5) for the four cases of mentioned above.

3.2 Derivation of Stochastic Gradient Estimators

In this section, we derive the stochastic gradient estimators for the BRO problem. The results are derived only for one-dimensional . Multidimensional case can be handled by treating each dimension as a one-dimensional parameter while fixing the rest. We start by providing the estimators for Expectation and Mean - Variance cases without going into details, then derive the estimators for the more technically challenging cases of VaR and CVaR. The following lemma from Broadie1996IPA is key to the consistency of IPA estimators and is used without mention throughout the paper.

Lemma 3.1.

Proposition 1, Broadie1996IPA - Let denote a Lipschitz continuous function that is differentiable on a set of points . Suppose that there exists a random variable with such that for all and exists w.p. (with probability) 1 for all , with an open set. If for all , then for all .

3.2.1 Expectation and Mean-Variance Cases

Suppose that the interchange of gradient and expectation is justified (see Assumption 3.3). Then, we have the following for the gradients of expectation and variance respectively:




The equations (7) and (8) can be used to provide gradient estimators for Expectation and Mean-Variance cases. For the Expectation case, it is seen that is a single run unbiased gradient estimator and the sample average (where are independent with distribution with ) is a strongly consistent estimator of the gradient. Similarly, for Mean-Variance case, we have


as an unbiased gradient estimator with independent and i.i.d. samples. One could use the same sample for and the same sample for at the expense of increased variance, and reduce the number of simulation runs to . However, using any fewer simulation runs would make the estimation of second and third terms biased. We do not study the trade off here since our main focus is on the estimation of VaR and CVaR gradients. This subsection is concluded by noting that sample averaging yields a strongly consistent estimator for the Mean-Variance case.

3.2.2 Value-at-Risk Case

In this subsection, we introduce the nested estimator of VaR gradients, and establish the asymptotical properties of the proposed estimator. Value-at-Risk, defined as , is the

quantile of the loss function. We are interested in estimating the gradient

using samples of and corresponding sample path gradients. Throughout the paper, and are used as shorthand notations for VaR and respectively.

If one has access to samples of , then can be estimated by the sample quantile (see Serfling1980) where is the ceiling function, denotes order statistic corresponding to the ordering , and is treated as a random variable induced by . However, in our case, we only have access to samples from . Let denote the Monte-Carlo estimator of generated using samples. Note that the ordering of based on does not necessarily correspond to the ordering of , i.e. does not hold in general. Therefore, we define a new ordering, denoted by , such that . This ordering is not uniquely defined by and depends on the realization of s. Under a mild set of assumptions, Zhu2017input-uncertainty shows that is a strongly consistent estimator of . Motivated by the consistency of , we propose


as the nested estimator of VaR gradients where is the IPA gradient estimator corresponding to . In the remainder of this subsection, we proceed to show that the estimator is asymptotically unbiased, and the batch-mean estimator , where is the number of batches of equal size and is the estimator corresponding to batch , is consistent and asymptotically normally distributed.

Notice that has the same form as , the single-layer estimator of quantile gradients of Hong2009VaR. Both estimators stem from the observation that, under a mild set of assumptions, the quantile gradients can be expressed as .

We now introduce the technical conditions that lead to the consistency of these estimators. The assumptions we introduce here can be viewed in three categories. First, we have a set of assumptions due to Zhu2017input-uncertainty that are needed to justify consistency of by providing the necessary smoothness of . A second set of assumptions are needed to justify the interchange of gradient and expectation, and thus the validity of IPA gradient estimators. An additional assumption by Hong2009VaR is needed to validate the interchange for the case of VaR. A final set of assumptions are needed to mitigate the difficulties arising from conditioning on measure zero sets, and ensure that the pathwise gradient estimator is sufficiently smooth.

Let denote the estimation error and denote the normalized error. Then, . Under following set of assumptions, Zhu2017input-uncertainty prove that is a strongly consistent estimator of .

Assumption 3.2.


  1. For all , the response

    has finite conditional second moment, i.e.,

    w.p. 1 and .

  2. The joint density of and , and its partial gradients and exist for each , all pairs of and for all .

  3. For all , there exists non-negative functions and such that , , for all . Furthermore, for , and .

The first part of the assumption ensures the validity of Central Limit Theorem (CLT). Second and third parts first appear in

Gordy2010Nested and provide sufficient smoothness to ensure that the PDF of convergences to the PDF of sufficiently fast. For our purposes, they provide uniform bounds on the moments of the estimation error. See Gordy2010Nested for further discussion on these assumptions.

Assumption 3.3.

There exists a random variable such that , and the following holds in a probability 1 subset of .

  1. w.p.1 for all .

  2. The sample path gradient exists w.p.1 .

Assumption 3.3 ensures that exists (w.p.1), , and that these relations can be extended to . Let denote the distribution function of and define . We have the following assumption due to Hong2009VaR.

Assumption 3.4.

(Hong2009VaR) For any , has a continuous density in a neighborhood of , and exists and is continuous w.r.t. both and at .

This assumption ensures that

is a continuous random variable in a neighborhood of

, and that its gradient exists and is continuous in the same neighborhood. It is shown in Jiang2015Quantile that Assumptions 3.3 & 3.4 are sufficient to justify the expression , and that the continuity (in ) of follows from these two assumptions. Under these assumptions, Hong2009VaR and Jiang2015Quantile show that as , and the batch-mean estimator (with as the number of batches) is consistent.

We would like to show that as . Let us introduce some more notations that will come in handy in proving this convergence. Given that , we define

  • ,

  • ,

as the probability measures corresponding to the given conditional distributions. These measures will be useful for characterizing and respectively where . In what follows, we let denote a ball centered at with radius .

Assumption 3.5.

Assume that there exists a family of measures and a number such that for all and for all ,

Assumption 3.6.


Assumptions 3.5 and 3.6 impose technical conditions to ensure that the estimation errors for both the function value and its gradients are well behaved. Assumption 3.5 is seemingly abstract and deserves further explanation. It essentially requires that the distribution of the gradient estimate conditioned on the function value converges to its true counterpart. One would notice that conditioned on the value of the function estimate, the gradient estimator is no longer unbiased and in general, as the observations ( and ) rely on the same set of ’s. Intuition suggests that as and the estimation error , the corresponding errors in gradient estimation should also cancel out. Assumption 3.5 is a technical condition that we impose to mitigate the difficulties arising from conditioning on measure zero sets in proving this behavior. In fact, if the condition (and its noisy counterpart) is relaxed from a point to a neighborhood, i.e. , it can be shown that Assumption 3.5 follows from Assumptions 3.2 & 3.6. We provide a detailed discussion on the assumptions in the online supplement, where it is also shown that Assumption 3.5 is satisfied in a general class of problems.

Now that we have established the necessary regularity conditions, we have the following proposition on the asymptotic bias of . In the following means , means , and means and .

Proposition 3.7.

Suppose that Assumptions 3.2 - 3.6 hold. Then as .

Moreover, if in addition the integral in Assumption 3.5 is , is differentiable w.r.t. at , and the budget sequence is such that , then the bias is .

Even though is asymptotically unbiased, it is not consistent in general when is multidimensional, particularly when the set is not a singleton. See Hong2009VaR for a discussion on consistency of , and Jiang2015Quantile for an additional assumption under which is consistent along with some examples. The same argument carries on to our case. A common approach is to use batching to address this difficulty. We have the following theorem which provides the consistency of the batch-mean estimator , where are i.i.d. copies and k is the number of batches.

Theorem 3.8.

Suppose that Assumptions 3.2 - 3.6 hold, then

where denotes convergence in probability.

In addition to the asymptotic unbiasedness and consistency, we have the following result that characterizes the asymptotic distribution of . The proof is a direct application of Lyapunov’s Central Limit Theorem combined with Proposition 3.7. It is identical to the proof of Theorem 5 of Hong2009VaR, and is omitted here.

Theorem 3.9.

Suppose that the (stronger) assumptions of Proposition 3.7 hold, , and for some . Then,

where is the asymptotic variance of .

3.2.3 Conditional Value-at-Risk Case

Conditional Value-at-Risk, defined as , is the expectation of large losses. We are interested in estimating the gradient using samples of (and ). We use and as shorthand notations for CVaR and respectively.

Under a mild set of assumptions, Hong2009CVaR show that CVaR gradients can be written in the form of a conditional expectation as


We propose the following estimator of CVaR gradients that mimics a Monte-Carlo estimator of (11) with the available information.


In the remainder of this subsection, we show that

is a strongly consistent and asymptotically unbiased estimator of


The analysis of relies on a weaker set of assumptions than that of . Assumption 3.5 is no longer needed, and Assumption 3.4 is replaced with the following weaker assumption due to Hong2009CVaR. Assumption 3.10, along with Assumption 3.3, is needed to ensure validity of (11).

Assumption 3.10.


  1. The VaR function is differentiable for any .

  2. For any , .

We note that Assumption 3.10 and is implied by 3.4 and the differentiability of . It is presented separately here, as it replaces Assumption 3.4 with a weaker set of conditions. The following proposition is needed in proving the consistency of .

Proposition 3.11.

Suppose Assumption 3.2 holds and . Then

Note that (11) can be rewritten as , which admits as a Monte-Carlo estimator. Proposition 3.11 shows that the bias introduced by replacing with its noisy version, , disappears in the limit. The following proposition extends this result to show that is asymptotically unbiased and the bias is of the order .

Proposition 3.12.

Under Assumptions 3.2, 3.3, 3.6, 3.10, the bias as . Moreover, if in addition , then the bias is of the order .

We conclude this subsection with the following theorem that provides strong consistency of .

Theorem 3.13.

Under Assumptions 3.2, 3.3, 3.6, 3.10, is a strongly consistent estimator of .

Remark 1.

Even though the results here are derived using the IPA estimator for the inner expectation, one should notice that our proofs only require a consistent estimator of the inner expectation. Therefore, where IPA is not applicable or it is not preferred for any other reason, one could replace with any consistent estimator such as the generalized likelihood ratio estimator of Peng2018GLR, support independent unified likelihood ratio and infinitesimal perturbation analysis estimator of Wang2012SLRIPA etc. as long as the corresponding regularity conditions hold.

3.3 Convergence Analysis of the Algorithms

In this subsection, we start with a brief discussion on implementation and computational cost of the SA algorithm, and show that the use of and results in consistent algorithms for solving the corresponding BRO problems.

The SA algorithm is briefly summarized in Algorithm 1

. When a closed form of the posterior distribution is not available, one may use numerical methods, such as Markov Chain Monte Carlo (MCMC)

(Lange2010), variational Bayes (Fox2012VariationalBayes) etc., to approximate the posterior distribution. We emphasize that it is only necessary to draw an empirical approximation to the posterior before the optimization starts (see Section 4.2 for more details). This avoids a repeated use of e.g. MCMC, which is not necessary since the posterior distribution does not change, and facilitates cost effective sampling of from the generated empirical distribution. Thus, the computational cost of Algorithm 1 is dominated by the simulations of , which has a total computational cost of .

  Input: , the input data, prior distribution and the choice of risk measure.
  Calculate the posterior distribution . If a closed form or is not available, use a computational method (e.g. MCMC or variational Bayes) to draw an empirical approximation.
  for  to  do
      Draw and .
     Simulate , and calculate the estimator if is chosen as VaR, or if is chosen as CVaR.
      Set , where is either of if is chosen as VaR, or if is chosen as CVaR.
  end for
  Return: as the decision.
Algorithm 1 Optimization of BRO via SA

The remainder of the subsection is dedicated to proving the convergence of the algorithms. The main result is summarized in the following theorem.

Theorem 3.14.

Suppose that Assumptions 3.2 - 3.6 hold, and is continuously differentiable w.r.t , are monotonically increasing sequences, , . Then, the SA algorithm (6) with converges w.p.1 to a unique solution set of the ODE


Similarly, under Assumptions 3.2, 3.3, 3.6, 3.10, and assuming is continuously differentiable w.r.t , are monotonically increasing sequences, ,