Tight Bayesian Ambiguity Sets for Robust MDPs

11/15/2018 ∙ by Reazul Hasan Russel, et al. ∙ 0

Robustness is important for sequential decision making in a stochastic dynamic environment with uncertain probabilistic parameters. We address the problem of using robust MDPs (RMDPs) to compute policies with provable worst-case guarantees in reinforcement learning. The quality and robustness of an RMDP solution is determined by its ambiguity set. Existing methods construct ambiguity sets that lead to impractically conservative solutions. In this paper, we propose RSVF, which achieves less conservative solutions with the same worst-case guarantees by 1) leveraging a Bayesian prior, 2) optimizing the size and location of the ambiguity set, and, most importantly, 3) relaxing the requirement that the set is a confidence interval. Our theoretical analysis shows the safety of RSVF, and the empirical results demonstrate its practical promise.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Markov decision processes (MDPs) provide a versatile methodology for modeling dynamic decision problems under uncertainty [Bertsekas and Tsitsiklis, 1996; Sutton and Barto, 1998; Puterman, 2005]

. MDPs assume that transition probabilities are known precisely, but this is rarely the case in reinforcement learning. Errors in transition probabilities often results in policies that are brittle and fail in real-world deployments. A promising framework for robust reinforcement learning are robust MDPs (RMDPs) which assume that the transition probabilities and/or rewards are not known precisely. Instead, they can take on any value from a so-called

ambiguity set which represents a set of plausible values [Xu and Mannor, 2006, 2009; Mannor et al., 2012; Petrik, 2012; Hanasusanto and Kuhn, 2013; Tamar et al., 2014; Delgado et al., 2016; Petrik et al., 2016]. The choice of an ambiguity set determines the trade-off between robustness and average performance of an RMDP.

The main contribution of this paper is RSVF, a new data-driven Bayesian approach to constructing ambiguity

sets for RMDPs. The method computes policies with tighter safe estimates (

Definition 2.1) by introducing two new ideas. First, it is based on Bayesian posterior distributions rather than distribution-free bounds. Second, RSVF does not construct ambiguity sets as simple confidence intervals. Confidence intervals as ambiguity sets are a sufficient but not a necessary condition. RSVF uses the structure of the value function to optimize the location and shape of the ambiguity set to guarantee lower bounds directly without necessarily enforcing the requirement for the set to be a confidence interval.

2 Problem Statement: Data-driven RMDPs

We propose to use Robust Markov Decision Processes (RMDPs) with states and actions to compute a policy with the maximal safe estimate of return.

Definition 2.1 (Safe Estimate of Return).

We say that an estimate of policy return is safe with probability for a given dataset if it satisfies: for each stationary deterministic policy . Here is the true, but unknown transition probabilities, and is the return for a policy .

In standard batch RL setting, can be used to estimate the transition probabilities, but is assumed to be not known precisely for the RMDP and is constrained to be in the ambiguity set , defined for each state and action (s,a-rectangular). The most common method for defining ambiguity sets is to use norm-bounded distance from a nominalprobability distribution : for a given and a nominal point . We assume that the rewards are known. The objective is to maximize the -discounted infinite horizon return.

Figure 1: Comparison of 90% confidence intervals, Left: Hoeffding (green) and Bayesian (blue), Right: RSVF (green) and BCI (blue).

RMDPs satisfy similar properties as regular MDPs [Iyengar, 2005; Tamar et al., 2014]. The robust Bellman operator is defined for a state as: . The robust return is defined as [Iyengar, 2005]: where is the initial distribution. In general, we use hat to denote quantities in RMDP.

3 Ambiguity Sets as Confidence Intervals

In this section, we describe the standard approach to constructing ambiguity sets as distribution-free confidence intervals and propose its extension to the Bayesian setting.

Distribution-free Confidence Interval The use of distribution-free error bounds on the norm is common in reinforcement learning [Petrik et al., 2016; Taleghan et al., 2015; Strehl and Littman, 2004]. The confidence interval is constructed around the mean transition probability by combining the Hoeffding inequality with the union bound [Weissman et al., 2003; Petrik et al., 2016]. The Hoeffding ambiguity set is defined as: where is the mean transition probability computed from and is the number of transitions observed originating from state and an action . An important limitation of is that the size of the ambiguity set grows linearly with the number of states .

Bayesian Confidence Interval (BCI) Here we assume that data is available and a hierarchical Bayesian model can be used to infer a probability distribution over analytically or using MCMC methods like Stan [Gelman et al., 2014]. To construct the ambiguity set , we optimize for the smallest ambiguity set around the mean transition probability with the assumption that a smaller ambiguity set will lead to a tighter lower bound estimate. Formally, the optimization problem to compute for each state and action is: where nominal point is .

Figure 2: Single state with Dirichlet prior, return error with 95% Confidence & violations.

4 RSVF: Robustification With Sensible Value Functions

RSVF uses samples from a posterior distribution, similar to a Bayesian confidence interval, but it relaxes the safety requirement as it is sufficient to guarantee for each state and action that:


with . To construct the set here, the set is not fixed but depends on the robust solution, which in turn depends on . RSVF starts with a guess of a small set for and then grows it, each time with the current value function, until it contains which is always recomputed after constructing the ambiguity set .

Input: Desired confidence level and posterior distribution
Output: Policy with a maximized safe return estimate
1 Initialize current policy ;
2 Initialize current value ;
3 Initialize value robustness set ;
4 Construct optimal for ;
5 Initialize counter ;
6 while Eq. 1 is violated with  do
7       Include that violates Eq. 1: ;
8       Construct optimized for ;
9       Compute robust value function and policy for ;
10       ;
12return ;
Algorithm 1 RSVF: Robustification with Sensible Value Functions

In lines 4 and 8 of Algorithm 1, is computed for each state-action . Center and set size are computed from Eq. 3 using set & optimal computed by solving Eq. 2. When the set is a singleton, it is easy to compute a form of an optimal ambiguity set.



is a singleton, it is sufficient for the ambiguity set to be a subset of the hyperplane

for the estimate to be safe. When is not a singleton, we only consider the setting when it is discrete, finite, and relatively small. We propose to construct a set defined in terms of an ball with the minimum radius such that it is safe for every . Assuming that

, we solve the following linear program:


In other words, we construct the set to minimize its radius while still intersecting the hyperplane for each in . Algorithm 1, as described, is not guaranteed to converge in finite time as written. It can be readily shown the value functions in the individual iterations are non-increasing. It is easy to just stop once the value function becomes smaller (and that is more conservative) than BCI.

Figure 3: Return error with a Gaussian prior with 95% confidence, Left: Single state, Right: Full MDP, X-axis is the number of samples per state-action.

5 Empirical Evaluation

In this section, we evaluate the safe estimates computed by BCI and RSVF empirically. We assume a true model of each problem and generate a number of simulated data sets for the known distribution. We compute the largest safe estimate for the optimal return and compare it with the optimal return for the true model. We compare our results with “Hoeffding Inequality“ based distance and “Mean Transition” which simply solves the expected model and provides no safety guarantees. The value represents the predicted regret, which is the absolute difference between the true optimal value and the robust estimate: , a smaller regret is better. All of our experiments use a 95% confidence for safety unless otherwise specified.

Single-state Bellman Update We initially consider simple problems where transition from a single non-terminal state following a single action leads to multiple terminal states. The value function for the terminal states are fixed and assumed to be provided. We evaluate different priors over the transition probabilities: i) uninformative Dirichlet prior and ii) informative Gaussian prior. Note that RSVF is optimal in this simplistic setting, as Fig. 2 (left) and Fig. 3 (left) shows. As expected, the mean estimate provides the tightest bound, but Fig. 2 (right) illustrates that it does not provide any meaningful safety guarantees.

Full MDP with Informative Prior Next, we evaluate RSVF on a full MDP problem. Standard RL benchmarks, like cart-pole or arcade games, lack meaningful Bayesian priors. We instead use a simple exponential population model, based on the management of an invasive species [Taleghan et al., 2015]. The population of the invasive species at time evolves according to the exponential dynamics . Here, is the growth rate and is the carrying capacity of the environment. A land manager needs to decide, at each time , whether to take a control action which influences the growth rate . If is the indicator of whether the control action was taken, the growth rate is defined as: , where and are the coefficients of control effectiveness. We also assume that we only observe , a noisy estimate of population : . In the MDP model, the population observation defines the state. There are two actions: to apply or not to apply the control measure. Transition probabilities are given by the population evolution function. The reward for the MDP captures the costs of high invasive population and the application of the treatment.

Fig. 3 (right) depicts the average predicted regret over the different datasets. The distribution-free methods are very conservative, BCI improves on this behavior somewhat, but RSVF provides bounds that are even tighter than BCI by almost a factor of 2. The rate of violations is 0 for all robust methods. This indicates that RSVF is overly conservative in this case too since its rate of violations is also close to 0. This is due its reliance on the union bound across multiple states, the approximate construction of the individual ambiguity sets, and the inherent rectangularity assumption.

6 Conclusion

We propose, in this paper, a new Bayesian approach to the construction of ambiguity sets in robust reinforcement learning. This approach has several important advantages over standard distribution-free methods used in the past. Our experimental results and theoretical analysis indicate that the Bayesian ambiguity sets can lead to much tighter safe return estimates.


  • Bertsekas and Tsitsiklis [1996] Dimitri P Bertsekas and John N Tsitsiklis. Neuro-dynamic programming. 1996.
  • Delgado et al. [2016] Karina V. Delgado, Leliane N. De Barros, Daniel B. Dias, and Scott Sanner. Real-time dynamic programming for Markov decision processes with imprecise probabilities. Artificial Intelligence, 230:192–223, 2016.
  • Gelman et al. [2014] Andrew Gelman, John B Carlin, Hal S Stern, and Donald B Rubin. Bayesian Data Analysis. 3rd edition, 2014.
  • Hanasusanto and Kuhn [2013] GA Hanasusanto and Daniel Kuhn. Robust Data-Driven Dynamic Programming. In Advances in Neural Information Processing Systems (NIPS), 2013.
  • Iyengar [2005] Garud N. Iyengar. Robust dynamic programming. Mathematics of Operations Research, 30(2):257–280, may 2005.
  • Mannor et al. [2012] Shie Mannor, O Mebel, and H Xu. Lightning does not strike twice: Robust MDPs with coupled uncertainty. In

    International Conference on Machine Learning

    , 2012.
  • Petrik et al. [2016] Marek Petrik, Mohammad Ghavamzadeh, and Yinlam Chow. Safe Policy Improvement by Minimizing Robust Baseline Regret. In Advances in Neural Information Processing Systems, 2016.
  • Petrik [2012] Marek Petrik. Approximate dynamic programming by minimizing distributionally robust bounds. In International Conference of Machine Learning, 2012.
  • Puterman [2005] Martin L Puterman. Markov decision processes: Discrete stochastic dynamic programming. John Wiley & Sons, Inc., 2005.
  • Strehl and Littman [2004] a. L. Strehl and M. L. Littman. An empirical evaluation of interval estimation for markov decision processes. (April 2007):128–135, 2004.
  • Sutton and Barto [1998] Richard S Sutton and Andrew Barto. Reinforcement learning. 1998.
  • Taleghan et al. [2015] Majid Alkaee Taleghan, Thomas G. Dietterich, Mark Crowley, Kim Hall, and H. Jo Albers. PAC Optimal MDP Planning with Application to Invasive Species Management. Journal of Machine Learning Research, 16(1):3877–3903, 2015.
  • Tamar et al. [2014] Aviv Tamar, Shie Mannor, and Huan Xu. Scaling Up Robust MDPs using Function Approximation. In International Conference of Machine Learning (ICML), 2014.
  • Weissman et al. [2003] Tsachy Weissman, Erik Ordentlich, Gadiel Seroussi, Sergio Verdu, and Marcelo J Weinberger. Inequalities for the L_1 deviation of the empirical distribution. jun 2003.
  • Xu and Mannor [2006] Huan Xu and Shie Mannor. The robustness-performance tradeoff in Markov decision processes. Advances in Neural Information Processing Systems, 2006.
  • Xu and Mannor [2009] Huan Xu and Shie Mannor. Parametric regret in uncertain Markov decision processes. Proceedings of the IEEE Conference on Decision and Control, pages 3606–3613, 2009.