1 Introduction
Markov decision processes (MDPs) provide a versatile methodology for modeling dynamic decision problems under uncertainty [Bertsekas and Tsitsiklis, 1996; Sutton and Barto, 1998; Puterman, 2005]
. MDPs assume that transition probabilities are known precisely, but this is rarely the case in reinforcement learning. Errors in transition probabilities often results in policies that are brittle and fail in realworld deployments. A promising framework for robust reinforcement learning are robust MDPs (RMDPs) which assume that the transition probabilities and/or rewards are not known precisely. Instead, they can take on any value from a socalled
ambiguity set which represents a set of plausible values [Xu and Mannor, 2006, 2009; Mannor et al., 2012; Petrik, 2012; Hanasusanto and Kuhn, 2013; Tamar et al., 2014; Delgado et al., 2016; Petrik et al., 2016]. The choice of an ambiguity set determines the tradeoff between robustness and average performance of an RMDP.The main contribution of this paper is RSVF, a new datadriven Bayesian approach to constructing ambiguity
sets for RMDPs. The method computes policies with tighter safe estimates (
Definition 2.1) by introducing two new ideas. First, it is based on Bayesian posterior distributions rather than distributionfree bounds. Second, RSVF does not construct ambiguity sets as simple confidence intervals. Confidence intervals as ambiguity sets are a sufficient but not a necessary condition. RSVF uses the structure of the value function to optimize the location and shape of the ambiguity set to guarantee lower bounds directly without necessarily enforcing the requirement for the set to be a confidence interval.2 Problem Statement: Datadriven RMDPs
We propose to use Robust Markov Decision Processes (RMDPs) with states and actions to compute a policy with the maximal safe estimate of return.
Definition 2.1 (Safe Estimate of Return).
We say that an estimate of policy return is safe with probability for a given dataset if it satisfies: for each stationary deterministic policy . Here is the true, but unknown transition probabilities, and is the return for a policy .
In standard batch RL setting, can be used to estimate the transition probabilities, but is assumed to be not known precisely for the RMDP and is constrained to be in the ambiguity set , defined for each state and action (s,arectangular). The most common method for defining ambiguity sets is to use normbounded distance from a nominalprobability distribution : for a given and a nominal point . We assume that the rewards are known. The objective is to maximize the discounted infinite horizon return.
3 Ambiguity Sets as Confidence Intervals
In this section, we describe the standard approach to constructing ambiguity sets as distributionfree confidence intervals and propose its extension to the Bayesian setting.
Distributionfree Confidence Interval The use of distributionfree error bounds on the norm is common in reinforcement learning [Petrik et al., 2016; Taleghan et al., 2015; Strehl and Littman, 2004]. The confidence interval is constructed around the mean transition probability by combining the Hoeffding inequality with the union bound [Weissman et al., 2003; Petrik et al., 2016]. The Hoeffding ambiguity set is defined as: where is the mean transition probability computed from and is the number of transitions observed originating from state and an action . An important limitation of is that the size of the ambiguity set grows linearly with the number of states .
Bayesian Confidence Interval (BCI) Here we assume that data is available and a hierarchical Bayesian model can be used to infer a probability distribution over analytically or using MCMC methods like Stan [Gelman et al., 2014]. To construct the ambiguity set , we optimize for the smallest ambiguity set around the mean transition probability with the assumption that a smaller ambiguity set will lead to a tighter lower bound estimate. Formally, the optimization problem to compute for each state and action is: where nominal point is .
4 RSVF: Robustification With Sensible Value Functions
RSVF uses samples from a posterior distribution, similar to a Bayesian confidence interval, but it relaxes the safety requirement as it is sufficient to guarantee for each state and action that:
(1) 
with . To construct the set here, the set is not fixed but depends on the robust solution, which in turn depends on . RSVF starts with a guess of a small set for and then grows it, each time with the current value function, until it contains which is always recomputed after constructing the ambiguity set .
In lines 4 and 8 of Algorithm 1, is computed for each stateaction . Center and set size are computed from Eq. 3 using set & optimal computed by solving Eq. 2. When the set is a singleton, it is easy to compute a form of an optimal ambiguity set.
(2) 
When
is a singleton, it is sufficient for the ambiguity set to be a subset of the hyperplane
for the estimate to be safe. When is not a singleton, we only consider the setting when it is discrete, finite, and relatively small. We propose to construct a set defined in terms of an ball with the minimum radius such that it is safe for every . Assuming that, we solve the following linear program:
(3) 
In other words, we construct the set to minimize its radius while still intersecting the hyperplane for each in . Algorithm 1, as described, is not guaranteed to converge in finite time as written. It can be readily shown the value functions in the individual iterations are nonincreasing. It is easy to just stop once the value function becomes smaller (and that is more conservative) than BCI.
5 Empirical Evaluation
In this section, we evaluate the safe estimates computed by BCI and RSVF empirically. We assume a true model of each problem and generate a number of simulated data sets for the known distribution. We compute the largest safe estimate for the optimal return and compare it with the optimal return for the true model. We compare our results with “Hoeffding Inequality“ based distance and “Mean Transition” which simply solves the expected model and provides no safety guarantees. The value represents the predicted regret, which is the absolute difference between the true optimal value and the robust estimate: , a smaller regret is better. All of our experiments use a 95% confidence for safety unless otherwise specified.
Singlestate Bellman Update We initially consider simple problems where transition from a single nonterminal state following a single action leads to multiple terminal states. The value function for the terminal states are fixed and assumed to be provided. We evaluate different priors over the transition probabilities: i) uninformative Dirichlet prior and ii) informative Gaussian prior. Note that RSVF is optimal in this simplistic setting, as Fig. 2 (left) and Fig. 3 (left) shows. As expected, the mean estimate provides the tightest bound, but Fig. 2 (right) illustrates that it does not provide any meaningful safety guarantees.
Full MDP with Informative Prior Next, we evaluate RSVF on a full MDP problem. Standard RL benchmarks, like cartpole or arcade games, lack meaningful Bayesian priors. We instead use a simple exponential population model, based on the management of an invasive species [Taleghan et al., 2015]. The population of the invasive species at time evolves according to the exponential dynamics . Here, is the growth rate and is the carrying capacity of the environment. A land manager needs to decide, at each time , whether to take a control action which influences the growth rate . If is the indicator of whether the control action was taken, the growth rate is defined as: , where and are the coefficients of control effectiveness. We also assume that we only observe , a noisy estimate of population : . In the MDP model, the population observation defines the state. There are two actions: to apply or not to apply the control measure. Transition probabilities are given by the population evolution function. The reward for the MDP captures the costs of high invasive population and the application of the treatment.
Fig. 3 (right) depicts the average predicted regret over the different datasets. The distributionfree methods are very conservative, BCI improves on this behavior somewhat, but RSVF provides bounds that are even tighter than BCI by almost a factor of 2. The rate of violations is 0 for all robust methods. This indicates that RSVF is overly conservative in this case too since its rate of violations is also close to 0. This is due its reliance on the union bound across multiple states, the approximate construction of the individual ambiguity sets, and the inherent rectangularity assumption.
6 Conclusion
We propose, in this paper, a new Bayesian approach to the construction of ambiguity sets in robust reinforcement learning. This approach has several important advantages over standard distributionfree methods used in the past. Our experimental results and theoretical analysis indicate that the Bayesian ambiguity sets can lead to much tighter safe return estimates.
References
 Bertsekas and Tsitsiklis [1996] Dimitri P Bertsekas and John N Tsitsiklis. Neurodynamic programming. 1996.
 Delgado et al. [2016] Karina V. Delgado, Leliane N. De Barros, Daniel B. Dias, and Scott Sanner. Realtime dynamic programming for Markov decision processes with imprecise probabilities. Artificial Intelligence, 230:192–223, 2016.
 Gelman et al. [2014] Andrew Gelman, John B Carlin, Hal S Stern, and Donald B Rubin. Bayesian Data Analysis. 3rd edition, 2014.
 Hanasusanto and Kuhn [2013] GA Hanasusanto and Daniel Kuhn. Robust DataDriven Dynamic Programming. In Advances in Neural Information Processing Systems (NIPS), 2013.
 Iyengar [2005] Garud N. Iyengar. Robust dynamic programming. Mathematics of Operations Research, 30(2):257–280, may 2005.

Mannor et al. [2012]
Shie Mannor, O Mebel, and H Xu.
Lightning does not strike twice: Robust MDPs with coupled
uncertainty.
In
International Conference on Machine Learning
, 2012.  Petrik et al. [2016] Marek Petrik, Mohammad Ghavamzadeh, and Yinlam Chow. Safe Policy Improvement by Minimizing Robust Baseline Regret. In Advances in Neural Information Processing Systems, 2016.
 Petrik [2012] Marek Petrik. Approximate dynamic programming by minimizing distributionally robust bounds. In International Conference of Machine Learning, 2012.
 Puterman [2005] Martin L Puterman. Markov decision processes: Discrete stochastic dynamic programming. John Wiley & Sons, Inc., 2005.
 Strehl and Littman [2004] a. L. Strehl and M. L. Littman. An empirical evaluation of interval estimation for markov decision processes. (April 2007):128–135, 2004.
 Sutton and Barto [1998] Richard S Sutton and Andrew Barto. Reinforcement learning. 1998.
 Taleghan et al. [2015] Majid Alkaee Taleghan, Thomas G. Dietterich, Mark Crowley, Kim Hall, and H. Jo Albers. PAC Optimal MDP Planning with Application to Invasive Species Management. Journal of Machine Learning Research, 16(1):3877–3903, 2015.
 Tamar et al. [2014] Aviv Tamar, Shie Mannor, and Huan Xu. Scaling Up Robust MDPs using Function Approximation. In International Conference of Machine Learning (ICML), 2014.
 Weissman et al. [2003] Tsachy Weissman, Erik Ordentlich, Gadiel Seroussi, Sergio Verdu, and Marcelo J Weinberger. Inequalities for the L_1 deviation of the empirical distribution. jun 2003.
 Xu and Mannor [2006] Huan Xu and Shie Mannor. The robustnessperformance tradeoff in Markov decision processes. Advances in Neural Information Processing Systems, 2006.
 Xu and Mannor [2009] Huan Xu and Shie Mannor. Parametric regret in uncertain Markov decision processes. Proceedings of the IEEE Conference on Decision and Control, pages 3606–3613, 2009.