Safe Policy Improvement by Minimizing Robust Baseline Regret

07/13/2016 ∙ by Marek Petrik, et al. ∙ 0

An important problem in sequential decision-making under uncertainty is to use limited data to compute a safe policy, i.e., a policy that is guaranteed to perform at least as well as a given baseline strategy. In this paper, we develop and analyze a new model-based approach to compute a safe policy when we have access to an inaccurate dynamics model of the system with known accuracy guarantees. Our proposed robust method uses this (inaccurate) model to directly minimize the (negative) regret w.r.t. the baseline policy. Contrary to the existing approaches, minimizing the regret allows one to improve the baseline policy in states with accurate dynamics and seamlessly fall back to the baseline policy, otherwise. We show that our formulation is NP-hard and propose an approximate algorithm. Our empirical results on several domains show that even this relatively simple approximate algorithm can significantly outperform standard approaches.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Many problems in science and engineering can be formulated as a sequential decision-making problem under uncertainty. A common scenario in such problems that occurs in many different fields, such as online marketing, inventory control, health informatics, and computational finance, is to find a good or an optimal strategy/policy, given a batch of data generated by the current strategy of the company (hospital, investor). Although there are many techniques to find a good policy given a batch of data, only a few of them guarantee that the obtained policy will perform well, when it is deployed. Since deploying an untested policy can be risky for the business, the product (hospital, investment) manager does not usually allow it to happen, unless we provide her/him with some performance guarantees of the obtained strategy, in comparison to the baseline policy (e.g., the policy that is currently in use).

In this paper, we focus on the model-based approach to this fundamental problem in the context of infinite-horizon

discounted Markov decision processes (MDPs). In this approach, we use the batch of data and build a

model or a simulator that approximates the true behavior of the dynamical system, together with an error function that captures the accuracy of the model at each state of the system. Our goal is to compute a safe policy, i.e., a policy that is guaranteed to perform at least as well as the baseline strategy, using the simulator and error function. Most of the work on this topic has been in the model-free setting, where safe policies are computed directly from the batch of data, without building an explicit model of the system [12, 13]. Another class of model-free algorithms are those that use a batch of data generated by the current policy and return a policy that is guaranteed to perform better. They optimize for the policy by repeating this process until convergence [6, 11].

A major limitation of the existing methods for computing safe policies is that they either adopt a newly learned policy with provable improvements or do not make any improvement at all by returning the baseline policy. These approaches may be quite limiting when model uncertainties are not uniform across the state space. In such cases, it is desirable to guarantee an improvement over the baseline policy by combining it with a learned policy on a state-by-state basis. In other words, we want to use the learned policy at the states in which either the improvement is significant or the model uncertainty (error function) is small, and to use the baseline policy everywhere else. However, computing a learned policy that can be effectively combined with a baseline policy is non-trivial due to the complex effects of policy changes in an MDP. Our key insight is that this goal can be achieved by minimizing the (negative) robust regret

w.r.t. the baseline policy. This unifies the sources of uncertainties in the learned and baseline policies and allows a more systematic performance comparison. Note that our approach differs significantly from the standard one, which compares a pessimistic performance estimate of the learned policy with an optimistic estimate of the baseline strategy. That may result in rejecting a learned policy with a performance (slightly) better than the baseline, simply due to the discrepancy between the pessimistic and optimistic evaluations.

The model-based approach of this paper builds on robust Markov decision processes [5, 15, 1]. The main difference is the availability of the baseline policy that creates unique challenges for sequential optimization. To the best of our knowledge, such challenges have not yet been fully investigated in the literature. A possible solution is to solve the robust formulation of the problem and then accept the resulted policy only if its conservative performance estimate is better than the baseline. While such an idea has been investigated in the model-free setting (e.g., [13]), we show in this paper that such an approach is overly conservative.

As the main contribution of the paper, we propose and analyze a new robust optimization formulation that captures the above intuition of minimizing robust regret w.r.t. the baseline policy. After a preliminary discussion in Section 2, we formally describe our model and analyze its main properties in Section 3. We show that in solving this optimization problem, we may have to go beyond the standard space of deterministic policies and search in the space of randomized policies; we derive a bound on the performance loss of its solutions; and we prove that solving this problem is NP-hard. We also propose a simple and practical approximate algorithm. Then, in Section 4, we show that the standard model-based approach is really a tractable approximation of robust baseline regret minimization. Finally, our experimental results in Section 5 indicate that even the simple approximate algorithm significantly outperforms the standard model-based approach when the model is uncertain.

2 Preliminaries

We consider problems in which the agent’s interaction with the environment is modeled as an infinite-horizon -discounted MDP. A -discounted MDP is a tuple , where and are the state and action spaces, is the bounded reward function,

is the transition probability function,

is the initial state distribution, and is a discount factor. We use and to denote the sets of randomized and deterministic stationary Markovian policies, respectively, where

is the set of probability distributions over the action space

.

Throughout the paper, we assume that the true reward of the MDP is known, but the true transition probability is not given. The generalization to include reward estimation is straightforward and is omitted for the sake of brevity. We use historical data to build a MDP model with the transition probability denoted by . Due to limited number of samples and other modeling issues, it is unlikely that matches the true transition probability of the system . We also require that the estimated model deviates from the true transition probability as stated in the following assumption:

For each , the error function bounds the difference between the estimated transition probability and true transition probability, i.e.,

(1)

The error function can be derived either directly from samples using high probability concentration bounds, as we briefly outline in Appendix A, or based on specific domain properties.

To model the uncertainty in the transition probability, we adopt the notion of robust MDP (RMDP) [5, 8, 15], i.e., an extension of MDP in which nature adversarially chooses the transitions from a given uncertainty set

From Assumption 2, we notice that the true transition probability is in the set of uncertain transition probabilities, i.e., . The above constraint is common in the RMDP literature (e.g., [5, 9, 15]). The uncertainty set in RMDP is -rectangular and randomized [7, 15]. One of the motivations for considering -rectangular sets in RMDP is that they lead to tractable solutions in the conventional reward maximization setting. However, in the robust regret minimization problem that we propose in this paper, even if we assume that the uncertainty set is -rectangular, it does not guarantee tractability of the solution. While it is of great interest to investigate the structure of uncertainty sets that lead to tractable algorithms in robust regret minimization, it is beyond the main scope of this paper and we leave it as future work.

For each policy and nature’s choice , the discounted return is defined as

where and

are the state and action random variables at time

, and is the corresponding value function. An optimal policy for a given is defined as . Similarly, under the true transition probability , the true return of a policy and a truly optimal policy are defined as and , respectively. Although we define the optimal policy using , it is known that every reward maximization problem in MDPs has at least one optimal policy in .

Finally, given a deterministic baseline policy , we call a policy safe, if its "true" performance is guaranteed to be no worse than that of the baseline policy, i.e., .

3 Robust Policy Improvement Model

In this section, we introduce and analyze an optimization procedure that robustly improves over a given baseline policy . As described above, the main idea is to find a policy that is guaranteed to be an improvement for any realization of the uncertain model parameters. The following definition formalizes this intuition.

[The Robust Policy Improvement Problem] Given a model uncertainty set and a baseline policy , find a maximal such that there exists a policy for which , for every .111From now on, for brevity, we omit the parameters and , and use to denote the model uncertainty set.

The problem posed in Definition 3 readily translates to the following optimization problem:

(2)

Note that since the baseline policy achieves value in (2), in Definition 3 is always non-negative. Therefore, any solution of (2) is safe, because under the true transition probability , we have the guarantee that . It is important to highlight how Definition 3 differs from the standard approach (e.g., [13]) on determining whether a policy is an improvement over the baseline policy . The standard approach considers a statistical error bound that translates to the test: . Note that the uncertainty parameters on both sides of the above inequality are not necessarily the same. Therefore, any optimization procedure derived based on this test is more conservative than the problem in (2). Indeed when the error function in is large, even the baseline policy () may not pass this test. In Section 5.1, we show the conditions under which this approach fails.

In the remainder of this section, we highlight some major properties of the optimization problem (2). Specifically, we show that its solution policy may be purely randomized, we compute a bound on the performance loss of its solution policy w.r.t. , and we finally prove that it is a NP-hard problem.

3.1 Policy Class

The following theorem shows that we should search for the solutions of the optimization problem (2) in the space of randomized policies .

The solution to the optimization problem (2) may not be attained by a deterministic policy. Moreover, the loss due to considering deterministic policies cannot be bounded, i.e., there exists no constant such that

Proof.

The proof follows directly from Example 3.1. The optimal policy in this example is randomized and achieves a guaranteed improvement . There is no deterministic policy that guarantees a positive improvement over the baseline policy, which proves the second part of the theorem. ∎

2

3

start

0

1

Figure 1: (left) A robust/uncertain MDP used in Example 3.1 that illustrates the sub-optimality of deterministic policies in solving the optimization problem (2). (right) A Markov decision process with significant uncertainty in baseline policy.

Consider the robust/uncertain MDP on the left panel of Figure 1 with states , actions , and discount factor . Actions and are shown as solid black nodes. A number with no state represents a terminal state with the corresponding reward. The robust outcomes correspond to the uncertainty set of transition probabilities . The baseline policy is deterministic and is denoted by double edges. It can be readily seen from the monotonicity of the Bellman operator that any improved policy will satisfy . Therefore, we will only focus on the policy at state . The robust improvement as a function of and the uncertainties is given as follows:

This shows that no deterministic policy can achieve a positive improvement in this problem. However, a randomized policy returns the maximum improvement .

Randomized policies can do better than their deterministic counterparts, because they allow for hedging among various realizations of the MDP parameters. Example 3.1 shows a problem such that there exists a realization of the parameters with improvement over the baseline when any deterministic policy is executed. However in this example, there is no single realization of parameters that provides an improvement for all the deterministic policies simultaneously. Therefore, randomizing the policy guarantees an improvement independent of the parameters’ choice.

3.2 Performance Bound

Generally, one cannot compute the truly optimal policy using an imprecise model. Nevertheless, it is still crucial to understand how errors in the model translates to a performance loss w.r.t. an optimal policy. The following theorem provides a bound on the performance loss of any solution to the optimization problem (2).

A solution to the optimization problem (2) is safe and its performance loss is bounded by the following inequality:

where and are the state occupancy distributions of the optimal and baseline policies in the true MDP . Furthermore, the above bound is tight.

The proof of Theorem 3.2 is available in Appendix C.

3.3 Computational Complexity

In this section, we analyze the computational complexity of solving the optimization problem (2) and prove that the problem is NP-hard. In particular, we proceed by showing that the following sub-problem of (2), for a fixed , is NP-hard:

(3)

The optimization problem (3) can be interpreted as computing a policy that simultaneously minimizes the returns of two MDPs, whose transitions induced by policies and . The proof of Theorem 3.3 is given in Appendix D.

Both optimization problems (2) and (3) are NP-hard.

Although the optimization problem (2

) is NP-hard in general, but it can be tractable under certain conditions. The following proposition shows that this is the case, for example, when the Markov chain induced by the baseline policy is known precisely.

Assume that for each , the error function induced by the baseline policy is zero, i.e., .222Note that this is equivalent to precisely knowing the Markov chain induced by the baseline policy . Then, the optimization problem (2) is equivalent to the following problem and can be solved in polynomial time:

(4)
Proof.

The hypothesis in the proposition implies that for any , we have , . This further indicates that is a constant (independent of ), for all . Thus, when the Markov chain induced by the baseline policy is known, the optimization problem (2) is reduced to the optimization problem (4), which is a robust MDP (RMDP) problem with -constraint uncertainty set. It is known that this class of RMDP problems can be solved in (strongly) polynomial time [4] and has also been solved efficiently in practice [9]. ∎

3.4 Approximate Algorithm

Solving for the optimal solution of (2) may not be possible in practice since the problem is NP hard. In this section, we propose a simple and practical approximate algorithm. The empirical results of Section 5 indicate that this algorithm holds promise and they also suggest that the approach may be a good starting point for building better approximate algorithms in the future.

input : Empirical transition probabilities: , baseline policy , and the error function
output : Policy
1 foreach  do
2       ;
3 end foreach
4 ;
return
Algorithm 1 Approximate Robust Baseline Regret Minimization Algorithm

Algorithm 1 contains the pseudocode of the proposed approximate method. The main idea is to use a modified uncertainty model by assuming no error in transition probabilities of the baseline policy. Then it is possible to minimize the robust baseline regret in polynomial time as Section 3.3 shows. Assuming no error in baseline transition probabilities is reasonable because of two main reasons. First, data is in practice often generated by executing the baseline policy and therefore we may have enough data for a good approximation its transition probabilities:  . Second, transition probabilities often affect baseline and improved policies similarly and therefore have little effect on the difference between their returns (i.e., the regret). See Section 5.1 for an example of such behavior.

4 Standard Policy Improvement Methods

In Section 3, we showed that finding an exact solution to the optimization problem (2) is computationally expensive and proposed an approximate algorithm. In this section, we describe and analyze two standard methods for computing safe policies and show how they can be interpreted as an approximation of our proposed baseline regret minimization. Due to space limitations, we describe another method, called reward-adjusted MDP, in Appendix G, but report its performance in Section 5.

4.1 Solving the Simulator

The most straightforward solution to (2) is to simply assume that our simulator is accurate and to solve the reward maximization problem of a MDP with the transition probability , i.e., . Theorem 4.1 quantifies the performance loss of the resulted policy .

Let be an optimal policy of the reward maximization problem of a MDP with transition probability . Then under Assumption 2, the performance loss of is bounded by

The proof is available in Appendix E. Note that there is no guarantee that is safe, and thus, deploying it may lead to undesirable outcomes due to model uncertainties. Moreover, the performance guarantee of , reported in Theorem 4.1, is weaker than that in Theorem 3.2 for the solution to our proposed optimization problem (2).

Appendix F indicates that the policy returned by Algorithm 2 is safe and has a tighter bound on its performance loss than . This is because Theorem F depends on a weighted -norm of the errors in the optimal policy, instead of the -norm over all policies in Theorem 4.1.

4.2 Solving Robust MDP

Another standard solution to the problem in (2) is based on solving the RMDP problem (4). We prove that the policy returned by this algorithm is safe and has better (sharper) worst-case guarantees than the simulator-based policy . Details of this algorithm are summarized in Algorithm 2. The algorithm first constructs and solves an RMDP. It then returns the solution policy if its worst-case performance over the uncertainty set is better than the robust performance , and it returns the baseline policy , otherwise.

input : Simulated MDP , baseline policy , and the error function
output : Policy
1 ;
2 if  then return else return ;
 
Algorithm 2 RMDP-based Algorithm

Algorithm 2 makes use of the following approximation to the solution of (2):

and guarantees safety by designing such that the RHS of this inequality is always non-negative.

The performance bound of is identical to that in Theorem 3.2, and is stated and proved in Appendix F in Appendix F. However even though the worst-case bounds are the same, we show in Section 5.1 that the performance loss of may be worse than by an arbitrarily large margin.

It is important to discuss the difference between Algorithms 2 and 1. Although both solve an RMDP, they use different uncertainty sets . The uncertainty set used in Algorithm 2 is the true error function in building the simulator, while the uncertainty set used in Algorithm 1 assumes that the error function is zero for all the actions suggested by the baseline policy. As a result, both algorithms approximately solve (2) but approximate the problem in different ways.

5 Experimental Evaluation

In this section, we experimentally evaluate the benefits of minimizing the robust baseline regret. First, we demonstrate that solving the problem in (2) may outperform the regular robust formulation by an arbitrarily large margin. Then, in the remainder of the section, we compare the solution quality of Algorithm 1 with simpler methods in more complex and realistic experimental domains. The purpose of our experiments is to show how solution quality depends on the degree of model uncertainties.

5.1 An Illustrative Example

Consider the example depicted on the right panel of Figure 1. White nodes represent states and black nodes represent state-action pairs. Labels on the edges originated from states indicate the policy according to which the action is taken; labels on the edges originated from actions denote the rewards and, if necessary, the name of the uncertainty realization. The baseline policy is , the optimal policy is , and the discount factor is .

This example represents a setting in which the level of uncertainty varies significantly across the individual states: the transitions model is precise in state and uncertain in state . The baseline policy takes a suboptimal action in state and the optimal action in the uncertain state . To prevent being overly conservative in computing a safe policy, one needs to consider that the realization of uncertainty in influences both the baseline and improved policies.

Using the plain robust optimization formulation in Algorithm 2, even the optimal policy is not considered safe in this example. In particular, the robust return of is , while the optimistic return of is . On the other hand, solving (2) will return the optimal policy since:

. Even the heuristic method of Section 

3.4 will return the optimal policy. Note that since the reward-adjusted formulation (see its description in Appendix G) is even more conservative than the robust formulation, it will also fail to improve on the baseline policy.

5.2 Example Grid Problem

In this section, we use a simple grid problem to compare the solution quality of Algorithm 1 with simpler methods. The grid problem is motivated by modeling customer interactions with an online system. States in the problem represent a two dimensional grid. Columns capture states of interaction with the website, and rows capture customer states such as overall satisfaction. Actions can move customers along either dimension with some probability of failure. A more detailed description of this domain is provided in Section H.

Our goal is to evaluate how the solution quality of the various methods depends on the magnitude of model error . The model is constructed from samples and thus the magnitude of the error depends on the number of samples used to build the model. We use a uniform random policy to gather samples. Model error function is then constructed from this simulated data using bounds in Section B. The baseline policy is constructed to be optimal when ignoring the row part of state; see Section H for more details.

All methods are compared in terms of the improvement percentage in total return over the baseline policy. Fig. 2 depicts the results as a function of the number of transition samples used in constructing the uncertain model and represent the mean of runs. Methods used in the comparison are as follows: 1) EXP represents solving the nominal model as described in Section 4.1, 2) RWA represent the reward-adjusted formulation in Algorithm 3, 3) ROB represents the robust method in Algorithm 2, and 4) RBC represents the approximate algorithm in Algorithm 1.

Figure 2: Improvement in return over the baseline policy for the proposed methods. The dashed line shows the return of the optimal policy.

Fig. 2 shows that Algorithm 1 not only reliably computes policies that are safe, but also significantly improves on the quality of the baseline policy when the model error is large. When the number of samples is small, Algorithm 1 is significantly better than other methods by relying on the baseline policy in states with a large model error and only taking improving actions when the model error is small. Note that EXP can be significantly worse that the baseline policy, especially when the number of samples is small.

5.3 Energy Arbitrage

In this section, we compare model-based policy improvement methods using a more complex domain. The problem is to determine an energy arbitrage policy in given limited energy storage (a battery) and stochastic prices. At each time period, the decision maker observes the available battery charge and a Markov state of energy price and decides on the amount of energy to purchase or to sell.

The set of states in the energy arbitrage problem consists of three components: current state of charge, current capacity, and a Markov state representing price; the actions represent the amount of energy purchased or sold; the rewards indicate profit/loss in the transactions. We discretize the state of charge and action sets to 10 separate levels. The problem is based on the domain from [10] whose description is detailed in Appendix H.2.

Energy arbitrage is a good fit for model-based approaches because it combines known and unknown dynamics. Physics of battery charging and discharging can be modeled with high confidence, while the evolution of energy prices is uncertain. As a result, using an explicit battery model the only uncertainty is in transition probabilities between the 10 states of the price process instead of the entire 1000 state-action pairs. This significantly reduces the number of samples needed to compute a good solution.

A realistic baseline policy is constructed by solving a high-precision version of the discretized problem in which the price process is aggregated to 3 levels from 10. This baseline policy represents a realistic but simplified solution. Because low energy prices are more commonly sampled than high energy prices, the degree of uncertainty varies significantly over the state space.

As in the previous application, we estimate the uncertainty model in a data-driven manner. Notice that the inherent uncertainty is only on price transitions, and is independent to the policy used (which controls the storage dynamics). Here the uncertainty set of transition probabilities is estimated by the method in Section A but the uncertainty set is only a non-singleton with respect to price states. Figure 3 shows the percentage improvement on the baseline policy averaged over 5 runs, whose labels of policies follow the definitions of Figure 2. We clearly observe that the heuristic RBC method—described in Section 3.4—effectively interleaves the baseline policy (in states with low level of uncertainty) and an improved policy (in states with low level of uncertainty) and results in the best performance in most cases.

Figure 3: (left) Frequency of observed price indexes; each index corresponds to a discretized price level. (right) Improvement over baseline policy as a function of the number of samples.

6 Conclusion

In this paper, we studied the model-based approach to the fundamental problem of learning safe policies given a batch of data. A policy is considered safe, if it is guaranteed to have an improved performance over the baseline policy. Solving the problem of safety in sequential decision-making can immensely increase the applicability of the existing technology to real-world problems. We showed that the standard robust formulation may be overly conservative and formulated a better approach that interleaves an improved policy with the baseline policy, based on the error at each state. We proposed and analyzed an optimization problem based on this idea (see Eq. 2). We showed that the resultant problem may only have randomized solution policies, derived a performance bound for its solutions, and proved that solving it is NP-hard. Furthermore we proposed several approximate solutions and experimentally evaluated their performances. Since solving the optimization problem (2) is NP-hard, future work includes 1) deriving approximate solution algorithms with tighter performance guarantees, and 2) identifying specific structures in the the uncertainty set of transition probabilities that lead to a tractable solution algorithm.

References

  • Ahmed and Varakantham [2013] A. Ahmed and P Varakantham. Regret based Robust Solutions for Uncertain Markov Decision Processes. Proceedings of Advances in Neural Information Processing Systems 26, pages 1–9, 2013.
  • Ghavamzadeh and Lazaric [2012] M. Ghavamzadeh and A. Lazaric. Conservative and greedy approaches to classification-based policy iteration. In

    Proceedings of the Twenty-Sixth Conference on Artificial Intelligence

    , pages 914–920, 2012.
  • Hallak et al. [2015] A. Hallak, F. Schnitzler, T. Mann, and S. Mannor. Off-policy model-based learning under unknown factored dynamics. In

    Proceedings of the 32nd International Conference on Machine Learning

    , pages 711–719, 2015.
  • Hansen et al. [2013] T. Hansen, P. Miltersen, and U. Zwick. Strategy iteration is strongly polynomial for 2-player turn-based stochastic games with a constant discount factor. Journal of the ACM, 60(1):1–16, 2013.
  • Iyengar [2005] G. Iyengar. Robust dynamic programming. Mathematics of Operations Research, 30(2):257–280, 2005.
  • Kakade and Langford [2002] S. Kakade and J. Langford.

    Approximately optimal approximate reinforcement learning.

    In Proceedings of the 19th International Conference on Machine Learning, pages 267–274, 2002.
  • Le Tallec [2007] Y. Le Tallec. Robust, Risk-Sensitive, and Data-driven Control of Markov Decision Processes. PhD thesis, MIT, 2007.
  • Nilim and El Ghaoui [2005] A. Nilim and L. El Ghaoui. Robust control of Markov decision processes with uncertain transition matrices. Operations Research, 53(5):780–798, 2005.
  • Petrik and Subramanian [2014] M. Petrik and D. Subramanian. RAAM : The benefits of robustness in approximating aggregated MDPs in reinforcement learning. In Proceedings of Advances in Neural Information Processing Systems 27, 2014.
  • Petrik and Wu [2015] M. Petrik and X. Wu. Optimal Threshold Control for Energy Arbitrage with Degradable Battery Storage. In Uncertainty in Artificial Intelligence (UAI), pages 692–701, 2015.
  • Pirotta et al. [2013] M. Pirotta, M. Restelli, and D. Calandriello. Safe Policy Iteration. In Proceedings of the 30th International Conference on Machine Learning, pages 307–315, 2013.
  • Thomas et al. [2015a] P. Thomas, G. Theocharous, and M. Ghavamzadeh. High confidence off-policy evaluation. In Proceedings of the Twenty-Ninth Conference on Artificial Intelligence, 2015.
  • Thomas et al. [2015b] P. Thomas, G. Theocharous, and M. Ghavamzadeh. High confidence policy improvement. In Proceedings of the Thirty-Second International Conference on Machine Learning, pages 2380–2388, 2015.
  • Weissman et al. [2003] T. Weissman, E. Ordentlich, G. Seroussi, S. Verdu, and M. Weinberger. Inequalities for the deviation of the empirical distribution. Hewlett-Packard Labs, Tech. Rep, 2003.
  • Wiesemann et al. [2013] W. Wiesemann, D. Kuhn, and B. Rustem. Robust Markov decision processes. Mathematics of Operations Research, 38(1):153–183, 2013.

Appendix A Error Bound

Our goal here is to construct the error function , when is estimated from the samples drawn from , such that we can guarantee that , with probability at least . Let us assume that at each state-action pair , we draw samples from .

If at each state-action pair , we define , then , with probability at least .

Proof.

From Theorem 2.1 in Weissman et al. [14], for each state-action pair , we may write

(5)

Setting , we may rewrite (5) as

(6)

From the definition of the uncertainty set and by summing the error probability in (A), we obtain that . ∎

Appendix B Proof of Lemma B

for which the following technical lemma (whose proof can be found in Appendix B) is used in the analysis.

Before proving Lemma B, we first prove the following lemma.

For any policy , consider two transition probability matrices and and two reward functions and corresponding to . Let and be the value functions of the policy given and , respectively. Under the assumption that for any state , we have

where

is the vector of

’s.

Proof.

The difference between the two value functions can be written as

Now using the Holder’s inequality, for any , we have

The proof follows by uniformly bounding from the above inequality and from the monotonicity of . ∎

The difference between the returns of a policy in two MDPs parameterized by is bounded as

where and are the transition probability matrix and error function (between and , see Eq. 1) of policy .

Proof.

Lemma B is the direct consequence of Lemma B with the fact that for any and any , from Assumption 2 and the construction of , we have

Appendix C Proof of Theorem 3.2

To prove the safety of , note that the objective in (2) is always non-negative, since the baseline policy is a feasible solution. Thus, we obtain the safety condition by simple algebraic manipulation as follows:

(7)

Now we prove the performance bound. From Appendix B, for any policy , we may write

(8)

where is state occupancy distribution of policy in the true MDP , defined as

We are now ready to show a bound on the performance loss of through the following set of inequalities:

(9)

where (a) is by applying (8) to the two terms on the RHS of the inequality.

The final bound is obtained by combining (C) and the fact that , and as a result, .

To prove the tightness of the bound, we use the example depicted in Fig. 4. The initial state is , actions are , the transitions are deterministic, and the leafs represent absorbing states with the given return. We denote by , the transitions of the true MDP, and by , the worst-case transitions in the uncertainty set . Finally the baseline policy takes action in state and shown by double edges in Figure 4. It is clear that the optimal policy is the one that takes action in state . The return of this policy is . It is also straightforward to derive that the policy that takes action in state (as shown in Figure 4) is a solution to (2). The return of this policy is and its performance loss is .

Figure 4: Example showing the tightness of the bound in Theorem 3.2.

Now let us set in the leafs of Figure 4 to . Note that this is the value given by (8) for . This gives us the tightness proof assuming that is such that and have similar values, and is a valid return value, i.e., .

Appendix D Proof of Theorem 3.3

start

1

0

0

1

1

0

0

1

0

1

1

0

0

1

1

0

1

0

0

1

1

0

0

1
Figure 5: MDP in Section 3.3 that represents the optimization of over .

start

0

0

0

0

0

0

0

0

T

F

T

F

Figure 6: MDP in Section 3.3 representing the optimization of over .

Assume a given fixed policy . We start by showing the NP hardness of solving (3):

by a reduction from the boolean satisfiability (SAT) problem. To simplify the exposition, we also illustrate the reduction on the following simple example SAT problem in a conjunctive normal form (CNF):

(10)

where , , , and are the variables, and represent the -th literal in -th disjunction.

As noted above, represents the return of a robust MDP. Recall that computing for a fixed is equivalent to computing a policy in a regular MDP with actions representing realizations of the transition uncertainty. Therefore, optimizing for in (3) translates to finding a single policy for two MDPs—defined by and —that maximizes the difference between their returns .

We reduce the SAT problem to the optimization over in (3). As described above, the value for a fixed can be represented as a return of some MDP for a policy given by . Similarly, the value for a fixed can be represented as a return of another MDP . We describe the general reduction in detail below. Figures 5 and 6 illustrate the MDPs and respectively for the example in (10).

MDPs share the same state and action sets. The actions represent the realization of uncertainty and are denoted by the edge labels. They are discrete and stand for the extreme points of feasible uncertainty sets. For ease of notation, we assume and states with double circles are terminal with rewards inscribed therein. All non-terminal transition have zero rewards.

The identical state set of both and are constructed as follows. There is one state for each variable , and two states for every literal . Informally, actions for a variable state capture the value of that variable. Actions for a literal state or represent the value of the variable referenced by the literal. This is regardless of whether the literal is positive or negative. For example, when the variable in is true, the action in is and when the variable in is false, the action in is . Two states per each literal are necessary in order to model the negation operation.

The transitions in MDPs and are constructed to guarantee that their returns are and , respectively (and as a result the objective in (3) is ), only if the assignment to the literals satisfies the SAT problem. Note that the transitions for the negated literals, such as are different from the positive literals, such as . This construction easily generalizes to any SAT problem in the CNF. Consider the example in (10) and let (other variables can take any values). It can then be seen readily that the objective in (3) would be .

Let be the optimal value of (3). Then, to show the correctness of our reduction, we argue that , if and only if the SAT problem is satisfiable. To show the reverse implication, assume that the SAT is satisfied for some assignment to variables and construct a policy as follows:

where represents the value of the variable referenced by the corresponding literal , e.g.,  in (10). It can be readily seen that and , and thus, the implication that holds.

To show the forward implication, assume that for an optimal deterministic realization , we have and , and thus, . We assign values to variables v as follows:

We have that only if for every disjunction either 1) there exists a positive literal such that and , or 2) there exists a negative literal such that and