Markov Decision Processes (MDPs) are a powerful tool for sequential decision making in stochastic domains. However, the parameters of an MDP are often estimated from limited data, and therefore cannot be specified exactlylacerda2019probabilistic; moldovan2012risk. By disregarding model uncertainty and planning on the estimated MDP, performance can be much worse than anticipated mannor2004bias.
For example, consider an MDP model for medical decision making, where transitions correspond to the stochastic health outcomes for a patient as a result of different treatment options. An estimated MDP model for this problem can be generated from observed data schaefer2005modeling
. However, such a model does not capture the variation in transition probabilities due to patient heterogeneity: any particular patient may respond differently to treatments than the average due to unknown underlying factors. Additionally, such a model does not consider uncertainty in the model parameters due to limited data. As a result, Uncertain MDPs (UMDPs) have been proposed as more suitable model for domains such as medical decision makingzhang2017robust where the model cannot be specified exactly.
UMDPs capture model ambiguity by defining an uncertainty set in which the true MDP cost and transition functions lie. In this work, we address offline planning for UMDPs. Most research in this setting has focused on optimising the expected value for the worst-case MDP parameters using robust dynamic programming iyengar2005robust; nilim2005robust. However, this can result in overly conservative policies which perform poorly in the majority of possible scenarios delage2010percentile. Minimax regret has been proposed as alternative metric for robust planning which is less conservative regan2009regret; xu2009parametric. The aim is to find the policy with the minimum gap between its expected value and the optimal value over all possible instantiations of model uncertainty. However, optimising minimax regret is challenging and existing methods do not scale well.
In this work, we introduce a Bellman equation to decompose the computation of the regret for a policy into a dynamic programming recursion. We show that if uncertainties are independent, we can perform minimax value iteration using the regret Bellman equation to efficiently optimise minimax regret exactly. To our knowledge, this is the first scalable exact algorithm for minimax regret planning in UMDPs with both uncertain cost and transition functions. To address problems with dependent uncertainties, we introduce the use of options sutton1999between to capture dependence over sequences of steps. By varying , the user may trade off computation time against solution quality.
Previous works have addressed regret-based planning in finite horizon problems ahmed2013regret; ahmed2017sampling, or problems where there is only uncertainty in the cost function regan2009regret; regan2010robust; regan2011robust; xu2009parametric. We focus on the more general problem of Stochastic Shortest Path (SSP) UMDPs with uncertain cost and transition functions. The main contributions of this work are:
Introducing a Bellman equation to compute the regret for a policy using dynamic programming.
An efficient algorithm to optimise minimax regret exactly in models with independent uncertainties by performing minimax value iteration using our novel Bellman equation.
Proposing using options to capture dependencies between uncertainties to trade off solution quality against computation time for models with dependent uncertainties.
Experiments in both synthetic and real-world domains demonstrate that our approach considerably outperforms existing baselines.
The worst-case expected value for a UMDP can be optimised efficiently with robust dynamic programming provided that the uncertainty set is convex, and the uncertainties are independent between states iyengar2005robust; nilim2005robust; wiesemann2013robust. However, optimising for the worst-case expected value often results in overly conservative policies delage2010percentile. This problem is exacerbated by the independence assumption which allows all parameters to be realised as their worst-case values simultaneously. Sample-based UMDPs represent model uncertainty with a finite set of possible MDPs, capturing dependencies between uncertainties adulyasak2015solving; ahmed2013regret; ahmed2017sampling; chen2012tractable; cubuktepe2020scenario; steimle2018multi. For sample-based UMDPs, dependent uncertainties can also be represented by augmenting the state space mannor2016robust, however this greatly enlarges the state space even for a modest number of samples.
To compute less conservative policies, alternative planning objectives to worst-case expected value have been proposed. Possibilities include forgoing robustness and optimising average performance adulyasak2015solving; steimle2018multi, performing chance-constrained optimisation under a known distribution of model parameters delage2010percentile, and computing a Pareto-front for multiple objectives scheftelowitsch2017multi. Minimax regret has been proposed as an intuitive objective which is less conservative than optimising for the worst-case expected value xu2009parametric, but can be considered robust as it optimises worst-case sub-optimality. Minimax regret in UMDPs where only the cost function is uncertain is addressed in regan2009regret; regan2010robust; regan2011robust; xu2009parametric.
Limited research has addressed minimax regret in UMDP planning with both
uncertain cost and transition functions. For sample-based UMDPs, the best stationary policy can be found by solving a Mixed Integer Linear Program (MILP), however this approach does not scale wellahmed2013regret. A policy-iteration algorithm is proposed by ahmed2017sampling to find a policy with locally optimal minimax regret. However, this approach is only suitable for finite-horizon planning in which states are indexed by time step and the graph is acyclic. An approximation proposed by ahmed2013regret optimises minimax Cumulative Expected Myopic Regret (CEMR). CEMR myopically approximates regret by comparing local actions, rather than evaluating overall performance. Our experiments show that policies optimising CEMR often perform poorly for minimax regret. Unlike CEMR, our approach optimises minimax regret exactly for problems with independent uncertainties.
Regret is used to measure performance in reinforcement learning (RL) (eg.jaksch2010near; cohen2020near; tarbouriech2020no). In the RL setting, the goal is to minimise the total regret, which is the total loss incurred throughout training over many episodes. In contrast, in our UMDP setting we plan offline to optimise the worst-case regret for a policy. This is the regret for a fixed policy evaluated over a single episode, assuming the MDP parameters are chosen adversarially. In RL, options sutton1999between have been utilised for learning robust policies with temporally extended actions mankowitz2018learning. In this work, we use options to capture dependencies between model uncertainties throughout the execution of each option.
Another approach to address MDPs which are not known exactly is Bayesian RL ghavamzadeh2015bayesian which adapts the policy online throughout execution. In contrast to our setting, Bayesian RL typically does not address worst-case performance and requires access to a distribution over MDPs rather than a set. The offline minimax regret setting we consider is more appropriate for safety-critical domains such as medical decision making, where the policy must be scrutinised by regulators prior to deployment, and robustness to worst-case suboptimality is important.
An SSP MDP is defined as a tuple . is the set of states, is the initial state, is the set of actions, is the cost function, and is the transition function. is the set of goal states. Each goal state is absorbing and incurs zero cost.
The expected cost of applying action in state is . The minimum expected cost at state is . A finite path is a finite sequence of states visited in the MDP. A history-dependent policy maps finite paths to a distribution over action choices. A stationary policy only considers the current state. A policy is deterministic if it chooses a single action at each step. The set of all policies is denoted . A policy is proper at if it reaches from with probability 1. A policy is proper if it is proper at all states. In an SSP MDP, the following assumptions are made kolobov2012planning: a) there exists a proper policy, and b) every improper policy incurs infinite cost at all states where it is improper.
In this work, we aim to minimise the regret for a fixed policy over a single episode which is defined as follows.
The regret for a policy , denoted , is defined as
where is the value of a policy in state according to the following Bellman equation,
and is the policy with minimal expected value. Intuitively, the regret for a policy is the expected suboptimality over a single episode. ahmed2013regret; ahmed2017sampling proposed Cumulative Expected Myopic Regret (CEMR) as a regret approximation.
The CEMR of policy at state , denoted is defined as
is the gap between the expected cost of , and the best expected cost for any action at . CEMR is myopic, accumulating the local regret relative to the actions available at each state.
We use the commonly employed sample-based UMDP definition adulyasak2015solving; ahmed2013regret; ahmed2017sampling; chen2012tractable; steimle2018multi. This representation captures dependencies between uncertainties because each sample represents an entire MDP. As we are interested in worst-case regret, we only require samples which provide adequate coverage over possible MDPs, rather than a distribution over MDPs.
An SSP UMDP is defined by the tuple . , , , and are defined as for SSP MDPs. denotes a finite set of possible transition functions and denotes the associated set of possible cost functions. A sample of model uncertainty, , is defined as where , . The set of samples is denoted .
We provide a definition for independent uncertainty sets, equivalent to the state-action rectangularity property introduced in iyengar2005robust. Intuitively this means that uncertainties are decoupled between subsequent action choices.
Set is independent over state-action pairs if where is the set of possible distributions over after applying in , and denotes the Cartesian product.
The definition of independence for cost functions is analogous. In this work we wish to find , the policy which minimises the maximum regret over the uncertainty set.
Find the minimax regret policy defined as
where is the regret of in the MDP corresponding to sample . In general, stochastic policies are required to hedge against alternate possibilities xu2009parametric, and history-dependent policies are required if uncertainties are dependent steimle2018multi; wiesemann2013robust. If only stationary deterministic policies are considered, a minimax regret policy can be computed exactly by solving a MILP ahmed2013regret. An approximation for minimax regret is to find the policy with minimax CEMR ahmed2013regret; ahmed2017sampling:
where is the CEMR of corresponding to .
Our work is closely connected to the UMDP solution which finds the best expected value for the worst-case parameters iyengar2005robust; nilim2005robust; wiesemann2013robust. We refer to the resulting policy as the robust policy. Assuming independent uncertainties per Def. 5, finding the robust policy can be posed as a Stochastic Game (SG) between the agent, and an adversary which responds to the action of the agent by applying the worst-case parameters at each step:
The meaning of the superscript for will become clear later.
For this problem, the optimal value function may be found via minimax Value Iteration (VI) and corresponds to a deterministic stationary policy for both players wiesemann2013robust. For SSPs, convergence is guaranteed if: a) there exists a policy for the agent which is proper for all possible policies of the adversary, and b) for any states where and are improper, the expected cost for the agent is infinite patek1999stochastic.
Regret Bellman Equation
Our first contribution is Proposition 1 which introduces a Bellman equation to compute the regret for a policy via dynamic programming. Full proofs of all propositions are in the appendices.
(Regret Bellman Equation) The regret for a proper policy, , can be computed via the following recursion
represents the suboptimality attributed to an pair.
Taking we have under the definition of a proper policy. Thus, . Simplifying, we get which is the original definition for the regret of a policy. ∎
Minimax Regret Optimisation
In this section, we describe how Proposition 1 can be used to optimise minimax regret in UMDPs. We separately address UMDPs with independent and dependent uncertainties.
Exact Solution for Independent Uncertainties
To address minimax regret optimisation in UMDPs with independent uncertainties per Def. 5, we start by considering the following SG in Problem 2. At each step, the agent chooses an action, and the adversary reacts to this choice by choosing the MDP sample to be applied for that step to maximise the regret of the policy.
Find the minimax regret policy in the stochastic game defined by
Proof sketch: For independent uncertainty sets, an adversary which chooses one set of parameters to be applied for the entire game is equivalent to an adversary which may change the parameters each step according to a stationary policy. ∎
Intuitively, this is because for any independent uncertainty set, fixing the parameters applied at one state-action pair does not restrict the set of parameters choices available at other state-action pairs. For problems of this form, deterministic stationary policies suffice iyengar2005robust.
Problem 2 can be solved by applying minimax VI to the regret Bellman equation in Proposition 1. In the next section, we present Alg. 1 which solves a generalisation of Problem 2. The generalisation optimises minimax regret against an adversary, , that may change the parameters every steps. To solve Problem 2, we apply Alg. 1 with . Proposition 2 shows that this optimises minimax regret exactly for UMDPs with independent uncertainty sets.
Approx. Solutions for Dependent Uncertainties
For UMDPs with dependent uncertainties, optimising minimax regret exactly is intractable ahmed2017sampling. A possible approach is to over-approximate the uncertainty by assuming independent uncertainties, and solve Problem 2. However, this gives too much power to the adversary, allowing parameters from different samples to be realised within the same game. Thus, the minimax regret computed under this assumption is an over-approximation, and the resulting policy may be overly conservative. In this section, we propose a generalisation of Problem 2 as a way to alleviate this issue. We start by bounding the maximum possible error of the over-approximation associated with solving Problem 2 for UMDPs with dependent uncertainties.
If the expected number of steps for to reach is at most for any adversary:
Prop. 3 shows that decoupling the uncertainties at every step over-approximates the maximum regret. We now introduce an approximation which is more accurate, but requires increased computation. Our approach is to approximate dependent uncertainties by decoupling the uncertainty at only every steps. This results in Problem 3, a generalisation of Problem 2 where the agent chooses a policy to execute for steps, and the adversary, , reacts by choosing the MDP sample to be applied for that steps to maximise the regret. After executing steps, the game transitions to a new state and the process repeats. Increasing weakens the adversary by capturing dependence over each step sequence. As we recover the original minimax regret definition (Problem 1).
Find the minimax regret policy in the stochastic game defined by
In the remainder of this section, we present our approach to solving Problem 3. We start by defining -step options, an adaption of options sutton1999between.
An -step option is a tuple , where is the initiation state where the option may be selected, is a policy, is a set of goal states, and is the number of steps.
If an -step option is executed at , the policy is executed until one of two conditions is reached: either steps pass, or a goal state is reached. Hereafter, we assume that the goal states for all -step options coincide with the goal states for the UMDP, = . The probability of reaching after executing option in is denoted by .
We are now ready to define the -step option MDP (-MDP). The -MDP is equivalent to the original MDP, except that we reason over options which represent policies executed for steps in the original MDP. Additionally, the cost for applying option at in the -MDP is equal to the regret attributed to applying at in the original MDP for steps according to the regret Bellman equation in Proposition 1.
An -step option MDP (n-MDP) , of original SSP MDP , is defined by the tuple (). , , and are the same as in the original MDP. is the set of possible -step options. is the transition function for applying options, where . The cost function, is defined as:
where is the expected value for applying for steps starting in .
The policy that selects options for the -MDP is denoted . We can convert a UMDP to a corresponding -UMDP by converting each MDP sample in the UMDP to an -MDP.
Proof sketch: The regret Bellman equation in Proposition 1 can equivalently be written for an -MDP as
This is an MDP Bellman equation using the cost and transition functions for the -MDP. Therefore, finding the minimax regret according to Problem 3 is equivalent to finding the minimax expected cost for the -UMDP. This is optimised by the robust policy for the -UMDP.
The results for minimax VI apply iyengar2005robust; nilim2005robust, and therefore the optimal policy is stationary and deterministically chooses options, . To guarantee convergence, we apply a perturbation by adding a small scalar to the cost in Eq. 12. It can be shown that in the limit as , the resulting value function approaches the exact solution bertsekas2018abstract.
Algorithm 1 presents pseudocode for the minimax VI algorithm. In Line 1, we start by computing the optimal value in each of the MDP samples using standard VI. This is necessary to compute the contribution to the regret of any action according to Proposition 1. Line 2 initialises the values for the minimax regret of the policy to zero at all states. In Lines 5-10 we sweep through each state until convergence. At each state we update both the minimax regret value and the option chosen by the policy, according to the Bellman backup defined by Eq. 14. Solving Equation 14 is non-trivial, and we formulate a means to solve it in the following subsection.
Eq. 15 of Prop. 5 establishes that for dependent uncertainties, is an upper bound on the maximum regret for the policy for any . Eq. 16 shows that if we increase by a factor and optimise the policy using Algorithm 1, we are guaranteed to equal or decrease this upper bound. Our experiments demonstrate that in practice increasing improves performance substantially.
For dependent uncertainty sets,
Optimising the Option Policies
To perform minimax VI in Algorithm 1, we repeatedly solve the Bellman equation defined by Eq. 14. Eq. 14 corresponds to finding an option policy by solving a finite-horizon minimax regret problem with dependent uncertainties. Because of the dependence over the steps, the optimal option policy may be history-dependent steimle2018multi; wiesemann2013robust. Intuitively, this is because for dependent uncertainty sets, the history may provide information about the uncertainty set at future stages. To maintain scalability, whilst still incorporating some memory into the option policy, we opt to consider option policies which depend only on the state and time step, . Therefore, the optimisation problem in Eq. 14 can be written as Table 1.
In Table 1, we optimise , the updated value for the minimax regret at . The other optimisation variables are those denoted by , and . The optimal value in each sample, , is precomputed in Line 1 of Alg 1. is the current estimate of the minimax regret for at each state, and is initialised to zero in Line 2 of Alg 1.
The constraints in Table 1 represent the following. The set contains all the states reachable in exactly steps, starting from , in sample . Eq. 17 corresponds to the regret Bellman equation for options in Eq. 12-13, where the inequality over all enforces minimising the maximum regret. The variables denoted represent the expected cumulative part of the minimax regret in Eq. 12-13 resulting from the expected state distribution after applying in sample for the horizon of steps. Constraint equations 18-20 propagate these values over the step horizon. The variables denoted represent the expected value over the -step horizon of at time in sample , and the computation of the expected value is enforced by the constraints in Eq. 21-23.
In the supplementary material, we provide linearisations for the nonlinear constraints in Eq. 20 and 23. We consider both deterministic and stochastic option policies, as due to the dependent uncertainties the optimal option policy may be stochastic wiesemann2013robust. The solution is exact for deterministic policies, and for stochastic policies a piecewise linear approximation is required.
For SSP MDPs with strictly positive costs, the number of iterations required for VI to converge within residual is bounded by , where is the minimum cost, is the optimal value, and is the norm of bonet2007speed. In our problem, the minimum cost is . During each VI sweep, we solve Eq. 14 times by optimising Table 1. Therefore, Table 1 must be solved times, where is the optimal minimax regret for Problem 3. To assess the complexity of optimising the model in Table 1, assume the MDP branching factor is . Then the number reachable states in steps is , and the size of the model is . MILP solution time is exponential, and therefore complexity is . Crucially, is not in the exponential.
This approach requires a finite set of samples, yet for some problems there are infinite possible MDP instantiations. To optimise worst-case regret, we need samples which provide adequate coverage over possible MDP instantiations so that the resulting worst-case solution generalises to all possible MDPs. Where necessary, we use the sampling approach proposed for this purpose in ahmed2013regret.
To reduce the size of the problem in Table 1, we can prune out actions that are unlikely to be used by the minimax regret policy. We propose the following pruning method, analagous to the approach proposed by lacerda2017multi. The policy with optimal expected cost, , is computed for each to create the set of optimal policies . We build the pruned UMDP by removing transitions for all , for which is not in for some policy . The intuition is that actions which are directed towards the goal are likely to be included in at least one of the optimal policies. Therefore, this pruning process removes actions which are not directed towards the goal, and are unlikely to be included in a policy with low maximum regret.
We evaluate the following approaches on three domains with dependent uncertainties:
reg: the approach presented in this paper.
cemr: the state of the art approach from ahmed2013regret; ahmed2017sampling which we have extended for -step options.
robust: the standard robust dynamic programming solution in Eq. 6 iyengar2005robust; nilim2005robust.
MILP: the optimal stationary deterministic minimax regret policy computed using the MILP in ahmed2013regret. We do not compare against the stochastic version as we found it did not scale beyond very small problems.
Averaged MDP: this benchmark averages the cost and transition parameters across the MDP samples and computes the optimal policy in the resulting MDP adulyasak2015solving; chen2012tractable; delage2010percentile.
Best MDP Policy: computes the optimal policy set, . For each policy, , the maximum regret is evaluated for all . The policy with lowest maximum regret is selected.
We found that action pruning reduced computation time for reg, cemr, and MILP without significantly harming performance. Therefore we only present results for these approaches with pruning. We write for stochastic for deterministic option policies. MILPs are solved using Gurobi, and all other processing is performed in Python. Computation times are reported for a 3.2 GHz Intel i7 processor. For further details on the experimental domains see the appendices.
Medical decision making domain
We test on the medical domain from sharma2019robust. The state, comprises of two factors: the health of the patient, , and the day . At any state, one of three actions can be applied, each representing different treatments. In each MDP sample the transition probabilities for each treatment differ, corresponding to different responses by patients with different underlying conditions. The health of the patient on the final day determines the cost received.
Disaster rescue domain
We adapt this domain from the UMDP literature adulyasak2015solving; ahmed2013regret; ahmed2017sampling to SSP MDPs. An agent navigates an 8-connected grid which contains swamps and obstacles by choosing from 8 actions. Nominally, for each action the agent transitions to the corresponding target square with , and to the two adjacent squares with each. If the target, or adjacent squares are obstacles, the agent transitions to that square with probability 0.05. If the square is a swamp, the cost is sampled uniformly in . The cost for entering any other state is 0.5. The agent does not know the exact locations of swamps and obstacles, and instead knows regions where they may be located. To construct a sample, a swamp and obstacle is sampled uniformly from each swamp and obstacle region respectively. Fig. 5 (left) illustrates swamp and obstacle regions for a particular UMDP. Fig. 5 (right) illustrates a possible sample corresponding to the same UMDP.
Underwater glider domain
An underwater glider navigates to a goal location subject to uncertain ocean currents from real-world forecasts. For each UMDP, we sample a region of the Norwegian sea, and sample the start and goal location. The mission will be executed between 6am and 6pm, but the exact time is unknown. As such, the navigation policy must perform well during all ocean conditions throughout the day. We construct 12 MDP samples, corresponding to the ocean current forecast at each hourly interval. Each state is a grid cell with 500m side length. There are 12 actions corresponding to heading directions. The cost for entering each state is sampled in [0.8, 1]. An additional cost of 3 is added for entering a state where the water depth is < 260m and current is > 0.12m/s, penalising operation in shallow water with strong currents. The forecast used was for May 1st 2020 and is available online at https://marine.copernicus.eu/.
For the medical domain, each method was evaluated for 250 different randomly generated UMDPs. For the other two domains, each method was evaluated for a range of problem sizes, and each problem size was repeated for 25 different randomly generated UMDPs. For each disaster rescue and medical decision making UMDP, consisted of 15 samples selected using the method from ahmed2013regret; ahmed2017sampling. In underwater glider, consisted of the 12 samples corresponding to each hourly weather forecast. For each method, we include results where the average computation time was <600s. For the resulting policies, the maximum regret over all samples in was computed. Each max regret value was normalised between [0, 1] by dividing by the worst max regret for that UMDP across each of the methods. The normalised values were then averaged over all 25/250 runs, and displayed in Fig 5 and the top row of Table 2.
For sample-based UMDPs, the policy is computed from a finite set of samples. To assess generalisation to any possible MDP instantiation, we evaluated the maximum regret on a larger set of 100 random samples. For disaster rescue and medical decision making, the samples were generated using the procedure outlined. For underwater glider, we generated more samples by linearly interpolating between the 12 forecasts and adding Gaussian noise to the ocean current values (2% of the ocean current velocity). The average maximum regret for this experiment is shown in Fig. 5 and the bottom row of Table 2.
A table of -values is included in the supplementary material which shows that most differences in performance between methods are statistically significant, with the exception of those lines in the figures which overlap. Fig. 5 shows that in general, our approach significantly outperforms the baselines with the exception of MILP, which also has good performance. However, MILP scales poorly as indicated by Fig. 5, and failed to find solutions in the medical domain within the 600s time limit. MILP finds the optimal deterministic stationary policy considering full dependence between uncertainties. However, the performance of our approach improves significantly with increasing , and outperforms MILP with larger . This indicates that the limited memory of the non-stationary option policies is crucial for strong performance. For the same , stochastic option policies improve performance in both domains. However, the poor scalability of stochastic policies indicates that deterministic options with larger are preferable. Across all domains the current state of the art method, CEMR with , performed poorly. Performance of CEMR improved somewhat by extending it to use our options framework (). The poor performance of CEMR can be attributed to the fact that CEMR myopically approximates the maximum regret by calculating the performance loss compared to local actions, which may be a poor estimate of the suboptimality over the entire policy. In contrast, our approach optimises the maximum regret using the recursion given in Prop. 1 which computes the contribution of each action to the regret for the policy exactly by comparing against the optimal value function in each sample.
The generalisation results in Fig. 5 and Table 2 show strong performance of our approach for disaster rescue and the medical domain on the larger test set. In the glider domain, there is more overlap between methods however our approach with larger still tends to perform the best. This verifies that in domains with a very large set of possible MDPs, a viable approach is to use our method to find a policy with a smaller set of MDPs and this will generalise well to the larger set.
We have presented an approach for minimax regret optimisation in offline UMDP planning. Our algorithm solves this problem efficiently and exactly in problems with independent uncertainties. To address dependent uncertainties we have proposed using options to capture dependence over sequences of steps and tradeoff computation and solution quality. Our results demonstrate that our approach offers state-of-the-art performance. In future work, we wish to improve the scalability of our approach by extending it to use approximate dynamic programming with function approximation.
This work was supported by UK Research and Innovation and EPSRC through the Robotics and Artificial Intelligence for Nuclear (RAIN) research hub [EP/R026084/1] and the Clarendon Fund at the University of Oxford.
Appendix A Proof of Proposition 1
(Regret Bellman Equation) The regret for a proper policy, , can be computed via the following recursion
To prove the proposition we show that if we apply the equations in Proposition 1 starting from the initial state we recover the definition of the regret for a policy given by Definition 2. We start by combining the equations stated in the proposition for the initial state
We can move outside of the sum as it does not depend on . We start unrolling the definition by substituting for
Again, we can move outside of the inner sum as it does not depend on . Thus, Equation 25 can be rewritten as
Cancelling terms, we have
After repeating the above process of unrolling the expression and cancelling terms for steps we arrive at the following expression
Taking we have under the definition of a proper policy. Thus, by the definition of goal states in an SSP MDP (Definition 1), and by the definition of the regret decomposition in Proposition 1. This allows us to further simplify the expression to the following
The nested sum is simply the expected cost of the policy, . Thus, the expression can be further simplified to give the final result , which is the original definition for the regret of a policy.
Appendix B Proof of Proposition 2
In Problem 1, the agent first chooses a policy. The adversary observes the policy of the agent and reacts by choosing the uncertainty sample to be applied to maximise the regret for the policy of the agent. In Problem 2, the adversary reacts to the policy of the agent by choosing the mapping from state-action pairs to uncertainty samples which maximises the regret for the policy of the agent. To prove the proposition, we show that in the case of independent uncertainty sets these adversaries are equivalent.
Let be an independent uncertainty set. Then each sample of model uncertainty is . By Definition 5, where is the set of possible distributions over after applying in , and where is the set of possible expected costs of applying in . We write to denote the Cartesian product of sets.
The adversary in Problem 2, , maps each state and action chosen by the agent to an MDP sample such that the regret for the policy of the agent is maximised. This means that at a state-action pair, , the adversary may apply any sample, , where we write to denote the MDP sample applied at . At another state-action pair , the adversary may also apply any sample , and so on. Thus, over all state-action pairs, the adversary may choose any combination of different samples at each state-action pair. We can write the set of all possible combinations of transition and cost functions as and respectively. These sets are equal to and respectively by the definition of independence. Thus, over all state-action pairs, the combination of samples chosen by is equivalent to , such that the regret for the policy of the agent is maximised. The adversary in Problem 1 also chooses any such that the regret for the agent is maximised. Thus, we observe for independent uncertainties, the two adversaries maximising the regret are equivalent. Therefore ∎
Appendix C Proof of Proposition 3
If the expected number of steps for to reach is at most for any adversary: