## 1 Introduction

Reinforcement learning (RL) is a learning framework that addresses sequential decision-making problems, wherein an ‘agent’ or a decision maker learns a policy to optimize a long-term reward by interacting with the (unknown or partially known) environment. At each step, the RL agent obtains evaluative feedback (called reward or cost) about the performance of its action, allowing it to improve the performance of subsequent actions (sutton1998)

. With the advent of deep learning, RL has witnessed huge successes in recent times

(silver2017mastering). However, since most of these methods rely on model-free RL, there are several unsolved challenges, which restrict the use of these algorithms for many safety critical physical systems (vamtutorial; Benosman2018). For example, it is very difficult for most model-free RL algorithms to ensure basic properties like stability of solutions, robustness with respect to model uncertainties, etc. This has led to several research directions which study incorporating robustness, constraint satisfaction, and safe exploration during learning for safety critical applications. While safe exploration and robust stability guarantees are highly desirable, they are also very challenging to incorporate in RL algorithms. The main goal of our work is to formulate this incorporation into robust constrained-MDPs (RCMDPs), and derive the corresponding equations necessary to solve problems on RCMDPs.Constrained Markov decision processes (CMDPs) can be seen as an extension of MDPs with expected cumulative cost constraints, e.g., (Altman2004)

. For such CMDPs, several solution methods have been proposed, e.g., linear programming-based solutions

(Altman2004), surrogate-based methods (CYA16; Dalal2018), Lagrangian methods (Geibel2005; Altman2004). We refer to these CMDPS as non-robust since they do not take uncertainties in the state transition probability into account, which is an important factor in real-life applications.On the other hand, robustification of MDPs, w.r.t. model uncertainties, can be found in the context of robust MDPs (RMDPs) which generalize MDPs to the case where transition probabilities and/or rewards are not perfectly known, e.g., Nilim2004; Wiesemann2013. These RMDPs can be formulated and solved using so-called ambiguity or uncertainty sets, e.g., (petrik2019; petrik2012; petrik2016). However, one noticeable point in all the RMDPs-based RL algorithms is the fact that they do not consider any safety constraints, i.e., expected cumulative cost constraints.

These safety constraints are important in real-life applications, where one cannot afford to risk violating some given constraints, e.g., in autonomous cars, there are hard safety constraints on the robot velocities and steering angles. Besides, in real applications, to mitigate the sample inefficiency of model-free RL algorithms, training often occurs on a simulated environment. The result is then transferred to the real world, typically followed by fine-tuning, a process referred to as Sim2Real. The simulation model is by definition uncertain with respect to the real world, due to approximations and lack of system identification. Domain randomization (vanBaar2019may) and meta-learning (pmlr-v78-finn17a), aimed at addressing model uncertainty in transfer, offer no guarantees. Furthermore, for safety critical applications, a trained policy in simulation should offer certain guarantees on safety when transferred to the real world.

In light of these practical motivations, we propose to merge the two concepts of CMDPs and RMDPs, to ensure both safety and robustness. In this RCMDP concept, we propose to robustify both the performance cost minimization, as well as the safety constraints (via cumulative constraint costs). Indeed, robustness is equally (if not more) important in estimating the cumulative constraint costs along the whole trajectory in order to certify that there will be no unexpected violations, if the system is deployed in reality. That is, if deployed, the worst-case cumulative constrained-cost will not exceed a pre-determined safety budget.

The contribution of this paper is four-fold: 1) Intuited from the concepts of CMDP and RMDP, we formulate the concept of RCMDP, bridging the gap between constraints and robustness w.r.t. state transition probability uncertainties; 2) propose a robust soft-constrained Lagrange-based solution of the RCMDP problem; 3) derive associated gradient update rule and present a policy gradient algorithm; 4) illustrate the performance of the proposed algorithm on the inventory management problem under model uncertainties.

The paper is organized as follows: Section 2 describes the formulation of our Robust-CMDP problem and the objective we seek to optimize. A Lagrange based approach is presented in Section 3 along with required gradient update rules and a robust constrained policy gradient algorithm. We evaluate our algorithm in Section 4 and draw the concluding remarks in Section 5.

## 2 Problem Formulation

We consider a robust-MDP model with a finite number of states and finite number of actions . Every action is available for the decision maker to take in every state . After taking an action in state , the decision maker receives a cost and transitions to a next state according to the *true* but *unknown* transition probability . is the distribution of an initial state. We further incorporate a constrained-MDP (Altman2004) setup into this robust-MDP model by introducing a constraint cost and an associated constraint budget .

An ambiguity set , defined for each state and action , is a set of feasible transition matrices quantifying the uncertainty in transition probabilities. In this paper, we restrict our attention to rectangular ambiguity sets which simply assumes independence between different state-action pairs (LeTallec2007; Wiesemann2013).

We use in this paper norm bounded ambiguity sets around the nominal transition probability , on some dataset , as:

Where is the budget of allowed deviations. This budget can be computed using Hoeffding bound as (petrik2019beyond): , where is the number of transitions in dataset originating from state and an action , and is the confidence level. Note that this is just one specific choice for the ambiguity set. Our method can be extended to any other type of ambiguity sets (e.g. norm, Bayesian, weighted etc.). We use to refer cumulatively to for all states and actions .

A stationary randomized policy for state

defines a probability distribution over actions

, represents the set of all stationary randomized policies. We parameterize the randomized policy for state as where is adimensional parameter vector. Let

be a sampled trajectory generated by executing a policy from a starting state . Then the probability of sampling is: . The total cost for trajectory is: (Puterman2005). The value function is defined as the expected return: . We define the robust value function as the expected return in the worst-case realization of the transition probability within as: . Similarly, the total constraint-cost for trajectory is: . And the robust constraint value function is defined as: .The robust Bellman operator for a state and an ambiguity set computes the best action with respect to the worst-case realization of the transition probabilities in as:

The optimal robust value function , and the robust value function for a policy are unique and satisfy and (Iyengar2005). Similarly, all these properties hold for the constrained robust value function as well.

#### Objective

Our objective is then to solve the RCMDP optimization problem below:

(1) | ||||

This objective resembles the objective of a CMDP, but with additional robustness integrated by the quantification of the uncertainty in the model.

## 3 Robust Constrained Optimization

A general approach for solving Equation 1 is to apply the Lagrange relaxation procedure (Chapter 3 of Bertsekas2003), which turns it into an unconstrained optimization problem:

(2) |

where is known as the *Lagrange multiplier*. The goal is then to find a saddle point that satisfies , . This is achieved by descending in and ascending in using the gradients.

Without loss of generality, we rewrite Equation 2 for a fixed starting state and perform some algebraic manipulation:

Here follows with and .

### 3.1 Gradient Update Rules

We now derive the gradient update rules with respect to as below:

Notice that the constraint budget does not play any role in the policy optimization. Also, the expectations over the cost and constraint cost are with respect to and ), respectively. However, the costs and constraint costs are coupled together in reality, meaning that the two trajectories would not diverge. So one of or can be chosen depending on the priorities toward robustness of cost or constraint cost, and both of the expectations can be evaluated with that common probability measure . The gradient update rule then becomes:

And the gradient update rule with respect to as below:

### 3.2 Policy Gradient Algorithm

algocf[!h] LABEL:alg:rcpg presents a robust constrained policy gradient algorithm based on the gradient update rules derived above in Section 3.1. This algorithm proceeds in an episodic way and update parameters based on the Monte-Carlo estimates of and . Line 9 of the algorithm requires the nominal transition probability and the ambiguity set which can be some parameterized estimates. The step size schedules satisfy the standard conditions for stochastic approximation algorithms ensuring that update is on the fastest time-scale and the update is on a slower time-scale . This results in a two time-scale stochastic approximation algorithm and the convergence of it to a (local) saddle point can be shown following standard proof techniques (Borkar2009).

## 4 Empirical Study

In this section, we empirically study the performance of our policy gradient algorithm on several problem domains. All the experiments are run with confidence parameter , discount factor , and number of samples drawn for each state-action from the underlying true transition distribution . For domains emitting reward signals instead of costs, we simply consider their negative magnitude keeping our cost-based formulation intact. We implement several combinations of different settings: i) non-robust unconstrained, ii) robust-unconstrained and iii) robust-constrained. We note that the non-robust unconstrained setting is the general policy gradient algorithm (sutton1998). The robust-unconstrained version deals only with robustness without any constraint cost or associated budget. The robust constrained version deals with both constraints and robustness as described in LABEL:alg:rcpg.

#### Inventory Management

We evaluate the policy gradient method on the classic inventory management problem (Behzadian2019; Puterman2005; Zipkin200)

. The state space is discrete and is represented by the level of inventory. The goal is to order products from a supplier in order to meet demands. Demand for a product is random and comes from a normal distribution.

Figure 1 shows the return distributions for the inventory management problem. Here the non-robust unconstrained variant has slightly higher expected return compared to the robust counterparts. This is expected because the robust version particularly deals with the worst-case situation. However, the unconstrained variant is not guaranteed to enforce any constraint. The plot also shows that the robust-constrained version outperforms the robust-unconstrained variant with a higher expected return.## 5 Conclusion

In this paper, we studied the problem of MDPs under constraints, and model uncertainties. We proposed to merge together the concepts of constrained MDPs and robust MDPs, leading to the concept of robust constrained MDPs (RCMDPs). Indeed, by doing so, one can take advantage of the safety guarantees given by the CMDP formulation, as well as the robustness guarantees w.r.t. model uncertainties, given by the RMDP formulation. We then proposed a robust soft-constrained Lagrange-based solution to the RCMDP problem, and a corresponding policy gradient algorithm. Next work will focus on extending the proposed approach to continuous domains, and validate the performance of this RCMDP formulation on more safety critical examples, e.g., robotics test-beds.

Comments

There are no comments yet.