Robust Reinforcement Learning for Continuous Control with Model Misspecification

06/18/2019 ∙ by Daniel J. Mankowitz, et al. ∙ 0

We provide a framework for incorporating robustness -- to perturbations in the transition dynamics which we refer to as model misspecification -- into continuous control Reinforcement Learning (RL) algorithms. We specifically focus on incorporating robustness into a state-of-the-art continuous control RL algorithm called Maximum a-posteriori Policy Optimization (MPO). We achieve this by learning a policy that optimizes for a worst case, entropy-regularized, expected return objective and derive a corresponding robust entropy-regularized Bellman contraction operator. In addition, we introduce a less conservative, soft-robust, entropy-regularized objective with a corresponding Bellman operator. We show that both, robust and soft-robust policies, outperform their non-robust counterparts in nine Mujoco domains with environment perturbations. Finally, we present multiple investigative experiments that provide a deeper insight into the robustness framework; including an adaptation to another continuous control RL algorithm as well as comparing this approach to domain randomization. Performance videos can be found online at https://sites.google.com/view/robust-rl.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Reinforcement Learning (RL) algorithms typically learn a policy that optimizes for the expected return (Sutton and Barto, 1998). That is, the policy aims to maximize the sum of future expected rewards that an agent accumulates in a particular task. This approach has yielded impressive results in recent years, including playing computer games with super human performance (Mnih et al., 2015; Tessler et al., 2016), multi-task RL (Rusu et al., 2016; Devin et al., 2017; Teh et al., 2017; Mankowitz et al., 2018b; Riedmiller et al., 2018) as well as solving complex continuous control robotic tasks (Duan et al., 2016; Abdolmaleki et al., 2018b; Kalashnikov et al., 2018; Haarnoja et al., 2018).

The current crop of RL agents are typically trained in a single environment (usually a simulator). As a consequence, an issue that is faced by many of these agents is the sensitivity of the agent’s policy to environment perturbations. Perturbing the dynamics of the environment during test time, which may include executing the policy in a real-world setting, can have a significant negative impact on the performance of the agent (Andrychowicz et al., 2018; Peng et al., 2018; Derman et al., 2018; Di Castro et al., 2012; Mankowitz et al., 2018a). This is because the training environment is not necessarily a very good model of the perturbations that an agent may actually face, leading to potentially unwanted, sub-optimal behaviour. There are many types of environment perturbations. These include changing lighting/weather conditions, sensor noise, actuator noise, action delays etc.

It is desirable to train agents that are agnostic to environment perturbations. This is especially crucial in the Sim2Real setting (Andrychowicz et al., 2018; Peng et al., 2018; Wulfmeier et al., 2017; Rastogi et al., 2018; Christiano et al., 2016) where a policy is trained in a simulator and then executed on a real-world domain. As an example, consider a robotic arm that executes a control policy to perform a specific task in a factory. If, for some reason, the arm needs to be replaced and the specifications do not exactly match, then the control policy still needs to be able to perform the task with the ‘perturbed’ robotic arm dynamics. In addition, sensor noise due to malfunctioning sensors, as well as actuator noise, may benefit from a robust policy to deal with these noise-induced perturbations.

Model misspecification: For the purpose of this paper, we refer to an agent that is trained in one environment and performs poorly in a different, perturbed version of the environment (as in the above examples) as model misspecification. By incorporating robustness into our agents, we correct for this misspecification yielding improved performance in the perturbed environment(s).

In this paper, we propose a framework for incorporating robustness into continuous control RL algorithms. We specifically focus on robustness to model misspecification in the transition dynamics. For the remainder of the paper, when we mention robustness, we refer to this particular form of robustness. Our main contributions are as follows:

(1) We provide a generalized framework for incorporating robustness to model misspecification into continuous control RL algorithms. Specifically, algorithms that learn a value function (e.g., a critic) or perform policy evaluation. As a proof-of-concept, we incorporate robustness into a state-of-the-art continuous control RL algorithm called Maximum a-posteriori Policy Optimization (MPO) (Abdolmaleki et al., 2018b) to yield Robust MPO (R-MPO). We also carry out an additional experiment, where we incorporate robustness into an additional continuous RL algorithm called Stochastic Value Gradients (SVG) (Heess et al., 2015a).

(2) Entropy regularization encourages exploration and helps prevent early convergence to sub-optimal policies (Nachum et al., 2017). To incorporate these advantages, we extend MPO to optimize for an entropy-regularized return objective (E-MPO).

(3) We extend the Robust Bellman operator (Iyengar, 2005) to robust and soft-robust entropy-regularized versions respectively and show that these operators are contraction mappings and yield a well-known value-iteration bound with respect to the max norm.

(4) We use these results to extend E-MPO to Robust Entropy-regularized MPO (RE-MPO) and Soft RE-MPO (SRE-MPO) and show that they perform at least as well as R-MPO and in some cases significantly better.

(5) We present experimental results in nine Mujoco domains showing that R-MPO, RE-MPO and SRE-MPO outperform both MPO and E-MPO respectively.

(6) Multiple investigative experiments to better understand the robustness framework. This includes results indicating that robustness outperforms domain randomization.

2 Background

A Markov Decision Process (MDP)

is defined as the tuple where is the state space, the action space, is a bounded reward function; is the discount factor and

maps state-action pairs to a probability distribution over next states. We use

to denote the simplex. The goal of a Reinforcement Learning agent for the purpose of control is to learn a policy which maps a state and action to a probability of executing the action from the given state so as to maximize the expected return where

is a random variable representing the reward received at time

(Sutton and Barto, 2018). The value function is defined as and the action value function as .

Entropy-regularized Reinforcement Learning: Entropy regularization encourages exploration and helps prevent early convergence to sub-optimal policies (Nachum et al., 2017). We make use of the relative entropy-regularized RL objective defined as where is a temperature parameter and is the Kullback-Liebler (KL) divergence between the current policy and a reference policy given a state (Schulman et al., 2017). The entropy-regularized value function is defined as . Intuitively, augmenting the rewards with the KL term regularizes the policy by forcing it to be ‘close’ in some sense to the base policy.

A Robust MDP (R-MDP) is defined as a tuple where and are defined as above; is an uncertainty set where is the set of probability measures over next states . This is interpreted as an agent selecting a state and action pair, and the next state is determined by a conditional measure (Iyengar, 2005). A robust policy optimizes for the worst-case expected return objective: .

The robust value function is defined as and the robust action value function as . Both the robust Bellman operator for a fixed policy and the optimal robust Bellman operator have previously been shown to be contractions (Iyengar, 2005). A rectangularity assumption on the uncertainty set (Iyengar, 2005) ensures that “nature” can choose a worst-case transition function independently for every state and action .

Maximum A-Posteriori Policy Optimization (MPO) (Abdolmaleki et al., 2018a, b)

is a continuous control RL algorithm that performs an expectation maximization form of policy iteration. There are two steps comprising

policy evaluation and policy improvement. The policy evaluation step receives as input a policy and evaluates an action-value function by minimizing the squared TD error:

where denotes the parameters of a target network (Mnih et al., 2015) that are periodically updated from . In practice we use a replay-buffer of samples in order to perform the policy evaluation step. The second step comprises a policy improvement step. The policy improvement step consists of optimizing the objective for states drawn from a state distribution . In practice the state distribution samples are drawn from an experience replay. By improving in all states , we improve our objective. To do so, a two step procedure is performed.

First, we construct a non-parametric estimate

such that . This is done by maximizing while ensuring that the solution, locally, stays close to the current policy ; i.e. . This optimization has a closed form solution given as where is a temperature parameter that can be computed by minimizing a convex dual function (Abdolmaleki et al. (2018b)). Second, we project this non-parametric representation back onto a parameterized policy by solving the optimization problem , where is the new and improved policy and where one typically employs additional regularization (Abdolmaleki et al., 2018a)

. Note that this amounts to supervised learning with samples drawn fron

; see Abdolmaleki et al. (2018a) for details.

3 Robust Entropy-Regularized Bellman Operator

(Relative-)Entropy regularization has been shown to encourage exploration and prevent early convergence to sub-optimal policies (Nachum et al., 2017). To take advantage of this idea when developing a robust RL algorithm we extend the robust Bellman operator to a robust entropy regularized Bellman operator and prove that it is a contraction.111Note that while MPO already bounds the per step relative entropy we, in addition, want to regularize the action-value function to obtain a robust regularized algorithm. We also show that well-known value iteration bounds can be attained using this operator. We first define the robust entropy-regularized value function as . For the remainder of this section, we drop the sub-and superscripts, as well as the reference policy conditioning, from the value function , and simply represent it as for brevity. We define the robust entropy-regularized Bellman operator for a fixed policy in Equation 1, and show it is a max norm contraction (Theorem 1).

(1)
Theorem 1.

The robust entropy-regularized Bellman operator for a fixed policy is a contraction operator. Specifically: and , we have, .

The proof can be found in the (Appendix A, Theorem 1). Using the optimal robust entropy-regularized Bellman operator , which is shown to also be a contraction operator in Appendix A, Theorem 2, a standard value iteration error bound can be derived (Appendix A, Corollary 1).

4 Soft-Robust Entropy-Regularized Bellman Operator

In this section, we derive a soft-robust entropy-regularized Bellman operator and show that it is also a -contraction in the max norm. First, we define the average transition model as which corresponds to the average transition model distributed according to some distribution over the uncertainty set . This average transition model induces an average stationary distribution (see Derman et al. (2018)). The soft-robust entropy-regularized value function is defined as . Again, for ease of notation, we denote for the remainder of the section. The soft-robust entropy-regularized Bellman operator for a fixed policy is defined as:

(2)

which is also a contraction mapping (see Appendix B, Theorem 3) and yields the same bound as Corollary 1 for the optimal soft-robust Bellman operator derived in Appendix B, Theorem 4.

5 Robust Policy Evaluation

To incorporate robustness into MPO, we focus on learning a worst-case value function in the policy evaluation step. Note that this policy evaluation step can be incorporated into any actor-critic algorithm. In particular, instead of optimizing the squared TD error, we optimize the worst-case squared TD error, which is defined as:

(3)

where is an uncertainty set for the current state and action ; is the current network’s policy, and denotes the target network parameters.

Relation to MPO: In MPO, this replaces the current policy evaluation step. The robust Bellman operator (Iyengar, 2005) ensures that this process converges to a unique fixed point for the policy . Since the proposal policy (see Section 2) is proportional to the robust action value estimate , it intuitively yields a robust policy as the policy is being generated from a worst-case value function. The fitting of the policy network to the proposal policy yields a robust network policy .

6 Robust Entropy-Regularized Policy Evaluation

To extend Robust policy evaluation to robust entropy-regularized policy evaluation, two key steps need to be performed: (1) optimize for the entropy-regularized expected return as opposed to the regular expected return and modify the TD update accordingly; (2) Incorporate robustness into the entropy-regularized expected return and modify the entropy-regularized TD update. To achieve (1), we define the entropy-regularized expected return as , and show in Appendix C that performing policy evaluation with the entropy-regularized value function is equivalent to optimizing the entropy-regularized squared TD error (same as Eq. (4), only omitting the operator). To achieve (2), we optimize for the robust entropy regularized expected return objective defined as , yielding the robust entropy-regularized squared TD error:

(4)

where . For the soft-robust setting, we remove the infimum from the TD update and replace the next state transition function with the average next state transition function .

Relation to MPO: As in the previous section, this step replaces the policy evaluation step of MPO. Our robust entropy-regularized Bellman operator and soft-robust entropy-regularized Bellman operator ensures that this process converges to a unique fixed point for the policy for the robust and soft-robust cases respectively. We use as the reference policy . The pseudo code for the R-MPO, RE-MPO and Soft-Robust Entropy-regularized MPO (SRE-MPO) algorithms can be found in Appendix E (Algorithms 1, 2 and 3 respectively).

7 Experiments

We now present experiments on nine different continuous control domains from the DeepMind control suite (Tassa et al., 2018). In this section our agent optimizes for the entropy-regularized objective (non-robust, robust and soft-robust versions). This corresponds to (a) non-robust E-MPO baseline, (b) Robust E-MPO (RE-MPO) and (c) Soft-Robust E-MPO (SRE-MPO). From hereon in, it is assumed that the algorithms optimize for the entropy-regularized objective unless otherwise stated.

Appendix: In Appendix D.4, we present results of our agent optimizing for the expected return objective without entropy regularization (for the non-robust, robust and soft-robust versions). This corresponds to (a’) non-robust MPO baseline, (b’) R-MPO and (c’) SR-MPO.

The experiments are divided into three sections. The first section details the setup for robust and soft-robust training. The next section compares robust and soft-robust performance to the non-robust MPO baseline in each of the nine domains. The final section is a set of investigative experiments to gain additional insights into the performance of the robust and soft-robust agents. This includes incorporating robustness into the Stochastic Value Gradients (SVG) algorithm (Heess et al., 2015a).

7.1 Setup

For each domain, the robust agent is trained using a pre-defined uncertainty set consisting of three task perturbations 222We did experiments on a larger set with similar results, but settled on three for computational efficiency.. Each of the three perturbations corresponds to a particular perturbation of the Mujoco domain. For example, in Cartpole, the uncertainty set consists of three different pole lengths. Both the robust and non-robust agents are evaluated on a test set of three unseen task perturbations. In the Cartpole example, this would correspond to pole lengths that the agent has not seen during training. The chosen values of the uncertainty set and evaluation set for each domain can be found in Appendix D.3. Note that it is common practice to manually select the pre-defined uncertainty set and the unseen test environments. Practitioners often have significant domain knowledge and can utilize this when choosing the uncertainty set (Derman and Mannor, 2019; Derman et al., 2018; Di Castro et al., 2012; Mankowitz et al., 2018a; Tamar et al., 2014).

During training, the robust, soft-robust and non-robust agents act in an unperturbed environment which we refer to as the nominal

environment. During the TD learning update, the robust agent calculates an infimum between Q values from each next state realization for each of the uncertainty set task perturbations (the soft-robust agent computes an average, which corresponds to a uniform distribution over

, instead of an infimum). Each transition model is a different instantiation of the Mujoco task. The robust and soft-robust agents are exposed to more state realizations than the non-robust agent. However, as we show in our ablation studies, significantly increasing the number of samples and the diversity of the samples for the non-robust agent still results in poor performance compared to the robust and soft-robust agents.

7.2 Main Experiments

In this section, we compare the performance of non-robust MPO to the robust and soft-robust variants. Each training run consists of episodes and the experiments are repeated

times. In the bar plots, the y-axis indicates the average reward (with standard deviation) and the x-axis indicates different unseen evaluation environment perturbations starting from the first perturbation (Env0) onwards. Increasing environment indices correspond to increasingly large perturbations. For example, in Figure

1 (top left), Env0, Env1 and Env2 for the Cartpole Balance task represents the pole perturbed to lengths of and meters respectively. Figure 1 shows the performance of six Mujoco domains (The remaining three domains are in Appendix D.4). The bar plots indicate the performance of E-MPO (red), RE-MPO (blue) and SRE-MPO (green) on the held-out test perturbations. This color scheme is consistent throughtout the experiments unless otherwise stated. As can be seen in each of the figures, RE-MPO attains improved performance over E-MPO. This same trend holds true for all nine domains. SRE-MPO outperforms the non-robust baseline in all but the Cheetah domain, but is not able to outperform RE-MPO.

Appendix: The appendix contains additional experiments with the non entropy-regularized versions of the algorithms where again the robust (R-MPO) and soft robust (SR-MPO) versions of MPO outperform the non-robust version (MPO).

Figure 1: Six domains showing RE-MPO (blue), SRE-MPO (green) and E-MPO (red). Additional domains can be found in the appendix. In addition, the results for R-MPO, SR-MPO and MPO can be found in Appendix D.4 with similar results.
Figure 2: A larger test set: the figures show the performance of RE-MPO (blue), SRE-MPO (green) and E-MPO (red) for a test set that extends from the nominal environment to significant perturbations outside the training set for Cartpole Balance (left) and Pendulum Swingup (right).

7.3 Investigative Experiments

This section aims to investigate and try answer various questions that may aid in explaining the performance of the robust and non-robust agents respectively. Each investigative experiment is conducted on the Cartpole Balance and Pendulum Swingup domains.

What if we increase the number of training samples? One argument is that the robust agent effectively has access to more samples since it calculates the Bellman update using the infimum of three different environment realizations. In order to balance this is effect, the non-robust agent was trained for three times more episodes than the robust agents. Training with significantly more samples does not increase the performance of the non-robust agent and, can even decreases the performance, as a result of overfitting to the nominal domain. The results on Cartpole balance and Pendulum swingup can be found in Appendix D.5, Figure 11.

What about Domain Randomization? A subsequent point would be that the robust agent sees more diverse examples compared to the non-robust agent from each of the perturbed environments. We therefore trained the non-robust agent in a domain randomization setting (Andrychowicz et al., 2018; Peng et al., 2018) where three actors each operate in a perturbed environment (the same as the robust agents uncertainty set). The TD errors are batch averaged as in domain randomization. As seen in the two left figures in Figure 4 for Cartpole and Pendulum respectively, the robust and soft-robust variants significantly outperform the domain randomization agent. A discussion as to why this is to be expected is in Section 8.

A larger test set: It is also useful to view the performance of the agent from the nominal environment to increasingly large perturbations in the unseen test set (see Appendix D.3 for values). These graphs can be seen in Figure 2 for Cartpole Balance and Pendulum Swingup respectively. As seen in the figures, the robust agent maintains a higher level of performance compared to the non-robust agent. The soft-robust agent outperforms the non-robust agent, but its performance degrades as the perturbations increase which is consistent with the results of Derman et al. (2018). In addition, the robust and soft-robust agents are competitive with the non-robust agent in the nominal environment.

Modifying the uncertainty set: It is also interesting to evaluate the performance of the agent for different uncertainty sets. For Pendulum Swingup, the original uncertainty set values of the pendulum arm are and meters. We modified the final perturbation to values of and meters respectively. The agent is evaluated on unseen lengths of and meters. An increase in performance can be seen in Figure 3 as the third perturbation approaches that of the unseen evaluation environments. Thus it appears that if the agent is able to approximately capture the dynamics of the unseen test environments within the training set, then the robust agent is able to adapt to the unseen test environments. The results for cartpole balance can be seen in Appendix D.5, Figure 12.

Figure 3: Modifying the uncertainty set: Pendulum Swingup when modifying the third perturbation of the uncertainty set to values of (left), (middle) and (right) meters respectively.

What about incorporating Robustness into other algorithms? To show the generalization of this robustness approach, we incorporate it into the critic of the Stochastic Value Gradient (SVG) continuous control RL algorithm (See Appendix D.1). As seen in Figure 4, Robust Entropy-regularized SVG (RE-SVG) and Soft RE-SVG (SRE-SVG) significantly outperform the non-robust Entropy-regularized SVG (E-SVG) baseline in both Cartpole and Pendulum.

Robust entropy-regularized return vs. robust expected return: When comparing the robust entropy-regularized return performance to the robust expected return, we found that the entropy-regularized return appears to do no worse than the expected return. And in some cases, e.g., Cheetah, the entropy-regularized objective performs significantly better than the expected return (see Appendix D.5, Figure 10 for these results).

Figure 4: (1) Domain Randomization (DR): Two left figures figures show the domain randomization performance for the Cartpole balance and Pendulum swingup tasks respectively. (2) Stochastic Value Gradients (SVG): Two right images show the performance of Robust Entropy-regularized SVG (RE-SVG) and SRE-SVG compared to E-SVG for Cartpole and Pendulum respectively.

Different Nominal Models: In this paper the nominal model was always chosen as the smallest perturbation from the uncertainty set. This was done to highlight the strong performance of robust policies to increasingly large environment perturbations. However, what if we set the nominal model as the median or largest perturbation with respect to the chosen uncertainty set for each agent? As seen in Appendix D.5, Figure 13, the closer (further) the nominal model is to (from) the holdout set, the better (worse) the performance of the non-robust agent. However, in all cases, the robust agent still performs at least as well as (and sometimes better than) the non-robust agent.

8 Related Work

In previous work, a robust Bellman operator has been defined and this has been used to develop a performance bound for robust value iteration (Iyengar, 2005). This has been extended in this work to the entropy-regularized case for both the robust and soft-robust cases. Tamar et al. (2014) incorporate function approximation into the robust formulation to solve large-scale MDPs. They do so by introducing a robust dynamic programming technique based on a projected fixed point equation. Mankowitz et al. (2018a) learn robust options, also known as temporally extended actions (Sutton et al., 1999), using policy gradient, and prove convergence to a locally optimal solution. Morimoto and Doya (2005) learn a disturbance value function by solving a min max game and augmenting the reward that the agent receives with a disturbance variable. Robust solutions tend to be overly conservative. To combat this, Derman et al. (2018) extend the actor-critic two-timescale stochastic approximation algorithm to a ‘soft-robust’ formulation to yield a less, conservative solution. Di Castro et al. (2012)

develop theory for a robust Kalman filter and introduce implementations for a deep robust Kalman filter as well as a robust implementation of Deep Q Networks

(Mnih et al., 2015).

Domain Randomization (DR) (Andrychowicz et al., 2018; Peng et al., 2018) is a technique whereby an agent trains on different perturbations of the environment. The agent then batch averages the learning error of these trajectories together to yield a more robust agent to environment perturbations.

The intuitive difference between DR and robustness to model misspecification: DR performs a different type of generalization to that of learning a worst-case or soft-robust value function. Domain randomization can be seen as a form of data augmentation as additional, more diverse data is added to the learning setup. On the other hand, the robust objective can be viewed as an adversarial training setup whereby the agent is constantly trying to learn a policy that can perform well, under the perturbations of a worst-case adversary with respect to the next state and the given uncertainty set. The soft-robust agent is also adversarial, albeit less conservative, in that it attempts to perform well with respect to the average next state perturbation. It is this difference that enables the improved performance as seen in the ‘Domain Randomization’ investigative experiment in section 7.3.

9 Conclusion

We have presented a framework for incorporating robustness - to perturbations in the transition dynamics, which we refer to as model misspecification - into continuous control RL algorithms. We specifically focused on incorporating robustness into MPO as well as our entropy-regularized version of MPO (E-MPO). In addition, we presented an experiment which incorporates robustness into the SVG algorithm. In each case, the robust and soft-robust variants outperformed the non-robust baselines. This framework is suited to continuous control algorithms that learn a value function, such as an actor critic setup. From a theoretical standpoint, we adapted MPO to an entropy-regularized version (E-MPO); we then incorporated robustness into the policy evaluation step of both algorithms to yield Robust MPO (R-MPO) and Robust E-MPO (RE-MPO). This was achieved by deriving the corresponding robust entropy-regularized Bellman operator to ensure that the policy evaluation step converges in each case. As seen in prior work (Derman et al., 2018; Di Castro et al., 2012; Mankowitz et al., 2018a), the robust agent can be overly conservative. We therefore also provide a less, conservative soft-robust Bellman operator and show that it is a contraction and use it to define an entropy-regularized soft-robust variant of MPO (SRE-MPO). We show that the robust versions outperform the non-robust counterparts on nine Mujoco domains. We provide investigative experiments to understand the robust and soft-robust policy in more detail, which includes an experiment indicating that robust agents outperform agents trained using domain randomization.

References

Appendix A Proofs

Theorem 1.
Proof.

We follow the proofs from [Tamar et al., 2014, Iyengar, 2005], and adapt them to account for the additional entropy regularization for a fixed policy . Let , and an arbitrary state. Assume . Let be an arbitrary positive number.

By the definition of the operator, there exists such that,

(5)

In addition, we have by definition that:

(6)

Thus, we have,

(7)

Applying a similar argument for the case results in

(8)

Since is an arbitrary positive number, we establish the result, i.e.,

(9)

Theorem 2.
Proof.

We follow a similar argument to the proof of Theorem 1. Let , and an arbitrary state. Assume . Let be an arbitrary positive number. By definition of the operator, there exists such that,

(10)

In addition, by the definition of the operator, there exists such that,

(11)

Thus, we have,

(12)

Applying a similar argument for the case results in

(13)

Since is an arbitrary positive number, we establish the result, i.e.,

(14)

Corollary 1.

Let be the greedy policy after applying value iteration steps. The bound between the optimal value function and , the value function that is induced by , is given by, , where is the function approximation error, and is the initial value function.

Proof.

From Berteskas (1996), we have the following proposition:

Lemma 1.

Let be the optimal value function, some arbitrary value function, the greedy policy with respect to , and the value function that is induced by . Thus,

(15)

Next, define the maximum projected loss to be:

(16)

We can now derive a bound on the loss between the optimal value function and the value function obtained after updates of value iteration (denoted by ) as follows:

(17)

Then, using Lemma 1, we get:

(18)

which establishes the result. ∎

Appendix B Soft-Robust Entropy-Regularized Bellman Operator

Theorem 3.
Proof.

For an arbitrary and for a fixed policy :

Theorem 4.
Proof.

Let , and an arbitrary state. Assume . Let be an arbitrary positive number. By definition of the operator, there exists such that,

(19)

Thus, we have,

(20)

Applying a similar argument for the case results in

(21)

Since is an arbitrary positive number, we establish the result, i.e.,

(22)

Appendix C Entropy-regularized Policy Evaluation

This section describes: (1) modification to the TD update for the expected return to optimize for the entropy-regularized expected return, (2) additional modification to account for robustness.

We start with (1).

The entropy-regularized value function is defined as:

(23)

and the corresponding entropy-regularized action value function is given by:

(24)
(25)

Next, we define:

(26)

thus,

(27)

Therefore, we have the following relationship:

(28)

We now retrieve the TD update for the entropy-regularized action value function:

(29)