1 Introduction
Reinforcement Learning (RL) algorithms typically learn a policy that optimizes for the expected return (Sutton and Barto, 1998). That is, the policy aims to maximize the sum of future expected rewards that an agent accumulates in a particular task. This approach has yielded impressive results in recent years, including playing computer games with super human performance (Mnih et al., 2015; Tessler et al., 2016), multitask RL (Rusu et al., 2016; Devin et al., 2017; Teh et al., 2017; Mankowitz et al., 2018b; Riedmiller et al., 2018) as well as solving complex continuous control robotic tasks (Duan et al., 2016; Abdolmaleki et al., 2018b; Kalashnikov et al., 2018; Haarnoja et al., 2018).
The current crop of RL agents are typically trained in a single environment (usually a simulator). As a consequence, an issue that is faced by many of these agents is the sensitivity of the agent’s policy to environment perturbations. Perturbing the dynamics of the environment during test time, which may include executing the policy in a realworld setting, can have a significant negative impact on the performance of the agent (Andrychowicz et al., 2018; Peng et al., 2018; Derman et al., 2018; Di Castro et al., 2012; Mankowitz et al., 2018a). This is because the training environment is not necessarily a very good model of the perturbations that an agent may actually face, leading to potentially unwanted, suboptimal behaviour. There are many types of environment perturbations. These include changing lighting/weather conditions, sensor noise, actuator noise, action delays etc.
It is desirable to train agents that are agnostic to environment perturbations. This is especially crucial in the Sim2Real setting (Andrychowicz et al., 2018; Peng et al., 2018; Wulfmeier et al., 2017; Rastogi et al., 2018; Christiano et al., 2016) where a policy is trained in a simulator and then executed on a realworld domain. As an example, consider a robotic arm that executes a control policy to perform a specific task in a factory. If, for some reason, the arm needs to be replaced and the specifications do not exactly match, then the control policy still needs to be able to perform the task with the ‘perturbed’ robotic arm dynamics. In addition, sensor noise due to malfunctioning sensors, as well as actuator noise, may benefit from a robust policy to deal with these noiseinduced perturbations.
Model misspecification: For the purpose of this paper, we refer to an agent that is trained in one environment and performs poorly in a different, perturbed version of the environment (as in the above examples) as model misspecification. By incorporating robustness into our agents, we correct for this misspecification yielding improved performance in the perturbed environment(s).
In this paper, we propose a framework for incorporating robustness into continuous control RL algorithms. We specifically focus on robustness to model misspecification in the transition dynamics. For the remainder of the paper, when we mention robustness, we refer to this particular form of robustness. Our main contributions are as follows:
(1) We provide a generalized framework for incorporating robustness to model misspecification into continuous control RL algorithms. Specifically, algorithms that learn a value function (e.g., a critic) or perform policy evaluation. As a proofofconcept, we incorporate robustness into a stateoftheart continuous control RL algorithm called Maximum aposteriori Policy Optimization (MPO) (Abdolmaleki et al., 2018b) to yield Robust MPO (RMPO). We also carry out an additional experiment, where we incorporate robustness into an additional continuous RL algorithm called Stochastic Value Gradients (SVG) (Heess et al., 2015a).
(2) Entropy regularization encourages exploration and helps prevent early convergence to suboptimal policies (Nachum et al., 2017). To incorporate these advantages, we extend MPO to optimize for an entropyregularized return objective (EMPO).
(3) We extend the Robust Bellman operator (Iyengar, 2005) to robust and softrobust entropyregularized versions respectively and show that these operators are contraction mappings and yield a wellknown valueiteration bound with respect to the max norm.
(4) We use these results to extend EMPO to Robust Entropyregularized MPO (REMPO) and Soft REMPO (SREMPO) and show that they perform at least as well as RMPO and in some cases significantly better.
(5) We present experimental results in nine Mujoco domains showing that RMPO, REMPO and SREMPO outperform both MPO and EMPO respectively.
(6) Multiple investigative experiments to better understand the robustness framework. This includes results indicating that robustness outperforms domain randomization.
2 Background
A Markov Decision Process (MDP)
is defined as the tuple where is the state space, the action space, is a bounded reward function; is the discount factor andmaps stateaction pairs to a probability distribution over next states. We use
to denote the simplex. The goal of a Reinforcement Learning agent for the purpose of control is to learn a policy which maps a state and action to a probability of executing the action from the given state so as to maximize the expected return whereis a random variable representing the reward received at time
(Sutton and Barto, 2018). The value function is defined as and the action value function as .Entropyregularized Reinforcement Learning: Entropy regularization encourages exploration and helps prevent early convergence to suboptimal policies (Nachum et al., 2017). We make use of the relative entropyregularized RL objective defined as where is a temperature parameter and is the KullbackLiebler (KL) divergence between the current policy and a reference policy given a state (Schulman et al., 2017). The entropyregularized value function is defined as . Intuitively, augmenting the rewards with the KL term regularizes the policy by forcing it to be ‘close’ in some sense to the base policy.
A Robust MDP (RMDP) is defined as a tuple where and are defined as above; is an uncertainty set where is the set of probability measures over next states . This is interpreted as an agent selecting a state and action pair, and the next state is determined by a conditional measure (Iyengar, 2005). A robust policy optimizes for the worstcase expected return objective: .
The robust value function is defined as and the robust action value function as . Both the robust Bellman operator for a fixed policy and the optimal robust Bellman operator have previously been shown to be contractions (Iyengar, 2005). A rectangularity assumption on the uncertainty set (Iyengar, 2005) ensures that “nature” can choose a worstcase transition function independently for every state and action .
Maximum APosteriori Policy Optimization (MPO) (Abdolmaleki et al., 2018a, b)
is a continuous control RL algorithm that performs an expectation maximization form of policy iteration. There are two steps comprising
policy evaluation and policy improvement. The policy evaluation step receives as input a policy and evaluates an actionvalue function by minimizing the squared TD error:where denotes the parameters of a target network (Mnih et al., 2015) that are periodically updated from . In practice we use a replaybuffer of samples in order to perform the policy evaluation step. The second step comprises a policy improvement step. The policy improvement step consists of optimizing the objective for states drawn from a state distribution . In practice the state distribution samples are drawn from an experience replay. By improving in all states , we improve our objective. To do so, a two step procedure is performed.
First, we construct a nonparametric estimate
such that . This is done by maximizing while ensuring that the solution, locally, stays close to the current policy ; i.e. . This optimization has a closed form solution given as where is a temperature parameter that can be computed by minimizing a convex dual function (Abdolmaleki et al. (2018b)). Second, we project this nonparametric representation back onto a parameterized policy by solving the optimization problem , where is the new and improved policy and where one typically employs additional regularization (Abdolmaleki et al., 2018a). Note that this amounts to supervised learning with samples drawn fron
; see Abdolmaleki et al. (2018a) for details.3 Robust EntropyRegularized Bellman Operator
(Relative)Entropy regularization has been shown to encourage exploration and prevent early convergence to suboptimal policies (Nachum et al., 2017). To take advantage of this idea when developing a robust RL algorithm we extend the robust Bellman operator to a robust entropy regularized Bellman operator and prove that it is a contraction.^{1}^{1}1Note that while MPO already bounds the per step relative entropy we, in addition, want to regularize the actionvalue function to obtain a robust regularized algorithm. We also show that wellknown value iteration bounds can be attained using this operator. We first define the robust entropyregularized value function as . For the remainder of this section, we drop the suband superscripts, as well as the reference policy conditioning, from the value function , and simply represent it as for brevity. We define the robust entropyregularized Bellman operator for a fixed policy in Equation 1, and show it is a max norm contraction (Theorem 1).
(1) 
Theorem 1.
The robust entropyregularized Bellman operator for a fixed policy is a contraction operator. Specifically: and , we have, .
4 SoftRobust EntropyRegularized Bellman Operator
In this section, we derive a softrobust entropyregularized Bellman operator and show that it is also a contraction in the max norm. First, we define the average transition model as which corresponds to the average transition model distributed according to some distribution over the uncertainty set . This average transition model induces an average stationary distribution (see Derman et al. (2018)). The softrobust entropyregularized value function is defined as . Again, for ease of notation, we denote for the remainder of the section. The softrobust entropyregularized Bellman operator for a fixed policy is defined as:
(2) 
5 Robust Policy Evaluation
To incorporate robustness into MPO, we focus on learning a worstcase value function in the policy evaluation step. Note that this policy evaluation step can be incorporated into any actorcritic algorithm. In particular, instead of optimizing the squared TD error, we optimize the worstcase squared TD error, which is defined as:
(3) 
where is an uncertainty set for the current state and action ; is the current network’s policy, and denotes the target network parameters.
Relation to MPO: In MPO, this replaces the current policy evaluation step. The robust Bellman operator (Iyengar, 2005) ensures that this process converges to a unique fixed point for the policy . Since the proposal policy (see Section 2) is proportional to the robust action value estimate , it intuitively yields a robust policy as the policy is being generated from a worstcase value function. The fitting of the policy network to the proposal policy yields a robust network policy .
6 Robust EntropyRegularized Policy Evaluation
To extend Robust policy evaluation to robust entropyregularized policy evaluation, two key steps need to be performed: (1) optimize for the entropyregularized expected return as opposed to the regular expected return and modify the TD update accordingly; (2) Incorporate robustness into the entropyregularized expected return and modify the entropyregularized TD update. To achieve (1), we define the entropyregularized expected return as , and show in Appendix C that performing policy evaluation with the entropyregularized value function is equivalent to optimizing the entropyregularized squared TD error (same as Eq. (4), only omitting the operator). To achieve (2), we optimize for the robust entropy regularized expected return objective defined as , yielding the robust entropyregularized squared TD error:
(4)  
where . For the softrobust setting, we remove the infimum from the TD update and replace the next state transition function with the average next state transition function .
Relation to MPO: As in the previous section, this step replaces the policy evaluation step of MPO. Our robust entropyregularized Bellman operator and softrobust entropyregularized Bellman operator ensures that this process converges to a unique fixed point for the policy for the robust and softrobust cases respectively. We use as the reference policy . The pseudo code for the RMPO, REMPO and SoftRobust Entropyregularized MPO (SREMPO) algorithms can be found in Appendix E (Algorithms 1, 2 and 3 respectively).
7 Experiments
We now present experiments on nine different continuous control domains from the DeepMind control suite (Tassa et al., 2018). In this section our agent optimizes for the entropyregularized objective (nonrobust, robust and softrobust versions). This corresponds to (a) nonrobust EMPO baseline, (b) Robust EMPO (REMPO) and (c) SoftRobust EMPO (SREMPO). From hereon in, it is assumed that the algorithms optimize for the entropyregularized objective unless otherwise stated.
Appendix: In Appendix D.4, we present results of our agent optimizing for the expected return objective without entropy regularization (for the nonrobust, robust and softrobust versions). This corresponds to (a’) nonrobust MPO baseline, (b’) RMPO and (c’) SRMPO.
The experiments are divided into three sections. The first section details the setup for robust and softrobust training. The next section compares robust and softrobust performance to the nonrobust MPO baseline in each of the nine domains. The final section is a set of investigative experiments to gain additional insights into the performance of the robust and softrobust agents. This includes incorporating robustness into the Stochastic Value Gradients (SVG) algorithm (Heess et al., 2015a).
7.1 Setup
For each domain, the robust agent is trained using a predefined uncertainty set consisting of three task perturbations ^{2}^{2}2We did experiments on a larger set with similar results, but settled on three for computational efficiency.. Each of the three perturbations corresponds to a particular perturbation of the Mujoco domain. For example, in Cartpole, the uncertainty set consists of three different pole lengths. Both the robust and nonrobust agents are evaluated on a test set of three unseen task perturbations. In the Cartpole example, this would correspond to pole lengths that the agent has not seen during training. The chosen values of the uncertainty set and evaluation set for each domain can be found in Appendix D.3. Note that it is common practice to manually select the predefined uncertainty set and the unseen test environments. Practitioners often have significant domain knowledge and can utilize this when choosing the uncertainty set (Derman and Mannor, 2019; Derman et al., 2018; Di Castro et al., 2012; Mankowitz et al., 2018a; Tamar et al., 2014).
During training, the robust, softrobust and nonrobust agents act in an unperturbed environment which we refer to as the nominal
environment. During the TD learning update, the robust agent calculates an infimum between Q values from each next state realization for each of the uncertainty set task perturbations (the softrobust agent computes an average, which corresponds to a uniform distribution over
, instead of an infimum). Each transition model is a different instantiation of the Mujoco task. The robust and softrobust agents are exposed to more state realizations than the nonrobust agent. However, as we show in our ablation studies, significantly increasing the number of samples and the diversity of the samples for the nonrobust agent still results in poor performance compared to the robust and softrobust agents.7.2 Main Experiments
In this section, we compare the performance of nonrobust MPO to the robust and softrobust variants. Each training run consists of episodes and the experiments are repeated
times. In the bar plots, the yaxis indicates the average reward (with standard deviation) and the xaxis indicates different unseen evaluation environment perturbations starting from the first perturbation (Env0) onwards. Increasing environment indices correspond to increasingly large perturbations. For example, in Figure
1 (top left), Env0, Env1 and Env2 for the Cartpole Balance task represents the pole perturbed to lengths of and meters respectively. Figure 1 shows the performance of six Mujoco domains (The remaining three domains are in Appendix D.4). The bar plots indicate the performance of EMPO (red), REMPO (blue) and SREMPO (green) on the heldout test perturbations. This color scheme is consistent throughtout the experiments unless otherwise stated. As can be seen in each of the figures, REMPO attains improved performance over EMPO. This same trend holds true for all nine domains. SREMPO outperforms the nonrobust baseline in all but the Cheetah domain, but is not able to outperform REMPO.Appendix: The appendix contains additional experiments with the non entropyregularized versions of the algorithms where again the robust (RMPO) and soft robust (SRMPO) versions of MPO outperform the nonrobust version (MPO).
7.3 Investigative Experiments
This section aims to investigate and try answer various questions that may aid in explaining the performance of the robust and nonrobust agents respectively. Each investigative experiment is conducted on the Cartpole Balance and Pendulum Swingup domains.
What if we increase the number of training samples? One argument is that the robust agent effectively has access to more samples since it calculates the Bellman update using the infimum of three different environment realizations. In order to balance this is effect, the nonrobust agent was trained for three times more episodes than the robust agents. Training with significantly more samples does not increase the performance of the nonrobust agent and, can even decreases the performance, as a result of overfitting to the nominal domain. The results on Cartpole balance and Pendulum swingup can be found in Appendix D.5, Figure 11.
What about Domain Randomization? A subsequent point would be that the robust agent sees more diverse examples compared to the nonrobust agent from each of the perturbed environments. We therefore trained the nonrobust agent in a domain randomization setting (Andrychowicz et al., 2018; Peng et al., 2018) where three actors each operate in a perturbed environment (the same as the robust agents uncertainty set). The TD errors are batch averaged as in domain randomization. As seen in the two left figures in Figure 4 for Cartpole and Pendulum respectively, the robust and softrobust variants significantly outperform the domain randomization agent. A discussion as to why this is to be expected is in Section 8.
A larger test set: It is also useful to view the performance of the agent from the nominal environment to increasingly large perturbations in the unseen test set (see Appendix D.3 for values). These graphs can be seen in Figure 2 for Cartpole Balance and Pendulum Swingup respectively. As seen in the figures, the robust agent maintains a higher level of performance compared to the nonrobust agent. The softrobust agent outperforms the nonrobust agent, but its performance degrades as the perturbations increase which is consistent with the results of Derman et al. (2018). In addition, the robust and softrobust agents are competitive with the nonrobust agent in the nominal environment.
Modifying the uncertainty set: It is also interesting to evaluate the performance of the agent for different uncertainty sets. For Pendulum Swingup, the original uncertainty set values of the pendulum arm are and meters. We modified the final perturbation to values of and meters respectively. The agent is evaluated on unseen lengths of and meters. An increase in performance can be seen in Figure 3 as the third perturbation approaches that of the unseen evaluation environments. Thus it appears that if the agent is able to approximately capture the dynamics of the unseen test environments within the training set, then the robust agent is able to adapt to the unseen test environments. The results for cartpole balance can be seen in Appendix D.5, Figure 12.
What about incorporating Robustness into other algorithms? To show the generalization of this robustness approach, we incorporate it into the critic of the Stochastic Value Gradient (SVG) continuous control RL algorithm (See Appendix D.1). As seen in Figure 4, Robust Entropyregularized SVG (RESVG) and Soft RESVG (SRESVG) significantly outperform the nonrobust Entropyregularized SVG (ESVG) baseline in both Cartpole and Pendulum.
Robust entropyregularized return vs. robust expected return: When comparing the robust entropyregularized return performance to the robust expected return, we found that the entropyregularized return appears to do no worse than the expected return. And in some cases, e.g., Cheetah, the entropyregularized objective performs significantly better than the expected return (see Appendix D.5, Figure 10 for these results).
Different Nominal Models: In this paper the nominal model was always chosen as the smallest perturbation from the uncertainty set. This was done to highlight the strong performance of robust policies to increasingly large environment perturbations. However, what if we set the nominal model as the median or largest perturbation with respect to the chosen uncertainty set for each agent? As seen in Appendix D.5, Figure 13, the closer (further) the nominal model is to (from) the holdout set, the better (worse) the performance of the nonrobust agent. However, in all cases, the robust agent still performs at least as well as (and sometimes better than) the nonrobust agent.
8 Related Work
In previous work, a robust Bellman operator has been defined and this has been used to develop a performance bound for robust value iteration (Iyengar, 2005). This has been extended in this work to the entropyregularized case for both the robust and softrobust cases. Tamar et al. (2014) incorporate function approximation into the robust formulation to solve largescale MDPs. They do so by introducing a robust dynamic programming technique based on a projected fixed point equation. Mankowitz et al. (2018a) learn robust options, also known as temporally extended actions (Sutton et al., 1999), using policy gradient, and prove convergence to a locally optimal solution. Morimoto and Doya (2005) learn a disturbance value function by solving a min max game and augmenting the reward that the agent receives with a disturbance variable. Robust solutions tend to be overly conservative. To combat this, Derman et al. (2018) extend the actorcritic twotimescale stochastic approximation algorithm to a ‘softrobust’ formulation to yield a less, conservative solution. Di Castro et al. (2012)
develop theory for a robust Kalman filter and introduce implementations for a deep robust Kalman filter as well as a robust implementation of Deep Q Networks
(Mnih et al., 2015).Domain Randomization (DR) (Andrychowicz et al., 2018; Peng et al., 2018) is a technique whereby an agent trains on different perturbations of the environment. The agent then batch averages the learning error of these trajectories together to yield a more robust agent to environment perturbations.
The intuitive difference between DR and robustness to model misspecification: DR performs a different type of generalization to that of learning a worstcase or softrobust value function. Domain randomization can be seen as a form of data augmentation as additional, more diverse data is added to the learning setup. On the other hand, the robust objective can be viewed as an adversarial training setup whereby the agent is constantly trying to learn a policy that can perform well, under the perturbations of a worstcase adversary with respect to the next state and the given uncertainty set. The softrobust agent is also adversarial, albeit less conservative, in that it attempts to perform well with respect to the average next state perturbation. It is this difference that enables the improved performance as seen in the ‘Domain Randomization’ investigative experiment in section 7.3.
9 Conclusion
We have presented a framework for incorporating robustness  to perturbations in the transition dynamics, which we refer to as model misspecification  into continuous control RL algorithms. We specifically focused on incorporating robustness into MPO as well as our entropyregularized version of MPO (EMPO). In addition, we presented an experiment which incorporates robustness into the SVG algorithm. In each case, the robust and softrobust variants outperformed the nonrobust baselines. This framework is suited to continuous control algorithms that learn a value function, such as an actor critic setup. From a theoretical standpoint, we adapted MPO to an entropyregularized version (EMPO); we then incorporated robustness into the policy evaluation step of both algorithms to yield Robust MPO (RMPO) and Robust EMPO (REMPO). This was achieved by deriving the corresponding robust entropyregularized Bellman operator to ensure that the policy evaluation step converges in each case. As seen in prior work (Derman et al., 2018; Di Castro et al., 2012; Mankowitz et al., 2018a), the robust agent can be overly conservative. We therefore also provide a less, conservative softrobust Bellman operator and show that it is a contraction and use it to define an entropyregularized softrobust variant of MPO (SREMPO). We show that the robust versions outperform the nonrobust counterparts on nine Mujoco domains. We provide investigative experiments to understand the robust and softrobust policy in more detail, which includes an experiment indicating that robust agents outperform agents trained using domain randomization.
References
 Abdolmaleki et al. [2018a] A. Abdolmaleki, J. T. Springenberg, J. Degrave, S. Bohez, Y. Tassa, D. Belov, N. Heess, and M. A. Riedmiller. Relative entropy regularized policy iteration. CoRR, abs/1812.02256, 2018a. URL http://arxiv.org/abs/1812.02256.
 Abdolmaleki et al. [2018b] A. Abdolmaleki, J. T. Springenberg, Y. Tassa, R. Munos, N. Heess, and M. Riedmiller. Maximum a posteriori policy optimisation. arXiv preprint arXiv:1806.06920, 2018b.
 Andrychowicz et al. [2018] M. Andrychowicz, B. Baker, M. Chociej, R. Jozefowicz, B. McGrew, J. Pachocki, A. Petron, M. Plappert, G. Powell, A. Ray, et al. Learning dexterous inhand manipulation. arXiv preprint arXiv:1808.00177, 2018.
 Christiano et al. [2016] P. F. Christiano, Z. Shah, I. Mordatch, J. Schneider, T. Blackwell, J. Tobin, P. Abbeel, and W. Zaremba. Transfer from simulation to real world through learning deep inverse dynamics model. CoRR, abs/1610.03518, 2016. URL http://arxiv.org/abs/1610.03518.
 Derman et al. [2018] E. Derman, D. J. Mankowitz, T. A. Mann, and S. Mannor. Softrobust actorcritic policygradient. arXiv preprint arXiv:1803.04848, 2018.

Derman and Mannor [2019]
M. D. J. M. T. A. Derman, Esther and S. Mannor.
A bayesian approach to robust reinforcement learning.
In
Association for Uncertainty in Artificial Intelligence
, 2019. 
Devin et al. [2017]
C. Devin, A. Gupta, T. Darrell, P. Abbeel, and S. Levine.
Learning modular neural network policies for multitask and multirobot transfer.
2017 IEEE International Conference on Robotics and Automation (ICRA), pages 2169–2176, 2017.  Di Castro et al. [2012] D. Di Castro, A. Tamar, and S. Mannor. Policy gradients with variance related risk criteria. arXiv preprint arXiv:1206.6404, 2012.
 Duan et al. [2016] Y. Duan, X. Chen, R. Houthooft, J. Schulman, and P. Abbeel. Benchmarking deep reinforcement learning for continuous control. In ICML, 2016.
 Haarnoja et al. [2018] T. Haarnoja, A. Zhou, K. Hartikainen, G. Tucker, S. Ha, J. Tan, V. Kumar, H. Zhu, A. Gupta, P. Abbeel, and S. Levine. Soft actorcritic algorithms and applications. CoRR, abs/1812.05905, 2018.
 Heess et al. [2015a] N. Heess, G. Wayne, D. Silver, T. Lillicrap, T. Erez, and Y. Tassa. Learning continuous control policies by stochastic value gradients. In Advances in Neural Information Processing Systems, pages 2944–2952, 2015a.
 Heess et al. [2015b] N. Heess, G. Wayne, D. Silver, T. Lillicrap, T. Erez, and Y. Tassa. Learning continuous control policies by stochastic value gradients. In Advances in Neural Information Processing Systems 28 (NIPS). 2015b.
 Iyengar [2005] G. N. Iyengar. Robust dynamic programming. Mathematics of Operations Research, 30(2):257–280, 2005.
 Kalashnikov et al. [2018] D. Kalashnikov, A. Irpan, P. Pastor, J. Ibarz, A. Herzog, E. Jang, D. Quillen, E. Holly, M. Kalakrishnan, V. Vanhoucke, and S. Levine. Scalable deep reinforcement learning for visionbased robotic manipulation. In CoRL, 2018.
 Kingma and Welling [2013] D. P. Kingma and M. Welling. Autoencoding variational bayes. arXiv preprint arXiv:1312.6114, 2013.
 Mankowitz et al. [2018a] D. J. Mankowitz, T. A. Mann, P.L. Bacon, D. Precup, and S. Mannor. Learning robust options. In ThirtySecond AAAI Conference on Artificial Intelligence, 2018a.
 Mankowitz et al. [2018b] D. J. Mankowitz, A. Zídek, A. Barreto, D. Horgan, M. Hessel, J. Quan, J. Oh, H. van Hasselt, D. Silver, and T. Schaul. Unicorn: Continual learning with a universal, offpolicy agent. CoRR, abs/1802.08294, 2018b. URL http://arxiv.org/abs/1802.08294.
 Mnih et al. [2015] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski, S. Petersen, C. Beattie, A. Sadik, I. Antonoglou, H. King, D. Kumaran, D. Wierstra, S. Legg, and D. Hassabis. Humanlevel control through deep reinforcement learning. Nature, 518(7540):529–533, 2015.
 Morimoto and Doya [2005] J. Morimoto and K. Doya. Robust reinforcement learning. Neural computation, 17(2):335–359, 2005.
 Nachum et al. [2017] O. Nachum, M. Norouzi, K. Xu, and D. Schuurmans. Bridging the gap between value and policy based reinforcement learning. In Advances in Neural Information Processing Systems, pages 2775–2785, 2017.
 Peng et al. [2018] X. B. Peng, M. Andrychowicz, W. Zaremba, and P. Abbeel. Simtoreal transfer of robotic control with dynamics randomization. In 2018 IEEE International Conference on Robotics and Automation (ICRA), pages 1–8. IEEE, 2018.
 Rastogi et al. [2018] D. Rastogi, I. Koryakovskiy, and J. Kober. Sampleefficient reinforcement learning via difference models. In Machine Learning in Planning and Control of Robot Motion Workshop at ICRA, 2018.

Rezende et al. [2014]
D. J. Rezende, S. Mohamed, and D. Wierstra.
Stochastic backpropagation and approximate inference in deep generative models.
In Proceedings of the 31st International Conference on Machine Learning (ICML), 2014.  Riedmiller et al. [2018] M. A. Riedmiller, R. Hafner, T. Lampe, M. Neunert, J. Degrave, T. V. de Wiele, V. Mnih, N. Heess, and J. T. Springenberg. Learning by playing  solving sparse reward tasks from scratch. In ICML, 2018.
 Rusu et al. [2016] A. A. Rusu, S. G. Colmenarejo, Çaglar Gülçehre, G. Desjardins, J. Kirkpatrick, R. Pascanu, V. Mnih, K. Kavukcuoglu, and R. Hadsell. Policy distillation. CoRR, abs/1511.06295, 2016.
 Schulman et al. [2017] J. Schulman, X. Chen, and P. Abbeel. Equivalence between policy gradients and soft qlearning. arXiv preprint arXiv:1704.06440, 2017.
 Sutton and Barto [1998] R. S. Sutton and A. G. Barto. Reinforcement Learning: An Introduction. MIT Press, 1998. URL http://wwwanw.cs.umass.edu/~rich/book/thebook.html.
 Sutton and Barto [2018] R. S. Sutton and A. G. Barto. Reinforcement learning: An introduction. MIT press, 2018.
 Sutton et al. [1999] R. S. Sutton, D. Precup, and S. Singh. Between mdps and semimdps: A framework for temporal abstraction in reinforcement learning. Artificial intelligence, 112(12):181–211, 1999.
 Tamar et al. [2014] A. Tamar, S. Mannor, and H. Xu. Scaling up robust mdps using function approximation. In International Conference on Machine Learning, pages 181–189, 2014.
 Tassa et al. [2018] Y. Tassa, Y. Doron, A. Muldal, T. Erez, Y. Li, D. de Las Casas, D. Budden, A. Abdolmaleki, J. Merel, A. Lefrancq, T. P. Lillicrap, and M. A. Riedmiller. Deepmind control suite. CoRR, abs/1801.00690, 2018. URL http://arxiv.org/abs/1801.00690.
 Teh et al. [2017] Y. W. Teh, V. Bapst, W. Czarnecki, J. Quan, J. Kirkpatrick, R. Hadsell, N. Heess, and R. Pascanu. Distral: Robust multitask reinforcement learning. In NIPS, 2017.
 Tessler et al. [2016] C. Tessler, S. Givony, T. Zahavy, D. J. Mankowitz, and S. Mannor. A deep hierarchical approach to lifelong learning in minecraft. CoRR, abs/1604.07255, 2016. URL http://arxiv.org/abs/1604.07255.
 Wulfmeier et al. [2017] M. Wulfmeier, I. Posner, and P. Abbeel. Mutual alignment transfer learning. arXiv preprint arXiv:1707.07907, 2017.
Appendix A Proofs
Theorem 1.
Proof.
We follow the proofs from [Tamar et al., 2014, Iyengar, 2005], and adapt them to account for the additional entropy regularization for a fixed policy . Let , and an arbitrary state. Assume . Let be an arbitrary positive number.
By the definition of the operator, there exists such that,
(5) 
In addition, we have by definition that:
(6) 
Thus, we have,
(7) 
Applying a similar argument for the case results in
(8) 
Since is an arbitrary positive number, we establish the result, i.e.,
(9) 
∎
Theorem 2.
Proof.
We follow a similar argument to the proof of Theorem 1. Let , and an arbitrary state. Assume . Let be an arbitrary positive number. By definition of the operator, there exists such that,
(10) 
In addition, by the definition of the operator, there exists such that,
(11) 
Thus, we have,
(12) 
Applying a similar argument for the case results in
(13) 
Since is an arbitrary positive number, we establish the result, i.e.,
(14) 
∎
Corollary 1.
Let be the greedy policy after applying value iteration steps. The bound between the optimal value function and , the value function that is induced by , is given by, , where is the function approximation error, and is the initial value function.
Proof.
From Berteskas (1996), we have the following proposition:
Lemma 1.
Let be the optimal value function, some arbitrary value function, the greedy policy with respect to , and the value function that is induced by . Thus,
(15) 
Next, define the maximum projected loss to be:
(16) 
We can now derive a bound on the loss between the optimal value function and the value function obtained after updates of value iteration (denoted by ) as follows:
(17) 
Then, using Lemma 1, we get:
(18) 
which establishes the result. ∎
Appendix B SoftRobust EntropyRegularized Bellman Operator
Theorem 3.
Proof.
For an arbitrary and for a fixed policy :
∎
Theorem 4.
Proof.
Let , and an arbitrary state. Assume . Let be an arbitrary positive number. By definition of the operator, there exists such that,
(19) 
Thus, we have,
(20) 
Applying a similar argument for the case results in
(21) 
Since is an arbitrary positive number, we establish the result, i.e.,
(22) 
∎
Appendix C Entropyregularized Policy Evaluation
This section describes: (1) modification to the TD update for the expected return to optimize for the entropyregularized expected return, (2) additional modification to account for robustness.
We start with (1).
The entropyregularized value function is defined as:
(23) 
and the corresponding entropyregularized action value function is given by:
(24)  
(25) 
Next, we define:
(26) 
thus,
(27) 
Therefore, we have the following relationship:
(28) 
We now retrieve the TD update for the entropyregularized action value function:
(29)  