1 Introduction
Robustness to environmental dynamics is an important topic in safe reinforcement learning, which is crucial for achieving the usability of RL in real world rather than games. For example, in autonomous driving, the driving agents have to adapt itself to complex realworld scenarios, which usually can not be fully covered during training. Typically, the driving agents are built with a simulated environment before deployments. However, due to the gap between simulated environment and real world, the trained agents over simulations are inevitable suboptimal in real world scenarios (Mannor et al., 2004, 2007) or even face failure. Therefore, learning policies robust to environmental dynamics is a challenging and urgent problem for safe reinforcement learning.
For robust Reinforcement Learning algorithms, existing methods lie on two branches: One type of methods, borrowed from game theory, introduces an extra agent to disturb the simulated environmental parameters during training
(Atkeson & Morimoto, 2003; Morimoto & Doya, 2005; Pinto et al., 2017; Rajeswaran et al., 2016). This method has to rely on the environmental characterization. The other type of methods disturbs the current state through Adversarial Examples (Huang et al., 2017; Kos & Song, 2017; Lin et al., 2017; Mandlekar et al., 2017; Pattanaik et al., 2018), which is more heuristic. Unfortunately, both methods are lack of theoretical guarantee to the robustness extent of transition dynamics.To address these issues, we design a Wasserstern constraint, which restricts the admissible transition probabilities within a Wasserstein ball centered at some reference transition dynamics. By applying the strong duality of Wasserstein distance (Santambrogio, 2015; Blanchet & Murthy, 2019), we are able to connect the disturbance on transition dynamics with the disturbance on the current state. As a result, the original infinitedimensional robust optimal problem is reduced to some finitedimensional ordinary riskaware RL problem. Through the moderated optimal Bellman equation, we prove the existence of robust optimal policies, provide the theoretical analyse on the performance of optimal policies, and design a corresponding —Wasserstein Robust Advantage ActorCritic algorithm (WRAAC), which does not depend on the environmental characterization. In the experiments, we verified the robustness and effectiveness of the proposed algorithms in the CartPole environment.
The remainder of this paper is organized as follows. In Section 2
, we briefly introduce some related work in Markov Decision Processes. In Section
3, we mainly describe the framework of Wasserstein robust Reinforcement Learning. In Section 4, we propose robust Advantage ActorCritic algorithms according to the moderated robust Bellman equation. In Section 5, we perform experiments on the CartPole environment to verify the effectiveness of our method. Finally, Section 6 concludes our study and provide possible future works.2 Related Work
In this section, we introduce some related work in the fields of MDPs. In robust MDP, the set of all possible transition kernels is called uncertainty set, which can be defined in various ways: one choice could be likelihood regions or entropy bounds of the environment parameters (White III & Eldeib, 1994; Nilim & El Ghaoui, 2005; Iyengar, 2005; Wiesemann et al., 2013); another choice is to constrain the deviation from a reference environment through some statistical distance. For example, Osogami (2012)
discussed such robust problem where the uncertainty set are defined via KullbackLeibler divergence, and also uncover the relations between robust MDPs using
divergence constraint and riskaware MDPs.Indeed, it was observed that since the robust MDP framework ignores probabilistic information of the uncertainty set, it can provide conservative solutions (Delage & Mannor, 2010; Xu & Mannor, 2007). Some papers consider bringing prior knowledge of dynamics to robust MDPs, and name such problem distributionally robust MDPs. Xu & Mannor (2010)
discuss robust MDPs with prior information to estimate the confidence region of parameters abound, which is a momentbased constraint, and they also show that such distributionally robust problems can be reduced to standard robust MDP problems.
Yang (2017, 2018) use Wasserstein distance to evaluate the difference among the prior distributions of transition probabilities. However, Yang’s algorithms are not appropriate for complex situations, because they need to estimate enough transition kernels to approximate prior distribution at each step.3 Wasserstein robust reinforcement learning
In this section, we specify the problem of interest, which is actually a minimax problem constrained by some Wasssersteinbased uncertainty set. We start with introducing a general theoretical framework, i.e., robust Markov Decision Process, and then briefly recall the definition of Wasserstein distance between probability measures. Inspired by the strong duality brought by Wassersteinbased uncertainty set, the robust MDP is reformulated to some riskaware MDP, making connections clear between robustness to dynamics and robustness to states.
3.1 Robust Markov Decision Process
Unlike ordinary Markov Decision Processes (MDPs), in robust MDP, environmental dynamics, including transition probabilities and rewards, might change over time (Nilim & El Ghaoui, 2004, 2005). Theoretically, such dynamics can be treated as stochastic changes within an uncertainty set. The objective of robust MDP is to find the optimal policy under the worst dynamics.
Given discretetime robust MDPs with continuous state and action spaces, without loss of generalization, we only consider the robustness to transition probabilities. Basic elements of robust MDPs include , where

: state space, which is a Borel measurable metric space.

: action space, which is a Borel measurable space. Let represent all the admissible actions at state , and denote all the possible stateaction pairs, i.e., .

: the uncertainty set that contains all possible transition kernels.

: , the immediate cost function. Generally we assume it is continuous and for some nonnegative constant .
The robust system evolves in the following way. Let denote the current time and the current state. Agent chooses an action and environment selects a transition kernel from the uncertainty set , respectively. Then at the next time , an agent observes an immediate cost and a new state which follows the distribution . The process repeats at each stage and produces trajectories in a form of . Let denote all the trajectories. Let denote all trajectories up to time and denote all trajectories up to time with action .
Correspondingly, a randomized policy is a series of stochastic kernels: where is a probability measure over . We name primal policy and use to represent all such randomized policies. If for , we say the policy is Markov. If for any , this policy is stationary. If there exists measurable functions such that , this policy is called deterministic. We denote the set of all such deterministic, stationary, Markov policies by .
The selection of transition kernels can be treated as a deterministic policy deployed by a secondary adversarial agent. Let with denote the adversarial policy. We use to represent all such deterministic policies. Similarly, if for all , the policy is Markov, and if for any , the policy is stationary.
Given the initial state , primal policy and adversarial policy , applying the IonescuTulcea theorem (HernándezLerma & Lasserre, 2012a; Bertsekas & Shreve, 2004), there exist a probability measure on trajectory space , where is the algebra of , satisfies

,

,

,

.
Let denote the corresponding expectation operation. As for the performance criterion, we consider the infinitehorizon discounted cost. Let be the discounting factor. The discounted cost contributed by trajectory is Given the initial state , policies and , the expected infinitehorizon discounted cost is
(1) 
Robust MDPs aim to find the optimal policy for the agent under the worst realization of , which means that reaches
(2) 
This minimax problem can be seen as a zerosum game of two agents.
3.2 Wasserstein Distance
The popular Wasserstein distance is a special case of optimal transport costs, which measures the discrepancy between two probabilities in terms of minimum total costs associated with some transport function. For any two probability measures and over the measurable space , let
denote the set of all joint distributions on
with and are respective marginals. Each element in is called a coupling between and . Let be the transport cost function between two positions, which is nonnegative, lower semicontinuous and satisfy if and only if . Intuitively, the quantity specifies the cost of transporting unit mass from in to another element of . Then the optimal transport total cost associated with is defined as follows:Therefore, the optimal transport cost corresponds to the lowest transport cost that can be obtained among all couplings between and . Let the transport cost function be some distance metric on , and then it is actually the Wasserstein distance of first order. Wasserstein distance of order is defined as:
Unlike KullbackLiebler divergence or other likelihoodbased divergence measures, Wasserstein distance is a proper metric on the space of probabilities. More importantly, Wasserstein distance does not restrict probabilities to share the same support (Villani, 2008; Santambrogio, 2015). Let , and , the Wasserstein ball of order and the optimaltransport ball are identical:
3.3 Optimal Solutions
Let the uncertainty set be a Wasserstein ball of order centered at some reference transition kernel :
(3)  
(4) 
The radius or reflects the extent of adversarial perturbation to the reference transition kernel . The difference between our theoretical framework and Yang (2017, 2018) is that our framework is trying to find the optimal solution for the worst transition kernel within the Wasserstein ball, while theirs is trying to find the optimal solution for the worst distribution over transition kernels.
Recall the state value function (1) at state given primal policy and adversarial policy , we can rewrite the state value function as follows,
where and are the shift policies. Since is continuous and bounded, the value function is actually continuous in and belongs to . Let be a measurable, upper semicontinuous function with , and let denote the set of all such functions.
For state and action . Consider the following operator defined on :
(5) 
Applying Lagrangian method and the strong duality property brought by Wasserstein distance (Blanchet & Murthy, 2019), we reformulate (5) to the following form:
(6) 
The significance of this strong dual representation lies on the fact that the operator in eq. (5) is replaced by in eq. (6), which leads a much easier optimization algorithm. The righthand side of eq. (6) is a normal iteratedrisk function. That is, it reduces the infinitedimensional probabilitysearching problem (5) into an ordinary finitedimensional optimization problem (6).
It is easy to verify that maps to . Thus, given a state and agent policy , we have the following expected Bellmanform operator:
Similarly, maps to as well. Under the following Assumption 1, we are able to define the optimal iteration operator and show its contraction property.
Assumption 1.
is a compact metric space. For any , is compact and is lower semicontinuous on .
Then, given an initial state , the following optimal operator over is welldefined.
(7)  
(8) 
It is simple to verify that maps to . The contraction property of is shown in Lemma 1.
Lemma 1.
is a contraction operator in under norm. There exists an unique element in , denoted as , satisfying .
Proof.
(1) First, for , if , it’s easy to have , i.e., the operator is monotone about .
(2) For any real constant and , we can verify that .
(3) For any , , there is . Combining (1) and (2), we have , i.e., . Thus . Furthermore, since , the operator has the contract property in under norm.
4) Via Banach fixedpoint theorem, there exist an unique satisfying .
∎
For any , . Due to the contraction, we have
(9) 
which indicates an iterative procedure of finding the optimal value function. Based on this optimal value function, we can demonstrate the existence of optimal policies, and single out an optimal policy who is deterministic, Markov and stationary, as shown in Theorem 1.
Theorem 1.
There exists a deterministic Markov stationary policy that satisfies
Proof.
We now obtain the existence of an unique robust optimal value function, as well as a robust optimal primal policies, which is deterministic, Markov and stationary. Theorem above also indicates that the choice of an ‘optimal’ adversarial policy actually can be fixed to some disturbed transition kernel during the process. Through an iterative procedure as (9), we can design corresponding algorithms for robust Reinforcement Learning.
3.4 Sensitivity Analysis and Penalty Term
Before going to the algorithm design, we present a sensitivity analysis for the optimal value function w.r.t. the radius and the Wasserstein order . Let and be a solution of equation (8), and the penalty term is nonnegative.
If , which means the worst transition kernel is within our fixed Wasserstein ball, equation (8) can be reduced to an ordinary problem:
Thus has nothing to do with or .
If , via the envelop theorem, the gradient of optimal value function w.r.t. can be calculated as follows.
(10) 
This gradient remains positive. That is, the optimal value function increases as the volume of Wasserstein ball increases (remember that and the value function represents the discounted cost). Similarly, via the envelop theorem, the gradient w.r.t. can be calculated as follows.
(11) 
Since , the worst transition kernel satisfies , i.e. .^{1}^{1}1Derived from the fact that if , there must be . Notice that calculating for is actually trying to find an optimal transport map , which substantially perturbs to . Recall that is upper semicontinuous and its domain is compact, and then we can actually regard as the Kantorovich potential (Villani, 2008) for a transport cost function in the transport from to . For , is strictly convex. Through Theorem 1.17 in Santambrogio (2015), we can represent the optimal transport map in an explicit way, as well as the gradient over when .
(12) 
If , the gradient over is nonpositive. Remember that actually reflect the extent of robustness, i.e., smaller coincides with larger radius and larger coincides with smaller radius . Intuitively, when the volume of Wasserstein ball is very large, larger is preferred.
If is large enough and , the gradient over is nonnegative. Intuitively, when the volume of Wasserstein ball is very small, the extent of perturbation at each point is small with high probability, making the gradient (11) positive. Thus in such situation, smaller is preferred.
Furthermore, the condition that actually indicates smoothness of function . According to the key convex analysis in Sinha et al. (2017), when , calculating is actually a stronglyconcave optimization problem, which provides computational efficiency approach. If the extend of robustness in the experiment is completely unknown to us, we recommend choosing small enough, i.e., larger according to the specific experimental scenarios.
4 Wasserstein Robust Advantange ActorCritic Algorithms
In reinforcement learning, the agent does not know the precise environment dynamics, i.e., the transition kernel and immediate cost function are unknown. Some researchers leverage an adversarial agent to inject perturbations into environmental parameters during training procedures (Pinto et al., 2017). However, such methods have to work with predefined environmental parameters, and are lack of quantified robustness toward transition kernels. Other researchers borrow the idea of adversarial examples and disturb observed states in a heuristic way (Nguyen et al., 2015). They also lose the explanation of robustness towards system dynamics.
Following the analysis in Section 3
, we develop a robust Advantage ActorCritic algorithm: a critic neural network with parameters
, denoted by , is employed to estimate value function; and an actor neural network with parameters , denoted by , is designed as the primal policy. Rewrite equation (8):Update for and : Let where , . Given and , denote
Initially, can be treated as the maximum perturbation to state , given the penalty . The gradient of over is:
Specially, when , we have . However, if , we can directly get according to eq. (3.4), and when .
Let , and we get
Combining the envelope theorem, we can obtain the gradient of w.r.t. :
The expectation in the gradient can be approximated by Monte Carlo: take action at state for times; under the reference transition kernel , observe the next states and quadruples , ; and then we can approximate
As for the initial value of , we recommend some value lager than , i.e., larger than . According to Theorem 3 in Blanchet & Murthy (2019), the worstcase probabilities has the following form: , where is some closed set, is the optimal penalty term for value function , and means the lowest transport cost possible in transporting unit mass from to some in the set . We can characterize as . According to this characterization, with a large probability, the extent of perturbation at a single point does not exceed , and we can set the initial value of larger than (i.e., ).
Critic Update Rule: Given state and policy , let
To calculate , similarly, we leverage Monte Carlo, take actions at the same state for times, observe “stateaction” pairs , and then approximate
Let , and denote the difference between the observed cost and the critic network:
Through the envelope theorem, we can obtain the following gradient of w.r.t. :
Notice that we should actually update the critic network via minimizing , and the gradient is
In practice, we usually can let to obtain faster convergence.
Actor Update Rule: In classical AC algorithms, directly minimizing “stateaction” value function
may cause large variance and slow convergence, and optimizing the advantage function is a better choice instead. The advantage function is
Thus we can find the optimal via minimizing the expected advantage function Similarly, we can approximate the gradient of w.r.t. as follows:
(13) 
Finally, we obtain a corresponding Robust Advantage ActorCritic algorithms. We name it Wasserstein Robust Advantage ActorCritic algorithm with order , described in Algorithm 1 and Algorithm 2. Algorithm 1 is actually an inner loop that certifies the extent of perturbations, while Algorithm 2 finds the optimal policy in a normal way. Let the learning rates satisfy the RobbinsMonro condition (Robbins & Monro, 1951), and , , , and via the multitimescales theory (Borkar, 2008), the convergence to a local minimum can be guaranteed.
5 Experiments
In this section, we will verify WRAAC algorithm in CartPole environment ^{2}^{2}2https://gym.openai.com/envs/CartPolev0/. State space has four dimensions, including cart position, cart velocity, pole angle and pole velocity at tip. There are only two admissible actions: left or right. The target is to prevent the pole from falling over.
Our baseline includes the ordinary Advantage ActorCritic algorithm. Policies are learnt under the default environment for WRAAC and the ordinary A2C (baseline). Then, we test the performances of these two learned policies under different environmental dynamics for times. We change the simulated environmental parameters such as the magnitude of force (default value ) or the polemass (default value ) to emulate different test dynamics. Note that the unit change on different environmental parameters will result in different extents of the dynamic’s robustness.
We apply WRAAC algorithm of order , and fix the degree of dynamical robustness at . For each quadruple , if is not the last state of the trajectory, we set initial be and initial be (designed according to the simulated dynamics of CartPole). If is the last state, we set and . The baseline policy and WRAAC are tested in environments with different magnitude of force or different polemasse, shown in Figure 1 and Figure 2
. The solid line is the mean survived steps and the shadow around it reflect the standard deviation.
Remember that different parameters in the CartPole environment have different effects to the dynamic’s robustness. Figure 1 indicates that our robust algorithm performs better than the baseline. When the perturbation of parameter reaches some level, such as , our robust policy keeps the pole from falling over for a longer time, which indicates that our algorithm does learn some level of robustness, compared with baseline. If the perturbation of parameter is small, our algorithm performs as good as the baseline. Furthermore, the relatively smooth curve of baseline indicates that the magnitude of force influences the environmental dynamic in a smooth way, and makes this parameter a good choice to verity policy robustness.
In Figure 2, Our robust algorithm and the baseline both plunge when the polemass outreach , which indicates a sudden change of the environmental dynamics. The polemass parameter is probably not a good choice to reflect dynamical changes.
Our experiments reflect the fact that environmental parameters probably have different influences to environmental dynamics, and we need to choose properly in order to reflect a smooth change of environmental dynamics. The results also show that WRAAC does learn some robustness to environmental dynamics.
6 Conclusions
In this paper, we investigate the robust Reinforcement Learning with Wasserstein constraint. The derived theoretical framework can be reformulated into a tractable iteratedrisk aware problem and the theoretical guarantee is then obtained by building connection between robustness to transition probabilities and robustness to states. Subsequently, we demonstrate the existence of optimal policies, provide a sensitivity analysis to reveal the effects of uncertainty set, and design a proper twostage learning algorithm WRAAC. The experimental results on the CartPole environment verified the effectiveness and robustness of our proposed approaches.
Future works may favor a complete study for the effects of the radius of Wasserstein ball in our WRAAC algorithm. We are also interested in studying robust policy improvement in a datadriven situation where we only have access to the set of collected trajectories.
References
 Atkeson & Morimoto (2003) Atkeson, C. G. and Morimoto, J. Nonparametric representation of policies and value functions: A trajectorybased approach. In Advances in neural information processing systems, pp. 1643–1650, 2003.
 Bertsekas & Shreve (2004) Bertsekas, D. P. and Shreve, S. Stochastic optimal control: the discretetime case. 2004.
 Blanchet & Murthy (2019) Blanchet, J. and Murthy, K. Quantifying distributional model risk via optimal transport. Mathematics of Operations Research, 2019.
 Borkar (2008) Borkar, V. S. Stochastic approximation: a dynamical systems viewpoint. Baptism’s 91 Witnesses, 2008.
 Delage & Mannor (2010) Delage, E. and Mannor, S. Percentile optimization for markov decision processes with parameter uncertainty. Operations research, 58(1):203–213, 2010.
 Esfahani & Kuhn (2018) Esfahani, P. M. and Kuhn, D. Datadriven distributionally robust optimization using the wasserstein metric: Performance guarantees and tractable reformulations. Mathematical Programming, 171(12):115–166, 2018.
 Gao & Kleywegt (2016) Gao, R. and Kleywegt, A. J. Distributionally robust stochastic optimization with wasserstein distance. arXiv preprint arXiv:1604.02199, 2016.
 HernándezLerma & Lasserre (2012a) HernándezLerma, O. and Lasserre, J. B. Discretetime Markov control processes: basic optimality criteria, volume 30. Springer Science & Business Media, 2012a.
 HernándezLerma & Lasserre (2012b) HernándezLerma, O. and Lasserre, J. B. Further topics on discretetime Markov control processes, volume 42. Springer Science & Business Media, 2012b.
 Huang et al. (2017) Huang, S., Papernot, N., Goodfellow, I., Duan, Y., and Abbeel, P. Adversarial attacks on neural network policies. arXiv preprint arXiv:1702.02284, 2017.
 Iyengar (2005) Iyengar, G. N. Robust dynamic programming. Mathematics of Operations Research, 30(2):257–280, 2005.
 Kos & Song (2017) Kos, J. and Song, D. Delving into adversarial attacks on deep policies. arXiv preprint arXiv:1705.06452, 2017.
 Lin et al. (2017) Lin, Y.C., Hong, Z.W., Liao, Y.H., Shih, M.L., Liu, M.Y., and Sun, M. Tactics of adversarial attack on deep reinforcement learning agents. arXiv preprint arXiv:1703.06748, 2017.
 Mandlekar et al. (2017) Mandlekar, A., Zhu, Y., Garg, A., FeiFei, L., and Savarese, S. Adversarially robust policy learning: Active construction of physicallyplausible perturbations. In 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 3932–3939. IEEE, 2017.

Mannor et al. (2004)
Mannor, S., Simester, D., Sun, P., and Tsitsiklis, J. N.
Bias and variance in value function estimation.
In
Proceedings of the twentyfirst international conference on Machine learning
, pp. 72. ACM, 2004.  Mannor et al. (2007) Mannor, S., Simester, D., Sun, P., and Tsitsiklis, J. N. Bias and variance approximation in value function estimates. Management Science, 53(2):308–322, 2007.
 Morimoto & Doya (2005) Morimoto, J. and Doya, K. Robust reinforcement learning. Neural computation, 17(2):335–359, 2005.

Nguyen et al. (2015)
Nguyen, A., Yosinski, J., and Clune, J.
Deep neural networks are easily fooled: High confidence predictions
for unrecognizable images.
In
Proceedings of the IEEE conference on computer vision and pattern recognition
, pp. 427–436, 2015.  Nilim & El Ghaoui (2004) Nilim, A. and El Ghaoui, L. Robustness in markov decision problems with uncertain transition matrices. In Advances in Neural Information Processing Systems, pp. 839–846, 2004.
 Nilim & El Ghaoui (2005) Nilim, A. and El Ghaoui, L. Robust control of markov decision processes with uncertain transition matrices. Operations Research, 53(5):780–798, 2005.
 Osogami (2012) Osogami, T. Robustness and risksensitivity in markov decision processes. In Advances in Neural Information Processing Systems, pp. 233–241, 2012.
 Pattanaik et al. (2018) Pattanaik, A., Tang, Z., Liu, S., Bommannan, G., and Chowdhary, G. Robust deep reinforcement learning with adversarial attacks. In Proceedings of the 17th International Conference on Autonomous Agents and MultiAgent Systems, pp. 2040–2042. International Foundation for Autonomous Agents and Multiagent Systems, 2018.
 Pinto et al. (2017) Pinto, L., Davidson, J., Sukthankar, R., and Gupta, A. Robust adversarial reinforcement learning. In Proceedings of the 34th International Conference on Machine LearningVolume 70, pp. 2817–2826. JMLR. org, 2017.
 Rajeswaran et al. (2016) Rajeswaran, A., Ghotra, S., Ravindran, B., and Levine, S. Epopt: Learning robust neural network policies using model ensembles. arXiv preprint arXiv:1610.01283, 2016.
 Robbins & Monro (1951) Robbins, H. and Monro, S. A stochastic approximation method. The annals of mathematical statistics, pp. 400–407, 1951.
 Santambrogio (2015) Santambrogio, F. Optimal transport for applied mathematicians. Birkäuser, NY, pp. 99–102, 2015.
 Sinha et al. (2017) Sinha, A., Namkoong, H., and Duchi, J. Certifying some distributional robustness with principled adversarial training. arXiv preprint arXiv:1710.10571, 2017.
 Villani (2008) Villani, C. Optimal transport: old and new, volume 338. Springer Science & Business Media, 2008.
 White III & Eldeib (1994) White III, C. C. and Eldeib, H. K. Markov decision processes with imprecise transition probabilities. Operations Research, 42(4):739–749, 1994.
 Wiesemann et al. (2013) Wiesemann, W., Kuhn, D., and Rustem, B. Robust markov decision processes. Mathematics of Operations Research, 38(1):153–183, 2013.
 Xu & Mannor (2007) Xu, H. and Mannor, S. The robustnessperformance tradeoff in markov decision processes. In Advances in Neural Information Processing Systems, pp. 1537–1544, 2007.
 Xu & Mannor (2010) Xu, H. and Mannor, S. Distributionally robust markov decision processes. In Advances in Neural Information Processing Systems, pp. 2505–2513, 2010.
 Yang (2017) Yang, I. A convex optimization approach to distributionally robust markov decision processes with wasserstein distance. IEEE control systems letters, 1(1):164–169, 2017.
 Yang (2018) Yang, I. Wasserstein distributionally robust stochastic control: A datadriven approach. arXiv preprint arXiv:1812.09808, 2018.