I Introduction
Numerous deep reinforcement learning (RL) algorithms have appeared over the last decade[1, 2, 3, 4, 5, 6, 7], and their applications have demonstrated great performance in a range of challenging domains such as games[8], robotic control[9] and autonomous driving[10, 11]. Mainstream RL algorithms focus on optimizing policies based on their performance in the training environment, without considering its universality for situations never encountered during training. Studies showed that this could reduce the generalization ability of the learned policy[12][13]. For intelligent agents, such as autonomous vehicles, we usually need them to be able to cope with multiple situations, including unknown scenarios.
A straightforward technique to improve the generalization ability of RL is training on a set of random environment. By randomizing the dynamics of the simulation environment during training, the developed policies are capable of adapting to different dynamics encountered during training [14]. Furthermore, the works of [15] and [16] were proposed by directly adding noises to state observations to provide adversarial perturbations [17]. However, these approaches can only handle common disturbances, while some catastrophic events are highly unlikely to be encountered if not changing environment or only randomly perturbing the environment parameters or dynamics.
Alternative techniques to improve generalization include risksensitive policy learning. Generally, risk is related to the stochasticity of environment and with the fact that, even an optimal policy (in terms of expected return) may perform poorly in some cases. That is, optimizing an expected longrun objective such as the infinitehorizon discounted or average reward alone is not sufficient to avoid the potential rare occurrences of large negative outcomes in many practical applications. Instead, risksensitive policy learning includes a risk measure in the optimization process, either as the objective or as a constraint[18, 19]
. This formulation not only seeks to maximize the expected reward but to reduce the risk criteria, such that the trained policy can still work in a varying environment. To be concrete, the risk is always modeled as the variance of return and the most representative algorithms include meanvariance tradeoff method
[20] and percentile optimization methods[21]. However, the existing methods can only model the risk by sampling discretely some trajectories from randomized environment, rather than learn the exact return distribution.To improve the generalization ability of RL across different kind of disturbances, minimax formulation, or worst case formulation has been widely adopted. Morimoto et al. (2005) firstly combined Hinfinity control with RL to learn a robust policy, which is the prototype of most existing worst case formulation of RL algorithms[22]
. They formulated a differential game in which a disturbing agent tries to make the worst possible destruction while a control agent aims to make the control input. This problem was reduced to find a minimax solution of a value function that took into account the norm of the output deviation and the norm of disturbance. After that, Pinto et al. (2017) extended this work with deep neural network and further proposed the Robust Adversarial Reinforcement Learning algorithm (RARL), in which two policies are trained simultaneously: the protagonist policy to optimize the performance and the adversarial policy to provide disruption. The protagonist and adversary policies are trained alternatively, with one being fixed whilst the other adapts
[23]. Pan et al. (2019) introduced the riskaware framework into RARL to prevent the rare, catastrophic events such as automotive accidents[24]. To that end, the risk was modeled as the variance of value functions and the protagonist policy should not only maximize expected reward, but should also select action with low variance. Conversely, the adversary policy aims to minimize the long term expected reward and select action with high variance such that the adversary can actively seek catastrophic outcomes. To obtain the risk of value function, they use an ensemble of Qvalue networks to estimate variance, in which multiple Qnetworks are trained in parallel. The risk aware RARL is effective and even crucial sometimes, especially in safetycritical systems like autonomous driving. However, the existing methods can only handle the discrete and lowdimensional action spaces and even worse, the value function must be divided into multiple discrete intervals in advance. This is inconvenient because different tasks usually require different division numbers.
In this paper, we propose a new RL algorithm to improve the generalization ability of the learned policy. In particular, the learned policy can not only succeed in the training environment but also cope with the situations never encountered before. To that end, we adopt the minimax formulation, which augments the standard RL with an adversarial policy, to develop a minmax variant of Distributional Soft ActorCritic algorithm (DSAC) [25], called Minimax DSAC. Here, we choose DSAC as the basis of our algorithm, not only because it is the stateoftheart RL algorithm, but also it can directly learn a continuous distribution of returns, which enables us to model return variance as risk explicitly. By modeling risk, we can train stronger adversaries and through competition, protagonist policy will have greater ability to cope with environmental changes. Additionally, we apply our Minimax DSAC algorithm to autonomous decisionmaking at intersections, and evaluate the trained policy in environments different from the training environment. The results show that our algorithm can guarantee the good performance even when the environment changes drastically.
The rest of the paper is organized as follows: Section II states the preliminaries of standard RL, twoplayer zerosum RL and maximum entropy RL framework mathematically. Section III introduces the combination of minimax formulation and distributional method and Section IV introduces the implementation of the proposed method Minimax DSAC. Section V introduces the simulation scenarios, presents the results and evaluates the trained model. Section VI summarizes the major contributions and concludes this paper.
Ii Preliminaries
Before delving into the details of our algorithm, we first introduce notation and summarize the standard RL setting, twoplayer RL setting and maximum entropy RL setting mathematically.
Iia Standard Reinforcement learning
Reinforcement Learning (RL) is designed to solve sequential decision making tasks wherein the agent interacts with the environment. Formally, we consider an infinitehorizon discounted Markov Decision Process (MDP), defined by the tuple (
), where is a continuous set of states and is a continuous set of actions,is the transition probability distribution,
is the reward function, and is the discounted factor. In each time step , the agent receives a state and selects an action , and the environment will return the next state with the probability and a scalar reward . We will use and to denote the state and stateaction distribution induced by policy in environment. For the sake of simplicity, the current and next stateaction pairs are also denoted as and , respectively.The goal in standard RL is to learn a policy which maximizes the expected future accumulated return. Let denote a stochastic policy , the action value function is the expected sum of discounted returns from a state action pairs when following policy :
(1) 
where and . Intuitively, the optimal policy can be defined as:
IiB Twoplayer zerosum RL
Twoplayer zerosum RL can be expressed as a twoplayer zerosum MDP, defined by the tuple (), where and are continuous action set which players can take. Under this theme, action value function is constructed on two policies, respectively called protagonist policy and adversary policy . Given the current state , the protagonist policy will take action , the adversary policy will take action , and then reach the next state . Whereas these two policies obtain different rewards: the protagonist gets a reward while the adversary gets a reward at each time step.
(2) 
where and for short. Similarly, the optimal action value function should satisfy the Minimax Bellman Optimality Equation:
(3) 
Intuitively, the protagonist policy seeks to maximize the longterm expect reward while the adversarial policy seeks to minimize it. The optimal policy can be defined as the Nash Equilibrium of two policies:
(4) 
where the optimal action and come from the optimal policies and respectively.
IiC Maximum entropy RL
The maximum entropy RL aims to maximize the expected accumulated reward and policy entropy, by augmenting the standard maximum reward RL objective with an entropy maximization term:
(5) 
where is the temperature parameter which determines the relative importance of the entropy term against the reward, and thus controls the stochasticity of the optimal policy.
Obviously, the maximum entropy objective differs from the standard maximum expected reward objective used in standard RL, though the conventional objective in (1) can be recovered as . Prior works [26, 27] have demonstrated that the maximum entropy objective has a number of conceptual and practical advantages. First, the policy is incentivized to explore more widely, while giving up on clearly unpromising avenues. Second, the policy can capture multiple modes of nearoptimal behavior. In problem settings where multiple actions seem equally attractive, the policy will act as randomly as possible to perform those actions.
The optimal maximum entropy policy is learned by a maximum entropy variant of the policy iteration method which alternates between policy evaluation and policy improvement, called soft policy iteration. In the policy evaluation process, given policy , Qvalue can be learned by repeatedly applying a modified Bellman operator under policy given by
The goal of the policy improvement process is to find a new policy that is better than the current policy , such that for all state action pairs . Hence, we can directly update the policy directly by maximizing the the entropyaugmented objective (5), i.e.,
(6)  
It has shown that policy evaluation step and policy improvement step can alternately roll forward and gradually shift to the optimal policies [7].
Iii Our methods
This section mainly focuses on the combination of minimax formulation and distributional RL framework, in which we state our algorithm based on the continuous distributional return.
Iiia Distributional RL
In distributional RL, we view the return
as a random variable, and choose to directly learn its distribution instead just its expected value, i.e., Qvalue:
(7) 
An analogous distributional Bellman operator of the form
(8) 
can be derived, where denotes that two random variables and have equal probability laws and the next state and action are distributed according to and respectively. Supposing , where denotes the return distribution of , the return distribution can be optimized by minimizing the distribution distance between Bellman updated and the current return distributions:
(9) 
where is some metric to measure the distance between two distribution. For example, we can adopt as the KullbackLeibler (KL) divergence or Wasserstein metric.
Distributional framework has attracted much attention for the reason that distributional RL algorithms show improved sample complexity and final performance, as well as increased robustness to hyperparameter variation. However, many prior works used discrete distribution to build the return distribution, in which we need to divide the value function into different intervals priorly. Recently, Duan et al.
[25] proposed the Distributional Soft ActorCritic algorithm (DSAC) to directly learn the continuous distribution of returns by truncating the difference between the target and current return distribution, outperforming the existing algorithms on the suite of continuous control tasks. Therefore, we draw on the continuous return distribution to develop our algorithm.IiiB Minimax distributional RL
Similarly ,the return can also be modeled as a continuous distribution in minimax formulation and its expectation is the action value function :
(10) 
We will denote as for the sake of brevity and suppose The corresponding minimax distributional Bellman Equation can derived as:
(11) 
where , and .
In policy improvement step, both protagonist policy and adversary policy optimize themselves based on current return distribution, in which they have common objective function:
The protagonist policy aims to maximize the distributional expected return while the adversary aims to minimize it:
(12) 
To learn risksensitive policies, we model risk as the variance of the learned continuous return distribution, where the protagonist policy is optimized to mitigate risk to avoid the potential events that have the chance to lead to bad return:
And the adversary policy seeks to increase risk to disrupt the learning process:
where and are the constants corresponding to the variance which describe different risk level.
Iv Implementation of Algorithm
In this section, we employ minimax formulation with the existing DSAC algorithm to present our Minimax DSAC algorithm, in which our algorithm can experience the serious variation from environment during training, thereby improving its generalization ability for unknown environment. To handle problems with large continuous domains, we use function approximators for all the return distribution function and two policies, which can be modeled as a Gaussian with the mean and variance given by neural networks (NNs). We will consider a parameterized stateaction return distribution function , a stochastic protagonist policy and a stochastic adversarial policy where , and
are parameters. Next we will derive update rules for these parameter vectors and show the details of our Minimax DSAC.
In policy evaluation step, the current protagonist policy and adversary policy are given and the return distribution is updated by minimizing the difference between the target return distribution and the current return distribution. The formulation is similar with the DASC algorithm except that we consider two polices[25]:
where is a constant. The gradient about parameter can be written as:
To prevent the gradient exploding, we also adopt the cliping technique to keep it close to the expectation value of the current distribution :
where is a hyperparameter representing the clipping boundary.
To stabilize the learning process, target return distribution with parameter , two policy functions with separate parameters and , are used to evaluate the target function. The target networks use a slowmoving update rate, parameterized by , such as
In policy improvement step, as discussed above, the protagonist policy aims to maximize the expected return with entropy and select actions with low variance:
The adversarial policy aims to minimize the expected return and select actions with high variance:
Suppose the mean and variance of the return distribution can be explicitly parameterized by parameters . We can derive the policy gradient of protagonist and adversary policy using the reparameterization trick:
where , is auxiliary variables which are sampled form some fixed distribution. Then the protagonist policy gradient can be derived as
And the adversarial policy can be approximated with
Finally, the temperature is updated by minimizing the following objective
where is the expected entropy. The detail of our algorithm can be shown as Algorithm 1.
V Experiments
In this section, we evaluate our algorithm on an autonomous driving environment, in which we choose the intersection as the driving scenario. Our experiment aims to study two primary questions: (1) How well does Minimax DSAC perform on this task in terms standard DSAC algorithm? (2) Can our algorithm still work or behave better if there are some disturbances from environment?
Va Simulation Environment
We focus on a typical 4intersection shown in Fig. 1. Each direction is denoted by its location in the figure, i.e. up(U), down (D), left (L) and right (R) respectively. The intersection is unsignalized and each direction has one lane. The protagonist vehicle (black car in Fig. 1) attempts to travel from down to up, while two adversarial vehicles (green car in Fig. 1) ride from right to left, left to right respectively. The trajectory of all three vehicles are given priorly, as a result, there are two traffic conflict points in the path of protagonist vehicle and adversarial vehicles, as the red circle shown in Fig. 1. In our experiment setting, the protagonist vehicle attempts to pass the intersection safely and quickly, while the other two adversarial vehicles tries to disrupt this event by hitting the protagonist vehicle.
Suppose that all vehicles are equipped with positioning and velocity devices, such that we can choose position and velocity information of each vehicle as states, i.e., (, ), where is distance between vehicle and center of the intersection. Note that is positive when vehicle is heading for the center and negative when it is leaving. For action space, we choose acceleration of each vehicle and suppose that vehicles can strictly follow the desired acceleration.
The reward function is designed to consider both safety and time efficiency. This task is constructed in episodic manner, where two terminate conditions are given: collision or passing. First, if the protagonist vehicle passes the intersection safely, a large positive reward 110 is given; Second, if collision happens anywhere, a large negative reward 110 is given to the protagonist vehicle; Besides, a minor negative step reward 1 is given every time step to encourage the protagonist vehicle to pass as quickly as possible. However, the adversarial vehicles obtain minus reward against the protagonist vehicle in every case aforementioned.
Overall, the protagonist vehicle will learn how to control acceleration to pass successfully, including avoid the potential collision which comes from the adversarial vehicles and two adversarial vehicles learn to control acceleration to make collision happen.
VB Algorithm Details and Results
All the value function and two policies are approximated by multilayer perceptron (MLP) with two hidden layers and 256 units per layer. The policy of the protagonist vehicle aims to maximize future expected return, while the policy of the adversarial vehicles aims to minimize it. The baseline of our algorithm is the standard DSAC
[25] without the adversarial policy, in which the protagonist vehicle learns to pass through the intersection with the existence of two random surrounding vehicles. Also, we adopt the asynchronous parallel architecture of DSAC, in which 4 learners and 3 actors are designed to accelerate the learning speed. The hyperparameters used in training are listed in Table I and the training result is shown as Fig. 2, where the solid line and the shaded correspond to the mean and 95 % confidence interval over 10 runs.
Model type 
MLP 

Hidden units  256 
Hidden layers  2 
Max buffer size  500 
Sample batch size  256 
Hidden layers activation  gelu 
Optimizer type  Adam 
Adam parameter  
Actor learning rate  
Critic learning rate  
learning rate  
Discount factor  0.99 
Temperature  Auto 
Target update rate  0.001 
Max train  500000 
Clipping boundary  20 
Actor number  4 
Learner number  3 
Buffer number  1 
Seed  Current time 
Max steps per episode  100 
0.1  
0.1  

Results show that Minimax DSAC obtained a smaller mean and a larger variance with respect to the average return, which is explicable that in minimax formulation, the adversary policy provides a strong disturbance to the learning of protagonist policy. Besides, it is clear that Minimax DSAC has more fluctuation than standard DASC at convergence. That is because the protagonist vehicle has learned to avoid the potential collision by decelerating and even stopping and waiting in face of the aggressive adversarial vehicles, which will lead to punishment in each step and finally result in a lower return.
VC Evaluation
Compared with the performance during the training process, we concern more about that on situations never encountered before, i.e., the generalization ability. Here, we employ the training/test split technique to test the generalization of our algorithm, aiming to explore whether our algorithm Minimax DSAC, after training with the minimax formulation, can behave better in a varying environment compared with the standard DSAC. As adversarial vehicles can be regarded as part of the environment, we can design different driving mode of adversaries agents to adjust the environment difficulty to evaluate the generalization ability of the protagonist policy. To be concrete, we design three driving mode for the adversarial agent: aggressive, conservative and random. In aggressive mode, the two adversarial vehicles sample their acceleration from positive interval while in conservative mode they sample acceleration from negative interval . In random mode, one adversarial vehicle sample acceleration from and the other vehicle sample acceleration from .
The comparison of two methods under three modes are shown in Fig. 3, in which the corresponding
values are also marked. Results show that Minimax DSAC can greatly improve the performance under different modes of adversarial vehicles, especially in aggressive and random mode. In conservative mode, these two algorithms show minor difference because both the adversarial vehicles drive at the lowest speed in the limit, thereby less potential collision to the protagonist will happen. However, Miniax DSAC still obtained a higher return because it adopted large acceleration to improve the passing efficiency. The ttest results in Fig.
3 show that the average reward of DSAC is significantly smaller than that of Minimax DSAC . To sum up, our Minimax DSAC algorithm can maintain better performance when encounting different kinds of variations from environment.Vi Conclusion
In this paper, we combine the minimax formulation with the distributional framework to improve the generalization ability of RL algorithms, in which the protagonist agent must compete with the adversarial agent to learn how to behave well. Based on the DSAC algorithm, we proposed the Minimax DSAC algorithm and implemented it on the autonomous driving tasks at intersections. Results show that our algorithm improves greatly the protagonist agent to the variation of environment. This study provides a promising approach to accelerate the application of algorithm in real world like autonomous driving, where we always use simulator to develop algorithms and then put them into use in real environment.
References
 [1] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski et al., “Humanlevel control through deep reinforcement learning,” Nature, vol. 518, no. 7540, p. 529, 2015.
 [2] T. Schaul, J. Quan, I. Antonoglou, and D. Silver, “Prioritized experience replay,” in 4th International Conference on Learning Representations, ICLR 2016, San Juan, Puerto Rico, May 24, 2016, 2016.

[3]
V. Mnih, A. P. Badia, M. Mirza, A. Graves, T. Lillicrap, T. Harley, D. Silver,
and K. Kavukcuoglu, “Asynchronous methods for deep reinforcement learning,”
in
International conference on machine learning
, 2016, pp. 1928–1937.  [4] T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa, D. Silver, and D. Wierstra, “Continuous control with deep reinforcement learning,” arXiv preprint arXiv:1509.02971, 2015.
 [5] J. Schulman, S. Levine, P. Abbeel, M. Jordan, and P. Moritz, “Trust region policy optimization,” in International conference on machine learning, 2015, pp. 1889–1897.
 [6] J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov, “Proximal policy optimization algorithms,” arXiv preprint arXiv:1707.06347, 2017.
 [7] T. Haarnoja, A. Zhou, P. Abbeel, and S. Levine, “Soft actorcritic: Offpolicy maximum entropy deep reinforcement learning with a stochastic actor,” arXiv preprint arXiv:1801.01290, 2018.
 [8] D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. Van Den Driessche, J. Schrittwieser, I. Antonoglou, V. Panneershelvam, M. Lanctot et al., “Mastering the game of go with deep neural networks and tree search,” nature, vol. 529, no. 7587, p. 484, 2016.
 [9] T. Haarnoja, V. Pong, A. Zhou, M. Dalal, P. Abbeel, and S. Levine, “Composable deep reinforcement learning for robotic manipulation,” in 2018 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2018, pp. 6244–6251.
 [10] Y. Guan, Y. Ren, S. E. Li, Q. Sun, L. Luo, K. Taguchi, and K. Li, “Centralized conflictfree cooperation for connected and automated vehicles at intersections by proximal policy optimization,” 2019.
 [11] J. Duan, S. E. Li, Y. Guan, Q. Sun, and B. Cheng, “Hierarchical reinforcement learning for selfdriving decisionmaking without reliance on labeled driving data,” IET Intelligent Transport Systems, 2020. doi:10.1049/ietits.2019.0317.
 [12] C. Packer, K. Gao, J. Kos, P. Krähenbühl, V. Koltun, and D. Song, “Assessing generalization in deep reinforcement learning,” arXiv preprint arXiv:1810.12282, 2018.
 [13] C. Zhao, O. Siguad, F. Stulp, and T. M. Hospedales, “Investigating generalisation in continuous deep reinforcement learning,” arXiv preprint arXiv:1902.07015, 2019.
 [14] X. B. Peng, M. Andrychowicz, W. Zaremba, and P. Abbeel, “Simtoreal transfer of robotic control with dynamics randomization,” in 2018 IEEE international conference on robotics and automation (ICRA). IEEE, 2018, pp. 1–8.
 [15] A. Mandlekar, Y. Zhu, A. Garg, L. FeiFei, and S. Savarese, “Adversarially robust policy learning: Active construction of physicallyplausible perturbations,” in 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 2017, pp. 3932–3939.
 [16] A. Pattanaik, Z. Tang, S. Liu, G. Bommannan, and G. Chowdhary, “Robust deep reinforcement learning with adversarial attacks,” in Proceedings of the 17th International Conference on Autonomous Agents and MultiAgent Systems. International Foundation for Autonomous Agents and Multiagent Systems, 2018, pp. 2040–2042.
 [17] I. J. Goodfellow, J. Shlens, and C. Szegedy, “Explaining and harnessing adversarial examples,” in 3rd International Conference on Learning Representations, ICLR 2015,, 2015.
 [18] A. Prashanth L and M. Fu, “Risksensitive reinforcement learning: A constrained optimization viewpoint,” arXiv preprint arXiv:1810.09126, 2018.

[19]
A. Tamar, Y. Glassner, and S. Mannor, “Optimizing the cvar via sampling,” in
TwentyNinth AAAI Conference on Artificial Intelligence
, 2015.  [20] J. Garcıa and F. Fernández, “A comprehensive survey on safe reinforcement learning,” Journal of Machine Learning Research, vol. 16, no. 1, pp. 1437–1480, 2015.
 [21] A. Rajeswaran, S. Ghotra, B. Ravindran, and S. Levine, “Epopt: Learning robust neural network policies using model ensembles,” arXiv preprint arXiv:1610.01283, 2016.
 [22] J. Morimoto and K. Doya, “Robust reinforcement learning,” Neural computation, vol. 17, no. 2, pp. 335–359, 2005.
 [23] L. Pinto, J. Davidson, R. Sukthankar, and A. Gupta, “Robust adversarial reinforcement learning,” in Proceedings of the 34th International Conference on Machine LearningVolume 70. JMLR. org, 2017, pp. 2817–2826.
 [24] X. Pan, D. Seita, Y. Gao, and J. Canny, “Risk averse robust adversarial reinforcement learning,” arXiv preprint arXiv:1904.00511, 2019.
 [25] J. Duan, Y. Guan, Y. Ren, S. E. Li, and B. Cheng, “Addressing value estimation errors in reinforcement learning with a stateaction return distribution function,” 2020.
 [26] Z. Wang, T. Schaul, M. Hessel, H. van Hasselt, M. Lanctot, and N. de Freitas, “Dueling network architectures for deep reinforcement learning,” in Proceedings of the 33nd International Conference on Machine Learning, ICML 2016, New York City, NY, USA, June 1924, 2016, 2016, pp. 1995–2003.
 [27] C. Lyle, M. G. Bellemare, and P. S. Castro, “A comparative analysis of expected and distributional reinforcement learning,” in The ThirtyThird AAAI Conference on Artificial Intelligence, AAAI 2019, Honolulu, Hawaii, USA, January 27  February 1, 2019. AAAI Press, 2019, pp. 4504–4511.
Comments
There are no comments yet.