Improving Generalization of Reinforcement Learning with Minimax Distributional Soft Actor-Critic

02/13/2020 ∙ by Yangang Ren, et al. ∙ Tsinghua University 0

Reinforcement learning (RL) has achieved remarkable performance in a variety of sequential decision making and control tasks. However, a common problem is that learned nearly optimal policy always overfits to the training environment and may not be extended to situations never encountered during training. For practical applications, the randomness of the environment usually leads to rare but devastating events, which should be the focus of safety-critical systems, such as autonomous driving. In this paper, we introduce the minimax formulation and distributional framework to improve the generalization ability of RL algorithms and develop the Minimax Distributional Soft Actor-Critic (Minimax DSAC) algorithm. Minimax formulation aims to seek optimal policy considering the most serious disturbances from environment, in which the protagonist policy maximizes action-value function while the adversary policy tries to minimize it. Distributional framework aims to learn a state-action return distribution, from which we can model the risk of different returns explicitly, thus, formulating a risk-averse protagonist policy and a risk-seeking adversarial policy. We implement our method on the decision-making tasks of autonomous vehicles at intersections and test the trained policy in distinct environments from training environment. Results demonstrate that our method can greatly improve the generalization ability of the protagonist agent to different environmental variations.



There are no comments yet.


page 1

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Numerous deep reinforcement learning (RL) algorithms have appeared over the last decade[1, 2, 3, 4, 5, 6, 7], and their applications have demonstrated great performance in a range of challenging domains such as games[8], robotic control[9] and autonomous driving[10, 11]. Mainstream RL algorithms focus on optimizing policies based on their performance in the training environment, without considering its universality for situations never encountered during training. Studies showed that this could reduce the generalization ability of the learned policy[12][13]. For intelligent agents, such as autonomous vehicles, we usually need them to be able to cope with multiple situations, including unknown scenarios.

A straightforward technique to improve the generalization ability of RL is training on a set of random environment. By randomizing the dynamics of the simulation environment during training, the developed policies are capable of adapting to different dynamics encountered during training [14]. Furthermore, the works of [15] and [16] were proposed by directly adding noises to state observations to provide adversarial perturbations [17]. However, these approaches can only handle common disturbances, while some catastrophic events are highly unlikely to be encountered if not changing environment or only randomly perturbing the environment parameters or dynamics.

Alternative techniques to improve generalization include risk-sensitive policy learning. Generally, risk is related to the stochasticity of environment and with the fact that, even an optimal policy (in terms of expected return) may perform poorly in some cases. That is, optimizing an expected long-run objective such as the infinite-horizon discounted or average reward alone is not sufficient to avoid the potential rare occurrences of large negative outcomes in many practical applications. Instead, risk-sensitive policy learning includes a risk measure in the optimization process, either as the objective or as a constraint[18, 19]

. This formulation not only seeks to maximize the expected reward but to reduce the risk criteria, such that the trained policy can still work in a varying environment. To be concrete, the risk is always modeled as the variance of return and the most representative algorithms include mean-variance trade-off method

[20] and percentile optimization methods[21]. However, the existing methods can only model the risk by sampling discretely some trajectories from randomized environment, rather than learn the exact return distribution.

To improve the generalization ability of RL across different kind of disturbances, minimax formulation, or worst case formulation has been widely adopted. Morimoto et al. (2005) firstly combined H-infinity control with RL to learn a robust policy, which is the prototype of most existing worst case formulation of RL algorithms[22]

. They formulated a differential game in which a disturbing agent tries to make the worst possible destruction while a control agent aims to make the control input. This problem was reduced to find a minimax solution of a value function that took into account the norm of the output deviation and the norm of disturbance. After that, Pinto et al. (2017) extended this work with deep neural network and further proposed the Robust Adversarial Reinforcement Learning algorithm (RARL), in which two policies are trained simultaneously: the protagonist policy to optimize the performance and the adversarial policy to provide disruption. The protagonist and adversary policies are trained alternatively, with one being fixed whilst the other adapts

[23]. Pan et al. (2019) introduced the risk-aware framework into RARL to prevent the rare, catastrophic events such as automotive accidents[24]

. To that end, the risk was modeled as the variance of value functions and the protagonist policy should not only maximize expected reward, but should also select action with low variance. Conversely, the adversary policy aims to minimize the long term expected reward and select action with high variance such that the adversary can actively seek catastrophic outcomes. To obtain the risk of value function, they use an ensemble of Q-value networks to estimate variance, in which multiple Q-networks are trained in parallel. The risk aware RARL is effective and even crucial sometimes, especially in safety-critical systems like autonomous driving. However, the existing methods can only handle the discrete and low-dimensional action spaces and even worse, the value function must be divided into multiple discrete intervals in advance. This is inconvenient because different tasks usually require different division numbers.

In this paper, we propose a new RL algorithm to improve the generalization ability of the learned policy. In particular, the learned policy can not only succeed in the training environment but also cope with the situations never encountered before. To that end, we adopt the minimax formulation, which augments the standard RL with an adversarial policy, to develop a minmax variant of Distributional Soft Actor-Critic algorithm (DSAC) [25], called Minimax DSAC. Here, we choose DSAC as the basis of our algorithm, not only because it is the state-of-the-art RL algorithm, but also it can directly learn a continuous distribution of returns, which enables us to model return variance as risk explicitly. By modeling risk, we can train stronger adversaries and through competition, protagonist policy will have greater ability to cope with environmental changes. Additionally, we apply our Minimax DSAC algorithm to autonomous decision-making at intersections, and evaluate the trained policy in environments different from the training environment. The results show that our algorithm can guarantee the good performance even when the environment changes drastically.

The rest of the paper is organized as follows: Section II states the preliminaries of standard RL, two-player zero-sum RL and maximum entropy RL framework mathematically. Section III introduces the combination of minimax formulation and distributional method and Section IV introduces the implementation of the proposed method Minimax DSAC. Section V introduces the simulation scenarios, presents the results and evaluates the trained model. Section VI summarizes the major contributions and concludes this paper.

Ii Preliminaries

Before delving into the details of our algorithm, we first introduce notation and summarize the standard RL setting, two-player RL setting and maximum entropy RL setting mathematically.

Ii-a Standard Reinforcement learning

Reinforcement Learning (RL) is designed to solve sequential decision making tasks wherein the agent interacts with the environment. Formally, we consider an infinite-horizon discounted Markov Decision Process (MDP), defined by the tuple (

), where is a continuous set of states and is a continuous set of actions,

is the transition probability distribution,

is the reward function, and is the discounted factor. In each time step , the agent receives a state and selects an action , and the environment will return the next state with the probability and a scalar reward . We will use and to denote the state and state-action distribution induced by policy in environment. For the sake of simplicity, the current and next state-action pairs are also denoted as and , respectively.

The goal in standard RL is to learn a policy which maximizes the expected future accumulated return. Let denote a stochastic policy , the action value function is the expected sum of discounted returns from a state action pairs when following policy :


where and . Intuitively, the optimal policy can be defined as:

Ii-B Two-player zero-sum RL

Two-player zero-sum RL can be expressed as a two-player zero-sum MDP, defined by the tuple (), where and are continuous action set which players can take. Under this theme, action value function is constructed on two policies, respectively called protagonist policy and adversary policy . Given the current state , the protagonist policy will take action , the adversary policy will take action , and then reach the next state . Whereas these two policies obtain different rewards: the protagonist gets a reward while the adversary gets a reward at each time step.


where and for short. Similarly, the optimal action value function should satisfy the Minimax Bellman Optimality Equation:


Intuitively, the protagonist policy seeks to maximize the long-term expect reward while the adversarial policy seeks to minimize it. The optimal policy can be defined as the Nash Equilibrium of two policies:


where the optimal action and come from the optimal policies and respectively.

Ii-C Maximum entropy RL

The maximum entropy RL aims to maximize the expected accumulated reward and policy entropy, by augmenting the standard maximum reward RL objective with an entropy maximization term:


where is the temperature parameter which determines the relative importance of the entropy term against the reward, and thus controls the stochasticity of the optimal policy.

Obviously, the maximum entropy objective differs from the standard maximum expected reward objective used in standard RL, though the conventional objective in (1) can be recovered as . Prior works [26, 27] have demonstrated that the maximum entropy objective has a number of conceptual and practical advantages. First, the policy is incentivized to explore more widely, while giving up on clearly unpromising avenues. Second, the policy can capture multiple modes of nearoptimal behavior. In problem settings where multiple actions seem equally attractive, the policy will act as randomly as possible to perform those actions.

The optimal maximum entropy policy is learned by a maximum entropy variant of the policy iteration method which alternates between policy evaluation and policy improvement, called soft policy iteration. In the policy evaluation process, given policy , Q-value can be learned by repeatedly applying a modified Bellman operator under policy given by

The goal of the policy improvement process is to find a new policy that is better than the current policy , such that for all state action pairs . Hence, we can directly update the policy directly by maximizing the the entropy-augmented objective (5), i.e.,


It has shown that policy evaluation step and policy improvement step can alternately roll forward and gradually shift to the optimal policies [7].

Iii Our methods

This section mainly focuses on the combination of minimax formulation and distributional RL framework, in which we state our algorithm based on the continuous distributional return.

Iii-a Distributional RL

In distributional RL, we view the return

as a random variable, and choose to directly learn its distribution instead just its expected value, i.e., Q-value:


An analogous distributional Bellman operator of the form


can be derived, where denotes that two random variables and have equal probability laws and the next state and action are distributed according to and respectively. Supposing , where denotes the return distribution of , the return distribution can be optimized by minimizing the distribution distance between Bellman updated and the current return distributions:


where is some metric to measure the distance between two distribution. For example, we can adopt as the Kullback-Leibler (KL) divergence or Wasserstein metric.

Distributional framework has attracted much attention for the reason that distributional RL algorithms show improved sample complexity and final performance, as well as increased robustness to hyperparameter variation. However, many prior works used discrete distribution to build the return distribution, in which we need to divide the value function into different intervals priorly. Recently, Duan et al.

[25] proposed the Distributional Soft Actor-Critic algorithm (DSAC) to directly learn the continuous distribution of returns by truncating the difference between the target and current return distribution, outperforming the existing algorithms on the suite of continuous control tasks. Therefore, we draw on the continuous return distribution to develop our algorithm.

Iii-B Minimax distributional RL

Similarly ,the return can also be modeled as a continuous distribution in minimax formulation and its expectation is the action value function :


We will denote as for the sake of brevity and suppose The corresponding minimax distributional Bellman Equation can derived as:


where , and .

In policy improvement step, both protagonist policy and adversary policy optimize themselves based on current return distribution, in which they have common objective function:

The protagonist policy aims to maximize the distributional expected return while the adversary aims to minimize it:


To learn risk-sensitive policies, we model risk as the variance of the learned continuous return distribution, where the protagonist policy is optimized to mitigate risk to avoid the potential events that have the chance to lead to bad return:

And the adversary policy seeks to increase risk to disrupt the learning process:

where and are the constants corresponding to the variance which describe different risk level.

Iv Implementation of Algorithm

In this section, we employ minimax formulation with the existing DSAC algorithm to present our Minimax DSAC algorithm, in which our algorithm can experience the serious variation from environment during training, thereby improving its generalization ability for unknown environment. To handle problems with large continuous domains, we use function approximators for all the return distribution function and two policies, which can be modeled as a Gaussian with the mean and variance given by neural networks (NNs). We will consider a parameterized state-action return distribution function , a stochastic protagonist policy and a stochastic adversarial policy where , and

are parameters. Next we will derive update rules for these parameter vectors and show the details of our Minimax DSAC.

In policy evaluation step, the current protagonist policy and adversary policy are given and the return distribution is updated by minimizing the difference between the target return distribution and the current return distribution. The formulation is similar with the DASC algorithm except that we consider two polices[25]:

where is a constant. The gradient about parameter can be written as:

To prevent the gradient exploding, we also adopt the cliping technique to keep it close to the expectation value of the current distribution :

where is a hyperparameter representing the clipping boundary.

To stabilize the learning process, target return distribution with parameter , two policy functions with separate parameters and , are used to evaluate the target function. The target networks use a slow-moving update rate, parameterized by , such as

In policy improvement step, as discussed above, the protagonist policy aims to maximize the expected return with entropy and select actions with low variance:

The adversarial policy aims to minimize the expected return and select actions with high variance:

Suppose the mean and variance of the return distribution can be explicitly parameterized by parameters . We can derive the policy gradient of protagonist and adversary policy using the reparameterization trick:

where , is auxiliary variables which are sampled form some fixed distribution. Then the protagonist policy gradient can be derived as

And the adversarial policy can be approximated with

Finally, the temperature is updated by minimizing the following objective

where is the expected entropy. The detail of our algorithm can be shown as Algorithm 1.

  Initialize parameters , , and
  Initialize target parameters , ,
  Initialize learning rate , , , and
  Initialize iteration index
     Select action ,
     Observe reward and new state
     Store transition tuple in buffer
     Sample transitions from
     Update return distribution
     Update protagonist policy
     Update adversarial policy
     Adjust temperature
     Update target networks:
  until Convergence
Algorithm 1 Minimax DSAC Algorithm

V Experiments

In this section, we evaluate our algorithm on an autonomous driving environment, in which we choose the intersection as the driving scenario. Our experiment aims to study two primary questions: (1) How well does Minimax DSAC perform on this task in terms standard DSAC algorithm? (2) Can our algorithm still work or behave better if there are some disturbances from environment?

V-a Simulation Environment

We focus on a typical 4-intersection shown in Fig. 1. Each direction is denoted by its location in the figure, i.e. up(U), down (D), left (L) and right (R) respectively. The intersection is unsignalized and each direction has one lane. The protagonist vehicle (black car in Fig. 1) attempts to travel from down to up, while two adversarial vehicles (green car in Fig. 1) ride from right to left, left to right respectively. The trajectory of all three vehicles are given priorly, as a result, there are two traffic conflict points in the path of protagonist vehicle and adversarial vehicles, as the red circle shown in Fig. 1. In our experiment setting, the protagonist vehicle attempts to pass the intersection safely and quickly, while the other two adversarial vehicles tries to disrupt this event by hitting the protagonist vehicle.

Fig. 1: Intersection Scenario

Suppose that all vehicles are equipped with positioning and velocity devices, such that we can choose position and velocity information of each vehicle as states, i.e., (, ), where is distance between vehicle and center of the intersection. Note that is positive when vehicle is heading for the center and negative when it is leaving. For action space, we choose acceleration of each vehicle and suppose that vehicles can strictly follow the desired acceleration.

The reward function is designed to consider both safety and time efficiency. This task is constructed in episodic manner, where two terminate conditions are given: collision or passing. First, if the protagonist vehicle passes the intersection safely, a large positive reward 110 is given; Second, if collision happens anywhere, a large negative reward -110 is given to the protagonist vehicle; Besides, a minor negative step reward -1 is given every time step to encourage the protagonist vehicle to pass as quickly as possible. However, the adversarial vehicles obtain minus reward against the protagonist vehicle in every case aforementioned.

Overall, the protagonist vehicle will learn how to control acceleration to pass successfully, including avoid the potential collision which comes from the adversarial vehicles and two adversarial vehicles learn to control acceleration to make collision happen.

V-B Algorithm Details and Results

All the value function and two policies are approximated by multi-layer perceptron (MLP) with two hidden layers and 256 units per layer. The policy of the protagonist vehicle aims to maximize future expected return, while the policy of the adversarial vehicles aims to minimize it. The baseline of our algorithm is the standard DSAC

[25] without the adversarial policy, in which the protagonist vehicle learns to pass through the intersection with the existence of two random surrounding vehicles. Also, we adopt the asynchronous parallel architecture of DSAC, in which 4 learners and 3 actors are designed to accelerate the learning speed. The hyperparameters used in training are listed in Table I and the training result is shown as Fig. 2

, where the solid line and the shaded correspond to the mean and 95 % confidence interval over 10 runs.

Model type
Hidden units 256
Hidden layers 2
Max buffer size 500
Sample batch size 256
Hidden layers activation gelu
Optimizer type Adam
Adam parameter
Actor learning rate
Critic learning rate
learning rate
Discount factor 0.99
Temperature Auto
Target update rate 0.001
Max train 500000
Clipping boundary 20
Actor number 4
Learner number 3
Buffer number 1
Seed Current time
Max steps per episode 100

TABLE I: Trainning hyperparameters
Fig. 2: Average return

Results show that Minimax DSAC obtained a smaller mean and a larger variance with respect to the average return, which is explicable that in minimax formulation, the adversary policy provides a strong disturbance to the learning of protagonist policy. Besides, it is clear that Minimax DSAC has more fluctuation than standard DASC at convergence. That is because the protagonist vehicle has learned to avoid the potential collision by decelerating and even stopping and waiting in face of the aggressive adversarial vehicles, which will lead to punishment in each step and finally result in a lower return.

V-C Evaluation

Compared with the performance during the training process, we concern more about that on situations never encountered before, i.e., the generalization ability. Here, we employ the training/test split technique to test the generalization of our algorithm, aiming to explore whether our algorithm Minimax DSAC, after training with the minimax formulation, can behave better in a varying environment compared with the standard DSAC. As adversarial vehicles can be regarded as part of the environment, we can design different driving mode of adversaries agents to adjust the environment difficulty to evaluate the generalization ability of the protagonist policy. To be concrete, we design three driving mode for the adversarial agent: aggressive, conservative and random. In aggressive mode, the two adversarial vehicles sample their acceleration from positive interval while in conservative mode they sample acceleration from negative interval . In random mode, one adversarial vehicle sample acceleration from and the other vehicle sample acceleration from .

The comparison of two methods under three modes are shown in Fig. 3, in which the corresponding

-values are also marked. Results show that Minimax DSAC can greatly improve the performance under different modes of adversarial vehicles, especially in aggressive and random mode. In conservative mode, these two algorithms show minor difference because both the adversarial vehicles drive at the lowest speed in the limit, thereby less potential collision to the protagonist will happen. However, Miniax DSAC still obtained a higher return because it adopted large acceleration to improve the passing efficiency. The t-test results in Fig. 

3 show that the average reward of DSAC is significantly smaller than that of Minimax DSAC . To sum up, our Minimax DSAC algorithm can maintain better performance when encounting different kinds of variations from environment.

Fig. 3: Evaluation results

Vi Conclusion

In this paper, we combine the minimax formulation with the distributional framework to improve the generalization ability of RL algorithms, in which the protagonist agent must compete with the adversarial agent to learn how to behave well. Based on the DSAC algorithm, we proposed the Minimax DSAC algorithm and implemented it on the autonomous driving tasks at intersections. Results show that our algorithm improves greatly the protagonist agent to the variation of environment. This study provides a promising approach to accelerate the application of algorithm in real world like autonomous driving, where we always use simulator to develop algorithms and then put them into use in real environment.