Off-policy Maximum Entropy Reinforcement Learning : Soft Actor-Critic with Advantage Weighted Mixture Policy(SAC-AWMP)

The optimal policy of a reinforcement learning problem is often discontinuous and non-smooth. I.e., for two states with similar representations, their optimal policies can be significantly different. In this case, representing the entire policy with a function approximator (FA) with shared parameters for all states maybe not desirable, as the generalization ability of parameters sharing makes representing discontinuous, non-smooth policies difficult. A common way to solve this problem, known as Mixture-of-Experts, is to represent the policy as the weighted sum of multiple components, where different components perform well on different parts of the state space. Following this idea and inspired by a recent work called advantage-weighted information maximization, we propose to learn for each state weights of these components, so that they entail the information of the state itself and also the preferred action learned so far for the state. The action preference is characterized via the advantage function. In this case, the weight of each component would only be large for certain groups of states whose representations are similar and preferred action representations are also similar. Therefore each component is easy to be represented. We call a policy parameterized in this way an Advantage Weighted Mixture Policy (AWMP) and apply this idea to improve soft-actor-critic (SAC), one of the most competitive continuous control algorithm. Experimental results demonstrate that SAC with AWMP clearly outperforms SAC in four commonly used continuous control tasks and achieve stable performance across different random seeds.


On-policy Reinforcement Learning with Entropy Regularization

Entropy regularization is an imported idea in reinforcement learning, wi...

Maximum Entropy Reinforcement Learning with Mixture Policies

Mixture models are an expressive hypothesis class that can approximate a...

Improved Soft Actor-Critic: Mixing Prioritized Off-Policy Samples with On-Policy Experience

Soft Actor-Critic (SAC) is an off-policy actor-critic reinforcement lear...

Improving Generalization of Reinforcement Learning with Minimax Distributional Soft Actor-Critic

Reinforcement learning (RL) has achieved remarkable performance in a var...

Potential-Based Advice for Stochastic Policy Learning

This paper augments the reward received by a reinforcement learning agen...

Direct Advantage Estimation

Credit assignment is one of the central problems in reinforcement learni...

Reinforcement Learning with Deep Energy-Based Policies

We propose a method for learning expressive energy-based policies for co...

1 Introduction

Many interesting reinforcement learning(RL) problems has large state space such as go[Silver et al.2017], atari games[Mnih et al.2015] and robotic manipulation control [Levine et al.2016, Xu et al.2018]. For these problems, it is impossible for the agent to visit all states and therefore tabular approaches in classic reinforcement learning would not be helpful [Sutton and Barto2018, Wan et al.2019]. It is therefore necessary for the agent to use function approximation (FA) to represent its value, policy or model and share parameters for different inputs. Sharing parameters gives the FA generalization ability, which loosely speaking, means that the FA’s outputs are similar given similar inputs. The generalization ability makes it possible to generate reliable outputs for inputs that are not used to train the FA. Meanwhile, it also makes it hard to approximate functions that are discontinuous, or not smooth.

A method called Mixture-of-Experts (MoE)[Xu et al.1995] was commonly used to deal with the above issue. This method approximates a function with a group of experts. Each expert only approximates the function locally. The rationale is that although the function to be approximated may be discontinuous and non-smooth for all possible inputs, it could still be smooth and continuous for a some inputs and therefore easy to be represented only for those inputs. As long as the ensemble of experts covers the entire input space, the approximated function can then be the weighted sum of these experts [Shazeer et al.2017]. The only unsolved problem is the determination of weights of experts, for different inputs. A reasonable idea is each expert’s weight should be high only for a part of the input space where the function is easy to be represented.

In this paper, we use MoE to represent the agent’s policy. For concreteness, let’s call each expert a policy component and the entire policy the mixture policy. The weights of policy components are learned by a method called advantage-weighted information maximization, which cleverly assigns weights so that each policy component is simple to represent. This method was originally proposed in [Osa et al.2019] as a way to learn policy-over-options for temporal abstraction[Bacon et al.2017]

. However, we note that this method by itself is a way to generate for each state a probability distribution and is not limited to the use of temporal abstraction. In this paper, we apply it to learn the weights of the mixture policy. We call a mixture policy whose weights are learned via advantage-weighted information maximization an

advantage weighted mixture policy (AWMP).

In AWMP, given a state, the weights of policy components are generated by a neural network, called the

prior network that takes the given state as input. Parameters of the prior network are learned through maximizing the mutual information(MI)[Chen et al.2016, Houthooft et al.2016] between the state-action pair under the policy induced by the advantage function learned so far and the policy component sampled from the probability distribution induced by the weights of policy components. Due to the generalization ability of the prior network, for similar state-action pair inputs, the corresponding weights output would also likely to be similar. Meanwhile, in order to maximize the MI, state-action pairs whose representations are different by large degree are likely to produce different weights. In this way, each policy component would have high weight, only for a group of states whose representations are similar and representations of preferred actions for these states, learned so far, are also similar. For this reason, the policy component for these states is much simpler and requires less capacity to represent.

Figure 1: Systematic of AWMP

AWMP in general can be combined with any policy-based RL algorithm, including both state-of-the-art on-policy method PPO[Schulman et al.2017] and the most popular off-policy method TD3[Fujimoto et al.2018]. In this paper, we combine it with one of the state-of-the-art off-policy continuous control algorithm, soft-actor-critic(SAC)[Haarnoja et al.2018a], which has achieved the start-of-the-art efficiency and stability performance on several continuous control Mujoco tasks and challenging real-world robotic tasks[Haarnoja et al.2018b]. SAC aims to learn a stochastic Gaussian policy based on the maximum entropy objective through maximizing both the given reward and an augmented entropy term [Ziebart et al.2008]. Therefore, as in Figure 1

, we propose a Gaussian mixture policy with several stochastic policy components as the mixture policy, each policy component estimates an independent Gaussian policy. The resulting algorithm is called soft-actor-critic with advantage-weighted mixture policy(SAC-AWMP). We show empirically the resulting algorithm clearly outperforms the standard SAC and TD3 in four commonly used continuous control domains, in terms of both the

learning efficiency and the stability. The rest of the paper is arranged as follows: section. 2 introduces the necessary background, including maximum entropy RL and mutual information, section. 3 first introduce the prior network and SAC-AWMP algorithm. In section. 4 we empirically compare SAC-AWMP with the standard SAC, which shows the proposed SAC-AWMP could improve the performance of SAC.

2 Background

In this section, we define the notation and derive the soft policy iteration of maximum entropy RL.

2.1 Preliminaries

The tasks with continuous state and action space addressed by RL generally is formulated as a MDP , consist of a state space , a action space , a state transition function and a reward function . At each environment step , based on the state of environment , the agent select an action generated by the policy , then the agent will receive a reward and the environment transit to next state . A trajectory denotes as given an initial state distribution , which starts from an initial state and follows the action under the policy . The standard RL objective to learn the policy is maximizing the received expected return , return denotes the cumulative reward of one episode from to the terminal time step .

2.2 Maximum entropy RL

Compared to the standard RL objective, the maximum entropy RL objective augmented with the the entropy of the stochastic policy is formulated as:


where denotes the action value function to approximate the expected return. denotes the probability of state-action pair tracking the trajectory induced by policy . denotes the visitation frequency of . denotes the temperature to balance the importance of the stochasticity of the optimal policy against the cumulative reward.

2.3 Soft policy iteration

To learn the optimal maximum entropy policy with the convergence guarantee, soft policy iteration is derived similar as in [Haarnoja et al.2018a], which repeats soft policy evaluation and soft policy improvement alternately. In soft policy evaluation iteration, given a fixed policy , the soft action value is calculated iteratively via a designed soft Bellman backup operator as:


where denotes the entropy reward; denotes a practical discount factor.

Lemma 1.

(Soft Policy Evaluation). Consider the soft Bellman backup operator and the initial soft action value : with . Define , as , will converge to the soft action value of .


With the assumption to guarantee the entropy augmented reward is bounded, then the convergence of soft policy evaluation updated as in Equation.(2) can be proofed as standard policy evaluation[Sutton and Barto2018]. ∎

In soft policy improvement iteration, the policy is updated to minimize the Kullback-Leibler(KL) divergence between the next policy and the target distribution induced by the soft action value function , as:

Lemma 2.

(Soft Policy Improvement). Consider and let be the optimizer of the minimization objective in Equation. (3). Then for all the with assumption .


See Supplementary Material. A.1. ∎

The full soft policy iteration(Theorem. 1) is derived similar as SAC [Haarnoja et al.2018a] to replace the soft policy evaluation and soft policy improvement. To perform the continuous control tasks, the soft action value and policy of SAC need to be estimated by the function approximation.

Theorem 1.

(Soft Policy Iteration). Repeat soft policy evaluation and soft policy improvement alternately, start from any initial policy will converges to a optimal policy with for all and all with assumption .


See Supplementary Material. A.2. ∎

2.4 Mutual information

MI maximization has been researched to learn the interpretable representation[Chen et al.2016] and achieve effective exploration for continuous control RL[Houthooft et al.2016]. MI

denotes the amount of information between random variable

and , calculated as:


where and represents the entropy and conditional entropy, respectively.

3 Sac-Awmp

Some previous work aims to learn the hierarchical policy with the latent variable through dividing the state space. Instead, in this paper, the state-action space is divided corresponding to the mode of advantage function. Firstly, the prior network is implemented to learn the weights of AWMP based on a advantage-weighted information maximization objective. Secondly, we introduce how to learn the AWMP on off-policy data with the designed maximum entropy objective.

3.1 Prior network

The prior network with a softmax output layer is parameterized by to obtain the weights , is the number of policy components in AWMP. To maximize the MI of state-action pairs and weights, the parameters are updated through minimizing a regularized objective[Krause et al.2010], as:


where the regularization term generally is calculated via to penalize the instability against the perturbation. is a coefficient to balance the performance. The improvement of regularization term trick has been verified in many learning representation tasks[Osa et al.2019].

The MI in Equation. (5) denotes as:


where the entropy is estimated by:


where denotes the probability of weights derived from the probability density , as:


where denotes the probability density of state-action pair induced by a policy based on the mode of advantage function as . We consider a formulation as , which can meet the requirement that the state-action pair with larger advantage value given a higher probability. denotes the partition function.

Likewise, the conditional entropy is estimated by:


In practice, it is not available to estimate the probability density from past experience, to solve this issue, the advantage-weighted importance sampling approach is introduced. denotes the behavior policy to generate the experience. We assume that the state distribution only changes sufficiently small resulting in , The advantage-weighted importance sampling weights are calculated as:


To improve the training stability, the advantage-weighted importance sampling weights are normalized as:


where is size of samples in replay buffer . Based on advantaged-weights importance sampling in Equation. (11), then probability of weights in Equation. (8) can be estimated by:


Therefore, the parameterized entropy is estimated by:


Likewise, the parameterized conditional entropy is estimated by:


3.2 Awmp

The AWMP network is comprised of policy components parameterized by . The maximum entropy objective to learn the AWMP denotes as:


where denotes the soft action value function induced by AWMP ; , here is set possible value of ; denotes a gating policy. denotes a single policy component given , a stochastic Gaussian policy; denotes the entropy temperature to control the stochasticity of the AWMP.

At each environment step, the action is sampled from the AWMP and executed to interact with environment. Given the state , the softmax gating policy is calculated by:


where named as soft option value [Bacon et al.2017] represents the conditional expectation of return following a given policy component , as:


The soft state value induced by AWMP is derived as:


where denotes the entropy temperature of the gating policy. The advantage value is calculated as:


The function approximators are applied to estimate both the soft state-value and soft action value . Based on the soft policy iteration theorem, the soft state value network with parameters and the soft action value network with parameters

can be learned alternatively through stochastic gradient descent(SGD).

The soft state value network is trained through minimizing the following error:


where state is sampled from the replay buffer and the target value is calculated by:


where the corresponding action is sampled from the current AWMP. In particular, we implement two independent soft action value network and select the minimal one to calculate the target value , which has been demonstrated to reduce the effect of positive bias and improve the sample-efficiency in previous value-based work [Fujimoto et al.2018, Zhang et al.2019]. The parameter soft action value function is updated through minimizing the soft Bellman residual error:


where is sampled from replay buffer ; the target value is calculated by:


where is calculated from the target soft state value network parameterized by .

Instead of minimizing the objective in Equation. (15

) through the gradient backpropagating, we apply the likelihood ratio gradient estimator to learn the AWMP based on the off-policy data in replay buffer

[Williams1992, Haarnoja et al.2018a]. The objective is rewritten as:


which is utilized to minimize the expected KL-divergence with the target density induced by ; is sampled from the replay buffer and is derived from the current AWMP ; the weights for AWMP is derived from the prior network ; denotes the partition function. The soft action value network can be differentiated similar as the deterministic policy gradient(DPG) theorem [Silver et al.2014]. Therefore, a transformation trick is implemented to represent the AWMP with the weight from the prior network, as:


where the action is derived from th policy component and is the th element of the weight.

is the input noise sampled from a fixed Gaussian distribution

. The objective function in Equation. (24) is rewritten as:


The gradient of partition function can be omitted due to be independent of parameters , the gradient with respect to the parameters is calculated as:


3.3 Practice training

Each policy component of the AWMP network will output the action from unbounded Gaussian distribution , in practice, an invertible squashing function () is applied to bound the Gaussian samples elementwise. Therefore, the action will be restricted in . The density of action induced by the Jacobian of the transformation, denotes as:


Therefore, the action sampled from the AWMP weighted by the weights from prior network is still restricted in .

Our proposed algorithm is summarised in Algorithm. 1. At each gradient step, the soft action-value network, the soft state-value network and the AWMP network are trained on the mini-batch off-policy samples from replay buffer . To improve the training stability, the prior network is trained on semi off-policy data, therefore we sample the mini-batch from the most recent samples generated by the most recent behavior policies. Additionally, to sufficiently update the AWMP, a target soft state-action value network is implemented to obtain a less frequent gating policy. The exponentially moving average with a smoothing constant and is applied to update the target networks, respectively.

1:  Input: Number of policy components , size of replay buffer , size of mini-batch and
2:  Initialize: , ; , ; ,
3:  for each iteration do
4:     for each environment step do
6:        ,
8:     end for
9:     for each prior network update step do
10:        Semi off-policy sample from
12:     end for
13:     for each gradient step do
14:        Off-policy samples from
20:        Target network update
23:     end for
24:  end for
Algorithm 1 SAC-AWMP

4 Experiments

Our proposed SAC-AWMP is evaluated to understand sample complexity and stability compared to the previous state-of-the-art RL algorithms on four commonly used continuous control tasks of OpenAI Gym (see Fig. 2)[Todorov et al.2012, Brockman et al.2016].

Figure 2: MuJoCo tasks. (a) Walker2d-v2, (b) Ant-v2, (c) Hopper-v2, (d) HalfCheetah-v2.

4.1 Settings

The sample complexity of off-policy RL such as TD3 and SAC compared to the state-of-the-art on-policy algorithm PPO [Schulman et al.2017] has been done in TD3 and SAC[Fujimoto et al.2018, Haarnoja et al.2018a]

. Therefore, in this paper, our proposed SAC-AWMP is only implemented compared to TD3 and SAC, which are implemented with the provided code of authors. Most specifically, each policy component of SAC-AWMP has the same network architecture with SAC. Each algorithm for all the tasks is trained for five trials with different seed(0, 1, 2, 3, 4), each trial with 1 million steps, and the expected return is estimated via ten evaluation episodes every 1000 experiment steps. All hyperparameters used in our experiments refers to the original papers of SAC and TD3

[Fujimoto et al.2018, Haarnoja et al.2018a], which are listed in Supplementary Material B and the source code is available on github222Code:

4.2 Results and Comparisons

As shown in Figure 3

, the solid curves represents the mean of the average evaluation and the shaded region corresponds to half a standard deviation of the evaluation over five seed. SAC-AWMP with four separate policy components can outperform than SAC in term of learning efficiency and stability(on Ant-v2, Walker2d-v2 and Hopper-v2), Halfcheetah-2 can be solved easily by all RL methods. However, on the harder task Ant-v2, SAC-AWMP will outperform than SAC and TD3 largely. Obviously, SAC-AWMP and SAC can achieve better stability than TD3 on all the four tasks without any hyperparameters tuning.

Figure 3: Learning curves for the Mujoco continuous control tasks. Entropy temperature term and . Number of policy components .
Figure 4: Learning curves for SAC-AWMP with different number of policy components. Entropy temperature term and .

The number of policy components in SAC-AWMP need to be given, which is similar as the number of options and latent variables in hierarchical RL(HRL) [Bacon et al.2017, Osa et al.2019, Zhang and Whiteson2019]. How to discover the meaningful policy components and option policies corresponding to each latent variable is a long standing open question. The number of policy components should be tasks-dependent, which has not been investigated clearly. In previous HRL work, for all the continuous control tasks, two or four option policies were tested in [Osa et al.2019, Zhang and Whiteson2019]. In this paper, our proposed SAC-AWMP is implemented with four different number of policy components . When only one single policy component is applied, the proposed SAC-AWMP will degenerate as SAC [Haarnoja et al.2018a]. As shown in Figure 4

, SAC-AWMP can outperform the SAC with three different number of policy components, and SAC-AWMP with 8 policy components could result in small variance during training.

Figure 5: Learning curves for SAC-AWMP with different smoothing coefficient.

The performance of maximum entropy RL largely depends on the entropy temperature, which is replaced by the reward scale in [Haarnoja et al.2018a]. The automating entropy adjustment varying across the different tasks and different learning stages proposed in [Haarnoja et al.2018b], however, it has not achieved too much improvement compared to fixed entropy given in [Haarnoja et al.2018a]. In this paper, the entropy temperature for each task is fixed same as in [Haarnoja et al.2018a]( see Figure 3). Target network is a commonly used trick to slowly track the changing value updated via a smoothing coefficient[Lillicrap et al.2015], which has largely improved learning stability of RL algorithms. As depicted in Algorithm 1, the AWMP could be learned on off-policy data, however the prior network need to be learned on semi off-policy data(generated by most recent policies). In addition to the target network for soft state value function, we implement a independent target network for soft action value to derive the gating policy. The smoothing coefficient is applied for the above experiments, additionally, as Figure 5, we test other two different smoothing coefficients. Large smoothing coefficient may result in instability and divergence, but small value will lead to slower learning.

5 Conclusion and discussion

In this article, we proposed a soft actor-critic with advantage weighted mixture policy(SAC-AWMP), an off-policy maximum entropy RL algorithm. Without any specific hyperparameters tuning, we empirically demonstrate that the proposed SAC-AWMP with wights learned via advantage-weighted information maximization can achieve more smooth policy approximation and stable learning than TD3, and improve the sample-efficiency performance of the typical SAC on three Mujoco tasks.

Compared to the typical SAC with single stochastic Gaussian policy, the AWMP hold the promise to solve the complex tasks with high dimensional continuous state and action space or the real-world tasks with hierarchical structures. Actually our proposed AWMP can combine with any policy-gradient methods, such as PPO and TD3. Additionally, in this paper, the prior network could only be learned via ’semi off-policy’ data. For better sample-efficiency and applicability, further investigation could be done in these directions.


This work was supported by Agency for Science, Technology and Research, Singapore, under the National Robotics Program, with A*star SERC Grant No.: 192 25 00054.


Appendix A Proofs

a.1 Lemma 2


(Soft Policy Improvement). Consider and let be the optimizer of the minimization problem in Equation(3). Then for all the with .


Let , and denote the old policy, the soft state-action value and soft state value, and then as new policy is defined as Equation. (3), rewritten as:


With , it must be satisfied that . Hence


Since partition function only depends on the state, the inequality reduces to:


Then, consider the soft Bellman equation:


where we expand the repeatedly by applying the soft Bellman equation and the bound in Equation. (31). Finally convergence to follows Lemma 1. ∎

a.2 Theorem 1


(Soft Policy Iteration). Repeat soft policy evaluation and soft improvement policy alternately, start from any initial policy will converges to optimal policy with for all and all the with .


Let denote the policy at iteration . Based on Lemma 2, the sequence is monotonically increasing. The sequence will converge to some due to is bounded above for . At convergence, it must be case that for all . Based on the proof of Lemma 2, for all . Hence, it must be case that is optimal in . ∎

Appendix B Details of Experiments

Description Action Dimensions Entropy
Ant-v2 8 0.2
HalfCheetah-v2 6 0.2
Walker2d-v2 6 0.2
Hopper-v2 3 0.2
Table 1: Mujoco Environments Settings
Description Symbol Value
Batch size for critic 100
Number of hidden layers (400, 400)
Activation function Relu, Relu, tanh
Target smoothing coefficient 0.005
Learning rate 3e-4
Gradient Steps 1
Replay buffer size 1e6
Entropy term 0.2
Optimizer Adam
Discount factor 0.99
Table 2: Hyper-parameters of SAC
Description Symbol Value
Batch size for critic 100
Batch size for policy 200()
Batch size for prior network 50
Target smoothing coefficient 0.001
Prior network update steps 5000
Learning rate 3e-4
Noise for MI regularization 0.04
Coefficient for MI 0.1
Entropy term 0.001
Table 3: Additional hyper-parameters of SAC-AWMP