1 Introduction
Many interesting reinforcement learning(RL) problems has large state space such as go[Silver et al.2017], atari games[Mnih et al.2015] and robotic manipulation control [Levine et al.2016, Xu et al.2018]. For these problems, it is impossible for the agent to visit all states and therefore tabular approaches in classic reinforcement learning would not be helpful [Sutton and Barto2018, Wan et al.2019]. It is therefore necessary for the agent to use function approximation (FA) to represent its value, policy or model and share parameters for different inputs. Sharing parameters gives the FA generalization ability, which loosely speaking, means that the FA’s outputs are similar given similar inputs. The generalization ability makes it possible to generate reliable outputs for inputs that are not used to train the FA. Meanwhile, it also makes it hard to approximate functions that are discontinuous, or not smooth.
A method called MixtureofExperts (MoE)[Xu et al.1995] was commonly used to deal with the above issue. This method approximates a function with a group of experts. Each expert only approximates the function locally. The rationale is that although the function to be approximated may be discontinuous and nonsmooth for all possible inputs, it could still be smooth and continuous for a some inputs and therefore easy to be represented only for those inputs. As long as the ensemble of experts covers the entire input space, the approximated function can then be the weighted sum of these experts [Shazeer et al.2017]. The only unsolved problem is the determination of weights of experts, for different inputs. A reasonable idea is each expert’s weight should be high only for a part of the input space where the function is easy to be represented.
In this paper, we use MoE to represent the agent’s policy. For concreteness, let’s call each expert a policy component and the entire policy the mixture policy. The weights of policy components are learned by a method called advantageweighted information maximization, which cleverly assigns weights so that each policy component is simple to represent. This method was originally proposed in [Osa et al.2019] as a way to learn policyoveroptions for temporal abstraction[Bacon et al.2017]
. However, we note that this method by itself is a way to generate for each state a probability distribution and is not limited to the use of temporal abstraction. In this paper, we apply it to learn the weights of the mixture policy. We call a mixture policy whose weights are learned via advantageweighted information maximization an
advantage weighted mixture policy (AWMP).In AWMP, given a state, the weights of policy components are generated by a neural network, called the
prior network that takes the given state as input. Parameters of the prior network are learned through maximizing the mutual information(MI)[Chen et al.2016, Houthooft et al.2016] between the stateaction pair under the policy induced by the advantage function learned so far and the policy component sampled from the probability distribution induced by the weights of policy components. Due to the generalization ability of the prior network, for similar stateaction pair inputs, the corresponding weights output would also likely to be similar. Meanwhile, in order to maximize the MI, stateaction pairs whose representations are different by large degree are likely to produce different weights. In this way, each policy component would have high weight, only for a group of states whose representations are similar and representations of preferred actions for these states, learned so far, are also similar. For this reason, the policy component for these states is much simpler and requires less capacity to represent.AWMP in general can be combined with any policybased RL algorithm, including both stateoftheart onpolicy method PPO[Schulman et al.2017] and the most popular offpolicy method TD3[Fujimoto et al.2018]. In this paper, we combine it with one of the stateoftheart offpolicy continuous control algorithm, softactorcritic(SAC)[Haarnoja et al.2018a], which has achieved the startoftheart efficiency and stability performance on several continuous control Mujoco tasks and challenging realworld robotic tasks[Haarnoja et al.2018b]. SAC aims to learn a stochastic Gaussian policy based on the maximum entropy objective through maximizing both the given reward and an augmented entropy term [Ziebart et al.2008]. Therefore, as in Figure 1
, we propose a Gaussian mixture policy with several stochastic policy components as the mixture policy, each policy component estimates an independent Gaussian policy. The resulting algorithm is called softactorcritic with advantageweighted mixture policy(SACAWMP). We show empirically the resulting algorithm clearly outperforms the standard SAC and TD3 in four commonly used continuous control domains, in terms of both the
learning efficiency and the stability. The rest of the paper is arranged as follows: section. 2 introduces the necessary background, including maximum entropy RL and mutual information, section. 3 first introduce the prior network and SACAWMP algorithm. In section. 4 we empirically compare SACAWMP with the standard SAC, which shows the proposed SACAWMP could improve the performance of SAC.2 Background
In this section, we define the notation and derive the soft policy iteration of maximum entropy RL.
2.1 Preliminaries
The tasks with continuous state and action space addressed by RL generally is formulated as a MDP , consist of a state space , a action space , a state transition function and a reward function . At each environment step , based on the state of environment , the agent select an action generated by the policy , then the agent will receive a reward and the environment transit to next state . A trajectory denotes as given an initial state distribution , which starts from an initial state and follows the action under the policy . The standard RL objective to learn the policy is maximizing the received expected return , return denotes the cumulative reward of one episode from to the terminal time step .
2.2 Maximum entropy RL
Compared to the standard RL objective, the maximum entropy RL objective augmented with the the entropy of the stochastic policy is formulated as:
(1) 
where denotes the action value function to approximate the expected return. denotes the probability of stateaction pair tracking the trajectory induced by policy . denotes the visitation frequency of . denotes the temperature to balance the importance of the stochasticity of the optimal policy against the cumulative reward.
2.3 Soft policy iteration
To learn the optimal maximum entropy policy with the convergence guarantee, soft policy iteration is derived similar as in [Haarnoja et al.2018a], which repeats soft policy evaluation and soft policy improvement alternately. In soft policy evaluation iteration, given a fixed policy , the soft action value is calculated iteratively via a designed soft Bellman backup operator as:
(2)  
where denotes the entropy reward; denotes a practical discount factor.
Lemma 1.
(Soft Policy Evaluation). Consider the soft Bellman backup operator and the initial soft action value : with . Define , as , will converge to the soft action value of .
Proof.
With the assumption to guarantee the entropy augmented reward is bounded, then the convergence of soft policy evaluation updated as in Equation.(2) can be proofed as standard policy evaluation[Sutton and Barto2018]. ∎
In soft policy improvement iteration, the policy is updated to minimize the KullbackLeibler(KL) divergence between the next policy and the target distribution induced by the soft action value function , as:
(3) 
Lemma 2.
(Soft Policy Improvement). Consider and let be the optimizer of the minimization objective in Equation. (3). Then for all the with assumption .
Proof.
See Supplementary Material. A.1. ∎
The full soft policy iteration(Theorem. 1) is derived similar as SAC [Haarnoja et al.2018a] to replace the soft policy evaluation and soft policy improvement. To perform the continuous control tasks, the soft action value and policy of SAC need to be estimated by the function approximation.
Theorem 1.
(Soft Policy Iteration). Repeat soft policy evaluation and soft policy improvement alternately, start from any initial policy will converges to a optimal policy with for all and all with assumption .
Proof.
See Supplementary Material. A.2. ∎
2.4 Mutual information
MI maximization has been researched to learn the interpretable representation[Chen et al.2016] and achieve effective exploration for continuous control RL[Houthooft et al.2016]. MI
denotes the amount of information between random variable
and , calculated as:(4) 
where and represents the entropy and conditional entropy, respectively.
3 SacAwmp
Some previous work aims to learn the hierarchical policy with the latent variable through dividing the state space. Instead, in this paper, the stateaction space is divided corresponding to the mode of advantage function. Firstly, the prior network is implemented to learn the weights of AWMP based on a advantageweighted information maximization objective. Secondly, we introduce how to learn the AWMP on offpolicy data with the designed maximum entropy objective.
3.1 Prior network
The prior network with a softmax output layer is parameterized by to obtain the weights , is the number of policy components in AWMP. To maximize the MI of stateaction pairs and weights, the parameters are updated through minimizing a regularized objective[Krause et al.2010], as:
(5) 
where the regularization term generally is calculated via to penalize the instability against the perturbation. is a coefficient to balance the performance. The improvement of regularization term trick has been verified in many learning representation tasks[Osa et al.2019].
The MI in Equation. (5) denotes as:
(6) 
where the entropy is estimated by:
(7) 
where denotes the probability of weights derived from the probability density , as:
(8) 
where denotes the probability density of stateaction pair induced by a policy based on the mode of advantage function as . We consider a formulation as , which can meet the requirement that the stateaction pair with larger advantage value given a higher probability. denotes the partition function.
Likewise, the conditional entropy is estimated by:
(9) 
In practice, it is not available to estimate the probability density from past experience, to solve this issue, the advantageweighted importance sampling approach is introduced. denotes the behavior policy to generate the experience. We assume that the state distribution only changes sufficiently small resulting in , The advantageweighted importance sampling weights are calculated as:
(10) 
To improve the training stability, the advantageweighted importance sampling weights are normalized as:
(11) 
where is size of samples in replay buffer . Based on advantagedweights importance sampling in Equation. (11), then probability of weights in Equation. (8) can be estimated by:
(12) 
Therefore, the parameterized entropy is estimated by:
(13) 
Likewise, the parameterized conditional entropy is estimated by:
(14) 
3.2 Awmp
The AWMP network is comprised of policy components parameterized by . The maximum entropy objective to learn the AWMP denotes as:
(15) 
where denotes the soft action value function induced by AWMP ; , here is set possible value of ; denotes a gating policy. denotes a single policy component given , a stochastic Gaussian policy; denotes the entropy temperature to control the stochasticity of the AWMP.
At each environment step, the action is sampled from the AWMP and executed to interact with environment. Given the state , the softmax gating policy is calculated by:
(16) 
where named as soft option value [Bacon et al.2017] represents the conditional expectation of return following a given policy component , as:
(17)  
The soft state value induced by AWMP is derived as:
(18)  
where denotes the entropy temperature of the gating policy. The advantage value is calculated as:
(19) 
The function approximators are applied to estimate both the soft statevalue and soft action value . Based on the soft policy iteration theorem, the soft state value network with parameters and the soft action value network with parameters
can be learned alternatively through stochastic gradient descent(SGD).
The soft state value network is trained through minimizing the following error:
(20) 
where state is sampled from the replay buffer and the target value is calculated by:
(21) 
where the corresponding action is sampled from the current AWMP. In particular, we implement two independent soft action value network and select the minimal one to calculate the target value , which has been demonstrated to reduce the effect of positive bias and improve the sampleefficiency in previous valuebased work [Fujimoto et al.2018, Zhang et al.2019]. The parameter soft action value function is updated through minimizing the soft Bellman residual error:
(22) 
where is sampled from replay buffer ; the target value is calculated by:
(23) 
where is calculated from the target soft state value network parameterized by .
Instead of minimizing the objective in Equation. (15
) through the gradient backpropagating, we apply the likelihood ratio gradient estimator to learn the AWMP based on the offpolicy data in replay buffer
[Williams1992, Haarnoja et al.2018a]. The objective is rewritten as:(24) 
which is utilized to minimize the expected KLdivergence with the target density induced by ; is sampled from the replay buffer and is derived from the current AWMP ; the weights for AWMP is derived from the prior network ; denotes the partition function. The soft action value network can be differentiated similar as the deterministic policy gradient(DPG) theorem [Silver et al.2014]. Therefore, a transformation trick is implemented to represent the AWMP with the weight from the prior network, as:
(25) 
where the action is derived from th policy component and is the th element of the weight.
is the input noise sampled from a fixed Gaussian distribution
. The objective function in Equation. (24) is rewritten as:(26)  
The gradient of partition function can be omitted due to be independent of parameters , the gradient with respect to the parameters is calculated as:
(27)  
3.3 Practice training
Each policy component of the AWMP network will output the action from unbounded Gaussian distribution , in practice, an invertible squashing function () is applied to bound the Gaussian samples elementwise. Therefore, the action will be restricted in . The density of action induced by the Jacobian of the transformation, denotes as:
(28) 
Therefore, the action sampled from the AWMP weighted by the weights from prior network is still restricted in .
Our proposed algorithm is summarised in Algorithm. 1. At each gradient step, the soft actionvalue network, the soft statevalue network and the AWMP network are trained on the minibatch offpolicy samples from replay buffer . To improve the training stability, the prior network is trained on semi offpolicy data, therefore we sample the minibatch from the most recent samples generated by the most recent behavior policies. Additionally, to sufficiently update the AWMP, a target soft stateaction value network is implemented to obtain a less frequent gating policy. The exponentially moving average with a smoothing constant and is applied to update the target networks, respectively.
4 Experiments
Our proposed SACAWMP is evaluated to understand sample complexity and stability compared to the previous stateoftheart RL algorithms on four commonly used continuous control tasks of OpenAI Gym (see Fig. 2)[Todorov et al.2012, Brockman et al.2016].
4.1 Settings
The sample complexity of offpolicy RL such as TD3 and SAC compared to the stateoftheart onpolicy algorithm PPO [Schulman et al.2017] has been done in TD3 and SAC[Fujimoto et al.2018, Haarnoja et al.2018a]
. Therefore, in this paper, our proposed SACAWMP is only implemented compared to TD3 and SAC, which are implemented with the provided code of authors. Most specifically, each policy component of SACAWMP has the same network architecture with SAC. Each algorithm for all the tasks is trained for five trials with different seed(0, 1, 2, 3, 4), each trial with 1 million steps, and the expected return is estimated via ten evaluation episodes every 1000 experiment steps. All hyperparameters used in our experiments refers to the original papers of SAC and TD3
[Fujimoto et al.2018, Haarnoja et al.2018a], which are listed in Supplementary Material B and the source code is available on github^{2}^{2}2Code: https://github.com/hzm2016/SACAWMP.git.4.2 Results and Comparisons
As shown in Figure 3
, the solid curves represents the mean of the average evaluation and the shaded region corresponds to half a standard deviation of the evaluation over five seed. SACAWMP with four separate policy components can outperform than SAC in term of learning efficiency and stability(on Antv2, Walker2dv2 and Hopperv2), Halfcheetah2 can be solved easily by all RL methods. However, on the harder task Antv2, SACAWMP will outperform than SAC and TD3 largely. Obviously, SACAWMP and SAC can achieve better stability than TD3 on all the four tasks without any hyperparameters tuning.
The number of policy components in SACAWMP need to be given, which is similar as the number of options and latent variables in hierarchical RL(HRL) [Bacon et al.2017, Osa et al.2019, Zhang and Whiteson2019]. How to discover the meaningful policy components and option policies corresponding to each latent variable is a long standing open question. The number of policy components should be tasksdependent, which has not been investigated clearly. In previous HRL work, for all the continuous control tasks, two or four option policies were tested in [Osa et al.2019, Zhang and Whiteson2019]. In this paper, our proposed SACAWMP is implemented with four different number of policy components . When only one single policy component is applied, the proposed SACAWMP will degenerate as SAC [Haarnoja et al.2018a]. As shown in Figure 4
, SACAWMP can outperform the SAC with three different number of policy components, and SACAWMP with 8 policy components could result in small variance during training.
The performance of maximum entropy RL largely depends on the entropy temperature, which is replaced by the reward scale in [Haarnoja et al.2018a]. The automating entropy adjustment varying across the different tasks and different learning stages proposed in [Haarnoja et al.2018b], however, it has not achieved too much improvement compared to fixed entropy given in [Haarnoja et al.2018a]. In this paper, the entropy temperature for each task is fixed same as in [Haarnoja et al.2018a]( see Figure 3). Target network is a commonly used trick to slowly track the changing value updated via a smoothing coefficient[Lillicrap et al.2015], which has largely improved learning stability of RL algorithms. As depicted in Algorithm 1, the AWMP could be learned on offpolicy data, however the prior network need to be learned on semi offpolicy data(generated by most recent policies). In addition to the target network for soft state value function, we implement a independent target network for soft action value to derive the gating policy. The smoothing coefficient is applied for the above experiments, additionally, as Figure 5, we test other two different smoothing coefficients. Large smoothing coefficient may result in instability and divergence, but small value will lead to slower learning.
5 Conclusion and discussion
In this article, we proposed a soft actorcritic with advantage weighted mixture policy(SACAWMP), an offpolicy maximum entropy RL algorithm. Without any specific hyperparameters tuning, we empirically demonstrate that the proposed SACAWMP with wights learned via advantageweighted information maximization can achieve more smooth policy approximation and stable learning than TD3, and improve the sampleefficiency performance of the typical SAC on three Mujoco tasks.
Compared to the typical SAC with single stochastic Gaussian policy, the AWMP hold the promise to solve the complex tasks with high dimensional continuous state and action space or the realworld tasks with hierarchical structures. Actually our proposed AWMP can combine with any policygradient methods, such as PPO and TD3. Additionally, in this paper, the prior network could only be learned via ’semi offpolicy’ data. For better sampleefficiency and applicability, further investigation could be done in these directions.
Acknowledgments
This work was supported by Agency for Science, Technology and Research, Singapore, under the National Robotics Program, with A*star SERC Grant No.: 192 25 00054.
References
 [Bacon et al.2017] PierreLuc Bacon, Jean Harb, and Doina Precup. The optioncritic architecture. In ThirtyFirst AAAI Conference on Artificial Intelligence, 2017.
 [Brockman et al.2016] Greg Brockman, Vicki Cheung, Ludwig Pettersson, Jonas Schneider, John Schulman, Jie Tang, and Wojciech Zaremba. Openai gym. arXiv preprint arXiv:1606.01540, 2016.
 [Chen et al.2016] Xi Chen, Yan Duan, Rein Houthooft, John Schulman, Ilya Sutskever, and Pieter Abbeel. Infogan: Interpretable representation learning by information maximizing generative adversarial nets. In Advances in neural information processing systems, pages 2172–2180, 2016.

[Fujimoto et al.2018]
Scott Fujimoto, Herke Hoof, and David Meger.
Addressing function approximation error in actorcritic methods.
In
International Conference on Machine Learning
, pages 1582–1591, 2018.  [Haarnoja et al.2018a] Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel, and Sergey Levine. Soft actorcritic: Offpolicy maximum entropy deep reinforcement learning with a stochastic actor. arXiv preprint arXiv:1801.01290, 2018.
 [Haarnoja et al.2018b] Tuomas Haarnoja, Aurick Zhou, Kristian Hartikainen, George Tucker, Sehoon Ha, Jie Tan, Vikash Kumar, Henry Zhu, Abhishek Gupta, Pieter Abbeel, et al. Soft actorcritic algorithms and applications. arXiv preprint arXiv:1812.05905, 2018.
 [Houthooft et al.2016] Rein Houthooft, Xi Chen, Yan Duan, John Schulman, Filip De Turck, and Pieter Abbeel. Vime: Variational information maximizing exploration. In Advances in Neural Information Processing Systems, pages 1109–1117, 2016.
 [Krause et al.2010] Andreas Krause, Pietro Perona, and Ryan G Gomes. Discriminative clustering by regularized information maximization. In Advances in neural information processing systems, pages 775–783, 2010.
 [Levine et al.2016] Sergey Levine, Chelsea Finn, Trevor Darrell, and Pieter Abbeel. Endtoend training of deep visuomotor policies. The Journal of Machine Learning Research, 17(1):1334–1373, 2016.
 [Lillicrap et al.2015] Timothy P Lillicrap, Jonathan J Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa, David Silver, and Daan Wierstra. Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971, 2015.
 [Mnih et al.2015] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Bellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al. Humanlevel control through deep reinforcement learning. Nature, 518(7540):529, 2015.
 [Osa et al.2019] Takayuki Osa, Voot Tangkaratt, and Masashi Sugiyama. Hierarchical reinforcement learning via advantageweighted information maximization. International Conference on Learning Representations, 2019.
 [Schulman et al.2017] John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017.
 [Shazeer et al.2017] Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc Le, Geoffrey Hinton, and Jeff Dean. Outrageously large neural networks: The sparselygated mixtureofexperts layer. arXiv preprint arXiv:1701.06538, 2017.
 [Silver et al.2014] David Silver, Guy Lever, Nicolas Heess, Thomas Degris, Daan Wierstra, and Martin Riedmiller. Deterministic policy gradient algorithms. In International Conference on Machine Learning (ICML), 2014.
 [Silver et al.2017] David Silver, Julian Schrittwieser, Karen Simonyan, Ioannis Antonoglou, Aja Huang, Arthur Guez, Thomas Hubert, Lucas Baker, Matthew Lai, Adrian Bolton, et al. Mastering the game of go without human knowledge. nature, 550(7676):354–359, 2017.
 [Sutton and Barto2018] Richard S Sutton and Andrew G Barto. Reinforcement learning: An introduction. MIT press, 2018.
 [Todorov et al.2012] Emanuel Todorov, Tom Erez, and Yuval Tassa. Mujoco: A physics engine for modelbased control. In 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems, pages 5026–5033. IEEE, 2012.
 [Wan et al.2019] Yi Wan, Muhammad Zaheer, Adam White, Martha White, and Richard S. Sutton. Planning with expectation models. CoRR, abs/1904.01191, 2019.
 [Williams1992] Ronald J Williams. Simple statistical gradientfollowing algorithms for connectionist reinforcement learning. Machine learning, 8(34):229–256, 1992.
 [Xu et al.1995] Lei Xu, Michael I Jordan, and Geoffrey E Hinton. An alternative model for mixtures of experts. In Advances in neural information processing systems, pages 633–640, 1995.
 [Xu et al.2018] Jing Xu, Zhimin Hou, Wei Wang, Bohao Xu, Kuangen Zhang, and Ken Chen. Feedback deep deterministic policy gradient with fuzzy reward for robotic multiple peginhole assembly tasks. IEEE Transactions on Industrial Informatics, 15(3):1658–1667, 2018.
 [Zhang and Whiteson2019] Shangtong Zhang and Shimon Whiteson. Dac: The double actorcritic architecture for learning options. arXiv preprint arXiv:1904.12691, 2019.
 [Zhang et al.2019] Kuangen Zhang, Zhimin Hou, Clarence W de Silva, Haoyong Yu, and Chenglong Fu. Teach biped robots to walk via gait principles and reinforcement learning with adversarial critics. arXiv preprint arXiv:1910.10194, 2019.
 [Ziebart et al.2008] Brian D Ziebart, Andrew Maas, J Andrew Bagnell, and Anind K Dey. Maximum entropy inverse reinforcement learning. 2008.
Appendix A Proofs
a.1 Lemma 2
Lemma.
(Soft Policy Improvement). Consider and let be the optimizer of the minimization problem in Equation(3). Then for all the with .
Proof.
Let , and denote the old policy, the soft stateaction value and soft state value, and then as new policy is defined as Equation. (3), rewritten as:
(29) 
With , it must be satisfied that . Hence
(30)  
Since partition function only depends on the state, the inequality reduces to:
(31) 
Then, consider the soft Bellman equation:
(32)  
where we expand the repeatedly by applying the soft Bellman equation and the bound in Equation. (31). Finally convergence to follows Lemma 1. ∎
a.2 Theorem 1
Theorem.
(Soft Policy Iteration). Repeat soft policy evaluation and soft improvement policy alternately, start from any initial policy will converges to optimal policy with for all and all the with .
Proof.
Let denote the policy at iteration . Based on Lemma 2, the sequence is monotonically increasing. The sequence will converge to some due to is bounded above for . At convergence, it must be case that for all . Based on the proof of Lemma 2, for all . Hence, it must be case that is optimal in . ∎
Appendix B Details of Experiments
Description  Action Dimensions  Entropy 

Antv2  8  0.2 
HalfCheetahv2  6  0.2 
Walker2dv2  6  0.2 
Hopperv2  3  0.2 
Description  Symbol  Value 

Batch size for critic  100  
Number of hidden layers  (400, 400)  
Activation function  Relu, Relu, tanh  
Target smoothing coefficient  0.005  
Learning rate  3e4  
Gradient Steps  1  
Replay buffer size  1e6  
Entropy term  0.2  
Optimizer  Adam  
Discount factor  0.99 
Description  Symbol  Value 

Batch size for critic  100  
Batch size for policy  200()  
400()  
Batch size for prior network  50  
Target smoothing coefficient  0.001  
Prior network update steps  5000  
Learning rate  3e4  
Noise for MI regularization  0.04  
Coefficient for MI  0.1  
Entropy term  0.001 