1. Introduction
Reinforcement learning (RL) studies the control problem where an agent tries to navigate through an unknown environment Sutton and Barto (2018). The agent attempts to maximize its cumulative rewards through an iterative trialanderror learning process Arulkumaran et al. (2017). Recently, we have seen many successes of applying RL to challenging simulation Mnih et al. (2015); Liang et al. (2016) and realworld Silver et al. (2017); Leibo et al. (2017); Wang and Zhang (2017) problems. Inherently, RL consists of two distinct but closely related objectives: learn the best possible policy from the gathered samples (i.e. exploitation) and collect new samples effectively (i.e. exploration). While the exploitation
step shares certain similarities with tasks such as supervised learning,
exploration is unique, essential, and is often viewed as the backbone of many successful RL algorithms Mnih et al. (2013); Haarnoja et al. (2018a).In order to explore novel states that are potentially rewarding, it is crucial to incorporate randomness when interacting with the environment. Thanks to its simplicity, injecting noise into the action Lillicrap et al. (2015); Fujimoto et al. (2018a) or parameter space Fortunato et al. (2017); Plappert et al. (2018) is widely used to implicitly construct behavior policies from target policies. In most prior work, the injected noise has a mean of zero, such that the updates to the target policy have no bias Fujimoto et al. (2018b); Gu et al. (2016). The stability of noisebased exploration, which is obtained from its nonbiased nature, makes it a safe exploration strategy. However, noisebased approaches are generally less effective since they are neither aware of potentially rewarding actions nor guided by the explorationoriented targets.
To tackle the above problem, two orthogonal lines of approaches have been proposed. One of them considers extracting more information from the current knowledge (i.e. gathered samples). For example, energybased RL algorithms learn to capture potentially rewarding actions through its energy objective Haarnoja et al. (2018a); Sutton and Barto (2018). A second line of work considers leveraging external guidance to aid exploration. In a nutshell, they formulate some intuitive tendencies in exploration as an additional reward function called intrinsic reward Bellemare et al. (2016); Houthooft et al. (2016). Guided by these auxiliary tasks, RL algorithms tend to act curiously, substantially improving exploration of the state space.
Despite their promising exploration efficiency, both lines of work fail to fully exploit the collected samples and turn them into the highest performing policy, as their learned policy often executes suboptimal actions. To avoid this undesirable explorationexploitation tradeoff, several attempts have been made to separately design two policies (i.e. disentangle them), of which one aims to gather the most informative examples (and hence is commonly referred as the behavior policy) while the other attempts to best utilize the current knowledge from the gathered samples (and hence is usually referred as the target policy) Colas et al. (2018); Beyer et al. (2019). To help fulfill their respective goals, disentangled objective functions and learning paradigms are further designed and separately applied to the two policies.
However, naively disentangling the behavior from the target policy would render their update process unstable. For example, when disentangled naively, the two policies tend to differ substantially due to their contrasting objectives, which is known to potentially result in catastrophic learning failure Nachum
et al. (2018). To mitigate this problem, we propose Analogous Disentangled ActorCritic (ADAC), where being analogous is reflected by the constraints imposed on the disentangled actorcritic Mnih et al. (2016) pairs. ADAC consists of two main algorithmic contributions. First, policy cotraining guides the behavior policy’s update by the target policy, making the gathered samples more helpful for the target policy’s learning process while keeping the expressiveness of the behavior policy for extensive ex
ploration (Section 4.2). Second, critic bounding allows an additional explorative critic to be trained with the aid of intrinsic rewards (Section 4.3). Under certain constraints from the target policy, the resultant critic maintains the curiosity incentivized by intrinsic rewards while guarantees training stability of the target policy.
Besides Section 4’s elaboration of our method, the rest of the paper is organized as follows. Section 2 reviews and summarizes the related work. Key background concepts and notations are introduced in Section 3. Experiment details of ADAC are explained in Section 5. Finally, conclusions are presented in Section 6.
2. Related Work
Learning to be aware of potentially rewarding actions is a promising strategy to conduct exploration, as it automatically prunes less rewarding actions and concentrates exploration efforts on those with high potential. To capture these actions, expressive learning models/objectives are widely used. Most noticeable recent work on this direction, such as Soft ActorCritic Haarnoja et al. (2018a), EntRL Schulman et al. (2017a), and Soft Q Learning Haarnoja et al. (2017), learns an expressive energybased target policy according to the maximum entropy RL objective Ziebart (2010). However, the expressiveness of their policies in turn becomes a burden for their optimality, and in practice, tradeoffs such as temperature controlling Haarnoja et al. (2018b) and reward scaling Haarnoja et al. (2017) have to be made for better overall performance. As we shall show later, ADAC makes use of a similar but extended energybased target, and alleviates the compromise on optimality using the analogous disentangled framework.
Adhoc explorationoriented learning targets that are designed to better explore the state space are also promising. Some recent research efforts on this line include countbased exploration Xu et al. (2017); Bellemare et al. (2016) and intrinsic motivation Houthooft et al. (2016); Fu et al. (2017); Kulkarni et al. (2016) approaches. The outcome of these methods is usually an auxiliary reward termed the intrinsic reward, which is extremely useful when the environmentdefined reward is sparsely available. However, as we shall illustrate in Section 5.3, intrinsic reward potentially biases the taskdefined learning objective, leading to catastrophic failure in some tasks. Again, with the disentangled nature of ADAC, we give a principled solution to solve this problem with theoretical guarantees (Section 4.3).
Explicitly disentangling exploration from exploitation has been used to solve a common problem in the above approaches, which is sacrificing the target policy’s optimality for better exploration. By separately designing exploration and exploitation components, both objectives can be better pursued simultaneously. Specifically, GEPPG Colas et al. (2018) uses a Goal Exploration Process (GEP) Forestier et al. (2017) to generate samples and feed them to the replay buffer of DDPG Lillicrap et al. (2015) or its variants. Multiple losses for exploration (MULEX) Beyer et al. (2019) proposes to use a series of intrinsic rewards to optimize different policies in parallel, which in turn generates abundant samples to train the target policy. Despite having intriguing conceptual ideas, they overlook the training stability issue caused by the mismatch in the distribution of collected samples (using the behavior policy) and the distribution induced by the target policy, which is formalized as extrapolation error in Fujimoto et al. (2018b). ADAC aims to mitigate the training stability issue caused by the extrapolation error while maintaining effective exploration exploitation tradeoff promised by expressive behavior policies (Section 4.2) as well as intrinsic rewards (Section 4.3) using its analogous disentangled actorcritic pairs.
3. Preliminaries
In this section, we introduce the reinforcement learning (RL) setting we address in this paper, as well as some background concepts that we utilize to build our method.
3.1. RL with Continuous Control
In a standard reinforcement learning (RL) setup, an agent interacts with an unknown environment at discrete time steps and aims to maximize the reward signal Sutton and Barto (2018). The environment is often formalized as a Markov Decision Process (MDP), which can be succinctly defined as a 5tuple . At time step , the agent in state takes action according to policy , a conditional distribution of given , leading to the next state
according to the transition probability
. Meanwhile, the agent observes reward emitted from the environment.^{1}^{1}1In all the environments considered in this paper, actions are assumed to be continuous.The agent strives to learn the optimal policy that maximizes the expected return , where is the initial state distribution and is the discount factor balancing the priority of short and longterm rewards. For continuous control, the policy (also known as the actor in the actorcritic framework) parameterized by can be updated by taking the gradient . According to the deterministic policy gradient theorem Silver et al. (2014), , where denotes the stateaction marginals of the trajectory distribution induced by , and denotes the stateaction value function (also know as the critic in the actorcritic framework), which represents the expected return under the reward function specified by when performing action at state and following policy afterwards. Intuitively, it measures how preferable executing action is at state with respect to the policy and reward function . Following Bellman (1966), we additionally introduce the Bellman operator, which is commonly used to update the function. The Bellman operator uses and to update an arbitrary value function , which is not necessarily defined with respect to the same or . For example, the outcome of is defined as . By slightly abusing notations, we further define the outcome of as . Some also call the Bellman optimality operator.
3.2. Offpolicy Learning and Behavior Policy
To aid exploration, it is a common practice to construct/store more than one policy for the agent (either implicitly or explicitly). Offpolicy actorcritic methods Watkins and Dayan (1992) allow us to make a clear separation between the target policy, which refers to the best policy currently learned by the agent, and the behavior policy, which the agent follows to interact with the environment. Note that the discussion in Section 3.1 is largely around the target policy. Thus, starting from this point, to avoid confusion, is reserved to only denote the target policy and notation is introduced to denote the behavior policy. Due to the policy separation, the target policy
is instead resorting to the estimates calculated with regards to samples collected by the behavior policy
, that is, the deterministic policy gradient mentioned above is approximated as(1) 
where is the environmentdefined reward. One of the most notable offpolicy learning algorithms that capitalize on this idea is deep deterministic policy gradient (DDPG) Lillicrap et al. (2015). To mitigate function approximation errors in DDPG, Fujimoto et al. proposes TD3 Fujimoto et al. (2018a). Given that DDPG and TD3 have demonstrated themselves to be competitive in many continuous control benchmarks, we choose to implement our Analogous Disentangled Actor Critic (ADAC) on top of their target policies. Yet, it is worth reiterating that ADAC is compatible with any existing offpolicy learning algorithms. We defer a more detailed discussion of ADAC’s compatibility until we start formally introducing our method in Section 4.1.
3.3. Expressive Behavior Policies through EnergyBased Representation
One promising way to design an explorationoriented behavior policy without external guidance, which is usually in the form of intrinsic reward, is by increasing the expressiveness of to capture information about potentially rewarding actions. Energybased representations have recently been increasingly chosen as the target form to construct an expressive behavior policy. Since its first introduction by Ziebart (2010) to achieve maximumentropy reinforcement learning, several additional prior work keeps improving upon this idea. Among them, the most notable ones include Soft QLearning (SQL) Haarnoja et al. (2017), EntRL Schulman et al. (2017a), and Soft ActorCritic (SAC) Haarnoja et al. (2018b). Collectively, they have achieved competitive results on many benchmark tasks. Formally, the energybased behavior policy is defined as
(2) 
where is commonly selected to be the target critic in prior work Haarnoja et al. (2018b); Haarnoja et al. (2018a). Various efficient samplers have been proposed to approximate the distribution specified in Eq (2). Among them, Haarnoja et al. (2017)’s Stein variational gradient descent (SVGD) Liu and Wang (2016); Wang et al. (2018) based sampler is especially worth noting as it has the potential to approximate complex and multimodel behavior policies. Given this, we also choose it to sample the behavior policy in our proposed ADAC.
Additionally, we want to highlight an intriguing property of SVGD that is critical for understanding why we can perform analogous disentangled exploration effectively. Intuitively, SVGD transforms a set of particles to match a target distribution. In the context of RL, following Amortized SVGD Feng et al. (2017)
, we use a neural network sampler
() to approximate Eq (2), which is done by minimizing the KL divergence between two distributions. According to Feng et al. (2017), is updated according to the following gradient:(3)  
where is a positive definite kernel^{2}^{2}2Formally, in ADAC, we define the kernel as , where is the number of dimensions of the action space., and is an additional hyperparameter proposed to make optimalityexpressiveness tradeoff. The intrinsic connection between Eq (3) and the deterministic policy gradient (i.e. Eq (1)) is introduced in Haarnoja et al. (2017) and Feng et al. (2017): the first term of the gradient represents a combination of deterministic policy gradients weighted by the kernel , while the second term of the gradient represents an entropy maximization objective.
To aid a better understanding of this relation, we illustrate the distribution approximated by SVGD using different in a toy example as shown in Figure 1. The dashed line is the approximation target. When is small, the entropy of the learned distribution is restricted and the overall policy leans towards the highestprobability region. On the other hand, larger leads to more expressive approximation.
4. Analogous Disentangled Actor Critic
This section introduces our proposed method Analogous Disentangled ActorCritic (ADAC). We start by providing an overview of it (Section 4.1), which is followed by elaborating the specific choices we make to design our actors and critics (Sections 4.2 and 4.3).
4.1. Algorithm Overview
Figure 2 provides a diagram overview of ADAC, which consists of two pairs of actorcritic and (see the blue and pink box) to achieve disentanglement. Same with prior offpolicy algorithms (e.g., DDPG), during training ADAC alternates between the two main procedures, namely sample collection (dotted green box), where we use to interact with the environment to collect training samples, and model update (dashed gray box), which consists of two phases: (i) batches of the collected samples are used to update both critics (the pink box); (ii) and (the blue box) are updated according to their respective critic using different objectives. During evaluation, is used to interact with the environment.
Both steps in the model update phase manifest the analogous property of our method. First, although optimized with respect to different objectives, both policies ( and ) are represented by the same neural network , where and .^{3}^{3}3 takes two components and as input, and is the parameter set of . That is, is a deterministic policy since the input to is fixed, while can be regarded as an action sampler that uses the randomly sampled to generate actions. As we shall demonstrate in Section 4.2, this specific setup effectively restricts the deviation between the two policies ( and ) (i.e. update bias), which stabilizes the training process and maintains sufficient expressiveness in the behavior policy (also see Section 5.1 for an intuitive illustration).
The second exhibit of our method’s analogous nature lies on our designed critics and , which are based on the environmentdefined reward and the augmented reward ( is the intrinsic reward) respectively yet are both computed with regard to the target policy . As a standard approach, approximates the taskdefined objective that the algorithm aims to maximize. On the other hand, is a behavior critic that can be shown to be both explorative and stable theoretically (Section 4.3) and empirically (Section 5.3). Note that when not using intrinsic reward, the two critics are degraded to be identical to one another (i.e. ) and in practice when that happens we only store one of them.
To better appreciate our method, it is not enough to only gain an overview about our actors and critics in isolation. Given this, we then formalize the connections between the actors and the critics as well as the objectives that are optimized during the model update phase (Figure 2). As defined above, is the exploitation policy that aims to maintain optimality throughout the learning process, which is best optimized using the deterministic policy gradient (Eq (1)), where is used as the referred critic (1⃝ in Figure 2). On the other hand, for the sake of expressiveness, the energybased objective (Eq (2)) is a good fit for . To further encourage exploration, we use the behavior critic in the objective, which gives (2⃝ in Figure 2). Since both policies share the same network , the actor optimization process (3⃝ in Figure 2) is done by maximizing
(4) 
where the gradients of both terms are defined by Eqs (1) and (3), respectively. In particular, we set in Eq (1) and in Eq (3). As illustrated in Algorithm 1 (line 5), we update and with the target and on the collected samples using the mean squared error loss, respectively.
In the sample collection phase, interacts with the environment and the gathered samples are stored in a replay buffer Mnih et al. (2013) for later use in the model update phase. Given state , actions are sampled from with a threestep procedure: (i) sample , (ii) plug the sampled in to get its output , and (iii) regard as the center of kernel ^{1}^{1}footnotemark: 1 and sample an action from it.
On the implementation side, ADAC is compatible with any existing offpolicy actorcritic model for continuous control: it directly builds upon them by inheriting their actor (which is also their target policy) and critic . To be more specific, ADAC merely adds a new actor to interact with the environment and a new critic that guides ’s updates on top of the base model, along with the constraints/connections enforced between the inherented and the new actor and between the inherent and the new critic (i.e. policy cotraining and critic bounding). In other words, modifications made by ADAC would not conflict with the originally proposed improvements on the base model. In our experiments, two base models (i.e. DDPG Lillicrap et al. (2015) and TD3 Fujimoto et al. (2018a)) are adopted.^{4}^{4}4See Appendix B for the pseudocode and detailed description of ADAC.
4.2. Stabilizing Policy Updates by Policy Cotraining
Although the energybased behavior policy defined by Eq (2) is sufficiently expressive to capture potentially rewarding actions, it may still not be helpful for learning a better : being expressive also means that is often significantly different from , leading to collect samples that can substantially bias ’s updates (recall the discussion about Equation 1), and in turn rendering the learning process of unstable and vulnerable to catastrophic failure Sutton et al. (2008); Schlegel et al. (2019); Xie et al. (2019); Fujimoto et al. (2018b). To be more specific, since the difference between and an expressive is more than some zeromean random noise, the state marginal distribution defined with respect to can potentially diverge greatly from the state marginal distribution defined with respect to . Since is not directly accessible, as shown in Eq (1), the gradients of are approximated using samples from . When the approximated gradients constantly deviate significantly from the true values (i.e. the approximated gradients are biased), the updates to essentially become inaccurate and hence ineffective. This suggests that a brutal act of disentangling the behavior policy from the target policy alone is not a guarantee of improved training efficiency or final performance.
Therefore, to mitigate the aforementioned problem, we would like to reduce the distance between and , which naturally reduces the KLdivergence between distribution and . One straightforward approach to reduce the distance between the two policies is to restrict the randomness of , for example by lowering the entropy of the behavior policy through a smaller (Eq (3)). However, this inevitably sacrifices ’s expressiveness, which in turn would also harm ADAC’s competitiveness. Alternatively, we propose policy cotraining to best maintain the expressiveness of while also stabilizing it by restricting it with regards to , which is motivated by the intrinsic connection between Eqs (1) and (3) (see the paragraph of Section 3.3). As described in Section 4.1, we reiterate here that in a nutshell, both policies are modeled by the same network and are distinguished only by their different inputs to . During training, is updated to maximize Eq (4). The method to sample actions from is described in the paragraph of Section 4.1.
We further justify the above choice by demonstrating that the imposed restrictions on and only has minor influence on ’s optimality and ’s expressiveness. To argue for this point, we need to revisit Eq (3) for one more time: can be viewed as being updated with , whereas is updated with . Intuitively, this makes policy optimal since its action is not affected by the entropy maximization term (i.e. the second term).
is still expressive since only when the input random variable
is close to the zero vector, it will be significantly restricted by
. In Section 5.1, we will empirically demonstrate policy cotraining indeed reduces the distance between and during training, fulfilling its mission.Additionally, policy cotraining enforces the underlying relations between and . Specifically, policy cotraining forces to be contained in since is the highestdensity point of , and sampling from is likely to generate actions close to that from . This matches the intuition that and should share similarities: actions proposed by is rewarding (with respect to ) and thus should be frequently executed by .
4.3. Incorporating Intrinsic Reward in Behavior Critic via Critic Bounding
With the help of disentanglement as well as policy cotraining, which makes and analogous, we manage to design an expressive behavior policy that not only explores effectively but also helps stabilize ’s learning process. In this subsection, we aim to achieve the same goal – stability and expressiveness – on a different subject, the behavior critic .
As introduced in Section 4.1, is the environmentdefined reward function, while consists of an additional explorationoriented intrinsic reward . As hinted by the notations, ADAC’s target critic and behavior critic are defined with regard to the same policy but updated differently according to the following
(5) 
where updates are performed through minibatches in practice. Note that when no intrinsic reward is used, Eq (5) becomes trivial and the two critics ( and ) are identical. Therefore, we only consider the case where intrinsic reward exists in the following discussion.
While it is natural that the target critic is updated using the target policy, it may seem counterintuitive that the behavior critic is also updated using the target policy. Given that is updated following the guidance (i.e. through the energybased objective) of , we do so to prevent from diverging disastrously from . Let be a greedy policy w.r.t. and be a greedy policy w.r.t. . Assume is optimal w.r.t. and . We have the following results.
First, , a proxy of training stability, is lower bounded by
(6) 
Second, , a proxy of training effectiveness, is lower bounded by
(7) 
The full proof is deferred to Appendix A. Here, we only focus on the insights conveyed by Theorem 4.3. Intuitively, the first result (i.e. (6)) guarantees training stability by providing a lower bound on our ultimate learning goal – the expected improvement of w.r.t. the taskdefined reward ; the second result (i.e. (7)) provides a lower bound on the expected improvement of , which measures the effectiveness of ADAC in the sense that the better ’s performance, the high the quality of collected samples (since is used to interact with the environment).
Before formalizing the above intuition, we first examine the assumptions made by the theorem. While other assumptions are generally satisfiable and are commonly made in the RL literature Munos (2007), the assumption on the rewards () seems restrictive. However, since most intrinsic rewards are strictly greater than zero (e.g., Houthooft et al. (2016); Fu et al. (2017)), it can be easily satisfied in practice.
To better understand the theorem, we first provide interpretations of the key components. According to the definition of the Bellman optimality operator (Section 3.1), quantifies the improvement on after performing one value iteration Bellman (1966) step (w.r.t. , where all states receive a hard update), which is a proxy of the policy improvement in the near future. Therefore, is the expected policy improvement under stateaction distribution in the near future.
We formalize training stability as the lower bound of the expected policy improvement under (i.e. ). Its lower bound (Eq (6)) consists of two parts. The second part, , is greater than zero since is optimized to maximize the cumulative reward of while is not. On the other hand, the first term can be viewed as the improvement of during training since is the training sample distribution. Therefore, the improvement of during training lower bounds the expected policy improvement under , which represents stability.
Conversely, the lower bound on reflects the effectiveness of the training procedure. Note that most intrinsic rewards are designed to be small in states that are frequently visited. Therefore, when the stateaction pairs would visit states that are frequently visited by , which is promised using the policy cotraining approach (Section 4.2), will be small. Therefore, even if and are not identical, as long as allows substantial visitations of high probability states in to make sufficiently small, improvement when trained on the samples would be almost as large as the training improvement on the target distribution , which indicates effectiveness.
Action ()  Reward () 

5. Experiments
In this section, we take gradual steps to analyze and illustrate our proposed method ADAC. Specifically, We first investigate the behavior of our analogous disentangled behavior policy (Section 5.1). Next, we perform an empirical evaluation of ADAC without intrinsic rewards on 14 standard continuouscontrol benchmarks (Section 5.2). Finally, encouraged by its promising performance and to further justify the critic bounding method, we examine ADAC with intrinsic rewards in 4 sparsereward and hence explorationheavy environments (Section 5.3). Throughout this paper, we highlight two benefits from the analogous disentangled nature of ADAC: (i) avoiding unnecessary tradeoffs between current optimality and exploration (i.e. a more expressive and effective behavior policy); (ii) natural compatibility with intrinsic rewards without altering environmentdefined optimality. In this context, the first two subsections are devoted to demonstrating the first benefit and the last subsection is dedicated for the second.
Environment  ADAC (TD3)  ADAC (DDPG)  TD3  DDPG  SAC  PPO 
RoboschoolAnt  2219373  838.1*97.1  2903666  450.027.9  2726652  128071 
RoboschoolHopper  2299333  766.5*10  2302537  543.8307  2089657  1229345 
RoboschoolHalfCheetah  1578166  1711*95  607.2246.2  441.6120.4  807.0252.6  1225184.2 
RoboschoolAtlasForwardWalk  234.655.7  186.7*37.9  190.650.1  52.6326.2  126.047.1  107.629.4 
RoboschoolWalker2d  1769452  1564*651  995.1146.3  208.7137.1  1021263  578.9231.3 
Ant  3353847  1226*18  4034517  370.5223  42911498  1401168 
Hopper  3598 374  374.5*36.5  2845609  38.930.88  3307825  1555458 
HalfCheetah  9392199  2238*40  105262367  100949  115412989  881.710.1 
Walker2d  51221314  1291*42  4630778  186.233.3  40671211  1146368 
InvertedPendulum  10000  1000*0  10000  1000*0  10000  98.902.08 
InvertedDoublePendulum  93590.17  9334*1.39  7665566  27.202.61  93532896  98.905.88 
BipedalWalker  309.815.6  52.77*1.94  288.451.25  123.9011.17  307.257.92  266.928.52 
BipedalWalkerHardcore  10.7627.70  98.523.21  57.9721.08  50.05*10.27  127.445.2  105.322.2 
LunarLanderContinuous  290.050.9  85.67*23.42  289.754.1  65.8996.48  283.369.29  59.3268.44 
indicates the better performance between ADAC (DDPG) and its base model DDPG. In all three cases, values that are statistically insignificantly different (>0.05 in ttest) from the respective shouldbe indicated ones are denoted as well.
5.1. Analysis of Analogous Disentangled Behavior Policy
Since we are largely motivated by the potential luxury of designing an expressive exploration strategy offered by the disentangled nature of our framework, it is natural we are first interested in investigating how well our behavior policy lives up to our expectation. Yet as discussed in Section 4.2, in order to aid stable policy updates, we specifically put some restrains on our behavior policy, deliberately making it analogous of the target policy, which means our behavior policy may not be as expressive as otherwise. Given this, we start this set of empirical experiments with investigating whether our behavior policy is still expressive enough, which is measure by its coverage (i.e. does it explore a wide enough action/policy space outside the current target policy). To further examine the influence of our added restrains, we examine the policy network’s stability (i.e. does the policy cotraining lowers the bias between two policies and stabilize the ’s learning process). Finally, we focus on the effectiveness of our behavior policy by measuring the overall performance of ADAC (i.e. does ADAC’s exploration strategy efficiently lead the target policy to iteratively converge to a more desirable local optimum).
Setup For the sake of ease of illustration, we choose a straightforward environment, namely CartPole Brockman et al. (2016), as our demonstration bed. The goal in this environment is to balance the pole that is attached to a cart by applying left/right force. For the compatibility with continuous control and a better modeling of realworld implications, we modified CartPole’s original discrete action space and added effort penalty to the rewards as specified in Table 1. To demonstrate the advantages of our behavior policy, we choose DDPG with two commonlyused existing exploration strategies as the main baselines, i.e., Gaussian noise () and OrnsteinUhlenbeck process noise () Uhlenbeck and Ornstein (1930)
, both with two variance levels
and . For the virtue of fair comparison, we only present DDPGbased ADAC here (or simply ADAC later in this subsection). To further demonstrate the benefits from disentanglement, we choose SAC as another baseline. As discussed earlier in related works, SAC similarly also utilizes energybased policies, yet opposite to our approach its exploration is embedded into its target policy.Empirical Insights To minimize distraction, our discussion starts with closely examining ADAC’s behavior and target policy alone. First, see the cells at the bottom of Figure 3, which are snapshots of the behavior and target policy at different training stages. As suggested by the wide bell shape of the solid blue curves () at the first cell, our behavior policy acts curiously when ignorant about the environment, extensively exploring all possible actions including those that are far away from the target policy (represented by the red dots). Yet having such a broad coverage alone is still not sufficient to overcome the beginning trap of getting stuck in the deceiving localoptimum of constantly exerting . As suggested by the bimodal shape of the solid blue curve () in the second cell, after acquiring a preliminary understanding of the environment the agent starts to form preference for some actions when exploring. Almost at the same time, the target policy no longer stays close to 0.0 (represented by the intersection of the two axes), suggesting that the behavior policy is effective in leading the target policy towards a more desirable place. This can be further corroborated by what is suggested from the third and fourth cell. In the late stage, besides being able to balance the pole, our agent even manages to learn exerting actions with small absolute value from time to time to avoid the effort penalty.
Other than its expressiveness, stability critically influences ADAC’s overall performance, which is by design controlled by the proposed policy cotraining approach. To examine its effect, we perform an ablation study about it. To be more specific, we compare ADAC with against ADAC without policy cotraining.^{5}^{5}5Two separate neural networks are used to store and when policy cotraining is not used. The effect of critic bounding is measured by the bias between and , which is shown in the middle of Figure 3. We can see that ADAC has much lower bias than its variant without policy cotraining. Additionally, policy cotraining does not affect the expressiveness of , which is suggested by the behavior policies rendered below in Figure 3.
Finally we move our attention to the learning curves in Figure 3: ADAC exceeds baselines in both learning efficiency (i.e. being the first to consistently accumulate positive rewards) and final performance. Unlike our behavior policy, exploration through random noise is unguided, resulting in either wasted exploration on unpromising regions or insufficient exploration on rewarding areas. This largely explains the noticeable performance gap between DDPG with random noise and ADAC. On the other side, SAC bears an expressive policy similar to our behavior policy. However, suffering from no separate behavior policy, to aid exploration, SAC has to consistently take suboptimal actions into account, adversely affecting its policy improvement process. In other words, different from ADAC, SAC cannot fully exploits its learned knowledge of the environment (i.e. its value functions) to construct its target policy, leading to a performance inferior to ADAC’s.
5.2. Comparison with the State of the Art
Though wellsuited for illustration, CartPole alone is not challenging and generalized enough to fully manifest ADAC’s competitiveness. In this subsection, we present that ADAC can achieve stateoftheart performance in standard benchmarks.
Setup To demonstrate the generality of our method, we construct a 14task testbed suite composed of qualitatively diverse continuouscontrol environments from the OpenAI Gym toolkit Brockman et al. (2016). On top of the two baselines adopted earlier (i.e. DDPG and SAC), we further include TD3 Fujimoto et al. (2018a), which improves upon DDPG by addressing some of its function approximation errors, PPO Schulman et al. (2017b), which is regarded as one of the most stable and efficient onpolicy policy gradient algorithm, and GEPPG Colas et al. (2018), which combines Goal Exploration Process Péré et al. (2018)
with policy gradient to perform curious exploration as well as stable learning. Though not exhaustive, this baseline suite still embodies many of the latest advancements and can be indeed deemed as the existing stateoftheart. However, we compare with GEPPG only in tasks adopted in their original experiments. Specifically, since the GEP part of the algorithm needs handcrafted exploration goals, we are not able to run their model on new experiments since it is nontrivial to generalize their experiments in other tasks. To best reproduce the rest’s performance, we use their original opensource implementations if released; otherwise, we build our own versions after the moststarred thirdparty implementations in GitHub. Furthermore, we finetune their hyperparameters around the values reported in the respective literature and only coarsely tune the hyperparameters introduced by ADAC. All experiments are run for 1 million timesteps, or until reaching performance convergence, whichever happens earlier.
^{6}^{6}6For fairness concern, we do not use intrinsic reward throughout this section, since most baseline approaches are not able to naturally incorporate them during learning. See Appendix C for additional environment details; hyperparameter settings of ADAC and baselines are provided in Appendix E; full benchmark results are given in Appendix F.Empirical Insights Table 2 corroborates ADAC’s competitiveness stemmed from its disentangled nature over existing methods. More importantly, these results reveal two desirable properties of ADAC’s full compatibility with existing offpolicy methods. First, ADAC consistently outperforms the method it is based on. As indicated by the symbols, compared to its base model, DDPGbased ADAC achieves statistically better or comparable performance on more than of the benchmarks and obtains identical performance on one of the remaining two. Though not as remarkable as DDPGbased ADAC, TDbased ADAC also manages to achieve statically better or comparable performance over its base model on more than of the tasks; see the symbols. Second, ADAC retains the benefits of improvements developed by the base model themselves. This is best illustrated by TD3based ADAC’s performance superiority over DDPGbased ADAC.
We would like to specially call readers’ attention on our comparison of ADAC against SAC since they both use energybased policy. This comparison also reveals the benefit brought by the disentangled structure and the analogous actors and critics. ADAC (TD3) achieves better average performance over SAC on 71%(10/14) of the benchmarks, indicating the effectiveness of our proposed analogous disentangled structure.
Despite also using the disentanglement idea, we do not compare with GEPPG Colas et al. (2018) across 14 benchmarks and hence GEPPG is not included in Table 2 because the Goal Exploration Process (GEP) in GEPPG requires manually defining a goal space to explore, which is task dependent and critically influences the algorithm performance. Therefore, we only compare our results on the two experiments that they have run, of which only one overlaps with our task suit, namely HalfCheetah. In HalfCheetah, GEPPG achieves 6118 cumulative reward, while ADAC (TD3) achieves 9392, showing superiority over GEPPG. Furthermore as also acknowledged in its paper, GEPPG lags behind SAC in performance, which suggests that simply disentangling behavior policy from target policy in a brutal way does not guarantee competitive performance. Rather, to design effective disentangled actorcritic, we should also pay attention to how to best restrict some components.
When considering all reported methods together, TD3based ADAC obtains the most number of stateoftheart results; as indicated in bold, it is the best performer (or statistically comparable with the best) on more than of the benchmarks.
5.3. Evaluation in SparseReward Environments
Encouraged by the promising results observed on the benchmarks, in this subsection we evaluate ADAC under more challenging environments, in which rewards are barely provided. This set of experiments aim to test ADAC’s exploration capacity under extreme settings. Furthermore, we also see them fit as demonstration beds to present ADAC’s natural compatibility with intrinsic methods. In this regard, we are particularly interested in investigating whether the disentangled nature of ADAC helps mitigate intrinsic rewards’ undesirable bias effect on the environmentdefined optimality.
Setup To the surprise of many, sparsereward environments turn out to be relatively unpopular in commonlyused RL toolkits. Besides including the classic MountainCarContinuous and Acrobot (after converting its action space to be countinuous), to construct a decently sized testing suite, we further handcraft new tasks, namely PendulumSparse and CartPoleSwingUpSparse by sparsifying the rewards in the existing environments. Sparsifying is achieved mainly through suppressing the original rewards until reaching some predefined threshold.^{7}^{7}7More details about the sparsereward environments can be founded in Appendix D. Due to their dependency on environmentprovided rewards as feedback signals, most modelfree RL algorithms suffer significant performance degradation in these sparsereward tasks. In this situation, resorting to intrinsic methods (IM) for additional signals has been widely considered as the goto solution. Among a wide variety of IM methods, we adopt Variational Information Maximization Exploration (VIME) Houthooft et al. (2016) as our internal reward generator for its consistent good performance on a wide variety of explorationchallenging tasks. Considering TD3based ADAC’s superiority over DDPGbased ADAC, we only combine VIME into TD3 and TD3based ADAC. Note when paired with ADAC, intrinsic rewards are only visible to the behavior policy.
Empirical Insights Among the four environments, PendulumSparse has the most vulnerable environmentdefined optimality. The goal here is to swing the inverted pendulum up so it stays upright. As suggested by Figure 4, not knowing how to distinguish between intrinsic and environment rewards, VIMEaugmented TD3 is completely fooled into chasing after the intrinsic rewards. In other words, the VIMEaugmented TD3’s understanding of what is optimal is completely off from the true environmentdefined optimality. Note as demonstrated in leftbottom part of Figure 5, VIMEaugmented TD3’s performance even trails behind TD3’s, which is an indisputable evidence that the bias introduced by IM can be detrimental and should be addressed whenever possible. In contrast, thanks to its disentangled nature, VIMEaugmented ADAC only perceives intrinsic rewards in its behavior policy, which means its target policy always remains optimal with regards to our current knowledge about environment rewards. Because of this, VIMEaugmented manages to consistently solve this explorationchallenging task. ADAC’s natural compatibility with VIME is further corroborated by the results in the remaining 3 tasks. As suggested by the the complete Figure 5, VIMEaugmented ADAC consistently surpasses all reported alternatives by a large margin in terms of both convergence speed and final performance.
6. Conclusion
We present Analogous Disentangled ActorCritic (ADAC), an offpolicy reinforcement learning framework explicit disentangles the behavior and target policy. Compared to prior work, to stabilize model updates, we restrain our behavior policy and its corresponding critic to be analogous of their target counterparts. Thanks to its disentangled and analogous nature, ADAC achieves the stateoftheart results in 10 out of 14 continuous control benchmarks. Moreover, ADAC is naturally compatible with intrinsic rewards, outperforming alternatives in explorationchallenging tasks.
Acknowledgements This work is partially supported by NSF grants #IIS1943641, #IIS1633857, #CCF1837129, DARPA XAI grant #N660011724032, UCLA Samueli Fellowship, and gifts from Intel and Facebook Research.
References
 (1)
 Arulkumaran et al. (2017) Kai Arulkumaran, Marc Peter Deisenroth, Miles Brundage, and Anil Anthony Bharath. 2017. Deep reinforcement learning: A brief survey. IEEE Signal Processing Magazine 34, 6 (2017), 26–38.
 Bellemare et al. (2016) Marc Bellemare, Sriram Srinivasan, Georg Ostrovski, Tom Schaul, David Saxton, and Remi Munos. 2016. Unifying countbased exploration and intrinsic motivation. In Advances in Neural Information Processing Systems. 1471–1479.
 Bellman (1966) Richard Bellman. 1966. Dynamic programming. Science 153, 3731 (1966), 34–37.
 Beyer et al. (2019) Lucas Beyer, Damien Vincent, Olivier Teboul, Sylvain Gelly, Matthieu Geist, and Olivier Pietquin. 2019. MULEX: Disentangling Exploitation from Exploration in Deep RL. arXiv preprint arXiv:1907.00868 (2019).
 Brockman et al. (2016) Greg Brockman, Vicki Cheung, Ludwig Pettersson, Jonas Schneider, John Schulman, Jie Tang, and Wojciech Zaremba. 2016. OpenAI Gym. (2016). arXiv:arXiv:1606.01540
 Colas et al. (2018) Cédric Colas, Olivier Sigaud, and PierreYves Oudeyer. 2018. Geppg: Decoupling exploration and exploitation in deep reinforcement learning algorithms. arXiv preprint arXiv:1802.05054 (2018).
 Feng et al. (2017) Yihao Feng, Dilin Wang, and Qiang Liu. 2017. Learning to draw samples with amortized stein variational gradient descent. arXiv preprint arXiv:1707.06626 (2017).
 Forestier et al. (2017) Sébastien Forestier, Yoan Mollard, and PierreYves Oudeyer. 2017. Intrinsically motivated goal exploration processes with automatic curriculum learning. arXiv preprint arXiv:1708.02190 (2017).
 Fortunato et al. (2017) Meire Fortunato, Mohammad Gheshlaghi Azar, Bilal Piot, Jacob Menick, Ian Osband, Alex Graves, Vlad Mnih, Remi Munos, Demis Hassabis, Olivier Pietquin, et al. 2017. Noisy networks for exploration. Internation Conference on Learning Representations (ICLR) (2017).
 Fu et al. (2017) Justin Fu, John CoReyes, and Sergey Levine. 2017. Ex2: Exploration with exemplar models for deep reinforcement learning. In Advances in Neural Information Processing Systems. 2577–2587.

Fujimoto
et al. (2018a)
Scott Fujimoto, Herke
Hoof, and David Meger.
2018a.
Addressing Function Approximation Error in
ActorCritic Methods. In
International Conference on Machine Learning
.  Fujimoto et al. (2018b) Scott Fujimoto, David Meger, and Doina Precup. 2018b. Offpolicy deep reinforcement learning without exploration. arXiv preprint arXiv:1812.02900 (2018).
 Gu et al. (2016) Shixiang Gu, Timothy Lillicrap, Zoubin Ghahramani, Richard E Turner, and Sergey Levine. 2016. Qprop: Sampleefficient policy gradient with an offpolicy critic. arXiv preprint arXiv:1611.02247 (2016).
 Haarnoja et al. (2017) Tuomas Haarnoja, Haoran Tang, Pieter Abbeel, and Sergey Levine. 2017. Reinforcement learning with deep energybased policies. In Proceedings of the International Conference on Machine LearningVolume 70.
 Haarnoja et al. (2018a) Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel, and Sergey Levine. 2018a. Soft actorcritic: Offpolicy maximum entropy deep reinforcement learning with a stochastic actor. In International Conference on Machine Learning.
 Haarnoja et al. (2018b) Tuomas Haarnoja, Aurick Zhou, Kristian Hartikainen, George Tucker, Sehoon Ha, Jie Tan, Vikash Kumar, Henry Zhu, Abhishek Gupta, Pieter Abbeel, et al. 2018b. Soft actorcritic algorithms and applications. arXiv preprint arXiv:1812.05905 (2018).
 Houthooft et al. (2016) Rein Houthooft, Xi Chen, Yan Duan, John Schulman, Filip De Turck, and Pieter Abbeel. 2016. Vime: Variational information maximizing exploration. In Advances in Neural Information Processing Systems. 1109–1117.
 Kulkarni et al. (2016) Tejas D Kulkarni, Karthik Narasimhan, Ardavan Saeedi, and Josh Tenenbaum. 2016. Hierarchical deep reinforcement learning: Integrating temporal abstraction and intrinsic motivation. In Advances in neural information processing systems. 3675–3683.
 Leibo et al. (2017) Joel Z. Leibo, Vinicius Zambaldi, Marc Lanctot, Janusz Marecki, and Thore Graepel. 2017. Multiagent Reinforcement Learning in Sequential Social Dilemmas. In Proc. of AAMAS.
 Liang et al. (2016) Yitao Liang, Marlos C. Machado, Erik Talvitie, and Michael Bowling. 2016. State of the Art Control of Atari Games Using Shallow Reinforcement Learning. In Proc. of AAMAS.
 Lillicrap et al. (2015) Timothy P Lillicrap, Jonathan J Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa, David Silver, and Daan Wierstra. 2015. Continuous control with deep reinforcement learning. In Proceedings of the International Conference on Learning Representations.

Liu and Wang (2016)
Qiang Liu and Dilin
Wang. 2016.
Stein variational gradient descent: A general purpose bayesian inference algorithm. In
Advances In Neural Information Processing Systems. 2378–2386.  Mnih et al. (2016) Volodymyr Mnih, Adria Puigdomenech Badia, Mehdi Mirza, Alex Graves, Timothy Lillicrap, Tim Harley, David Silver, and Koray Kavukcuoglu. 2016. Asynchronous methods for deep reinforcement learning. In Proceedings of the International conference on machine learning. 1928–1937.
 Mnih et al. (2013) Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan Wierstra, and Martin Riedmiller. 2013. Playing atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602 (2013).
 Mnih et al. (2015) Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A. Rusu, Joel Veness, Marc G. Bellemare, Alex Graves, Martin Riedmiller, Andreas K. Fidjeland, Georg Ostrovski, Stig Petersen, Charles Beattie, Amir Sadik, Ioannis Antonoglou, Helen King, Dharshan Kumaran, Daan Wierstra, Shane Legg, and Demis Hassabis. 2015. Humanlevel Control through Deep Reinforcement Learning. Nature 518, 7540 (26 02 2015), 529–533.
 Munos (2007) Rémi Munos. 2007. Performance bounds in l_pnorm for approximate value iteration. SIAM journal on control and optimization 46, 2 (2007), 541–561.
 Nachum et al. (2018) Ofir Nachum, Mohammad Norouzi, Kelvin Xu, and Dale Schuurmans. 2018. TrustPCL: An OffPolicy Trust Region Method for Continuous Control. In International Conference on Learning Representations. https://openreview.net/forum?id=HyrCWeWCb
 Péré et al. (2018) Alexandre Péré, Sébastien Forestier, Olivier Sigaud, and PierreYves Oudeyer. 2018. Unsupervised learning of goal spaces for intrinsically motivated goal exploration. arXiv preprint arXiv:1803.00781 (2018).
 Plappert et al. (2018) Matthias Plappert, Rein Houthooft, Prafulla Dhariwal, Szymon Sidor, Richard Y Chen, Xi Chen, Tamim Asfour, Pieter Abbeel, and Marcin Andrychowicz. 2018. Parameter space noise for exploration. In Proceedings of the International Conference on Learning Representations.
 Roboschool (2019) Roboschool 2019. Roboschool. https://openai.com/blog/roboschool/. (2019). Accessed: 20190827.
 Schlegel et al. (2019) Matthew Schlegel, Wesley Chung, Daniel Graves, Jian Qian, and Martha White. 2019. Importance Resampling for Offpolicy Prediction. arXiv preprint arXiv:1906.04328 (2019).
 Schulman et al. (2017a) John Schulman, Xi Chen, and Pieter Abbeel. 2017a. Equivalence between policy gradients and soft qlearning. arXiv preprint arXiv:1704.06440 (2017).
 Schulman et al. (2017b) John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. 2017b. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017).
 Silver et al. (2014) David Silver, Guy Lever, Nicolas Heess, Thomas Degris, Daan Wierstra, and Martin Riedmiller. 2014. Deterministic Policy Gradient Algorithms. In International Conference on International Conference on Machine Learning.
 Silver et al. (2017) David Silver, Julian Schrittwieser, Karen Simonyan, Ioannis Antonoglou, Aja Huang, Arthur Guez, Thomas Hubert, Lucas Baker, Matthew Lai, Adrian Bolton, Yutian Chen, Timothy Lillicrap, Fan Hui, Laurent Sifre, George van den Driessche, Thore Graepel, and Demis Hassabis. 2017. Mastering the game of Go without human knowledge. Nature 550 (18 10 2017), 354 EP –.
 Sutton and Barto (2018) Richard S Sutton and Andrew G Barto. 2018. Reinforcement learning: An introduction. MIT press.
 Sutton et al. (2008) Richard S Sutton, Csaba Szepesvári, and Hamid Reza Maei. 2008. A convergent O (n) algorithm for offpolicy temporaldifference learning with linear function approximation. Advances in neural information processing systems 21, 21 (2008), 1609–1616.
 Todorov et al. (2012) Emanuel Todorov, Tom Erez, and Yuval Tassa. 2012. Mujoco: A physics engine for modelbased control. In 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems. IEEE, 5026–5033.
 Uhlenbeck and Ornstein (1930) G. E. Uhlenbeck and L. S. Ornstein. 1930. On the Theory of the Brownian Motion. Phys. Rev. 36 (1930), 823–841. Issue 5.
 Wang et al. (2018) Dilin Wang, Zhe Zeng, and Qiang Liu. 2018. Stein Variational Message Passing for Continuous Graphical Models. In ICML. 5206–5214. http://proceedings.mlr.press/v80/wang18l.html
 Wang and Zhang (2017) Yue Wang and Fumin Zhang. 2017. Trends in Control and DecisionMaking for HumanRobot Collaboration Systems (1st ed.). Springer Publishing Company, Incorporated.
 Watkins and Dayan (1992) Christopher J. C. H. Watkins and Peter Dayan. 1992. Technical Note: QLearning. Machine Learning 8, 34 (May 1992).
 Xie et al. (2019) Tengyang Xie, Yifei Ma, and YuXiang Wang. 2019. Optimal OffPolicy Evaluation for Reinforcement Learning with Marginalized Importance Sampling. arXiv preprint arXiv:1906.03393 (2019).
 Xu et al. (2017) ZhiXiong Xu, XiLiang Chen, Lei Cao, and ChenXi Li. 2017. A study of countbased exploration and bonus for reinforcement learning. In Proceedings of the IEEE International Conference on Cloud Computing and Big Data Analysis (ICCCBDA). IEEE, 425–429.
 Ziebart (2010) Brian D Ziebart. 2010. Modeling purposeful adaptive behavior with the principle of maximum causal entropy. Ph.D. Dissertation. figshare.
Supplementary Material
Appendix A Theoretical Results
This section provides the full proof of Theorem 4.3 that is the guarantees of the training stability as well as the training effectiveness of the critic bounding approach (Section 4.3).
Proof of Theorem 4.3
We define as the optimal value function with respect to policy and reward , i.e., . We further define as the optimal value function with respect to and (i.e., ). Our proof is built upon the foundation result stated in the following lemma. For the sake of a smoother presentation, we defer its proof after we finish proving the theorem.
Lemma
Under the definitions and assumptions made in Theorem 4.3 and the above paragraph, we have the following result
(8) 
Recall that and are the optimal value function with respect to and , respectively. By definition, we have and (since ).
Result on training effectiveness (i.e. Eq (7)) We are now ready to prove the second result stated in the theorem. Since , we have . Plug in Eq (8) and use the equality
we have
which is equivalent to the second result stated in the theorem (Eq (7)).
Result on training stability (i.e. Eq (6)) To prove the first result stated in the theorem, we start from rearranging Eq (8):
(9) 
where uses the inequality , and follows from . Rewriting Eq (9) gives us the first result stated in the theorem (Eq (6)):
Proof of Lemma A.
Before delving into the detailed derivation, we make the following clarifications. First, although is a greedy policy w.r.t. , is not the optimal value function w.r.t. and . In other words, is guaranteed to hold yet we might have . Second, in both the theorem and the proof, we omit the stateaction notation (e.g., ) for the sake of simplicity.
We begin from the difference between the respective optimal value function with regard to and :
(10) 
where is the state probability transition operator with respect to the environment dynamics and policy ; uses the equality ; adopts the fact that . Combining the terms and gives us
(11) 
where is the identity operator, i.e. . We define . By definition, given the initial stateaction distribution , is the stateaction marginal distribution with respect to and policy . We can easily verify that and since by definition, .
Appendix B Algorithmic Details of ADAC
(12)  
(13) 
Comments
There are no comments yet.