Log In Sign Up

Learning Self-Imitating Diverse Policies

Deep reinforcement learning algorithms, including policy gradient methods and Q-learning, have been widely applied to a variety of decision-making problems. Their success has relied heavily on having very well designed dense reward signals, and therefore, they often perform badly on the sparse or episodic reward settings. Trajectory-based policy optimization methods, such as cross-entropy method and evolution strategies, do not take into consideration the temporal nature of the problem and often suffer from high sample complexity. Scaling up the efficiency of RL algorithms to real-world problems with sparse or episodic rewards is therefore a pressing need. In this work, we present a new perspective of policy optimization and introduce a self-imitation learning algorithm that exploits and explores well in the sparse and episodic reward settings. First, we view each policy as a state-action visitation distribution and formulate policy optimization as a divergence minimization problem. Then, we show that, with Jensen-Shannon divergence, this divergence minimization problem can be reduced into a policy-gradient algorithm with dense reward learned from experience replays. Experimental results indicate that our algorithm works comparable to existing algorithms in the dense reward setting, and significantly better in the sparse and episodic settings. To encourage exploration, we further apply the Stein variational policy gradient descent with the Jensen-Shannon kernel to learn multiple diverse policies and demonstrate its effectiveness on a number of challenging tasks.


Hindsight Trust Region Policy Optimization

As reinforcement learning continues to drive machine intelligence beyond...

Improved Exploration through Latent Trajectory Optimization in Deep Deterministic Policy Gradient

Model-free reinforcement learning algorithms such as Deep Deterministic ...

GACEM: Generalized Autoregressive Cross Entropy Method for Multi-Modal Black Box Constraint Satisfaction

In this work we present a new method of black-box optimization and const...

Connecting the Dots Between MLE and RL for Sequence Generation

Sequence generation models such as recurrent networks can be trained wit...

MAD for Robust Reinforcement Learning in Machine Translation

We introduce a new distributed policy gradient algorithm and show that i...

Tolerance-Guided Policy Learning for Adaptable and Transferrable Delicate Industrial Insertion

Policy learning for delicate industrial insertion tasks (e.g., PC board ...

Policy Gradient from Demonstration and Curiosity

With reinforcement learning, an agent could learn complex behaviors from...

1 Introduction

Deep reinforcement learning (RL) has demonstrated significant applicability and superior performance in many problems outside the reach of traditional algorithms, such as computer and board games  (Mnih et al., 2015; Silver et al., 2016), continuous control (Lillicrap et al., 2015), and robotics (Levine et al., 2016)

. Using deep neural networks as functional approximators, many classical RL algorithms have been shown to be very effective in solving sequential decision problems. For example, a policy that selects actions under certain state observation can be parameterized by a deep neural network that takes the current state observation as input and gives an action or a distribution over actions as output. Value functions that take both state observation and action as inputs and predict expected future reward can also be parameterized as neural networks. In order to optimize such neural networks, policy gradient methods  

(Mnih et al., 2016; Schulman et al., 2015, 2017a) and Q-learning algorithms (Mnih et al., 2015)

capture the temporal structure of the sequential decision problem and decompose it to a supervised learning problem, guided by the immediate and discounted future reward from rollout data.

Unfortunately, when the reward signal becomes sparse or delayed, these RL algorithms may suffer from inferior performance and inefficient sample complexity, mainly due to the scarcity of the immediate supervision when training happens in single-timestep manner. This is known as the temporal credit assignment problem (Sutton, 1984). For instance, consider the Atari Montezuma’s revenge game – a reward is received after collecting certain items or arriving at the final destination in the lowest level, while no reward is received as the agent is trying to reach these goals. The sparsity of the reward makes the neural network training very inefficient and also poses challenges in exploration. It is not hard to see that many of the real-world problems tend to be of the form where rewards are either only sparsely available during an episode, or the rewards are episodic, meaning that a non-zero reward is only provided at the end of the trajectory or episode.

In addition to policy-gradient and Q-learning, alternative algorithms, such as those for global- or stochastic-optimization, have recently been studied for policy search. These algorithms do not decompose trajectories into individual timesteps, but instead apply zeroth-order finite-difference gradient or gradient-free methods to learn policies based on the cumulative rewards of the entire trajectory. Usually, trajectory samples are first generated by running the current policy and then the distribution of policy parameters is updated according to the trajectory-returns. The cross-entropy method (CEM, Rubinstein & Kroese (2016)) and evolution strategies (Salimans et al., 2017) are two nominal examples. Although their sample efficiency is often not comparable to the policy gradient methods when dense rewards are available from the environment, they are more widely applicable in the sparse or episodic reward settings as they are agnostic to task horizon, and only the trajectory-based cumulative reward is needed.

Our contribution is the introduction of a new algorithm based on policy-gradients, with the objective of achieving better performance than existing RL algorithms in sparse and episodic reward settings. Using the equivalence between the policy function and its state-action visitation distribution, we formulate policy optimization as a divergence minimization problem between the current policy’s visitation and the distribution induced by a set of experience replay trajectories with high returns. We show that with the Jensen-Shannon divergence (), this divergence minimization problem can be reduced into a policy-gradient algorithm with shaped, dense rewards learned from these experience replays. This algorithm can be seen as self-imitation learning, in which the expert trajectories in the experience replays are self-generated by the agent during the course of learning, rather than using some external demonstrations. We combine the divergence minimization objective with the standard RL objective, and empirically show that the shaped, dense rewards significantly help in sparse and episodic settings by improving credit assignment. Following that, we qualitatively analyze the shortcomings of the self-imitation algorithm. Our second contribution is the application of Stein variational policy gradient (SVPG) with the Jensen-Shannon kernel to simultaneously learn multiple diverse policies. We demonstrate the benefits of this addition to the self-imitation framework by considering difficult exploration tasks with sparse and deceptive rewards.

Related Works.   Divergence minimization has been used in various policy learning algorithms. Relative Entropy Policy Search (REPS) (Peters et al., 2010) restricts the loss of information between policy updates by constraining the KL-divergence between the state-action distribution of old and new policy. Policy search can also be formulated as an EM problem, leading to several interesting algorithms, such as RWR (Peters & Schaal, 2007) and PoWER (Kober & Peters, 2009). Here the M-step minimizes a KL-divergence between trajectory distributions, leading to an update rule which resembles return-weighted imitation learning. Please refer to Deisenroth et al. (2013) for a comprehensive exposition. MATL (Wulfmeier et al., 2017)

uses adversarial training to bring state occupancy from a real and simulated agent close to each other for efficient transfer learning. In Guided Policy Search (GPS, 

Levine & Koltun (2013)), a parameterized policy is trained by constraining the divergence between the current policy and a controller learnt via trajectory optimization.

Learning from Demonstrations (LfD). The objective in LfD, or imitation learning, is to train a control policy to produce a trajectory distribution similar to the demonstrator. Approaches for self-driving cars (Bojarski et al., 2016) and drone manipulation (Ross et al., 2013) have used human-expert data, along with Behavioral Cloning algorithm to learn good control policies. Deep Q-learning has been combined with human demonstrations to achieve performance gains in Atari (Hester et al., 2017) and robotics tasks (Večerík et al., 2017; Nair et al., 2017). Human data has also been used in the maximum entropy IRL framework to learn cost functions under which the demonstrations are optimal (Finn et al., 2016)Ho & Ermon (2016) use the same framework to derive an imitation-learning algorithm (GAIL) which is motivated by minimizing the divergence between agent’s rollouts and external expert demonstrations. Besides humans, other sources of expert supervision include planning-based approaches such as iLQR (Levine et al., 2016) and MCTS (Silver et al., 2016). Our algorithm departs from prior work in forgoing external supervision, and instead using the past experiences of the learner itself as demonstration data.

Exploration and Diversity in RL. Count-based exploration methods utilize state-action visitation counts , and award a bonus to rarely visited states (Strehl & Littman, 2008). In large state-spaces, approximation techniques (Tang et al., 2017)

, and estimation of pseudo-counts by learning density models 

(Bellemare et al., 2016; Fu et al., 2017) has been researched. Intrinsic motivation has been shown to aid exploration, for instance by using information gain (Houthooft et al., 2016) or prediction error (Stadie et al., 2015) as a bonus. Hindsight Experience Replay (Andrychowicz et al., 2017) adds additional goals (and corresponding rewards) to a Q-learning algorithm. We also obtain additional rewards, but from a discriminator trained on past agent experiences, to accelerate a policy-gradient algorithm. Prior work has looked at training a diverse ensemble of agents with good exploratory skills (Liu et al., 2017; Conti et al., 2017; Florensa et al., 2017). To enjoy the benefits of diversity, we incorporate a modification of SVPG (Liu et al., 2017) in our final algorithm.

In very recent work, Oh et al. (2018) propose exploiting past good trajectories to drive exploration. Their algorithm buffers and the corresponding return for each transition in rolled trajectories, and reuses them for training if the stored return value is higher than the current state-value estimate. Our approach presents a different objective for self-imitation based on divergence-minimization. With this view, we learn shaped, dense rewards which are then used for policy optimization. We further improve the algorithm with SVPG. Reusing high-reward trajectories has also been explored for program synthesis and semantic parsing tasks (Liang et al., 2016, 2018; Abolafia et al., 2018).

2 Main Methods

We start with a brief introduction to RL in Section 2.1, and then introduce our main algorithm of self-imitating learning in Section 2.2. Section 2.3 further extends our main method to learn multiple diverse policies using Stein variational policy gradient with Jensen-Shannon kernel.

2.1 Reinforcement Learning Background

A typical RL setting involves an environment modeled as a Markov Decision Process with an unknown system dynamics model

and an initial state distribution . An agent interacts sequentially with the environment in discrete time-steps using a policy which maps the an observation to either a single action (deterministic policy), or a distribution over the action space (stochastic policy). We consider the scenario of stochastic policies over high-dimensional, continuous state and action spaces. The agent receives a per-step reward , and the RL objective involves maximization of the expected discounted sum of rewards, , where is the discount factor. The action-value function is . We define the unnormalized -discounted state-visitation distribution for a policy by , where

is the probability of being in state

at time , when following policy and starting state . The expected policy return can then be written as , where is the state-action visitation distribution. Using the policy gradient theorem (Sutton et al., 2000), we can get the direction of ascent .

2.2 Policy Optimization as Divergence Minimization with Self-Imitation

Although the policy is given as a conditional distribution, its behavior is better characterized by the corresponding state-action visitation distribution , which wraps the MDP dynamics and fully decides the expected return via . Therefore, distance metrics on a policy should be defined with respect to the visitation distribution , and the policy search should be viewed as finding policies with good visitation distributions that yield high reward. Suppose we have access to a good policy , then it is natural to consider finding a such that its visitation distribution matches . To do so, we can define a divergence measure that captures the similarity between two distributions, and minimize this divergence for policy improvement.

Assume there exists an expert policy , such that policy optimization can be framed as minimizing the divergence , that is, finding a policy to imitate . In practice, however, we do not have access to any real guiding expert policy. Instead, we can maintain a selected subset of highly-rewarded trajectories from the previous rollouts of policy , and optimize the policy to minimize the divergence between and the empirical state-action pair distribution :


Since it is not always possible to explicitly formulate even with the exact functional form of , we generate rollouts from in the environment and obtain an empirical distribution of . To measure the divergence between two empirical distributions, we use the Jensen-Shannon divergence, with the following variational form (up to a constant shift) as exploited in GANs (Goodfellow et al., 2014):


where and are empirical density estimators of and , respectively. Under certain assumptions, we can obtain an approximate gradient of w.r.t the policy parameters, thus enabling us to optimize the policy.

Gradient Approximation: Let and be the state-action visitation distributions induced by two policies and respectively. Let and be the surrogates to and , respectively, obtained by solving Equation 2. Then, if the policy is parameterized by , the gradient of with respect to policy parameters () can be approximated as:


The derivation of the approximation and the underlying assumptions are in Appendix 5.1. Next, we introduce a simple and inexpensive approach to construct the replay memory using high-return past experiences during training. In this way, can be seen as a mixture of deterministic policies, each representing a delta point mass distribution in the trajectory space or a finite discrete visitation distribution of state-action pairs. At each iteration, we apply the current policy to sample trajectories . We hope to include in , the top- trajectories (or trajectories with returns above a threshold) generated thus far during the training process. For this, we use a priority-queue list for which keeps the trajectories sorted according to the total trajectory reward. The reward for each newly sampled trajectory in is compared with the current threshold of the priority-queue, updating accordingly. The frequency of updates is impacted by the exploration capabilities of the agent and the stochasticity in the environment. We find that simply sampling noisy actions from Gaussian policies is sufficient for several locomotion tasks (Section 3). To handle more challenging environments, in the next sub-section, we augment our policy optimization procedure to explicitly enhance exploration and produce an ensemble of diverse policies.

In the usual imitation learning framework, expert demonstrations of trajectories—from external sources—are available as the empirical distribution of of an expert policy . In our approach, since the agent learns by treating its own good past experiences as the expert, we can view the algorithm as self-imitation learning from experience replay. As noted in Equation 3, the gradient estimator of has a form similar to policy gradients, but for replacing the true reward function with per-timestep reward defined as

. Therefore, it is possible to interpolate the gradient of

and the standard policy gradient. We would highlight the benefit of this interpolation soon. The net gradient on the policy parameters is:


where is the function with true rewards, and is the mixture policy represented by the samples in . Let . can be computed using parameterized networks for densities and , which are trained by solving the optimization (Eq 2) using the current policy rollouts and , where includes the parameters for and . Using Equation 3, the interpolated gradient can be further simplified to:


where is the function calculated using as the reward. This reward is high in the regions of the space frequented more by the expert than the learner, and low in regions visited more by the learner than the expert. The effective in Equation 5 is therefore an interpolation between obtained with true environment rewards, and obtained with rewards which are implicitly shaped to guide the learner towards expert behavior. In environments with sparse or deceptive rewards, where the signal from is weak or sub-optimal, a higher weight on enables successful learning by imitation. We show this empirically in our experiments. We further find that even in cases with dense environment rewards, the two gradient components can be successfully combined for policy optimization. The complete algorithm for self-imitation is outlined in Appendix 5.2 (Algorithm 1).

Limitations of self-imitation. We now elucidate some shortcomings of the self-imitation approach. Since the replay memory is only constructed from the past training rollouts, the quality of the trajectories in is hinged on good exploration by the agent. Consider a maze environment where the robot is only rewarded when it arrives at a goal placed in a far-off corner. Unless the robot reaches once, the trajectories in always have a total reward of zero, and the learning signal from is not useful. Secondly, self-imitation can lead to sub-optimal policies when there are local minima in the policy optimization landscape; for example, assume the maze has a second goal in the opposite direction of , but with a much smaller reward. With simple exploration, the agent may fill with below-par trajectories leading to , and the reinforcement from would drive it further to . Thirdly, stochasticity in the environment may make it difficult to recover the optimal policy just by imitating the past top- rollouts. For instance, in a 2-armed bandit problem with reward distributions Bernoulli (p) and Bernoulli (p+), rollouts from both the arms get conflated in during training with high probability, making it hard to imitate the action of picking the arm with the higher expected reward.

We propose to overcome these pitfalls by training an ensemble of self-imitating agents, which are explicitly encouraged to visit different, non-overlapping regions of the state-space. This helps to discover useful rewards in sparse settings, avoids deceptive reward traps, and in environments with reward-stochasticity like the 2-armed bandit, increases the probability of the optimal policy being present in the final trained ensemble. We detail the enhancements next.

2.3 Improving Exploration with Stein Variational Gradient

One approach to achieve better exploration in challenging cases like above is to simultaneously learn multiple diverse policies and enforce them to explore different parts of the high dimensional space. This can be achieved based on the recent work by Liu et al. (2017) on Stein variational policy gradient (SVPG). The idea of SVPG is to find an optimal distribution over the policy parameters which maximizes the expected policy returns, along with an entropy regularization that enforces diversity on the parameter space, i.e.

Without a parametric assumption on , this problem admits a challenging functional optimization problem. Stein variational gradient descent (SVGD,  Liu & Wang (2016)) provides an efficient solution for solving this problem, by approximating with a delta measure , where is an ensemble of policies, and iteratively update with


where is a positive definite kernel function. The first term in moves the policy to regions with high expected return (exploitation), while the second term creates a repulsion pressure between policies in the ensemble and encourages diversity (exploration). The choice of kernel is critical. Liu et al. (2017) used a simple Gaussian RBF kernel , with the bandwidth dynamically adapted. This, however, assumes a flat Euclidean distance between and

, ignoring the structure of the entities defined by them, which are probability distributions. A statistical distance, such as

, serves as a better metric for comparing policies (Amari, 1998; Kakade, 2002). Motivated by this, we propose to improve SVPG using JS kernel , where is the state-action visitation distribution obtained by running policy , and is the temperature. The second exploration term in SVPG involves the gradient of the kernel w.r.t policy parameters. With the JS kernel, this requires estimating gradient of , which as shown in Equation 3, can be obtained using policy gradients with an appropriately trained reward function.

Our full algorithm is summarized in Appendix 5.3 (Algorithm 2). In each iteration, we apply the SVPG gradient to each of the policies, where the in Equation 6 is the interpolated gradient from self-imitation (Equation 5

). We also utilize state-value function networks as baselines to reduce the variance in sampled policy-gradients.

3 Experiments

Our goal in this section is to answer the following questions: 1) How does self-imitation fare against standard policy gradients under various reward distributions from the environment, namely episodic, noisy and dense? 2) How far does the SVPG exploration go in overcoming the limitations of self-imitation, such as susceptibility to local-minimas?

We benchmark high-dimensional, continuous-control locomotion tasks based on the MuJoCo physics simulator by extending the OpenAI Baselines (Dhariwal et al., 2017) framework. Our control policies () are modeled as unimodal Gaussians. All feed-forward networks have two layers of 64 hidden units each with tanh non-linearity. For policy-gradient, we use the clipped-surrogate based PPO algorithm (Schulman et al., 2017b). Further implementation details are in the Appendix.

Figure 1:

Learning curves for PPO and Self-Imitation on tasks with episodic rewards. Mean and standard-deviation over 5 random seeds is plotted.

Episodic rewards

      Noisy rewards

Each suppressed w/

90% prob. ()

      Noisy rewards

Each suppressed w/

50% prob. ()

Dense rewards

(Gym default)










Walker 2996 252 205 1200 2276 2047 3049 3364 3263 3401
Humanoid 3602 532 426 - 4136 1159 4296 3145 3339 4149
H-Standup ( 10) 18.1 4.4 9.6 - 14.3 11.4 16.3 9.8 17.2 10
Hopper 2618 354 97 1900 2381 2264 2137 2132 2700 2252
Swimmer 173 21 17 - 52 37 127 56 106 68
Invd.Pendulum 8668 344 86 9000 8744 8826 8926 8968 8989 8694
Table 1: Performance of PPO and Self-Imitation (SI) on tasks with episodic rewards, noisy rewards with masking probability , and dense rewards. All runs use 5M timesteps of interaction with the environment. ES performance at 5M timesteps is taken from (Salimans et al., 2017). Missing entry denotes that we were unable to obtain the 5M timestep performance from the paper.

3.1 Self-Imitation with Different Reward Distributions

We evaluate the performance of self-imitation with a single agent in this sub-section; combination with SVPG exploration for multiple agents is discussed in the next. We consider the locomotion tasks in OpenAI Gym under 3 separate reward distributions: Dense refers to the default reward function in Gym, which provides a reward for each simulation timestep. In episodic reward setting, rather than providing at each timestep of an episode, we provide at the last timestep of the episode, and zero reward at other timesteps. This is the case for many practical settings where the reward function is hard to design, but scoring each trajectory, possibly by a human (Christiano et al., 2017), is feasible. In noisy reward setting, we probabilistically mask out each out each per-timestep reward in an episode. Reward masking is done independently for every new episode, and therefore, the agent receives non-zero feedback at different—albeit only few—timesteps in different episodes. The probability of masking-out or suppressing the rewards is denoted by .

In Figure 1, we plot the learning curves on three tasks with episodic rewards. Recall that is the hyper-parameter controlling the weight distribution between gradients with environment rewards and the gradients with shaped reward from (Equation 5). The baseline PPO agents use , meaning that the entire learning signal comes from the environment. We compare them with self-imitating (SI) agents using a constant value . The capacity of is fixed at 10 trajectories. We didn’t observe our method to be particularly sensitive to the choice of and the capacity value. For instance, works equally well. Further ablation on these two hyper-parameters can be found in the Appendix.

In Figure 1, we see that the PPO agents are unable to make any tangible progress on these tasks with episodic rewards, possibly due to difficulty in credit assignment – the lumped rewards at the end of the episode can’t be properly attributed to the individual state-action pairs during the episode. In case of Self-Imitation, the algorithm has access to the shaped rewards for each timestep, derived from the high-return trajectories in . This makes credit-assignment easier, leading to successful learning even for very high-dimensional control tasks such as Humanoid.

Table 1 summarizes the final performance, averaged over 5 runs with random seeds, under the various reward settings. For the noisy rewards, we compare performance with two different reward masking values - suppressing each reward with 90% probability (), and with 50% probability (). The density of rewards increases across the reward settings from left to right in Table 1. We find that SI agents () achieve higher average score than the baseline PPO agents () in majority of the tasks for all the settings. This indicates that not only does self-imitation vastly help when the environment rewards are scant, it can readily be incorporated with the standard policy gradients via interpolation, for successful learning across reward settings. For completion, we include performance of CEM and ES since these algorithms depend only on the total trajectory rewards and don’t exploit the temporal structure. CEM perform poorly in most of the cases. ES, while being able to solve the tasks, is sample-inefficient. We include ES performance from Salimans et al. (2017) after 5M timesteps of training for a fair comparison with our algorithm.

(a) SI-independent state-density
(b) SI-interact-JS state-density
(c) SI-independent kernel matrix
(d) SI-interact-JS kernel matrix
Figure 2: SI-independent and SI-interact-JS agents on Maze environment.

3.2 Characterizing Ensemble of Diverse Self-Imitating Policies

We now conduct experiments to show how self-imitation can lead to sub-optimal policies in certain cases, and how the SVPG objective, which trains an ensemble with an explicit repulsion between policies, can improve performance.

2D-Navigation. Consider a simple Maze environment where the start location of the agent

(blue particle) is shown in the figure on the right, along with two regions – the red region is closer to agent’s starting location but has a per-timestep reward of only 1 point if the agent hovers over it; the green region is on the other side of the wall but has a per-timestep reward of 10 points. We run 8 independent, non-interacting, self-imitating (with ) agents on this task. This ensemble is denoted as SI-independent. Figures 1(a) plots the state-visitation density for SI-independent after training, from which it is evident that the agents get trapped in the local minima. The red-region is relatively easily explored and trajectories leading to it fill the , causing sub-optimal imitation. We contrast this with an instantiation of our full algorithm, which is referred to as SI-interact-JS. It is composed of 8 self-imitating agents which share information for gradient calculation with the SVPG objective (Equation 6). The temperature is held constant, and the weight on exploration-facilitating repulsion term () is linearly decayed over time. Figure 1(b) depicts the state-visitation density for this ensemble. SI-interact-JS explores wider portions of the maze, with multiple agents reaching the green zone of high reward.

Figures 1(c) and 1(d) show the kernel matrices for the two ensembles after training. Cell in the matrix corresponds to the kernel value . For SI-independent, many darker cells indicate that policies are closer (low JS). For SI-interact-JS, which explicitly tries to decrease , the cells are noticeably lighter, indicating dissimilar policies (high JS). Behavior of PPO-independent () is similar to SI-independent () for the Maze task.

Figure 3: Learning curves for various ensembles on sparse locomotion tasks. Mean and standard-deviation over 3 random seeds are plotted.

Locomotion. To explore the limitations of self-imitation in harder exploration problems in high-dimensional, continuous state-action spaces, we modify 3 MuJoCo tasks as follows – SparseHalfCheetah, SparseHopper and SparseAnt yield a forward velocity reward only when the center-of-mass of the corresponding bot is beyond a certain threshold distance. At all timesteps, there is an energy penalty to move the joints, and a survival bonus for bots that can fall over causing premature episode termination (Hopper, Ant). Figure 3 plots the performance of PPO-independent, SI-independent, SI-interact-JS and SI-interact-RBF (which uses RBF-kernel from  Liu et al. (2017) instead of the JS-kernel) on the tasks. Each of these 4 algorithms is an ensemble of 8 agents using the same amount of simulation timesteps. The results are averaged over 3 separate runs, where for each run, the best agent from the ensemble after training is selected.

The SI-independent agents rely solely on action-space noise from the Gaussian policy parameterization to find high-return trajectories which are added to as demonstrations. This is mostly inadequate or slow for sparse environments. Indeed, we find that all demonstrations in for SparseHopper are with the bot standing upright (or tilted) and gathering only the survival bonus, as action-space noise alone can’t discover hopping behavior. Similarly, for SparseHalfCheetah, has trajectories with the bot haphazardly moving back and forth. On the other hand, in SI-interact-JS, the repulsion term encourages the agents to be diverse and explore the state-space much more effectively. This leads to faster discovery of quality trajectories, which then provide good reinforcement through self-imitation, leading to higher overall score. SI-interact-RBF doesn’t perform as well, suggesting that the JS-kernel is more formidable for exploration. PPO-independent gets stuck in the local optimum for SparseHopper and SparseHalfCheetah – the bots stand still after training, avoiding energy penalty. For SparseAnt, the bot can cross our preset distance threshold using only action-space noise, but learning is slow due to naïve exploration.

4 Conclusion and Future Work

We approached policy optimization for deep RL from the perspective of JS-divergence minimization between state-action distributions of a policy and its own past good rollouts. This leads to a self-imitation algorithm which improves upon standard policy-gradient methods via the addition of a simple gradient term obtained from implicitly shaped dense rewards. We observe substantial performance gains over the baseline for high-dimensional, continuous-control tasks with episodic and noisy rewards. Further, we discuss the potential limitations of the self-imitation approach, and propose ensemble training with the SVPG objective and JS-kernel as a solution. Through experimentation, we demonstrate the benefits of a self-imitating, diverse ensemble for efficient exploration and avoidance of local minima.

An interesting future work is improving our algorithm using the rich literature on exploration in RL. Since ours is a population-based exploration method, techniques for efficient single agent exploration can be readily combined with it. For instance, parameter-space noise or curiosity-driven exploration can be applied to each agent in the SI-interact-JS ensemble. Secondly, our algorithm for training diverse agents could be used more generally. In Appendix 5.6, we show preliminary results for two cases: a) hierarchical RL, where a diverse group of Swimmer bots is trained for downstream use in a complex Swimming+Gathering task; b) RL without environment rewards, relying solely on diversity as the optimization objective. Further investigation is left for future work.


5 Appendix

5.1 Derivation of Gradient Approximation

Let and be the exact state-action densities for the current policy () and the expert, respectively. Therefore, by definition, we have (up to a constant shift):

Now, is a local surrogate to . By approximating it to be constant in an ball neighborhood around , we get the following after taking gradient of the above equation w.r.t :

The last step follows directly from the policy gradient theorem (Sutton et al., 2000). Since we do not have the exact densities and , we substitute them with the optimized density estimators and from the maximization in Equation 2 for computing . This gives us the gradient approximation mentioned in Section 2.2. A similar approximation is also used by Ho & Ermon (2016) for Generative Adversarial Imitation Learning (GAIL).

5.2 Algorithm for Self-Imitation

Policy parameters
Discriminator parameters
Environment reward

1 initial parameters
2 empty replay memory
3 for each iteration do
4       Generate batch of trajectories with two rewards for each transition: and
5       Update using priory queue threshold
       /* Update policy */
6       for each minibatch do
7             Calculate with PPO objective using reward
8             Calculate with PPO objective using reward
9             Update with using ADAM
11       end for
      /* Update self-imitation discriminator */
12       for 

each epoch

13             Sample mini-batch of (s,a) from
14             Sample mini-batch of (s,a) from
15             Update with log-loss objective using
16       end for
18 end for
Algorithm 1

5.3 Algorithm for Self-Imitating Diverse Policies

Policy parameters for rank
Self-imitation discriminator parameters for rank
Empirical density network parameters for rank

/* This is run for every rank */
1 some initial distributions
2 empty replay memory local to rank
3 for each iteration do
4       Generate batch of trajectories
5       Update using priory queue threshold
       /* Update policy */
6       for each minibatch do
7             Calculate using self-imitation (as in Algorithm 1)
8             MPI send: to other ranks
9             MPI recv: from other ranks
10             Calculate using
11               Use and lines 8, 10, 11 in SVPG to get
12               Update with using ADAM
14        end for
       /* Update self-imitation discriminator */
15        for each epoch do
16               Sample mini-batch of (s,a) from
17               Sample mini-batch of (s,a) from
18               Update with log-loss objective using
19        end for
       /* Update state-action visitation network */
20        MPI send: to other ranks
21        MPI send: to other ranks
22        MPI recv: from other ranks
23        MPI recv: from other ranks
24        Update with log-loss objective using , ,
25        Update
27 end for
Algorithm 2

5.4 Ablation Studies

We show the sensitivity of self-imitation to and the capacity of , denoted by . The experiments in this subsection are done on Humanoid and Hopper tasks with episodic rewards. The tables show the average performance over 5 random seeds. For ablation on , is fixed at 10; for ablation on , is fixed at 0.8. With episodic rewards, a higher value of helps boost performance since the RL signal from the environment is weak. With , there isn’t a single best choice for , though all values of give better results than baseline PPO ().

Humanoid Hopper
532 354
395 481
810 645
3602 2618
3891 2633
Humanoid Hopper
2861 1736
2946 2415
3602 2618
2667 1624
4159 2301

5.5 Hyperparameters

  • Horizon (T) = 1000 (locomotion), 250 (Maze), 5000 (Swimming+Gathering)

  • Discount () = 0.99

  • GAE parameter () = 0.95

  • PPO internal epochs = 5

  • PPO learning rate = 1e-4

  • PPO mini-batch = 64

5.6 Leveraging Diverse Policies

The diversity-promoting repulsion can be used for various other purposes apart from aiding exploration in the sparse environments considered thus far. First, we consider the paradigm of hierarchical reinforcement learning wherein multiple sub-policies (or skills) are managed by a high-level policy, which chooses the most apt sub-policy to execute at any given time. In Figure 4, we use the Swimmer environment from Gym and show that diverse skills (movements) can be acquired in a pre-training phase when repulsion is used. The skills can then be used in a difficult downstream task. During pre-training with SVPG, exploitation is done with policy-gradients calculated using the norm of the velocity as dense rewards, while the exploration term uses the JS-kernel. As before, we compare an ensemble of 8 interacting agents with 8 independent agents. Figures 3(a) and 3(b) depict the paths taken by the Swimmer after training with independent and interacting agents, respectively. The latter exhibit variety. Figure 3(c) is the downstream task of Swimming+Gathering (Duan et al., 2016) where the bot has to swim and collect the green dots, whilst avoiding the red ones. The utility of pre-training a diverse ensemble is shown in Figure 3(d), which plots the performance on this task while training a higher-level categorical manager policy ().

Diversity can sometimes also help in learning a skill without any rewards from the environment, as observed by Eysenbach et al. (2018) in recent work. We consider a Hopper task with no rewards, but we do require weak supervision in form of the length of each trajectory . Using policy-gradient with as reward and repulsion, we see the emergence of hopping behavior within an ensemble of 8 interacting agents. Videos of the skills acquired can be found here 111

Figure 4: Using diverse agents for hierarchical reinforcement learning. (a) Independent agents paths. (b) Interacting agents paths. (c) Swimming+Gathering task. (d) Performance of manager policy with two different pre-trained ensembles as sub-policies.

5.7 Performance on more MuJoCo tasks

Episodic rewards

      Noisy rewards

Each suppressed w/

90% prob. ()

      Noisy rewards

Each suppressed w/

50% prob. ()

Dense rewards

(Gym default)









Half-Cheetah 3686 -1572 3378 1670 4574 2374 4878 2422
Reacher -12 -12 -12 -10 -6 -6 -5 -5
Inv. Pendulum 977 53 993 999 978 988 969 992
Table 2: Extension of Table 1 from Section 3. All runs use 5M timesteps of interaction with the environment.

5.8 Additional details on SVPG exploration with JS-kernel

5.8.1 SVPG formulation

Let the policy parameters be parameterized by . To achieve diverse, high-return policies, we seek to obtain the distribution which is the solution of the optimization problem: , where is the entropy of . Solving the above equation by setting derivative to zero yields the an energy-based formulation for the optimal policy-parameter distribution: . Drawing samples from this posterior using traditional methods such as MCMC is computationally intractable. Stein variational gradient descent (SVGD; Liu & Wang (2016)

) is an efficient method for generating samples and also converges to the posterior of the energy-based model. Let

be the particles that constitute the policy ensemble. SVGD provides appropriate direction for perturbing each particle such that induced KL-divergence between the particles and the target distribution is reduced. The perturbation (gradient) for particle is given by (please see Liu & Wang (2016) for derivation):

where is a positive definite kernel function. Using as target distribution, and as the JS-kernel, we get the gradient direction for ascent:

where is the state-action visitation distribution for policy , and is the temperature. Also, for our case, is the interpolated gradient from self-imitation (Equation 5).

5.8.2 Implementation details

The gradient in the above equation is the repulsion factor that pushes away from . Similar repulsion can be achieved by using the gradient ; note that this gradient is w.r.t instead of and the sign is reversed. Empirically, we find that the latter results in slightly better performance.

Estimation of : This can be done in two ways - using implicit and explicit distributions. In the implicit method, we could train a parameterized discriminator network () using state-actions pairs from and to implicitly approximate the ratio . We could then use the policy gradient theorem to obtain the gradient of as explained in Section 2.2. This, however, requires us to learn discriminator networks for a population of size , one for each policy pair . To reduce the computational and memory resource burden to , we opt for explicit modeling of . Specifically, we train a network to approximate the state-action visitation density for each policy . The networks are learned using the optimization (Equation 2), and we can easily obtain the ratio . The agent then uses as the SVPG exploration rewards in the policy gradient theorem.

State-value baselines: We use state-value function networks as baselines to reduce the variance in sampled policy-gradients. Each agent in a population of size trains state-value networks corresponding to real environment rewards , self-imitation rewards , and SVPG exploration rewards .

5.9 Comparison to Oh et al. (2018)

In this section, we provide evaluation for a recently proposed method for self-imitation learning (SIL; Oh et al. (2018)

). The SIL loss function take the form:

In words, the algorithm buffers and the corresponding return for each transition in rolled trajectories, and reuses them for training if the stored return value is higher than the current state-value estimate .

We use the code provided by the authors 222 As per our understanding, PPO+SIL does not use a single set of hyper-parameters for all the MuJoCo tasks (Appendix A;  Oh et al. (2018)). We follow their methodology and report numbers for the best configuration for each task. This is different from our experiments since we run all tasks on a single fix hyper-parameter set (Appendix 5.5), and therefore a direct comparison of the average scores between the two approaches is tricky.

SIL Dense rewards

    Oh et al. (2018)

SIL Episodic


   SIL Noisy rewards

Each suppressed w/

90% prob. ()

   SIL Noisy rewards

Each suppressed w/

50% prob. ()

Walker 3973 257 565 3911
Humanoid 3610 530 1126 3460
Humanoid-Standup ( 10) 18.9 4.9 14.9 18.8
Hopper 1983 563 1387 1723
Swimmer 120 17 50 100
InvertedDoublePendulum 6250 405 6563 6530
Table 3: Performance of PPO+SIL (Oh et al., 2018) on tasks with episodic rewards, noisy rewards with masking probability , and dense rewards. All runs use 5M timesteps of interaction with the environment.

Table 3 shows the performance of PPO+SIL on MuJoCo tasks under the various reward distributions explained in Section 3.1 - dense, episodic and noisy. We observe that, compared to the dense rewards setting (default Gym rewards), the performance suffers under the episodic case and when the rewards are masked out with . Our intuition is as follows. PPO+SIL makes use of the cumulative return from each transition of a past good rollout for the update. When rewards are provided only at the end of the episode, for instance, cumulative return does not help with the temporal credit assignment problem and hence is not a strong learning signal. Our approach, on the other hand, derives dense, per-timestep rewards using an objective based on divergence-minimization. This is useful for credit assignment, and as indicated in Table 1. (Section 3.1) leads to learning good policies even under the episodic and noisy settings.

5.10 Comparison to off-policy RL (Q-learning)

Our approach makes use of replay memory to store the past good rollouts of the agent. Off-policy RL methods such as DQN (Mnih et al., 2015) also accumulate agent experience in a replay buffer and reuse them for learning (e.g. by reducing TD-error). In this section, we evaluate the performance of one such recent algorithm - Twin Delayed Deep Deterministic policy gradient (TD3; Fujimoto et al. (2018)) on tasks with episodic and noisy rewards. TD3 builds on DDPG (Lillicrap et al., 2015) and surpasses its performance on all the MuJoCo tasks evaluated by the authors.

  TD3 Dense rewards

 Fujimoto et al. (2018)

TD3 Episodic


   TD3 Noisy rewards

Each suppressed w/

90% prob. ()

   TD3 Noisy rewards

Each suppressed w/

50% prob. ()

Walker 4352 189 395 2417
Hopper 3636 402 385 1825
InvertedDoublePendulum 9350 363 948 4711
Swimmer - - - -
Humanoid-Standup - - - -
Humanoid - - - -
Table 4: Performance of TD3 (Fujimoto et al., 2018) on tasks with episodic rewards, noisy rewards with masking probability , and dense rewards. All runs use 5M timesteps of interaction with the environment.

Table 4 shows that the performance of TD3 suffers appreciably with the episodic and noisy reward settings, indicating that popular off-policy algorithms (DDPG, TD3) do not exploit the past experience in a manner that accelerates learning when rewards are scarce during an episode.

* For 3 tasks used in our paper—Swimmer and the high-dimensional Humanoid, Humanoid-Standup—the TD3 code from the authors 333 is unable to learn a good policy even in presence of dense rewards (default Gym rewards). These tasks are also not included in the evaluation by Fujimoto et al. (2018).

5.11 Comparing SVPG exploration to a novelty-based baseline

We run a new exploration baseline - EX (Fu et al., 2017) and compare its performance to SI-interact-JS on the hard exploration MuJoCo tasks considered in Section 3.2. The EX algorithm does implicit state-density estimation using discriminative modeling, and uses it for novelty-based exploration by adding as the bonus. We used the author provided code 444

and hyperparameter settings. TRPO is used as the policy gradient algorithm.

EX SI-interact-JS
SparseHalfCheetah -286 769
SparseHopper 1477 1949
SparseAnt -3.9 208
Table 5: Performance of EX (Fu et al., 2017) and SI-interact-JS on the hard exploration MuJoCo tasks from Section 3.2. SparseHalfCheetah, SparseHalfCheetah, SparseAnt use 1M, 1M and 2M timesteps of interaction with the environment, respectively. Results are averaged over 3 separate runs.