1 Introduction
Deep reinforcement learning (RL) has demonstrated significant applicability and superior performance in many problems outside the reach of traditional algorithms, such as computer and board games (Mnih et al., 2015; Silver et al., 2016), continuous control (Lillicrap et al., 2015), and robotics (Levine et al., 2016)
. Using deep neural networks as functional approximators, many classical RL algorithms have been shown to be very effective in solving sequential decision problems. For example, a policy that selects actions under certain state observation can be parameterized by a deep neural network that takes the current state observation as input and gives an action or a distribution over actions as output. Value functions that take both state observation and action as inputs and predict expected future reward can also be parameterized as neural networks. In order to optimize such neural networks, policy gradient methods
(Mnih et al., 2016; Schulman et al., 2015, 2017a) and Qlearning algorithms (Mnih et al., 2015)capture the temporal structure of the sequential decision problem and decompose it to a supervised learning problem, guided by the immediate and discounted future reward from rollout data.
Unfortunately, when the reward signal becomes sparse or delayed, these RL algorithms may suffer from inferior performance and inefficient sample complexity, mainly due to the scarcity of the immediate supervision when training happens in singletimestep manner. This is known as the temporal credit assignment problem (Sutton, 1984). For instance, consider the Atari Montezuma’s revenge game – a reward is received after collecting certain items or arriving at the final destination in the lowest level, while no reward is received as the agent is trying to reach these goals. The sparsity of the reward makes the neural network training very inefficient and also poses challenges in exploration. It is not hard to see that many of the realworld problems tend to be of the form where rewards are either only sparsely available during an episode, or the rewards are episodic, meaning that a nonzero reward is only provided at the end of the trajectory or episode.
In addition to policygradient and Qlearning, alternative algorithms, such as those for global or stochasticoptimization, have recently been studied for policy search. These algorithms do not decompose trajectories into individual timesteps, but instead apply zerothorder finitedifference gradient or gradientfree methods to learn policies based on the cumulative rewards of the entire trajectory. Usually, trajectory samples are first generated by running the current policy and then the distribution of policy parameters is updated according to the trajectoryreturns. The crossentropy method (CEM, Rubinstein & Kroese (2016)) and evolution strategies (Salimans et al., 2017) are two nominal examples. Although their sample efficiency is often not comparable to the policy gradient methods when dense rewards are available from the environment, they are more widely applicable in the sparse or episodic reward settings as they are agnostic to task horizon, and only the trajectorybased cumulative reward is needed.
Our contribution is the introduction of a new algorithm based on policygradients, with the objective of achieving better performance than existing RL algorithms in sparse and episodic reward settings. Using the equivalence between the policy function and its stateaction visitation distribution, we formulate policy optimization as a divergence minimization problem between the current policy’s visitation and the distribution induced by a set of experience replay trajectories with high returns. We show that with the JensenShannon divergence (), this divergence minimization problem can be reduced into a policygradient algorithm with shaped, dense rewards learned from these experience replays. This algorithm can be seen as selfimitation learning, in which the expert trajectories in the experience replays are selfgenerated by the agent during the course of learning, rather than using some external demonstrations. We combine the divergence minimization objective with the standard RL objective, and empirically show that the shaped, dense rewards significantly help in sparse and episodic settings by improving credit assignment. Following that, we qualitatively analyze the shortcomings of the selfimitation algorithm. Our second contribution is the application of Stein variational policy gradient (SVPG) with the JensenShannon kernel to simultaneously learn multiple diverse policies. We demonstrate the benefits of this addition to the selfimitation framework by considering difficult exploration tasks with sparse and deceptive rewards.
Related Works. Divergence minimization has been used in various policy learning algorithms. Relative Entropy Policy Search (REPS) (Peters et al., 2010) restricts the loss of information between policy updates by constraining the KLdivergence between the stateaction distribution of old and new policy. Policy search can also be formulated as an EM problem, leading to several interesting algorithms, such as RWR (Peters & Schaal, 2007) and PoWER (Kober & Peters, 2009). Here the Mstep minimizes a KLdivergence between trajectory distributions, leading to an update rule which resembles returnweighted imitation learning. Please refer to Deisenroth et al. (2013) for a comprehensive exposition. MATL (Wulfmeier et al., 2017)
uses adversarial training to bring state occupancy from a real and simulated agent close to each other for efficient transfer learning. In Guided Policy Search (GPS,
Levine & Koltun (2013)), a parameterized policy is trained by constraining the divergence between the current policy and a controller learnt via trajectory optimization.Learning from Demonstrations (LfD). The objective in LfD, or imitation learning, is to train a control policy to produce a trajectory distribution similar to the demonstrator. Approaches for selfdriving cars (Bojarski et al., 2016) and drone manipulation (Ross et al., 2013) have used humanexpert data, along with Behavioral Cloning algorithm to learn good control policies. Deep Qlearning has been combined with human demonstrations to achieve performance gains in Atari (Hester et al., 2017) and robotics tasks (Večerík et al., 2017; Nair et al., 2017). Human data has also been used in the maximum entropy IRL framework to learn cost functions under which the demonstrations are optimal (Finn et al., 2016). Ho & Ermon (2016) use the same framework to derive an imitationlearning algorithm (GAIL) which is motivated by minimizing the divergence between agent’s rollouts and external expert demonstrations. Besides humans, other sources of expert supervision include planningbased approaches such as iLQR (Levine et al., 2016) and MCTS (Silver et al., 2016). Our algorithm departs from prior work in forgoing external supervision, and instead using the past experiences of the learner itself as demonstration data.
Exploration and Diversity in RL. Countbased exploration methods utilize stateaction visitation counts , and award a bonus to rarely visited states (Strehl & Littman, 2008). In large statespaces, approximation techniques (Tang et al., 2017)
, and estimation of pseudocounts by learning density models
(Bellemare et al., 2016; Fu et al., 2017) has been researched. Intrinsic motivation has been shown to aid exploration, for instance by using information gain (Houthooft et al., 2016) or prediction error (Stadie et al., 2015) as a bonus. Hindsight Experience Replay (Andrychowicz et al., 2017) adds additional goals (and corresponding rewards) to a Qlearning algorithm. We also obtain additional rewards, but from a discriminator trained on past agent experiences, to accelerate a policygradient algorithm. Prior work has looked at training a diverse ensemble of agents with good exploratory skills (Liu et al., 2017; Conti et al., 2017; Florensa et al., 2017). To enjoy the benefits of diversity, we incorporate a modification of SVPG (Liu et al., 2017) in our final algorithm.In very recent work, Oh et al. (2018) propose exploiting past good trajectories to drive exploration. Their algorithm buffers and the corresponding return for each transition in rolled trajectories, and reuses them for training if the stored return value is higher than the current statevalue estimate. Our approach presents a different objective for selfimitation based on divergenceminimization. With this view, we learn shaped, dense rewards which are then used for policy optimization. We further improve the algorithm with SVPG. Reusing highreward trajectories has also been explored for program synthesis and semantic parsing tasks (Liang et al., 2016, 2018; Abolafia et al., 2018).
2 Main Methods
We start with a brief introduction to RL in Section 2.1, and then introduce our main algorithm of selfimitating learning in Section 2.2. Section 2.3 further extends our main method to learn multiple diverse policies using Stein variational policy gradient with JensenShannon kernel.
2.1 Reinforcement Learning Background
A typical RL setting involves an environment modeled as a Markov Decision Process with an unknown system dynamics model
and an initial state distribution . An agent interacts sequentially with the environment in discrete timesteps using a policy which maps the an observation to either a single action (deterministic policy), or a distribution over the action space (stochastic policy). We consider the scenario of stochastic policies over highdimensional, continuous state and action spaces. The agent receives a perstep reward , and the RL objective involves maximization of the expected discounted sum of rewards, , where is the discount factor. The actionvalue function is . We define the unnormalized discounted statevisitation distribution for a policy by , whereis the probability of being in state
at time , when following policy and starting state . The expected policy return can then be written as , where is the stateaction visitation distribution. Using the policy gradient theorem (Sutton et al., 2000), we can get the direction of ascent .2.2 Policy Optimization as Divergence Minimization with SelfImitation
Although the policy is given as a conditional distribution, its behavior is better characterized by the corresponding stateaction visitation distribution , which wraps the MDP dynamics and fully decides the expected return via . Therefore, distance metrics on a policy should be defined with respect to the visitation distribution , and the policy search should be viewed as finding policies with good visitation distributions that yield high reward. Suppose we have access to a good policy , then it is natural to consider finding a such that its visitation distribution matches . To do so, we can define a divergence measure that captures the similarity between two distributions, and minimize this divergence for policy improvement.
Assume there exists an expert policy , such that policy optimization can be framed as minimizing the divergence , that is, finding a policy to imitate . In practice, however, we do not have access to any real guiding expert policy. Instead, we can maintain a selected subset of highlyrewarded trajectories from the previous rollouts of policy , and optimize the policy to minimize the divergence between and the empirical stateaction pair distribution :
(1) 
Since it is not always possible to explicitly formulate even with the exact functional form of , we generate rollouts from in the environment and obtain an empirical distribution of . To measure the divergence between two empirical distributions, we use the JensenShannon divergence, with the following variational form (up to a constant shift) as exploited in GANs (Goodfellow et al., 2014):
(2) 
where and are empirical density estimators of and , respectively. Under certain assumptions, we can obtain an approximate gradient of w.r.t the policy parameters, thus enabling us to optimize the policy.
Gradient Approximation: Let and be the stateaction visitation distributions induced by two policies and respectively. Let and be the surrogates to and , respectively, obtained by solving Equation 2. Then, if the policy is parameterized by , the gradient of with respect to policy parameters () can be approximated as:
(3)  
The derivation of the approximation and the underlying assumptions are in Appendix 5.1. Next, we introduce a simple and inexpensive approach to construct the replay memory using highreturn past experiences during training. In this way, can be seen as a mixture of deterministic policies, each representing a delta point mass distribution in the trajectory space or a finite discrete visitation distribution of stateaction pairs. At each iteration, we apply the current policy to sample trajectories . We hope to include in , the top trajectories (or trajectories with returns above a threshold) generated thus far during the training process. For this, we use a priorityqueue list for which keeps the trajectories sorted according to the total trajectory reward. The reward for each newly sampled trajectory in is compared with the current threshold of the priorityqueue, updating accordingly. The frequency of updates is impacted by the exploration capabilities of the agent and the stochasticity in the environment. We find that simply sampling noisy actions from Gaussian policies is sufficient for several locomotion tasks (Section 3). To handle more challenging environments, in the next subsection, we augment our policy optimization procedure to explicitly enhance exploration and produce an ensemble of diverse policies.
In the usual imitation learning framework, expert demonstrations of trajectories—from external sources—are available as the empirical distribution of of an expert policy . In our approach, since the agent learns by treating its own good past experiences as the expert, we can view the algorithm as selfimitation learning from experience replay. As noted in Equation 3, the gradient estimator of has a form similar to policy gradients, but for replacing the true reward function with pertimestep reward defined as
. Therefore, it is possible to interpolate the gradient of
and the standard policy gradient. We would highlight the benefit of this interpolation soon. The net gradient on the policy parameters is:(4) 
where is the function with true rewards, and is the mixture policy represented by the samples in . Let . can be computed using parameterized networks for densities and , which are trained by solving the optimization (Eq 2) using the current policy rollouts and , where includes the parameters for and . Using Equation 3, the interpolated gradient can be further simplified to:
(5) 
where is the function calculated using as the reward. This reward is high in the regions of the space frequented more by the expert than the learner, and low in regions visited more by the learner than the expert. The effective in Equation 5 is therefore an interpolation between obtained with true environment rewards, and obtained with rewards which are implicitly shaped to guide the learner towards expert behavior. In environments with sparse or deceptive rewards, where the signal from is weak or suboptimal, a higher weight on enables successful learning by imitation. We show this empirically in our experiments. We further find that even in cases with dense environment rewards, the two gradient components can be successfully combined for policy optimization. The complete algorithm for selfimitation is outlined in Appendix 5.2 (Algorithm 1).
Limitations of selfimitation. We now elucidate some shortcomings of the selfimitation approach. Since the replay memory is only constructed from the past training rollouts, the quality of the trajectories in is hinged on good exploration by the agent. Consider a maze environment where the robot is only rewarded when it arrives at a goal placed in a faroff corner. Unless the robot reaches once, the trajectories in always have a total reward of zero, and the learning signal from is not useful. Secondly, selfimitation can lead to suboptimal policies when there are local minima in the policy optimization landscape; for example, assume the maze has a second goal in the opposite direction of , but with a much smaller reward. With simple exploration, the agent may fill with belowpar trajectories leading to , and the reinforcement from would drive it further to . Thirdly, stochasticity in the environment may make it difficult to recover the optimal policy just by imitating the past top rollouts. For instance, in a 2armed bandit problem with reward distributions Bernoulli (p) and Bernoulli (p+), rollouts from both the arms get conflated in during training with high probability, making it hard to imitate the action of picking the arm with the higher expected reward.
We propose to overcome these pitfalls by training an ensemble of selfimitating agents, which are explicitly encouraged to visit different, nonoverlapping regions of the statespace. This helps to discover useful rewards in sparse settings, avoids deceptive reward traps, and in environments with rewardstochasticity like the 2armed bandit, increases the probability of the optimal policy being present in the final trained ensemble. We detail the enhancements next.
2.3 Improving Exploration with Stein Variational Gradient
One approach to achieve better exploration in challenging cases like above is to simultaneously learn multiple diverse policies and enforce them to explore different parts of the high dimensional space. This can be achieved based on the recent work by Liu et al. (2017) on Stein variational policy gradient (SVPG). The idea of SVPG is to find an optimal distribution over the policy parameters which maximizes the expected policy returns, along with an entropy regularization that enforces diversity on the parameter space, i.e.
Without a parametric assumption on , this problem admits a challenging functional optimization problem. Stein variational gradient descent (SVGD, Liu & Wang (2016)) provides an efficient solution for solving this problem, by approximating with a delta measure , where is an ensemble of policies, and iteratively update with
(6) 
where is a positive definite kernel function. The first term in moves the policy to regions with high expected return (exploitation), while the second term creates a repulsion pressure between policies in the ensemble and encourages diversity (exploration). The choice of kernel is critical. Liu et al. (2017) used a simple Gaussian RBF kernel , with the bandwidth dynamically adapted. This, however, assumes a flat Euclidean distance between and
, ignoring the structure of the entities defined by them, which are probability distributions. A statistical distance, such as
, serves as a better metric for comparing policies (Amari, 1998; Kakade, 2002). Motivated by this, we propose to improve SVPG using JS kernel , where is the stateaction visitation distribution obtained by running policy , and is the temperature. The second exploration term in SVPG involves the gradient of the kernel w.r.t policy parameters. With the JS kernel, this requires estimating gradient of , which as shown in Equation 3, can be obtained using policy gradients with an appropriately trained reward function.Our full algorithm is summarized in Appendix 5.3 (Algorithm 2). In each iteration, we apply the SVPG gradient to each of the policies, where the in Equation 6 is the interpolated gradient from selfimitation (Equation 5
). We also utilize statevalue function networks as baselines to reduce the variance in sampled policygradients.
3 Experiments
Our goal in this section is to answer the following questions: 1) How does selfimitation fare against standard policy gradients under various reward distributions from the environment, namely episodic, noisy and dense? 2) How far does the SVPG exploration go in overcoming the limitations of selfimitation, such as susceptibility to localminimas?
We benchmark highdimensional, continuouscontrol locomotion tasks based on the MuJoCo physics simulator by extending the OpenAI Baselines (Dhariwal et al., 2017) framework. Our control policies () are modeled as unimodal Gaussians. All feedforward networks have two layers of 64 hidden units each with tanh nonlinearity. For policygradient, we use the clippedsurrogate based PPO algorithm (Schulman et al., 2017b). Further implementation details are in the Appendix.
Episodic rewards 
Noisy rewards Each suppressed w/ 90% prob. () 
Noisy rewards Each suppressed w/ 50% prob. () 
Dense rewards (Gym default) 

(SI) 
(PPO) 
CEM  ES 
(SI) 
(PPO) 
(SI) 
(PPO) 
(SI) 
(PPO) 

Walker  2996  252  205  1200  2276  2047  3049  3364  3263  3401 
Humanoid  3602  532  426    4136  1159  4296  3145  3339  4149 
HStandup ( 10)  18.1  4.4  9.6    14.3  11.4  16.3  9.8  17.2  10 
Hopper  2618  354  97  1900  2381  2264  2137  2132  2700  2252 
Swimmer  173  21  17    52  37  127  56  106  68 
Invd.Pendulum  8668  344  86  9000  8744  8826  8926  8968  8989  8694 
3.1 SelfImitation with Different Reward Distributions
We evaluate the performance of selfimitation with a single agent in this subsection; combination with SVPG exploration for multiple agents is discussed in the next. We consider the locomotion tasks in OpenAI Gym under 3 separate reward distributions: Dense refers to the default reward function in Gym, which provides a reward for each simulation timestep. In episodic reward setting, rather than providing at each timestep of an episode, we provide at the last timestep of the episode, and zero reward at other timesteps. This is the case for many practical settings where the reward function is hard to design, but scoring each trajectory, possibly by a human (Christiano et al., 2017), is feasible. In noisy reward setting, we probabilistically mask out each out each pertimestep reward in an episode. Reward masking is done independently for every new episode, and therefore, the agent receives nonzero feedback at different—albeit only few—timesteps in different episodes. The probability of maskingout or suppressing the rewards is denoted by .
In Figure 1, we plot the learning curves on three tasks with episodic rewards. Recall that is the hyperparameter controlling the weight distribution between gradients with environment rewards and the gradients with shaped reward from (Equation 5). The baseline PPO agents use , meaning that the entire learning signal comes from the environment. We compare them with selfimitating (SI) agents using a constant value . The capacity of is fixed at 10 trajectories. We didn’t observe our method to be particularly sensitive to the choice of and the capacity value. For instance, works equally well. Further ablation on these two hyperparameters can be found in the Appendix.
In Figure 1, we see that the PPO agents are unable to make any tangible progress on these tasks with episodic rewards, possibly due to difficulty in credit assignment – the lumped rewards at the end of the episode can’t be properly attributed to the individual stateaction pairs during the episode. In case of SelfImitation, the algorithm has access to the shaped rewards for each timestep, derived from the highreturn trajectories in . This makes creditassignment easier, leading to successful learning even for very highdimensional control tasks such as Humanoid.
Table 1 summarizes the final performance, averaged over 5 runs with random seeds, under the various reward settings. For the noisy rewards, we compare performance with two different reward masking values  suppressing each reward with 90% probability (), and with 50% probability (). The density of rewards increases across the reward settings from left to right in Table 1. We find that SI agents () achieve higher average score than the baseline PPO agents () in majority of the tasks for all the settings. This indicates that not only does selfimitation vastly help when the environment rewards are scant, it can readily be incorporated with the standard policy gradients via interpolation, for successful learning across reward settings. For completion, we include performance of CEM and ES since these algorithms depend only on the total trajectory rewards and don’t exploit the temporal structure. CEM perform poorly in most of the cases. ES, while being able to solve the tasks, is sampleinefficient. We include ES performance from Salimans et al. (2017) after 5M timesteps of training for a fair comparison with our algorithm.
3.2 Characterizing Ensemble of Diverse SelfImitating Policies
We now conduct experiments to show how selfimitation can lead to suboptimal policies in certain cases, and how the SVPG objective, which trains an ensemble with an explicit repulsion between policies, can improve performance.
2DNavigation. Consider a simple Maze environment where the start location of the agent
(blue particle) is shown in the figure on the right, along with two regions – the red region is closer to agent’s starting location but has a pertimestep reward of only 1 point if the agent hovers over it; the green region is on the other side of the wall but has a pertimestep reward of 10 points. We run 8 independent, noninteracting, selfimitating (with ) agents on this task. This ensemble is denoted as SIindependent. Figures 1(a) plots the statevisitation density for SIindependent after training, from which it is evident that the agents get trapped in the local minima. The redregion is relatively easily explored and trajectories leading to it fill the , causing suboptimal imitation. We contrast this with an instantiation of our full algorithm, which is referred to as SIinteractJS. It is composed of 8 selfimitating agents which share information for gradient calculation with the SVPG objective (Equation 6). The temperature is held constant, and the weight on explorationfacilitating repulsion term () is linearly decayed over time. Figure 1(b) depicts the statevisitation density for this ensemble. SIinteractJS explores wider portions of the maze, with multiple agents reaching the green zone of high reward.
Figures 1(c) and 1(d) show the kernel matrices for the two ensembles after training. Cell in the matrix corresponds to the kernel value . For SIindependent, many darker cells indicate that policies are closer (low JS). For SIinteractJS, which explicitly tries to decrease , the cells are noticeably lighter, indicating dissimilar policies (high JS). Behavior of PPOindependent () is similar to SIindependent () for the Maze task.
Locomotion. To explore the limitations of selfimitation in harder exploration problems in highdimensional, continuous stateaction spaces, we modify 3 MuJoCo tasks as follows – SparseHalfCheetah, SparseHopper and SparseAnt yield a forward velocity reward only when the centerofmass of the corresponding bot is beyond a certain threshold distance. At all timesteps, there is an energy penalty to move the joints, and a survival bonus for bots that can fall over causing premature episode termination (Hopper, Ant). Figure 3 plots the performance of PPOindependent, SIindependent, SIinteractJS and SIinteractRBF (which uses RBFkernel from Liu et al. (2017) instead of the JSkernel) on the tasks. Each of these 4 algorithms is an ensemble of 8 agents using the same amount of simulation timesteps. The results are averaged over 3 separate runs, where for each run, the best agent from the ensemble after training is selected.
The SIindependent agents rely solely on actionspace noise from the Gaussian policy parameterization to find highreturn trajectories which are added to as demonstrations. This is mostly inadequate or slow for sparse environments. Indeed, we find that all demonstrations in for SparseHopper are with the bot standing upright (or tilted) and gathering only the survival bonus, as actionspace noise alone can’t discover hopping behavior. Similarly, for SparseHalfCheetah, has trajectories with the bot haphazardly moving back and forth. On the other hand, in SIinteractJS, the repulsion term encourages the agents to be diverse and explore the statespace much more effectively. This leads to faster discovery of quality trajectories, which then provide good reinforcement through selfimitation, leading to higher overall score. SIinteractRBF doesn’t perform as well, suggesting that the JSkernel is more formidable for exploration. PPOindependent gets stuck in the local optimum for SparseHopper and SparseHalfCheetah – the bots stand still after training, avoiding energy penalty. For SparseAnt, the bot can cross our preset distance threshold using only actionspace noise, but learning is slow due to naïve exploration.
4 Conclusion and Future Work
We approached policy optimization for deep RL from the perspective of JSdivergence minimization between stateaction distributions of a policy and its own past good rollouts. This leads to a selfimitation algorithm which improves upon standard policygradient methods via the addition of a simple gradient term obtained from implicitly shaped dense rewards. We observe substantial performance gains over the baseline for highdimensional, continuouscontrol tasks with episodic and noisy rewards. Further, we discuss the potential limitations of the selfimitation approach, and propose ensemble training with the SVPG objective and JSkernel as a solution. Through experimentation, we demonstrate the benefits of a selfimitating, diverse ensemble for efficient exploration and avoidance of local minima.
An interesting future work is improving our algorithm using the rich literature on exploration in RL. Since ours is a populationbased exploration method, techniques for efficient single agent exploration can be readily combined with it. For instance, parameterspace noise or curiositydriven exploration can be applied to each agent in the SIinteractJS ensemble. Secondly, our algorithm for training diverse agents could be used more generally. In Appendix 5.6, we show preliminary results for two cases: a) hierarchical RL, where a diverse group of Swimmer bots is trained for downstream use in a complex Swimming+Gathering task; b) RL without environment rewards, relying solely on diversity as the optimization objective. Further investigation is left for future work.
References
 Abolafia et al. (2018) Daniel A Abolafia, Mohammad Norouzi, and Quoc V Le. Neural program synthesis with priority queue training. arXiv preprint arXiv:1801.03526, 2018.
 Amari (1998) ShunIchi Amari. Natural gradient works efficiently in learning. Neural computation, 10(2):251–276, 1998.
 Andrychowicz et al. (2017) Marcin Andrychowicz, Filip Wolski, Alex Ray, Jonas Schneider, Rachel Fong, Peter Welinder, Bob McGrew, Josh Tobin, OpenAI Pieter Abbeel, and Wojciech Zaremba. Hindsight experience replay. In Advances in Neural Information Processing Systems, pp. 5048–5058, 2017.
 Bellemare et al. (2016) Marc Bellemare, Sriram Srinivasan, Georg Ostrovski, Tom Schaul, David Saxton, and Remi Munos. Unifying countbased exploration and intrinsic motivation. In Advances in Neural Information Processing Systems, pp. 1471–1479, 2016.
 Bojarski et al. (2016) Mariusz Bojarski, Davide Del Testa, Daniel Dworakowski, Bernhard Firner, Beat Flepp, Prasoon Goyal, Lawrence D Jackel, Mathew Monfort, Urs Muller, Jiakai Zhang, et al. End to end learning for selfdriving cars. arXiv preprint arXiv:1604.07316, 2016.
 Christiano et al. (2017) Paul F Christiano, Jan Leike, Tom Brown, Miljan Martic, Shane Legg, and Dario Amodei. Deep reinforcement learning from human preferences. In Advances in Neural Information Processing Systems, pp. 4302–4310, 2017.
 Conti et al. (2017) Edoardo Conti, Vashisht Madhavan, Felipe Petroski Such, Joel Lehman, Kenneth O Stanley, and Jeff Clune. Improving exploration in evolution strategies for deep reinforcement learning via a population of noveltyseeking agents. arXiv preprint arXiv:1712.06560, 2017.
 Deisenroth et al. (2013) Marc Peter Deisenroth, Gerhard Neumann, Jan Peters, et al. A survey on policy search for robotics. Foundations and Trends® in Robotics, 2(1–2):1–142, 2013.
 Dhariwal et al. (2017) Prafulla Dhariwal, Christopher Hesse, Oleg Klimov, Alex Nichol, Matthias Plappert, Alec Radford, John Schulman, Szymon Sidor, and Yuhuai Wu. Openai baselines. https://github.com/openai/baselines, 2017.

Duan et al. (2016)
Yan Duan, Xi Chen, Rein Houthooft, John Schulman, and Pieter Abbeel.
Benchmarking deep reinforcement learning for continuous control.
In
International Conference on Machine Learning
, pp. 1329–1338, 2016.  Eysenbach et al. (2018) Benjamin Eysenbach, Abhishek Gupta, Julian Ibarz, and Sergey Levine. Diversity is all you need: Learning skills without a reward function. arXiv preprint arXiv:1802.06070, 2018.
 Finn et al. (2016) Chelsea Finn, Sergey Levine, and Pieter Abbeel. Guided cost learning: Deep inverse optimal control via policy optimization. In International Conference on Machine Learning, pp. 49–58, 2016.
 Florensa et al. (2017) Carlos Florensa, Yan Duan, and Pieter Abbeel. Stochastic neural networks for hierarchical reinforcement learning. arXiv preprint arXiv:1704.03012, 2017.
 Fu et al. (2017) Justin Fu, John CoReyes, and Sergey Levine. Ex2: Exploration with exemplar models for deep reinforcement learning. In Advances in Neural Information Processing Systems, pp. 2574–2584, 2017.
 Fujimoto et al. (2018) Scott Fujimoto, Herke van Hoof, and Dave Meger. Addressing function approximation error in actorcritic methods. arXiv preprint arXiv:1802.09477, 2018.
 Goodfellow et al. (2014) Ian Goodfellow, Jean PougetAbadie, Mehdi Mirza, Bing Xu, David WardeFarley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Advances in neural information processing systems, pp. 2672–2680, 2014.
 Hester et al. (2017) Todd Hester, Matej Vecerik, Olivier Pietquin, Marc Lanctot, Tom Schaul, Bilal Piot, Dan Horgan, John Quan, Andrew Sendonaris, Gabriel DulacArnold, et al. Deep qlearning from demonstrations. arXiv preprint arXiv:1704.03732, 2017.
 Ho & Ermon (2016) Jonathan Ho and Stefano Ermon. Generative adversarial imitation learning. In Advances in Neural Information Processing Systems, pp. 4565–4573, 2016.
 Houthooft et al. (2016) Rein Houthooft, Xi Chen, Yan Duan, John Schulman, Filip De Turck, and Pieter Abbeel. Vime: Variational information maximizing exploration. In Advances in Neural Information Processing Systems, pp. 1109–1117, 2016.
 Kakade (2002) Sham M Kakade. A natural policy gradient. In Advances in neural information processing systems, pp. 1531–1538, 2002.
 Kober & Peters (2009) Jens Kober and Jan R Peters. Policy search for motor primitives in robotics. In Advances in neural information processing systems, pp. 849–856, 2009.
 Levine & Koltun (2013) Sergey Levine and Vladlen Koltun. Guided policy search. In International Conference on Machine Learning, pp. 1–9, 2013.
 Levine et al. (2016) Sergey Levine, Chelsea Finn, Trevor Darrell, and Pieter Abbeel. Endtoend training of deep visuomotor policies. The Journal of Machine Learning Research, 17(1):1334–1373, 2016.
 Liang et al. (2016) Chen Liang, Jonathan Berant, Quoc Le, Kenneth D Forbus, and Ni Lao. Neural symbolic machines: Learning semantic parsers on freebase with weak supervision. arXiv preprint arXiv:1611.00020, 2016.
 Liang et al. (2018) Chen Liang, Mohammad Norouzi, Jonathan Berant, Quoc Le, and Ni Lao. Memory augmented policy optimization for program synthesis with generalization. arXiv preprint arXiv:1807.02322, 2018.
 Lillicrap et al. (2015) Timothy P Lillicrap, Jonathan J Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa, David Silver, and Daan Wierstra. Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971, 2015.

Liu & Wang (2016)
Qiang Liu and Dilin Wang.
Stein variational gradient descent: A general purpose bayesian inference algorithm.
In Advances In Neural Information Processing Systems, pp. 2378–2386, 2016.  Liu et al. (2017) Yang Liu, Prajit Ramachandran, Qiang Liu, and Jian Peng. Stein variational policy gradient. arXiv preprint arXiv:1704.02399, 2017.
 Mnih et al. (2015) Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Bellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al. Humanlevel control through deep reinforcement learning. Nature, 518(7540):529–533, 2015.
 Mnih et al. (2016) Volodymyr Mnih, Adria Puigdomenech Badia, Mehdi Mirza, Alex Graves, Timothy Lillicrap, Tim Harley, David Silver, and Koray Kavukcuoglu. Asynchronous methods for deep reinforcement learning. In International Conference on Machine Learning, pp. 1928–1937, 2016.
 Nair et al. (2017) Ashvin Nair, Bob McGrew, Marcin Andrychowicz, Wojciech Zaremba, and Pieter Abbeel. Overcoming exploration in reinforcement learning with demonstrations. arXiv preprint arXiv:1709.10089, 2017.
 Oh et al. (2018) Junhyuk Oh, Yijie Guo, Satinder Singh, and Honglak Lee. Selfimitation learning. arXiv preprint arXiv:1806.05635, 2018.
 Peters & Schaal (2007) Jan Peters and Stefan Schaal. Reinforcement learning by rewardweighted regression for operational space control. In Proceedings of the 24th international conference on Machine learning, pp. 745–750. ACM, 2007.
 Peters et al. (2010) Jan Peters, Katharina Mülling, and Yasemin Altun. Relative entropy policy search. In AAAI, pp. 1607–1612. Atlanta, 2010.
 Ross et al. (2013) Stéphane Ross, Narek MelikBarkhudarov, Kumar Shaurya Shankar, Andreas Wendel, Debadeepta Dey, J Andrew Bagnell, and Martial Hebert. Learning monocular reactive uav control in cluttered natural environments. In Robotics and Automation (ICRA), 2013 IEEE International Conference on, pp. 1765–1772. IEEE, 2013.
 Rubinstein & Kroese (2016) Reuven Y Rubinstein and Dirk P Kroese. Simulation and the Monte Carlo method, volume 10. John Wiley & Sons, 2016.
 Salimans et al. (2017) Tim Salimans, Jonathan Ho, Xi Chen, and Ilya Sutskever. Evolution strategies as a scalable alternative to reinforcement learning. arXiv preprint arXiv:1703.03864, 2017.
 Schulman et al. (2015) John Schulman, Sergey Levine, Pieter Abbeel, Michael Jordan, and Philipp Moritz. Trust region policy optimization. In International Conference on Machine Learning, pp. 1889–1897, 2015.
 Schulman et al. (2017a) John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017a.
 Schulman et al. (2017b) John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017b.
 Silver et al. (2016) David Silver, Aja Huang, Chris J Maddison, Arthur Guez, Laurent Sifre, George Van Den Driessche, Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanctot, et al. Mastering the game of go with deep neural networks and tree search. nature, 529(7587):484–489, 2016.
 Stadie et al. (2015) Bradly C Stadie, Sergey Levine, and Pieter Abbeel. Incentivizing exploration in reinforcement learning with deep predictive models. arXiv preprint arXiv:1507.00814, 2015.
 Strehl & Littman (2008) Alexander L Strehl and Michael L Littman. An analysis of modelbased interval estimation for markov decision processes. Journal of Computer and System Sciences, 74(8):1309–1331, 2008.
 Sutton et al. (2000) Richard S Sutton, David A McAllester, Satinder P Singh, and Yishay Mansour. Policy gradient methods for reinforcement learning with function approximation. In Advances in neural information processing systems, pp. 1057–1063, 2000.
 Sutton (1984) Richard Stuart Sutton. Temporal credit assignment in reinforcement learning. 1984.
 Tang et al. (2017) Haoran Tang, Rein Houthooft, Davis Foote, Adam Stooke, OpenAI Xi Chen, Yan Duan, John Schulman, Filip DeTurck, and Pieter Abbeel. # exploration: A study of countbased exploration for deep reinforcement learning. In Advances in Neural Information Processing Systems, pp. 2750–2759, 2017.
 Večerík et al. (2017) Matej Večerík, Todd Hester, Jonathan Scholz, Fumin Wang, Olivier Pietquin, Bilal Piot, Nicolas Heess, Thomas Rothörl, Thomas Lampe, and Martin Riedmiller. Leveraging demonstrations for deep reinforcement learning on robotics problems with sparse rewards. arXiv preprint arXiv:1707.08817, 2017.
 Wulfmeier et al. (2017) Markus Wulfmeier, Ingmar Posner, and Pieter Abbeel. Mutual alignment transfer learning. arXiv preprint arXiv:1707.07907, 2017.
5 Appendix
5.1 Derivation of Gradient Approximation
Let and be the exact stateaction densities for the current policy () and the expert, respectively. Therefore, by definition, we have (up to a constant shift):
Now, is a local surrogate to . By approximating it to be constant in an ball neighborhood around , we get the following after taking gradient of the above equation w.r.t :
The last step follows directly from the policy gradient theorem (Sutton et al., 2000). Since we do not have the exact densities and , we substitute them with the optimized density estimators and from the maximization in Equation 2 for computing . This gives us the gradient approximation mentioned in Section 2.2. A similar approximation is also used by Ho & Ermon (2016) for Generative Adversarial Imitation Learning (GAIL).
5.2 Algorithm for SelfImitation
Notation:
Policy parameters
Discriminator parameters
Environment reward
5.3 Algorithm for SelfImitating Diverse Policies
Notation:
Policy parameters for rank
Selfimitation discriminator parameters for rank
Empirical density network parameters for rank
5.4 Ablation Studies
We show the sensitivity of selfimitation to and the capacity of , denoted by . The experiments in this subsection are done on Humanoid and Hopper tasks with episodic rewards. The tables show the average performance over 5 random seeds. For ablation on , is fixed at 10; for ablation on , is fixed at 0.8. With episodic rewards, a higher value of helps boost performance since the RL signal from the environment is weak. With , there isn’t a single best choice for , though all values of give better results than baseline PPO ().
Humanoid  Hopper  

532  354  
395  481  
810  645  
3602  2618  
3891  2633 
Humanoid  Hopper  

2861  1736  
2946  2415  
3602  2618  
2667  1624  
4159  2301 
5.5 Hyperparameters

Horizon (T) = 1000 (locomotion), 250 (Maze), 5000 (Swimming+Gathering)

Discount () = 0.99

GAE parameter () = 0.95

PPO internal epochs = 5

PPO learning rate = 1e4

PPO minibatch = 64
5.6 Leveraging Diverse Policies
The diversitypromoting repulsion can be used for various other purposes apart from aiding exploration in the sparse environments considered thus far. First, we consider the paradigm of hierarchical reinforcement learning wherein multiple subpolicies (or skills) are managed by a highlevel policy, which chooses the most apt subpolicy to execute at any given time. In Figure 4, we use the Swimmer environment from Gym and show that diverse skills (movements) can be acquired in a pretraining phase when repulsion is used. The skills can then be used in a difficult downstream task. During pretraining with SVPG, exploitation is done with policygradients calculated using the norm of the velocity as dense rewards, while the exploration term uses the JSkernel. As before, we compare an ensemble of 8 interacting agents with 8 independent agents. Figures 3(a) and 3(b) depict the paths taken by the Swimmer after training with independent and interacting agents, respectively. The latter exhibit variety. Figure 3(c) is the downstream task of Swimming+Gathering (Duan et al., 2016) where the bot has to swim and collect the green dots, whilst avoiding the red ones. The utility of pretraining a diverse ensemble is shown in Figure 3(d), which plots the performance on this task while training a higherlevel categorical manager policy ().
Diversity can sometimes also help in learning a skill without any rewards from the environment, as observed by Eysenbach et al. (2018) in recent work. We consider a Hopper task with no rewards, but we do require weak supervision in form of the length of each trajectory . Using policygradient with as reward and repulsion, we see the emergence of hopping behavior within an ensemble of 8 interacting agents. Videos of the skills acquired can be found here ^{1}^{1}1https://sites.google.com/site/tesr4t223424.
5.7 Performance on more MuJoCo tasks
Episodic rewards 
Noisy rewards Each suppressed w/ 90% prob. () 
Noisy rewards Each suppressed w/ 50% prob. () 
Dense rewards (Gym default) 

(SI) 
(PPO) 
(SI) 
(PPO) 
(SI) 
(PPO) 
(SI) 
(PPO) 

HalfCheetah  3686  1572  3378  1670  4574  2374  4878  2422 
Reacher  12  12  12  10  6  6  5  5 
Inv. Pendulum  977  53  993  999  978  988  969  992 
5.8 Additional details on SVPG exploration with JSkernel
5.8.1 SVPG formulation
Let the policy parameters be parameterized by . To achieve diverse, highreturn policies, we seek to obtain the distribution which is the solution of the optimization problem: , where is the entropy of . Solving the above equation by setting derivative to zero yields the an energybased formulation for the optimal policyparameter distribution: . Drawing samples from this posterior using traditional methods such as MCMC is computationally intractable. Stein variational gradient descent (SVGD; Liu & Wang (2016)
) is an efficient method for generating samples and also converges to the posterior of the energybased model. Let
be the particles that constitute the policy ensemble. SVGD provides appropriate direction for perturbing each particle such that induced KLdivergence between the particles and the target distribution is reduced. The perturbation (gradient) for particle is given by (please see Liu & Wang (2016) for derivation):where is a positive definite kernel function. Using as target distribution, and as the JSkernel, we get the gradient direction for ascent:
where is the stateaction visitation distribution for policy , and is the temperature. Also, for our case, is the interpolated gradient from selfimitation (Equation 5).
5.8.2 Implementation details
The gradient in the above equation is the repulsion factor that pushes away from . Similar repulsion can be achieved by using the gradient ; note that this gradient is w.r.t instead of and the sign is reversed. Empirically, we find that the latter results in slightly better performance.
Estimation of : This can be done in two ways  using implicit and explicit distributions. In the implicit method, we could train a parameterized discriminator network () using stateactions pairs from and to implicitly approximate the ratio . We could then use the policy gradient theorem to obtain the gradient of as explained in Section 2.2. This, however, requires us to learn discriminator networks for a population of size , one for each policy pair . To reduce the computational and memory resource burden to , we opt for explicit modeling of . Specifically, we train a network to approximate the stateaction visitation density for each policy . The networks are learned using the optimization (Equation 2), and we can easily obtain the ratio . The agent then uses as the SVPG exploration rewards in the policy gradient theorem.
Statevalue baselines: We use statevalue function networks as baselines to reduce the variance in sampled policygradients. Each agent in a population of size trains statevalue networks corresponding to real environment rewards , selfimitation rewards , and SVPG exploration rewards .
5.9 Comparison to Oh et al. (2018)
In this section, we provide evaluation for a recently proposed method for selfimitation learning (SIL; Oh et al. (2018)
). The SIL loss function take the form:
In words, the algorithm buffers and the corresponding return for each transition in rolled trajectories, and reuses them for training if the stored return value is higher than the current statevalue estimate .
We use the code provided by the authors ^{2}^{2}2https://github.com/junhyukoh/selfimitationlearning. As per our understanding, PPO+SIL does not use a single set of hyperparameters for all the MuJoCo tasks (Appendix A; Oh et al. (2018)). We follow their methodology and report numbers for the best configuration for each task. This is different from our experiments since we run all tasks on a single fix hyperparameter set (Appendix 5.5), and therefore a direct comparison of the average scores between the two approaches is tricky.
SIL Dense rewards Oh et al. (2018) 
SIL Episodic rewards 
SIL Noisy rewards Each suppressed w/ 90% prob. () 
SIL Noisy rewards Each suppressed w/ 50% prob. () 

Walker  3973  257  565  3911 
Humanoid  3610  530  1126  3460 
HumanoidStandup ( 10)  18.9  4.9  14.9  18.8 
Hopper  1983  563  1387  1723 
Swimmer  120  17  50  100 
InvertedDoublePendulum  6250  405  6563  6530 
Table 3 shows the performance of PPO+SIL on MuJoCo tasks under the various reward distributions explained in Section 3.1  dense, episodic and noisy. We observe that, compared to the dense rewards setting (default Gym rewards), the performance suffers under the episodic case and when the rewards are masked out with . Our intuition is as follows. PPO+SIL makes use of the cumulative return from each transition of a past good rollout for the update. When rewards are provided only at the end of the episode, for instance, cumulative return does not help with the temporal credit assignment problem and hence is not a strong learning signal. Our approach, on the other hand, derives dense, pertimestep rewards using an objective based on divergenceminimization. This is useful for credit assignment, and as indicated in Table 1. (Section 3.1) leads to learning good policies even under the episodic and noisy settings.
5.10 Comparison to offpolicy RL (Qlearning)
Our approach makes use of replay memory to store the past good rollouts of the agent. Offpolicy RL methods such as DQN (Mnih et al., 2015) also accumulate agent experience in a replay buffer and reuse them for learning (e.g. by reducing TDerror). In this section, we evaluate the performance of one such recent algorithm  Twin Delayed Deep Deterministic policy gradient (TD3; Fujimoto et al. (2018)) on tasks with episodic and noisy rewards. TD3 builds on DDPG (Lillicrap et al., 2015) and surpasses its performance on all the MuJoCo tasks evaluated by the authors.
TD3 Dense rewards Fujimoto et al. (2018) 
TD3 Episodic rewards 
TD3 Noisy rewards Each suppressed w/ 90% prob. () 
TD3 Noisy rewards Each suppressed w/ 50% prob. () 

Walker  4352  189  395  2417 
Hopper  3636  402  385  1825 
InvertedDoublePendulum  9350  363  948  4711 
Swimmer         
HumanoidStandup         
Humanoid         
Table 4 shows that the performance of TD3 suffers appreciably with the episodic and noisy reward settings, indicating that popular offpolicy algorithms (DDPG, TD3) do not exploit the past experience in a manner that accelerates learning when rewards are scarce during an episode.
* For 3 tasks used in our paper—Swimmer and the highdimensional Humanoid, HumanoidStandup—the TD3 code from the authors ^{3}^{3}3https://github.com/sfujim/TD3 is unable to learn a good policy even in presence of dense rewards (default Gym rewards). These tasks are also not included in the evaluation by Fujimoto et al. (2018).
5.11 Comparing SVPG exploration to a noveltybased baseline
We run a new exploration baseline  EX (Fu et al., 2017) and compare its performance to SIinteractJS on the hard exploration MuJoCo tasks considered in Section 3.2. The EX algorithm does implicit statedensity estimation using discriminative modeling, and uses it for noveltybased exploration by adding as the bonus. We used the author provided code ^{4}^{4}4https://github.com/jcoreyes/ex2
and hyperparameter settings. TRPO is used as the policy gradient algorithm.
EX  SIinteractJS  

SparseHalfCheetah  286  769 
SparseHopper  1477  1949 
SparseAnt  3.9  208 