In reinforcement learning (RL), the goal is to train a policy to interact with an environment, such that this policy yields the maximal expected return. While typical RL methods merely train a single parameterized policy, ensemble methods that share experiences amongst several function approximators (osband2017deep; osband2016deep) have been able to achieve superior performance in the context of reinforcement learning (RL). Unlike typical RL methods, osband2017deep
train an ensemble of neural network (NN) policies with distinct initial weights (i.e. parameters of NNs) simultaneously, by sharing experiences amongst the policies. These shared experiences are collected by first randomly selecting a policy from the ensemble to perform an episode. This episode of experiences is added to a shared experience replay buffer(mnih2015human) used to train all members of the ensemble. Learning from shared experience allows for more efficient policy learning, since randomly initialized policies result in extensive exploration in the environment. Though reinforcement learning from shared experiences has shown considerable improvement over single-policy RL methods, other lines of work (hester2018dqnfd) show that directly imitating an expert’s experiences in a supervised manner can accelerate reinforcement learning.
Motivated by these results that demonstrate that direction imitation can accelerate RL, we propose Periodic Intra-Ensemble Knowledge Distillation (PIEKD), a framework that not only trains an ensemble of policies via common experience but also shares the knowledge of the best-performing policy amongst the ensemble. Previous works on ensemble RL have shown that randomly initialized policies can result in adequate behavioral diversity (osband2016deep). Thus PIEKD first begins by initializing each policy in the ensemble with different weights to perform extensive exploration in the environment. As the behaviors of these policies are diverse in nature, at any given time during the course of training, one policy is naturally superior to other policies. This policy is then used to improve the quality of the other policies in the ensemble, without having to improve solely through experience. To use the best policy to improve other policies, PIEKD employs knowledge distillation (hinton2015distilling), which is effective at transferring knowledge between neural networks. By using knowledge distillation, we can encourage policies in the ensemble to act in a manner similar to the best policy, enabling them to rapidly improve and continue optimizing for the optimal policy from better starting points. Prior work (rusu2015policy) has shown that we can successfully distill several specialized policies into a single multitask policy, demonstrating that distillation can successfully augment behaviors into a policy without destroying existing knowledge. These results suggest that in PIEKD, despite the use of distillation between policies’, their inherent knowledge is still preserved, improving individual policies without destroying the diversity amongst policies. An abstract overview of PIEKD is depicted in Figure 1.
This paper’s primary contribution is Periodic Intra-Ensemble Knowledge Distillation (PIEKD), a simple yet effective framework for off-policy RL that jointly trains an ensemble of policies while periodically performing knowledge sharing. We demonstrate empirically that PIEKD can improve the state-of-the-art soft-actor critic (SAC) (haarnoja2018soft) on a suite of challenging MuJoCo tasks, exhibiting superior sample efficiency. We further validate the effectiveness of distillation for knowledge sharing by comparing against other forms of sharing knowledge.
The remainder of this paper is organized as follows. Section 2 discusses the related work in ensemble RL and knowledge distillation. Section 3 provides a brief overview of the reinforcement learning formulation. Section 4 describes PIEKD. Section 5 presents our experimental findings. Lastly, Section 6 summarizes our contributions and outlines potential avenues for future work.
2 Related work
The works that are most related to PIEKD (osband2016deep; osband2017deep) train multiple policies via shared experience for the same task through RL, where the shared experiences are collected by all policies in the ensemble and stored in a common buffer, as our method does. Differing from those works (osband2016deep; osband2017deep), we additionally periodically performing knowledge distillation between policies of the ensemble. Other related methods aggregate multiple policies to select actions (gimelfarb2018reinforcement; tham1995reinforcement). abel2016exploratory sequentially train a series of policies, boosting the learning performance by using the errors of a prior policy. However, rather than perform decision aggregation or sequentially-boosted training, we focus on improving the performance of each individual policy via knowledge sharing amongst jointly trained policies.
rusu2015policy train a single neural network to perform multiple tasks by transferring multiple pretrained policies to a single network through distillation. hester2018dqnfd and nair2018overcoming accelerate RL agents’ training progress through human experts’ guidance. Rather than experts’ policies, nagabandi2018mpc, levine2013gps and zhang2016mpc leverage model-based controllers’ behaviors, facilitating training for RL agents. Additionally, oh2018self train RL agents to imitate past successful self-experiences or policies. Orthogonal to the aforementioned works, PIEKD periodically exploits the current best policy within the ensemble, and shares its amongst the ensemble.
In other machine learning areas,zhang2018deep trains multiple models that mutually imitate each other’s outputs on classification tasks. Our distillation procedure is not mutual, but flows in a single direction, from a superior teacher policy to other student policies in the ensemble. Subsequent work by lan2018knowledge trains an ensemble of models to imitate a stronger teacher model that aggregates all of the ensemble models’ predictions. Our method contrasts from the above methods by periodically electing the teacher for distillation to other ensemble members. We maintain the distinction between ensemble members rather than aggregate them into a single policy.
teh2017distral and ghosh2017divide distill multiple task-specific policies to a central multi-task policy and constrain the mutual divergence between each task-specific policy and the central one. galashov2019information learn a task-specific policy while bounding the divergence between this task-specific policy and some generic policy that can perform basic task-agnostic behaviors. czarnecki2018mix gradually transfer the knowledge of a simple policy to a complex policy during the course of joint training. Our work differs from the aforementioned works in several aspects. First, our method periodically elects a teacher policy for sharing knowledge rather than either constraining the mutual policy divergence (teh2017distral; ghosh2017divide; galashov2019information). Second, our method does not rely on training heterogeneous policies (e.g. a simple policy and a complex policy (czarnecki2018mix)), which makes our method more generally applicable. Finally, as opposed to teh2017distral and ghosh2017divide, we consider single-task settings rather than multi-task settings.
Population-based methods similarly employ multiple policies in separate copies of environments to find the optimal policy. Evolutionary Algorithms (EA)(salimans2017evolution; gangwani2017policy; khadka2018evolution) randomly perturb the parameters of policies in the population, eliminate underperforming policies by evaluating the policies’ performances in the environment, and produce new generations of policies from the remaining policies. Unlike EA, our method does not rely on separate copies of environments and eliminating existing policies from the population. Instead, our method focuses on continuously improving the existing policies. In addition to EA, other work (Jung2020Population-Guided) done concurrent to our work adds a regularization term that forces each agent to imitate the best agent’s policy when performing policy updates at each step. Differing from PIEKD, they train multiple agents in separate copies of the environment in parallel. Without the reliance on multiple copies of the environment, our method is more applicable in the cases of expensive interaction with the environment or costly setup of multiple environments (e.g. robot learning in the real world).
In this section we describe the general framework of RL. RL formalizes a sequential decision-making task as a Markov decision process (MDP) (sutton1998introduction). An MDP consists of a state space , a set of actions , a (potentially stochastic) transition function , a reward function , and a discount factor . An RL agent performs episodes of a task where the agent starts in a random initial state , sampled from the initial state distribution , and performs actions, which transition the agent to new states and for which the agent receives rewards. More generally, at timestep , an agent in state performs an action , receives a reward , and transitions to a new state , according to the transition function . The discount factor is used to indicate the agent’s preference for short-term rewards over long-term rewards.
An RL agent performs actions according to its policy, a conditional probability distribution, where denotes the parameters of the policy, which may be the parameters of a neural network. RL methods iteratively update via rollouts of experience , seeking within the parameter space the optimal that maximizes the expected return at each within an episode.
In this section, we formally present the technical details of our method, Periodic Intra-Ensemble Knowledge Distillation (PIEKD). We start by providing an overview of PIEKD and then describe its components in detail.
PIEKD maintains an ensemble of policies that perform that collect different experiences on the same task, and then periodically shares knowledge amongst the policies in the ensemble. PIEKD is separated into three phases: ensemble initialization, joint training, and intra-ensemble knowledge distillation. First, the ensemble initialization phase randomly initializes an ensemble of policies with different parameters to achieve behavioral diversity. In the joint training stage, a policy randomly selected from the ensemble is used to execute an episode in the environment and its experience is then stored in a shared experience replay buffer that is used to train each policy. In the last stage, we perform intra-ensemble knowledge distillation, where we elect a teacher policy from the ensemble used to guide the other policies towards better behaviors. To this end, we distill (hinton2015distilling) the best-performing policy to the others. Algorithm 1 and Figure 2 summarize our method. In this paper, we apply PIEKD to the state-of-the-art off-policy RL algorithm, soft actor-critic (SAC) (haarnoja2018learning).
4.2 Ensemble initialization
In the ensemble initialization phase, we randomly initialize policies in the ensemble. Each policy is instantiated with a model parameterized by , where stands for the policy’s index in the ensemble.
is initialized by sampling from the uniform distribution over parameter spacewhich contains all possible values of : . Despite the simplicity of uniform distributions used for initialization, osband2016deep shows that uniformly random initialization can provide adequate behavioral diversity. In this paper, we represent each
as a neural network (NN), though other parametric models can be used.
Since SAC learns both a policy and a critic function that values states or state-action pairs from past experiences stored in a replay buffer (mnih2015human), we create a shared replay buffer for all policies in the ensemble and randomly initialize a NN critic function for each policy . stands for the NN’s weight for the critic .
4.3 Joint training
Each joint training phase consists of timesteps. For each episode, we select a policy in the ensemble to act in the environment (hereinafter, we refer this process as “policy selection”) The policy selection strategy is a way of selecting a policy from the ensemble to perform an episode in the environment. This episode is stored in a shared experience replay buffer , and the policy’s recent episodic performance statistic is updated according to the return achieved in , where is the average episodic return in the most recent episodes. The episodic performance statistics and will later be used in the intra-ensemble distillation phase. (Section 4.4). In this paper, we adopt a simple uniform random policy selection strategy: . To perform RL updates on the agent’s policy,
After selecting a policy which performs an episode , we store this in (line 11). Then, we can sample data from and update all policies and critics using SAC (line 12-13). Since off-policy RL methods like SAC do not require that is necessarily generated by the policy that is being updated, they enable our policies to learn from the trajectories generated by other policies of the ensemble. The details of the update routine for the policy and the critic are taken from the original SAC paper (haarnoja2018soft).
4.4 Intra-ensemble knowledge distillation
The intra-ensemble knowledge distillation phase consists of two stages: teacher election and knowledge distillation. The teacher election stage (line 18) selects a policy from the ensemble to serve as the teacher for other policies. In our experiments, we use the natural selection criteria of the selecting the best-performing teacher. Specifically, we select the policy that has the highest average recent episodic performance recorded in the joint training phase (Sec. 4.3), namely , where is the index of the teacher. Rather than use a policy’s most recent episodic performance, we use its average return over its previous
episodes, to minimize the noise in our estimate of the policy’s performance.
Next, the elected teacher guides the other policies in the ensemble towards better policies (line 19-20). This is done through knowledge distillation (hinton2015distilling), which has been shown to be effective at guiding a neural network to behave similarly to another. To distill from the teacher to the students (i.e., other policies in the ensemble), the teacher samples experiences from the buffer and instructs each student to match the teacher’s outputs on these samples. After distillation, the students acquire the teacher’s knowledge, enabling them to correct their low-rewarding behaviors and reinforce their high-rewarding behaviors, without forgetting their previously learned behaviors (rusu2015policy; teh2017distral). Specifically, the policy distillation process is formalized as updating each in the direction of
where Kullback–Leibler divergence () is a principled way to measure the similarity between two probability distributions (i.e., policies). Note that when applying PIEKD to SAC, we must additionally distill the critic function from the teacher to the students, where each critic function is updated toward the direction
where and denote parameters of critic functions. and denote the critic function corresponding to the teacher’s policy and the student’s policy, respectively.
The experiments are designed to answer the following questions: (1) Can PIEKD improve upon the data efficiency of state-of-the-art RL? (2) Is knowledge distillation effective at sharing knowledge? (3) Is is it necessary to choose the best-performing agent to be the teacher? Next, we show our experimental findings for each of the aforementioned questions, and discuss their implications.
5.1 Experimental setup
Our goal is to demonstrate how PIEKD improves the sample efficiency of an RL algorithm. Since soft actor-critic (SAC) (haarnoja2018soft)
exhibits state-of-the-art performance across several continuous control tasks, we build on top of SAC. We directly use the hyperparameters for SAC from the original paper(haarnoja2018soft) in all of our experiments222Code:https://github.com/pfnet-research/piekd. Unless stated otherwise, the hyperparameters used in for PIEKD (Algorithm 1) are , and . The value of is tuned via grid search over . We tried different ensemble size configurations () and found decided on . For the remainder of our experiments, we term PIEKD applied to SAC as SAC-PIEKD.
We use OpenAI gym (openaigym)’s MuJoCo benchmark tasks, as used in the original SAC (haarnoja2018soft) paper. We choose most of the tasks selected in the original paper (haarnoja2018soft) to evaluate the performance of our method. The description for each task can be found in the source code for OpenAI gym (openaigym).
We adapt the evaluation approach from the original SAC paper (haarnoja2018soft)
. We train each agent for 1 million timesteps, and run 20 evaluation episodes after every 10000 timesteps (i.e., number of interactions with the environment), where the performance is the mean of these 20 evaluation episodes. We repeat this entire process across 5 different runs, each with different random seeds. We plot the mean value and confidence interval of mean episodic return at each stage of training. The mean value and confidence interval are depicted by the solid line and shaded area, respectively. The confidence interval is estimated by the bootstrapped method. At each evaluation point, we report the highest mean episodic return amongst the agents in the ensemble. In some curves, we additionally report the lowest mean episodic return amongst the agents in the ensemble.
5.2 Effectiveness of PIEKD
In order to evaluate the effectiveness of intra-ensemble knowledge distillation, we compare SAC-PIEKD, against two baselines: Vanilla-SAC and Ensemble-SAC. Vanilla-SAC denotes the original SAC; Ensemble-SAC is the analogous variant of osband2016deep’s method for ensemble Q-learning, except on SAC. At its core, Osband’s method involves an ensemble of policies that act with the environment and generate experiences. These experiences are then used to train the entire ensemble using an off-policy RL algorithm, such as Q-learning or off-policy actor-critic methods. Thus, our Ensemble-SAC baseline denotes the training of an ensemble of policies through SAC while sharing knowledge amongst in the ensemble in a shared replay buffer. Effectively, Ensemble-SAC is SAC-PIEKD without the intra-ensemble knowledge distillation phase. For both Ensemble-SAC and SAC-PIEKD we set the ensemble size to be 3.
Our results are shown in Figure 3. Note that we also plot the worst evaluation in the ensemble at each evaluation phase to provide insight into the general performance of the ensemble. In all tasks, we outperform all baselines, including Vanilla-SAC and Ensemble-SAC, in terms of sample efficiency. Visually we can see that throughout training, we have consistently better performance at similar amounts of experience, indicating that our method can achieve higher performance with the same number of experiences relative to our baselines.
SAC-PIEKD usually reaches the best baseline’s convergent performance in half of the environment interactions. We even find that in the majority of tasks, our worst evaluation in the ensemble outperforms the baseline methods. This demonstrates that all policies of the ensemble are significantly improving, and our method’s superior performance is not simply a consequence of selecting the best agent in the ensemble. In particular, SAC-PIEKD’s superiority over Ensemble-SAC highlights the effectiveness of supplementing shared experiences (Ensemble-SAC) with knowledge distillation. In summary, Figure 3 demonstrates the effectiveness of PIEKD on enhancing the data efficiency of RL algorithms.
5.3 Effectiveness of knowledge distillation for knowledge sharing
In this section, we investigate the advantage of using knowledge distillation for knowledge sharing. We consider two alternative approaches towards sharing knowledge, other than distillation. First, we consider sharing knowledge by simply providing agents with additional policy updates (in lieu of distillation updates) using the shared experiences. We also consider directly copying the neural network as opposed to performing distillation. Below, we compare these two approaches against knowledge distillation.
Section 5.2 has shown that Ensemble-SAC, which updates all agents’ policies through shared experiences fails to learn as efficiently as SAC-PIEKD. However, SAC-PIEKD uses additional gradient updates during knowledge distillation phase, whereas Ensemble-SAC only performs joint training, and lacks an additional knowledge distillation phase. It is unclear whether additional policy updates in lieu of knowledge distillation can achieve the same effects. To investigate this, we compare SAC-PIEKD with Vanilla-SAC (extra) and Ensemble-SAC (extra), which respectively correspond to Vanilla-SAC and Ensemble-SAC (see Section 5.2) that are trained with extra policy update steps with the same number of updates and minibatch sizes that SAC-PIEKD performs. A policy update here refers to a training step that updates the policy and value function (haarnoja2018soft), if required, by RL algorithms. Figure ((a)a) compares the performance of our baselines to SAC-PIEKD. We see that SAC-PIEKD reaches higher performance more rapidly than the baselines. This observation shows that knowledge distillation is more effective than policy updates for knowledge sharing.
We additionally study whether the naive method of directly copying parameters from the best-performing agent can also be an effective way to share knowledge between neural networks. We compare a variant of our method, which we denote as SAC-PIEKD (hardcopy), against SAC-PIEKD. In SAC-PIEKD (hardcopy), rather than perform intra-ensemble knowledge distillation, we simply copy the parameters of the teacher policy and critic into the student policies and critics. Figure ((b)b) depicts the performance of this variant. We see that SAC-PIEKD (hardcopy) performs worse than both Ensemble-SAC and SAC-PIEKD. Thus, it is clear that knowledge distillation is superior to naively copying the best agent’s parameters. In fact, it can be counterproductive to explicitly copy parameters, as Ensemble-SAC outperforms copying without any knowledge sharing. This is likely due to the loss in policy diversity as a consequence of hardcopying, perhaps reducing to training a single policy as in Vanilla-SAC.
5.4 Effectiveness of selecting the best-performing agent as the teacher
During teacher election, we opted for the natural strategy of selecting the best-performing agent. However, in order to investigate its importance, we compared the performance of SAC-PIEKD when we select the best policy to be the teacher as opposed to selecting a random policy to be the teacher. This is depicted in Figure (c)c, where SAC-PIEKD (random teacher) denotes the selection of a random policy to be the teacher and the standard SAC-PIEKD refers to the selection of the highest-performing policy to be the teacher. We see that using the highest-performing teacher for distillation appears to be slightly better than selecting a random teacher, though not significantly. Interestingly, we see that using a random teacher performs better than Ensemble-SAC. This result suggests that selecting the best teacher is not necessarily of high importance, as a random teacher yields benefits. While this warrants further investigation, perhaps the diverse knowledge is being shared through distillation, which may elicit the success we see in SAC-PIEKD (random teacher). Another possibility is that by bringing policies closer together, the off-policy error (fujimoto2018off) stemming from RL updates on a shared replay buffer is reduced, improving performance. However, we can conclude that selecting the highest-performing teacher, while somewhat beneficial, is nonessential, and we leave the investigation of these open questions for future work.
In this paper, we introduce Periodic Intra-Ensemble Knowledge Distillation (PIEKD), a method that jointly trains an ensemble of RL policies while periodically sharing information via knowledge distillation. Our experimental results demonstrate that PIEKD improves the data efficiency of a state-of-the-art RL method on several standard MuJoCo tasks. Also, we show that knowledge distillation is more effective than the other approaches for knowledge sharing. We found that electing the best-performing policy is beneficial, but nonessential for improving the sample efficiency of PIEKD.
PIEKD opens several avenues for future work. While we used a simple uniform policy selection strategy, a more efficient policy selection strategy may further accelerate learning. Moreover, while our ensemble members used identical architectures, PIEKD may benefit from using heterogeneous ensembles, consisting of different architectures that may be conducive to learning different skills, which can then be distilled within the ensemble. Lastly, additional investigations into teacher elections may be lead to informative insights.