Currently, reinforcement learning algorithms are sample inefficient and learn from scratch through trail and error over millions of rollouts. This sample inefficiency is not a problem when the goal is to maximize performance on a single task in a simulated environment, where data is cheap and can be collected quickly. However, this inefficiency is not viable in real-world use cases when the goal is for an agent to accomplish many tasks, which may change over time. In addition to developing more efficient algorithms, one solution to this problem is to share knowledge between multiple tasks and develop flexible representations that can easily transfer to new tasks. In this work, we take a prerequisite step in this direction by evaluating the performance of the state-of-the-art multi-task learning methods on continuous action spaces using an extended version of the MuJoCo environment (Henderson et al., 2017). Analyzing the success of various multi-task methods on continuous control tasks is an important contribution because most research in this area has been on discrete action spaces in Atari environments.
2 Related Work
In recent years, a number of works, most consistently from Deep Mind, have proposed methods for transfer learning and muli-task learning. These works primarily focus on two approaches: knowledge distillation and feature reuse. Knowledge distillation, originally proposed in(Bucila et al., 2006), serves as the foundation for the methods put forward in (Rusu et al., 2016a), (Parisotto et al., 2016), and (Teh et al., 2017). (Rusu et al., 2016a) extends the distillation method, formulated in (Hinton et al., 2014), to Deep Q Networks trained on Atari environments and demonstrates that policy distillation can act as a form of regularization for Deep Q Networks. In (Parisotto et al., 2016)
, the authors propose a novel loss function that includes both a policy regression term as well as a feature regression term. This policy regression objective is traditionally used in distillation, while the added regression objective encourages feature representations in the intermediate layers of the student network to match those of the expert network.(Teh et al., 2017) applies distillation to the multi-task setting by learning a common (distilled) policy across a number of 3D environments. In addition, (Teh et al., 2017) adds an entropy penalty as well as and entropy regularization coefficients to the objective in order to trade off between exploration and exploitation. (Rusu et al., 2016b) attack the problem of catastrophic forgetting, where a policy losses the ability to preform a pre-transfer task after being transferred to a target task. Concretely, the authors prevent catastrophic forgetting by maintaining task specific representations within the policy network. The environments we use in our experiments were introduced in (Henderson et al., 2017). These environments extend the Mujoco continuous control tasks available in Open AI Gym and are designed to be a test bed for transfer learning and multi-task learning. For a given simulated agent, these environments provide minor structural variations such as the length of an agent’s body parts.
3.1.1 Markov Decision Process
To provide context we give a brief review of the reinforcement learning problem. Reinforcement learning is the process of training an agent to maximize reward in an environment. More technically, the aim is to learn the optimal policy for selecting actions to take in a Markov Decision Process (MDP). A MDP is defined aswhere is a set of states, is a set of actions, is a transition function, is a reward function, is a discount factor, and is a time horizon. The policy is trained to maximize the expected discounted return , where denotes a trajectory sampled according to with , and . Furthermore, we define the optimal approximate policy as where,
3.1.2 Actor Critic Algorithm
To learn this optimal approximate policy we use the advantage actor-critic (A2C) algorithm, which is a synchronous implementation of the A3C algorithm introduced in (Mnih et al., 2016). A2C is an on-policy algorithm that operates in the forward view by sampling rollouts of current policy to calculate n-step returns. The policy as well as the value function are updated after every forward steps or when a terminal state is reached. This algorithm maintains a policy
and a value function estimateand performs updates of the form , where the advantage is defined as
We discourage convergence to sub-optimal deterministic policies can by adding a policy entropy term to the objective function as originally proposed in (Williams & Peng, 1991). To approximate and
and diagonal co-variance matrix. Thus, the output layer of our policy network consists of a real-valued mean and the log variance for each dimension of the action space.
3.1.3 Knowledge Distillation
The goal of knowledge distillation is to transfer knowledge from a teacher model to a student model . In our experiments, is a policy trained from scratch on a single environment using A2C, and is a feed forward network which has not been trained. is trained on the dataset , where denotes features and denotes targets. contains state action pairs taken from trajectories of length , which are sampled according to the student’s policy
with probabilityand according to the teacher’s policy with probability . contains the values that parameterize the teacher’s policy for the given state, action pair. We train on using the KL Divergence between the teacher policy and the student policy as the objective function. Specifically, we use since the actions taken by and are drawn from multi-variate Gaussians. is defined as follows:
We choose to use KL Divergence as our loss function because it was shown to perform well on discrete action spaces in (Rusu et al., 2016a)
3.1.4 Multi-Task Learning
To goal of multi-task learning is to train a policy network that behaves optimally in different environments . In our multi-task experiments, we approximate the optimal policy using an actor network that is essentially a feed forward network that contains two hidden layers shared across all environments and output layers (heads), where head produces the mean and covariance that parameterize the Gaussian policy for environment . We experiment with two methods for training on multiple tasks:
Vanilla multi-task learning: Each head is trained using A2C to maximize the expected discounted return for environment . The value network in this case consists of one shared hidden layer and output head. During training, we sample an equal number of rollouts by cycling between environment to environment . We provide an illustration of vanilla multi-task in part (a) of figure 1.
Muti-task distillation: Each head is trained using knowledge distillation to match the output of a teacher network . The Multi-task distillation training process is identical to the knowledge distilation process except that a dataset is collected for each teacher, head pair where the rollouts are sampled from the student network. We provide an illustration of muti-task distillation in part (b) of figure 1.
The above 2 methods for multi-task learning are illustrated in figure 1.
We conducted our experiments using the half-cheetah agent on 6 morphologically modified variants of the Open AI gym extensions described in (Henderson et al., 2017) namely HalfCheetahSmallFoot-v0, HalfCheetahSmallLeg-v0, HalfCheetahSmallTorso-v0 and HalfCheetahSmallThigh-v0 which reduce the size of the agent’s respective body part by as well as on HalfCheetahBigFoot-v0 and HalfCheetahBigTorso-v0 which increase the size of the agent’s respective body part by
. We evaluate the performance of a trained policy by reporting the mean and standard deviation of the cumulative reward across 20 sample rollouts on each target environment as done in(Henderson et al., 2017). In addition, we plot the learning curves for each method in order to determine the sample efficiency of these approaches, which we provide in the appendix.
We use PyTorch to implement all our models. Our actor network and critic networks consist of 2 and 3 fully-connected layers respectively, each of which have 64 hidden units. Because the Mujoco environments we use are for continuous control each action taken by an agent is sampled from a Gaussian distribution parameterized by the mean and variance given by
. We use RMSprop with an initial learning rate of 0.0007 to train our models. We set the A2C hyper parameterfor all of our experiments. In addition, we use an entropy penalty coefficient of 0.01.
|Environment||Scratch (3M)||Distillation Multi-task (1M)||Vanilla Multi-task (1M)|
One of the simplest methods for multi-task learning is fine-tuning. For this, we first trained a policy with random initialization of weights for 5M frames on each environment separately. We then transferred this policy to another environment by initializing its weights of a new network to the weights used in and then fine-tuning the last layer of the actor and critic networks in for another 5M frames. We conducted our fine-tuning experiments on HalfCheetahSmallFoot-v0 and HalfCheetahSmallLeg-v0 and evaluated the results both at the 1M and 5M mark. The learning curves for all tasks are shown in Figure 2.The mean and standard deviation of the accumulated rewards calculated on 20 rollouts of the policy are tabulated in Table 1 for each combination of original and target environment.
From the table it is clear that although fine-tuning the weights of a pre-trained network on a new environment performs better than training from scratch, its performance degrades on the original environment. Thus it suffers from catastrophic forgetting, which makes it a poor choice for multi-task learning. For this reason, we did not explore all combinations of original and target environments and instead focus on other types of multi-task learning, which we discuss below.
4.3 Multi-task learning
4.3.1 Vanilla multi-task learning
In the vanilla multi-task experiment, we train a six head actor network and a six head critic network on each environment. Initial hidden layers are shared across environments, while output layers are unique to each environment’s head as shown in Fig. 1. The training procedure is as follows, first sample rollouts from are collected and the head corresponding to the environment is trained, similarly and so on. the environments are continuously cycled in this manner until each head is trained for 1M frames.
All results are tabulated in 2. As clearly visible, the vanilla multi-task outperforms not only distillation multi-task but also networks trained from scratch on a single environment. This shows that sharing knowledge across multiple tasks helps the network perform better on each individual task as well re-affirming our original motivation for this work. In addition, it also helps train the network faster and achieves comparable performance in just 1M frames as compared to the 3M frames.
4.3.2 Multi-task distillation
For the multi-task distillation experiment we first trained teacher networks on all 6 tasks separately for 3M frames each. Our multi-task network then consisted only of an actor network with shared hidden layers and 6 output head layers unique to each environment Fig. 1. The training procedure was similar to the vanilla multi-task except we sample rollouts from the student policy and used the knowledge distillation loss for training our network.
The motivation for exploring the use of distillation to train each head is two fold. Firstly, we hoped that distillation would decrease training time by providing more stable targets for the actor and critic networks. Secondly, we thought that distillation had the potential to stabilize the training process by mimicking the behavior of the expert teacher network. In the multi-task learning graphs provided in section B of our appendix, you can see that our first assertion was correct. Namely, the reward of the multi-task distillation agent reaches 1000 more quickly than the Vanilla distillation agent. However, the variance in the reward of the trained distillation agent is not notably lower than the variance of the vanilla distillation agent. This is evidenced by the fact that the standard deviation of the reward of the distillation agent on and is considerably larger than the standard deviation of the vanilla agent as shown in table 2.
In this paper, we experiment with two different methods for multi-task learning for continuous control. We show that an agent that is trained simultaneously to perform on multiple tasks is not only able to generalize better on each individual task but also requires fewer training steps to achieve comparable performance. This has a huge advantage since most of the real-world environments are continuous and sampling large numbers of episodes from them can be difficult. More sophisticated techniques could be developed that use a reinforcement learning policy to select environments to sample episodes from, which are likely to help the network focus on more difficult tasks and train faster. We hope our methods can serve as benchmark for future work in this field.
We have provided all the trained model weights and codes at https://github.com/jasonkrone/elen6885-final-project
The authors would like to thank Prof. Chong Li for sharing his knowledge and the Teaching Assistants– Lingyu Zhang, Chen-Yu Yen and Xing Yuan for their constant support through out the course.
- Bucila et al. (2006) Bucila, Cristian, Caruana, Rich, and Niculescu-Mizil, Alexandru. Model compression. ACM, 2006.
- Henderson et al. (2017) Henderson, Peter, Chang, Wei-Di, Shkurti, Florian, Hansen, Johanna, Meger, David, and Dudek, Gregory. Benchmark environments for multitask learning in continuous domains. arXiv preprint arXiv:1708.04352v1, 2017.
- Hinton et al. (2014) Hinton, G., Vinyals, O., and Dean, J. Distilling the knowledge in a neural network. Deep Learning and Representation Learning Workshop, NIPS, 2014.
- Mnih et al. (2016) Mnih, Volodymyr, Badia, Adrià Puigdomènech, Mirza, Mehdi, Graves, Alex, Harley, Tim, Lillicrap, Timothy P., Silver, David, and Kavukcuoglu, Koray. Asynchronous methods for deep reinforcement learning. JMLR, 2016.
- Parisotto et al. (2016) Parisotto, Emilio, Ba, Jimmy, and Salakhutdinov, Ruslan. Actor-mimic deep multitask and transfer reinforcement learning. ICLR 2016, 2016.
- Rusu et al. (2016a) Rusu, Andrei A., Colmenarejo, Sergio Gomez, Gulcehr, Caglar, Desjardins, Guillaume, Kirkpatrick, James, Pascanu, Razvan, Mnih, Volodymyr, Kavukcuoglu, Koray, and Hadsell, Raia. Policy distilation. arXiv preprint arXiv:1511.06295v2, 2016a.
- Rusu et al. (2016b) Rusu, Andrei A., Rabinowitz, Neil C., Desjardins, Guillaume, Soyer, Hubert, Kirkpatrick, James, Kavukcuoglu, Koray, Pascanu, Razvan, and Hadsell, Raia. Progressive neural networks. arXiv preprint arXiv:1606.04671v3, 2016b.
- Teh et al. (2017) Teh, Yee Whye, Bapst, Victor, Czarnecki, Wojciech Marian, Quan, John, Kirkpatrick, James, Hadsell, Raia, Heess, Nicolas, and Pascanu, Razvan. Distral: Robust multitask reinforcement learning. arXiv preprint arXiv:1707.04175v1, 2017.
- Williams & Peng (1991) Williams, Ronald J and Peng, Jing. Function optimization using connectionist reinforcement learning algorithms. Connection Science, 3(3):241–268, 1991.