We attempt to solve the problem of imitation learning, where a task has to be performed by an agent using only expert demonstrations. The agent cannot query for more information from the expert in an iterative manner. One approach to solve the problem is to use behaviour cloning, where learning from demonstrations has been formulated as a supervised learning task. Behaviour cloning is plain supervised learning and converges in a few update steps during training without needing environment interactions. However, supervised learning assumes the data to be i.i.d. while completely ignoring the temporal dependence between states and actions (i.e. the action taken at a state influences the future actions that the expert might take and states it might encounter). The ignorance of the temporal aspect of imitation learning leads to compounding error in behaviour cloning, thereby requiring a lot more data to compensate for the lack of prior.
An alternative to behavior cloning is to use generative adversarial imitation learning (GAIL). GAIL tackles the problem of compounding errors by accounting for the temporal dependencies using a state-action distribution matching between the policy and the expert. However, GAIL takes a lot of environment interactions to stabilize, since it tries to optimize for a non-stationary minimax RL objective to train the policy and discriminator. Adversarial training is also known to be very unstable in general, and GAIL is no exception.
Behavior cloning doesn’t account for the temporal dependencies between states and actions but it converges in a few iterations. GAIL accounts for the temporal dependencies between states and actions, but is brittle and slow due to its training framework. Since the two methods are complementary, it makes sense to combine the two to get the best of both worlds in terms of sample efficiency, stability and performance. A naive approach is to pretrain a network using behavior cloning to get a “greedy” policy, and refine it further by using GAIL. However, there are differences in pre-training in the supervised setting and the RL setting as explained in Section 3. Other works () anecdotally report that pretraining a policy with BC and fine-tuning with GAIL doesn’t converge to an optimal policy but do not provide insights as to why that is be the case. We investigate a few possible reasons and experimentally show that different design choices and pretraining can affect the training of GAIL adversely. To that end, we propose an algorithm to blend BC and GAIL, and experiments show that our method is stable in both low and high dimensional tasks and return optimal policies while making much fewer interactions with the environment than GAIL.
2 Related Work
To mitigate the compounding errors in behavior cloning,  train an iterative algorithm where at each time step t, the policy learns the expert behavior on the trajectories induced by .  also introduce a stochastic mixing algorithm based on . The initial policy starts off as the expert policy, and at each iteration, a new policy is obtained by training on the trajectory induced by the previous policy . The policy at timestep t is obtained by a geometric stochastic mixing of the expert and the previous policies.  train a policy using the expert demonstrations, generate new trajectories and use the expert to correct the behavior in these new trajectories iteratively. Although this method performs much better in a variety of scenarios, it requires online access to the expert, which might be very expensive.
Another approach to tackling the problem is to use Inverse Reinforcement Learning. Inverse reinforcement learning attempts to find a reward function which best explains the behavior by an expert. Once the reward function is extracted, an agent can learn an optimal policy from the reward using reinforcement learning. Inverse reinforcement learning has shown successes in variety of tasks, including gridworld environments, car driving simulations (), route inference based on partial trajectories (), and path planning (). The reward function is modeled as a linear function of the state features, and the weights are learned to match the feature expectations of the expert and the learnt policy by , . 
use the principle of maximum entropy to disambiguate the under-defined problem of multiple possible rewards. Other methods like use priors and evidence from expert’s actions to derive probabilistic distributions over rewards.
Recently, adversarial imitation learning methods have shown successes in a variety of imitation tasks, from low dimensional continous control to high dimensional tasks like autonomous driving from raw pixels as input.  propose a framework for directly extracting a policy from trajectories without performing reinforcement learning inside a loop. This approach utilizes a discriminator to distinguish between the state-action pairs induced by the expert and the policy, and the policy uses the output of the discriminator as the reward. Different approaches build on top of this method, with  proposing an algorithm that can infer the latent structure of the expert trajectories without explicit supervision. This approach maximizes a mutual information term between the trajectory and the latent space to capture the variations in the trajectories. GAIL was further extended by  to produce a scalable inverse reinforcement learning algorithm based on adversarial reward learning. This approach returns a policy as well as a learnt reward function. These approaches have led to optimal imitation learning in both low and high dimensional tasks.
To find an approach that can combine the best of both worlds from behaviour cloning and GAIL based methods, we hypothesize why pretraining with behaviour cloning may not be optimal. Empirically, we found that pretraining with behaviour cloning did not help and the agent learns a suboptimal policy as compared to GAIL trained from scratch. This issue is not uniquely raised by us, as demonstrated by 
, where they show that GAIL pretrained with behaviour cloning failed to reach optimal performance as compared to the standard GAIL baseline. We investigate a few probable reasons and present an algorithm that gives a tradeoff between behavior cloning and GAIL for sample efficient and optimal training.
3.1 Catastrophic forgetting due to discriminator
Catastrophic forgetting is a phenomenon that occurs in incremental training of neural networks. After a trained model is deployed, training it on new data makes the neural network ‘forget’ about the old data it was trained on. Catastrophic forgetting occurs due to the network’s weights changing in a way that the learned representation for the previous task is disrupted. This leads to forgetting of the old task when a new task is learned. In the case of pretraining with behaviour cloning, the representation learned may be forgotten when the discriminator provides random rewards in the beginning of the GAIL training. The policy gradient loss with random rewards can make the network ‘forget’ the state representation it learnt during the pretraining stage. Eventually, the discriminator becomes better and the rewards are more meaningful, but the initial gradients may disrupt the learnt representations and hence lead to catastrophic forgetting. One way to tackle this problem is to train the discriminator alongside the policy network in the pretraining stage to distinguish between the expert and policy. The discriminator is effectively trained along the policy as well to give meaningful rewards at the start of GAIL training, and become less disruptive to the weights of the policy network. Section 5.3 analyses the effect of training the discriminator on the trajectories induced by the policy pretrained with behavior cloning as an alternative to random initialization for GAIL.
3.2 Catastrophic interference from Value network
Catastrophic interference occurs when a neural network drastically forgets previously learned information when it learns new information. Most deep reinforcement learning algorithms follow an Actor-Critic architecture , . The policy and value network can be trained separately, or can use shared state features. In high dimensional tasks, it is preferable to train a single network for actor and critic  with shared features. The value network cannot be trained in the pretraining stage because there is no TD error to minimize and the policy network is trained in a supervised manner. When this network is trained on the rewards from the discriminator, the gradients of the value loss interferes with the previously learnt shared representation. This leads to catastrophic interference with the pretrained representation when there is a shared network involved. The problem shouldn’t occur in the case when separate policy and value networks are trained as there is no shared representation between the networks. Section 5.2 analyses the effect of using shared versus separate networks for the policy and value networks and how it affects the BC+GAIL framework. Experiments also show that our method effectively uses the shared representation to learn better policies when the network parameters are shared.
3.3 Suboptimal performance of warm-started neural networks
Pretraining networks has shown a number of successes in deep learning, from image classification to natural language inference among others. The success of pretraining lies in the fact that it can be used to finetune to domain specific tasks with little data since a general feature extractor has already been learnt. However, show that random initialization is very robust and performs no worse than pretrained networks. The networks take longer to train than pretrained networks, but their generalization errors are always better than that of pretrained networks.  takes this a step further and shows that warm starting a network might result to poorer generalization although the training losses converge to the same value. In the context of imitation learning, behaviour cloning doesn’t train with all the expert trajectories because some data is required for validation. Since the network is warm-started with a fraction of the expert data, it may lead to an overall poor generalization error when trained on the entire set of trajectories during the GAIL training stage.
4 Annealing the loss function
Simulated annealing is used in optimization techniques for approximating the global optimum of a function. A common use of annealing is done in the learning rate in training of deep neural networks.  optimize a sequence of gradually improving mosaic functions that approximate the original non-convex objective using an annealing scheme.  use an exponentially moving average of the parameters of the target Q function at each time step. 11] use a annealing scheme to stabilize training in the context of semantic segmentation in medical images.
Inspired by these works, we propose a loss function that makes a tradeoff between the supervised loss and the policy gradient by minimizing a weighted sum of both the terms. The tradeoff parameter is annealed out such that as the number of iterations, the optimization looks identical to GAIL, which provides better asymptotic performance.
Let the expert trajectories be denoted by a dataset , where , containing tuples of states and actions taken by the expert . The policy network is denoted by , parameterized by , and the value network is denoted by , parameterized by . The parameters of the policy and value networks may or may not be shared. The behavior cloning loss is given by
In adversarial imitation learning, we also train a discriminator parameterized by . The discriminator is trained by minimizing the loss
And the policy is trained using a policy gradient algorithm:
where the advantage is estimated using the value network and the discriminator:
The value loss is trained using the following loss:
where is the estimated value from the trajectory rewards. When the parameters of the value net and policy net are shared, it is beneficial to take gradients with respect to a combined loss instead, where is a weighing factor.
At every training step, we take a convex combination of the loss to minimize. Specifically, at iteration , we train the policy using the following loss:
where . Note that corresponds to training with GAIL, and corresponds to behaviour cloning. In our case, we anneal from 1 to 0, which transitions the gradients from a greedy action-matching behaviour to gradually accounting for long term reward. The tradeoff parameter is annealed using an exponential decay , with . The pseudo-code for the framework is given in Algorithm 1.
5.1 Low dimensional control tasks
We evaluate the proposed algorithm on a variety of low dimensional continous control environments in OpenAI gym, namely Acrobot, Ant, HalfCheetah, Reacher, Swimmer, Walker. We compare our algorithm with 3 baselines for this experiment:
Behavior cloning Behavior cloning trains a policy to ‘ape’ the actions of the expert using supervised learning. Although behaviour cloning is very quick since it doesn’t require environment interactions, its asympotic performance is not optimal unless a lot of data is provided. Since our experiments do not use iterative data collection, we do not use other behavior cloning baselines which use iterative feedback from experts like  and .
GAIL Adversarial imitation learning has been successful in a lot of environments. GAIL accounts for the temporal nature of sequential decision making by providing a surrogate reward value from the discriminator. However, adversarial methods are shown to be unstable and take a lot of environment interactions to converge.
|Environment||BC + GAIL (separate)||BC+GAIL (shared)||Ours (separate)||Ours (shared)|
For these experiments, we use a shared value and policy network, which is a MLP containing 2 hidden layers containing 32 hidden units each, with tanh nonlinearities. All the weights except the last layer are shared for the policy and value networks. For our algorithm, we choose according to the number of iterations required to reach a weighing factor of , which we call the ‘half-life’. The half-life is related to the value of as . We choose a half-life of iterations across all experiments, which corresponds to . All algorithms (except the BC+GAIL baseline) are trained from scratch.
The experiments in  use a large number of expert trajectories, which amounts to upto state-action pairs. However, a more realistic scenario is when expert trajectories are limited. Behavior cloning should fail to generalize without much data, and GAIL is supposed to take a lot of iterations to converge. We use a small number of expert state-actions in each experiment, and we do not subsample from the expert trajectory. Dense trajectories are supposed to have more redundant information, making imitation even more difficult.
For obtaining expert trajectories, we train an expert with the same architecture as the imitators on the task reward using Proximal Policy Optimization . For expert trajectories, we use only 100 state-action pairs for Acrobot, since this is a relatively simple environment. For all other experiments, we use 500 state-action pairs as the expert trajectories to train the imitation agents. In case of all environments, the number of state-action pairs do not constitute even a single trajectory. Each algorithm is run across 3 random seeds, as done in . The behavior cloning algorithm is trained only on 70% of the data, and 30% is used for validation. For all other experiments, all of the expert data is used. Note that although our loss contains a behavior cloning term, it doesn’t require any validation data. The final performance of each method is evaluated by taking an average of 60 episodes, where 20 episodes are obtained from each seed.
Our method performs consistently across all the environments, whereas GAIL is unstable in the absence of trajectories and behaviour cloning never reaches peak performance. GAIL takes more iterations to catch up as compared to the BC+GAIL baseline and with our method. Acrobot is also a simple environment but the adversarial methods fail and are unstable. The BC+GAIL baseline converges to a suboptimal policy and GAIL fails to improve given the few iterations. Given the few number of trajectories, our method achieves the best performance as shown in Table 1.
5.2 Effect of shared value network
In the previous section, we compared all the baselines with a shared network architecture. Although a shared value network may train faster because of fewer parameters, it may be unstable due to the trade-offs associated with the representations needed by the value loss and policy gradient loss. To analyse whether having a shared value network may affect performance on the pretrained baseline, we run the same experiments as in Section 5.1. However, we train a policy using separate value and policy networks to minimize the effect of possible interference. The architecture of both networks are 2-layered MLPs with 32 hidden units and tanh nonlinearities. We also train the networks with our method to compare the performance. The results are shown in 3. Table 3 shows that the BC+GAIL baseline performs better when the value and policy networks do not share weights. However, the performance is still suboptimal compared to our method which achieves a higher score with shared architectures.
5.3 Effect of pretraining discriminator
To test the effect of pretraining the discriminator along with behavior cloning, we train two variants of the BC+GAIL baseline. In the first variant, we train the policy using behavior cloning, and train the discriminator only during GAIL training. In the second variant, we train the discriminator on the trajectories induced by the policy after every policy update during the pretraining stage. During GAIL training, we initialize the discriminator with the parameters learnt during pretraining. The final performance for both the variants is shown in Table 4. Pretraining the discriminator does not provide any significant advantage in terms of final performance. Hence, discriminator may not contribute significantly to catastrophic forgetting.
5.4 High dimensional GridWorld tasks
To test the stability of the algorithms in a high dimensional setting, we use a gridworld environment as used in . We evaluate on the “Key-Door” task, where the grid is divided into two rooms. The agent has to pick up a key, open the locked door and move to the goal location in the other room. The wall, key, door, goal location and agent are initialized at a different location everytime, making the task harder. The agent recieves a reward of for reaching the goal location, and otherwise. Reinforcement learning algorithms are known to struggle with sparse rewards over a long horizon, and imitation learning methods can use the expert data to learn faster.
We use the code provided by  and extend it for training the imitation learning algorithms. The input to the policy network is a
binary tensor, whereare the grid dimensions, and
is the vocabulary size. The vocabulary consists of objects that are present in the grid i.e. ‘none’, ‘player’, ‘wall’, ‘key’, ‘door’ and ‘goal’. Each grid cell is represented by a one-hot vector, corresponding to the object present in the corresponding grid cell. To prevent the problems of mode collapse and vanishing gradients in this high-dimensional setting, we follow the work of and opt to use a Wasserstein GAN () framework. In addition, we use the REINFORCE algorithm as another baseline which uses the sparse reward. The policy network is an MLP with 2 hidden layers, and each layer consists of 64 hidden units with tanh nonlinearities. The value network uses the same weights as the policy network except the last layer. To compare the effects of the the interference due to value network and due to discriminator, we also compare 2 variants of pretraining GAIL with BC, with and without discriminator training. The expert is an A* search agent which first navigates to the key from its location, picks it up, and goes to the goal via the door. We test with grid sizes of 8, 10, and 12 to analyse the effect of progressively tougher environments. We use a total of 200, 350, and 500 expert trajectories for grid sizes 8, 10, 12 respectively. Since expert trajectories are very short (15 state-action pairs per trajectory ), there are a lot of missing data corresponding to unvisited states, and behavior cloning is expected to perform very suboptimally.
In Figure 2, we observe the effects of all baselines. Behavior cloning performs suboptimally for all grid sizes, and its performance worsens with increasing grid size. This is because the linearly increasing trajectory count cannot compensate for the exponentially increasing state space. REINFORCE converges but takes a lot of iterations. In the case of grid size 12, REINFORCE only reaches about 70% of the performance of GAIL and Loss-annealed GAIL after 30M steps. Both the pretrained baselines drop their performance to 0, and never recover from this catastrophic interference. Pretraining the discriminator on the trajectories induced by the behavior cloning policy doesn’t help either. However, our method reaches peak performance much faster than GAIL.
5.5 Choice of decay factor
One design choice is the choice of
, the decay factor involved in the loss function for our method. Setting it too low may lead to the policy not benefitting at all from the behavior cloning term. Setting it too high may lead the weights to overfit to the behavior cloning term and performance of the model may be suboptimal. In our experiment, we use a heuristic to choose this parameter, which works well for all experiments. The heuristic is to calculate the number of iterations it takes behavior cloning to converge, and set half of that number as the half life of the decay parameter. For example, if behavior cloning takes 50 iterations to converge, then we set the half life to 25 iterations, and subsequently,. Intuitively, it makes sense - if behavior cloning takes a long time to converge, then the task is probably difficult, and the decay should happen slower so that behavior cloning can learn the weights faster. If behavior cloning converges quickly, the task is easy or the data can be overfit to quickly, and setting
to be low might be a better option. In both the low and high dimensional experiments, we set a half-life of 10 epochs based on the number of epochs it took the behavior cloning baseline to converge. Doing this test is not very expensive, as behavior cloning always converged in the least amount of time. Future work in this direction might include sensitivity analysis of this parameter, or employing other annealing schedules altogether.
We show that our method provides stability in adversarial imitation learning, especially in low dimensional tasks with less expert data and in high dimensional tasks where adversarial learning methods are slow and unstable. Our method improves substantially on GAIL by covering up for one of its weakness, i.e. its sample efficiency, without compromising on the stability or performance. We show that policies pretrained with behavior cloning suffer from catastrophic interference by the value network if parameters are shared between the policy and value networks. We also show that policies warm-started with behavior cloning get stuck in a local optima, and do not reach peak performance in practice. Experiments show that our method is a better alternative than two-stage training by being sample efficient and reaching peak performance. Our method is easy to implement, and we show successes with two different policy update algorithms that GAIL can use, namely REINFORCE and PPO. We provide a heuristic for estimating the decay factor used in our algorithm, and show that a single value works well across all experiments.
Apprenticeship learning via inverse reinforcement learning.
Proceedings of the twenty-first international conference on Machine learning, pp. 1. Cited by: §2.
-  (2005) Memory retention–the synaptic stability versus plasticity dilemma. Trends in neurosciences 28 (2), pp. 73–78. Cited by: §3.1.
-  (2017) Wasserstein gan. arXiv preprint arXiv:1701.07875. Cited by: §5.4.
-  (2019) On the difficulty of warm-starting neural network training. arXiv preprint arXiv:1910.08475. Cited by: §3.3.
-  (2009) Search-based structured prediction. Machine learning 75 (3), pp. 297–325. Cited by: §2.
-  (2016) Guided cost learning: deep inverse optimal control via policy optimization. In International Conference on Machine Learning, pp. 49–58. Cited by: §2.
-  (2017) Learning robust rewards with adversarial inverse reinforcement learning. arXiv preprint arXiv:1710.11248. Cited by: §2.
-  (2018) Addressing function approximation error in actor-critic methods. arXiv preprint arXiv:1802.09477. Cited by: §4.
Rethinking imagenet pre-training. In
Proceedings of the IEEE International Conference on Computer Vision, pp. 4918–4927. Cited by: §3.3.
-  (2016) Generative adversarial imitation learning. In Advances in neural information processing systems, pp. 4565–4573. Cited by: §2, 3rd item, §5.1.
-  (2019) A bayesian neural net to segment images with uncertainty estimates and good calibration. In Information Processing in Medical Imaging, Cham, pp. 3–15. External Links: Cited by: §4.
-  (2017) InfoGAIL: interpretable imitation learning from visual demonstrations. In Advances in Neural Information Processing Systems 30, I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (Eds.), pp. 3812–3822. External Links: Cited by: §2, §5.4.
-  (2016) Asynchronous methods for deep reinforcement learning. In International conference on machine learning, pp. 1928–1937. Cited by: §3.2.
-  (2000) Algorithms for inverse reinforcement learning.. In Icml, Vol. 1, pp. 2. Cited by: §2.
-  (2019) Annealed gradient descent for deep learning. Neurocomputing. External Links: Cited by: §4.
-  (2007) Bayesian inverse reinforcement learning.. In IJCAI, Vol. 7, pp. 2586–2591. Cited by: §2.
-  (2006) Maximum margin planning. In Proceedings of the 23rd international conference on Machine learning, pp. 729–736. Cited by: §2.
Efficient reductions for imitation learning.
Proceedings of the thirteenth international conference on artificial intelligence and statistics, pp. 661–668. Cited by: §2, 1st item.
-  (2011) A reduction of imitation learning and structured prediction to no-regret online learning. In Proceedings of the fourteenth international conference on artificial intelligence and statistics, pp. 627–635. Cited by: §2, 1st item.
-  (2019) Sample efficient imitation learning for continuous control. In International Conference on Learning Representations, External Links: Cited by: §1, §3, 3rd item.
-  (2015) High-dimensional continuous control using generalized advantage estimation. arXiv preprint arXiv:1506.02438. Cited by: §4.
-  (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347. Cited by: §5.1.
-  (2018) Learning goal embeddings via self-play for hierarchical reinforcement learning. arXiv preprint arXiv:1811.09083. Cited by: §5.4, §5.4.
-  (2000) Policy gradient methods for reinforcement learning with function approximation. In Advances in neural information processing systems, pp. 1057–1063. Cited by: §3.2.
-  (2008) Maximum entropy inverse reinforcement learning. Cited by: §2.