Pre-training Neural Networks with Human Demonstrations for Deep Reinforcement Learning

09/12/2017 ∙ by Gabriel V. de la Cruz Jr, et al. ∙ 0

Deep reinforcement learning (deep RL) has achieved superior performance in complex sequential tasks by using a deep neural network as its function approximator and by learning directly from raw images. A drawback of using raw images is that deep RL must learn the state feature representation from the raw images in addition to learning a policy. As a result, deep RL can require a prohibitively large amount of training time and data to reach reasonable performance, making it difficult to use deep RL in real-world applications, especially when data is expensive. In this work, we speed up training by addressing half of what deep RL is trying to solve --- learning features. Our approach is to learn some of the important features by pre-training deep RL network's hidden layers via supervised learning using a small set of human demonstrations. We empirically evaluate our approach using deep Q-network (DQN) and asynchronous advantage actor-critic (A3C) algorithms on the Atari 2600 games of Pong, Freeway, and Beamrider. Our results show that: 1) pre-training with human demonstrations in a supervised learning manner is better at discovering features relative to pre-training naively in DQN, and 2) initializing a deep RL network with a pre-trained model provides a significant improvement in training time even when pre-training from a small number of human demonstrations.



There are no comments yet.


page 4

page 5

page 6

page 7

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.


The recent resurgence of neural networks in reinforcement learning can be attributed to the widespread success of Deep Reinforcement Learning (deep RL), which uses deep neural networks for function approximation [Mnih et al.2015, Mnih et al.2016]. Besides deep RL’s state-of-the-art results, one of its most impressive accomplishments is its ability to learn directly from raw images. However, in order to bring the success of deep RL in virtual environments into real-world applications, we must address the lengthy training time that is required to learn a policy.

Deep RL suffers from poor initial performance like classic RL algorithms since it learns tabula rasa [Sutton and Barto1998]. In addition, deep RL inherently takes longer to learn because besides learning a policy it also learns directly from raw images — instead of using hand-engineered features, deep RL needs to learn to construct relevant high-level features from raw images. These problems are consequential in real-world applications with expensive data, like in robotics, finance, or medicine.

In order to use deep RL for solving real-world problems, there is a need to speed up its learning. One method is by using humans to provide demonstrations. Using human demonstrations in RL is not new [Argall et al.2009]. However, only recently has this area gain traction as a possible way of speeding up deep RL [Kurin et al.2017, Vinyals et al.2017, Hester et al.2017].

Speeding up deep reinforcement learning can be achieved by addressing two problems it is trying to accomplish: 1) feature learning and 2) policy learning. In this work, we will focus only in addressing the problem of feature learning in order to speed up learning in deep RL. We do so by pre-training to learn the underlying features in the hidden layers of the network. We bring to deep RL a common technique that is widely used to speed up training in deep learning: pre-training a network

[Erhan et al.2009, Erhan et al.2010, Yosinski et al.2014]. However, the success of this technique in supervised deep learning is attributed to the large datasets that are available and used to pre-train networks.

In this work, we propose an approach in speeding up deep reinforcement learning algorithms using only a relatively small amount of non-expert human demonstrations. This approach starts by pre-training a deep neural network using human demonstrations through supervised learning. Similar work has shown that this step would learn to imitate the human demonstrator [Argall et al.2009]. However, what is most interesting to us are the underlying features learned even with a small amount of data.

We test our approach in both Deep Q-network (DQN) and Asynchronous Advantage Actor-Critic (A3C) and evaluated its performance using Pong, Freeway, and Beamrider in the Atari 2600 domain. Our results show speed ups in five of the six cases. The improvement in Pong and Freeway are quite large in DQN, and A3C’s improvement on Pong was especially large. The generality of this approach means that it can be easily incorporated into multiple deep RL algorithms.

Related Work

Our work is not directly under transfer learning, but it is similar to one of the transfer learning methods in deep learning. In training deep neural networks for image classification, Yosinski et al. yosinski2014transferable has shown how transferring the features learned from existing models allow new models to learn faster, particularly when the datasets are similar. In this work, we use a deep learning classifier as the source network, and the classifier’s network is used to initialize the RL agent’s network.

Existing work on pre-training in RL [Abtahi and Fasel2011, Anderson, Lee, and Elliott2015] has shown that there is an improvement when pre-training the network. However, their networks have a much smaller number of parameters and they use the state dynamics of their domains as network input. In our approach, we use the raw images of the domain as network input — our RL agent also needs to learn the latent features while learning its policy.

Our approach of using supervised learning for pre-training is also similar in spirit to that of Anderson, Lee, and Elliott anderson2015faster. They pre-train by learning to predict the state dynamics. We instead pre-train our network by using the game’s image frames from the human demo as training data, which are individually labeled by the action taken by the human demonstrator. This setup is similar to how one could derive a policy when learning from demonstration (LfD) [Argall et al.2009].

Another approach to pre-training is to learn the latent features by using unsupervised learning through Deep Belief Networks

[Abtahi and Fasel2011]

. Although this pre-training approach differs — it falls under a different machine learning paradigm — its goals are similar to our approach in that pre-trained networks learn better than when using random initialization.

There are more recent works that leverage humans in deep RL. Christiano et al. christiano2017deep use human feedback to learn a reward function. Another recent work similarly pre-trains the network with human demonstrations in DQN [Hester et al.2017]. However, their pre-training combines the large margin supervised loss and the temporal difference loss, which tries to closely imitate the demonstrator. In our work, we only use the cross-entropy loss while we focus on the learned features.

The work of Silver et al. silver2016mastering trained human demonstrations in supervised learning and used the supervised learner’s network to initialized RL’s policy network. They tested this approach in a single domain, used a huge amount of data to train the supervised learner, and their data is from human experts. Our work will be the first to provide a comparative analysis as to how such approach impacts deep reinforcement learning algorithms and how well this approach can complement existing deep RL algorithms when human demonstrations are available. Their work also focuses more on optimizing the policy learned from humans, while our paper focuses on learning the underlying features. Our work shows that it is not required to have a huge amount of data to gain some improvements, a supervised learner can still learn important latent features even when demonstrated human data is from non-experts and the dataset is small.

Deep Reinforcement Learning

A reinforcement learning (RL) problem is typically modeled using a Markov Decision Process, represented by a 5-tuple

. An RL agent explores an unknown environment by taking an action . Each action lead the agent to a certain state and a reward is given based on the action took and the next state it lands in. The goal of an RL agent is to learn to maximize the expected return value for each state at time . The discount factor determines the relative importance of future and immediate rewards.

Deep Q-network

The recent development of deep RL has gained great attention due to its ability to generalize and solve problems in different domains. The first such method, deep Q-network (DQN) [Mnih et al.2015], learns to solve 49 Atari games directly from screen pixels by combining Q-learning [Watkins and Dayan1992]

with a deep convolutional neural network. In classic Q-learning, instead of learning the value of states, it learns the value of state-action pairs

which is the expected discounted reward determined by performing action in state and thereafter perform optimally. The optimal policy can then be deduced by following actions that have the maximum Q value .

Directly computing the Q value is not feasible when the state space is large or continuous (e.g., in Atari games). The DQN algorithm uses a convolutional neural network as a function approximator to estimate the Q function

, where is the network’s weight parameters. For each iteration , DQN is trained to minimize the mean-squared error (MSE) between the Q-network and its target , where is the weight parameters for the target network that was generated from previous iterations. The reward

uses reward clipping that scales the scores by clipping all rewards when positive at 1, negative at -1, and 0 when rewards are unchanged. The loss function at iteration

can be expressed as:

where are state-action samples drawn from experience replay memory with a minibatch of size 32. The use of experience replay memory, along with a target network and reward clipping, help to stabilize learning. During training, the agent also behaves following an -greedy policy to obtain sufficient exploration of the state space.

Asynchronous Advantage Actor-critic

There are a few drawbacks of using experience replay memory in the DQN algorithm. First, having to store all experiences is space-consuming and could slow down learning. Second, using replay memory limits DQN to only off-policy algorithms such as Q-learning. The asynchronous advantage actor-critic (A3C) algorithm [Mnih et al.2016] was proposed to overcome these problems. A3C has set a new benchmark for deep RL since it not only surpass DQN’s performance in playing Atari games but also can be applied to continuous control problems.

A3C combines the actor-critic algorithm [Sutton and Barto1998] with deep learning. It differs from value-based algorithms (e.g., Q-learning) where only a value function is learned, actor-critic is policy-based, where a policy function and a value function are both maintained. The policy function is called the actor, which takes actions based on the current policy . The value function is called a critic, which serves as a baseline to evaluate the quality of the action by returning the state value for the current state under policy

. The policy is directly parameterized and improved via policy-gradient. To reduce the variance in policy gradient, an advantage function is used and calculated as

at time step for action at state , where is the expected return at time . The loss function for A3C is:

In A3C, actor-learners are running in parallel with their own copies of the environment and the parameters for the policy and value function. This enables the algorithm to explore different parts of the environment and observations will not be correlated. This mimics the function of experience replay memory in DQN, while being more efficient in space and training time. Each actor-learner pair performs an update on parameters every actions, or when a terminal state is reached — this is similar to using minibatches, as is done in DQN. Updates are synchronized to a master learner that maintains a central policy and value function, which will be the final policy upon the completion of training.

Pre-Training Networks for Deep RL

Deep reinforcement learning can be divided into two sub-tasks: 1) feature learning, and 2) policy learning. Deep RL in itself is already quite successful in learning both tasks in parallel. However, learning both tasks also makes learning in deep RL very slow. We believe that by addressing the feature learning task, it would allow deep RL agents to focus more on learning the policy. We learn the features by pre-training deep RL’s network using human demonstrations from non-experts. We will refer to our approach as pre-trained model.

The pre-trained model method is similar to the technique of transfer in deep learning [Yosinski et al.2014], where existing or previously trained model’s parameters are used to initialize a new model to solve a different problem. In this case, we pre-train the network as a multi-classification problem using deep learning with human demonstrations as our training data. We assume here that humans provide correct labels through actions demonstrated while playing the game.

We first apply the pre-trained model approach in DQN and refer to it as pre-trained model for DQN (PMfDQN). In PMfDQN, we train a multiclass-classification deep neural network with a softmax cross entropy loss function. The loss is minimized using Adam [Kingma and Ba2014]

for optimization with the following hyperparameters: step size

, stability constant

, and using Tensorflow’s default exponential decay rates

. The network architecture for the classification follows exactly the structure of the hidden layers of DQN [Mnih et al.2015] with three convolutional layers (conv1, conv2, conv3) and one fully connected layer (fc1). The network’s output layer also has a single output for each valid action but it uses the cross-entropy loss instead of the TD loss. The learned weights and biases from the classification model’s hidden layers are used as initialization to DQN’s network instead of using random initialization. We have also tested transferring all layers, including the output layer, in some experiments. When transferring the output layer, normalization of the parameters of the output layer was necessary to achieve a positive transfer. To normalize the output layer, we keep track of the max value of the output layer during training, which is used as a divisor to all the weights and biases during initial transfer. Without normalization, the values of the output layer tend to explode. We also loaded the human demonstrations in the replay memory, thus removing the need for DQN to take a uniform random action for 50,000 frames to initially populate the replay memory [Mnih et al.2015].

The pre-trained model method can also be applied in A3C, which we will refer as pre-trained model for A3C (PMfA3C). In PMfA3C, we pre-trained the multiclass-classifier using the same hyperparameters and optimization method as mentioned in PMfDQN, while we experimented with two different types of network structure. The first network uses the same hidden layers as was used in [Sharma, Lakshminarayanan, and Ravindran2017] with three convolutional layers (conv1, conv2, conv3) and one fully connected layer (fc1), but without the LSTM cells. And the output layer follows the exact way as described in PMfDQN. The second network is inspired from one-vs-all multiclass-classification and multitask learning [Caruana1998]

. It differs from the first network as it uses multiple heads of output layers where each class or action has its own output layer. Each individual output layer becomes a one-vs-all classification. During each training iteration it uses a uniform probability distribution to select which output layer to train, and in each iteration, gradients are backpropagated to the shared hidden layers of the network. In both multiclass networks, only the hidden layers are used to initialize A3C’s network.

Since DQN uses experience replay memory, it is also possible to pre-train its network just by loading the human demonstrations in the replay memory. We refer to this experiment as pre-training in DQN (PiDQN). While being a naive way to incorporate human demonstrations in DQN, this is an interesting method to pre-train as it allows the DQN agent to learn both the features and policy without any interaction with the actual Atari environment. However, this pre-training method does not generalize to A3C and/or other deep RL algorithms that do not use a replay memory. We would like to address this in future work by applying this naive approach to an alternative version of A3C that uses a replay memory [Wang et al.2016].

Lastly, we conducted additional experiments in DQN that combines PMfDQN and PiDQN, with the goal of exploring whether a combined approach would achieve a much greater performance in DQN.

Experimental Design

We conducted experiments in both DQN and A3C that are both implemented using Tensorflow r1.0 [Abadi et al.2015]. Due to limited computational resources, we tested our approach in only three Atari games: Pong, Freeway, and Beamrider, as shown in Figure 1. The games have 6, 3, and 9 actions, respectively. We use OpenAI Gym’s deterministic version of the Atari 2600 environment [Brockman et al.2016] with an action repeat of 4.

We use the same network architecture and hyperparameters for DQN as was done previously in [Mnih et al.2015] and we follow [Sharma, Lakshminarayanan, and Ravindran2017] for the LSTM-variant of A3C as their work closely replicates the results of the original A3C paper [Mnih et al.2016]. However, note that there are two key differences from the original A3C work. First, while using the same network architecture as in [Mnih et al.2015] for the three convolutional layers (conv1, conv2, and conv3), the fully connected layer (fc1) was modified to have 256 units (instead of 512) to connect with the 256 LSTM cells that followed. Second, we use instead of . We use 16 actor-learner threads for all A3C experiments.

In both DQN and A3C, we use the four most recent game frames as input to the network where each frame is pre-processed as done in [Mnih et al.2015]. We also use the same evaluation technique for both DQN and A3C such that the average reward over 125,000 steps was taken. In addition, DQN is evaluated as a deterministic policy where an agent uses an -greedy action selection method, where . In A3C, it is evaluated as a stochastic policy where it uses the output policy as action probabilities.

Figure 1: Atari 2600 game screenshot of Pong, Freeway, and Beamrider from left to right.

Collection of Human Demonstration

We used OpenAI Gym’s [Brockman et al.2016] keyboard interface to allow a human demonstrator to interact with the Atari environment. The demonstrator is provided with game rules and a set of valid actions with their corresponding keyboard keys for each game. The action repeat was set to one (instead of four) to provide smoother transitions of the game play. During the demonstration, we collect every fourth frame of the game play, saving the game state using the game’s image, action taken, reward received, and if the game’s current state is a terminal state. The format of the stored data follows the exact structure of the experience replay memory used in DQN.

We use a non-expert human demonstrator where the demonstrator plays five games. Each game play has a maximum of five minutes of playing time. The demonstration ends when the game play reaches the time limit or when game terminates — whichever comes first. Table 1 provides a breakdown of human demonstration size for each game and the human performance level.

Game Worst Score Best Score # of Frames
Beamrider 2,160 3,406 11,205
Freeway 28 31 10,241
Pong -10 5 11,265
Table 1: Human demonstration over five plays per game.


This section presents and discusses results from pre-training deep RL’s network for DQN and A3C.


Using PMfDQN, we trained three multiclass-classification networks for each Atari game with the human demonstration dataset. Each training was done using a batch size of 32 for 150,000 training iterations. The number of training iterations is determined to be the shortest number of iterations where the training loss for all games converges approximately to zero. The trained classification networks provide us with the pre-trained models. The pre-trained model that consists of the weights and biases are used to initialize DQN’s network. Results in Figure 2 shows that PMfDQN speeds up training in all three Atari games. We also tested PiDQN with the same number of pre-training iterations as in PMfDQN. Figure 2 shows that using PiDQN provides varying results in all three games but is worse when compared to the results of PMfDQN. Additionally, Figure 1(a) shows that although pre-training for 1 million iterations provides a better result compared to shorter pre-training of 150,000 iterations, it still performed much worse than PMfDQN. This indicates that relevant features are better learned via supervised learning.

Figure 2:

Performance evaluation of baseline and pre-training using DQN. The x-axis is the training epoch where an epoch corresponds to two million steps. The y-axis is the average testing score over three trials where the shaded regions correspond to the standard deviation.

In addition, to see if we can further improve DQN through pre-training, we use PMfDQN followed by PiDQN with 150,000 pre-training iterations each. The results of this experiment were surprising since, with more pre-training, the results should have been better. However, Figure 2 shows lesser improvement when compared to PMfDQN alone for Pong and Freeway, while Beamrider had similar results to PMfDQN. We believe that this is due to the high initial exploration rate that DQN has during training. Under this setting, the agent would be taking entirely random actions until the value of has decayed to a much lower exploration rate. In [Mnih et al.2015], is decayed over one million steps, resulting in the replay memory being filled with experiences of random actions — we believe this has an adverse effect to what has already been learned from the pre-training steps. Therefore, we initialize when using PMfDQN followed by PiDQN. Results for all games as shown in Figure 2 reveal that combining PMfDQN with PiDQN with a low initial exploration rate is equally as good as PMfDQN by itself and was even better for Freeway. This result is beneficial, especially when applying DQN to real-world applications since we can now remove the high exploration rate at the initial part of the training.

Ablation Studies

We consider two modifications to PfDQN to further analyze its performance. In our first ablation study, we replaced human demonstrations with random demonstrations. We are interested in knowing how important it is to use human demonstrations in comparison with using a random agent. We conducted this experiment in Pong and the results in Figure 3 show that pre-training with random demonstrations is worse than the DQN baseline. This experiment indicates that there is a need for some level of competency from the demonstrator in order to extract useful features during pre-training.

In our second ablation study, we excluded the second fully connected layer (fc2) (i.e., the output layer) when initializing the DQN network with the pre-trained model. This will allow us to know if supervised learning does learn important features, particularly in the hidden layers. We ran experiments without transferring the output layer from the pre-trained model. Empirically, results in Figure 3 show that besides losing the initial jumpstart at the beginning, the training time to reach convergence is not different from the time when using all layers. This indicates that it is actually the features in the hidden layers that provide most of the improvement in the training speed. This is not surprising since the output layer of a classifier is trying to learn to predict what action to take given a state without any consideration for maximizing the reward. Additionally, when learning from only a small amount of data where human performance was relatively poor (Table 1), the classifier’s policy would be far from optimal.


Using PMfA3C, we also pre-trained multiclass-classification networks for each Atari game with human demonstrations similar to PMfDQN with a batch of 32 for 150,000 training iterations. Since the network for the LSTM-variant of A3C uses LSTM cells with two output layers, we only initialize A3C’s network with the pre-trained model’s hidden layers. In Figure 4, results show improvements in the training time in both Pong and Beamrider, with a much higher improvement in Pong.

Figure 3: Performance evaluation on the ablation studies for Pong using DQN. The results are the average testing score over three trials where the shaded regions correspond to the standard deviation.
Figure 4: Performance of baseline and pre-training using A3C. The x-axis is the number of training steps which is also the number of visited game frames among all parallel works. The y-axis is the average testing score over three trials where the shaded regions correspond to the standard deviation.

However, there is no improvement in Freeway. We attribute this to the poor baseline performance of Freeway in the original A3C work [Mnih et al.2016] (shown in Figure 4 baseline). Our approach focuses on learning features without addressing improvements in policy — no improvements in Freeway with our approach were expected. Freeway in A3C needs a better way of exploring states in order to learn a near-optimal policy for the game. This is something we will try to address in future work.

With strong improvements observed in A3C, can we still gain further improvements if we pre-train our classification network longer? We tried longer training using the one-vs-all multiclass-classification network with shared hidden layers. Since each class or action is trained independently, we can now observe the different convergence of the training loss for each class. This allowed us to use the same technique of training until the training loss for all classes is approximately zero. Using the one-vs.-all classification, we pre-train for 450,000 iterations in Pong and 650,000 iterations in Beamrider. Training longer results in a slight improvement for Beamrider, but Pong shows a very large improvement, as shown in Figure 4.

Figure 5: Visualization of the normalized weights on Pong’s first convolutional layer using PMfA3C. The weights (filters) are from a pre-trained classification network trained for 150,000 iterations (left image), and from the final weights after 50 million training steps in A3C (right image). To better illustrate the similarity of the weights, we provided two zoomed-in images of a particular filter from pre-trained conv1 (green box) and final conv1 (blue box).

The last experiment we conducted was to test whether important features could still be learned even with a much smaller number of demonstrations, in this case, a single game play that is only five minutes of demonstration. We use one-vs-all classification network to pre-train for Pong with only 2,253 game frames with 250,000 training iterations and similarly for Beamrider with 2,232 game frames with 300,000 training iterations. In Figure 4, results for both Pong and Beamrider shows that high improvement is still achievable with only a small amount of demonstration. It is even more remarkable in Beamrider how the results are as equally good as pre-training with the full set of the human demonstrations.

Additional Analysis

In order to understand what is accomplished with pre-training, we look closer at the filters (i.e., weights of the network layers) to determine on how much pre-trained features contribute to the final features learned. Thus, we further investigate how similar the initial weights of a deep RL network are from its final weights for each layer after learning a near-optimal policy. We can quantify the similarity by finding the difference between the weights using the mean squared error . A layer’s MSE that is arbitrarily small means higher similarity. Table 2 shows that there is a high similarity in the pre-training approach compared to random weight initialization. Furthermore, we looked at the visualization of each hidden layer and observed that the weights learned from classification and used as initial values in deep RL’s network provided features that were retained even after training in deep RL. Figure 5 shows a visualization of the first convolutional layer. The high similarity of the weights observed in all layers suggests that pre-training in classification was able to learn important features that were useful in deep RL.

width= Layer MSE (Pong) MSE (Beamrider) Baseline Pre-train Baseline Pre-train conv1 conv2 conv3 fc1

Table 2: Evaluation on the similarity of features for each hidden layer. The mean squared error (MSE) is computed between the weights from a randomly initialized A3C network (baseline) and the final weights. Similarly, when using a pre-trained model as the initial weights.

Discussion and Conclusion

The pre-training approach worked very well in Pong. This success can be explained by the human demonstration data the classifier was pre-trained with, and the simplicity of Pong’s game. Pong’s states are highly repetitive when compared to the other game environments that are more dynamic. The Beamrider has the most complex environments among all three games because it has different levels with varying difficulty. Although Freeway’s game state is also repetitive, the base agent’s inability to learn a good policy is a problem that leans more towards policy learning, which is not addressed in our approach.

Human demonstrations are a big part of the success of our approach. It is import to understand how the demonstrator’s performance and the amount of demonstration data affect the benefits of pre-training the network in future work. We would also look into using a recently released human demonstration dataset for Atari [Kurin et al.2017] and Starcraft II [Vinyals et al.2017].

Another issue that needs to be addressed in regards to the human demonstrations is that they suffer from highly imbalanced classes (actions). This is attributed to: 1) sparsity of some actions like the torpedo action in Beamrider that is limited to three uses at each level, 2) actions that are closely related like in Beamrider where there is a left and right action plus combined actions of left-fire and right-fire — a demonstrator would usually just use the native actions of left and right action alone and use the fire action by itself, and 3) games having a default no-operation action.

In [He and Garcia2009], when the imbalance problem is not addressed, the classifier will learn a policy that tends to bias towards the majority classes. It is interesting that the classifier is still able to learn important features without handling this issue. However, we see this as an interesting future work of handling the class imbalance so we would know if it ends up learning better features and if further improvements can be observed in our approach.

As we investigate further on ways to improve our approach, we know there is a limit to how much improvement pre-training can provide without addressing policy learning. In this approach, we have already trained a model with a policy that tries to imitate the human demonstrator [Argall et al.2009], we can extend this work by simply using the pre-trained model’s policy to provide advice to the agent (e.g., [Wang and Taylor2017]).

Overall, learning directly from raw images through deep neural networks is a major factor why learning is slow in deep RL. We have demonstrated that our method of initializing deep RL’s network with a pre-trained model can significantly speed up learning in deep RL.


The A3C implementation was a modification of The authors would like to thank Sahil Sharma and Kory Matthewson for providing very useful insights on the actor-critic method. We thank WSU Kamiak for the computing resources we used for running the experiments.