The recent resurgence of neural networks in reinforcement learning can be attributed to the widespread success of Deep Reinforcement Learning (deep RL), which uses deep neural networks for function approximation [Mnih et al.2015, Mnih et al.2016]. Besides deep RL’s state-of-the-art results, one of its most impressive accomplishments is its ability to learn directly from raw images. However, in order to bring the success of deep RL in virtual environments into real-world applications, we must address the lengthy training time that is required to learn a policy.
Deep RL suffers from poor initial performance like classic RL algorithms since it learns tabula rasa [Sutton and Barto1998]. In addition, deep RL inherently takes longer to learn because besides learning a policy it also learns directly from raw images — instead of using hand-engineered features, deep RL needs to learn to construct relevant high-level features from raw images. These problems are consequential in real-world applications with expensive data, like in robotics, finance, or medicine.
In order to use deep RL for solving real-world problems, there is a need to speed up its learning. One method is by using humans to provide demonstrations. Using human demonstrations in RL is not new [Argall et al.2009]. However, only recently has this area gain traction as a possible way of speeding up deep RL [Kurin et al.2017, Vinyals et al.2017, Hester et al.2017].
Speeding up deep reinforcement learning can be achieved by addressing two problems it is trying to accomplish: 1) feature learning and 2) policy learning. In this work, we will focus only in addressing the problem of feature learning in order to speed up learning in deep RL. We do so by pre-training to learn the underlying features in the hidden layers of the network. We bring to deep RL a common technique that is widely used to speed up training in deep learning: pre-training a network[Erhan et al.2009, Erhan et al.2010, Yosinski et al.2014]. However, the success of this technique in supervised deep learning is attributed to the large datasets that are available and used to pre-train networks.
In this work, we propose an approach in speeding up deep reinforcement learning algorithms using only a relatively small amount of non-expert human demonstrations. This approach starts by pre-training a deep neural network using human demonstrations through supervised learning. Similar work has shown that this step would learn to imitate the human demonstrator [Argall et al.2009]. However, what is most interesting to us are the underlying features learned even with a small amount of data.
We test our approach in both Deep Q-network (DQN) and Asynchronous Advantage Actor-Critic (A3C) and evaluated its performance using Pong, Freeway, and Beamrider in the Atari 2600 domain. Our results show speed ups in five of the six cases. The improvement in Pong and Freeway are quite large in DQN, and A3C’s improvement on Pong was especially large. The generality of this approach means that it can be easily incorporated into multiple deep RL algorithms.
Our work is not directly under transfer learning, but it is similar to one of the transfer learning methods in deep learning. In training deep neural networks for image classification, Yosinski et al. yosinski2014transferable has shown how transferring the features learned from existing models allow new models to learn faster, particularly when the datasets are similar. In this work, we use a deep learning classifier as the source network, and the classifier’s network is used to initialize the RL agent’s network.
Existing work on pre-training in RL [Abtahi and Fasel2011, Anderson, Lee, and Elliott2015] has shown that there is an improvement when pre-training the network. However, their networks have a much smaller number of parameters and they use the state dynamics of their domains as network input. In our approach, we use the raw images of the domain as network input — our RL agent also needs to learn the latent features while learning its policy.
Our approach of using supervised learning for pre-training is also similar in spirit to that of Anderson, Lee, and Elliott anderson2015faster. They pre-train by learning to predict the state dynamics. We instead pre-train our network by using the game’s image frames from the human demo as training data, which are individually labeled by the action taken by the human demonstrator. This setup is similar to how one could derive a policy when learning from demonstration (LfD) [Argall et al.2009].
. Although this pre-training approach differs — it falls under a different machine learning paradigm — its goals are similar to our approach in that pre-trained networks learn better than when using random initialization.
There are more recent works that leverage humans in deep RL. Christiano et al. christiano2017deep use human feedback to learn a reward function. Another recent work similarly pre-trains the network with human demonstrations in DQN [Hester et al.2017]. However, their pre-training combines the large margin supervised loss and the temporal difference loss, which tries to closely imitate the demonstrator. In our work, we only use the cross-entropy loss while we focus on the learned features.
The work of Silver et al. silver2016mastering trained human demonstrations in supervised learning and used the supervised learner’s network to initialized RL’s policy network. They tested this approach in a single domain, used a huge amount of data to train the supervised learner, and their data is from human experts. Our work will be the first to provide a comparative analysis as to how such approach impacts deep reinforcement learning algorithms and how well this approach can complement existing deep RL algorithms when human demonstrations are available. Their work also focuses more on optimizing the policy learned from humans, while our paper focuses on learning the underlying features. Our work shows that it is not required to have a huge amount of data to gain some improvements, a supervised learner can still learn important latent features even when demonstrated human data is from non-experts and the dataset is small.
Deep Reinforcement Learning
A reinforcement learning (RL) problem is typically modeled using a Markov Decision Process, represented by a 5-tuple. An RL agent explores an unknown environment by taking an action . Each action lead the agent to a certain state and a reward is given based on the action took and the next state it lands in. The goal of an RL agent is to learn to maximize the expected return value for each state at time . The discount factor determines the relative importance of future and immediate rewards.
The recent development of deep RL has gained great attention due to its ability to generalize and solve problems in different domains. The first such method, deep Q-network (DQN) [Mnih et al.2015], learns to solve 49 Atari games directly from screen pixels by combining Q-learning [Watkins and Dayan1992]
with a deep convolutional neural network. In classic Q-learning, instead of learning the value of states, it learns the value of state-action pairs
which is the expected discounted reward determined by performing action in state and thereafter perform optimally. The optimal policy can then be deduced by following actions that have the maximum Q value .
Directly computing the Q value is not feasible when the state space is large or continuous (e.g., in Atari games). The DQN algorithm uses a convolutional neural network as a function approximator to estimate the Q function, where is the network’s weight parameters. For each iteration , DQN is trained to minimize the mean-squared error (MSE) between the Q-network and its target , where is the weight parameters for the target network that was generated from previous iterations. The reward
uses reward clipping that scales the scores by clipping all rewards when positive at 1, negative at -1, and 0 when rewards are unchanged. The loss function at iterationcan be expressed as:
where are state-action samples drawn from experience replay memory with a minibatch of size 32. The use of experience replay memory, along with a target network and reward clipping, help to stabilize learning. During training, the agent also behaves following an -greedy policy to obtain sufficient exploration of the state space.
Asynchronous Advantage Actor-critic
There are a few drawbacks of using experience replay memory in the DQN algorithm. First, having to store all experiences is space-consuming and could slow down learning. Second, using replay memory limits DQN to only off-policy algorithms such as Q-learning. The asynchronous advantage actor-critic (A3C) algorithm [Mnih et al.2016] was proposed to overcome these problems. A3C has set a new benchmark for deep RL since it not only surpass DQN’s performance in playing Atari games but also can be applied to continuous control problems.
A3C combines the actor-critic algorithm [Sutton and Barto1998] with deep learning. It differs from value-based algorithms (e.g., Q-learning) where only a value function is learned, actor-critic is policy-based, where a policy function and a value function are both maintained. The policy function is called the actor, which takes actions based on the current policy . The value function is called a critic, which serves as a baseline to evaluate the quality of the action by returning the state value for the current state under policy
. The policy is directly parameterized and improved via policy-gradient. To reduce the variance in policy gradient, an advantage function is used and calculated asat time step for action at state , where is the expected return at time . The loss function for A3C is:
In A3C, actor-learners are running in parallel with their own copies of the environment and the parameters for the policy and value function. This enables the algorithm to explore different parts of the environment and observations will not be correlated. This mimics the function of experience replay memory in DQN, while being more efficient in space and training time. Each actor-learner pair performs an update on parameters every actions, or when a terminal state is reached — this is similar to using minibatches, as is done in DQN. Updates are synchronized to a master learner that maintains a central policy and value function, which will be the final policy upon the completion of training.
Pre-Training Networks for Deep RL
Deep reinforcement learning can be divided into two sub-tasks: 1) feature learning, and 2) policy learning. Deep RL in itself is already quite successful in learning both tasks in parallel. However, learning both tasks also makes learning in deep RL very slow. We believe that by addressing the feature learning task, it would allow deep RL agents to focus more on learning the policy. We learn the features by pre-training deep RL’s network using human demonstrations from non-experts. We will refer to our approach as pre-trained model.
The pre-trained model method is similar to the technique of transfer in deep learning [Yosinski et al.2014], where existing or previously trained model’s parameters are used to initialize a new model to solve a different problem. In this case, we pre-train the network as a multi-classification problem using deep learning with human demonstrations as our training data. We assume here that humans provide correct labels through actions demonstrated while playing the game.
We first apply the pre-trained model approach in DQN and refer to it as pre-trained model for DQN (PMfDQN). In PMfDQN, we train a multiclass-classification deep neural network with a softmax cross entropy loss function. The loss is minimized using Adam [Kingma and Ba2014]
for optimization with the following hyperparameters: step size, stability constant
, and using Tensorflow’s default exponential decay rates. The network architecture for the classification follows exactly the structure of the hidden layers of DQN [Mnih et al.2015] with three convolutional layers (conv1, conv2, conv3) and one fully connected layer (fc1). The network’s output layer also has a single output for each valid action but it uses the cross-entropy loss instead of the TD loss. The learned weights and biases from the classification model’s hidden layers are used as initialization to DQN’s network instead of using random initialization. We have also tested transferring all layers, including the output layer, in some experiments. When transferring the output layer, normalization of the parameters of the output layer was necessary to achieve a positive transfer. To normalize the output layer, we keep track of the max value of the output layer during training, which is used as a divisor to all the weights and biases during initial transfer. Without normalization, the values of the output layer tend to explode. We also loaded the human demonstrations in the replay memory, thus removing the need for DQN to take a uniform random action for 50,000 frames to initially populate the replay memory [Mnih et al.2015].
The pre-trained model method can also be applied in A3C, which we will refer as pre-trained model for A3C (PMfA3C). In PMfA3C, we pre-trained the multiclass-classifier using the same hyperparameters and optimization method as mentioned in PMfDQN, while we experimented with two different types of network structure. The first network uses the same hidden layers as was used in [Sharma, Lakshminarayanan, and Ravindran2017] with three convolutional layers (conv1, conv2, conv3) and one fully connected layer (fc1), but without the LSTM cells. And the output layer follows the exact way as described in PMfDQN. The second network is inspired from one-vs-all multiclass-classification and multitask learning [Caruana1998]
. It differs from the first network as it uses multiple heads of output layers where each class or action has its own output layer. Each individual output layer becomes a one-vs-all classification. During each training iteration it uses a uniform probability distribution to select which output layer to train, and in each iteration, gradients are backpropagated to the shared hidden layers of the network. In both multiclass networks, only the hidden layers are used to initialize A3C’s network.
Since DQN uses experience replay memory, it is also possible to pre-train its network just by loading the human demonstrations in the replay memory. We refer to this experiment as pre-training in DQN (PiDQN). While being a naive way to incorporate human demonstrations in DQN, this is an interesting method to pre-train as it allows the DQN agent to learn both the features and policy without any interaction with the actual Atari environment. However, this pre-training method does not generalize to A3C and/or other deep RL algorithms that do not use a replay memory. We would like to address this in future work by applying this naive approach to an alternative version of A3C that uses a replay memory [Wang et al.2016].
Lastly, we conducted additional experiments in DQN that combines PMfDQN and PiDQN, with the goal of exploring whether a combined approach would achieve a much greater performance in DQN.
We conducted experiments in both DQN and A3C that are both implemented using Tensorflow r1.0 [Abadi et al.2015]. Due to limited computational resources, we tested our approach in only three Atari games: Pong, Freeway, and Beamrider, as shown in Figure 1. The games have 6, 3, and 9 actions, respectively. We use OpenAI Gym’s deterministic version of the Atari 2600 environment [Brockman et al.2016] with an action repeat of 4.
We use the same network architecture and hyperparameters for DQN as was done previously in [Mnih et al.2015] and we follow [Sharma, Lakshminarayanan, and Ravindran2017] for the LSTM-variant of A3C as their work closely replicates the results of the original A3C paper [Mnih et al.2016]. However, note that there are two key differences from the original A3C work. First, while using the same network architecture as in [Mnih et al.2015] for the three convolutional layers (conv1, conv2, and conv3), the fully connected layer (fc1) was modified to have 256 units (instead of 512) to connect with the 256 LSTM cells that followed. Second, we use instead of . We use 16 actor-learner threads for all A3C experiments.
In both DQN and A3C, we use the four most recent game frames as input to the network where each frame is pre-processed as done in [Mnih et al.2015]. We also use the same evaluation technique for both DQN and A3C such that the average reward over 125,000 steps was taken. In addition, DQN is evaluated as a deterministic policy where an agent uses an -greedy action selection method, where . In A3C, it is evaluated as a stochastic policy where it uses the output policy as action probabilities.
Collection of Human Demonstration
We used OpenAI Gym’s [Brockman et al.2016] keyboard interface to allow a human demonstrator to interact with the Atari environment. The demonstrator is provided with game rules and a set of valid actions with their corresponding keyboard keys for each game. The action repeat was set to one (instead of four) to provide smoother transitions of the game play. During the demonstration, we collect every fourth frame of the game play, saving the game state using the game’s image, action taken, reward received, and if the game’s current state is a terminal state. The format of the stored data follows the exact structure of the experience replay memory used in DQN.
We use a non-expert human demonstrator where the demonstrator plays five games. Each game play has a maximum of five minutes of playing time. The demonstration ends when the game play reaches the time limit or when game terminates — whichever comes first. Table 1 provides a breakdown of human demonstration size for each game and the human performance level.
|Game||Worst Score||Best Score||# of Frames|
This section presents and discusses results from pre-training deep RL’s network for DQN and A3C.
Using PMfDQN, we trained three multiclass-classification networks for each Atari game with the human demonstration dataset. Each training was done using a batch size of 32 for 150,000 training iterations. The number of training iterations is determined to be the shortest number of iterations where the training loss for all games converges approximately to zero. The trained classification networks provide us with the pre-trained models. The pre-trained model that consists of the weights and biases are used to initialize DQN’s network. Results in Figure 2 shows that PMfDQN speeds up training in all three Atari games. We also tested PiDQN with the same number of pre-training iterations as in PMfDQN. Figure 2 shows that using PiDQN provides varying results in all three games but is worse when compared to the results of PMfDQN. Additionally, Figure 1(a) shows that although pre-training for 1 million iterations provides a better result compared to shorter pre-training of 150,000 iterations, it still performed much worse than PMfDQN. This indicates that relevant features are better learned via supervised learning.
Performance evaluation of baseline and pre-training using DQN. The x-axis is the training epoch where an epoch corresponds to two million steps. The y-axis is the average testing score over three trials where the shaded regions correspond to the standard deviation.
In addition, to see if we can further improve DQN through pre-training, we use PMfDQN followed by PiDQN with 150,000 pre-training iterations each. The results of this experiment were surprising since, with more pre-training, the results should have been better. However, Figure 2 shows lesser improvement when compared to PMfDQN alone for Pong and Freeway, while Beamrider had similar results to PMfDQN. We believe that this is due to the high initial exploration rate that DQN has during training. Under this setting, the agent would be taking entirely random actions until the value of has decayed to a much lower exploration rate. In [Mnih et al.2015], is decayed over one million steps, resulting in the replay memory being filled with experiences of random actions — we believe this has an adverse effect to what has already been learned from the pre-training steps. Therefore, we initialize when using PMfDQN followed by PiDQN. Results for all games as shown in Figure 2 reveal that combining PMfDQN with PiDQN with a low initial exploration rate is equally as good as PMfDQN by itself and was even better for Freeway. This result is beneficial, especially when applying DQN to real-world applications since we can now remove the high exploration rate at the initial part of the training.
We consider two modifications to PfDQN to further analyze its performance. In our first ablation study, we replaced human demonstrations with random demonstrations. We are interested in knowing how important it is to use human demonstrations in comparison with using a random agent. We conducted this experiment in Pong and the results in Figure 3 show that pre-training with random demonstrations is worse than the DQN baseline. This experiment indicates that there is a need for some level of competency from the demonstrator in order to extract useful features during pre-training.
In our second ablation study, we excluded the second fully connected layer (fc2) (i.e., the output layer) when initializing the DQN network with the pre-trained model. This will allow us to know if supervised learning does learn important features, particularly in the hidden layers. We ran experiments without transferring the output layer from the pre-trained model. Empirically, results in Figure 3 show that besides losing the initial jumpstart at the beginning, the training time to reach convergence is not different from the time when using all layers. This indicates that it is actually the features in the hidden layers that provide most of the improvement in the training speed. This is not surprising since the output layer of a classifier is trying to learn to predict what action to take given a state without any consideration for maximizing the reward. Additionally, when learning from only a small amount of data where human performance was relatively poor (Table 1), the classifier’s policy would be far from optimal.
Using PMfA3C, we also pre-trained multiclass-classification networks for each Atari game with human demonstrations similar to PMfDQN with a batch of 32 for 150,000 training iterations. Since the network for the LSTM-variant of A3C uses LSTM cells with two output layers, we only initialize A3C’s network with the pre-trained model’s hidden layers. In Figure 4, results show improvements in the training time in both Pong and Beamrider, with a much higher improvement in Pong.
However, there is no improvement in Freeway. We attribute this to the poor baseline performance of Freeway in the original A3C work [Mnih et al.2016] (shown in Figure 4 baseline). Our approach focuses on learning features without addressing improvements in policy — no improvements in Freeway with our approach were expected. Freeway in A3C needs a better way of exploring states in order to learn a near-optimal policy for the game. This is something we will try to address in future work.
With strong improvements observed in A3C, can we still gain further improvements if we pre-train our classification network longer? We tried longer training using the one-vs-all multiclass-classification network with shared hidden layers. Since each class or action is trained independently, we can now observe the different convergence of the training loss for each class. This allowed us to use the same technique of training until the training loss for all classes is approximately zero. Using the one-vs.-all classification, we pre-train for 450,000 iterations in Pong and 650,000 iterations in Beamrider. Training longer results in a slight improvement for Beamrider, but Pong shows a very large improvement, as shown in Figure 4.
The last experiment we conducted was to test whether important features could still be learned even with a much smaller number of demonstrations, in this case, a single game play that is only five minutes of demonstration. We use one-vs-all classification network to pre-train for Pong with only 2,253 game frames with 250,000 training iterations and similarly for Beamrider with 2,232 game frames with 300,000 training iterations. In Figure 4, results for both Pong and Beamrider shows that high improvement is still achievable with only a small amount of demonstration. It is even more remarkable in Beamrider how the results are as equally good as pre-training with the full set of the human demonstrations.
In order to understand what is accomplished with pre-training, we look closer at the filters (i.e., weights of the network layers) to determine on how much pre-trained features contribute to the final features learned. Thus, we further investigate how similar the initial weights of a deep RL network are from its final weights for each layer after learning a near-optimal policy. We can quantify the similarity by finding the difference between the weights using the mean squared error . A layer’s MSE that is arbitrarily small means higher similarity. Table 2 shows that there is a high similarity in the pre-training approach compared to random weight initialization. Furthermore, we looked at the visualization of each hidden layer and observed that the weights learned from classification and used as initial values in deep RL’s network provided features that were retained even after training in deep RL. Figure 5 shows a visualization of the first convolutional layer. The high similarity of the weights observed in all layers suggests that pre-training in classification was able to learn important features that were useful in deep RL.
Discussion and Conclusion
The pre-training approach worked very well in Pong. This success can be explained by the human demonstration data the classifier was pre-trained with, and the simplicity of Pong’s game. Pong’s states are highly repetitive when compared to the other game environments that are more dynamic. The Beamrider has the most complex environments among all three games because it has different levels with varying difficulty. Although Freeway’s game state is also repetitive, the base agent’s inability to learn a good policy is a problem that leans more towards policy learning, which is not addressed in our approach.
Human demonstrations are a big part of the success of our approach. It is import to understand how the demonstrator’s performance and the amount of demonstration data affect the benefits of pre-training the network in future work. We would also look into using a recently released human demonstration dataset for Atari [Kurin et al.2017] and Starcraft II [Vinyals et al.2017].
Another issue that needs to be addressed in regards to the human demonstrations is that they suffer from highly imbalanced classes (actions). This is attributed to: 1) sparsity of some actions like the torpedo action in Beamrider that is limited to three uses at each level, 2) actions that are closely related like in Beamrider where there is a left and right action plus combined actions of left-fire and right-fire — a demonstrator would usually just use the native actions of left and right action alone and use the fire action by itself, and 3) games having a default no-operation action.
In [He and Garcia2009], when the imbalance problem is not addressed, the classifier will learn a policy that tends to bias towards the majority classes. It is interesting that the classifier is still able to learn important features without handling this issue. However, we see this as an interesting future work of handling the class imbalance so we would know if it ends up learning better features and if further improvements can be observed in our approach.
As we investigate further on ways to improve our approach, we know there is a limit to how much improvement pre-training can provide without addressing policy learning. In this approach, we have already trained a model with a policy that tries to imitate the human demonstrator [Argall et al.2009], we can extend this work by simply using the pre-trained model’s policy to provide advice to the agent (e.g., [Wang and Taylor2017]).
Overall, learning directly from raw images through deep neural networks is a major factor why learning is slow in deep RL. We have demonstrated that our method of initializing deep RL’s network with a pre-trained model can significantly speed up learning in deep RL.
The A3C implementation was a modification of https://github.com/miyosuda/async_deep_reinforce. The authors would like to thank Sahil Sharma and Kory Matthewson for providing very useful insights on the actor-critic method. We thank WSU Kamiak for the computing resources we used for running the experiments.
- [Abadi et al.2015] Abadi, M.; Agarwal, A.; Barham, P.; Brevdo, E.; Chen, Z.; Citro, C.; Corrado, G. S.; Davis, A.; Dean, J.; Devin, M.; Ghemawat, S.; Goodfellow, I.; Harp, A.; Irving, G.; Isard, M.; Jia, Y.; Jozefowicz, R.; Kaiser, L.; Kudlur, M.; Levenberg, J.; Mané, D.; Monga, R.; Moore, S.; Murray, D.; Olah, C.; Schuster, M.; Shlens, J.; Steiner, B.; Sutskever, I.; Talwar, K.; Tucker, P.; Vanhoucke, V.; Vasudevan, V.; Viégas, F.; Vinyals, O.; Warden, P.; Wattenberg, M.; Wicke, M.; Yu, Y.; and Zheng, X. 2015. TensorFlow: Large-scale machine learning on heterogeneous systems. Software available from tensorflow.org.
- [Abtahi and Fasel2011] Abtahi, F., and Fasel, I. 2011. Deep belief nets as function approximators for reinforcement learning. RBM 2:h3.
- [Anderson, Lee, and Elliott2015] Anderson, C. W.; Lee, M.; and Elliott, D. L. 2015. Faster reinforcement learning after pretraining deep networks to predict state dynamics. In Neural Networks (IJCNN), 2015 International Joint Conference on, 1–7. IEEE.
- [Argall et al.2009] Argall, B. D.; Chernova, S.; Veloso, M.; and Browning, B. 2009. A survey of robot learning from demonstration. Robotics and autonomous systems 57(5):469–483.
- [Brockman et al.2016] Brockman, G.; Cheung, V.; Pettersson, L.; Schneider, J.; Schulman, J.; Tang, J.; and Zaremba, W. 2016. Openai gym.
- [Caruana1998] Caruana, R. 1998. Multitask learning. In Learning to learn. Springer. 95–133.
- [Christiano et al.2017] Christiano, P.; Leike, J.; Brown, T. B.; Martic, M.; Legg, S.; and Amodei, D. 2017. Deep reinforcement learning from human preferences. arXiv preprint arXiv:1706.03741.
[Erhan et al.2009]
Erhan, D.; Manzagol, P.-A.; Bengio, Y.; Bengio, S.; and Vincent, P.
The difficulty of training deep architectures and the effect of
Twelfth International Conference on Artificial Intelligence and Statistics (AISTATS), 153–160.
- [Erhan et al.2010] Erhan, D.; Bengio, Y.; Courville, A.; Manzagol, P.-A.; Vincent, P.; and Bengio, S. 2010. Why does unsupervised pre-training help deep learning? J. Mach. Learn. Res. 11:625–660.
- [He and Garcia2009] He, H., and Garcia, E. A. 2009. Learning from imbalanced data. IEEE Transactions on knowledge and data engineering 21(9):1263–1284.
- [Hester et al.2017] Hester, T.; Vecerik, M.; Pietquin, O.; Lanctot, M.; Schaul, T.; Piot, B.; Sendonaris, A.; Dulac-Arnold, G.; Osband, I.; Agapiou, J.; et al. 2017. Learning from demonstrations for real world reinforcement learning. arXiv preprint arXiv:1704.03732.
- [Kingma and Ba2014] Kingma, D., and Ba, J. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980.
- [Kurin et al.2017] Kurin, V.; Nowozin, S.; Hofmann, K.; Beyer, L.; and Leibe, B. 2017. The atari grand challenge dataset. arXiv preprint arXiv:1705.10998.
- [Mnih et al.2015] Mnih, V.; Kavukcuoglu, K.; Silver, D.; Rusu, A. A.; Veness, J.; Bellemare, M. G.; Graves, A.; Riedmiller, M.; Fidjeland, A. K.; Ostrovski, G.; et al. 2015. Human-level control through deep reinforcement learning. Nature 518(7540):529–533.
- [Mnih et al.2016] Mnih, V.; Badia, A. P.; Mirza, M.; Graves, A.; Lillicrap, T.; Harley, T.; Silver, D.; and Kavukcuoglu, K. 2016. Asynchronous methods for deep reinforcement learning. In International Conference on Machine Learning, 1928–1937.
- [Sharma, Lakshminarayanan, and Ravindran2017] Sharma, S.; Lakshminarayanan, A. S.; and Ravindran, B. 2017. Learning to repeat: Fine grained action repetition for deep reinforcement learning. arXiv preprint arXiv:1702.06054.
- [Silver et al.2016] Silver, D.; Huang, A.; Maddison, C. J.; Guez, A.; Sifre, L.; Van Den Driessche, G.; Schrittwieser, J.; Antonoglou, I.; Panneershelvam, V.; Lanctot, M.; et al. 2016. Mastering the game of go with deep neural networks and tree search. Nature 529(7587):484–489.
- [Sutton and Barto1998] Sutton, R. S., and Barto, A. G. 1998. Reinforcement learning: An introduction, volume 1. MIT press Cambridge.
- [Vinyals et al.2017] Vinyals, O.; Ewalds, T.; Bartunov, S.; Georgiev, P.; Vezhnevets, A. S.; Yeo, M.; Makhzani, A.; Küttler, H.; Agapiou, J.; Schrittwieser, J.; et al. 2017. Starcraft ii: A new challenge for reinforcement learning. arXiv preprint arXiv:1708.04782.
- [Wang and Taylor2017] Wang, Z., and Taylor, M. E. 2017. Improving reinforcement learning with confidence-based demonstrations. In Proceedings of the 26th International Conference on Artificial Intelligence (IJCAI).
- [Wang et al.2016] Wang, Z.; Bapst, V.; Heess, N.; Mnih, V.; Munos, R.; Kavukcuoglu, K.; and de Freitas, N. 2016. Sample efficient actor-critic with experience replay. arXiv preprint arXiv:1611.01224.
- [Watkins and Dayan1992] Watkins, C. J., and Dayan, P. 1992. Q-learning. Machine Learning 8(3-4):279–292.
- [Yosinski et al.2014] Yosinski, J.; Clune, J.; Bengio, Y.; and Lipson, H. 2014. How transferable are features in deep neural networks? In Advances in neural information processing systems, 3320–3328.