1 Introduction
In recent years, significant advances in the field of deep Reinforcement Learning (RL) have led to artificial agents that are able to reach humanlevel control on a wide array of tasks such as some Atari 2600 games Bellemare et al. (2015). In many of the Atari games, these agents learn control policies that far exceed the capabilities of an average human player Gruslys et al. (2018); Hessel et al. (2018); Horgan et al. (2018). However, learning humanlevel policies consistently across the entire set of games remains an open problem.
We argue that an algorithm needs to overcome three key challenges in order to perform well on all Atari games. The first challenge is processing diverse reward distributions. An algorithm must learn stably regardless of reward density and scale. Mnih et al. (2015) showed that clipping rewards to the canonical interval is one way to achieve stability. However, this clipping operation may change the set of optimal policies. For example, the agent no longer differentiates between striking a single pin or all ten pins in Bowling. Hence, optimizing the unaltered reward signal in a stable manner is crucial to achieving consistent performance across games. The second challenge is reasoning over long time horizons, which means the algorithm should be able to choose actions in anticipation of rewards that might be far away. For example, in Montezuma’s Revenge, individual rewards might be separated by several hundred time steps. In the standard discounted RL setting, this means the algorithm should be able to handle discount factors close to 1. The third and final challenge is efficient exploration of the MDP. An algorithm that explores efficiently is able to discover long trajectories with a high cumulative reward in a reasonable amount of time even if individual rewards are very sparse. While each problem has been partially addressed in the literature, none of the existing deep RL algorithms have been able to address these three challenges at once.
In this paper, we propose a new Deep QNetwork (DQN) Mnih et al. (2015)
style algorithm that specifically addresses these three challenges. In order to learn stably independent of the reward distribution, we use a transformed Bellman operator that reduces the variance of the actionvalue function. Learning with the transformed operator allows us to process the unaltered environment rewards regardless of scale and density. We prove that the optimal policy does not change in deterministic MDPs and show that under certain assumptions the operator is a contraction in stochastic MDPs (
i.e., the algorithm converges to a fixed point) (see Sec. 3.2). Our algorithm learns stably even at high discount factors due to an auxiliary temporal consistency (TC) loss. This loss prevents the network from prematurely generalizing to unseen states (Sec. 3.3) allowing us to use a discount factor as high as in practice. This extends the effective planning horizon of our algorithm by one order of magnitude when compared to other deep RL approaches on Atari. Finally, we improve the efficiency of DQN’s default exploration scheme by combining the distributed experience replay approach of Horgan et al. (2018) with the Deep Qlearning from Demonstrations (DQfD) algorithm of Hester et al. (2018). The resulting architecture is a distributed actorlearner system that combines offline expert demonstrations with online agent experiences (Sec. 3.4).We experimentally evaluate our algorithm on a set of 42 games for which we have demonstrations from an expert human player (see Table 5). Using the same hyper parameters on all games, our algorithm exceeds the performance of an average human player on 40 games, the expert player on 34 games, and stateoftheart agents on at least 28 games. Furthermore, we significantly advance the stateoftheart on sparse reward games. Our algorithm is the first to complete the first level of Montezuma’s Revenge and it achieves a new top score of 3997 points on Pitfall! without compromising performance on dense reward games and while only using 5 demonstration trajectories.
2 Related work
Reinforcement Learning with Expert Demonstrations (RLED): RLED seeks to use expert demonstrations to guide the exploration process in difficult RL problems. Some early works in this area Atkeson and Schaal (1997); Schaal (1997) used expert demonstrations to find a good initial policy before finetuning it with RL. More recent approaches have explicitly combined expert demonstrations with RL data during the learning of the policy or actionvalue function Chemali and Lazaric (2015); Kim et al. (2013); Piot et al. (2014)
. In these works, expert demonstrations were used to build an imitation loss function (classificationbased loss) or maxmargin constraints. While these algorithms worked reasonably well in small problems, they relied on handcrafted features to describe states and were not applied to large MDPs. In contrast, approaches using deep neural networks allow RLED to be explored in more challenging RL tasks such as Atari or robotics. In particular, our work builds upon DQfD
Hester et al. (2018), which used a separate replay buffer for expert demonstrations, and minimized the sum of a temporal difference loss and a supervised classification loss. Another similar approach is Replay Buffer Spiking (RBS) Lipton et al. (2016), wherein the experience replay buffer is initialized with demonstration data, but this data is not kept for the full duration of the training. In robotics tasks, similar techniques have been combined with other improvements to successfully solve difficult exploration problems Nair et al. (2017); Večerík et al. (2017).Deep QNetworks (DQN): DQN Mnih et al. (2015) used deep neural networks as function approximators to apply RL to Atari games. Since that work, many extensions that significantly improve the algorithm’s performance have been developed. For example, DQN uses a replay buffer to store offpolicy experiences and the algorithm learns by sampling batches uniformly from the replay buffer; instead of using uniform samples, Schaul et al. (2015) proposed prioritized sampling where transitions are weighted by their absolute temporal difference error. This concept was further improved by ApeX DQN Horgan et al. (2018) which decoupled the data collection and the learning processes by having many actors feed data to a central prioritized replay buffer that an independent learner can sample from.
Durugkar and Stone (2017) observed that due to overgeneralization in DQN, updates to the value of the current state also have an adverse effect on the values of the next state. This can lead to unstable learning when the discount factor is high. To counteract this effect, they constrained the TD update to be orthogonal to the direction of maximum change of the next state. However, their approach only worked on toy domains such as CartPole. Finally, van Hasselt et al. (2016a) successfully extended DQN to process unclipped rewards with an algorithm called PopArt, which adaptively rescales the targets for the value network to have zero mean and unit variance.
3 Algorithm
In this section, we describe our algorithm, which consists of three components: (1) The transformed Bellman operator; (2) The temporal consistency (TC) loss; (3) Combining ApeX DQN and DQfD.
3.1 DQN Background
Let be a finite, discretetime MDP where is the state space, the action space, the reward function which represents the onestep reward distribution of doing action in state , the discount factor and a stochastic kernel modelling the onestep Markovian dynamics (
is the probability of transitioning to state
by choosing action in state ). The quality of a policy is determined by the actionvalue functionwhere is the expectation over the distribution of the admissible trajectories obtained by executing the policy starting from state and taking action . The goal is to find a policy that maximizes the statevalue for all states , i.e., find such that V for all . While there may be several optimal policies, they all share a common optimal actionvalue function Puterman (1994). Furthermore, acting greedily with respect to the optimal actionvalue function yields an optimal policy. In addition, is the unique fixed point of the Bellman optimality operator defined as
for any . Because is a contraction, we can learn using a fixed point iteration. Starting with an arbitrary function and then iterating for generates a sequence of functions that converges to .
DQN Mnih et al. (2015) is an onlineRL algorithm using a deep neural network with parameters as a function approximator of the optimal actionvalue function . The algorithm starts with a random initialization of the network weights and then iterates
(1) 
where the expectation is taken with respect to a random sample of states and actions and is the Huber loss Huber (1964) defined as
In practice, the minimization problem in (1
) is only approximately solved by performing a finite and fixed number of stochastic gradient descent (SGD) steps
^{1}^{1}1Mnih et al. (2015) refer to the number of SGD iterations as target update period. and all expectations are approximated by sample averages.3.2 Transformed Bellman Operator
Mnih et al. (2015) have empirically observed that the errors induced by the limited network capacity, the approximate finitetime solution to (1), and the stochasticity of the optimization problem can cause the algorithm to diverge if the variance of the optimization target is too high. In order to reduce the variance, they clip the reward distribution to the interval . While this achieves the desired goal of stabilizing the algorithm, it significantly changes the set of optimal policies. For example, consider a simplified version of Bowling where an episode only consists of a single throw. If the original reward is the number of hit pins and the rewards were clipped, any policy that hits at least a single pin would be optimal under the clipped reward function. Instead of reducing the magnitude of the rewards, we propose to focus on the actionvalue function instead. We use a function that reduces the scale of the actionvalue function. Our new operator is defined as
Proposition 3.1.
Let be the fixed point of and , then

If for , then .

If is strictly monotonically increasing and the MDP is deterministic (i.e., and are point measures for all ), then .
Proof.
(i) is equivalent to linearly scaling the reward by a constant , which implies the proposition. For (ii) let be the fixed point of and note that where the last equality only holds if the MDP is deterministic. ∎
Proposition 3.1 shows that in the basic cases when either is linear or the MDP is deterministic, has the unique fixed point . Hence, if is an invertible contraction and we use instead of in the DQN algorithm, the variance of our optimization target decreases while still learning an optimal policy. In our algorithm, we use with where the additive regularization term ensures that is Lipschitz continuous (see Proposition A.1). We chose this function because it has the desired effect of reducing the scale of the targets while being Liptschitz continuous and admitting a closed form inverse.
In practice, DQN minimizes the problem in (1) by sampling transitions of the form from a replay buffer where , and . Let be transitions from the buffer with normalized priorities , then for the loss function in (1) using the operator is approximated as
where for DQN and for Double DQN van Hasselt et al. (2016b).
3.3 Temporal consistency (TC) loss
The stability of DQN, which minimizes the TDloss , is primarily determined by the target . While the transformed Bellman operator provides an atemporal reduction of the target’s scale and variance, instability can still occur as the discount factor approaches 1. Increasing the discount factor decreases the temporal difference in value between nonrewarding states. In particular, unwanted generalization of the neural network to the next state (due to the similarity of temporally adjacent target values) can result in catastrophic TD backups. We resolve the problem by adding an auxiliary temporal consistency (TC) loss of the form
where
is the current iteration. The TCloss penalizes weight updates that change the next actionvalue estimate
. This makes sure that the updated estimates adhere to the operator and thus are consistent over time.3.4 ApeX DQfD
(a) ApeX DQN  (b) ApeX DQfD (ours) 
In this section, we describe how we combine the transformed Bellman operator and the TC loss with the DQfD algorithm Hester et al. (2018) and distributed prioritized experience replay Horgan et al. (2018). The resulting algorithm, which we call ApeX DQfD following Horgan et al. (2018), is a distributed DQN algorithm with expert demonstrations that is robust to the reward distribution and can learn at discount factors an order of magnitude higher than what was possible before (i.e., instead of ). Our algorithm consists of three components: (1) replay buffers; (2) actor processes; and (3) a learner process. Fig. 1 shows how our architecture compares to the one used by Horgan et al. (2018).
Replay buffers. Following Hester et al. (2018), we maintain two replay buffers: an actor replay buffer and an expert replay buffer. Both buffers store 1step and 10step transitions and are prioritized Schaul et al. (2015). The transitions in the actor replay buffer come from actor processes that interact with the MDP. In order to limit the memory consumption of the actor replay buffer, we regularly remove transitions in a FIFOmanner. The expert replay buffer is filled once offline before training commences.
Actor processes. Horgan et al. (2018) showed that we can significantly improve the performance of DQN with prioritized replay buffers by having many actor processes. We follow their approach and use actor processes. Each actor follows an greedy policy based on the current estimate of the actionvalue function. The noise levels are chosen as where . Notably, this exploration is closer to the one used by Hester et al. (2018) and is much lower (i.e., less random exploration) than the schedule used by Horgan et al. (2018).
Learner process. The learner process samples experiences from the two replay buffers and minimizes a loss in order to approximate the optimal actionvalue function. Following Hester et al. (2018), we combine the TDloss with a supervised imitation loss. Let be transitions of the form with normalized priorities where is 1 if the transition is part of the best (i.e., highest episode return) expert episode and 0 otherwise. The imitation loss is a maxmargin loss of the form
(2) 
where is the margin and is 1 if and 0 otherwise. Combining the imitation loss with the TD loss and the TC loss yields the total loss formulation
Algo. 1, provided in the appendix, shows the entire learner procedure. Note that while we only apply the imitation loss on the best expert trajectory, we still use all expert trajectories for the other two losses.
Our learning algorithm differs from the one used by Hester et al. (2018) in three important ways. First, we do not have a pretraining phase where we minimize only using expert transitions. We learn with a mix of actor and expert transitions from the beginning. Second, we maintain a fixed ratio of actor and expert transitions. For each SGD step, our training batch consists of 75% agent transitions and 25% expert transitions. The ratio is constant throughout the entire learning process. Finally, we only apply the imitation loss to the best expert episode instead of all episodes.
Algorithm 
Rainbow 
DQfD 
ApeX DQN 
ApeX DQfD 
ApeX DQfD (deeper) 
Random 
Avg. Human 
Best Expert Trajectory 
Rainbow DQN  –  31 / 42  9 / 42  10 / 42  7 / 42  41 / 42  32 / 42  24 / 42 
DQfD  11 / 42  –  7 / 42  11 / 42  2 / 42  40 / 42  25 / 42  13 / 42 
ApeX DQN  34 / 42  35 / 42  –  28 / 42  15 / 42  40 / 42  31 / 42  31 / 42 
ApeX DQfD  32 / 42  39 / 42  15 / 42  –  9 / 42  40 / 42  39 / 42  32 / 42 
ApeX DQfD (deeper)  36 / 42  40 / 42  28 / 42  33 / 42  –  42 / 42  40 / 42  34 / 42 
Noop starts  Human starts  
Mean  Median  Mean  Median  
Algorithm  42 Games  57 Games  42 Games  57 Games  42 Games  57 Games  42 Games  57 Games 
Rainbow DQN  1022%  874%  231%  231%  897%  776%  159%  153% 
DQfD  364%  –  113%  –  –  –  –  – 
ApeX DQN  1770%  1695%  421%  434%  1651%  1591%  354%  358% 
ApeX DQfD  1536%  –  339%  –  1461%  –  302%  – 
ApeX DQfD (deeper)  2346%  –  702%  –  2028%  –  547%  – 
4 Experimental evaluation
We evaluate our approach on the same subset of 42 games from the Arcade Learning Environment (ALE) Bellemare et al. (2015) used by Hester et al. (2018). We report the performance using the noop starts and the human starts test regimes Mnih et al. (2015). The full evaluation procedure is detailed in Sec. C.
4.1 Benchmark results
We compare our approach to ApeX DQN Horgan et al. (2018), on which our actorlearner architecture is based, DQfD Hester et al. (2018), which introduced the expert replay buffer and the imitation loss, and Rainbow DQN Hessel et al. (2018), which combines all major DQN extensions from the literature into a single algorithm. Note that the scores reported in Horgan et al. (2018) were obtained by running 360 actors. Due to resource constraints, we limit the number of actors to 128 for all ApeX DQfD experiments. Besides comparing our performance to other RL agents, we are also interested in comparing our scores to a human player. Because our demonstrations were gathered from an expert player, the expert scores are mostly better than the level of human performance reported in the literature Mnih et al. (2015); Wang et al. (2016). Hence, we treat the historical human scores as the performance of an average human and the scores of our expert as expert performance.
We first analyse the performance of the standard dueling DQN architecture Wang et al. (2016) that is also used by the baselines. We report the scores as ApeX DQfD in Tables 2 and 2. We designed the algorithm to achieve higher consistency over a broad range of games and the scores shown in Table 2 reflect that goal. Whereas previous approaches outperformed an average human on at most 32 out of 42 games, ApeX DQfD with the standard dueling architecture achieves a new stateoftheart result of 39 out of 42 games. This means we significantly improve the performance on the tails of the distribution of scores over the games. When looking at this performance in the context of the median humannormalized scores reported in Table 2, we see that we significantly increase the set of games where we learn good policies at the expense of achieving lower peak scores on some games.
One of the significant changes in our experimental setup is moving from a discount factor of to . Jiang et al. (2015) argue that this increases the complexity of the learning problem and, thus, requires a bigger hypothesis space. Hence, in addition to the standard architecture, we also evaluated a slightly wider (i.e., double the number of convolutional kernels) and deeper (one extra fully connected layer) network architecture (see Fig. 8). With the deeper architecture, our algorithm outperforms an average human on 40 out of 42 games. Furthermore, it is the first deep RL algorithm to learn nontrivial policies on all games including sparse reward games such as Montezuma’s Revenge, Private Eye, and Pitfall!. For example, we achieve 3997 points in Pitfall!, which is below the 6464 points of an average human but far above any baseline. Finally, with a median humannormalized score of 702% and exceeding every baseline on at least of the games, we demonstrate strong peak performance and consistency over the entire benchmark.
4.2 Imitation vs. inspiration
Although we use demonstration data, the goal of RLED algorithms is still to learn an optimal policy that maximizes the expected discounted return.
While Table 2 shows that we exceed the best expert episode on 34 games using the deeper architecture, it is hard to grasp the qualitative differences between the expert policies and our algorithm’s policies.
In order to qualitatively compare the agent and the expert, we provide videos
on YouTube (see Sec. F)
and we plot the cumulative episode return of the best expert and agent episodes in Fig. 3.
We see that our algorithm (
4.3 Ablation study
We evaluate the performance contributions of the three key ingredients of ApeX DQfD (transformed Bellman operator, the TCloss, and demonstration data) by performing an ablation study on a subset of 6 games. We chose sparsereward games (Montezuma’s Revenge, Private Eye), densereward games (Ms. Pacman, Seaquest), and games where DQfD performs well (Hero, Kangaroo) (see Fig. 3).
Transformed Bellman operator (
TC loss (
Expert demonstrations (
4.4 Comparison to related work
The problems of handling diverse reward distributions and network overgeneralization in deep RL have been partially addressed in the literature (see Sec. 2). Specifically, van Hasselt et al. (2016a) proposed PopArt and Durugkar and Stone (2017) used constrained TD updates. We evaluate the performance of our algorithm when using alternative solutions and report the results in Fig. 4.
PopArt (
) to have zero mean and unit variance. While the modified algorithm manages to learn in some games, the overall performance is significantly worse than ApeX DQfD. One possible limiting factor that makes PopArt a bad choice for our framework is that training batches contain highly rewarding states from the very beginning of training. SGD updates performed before the moving statistics have adequately adapted the moments of the target distribution might result in catastrophic changes to the network’s weights.
Constrained TD updates (
5 Conclusion
In this paper, we presented a deep Reinforcement Learning (RL) algorithm that achieves humanlevel performance on a wide variety of MDPs on the Atari 2600 benchmark. It does so by addressing three challenges: handling diverse reward distributions, acting over longer time horizons, and efficiently exploring on sparse reward tasks. We introduce novel approaches for each of these challenges: a transformed Bellman operator, a temporal consistency loss, and a distributed RLED framework for learning from human demonstrations and task reward. Our algorithm exceeds the performance of an average human on 40 out of 42 Atari 2600 games and it is the first deep RL algorithm to complete the first level of Montezuma’s Revenge.
References
 Atkeson and Schaal [1997] Christopher Atkeson and Stefan Schaal. Robot learning from demonstration. In Proc. of ICML, 1997.
 Bellemare et al. [2015] Marc Bellemare, Yavar Naddaf, Joel Veness, and Michael Bowling. The Arcade Learning Environment: An evaluation platform for general agents. In Proc. of IJCAI, 2015.
 Chemali and Lazaric [2015] Jessica Chemali and Alessandro Lazaric. Direct policy iteration with demonstrations. In Proc. of IJCAI, 2015.
 Durugkar and Stone [2017] Ishan Durugkar and Peter Stone. TD learning with constrained gradients. In Deep Reinforcement Learning Symposium, NIPS, 2017.
 Gruslys et al. [2018] Audrunas Gruslys, Will Dabney, Mohammad Gheshlaghi Azar, Bilal Piot, Marc Bellemare, and Remi Munos. The reactor: A fast and sampleefficient actorcritic agent for reinforcement learning. In Proc. of ICLR, 2018.
 Hessel et al. [2018] Matteo Hessel, Joseph Modayil, Hado Van Hasselt, Tom Schaul, Georg Ostrovski, Will Dabney, Dan Horgan, Bilal Piot, Mohammad G. Azar, and David Silver. Rainbow: Combining improvements in deep reinforcement learning. In Proc. of AAAI, 2018.
 Hester et al. [2018] Todd Hester, Matej Vecerik, Olivier Pietquin, Marc Lanctot, Tom Schaul, Bilal Piot, Andrew Sendonaris, Gabriel DulacArnold, Ian Osband, and John Agapiou. Deep Qlearning from demonstrations. Proc. of AAAI, 2018.
 Horgan et al. [2018] Dan Horgan, John Quan, David Budden, Gabriel BarthMaron, Matteo Hessel, Hado van Hasselt, and David Silver. Distributed prioritized experience replay. In Proc. of ICLR, 2018.
 Huber [1964] Peter J. Huber. Robust estimation of a location parameter. Ann. Math. Statist., 35(1):73–101, 03 1964.
 Jiang et al. [2015] Nan Jiang, Alex Kulesza, Satinder Singh, and Richard Lewis. The dependence of effective planning horizon on model accuracy. In Proceedings of the 2015 International Conference on Autonomous Agents and Multiagent Systems, pages 1181–1189. International Foundation for Autonomous Agents and Multiagent Systems, 2015.
 Kim et al. [2013] Beomjoon Kim, Amirmassoud Farahmand, Joelle Pineau, and Doina Precup. Learning from limited demonstrations. In Proc. of NIPS, 2013.
 Lipton et al. [2016] Zachary Lipton, Xiujun Li, Jianfeng Gao, Lihong Li, Faisal Ahmed, and Li Deng. Bbqnetworks: Efficient exploration in deep reinforcement learning for taskoriented dialogue systems. In Proc. of AAAI, 2016.
 Mnih et al. [2015] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Bellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al. Humanlevel control through deep reinforcement learning. Nature, 518(7540):529–533, 2015.
 Nair et al. [2017] Ashvin Nair, Bob McGrew, Marcin Andrychowicz, Wojciech Zaremba, and Pieter Abbeel. Overcoming exploration in reinforcement learning with demonstrations. arXiv preprint arXiv:1709.10089, 2017.
 Piot et al. [2014] Bilal Piot, Matthieu Geist, and Olivier Pietquin. Boosted bellman residual minimization handling expert demonstrations. In Proc. of ECML/PKDD, 2014.
 Puterman [1994] Marc L. Puterman. Markov decision processes: discrete stochastic dynamic programming. John Wiley & Sons, 1994.
 Schaal [1997] Stefan Schaal. Learning from demonstration. In Proc. of NIPS, 1997.
 Schaul et al. [2015] Tom Schaul, John Quan, Ioannis Antonoglou, and David Silver. Prioritized experience replay. In Proc. of ICLR, 2015.
 van Hasselt et al. [2016a] Hado van Hasselt, Arthur Guez, Matteo Hessel, Volodymyr Mnih, and David Silver. Learning values across many orders of magnitude. In Proc. of NIPS, 2016a.
 van Hasselt et al. [2016b] Hado van Hasselt, Arthur Guez, and David Silver. Deep reinforcement learning with double Qlearning. In Proc. of AAAI, 2016b.
 Večerík et al. [2017] Matej Večerík, Todd Hester, Jonathan Scholz, Fumin Wang, Olivier Pietquin, Bilal Piot, Nicolas Heess, Thomas Rothörl, Thomas Lampe, and Martin Riedmiller. Leveraging demonstrations for deep reinforcement learning on robotics problems with sparse rewards. arXiv preprint arXiv:1707.08817, 2017.
 Wang et al. [2016] Ziyu Wang, Tom Schaul, Matteo Hessel, Hado van Hasselt, Marc Lanctot, and Nando de Freitas. Dueling network architectures for deep reinforcement learning. In Proc. of ICML, 2016.
Appendix A Transformed Bellman Operator in Stochastic MDPs
The following proposition shows that transformed Bellman operator is still a contraction for small if we assume a stochastic MDP and a more generic choice of . However, the fixed point might not be .
Proposition A.1.
Let be strictly monotonically increasing, Lipschitz continuous with Lipschitz constant , and have a Lipschitz continuous inverse with Lipschitz constant . For , is a contraction.
Proof.
Let be arbitrary. It holds
where we used Jensen’s inequality in (1) and the Lipschitz properties of and in (1) and (2). ∎
For our algorithm, we use with . While Proposition A.2 shows that the transformed operator is a contraction, the discount factor we use in practice is higher than . We leave a deeper investigation of the contraction properties of in stochastic MDPs for future work.
Proposition A.2.
Let and . It holds

is strictly monotonically increasing.

is Lipschitz continuous with Lipschitz constant .

is invertible with .

is strictly monotonically increasing.

is Lipschitz continuous with Lipschitz constant .
We use the following Lemmas in order to prove Proposition A.2.
Lemma A.1.
is differentiable everywhere with derivative for all .
Proof of Lemma a.1.
For , is differentiable as a composition of differentiable functions with . Analogously, is differentiable for with . For , we find
and similarly . Hence, for all . ∎
Lemma A.2.
is differentiable everywhere with derivative for all .
Proof of Lemma a.2.
For , is differentiable as a composition of differentiable functions. For , it holds
and similarly , which concludes the proof. ∎
Proof of Proposition a.2.
We prove all statements individually.

for all , which implies the proposition.

Let with , using the mean value theorem, we find

(i) Implies that is invertible and simple substitution shows .

for all , which implies the proposition.

Let with , using the mean value theorem, we find
∎
Appendix B Learner algorithm
Appendix C Experimental setup
We evaluate our algorithm on Arcade Learning Environment (ALE) by Bellemare et al. [2015]. While we follow many of the practices commonly applied when training on the ALE, our experimental setup differs in a few key aspects from the defaults Mnih et al. [2015], Hessel et al. [2018].
End episode on life loss.
Most authors who train agents on the ALE choose to end a training episode when the agent loses a life. This naturally makes the agent risk averse as an action that leads to the termination of an episode has a value of 0. However, because our expert player was allowed to continue playing an episode after losing a life, we follow Hester et al. [2018] and only terminate a training episode either when the game is over or when the agent has performed 50,000 steps which is the episode length used by Horgan et al. [2018].
Reward Preprocessing.
As explained in Sec. 3.2, we do not clip the rewards to the interval . Instead, we use the raw and unprocessed rewards provided by each game.
Discount factor.
The majority of approaches use a discount factor of . Empirically, this used to be the highest discount factor that allows stable learning on all games. However, the TC loss allows us to use a much higher discount factor of giving the algorithm an effective planning horizon of 1000 instead of 100 steps.
Expert data.
Instead of relying purely on an greedy exploration strategy, our algorithm uses expert demonstrations. By using these demonstrations in the TD loss , the algorithm gets the experience rewarding transitions without having to discover them itself.
Appendix D Full experimental results
Game  Rainbow  DQfD  ApeX DQN  ApeX DQfD  ApeX DQfD (deeper)  Random  Avg. Human  Expert 

Alien  9491.7  4737.5  40804.9  11313.6  50113.6  128.3  7128.0  29160.0 
Amidar  5131.2  2325.0  8659.2  8463.8  12291.7  11.8  1720.0  2341.0 
Assault  14198.5  1755.7  24559.4  22855.0  35046.9  166.9  742.0  2274.0 
Asterix  428200.3  5493.6  313305.0  399888.0  418433.5  164.5  8503.0  18100.0 
Asteroids  2712.8  3796.4  155495.1  116846.4  112573.6  871.3  47389.0  18100.0 
Atlantis  826659.5  920213.9  944497.5  911025.0  1057521.5  13463.0  29028.0  22400.0 
Bank Heist  1358.0  1280.2  1716.4  2061.9  2578.9  21.7  753.0  7465.0 
Battle Zone  62010.0  41708.2  98895.0  60540.0  128925.0  3560.0  37188.0  60000.0 
Beam Rider  16850.2  5173.3  63305.2  47129.4  87257.4  254.6  16926.0  19844.0 
Bowling  30.0  97.0  17.6  216.3  210.9  35.2  161.0  149.0 
Boxing  99.6  99.1  100.0  100.0  98.5  0.1  12.0  15.0 
Breakout  417.5  308.1  800.9  419.7  641.9  1.6  30.0  79.0 
Chopper Command  16654.0  6993.1  721851.0  96653.0  840023.5  644.0  7388.0  11300.0 
Crazy Climber  168788.5  151909.5  320426.0  176598.5  247651.0  9337.0  35829.0  61600.0 
Defender  55105.0  27951.5  411943.5  51442.0  218006.3  1965.5  18689.0  18700.0 
Demon Attack  111185.2  3848.8  133086.4  100200.9  141444.6  208.3  1971.0  6190.0 
Double Dunk  0.3  20.4  23.5  23.0  23.2  16.0  16.0  14.0 
Enduro  2125.9  1929.8  2177.4  1663.1  1910.1  81.8  860.0  803.0 
Fishing Derby  31.3  38.4  44.4  66.1  68.0  77.1  39.0  20.0 
Freeway  34.0  31.4  33.7  32.0  31.7  0.1  30.0  32.0 
Gopher  70354.6  7810.3  120500.9  114702.6  114168.9  250.0  2412.0  22520.0 
Gravitar  1419.3  1685.1  1598.5  4214.3  3920.5  245.5  3351.0  13400.0 
Hero  55887.4  105929.4  31655.9  112042.4  114248.2  1580.3  30826.0  99320.0 
Ice Hockey  1.1  9.6  33.0  3.4  32.9  9.7  1.0  1.0 
James Bond  19809.0  2095.0  21322.5  12889.0  16956.3  33.5  303.0  650.0 
Kangaroo  14637.5  14681.5  1416.0  47676.5  48599.0  100.0  3035.0  36300.0 
Krull  8741.5  9825.3  11741.4  104160.3  140670.6  1151.9  2666.0  13730.0 
Kung Fu Master  52181.0  29132.0  97829.5  67957.5  137804.5  304.0  22736.0  25920.0 
Montezuma’s Revenge  384.0  4638.4  2500.0  29384.0  27926.5  25.0  4753.0  34900.0 
Ms. Pacman  5380.4  4695.7  11255.2  12857.1  20872.7  197.8  6952.0  55021.0 
Name This Game  13136.0  5188.3  25783.3  24465.8  31569.4  1747.8  8049.0  19380.0 
Pitfall!  0.0  57.3  0.6  3996.7  3997.5  348.8  6464.0  47821.0 
Pong  20.9  10.7  20.9  21.0  20.9  18.0  15.0  0.0 
Private Eye  4234.0  42457.2  49.8  100747.4  100724.9  662.8  69571.0  72800.0 
Q*bert  33817.5  21792.7  302391.3  71224.4  91603.5  183.0  13455.0  99450.0 
Riverraid  22920.8  18735.4  63864.4  24147.7  47609.9  588.3  17118.0  39710.0 
Road Runner  62041.0  50199.6  222234.5  507213.0  578806.5  200.0  7845.0  20200.0 
Seaquest  15898.9  12361.6  392952.3  13603.8  318418.0  215.5  42055.0  101120.0 
Solaris  3560.3  2616.8  2892.9  2529.8  3428.9  2047.2  12327.0  17840.0 
Up’n’Down  125754.6  82555.0  401884.3  324505.2  469548.3  707.2  11693.0  16080.0 
Video Pinball  533936.5  19123.1  565163.2  243320.1  922518.0  20452.0  17668.0  32420.0 
Yars’ Revenge  102557.0  61575.7  148594.8  109980.9  498947.1  1476.9  54577.0  83523.0 
Game  Rainbow  DQfD  ApeX DQN  ApeX DQfD  ApeX DQfD (deeper)  Random  Avg. Human  Expert 

Alien  6022.9  –  17731.5  1025.5  6983.4  –  6371.3  – 
Amidar  202.8  –  1047.3  310.5  1177.5  –  1540.4  – 
Assault  14491.7  –  24404.6  23384.3  34716.5  –  628.9  – 
Asterix  280114.0  –  283179.5  327929.0  297533.8  –  7536.0  – 
Asteroids  2249.4  –  117303.4  95066.6  95170.9  –  36517.3  – 
Atlantis  814684.0  –  918714.5  912443.0  1020311.0  –  26575.0  – 
Bank Heist  826.0  –  1200.8  1695.9  2020.5  –  644.5  – 
Battle Zone  52040.0  –  92275.0  42150.0  74410.0  –  33030.0  – 
Beam Rider  21768.5  –  72233.7  46454.5  82997.1  –  14961.0  – 
Bowling  39.4  –  30.2  178.3  174.4  –  146.5  – 
Boxing  54.9  –  80.9  64.5  69.7  –  9.6  – 
Breakout  379.5  –  756.5  145.1  365.5  –  27.9  – 
Chopper Command  10916.0  –  576601.5  90152.5  681202.5  –  8930.0  – 
Crazy Climber  143962.0  –  263953.5  141468.0  196633.5  –  32667.0  – 
Defender  47671.3  –  399865.3  37771.8  123734.8  –  14296.0  – 
Demon Attack  109670.7  –  133002.1  97458.8  142189.0  –  3442.8  – 
Double Dunk  0.6  –  22.3  20.5  21.8  –  14.4  – 
Enduro  2061.1  –  2042.4  1538.3  1754.9  –  740.2  – 
Fishing Derby  22.6  –  22.4  26.3  24.0  –  5.1  – 
Freeway  29.1  –  29.0  23.8  26.8  –  25.6  – 
Gopher  72595.7  –  121168.2  115654.7  115392.1  –  2311.0  – 
Gravitar  567.5  –  662.0  972.0  1021.8  –  3116.0  – 
Hero  50496.8  –  26345.3  104942.1  107144.0  –  25839.4  – 
Ice Hockey  0.7  –  24.0  3.3  18.4  –  0.5  – 
James Bond  18142.3  –  18992.3  12041.0  15010.0  –  368.5  – 
Kangaroo  10841.0  –  577.5  25953.5  28616.0  –  2739.0  – 
Krull  6715.5  –  8592.0  111496.1  122870.1  –  2109.1  – 
Kung Fu Master  28999.8  –  72068.0  50421.5  102258.0  –  20786.8  – 
Montezuma’s Revenge  154.0  –  1079.0  22781.0  22730.5  –  4182.0  – 
Ms. Pacman  2570.2  –  6135.4  1880.8  4007.4  –  15375.0  – 
Name This Game  11686.5  –  23829.9  22874.6  29416.0  –  6796.0  – 
Pitfall!  37.6  –  273.3  3367.5  3208.7  –  5998.9  – 
Pong  19.0  –  18.7  14.0  18.6  –  15.5  – 
Private Eye  1704.4  –  864.7  61895.1  54976.0  –  64169.1  – 
Q*bert  18397.6  –  380152.1  41419.6  51159.3  –  12085.0  – 
Riverraid  15608.1  –  49982.8  18720.1  42288.9  –  14382.2  – 
Road Runner  54261.0  –  127111.5  486082.0  507490.0  –  6878.0  – 
Seaquest  19176.0  –  377179.8  15526.1  269480.0  –  40425.8  – 
Solaris  2860.7  –  3115.9  2235.6  1835.8  –  11032.6  – 
Up’n’Down  92640.6  –  347912.2  200709.3  298361.8  –  9896.1  – 
Video Pinball  506817.2  –  873988.5  194845.0  832691.1  –  15641.1  – 
Yars’ Revenge  93007.9  –  131701.1  82521.8  466181.8  –  47135.2  – 
Appendix E Experimental setup & hyper parameters
Game  Min score  Max score  Number of transitions  Number of episodes 

Alien  9690  29160  19133  5 
Amidar  1353  2341  16790  5 
Assault  1168  2274  13224  5 
Asterix  4500  18100  9525  5 
Asteroids  14170  18100  22801  5 
Atlantis  10300  22400  17516  12 
Bank Heist  900  7465  32389  7 
Battle Zone  35000  60000  9075  5 
Beam Rider  12594  19844  38665  4 
Bowling  89  149  9991  5 
Boxing  0  15  8438  5 
Breakout  17  79  10475  9 
Chopper Command  4700  11300  7710  5 
Crazy Climber  30600  61600  18937  5 
Defender  5150  18700  6421  5 
Demon Attack  1800  6190  17409  5 
Double Dunk  22  14  11855  5 
Enduro  383  803  42058  5 
Fishing Derby  10  20  6388  4 
Freeway  30  32  10239  5 
Gopher  2500  22520  38632  5 
Gravitar  2950  13400  15377  5 
Hero  35155  99320  32907  5 
Ice Hockey  4  1  17585  5 
James Bond  400  650  9050  5 
Kangaroo  12400  36300  20984  5 
Krull  8040  13730  32581  5 
Kung Fu Master  8300  25920  12989  5 
Montezuma’s Revenge  32300  34900  17949  5 
Ms Pacman  31781  55021  21896  3 
Name This Game  11350  19380  43571  5 
Pitfall  3662  47821  35347  5 
Pong  12  0  17719  3 
Private Eye  70375  74456  10899  5 
QBert  80700  99450  75472  5 
River Raid  17240  39710  46233  5 
Road Runner  8400  20200  5574  5 
Seaquest  56510  101120  57453  7 
Solaris  2840  17840  28552  6 
Up N Down  6580  16080  10421  4 
Video Pinball  8409  32420  10051  5 
Yars’ Revenge  48361  83523  21334  4 
Parameter  Comment  Value 
Learner configuration  
Batch size  256  
Agent transitions per batch  192 