1 Introduction
Deep neural networks have become a critical component of modern reinforcement learning (RL)
(Sutton and Barto, 2018). The seminal work of Mnih et al. (2013, 2015) on deep Qnetworks (DQN) has demonstrated that it is possible to train convolutional networks (LeCun et al., 1998) using Qlearning (Watkins and Dayan, 1992) and achieve humanlevel performance in playing Atari 2600 games (Bellemare et al., 2013) directly from raw pixels. Recent progress in mastering the games of Go (Silver et al., 2016) and Poker (Moravčík et al., 2017; Brown and Sandholm, 2017) and advances in robotic control (Levine et al., 2016; OpenAI et al., 2018; Kalashnikov et al., 2018) present additional supporting evidence for the enormous potential of deep RL. Nevertheless, the use of neural networks within RL algorithms introduces unique and complex challenges, especially in the context of offpolicy learning. For instance, it is well known that Qlearning can be unstable or even divergent when neural networks are used to parameterize optimal actionvalue functions (Baird, 1995; Boyan and Moore, 1995; Tsitsiklis and Van Roy, 1997). When discussing offpolicy methods with function approximation, Sutton and Barto (2018) conclude that: “The potential for offpolicy learning remains tantalizing, the best way to achieve it still a mystery.”Offpolicy RL algorithms are attractive as they disentangle data collection and policy optimization. This enables reusing the experience collected by any
policy to improve the value estimates under the optimal policy, as utilized by value iteration
(Howard, 1960; Bellman, 1966) and Qlearning (Watkins and Dayan, 1992). By contrast, onpolicy policy gradient methods (Williams, 1992; Sutton et al., 2000; Kakade, 2002; Schulman et al., 2015; Mnih et al., 2016) require access to samples drawn from the current policy to directly estimate the gradient of the objective function w. r. t. the current parameters. Onpolicy methods are typically convergent and stable, but require many more interactions with the environment than offpolicy techniques to achieve a similar level of performance (Hessel et al., 2018). More importantly, onpolicy methods are not applicable to many practical RL problems Strehl et al. (2010); Bottou et al. (2013); Swaminathan and Joachims (2015) for which offline logged data already exists before policy learning proceeds. By contrast, offpolicy techniques are capable of leveraging the vast amount of existing offline data to tackle realworld problems.In the pursuit of developing efficient and stable offpolicy deep RL algorithms, recent papers have presented various conceptual and engineering ideas (e.g., (Mnih et al., 2015; Schaul et al., 2016; Van Hasselt et al., 2016; Wang et al., 2016; Osband et al., 2016; Fortunato et al., 2018; Anschel et al., 2017; Bellemare et al., 2017; Dabney et al., 2018b; Hessel et al., 2018)). In the absence of theoretical guarantees for offpolicy learning with function approximation, most advances in this area are largely governed by empirical results on a popular benchmark suite of Atari 2600 games (Bellemare et al., 2013). Reflecting on the advances of deep RL algorithms since the early version of DQN in 2013 (Mnih et al., 2013), it is important to ask: are the complexities of recent offpolicy methods really necessary? Given the empirical nature of the field, we believe it is crucial to study the relative importance of individual components of our RL algorithms and strive for finding successful RL algorithms that are as simple as possible (the principle of Occam’s razor).
In an attempt to isolate various factors of variation in offpolicy deep RL and help develop simpler algorithms, this paper investigates a set of interrelated questions:

[topsep=0pt, partopsep=0pt, leftmargin=25pt, parsep=0pt, itemsep=3pt]

Separating the contribution of offline versus online data helps isolate an RL algorithm’s ability to exploit experience and generalize versus its ability to explore effectively. Is it possible to train successful Atari agents completely in isolation solely based on offline data?

How much of the gain of recent distributional RL algorithms such as C51 (Bellemare et al., 2017) and QRDQN (Dabney et al., 2018b) is attributed to improvements in their exploration versus exploitation behavior? In other words, are distributional RL algorithms much better than DQN if trained on the same offline data?

Can simpler offpolicy algorithms that avoid learning explicit distributions over returns capture the benefits of distributional RL? Ideally, one prefers algorithms that have better theoretical characteristics than distributional Qlearning algorithms.
(a)  (b) 
By investigating the questions above, we make several contributions to offpolicy deep RL research:

[topsep=0pt, partopsep=0pt, leftmargin=25pt, parsep=0pt, itemsep=3pt]

We propose a deceptively simple experimental setup for evaluating offpolicy deep RL algorithms based on logged experiences of a DQN agent. This helps reduce the computation cost of the experiments significantly and enables comparing exploitation behavior of offpolicy agents in isolation. We will release the offline data used in our experiments to help reproduce our results and enable the research community evaluate offpolicy methods on common ground.

Unexpectedly, we find that the full experience replay of a DQN agent comprising about million tuples of (observation, action, reward, next observation) is enough to learn strong Atari agents completely in isolation without any interaction with the environment during training.

We confirm that batch QRDQN significantly outperforms batch DQN when using identical offline data. We also confirm that QRDQN outperforms an ensembleDQN baseline, which optimizes Qheads of a multiheaded Qnetwork based on separate TD errors for each head.

We present Random Ensemble Mixture (REM), a novel and simple extension of ensembleDQN, which enforces optimal Bellman consistency on random convex combinations of the Qheads of a multiheaded Qnetwork. Surprisingly, as Figure 1 shows, the batch REM agent trained offline on DQN data outperforms the batch QRDQN (Dabney et al., 2018b) and the online C51 (Bellemare et al., 2017) agents both in terms of the median normalized scores and the number of games superior to DQN.
2 Background
In reinforcement learning (RL), an agent interacts with an environment, which is typically expressed as a Markov decision process (MDP)
Puterman (1994), with a state space , an action space , a stochastic reward function , transition dynamics and a discount factor . A stochastic policy maps each state to a distribution over actions.For an agent following the policy , the actionvalue function denoted is defined as the expectation of cumulative discounted future rewards, i.e.,
(1) 
The actionvalue function can be recursively defined in terms of the Bellman equation (Bellman, 1966) as
(2) 
Our goal is to find an optimal policy that attains the maximum expected return, for all . One way to characterize the optimal policy is via the Bellman optimality equations, which recursively define the optimal actionvalue function denoted via:
(3) 
To this end, Qlearning Watkins and Dayan (1992) iteratively improves an approximate estimate of denoted by repeatedly regressing the LHS of (3) to samples from the RHS of (3). For large and complex state spaces, approximate Qvalues are obtained using a neural network as the function approximator. To further stabilize learning, a target network with frozen parameters is used for computing the learning target. The target network parameters are updated to the current Qnetwork parameters after a fixed number of time steps. DQN (Mnih et al., 2015) parameterizes
with a convolutional neural network and uses Qlearning with a target network while following an
greedy policy over for data collection. DQN minimizes the TD error loss in (4) over minibatches of tuples of the agent’s past data, , sampled from an experience replay buffer (Lin, 1992) collected during training. This approach achieves humanlevel play on the Atari 2600 games (Bellemare et al., 2013).(4) 
where is the Huber loss given by .
Qlearning is an offpolicy algorithm (Sutton and Barto, 2018), meaning the target can be computed without consideration of how the experience was generated. In principle, offpolicy RL algorithms can learn from data collected by any behavioral policy. In batch RL (Lange et al., 2012), we assume access to a fixed offline dataset of experiences without further interactions with the environment.
Distributional RL (Bellemare et al., 2017) makes use of a distribution over returns, denoted , instead of the scalar actionvalue function , which is the expectation of . Accordingly, similar to the scalar setting, a distributional Bellman optimality operator is defined as
(5) 
where denotes distributional equality and is distributed according to .
The C51 algorithm by Bellemare et al. (2017) was the first distributional RL algorithm, which achieved stateoftheart performance on the Atari 2600 games. C51 models using a categorical distribution supported on a set of fixed locations
uniformly spaced over a predetermined interval. The parameters of that distribution are the probabilities
, associated with each location . Given a current value distribution , the C51 algorithm applies a projection step to map the target onto its finite element support, followed by a KullbackLeibler minimization step.Dabney et al. (2018b) proposed to use quantile regression for distributional RL and developed QRDQN which outperformed C51. QRDQN minimizes the Wasserstein distance to the distributional Bellman target and approximates the random return by a uniform mixture of Dirac distributions, i.e.,
(6) 
with each being trained to match the quantile of the target return distribution using the Huber (Huber, 1964) quantile regression loss, where for .
3 Can Pure Offpolicy Learning with no Environment Interactions Succeed?
Offpolicy learning with function approximation has no known convergence guarantees (Sutton and Barto, 2018), and there are wellknown counter examples (Baird, 1995) where Qlearning diverges even with linear function approximation. Despite the lack of theoretical guarantees, modern “offpolicy” deep RL algorithms perform quite well on commonly used benchmarks such as the Arcade Learning Environment (ALE) (Bellemare et al., 2013) and MuJoCo locomotion tasks (Todorov et al., 2012). Most of these algorithms fall into the category of growing batch learning (Lange et al., 2012), in which data is collected and stored into an experience replay buffer (Lin, 1992), which is used to train the agent before further data collection.
Fujimoto et al. (2019) note that these “offpolicy” algorithms tend to use near onpolicy exploratory policies and present results suggesting that these algorithms are ineffective when learning truly offpolicy without interacting with the environment. Previously, Zhang and Sutton (2017) advocated that a large replay buffer can significantly hurt the performance of Qlearning algorithms. These results raise the question that if we had access to all of the data collected during training from a deep offpolicy RL agent, can pure offpolicy learning without any environment interactions succeed?
To answer this question, we train a standard offpolicy agent, namely Nature DQN (Mnih et al., 2015), on 60 Atari 2600 games for 200 million frames (standard protocol) and save all of the experience tuples of (observation, action, reward, next observation) encountered during training. We then use these logged experiences for training offpolicy agents in an offline setting (i.e., batch RL (Lange et al., 2012)
) without any new interaction with the environment during training. For most of the experiments, we train all of the offline agents for the same number of gradient updates as the online agents, but we also consider using four times as many gradient updates for offline agents given that the sampleefficiency of an offline agent does not change with the number of training epochs.
The proposed offline setting for evaluating offpolicy RL algorithms is much closer to supervised learning and simpler than the typical online setting. For example, in the offline setting, we optimize a training objective over a fixed dataset as compared to the nonstationary objective over a changing experience replay buffer for an online offpolicy RL algorithm. This simplicity allows us to segregate the problems of exploration and exploitation in offpolicy RL and study them independently.
3.1 Experimental Setup & Results
(a)  (b) 
We use DQN (Mnih et al., 2015) as our baseline agent to collect the full experience datasets
. For each game, we repeat this dataset collection process 5 times for statistical significance and report the results averaged over 5 runs unless stated otherwise. We used the hyperparameters provided in the Dopamine
(Castro et al., 2018) baselines for a standardized comparison (refer to supplementary material for more details). Note that we use the stochastic version of ALE with sticky actions (Machado et al., 2018) to ward against trajectory overfitting. We compare the following agents trained offline on the fixed dataset :Batch DQN: Can we recover the online DQN agent’s performance in the offline setting? To answer this question, we train a DQN agent offline on the data collected from online DQN (i.e., ).
Batch QRDQN: One may ask whether it is possible to exploit the data collected from online DQN more effectively using a better offpolicy algorithm than DQN. To answer this question, we train agents offline on using the distributional RL algorithm QRDQN (Dabney et al., 2018b).
We measure the online performance of the batch agents on a score scale normalized with respect to the random and online DQN agents. The normalized score (11) for each game is one for the better performing agent among the online DQN and the random agent and zero for the worse among the DQN and the random agent.
Results. Batch DQN achieves at least 74% of online DQN’s score improvement over a random agent on half of the games and outperforms online DQN on only 10 Atari games (Figure 2(a)) when trained for the same number of gradient steps as the online DQN agent. Quite surprisingly, batch QRDQN respectively outperforms online DQN and batch DQN on 44 and 51 games (Figure 2(b)). We also ran batch C51 on and observed that it also significantly outperforms batch DQN (Figure 7). Additionally, we performed similar experiments on data collected from an online QRDQN agent and observe that batch QRDQN significantly outperforms batch DQN (Figure 6). The learning curves of the online DQN agent and the batch DQN, batch QRDQN trained using on all 60 games can be visualized in Figure 9 in the appendix.
The suboptimal performance of batch DQN suggests that the DQN agent is not very effective at exploiting offpolicy data. More importantly, the impressive performance of batch QRDQN indicates that learning good policies completely offline using logged DQN data is possible on most of the Atari 2600 games. Interestingly, a significant fraction of the gains from distributional RL agents (C51, QRDQN) can be attributed to their exploitation capacity and not exploration behavior. The diversity of states in the experience replay buffer is an important factor for performance (De Bruin et al., 2015) and our results indicate that 50 million experience tuples from a DQN agent are diverse enough to obtain good performance on most of the Atari 2600 games.
4 Distilling the Success of Distributional RL into Simpler Algorithms
The source of the gains from combining nonlinear function approximation with QRDQN and from extending RL to distributional RL remain unclear (Lyle et al., 2019). Is modelling the full distribution over expected returns crucial to achieve the gains of distributional RL? Can we distill the success of distributional RL to simpler offpolicy methods that avoid learning explicit distributions over returns? Our results in the previous section confirm that the benefits of distributional RL mainly stem from better exploitation behavior. Using this new insight, we try to shed light on these questions.
In order to approximate return distributions, QRDQN modifies the DQN architecture to output heads for each action, where each head is trained separately to represent a specific quantile value of the return distribution, and the average of the heads is used as the value (6). As shown in Figure 3, the individual heads share all of the neural network layers except the final fully connected layer. We propose two simple variants of deep Qlearning using the same architecture changes over DQN employed by QRDQN but with objective functions for minimizing the TD error of the scalar approximation to the function rather than return distributions ((7), (8)).
4.1 EnsembleDQN
EnsembleDQN is a simple extension of DQN which approximates the values via an ensemble of Qfunctions (Faußer and Schwenker, 2015; Osband et al., 2016; Anschel et al., 2017). We implement this by modifying the DQN network to output heads which are used to estimate functions. Each one of these value heads, denoted , is trained against its own target head similar to BootstrappedDQN (Osband et al., 2016). Note that our EnsembleDQN is different from the ensembling proposed in (Anschel et al., 2017) which uses the average target Qvalues for training the individual Qfunctions. The Qheads with different random initialization are optimized with complete data sharing between heads using the loss given by
(7) 
where is the Huber loss.
BootstrappedDQN uses the value estimates to improve temporallyextended exploration, while EnsembleDQN simply uses the mean of the estimates as the value. In the offline learning setup, we are only concerned with the exploitation ability of EnsembleDQN.
(a)  (b) 
4.2 Random Ensemble Mixture (REM)
Increasing the number of models used for ensembling and their diversity (Kuncheva and Whitaker, 2003) tends to improve performance in supervised learning. We noticed a similar trend in EnsembleDQN experiments, i.e., ensembling a larger number of value estimates tends to improve performance. This observation begs the question of whether we can use a ensemble over an exponential number of estimates in a computationally efficient manner.
Inspired by Dropout (Srivastava et al., 2014), we propose Random Ensemble Mixture for RL, which uses multiple function heads to estimate the value, similar to EnsembleDQN. The key insight in REM is that since each head represents a valid estimate of the values, a convex combination of the heads is also a valid approximator for the function. This is especially true at the fixed point, where all of the heads must converge to identical value estimates. Using this insight, we train a family of functions over a simplex. Specifically, for each minibatch, we randomly draw a categorical distribution that defines a convex combination of the heads to define a function. This function is trained against its corresponding target to define the TD error. The loss takes the form,
(8) 
where
represents a probability distribution over the standard
simplex . For evaluation, we use the average of the heads as the function, i.e., .Proposition 1. Consider the assumptions: (a) The distribution has full support over the entire simplex. (b) Only a finite number of distinct Qfunctions globally minimize the loss in (4). (c) is defined in terms of the MDP induced by the data distribution . (d) lies in the family of our function approximation. Then at the global minimum of (8):

[label=()]

Under assumptions (a) and (b), all the heads represent identical functions.

Under assumptions (a)–(d), the common convergence point is .
The proof of (ii) follows from (i) and the fact that (8) is lower bounded by zero and by definition attains a TD error of zero. The proof of part (i) can be found in the supplementary material. In our experiments, we use a very simple distribution without any tuning: we first draw a set of values i. i. d. from Uniform(0, 1) and normalize them to get a valid categorical distribution, i.e., followed by .
It can be inferred from Proposition 1 that despite using an architecture with multiple heads, REM does not change the representation capacity over DQN. Also, one can view as an infinite set of Bellman optimality constraints, and each constraint provides an auxiliary objective for training (Jaderberg et al., 2017; Bellemare et al., 2019a). We hypothesize that gains from overparametrization in REM over DQN, if any, can be attributed to its training procedure.
4.3 Experiments & Results
We train batch EnsembleDQN and batch REM offline using the experience replay data of a DQN agent (i.e., ). All of the offline agents including QRDQN, EnsembleDQN, and REM use an identical architecture with heads and the same number of parameters. We use the same number of SGD updates as the online DQN agent for the algorithms compared in this subsection.
Batch EnsembleDQN significantly outperforms batch DQN and achieves a higher score than online DQN on 26 games (Figure 4(a)). This indicates that a noticeable proportion of gains achieved by QRDQN is due to training multiple heads to estimate the values. Batch REM outperforms online DQN and batch EnsembleDQN in 35 and 47 games respectively (Figure 4(b)).
4.4 Asymptotic Performance of Batch Agents
In the offline setting, since the training dataset is fixed, the sample efficiency of the batch agents remain unchanged if trained for longer duration. This leads to the question of whether there would be any significant performance gain at the expense of increased computational efficiency for training the batch agents. To answer this, we train all the batch agents on dataset collected from online DQN () for four times as many gradient steps (indicated by the suffix (4X) after the name of a batch agent) as compared to the online DQN (or C51) agent.
Figure 1 reveals that the batch agents are able to exploit the data much more effectively when trained for more gradient updates. Interestingly, batch REM (4X) achieves a higher score than online DQN as on more games as well as a higher median normalized score as compared to online C51. It is worthwhile to note that batch REM (4X) is able to outperform batch QRDQN (4X). Also, batch REM (4X) and batch QRDQN (4X) outperform C51 on approximately half of the games (see Figure 8 in the appendix). These results indicate that the success of distributional RL algorithms such as QRDQN and C51 can be distilled to the simpler REM algorithm, at least in the batch setting.
Figure 5 shows that the batch agents are able to obtain much higher scores compared to their training data (i.e.,) on some of the games. Analogous to supervised learning, all of the batch agents exhibit overfitting, i.e., after certain number of gradient updates, the TD error on the dataset (not plotted here) continues to decrease while the online performance deteriorates significantly. The learning curves comparing DQN, QRDQN, EnsembleDQN and REM agents trained offline using on all 60 games can be visualized in Figure 10 in the appendix.
5 Related work
In an attempt to develop efficient and stable offpolicy deep RL algorithms, recent papers have presented various conceptual and engineering contributions. DQN (Mnih et al., 2015) proposes the use of experience replay and target networks. Bootstrapped DQN (Osband et al., 2016) and NoisyNet (Fortunato et al., 2018) suggest using randomized value functions to select more exploratory actions. Averaged DQN (Anschel et al., 2017) suggests using a running average of several target networks as the target value function. Distributional RL (Bellemare et al., 2017; Dabney et al., 2018b, a) proposes using a distribution to represent optimal actionvalues rather than a point estimate. Rainbow (Hessel et al., 2018) suggests combining the improvement of different techniques such as distributional RL, prioritized experience replay (Schaul et al., 2016), double Qlearning (Van Hasselt et al., 2016; Hasselt, 2010), dueling architecture (Wang et al., 2016), etc., into a single offpolicy agent. We are investigating whether the complexity of some of these prior approaches is necessary. Our work is similar in spirit to Rajeswaran et al. (2017), which attempts to demystify the complexity of onpolicy deep RL for continuous control. Our work is mainly related to two subareas of RL known as Batch RL and Distributional RL.
Batch Reinforcement Learning. RL on a fixed dataset without any online interactions with the environment has been long studied (Ernst et al., 2005; Kalyanakrishnan and Stone, 2007; Strehl et al., 2010; Lange et al., 2012; Swaminathan and Joachims, 2015). While there has been increasing interest in batch offpolicy RL over the last few years (Jiang and Li, 2016; Thomas and Brunskill, 2016; Farajtabar et al., 2018; Irpan et al., 2019; Jaques et al., 2019), much of this focussed on offpolicy policy evaluation, where the goal is to estimate the performance of a given target policy. Our work tackles the problem of batch offpolicy optimization which requires learning a good policy given a fixed dataset. Recent work from Fujimoto et al. (2019) as well as concurrent work from Kumar et al. (2019) document that standard offpolicy RL methods with static datasets fail on continuous control tasks. In (Fujimoto et al., 2019), this failure is attributed to extrapolation error, i.e., the mismatch between the batch dataset and true stateaction visitation of the current policy. We assert that the failure on offpolicy expert data reported in (Kumar et al., 2019)
is analogous to the difficulty of learning a binary classifier using only positive training examples. Additionally, our positive results with standard offpolicy agents suggest that failure to learn from diverse offline data is not necessarily an indication of issues with batch RL (
e.g., see (Fujimoto et al., 2019)), but it maybe linked to extrapolation error for weaker exploitation agents, e.g., DQN.Distributional Reinforcement Learning. Recently, distributional RL algorithms (Bellemare et al., 2017; Dabney et al., 2018b, a), especially C51 (Bellemare et al., 2017), gained enormous popularity due to their impressive performance on the ALE benchmark (Bellemare et al., 2013). Dabney et al. (2018b) proposed QRDQN allowing for a more flexible parameterization for modeling the return distribution, leading it to outperform C51. Despite these algorithmic advances, our understanding of performance gains due to practical distributional RL methods remains limited (Lyle et al., 2019; Bellemare et al., 2019b). In this work, we investigate the QRDQN algorithm, as it only modifies the architecture (Figure 3) and the regression loss objective used in DQN, and therefore, easier to study than C51. We demonstrate that the gains from QRDQN mainly stem from its exploitation ability and distill the success of QRDQN to REM (Section 4.2), a much simpler offpolicy expected RL algorithm.
6 Conclusion
This paper investigates offpolicy deep RL for learning to play the Atari 2600 games from offline replay data. We conclude that it is possible to learn successful polices that significantly outperform the policy used to collect replay data. We demonstrate the superior exploitation capability of distributional RL algorithms (e.g., batch QRDQN) over batch DQN and batch EnsembleDQN. We propose REM, a simple and effective variant of EnsembleDQN that outperforms batch QRDQN and online C51 when trained only using the replay data of DQN. The proposed offline experimental setup can serve as the testbed for evaluating offpolicy algorithms and enable the research community to evaluate offpolicy methods on common grounds.
For future work, analyzing the effect of the distribution over the simplex in REM and potentially learning the distribution using an adversarial objective are promising directions. We believe that REM holds significant potential in the online setting and combining its exploitation ability with ensemble exploration strategies can potentially lead to a substantial improvement. That said, this paper focuses on the exploitation ability of REM, and we leave online REM to future work.
7 Acknowledgments
We thank Pablo Samuel Castro for help in understanding and debugging certain issues with the Dopamine codebase and reviewing an early draft of the paper. We thank Robert Dadashi, Carles Gelada and Liam Fedus for helpful discussions. We also acknowledge William Chan and Aviral Kumar for their review of initial draft of the paper.
References

Anschel et al. (2017)
Oron Anschel, Nir Baram, and Nahum Shimkin.
Averageddqn: Variance reduction and stabilization for deep reinforcement learning.
ICML, 2017.  Baird (1995) Leemon Baird. Residual algorithms: Reinforcement learning with function approximation. Machine Learning, 1995.

Bellemare et al. (2013)
Marc G Bellemare, Yavar Naddaf, Joel Veness, and Michael Bowling.
The arcade learning environment: An evaluation platform for general
agents.
Journal of Artificial Intelligence Research
, 2013.  Bellemare et al. (2017) Marc G Bellemare, Will Dabney, and Rémi Munos. A distributional perspective on reinforcement learning. ICML, 2017.
 Bellemare et al. (2019a) Marc G Bellemare, Will Dabney, Robert Dadashi, Adrien Ali Taiga, Pablo Samuel Castro, Nicolas Le Roux, Dale Schuurmans, Tor Lattimore, and Clare Lyle. A geometric perspective on optimal representations for reinforcement learning. arXiv preprint arXiv:1901.11530, 2019a.
 Bellemare et al. (2019b) Marc G Bellemare, Nicolas Le Roux, Pablo Samuel Castro, and Subhodeep Moitra. Distributional reinforcement learning with linear function approximation. arXiv preprint arXiv:1902.03149, 2019b.
 Bellman (1966) Richard Bellman. Dynamic programming. Science, 1966.
 Bottou et al. (2013) Léon Bottou, Jonas Peters, Joaquin QuiñoneroCandela, Denis X Charles, D Max Chickering, Elon Portugaly, Dipankar Ray, Patrice Simard, and Ed Snelson. Counterfactual reasoning and learning systems: The example of computational advertising. JMLR, 2013.
 Boyan and Moore (1995) Justin A Boyan and Andrew W Moore. Generalization in reinforcement learning: Safely approximating the value function. NIPS, 1995.
 Brown and Sandholm (2017) Noam Brown and Tuomas Sandholm. Libratus: The superhuman ai for nolimit poker. IJCAI, 2017.
 Castro et al. (2018) Pablo Samuel Castro, Subhodeep Moitra, Carles Gelada, Saurabh Kumar, and Marc G Bellemare. Dopamine: A research framework for deep reinforcement learning. ArXiv:1812.06110, 2018.
 Dabney et al. (2018a) Will Dabney, Georg Ostrovski, David Silver, and Rémi Munos. Implicit quantile networks for distributional reinforcement learning. ICML, 2018a.
 Dabney et al. (2018b) Will Dabney, Mark Rowland, Marc G Bellemare, and Rémi Munos. Distributional reinforcement learning with quantile regression. AAAI, 2018b.
 De Bruin et al. (2015) Tim De Bruin, Jens Kober, Karl Tuyls, and Robert Babuška. The importance of experience replay database composition in deep reinforcement learning. Deep reinforcement learning workshop, NIPS, 2015.
 Deng et al. (2009) J. Deng, W. Dong, R. Socher, L.J. Li, K. Li, and L. FeiFei. ImageNet: A LargeScale Hierarchical Image Database. CVPR, 2009.
 Ernst et al. (2005) Damien Ernst, Pierre Geurts, and Louis Wehenkel. Treebased batch mode reinforcement learning. JMLR, 2005.
 Farajtabar et al. (2018) Mehrdad Farajtabar, Yinlam Chow, and Mohammad Ghavamzadeh. More robust doubly robust offpolicy evaluation. ICML, 2018.
 Faußer and Schwenker (2015) Stefan Faußer and Friedhelm Schwenker. Neural network ensembles in reinforcement learning. Neural Processing Letters, 2015.
 Fortunato et al. (2018) Meire Fortunato, Mohammad Gheshlaghi Azar, Bilal Piot, Jacob Menick, Ian Osband, Alex Graves, Vlad Mnih, Remi Munos, Demis Hassabis, Olivier Pietquin, et al. Noisy networks for exploration. ICLR, 2018.
 Fujimoto et al. (2019) Scott Fujimoto, David Meger, and Doina Precup. Offpolicy deep reinforcement learning without exploration. ICML, 2019.
 Hasselt (2010) Hado V Hasselt. Double qlearning. NIPS, 2010.
 Hessel et al. (2018) Matteo Hessel, Joseph Modayil, Hado Van Hasselt, Tom Schaul, Georg Ostrovski, Will Dabney, Dan Horgan, Bilal Piot, Mohammad Azar, and David Silver. Rainbow: Combining improvements in deep reinforcement learning. AAAI, 2018.
 Howard (1960) Ronald A Howard. Dynamic programming and markov processes. MIT press, 1960.
 Huber (1964) PJ Huber. Robust estimation of a location parameter. Ann. Math. Stat., 1964.
 Irpan et al. (2019) Alex Irpan, Kanishka Rao, Konstantinos Bousmalis, Chris Harris, Julian Ibarz, and Sergey Levine. Offpolicy evaluation via offpolicy classification. arXiv preprint arXiv:1906.01624, 2019.
 Jaderberg et al. (2017) Max Jaderberg, Volodymyr Mnih, Wojciech Marian Czarnecki, Tom Schaul, Joel Z Leibo, David Silver, and Koray Kavukcuoglu. Reinforcement learning with unsupervised auxiliary tasks. ICLR, 2017.
 Jaques et al. (2019) Natasha Jaques, Asma Ghandeharioun, Judy Hanwen Shen, Craig Ferguson, Agata Lapedriza, Noah Jones, Shixiang Gu, and Rosalind Picard. Way offpolicy batch deep reinforcement learning of implicit human preferences in dialog. arXiv preprint arXiv:1907.00456, 2019.
 Jiang and Li (2016) Nan Jiang and Lihong Li. Doubly robust offpolicy value evaluation for reinforcement learning. ICML, 2016.
 Kakade (2002) Sham M Kakade. A natural policy gradient. NIPS, 2002.
 Kalashnikov et al. (2018) Dmitry Kalashnikov, Alex Irpan, Peter Pastor, Julian Ibarz, Alexander Herzog, Eric Jang, Deirdre Quillen, Ethan Holly, Mrinal Kalakrishnan, Vincent Vanhoucke, et al. Qtopt: Scalable deep reinforcement learning for visionbased robotic manipulation. CoRL, 2018.
 Kalyanakrishnan and Stone (2007) Shivaram Kalyanakrishnan and Peter Stone. Batch reinforcement learning in a complex domain. AAMAS, 2007.
 Kumar et al. (2019) Aviral Kumar, Justin Fu, George Tucker, and Sergey Levine. Stabilizing offpolicy qlearning via bootstrapping error reduction. arXiv preprint arXiv:1906.00949, 2019.
 Kuncheva and Whitaker (2003) Ludmila I Kuncheva and Christopher J Whitaker. Measures of diversity in classifier ensembles and their relationship with the ensemble accuracy. Machine learning, 2003.
 Lange et al. (2012) Sascha Lange, Thomas Gabel, and Martin Riedmiller. Batch reinforcement learning. Reinforcement learning, 2012.
 LeCun et al. (1998) Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. Gradientbased learning applied to document recognition. Proceedings of the IEEE, 1998.
 Levine et al. (2016) Sergey Levine, Chelsea Finn, Trevor Darrell, and Pieter Abbeel. Endtoend training of deep visuomotor policies. JMLR, 2016.
 Lin (1992) LongJi Lin. Selfimproving reactive agents based on reinforcement learning, planning and teaching. Machine learning, 1992.
 Lyle et al. (2019) Clare Lyle, Pablo Samuel Castro, and Marc G Bellemare. A comparative analysis of expected and distributional reinforcement learning. AAAI, 2019.
 Machado et al. (2018) Marlos C Machado, Marc G Bellemare, Erik Talvitie, Joel Veness, Matthew Hausknecht, and Michael Bowling. Revisiting the arcade learning environment: Evaluation protocols and open problems for general agents. Journal of Artificial Intelligence Research, 2018.
 Mnih et al. (2013) Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan Wierstra, and Martin Riedmiller. Playing atari with deep reinforcement learning. arXiv:1312.5602, 2013.
 Mnih et al. (2015) Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Bellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al. Humanlevel control through deep reinforcement learning. Nature, 2015.
 Mnih et al. (2016) Volodymyr Mnih, Adria Puigdomenech Badia, Mehdi Mirza, Alex Graves, Timothy Lillicrap, Tim Harley, David Silver, and Koray Kavukcuoglu. Asynchronous methods for deep reinforcement learning. ICML, 2016.
 Moravčík et al. (2017) Matej Moravčík, Martin Schmid, Neil Burch, Viliam Lisỳ, Dustin Morrill, Nolan Bard, Trevor Davis, Kevin Waugh, Michael Johanson, and Michael Bowling. Deepstack: Expertlevel artificial intelligence in headsup nolimit poker. Science, 2017.
 OpenAI et al. (2018) OpenAI, Marcin Andrychowicz, Bowen Baker, Maciek Chociej, Rafał Józefowicz, Bob McGrew, Jakub Pachocki, Arthur Petron, Matthias Plappert, Glenn Powell, Alex Ray, Jonas Schneider, Szymon Sidor, Josh Tobin, Peter Welinder, Lilian Weng, and Wojciech Zaremba. Learning dexterous inhand manipulation. CoRR, 2018.
 Osband et al. (2016) Ian Osband, Charles Blundell, Alexander Pritzel, and Benjamin Van Roy. Deep exploration via bootstrapped dqn. NIPS, 2016.
 Puterman (1994) Martin L Puterman. Markov Decision Processes: Discrete Stochastic Dynamic Programming. John Wiley & Sons, Inc., 1994.
 Rajeswaran et al. (2017) Aravind Rajeswaran, Kendall Lowrey, Emanuel V Todorov, and Sham M Kakade. Towards generalization and simplicity in continuous control. NIPS, 2017.
 Schaul et al. (2016) Tom Schaul, John Quan, Ioannis Antonoglou, and David Silver. Prioritized experience replay. ICLR, 2016.
 Schulman et al. (2015) John Schulman, Sergey Levine, Pieter Abbeel, Michael Jordan, and Philipp Moritz. Trust region policy optimization. ICML, 2015.
 Silver et al. (2016) David Silver, Aja Huang, et al. Mastering the game of Go with deep neural networks and tree search. Nature, 2016.
 Srivastava et al. (2014) Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. Dropout: a simple way to prevent neural networks from overfitting. JMLR, 2014.
 Strehl et al. (2010) Alexander L. Strehl, John Langford, Lihong Li, and Sham Kakade. Learning from logged implicit exploration data. NIPS, 2010.
 Sutton and Barto (2018) Richard S Sutton and Andrew G Barto. Reinforcement learning: An introduction. MIT press, 2018.
 Sutton et al. (2000) Richard S Sutton, David A. McAllester, Satinder P. Singh, and Yishay Mansour. Policy gradient methods for reinforcement learning with function approximation. NIPS, 2000.
 Swaminathan and Joachims (2015) Adith Swaminathan and Thorsten Joachims. Batch learning from logged bandit feedback through counterfactual risk minimization. JMLR, 2015.
 Thomas and Brunskill (2016) Philip Thomas and Emma Brunskill. Dataefficient offpolicy policy evaluation for reinforcement learning. ICML, 2016.
 Todorov et al. (2012) Emanuel Todorov, Tom Erez, and Yuval Tassa. Mujoco: A physics engine for modelbased control. IROS, 2012.
 Tsitsiklis and Van Roy (1997) John N Tsitsiklis and Benjamin Van Roy. Analysis of temporaldiffference learning with function approximation. NIPS, 1997.
 Van Hasselt et al. (2016) Hado Van Hasselt, Arthur Guez, and David Silver. Deep reinforcement learning with double qlearning. AAAI, 2016.
 Wang et al. (2016) Ziyu Wang, Tom Schaul, Matteo Hessel, Hado Van Hasselt, Marc Lanctot, and Nando De Freitas. Dueling network architectures for deep reinforcement learning. ICLR, 2016.
 Watkins and Dayan (1992) Christopher JCH Watkins and Peter Dayan. Qlearning. Machine Learning, 1992.
 Williams (1992) Ronald J. Williams. Simple statistical gradientfollowing algorithms for connectionist reinforcement learning. Machine Learning, 1992.
 Zhang and Sutton (2017) Shangtong Zhang and Richard S. Sutton. A deeper look at experience replay. arXiv preprint arXiv:1712.01275, 2017.
8 Supplement
8.1 Proofs
Proposition 1. Consider the assumptions: (a) The distribution has full support over the entire simplex. (b) Only a finite number of distinct Qfunctions globally minimize the loss in (4). (c) is defined in terms of the MDP induced by the data distribution . (d) lies in the family of our function approximation. Then at the global minimum of (8):

[label=()]

Under assumptions (a) and (b), all the heads represent identical functions.

Under assumptions (a)–(d), the common convergence point is .
Proof. Part (i): Under assumptions (a) and (b), we would prove by contradiction that each head should be identical to minimize the REM loss (8). Note that we consider two Qfunctions to be distinct only if they differ on any state in .
The REM loss where is given by
(9) 
If the heads and don’t converge to identical Qvalues at the global minimum of , it can be deduced using Lemma 1 that all the Qfunctions given by the convex combination such that minimizes the loss in (4). This contradicts the assumption that only a finite number of distinct Qfunctions globally minimize the loss in (4). Hence, all Qheads represent an identical Qfunction at the global minimum of .
Lemma 1. Assuming that the distribution has full support over the entire simplex , then at any global minimum of , the Qfunction heads for minimize for any .
Proof. Let corresponding to the convex combination represents one of the global minima of (9) i.e., where . Any global minima of attains a value of or higher since,
(10) 
Let where represent the shared features among the heads and
represent the weight vector in the final layer corresponding to the
th head. Note that can also be represented by each of the individual heads using a weight vector given by convex combination of weight vectors , i.e., .Let be such that for all heads. By definition of , for all , which implies that . Hence, corresponds to one of the global minima of and any global minima of attains a value of .
Since for any , for any such that implies that for any . Therefore, at any global minimum of , the Qfunction heads for minimize for any .
8.2 Score Normalization
The improvement in normalized performance of a batch agent, expressed as a percentage, over a DQN agent is calculated as:
(11) 
We chose not to measure performance in terms of percentage of DQN performance alone because a tiny difference relative to the random agent on some games can translate into hundreds of percent in DQN performance difference. Additionally, the max is needed since DQN performs worse than a random agent on the games Solaris and Skiing.
Since a C51 agent always outperform the Random agent, the normalized performance over C51 (in %) simplifies to:
(12) 
8.3 Batch Experiments on QRDQN data
(a)  (b) 
Similar to our experiments using DQN data, we collected data comprising 200 million frames from an online QRDQN agent. The learning curves for different batch agents trained using are shown in Figure 11.
8.4 Batch C51 vs Batch DQN
8.5 Hyperparameters & Experiment Details
Our results compare the agents with the same hyperparameters: target network update frequency, frequency at which exploratory actions are selected (), the length of the schedule over which is annealed, and the number of agent steps before training occurs. As mentioned in Dopamine’s [Castro et al., 2018] GitHub repository, changing these parameters can significantly affect performance, without necessarily being indicative of an algorithmic difference. Note that the batch agents trained offline do not collect any new data during training and the training and decay schedule are irrelevant to these agents.
The performance of the agents is compared using the best evaluation score achieved during training where the evaluation is done online using a greedy policy with . Step size and optimizer were taken as published. Note the multiheaded Qnetworks (Figure 3) use similar parameters to the QRDQN agent in the Dopamine baselines. Table 1 summarizes our choices. All numbers are in ALE frames. For all experiments, we report results averaged over 5 runs with same hyperparameters and different random seeds. Each individual run used a single Tesla P100 GPU. All the online agents were trained for 200 iterations where each iteration corresponds to experiencing 1 million ALE frames when run online.
Sticky actions  Yes 

Episode termination  Game Over 
Training  0.01 
Evaluation  0.001 
decay schedule (frames)  1,000,000 
Min. history to learn (frames)  80,000 
Target net. update freq. (frames)  32,000 
8.6 Additional Plots & Tables
Figure 8 compares the batch REM (4X) and batch QRDQN (4X) agents trained using DQN data against the online C51 agent. Table 2 summarizes the results related to the asymptotic performance of batch agents presented in Figure 1.
(a)  (b) 
Median  DQN  

Batch DQN  74%  10 
Batch EnsembleDQN  92%  26 
Batch REM  103%  35 
Batch QRDQN  115%  44 
Batch DQN (4X)  83%  17 
Batch EnsembleDQN (4X)  111%  38 
Batch REM (4X)  123%  48 
Batch QRDQN (4X)  119%  45 
Online C51  119%  44 
Online QRDQN  135%  53 
8.7 Learning Curves
All the learning curves in this section show the evaluation performance across 60 Atari games averaged over 5 runs, smoothed over a sliding window of 3 iterations, and error bands show 95% confidence interval. For each game, each individual run for a batch agent use the logged dataset collected from a DQN (or QRDQN) agent run using a different random seed.