The Difficulty of Passive Learning in Deep Reinforcement Learning

by   Georg Ostrovski, et al.

Learning to act from observational data without active environmental interaction is a well-known challenge in Reinforcement Learning (RL). Recent approaches involve constraints on the learned policy or conservative updates, preventing strong deviations from the state-action distribution of the dataset. Although these methods are evaluated using non-linear function approximation, theoretical justifications are mostly limited to the tabular or linear cases. Given the impressive results of deep reinforcement learning, we argue for a need to more clearly understand the challenges in this setting. In the vein of Held Hein's classic 1963 experiment, we propose the "tandem learning" experimental paradigm which facilitates our empirical analysis of the difficulties in offline reinforcement learning. We identify function approximation in conjunction with fixed data distributions as the strongest factors, thereby extending but also challenging hypotheses stated in past work. Our results provide relevant insights for offline deep reinforcement learning, while also shedding new light on phenomena observed in the online case of learning control.



There are no comments yet.


page 23


Emphatic Algorithms for Deep Reinforcement Learning

Off-policy learning allows us to learn about possible policies of behavi...

Regret Minimization for Partially Observable Deep Reinforcement Learning

Deep reinforcement learning algorithms that estimate state and state-act...

The Least Restriction for Offline Reinforcement Learning

Many practical applications of reinforcement learning (RL) constrain the...

Online Multimodal Transportation Planning using Deep Reinforcement Learning

In this paper we propose a Deep Reinforcement Learning approach to solve...

DCUR: Data Curriculum for Teaching via Samples with Reinforcement Learning

Deep reinforcement learning (RL) has shown great empirical successes, bu...

On Finite-Sample Analysis of Offline Reinforcement Learning with Deep ReLU Networks

This paper studies the statistical theory of offline reinforcement learn...

Spectral Normalisation for Deep Reinforcement Learning: an Optimisation Perspective

Most of the recent deep reinforcement learning advances take an RL-centr...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Learning to act in an environment purely from observational data (i.e. with no environment interaction), usually referred to as offline reinforcement learning, has great practical as well as theoretical importance (see (Levine et al., 2020) for a recent survey). In real-world settings like robotics and healthcare, it is motivated by the ambition to learn from existing datasets and the high cost of environment interaction. Its theoretical appeal is that stationarity of the data distribution allows for more straightforward convergence analysis of learning algorithms. Moreover, decoupling learning from data generation alleviates one of the major difficulties in the empirical analysis of common reinforcement learning agents, allowing the targeted study of learning dynamics in isolation from their effects on behavior.

Recent work has identified extrapolation error as a major challenge for offline (deep) reinforcement learning (Achiam et al., 2019; Buckman et al., 2021; Fujimoto et al., 2019b; Fakoor et al., 2021; Liu et al., 2020; Nair et al., 2020), with bootstrapping

often highlighted as either the cause or an amplifier of the effect: The value of missing or under-represented state-action pairs in the dataset can be over-estimated, either transiently (due to insufficient training or data) or even asymptotically (due to modelling or dataset bias), resulting in a potentially severely under-performing acquired policy. The corrective feedback-loop

(Kumar et al., 2020b), whereby value over-estimation is self-correcting via exploitation during interaction with the environment (while under-estimation is corrected by exploration), is critically missing in the offline setting.

To mitigate this, typically one of a few related strategies are proposed: policy or learning update constraints preventing deviations from states and actions well-covered by the dataset or satisfying certain uncertainty bounds (Fujimoto et al., 2019a, b; Kumar et al., 2019, 2020c; Achiam et al., 2019; Wang et al., 2020b; Wu et al., 2021; Nair et al., 2020; Wu et al., 2019; Yu et al., 2020), pessimism bias to battle value over-estimation (Buckman et al., 2021; Kidambi et al., 2020), large and diverse datasets to improve state space coverage (Agarwal et al., 2020), or learned models to fill in gaps with synthesized data (Schrittwieser et al., 2021; Matsushima et al., 2020). While many of these enjoy theoretical justification in the tabular or linear cases (Thomas et al., 2015), guarantees for the practically relevant non-linear case are mostly lacking.

In this paper we draw inspiration from the experimental paradigm introduced in the classic Held and Hein (1963) experiment in psychology. The experiment involved coupling two young animal subjects’ movements and visual perceptions to ensure that both receive the same stream of visual inputs, while only one can actively shape that stream by directing the pair’s movements (Fig. 1, top-left). By showing that, despite identical visual experiences, only the actively moving subject acquired adequate visual acuity, the experiment established the importance of active locomotion in learning vision. Analogously, we introduce the ‘Tandem RL’ setup, pairing an ‘active’ and a ‘passive’ agent in a training loop where only the active agent drives data generation, while both perform identical learning updates from the generated data111The ‘tandem’ analogy is of course that of two riders, both of whom experience the same route, while only the front rider gets to decide on direction.. By decoupling learning dynamics from its impact on data generation, while preserving the non-stationarity of the online learning setting, this experimental paradigm promises to be a valuable analytic tool for the precise empirical study of RL algorithms.

Holding architectures, losses, and crucially data distribution equal across the active and passive agents, or varying them in a controlled manner, we perform a detailed empirical analysis of the failure modes of passive (i.e. non-interactive, offline) learning, and pinpoint the contributing factors in properties of the data distribution, function approximation and learning algorithm. Our study confirms some past intuitions for the failure modes of offline learning, while refining and extending the findings in the deep RL case. In particular, our results indicate an empirically less critical role for bootstrapping than previously hypothesized, while foregrounding erroneous extrapolation or over-generalization by a function approximator trained on an inadequate data distribution as the crucial challenge. Among other things, our experiments draw a sharp boundary between the mostly well-behaved (and analytically well-studied) case of linear function approximation, and the non-linear case for which theoretical guarantees are lacking. Moreover, we delineate different, more and less effective, ways of enhancing the training data distribution to support successful offline learning, e.g. by analysing the impact of dataset size and diversity, the stochasticity of the data generating policy, or small amounts of self-generated data. Our results provide hints towards a hypothesis relevant in both offline and online RL: robust learning of control with function approximation may require interactivity not merely as a data gathering mechanism, but as a counterbalance to a (sufficiently expressive) approximator’s tendency to ‘exploit gaps’ in an arbitrary fixed data distribution by excessive extrapolation.

Figure 1: (top-left) Held and Hein (1963) experiment setup. (top-right) Illustrations of the Tandem and Forked Tandem experiment setups. (bottom) Tandem (active and passive) performance on 4 selected Atari domains. In all figures, active agent performance is shown in gray.

2 The Experimental Paradigm of Tandem Reinforcement Learning

The Tandem RL setting, extending a similar analytic setup in (Fujimoto et al., 2019b), consists of two learning agents, one of which (the ‘active agent’) performs the usual online training loop of interacting with an environment and learning from the generated data, while the other (the ‘passive agent’) learns solely from data generated by the active agent, while only interacting with the environment for evaluation. We distinguish two experimental paradigms (see Fig. 1, top-right):

Tandem: Active and passive agents start with independently initialized networks, and train on an identical sequence of training batches in the exact same order.

Forked Tandem: An agent is trained for a fraction of its total training budget. It is then ‘forked’ into active and passive agents, which start out with identical network weights. The active agent is ‘frozen’, i.e. receives no further training, but continues to generate data from its policy. The passive agent is trained on this generated data for the remainder of the training budget.

2.1 Implementation

Our basic experimental agent is ‘Tandem DQN’, an active/passive pair of Double-DQN agents222Our choice of Double-DQN as a baseline is motivated by its relatively strong performance and robustness compared to vanilla DQN (Mnih et al., 2015), paired with its simplicity compared to later variants like Rainbow (Hessel et al., 2018), which allows for more easily controlled experiments with fewer moving parts. (van Hasselt et al., 2016). Following the usual training protocol (Mnih et al., 2015), the total training budget is 200 iterations, each of which consists of 1M steps taken on the environment by the active agent, interspersed with regular learning updates (on one, or concurrently on both agents, depending on the paradigm), on batches of transitions sampled from the active agent’s replay buffer. Both agents are independently evaluated on the environment for 500K steps after each training iteration.

Most of our experiments are performed on the Atari domain (Bellemare et al., 2013)

, using the exact algorithm and hyperparameters from

(van Hasselt et al., 2016). We use a fixed set of four representative games to demonstrate most of our empirical results, two of which (Breakout, Pong) can be thought of as easy and largely solved by baseline agents, while the others (Seaquest, Space Invaders

) have non-trivial learning curves and remain challenging. Unless stated otherwise, all results show averages over at least 5 seeds, with confidence intervals indicating variation over seeds. In comparative plots,

boldface entries indicate the default Tandem DQN configuration, and gray lines always correspond to the active agent’s performance.

2.2 The Tandem Effect

We begin by reproducing the striking observation in (Fujimoto et al., 2019b) that the passive learner generally fails to adequately learn from the very data stream that is demonstrably sufficient for its architecturally identical active counterpart; we refer to this phenomenon as the ‘tandem effect’ (Fig. 1, bottom). We ascertain the generality of this finding by replicating it across a broad suite of environments and agent architectures: Double-DQN on 57 Atari environments (Appendix Figs. 10 & 11), adapted agent variants on four Classic Control domains from the OpenAI Gym library (Brockman et al., 2016) and the MinAtar domain (Young and Tian, 2019) (Appendix Figs. 12 & 15), and the distributed R2D2 agent (Kapturowski et al., 2019) (Appendix Fig. 14). Details on agents and environments are provided in the Appendix333We provide two Tandem RL implementations: based on the DQN Zoo (Quan and Ostrovski, 2020), and dopamine/tree/master/dopamine/labs/tandem_dqn based on the Dopamine library (Castro et al., 2018)..

Empirically, we make the informal observation that while active and passive Q-networks tend to produce similar values for typical state-action pairs under the active policy (where the action is the active Q-value function’s argmax for a given state), their values are less correlated for other (non-argmax) actions, and in fact the active and passive greedy policies of a Tandem DQN tend to disagree in a large fraction of states under the behavior distribution (on average of states, after 100M steps of training, across 57 Atari games; see Appendix Fig. 13). Moreover, in a fraction () of Atari games, we observe the passive agent’s network to strongly over-estimate a fraction of state-action values, with the over-estimation growing as training progresses.

3 Analysis of the Tandem Effect

In line with existing explanations (Levine et al., 2020), we propose that the tandem effect is primarily caused by extrapolation error when certain state-action pairs are under-represented in the active agent’s behavior data. Specifically with -greedy policies, even small over-estimation of the values of rarely seen actions can lead to sufficient behavior deviations to cause catastrophic under-performance.

We further extend this hypothesis: in the context of deep reinforcement learning (i.e. with non-linear function approximation), an inadequate data distribution can drive over-generalization (Bengio et al., 2020), making such erroneous extrapolation likely. While the tandem effect can show up as learning inefficiency even in the tabular case (Xiao et al., 2021), it proves especially pernicious in the case of non-linear function approximation, where erroneous extrapolation can lead to errors not just on completely unseen, but also rarely seen data, and can persist in the infinite-sample limit.

Coalescing this view and past analyses of challenges in offline RL (e.g. (Levine et al., 2020; Fujimoto et al., 2019b; Liu et al., 2020)) into the following three potential contributing factors in the tandem effect provides a natural structure to our analysis:

Bootstrapping (B)

The passive agent’s bootstrapping from poorly estimated (in particular, over-estimated) values causes any initially small mis-estimation to get amplified.

Data Distribution (D)

Insufficient coverage of sub-optimal actions under the active agent’s policy may lead to their mis-estimation by the passive agent. In the case of over-estimation, this may lead to the passive agent’s under-performance.

Function Approximation (F)

A non-linear function approximator used as a Q-value function may tend to wrongly extrapolate the values of state-action pairs underrepresented in the active agent’s behavior distribution. This tendency can be inherent and persistent, in the sense of being independent of initialization and not being reduced with increased training on the same data distribution.

These proposed contributing factors are not at all mutually exclusive; they may interact in causing or exacerbating the tandem effect. Insufficient coverage of sub-optimal actions under the active agent’s behavior distribution (D) may lead to insufficient constraint on the respective values, which allows for effects of erroneous extrapolation by a function approximator (F). Where this results in over-estimation, the use of bootstrapping (B) carries the potential to ‘pollute’ even well-covered state-action pairs by propagating over-estimated values (especially via the operator in the case of Q-learning). In the next sections we empirically study these three factors in isolation, to establish their actual roles and relative contributions to the overall difficulty of passive learning.

3.1 The Role of Bootstrapping

One distinguishing feature of reinforcement learning as opposed to supervised learning is its frequent use of learned quantities as preliminary optimization targets, most prominently in what is referred to as

bootstrapping in the widely used TD algorithms (Sutton, 1988), where preliminary estimates of the value function are used as update targets. In the Double-DQN algorithm these updates take the form , where denotes the parametric Q-value function, and is the target network Q-value function, i.e. a time-delayed copy of .

Figure 2: Active vs. passive performance when using the active agent’s target policy and/or value function for constructing passive bootstrapping targets.

Four value functions are involved in the active and passive updates of Tandem DQN: and , where the subscripts refer to the Q-value functions of the active and passive agents, respectively. The use of its own target network by the passive agent makes bootstrapping a plausible strong contributor to the tandem effect. To test this, we replace the target values and/or policies in the update equation for the passive agent, with the values provided by the active agent’s value functions:

As shown in Fig. 2, the use of the active value functions as targets reduces the active-passive gap by only a small amount. Note that when both active target values and policy are used, both networks are receiving an identical sequence of targets for their update computations, a sequence that suffices for the active agent to learn a successful policy. Strikingly, despite this the tandem effect appears largely preserved: in all but the easiest games (e.g. Pong444Pong

is an outlier in that it only has 3 actions, and in a large fraction of states actions have no (irreversible) consequences, making greedy policies somewhat robust to errors in the underlying value function.

) the passive agent fails to learn effectively.

To more precisely understand the effect of bootstrapping with respect to a potential value over-estimation by the passive agent, in Appendix Fig. 16 we also show the values of the passive networks in the above experiment compared to those of the respective active networks. As hypothesised, we observe that the vanilla tandem setting leads to significant value over-estimation, and that indeed bootstrapping plays a substantial role in amplifying the effect: passive networks trained using the active network’s bootstrap targets do not over-estimate compared to the active network at all.

These findings indicate that a simple notion of value over-estimation itself is not the fundamental cause of the tandem effect, and that (B) plays an amplifying, rather than causal role. Additional evidence for this is provided below, where the tandem effect occurs in a purely supervised setting without bootstrapping.

3.2 The Role of the Data Distribution

The critical role of the data distribution for offline learning is well established (Fujimoto et al., 2019b; Jacq et al., 2019; Liu et al., 2020; Wang et al., 2021). In particular, Wang et al. (2020a) showed that simpler notions of state-space coverage may not suffice for efficient offline learning with function approximation (even in the linear case and under a strong realizability assumption); much stronger assumptions on the data distribution, typically not satisfied in practical scenarios, may actually be required. Here we extend past analysis empirically, by investigating how properties of the data distribution (e.g. stochasticity, stationarity, the size and diversity of the dataset, and its proximity to the passive agent’s own behavior distribution) affect its suitability for passive learning.

The exploration parameter

A simple way to affect the data distribution’s state-action coverage (albeit in a blunt and uniform way) is by varying the exploration parameter of the active agent’s -greedy behavior policy (for training, not for evaluation). Note that a higher parameter affects the active agent’s own training performance, as its ability to navigate environments requiring precise control is reduced. In Fig. 3 (top) we therefore report the relative passive performance (i.e. as a fraction of the active agent’s performance, which itself also varies across parameters), with absolute performance plots included in the Appendix for completeness (Fig. 17). We observe that the relative passive performance is indeed substantially improved when the active behavior policy’s stochasticity (and as a consequence its coverage of non-argmax actions along trajectories) is increased, and conversely it reduces with a greedier behavior policy, providing evidence for the role of (D).

Figure 3: Passive as fraction of active performance for varying active -greedy behavior policies (top); regular Atari vs sticky-actions Atari (bottom). We report relative passive performance, as active performance varies across configurations. See Appendix Figs. 17 & 18 for absolute performance.

Sticky actions

An alternative potential source of stochasticity is the environment itself, e.g. the use of ‘sticky actions’ in Atari (Machado et al., 2018)

: with fixed probability, an agent action is ignored (and the previous action repeated instead). This type of environment-side stochasticity should not be expected to cause new actions to appear in the behavior data, and indeed Fig. 

3 (bottom) shows no substantial impact on the tandem effect.

Replay size

Our results contrast with the strong offline RL results in (Agarwal et al., 2020)

. We hypothesize that the difference is due to the vastly different dataset size (full training of 200M transitions vs. replay buffer of 1M). Interpolating between the tandem and the offline RL setting from

(Agarwal et al., 2020), we increase the replay buffer size, thereby giving the passive agent access to somewhat larger data diversity and state-action coverage (this does not affect the active agent’s training as the active agent is constrained to only sample from the most recent 1M replay samples, as in the baseline variant). A larger replay buffer somewhat mitigates the passive agent’s under-performance (Fig. 4), though it appears to mostly slow down rather than prevent the passive agent from eventually under-performing its active counterpart substantially. As we suspect that a sufficient replay buffer size may depend on the effective state-space size of an environment, we also perform analogous experiments on the (much smaller) classic control domains; results (Appendix Fig. 22) remain qualitatively the same.

Figure 4: Active vs. passive performance for different replay sizes (for passive agent).

Note that for a fixed dataset size, sample diversity can take different forms. Many samples from a single policy may provide better coverage of states on, or near, policy-typical trajectories. Meanwhile, a larger collection of policies, with fewer samples per policy, provides better coverage of many trajectories at the expense of lesser coverage of small deviations from each. To disentangle the impact of these modalities, while also shedding light on the role of stationarity of the distribution, we next switch to the ‘Forked Tandem’ variation of the experimental paradigm.

Figure 5: Performance of a Forked Tandem DQN, training passively after freezing its data generating policy (top) or its replay buffer (bottom). Vertical lines indicate forking time points.

Fixed policy

Upon forking, the frozen active policy is executed to produce training data for the passive agent, which begins its training initialized with the active network’s weights. Note that this constitutes a stronger variant of the tandem experiment. At the time of forking, the agents do not merely share analogous architectures and equal ‘data history’, but also identical network weights (whereas in the simple tandem setting, the agents were distinguished by independently initialized networks). Moreover, the data used for passive training can be thought of as a ‘best-case scenario’: generated by a single fixed policy, identical to the passive policy at the beginning of passive training. Strikingly, the tandem effect is not only preserved but even exacerbated in this setting (Fig. 5, top): after forking, passive performance decays rapidly in all but the easiest games, despite starting from a near-optimally performing initialization. This re-frames the tandem effect as not merely the difficulty of passively learning to act, but even to passively maintain performance. Instability appears to be inherent in the very learning process itself, providing strong support to the hypothesis that an interplay between (D) and (F) is critical to the tandem effect.

In Appendix Fig. 23 we additionally show that similarly to the regular tandem setting, stochasticity of the active policy after forking influences the passive agent’s ability to maintain performance.

Fixed replay

A variation on the above experiments is to freeze the replay buffer while continuing to train the passive policy from this fixed dataset. Instead of a stream of samples from a single policy, this fixed data distribution now contains a fixed number of samples from a training process of the length of the replay buffer, i.e. from a number of different policies. The collapse of passive performance here (Fig. 5, bottom) is less rapid, yet qualitatively similar. In Appendix Fig. 24 we present yet another variant of this experiment with similar results, showing that the effect is robust to minor variations in the exact way of fixing the data distribution of a learning agent.

These experiments provide strong evidence for the importance of (D): a larger replay buffer, containing samples from more diverse policies, can be expected to provide an improved coverage of (currently) non-greedy actions, reducing the tandem effect. While the forked tandem begins passive learning with the seemingly advantageous high-performing initialization, state-action coverage is critically limited in this case. In the frozen-policy case, a large number of samples from the very same -greedy policy can be expected to provide very little coverage of non-greedy actions, while in the frozen-replay case, a smaller number of samples from multiple policies can be expected to only do somewhat better in this regard. In both cases the tandem effect is highly pronounced.

On-policy evaluation

The strength of the last two experiments lies in the observation that, since active and passive networks have identical parameter values at the beginning of passive training, their divergence cannot be attributed to small initial differences getting amplified by training on an inadequate data distribution. With so many factors held fixed, the collapse of passive performance when trained on the very data distribution produced by its own initial policy begs the question whether off-policy Q-learning itself is to blame for this failure mode, e.g. via statistical over-estimation bias introduced by the operator (van Hasselt, 2010). Here we provide a negative answer, by performing on-policy evaluation with SARSA (Rummery and Niranjan, 1994) (Fig. 6), and even purely supervised regression on the Monte-Carlo returns (Appendix Fig. 25), in the forked tandem setup. While evaluation succeeds, in the sense of minimizing evaluation error on the given behavior distribution, atypical action values under the behavior policy suffer substantial estimation error, resulting in occasional over-estimation. The resulting -greedy control policy under-performs the initial policy at forking time as catastrophically as in the other forked tandem experiments (more details in Appendix A.3). Strengthening the roles of (D) and (F) while further weakening that of (B), these observations point to an inherent instability of offline learning, different from that of Baird’s famous example (Baird, 1995) or the ‘Deadly Triad’ (Sutton and Barto, 2018; van Hasselt et al., 2018); an instability that results purely from erroneous extrapolation by the function approximator, when the utilized data distribution does not provide adequate coverage of relevant state-action pairs.

Figure 6: Passive performance in Forked Tandem DQN after policy evaluation with SARSA.
Figure 7: Passive performance for different amounts of self-generated data in the passive agent’s replay batches.

Self-generated data

Our final empirical question in this section is ‘How much data generated by the passive agent is needed to correct for the tandem effect?’. While a full investigation of this question exceeds the scope of this paper and is left for future work, the tandem setup lends itself to a simple experiment: both agents interact with the environment and fill individual replay buffers, one of them (for simplicity still referred to as ‘passive’) however learns from data stochastically mixed from both replay buffers. Fig. 7 shows that even a moderate amount (10%-20%) of ‘own’ data yields a substantial reduction of the tandem effect, while a 50/50 mixture completely eliminates it.

3.3 The Role of Function Approximation

We structure our investigation of the role of function approximation in the tandem effect into two categories: the optimization process and the function class used.


Agarwal et al. (2020) and Obando-Ceron and Castro (2021) demonstrated that the Adam optimization algorithm (Kingma and Ba, 2015)

outperforms RMSProp

(Tieleman and Hinton, 2012) used in our experiments. In Appendix Fig. 20 we show that while both active and passive agents perform better with Adam, the tandem effect itself is unaffected by the choice of optimizer.

Another plausible hypothesis is that the passive network suffers from under-fitting and requires more updates on the same data to attain comparable performance to the active learner. Varying the number of passive agent updates per active agent update step, we find that more updates worsen the performance of the passive agent (Appendix Fig. 21). This rejects insufficient training as a possible cause, and further supports the role of (D). We also note that, together with the forked tandem experiments in the previous section, this finding distinguishes the tandem effect from the issue of estimation error in the offline learning setting (Xiao et al., 2021): while in the tabular setting estimation error dominates the learning challenge and a sufficient training duration (assuming full state-space coverage) guarantees convergence to a good solution, this is not necessarily the case with function approximation trained on a given data distribution.

Function class

Given that the active and passive agents share an identical network architecture, the passive agent’s under-performance cannot be explained by an insufficiently expressive function approximator. Performing the tandem experiment with pure regression of the passive network’s outputs towards the active network’s (a variant of network distillation (Hinton et al., 2015)), instead of TD based training, we observe that the performance gap is indeed vastly reduced and in some games closed entirely (see Appendix Fig. 19); however, strikingly, it remains in some.

Figure 8: Passive performance as a fraction of active performance in CartPole: varying number of hidden layers and units.

Next, we vary the function class of both networks by varying the depth and width of the utilized Q-networks on a set of Classic Control tasks. As can be seen in Fig. 8 (and Appendix Fig. 28), the magnitude of the active-passive performance gap appears negatively correlated with network width, which is in line with (F): an increase in network capacity results in less pressure towards over-generalizing to infrequently seen action values and an ultimately smaller tandem effect. On the other hand, the gap seems to correlate positively with network depth. We speculate that this may relate to the finding that deeper networks tend to be biased towards simpler (e.g. lower rank) solutions, which may suffer from increased over-generalization (Huh et al., 2021; Kumar et al., 2020a).

Figure 9: Active vs. passive performance, with first of 5 layers of active/passive networks shared.

Finally, we investigate varying the function class of only the passive network by sharing the weights of the first (out of ) layers of active and passive networks, while constraining the passive network to only update the remaining top layers, and using the ‘representation’ at layer

acquired through active learning only. This reduces the ‘degrees of freedom’ of the passive agent, which we hypothesize reduces its potential for divergence. Indeed, Fig. 

9 illustrates that passive performance correlates strongly with the number of tied layers, with the variant for which only the linear output layer is trained passively performing on par with the active agent. A similar result is obtained in the forked tandem setting, see Appendix Fig. 27. This finding provides a strong indirect hint towards (F): with only part of the network’s layers being trained passively, much of its ‘generalization capacity’ is shared between active and passive agents. States that are not aggregated by the shared bottom layers (only affected by active training) have to be ‘erroneously’ aggregated by the remaining top layers of the network for over-generalization to occur. A more thorough investigation of this, exceeding the scope of this paper, may involve attempting to measure (over-)generalization more directly, e.g. via gradient interference (Bengio et al., 2020).

4 Applications of the Tandem Setting

In addition to being valuable for studying the challenges in offline RL, we propose that the Tandem RL setting provides analytic capabilities that make it a useful tool in the empirical analysis of general (online) reinforcement learning algorithms. At its core, the tandem setting aims to decouple learning dynamics from its impact on behavior and the data distribution, which are inseparably intertwined in the online setting. While classic offline RL achieves a similar effect, as an analytic tool it has the potential downside of typically using a stationary distribution. Tandem RL, on the other hand, presents the passive agent with a data distribution which realistically represents the type of non-stationarity encountered in an online learning process, while still holding that distribution independent from the learning dynamics being studied. This allows Tandem RL to be used to study, e.g., the impact of variations in the learning algorithm on the quality of a learned representation, without having to control for the indirect confounding effect of a different behavior causing a different data distribution.

While extensive examples of this exceed the scope of this paper, Appendix A.5 contains a single such experiment, testing QR-DQN (Dabney et al., 2018) as a passive learning algorithm (the active agent being a Double-DQN). This is motivated by the observation of Agarwal et al. (2020), that QR-DQN outperforms DQN in the offline setting. QR-DQN indeed appears to be a nontrivially different passive learning algorithm, significantly better in some games, while curiously worse in others (Fig. 29).

5 Discussion and Conclusion

At a high level, our work can be viewed as investigating the issue of (in)compatibility between the data distribution used to train a function approximator and the data distribution relevant in its evaluation. While in supervised learning, generalization can be viewed as the problem of transfer from a training to a (given) test distribution, the fundamental challenge for control in reinforcement learning is that the test distribution is created by the very outcome of learning itself, the learned policy. The various difficulties of learning to act from offline data alone throw into focus the role of interactivity in the learning process: only by continuously interacting with the environment does an agent gradually ‘unroll’ the very data on which its performance will be evaluated.

This need not be an obstacle in the case of exact (i.e. tabular) functions: with sufficient data, extrapolation error can be avoided entirely. In the case of function approximation however, as small errors compound rapidly into a difference in the underlying state distribution, significant divergence and, as this and past work demonstrates, ultimately catastrophic under-performance can occur. Function approximation plays a two-fold role here: (1) being an approximation, it allows deviations in the outputs; (2) as the learned quantity, it is (especially in the non-linear case) highly sensitive to variations in the input distribution. When evaluated for control after offline training, these two roles combine in a way that is ‘unexplored’ by the training process: minor output errors cause a drift in behavior, and thereby a drift in the test distribution.

While related, this challenge is subtly different from the well-known divergence issues of off-policy learning with function approximation, demonstrated by Baird’s famous counterexample (Baird, 1995) (see also (Tsitsiklis and Van Roy, 1996)) and conceptualized as the Deadly Triad (Sutton and Barto, 2018; van Hasselt et al., 2018). While these depend on bootstrapping as a mechanism to cause a feedback-loop resulting in value divergence, our results show that the offline learning challenge persists even without bootstrapping, as small differences in behavior cause a drift in the ‘test distribution’ itself. Instead of a training-time output drift caused by bootstrapping, the central role is taken by a test-time drift of the state distribution caused by the interplay of function approximation and a fixed data distribution (as opposed to dynamically self-generated data).

Our empirical work highlights the importance of interactivity and ‘learning from your own mistakes’ in learning control. Starting out as an investigation of the challenges in offline reinforcement learning, it also provides a particular viewpoint on the classical online reinforcement learning case. Heuristic explanations for highly successful deep RL algorithms like DQN, based on intuitions from (e.g.) approximate policy iteration, need to be viewed with caution in light of the apparent hardness of a policy improvement step based on approximate policy evaluation with a function approximator.

Finally, the forked tandem experiments show that even high-performing initializations are not robust to a collapse of control performance, when trained under their own (but fixed!) behavior distribution. Not just learning to act, but even maintaining performance appears hard in this setting. This provides an intuition that we distill into the following working conjecture: The dynamics of deep reinforcement learning for control are unstable on (almost) any fixed data distribution.

Expanding on the classical on- vs. off-policy dichotomy, we propose that indefinitely training on any fixed data distribution, without strong explicit regularization or additional inductive bias, gives rise to ‘exploitation of gaps in the data’ by a function approximator, akin to the over-fitting occurring when over-training on a fixed dataset in supervised learning. Interaction, i.e. generating at least moderate amounts of one’s own experience, appears to be a powerful, and for the most part necessary, regularizer and stabilizer for learning to act, by creating a dynamic equilibrium between optimization of a function approximator and its own data-generation process.

Broader impact statement

This work lies in the realm of foundational RL, contributing to the fundamental understanding and development of RL algorithms, and as such is far removed from ethical issues and direct societal consequences. On the other hand, it highlights the empirical difficulty and limitations of offline deep RL for control - increasingly important for practical applications, e.g. robotics, where interactive data is costly, and learning from offline datasets is desirable. In this way it complements existing theoretical hardness results in this area and provides additional context to existing empirical techniques which aim to overcome or circumvent those limitations. We believe that an improved understanding of these challenges can play an important role in creating robust and stable offline learning algorithms whose outputs can be more safely deployed in the real world.


We would like to thank Hado van Hasselt and Joshua Greaves for feedback on an early draft of this paper, and Zhongwen Xu for an unpublished related piece of work at DeepMind that inspired some of our experiments. We also thank Clare Lyle, David Abel, Diana Borsa, Doina Precup, John Quan, Marc G. Bellemare, Mark Rowland, Michal Valko, Remi Munos, Rishabh Agarwal, Tom Schaul and Yaroslav Ganin, and many other colleagues at DeepMind and Google Brain for the numerous discussions that helped shape this research.


  • J. Achiam, E. Knight, and P. Abbeel (2019) Towards characterizing divergence in deep Q-learning. arXiv preprint arXiv:1903.08894. Cited by: §1, §1.
  • R. Agarwal, D. Schuurmans, and M. Norouzi (2020) An optimistic perspective on offline reinforcement learning. In

    Proceedings of the 37th International Conference on Machine Learning

    Vol. 119, pp. 104–114. Cited by: §A.5, §1, §3.2, §3.3, §4.
  • L. Baird (1995) Residual algorithms: reinforcement learning with function approximation. In Machine Learning Proceedings 1995, pp. 30–37. Cited by: §3.2, §5.
  • M. G. Bellemare, Y. Naddaf, J. Veness, and M. Bowling (2013) The Arcade Learning Environment: an evaluation platform for general agents.

    Journal of Artificial Intelligence Research

    47, pp. 253–279.
    Cited by: §A.1, §2.1.
  • E. Bengio, J. Pineau, and D. Precup (2020) Interference and generalization in temporal difference learning. In International Conference on Machine Learning, pp. 767–777. Cited by: §3.3, §3.
  • J. Bradbury, R. Frostig, P. Hawkins, M. J. Johnson, C. Leary, D. Maclaurin, G. Necula, A. Paszke, J. VanderPlas, S. Wanderman-Milne, and Q. Zhang (2018) JAX: composable transformations of Python+NumPy programs. External Links: Link Cited by: §A.1.
  • G. Brockman, V. Cheung, L. Pettersson, J. Schneider, J. Schulman, J. Tang, and W. Zaremba (2016) OpenAI Gym. arXiv preprint arXiv:1606.01540. Cited by: §A.1, §A.1, §2.2.
  • J. Buckman, C. Gelada, and M. G. Bellemare (2021) The importance of pessimism in fixed-dataset policy optimization. In International Conference on Learning Representations, Cited by: §1, §1.
  • D. Budden, M. Hessel, J. Quan, S. Kapturowski, K. Baumli, S. Bhupatiraju, A. Guy, and M. King (2020) RLax: Reinforcement Learning in JAX. External Links: Link Cited by: §A.1.
  • P. S. Castro, S. Moitra, C. Gelada, S. Kumar, and M. G. Bellemare (2018) Dopamine: A research framework for deep reinforcement learning. arXiv preprint arXiv:1812.06110. Cited by: §A.1, footnote 3.
  • W. Dabney, M. Rowland, M. G. Bellemare, and R. Munos (2018)

    Distributional reinforcement learning with quantile regression

    In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 32. Cited by: §A.5, §4.
  • R. Fakoor, J. Mueller, P. Chaudhari, and A. J. Smola (2021) Continuous doubly constrained batch reinforcement learning. arXiv preprint arXiv:2102.09225. Cited by: §1.
  • S. Fujimoto, E. Conti, M. Ghavamzadeh, and J. Pineau (2019a) Benchmarking batch deep reinforcement learning algorithms. arXiv preprint arXiv:1910.01708. Cited by: §1.
  • S. Fujimoto, D. Meger, and D. Precup (2019b) Off-policy deep reinforcement learning without exploration. In Proceedings of the 36th International Conference on Machine Learning, Vol. 97, pp. 2052–2062. Cited by: §1, §1, §2.2, §2, §3.2, §3.
  • R. Held and A. Hein (1963) Movement-produced stimulation in the development of visually guided behavior. Journal of Comparative and Physiological Psychology 56 (5), pp. 872–876. Cited by: Figure 1, §1.
  • T. Hennigan, T. Cai, T. Norman, and I. Babuschkin (2020) Haiku: Sonnet for JAX. External Links: Link Cited by: §A.1.
  • M. Hessel, D. Budden, F. Viola, M. Rosca, E. Sezener, and T. Hennigan (2020) Optax: composable gradient transformation and optimisation, in JAX!. External Links: Link Cited by: §A.1.
  • M. Hessel, J. Modayil, H. van Hasselt, T. Schaul, G. Ostrovski, W. Dabney, D. Horgan, B. Piot, M. Azar, and D. Silver (2018) Rainbow: combining improvements in deep reinforcement learning. In Proceedings of the AAAI Conference on Artificial Intelligence, Cited by: footnote 2.
  • G. Hinton, O. Vinyals, and J. Dean (2015)

    Distilling the knowledge in a neural network


    NIPS Deep Learning and Representation Learning Workshop

    Cited by: §3.3.
  • M. Huh, H. Mobahi, R. Zhang, B. Cheung, P. Agrawal, and P. Isola (2021) The low-rank simplicity bias in deep networks. arXiv preprint arXiv:2103.10427. Cited by: §3.3.
  • A. Jacq, M. Geist, A. Paiva, and O. Pietquin (2019) Learning from a learner. In Proceedings of the 36th International Conference on Machine Learning, Vol. 97, pp. 2990–2999. Cited by: §3.2.
  • S. Kapturowski, G. Ostrovski, J. Quan, R. Munos, and W. Dabney (2019) Recurrent experience replay in distributed reinforcement learning. In International Conference on Learning Representations, Cited by: Figure 14, §A.1, §2.2.
  • R. Kidambi, A. Rajeswaran, P. Netrapalli, and T. Joachims (2020) MOReL: model-based offline reinforcement learning. In Advances in Neural Information Processing Systems, Vol. 33, pp. 21810–21823. Cited by: §1.
  • D. Kingma and J. Ba (2015) Adam: a method for stochastic optimization. Proceedings of the International Conference on Learning Representations. Cited by: §3.3.
  • A. Kumar, R. Agarwal, D. Ghosh, and S. Levine (2020a) Implicit under-parameterization inhibits data-efficient deep reinforcement learning. arXiv preprint arXiv:2010.14498. Cited by: §3.3.
  • A. Kumar, J. Fu, M. Soh, G. Tucker, and S. Levine (2019) Stabilizing off-policy Q-learning via bootstrapping error reduction. In Advances in Neural Information Processing Systems, Vol. 32, pp. . Cited by: §1.
  • A. Kumar, A. Gupta, and S. Levine (2020b) DisCor: corrective feedback in reinforcement learning via distribution correction. In Advances in Neural Information Processing Systems, Vol. 33, pp. 18560–18572. Cited by: §1.
  • A. Kumar, A. Zhou, G. Tucker, and S. Levine (2020c) Conservative Q-learning for offline reinforcement learning. In Advances in Neural Information Processing Systems, Vol. 33, pp. 1179–1191. Cited by: §1.
  • S. Levine, A. Kumar, G. Tucker, and J. Fu (2020) Offline reinforcement learning: tutorial, review, and perspectives on open problems. arXiv preprint arXiv:2005.01643. Cited by: §1, §3, §3.
  • Y. Liu, A. Swaminathan, A. Agarwal, and E. Brunskill (2020) Provably good batch reinforcement learning without great exploration. arXiv preprint arXiv:2007.08202. Cited by: §1, §3.2, §3.
  • M. C. Machado, M. G. Bellemare, E. Talvitie, J. Veness, M. Hausknecht, and M. Bowling (2018) Revisiting the Arcade Learning Environment: evaluation protocols and open problems for general agents. Journal of Artificial Intelligence Research 61, pp. 523–562. Cited by: Figure 18, §A.1, §3.2.
  • T. Matsushima, H. Furuta, Y. Matsuo, O. Nachum, and S. Gu (2020) Deployment-efficient reinforcement learning via model-based offline optimization. arXiv preprint arXiv:2006.03647. Cited by: §1.
  • V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski, S. Petersen, C. Beattie, A. Sadik, I. Antonoglou, H. King, D. Kumaran, D. Wierstra, S. Legg, and D. Hassabis (2015) Human-level control through deep reinforcement learning. Nature 518 (7540), pp. 529–533. Cited by: §A.1, §A.3, §2.1, footnote 2.
  • A. Nair, M. Dalal, A. Gupta, and S. Levine (2020) Accelerating online reinforcement learning with offline datasets. arXiv preprint arXiv:2006.09359. Cited by: §1, §1.
  • J. S. Obando-Ceron and P. S. Castro (2021) Revisiting Rainbow: promoting more insightful and inclusive deep reinforcement learning research. In Proceedings of the 38th International Conference on Machine Learning, Cited by: 2nd item, §3.3.
  • J. Quan and G. Ostrovski (2020) DQN Zoo: reference implementations of DQN-based agents. External Links: Link Cited by: §A.1, footnote 3.
  • G. A. Rummery and M. Niranjan (1994) On-line Q-learning using connectionist systems. Technical report University of Cambridge, Department of Engineering Cambridge, UK. Cited by: §A.3, §3.2.
  • J. Schrittwieser, T. Hubert, A. Mandhane, M. Barekatain, I. Antonoglou, and D. Silver (2021) Online and offline reinforcement learning by planning with a learned model. arXiv preprint arXiv:2104.06294. Cited by: §1.
  • R. S. Sutton and A. G. Barto (2018) Reinforcement learning: an introduction. MIT press. Cited by: §A.3, §3.2, §5.
  • R. S. Sutton (1988) Learning to predict by the methods of temporal differences. Machine learning 3 (1), pp. 9–44. Cited by: §3.1.
  • P. Thomas, G. Theocharous, and M. Ghavamzadeh (2015) High confidence policy improvement. In International Conference on Machine Learning, pp. 2380–2388. Cited by: §1.
  • T. Tieleman and G. Hinton (2012) Lecture 6.5: RMSProp. COURSERA: Neural Networks for Machine Learning. Cited by: §3.3.
  • J. N. Tsitsiklis and B. Van Roy (1996) Analysis of temporal-difference learning with function approximation. In Proceedings of the 9th International Conference on Neural Information Processing Systems, Cambridge, MA, USA, pp. 1075–1081. Cited by: §5.
  • H. van Hasselt, Y. Doron, F. Strub, M. Hessel, N. Sonnerat, and J. Modayil (2018) Deep reinforcement learning and the deadly triad. arXiv preprint arXiv:1812.02648. Cited by: §A.3, §3.2, §5.
  • H. van Hasselt, A. Guez, and D. Silver (2016) Deep reinforcement learning with double Q-learning. In Proceedings of the AAAI Conference on Artificial Intelligence, Cited by: 1st item, §A.1, §2.1, §2.1.
  • H. van Hasselt (2010) Double Q-learning. In Advances in Neural Information Processing Systems, Vol. 23, pp. . Cited by: §A.3, §3.2.
  • R. Wang, D. P. Foster, and S. M. Kakade (2020a) What are the statistical limits of offline RL with linear function approximation?. arXiv preprint arXiv:2010.11895. Cited by: §3.2.
  • R. Wang, Y. Wu, R. Salakhutdinov, and S. M. Kakade (2021) Instabilities of offline RL with pre-trained neural representation. arXiv preprint arXiv:2103.04947. Cited by: §3.2.
  • Z. Wang, A. Novikov, K. Żołna, J. T. Springenberg, S. Reed, B. Shahriari, N. Siegel, J. Merel, C. Gulcehre, N. Heess, et al. (2020b) Critic regularized regression. arXiv preprint arXiv:2006.15134. Cited by: §1.
  • Y. Wu, G. Tucker, and O. Nachum (2019) Behavior regularized offline reinforcement learning. arXiv preprint arXiv:1911.11361. Cited by: §1.
  • Y. Wu, S. Zhou, N. Srivastava, J. Susskind, J. Zhang, R. Salakhutdinov, and H. Goh (2021) Uncertainty weighted actor-critic for offline reinforcement learning. In Proceedings of the 38th International Conference on Machine Learning, Cited by: §1.
  • C. Xiao, I. Lee, B. Dai, D. Schuurmans, and C. Szepesvari (2021) On the sample complexity of batch reinforcement learning with policy-induced data. arXiv preprint arXiv:2106.09973. Cited by: §3.3, §3.
  • K. Young and T. Tian (2019) MinAtar: an Atari-inspired testbed for thorough and reproducible reinforcement learning experiments. arXiv preprint arXiv:1903.03176. Cited by: §A.1, §2.2.
  • T. Yu, G. Thomas, L. Yu, S. Ermon, J. Zou, S. Levine, C. Finn, and T. Ma (2020) MOPO: model-based offline policy optimization. arXiv preprint arXiv:2005.13239. Cited by: §1.

Appendix A Appendix

Figure 10: Tandem DQN: active vs. passive performance across Atari 57 (3 seeds per game).
Figure 11: Tandem DQN: Median human-normalized scores over 57 Atari games (3 seeds per game).
Figure 12: Tandem DQN: Active vs. passive performance on four selected Classic Control domains.
Figure 13: Fraction of states (uniformly sampled from replay buffer) on which active and passive policies disagree, i.e. where , mean and median across 57 Atari games (3 seeds per game).
Figure 14: Tandem R2D2: active vs. passive performance on three Atari domains (3 seeds per game). Note: because of the use of an untuned implementation of R2D2, active results are not directly comparable to those of the published agent [Kapturowski et al., 2019].
Figure 15: Tandem DQN evaluated on five MinAtar domains.
Figure 16: Q-value over-estimation by the passive network compared to the active one in Tandem DQN with varying bootstrap targets. It can be seen that the passive network in the vanilla Tandem DQN setting tends to over-estimate values (compared to the active one), which is almost perfectly mitigated by using the same bootstrap target values as the active network, and in fact reversed when using the same target policy (but not the same target values). Note that in all four configurations the passive agents substantially under-perform their active counterparts (Fig. 2), showing that bootstrapping-amplified over-estimation is only part, not the main cause of the tandem effect.
Figure 17: Active vs. passive performance for varying active -greedy behavior policies. Note that here active performance varies across settings.
Figure 18: Active vs. passive performance on Atari with sticky actions [Machado et al., 2018].
Figure 19: Active vs. passive performance with regular (TD-based) and regression-based Tandem DQN. The latter regresses all the passive agent’s action-values towards the respective outputs of the active agent’s network, which can be viewed as network distillation.
Figure 20: Tandem DQN with the Adam optimizer (instead of RMSProp) for both active and passive network optimization. (top) Adam: Active vs. passive performance. (bottom) Passive as fraction of active performance Adam vs. RMSProp. While the Adam optimizer improves both the active and passive performance of the Tandem DQN, the relative active-passive gap is not affected strongly.
Figure 21: Active vs. passive performance for varying number of passive updates per active update.
Figure 22: Passive performance as a fraction of active performance when varying the size of the replay buffer used by the passive agent on Classic Control domains.

a.1 Implementation, Hyperparameters and Evaluation Details

The implementation of our main agent, Tandem DQN, is based on the Double-DQN [van Hasselt et al., 2016]

agent provided in the DQN Zoo open-source agent collection

[Quan and Ostrovski, 2020]. The code uses JAX [Bradbury et al., 2018], and the Rlax, Haiku and Optax libraries [Budden et al., 2020, Hennigan et al., 2020, Hessel et al., 2020] for RL losses, neural networks and optimization algorithms, respectively. All algorithmic hyperparameters correspond to those in the DQN Zoo implementation of Double-DQN.

In the ‘Tandem’ setting, active and passive agents’ networks weights are initialized independently. Agents in this setting are trained in lockstep, i.e. active and passive agents are updated simultaneously from the same batch of sampled replay transitions, with the exception of one experiment in Section 3.3, where we study the effect of the number of passive updates relative to active agent updates.

In the ‘Forked Tandem’ setting, only one of the agents is trained at any one time. The active agent trains (as a regular Double-DQN) up to the time of forking, at which point the passive agent is created as a ‘fork’ (i.e., with identical network weights) of the active agent. After forking, only the passive agent is trained. The active agent is used for data generation, either by executing its policy and continuing to fill the replay buffer (‘Fixed Policy’ experiment), or by sampling batches from its ‘frozen’ last replay buffer obtained in the active phase of training (‘Fixed Replay’ experiment).

The total training budget is kept fixed at 200 iterations in both settings, split across ‘active’ and ‘passive’ training phases in the forked tandem setting. In all cases, both active and passive agents are evaluated after each training iteration for 500K environment steps. Executed with an NVidia P100 GPU accelerator, each Atari training run takes approximately 4.5 days of wall-clock time.

The majority of our Atari experiments use the regular ALE Atari environment [Bellemare et al., 2013], using DQN’s default preprocessing, random noop-starts and action-repeats [Mnih et al., 2015], as well as using the reduced action-set (i.e. each game exposing the subset of Atari’s total 18 actions which are relevant to this game). For the ‘Sticky actions’ experiment, we use the OpenAI Gym variant of Atari [Brockman et al., 2016] enhanced with sticky actions [Machado et al., 2018].

Unless stated explicitly, all our results are reported as mean episode returns averaged across 5 seeds, with light and dark shading indicating standard deviation confidence bounds and min/max bounds (across seeds), respectively. Gray curves always indicate active performance.

The ‘relative passive performance’ (or ‘passive performance as fraction of active performance’) curves are meant to illustrate the relative (under-)performance of the passive agent compared to its active counterpart in cases where the active agent’s performance varies strongly across configurations. Denoting , the active and passive (undiscounted) episodic returns at iteration , and setting , the relative performance is computed as

with the value being clipped to lie in and set to whenever .

For the classic control [Brockman et al., 2016] and MinAtar [Young and Tian, 2019] experiments we used a modified version of the DQN agent from the Dopamine library [Castro et al., 2018]. The modifications made were:

  • Double-DQN [van Hasselt et al., 2016] learning updates instead of vanilla DQN

  • MSE loss instead of Huber loss (as suggested in [Obando-Ceron and Castro, 2021])

  • Networks and wrappers for running MinAtar with the Dopamine agents

  • Tandem training regime (regular and/or forked) instead of regular single-agent training.

Unless stated explicitly, all hyperparameters follow the respective default configurations in the Dopamine library. Our network architecture for the classic control environments are two fully connected layers of 512 units (each with ReLu activations), followed by a final fully connected layer that yields the Q-values for each action. In Figs. 

8 and 28 we varied the number of hidden layers and units, where the variation of number of units is applied uniformly across all layers.

The default network architecture for the MinAtar environments is one convolutional layer with 16 features of

and stride of

followed by a ReLu activation, whose output is mapped via a fully connected layer to the network’s Q-value outputs.

The classic control environments were all run on CPUs; each run took between 20 minutes (CartPole) and 2 hours (MountainCar). The MinAtar environments were all run on NVidia P100 GPUs, each run taking approximately 12 hours to complete. Results for all classic control and MinAtar environments are reported as mean episode returns averaged across 10 seeds, with light and dark shading indicating standard deviation confidence bounds and min/max bounds (across seeds), respectively.

For the R2D2 experiment (Fig. 14), an untuned variant of the distributed R2D2 algorithm [Kapturowski et al., 2019] was used. Each run used 4 TPUv3 chips for learning and inference, together with a fleet of approximately 500 CPU-based actor threads for distributed environment interaction, completing a training run of approximately 150K batch updates in about 7 hours wall-clock time.

a.2 Forked Tandem: Variants

Here we present additional experimental variants performed within the Forked Tandem setup.

Varying exploration parameter with fixed policy:

This experiment is an extension of the ‘Fixed Policy’ (Fig. 5) and ‘The exploration parameter ’ (Fig. 3 (top)) experiments. After freezing the active agent’s policy for further data generation, its parameter is set to a different value, to explore the impact of the resulting policy stochasticity on the ability of the passive learning process to maintain the initial performance level. We note that because of the fixed active policy, in this case active training performance does not depend on the chosen configuration, and so absolute passive performance curves are more directly comparable.

Similar to the results in the regular tandem setup, we observe (in Fig. 17) that the ability of the passive agent to maintain the initial performance level is substantially aided by the stochasticity resulting from a higher value of , providing further support for the importance of (D).

Training process samples (‘Groundhog day’):

The forked tandem experiments in Section 3.2 indicate that data distributions represented by a fixed replay buffer or a stream of data generated by a single fixed policy both show a lack of diversity leading to a catastrophic collapse of an (initially high-performing) agent when trained passively. The naive expectation that the (unbounded) stream of data generated by a fixed policy may provide a better state-action coverage than the fixed-size dataset of a single replay buffer (1M transitions) is invalidated by the observation of the fixed-replay training leading to somewhat slower degradation of passive performance. Unsurprisingly in hindsight, the diversity given by the samples stemming from many different policies along a learning trajectory of 1M training steps appears to be significantly higher than that generated by a single -greedy policy.

To probe this further, we devise an experiment attempting to combine both: instead of freezing the active policy after forking, we continue training it (and filling the replay buffer), however after each iteration of 1M steps, the active network is reset to its parameter values at forking time. Effectively this produces a data stream that can be viewed as producing samples of the training process of a single iteration, a variant that we refer to as the ‘Groundhog day’ experiment. This setting combines the multiplicity of data-generating policies with the property of an unbounded dataset being presented to the passive agent. The results are shown in Fig. 24 - indeed we observe that the groundhog day setting improves passive performance over the fixed-policy setting, while not clearly changing the outcome in comparison to the fixed-replay setting.

Overall we observe a general robustness of the tandem effect with respect to minor experimental variations in the forked tandem settings.

Figure 23: Forked Tandem DQN: After 50 iterations of regular active training, the active value function is frozen and used to continuously generate data for the passive agents’ training by executing an -greedy policy with a given value of .
Figure 24: Forked Tandem DQN: ‘groundhog day’ variation, active vs. passive performance. After forking, the active agent trains for a full iteration (1M environment steps), and is then reset to its initial network parameters at the time of forking, repeatedly for the remaining number of iterations.

a.3 Forked Tandem: Policy Evaluation

Figure 25: Forked Tandem DQN: Policy evaluation with Monte-Carlo return regression. (top) Active vs. passive control performance. (bottom) Average absolute Monte-Carlo error. The Monte-Carlo error is minimized effectively by the passive training, while the control performance of the resulting -greedy policy collapses completely.

Among the most striking findings in our work are the forked tandem findings in Sections 3.2 and A.2, demonstrating a catastrophic collapse of performance when passively training from a fixed data distribution, even when the starting point for the passive policy is the very same high-performing policy generating the data. This leads to the question whether the process of Q-learning itself is to blame for this failure mode, e.g. via the well-known statistical over-estimation bias introduced by the operator [van Hasselt, 2010]. To test this, we perform two variants of the forked tandem experiment with SARSA [Rummery and Niranjan, 1994] and (purely supervised) Monte-Carlo return regression based policy evaluation instead of Q-learning as the passive learning algorithm. (We note that while SARSA evaluation of an -greedy policy can still exhibit over-estimation bias, this is not the case for Monte-Carlo return regression.)

As can be seen in Figs. 6 and 25(top), even in this on-policy policy evaluation setting, the resulting control performance catastrophically collapses after a short length of training. We also observe that this is not due to a failing of the evaluation: Fig. 25(bottom) shows effective minimization of the Monte-Carlo error, indicating that the control failure is due to extrapolation error (and in particular, over-estimation) on infrequent (under the active policy) state-action pairs.

These findings provide strong support for the role of (D) and (F) while weakening that of (B): In contrast to the well-known ‘Deadly Triad’ phenomenon [Sutton and Barto, 2018, van Hasselt et al., 2018], the tandem effect occurs without the amplifying mechanism of bootstrapping or the statistical over-estimation caused by the operator, solely due to erroneous extrapolation by the function approximator to state-action pairs which are under-represented in the given training data distribution.

Figure 26: Forked Tandem DQN: Policy evaluation with Monte-Carlo return regression, passive network initialized independently of the active network at forking time. Top: Active vs. passive control performance. Bottom: Average absolute Monte-Carlo error. While Monte-Carlo error is minimized effectively by the passive training, this does not result in above-random control performance of the resulting -greedy policy.

We have so far presented the equal network weights of active and passive networks at forking time as a strength of the forked tandem setting, following the intuition that an initialization by the high-performing policy should be advantageous for maintaining performance in these experiments. A plausible counter-argument could be that the representation (i.e., the network weights) learned by the active agent in service of control could be a poor, over-specialized starting point for policy evaluation. To verify that this is not a major factor, we also perform the above Monte-Carlo evaluation experiment with the passive network freshly re-initialized at the beginning of passive training. As shown in Fig. 26, while a fresh initialization of the passive network indeed allows it to similarly effectively minimize Monte-Carlo error, its control performance here never exceeds random performance levels, further connecting the tandem effect to (D) and (F).

We remark that the demonstrated control performance failure of approximate policy evaluation casts a shadow over the concept of approximate policy iteration, or the application of this concept in heuristic explanations of the function of empirically successful algorithms like DQN [Mnih et al., 2015]. A successful greedy improvement step on an approximately evaluated policy appears implausible given the brittleness of approximate policy evaluation even in the nominally best-case scenario of an on-policy data distribution.

Another view point emerging from these results is that the classic category of ‘on-policy data’ appears less relevant in this context: an appropriate data distribution for robust approximate evaluation targeting control seems to require a data distribution sufficiently overlapping with the (hypothetical, in practice unavailable) behavior distribution of the resulting policy rather than the original evaluated policy.

a.4 Additional Experimental Results on the Role of Function Approximation

Here we present several extra experiments, complementing the results from the end of Section 3.3.

The first set of results, shown in Fig. 27, concerns the passive performance in the forked tandem setup, when the first (bottom) neural network layers are shared between active and passive agents. These bottom layers are only trained during the active training phase before forking, and are frozen after that. Similar to the corresponding experiment in the regular tandem setting (Fig. 9), we observe that the passive agent’s ability to maintain its initially high performance strongly correlates with the number of shared, i.e. not passively trained layers. The difference between the passive agent training a ‘deep’ vs. a ‘linear’ network (the latter corresponds to all but one of the network layers being frozen) appears stark: the tandem effect is almost equally catastrophic in all configurations except for the linear one, where it appears to be strongly reduced. While a more thorough investigation remains to future work, we remark that overall this finding appears to supports the importance of (F), in that intuitively over-extrapolation can be expected to become more problematic when passive training is applied to a larger function class (deeper part of the network).

Figure 27: Forked Tandem DQN: passive performance after forking, with the first (out of ) layers of active and passive agent networks shared, i.e. only trained by the active agent, and fixed during the passive training phase.

The next experiment, shown in Fig. 28 and extending the results from Fig. 8, investigates the impact of network architecture more generally, by varying width and depth of both active and passive agents’ networks. Since changes in Atari-DQN network architecture tend to require expensive re-tuning of various hyperparameters, we chose to perform these experiments on the smaller Classic Control domains, where such changes tend to be more straightforward. Nevertheless active performance in these domains does depend on the chosen network configuration, so that we report relative performance as the more informative quantity. The findings across four domains mostly appear to echo those on CartPole described in Section 3.3, showing a positive correlation of (relative) passive performance with network depth, and a negative correlation with its width. Again, a more detailed investigation of the causes for this exceeds the scope of this paper and is left to future work.

Figure 28: Tandem DQN: Passive performance as a fraction of active performance when varying network architecture in classic control games. Here network architecture varies for both active and passive agent, so active performance is also affected by the configuration.

a.5 Applications of Tandem RL: Passive QR-DQN

Figure 29: QR-DQN as a passive learning algorithm, in tandem with a Double-DQN active agent: active vs. passive performance.

Here we provide an example application for the Tandem RL setting as an analytic tool for the study of learning algorithms in isolation from the confounding factor of behavior. As observed in [Agarwal et al., 2020], the QR-DQN algorithm [Dabney et al., 2018] can be preferable to DQN in the offline setting, motivating our attempt to use it as a passive agent, coupled with a regular Double-DQN active agent. As shown in Fig. 29, QR-DQN indeed provides a somewhat different passive performance profile when compared to the regular Double-DQN tandem, albeit not a clearly better one. While perfectly matching active performance in one game (Space Invaders) and even out-performing the active agent in another (Breakout), it also shows exacerbated under-performance or instability in the other two domains. A fine-grained diagnosis of the causes of this are left to future work.

We note that any difference in performance between the DQN and QR-DQN algorithms as passive agents reflects directly on their properties as learning algorithms, i.e. their respective abilities to extract information about an appropriate control policy from observational data, while separating out any influence their learning dynamics may have on (transient) behavior and data generation. We believe that Tandem RL can become a valuable analytic tool for targeted empirical studies of such properties of learning algorithms.