Diagnosing Bottlenecks in Deep Q-learning Algorithms

by   Justin Fu, et al.

Q-learning methods represent a commonly used class of algorithms in reinforcement learning: they are generally efficient and simple, and can be combined readily with function approximators for deep reinforcement learning (RL). However, the behavior of Q-learning methods with function approximation is poorly understood, both theoretically and empirically. In this work, we aim to experimentally investigate potential issues in Q-learning, by means of a "unit testing" framework where we can utilize oracles to disentangle sources of error. Specifically, we investigate questions related to function approximation, sampling error and nonstationarity, and where available, verify if trends found in oracle settings hold true with modern deep RL methods. We find that large neural network architectures have many benefits with regards to learning stability; offer several practical compensations for overfitting; and develop a novel sampling method based on explicitly compensating for function approximation error that yields fair improvement on high-dimensional continuous control domains.


A Dissection of Overfitting and Generalization in Continuous Reinforcement Learning

The risks and perils of overfitting in machine learning are well known. ...

A Study on Overfitting in Deep Reinforcement Learning

Recent years have witnessed significant progresses in deep Reinforcement...

Adapting the Function Approximation Architecture in Online Reinforcement Learning

The performance of a reinforcement learning (RL) system depends on the c...

Deep Reinforcement Learning with Smooth Policy

Deep neural networks have been widely adopted in modern reinforcement le...

The Primacy Bias in Deep Reinforcement Learning

This work identifies a common flaw of deep reinforcement learning (RL) a...

On Practical Reinforcement Learning: Provable Robustness, Scalability, and Statistical Efficiency

This thesis rigorously studies fundamental reinforcement learning (RL) m...

On Sampling Top-K Recommendation Evaluation

Recently, Rendle has warned that the use of sampling-based top-k metrics...

1 Introduction

Q-learning algorithms, which are based on approximating state-action value functions, are an efficient and commonly used class of RL methods. In recent years, such methods have been applied to great effect in domains such as playing video games from raw pixels (Mnih et al., 2015) and continuous control in robotics (Kalashnikov et al., 2018)

. Methods based on approximate dynamic programming and Q-function estimation have several very appealing properties: they are generally moderately sample-efficient, when compared to policy gradient methods, they are simple to use, and they allow for off-policy learning. This makes them an appealing choice for a wide range of tasks, from robotic control 

(Kalashnikov et al., 2018) to off-policy learning from historical data for recommender (Shani et al., 2005) systems and other applications. However, although the basic tabular Q-learning algorithm is convergent and admits theoretical analysis (Sutton & Barto, 2018), its non-linear counterpart with function approximation (such as with deep neural networks) is poorly understood theoretically. In this paper, we aim to investigate the degree to which the theoretical issues with Q-learning actually manifest in practice. Thus, we empirically analyze aspects of the Q-learning method in a unit testing framework, where we can employ oracle solvers to obtain ground truth Q-functions and distributions for exact analysis. We investigate the following questions:

1) What is the effect of function approximation on convergence? Most practical reinforcement learning problems, such as robotic control, require function approximation to handle large or continuous state spaces. However, the behavior of Q-learning methods under function approximation is not well understood. There are known counterexamples where the method diverges (Baird, 1995), and there are no known convergence guarantees (Sutton & Barto, 2018). To investigate these problems, we study the convergence behavior of Q-learning methods with function approximation, parametrically varying the function approximator power and analyzing the quality of the solution as compared to the optimal Q-function and the optimal projected Q-function under that function approximator. We find, somewhat surprisingly, that function approximation error is not a major problem in Q-learning algorithms, but only when the representational capacity of the function approximator is high. This makes sense in light of the theory: a high-capacity function approximator can perform a nearly perfect projection of the backed up Q-function, thus mitigating potentially convergence issues due to an imperfect norm projection. We also find that divergence rarely occurs, for example, we observed divergence in only 0.9% of our experiments. We discuss this further in Section 4.

2) What is the effect of sampling error and overfitting? Q-learning is used to solve problems where we do not have access to the transition function of the MDP. Thus, Q-learning methods need to learn by collecting samples in the environment, and training on these samples incurs sampling error, potentially leading to overfitting. This causes errors in the computation of the Bellman backup, which degrades the quality of the solution. We experimentally show that overfitting exists in practice by performing ablation studies on the number of gradient steps, and by demonstrating that oracle based early stopping techniques can be used to improve performance of Q-learning algorithms. (Section 5). Thus, in our experiments we quantify the amount of overfitting which happens in practice, incorporating a variety of metrics, an performing a number of ablations and investigate methods to mitigate its effects.

3) What is the effect of distribution shift and a moving target? The standard formulation of Q-learning prescribes an update rule, with no corresponding objective function (Sutton et al., 2009a). This results in a process which optimizes an objective that is non-stationary in two ways: the target values are updated during training, and the distribution under which the Bellman error is optimized changes, as samples are drawn from different policies. We refer to these problems as the moving target and distribution shift problems, respectively. These properties can make convergence behavior difficult to understand, and prior works have hypothesized that nonstationarity is a source of instability (Mnih et al., 2015; Lillicrap et al., 2015). In our experiments, we develop metrics to quantify the amount of distribution shift and performance change due to non-stationary targets. Surprisingly, we find that in a controlled experiment, distributional shift and non-stationary targets do not in fact correlate with reduction in performance. In fact, sampling strategies with large distributional shift often perform very well.

4) What is the best sampling or weighting distribution? Deeply tied to the distribution shift problem is the choice of which distribution to sample from. Do moving distributions cause instability, as Q-values trained on one distribution are evaluated under another in subsequent iterations? Researchers have often noted that on-policy samples are typically superior to off-policy samples (Sutton & Barto, 2018), and there are several theoretical results that highlight favorable convergence properties under on-policy samples. However, there is little theoretical guidance on how to pick distributions so as to maximize learning rate. To this end, we investigate several choices for the sampling distribution. Surprisingly, we find that on-policy training distributions are not always preferable, and that a clear pattern in performance with respect to training distribution is that broader, higher-entropy distributions perform better, regardless of distributional shift. Motivated by our findings, we propose a novel weighting distribution, adversarial feature matching (AFM), which is explicitly compensates for function approximator error, while still producing high-entropy sampling distributions.

Our contributions are as follows: We introduce a unit testing framework for Q-learning to disentangle issues related to function approximation, sampling, and distributional shift where approximate components are replaced by oracles. This allows for controlled analysis of different sources of error. We perform a detailed experimental analysis of many hypothesized sources of instability, error, and slow training in Q-learning algorithms on tabular domains, and show that many of these trends hold true in high dimensional domains. We propose novel choices of sampling distributions which lead to improved performance even on high-dimensional tasks. Our overall aim is to offer practical guidance for designing RL algorithms, as well as to identify important issues to solve in future research.

2 Preliminaries

Q-learning algorithms aim to solve a Markov decision process (MDP) by learning the optimal state-action value function, or Q-function. We define an MDP as a tuple

. represent the state and action spaces, respectively. and represent the dynamics (transition distribution) and reward function, and represents the discount factor. The goal in RL is to find a policy that maximizes the expected cumulative discounted rewards, known as the returns:

The quantity of interest in Q-learning methods are state-action value functions, which give the expected future return starting from a particular state-action tuple, denoted . The state value function can also be denoted as . Q-learning algorithms are based on iterating the Bellman backup operator , defined as

The (tabular) Q-iteration algorithm is a dynamic programming algorithm that iterates the Bellman backup . Because the Bellman backup is a -contraction in the L- norm, and (the Q-values of ) is its fixed point, Q-iteration can be shown to converge to  (Sutton & Barto, 2018). A deterministic optimal policy can then be obtained as .

When state spaces cannot be enumerated in a tabular format, function approximators can be used to represent the Q-values. An important class of such Q-learning methods are fitted Q-iteration (FQI) (Ernst et al., 2005), or approximate dynamic programming (ADP) methods, which form the basis of modern deep RL methods such as DQN (Mnih et al., 2015). FQI projects the values of the Bellman backup onto a family of Q-function approximators :

Here, denotes a -weighted L2 projection, which minimizes the Bellman error

via supervised learning:


The values produced by the Bellman backup, are commonly referred to as target values, and when neural networks are used for function approximation, the previous Q-function is referred to as the target network. In this work, we distinguish between the cases when the Bellman error is estimated with Monte-Carlo sampling or computed exactly (see Section 3.1). The sampled variant corresponds to FQI as described in the literature (Ernst et al., 2005; Riedmiller, 2005), while the exact variant is analogous to conventional ADP (Bertsekas & Tsitsiklis, 1996).

Convergence guarantees for Q-iteration do not cleanly translate to FQI. is an projection, but is a contraction in the norm – this norm mistmatch means the composition of the backup and projection is no longer guaranteed to be a contraction under any norm (Bertsekas & Tsitsiklis, 1996), and hence the convergence is not guaranteed.

A related branch of Q-learning methods are online Q-learning methods, in which Q-values are updated while samples are being collected in the MDP. This includes classic algorithms such as Watkin’s Q-learning (Watkins & Dayan, 1992). Online Q-learning methods can be viewed as a form of stochastic approximation (such as Robbins-Monro) applied to Q-iteration and FQI (Bertsekas & Tsitsiklis, 1996), and share many of its theoretical properties (Szepesvári, 1998). Modern deep RL algorithms such as DQN (Mnih et al., 2015) have characteristics of both online Q-learning and FQI – using replay buffers means the sampling distribution changes very little between target updates (see Section 6.3), and target networks are justified from the viewpoint of FQI. Because FQI corresponds to the case when the sampling distribution is static between target updates, the behavior of modern deep RL methods more closely resembles FQI than a true online method without target networks.

3 Experimental Setup

Our experimental setup is centered around unit-testing. We first introduce a spectrum of Q-learning algorithms, starting with exact approximate dynamic programming and gradually replacing oracle components, such as knowledge of dynamics, until the algorithm resembles modern deep Q-learning methods. We then introduce a suite of tabular environments where oracle solutions can be computed and compared against, to aid in diagnosis, as well as testing in high-dimensional environments to verify our hypotheses.

In order to provide consistent metrics across domains, we normalize returns and errors involving Q-functions (such as Bellman error) by the returns of the expert policy on each environment.

3.1 Algorithms

In the analysis presented in Section 4, 5, 6 and 7, we will use three different Q-learning variants, each of which remove some of the approximations in the standard Q-learning method used in the literature – Exact-FQI, Sampling-FQI, and Replay-FQI. Although FQI is not exactly identical to commonly used deep RL methods, such as DQN (Mnih et al., 2015), DDPG (Lillicrap et al., 2015), and SAC (Haarnoja et al., 2017), it is structurally similar and, when the replay buffer for the commonly used methods becomes large, the difference becomes negligible, since the sampling distribution changes very little between target network updates. However, FQI methods are much more amenable for controlled analysis, since we can separately isolate target values, update rates, and the number of samples used for each iteration. We therefore use variants of FQI as the basis for our analysis, but we also confirm that similar trends hold with more commonly used algorithms on standard benchmark problems.

1:  Initialize Q-value approximator .
2:  for step in {1, …, N} do
3:     Evaluate at all states.
4:     Compute exact target values at all states.
5:     Minimize projection loss with respect to :
6:  end for
Algorithm 1 Exact-FQI
1:  Initialize Q-value approximator .
2:  for step in {1, …, N} do
3:     Collect samples from .
4:     Evaluate on samples.
5:     Compute sampled target values on samples.
6:     Minimize projection loss with respect to samples:
7:  end for
Algorithm 2 Sampled-FQI
1:  Initialize Q-value approximator , replay buffer .
2:  for step in {1, …, N} do
3:     Collect online samples from .
4:     Append online samples to buffer .
5:     Collect samples from .
6:     Evaluate on samples.
7:     Compute sampled target values on samples
8:     Minimize projection loss with respect to samples:
9:  end for
Algorithm 3 Replay-FQI

Exact-FQI (Algorithm 1): Exact-FQI computes the backup and projection on all state-action tuples without any sampling error. It also assumes knowledge of dynamics and reward function to compute Bellman backups exactly. We use Exact-FQI to study convergence, distribution shift (by varying weighting distributions on transitions), and function approximation in the absence of sampling error. Exact-FQI eliminates errors due to sampling states, and computing inexact, sampled backups.

Sampled-FQI (Algorithm 2): Sampled-FQI is a special case of Exact-FQI, where the Bellman error is approximated with Monte-Carlo estimates from a sampling distribution , and the Bellman backup is approximated with samples from the dynamics as . We use Sampled-FQI to study effects of overfitting. Sampled-FQI incorporates all sources of error – arising from function approximation, sampling and also distribution shift.

Replay-FQI (Algorithm 3): Replay-FQI is a special case of Sampled-FQI that uses a replay buffer (Lin, 1992), that saves past transition samples , which are used for computing Bellman error. Replay-FQI strongle resembles DQN (Mnih et al., 2015), lacking the online updates that allow to change within an FQI iteration. With large replay buffers, we expect the difference between Replay-FQI and DQN to be minimal as changes slowly.

We additionally investigate the following choices of weighting distributions () for the Bellman error. When sampling the Bellman error, these can be implemented by sampling directly from the distribution, or via importance sampling.

Unif: Uniform weights over state-action space. This is the weighting distribution typically used by dynamic programming algorithms, such as FQI.

: The on-policy state-action marginal induced by .

: The state-action marginal induced by .

Random: State-action marginal induced by executing uniformly random actions.

Prioritized(s,a): Weights Bellman errors proportional to . This is similar to prioritized replay (Schaul et al., 2015) without importance sampling.

Replay and Replay10: Averaged state-action marginal of all policies (or the previous 10) produced during training. This simulates sampling uniformly from a replay buffer where infinite samples are collected from each policy.

3.2 Domains

We evaluate our methods on suite of tabular environments where we can compute oracle values. This will help us compare, analyze and fix various sources of error by means of comparing the learned Q-functions to the true, oracle-compute Q-functions. We selected 8 tabular domains, each with different qualitative attributes, including: gridworlds of varying sizes and observations, blind Cliffwalk (Schaul et al., 2015), discretized Pendulum and Mountain Car based on implementations in OpenAI Gym (Plappert et al., 2018), and a random sparsely connected graph. We give full details of these environments in Appendix A, as well as their motivation for inclusion.

3.3 Function Approximators

Throughout our experiments, we use 2-layer ReLU networks, denoted by a tuple

where N represents the number of units in a layer. The “Tabular” architecture refers to the case when no function approximation is used.

3.4 High-Dimensional Testing

In addition to diagnostic experiments on tabular domains, we also wish to see if the observed trends hold true on high-dimensional environments. To this end, we include experiments on continuous control tasks in the OpenAI Gym benchmark (Plappert et al., 2018) (HalfCheetah-v2, Hopper-v2, Ant-v2, Walker2d-v2). In continuous domains, computing the maximum over actions of the Q-value is difficult (). A common choice in this case is to use a second “actor” neural network to approximate  (Lillicrap et al., 2015; Fujimoto et al., 2018; Haarnoja et al., 2018). This approach most closely resembles Replay-FQI, but using the actor network in place of the max.

4 Function Approximation and Convergence

The first issue we investigate is the connection between function approximation and convergence properties.

4.1 Technical Background

As discussed in Section 2, when function approximation is introduced to Q-learning, convergence guarantees are lost. This interaction between approximation and convergence has been a long-studied topic in reinforcement learning. In the control literature, it is closely related to the problems of state-aliasing or interference (Farrell & Berger, 1995). Baird (1995) introduces a simple counterexample in which Watkin’s Q-learning with linear approximators can cause unbounded divergence. In the policy evaluation scenario, Tsitsiklis & Van Roy (1997) prove that on-policy TD-learning with linear function approximators converge, and methods such as GTD (Sutton et al., 2009a) and ETD (Sutton et al., 2016) have extended results to off-policy cases. In the control scenario, convergent algorithms such as SBEED (Dai et al., 2018) and Greedy-GQ (Maei et al., 2010) have been developed. However, several works have noted that divergence need not occur. Munos (2005) theoretically addresses the norm-mismatch problem, which show that unbounded divergence is impossible provided has adequate support and projections are non-expansive in p-norms. Concurrently to us, Van Hasselt et al. (2018) experimentally find that unbounded divergence rarely occurs with DQN variants on Atari games.

4.2 How does function approximation affect convergence properties and suboptimality of solutions?

The crucial quantities we wish to measure are a trend between function approximation and performance, and a measure for the bias in the learning procedure introduced by function approximation. Thus, using Exact-FQI with uniform weighting (to remove sampling error), we measure the returns of the learning policy, and the error between and the solution found by Exact-FQI () or the projection of the optimal solution (). represents the best solution inside the model class, in absence of error from the bootstrapping process of FQI. Thus, the difference between FQI error and projection error represents the bias introduced by the bootstrapping procedure, while controlling for bias that is simply due to function approximation – this quantity is roughly the inherent Bellman error of the function class (Munos & Szepesvári, 2008). This is the gap which can possibly be improved upon via better Q-learning algorithm design. We plot our results in Fig. 1.

Figure 1: Normalized returns and normalized Q-function error with function approximation, averaged across domains and seeds. We see that for small architectures, there is a significant gap between the solution found by FQI (FQI Error) and the best solution within the model class (Project Error).

We first note the obvious trend that smaller architectures produce lower returns, and converge to more suboptimal solutions. However, we also find that smaller architectures introduce significant bias in the learning process, and there is often a significant gap between the solution found by Exact-FQI and the best solution within the model class. This gap may be due to the fact that when the target is bootstrapped, we must be able to represent all Q-function along the path to the solution, and not just the final result (Bertsekas & Tsitsiklis, 1996). This observation implies that using large architectures is crucial not only because they have capacity to represent a better solution, but also because they are significantly easier to train using bootstrapping, and suffer less from nonconvergence issues. We also note that divergence rarely happens in practice. We observed divergence in 0.9% of our experiments using function approximation, measured by the largest Q-value growing larger than 10 times that of .

For high-dimensional problems, we present experiments on varying the architecture of the Q-network in SAC (Haarnoja et al., 2018) in Appendix Fig. 13. We still observe that large networks have the best performance, and that divergence rarely happens even in high-dimensional continuous spaces. We briefly discuss theoretical intuitions on apparent discrepancy between the lack of unbounded divergence in relation known counterexamples in Appendix B.

5 Sampling Error and Overfitting

A second source of error in minimizing the Bellman error, orthogonal to function approximation, is that of sampling or generalization error. The next issue we investigate is the effect of sampling error on Q-learning methods.

5.1 Technical Background

Figure 2: Samples plotted with returns for a 256x256 network. More samples yields better performance.

Approximate dynamic programming assumes that the projection of the Bellman backup (Eqn. 1) is computed exactly, but in reinforcement learning we can normally only compute the empirical Bellman error

over a finite set of samples. In the PAC framework, overfitting can be quantified by a bounded error in between the empirical and expected loss with high probability, which decays with sample size 

(Shalev-Shwartz & Ben-David, 2014). Munos & Szepesvári (2008); Maillard et al. (2010); Tosatto et al. (2017) provide such PAC-bounds which account for sampling error in the context of Q-learning and value-based methods, and quantify the quality of the final solution in terms of sample complexity.

We analyze several key points that relate to sampling error. First, we show that Q-learning is prone to overfitting, and that this overfitting has a real impact on performance, in both tabular and high-dimensional settings. We also show that the replay buffer is in fact a very effective technique in addressing this issue, and discuss several methods to migitate the effects of overfitting in practice.

Figure 3: On-policy validation losses for varying amounts of on-policy data (or replay buffer), averaged across environments and seeds. Note that sampling from the replay buffer has lower on-policy validation loss, despite bias from distribution shift.

5.2 Quantifying Overfitting

We first quantify the amount of overfitting that happens during training, by varying the number of samples. In order provide comparable validation errors across different experiments, we fix a reference sequence of Q-functions, , obtained during a normal training run. We then retrace the training sequence, and minimize the projection error for each training iteration, using varying amounts of on-policy data or sampling from a replay buffer. We measure the exact validation error (the expected Bellman error) at each iteration under the on-policy distribution, plotted in Fig. 3. We note the obvious trend that more samples leads to lower validation loss, confirming that overfitting can in fact occur. A more interesting observation is that sampling from the replay buffer results in the lowest on-policy validation loss, despite bias due to distribution mismatch from sampling off-policy data. As we discuss in Section 6, we believe that replay buffers are mainly effective because they greatly reduce the effect of overfitting and create relatively good coverage over the state space, not necessarily due to reducing the effects of distribution shift.

Next, Fig. 2 shows the relationship between number of samples and returns. We see a clear trend that higher sample count leads to improved learning speed and a better final solution, confirming our hypothesis that overfitting has a significant effect on the performance of Q-learning. A full sweep including architectures is presented in Appendix Fig. 14. We observe that despite overfitting being an issue, larger architectures still perform better because the bias introduced by smaller architectures dominates.

Figure 4: Normalized returns plotted over training iterations (32 samples are taken per iteration), for different ratios of gradient steps taken per sample during projection using Replay-FQI. We observe that intermediate values of gradient steps work best, and too many gradient steps hinders performance.
Figure 5: Normalized returns plotted over training iterations (32 samples are taken per iteration), for different early stopping methods using Replay-FQI. We observe that using proper early stopping can result in a modest performance increase.

5.3 What methods can be used to compensate for overfitting?

Finally, we discuss methods to compensate for overfitting. One common method for reducing overfitting is to regularize the function approximator to reduce its capacity. However, as we have seen before that weaker architectures can give rise to suboptimal convergence, we instead study early stopping

methods to mitigate overfitting without reducing model size. First, we observe that the number of gradient steps taken per sample in the projection step has an important effect on performance – too few steps and the algorithm learns slowly, but too many steps and the algorithm may initially learn quickly but overfit. To show this, we run a hyperparameter sweep over the number of gradient steps taken per environment step in Replay-FQI and TD3 (TD3 uses 1 by default). Results for FQI are shown in Fig. 

4, and for TD3 in Appendix Fig. 15.

In order to understand whether better early stopping criteria can possibly help with overfitting, we employ oracle

early stopping rules. While neither of these rules can be used to solve overfitting in practice, these experiments can provide guidance for future methods and an “upper bound” on the best improvement that can be obtained from optimal stopping. We investigate two oracle early stopping criteria for setting the number of gradient steps: using the expected Bellman error and the expected returns of the greedy policy w.r.t. the current Q-function (oracle returns). We implement both methods by running the projection step of Replay-FQI to convergence using gradient descent, and afterwards selecting the intermediate Q-function which is judged best by the evaluation metric (lowest Bellman error or highest returns). Using such oracle stopping metrics results in a modest boost in performance in tabular domains (Fig. 

5). Thus, we believe that there is promise in further improving such early-stopping methods for reducing overfitting in deep RL algorithms.

We might draw a few actionable conclusions from these experiments. First, overfitting is indeed a serious issue with Q-learning, and too many gradient steps or too few samples can lead to poor performance. Second, replay buffers and early stopping can be used to mitigate the effects of overfitting. Third, although overfitting is a problem, large architectures are still preferred, because the harm from function approximation bias outweighs the harm from increased overfitting with large models.

6 Non-Stationarity

In this section, we discuss issues related to the non-stationarity of the Q-learning process (relating to the Bellman backup and Bellman error minimization).

6.1 Technical Background

Instability in Q-learning methods is often attributed to the nonstationarity of the regression objective (Lillicrap et al., 2015; Mnih et al., 2015). Nonstationarity occurs in two places: in the changing target values , and in a changing weighting distribution (“distribution shift”) (i.e., due to samples being taken from different policies). Note that a non-stationary objective, by itself, is not indicative of instability. For example, gradient descent can be viewed as successively minimizing linear approximations to a function: for gradient descent on with parameter and learning rate , we have the “moving” objective . However, the fact that the Q-learning algorithm prescribes an update rule and not a stationary objective complicates analysis. Indeed, the motivation behind algorithms such as GTD (Sutton et al., 2009b, a) and residual methods (Baird, 1995; Scherrer, 2010) can be seen as introducing a stationary objective that can be optimized with standard procedures such as gradient descent for increased stability. Therefore, a key question to investigate is whether these non-stationarities are detrimental to the learning process.

6.2 Does a moving target cause instability in the absence of a moving distribution?

To study the moving target problem, we must first isolate the effects of a moving target, and study how the rate at which the target changes impacts performance. To control the rate at which the target changes, we introduce an additional smoothing parameter to Q-iteration, where the target values are now computed as an -moving average over previous targets. We define the -smoothed Bellman backup, , as follows:

This scheme is inspired by the soft target update used in algorithms such as DDPG (Lillicrap et al., 2015) and SAC (Haarnoja et al., 2017) to improve the stability of learning. Standard Q-iteration uses a “hard” update where . A soft target update weakens the contraction of Q-iteration from to (See Appendix C), so we expect slower convergence, but perhaps it is more stable under heavy function approximation error. We performed experiments with this modified backup using Exact-FQI under the weighting distribution.

Our results are presented in Appendix Fig. 12. We find that the most cases, the hard update with results in the fastest convergence and highest asymptotic performance. However, for the smallest two architectures we used, and , lower values of (such as 0.1) achieve slightly higher asymptotic performance. Thus, while more expressive architectures are still stable under fast-changing targets, we believe that a slowly moving target may have benefits under heavy approximation error. This evidence points to either using large function approximators, in line with the conclusions drawn in the previous sections, or adaptively slowing the target updates when the architecture is weak (relative to the problem difficulty) and the projected Bellman error is therefore high.

6.3 Does distribution shift impact performance?

Figure 6: Distribution shift and loss shift plotted against time. Prioritized and on-policy distributions induce the greatest shift, whereas replay buffers greatly reduce the amount of shift.

To study the distribution shift problem, we exactly compute the amount of distribution shift between iterations in total-variation distance, and the “loss shift”:

The loss shift quantifies the Bellman error objective when evaluated under a new distribution - if the distribution shifts to previously unseen or low support states, we would expect a highly inaccurate Q-value in such states, and a correspondingly high loss shift.

Figure 7: Average distribution shift across time for different weighting distributions, plotted against returns for a 256x256 model. We find that distribution shift does not have strong correlation with returns.

We run our experiments using Exact-FQI with a 256x256 layer architecture, and plot the distribution discrepancy and the loss discrepancy in Fig. 6. We find that Prioritized has the greatest shift, followed by on-policy variants. Replay buffers greatly reduce distribution shift compared to on-policy learning, which is similar to the de-correlation argument cited for its use by Mnih et al. (2015). However, we find that this metric correlates very little with the actual performance of the algorithm (Fig. 7). For example, prioritized weighting performs well yet has high distribution shift.

Overall, our experiments indicate that nonstationarities in both distributions and target values, when isolated, do not cause significant stability issues. Instead, other factors such as sampling error and function approximation appear to have more significant effects on performance. In the light of these findings, we might therefore ask: can we design a better sampling distribution, without regard for distributional shift and with regard for high-entropy, that results in better final performance, and is realizable in practice? We investigate this in the following section.

7 Sampling Distributions

As alluded to in Section 6, the choice of sampling distribution is an important design decision can have a large impact on performance. Indeed, it is not immediately clear which distribution is ideal for Q-learning. In this section, we hope to shed some light on this issue.

7.1 Technical Background

Off-policy data has been cited as one of the “deadly triads” for Q-learning (Sutton & Barto, 2018), which has potential to cause instabilities in learning. On-policy distributions (Tsitsiklis & Van Roy, 1997) and fixed behavior distributions (Sutton et al., 2009b; Maei et al., 2010) have often been targeted for theoretical convergence analysis, and many works use importance sampling to correct for off-policyness (Precup et al., 2001; Munos et al., 2016) However, to our knowledge, there is relatively little guidance which compares how different weighting distributions compare in terms of convergence rate and final solutions.

Nevertheless, several works give hypotheses on good choices for weighting distributions. (Munos, 2005) provides an error bound which suggests that “more uniform” weighting distributions can guarantee better worst-case performance. (Geist et al., 2017) suggests that when the state-distribution is fixed, the action distribution should be weighted by the optimal policy for residual Bellman errors. In deep RL, several methods have been developed to prevent instabilities in Q-Learning, such as prioritized replay (Schaul et al., 2015), and mixing replay buffer with on-policy data (Hausknecht & Stone, 2016; Zhang & Sutton, 2017) have been found to be beneficial. In our experiments, we aim to empirically analyze multiple choices for weighting distributions to determine which are the most effective.

7.2 What Are the Best Weighting Distributions in Absence of Sampling Error?

Figure 8: Weighting distribution versus architecture in Exact-FQI. Replay(s, a) consistently provides the highest performance. Note that Adversarial Feature Matching is comparable to Replay(s, a), but surprisingly better for small networks.
Figure 9: Normalized returns plotted against normalized entropy for different weighting distributions. All experiments use Exact-FQI with a 256x256 network. We see a general trend that high-entropy distributions lead to greater performance.

We begin by studying the effect of weighting distributions when disentangled from sampling error. We run Exact-FQI with varying choices of architectures and weighting distributions and report our results in Fig. 8. , and consistently result in the highest returns across all architectures. We believe that these results are in favor of the uniformity hypothesis: the top performing distributions spread weight across larger support of the state-action space. For example, a replay buffer contains state-action tuples from many policies, and therefore would be expected to have wider support than the state-action distribution of a single policy. We can see this general trend in Fig. 9. These distributions generally result in the tightest contraction rates, and allow the Q-function to focus on locations where the error is high. In the sampled setting, this observation motivates exploration algorithms that maximize state coverage (for example,  Hazan et al. (2018) solve an exploration objective which maximizes state-space entropy). However, note that in this particular experiment, there is no sampling. All states are observed, just with different weights, thus isolating the issue of distributions from the issue of sampling.

7.3 Designing a Better Off-Policy Distribution: Adversarial Feature Matching

In our final study, we attempt to design a better weighting distribution using insights from previous sections that can be easily integrated into deep RL methods. We refer to this method as adversarial feature-matching (AFM). We draw upon three specific insights outlined in previous analysis. First, the function approximator should be incentivized to maximize its ability to distinguish states to minimize function approximation bias (Section 4). Second, the weighting distribution should emphasize areas where the Q-function incurs high Bellman error, in order to minimize the discrepancy between norm error and norm error. Third, more-uniform weighting distributions tend to be higher performant. The first insight was also demonstrated in (Liu et al., 2018) where enforcing sparsity in the Q-function was found to provide locality in the Q-function which prevented catastrophic interference and provided better values for bootstrapping.

We propose to model our problem as a minimax game, where the weighting distribution is a parameterized adversary which tries to maximize the Bellman error, while the Q-function () tries to minimize it. Note that in the unconstrained setting, this game is equivalent to minimizing the norm error in its dual-norm representation. However, in practical settings where minimizing stochastic approximations of the norm can be difficult for neural networks (also noticed when using PER (Van Hasselt et al., 2018)

), it is crucial to introduce constraints to limit the power of the adversary. These constraints also make the adversary closer to the uniform distribution while still allowing it to be sufficiently different at specific state-action pairs.

We elect to use a feature matching constraint which enforces the expected feature vectors,

, under to roughly match the expected feature vector under uniform sampling from the replay buffer. We can express the output of a neural network Q-function as or, in the continuous case, as , where the feature vector represent the the output of all but the final layer. Intuitively, this constraint restricts the adversarial sampler to distributing probability mass among states (or state-action pairs) that are perceptually similar to the Q-function, which in turn forces the Q-function to reduce state-aliasing by learning features that are more separable. Note that, in our case, . This also provides a natural extension of our method by performing expected gradient matching over all parameters (), instead of matching only (we leave it to future work to explore this direction). Formally, this objective is given as follows:

Note that is a function of but, while solving the maximization, is assumed to be a constant. This is equivalent to solving only the inner maximization with a constraint, and empirically provides better stability. Implementation details for AFM are provided in Appendix D. The denotes an estimator for the true expectation under some sampling distribution, such as a uniform distribution over all states and actions (in exact FQI) or the replay buffer distribution. So, holds when using a replay buffer.

While both AFM and PER tend to upweight samples in the buffer with a high Bellman error, PER explicitly attempts to reduce distribution shift via importance sampling. As we observed in Section 7, distributional shift is not actually harmful in practice, and AFM dispenses with this goal, instead explicitly aiming to rebalance the buffer to attain better coverage via adversarial optimization. In our experiments, this results in substantially better performance, consistent with the hypothesis that coverage, rather than reduction of distributional shift, is the most important property in a sampling distribution.

In tabular domains with Exact-FQI, we find that AFM performs at par with the top performing weighting distributions, such as and better than (Fig. 8). This confirms that adaptive prioritization works better than Prioritized(). Another benefit of AFM is its robustness to function approximation and the performance gains in the case of small architectures (say, ) are particularly noticeable. (Fig. 8)

In tabular domains with Replay-FQI (Table 1), we also compare AFM to prioritized replay (PER) (Schaul et al., 2015), where AFM and PER perform similarly in terms of normalized returns. Note that AFM reweights samples drawn uniformly from the buffer, whereas PER changes which samples are actually drawn. We also evaluate a variant of AFM (AFM+Sampling in Table 1) which changes which samples instead of reweighting. Essentially, in this version we sample from the replay buffer using probabilities determined by the AFM optimization, rather than using importance sampling while making bellman updates. We note that, in Table 1, AFM+Sampling performs strictly better than AFM and PER.

We further evaluate AFM on MuJoCo tasks with the TD3 algorithm (Fujimoto et al., 2018) and the entropy constrained SAC algorithm (Haarnoja et al., 2018). We find that in all 3 tested domains (Half-Cheetah, Hopper and Ant), AFM yields substantial empirical improvement in the case of TD3 (Fig. 10) and performs slightly better than entropy constrained SAC (Fig. 11). Surprisingly, we found PER to not work very well in these domains. In light of these results, we conclude that: (1) the choice of sampling distribution is very important for performance, and (2) considerations such as incorporating knowledge about the function approximator (for example, through ) into the choice of (the sampling/weighting distribution) can be very effective.

(a) Ant-v2
(b) Hopper-v2
(c) HalfCheetah-v2
Figure 10: Average Return for rollouts performed with a trained the TD3 algorithm with/without AFM (Ours) and with Prioritized Replay (PER). Note that on an average AFM performs better than the baseline and the Prioritized Replay. Each iteration on the x-axis corresponds to 5000 environment steps.
(a) Ant-v2
(b) Hopper-v2
(c) HalfCheetah-v2
Figure 11: Average Return for rollouts performed with a trained SAC model with temperature auto-tuning (Tuomas Haarnoja & Levine, 2018) with/without AFM. Note that on an average AFM performs slightly better and is always atleast at par with SAC. Each iteration on the x-axis corresponds to 1000 environment steps.
Sampling distribution Norm. Returns Norm. Returns
(16, 16) (64, 64)
None 0.18 0.23
Uniform(s, a) 0.19 0.25
0.45 0.39
0.30 0.21
Prioritized(s, a) 0.17 0.33
PER (Schaul et al., 2015) 0.42 0.49
AFM (Ours) 0.41 0.48
AFM + Sampling (Ours) 0.43 0.51
Table 1: Average Performance of various sampling distributions for (16, 16) and (64, 64) neural nets in the setting with replay buffers where sampling errors, function approximation error and distribution shift, all of them coexist, averaged across 5 random seeds. PER, our AFM and on-policy sampling perform roughly at par on benchmark tasks in expectation when using (16, 16) architectures. However, note that is generally computationally intractable. Another point to note is that AFM performs as good as PER just by virtue of weighting and not sampling. AFM+Sampling which is the sampling analogue of AFM works better than PER and AFM on average on tabular domains.

8 Conclusions and Discussion

From our analysis, we have several broad takeaways for the design of deep Q-learning algorithms.

Potential convergence issues with Q-learning do not seem to be endemic empirically, but function approximation still has a strong impact on the solution to which these methods converge. This impact goes beyond just approximation error, suggesting that Q-learning methods do find suboptimal solutions (within the given function class) with smaller function approximators. However, expressive architectures largely mitigate this problem, suffer less from bootstrapping error, converge faster, and more stable with moving targets.

Sampling error can cause substantial overfitting problems with Q-learning. However, replay buffers and early stopping can mitigate this problem, and the biases incurred from small function approximators outweigh any benefits they may have in terms of overfitting. We believe the best strategy is to keep large architectures but carefully select the number of gradient steps used per sample. We showed that employing oracle early stopping techniques can provide huge benefits in the performance in Q-learning. This motivates the future research direction of devising early stopping techniques to dynamically control the number of gradient steps in Q-learning, rather than setting it as a hyperparameter as this can give rise to big difference in performance.

The choice of sampling or weighting distribution has significant effect on solution quality, even in the absence of sampling error. Surprisingly, we do not find on-policy distributions to be the most performant, but rather methods which have high state-entropy and spread mass uniformly among state-action pairs, seem to be highly effective for training. Based on these insights, we propose a new weighting distribution which balances high-entropy and state aliasing, AFM, that yields fair improvements in both tabular and continuous domains with state-of-the-art off-policy RL algorithms.

Finally, we note that there are many other topics in Q-learning that we did not investigate, such as overestimation bias and multi-step returns. We believe that these issues too could be studied in future work with our oracle-based analysis framework.


We thank Vitchyr Pong and Kristian Hartikainen for providing us with implementations of RL algorithms. We thank Chelsea Finn for comments on an earlier draft of this paper. SL thanks George Tucker for helpful discussion. We thank Google, NVIDIA, and Amazon for providing computational resources. This research was supported by Berkeley DeepDrive, NSF IIS-1651843 and IIS-1614653, the DARPA Assured Autonomy program, and ARL DCIST CRA W911NF-17-2-0181.


  • Arjovsky et al. (2017) Arjovsky, M., Chintala, S., and Bottou, L. Wasserstein generative adversarial networks. In

    International Conference on Machine Learning (ICML)

    , pp. 214–223, 2017.
  • Baird (1995) Baird, L. Residual Algorithms : Reinforcement Learning with Function Approximation. In International Conference on Machine Learning (ICML), 1995.
  • Bertsekas & Tsitsiklis (1996) Bertsekas, D. P. and Tsitsiklis, J. N. Neuro-dynamic programming. Athena Scientific, 1996.
  • Dai et al. (2018) Dai, B., Shaw, A., Li, L., Xiao, L., He, N., Liu, Z., Chen, J., and Song, L. Sbeed: Convergent reinforcement learning with nonlinear function approximation. In International Conference on Machine Learning, pp. 1133–1142, 2018.
  • Daskalakis et al. (2018) Daskalakis, C., Ilyas, A., Syrgkanis, V., and Zeng, H. Training GANs with optimism. In International Conference on Learning Representations, 2018. URL https://openreview.net/forum?id=SJJySbbAZ.
  • Ernst et al. (2005) Ernst, D., Geurts, P., and Wehenkel, L. Tree-based batch mode reinforcement learning. Journal of Machine Learning Research, 6(Apr):503–556, 2005.
  • Farrell & Berger (1995) Farrell, J. A. and Berger, T. On the effects of the training sample density in passive learning control. In American Control Conference, 1995.
  • Fujimoto et al. (2018) Fujimoto, S., van Hoof, H., and Meger, D. Addressing function approximation error in actor-critic methods. In International Conference on Machine Learning (ICML), pp. 1587–1596, 2018.
  • Geist et al. (2017) Geist, M., Piot, B., and Pietquin, O. Is the bellman residual a bad proxy? In Advances in Neural Information Processing Systems (NeurIPS), pp. 3205–3214. 2017.
  • Haarnoja et al. (2017) Haarnoja, T., Tang, H., Abbeel, P., and Levine, S. Reinforcement learning with deep energy-based policies. In International Conference on Machine Learning (ICML), 2017.
  • Haarnoja et al. (2018) Haarnoja, T., Zhou, A., Abbeel, P., and Levine, S. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. CoRR, abs/1801.01290, 2018. URL http://arxiv.org/abs/1801.01290.
  • Hausknecht & Stone (2016) Hausknecht, M. and Stone, P. On-policy vs. off-policy updates for deep reinforcement learning. In Deep Reinforcement Learning: Frontiers and Challenges, IJCAI, 2016.
  • Hazan et al. (2018) Hazan, E., Kakade, S. M., Singh, K., and Van Soest, A. Provably efficient maximum entropy exploration. arXiv preprint arXiv:1812.02690, 2018.
  • Kalashnikov et al. (2018) Kalashnikov, D., Irpan, A., Pastor, P., Ibarz, J., Herzog, A., Jang, E., Quillen, D., Holly, E., Kalakrishnan, M., Vanhoucke, V., and Levine, S. Qt-opt: Scalable deep reinforcement learning for vision-based robotic manipulation. In CoRL, volume 87 of Proceedings of Machine Learning Research, pp. 651–673. PMLR, 2018.
  • Lillicrap et al. (2015) Lillicrap, T. P., Hunt, J. J., Pritzel, A., Heess, N., Erez, T., Tassa, Y., Silver, D., and Wierstra, D. Continuous control with deep reinforcement learning. International Conference on Learning Representations (ICLR), 2015.
  • Lin (1992) Lin, L.-J. Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning, 8(3-4):293–321, 1992.
  • Liu et al. (2018) Liu, V., Kumaraswamy, R., Le, L., and White, M. The utility of sparse representations for control in reinforcement learning. CoRR, abs/1811.06626, 2018. URL http://arxiv.org/abs/1811.06626.
  • Maei et al. (2010) Maei, H. R., Szepesvári, C., Bhatnagar, S., and Sutton, R. S. Toward off-policy learning control with function approximation. In International Conference on Machine Learning (ICML), 2010.
  • Maillard et al. (2010) Maillard, O.-A., Munos, R., Lazaric, A., and Ghavamzadeh, M. Finite-sample analysis of bellman residual minimization. In Asian Conference on Machine Learning (ACML), pp. 299–314, 2010.
  • Metelli et al. (2018) Metelli, A. M., Papini, M., Faccio, F., and Restelli, M. Policy optimization via importance sampling. CoRR, abs/1809.06098, 2018. URL http://arxiv.org/abs/1809.06098.
  • Mnih et al. (2015) Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Veness, J., Bellemare, M. G., Graves, A., Riedmiller, M., Fidjeland, A. K., Ostrovski, G., Petersen, S., Beattie, C., Sadik, A., Antonoglou, I., King, H., Kumaran, D., Wierstra, D., Legg, S., and Hassabis, D. Human-level control through deep reinforcement learning. Nature, 518(7540):529–533, feb 2015. ISSN 0028-0836.
  • Munos (2005) Munos, R. Error bounds for approximate value iteration. In

    AAI Conference on Artificial intelligence (AAAI)

    , pp. 1006–1011. AAAI Press, 2005.
  • Munos & Szepesvári (2008) Munos, R. and Szepesvári, C. Finite-time bounds for fitted value iteration. Journal of Machine Learning Research, 9(May):815–857, 2008.
  • Munos et al. (2016) Munos, R., Stepleton, T., Harutyunyan, A., and Bellemare, M. Safe and efficient off-policy reinforcement learning. In Advances in Neural Information Processing Systems (NeurIPS), pp. 1054–1062, 2016.
  • Plappert et al. (2018) Plappert, M., Andrychowicz, M., Ray, A., McGrew, B., Baker, B., Powell, G., Schneider, J., Tobin, J., Chociej, M., Welinder, P., Kumar, V., and Zaremba, W. Multi-goal reinforcement learning: Challenging robotics environments and request for research, 2018.
  • Precup et al. (2001) Precup, D., Sutton, R. S., and Dasgupta, S. Off-policy temporal difference learning with function approximation. In International Conference on Machine Learning (ICML), pp. 417–424, 2001.
  • Riedmiller (2005) Riedmiller, M. Neural fitted q iteration–first experiences with a data efficient neural reinforcement learning method. In European Conference on Machine Learning, pp. 317–328. Springer, 2005.
  • Schaul et al. (2015) Schaul, T., Quan, J., Antonoglou, I., and Silver, D. Prioritized experience replay. International Conference on Learning Representations (ICLR), 2015.
  • Scherrer (2010) Scherrer, B. Should one compute the temporal difference fix point or minimize the bellman residual? the unified oblique projection view. In International Conference on Machine Learning (ICML), pp. 959–966, 2010.
  • Shalev-Shwartz & Ben-David (2014) Shalev-Shwartz, S. and Ben-David, S. Understanding machine learning: From theory to algorithms. Cambridge university press, 2014.
  • Shani et al. (2005) Shani, G., Heckerman, D., and Brafman, R. I. An mdp-based recommender system. Journal of Machine Learning Research, 6(Sep):1265–1295, 2005.
  • Sutton & Barto (2018) Sutton, R. S. and Barto, A. G. Reinforcement learning: An introduction. Second edition, 2018.
  • Sutton et al. (2009a) Sutton, R. S., Maei, H. R., Precup, D., Bhatnagar, S., Silver, D., Szepesvári, C., and Wiewiora, E. Fast gradient-descent methods for temporal-difference learning with linear function approximation. In International Conference on Machine Learning (ICML), 2009a.
  • Sutton et al. (2009b) Sutton, R. S., Maei, H. R., and Szepesvári, C. A convergent o(n) temporal-difference algorithm for off-policy learning with linear function approximation. In Advances in Neural Information Processing Systems (NeurIPS), 2009b.
  • Sutton et al. (2016) Sutton, R. S., Mahmood, A. R., and White, M. An emphatic approach to the problem of off-policy temporal-difference learning. The Journal of Machine Learning Research, 17(1):2603–2631, 2016.
  • Szepesvári (1998) Szepesvári, C. The asymptotic convergence-rate of q-learning. In Advances in Neural Information Processing Systems, pp. 1064–1070, 1998.
  • Tosatto et al. (2017) Tosatto, S., Pirotta, M., D’Eramo, C., and Restelli, M. Boosted fitted q-iteration. In International Conference on Machine Learning (ICML), pp. 3434–3443. JMLR. org, 2017.
  • Tsitsiklis & Van Roy (1997) Tsitsiklis, J. N. and Van Roy, B. Analysis of temporal-diffference learning with function approximation. In Advances in Neural Information Processing Systems (NeurIPS), pp. 1075–1081, 1997.
  • Tuomas Haarnoja & Levine (2018) Tuomas Haarnoja, Aurick Zhou, K. H. G. T. S. H. J. T. V. K. H. Z. A. G. P. A. and Levine, S. Soft actor-critic algorithms and applications. Technical report, 2018.
  • Van Hasselt et al. (2018) Van Hasselt, H., Doron, Y., Strub, F., Hessel, M., Sonnerat, N., and Modayil, J. Deep reinforcement learning and the deadly triad. arXiv preprint arXiv:1812.02648, 2018.
  • Watkins & Dayan (1992) Watkins, C. J. and Dayan, P. Q-learning. Machine learning, 8(3-4):279–292, 1992.
  • Yazıcı et al. (2019) Yazıcı, Y., Foo, C.-S., Winkler, S., Yap, K.-H., Piliouras, G., and Chandrasekhar, V. The unusual effectiveness of averaging in GAN training. In International Conference on Learning Representations, 2019. URL https://openreview.net/forum?id=SJgw_sRqFQ.
  • Zhang & Sutton (2017) Zhang, S. and Sutton, R. S. A deeper look at experience replay. CoRR, abs/1712.01275, 2017. URL http://arxiv.org/abs/1712.01275.


Appendix A Benchmark Tabular Domains

We evaluate on a benchmark of 8 tabular domains, selected for qualitative differences.

4 Gridworlds. The Gridworld environment is an NxN grid with randomly placed walls. The reward is proportional to Manhattan distance to a goal state (1 at the goal, 0 at the initial position), and there is a 5% chance the agent travels in a different direction than commanded. We vary two parameters: the size ( and ), and the state representations. We use a “one-hot” representation, an (X, Y) coordinate tuple (represented as two one-hot vectors), and a “random” representation, a vector drawn from , where N is the width or height of the Gridworld. The random observation significantly increases the challenge of function approximation, as significant state aliasing occurs.

Cliffwalk: Cliffwalk is a toy example from Schaul et al. (2015). It consists of a sequence of states, where each state has two allowed actions: advance to the next state or return to the initial state. A reward of 1.0 is obtained when the agent reaches the final state. Observations consist of vectors drawn from .

InvertedPendulum and MountainCar: InvertedPendulum and MountainCar are discretized versions of continuous control tasks found in OpenAI gym (Plappert et al., 2018), and are based on problems from classical RL literature. In the InvertedPendulum task, an agent must swing up an pendulum and hold it in its upright position. The state consists of the angle and angular velocity of the pendulum. Maximum reward is given when the pendulum is upright. The observation consists of the and of the pendulum angle, and the angular velocity. In the MountainCar task, the agent must push a vehicle up a hill, but the hill is steep enough that the agent must gather momentum by swinging back and forth within a valley in order to reach the top. The state consists of the position and velocity of the vehicle.

SparseGraph: The SparseGraph environment is a 256-state graph with randomly drawn edges. Each state has two edges, each corresponding to an action. One state is chosen as the goal state, where the agent receives a reward of one.

Appendix B Fitted Q-iteration with Bounded Projection Error

When function approximation is introduced to Q-iteration, we lose guarantees that our solution will converge to the optimal solution , because the composition of projection and backup is no longer guaranteed to be a contraction under any norm. However, this does not imply divergence, and in most cases it merely degrades the quality of solution found.

This can be seen by recalling the following result from (Bertsekas & Tsitsiklis, 1996), that describes the quality of the solution obtained by fitted Q-iteration (FQI) when the projection error at each step is bounded. The conclusion is that FQI converges to an ball around the optimal solution which scales proportionally with the projection error. While this statement does not claim that divergence cannot occur in general (this theorem can only be applied in retrospect, since we cannot always uniformly bound the projection error at each iteration), it nevertheless offers important intuitions on the behavior of FQI under approximation error. For similar results concerning -weighted norms, see (Munos, 2005).

Theorem B.1 (Bounded error in fitted Q-iteration).

Let the projection or Bellman error at each iteration of FQI be uniformly bounded by , i.e. . Then, the error in the final solution is bounded as


See of Chapter 6 of Bertsekas & Tsitsiklis (1996). ∎

We can use this statement to provide a bound on the performance of the final policy.

Corollary B.1.1.

Suppose we run fitted Q-iteration, and let the projection error at each iteration be uniformly bounded by , i.e. . Letting denote the returns of a policy , the the performance of the final policy is bounded as:


This result is obtained by substituting Thm. B.1 into Propositon 6.1 of  Bertsekas & Tsitsiklis (1996). ∎

b.1 Unbounded divergence in FQI

Because norms are bounded by the norm, Thm. B.1 implies that unbounded divergence is impossible when weighting distribution has positive support at all states and actions (i.e. ), and the projection is non-expansive in the norm (such as when using linear approximators).

We can bound the -weighted in terms of the as follows: . Thus, we can apply Thm. B.1 with to show that unbounded divergence is impossible. Note that because this bound scales with the size of the state and action spaces, it is fairly loose in many practical cases, and practitioners may nevertheless see Q-values grow to large values (tighter bounds concerning L2 norms can be found in (Munos, 2005), which depend on the transition distribution). It also suggests that distributions which are fairly uniform (so as to maximize the denominator) can perform well.

When the weighting distribution does not have support over all states and actions, divergence can still occur, as noted in the counterexamples such as Section 11.2 of  Sutton & Barto (2018). In this case, we consider two states (state 1 and 2) with feature vectors 1 and 2, respectively, and a linear approximator with parameter . There exists a single action with a deterministic transition from state 1 to state 2, and we only sample the transition from state 1 to state 2 (i.e. is 1 for state 1 and 0 for state 2). All rewards are 0. In this case, the projected Bellman backup takes the form:

Which will cause unbounded growth when iterated, provided . However, if we add a transition from state 2 back to itself or to state 1, and place nonzero probability on sampling these transitions, divergence can be avoided.

Appendix C -smoothed Q-iteration

In this section we show that the -smoothed Bellman backup introduced in Section 6 is still a valid Q-iteration method, in that it is a contraction (for ) and thus converges to .

We define the -smoothed Bellman backup as:

Theorem C.1 (Contraction rate of the -smoothed Bellman backup).

is a -contraction:


This statement follows from straightforward application of the triangle rule and the fact that is a -contraction:

Figure 12: Results for the -smoothed Bellman backup experiment. Normalized norm error to and normalized returns plotted for different values of and architectures. Values are averaged over all domains and 5 seeds. For large architectures, higher values of result in faster convergence and higher asymptotic returns. However, for smaller architectures, low values of slightly outperform higher values.

Appendix D Adversarial Feature Matching (AFM): Detailed Explanation and Practical Implementation

As described in section 7.3, we devise a novel weighting scheme for the Bellman error objective based on an adversarial minimax game. The adversary computes weights (representing the weighting distribution ), for the Bellman error: . Recalling from Section 7.3, the optimization problem is given by:

where are the state features learned by the Q-function approximator. is easy to extract out of the multiheaded () model typically used for discrete action control, as one choice is to let be the output of the penultimate layer of the Q-network. For continuous control tasks, however, we model (which is a function of the actions as well) as state-only features are unavailable, unless separately modeled. This can also be interpreted as modelling a feature matching constraint on the gradient of with respect to the last linear parameters . A possible extension is to take into account the entire gradient as the features in the feature matching constraint, that is, .

This choice of the constraint is suitable and can be interpreted in two ways. First, an adversary constrained in this manner has enough power to exploit the Q-network at states which get aliased under the chosen function class, thereby promoting more separable feature learning and reducing some negative aspects of function approximation that can arise in Q-learning. This is also similar in motivation to (Liu et al., 2018). Second, this feature constraint also bears a similarity the Maximum Mean Discrepancy (MMD) distance between two distributions and that can be written as , where the set of functions is the canonical feature map, (from real space to the RKHS). In our context, this is analogous to optimizing a distance between the adversarial distribution and the replay buffer distribution (as the average is a Monte-Carlo estimator of the expected under the replay buffer distribution ). In the light of these arguments, AFM, and other associated methods that take into account the properties of the function approximator into account (for example, here), can greatly reduce the bias incurred due to function approximation in the due course of Q-learning/FQI, as depicted in 1.

Solving the optimization

We solve this saddle point problem using alternating dual gradient descent. We first solve the inner maximization problem, and then use its solution to then solve the outer minimization problem. We first compute the Lagrangian for the maximization, by introducing a dual variable ,

(Note that this Lagrangian is flipped in sign because we first convert the maximization problem to standard minimization form.) We now solve the inner problem using dual gradient descent. We then plug in the solutions (approximate solutions obtained after gradient descent), into the Lagrangian, to then solve the outside minimization over . Note that while depends on

(as it is the feature layer of the Q-network), we don not backpropagate through

while solving the minimization. This improves stability of the Q-network training in practice and to makes sure that Q-function is only affected by FQI updates. In practice, we take up to 10 gradient steps for the inner problem every 1 gradient step of the outer problem. The algorithm is summarized in Algorithm 4. Our results provided in the main paper and here don’t particularly assume any other tricks like Optimistic Gradient (Daskalakis et al., 2018), using exponential moving average of the parameters (Yazıcı et al., 2019). Our tabular experiments seemed to benefit some what using these tricks.

1:  Initialize Q-value approximator , projection distribution , threshold
2:  for step in {1, …, N} do
3:     Initialize Q-value approximator .
4:     Evaluate at all states.
5:     Compute exact target values at all states.
6:     Minimize the negative projection loss with respect to subject to the feature matching constraint exactly over all states and actions
Maximize the Dual Loss w.r.t. .
7:     Repeat Step 6 for K steps (K ).
8:     Minimize projection loss with respect to :
9:  end for
Algorithm 4 AFM with Exact-FQI

Practical implementation with replay buffers

We incorporate this weighting/sampling distribution into Q-learning in the setting with replay buffers and with state-action sampling. We evaluate the weighting version of our method, AFM, where, we usually sample a large batch of state-action pairs from a usual replay buffer used in Q-learning, but use importance weights to then match in expectation. Thus, we use a parametric function approximator to model – that is, the importance weights of the adversarial distribution with respect to the replay buffer distribution . Mathematically, we estimate: , where

. The latter expectation is then approximated using a set of finite samples. It has been noted in literature that importance sampling (IS) suffers from high variance especially if the number of samples is small. Hence, we use the self-normalized importance sampling estimator, which averages the importance weights in a set of samples or a large number of samples. That is, let

, then instead of using as the importance weights, we use (where and represent state-action tuples; concisely mentioned for visual clarity) as the importance weights. We also regularize the second-order Renyi Divergence between and for stability. Mathematically, it can be shown that this is a lower bound on the true expectation of under , which is being estimated using importance sampling. This result has also been shown in (Metelli et al., 2018) (Theorem 4.1), where the authors use this lower bound in policy optimization via importance sampling. We state the theorem below for completeness.

Theorem D.1.

(Metelli et al., 2018) Let and be two probability measures on the measurable space such that and . Let

be i.i.d. random variables sampled from

, and be a bounded function. Then, for any and with probability at least it holds that:

where is the exponentiated second-order Renyi Divergence between and .

Hence, our objective for the inner loop now becomes: is now computed using samples with an additional renyi regularisation term. Since, we end up modeling this ratio, through out parameteric model, we can hence easily compute an estimator for the Renyi divergence term. The overall lower bound inner maximization problem is:

We found that this Renyi penalty helped stabilize training. In practice, we model the importance weights:

as a parametric model with an identical architecture to the Q-network. We use parameter clipping for

, where the parameter are clipped to , analogous to Wasserstein GANs (Arjovsky et al., 2017). We also found that self-normalization during importance sampling has a huge practical benefit. Note that as the true norm of the Bellman error is not known, for computing in the Renyi Divergence term, and hence we either replace it by constant, or compute a stochastic approximation to the norm over the current batch. We found the former to be more stable, and hence, used that in all our experiments. This coefficient of the Renyi divergence penalty is tuned uniformly between . The learning rate for the adversary was chosen to be 1e-4 for the tabular environments, and 5e-4 for TD3. The batch size for our algorithm was chosen to be 128 for the tabular environments and 500 for TD3/SAC. Note that a larger batch size ensures smoothness in the minmax optimization problem. We also found that instead of having a Lagrange multiplier for the feature matching constraint, having Lagrange multipliers for constraining each of the individual dimensions of the features also helps very much. This is to ensure that the hyperparameters remain the same across different architectures regardless of the dimension of the penultimate layer of the Q-network. The algorithm in this case is exactly the same as the algorithm before with a vector valued dual variable . We used TD3 and SAC implementations from rlkit (https://github.com/vitchyr/rlkit/tree/master/rlkit)

Appendix E Function approximation analysis on Mujoco Tasks

As discussed in Section 4, we validate our findings on the effect of function approximation on 3 MuJoCo tasks from OpenAI Gym with the SAC algorithm from the author’s implementation at (Tuomas Haarnoja & Levine, 2018). We observe that bigger networks learn faster and better in general.

(a) HalfCheetah-v2
(b) Hopper-v2
(c) Ant-v2
Figure 13:

Performance of different size architectures on 3 benchmark MuJoco tasks from OpenAI gym suite with the SAC algorithm. Values are averaged over 3 different seeds. A bigger network performs better in terms of learning speed and performance measured in terms of returns. Each epoch on the x-axis corresponds to 1000 environment steps.

Appendix F Additional Plots

Figure 14: Normalized returns with Sampled-FQI, varying over architectures and number of on-policy samples.
Figure 15: Performance on Half Cheetah and Hopper trained via TD3 with replay buffer of size with increasing number of gradient steps taken per environment step () on the critic and the actor. Note the clearly observable decay in performance of the agent with more number of gradient steps – which clearly validates our claim of the presence of overfitting in Q-functions. Each iteration on the x-axis corresponds to taking 5000 steps in the environment.