1 Introduction
Qlearning algorithms, which are based on approximating stateaction value functions, are an efficient and commonly used class of RL methods. In recent years, such methods have been applied to great effect in domains such as playing video games from raw pixels (Mnih et al., 2015) and continuous control in robotics (Kalashnikov et al., 2018)
. Methods based on approximate dynamic programming and Qfunction estimation have several very appealing properties: they are generally moderately sampleefficient, when compared to policy gradient methods, they are simple to use, and they allow for offpolicy learning. This makes them an appealing choice for a wide range of tasks, from robotic control
(Kalashnikov et al., 2018) to offpolicy learning from historical data for recommender (Shani et al., 2005) systems and other applications. However, although the basic tabular Qlearning algorithm is convergent and admits theoretical analysis (Sutton & Barto, 2018), its nonlinear counterpart with function approximation (such as with deep neural networks) is poorly understood theoretically. In this paper, we aim to investigate the degree to which the theoretical issues with Qlearning actually manifest in practice. Thus, we empirically analyze aspects of the Qlearning method in a unit testing framework, where we can employ oracle solvers to obtain ground truth Qfunctions and distributions for exact analysis. We investigate the following questions:1) What is the effect of function approximation on convergence? Most practical reinforcement learning problems, such as robotic control, require function approximation to handle large or continuous state spaces. However, the behavior of Qlearning methods under function approximation is not well understood. There are known counterexamples where the method diverges (Baird, 1995), and there are no known convergence guarantees (Sutton & Barto, 2018). To investigate these problems, we study the convergence behavior of Qlearning methods with function approximation, parametrically varying the function approximator power and analyzing the quality of the solution as compared to the optimal Qfunction and the optimal projected Qfunction under that function approximator. We find, somewhat surprisingly, that function approximation error is not a major problem in Qlearning algorithms, but only when the representational capacity of the function approximator is high. This makes sense in light of the theory: a highcapacity function approximator can perform a nearly perfect projection of the backed up Qfunction, thus mitigating potentially convergence issues due to an imperfect norm projection. We also find that divergence rarely occurs, for example, we observed divergence in only 0.9% of our experiments. We discuss this further in Section 4.
2) What is the effect of sampling error and overfitting? Qlearning is used to solve problems where we do not have access to the transition function of the MDP. Thus, Qlearning methods need to learn by collecting samples in the environment, and training on these samples incurs sampling error, potentially leading to overfitting. This causes errors in the computation of the Bellman backup, which degrades the quality of the solution. We experimentally show that overfitting exists in practice by performing ablation studies on the number of gradient steps, and by demonstrating that oracle based early stopping techniques can be used to improve performance of Qlearning algorithms. (Section 5). Thus, in our experiments we quantify the amount of overfitting which happens in practice, incorporating a variety of metrics, an performing a number of ablations and investigate methods to mitigate its effects.
3) What is the effect of distribution shift and a moving target? The standard formulation of Qlearning prescribes an update rule, with no corresponding objective function (Sutton et al., 2009a). This results in a process which optimizes an objective that is nonstationary in two ways: the target values are updated during training, and the distribution under which the Bellman error is optimized changes, as samples are drawn from different policies. We refer to these problems as the moving target and distribution shift problems, respectively. These properties can make convergence behavior difficult to understand, and prior works have hypothesized that nonstationarity is a source of instability (Mnih et al., 2015; Lillicrap et al., 2015). In our experiments, we develop metrics to quantify the amount of distribution shift and performance change due to nonstationary targets. Surprisingly, we find that in a controlled experiment, distributional shift and nonstationary targets do not in fact correlate with reduction in performance. In fact, sampling strategies with large distributional shift often perform very well.
4) What is the best sampling or weighting distribution? Deeply tied to the distribution shift problem is the choice of which distribution to sample from. Do moving distributions cause instability, as Qvalues trained on one distribution are evaluated under another in subsequent iterations? Researchers have often noted that onpolicy samples are typically superior to offpolicy samples (Sutton & Barto, 2018), and there are several theoretical results that highlight favorable convergence properties under onpolicy samples. However, there is little theoretical guidance on how to pick distributions so as to maximize learning rate. To this end, we investigate several choices for the sampling distribution. Surprisingly, we find that onpolicy training distributions are not always preferable, and that a clear pattern in performance with respect to training distribution is that broader, higherentropy distributions perform better, regardless of distributional shift. Motivated by our findings, we propose a novel weighting distribution, adversarial feature matching (AFM), which is explicitly compensates for function approximator error, while still producing highentropy sampling distributions.
Our contributions are as follows: We introduce a unit testing framework for Qlearning to disentangle issues related to function approximation, sampling, and distributional shift where approximate components are replaced by oracles. This allows for controlled analysis of different sources of error. We perform a detailed experimental analysis of many hypothesized sources of instability, error, and slow training in Qlearning algorithms on tabular domains, and show that many of these trends hold true in high dimensional domains. We propose novel choices of sampling distributions which lead to improved performance even on highdimensional tasks. Our overall aim is to offer practical guidance for designing RL algorithms, as well as to identify important issues to solve in future research.
2 Preliminaries
Qlearning algorithms aim to solve a Markov decision process (MDP) by learning the optimal stateaction value function, or Qfunction. We define an MDP as a tuple
. represent the state and action spaces, respectively. and represent the dynamics (transition distribution) and reward function, and represents the discount factor. The goal in RL is to find a policy that maximizes the expected cumulative discounted rewards, known as the returns:The quantity of interest in Qlearning methods are stateaction value functions, which give the expected future return starting from a particular stateaction tuple, denoted . The state value function can also be denoted as . Qlearning algorithms are based on iterating the Bellman backup operator , defined as
The (tabular) Qiteration algorithm is a dynamic programming algorithm that iterates the Bellman backup . Because the Bellman backup is a contraction in the L norm, and (the Qvalues of ) is its fixed point, Qiteration can be shown to converge to (Sutton & Barto, 2018). A deterministic optimal policy can then be obtained as .
When state spaces cannot be enumerated in a tabular format, function approximators can be used to represent the Qvalues. An important class of such Qlearning methods are fitted Qiteration (FQI) (Ernst et al., 2005), or approximate dynamic programming (ADP) methods, which form the basis of modern deep RL methods such as DQN (Mnih et al., 2015). FQI projects the values of the Bellman backup onto a family of Qfunction approximators :
Here, denotes a weighted L2 projection, which minimizes the Bellman error
via supervised learning:
(1) 
The values produced by the Bellman backup, are commonly referred to as target values, and when neural networks are used for function approximation, the previous Qfunction is referred to as the target network. In this work, we distinguish between the cases when the Bellman error is estimated with MonteCarlo sampling or computed exactly (see Section 3.1). The sampled variant corresponds to FQI as described in the literature (Ernst et al., 2005; Riedmiller, 2005), while the exact variant is analogous to conventional ADP (Bertsekas & Tsitsiklis, 1996).
Convergence guarantees for Qiteration do not cleanly translate to FQI. is an projection, but is a contraction in the norm – this norm mistmatch means the composition of the backup and projection is no longer guaranteed to be a contraction under any norm (Bertsekas & Tsitsiklis, 1996), and hence the convergence is not guaranteed.
A related branch of Qlearning methods are online Qlearning methods, in which Qvalues are updated while samples are being collected in the MDP. This includes classic algorithms such as Watkin’s Qlearning (Watkins & Dayan, 1992). Online Qlearning methods can be viewed as a form of stochastic approximation (such as RobbinsMonro) applied to Qiteration and FQI (Bertsekas & Tsitsiklis, 1996), and share many of its theoretical properties (Szepesvári, 1998). Modern deep RL algorithms such as DQN (Mnih et al., 2015) have characteristics of both online Qlearning and FQI – using replay buffers means the sampling distribution changes very little between target updates (see Section 6.3), and target networks are justified from the viewpoint of FQI. Because FQI corresponds to the case when the sampling distribution is static between target updates, the behavior of modern deep RL methods more closely resembles FQI than a true online method without target networks.
3 Experimental Setup
Our experimental setup is centered around unittesting. We first introduce a spectrum of Qlearning algorithms, starting with exact approximate dynamic programming and gradually replacing oracle components, such as knowledge of dynamics, until the algorithm resembles modern deep Qlearning methods. We then introduce a suite of tabular environments where oracle solutions can be computed and compared against, to aid in diagnosis, as well as testing in highdimensional environments to verify our hypotheses.
In order to provide consistent metrics across domains, we normalize returns and errors involving Qfunctions (such as Bellman error) by the returns of the expert policy on each environment.
3.1 Algorithms
In the analysis presented in Section 4, 5, 6 and 7, we will use three different Qlearning variants, each of which remove some of the approximations in the standard Qlearning method used in the literature – ExactFQI, SamplingFQI, and ReplayFQI. Although FQI is not exactly identical to commonly used deep RL methods, such as DQN (Mnih et al., 2015), DDPG (Lillicrap et al., 2015), and SAC (Haarnoja et al., 2017), it is structurally similar and, when the replay buffer for the commonly used methods becomes large, the difference becomes negligible, since the sampling distribution changes very little between target network updates. However, FQI methods are much more amenable for controlled analysis, since we can separately isolate target values, update rates, and the number of samples used for each iteration. We therefore use variants of FQI as the basis for our analysis, but we also confirm that similar trends hold with more commonly used algorithms on standard benchmark problems.
ExactFQI (Algorithm 1): ExactFQI computes the backup and projection on all stateaction tuples without any sampling error. It also assumes knowledge of dynamics and reward function to compute Bellman backups exactly. We use ExactFQI to study convergence, distribution shift (by varying weighting distributions on transitions), and function approximation in the absence of sampling error. ExactFQI eliminates errors due to sampling states, and computing inexact, sampled backups.
SampledFQI (Algorithm 2): SampledFQI is a special case of ExactFQI, where the Bellman error is approximated with MonteCarlo estimates from a sampling distribution , and the Bellman backup is approximated with samples from the dynamics as . We use SampledFQI to study effects of overfitting. SampledFQI incorporates all sources of error – arising from function approximation, sampling and also distribution shift.
ReplayFQI (Algorithm 3): ReplayFQI is a special case of SampledFQI that uses a replay buffer (Lin, 1992), that saves past transition samples , which are used for computing Bellman error. ReplayFQI strongle resembles DQN (Mnih et al., 2015), lacking the online updates that allow to change within an FQI iteration. With large replay buffers, we expect the difference between ReplayFQI and DQN to be minimal as changes slowly.
We additionally investigate the following choices of weighting distributions () for the Bellman error. When sampling the Bellman error, these can be implemented by sampling directly from the distribution, or via importance sampling.
Unif: Uniform weights over stateaction space. This is the weighting distribution typically used by dynamic programming algorithms, such as FQI.
: The onpolicy stateaction marginal induced by .
: The stateaction marginal induced by .
Random: Stateaction marginal induced by executing uniformly random actions.
Prioritized(s,a): Weights Bellman errors proportional to . This is similar to prioritized replay (Schaul et al., 2015) without importance sampling.
Replay and Replay10: Averaged stateaction marginal of all policies (or the previous 10) produced during training. This simulates sampling uniformly from a replay buffer where infinite samples are collected from each policy.
3.2 Domains
We evaluate our methods on suite of tabular environments where we can compute oracle values. This will help us compare, analyze and fix various sources of error by means of comparing the learned Qfunctions to the true, oraclecompute Qfunctions. We selected 8 tabular domains, each with different qualitative attributes, including: gridworlds of varying sizes and observations, blind Cliffwalk (Schaul et al., 2015), discretized Pendulum and Mountain Car based on implementations in OpenAI Gym (Plappert et al., 2018), and a random sparsely connected graph. We give full details of these environments in Appendix A, as well as their motivation for inclusion.
3.3 Function Approximators
Throughout our experiments, we use 2layer ReLU networks, denoted by a tuple
where N represents the number of units in a layer. The “Tabular” architecture refers to the case when no function approximation is used.3.4 HighDimensional Testing
In addition to diagnostic experiments on tabular domains, we also wish to see if the observed trends hold true on highdimensional environments. To this end, we include experiments on continuous control tasks in the OpenAI Gym benchmark (Plappert et al., 2018) (HalfCheetahv2, Hopperv2, Antv2, Walker2dv2). In continuous domains, computing the maximum over actions of the Qvalue is difficult (). A common choice in this case is to use a second “actor” neural network to approximate (Lillicrap et al., 2015; Fujimoto et al., 2018; Haarnoja et al., 2018). This approach most closely resembles ReplayFQI, but using the actor network in place of the max.
4 Function Approximation and Convergence
The first issue we investigate is the connection between function approximation and convergence properties.
4.1 Technical Background
As discussed in Section 2, when function approximation is introduced to Qlearning, convergence guarantees are lost. This interaction between approximation and convergence has been a longstudied topic in reinforcement learning. In the control literature, it is closely related to the problems of statealiasing or interference (Farrell & Berger, 1995). Baird (1995) introduces a simple counterexample in which Watkin’s Qlearning with linear approximators can cause unbounded divergence. In the policy evaluation scenario, Tsitsiklis & Van Roy (1997) prove that onpolicy TDlearning with linear function approximators converge, and methods such as GTD (Sutton et al., 2009a) and ETD (Sutton et al., 2016) have extended results to offpolicy cases. In the control scenario, convergent algorithms such as SBEED (Dai et al., 2018) and GreedyGQ (Maei et al., 2010) have been developed. However, several works have noted that divergence need not occur. Munos (2005) theoretically addresses the normmismatch problem, which show that unbounded divergence is impossible provided has adequate support and projections are nonexpansive in pnorms. Concurrently to us, Van Hasselt et al. (2018) experimentally find that unbounded divergence rarely occurs with DQN variants on Atari games.
4.2 How does function approximation affect convergence properties and suboptimality of solutions?
The crucial quantities we wish to measure are a trend between function approximation and performance, and a measure for the bias in the learning procedure introduced by function approximation. Thus, using ExactFQI with uniform weighting (to remove sampling error), we measure the returns of the learning policy, and the error between and the solution found by ExactFQI () or the projection of the optimal solution (). represents the best solution inside the model class, in absence of error from the bootstrapping process of FQI. Thus, the difference between FQI error and projection error represents the bias introduced by the bootstrapping procedure, while controlling for bias that is simply due to function approximation – this quantity is roughly the inherent Bellman error of the function class (Munos & Szepesvári, 2008). This is the gap which can possibly be improved upon via better Qlearning algorithm design. We plot our results in Fig. 1.
We first note the obvious trend that smaller architectures produce lower returns, and converge to more suboptimal solutions. However, we also find that smaller architectures introduce significant bias in the learning process, and there is often a significant gap between the solution found by ExactFQI and the best solution within the model class. This gap may be due to the fact that when the target is bootstrapped, we must be able to represent all Qfunction along the path to the solution, and not just the final result (Bertsekas & Tsitsiklis, 1996). This observation implies that using large architectures is crucial not only because they have capacity to represent a better solution, but also because they are significantly easier to train using bootstrapping, and suffer less from nonconvergence issues. We also note that divergence rarely happens in practice. We observed divergence in 0.9% of our experiments using function approximation, measured by the largest Qvalue growing larger than 10 times that of .
For highdimensional problems, we present experiments on varying the architecture of the Qnetwork in SAC (Haarnoja et al., 2018) in Appendix Fig. 13. We still observe that large networks have the best performance, and that divergence rarely happens even in highdimensional continuous spaces. We briefly discuss theoretical intuitions on apparent discrepancy between the lack of unbounded divergence in relation known counterexamples in Appendix B.
5 Sampling Error and Overfitting
A second source of error in minimizing the Bellman error, orthogonal to function approximation, is that of sampling or generalization error. The next issue we investigate is the effect of sampling error on Qlearning methods.
5.1 Technical Background
Approximate dynamic programming assumes that the projection of the Bellman backup (Eqn. 1) is computed exactly, but in reinforcement learning we can normally only compute the empirical Bellman error
over a finite set of samples. In the PAC framework, overfitting can be quantified by a bounded error in between the empirical and expected loss with high probability, which decays with sample size
(ShalevShwartz & BenDavid, 2014). Munos & Szepesvári (2008); Maillard et al. (2010); Tosatto et al. (2017) provide such PACbounds which account for sampling error in the context of Qlearning and valuebased methods, and quantify the quality of the final solution in terms of sample complexity.We analyze several key points that relate to sampling error. First, we show that Qlearning is prone to overfitting, and that this overfitting has a real impact on performance, in both tabular and highdimensional settings. We also show that the replay buffer is in fact a very effective technique in addressing this issue, and discuss several methods to migitate the effects of overfitting in practice.
5.2 Quantifying Overfitting
We first quantify the amount of overfitting that happens during training, by varying the number of samples. In order provide comparable validation errors across different experiments, we fix a reference sequence of Qfunctions, , obtained during a normal training run. We then retrace the training sequence, and minimize the projection error for each training iteration, using varying amounts of onpolicy data or sampling from a replay buffer. We measure the exact validation error (the expected Bellman error) at each iteration under the onpolicy distribution, plotted in Fig. 3. We note the obvious trend that more samples leads to lower validation loss, confirming that overfitting can in fact occur. A more interesting observation is that sampling from the replay buffer results in the lowest onpolicy validation loss, despite bias due to distribution mismatch from sampling offpolicy data. As we discuss in Section 6, we believe that replay buffers are mainly effective because they greatly reduce the effect of overfitting and create relatively good coverage over the state space, not necessarily due to reducing the effects of distribution shift.
Next, Fig. 2 shows the relationship between number of samples and returns. We see a clear trend that higher sample count leads to improved learning speed and a better final solution, confirming our hypothesis that overfitting has a significant effect on the performance of Qlearning. A full sweep including architectures is presented in Appendix Fig. 14. We observe that despite overfitting being an issue, larger architectures still perform better because the bias introduced by smaller architectures dominates.
5.3 What methods can be used to compensate for overfitting?
Finally, we discuss methods to compensate for overfitting. One common method for reducing overfitting is to regularize the function approximator to reduce its capacity. However, as we have seen before that weaker architectures can give rise to suboptimal convergence, we instead study early stopping
methods to mitigate overfitting without reducing model size. First, we observe that the number of gradient steps taken per sample in the projection step has an important effect on performance – too few steps and the algorithm learns slowly, but too many steps and the algorithm may initially learn quickly but overfit. To show this, we run a hyperparameter sweep over the number of gradient steps taken per environment step in ReplayFQI and TD3 (TD3 uses 1 by default). Results for FQI are shown in Fig.
4, and for TD3 in Appendix Fig. 15.In order to understand whether better early stopping criteria can possibly help with overfitting, we employ oracle
early stopping rules. While neither of these rules can be used to solve overfitting in practice, these experiments can provide guidance for future methods and an “upper bound” on the best improvement that can be obtained from optimal stopping. We investigate two oracle early stopping criteria for setting the number of gradient steps: using the expected Bellman error and the expected returns of the greedy policy w.r.t. the current Qfunction (oracle returns). We implement both methods by running the projection step of ReplayFQI to convergence using gradient descent, and afterwards selecting the intermediate Qfunction which is judged best by the evaluation metric (lowest Bellman error or highest returns). Using such oracle stopping metrics results in a modest boost in performance in tabular domains (Fig.
5). Thus, we believe that there is promise in further improving such earlystopping methods for reducing overfitting in deep RL algorithms.We might draw a few actionable conclusions from these experiments. First, overfitting is indeed a serious issue with Qlearning, and too many gradient steps or too few samples can lead to poor performance. Second, replay buffers and early stopping can be used to mitigate the effects of overfitting. Third, although overfitting is a problem, large architectures are still preferred, because the harm from function approximation bias outweighs the harm from increased overfitting with large models.
6 NonStationarity
In this section, we discuss issues related to the nonstationarity of the Qlearning process (relating to the Bellman backup and Bellman error minimization).
6.1 Technical Background
Instability in Qlearning methods is often attributed to the nonstationarity of the regression objective (Lillicrap et al., 2015; Mnih et al., 2015). Nonstationarity occurs in two places: in the changing target values , and in a changing weighting distribution (“distribution shift”) (i.e., due to samples being taken from different policies). Note that a nonstationary objective, by itself, is not indicative of instability. For example, gradient descent can be viewed as successively minimizing linear approximations to a function: for gradient descent on with parameter and learning rate , we have the “moving” objective . However, the fact that the Qlearning algorithm prescribes an update rule and not a stationary objective complicates analysis. Indeed, the motivation behind algorithms such as GTD (Sutton et al., 2009b, a) and residual methods (Baird, 1995; Scherrer, 2010) can be seen as introducing a stationary objective that can be optimized with standard procedures such as gradient descent for increased stability. Therefore, a key question to investigate is whether these nonstationarities are detrimental to the learning process.
6.2 Does a moving target cause instability in the absence of a moving distribution?
To study the moving target problem, we must first isolate the effects of a moving target, and study how the rate at which the target changes impacts performance. To control the rate at which the target changes, we introduce an additional smoothing parameter to Qiteration, where the target values are now computed as an moving average over previous targets. We define the smoothed Bellman backup, , as follows:
This scheme is inspired by the soft target update used in algorithms such as DDPG (Lillicrap et al., 2015) and SAC (Haarnoja et al., 2017) to improve the stability of learning. Standard Qiteration uses a “hard” update where . A soft target update weakens the contraction of Qiteration from to (See Appendix C), so we expect slower convergence, but perhaps it is more stable under heavy function approximation error. We performed experiments with this modified backup using ExactFQI under the weighting distribution.
Our results are presented in Appendix Fig. 12. We find that the most cases, the hard update with results in the fastest convergence and highest asymptotic performance. However, for the smallest two architectures we used, and , lower values of (such as 0.1) achieve slightly higher asymptotic performance. Thus, while more expressive architectures are still stable under fastchanging targets, we believe that a slowly moving target may have benefits under heavy approximation error. This evidence points to either using large function approximators, in line with the conclusions drawn in the previous sections, or adaptively slowing the target updates when the architecture is weak (relative to the problem difficulty) and the projected Bellman error is therefore high.
6.3 Does distribution shift impact performance?
To study the distribution shift problem, we exactly compute the amount of distribution shift between iterations in totalvariation distance, and the “loss shift”:
The loss shift quantifies the Bellman error objective when evaluated under a new distribution  if the distribution shifts to previously unseen or low support states, we would expect a highly inaccurate Qvalue in such states, and a correspondingly high loss shift.
We run our experiments using ExactFQI with a 256x256 layer architecture, and plot the distribution discrepancy and the loss discrepancy in Fig. 6. We find that Prioritized has the greatest shift, followed by onpolicy variants. Replay buffers greatly reduce distribution shift compared to onpolicy learning, which is similar to the decorrelation argument cited for its use by Mnih et al. (2015). However, we find that this metric correlates very little with the actual performance of the algorithm (Fig. 7). For example, prioritized weighting performs well yet has high distribution shift.
Overall, our experiments indicate that nonstationarities in both distributions and target values, when isolated, do not cause significant stability issues. Instead, other factors such as sampling error and function approximation appear to have more significant effects on performance. In the light of these findings, we might therefore ask: can we design a better sampling distribution, without regard for distributional shift and with regard for highentropy, that results in better final performance, and is realizable in practice? We investigate this in the following section.
7 Sampling Distributions
As alluded to in Section 6, the choice of sampling distribution is an important design decision can have a large impact on performance. Indeed, it is not immediately clear which distribution is ideal for Qlearning. In this section, we hope to shed some light on this issue.
7.1 Technical Background
Offpolicy data has been cited as one of the “deadly triads” for Qlearning (Sutton & Barto, 2018), which has potential to cause instabilities in learning. Onpolicy distributions (Tsitsiklis & Van Roy, 1997) and fixed behavior distributions (Sutton et al., 2009b; Maei et al., 2010) have often been targeted for theoretical convergence analysis, and many works use importance sampling to correct for offpolicyness (Precup et al., 2001; Munos et al., 2016) However, to our knowledge, there is relatively little guidance which compares how different weighting distributions compare in terms of convergence rate and final solutions.
Nevertheless, several works give hypotheses on good choices for weighting distributions. (Munos, 2005) provides an error bound which suggests that “more uniform” weighting distributions can guarantee better worstcase performance. (Geist et al., 2017) suggests that when the statedistribution is fixed, the action distribution should be weighted by the optimal policy for residual Bellman errors. In deep RL, several methods have been developed to prevent instabilities in QLearning, such as prioritized replay (Schaul et al., 2015), and mixing replay buffer with onpolicy data (Hausknecht & Stone, 2016; Zhang & Sutton, 2017) have been found to be beneficial. In our experiments, we aim to empirically analyze multiple choices for weighting distributions to determine which are the most effective.
7.2 What Are the Best Weighting Distributions in Absence of Sampling Error?
We begin by studying the effect of weighting distributions when disentangled from sampling error. We run ExactFQI with varying choices of architectures and weighting distributions and report our results in Fig. 8. , and consistently result in the highest returns across all architectures. We believe that these results are in favor of the uniformity hypothesis: the top performing distributions spread weight across larger support of the stateaction space. For example, a replay buffer contains stateaction tuples from many policies, and therefore would be expected to have wider support than the stateaction distribution of a single policy. We can see this general trend in Fig. 9. These distributions generally result in the tightest contraction rates, and allow the Qfunction to focus on locations where the error is high. In the sampled setting, this observation motivates exploration algorithms that maximize state coverage (for example, Hazan et al. (2018) solve an exploration objective which maximizes statespace entropy). However, note that in this particular experiment, there is no sampling. All states are observed, just with different weights, thus isolating the issue of distributions from the issue of sampling.
7.3 Designing a Better OffPolicy Distribution: Adversarial Feature Matching
In our final study, we attempt to design a better weighting distribution using insights from previous sections that can be easily integrated into deep RL methods. We refer to this method as adversarial featurematching (AFM). We draw upon three specific insights outlined in previous analysis. First, the function approximator should be incentivized to maximize its ability to distinguish states to minimize function approximation bias (Section 4). Second, the weighting distribution should emphasize areas where the Qfunction incurs high Bellman error, in order to minimize the discrepancy between norm error and norm error. Third, moreuniform weighting distributions tend to be higher performant. The first insight was also demonstrated in (Liu et al., 2018) where enforcing sparsity in the Qfunction was found to provide locality in the Qfunction which prevented catastrophic interference and provided better values for bootstrapping.
We propose to model our problem as a minimax game, where the weighting distribution is a parameterized adversary which tries to maximize the Bellman error, while the Qfunction () tries to minimize it. Note that in the unconstrained setting, this game is equivalent to minimizing the norm error in its dualnorm representation. However, in practical settings where minimizing stochastic approximations of the norm can be difficult for neural networks (also noticed when using PER (Van Hasselt et al., 2018)
), it is crucial to introduce constraints to limit the power of the adversary. These constraints also make the adversary closer to the uniform distribution while still allowing it to be sufficiently different at specific stateaction pairs.
We elect to use a feature matching constraint which enforces the expected feature vectors,
, under to roughly match the expected feature vector under uniform sampling from the replay buffer. We can express the output of a neural network Qfunction as or, in the continuous case, as , where the feature vector represent the the output of all but the final layer. Intuitively, this constraint restricts the adversarial sampler to distributing probability mass among states (or stateaction pairs) that are perceptually similar to the Qfunction, which in turn forces the Qfunction to reduce statealiasing by learning features that are more separable. Note that, in our case, . This also provides a natural extension of our method by performing expected gradient matching over all parameters (), instead of matching only (we leave it to future work to explore this direction). Formally, this objective is given as follows:Note that is a function of but, while solving the maximization, is assumed to be a constant. This is equivalent to solving only the inner maximization with a constraint, and empirically provides better stability. Implementation details for AFM are provided in Appendix D. The denotes an estimator for the true expectation under some sampling distribution, such as a uniform distribution over all states and actions (in exact FQI) or the replay buffer distribution. So, holds when using a replay buffer.
While both AFM and PER tend to upweight samples in the buffer with a high Bellman error, PER explicitly attempts to reduce distribution shift via importance sampling. As we observed in Section 7, distributional shift is not actually harmful in practice, and AFM dispenses with this goal, instead explicitly aiming to rebalance the buffer to attain better coverage via adversarial optimization. In our experiments, this results in substantially better performance, consistent with the hypothesis that coverage, rather than reduction of distributional shift, is the most important property in a sampling distribution.
In tabular domains with ExactFQI, we find that AFM performs at par with the top performing weighting distributions, such as and better than (Fig. 8). This confirms that adaptive prioritization works better than Prioritized(). Another benefit of AFM is its robustness to function approximation and the performance gains in the case of small architectures (say, ) are particularly noticeable. (Fig. 8)
In tabular domains with ReplayFQI (Table 1), we also compare AFM to prioritized replay (PER) (Schaul et al., 2015), where AFM and PER perform similarly in terms of normalized returns. Note that AFM reweights samples drawn uniformly from the buffer, whereas PER changes which samples are actually drawn. We also evaluate a variant of AFM (AFM+Sampling in Table 1) which changes which samples instead of reweighting. Essentially, in this version we sample from the replay buffer using probabilities determined by the AFM optimization, rather than using importance sampling while making bellman updates. We note that, in Table 1, AFM+Sampling performs strictly better than AFM and PER.
We further evaluate AFM on MuJoCo tasks with the TD3 algorithm (Fujimoto et al., 2018) and the entropy constrained SAC algorithm (Haarnoja et al., 2018). We find that in all 3 tested domains (HalfCheetah, Hopper and Ant), AFM yields substantial empirical improvement in the case of TD3 (Fig. 10) and performs slightly better than entropy constrained SAC (Fig. 11). Surprisingly, we found PER to not work very well in these domains. In light of these results, we conclude that: (1) the choice of sampling distribution is very important for performance, and (2) considerations such as incorporating knowledge about the function approximator (for example, through ) into the choice of (the sampling/weighting distribution) can be very effective.
Sampling distribution  Norm. Returns  Norm. Returns 

(16, 16)  (64, 64)  
None  0.18  0.23 
Uniform(s, a)  0.19  0.25 
0.45  0.39  
0.30  0.21  
Prioritized(s, a)  0.17  0.33 
PER (Schaul et al., 2015)  0.42  0.49 
AFM (Ours)  0.41  0.48 
AFM + Sampling (Ours)  0.43  0.51 
8 Conclusions and Discussion
From our analysis, we have several broad takeaways for the design of deep Qlearning algorithms.
Potential convergence issues with Qlearning do not seem to be endemic empirically, but function approximation still has a strong impact on the solution to which these methods converge. This impact goes beyond just approximation error, suggesting that Qlearning methods do find suboptimal solutions (within the given function class) with smaller function approximators. However, expressive architectures largely mitigate this problem, suffer less from bootstrapping error, converge faster, and more stable with moving targets.
Sampling error can cause substantial overfitting problems with Qlearning. However, replay buffers and early stopping can mitigate this problem, and the biases incurred from small function approximators outweigh any benefits they may have in terms of overfitting. We believe the best strategy is to keep large architectures but carefully select the number of gradient steps used per sample. We showed that employing oracle early stopping techniques can provide huge benefits in the performance in Qlearning. This motivates the future research direction of devising early stopping techniques to dynamically control the number of gradient steps in Qlearning, rather than setting it as a hyperparameter as this can give rise to big difference in performance.
The choice of sampling or weighting distribution has significant effect on solution quality, even in the absence of sampling error. Surprisingly, we do not find onpolicy distributions to be the most performant, but rather methods which have high stateentropy and spread mass uniformly among stateaction pairs, seem to be highly effective for training. Based on these insights, we propose a new weighting distribution which balances highentropy and state aliasing, AFM, that yields fair improvements in both tabular and continuous domains with stateoftheart offpolicy RL algorithms.
Finally, we note that there are many other topics in Qlearning that we did not investigate, such as overestimation bias and multistep returns. We believe that these issues too could be studied in future work with our oraclebased analysis framework.
Acknowledgements
We thank Vitchyr Pong and Kristian Hartikainen for providing us with implementations of RL algorithms. We thank Chelsea Finn for comments on an earlier draft of this paper. SL thanks George Tucker for helpful discussion. We thank Google, NVIDIA, and Amazon for providing computational resources. This research was supported by Berkeley DeepDrive, NSF IIS1651843 and IIS1614653, the DARPA Assured Autonomy program, and ARL DCIST CRA W911NF1720181.
References

Arjovsky et al. (2017)
Arjovsky, M., Chintala, S., and Bottou, L.
Wasserstein generative adversarial networks.
In
International Conference on Machine Learning (ICML)
, pp. 214–223, 2017.  Baird (1995) Baird, L. Residual Algorithms : Reinforcement Learning with Function Approximation. In International Conference on Machine Learning (ICML), 1995.
 Bertsekas & Tsitsiklis (1996) Bertsekas, D. P. and Tsitsiklis, J. N. Neurodynamic programming. Athena Scientific, 1996.
 Dai et al. (2018) Dai, B., Shaw, A., Li, L., Xiao, L., He, N., Liu, Z., Chen, J., and Song, L. Sbeed: Convergent reinforcement learning with nonlinear function approximation. In International Conference on Machine Learning, pp. 1133–1142, 2018.
 Daskalakis et al. (2018) Daskalakis, C., Ilyas, A., Syrgkanis, V., and Zeng, H. Training GANs with optimism. In International Conference on Learning Representations, 2018. URL https://openreview.net/forum?id=SJJySbbAZ.
 Ernst et al. (2005) Ernst, D., Geurts, P., and Wehenkel, L. Treebased batch mode reinforcement learning. Journal of Machine Learning Research, 6(Apr):503–556, 2005.
 Farrell & Berger (1995) Farrell, J. A. and Berger, T. On the effects of the training sample density in passive learning control. In American Control Conference, 1995.
 Fujimoto et al. (2018) Fujimoto, S., van Hoof, H., and Meger, D. Addressing function approximation error in actorcritic methods. In International Conference on Machine Learning (ICML), pp. 1587–1596, 2018.
 Geist et al. (2017) Geist, M., Piot, B., and Pietquin, O. Is the bellman residual a bad proxy? In Advances in Neural Information Processing Systems (NeurIPS), pp. 3205–3214. 2017.
 Haarnoja et al. (2017) Haarnoja, T., Tang, H., Abbeel, P., and Levine, S. Reinforcement learning with deep energybased policies. In International Conference on Machine Learning (ICML), 2017.
 Haarnoja et al. (2018) Haarnoja, T., Zhou, A., Abbeel, P., and Levine, S. Soft actorcritic: Offpolicy maximum entropy deep reinforcement learning with a stochastic actor. CoRR, abs/1801.01290, 2018. URL http://arxiv.org/abs/1801.01290.
 Hausknecht & Stone (2016) Hausknecht, M. and Stone, P. Onpolicy vs. offpolicy updates for deep reinforcement learning. In Deep Reinforcement Learning: Frontiers and Challenges, IJCAI, 2016.
 Hazan et al. (2018) Hazan, E., Kakade, S. M., Singh, K., and Van Soest, A. Provably efficient maximum entropy exploration. arXiv preprint arXiv:1812.02690, 2018.
 Kalashnikov et al. (2018) Kalashnikov, D., Irpan, A., Pastor, P., Ibarz, J., Herzog, A., Jang, E., Quillen, D., Holly, E., Kalakrishnan, M., Vanhoucke, V., and Levine, S. Qtopt: Scalable deep reinforcement learning for visionbased robotic manipulation. In CoRL, volume 87 of Proceedings of Machine Learning Research, pp. 651–673. PMLR, 2018.
 Lillicrap et al. (2015) Lillicrap, T. P., Hunt, J. J., Pritzel, A., Heess, N., Erez, T., Tassa, Y., Silver, D., and Wierstra, D. Continuous control with deep reinforcement learning. International Conference on Learning Representations (ICLR), 2015.
 Lin (1992) Lin, L.J. Selfimproving reactive agents based on reinforcement learning, planning and teaching. Machine learning, 8(34):293–321, 1992.
 Liu et al. (2018) Liu, V., Kumaraswamy, R., Le, L., and White, M. The utility of sparse representations for control in reinforcement learning. CoRR, abs/1811.06626, 2018. URL http://arxiv.org/abs/1811.06626.
 Maei et al. (2010) Maei, H. R., Szepesvári, C., Bhatnagar, S., and Sutton, R. S. Toward offpolicy learning control with function approximation. In International Conference on Machine Learning (ICML), 2010.
 Maillard et al. (2010) Maillard, O.A., Munos, R., Lazaric, A., and Ghavamzadeh, M. Finitesample analysis of bellman residual minimization. In Asian Conference on Machine Learning (ACML), pp. 299–314, 2010.
 Metelli et al. (2018) Metelli, A. M., Papini, M., Faccio, F., and Restelli, M. Policy optimization via importance sampling. CoRR, abs/1809.06098, 2018. URL http://arxiv.org/abs/1809.06098.
 Mnih et al. (2015) Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Veness, J., Bellemare, M. G., Graves, A., Riedmiller, M., Fidjeland, A. K., Ostrovski, G., Petersen, S., Beattie, C., Sadik, A., Antonoglou, I., King, H., Kumaran, D., Wierstra, D., Legg, S., and Hassabis, D. Humanlevel control through deep reinforcement learning. Nature, 518(7540):529–533, feb 2015. ISSN 00280836.

Munos (2005)
Munos, R.
Error bounds for approximate value iteration.
In
AAI Conference on Artificial intelligence (AAAI)
, pp. 1006–1011. AAAI Press, 2005.  Munos & Szepesvári (2008) Munos, R. and Szepesvári, C. Finitetime bounds for fitted value iteration. Journal of Machine Learning Research, 9(May):815–857, 2008.
 Munos et al. (2016) Munos, R., Stepleton, T., Harutyunyan, A., and Bellemare, M. Safe and efficient offpolicy reinforcement learning. In Advances in Neural Information Processing Systems (NeurIPS), pp. 1054–1062, 2016.
 Plappert et al. (2018) Plappert, M., Andrychowicz, M., Ray, A., McGrew, B., Baker, B., Powell, G., Schneider, J., Tobin, J., Chociej, M., Welinder, P., Kumar, V., and Zaremba, W. Multigoal reinforcement learning: Challenging robotics environments and request for research, 2018.
 Precup et al. (2001) Precup, D., Sutton, R. S., and Dasgupta, S. Offpolicy temporal difference learning with function approximation. In International Conference on Machine Learning (ICML), pp. 417–424, 2001.
 Riedmiller (2005) Riedmiller, M. Neural fitted q iteration–first experiences with a data efficient neural reinforcement learning method. In European Conference on Machine Learning, pp. 317–328. Springer, 2005.
 Schaul et al. (2015) Schaul, T., Quan, J., Antonoglou, I., and Silver, D. Prioritized experience replay. International Conference on Learning Representations (ICLR), 2015.
 Scherrer (2010) Scherrer, B. Should one compute the temporal difference fix point or minimize the bellman residual? the unified oblique projection view. In International Conference on Machine Learning (ICML), pp. 959–966, 2010.
 ShalevShwartz & BenDavid (2014) ShalevShwartz, S. and BenDavid, S. Understanding machine learning: From theory to algorithms. Cambridge university press, 2014.
 Shani et al. (2005) Shani, G., Heckerman, D., and Brafman, R. I. An mdpbased recommender system. Journal of Machine Learning Research, 6(Sep):1265–1295, 2005.
 Sutton & Barto (2018) Sutton, R. S. and Barto, A. G. Reinforcement learning: An introduction. Second edition, 2018.
 Sutton et al. (2009a) Sutton, R. S., Maei, H. R., Precup, D., Bhatnagar, S., Silver, D., Szepesvári, C., and Wiewiora, E. Fast gradientdescent methods for temporaldifference learning with linear function approximation. In International Conference on Machine Learning (ICML), 2009a.
 Sutton et al. (2009b) Sutton, R. S., Maei, H. R., and Szepesvári, C. A convergent o(n) temporaldifference algorithm for offpolicy learning with linear function approximation. In Advances in Neural Information Processing Systems (NeurIPS), 2009b.
 Sutton et al. (2016) Sutton, R. S., Mahmood, A. R., and White, M. An emphatic approach to the problem of offpolicy temporaldifference learning. The Journal of Machine Learning Research, 17(1):2603–2631, 2016.
 Szepesvári (1998) Szepesvári, C. The asymptotic convergencerate of qlearning. In Advances in Neural Information Processing Systems, pp. 1064–1070, 1998.
 Tosatto et al. (2017) Tosatto, S., Pirotta, M., D’Eramo, C., and Restelli, M. Boosted fitted qiteration. In International Conference on Machine Learning (ICML), pp. 3434–3443. JMLR. org, 2017.
 Tsitsiklis & Van Roy (1997) Tsitsiklis, J. N. and Van Roy, B. Analysis of temporaldiffference learning with function approximation. In Advances in Neural Information Processing Systems (NeurIPS), pp. 1075–1081, 1997.
 Tuomas Haarnoja & Levine (2018) Tuomas Haarnoja, Aurick Zhou, K. H. G. T. S. H. J. T. V. K. H. Z. A. G. P. A. and Levine, S. Soft actorcritic algorithms and applications. Technical report, 2018.
 Van Hasselt et al. (2018) Van Hasselt, H., Doron, Y., Strub, F., Hessel, M., Sonnerat, N., and Modayil, J. Deep reinforcement learning and the deadly triad. arXiv preprint arXiv:1812.02648, 2018.
 Watkins & Dayan (1992) Watkins, C. J. and Dayan, P. Qlearning. Machine learning, 8(34):279–292, 1992.
 Yazıcı et al. (2019) Yazıcı, Y., Foo, C.S., Winkler, S., Yap, K.H., Piliouras, G., and Chandrasekhar, V. The unusual effectiveness of averaging in GAN training. In International Conference on Learning Representations, 2019. URL https://openreview.net/forum?id=SJgw_sRqFQ.
 Zhang & Sutton (2017) Zhang, S. and Sutton, R. S. A deeper look at experience replay. CoRR, abs/1712.01275, 2017. URL http://arxiv.org/abs/1712.01275.
Appendices
Appendix A Benchmark Tabular Domains
We evaluate on a benchmark of 8 tabular domains, selected for qualitative differences.
4 Gridworlds. The Gridworld environment is an NxN grid with randomly placed walls. The reward is proportional to Manhattan distance to a goal state (1 at the goal, 0 at the initial position), and there is a 5% chance the agent travels in a different direction than commanded. We vary two parameters: the size ( and ), and the state representations. We use a “onehot” representation, an (X, Y) coordinate tuple (represented as two onehot vectors), and a “random” representation, a vector drawn from , where N is the width or height of the Gridworld. The random observation significantly increases the challenge of function approximation, as significant state aliasing occurs.
Cliffwalk: Cliffwalk is a toy example from Schaul et al. (2015). It consists of a sequence of states, where each state has two allowed actions: advance to the next state or return to the initial state. A reward of 1.0 is obtained when the agent reaches the final state. Observations consist of vectors drawn from .
InvertedPendulum and MountainCar: InvertedPendulum and MountainCar are discretized versions of continuous control tasks found in OpenAI gym (Plappert et al., 2018), and are based on problems from classical RL literature. In the InvertedPendulum task, an agent must swing up an pendulum and hold it in its upright position. The state consists of the angle and angular velocity of the pendulum. Maximum reward is given when the pendulum is upright. The observation consists of the and of the pendulum angle, and the angular velocity. In the MountainCar task, the agent must push a vehicle up a hill, but the hill is steep enough that the agent must gather momentum by swinging back and forth within a valley in order to reach the top. The state consists of the position and velocity of the vehicle.
SparseGraph: The SparseGraph environment is a 256state graph with randomly drawn edges. Each state has two edges, each corresponding to an action. One state is chosen as the goal state, where the agent receives a reward of one.
Appendix B Fitted Qiteration with Bounded Projection Error
When function approximation is introduced to Qiteration, we lose guarantees that our solution will converge to the optimal solution , because the composition of projection and backup is no longer guaranteed to be a contraction under any norm. However, this does not imply divergence, and in most cases it merely degrades the quality of solution found.
This can be seen by recalling the following result from (Bertsekas & Tsitsiklis, 1996), that describes the quality of the solution obtained by fitted Qiteration (FQI) when the projection error at each step is bounded. The conclusion is that FQI converges to an ball around the optimal solution which scales proportionally with the projection error. While this statement does not claim that divergence cannot occur in general (this theorem can only be applied in retrospect, since we cannot always uniformly bound the projection error at each iteration), it nevertheless offers important intuitions on the behavior of FQI under approximation error. For similar results concerning weighted norms, see (Munos, 2005).
Theorem B.1 (Bounded error in fitted Qiteration).
Let the projection or Bellman error at each iteration of FQI be uniformly bounded by , i.e. . Then, the error in the final solution is bounded as
Proof.
See of Chapter 6 of Bertsekas & Tsitsiklis (1996). ∎
We can use this statement to provide a bound on the performance of the final policy.
Corollary B.1.1.
Suppose we run fitted Qiteration, and let the projection error at each iteration be uniformly bounded by , i.e. . Letting denote the returns of a policy , the the performance of the final policy is bounded as:
Proof.
b.1 Unbounded divergence in FQI
Because norms are bounded by the norm, Thm. B.1 implies that unbounded divergence is impossible when weighting distribution has positive support at all states and actions (i.e. ), and the projection is nonexpansive in the norm (such as when using linear approximators).
We can bound the weighted in terms of the as follows: . Thus, we can apply Thm. B.1 with to show that unbounded divergence is impossible. Note that because this bound scales with the size of the state and action spaces, it is fairly loose in many practical cases, and practitioners may nevertheless see Qvalues grow to large values (tighter bounds concerning L2 norms can be found in (Munos, 2005), which depend on the transition distribution). It also suggests that distributions which are fairly uniform (so as to maximize the denominator) can perform well.
When the weighting distribution does not have support over all states and actions, divergence can still occur, as noted in the counterexamples such as Section 11.2 of Sutton & Barto (2018). In this case, we consider two states (state 1 and 2) with feature vectors 1 and 2, respectively, and a linear approximator with parameter . There exists a single action with a deterministic transition from state 1 to state 2, and we only sample the transition from state 1 to state 2 (i.e. is 1 for state 1 and 0 for state 2). All rewards are 0. In this case, the projected Bellman backup takes the form:
Which will cause unbounded growth when iterated, provided . However, if we add a transition from state 2 back to itself or to state 1, and place nonzero probability on sampling these transitions, divergence can be avoided.
Appendix C smoothed Qiteration
In this section we show that the smoothed Bellman backup introduced in Section 6 is still a valid Qiteration method, in that it is a contraction (for ) and thus converges to .
We define the smoothed Bellman backup as:
Theorem C.1 (Contraction rate of the smoothed Bellman backup).
is a contraction:
Proof.
This statement follows from straightforward application of the triangle rule and the fact that is a contraction:
∎
Appendix D Adversarial Feature Matching (AFM): Detailed Explanation and Practical Implementation
As described in section 7.3, we devise a novel weighting scheme for the Bellman error objective based on an adversarial minimax game. The adversary computes weights (representing the weighting distribution ), for the Bellman error: . Recalling from Section 7.3, the optimization problem is given by:
where are the state features learned by the Qfunction approximator. is easy to extract out of the multiheaded () model typically used for discrete action control, as one choice is to let be the output of the penultimate layer of the Qnetwork. For continuous control tasks, however, we model (which is a function of the actions as well) as stateonly features are unavailable, unless separately modeled. This can also be interpreted as modelling a feature matching constraint on the gradient of with respect to the last linear parameters . A possible extension is to take into account the entire gradient as the features in the feature matching constraint, that is, .
This choice of the constraint is suitable and can be interpreted in two ways. First, an adversary constrained in this manner has enough power to exploit the Qnetwork at states which get aliased under the chosen function class, thereby promoting more separable feature learning and reducing some negative aspects of function approximation that can arise in Qlearning. This is also similar in motivation to (Liu et al., 2018). Second, this feature constraint also bears a similarity the Maximum Mean Discrepancy (MMD) distance between two distributions and that can be written as , where the set of functions is the canonical feature map, (from real space to the RKHS). In our context, this is analogous to optimizing a distance between the adversarial distribution and the replay buffer distribution (as the average is a MonteCarlo estimator of the expected under the replay buffer distribution ). In the light of these arguments, AFM, and other associated methods that take into account the properties of the function approximator into account (for example, here), can greatly reduce the bias incurred due to function approximation in the due course of Qlearning/FQI, as depicted in 1.
Solving the optimization
We solve this saddle point problem using alternating dual gradient descent. We first solve the inner maximization problem, and then use its solution to then solve the outer minimization problem. We first compute the Lagrangian for the maximization, by introducing a dual variable ,
(Note that this Lagrangian is flipped in sign because we first convert the maximization problem to standard minimization form.) We now solve the inner problem using dual gradient descent. We then plug in the solutions (approximate solutions obtained after gradient descent), into the Lagrangian, to then solve the outside minimization over . Note that while depends on
(as it is the feature layer of the Qnetwork), we don not backpropagate through
while solving the minimization. This improves stability of the Qnetwork training in practice and to makes sure that Qfunction is only affected by FQI updates. In practice, we take up to 10 gradient steps for the inner problem every 1 gradient step of the outer problem. The algorithm is summarized in Algorithm 4. Our results provided in the main paper and here don’t particularly assume any other tricks like Optimistic Gradient (Daskalakis et al., 2018), using exponential moving average of the parameters (Yazıcı et al., 2019). Our tabular experiments seemed to benefit some what using these tricks.Practical implementation with replay buffers
We incorporate this weighting/sampling distribution into Qlearning in the setting with replay buffers and with stateaction sampling. We evaluate the weighting version of our method, AFM, where, we usually sample a large batch of stateaction pairs from a usual replay buffer used in Qlearning, but use importance weights to then match in expectation. Thus, we use a parametric function approximator to model – that is, the importance weights of the adversarial distribution with respect to the replay buffer distribution . Mathematically, we estimate: , where
. The latter expectation is then approximated using a set of finite samples. It has been noted in literature that importance sampling (IS) suffers from high variance especially if the number of samples is small. Hence, we use the selfnormalized importance sampling estimator, which averages the importance weights in a set of samples or a large number of samples. That is, let
, then instead of using as the importance weights, we use (where and represent stateaction tuples; concisely mentioned for visual clarity) as the importance weights. We also regularize the secondorder Renyi Divergence between and for stability. Mathematically, it can be shown that this is a lower bound on the true expectation of under , which is being estimated using importance sampling. This result has also been shown in (Metelli et al., 2018) (Theorem 4.1), where the authors use this lower bound in policy optimization via importance sampling. We state the theorem below for completeness.Theorem D.1.
(Metelli et al., 2018) Let and be two probability measures on the measurable space such that and . Let
be i.i.d. random variables sampled from
, and be a bounded function. Then, for any and with probability at least it holds that:where is the exponentiated secondorder Renyi Divergence between and .
Hence, our objective for the inner loop now becomes: is now computed using samples with an additional renyi regularisation term. Since, we end up modeling this ratio, through out parameteric model, we can hence easily compute an estimator for the Renyi divergence term. The overall lower bound inner maximization problem is:
We found that this Renyi penalty helped stabilize training. In practice, we model the importance weights:
as a parametric model with an identical architecture to the Qnetwork. We use parameter clipping for
, where the parameter are clipped to , analogous to Wasserstein GANs (Arjovsky et al., 2017). We also found that selfnormalization during importance sampling has a huge practical benefit. Note that as the true norm of the Bellman error is not known, for computing in the Renyi Divergence term, and hence we either replace it by constant, or compute a stochastic approximation to the norm over the current batch. We found the former to be more stable, and hence, used that in all our experiments. This coefficient of the Renyi divergence penalty is tuned uniformly between . The learning rate for the adversary was chosen to be 1e4 for the tabular environments, and 5e4 for TD3. The batch size for our algorithm was chosen to be 128 for the tabular environments and 500 for TD3/SAC. Note that a larger batch size ensures smoothness in the minmax optimization problem. We also found that instead of having a Lagrange multiplier for the feature matching constraint, having Lagrange multipliers for constraining each of the individual dimensions of the features also helps very much. This is to ensure that the hyperparameters remain the same across different architectures regardless of the dimension of the penultimate layer of the Qnetwork. The algorithm in this case is exactly the same as the algorithm before with a vector valued dual variable . We used TD3 and SAC implementations from rlkit (https://github.com/vitchyr/rlkit/tree/master/rlkit)Appendix E Function approximation analysis on Mujoco Tasks
As discussed in Section 4, we validate our findings on the effect of function approximation on 3 MuJoCo tasks from OpenAI Gym with the SAC algorithm from the author’s implementation at (Tuomas Haarnoja & Levine, 2018). We observe that bigger networks learn faster and better in general.
Performance of different size architectures on 3 benchmark MuJoco tasks from OpenAI gym suite with the SAC algorithm. Values are averaged over 3 different seeds. A bigger network performs better in terms of learning speed and performance measured in terms of returns. Each epoch on the xaxis corresponds to 1000 environment steps.