Q-Value Weighted Regression: Reinforcement Learning with Limited Data

02/12/2021 ∙ by Piotr Kozakowski, et al. ∙ 41

Sample efficiency and performance in the offline setting have emerged as significant challenges of deep reinforcement learning. We introduce Q-Value Weighted Regression (QWR), a simple RL algorithm that excels in these aspects. QWR is an extension of Advantage Weighted Regression (AWR), an off-policy actor-critic algorithm that performs very well on continuous control tasks, also in the offline setting, but has low sample efficiency and struggles with high-dimensional observation spaces. We perform an analysis of AWR that explains its shortcomings and use these insights to motivate QWR. We show experimentally that QWR matches the state-of-the-art algorithms both on tasks with continuous and discrete actions. In particular, QWR yields results on par with SAC on the MuJoCo suite and - with the same set of hyperparameters - yields results on par with a highly tuned Rainbow implementation on a set of Atari games. We also verify that QWR performs well in the offline RL setting.



There are no comments yet.


page 3

page 8

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Deep reinforcement learning has been applied to a large number of challenging tasks, from games (silver2017mastering; OpenAI_dota; vinyals2017starcraft) to robotic control (sadeghi2016cad2rl; openai2018dexterous; rusu2016sim2real)

. Since RL makes minimal assumptions on the underlying task, it holds the promise of automating a wide range of applications. However, its widespread adoption has been hampered by a number of challenges. Reinforcement learning algorithms can be substantially more complex to implement and tune than standard supervised learning methods and can have a fair number of hyper-parameters and be brittle with respect to their choices, and may require a large number of interactions with the environment.

These issues are well-known and there has been significant progress in addressing them. The policy gradient algorithm REINFORCE (reinforce1) is simple to understand and implement, but is both brittle and requires on-policy data. Proximal Policy Optimization (PPO, schulman2017proximal) is a more stable on-policy algorithm that has seen a number of successful applications despite requiring a large number of interactions with the environment. Soft Actor-Critic (SAC, haarnoja2018soft) is a much more sample-efficient off-policy algorithm, but it is defined only for continuous action spaces and does not work well in the offline setting, known as batch reinforcement learning, where all samples are provided from earlier interactions with the environment, and the agent cannot collect more samples. Advantage Weighted Regression (AWR, peng2019advantageweighted

) is a recent off-policy actor-critic algorithm that works well in the offline setting and is built using only simple and convergent maximum likelihood loss functions, making it easier to tune and debug. It is competitive with SAC given enough time to train, but is less sample-efficient and has not been demonstrated to succeed in settings with discrete actions.

We replace the value function critic of AWR with a Q-value function. Next, we add action sampling to the actor training loop. Finally, we introduce a custom backup to the Q-value training. The resulting algorithm, which we call Q-Value Weighted Regression (QWR) inherits the advantages of AWR but is more sample-efficient and works well with discrete actions and in visual domains, e.g., on Atari games.

To better understand QWR we perform a number of ablations, checking different number of samples in actor training, different advantage estimators, and aggregation functions. These choices affect the performance of QWR only to a limited extent and it remains stable with each of the choices across the tasks we experiment with.

We run experiments with QWR on the MuJoCo environments and on a subset of the Arcade Learning Environment. Since sample efficiency is our main concern, we focus on the difficult case when the number of interactions with the environment is limited – in most our experiments we limit it to 100K interactions. The experiments demonstrate that QWR is indeed more sample-efficient than AWR. On MuJoCo, it performs on par with Soft Actor-Critic (SAC), the current state-of-the-art algorithm for continuous domains. On Atari, QWR performs on par with OTRainbow, a variant of Rainbow highly tuned for sample efficiency. Notably, we use the same set of hyperparameters (except for the network architecture) for both MuJoCo and Atari experiments. We verify that QWR performs well also in the regime where more data is available: with 1M interactions, QWR still out-perform SAC on MuJoCo on all environments we tested except for HalfCheetah.

2 Q-Value Weighted Regression

2.1 Advantage Weighted Regression

peng2019advantageweighted recently proposed Advantage Weighted Regression (AWR), an off-policy, actor-critic algorithm notable for its simplicity and stability, achieving competitive results across a range of continuous control tasks. It can be expressed as interleaving data collection and two regression tasks performed on the replay buffer, as shown in Algorithm 1.

1:   random actor parameters
2:   random critic parameters
4:  for  in  do
5:     add trajectories sampled by to
6:     for  in  do
7:        sample
9:     end for
10:     for  in  do
11:        sample
14:     end for
15:  end for
Algorithm 1 Advantage Weighted Regression.

AWR optimizes expected improvement of an actor policy over a sampling policy by regression towards the well-performing actions in the collected experience. Improvement is achieved by weighting the actor loss by exponentiated advantage

of an action, skewing the regression towards the better-performing actions. The advantage is calculated based on the expected return

achieved by performing action in state and then following the sampling policy . To calculate the advantage, one first estimates the value, , using a learned critic and then computes . This results in the following formula for the actor:


In this formula denotes the unnormalized, discounted state visitation distribution of the policy , and is a temperature hyperparameter.

The critic is trained to estimate the future returns of the sampling policy :


To achieve off-policy learning, the actor and the critic are trained on data collected from a mixture of policies from different training iterations, stored in the replay buffer .

2.2 Analysis of AWR with Limited Data

While AWR achieves very good results after longer training, it is not very sample efficient, as noted in the future work section of peng2019advantageweighted. To understand this problem, we analyze a single loop of actor training in AWR under a special assumption.

The assumption we introduce, called state-determines-action, concerns the content of the replay buffer of an off-policy RL algorithm. The replay buffer contains the state-action pairs that the algorithm has visited so far during its interactions with the environment. We say that a replay buffer satisfies the state-determines-action assumption when for each state in the buffer, there is a unique action that was taken from it, formally:

This assumption may seem limiting and indeed – it is not true in many of the artificial experiments with RL algorithms, with discrete state and action spaces. In such settings, even a random policy starting from the same state could violate the assumption the second time it collects a trajectory. But note that state-determines-action is almost always satisfied in continuous control, where even a slightly random policy is unlikely to ever perform the exact same action twice and transition to exactly the same state.

Note that our assumption applies well to real-world experiments with high-dimensional state spaces, as any amount of noise added to a high-dimensional space will make repeating the exact same state highly improbable. For example, consider a robot observing 32x32 pixel images. To repeat an observation, each of the 1024 pixels would have to have exactly the same value, which is close to impossible, even with a small amount of pixel noise coming from a camera. This assumption also holds in cases with limited data, even in discrete state and action spaces. When the number of collected trajectories is not enough to span the state space, it is unlikely a state will be repeated in the replay buffer. This makes our assumption particularly relevant to the study of sample efficiency.

We emphasize that the state-determines-action assumption, by design, considers exact equality

of states. Two very similar, but not equal states that lead to different actions do not violate our assumption. This makes it irrelevant to reinforcement learning with linear functions as linear functions cannot separate similar states. However, it is relevant in deep RL because deep neural networks can indeed distinguish even very similar inputs

(adv-examples; margins; understanding-generalization).

How does AWR perform under the state-determines-action assumption? In Theorems 1 and 2 (see Appendix 6.2

for more details), we show that for popular choices of discrete and Gaussian distributions the AWR update rule under this assumption will converge to a policy that assigns probability 1 to the actions already present in the replay buffer, thus cloning the previous behaviors. This is not the desired behavior, as an agent should consider various actions from each state, to ensure exploration.

Theorem 1.

Let be a discrete action space. Let a replay buffer satisfy the state-determines-action assumption. Let be the probability function of a distribution that clones the behavior from , i.e., that assigns to each state from the action such that with probability . Then, under the AWR update, .

The state-determines-action assumption is the main motivating point behind QWR, whose theoretical properties are proven in Theorem 3 in Appendix 6.3. We now illustrate the importance of this assumption by creating a simple environment in which it holds with high probability. We verify experimentally that AWR fails on this simple environment, while QWR is capable of solving it.

Figure 1: AWR and QWR on the BitFlip environment. The maximum possible return is 5.

The environment, which we call BitFlip, is parameterized by an integer . The state of the environment consists of bits and a step counter. The action space consists of actions. When an action is chosen, the -th bit is flipped and the step counter is incremented. A game of BitFlip starts in a random state with the step counter set to 0, and proceeds for 5 steps. The initial state is randomized in such a way to always leave at least 5 bits set to 0. At each step, the reward is if a bit was flipped from to and the reward is in the opposite case.

Since BitFlip starts in one random state out of , at large enough it is highly unlikely that the starting state will ever be repeated in the replay buffer. As the initial policy is random and BitFlip maintains a step counter to prevent returning to a state, the same holds for subsequent states.

BitFlip is a simple game with a very simple strategy, but the initial replay buffer will satisfy the state-determines-action assumption with high probability. As we will see, this is enough to break AWR. We ran both AWR and QWR on BitFlip for different values of , for 10 iterations per experiment. In each iteration we collected 1000 interactions with the environment and trained both the actor and the critic for 300 steps. All shared hyperparameters of AWR and QWR were set to the same values, and the backup operator in QWR was set to mean. We report the mean out of 10 episodes played by the trained agent. The results are shown in Figure 1.

As we can see, the performance of AWR starts deteriorating at a relatively small value of , which corresponds to a state space with states, while QWR maintains high performance even at , so around states. Notice how the returns of AWR drop with – at higher values: , the agent struggles to flip even a single zero bit. This problem with AWR and large state spaces motivates us to introduce QWR next.

2.3 Q-Value Weighted Regression

1:   random actor parameters
2:   random critic parameters
4:  for  in  do
5:     add trajectories sampled by to
7:     for  in  do
8:        if  then
10:        end if
11:        sample
12:        sample
15:     end for
16:     for  in  do
17:        sample
18:        sample
22:     end for
23:  end for
Algorithm 2 Q-Value Weighted Regression.

To remedy the issue indicated by Theorems 1 and 2, we introduce a mechanism to consider multiple different actions that can be taken from a single state. We calculate the advantage of the sampling policy based on a learned Q-function: , where is the expected return of the policy , expressed using by expectation over actions: . We substitute our advantage estimator into the AWR actor formula (Equation 1) to obtain the QWR actor:


Similar to AWR, we implement the expectation over states in Equation 3 by sampling from the replay buffer. However, to estimate the expectation over actions, we average over multiple actions sampled from during training. Because the replay buffer contains data from multiple different sampling policies, we store the parameters of the sampling policy conditioned on the current state in the replay buffer and restore it in each training step to compute the loss. This allows us to consider multiple different possible actions for a single state when training the actor, not only the one performed in the collected experience.

The use of a Q-network as a critic provides us with an additional benefit. Instead of regressing it towards the returns of our sampling policy , we can train it to estimate the returns of an improved policy , in a manner similar to Q-learning. This allows us to optimize expected improvement over , providing a better baseline - as long as , the policy improvement theorem for stochastic policies (sutton_barto, Section 4.2) implies that the policy achieves higher returns than the sampling policy :


need not be parametric - in fact, it is not materialized in any way over the course of the algorithm. The only requirement is that we can estimate the Q backup . This allows great flexibility in choosing the form of . Since we want our method to work also in continuous action spaces, we cannot compute the backup exactly. Instead, we estimate it based on several samples from the sampling policy . Our backup has the form . In this work, we extend the term Q-learning to mean training a Q-value using such a generalized backup. To make training of the Q-network more efficient, we use multi-step targets, described in detail in Appendix 6.4. The critic optimization objective using single-step targets is:



and is the environment’s transition distribution.

In this work, we investigate three choices of : average, yielding ; max, where approximates the greedy policy; and log-sum-exp,

, interpolating between average and max with the temperature parameter

. This leads to three versions of the QWR algorithm: QWR-AVG, QWR-MAX, and QWR-LSE. The last operator, log-sum-exp, is similar to the backup operator used in maximum-entropy reinforcement learning (see e.g. haarnoja2018soft) and can be thought of as a soft-greedy backup, rewarding both high returns and uncertainty of the policy. It is our default choice and the final algorithm is shown in Algorithm 2.

3 Related work

Reinforcement learning algorithms.

Recent years have seen great advances in the field of reinforcement learning due to the use of deep neural networks as function approximators. mnih2013playing introduced DQN, an off-policy algorithm learning a parametrized Q-value function through updates based on the Bellman equation. The DQN algorithm only computes the Q-value function, it does not learn an explicit policy. In contrast, policy-based methods such as REINFORCE (reinforce1) learn a parameterized policy, typically by following the policy gradient (reinforce2)

estimated through Monte Carlo approximation of future returns. Such methods suffer from high variance, causing low sample efficiency. Actor-critic algorithms, such as A2C and A3C

(sutton1999a2c; mnih2016asynchronous), decrease the variance of the estimate by jointly learning policy and value functions, and using the latter as an action-independent baseline for calculation of the policy gradient. The PPO algorithm (schulman2017proximal) optimizes a clipped surrogate objective in order to allow multiple updates using the same sampled data.

Continuous control.


adapted Q-learning to continuous action spaces. In addition to a Q-value function, they learn a deterministic policy function optimized by backpropagating the gradient through the Q-value function.

haarnoja2018soft introduce Soft Actor-Critic (SAC): a method learning in a similar way, but with a stochastic policy optimizing the Maximum Entropy RL (levine2018reinforcement) objective. Similarly to our method, SAC also samples from the policy during training.

Advantage-weighted regression.

The QWR algorithm is a successor of AWR proposed by peng2019advantageweighted, which in turn is based on Reward-Weighted Regression (RWR, rwr) and AC-REPS proposed by wirth2016acreps. Mathematical and algorithmical foundations of advantage-weighted regression were developed by fqi. The algorithms share the same good theoretical properties: RWR, AC-REPS, AWR, and QWR losses can be mathematically reformulated in terms of KL-divergence with respect to the optimal policy (see formulas (7)-(10) in peng2019advantageweighted). QWR is different from AWR in the following key aspects: instead of empirical returns in the advantage estimation we train a function (see formulas 1 and 3

below for precise definition) and use sampling for the actor. QWR is different from AC-REPS as it uses deep learning for function approximation and Q-learning for fitting the critic, see Section 


Several recent works have developed algorithms similar to QWR. We provide a brief overview and ways of obtaining them from the QWR pseudocode (Algorithm 2). AWR can be recovered by learning a value function as a critic (line 14) and sampling actions from the replay buffer (lines 12 and 18 in Algorithm 2). AWAC (awac) modifies AWR by learning a Q-function for the critic. We get it from QWR by sampling actions from the replay buffer (lines 12 and 18). Note that compared to AWAC, by sampling multiple actions for each state, QWR is able to take advantage of Q-learning to improve the critic. CRR (crr) augments AWAC with training a distributional Q-function in line 14 and substituting different functions for computing advantage weights in line 21 111CRR sets the advantage weight function to be a hyperparameter in (line 21). In QWR, .. Again, compared to CRR, QWR samples multiple actions for each state, and so can take advantage of Q-learning. In a way similar to QWR, MPO (abdolmaleki2018mpo) samples actions during actor training to improve generalization. Compared to QWR, it introduces a dual function for dynamically tuning in line 21, adds a prior regularization for policy training and trains the critic using Retrace (retrace) targets in line 13. QWR can be thought of as a significant simplification of MPO, with addition of Q-learning to provide a better baseline for the actor. Additionally, the classical DQN (dqn) algorithm for discrete action spaces can be recovered from QWR by removing the actor training loop (lines 16-22), computing a maximum over all actions in Q-network training (line 13) and using an epsilon-greedy policy w.r.t. the Q-network for data collection.

Offline reinforcement learning.

Offline RL is the main topic of the survey levine2020offline. The authors state that “offline reinforcement learning methods equipped with powerful function approximation may enable data to be turned into generalizable and powerful decision making engines”. We see this as one of the major challenges of modern RL and this work contributes to this challenge. Many current algorithms perform to some degree in offline RL, e.g., variants of DDPG and DQN developed by fujimoto2018offpolicy; agarwal2019striving, as well as the MPO algorithm by abdolmaleki2018mpo are promising alternatives to AWR and QWR analyzed in this work.

ABM (abm) is a method of extending RL algorithms based on policy networks to offline settings. It first learns a prior policy network on the offline dataset using a loss similar to Equation 1, and then learns the final policy network using any algorithm, adding an auxiliary term penalizing KL-divergence from the prior policy. CQL (cql) is a method of extending RL algorithms based on Q-networks to offline settings by introducing an auxiliary loss. To compute the loss, CQL samples actions on-line during training of the Q-network, similar to line 14 in QWR. EMaQ (emaq) learns an ensemble of Q-functions using an Expected-Max backup operator and uses it during evaluation to pick the best action. The Q-network training part is similar to QWR with in line 13 in Algorithm 2.

The imitation learning algorithm MARWIL by

wang2018marwil confirms that the advantage-weighted regression performs well in the context of complex games.

Algorithm Half-Cheetah Walker Hopper Humanoid
Table 1: Comparison of the variants of QWR with AWR (peng2019advantageweighted), SAC (haarnoja2018soft) and PPO (schulman2017proximal) on 4 MuJoCo environments at 100K samples. We report the median of 5 runs, half of the interquartile range.
Algorithm Boxing Breakout Freeway Gopher Pong Seaquest
Table 2: Comparison of the variants of QWR with the sample-efficient variant of Rainbow (rainbow; otrainbow), MPR (mpr), SimPLe (simple) and random scores on 6 Atari games at 100K samples. We report results of the the augmented and non-augmented version of the MPR algorithm. Since MPR and SimPLe are based on learning a model of the environment, we do not consider them when choosing the best scores.
Algorithm Half-Cheetah Walker Hopper Humanoid
Table 3: Comparison of QWR-LSE with AWR, SAC and PPO on 4 MuJoCo environments at 1M samples.
(a) QWR-LSE, margin 1.
(b) QWR-LSE, margin 3.
(c) QWR-LSE, margin 7.
(d) QWR-MAX, margin 1.
(e) QWR-MAX, margin 3.
(f) QWR-MAX, margin 7.
Figure 2: Ablation of QWR with respect to the margin, the number of action samples and the method of training the critic. The results are shown on the Half-Cheetah environment. The plots show the median of 5 runs with the shaded area denoting the interquartile range.

4 Experiments

Neural architectures.

In all MuJoCo experiments, for both value and policy networks, we use multi-layer perceptrons with two layers 256 neurons each, and ReLU activations. In all Atari experiments, for both value and policy networks, we use the same convolutional architectures as in

dqn. To feed actions to the network, we embed them using one linear layer, connected to the rest of the network using the formula where is the processed observation and

is the embedded action. This is followed by the value or policy head. For the policy, we parameterize either the log-probabilities of actions in case of discrete action spaces, or the mean of a Gaussian distribution in case of continuous action spaces, while keeping the standard deviation constant, as


4.1 Sample efficiency

Since we are concerned with sample efficiency, we focus our first experiments on the case when the number of interactions with the environment is limited. To use a single number that allows comparisons with previous work both on MuJoCo and Atari, we decided to restrict the number of interactions to 100K. This number is high enough, that the state-of-the-art algorithms such as SAC reach good performance.

We run experiments on 4 MuJoCo environments and 6 Atari games, evaluating three versions of QWR with the 3 backup operators introduced in Section 2.3: QWR-LSE (using log-sum-exp), QWR-MAX (using maximum) and QWR-AVG (using average). For all experiments, we set the Q target truncation horizon to 3. In MuJoCo experiments, we set the number of action samples to 4. In Atari experiments, because of the discrete action space, we can compute the policy loss for each transition explicitly, without sampling. All other hyperparameters are kept the same between those domains. We discuss the choice of and show ablations in subsection 4.3, while more experimental details are given in Appendix 6.1.

In Tables 1 and 2 we present the final returns at 100K samples for the considered algorithms and environments. To put them within a context, we also provide those results for SAC, PPO, OTRainbow - a variant of Rainbow tuned for sample efficiency, MPR and SimPLe.

On all considered MuJoCo tasks, QWR exceeds the performance of AWR and PPO. The better sample efficiency is particularly well visible in the case of Walker, where each variant of QWR performs better than any baseline considered. On Hopper, QWR-LSE - the best variant - outpaces all baselines by a large margin. On Humanoid, it comes close to SAC - the state of the art on MuJoCo.

QWR surpasses PPO and Rainbow in 4 out of 6 Atari games. In Gopher and Pong QWR outperforms even against the augmented and non-augmented versions of the model-based MPR algorithm.

4.2 More samples

To verify that our algorithm makes a good use of higher sample budgets, we also evaluate it on the 4 MuJoCo tasks at 1M samples. For the purpose of this experiment, we adapt several of the hyperparameters of QWR to the larger amount of data. The details are provided in Appendix 6.1. We present the results in Table 3.

On Walker, Hopper and Humanoid, QWR outperforms all baselines. Only on Half-Cheetah it is surpassed by SAC. In all tasks, QWR achieves significantly higher scores than AWR and PPO, which shows that the sample-efficiency improvements applied in QWR translate well to the higher budget of 1M samples.

4.3 Ablations

In Figure 2 we provide an ablation of QWR with respect to the backup method , multistep target horizon (”margin”) and the number of action samples to consider when training the actor and the critic. As we can see, the algorithm is fairly robust to the choice of these hyperparameters.

In total, the log-sum-exp backup (LSE) achieves the best results – compare (b) and (e). Max backup performs well with margin 1, but is more sensitive to higher numbers of samples – compare (d) and (e). The log-sum-exp backup is less vulnerable to this effect – compare (a) and (d). Higher margins decrease performance – see (c) and (b). We conjecture this to be due to stale action sequences in the replay buffer biasing the multi-step targets. Again, the log-sum-exp backup is less prone to this issue – compare (c) to (f).

4.4 Offline RL

Both QWR and AWR are capable of handling expert data. AWR was shown to behave in a stable way when provided only with a number of expert trajectories (see Figure 7 in peng2019advantageweighted) without additional data collection. In this respect, the performance of AWR is much more robust than the performance of PPO and SAC. In Figure 3 we show the same result for QWR – in terms of re-using the expert trajectories, it matches or exceeds AWR. The QWR trainings based on offline data were remarkably stable and worked well across all environments we have tried.

(a) HalfCheetah
(b) Hopper
(c) Walker2d
(d) HalfCheetah
(e) Hopper
(f) Walker2d
Figure 3: Figures 2(a), 2(b) and 2(c) show offline trainings based on 50 trajectories of length 1000 collected by diverse policies. The horizontal lines mark the average return of a policy from the dataset. The bars denote median returns out of 4 runs, and the vertical lines denote the interquartile range. Data for figures 2(d), 2(e) and 2(f) is borrowed from peng2019advantageweighted to cover a broader family of algorithms and show that offline training fails for many RL algorithms.

For the offline RL experiments, we have trained each algorithm for 30 iterations, without additional data collection. The training trajectories contained only states, actions and rewards, without any algorithm-specific data. In QWR, we have set the per-step sampling policies to be Gaussians with mean at the performed action and standard deviation set to , same as in peng2019advantageweighted.

5 Discussion and Future Work

We present Q-value Weighted Regression (QWR), an off-policy actor-critic algorithm that extends Advantage Weighted Regression with action sampling and Q-learning. It is significantly more sample-efficient than AWR and works well with discrete actions and in visual domains, e.g., on Atari games. QWR consists of two interleaved steps of supervised training: the critic learning the Q function using a predefined backup operator, and the actor learning the policy with weighted regression based on multiple sampled actions. Thanks to this clear structure, QWR is simple to implement and debug. It is also stable in a wide range of hyperparameter choices and works well in the offline setting.

Importantly, we designed QWR thanks to a theoretical analysis that revealed why AWR may not work when there are limits on data collection in the environment. Our analysis for the limited data regime is based on the state-determines-action assumption that allows to fully solve AWR analytically while still being realistic and indicative of the performance of this algorithm with few samples. We believe that using the state-determines-action assumption can yield important insights into other RL algorithms as well.

QWR already achieves state-of-the-art results in settings with limited data and we believe that it can be further improved in the future. The critic training could benefit from the advances in Q-learning methods such as double Q-networks (hasselt2015deep) or Polyak averaging (Polyak1990NewMO), already used in SAC. Distributional Q-learning bellemare2017distributional and the use of ensembles like REM agarwal2020optimistic could yield further improvements.

Notably, the QWR results at 100K that we present are achieved with the same set of hyperparameters (except for the network architecture) both for MuJoCo environments and for Atari games. This is rare among deep reinforcement learning algorithms, especially among ones that strive for sample-efficiency. Combined with its stability and good performance in offline settings, this makes QWR a compelling choice for reinforcement learning in domains with limited data.


6 Appendix

6.1 Experimental Details

We run experiments on 4 MuJoCo environments: Half-Cheetah, Walker, Hopper and Humanoid and on 6 Atari games: Boxing, Breakout, Freeway, Gopher, Pong and Seaquest. For the MuJoCo environments, we limit the episode length to . For the Atari environments, we apply the following preprocessing:

  • Repeating each action for 4 consecutive steps, taking a maximum of 2 last frames as the observation.

  • Stacking 4 last frames obtained from the previous step in one observation.

  • Gray-scale observations, cropped and rescaled to size .

  • Maximum interactions per episode.

  • Random number of no-op actions from range at the beginning of each episode.

  • Rewards clipped to the range during training.

Our code with the exact configurations we use to reproduce the experiments is available as open source

222url_removed_to_preserve_anonymity. We use the same hyperparameters for the MuJoCo and Atari experiments, and almost the same hyperparameters for the 100K and 1M sample budgets. The hyperparameters and their tuning ranges are reported in Table 4.

Before calculating the actor loss, we normalize the advantages over the entire batch by subtracting their mean and dividing by their standard deviation, same as peng2019advantageweighted. We perform a similar procedure for the log-sum-exp backup operator used in critic training. Before applying the backup, we divide the Q-values by a computed measure of their scale . After applying the backup, we re-scale the target by . There is no need to subtract the mean, as log-sum-exp is translation-invariant.


The parameters of this backup are the only ones different between the 100K and 1M experiments. For 100K, we use and mean absolute deviation as . For 1M, we use and standard deviation as .

Hyperparameter Value Considered range
- number of action samples
- multi-step target horizon (”margin”)
- actor loss temperature
- critic backup operator log-sum-exp mean, log-sum-exp,
- discount factor for the returns
- discount factor in TD()
- actor learning rate
- critic learning rate
batch size (actor and critic)
replay buffer size interactions
n_iterations Until we reach the desired number of interactions. In all experiments, we collect interactions with the environment in each iteration of the algorithm.
Table 4: Hyperparameter values and considered ranges.

When training the networks, we use the Adam optimizer. We use the standard architectures for deep networks. In MuJoCo experiments we use a multi-layer perceptron with two layers 256 neurons each and ReLU activations. In Atari experiments we use the same convolutional architectures as dqn.

The 100K experiments took around 18 hours each, on a single TPU v2 chip. The 1M experiments took around 180 hours each, using the same hardware.

6.2 Formal Analysis of AWR with Limited Data

Since sample efficiency is one of the key challenges in deep reinforcement learning, it would be desirable to have better tools to understand why any RL algorithm – for instance AWR – is sample efficient or not. This is hard to achieve in the general setting, but we identify a key simplifying assumption that allows us to solve AWR analytically and identify the source of its problems.

The assumption we introduce, called state-determines-action, concerns the content of the replay buffer of an off-policy RL algorithm. The replay buffer contains all state-action pairs that the algorithm has visited so far during its interactions with the environment. We say that a replay buffer satisfies the state-determines-action assumption when for each state in the buffer, there is a unique action that was taken from it, formally:

A simplifying assumption like state-determines-action is useful only if it indeed simplifies the analysis of RL algorithms. We show that in case of AWR it does even more – it allows us to analytically calculate the final policy that the algorithm produces. In the case of AWR, it turns out that the resulting policy yields no improvement over the sampling policy.

While AWR achieves very good results after longer training, it is not very sample efficient, as noted in the future work section of (peng2019advantageweighted). To address this problem, let us analyze a single loop of actor training in AWR:


How does this update act on a replay buffer that satisfies the state-determines-action assumption? It turns out that we can answer this question analytically using the following theorem.

Theorem 1.

Let be a discrete action space. Let a replay buffer satisfy the state-determines-action assumption. Let be the probability function of a distribution that clones the behavior from , i.e., that assigns to each state from the action such that with probability . Then, under the AWR update, .


By definition of the AWR update rule, . Recall that the number as an exponent of another number is always positive. Since is a discrete action space, is a discrete policy and we have , so in the considered equation is at most 0 ( is a strictly increasing function and ). Thus the value can be at most and it reaches its maximum value for the policy that assigns probability to the action in state for each . Therefore attains the as required. ∎

As we can see from the above theorem, the AWR update rule will insist on cloning the action taken in the replay buffer as long as it satisfies the state-determines-action assumption. In the extreme case of a deterministic environment, the new policy will not add any new data to the buffer, only replay a trajectory already in it. So the whole AWR loop will end with the policy , which yields no improvement.

In the next section, we prove an analogous theorem for continuous action spaces.

6.2.1 Continuous action spaces

The statement of Theorem 1 must be adjusted for the case of continuous actions. First of all, let us clarify the notation of AWR update introduced in Equation 7:

For discrete actions, the symbol

denotes the probability function of a discrete distribution. In the continuous setting, we use it to denote probability density functions.

Now let us define the policy that ”clones the behavior from the replay buffer”. Intuitively that would be a distribution that concentrates most of its probability mass arbitrarily close to the action in the replay buffer.

Theorem 2.

Let be a continuous action space. Let a replay buffer satisfy the state-determines-action assumption. For a given let us consider the following family of parameterized Gaussian distributions

where , and define such that and for . If we perform the optimization in the AWR update over such a family of distributions, we get .


The reasoning is similar to the proof of Theorem 1 but we cannot rely on , as probability density functions can take arbitrarily large values. Let be any state that for some we have . For the assumed family of distributions we have

is a quadratic function of , so it attains the maximum value at  for every . Now let’s look at

The derivative is negative regardless of , so is maximized for the lowest allowed and . This is true for arbitrary state-action pair such that . So under the AWR update we get . ∎

This gives us intuition that the probability distributions

commonly used in RL (e.g. Gaussian) can be improved by increasing the density at and decreasing it everywhere else. For those distributions, the maximum can come arbitrarily close to the Dirac delta, where .

Given that AWR aims to copy the replay buffer, as demonstrated by Theorems 1 and 2, how come this algorithm works so well in practice, given enough interactions? First of all, note that for this effect to occur, the neural network used for AWR actor must be large enough and trained long enough to memorize the data from the replay buffer. Furthermore, the policy it learns must be allowed to express distributions that assign probability to a single action. This holds for environments with discrete actions, and for continuous actions with distributions with controlled scale, but it is not true e.g. when using Gaussian distributions with fixed variance. However, in the latter case, the proof of Theorem 2 shows that the AWR update will place the mean of the policy distribution at the performed action, regardless of the variance, which still leads to no improvement over the sampling policy.

In the next section, we show that using an algorithm, that corrects this cloning behavior, leads to improved sample efficiency.

6.3 Formal Analysis of QWR with Limited Data

To see how QWR performs under limited data, we are going to formulate a positive theorem showing that it achieves the policy improvement that AWR aims for even in a limited data setting. Note that this time we allow replay buffers that do not necessarily satisfy the state-determines-action assumption. But, for clarity, we make a simplifying assumption that the replay buffer has been sampled by a single policy .

Recall the QWR update rule:


and is the set of states in the replay buffer.

Let , where is the state value function of . This is the policy optimizing the expected improvement over the sampling policy , subject to a KL constraint – the same as in Equation 36 in peng2019advantageweighted.

Since is the target policy resulting from the AWR derivation, we know from peng2019advantageweighted that AWR will update towards this policy in the limit, when the replay buffer is large enough. But from Theorem 1 we know that it will fail to perform this update when the state-determines-action assumption holds. Below we show that QWR will perform the same desirable update for any replay buffer, as long as we restrict the attention to states in the buffer.

Theorem 3.

Let be a finite sample from - the undiscounted state distribution of a policy . Let be the state-action value function for , so for any state and action. Then, under the QWR actor update, , where is the policy . restricted to the set of states .


Let be an arbitrary state in . From the definition of we have

Since ,


We can now change the measure using the definition of :


The inner expectation, up to a normalizing constant, is the negative cross-entropy between and . Since cross-entropy between two distributions is minimized when the distributions are equal, the optimum is reached at for all . ∎

As we can see, the QWR update rule reaches the desired target policy even under limited data. This stands in contrast to AWR, which requires repeating states in the replay buffer, as shown in Theorems 1 and 2.

6.4 Multi-step targets

To make the training of the Q-value network more efficient, we implement an approach inspired by widely-used multi-step Q-learning (mnih2016asynchronous). We consider targets for the Q-value network computed over multiple different time horizons:


where , , are the states, actions and rewards in a collected trajectory, respectively. We aggregate those multi-step targets using a truncated TD() estimator (sutton_barto, p. 236):