1 Introduction
Deep reinforcement learning has been applied to a large number of challenging tasks, from games (silver2017mastering; OpenAI_dota; vinyals2017starcraft) to robotic control (sadeghi2016cad2rl; openai2018dexterous; rusu2016sim2real)
. Since RL makes minimal assumptions on the underlying task, it holds the promise of automating a wide range of applications. However, its widespread adoption has been hampered by a number of challenges. Reinforcement learning algorithms can be substantially more complex to implement and tune than standard supervised learning methods and can have a fair number of hyperparameters and be brittle with respect to their choices, and may require a large number of interactions with the environment.
These issues are wellknown and there has been significant progress in addressing them. The policy gradient algorithm REINFORCE (reinforce1) is simple to understand and implement, but is both brittle and requires onpolicy data. Proximal Policy Optimization (PPO, schulman2017proximal) is a more stable onpolicy algorithm that has seen a number of successful applications despite requiring a large number of interactions with the environment. Soft ActorCritic (SAC, haarnoja2018soft) is a much more sampleefficient offpolicy algorithm, but it is defined only for continuous action spaces and does not work well in the offline setting, known as batch reinforcement learning, where all samples are provided from earlier interactions with the environment, and the agent cannot collect more samples. Advantage Weighted Regression (AWR, peng2019advantageweighted
) is a recent offpolicy actorcritic algorithm that works well in the offline setting and is built using only simple and convergent maximum likelihood loss functions, making it easier to tune and debug. It is competitive with SAC given enough time to train, but is less sampleefficient and has not been demonstrated to succeed in settings with discrete actions.
We replace the value function critic of AWR with a Qvalue function. Next, we add action sampling to the actor training loop. Finally, we introduce a custom backup to the Qvalue training. The resulting algorithm, which we call QValue Weighted Regression (QWR) inherits the advantages of AWR but is more sampleefficient and works well with discrete actions and in visual domains, e.g., on Atari games.
To better understand QWR we perform a number of ablations, checking different number of samples in actor training, different advantage estimators, and aggregation functions. These choices affect the performance of QWR only to a limited extent and it remains stable with each of the choices across the tasks we experiment with.
We run experiments with QWR on the MuJoCo environments and on a subset of the Arcade Learning Environment. Since sample efficiency is our main concern, we focus on the difficult case when the number of interactions with the environment is limited – in most our experiments we limit it to 100K interactions. The experiments demonstrate that QWR is indeed more sampleefficient than AWR. On MuJoCo, it performs on par with Soft ActorCritic (SAC), the current stateoftheart algorithm for continuous domains. On Atari, QWR performs on par with OTRainbow, a variant of Rainbow highly tuned for sample efficiency. Notably, we use the same set of hyperparameters (except for the network architecture) for both MuJoCo and Atari experiments. We verify that QWR performs well also in the regime where more data is available: with 1M interactions, QWR still outperform SAC on MuJoCo on all environments we tested except for HalfCheetah.
2 QValue Weighted Regression
2.1 Advantage Weighted Regression
peng2019advantageweighted recently proposed Advantage Weighted Regression (AWR), an offpolicy, actorcritic algorithm notable for its simplicity and stability, achieving competitive results across a range of continuous control tasks. It can be expressed as interleaving data collection and two regression tasks performed on the replay buffer, as shown in Algorithm 1.
AWR optimizes expected improvement of an actor policy over a sampling policy by regression towards the wellperforming actions in the collected experience. Improvement is achieved by weighting the actor loss by exponentiated advantage
of an action, skewing the regression towards the betterperforming actions. The advantage is calculated based on the expected return
achieved by performing action in state and then following the sampling policy . To calculate the advantage, one first estimates the value, , using a learned critic and then computes . This results in the following formula for the actor:(1) 
In this formula denotes the unnormalized, discounted state visitation distribution of the policy , and is a temperature hyperparameter.
The critic is trained to estimate the future returns of the sampling policy :
(2) 
To achieve offpolicy learning, the actor and the critic are trained on data collected from a mixture of policies from different training iterations, stored in the replay buffer .
2.2 Analysis of AWR with Limited Data
While AWR achieves very good results after longer training, it is not very sample efficient, as noted in the future work section of peng2019advantageweighted. To understand this problem, we analyze a single loop of actor training in AWR under a special assumption.
The assumption we introduce, called statedeterminesaction, concerns the content of the replay buffer of an offpolicy RL algorithm. The replay buffer contains the stateaction pairs that the algorithm has visited so far during its interactions with the environment. We say that a replay buffer satisfies the statedeterminesaction assumption when for each state in the buffer, there is a unique action that was taken from it, formally:
This assumption may seem limiting and indeed – it is not true in many of the artificial experiments with RL algorithms, with discrete state and action spaces. In such settings, even a random policy starting from the same state could violate the assumption the second time it collects a trajectory. But note that statedeterminesaction is almost always satisfied in continuous control, where even a slightly random policy is unlikely to ever perform the exact same action twice and transition to exactly the same state.
Note that our assumption applies well to realworld experiments with highdimensional state spaces, as any amount of noise added to a highdimensional space will make repeating the exact same state highly improbable. For example, consider a robot observing 32x32 pixel images. To repeat an observation, each of the 1024 pixels would have to have exactly the same value, which is close to impossible, even with a small amount of pixel noise coming from a camera. This assumption also holds in cases with limited data, even in discrete state and action spaces. When the number of collected trajectories is not enough to span the state space, it is unlikely a state will be repeated in the replay buffer. This makes our assumption particularly relevant to the study of sample efficiency.
We emphasize that the statedeterminesaction assumption, by design, considers exact equality
of states. Two very similar, but not equal states that lead to different actions do not violate our assumption. This makes it irrelevant to reinforcement learning with linear functions as linear functions cannot separate similar states. However, it is relevant in deep RL because deep neural networks can indeed distinguish even very similar inputs
(advexamples; margins; understandinggeneralization).How does AWR perform under the statedeterminesaction assumption? In Theorems 1 and 2 (see Appendix 6.2
for more details), we show that for popular choices of discrete and Gaussian distributions the AWR update rule under this assumption will converge to a policy that assigns probability 1 to the actions already present in the replay buffer, thus cloning the previous behaviors. This is not the desired behavior, as an agent should consider various actions from each state, to ensure exploration.
Theorem 1.
Let be a discrete action space. Let a replay buffer satisfy the statedeterminesaction assumption. Let be the probability function of a distribution that clones the behavior from , i.e., that assigns to each state from the action such that with probability . Then, under the AWR update, .
The statedeterminesaction assumption is the main motivating point behind QWR, whose theoretical properties are proven in Theorem 3 in Appendix 6.3. We now illustrate the importance of this assumption by creating a simple environment in which it holds with high probability. We verify experimentally that AWR fails on this simple environment, while QWR is capable of solving it.
The environment, which we call BitFlip, is parameterized by an integer . The state of the environment consists of bits and a step counter. The action space consists of actions. When an action is chosen, the th bit is flipped and the step counter is incremented. A game of BitFlip starts in a random state with the step counter set to 0, and proceeds for 5 steps. The initial state is randomized in such a way to always leave at least 5 bits set to 0. At each step, the reward is if a bit was flipped from to and the reward is in the opposite case.
Since BitFlip starts in one random state out of , at large enough it is highly unlikely that the starting state will ever be repeated in the replay buffer. As the initial policy is random and BitFlip maintains a step counter to prevent returning to a state, the same holds for subsequent states.
BitFlip is a simple game with a very simple strategy, but the initial replay buffer will satisfy the statedeterminesaction assumption with high probability. As we will see, this is enough to break AWR. We ran both AWR and QWR on BitFlip for different values of , for 10 iterations per experiment. In each iteration we collected 1000 interactions with the environment and trained both the actor and the critic for 300 steps. All shared hyperparameters of AWR and QWR were set to the same values, and the backup operator in QWR was set to mean. We report the mean out of 10 episodes played by the trained agent. The results are shown in Figure 1.
As we can see, the performance of AWR starts deteriorating at a relatively small value of , which corresponds to a state space with states, while QWR maintains high performance even at , so around states. Notice how the returns of AWR drop with – at higher values: , the agent struggles to flip even a single zero bit. This problem with AWR and large state spaces motivates us to introduce QWR next.
2.3 QValue Weighted Regression
To remedy the issue indicated by Theorems 1 and 2, we introduce a mechanism to consider multiple different actions that can be taken from a single state. We calculate the advantage of the sampling policy based on a learned Qfunction: , where is the expected return of the policy , expressed using by expectation over actions: . We substitute our advantage estimator into the AWR actor formula (Equation 1) to obtain the QWR actor:
(3) 
Similar to AWR, we implement the expectation over states in Equation 3 by sampling from the replay buffer. However, to estimate the expectation over actions, we average over multiple actions sampled from during training. Because the replay buffer contains data from multiple different sampling policies, we store the parameters of the sampling policy conditioned on the current state in the replay buffer and restore it in each training step to compute the loss. This allows us to consider multiple different possible actions for a single state when training the actor, not only the one performed in the collected experience.
The use of a Qnetwork as a critic provides us with an additional benefit. Instead of regressing it towards the returns of our sampling policy , we can train it to estimate the returns of an improved policy , in a manner similar to Qlearning. This allows us to optimize expected improvement over , providing a better baseline  as long as , the policy improvement theorem for stochastic policies (sutton_barto, Section 4.2) implies that the policy achieves higher returns than the sampling policy :
(4) 
need not be parametric  in fact, it is not materialized in any way over the course of the algorithm. The only requirement is that we can estimate the Q backup . This allows great flexibility in choosing the form of . Since we want our method to work also in continuous action spaces, we cannot compute the backup exactly. Instead, we estimate it based on several samples from the sampling policy . Our backup has the form . In this work, we extend the term Qlearning to mean training a Qvalue using such a generalized backup. To make training of the Qnetwork more efficient, we use multistep targets, described in detail in Appendix 6.4. The critic optimization objective using singlestep targets is:
(5) 
where
and is the environment’s transition distribution.
In this work, we investigate three choices of : average, yielding ; max, where approximates the greedy policy; and logsumexp,
, interpolating between average and max with the temperature parameter
. This leads to three versions of the QWR algorithm: QWRAVG, QWRMAX, and QWRLSE. The last operator, logsumexp, is similar to the backup operator used in maximumentropy reinforcement learning (see e.g. haarnoja2018soft) and can be thought of as a softgreedy backup, rewarding both high returns and uncertainty of the policy. It is our default choice and the final algorithm is shown in Algorithm 2.3 Related work
Reinforcement learning algorithms.
Recent years have seen great advances in the field of reinforcement learning due to the use of deep neural networks as function approximators. mnih2013playing introduced DQN, an offpolicy algorithm learning a parametrized Qvalue function through updates based on the Bellman equation. The DQN algorithm only computes the Qvalue function, it does not learn an explicit policy. In contrast, policybased methods such as REINFORCE (reinforce1) learn a parameterized policy, typically by following the policy gradient (reinforce2)
estimated through Monte Carlo approximation of future returns. Such methods suffer from high variance, causing low sample efficiency. Actorcritic algorithms, such as A2C and A3C
(sutton1999a2c; mnih2016asynchronous), decrease the variance of the estimate by jointly learning policy and value functions, and using the latter as an actionindependent baseline for calculation of the policy gradient. The PPO algorithm (schulman2017proximal) optimizes a clipped surrogate objective in order to allow multiple updates using the same sampled data.Continuous control.
lillicrap2015continuous
adapted Qlearning to continuous action spaces. In addition to a Qvalue function, they learn a deterministic policy function optimized by backpropagating the gradient through the Qvalue function.
haarnoja2018soft introduce Soft ActorCritic (SAC): a method learning in a similar way, but with a stochastic policy optimizing the Maximum Entropy RL (levine2018reinforcement) objective. Similarly to our method, SAC also samples from the policy during training.Advantageweighted regression.
The QWR algorithm is a successor of AWR proposed by peng2019advantageweighted, which in turn is based on RewardWeighted Regression (RWR, rwr) and ACREPS proposed by wirth2016acreps. Mathematical and algorithmical foundations of advantageweighted regression were developed by fqi. The algorithms share the same good theoretical properties: RWR, ACREPS, AWR, and QWR losses can be mathematically reformulated in terms of KLdivergence with respect to the optimal policy (see formulas (7)(10) in peng2019advantageweighted). QWR is different from AWR in the following key aspects: instead of empirical returns in the advantage estimation we train a function (see formulas 1 and 3
below for precise definition) and use sampling for the actor. QWR is different from ACREPS as it uses deep learning for function approximation and Qlearning for fitting the critic, see Section
2.Several recent works have developed algorithms similar to QWR. We provide a brief overview and ways of obtaining them from the QWR pseudocode (Algorithm 2). AWR can be recovered by learning a value function as a critic (line 14) and sampling actions from the replay buffer (lines 12 and 18 in Algorithm 2). AWAC (awac) modifies AWR by learning a Qfunction for the critic. We get it from QWR by sampling actions from the replay buffer (lines 12 and 18). Note that compared to AWAC, by sampling multiple actions for each state, QWR is able to take advantage of Qlearning to improve the critic. CRR (crr) augments AWAC with training a distributional Qfunction in line 14 and substituting different functions for computing advantage weights in line 21 ^{1}^{1}1CRR sets the advantage weight function to be a hyperparameter in (line 21). In QWR, .. Again, compared to CRR, QWR samples multiple actions for each state, and so can take advantage of Qlearning. In a way similar to QWR, MPO (abdolmaleki2018mpo) samples actions during actor training to improve generalization. Compared to QWR, it introduces a dual function for dynamically tuning in line 21, adds a prior regularization for policy training and trains the critic using Retrace (retrace) targets in line 13. QWR can be thought of as a significant simplification of MPO, with addition of Qlearning to provide a better baseline for the actor. Additionally, the classical DQN (dqn) algorithm for discrete action spaces can be recovered from QWR by removing the actor training loop (lines 1622), computing a maximum over all actions in Qnetwork training (line 13) and using an epsilongreedy policy w.r.t. the Qnetwork for data collection.
Offline reinforcement learning.
Offline RL is the main topic of the survey levine2020offline. The authors state that “offline reinforcement learning methods equipped with powerful function approximation may enable data to be turned into generalizable and powerful decision making engines”. We see this as one of the major challenges of modern RL and this work contributes to this challenge. Many current algorithms perform to some degree in offline RL, e.g., variants of DDPG and DQN developed by fujimoto2018offpolicy; agarwal2019striving, as well as the MPO algorithm by abdolmaleki2018mpo are promising alternatives to AWR and QWR analyzed in this work.
ABM (abm) is a method of extending RL algorithms based on policy networks to offline settings. It first learns a prior policy network on the offline dataset using a loss similar to Equation 1, and then learns the final policy network using any algorithm, adding an auxiliary term penalizing KLdivergence from the prior policy. CQL (cql) is a method of extending RL algorithms based on Qnetworks to offline settings by introducing an auxiliary loss. To compute the loss, CQL samples actions online during training of the Qnetwork, similar to line 14 in QWR. EMaQ (emaq) learns an ensemble of Qfunctions using an ExpectedMax backup operator and uses it during evaluation to pick the best action. The Qnetwork training part is similar to QWR with in line 13 in Algorithm 2.
The imitation learning algorithm MARWIL by
wang2018marwil confirms that the advantageweighted regression performs well in the context of complex games.Algorithm  HalfCheetah  Walker  Hopper  Humanoid 

QWRLSE  
QWRMAX  
QWRAVG  
AWR  
SAC  
PPO 
Algorithm  Boxing  Breakout  Freeway  Gopher  Pong  Seaquest 

QWRLSE  
QWRMAX  
QWRAVG  
PPO  
OTRainbow  
MPR  
MPRaug  
SimPLe  
Random 
Algorithm  HalfCheetah  Walker  Hopper  Humanoid 

QWRLSE  
AWR  
SAC  
PPO 
4 Experiments
Neural architectures.
In all MuJoCo experiments, for both value and policy networks, we use multilayer perceptrons with two layers 256 neurons each, and ReLU activations. In all Atari experiments, for both value and policy networks, we use the same convolutional architectures as in
dqn. To feed actions to the network, we embed them using one linear layer, connected to the rest of the network using the formula where is the processed observation andis the embedded action. This is followed by the value or policy head. For the policy, we parameterize either the logprobabilities of actions in case of discrete action spaces, or the mean of a Gaussian distribution in case of continuous action spaces, while keeping the standard deviation constant, as
.4.1 Sample efficiency
Since we are concerned with sample efficiency, we focus our first experiments on the case when the number of interactions with the environment is limited. To use a single number that allows comparisons with previous work both on MuJoCo and Atari, we decided to restrict the number of interactions to 100K. This number is high enough, that the stateoftheart algorithms such as SAC reach good performance.
We run experiments on 4 MuJoCo environments and 6 Atari games, evaluating three versions of QWR with the 3 backup operators introduced in Section 2.3: QWRLSE (using logsumexp), QWRMAX (using maximum) and QWRAVG (using average). For all experiments, we set the Q target truncation horizon to 3. In MuJoCo experiments, we set the number of action samples to 4. In Atari experiments, because of the discrete action space, we can compute the policy loss for each transition explicitly, without sampling. All other hyperparameters are kept the same between those domains. We discuss the choice of and show ablations in subsection 4.3, while more experimental details are given in Appendix 6.1.
In Tables 1 and 2 we present the final returns at 100K samples for the considered algorithms and environments. To put them within a context, we also provide those results for SAC, PPO, OTRainbow  a variant of Rainbow tuned for sample efficiency, MPR and SimPLe.
On all considered MuJoCo tasks, QWR exceeds the performance of AWR and PPO. The better sample efficiency is particularly well visible in the case of Walker, where each variant of QWR performs better than any baseline considered. On Hopper, QWRLSE  the best variant  outpaces all baselines by a large margin. On Humanoid, it comes close to SAC  the state of the art on MuJoCo.
QWR surpasses PPO and Rainbow in 4 out of 6 Atari games. In Gopher and Pong QWR outperforms even against the augmented and nonaugmented versions of the modelbased MPR algorithm.
4.2 More samples
To verify that our algorithm makes a good use of higher sample budgets, we also evaluate it on the 4 MuJoCo tasks at 1M samples. For the purpose of this experiment, we adapt several of the hyperparameters of QWR to the larger amount of data. The details are provided in Appendix 6.1. We present the results in Table 3.
On Walker, Hopper and Humanoid, QWR outperforms all baselines. Only on HalfCheetah it is surpassed by SAC. In all tasks, QWR achieves significantly higher scores than AWR and PPO, which shows that the sampleefficiency improvements applied in QWR translate well to the higher budget of 1M samples.
4.3 Ablations
In Figure 2 we provide an ablation of QWR with respect to the backup method , multistep target horizon (”margin”) and the number of action samples to consider when training the actor and the critic. As we can see, the algorithm is fairly robust to the choice of these hyperparameters.
In total, the logsumexp backup (LSE) achieves the best results – compare (b) and (e). Max backup performs well with margin 1, but is more sensitive to higher numbers of samples – compare (d) and (e). The logsumexp backup is less vulnerable to this effect – compare (a) and (d). Higher margins decrease performance – see (c) and (b). We conjecture this to be due to stale action sequences in the replay buffer biasing the multistep targets. Again, the logsumexp backup is less prone to this issue – compare (c) to (f).
4.4 Offline RL
Both QWR and AWR are capable of handling expert data. AWR was shown to behave in a stable way when provided only with a number of expert trajectories (see Figure 7 in peng2019advantageweighted) without additional data collection. In this respect, the performance of AWR is much more robust than the performance of PPO and SAC. In Figure 3 we show the same result for QWR – in terms of reusing the expert trajectories, it matches or exceeds AWR. The QWR trainings based on offline data were remarkably stable and worked well across all environments we have tried.
For the offline RL experiments, we have trained each algorithm for 30 iterations, without additional data collection. The training trajectories contained only states, actions and rewards, without any algorithmspecific data. In QWR, we have set the perstep sampling policies to be Gaussians with mean at the performed action and standard deviation set to , same as in peng2019advantageweighted.
5 Discussion and Future Work
We present Qvalue Weighted Regression (QWR), an offpolicy actorcritic algorithm that extends Advantage Weighted Regression with action sampling and Qlearning. It is significantly more sampleefficient than AWR and works well with discrete actions and in visual domains, e.g., on Atari games. QWR consists of two interleaved steps of supervised training: the critic learning the Q function using a predefined backup operator, and the actor learning the policy with weighted regression based on multiple sampled actions. Thanks to this clear structure, QWR is simple to implement and debug. It is also stable in a wide range of hyperparameter choices and works well in the offline setting.
Importantly, we designed QWR thanks to a theoretical analysis that revealed why AWR may not work when there are limits on data collection in the environment. Our analysis for the limited data regime is based on the statedeterminesaction assumption that allows to fully solve AWR analytically while still being realistic and indicative of the performance of this algorithm with few samples. We believe that using the statedeterminesaction assumption can yield important insights into other RL algorithms as well.
QWR already achieves stateoftheart results in settings with limited data and we believe that it can be further improved in the future. The critic training could benefit from the advances in Qlearning methods such as double Qnetworks (hasselt2015deep) or Polyak averaging (Polyak1990NewMO), already used in SAC. Distributional Qlearning bellemare2017distributional and the use of ensembles like REM agarwal2020optimistic could yield further improvements.
Notably, the QWR results at 100K that we present are achieved with the same set of hyperparameters (except for the network architecture) both for MuJoCo environments and for Atari games. This is rare among deep reinforcement learning algorithms, especially among ones that strive for sampleefficiency. Combined with its stability and good performance in offline settings, this makes QWR a compelling choice for reinforcement learning in domains with limited data.
References
6 Appendix
6.1 Experimental Details
We run experiments on 4 MuJoCo environments: HalfCheetah, Walker, Hopper and Humanoid and on 6 Atari games: Boxing, Breakout, Freeway, Gopher, Pong and Seaquest. For the MuJoCo environments, we limit the episode length to . For the Atari environments, we apply the following preprocessing:

Repeating each action for 4 consecutive steps, taking a maximum of 2 last frames as the observation.

Stacking 4 last frames obtained from the previous step in one observation.

Grayscale observations, cropped and rescaled to size .

Maximum interactions per episode.

Random number of noop actions from range at the beginning of each episode.

Rewards clipped to the range during training.
Our code with the exact configurations we use to reproduce the experiments is available as open source
^{2}^{2}2url_removed_to_preserve_anonymity. We use the same hyperparameters for the MuJoCo and Atari experiments, and almost the same hyperparameters for the 100K and 1M sample budgets. The hyperparameters and their tuning ranges are reported in Table 4.Before calculating the actor loss, we normalize the advantages over the entire batch by subtracting their mean and dividing by their standard deviation, same as peng2019advantageweighted. We perform a similar procedure for the logsumexp backup operator used in critic training. Before applying the backup, we divide the Qvalues by a computed measure of their scale . After applying the backup, we rescale the target by . There is no need to subtract the mean, as logsumexp is translationinvariant.
(6) 
The parameters of this backup are the only ones different between the 100K and 1M experiments. For 100K, we use and mean absolute deviation as . For 1M, we use and standard deviation as .
Hyperparameter  Value  Considered range 

 number of action samples  
 multistep target horizon (”margin”)  
 actor loss temperature  
 critic backup operator  logsumexp  mean, logsumexp, 
 discount factor for the returns  
 discount factor in TD()  
 actor learning rate  
 critic learning rate  
batch size (actor and critic)  
replay buffer size  interactions  
n_actor_steps  
n_critic_steps  
update_frequency  
n_iterations  Until we reach the desired number of interactions. In all experiments, we collect interactions with the environment in each iteration of the algorithm. 
When training the networks, we use the Adam optimizer. We use the standard architectures for deep networks. In MuJoCo experiments we use a multilayer perceptron with two layers 256 neurons each and ReLU activations. In Atari experiments we use the same convolutional architectures as dqn.
The 100K experiments took around 18 hours each, on a single TPU v2 chip. The 1M experiments took around 180 hours each, using the same hardware.
6.2 Formal Analysis of AWR with Limited Data
Since sample efficiency is one of the key challenges in deep reinforcement learning, it would be desirable to have better tools to understand why any RL algorithm – for instance AWR – is sample efficient or not. This is hard to achieve in the general setting, but we identify a key simplifying assumption that allows us to solve AWR analytically and identify the source of its problems.
The assumption we introduce, called statedeterminesaction, concerns the content of the replay buffer of an offpolicy RL algorithm. The replay buffer contains all stateaction pairs that the algorithm has visited so far during its interactions with the environment. We say that a replay buffer satisfies the statedeterminesaction assumption when for each state in the buffer, there is a unique action that was taken from it, formally:
A simplifying assumption like statedeterminesaction is useful only if it indeed simplifies the analysis of RL algorithms. We show that in case of AWR it does even more – it allows us to analytically calculate the final policy that the algorithm produces. In the case of AWR, it turns out that the resulting policy yields no improvement over the sampling policy.
While AWR achieves very good results after longer training, it is not very sample efficient, as noted in the future work section of (peng2019advantageweighted). To address this problem, let us analyze a single loop of actor training in AWR:
(7)  
(8) 
How does this update act on a replay buffer that satisfies the statedeterminesaction assumption? It turns out that we can answer this question analytically using the following theorem.
Theorem 1.
Let be a discrete action space. Let a replay buffer satisfy the statedeterminesaction assumption. Let be the probability function of a distribution that clones the behavior from , i.e., that assigns to each state from the action such that with probability . Then, under the AWR update, .
Proof.
By definition of the AWR update rule, . Recall that the number as an exponent of another number is always positive. Since is a discrete action space, is a discrete policy and we have , so in the considered equation is at most 0 ( is a strictly increasing function and ). Thus the value can be at most and it reaches its maximum value for the policy that assigns probability to the action in state for each . Therefore attains the as required. ∎
As we can see from the above theorem, the AWR update rule will insist on cloning the action taken in the replay buffer as long as it satisfies the statedeterminesaction assumption. In the extreme case of a deterministic environment, the new policy will not add any new data to the buffer, only replay a trajectory already in it. So the whole AWR loop will end with the policy , which yields no improvement.
In the next section, we prove an analogous theorem for continuous action spaces.
6.2.1 Continuous action spaces
The statement of Theorem 1 must be adjusted for the case of continuous actions. First of all, let us clarify the notation of AWR update introduced in Equation 7:
For discrete actions, the symbol
denotes the probability function of a discrete distribution. In the continuous setting, we use it to denote probability density functions.
Now let us define the policy that ”clones the behavior from the replay buffer”. Intuitively that would be a distribution that concentrates most of its probability mass arbitrarily close to the action in the replay buffer.
Theorem 2.
Let be a continuous action space. Let a replay buffer satisfy the statedeterminesaction assumption. For a given let us consider the following family of parameterized Gaussian distributions
where , and define such that and for . If we perform the optimization in the AWR update over such a family of distributions, we get .
Proof.
The reasoning is similar to the proof of Theorem 1 but we cannot rely on , as probability density functions can take arbitrarily large values. Let be any state that for some we have . For the assumed family of distributions we have
is a quadratic function of , so it attains the maximum value at for every . Now let’s look at
The derivative is negative regardless of , so is maximized for the lowest allowed and . This is true for arbitrary stateaction pair such that . So under the AWR update we get . ∎
This gives us intuition that the probability distributions
commonly used in RL (e.g. Gaussian) can be improved by increasing the density at and decreasing it everywhere else. For those distributions, the maximum can come arbitrarily close to the Dirac delta, where .Given that AWR aims to copy the replay buffer, as demonstrated by Theorems 1 and 2, how come this algorithm works so well in practice, given enough interactions? First of all, note that for this effect to occur, the neural network used for AWR actor must be large enough and trained long enough to memorize the data from the replay buffer. Furthermore, the policy it learns must be allowed to express distributions that assign probability to a single action. This holds for environments with discrete actions, and for continuous actions with distributions with controlled scale, but it is not true e.g. when using Gaussian distributions with fixed variance. However, in the latter case, the proof of Theorem 2 shows that the AWR update will place the mean of the policy distribution at the performed action, regardless of the variance, which still leads to no improvement over the sampling policy.
In the next section, we show that using an algorithm, that corrects this cloning behavior, leads to improved sample efficiency.
6.3 Formal Analysis of QWR with Limited Data
To see how QWR performs under limited data, we are going to formulate a positive theorem showing that it achieves the policy improvement that AWR aims for even in a limited data setting. Note that this time we allow replay buffers that do not necessarily satisfy the statedeterminesaction assumption. But, for clarity, we make a simplifying assumption that the replay buffer has been sampled by a single policy .
Recall the QWR update rule:
(9) 
and is the set of states in the replay buffer.
Let , where is the state value function of . This is the policy optimizing the expected improvement over the sampling policy , subject to a KL constraint – the same as in Equation 36 in peng2019advantageweighted.
Since is the target policy resulting from the AWR derivation, we know from peng2019advantageweighted that AWR will update towards this policy in the limit, when the replay buffer is large enough. But from Theorem 1 we know that it will fail to perform this update when the statedeterminesaction assumption holds. Below we show that QWR will perform the same desirable update for any replay buffer, as long as we restrict the attention to states in the buffer.
Theorem 3.
Let be a finite sample from  the undiscounted state distribution of a policy . Let be the stateaction value function for , so for any state and action. Then, under the QWR actor update, , where is the policy . restricted to the set of states .
Proof.
Let be an arbitrary state in . From the definition of we have
Since ,
(10) 
We can now change the measure using the definition of :
(11) 
The inner expectation, up to a normalizing constant, is the negative crossentropy between and . Since crossentropy between two distributions is minimized when the distributions are equal, the optimum is reached at for all . ∎
6.4 Multistep targets
To make the training of the Qvalue network more efficient, we implement an approach inspired by widelyused multistep Qlearning (mnih2016asynchronous). We consider targets for the Qvalue network computed over multiple different time horizons:
(12) 
where , , are the states, actions and rewards in a collected trajectory, respectively. We aggregate those multistep targets using a truncated TD() estimator (sutton_barto, p. 236):
(13) 
Comments
There are no comments yet.