Remember and Forget for Experience Replay

07/16/2018 ∙ by Guido Novati, et al. ∙ 0

Experience replay (ER) is crucial for attaining high data-efficiency in off-policy deep reinforcement learning (RL). ER entails the recall of experiences obtained in past iterations to compute gradient estimates for the current policy. However, the accuracy of such updates may deteriorate when the policy diverges from past behaviors. Remedies that aim to abate policy changes, such as target networks and hyper-parameter tuning, do not prevent the policy from becoming disconnected from past experiences, possibly undermining the effectiveness of ER. We introduce an algorithm that relies on systematic Remembering and Forgetting for ER (ReF-ER). In ReF-ER the RL agents forget experiences that would be too unlikely with the current policy and constrain policy changes within a trust region of past behaviors in the replay memory. We show that ReF-ER improves the reliability and performance of off-policy RL, both in the deterministic and in the stochastic policy gradients settings. Finally, we complement ReF-ER with a novel off-policy actor-critic algorithm (RACER) for continuous-action control problems. RACER employs a computationally efficient closed-form approximation of on-policy action values and is shown to be highly competitive with state-of-the-art algorithms on benchmark problems, while being robust to large hyper-parameter variations.



There are no comments yet.


page 15

page 16

page 17

Code Repositories


Lightweight and scalable framework for Reinforcement Learning

view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Deep reinforcement learning (RL) has an ever increasing number of success stories ranging from realistic simulated environments (Schulman et al., 2015; Mnih et al., 2016), robotics (Levine et al., 2016) and games (Mnih et al., 2015; Silver et al., 2016). Experience Replay (ER) (Lin, 1992) enhances RL algorithms by using information collected in past policy () iterations to compute updates for the current policy (

). ER has become one of the mainstay techniques to improve the sample-efficiency of off-policy deep RL. Sampling from a replay memory (RM) stabilizes stochastic gradient descent (SGD) by disrupting temporal correlations and extracts information from useful experiences over multiple updates 

(Schaul et al., 2015b). However, when

is parameterized by a neural network (NN), SGD updates may result in significant changes to the policy, thereby shifting the distribution of states observed from the environment. In this case sampling the RM for further updates may lead to incorrect gradient estimates, therefore deep RL methods must account for and limit the dissimilarity between

and behaviors in the RM. Previous works employed trust region methods to bound policy updates (Schulman et al., 2015; Wang et al., 2017)

. Despite several successes, deep RL algorithms are known to suffer from instabilities and exhibit high-variance of outcomes 

(Islam et al., 2017; Henderson et al., 2017), especially continuous-action methods employing the stochastic (Sutton et al., 2000) or deterministic (Silver et al., 2014) policy gradients (PG or DPG).

In this work we redesign ER in order to control the similarity between the replay behaviors used to compute updates and the policy

. More specifically, we classify experiences either as “near-policy" or “far-policy", depending on the ratio

of probabilities of selecting the associated action with

and that with . The weight appears in many estimators that are used with ER such as the off-policy policy gradients (off-PG) (Degris et al., 2012) and the off-policy return-based evaluation algorithm Retrace (Munos et al., 2016). Here we propose and analyze Remember and Forget Experience Replay (ReF-ER), an ER method that can be applied to any off-policy RL algorithm with parameterized policies. ReF-ER limits the fraction of far-policy samples in the RM, and computes gradient estimates only from near-policy experiences. Furthermore, these hyper-parameters can be gradually annealed during training to obtain increasingly accurate updates from nearly on-policy experiences. We show that ReF-ER allows better stability and performance than conventional ER in all three main classes of continuous-actions off-policy deep RL algorithms: methods based on the DPG (ie. DDPG (Lillicrap et al., 2016)), methods based on Q-learning (ie. NAF (Gu et al., 2016)), and with off-PG (Degris et al., 2012; Wang et al., 2017).

In recent years, there is a growing interest in coupling RL with high-fidelity physics simulations (Reddy et al., 2016; Novati et al., 2017; Colabrese et al., 2017; Verma et al., 2018). The computational cost of these simulations calls for reliable and data-efficient RL methods that do not require problem-specific tweaks to the hyper-parameters (HP). Moreover, while on-policy training of simple architectures has been shown to be sufficient in some benchmarks (Rajeswaran et al., 2017), agents aiming to solve complex problems with partially observable dynamics might require deep or recurrent models that can be trained more efficiently with off-policy methods. We analyze ReF-ER on the OpenAI Gym (Brockman et al., 2016) as well as fluid-dynamics simulations to show that it reliably obtains competitive results without requiring extensive HP optimization.

2 Methods

Consider the sequential decision process of an agent aiming to optimize its interaction with the environment. At each step , the agent observes its state , performs an action by sampling a policy , and transitions to a new state with reward . The experiences are stored in a RM, which constitutes the data used by off-policy RL to train the parametric policy . The importance weight is the ratio between the probability of selecting with the current and with the behavior , which gradually becomes dissimilar from as the latter is trained. The on-policy state-action value measures the expected returns from following :


Here is a discount factor. The value of state is the on-policy expectation . In this work we focus on three deep-RL algorithms, each representing one class of off-policy continuous action RL methods.

DDPG (Lillicrap et al., 2016) is a method based on deterministic PG which trains two networks by ER. The value-network (a.k.a. critic) outputs and is trained to minimize the L2 distance from the temporal difference (TD) target :


Here is the probability of sampling state from a RM containing the last experiences of the agent acting with policies . The policy-network is trained to output actions that maximize the returns predicted by the critic (Silver et al., 2014):


NAF (Gu et al., 2016) is the state-of-the-art of Q-learning based algorithms for continuous-action problems. It employs a quadratic-form approximation of the value:


Given a state , a single network estimates its value , the optimal action , and the lower-triangular matrix that parameterizes the advantage. Like DDPG, NAF is trained by ER with the Q-learning target (Eq. 2). For both DDPG and NAF, we include exploratory Gaussian noise in the policy with (to compute

or the Kullback-Leibler divergence

between policies).

V-RACER is the method we propose to analyze off-policy PG (off-PG) and ER. Given , a single NN outputs the value , the mean and diagonal covariance of the Gaussian policy . The policy is updated with the off-policy objective (Degris et al., 2012):


On-policy returns are estimated with Retrace (Munos et al., 2016), which takes into account rewards obtained by :


Here we defined . V-RACER avoids training a NN for the action value by approximating (i.e. it assumes that any individual action has a small effect on returns (Tucker et al., 2018)). The on-policy state value is estimated with the “variance truncation and bias correction trick” (TBC) (Wang et al., 2017):


From Eq. 6 and 7 we obtain . From this, Eq. 7 and , we obtain a recursive estimator for the on-policy state value that depends on alone:


This target is equivalent to the recently proposed V-trace estimator (Espeholt et al., 2018) when all importance weights are clipped at 1, which was empirically found by the authors to be the best-performing solution. Finally, the value estimate is trained to minimize the loss:


In order to estimate for a sampled time step , Eq. 8 requires and for all following steps in sample ’s episode. These are naturally computed when training from batches of episodes (as in ACER (Wang et al., 2017)) rather than time steps (as in DDPG and NAF). However, the information contained in consecutive steps is correlated, worsening the quality of the gradient estimate, and episodes may be composed of thousands of time steps, increasing the computational cost. To efficiently train from uncorrelated time steps, V-RACER stores for each sample the most recently computed estimates of , and . When a time step is sampled, the stored is used to compute the gradients. At the same time, the current NN outputs are used to update , and to correct for all prior time-steps in the episode with Eq. 8. Each algorithm and the remaining implementation details are described in App. A.

3 Remember and Forget Experience Replay

In off-policy RL it is common to maximize on-policy returns estimated over the distribution of states contained in a RM. In fact, each method introduced in Sec. 2 relies on computing estimates over the distribution of states observed by the agent following behaviors over prior steps . However, as gradually shifts away from previous behaviors, is increasingly dissimilar from the on-policy distribution, and trying to increase an off-policy performance metric may not improve on-policy outcomes. This issue can be compounded with algorithm-specific concerns. For example, the dissimilarity between and may cause vanishing or diverging importance weights , thereby increasing the variance of the off-PG and deteriorating the convergence speed of Retrace (and V-trace) by inducing “trace-cutting” (Munos et al., 2016). Multiple remedies have been proposed to address these issues. For example, ACER tunes the learning rate and uses a target-network (Mnih et al., 2015), updated as a delayed copy of the policy-network, to constrain policy updates. Target-networks are also employed in DDPG to slow down the feedback loop between value-network and policy-network optimizations. This feedback loop causes overestimated action values that can only be corrected by acquiring new on-policy samples. Recent works (Henderson et al., 2017) have shown the opaque variability of outcomes of continuous-action deep RL algorithms depending on hyper-parameters. Target-networks may be one of the sources of this unpredictability. In fact, when using deep approximators, there is no guarantee that the small weight changes imposed by target-networks correspond to small changes in the network’s output.

This work explores the benefits of actively managing the “off-policyness” of the experiences used by ER. We propose a set of simple techniques, collectively referred to as Remember and Forget ER (ReF-ER), that can be applied to any off-policy RL method with parameterized policies.

  • [leftmargin=*]

  • The cost functions are minimized by estimating the gradients with mini-batches of experiences drawn from a RM. We compute the importance weight of each experience and classify it as “near-policy" if with . Samples with vanishing () or exploding () importance weights are classified as “far-policy". When computing off-policy estimators with finite batch-sizes, such as or the off-PG, “far-policy" samples may either be irrelevant or increase the variance. For this reason, (Rule 1:) the gradients computed from far-policy samples are clipped to zero. In order to efficiently approximate the number of far-policy samples in the RM, we store for each step its most recent .

  • (Rule 2:) Policy updates are penalized in order to attract the current policy towards past behaviors:


    Here we penalize the “off-policyness” of the RM with:


    The coefficient is updated at each step such that a set fraction of samples are far-policy:


    Here is the NN’s learning rate, is the number of experiences in the RM, of which are far-policy. Note that iteratively updating with Eq. 12 has fixed points in for and in otherwise.

We remark that alternative metrics of the relevance of training samples were considered, such as or its discounted cumulative sum, before settling on the present formulation. ReF-ER aims to reduce the sensitivity on the NN architecture and HP by controlling the rate at which the policy can deviate from the replayed behaviors. For and , ReF-ER becomes asymptotically equivalent to computing updates from an on-policy dataset. Therefore, we anneal ReF-ER’s and the NN’s learning rate according to:


Here is the time step index, regulates annealing, and is the initial learning rate. determines how much is allowed to differ from the replayed behaviors. By annealing we allow fast improvements at the beginning of training, when inaccurate policy gradients might be sufficient to estimate a good direction for the update. Conversely, during the later stages of training, precise updates can be computed from almost on-policy samples. We use , , , and for all results with ReF-ER in the main text. The effect of these hyper parameters is further discussed in a detailed sensitivity analysis reported in the Supplementary Material.

4 Related work

The rules that determine which samples are kept in the RM and how they are used for training can be designed to address specific objectives. For example, it may be necessary to properly plan ER to prevent lifelong learning agents from forgetting previously mastered tasks (Isele & Cosgun, 2018). ER can be used to train transition models in planning-based RL (Pan et al., 2018), or to help shape NN features by training off-policy learners on auxiliary tasks (Schaul et al., 2015a; Jaderberg et al., 2017). When rewards are sparse, RL agents can be trained to repeat previous outcomes (Andrychowicz et al., 2017) or to reproduce successful states or episodes (Oh et al., 2018; Goyal et al., 2018).

In the next section we compare ReF-ER to conventional ER and Prioritized Experience Replay (Schaul et al., 2015b) (PER). PER improves the performance of DQN (Mnih et al., 2015) by biasing sampling in favor of experiences that cause large temporal-difference (TD) errors. TD errors may signal rare events that would convey useful information to the learner. de Bruin et al. (2015) proposes a modification to ER that increases the diversity of behaviors contained in the RM, which is the opposite of what ReF-ER achieves. Because the ideas proposed by de Bruin et al. (2015) cannot readily be applied to complex tasks (the authors state that their method is not suitable when the policy is advanced for many iterations), we compare ReF-ER only to PER and conventional ER. We assume that if increasing the diversity of experiences in the RM were beneficial to off-policy RL then either PER or ER would outperform ReF-ER.

ReF-ER is inspired by the techniques developed for on-policy RL to bound policy changes in PPO (Schulman et al., 2017). Rule 1 of ReF-ER is similar to the clipped objective function of PPO (gradients are zero if is outside of some range). However, Rule 1 is not affected by the sign of the advantage estimate and clips both policy and value gradients. Another variant of PPO penalizes in a similar manner to Rule 2 (also Schulman et al. (2015) and Wang et al. (2017) employ trust-region schemes in the on- and off-policy setting respectively). PPO

picks one of the two techniques, and the authors find that gradient-clipping performs better than penalization. Conversely, in

ReF-ER Rules 1 and 2 complement each other and can be applied to most off-policy RL methods with parametric policies.

V-RACER shares many similarities with ACER (Wang et al., 2017) and IMPALA (Espeholt et al., 2018) and is a secondary contribution of this work. The improvements introduced by V-RACER have the purpose of aiding our analysis of ReF-ER: (1) V-RACER employs a single NN; not requiring expensive architectures eases reproducibility and exploration of the HP (e.g. continuous-ACER uses 9 NN evaluations per gradient). (2) V-RACER samples time steps rather than episodes (like DDPG and NAF and unlike ACER and IMPALA), further reducing its cost (episodes may consist of thousands of steps). (3) V-RACER does not introduce techniques that would interfere with ReF-ER and affect its analysis. Specifically, ACER uses the TBC (Sec. 2) to clip policy gradients, employs a target-network to bound policy updates with a trust-region scheme, and modifies Retrace to use instead of . Lacking these techniques, we expect V-RACER to require ReF-ER to deal with unbounded importance weights. Because of points (1) and (2), the computational complexity of V-RACER is approximately two orders of magnitude lower than that of ACER.

5 Results

In this section we couple ReF-ER, conventional ER and PER with one method from each of the three main classes of deep continuous-action RL algorithms: DDPG, NAF, and V-RACER. In order to separate the effects of its two components, we distinguish between ReF-ER-1, which uses only Rule 1, ReF-ER-2, using only Rule 2, and the full ReF-ER. The performance of each combination of algorithms is measured on the MuJoCo (Todorov et al., 2012) tasks of OpenAI Gym (Brockman et al., 2016) by plotting the mean cumulative reward . Each plot tracks the average among all episodes entering the RM within intervals of time steps averaged over five differently seeded training trials. For clarity, we highlight the contours of the to percentiles of only of the best performing alternatives to the proposed methods. The code to reproduce all present results is available on GitHub.111

5.1 Results for Ddpg

Figure 1: Cumulative rewards on OpenAI MuJoCo tasks for DDPG (black line), DDPG with rank-based PER (purple line), DDPG with ReF-ER (blue), with ReF-ER-1 (red), and with ReF-ER-2 (green). Implementation details in App. A.
Figure 2: Kullback-Leibler divergence between trained by DDPG and the replayed behaviors. Same colors as above. Note: the average for each algorithm is 0 at the beginning of training and is updated after every 1e5 time steps.
Figure 1: Cumulative rewards on OpenAI MuJoCo tasks for DDPG (black line), DDPG with rank-based PER (purple line), DDPG with ReF-ER (blue), with ReF-ER-1 (red), and with ReF-ER-2 (green). Implementation details in App. A.

The performance of DDPG is sensitive to hyper-parameter (HP) tuning (Henderson et al., 2017). We find the critic’s weight decay and temporally-correlated exploration noise to be necessary to stabilize DDPG with ER and PER. Without this tuning, the returns for DDPG can fall to large negative values, especially in tasks that include the actuation cost in the reward (e.g. Ant). This is explained by the critic not having learned local maxima with respect to the action (Silver et al., 2014). Fig. 2 shows that replacing ER with ReF-ER stabilizes DDPG and greatly improves its performance, especially for tasks with complex dynamics (e.g. Humanoid and Ant). We note that with ReF-ER we do not use temporally-correlated noise and that annealing worsened the instability of DDPG with regular ER and PER.

In Fig. 2, we report the average as a measure of the RM’s “off-policyness”. With ReF-ER, despite its reliance on approximating of the total number of far-policy samples in Eq. 12 from outdated importance weights, the smoothly decreases during training due to the annealing process. This validates that Rule 2 of ReF-ER achieves its intended goal with minimal computational overhead. With regular ER, even after lowering by one order of magnitude from the original paper (we use for the critic and for the policy), may span the entire action space. In fact, in many tasks the average with ER is of similar order of magnitude as its maximum (DDPG by construction bounds to the hyperbox ). For example, for , the maximum is 850 for Humanoid and 300 for Walker and it oscillates during training around 100 and 50 respectively. This indicates that swings between the extrema of the action space likely due to the critic not learning local maxima for . Without policy constrains, DDPG often finds only “bang-bang” control schemes, which explains why bounding the action space is necessary to ensure the stability of DDPG.

Figure 3: Cumulative rewards on OpenAI MuJoCo tasks for NAF (black line), NAF with rank-based PER (purple line), NAF with ReF-ER (blue), with ReF-ER-1 (red), and with ReF-ER-2 (green). Implementation details in App. A.

When comparing the components of ReF-ER, we note that relying on gradient clipping alone (ReF-ER-1) does not produce good results. ReF-ER-1 may cause many zero-valued gradients, especially in high-dimensional tasks where even small changes to may push outside of the near-policy region. However, it’s on these tasks that combining the two rules brings a measurable improvement in performance over ReF-ER-2. Training from only near-policy samples, provides the critic with multiple examples of trajectories that are possible with the current policy. This focuses the representation capacity of the critic, enabling it to extrapolate the effect of a marginal change of action on the expected returns, and therefore increasing the accuracy of the DPG. Any misstep of the DPG is weighted with a penalization term that attracts the policy towards past behaviors. This allows time for the learner to gather experiences with the new policy, improve the value-network, and correct the misstep. This reasoning is almost diametrically opposed to that behind PER, which generally obtains worse outcomes than regular ER. In PER observations associated with larger TD errors are sampled more frequently. In the continuous-action setting, however, TD errors may be caused by actions that are farther from . Therefore, precisely estimating their value might not help the critic in yielding an accurate estimate of the DPG. The Swimmer and HumanoidStandup tasks highlight that ER is faster than ReF-ER in finding bang–bang policies. The bounds imposed by DDPG on allow learning these behaviors without numerical instability and without finding local maxima of . The methods we consider next learn unbounded policies. These methods do not require prior knowledge of optimal action bounds, but may not enjoy the same stability guarantees.

5.2 Results for Naf

Figure 3 shows how NAF is affected by the choice of ER algorithm. While Q-learning based methods are thought to be less sensitive than PG-based methods to the dissimilarity between policy and stored behaviors owing to the bootstrapped Q-learning target, NAF benefits from both rules of REF-ER. Like for DDPG, Rule 2 provides NAF with more near-policy samples to compute the off-policy estimators. Moreover, the performance of NAF is more distinctly improved by combining Rule 1 and 2 of REF-ER over using REF-ER-2. This is because is likely to be approximated well by the quadratic in a small neighborhood near its local maxima. When learns a poor fit of (e.g. when the return landscape is multi-modal), NAF may fail to choose good actions. Rule 1 clips the gradients from actions outside of this neighborhood and prevents large TD errors from disrupting the locally-accurate approximation . This intuition is supported by observing that rank-based PER (the better performing variant of PER also in this case), often worsens the performance of NAF. PER aims at biasing sampling in favor of larger TD errors, which are more likely to be farther from , and their accurate prediction might not help the learner in fine-tuning the policy by improving a local approximation of the advantage. Lastly, is unbounded, therefore training from actions that are farther from increases the variance of the gradient estimates.

Figure 4: Average cumulative rewards on MuJoCo OpenAI Gym tasks obtained by PPO (black line), ACER (purple dashed line for and full line for ) and V-RACER with ReF-ER (blue), ReF-ER-1 (red), ReF-ER-2 (green).
Figure 5: Kullback-Leibler divergence between and the replayed behaviors obtained by the PG-based methods. Same colors as above.
Figure 4: Average cumulative rewards on MuJoCo OpenAI Gym tasks obtained by PPO (black line), ACER (purple dashed line for and full line for ) and V-RACER with ReF-ER (blue), ReF-ER-1 (red), ReF-ER-2 (green).

5.3 Results for V-Racer

Here we compare V-RACER to ACER and to PPO, an algorithm that owing to its simplicity and good performance on MuJoCo tasks is often used as baseline. For clarity, we omit from Fig. 5 results from coupling V-RACER with ER or PER, which generally yield similar or worse results than ReF-ER-1. Without Rule 1 of ReF-ER, V-RACER has no means to deal with unbounded importance weights, which cause off-PG estimates to diverge and disrupt prior learning progress. In fact, also ReF-ER-2 is affected by unbounded because even small policy differences can cause to overflow if computed for actions at the tails of the policy. For this reason, the results of ReF-ER-2 are obtained by clipping all importance weights .

Similarly to ReF-ER, ACER’s techniques (summarized in Sec. 4) guard against the numerical instability of the off-PG. ACER partly relies on constraining policy updates around a target-network with tuned learning and target-update rates. However, when using deep NN, small parameter updates do not guarantee small differences in the NN’s outputs. Therefore, tuning the learning rates does not ensure similarity between and RM behaviors. This can be observed in Fig. 5 from ACER’s superlinear relation between policy changes and the NN’s learning rate . By lowering from to , is reduced by multiple orders of magnitude (depending on the task). This corresponds to a large disparity in performance between the two choices of HP. For , as grows orders of magnitude more than with other algorithms, off-PG estimates become inaccurate, causing ACER to be often outperformed by PPO. These experiments, together with the analysis of DDPG, illustrate the difficulty of controlling off-policyness in deep RL by enforcing slow parameter changes.

ReF-ER aids off-policy PG methods in two ways. As discussed for DDPG and NAF, Rule 2 ensures a RM of valuable experiences for estimating on-policy quantities with a finite batch size. In fact, we observe from Fig. 5 that ReF-ER-2 alone often matches or surpasses the performance of ACER. Rule 1 prevents unbounded importance weights from increasing the variance of the PG and from increasing the amount of “trace-cutting” incurred by Retrace (Munos et al., 2016). Trace-cutting reduces the speed at which converges to the on-policy after each change to

, and consequently affects the accuracy of the loss functions. On the other hand, skipping far-policy samples without penalties or without extremely large batch sizes 

(OpenAI, 2018) causes ReF-ER-1 to have many zero-valued gradients (reducing the effective batch size) and unreliable outcomes.

Annealing eventually provides V-RACER with a RM of experiences that are almost as on-policy as those used by PPO. In fact, while considered on-policy, PPO alternates gathering a small RM (usually experiences) and performing few optimization steps on the samples. Fig. 5 shows the average converging to similar values for both methods. While a small RM may not contain enough diversity of samples for the learner to accurately estimate the gradients. The much larger RM of ReF-ER (here samples), and possibly the continually-updated value targets, allow V-RACER to obtain much higher returns. The Supplementary Material contains extended analysis of V-RACER’s most relevant HP. For many tasks presented here, V-RACER combined with ReF-ER outperforms the best result from DDPG (Sec. 5.1), NAF (Sec. 5.2), PPO and ACER and is competitive with the best published results, which to our knowledge were achieved by the on-policy algorithms Trust Region Policy Optimization (Schulman et al., 2015), Policy Search with Natural Gradient (Rajeswaran et al., 2017), and Soft Actor-Critic (Haarnoja et al., 2018).

Figure 6: (left) Contours of the vorticity field (red and blue for anti- and clockwise rotation respectively) of the 2D flow control problem: the D-section cylinder is moving leftward, the agent is marked by A and by the highlighted control force and torque. (right) Returns obtained by V-RACER (red), ACER (purple), DDPG with ER (blue), DDPG with ReF-ER (green).

5.4 Results for a partially-observable flow control task

The problems considered so far have been modeled by ordinary differential equations (ODE), with the agent having access to the entire state of the system. We now apply the considered methods to systems described by non-linear Partial Differential Equations (PDE), here the Navier Stokes Equations (NSE) that govern continuum fluid flows. Such PDEs are used to describe many problems of scientific (e.g. turbulence, fish swimming) and industrial interests (e.g. wind farms, combustion engines). These problems pose two challenges: First, accurate simulations of PDEs may entail significant computational costs and large scale computing resources which exceed by several orders of magnitude what is required by ODEs. Second, the NSE are usually solved on spatial grids with millions or even trillions of degrees of freedom. It would be excessive to provide all that information to the agent, and therefore the state is generally measured by a finite number of sensors. Consequently, the assumption of Markovian dynamics at the core of most RL methods is voided. This may be remedied by using recurrent NN (RNN) for function approximation. In turn, RNNs add to the challenges of RL the increased complexity of properly training them. Here we consider the small 2D flow control problem of agent

, an elliptical body of major-axis and aspect ratio , interacting with an unsteady wake. The wake is created by a D-section cylinder of diameter ( in Fig. 6) moving at constant speed (one length per time-unit ) at Reynolds number . Agent performs one action per unit by imposing a force and a torque on the flow . The state contains ’s position, orientation and velocity relative to and has 4 flow-speed sensors located at ’s 4 vertices. The reward is . If exits the area denoted by a dashed line in Fig. 6, the terminal reward is and the simulation restarts with random initial conditions. Otherwise, the maximum duration of the simulation is 400 actions. We attempt this problem with three differently-seeded runs of each method considered so far. Instead of maximizing the performances by HP tuning, we only substitute the MLPs used for function approximation with LSTM networks (2 layers of 32 cells with back-propagation window of 16 steps).

If correctly navigated, drafting in the momentum released into the flow by the motion of allows to maintain its position with minimal actuation cost. Fig. 6 shows that the optimal HP found for ACER (small ) in the ODE tasks, together with the lack of feature-sharing between policy and value-networks and with the variance of the off-PG, cause the method to make little progress during training. DDPG with ER incurs large actuation costs, while DDPG with ReF-ER is the fastest at learning to avoid the distance limits sketched in Fig. 6. In fact the critic quickly learns that needs to accelerate leftward to avoid being left behind, and the policy adopts the behavior rapidly due to the lower variance of the DPG (Silver et al., 2014). Eventually, the best performance is reached by V-RACER with ReF-ER (an animation of a trained policy is provided in the Supplementary Material). V-RACER has the added benefit of having an unbounded action space and of feature-sharing: a single NN receives the combined feedback of and on how to shape its internal representation of the dynamics.

6 Conclusion

Many RL algorithms update a policy from experiences collected with off-policy behaviors . We present evidence that off-policy continuous-action deep RL methods benefit from actively maintaining similarity between policy and replay behaviors. We propose a novel ER algorithm (ReF-ER) which consists of: (1) Characterizing past behaviors either as “near-policy" or “far-policy" by the deviation from one of the importance weight and computing gradients only from near-policy experiences. (2) Regulating the pace at which is allowed to deviate from through penalty terms that reduce . This allows time for the learner to gather experiences with the new policy, improve the value estimators, and increase the accuracy of the next steps. We analyze the two components of ReF-ER and show their effects on continuous-action RL algorithms employing off-policy PG, deterministic PG (DDPG) and Q-learning (NAF). Moreover, we introduce V-RACER, a novel algorithm based on the off-policy PG which emphasizes simplicity and computational efficiency. The combination of ReF-ER and V-RACER reliably yields performance that is competitive with the state-of-the-art.


We thank Siddhartha Verma for helpful discussions and feedback on this manuscript. This work was supported by European Research Council Advanced Investigator Award 341117. Computational resources were provided by Swiss National Supercomputing Centre (CSCS) Project s658 and s929.


Appendix A Implementation and network architecture details

We implemented all presented learning algorithms within smarties,222 our open source C++ RL framework, and optimized for high CPU-level efficiency through fine-grained multi-threading, strict control of cache-locality, and computation-communication overlap. On every step, we asynchronously obtain on-policy data by sampling the environment with , which advances the index of observed time steps, and we compute updates by sampling from the Replay Memory (RM), which advances the index of gradient steps. During training, the ratio of time and update steps is equal to a constant , usually set to 1. Upon completion of all tasks, we apply the gradient update and proceed to the next step. The pseudo-codes in App. C neglect parallelization details as they do not affect execution.

In order to evaluate all algorithms on equal footing, we use the same baseline network architecture for V-RACER, DDPG and NAF

, consisting of an MLP with two hidden layers of 128 units each. For the sake of computational efficiency, we employed Softsign activation functions. The weights of the hidden layers are initialized according to

, where and are respectively the layer’s fan-in and fan-out (Glorot & Bengio, 2010). The weights of the linear output layer are initialized from the distribution

, such that the MLP has near-zero outputs at the beginning of training. When sampling the components of the action vectors, the policies are treated as truncated normal distributions with symmetric bounds at three standard deviations from the mean. Finally, we optimize the network weights with the Adam algorithm 

(Kingma & Ba, 2015).

V-RACER We note that the values of the diagonal covariance matrix are shared among all states and initialized to . To ensure that is positive definite, the respective NN outputs are mapped onto by a Softplus rectifier. We set the discount factor , ReF-ER parameters , and , and the RM contains samples. We perform one gradient step per environment time step, with mini-batch size and learning rate .

DDPG We use the common MLP architecture for each network. The output of the policy-network is mapped onto the bounded interval with an hyperbolic tangent function. We set the learning rate for the policy-network to and that of the value-network to with L2 weight decay coefficient of . The RM is set to contain observations and we follow Henderson et al. (2017) for the remaining hyper-parameters: mini-batches of samples, , soft target-network update coefficient . We note that while DDPG is the only algorithm employing two networks, choosing half the batch-size as V-RACER and NAF makes the compute cost roughly equal among the three methods. Finally, when using ReF-ER we add exploratory Gaussian noise to the deterministic policy: with . When performing regular ER or PER we sample the exploratory noise from an Ornstein–Uhlenbeck process with and .

NAF We use the same baseline MLP architecture and learning rate , batch-size , discount , RM size , and soft target-network update coefficient . Gaussian noise is added to the deterministic policy with .

PPO We tuned the hyper-parameters as Henderson et al. (2017): , GAE , policy gradient clipping at

, and we alternate performing 2048 environment steps and 10 optimizer epochs with batch-size 64 on the obtained data. Both the policy- and the value-network are 2-layer MLPs with 64 units per layer. We further improved results by having separate learning rates (

for the policy and for the critic) with the same annealing as used in the other experiments.

ACER We kept most hyper-parameters as described in the original paper (Wang et al., 2017): the TBC clipping parameter is , the trust-region update parameter is , and five samples of the advantage-network are used to compute estimates under . We use a RM of samples, each gradient is computed from 24 uniformly sampled episodes, and we perform one gradient step per environment step. Because here learning is not from pixels, each network (value, advantage, and policy) is an MLP with 2 layers and 128 units per layer. Accordingly, we reduced the soft target-network update coefficient () and the learning rates for the advantage-network (), value-network () and for the policy-network ().

Appendix B State, action and reward preprocessing

Several authors have employed state (Henderson et al., 2017) and reward (Duan et al., 2016) (Gu et al., 2017) rescaling to improve the learning results. For example, the stability of DDPG is affected by the L2 weight decay of the value-network. Depending on the numerical values of the distribution of rewards provided by the environment and the choice of weight decay coefficient, the L2 penalization can be either negligible or dominate the Bellman error. Similarly, the distribution of values describing the state variables can increase the challenge of learning by gradient descent.

We partially address these issues by rescaling both rewards and state vectors depending on the the experiences contained in the RM. At the beginning of training we prepare the RM by collecting observations and then we compute:


Throughout training, and are used to standardize all state vectors before feeding them to the NN approximators. Moreover, every 1000 steps, chosen as the smallest power of 10 that doesn’t affect the run time, we loop over the samples stored in the RM to compute:


This value is used to scale the rewards used by the Q-learning target and the Retrace algorithm. We use to ensure numerical stability.

The actions sampled by the learner may need to be rescaled or bounded to some interval depending on the environment. For the OpenAI Gym tasks this amounts to a linear scaling , where the values specified by the Gym library are for Humanoid tasks, for Pendulum tasks, and for all others.

Appendix C Pseudo-codes

  , ,
  Initialize an empty RM, network weights , and Adam’s (Kingma & Ba, 2015)moments.
  while  do
     Advance the environment according algorithm 2.
  end while
  Compute the initial statistics used to standardize the state vectors (App. B).
  Compute the initial statistics used to rescale the rewards (App. B).
  while  do
     while  do
        Advance the environment according to algorithm 2.
        while   do
           Remove an episode from RM (first in first out).
        end while
     end while
     Sample time steps from the RM to compute a gradient estimate (e.g. for V-RACER with algorithm 3).
     Perform the gradient step with the Adam algorithm.
     If applicable, update the ReF-ER penalization coefficient .
     if  is 0  then
        Update the statistics used to rescale the rewards (App. B).
     end if
  end while
Algorithm 1 Serial description of the master algorithm.

Remarks on algorithm 1: 1) It describes the general structure of the ER-based off-policy RL algorithms implemented for this work (i.e. V-RACER, DDPG, and NAF). 2) This algorithm can be adapted to conventional ER, PER (by modifying the sampling algorithm to compute the gradient estimates), or ReF-ER (by following Sec. 3)). 3) The algorithm requires 3 hyper-parameters: the ratio of time step to gradient steps (usually set to 1 as in DDPG), the maximal size of the RM , and the minimal size of the RM before we begin gradient updates .

  Observe and .
  if  concludes an episode then
     Store data for into the RM:
     Compute and store for all steps of the episode
     Sample the current policy
     Store data for into the RM:
     Advance the environment by performing
  end if
Algorithm 2 Environment sampling
  for mini-batch sample to  do
     Fetch all relevant information: , , , and .
     Call the approximator to compute and
     Update for all prior steps in ’s episode with and
     if  then
        Compute according to Sec. 2
     end if
     ReF-ER penalization:
  end for
  Accumulate the gradient estimate over the mini-batch
Algorithm 3 V-RACER’s gradient update
  for mini-batch sample to  do
     Fetch all relevant information: , , and .
     The policy-network computes and the value-network computes .
     Define a stochastic policy with Gaussian exploration noise:
     if  then
        Compute the policy at with the target-network:
        Compute the Q-learning target:
        The gradient of the value-network minimizes the squared distance from .
        The gradient of the policy-network is the deterministic PG (Eq. 3).
     end if
     ReF-ER penalization:
  end for
  Accumulate the gradient estimates over the mini-batch for both networks.
  Update the target policy- () and target value-networks ().
Algorithm 4 DDPG’s gradient update with ReF-ER

Remarks on algorithm 2: 1) The reward for an episode’s initial state, before having performed any action, is zero by definition. 2) The value for the last state of an episode is computed if the episode has been truncated due the task’s time limits or is set to zero if is a terminal state. 3) Each time step we use the learner’s updated policy-network and we store .

Remarks on algorithm 3: 1) In order to compute the gradients we rely on value estimates that were computed when subsequent time steps in ’s episode were previously drawn by ER. Not having to compute the quantities , and for all following steps comes with clear computational efficiency benefits, at the risk of employing an incorrect estimate for . In practice, we find that the Retrace values incur only minor changes between updates (even when large RM sizes decrease the frequency of updates to the Retrace estimator) and that relying on previous estimates has no evident effect on performance. This could be attributed to the gradual policy changes enforced by ReF-ER. 2) With a little abuse of the notation, with (or ) we denote the statistics (mean, covariance) of the multivariate normal policy, with we denote the probability of performing action given state , and with

we denote the probability density function over actions given state


Remarks on algorithm 4: 1) It assumes that weights and Adam are initialized for both policy-network and value-network. 2) The “target” weights are initialized as identical to the “trained” weights. 3) For the sake of brevity, we omit the algorithm for NAF, whose structure would be very similar to this one. The key difference is that NAF employs only one network and all the gradients are computed from the Q-learning target.

Appendix D Flow control simulation and parallelization

The Navier-Stokes equations are solved with our in-house 2D flow solver, parallelized with CUDA and OpenMP. We write the NSE with discrete explicit time integration as:


Here is the velocity field, is the pressure computed with the projection method by solving the Poisson equation  (Chorin, 1967), and is the penalization force introduced by Brinkman penalization. The Brinkman penalization method (Angot et al., 1999) enforces the no-slip and no-flow-through boundary conditions at the surface of the solid bodies by extending the NSE inside the body and introducing a forcing term. Furthermore, we assumed incompressibility and no gravitational effects with . The simulations are performed on a grid of extent by , uniform spacing and Neumann boundary conditions. The time step is limited by the condition . Due to the computational cost of the simulations, we deploy 24 parallel agents, all sending states and receiving actions from a central learner. To preserve the ratio we consider to be the global number of simulated time-steps received by the learner.

Figure 7: 20th and 80th percentiles of the mean cumulative rewards over 5 training runs on a subset of OpenAI Gym tasks obtained with V-RACER. In each row we vary one HP: the ReF-ER parameters and , the RM size , mini-batch size , number of time steps per gradient step , and learning rate .
Figure 8: 20th and 80th percentiles of the mean cumulative rewards over 5 training runs on a subset of OpenAI Gym tasks obtained with V-RACER by varying the annealing schedule parameter , the number of units in each of the two MLP hidden layers, and by extending the architecture with a parameterized action advantage as described in Sec. E.

Appendix E Sensitivity to hyper-parameters

We report in Fig. 7 and Fig. 8 an extensive analysis of V-RACER’s robustness to the most relevant hyper-parameters (HP). The figures in the main text show the 20th and 80th percentiles of all cumulative rewards obtained over 5 training runs, binned in intervals of time steps. Here we show the 20th and 80th percentiles of the mean cumulative rewards obtained over 5 training runs. This metric yields tighter uncertainty bounds and allows to more clearly distinguish the minor effects of HP changes.

The two HP that characterize the performance of ReF-ER are the RM size and the importance sampling clipping parameter , where A is the annealing parameter discussed in Sec. 3. Both and determine the pace of policy changes allowed by ReF-ER. Specifically, the penalty terms imposed by ReF-ER increase for low values of , because the trust region around replayed behaviors is tightened. On the other hand, high values of increase the variance of the gradients and may reduce the accuracy of the return-based value estimators by inducing trace-cutting (Munos et al., 2016). The penalty terms imposed by ReF-ER also increase for large RM sizes , because the RM is composed of episodes obtained with increasingly older behaviors which are all kept near-policy. Conversely, gradients computed from a small RM may be inaccurate because the environment’s dynamics are not sufficiently covered by the training data. These arguments are supported by the results in the first two rows of Fig. 7. Moreover, we observe that “stable” tasks, where the agent’s success is less predicated on avoiding mistakes that would cause it to trip (e.g. HalfCheetah), are more tolerant to high values of .

The tolerance for far-policy samples in the RM has a similar effect as : low values of tend to delay learning while high values reduce the fraction of the RM that is used to compute updates and may decrease the accuracy of gradient estimates. The training performance can benefit from minor improvements by increasing the mini-batch size , while the optimal learning rate is task-specific. From the first two rows of Fig. 8 we observe that both the annealing schedule parameter and the number of units per layer of the MLP architecture have minor effects on performance. The annealing parameter allows the learner to fine-tune the policy parameters with updates computed from almost on-policy data at the later stages of training. We also note that wider networks may exhibit higher variance of outcomes.

More uncertain is the effect of the number of environment time steps per learner’s gradient step. Intuitively, increasing could either cause a rightward shift of the expected returns curve, because the learner computes fewer updates for the same budget of observations, or it could improve returns by providing more on-policy samples, which decreases ReF-ER’s penalty terms and may increase the accuracy of the estimators. In practice the effect of is task-dependent. Problems with more complex dynamics, or higher dimensionality (e.g. Humanoid), seem to benefit from refreshing the RM more frequently with newer experiences (higher ), while simpler tasks can be learned more rapidly by performing more gradient steps per time step.

We considered extending V-RACER’s architecture by adding a closed-form parameterization for the action advantage. Rather than having a separate MLP with inputs () to parameterize (as in ACER or DDPG), whose expected value under the policy would be computationally demanding to compute, we employ closed-form equations for inspired by NAF (Gu et al., 2016). The network outputs the coefficients of a concave function which is chosen such that its maximum coincides with the mean of the policy , and such that it is possible to derive analytical expectations for . Here we consider two options for the parameterized advantage. First, the quadratic form employed by NAF (Gu et al., 2016):


From , the advantage is uniquely defined for any action as . Therefore, like the exact on-policy advantage , has by design expectation zero under the policy. The expectation can be computed as (Petersen et al., 2008):


Here Tr denotes the trace of a matrix. Second we consider the asymmetric Gaussian parameterization:


Here and (both are element-wise operations). The expectation of under the policy can be easily derived from the properties of products of Gaussian densities for one component of the action vector:


Here denotes a determinant and we note that we exploited the symmetry of the Gaussian policy around the mean. Because , , and are all diagonal, we obtain:


We note that all these parameterizations are differentiable.

The first parameterization requires additional network outputs, corresponding to the entries of the lower triangular matrix . The second parameterization requires one MLP output for and outputs for each diagonal matrix and . For example, for the second parameterization, given a state , a single MLP computes in total , , , , and . The quadratic complexity of affects the computational cost of learning tasks with high-dimensional action spaces (e.g. it requires 153 parameters for the 17-dimensional Humanoid tasks of OpenAI Gym, against the 35 of ). Finally, in order to preserve bijection between and , the diagonal terms are mapped to with a Softplus rectifier. Similarly, to ensure concavity of , the network outputs corresponsing to , and are mapped onto by a Softplus rectifier.

The parameterization coefficients are updated to minimize the L2 error from :


Here, reduces the weight of estimation errors for unlikely actions, where is expected to be less accurate.

Beside increasing the number of network outputs, the introduction of a parameterized affects how the value estimators are computed (i.e. we do not approximate when updating Retrace as discussed in Sec. 2). This change may decrease the variance of the value estimators, but its observed benefits are negligible when compared to other HP changes. The minor performance improvements allowed by the introduction of a closed-form parameterization are outweighed in most cases by the increased simplicity of the original V-RACER architecture.