Sample-Efficient Model-Free Reinforcement Learning with Off-Policy Critics

03/11/2019 ∙ by Denis Steckelmacher, et al. ∙ Vrije Universiteit Brussel 0

Value-based reinforcement-learning algorithms are currently state-of-the-art in model-free discrete-action settings, and tend to outperform actor-critic algorithms. We argue that actor-critic algorithms are currently limited by their need for an on-policy critic, which severely constraints how the critic is learned. We propose Bootstrapped Dual Policy Iteration (BDPI), a novel model-free actor-critic reinforcement-learning algorithm for continuous states and discrete actions, with off-policy critics. Off-policy critics are compatible with experience replay, ensuring high sample-efficiency, without the need for off-policy corrections. The actor, by slowly imitating the average greedy policy of the critics, leads to high-quality and state-specific exploration, which we show approximates Thompson sampling. Because the actor and critics are fully decoupled, BDPI is remarkably stable and, contrary to other state-of-the-art algorithms, unusually forgiving for poorly-configured hyper-parameters. BDPI is significantly more sample-efficient compared to Bootstrapped DQN, PPO, A3C and ACKTR, on a variety of tasks. Source code:



There are no comments yet.


page 6

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction and Related Work

State-of-the-art stochastic actor-critic algorithms, used with discrete actions, all share a common trait: the critic they learn directly evaluates the actor [Konda1999, Schulman2017, Wu2017, Mnih2016]. Some algorithms allow the agent to execute a policy different from the actor, which the authors refer to as off-policy, but the critic is still on-policy with regards to the actor [Haarnoja2018, for instance]. ACER and the off-policy actor-critic [Wang2016b, Degris2012] use off-policy corrections to learn from past experiences, DDPG learns its critic with an on-policy SARSA-like algorithm [Lillicrap2015], Q-prop [Gu2017] uses the actor in the critic learning rule to make it on-policy, and PGQL [ODonoghue2017] allows for an off-policy V function, but requires it to be combined with on-policy advantage values. Notable examples of algorithms without an on-policy critic are AlphaGo Zero [Silver2017], that replaces the critic with a slow-moving target policy learned with tree search, and the Actor-Mimic [Parisotto2016], that minimizes the cross-entropy between an actor and the Softmax policies of critics (see Section 4.2). The need of most actor-critic algorithms for an on-policy critic makes them incompatible with state-of-the-art value-based algorithms of the Q-Learning family [Medina2018, Hessel2017], that are all highly sample-efficient but off-policy. In a discrete-actions setting, where off-policy value-based methods can be used, this raises two questions:

  1. Can we use off-policy value-based algorithms in an actor-critic setting?

  2. Would the actor bring anything positive to the agent?

In this paper, we provide a positive answer to these two questions. We introduce Bootstrapped Dual Policy Iteration (BDPI), a novel actor-critic algorithm. Our actor learning rule, inspired by Conservative Policy Iteration (see Sections 2.4 and 3.2), is robust to off-policy critics. Because we lift the requirement for on-policy critics, the full range of value-based methods can now be leveraged by the critic, such as DQN-family algorithms [Hessel2017], or exploration-focused approaches [Medina2018, Burda2018]. To better isolate the sample-efficiency and exploration properties arising from our actor-critic approach, we use in this paper a simple DQN-family critic. We learn several Q-Functions, as suggested by [Osband2016], with a novel extension of Q-Learning (see Section 3.1

). Unlike other approaches, that use the critics to compute means and variances

[Nikolov2019, Chen2017], BDPI uses the information in each individual critic to train the actor. We show that our actor learning rule, combined with several off-policy critics, can be compared to bootstrapped Thompson sampling (Section 3.4).

Our experimental results in Section 4 show that BDPI significantly outperforms state-of-the-art actor-critic and critic-only algorithms, such as PPO, ACKTR and Bootstrapped DQN, on a set of discrete, continuous and 3D-rendered tasks. Our ablative study shows that BDPI’s actor significantly contributes to its performance and exploration. To the best of our knowledge, this is the first time that, in a discrete-action setting, the benefit of having an actor can be clearly identified. Finally, and perhaps most importantly, BDPI is highly robust to its hyper-parameters, which mitigates the need for endless tuning (see Section 4.5). BDPI’s ease of configuration and sample-efficiency are crucial in many real-world settings, where computing power is not the bottleneck, but data collection is.

2 Background

In this section, we introduce and review the various formalisms on which Bootstrapped Dual Policy Iteration builds. We also compare current actor-critic methods with Conservative and Dual Policy Iteration, in Sections 2.3 and 2.4.

2.1 Markov Decision Processes

A discrete-time Markov Decision Process (MDP)

[Bellman1957] with discrete actions is defined by the tuple : a possibly-infinite set of states; a finite set of actions; a reward function returning a scalar reward for each state transition; a transition function determining the dynamics of the environment; and the discount factor defining the importance given by the agent to future rewards.

A stochastic stationary policy

maps each state to a probability distribution over actions. At each time-step, the agent observes

, selects , then observes and , which produces an experience tuple. An optimal policy maximizes the expected cumulative discounted reward . The goal of the agent is to find based on its experience within the environment, with no a-priori knowledge of and .

2.2 Q-Learning, Experience Replay and Clipped DQN

Value-based reinforcement learning algorithms, such as Q-Learning [watkins1992q], use experience tuples and Equation 1 to learn an action-value function , also called a critic

, which estimates the expected return for each action in each state when the optimal policy is followed:


with a learning rate. At acting time, the agent selects actions having the largest Q-Value, plus some exploration. To improve sample-efficiency, experience tuples are stored in an experience buffer, and are periodically re-sampled for further training using Equation 1 [Lin1992]. Before convergence, Q-Learning tends to over-estimate the Q-Values [Hasselt2010], as positive errors are propagated by the operator of Equation 1. Clipped DQN [Fujimoto18], that we use as the basis of our critic learning rule (Section 3.1), addresses this bias by applying the operator to the minimum of the predictions of two independent Q-functions, such that positive errors are removed by the minimum operation. Addressing this over-estimation has been shown to increase sample-efficiency and robustness [Hasselt2010].

2.3 Policy Gradient and Actor-Critic Algorithms

Instead of choosing actions according to Q-Values, Policy Gradient methods [Williams1992, Sutton2000] explicitly learn an actor

, parametrized by a weights vector

, such as the weights of a neural network. The objective of the agent is to maximize the expected cumulative discounted reward

, which translates to the minimization of Equation 2 [Sutton2000]:


with the action executed at time , and the Monte-Carlo return from time

onwards. At every training epoch, experiences are used to compute the gradient

of Equation 2, then the weights of the policy are adjusted by a small step in the opposite direction of the gradient. A second gradient update requires fresh experiences [Sutton2000], which makes Policy Gradient quite sample-inefficient. Three approaches have been proposed to increase the sample-efficiency of Policy Gradient: trust regions, that allow larger gradient steps to be taken [Schulman2015], surrogate losses, that prevent divergence if several gradient steps are taken [Schulman2017], and stochastic111Deterministic actor-critic methods are slightly different and outside the scope of this paper. actor-critic methods [Barto1983, Konda1999], that replace the Monte-Carlo with an estimation of its expectation, , an on-policy critic, shown in Equation 3.

The use of -Values instead of Monte-Carlo returns leads to a gradient of lower variance, and allows actor-critic methods to obtain impressive results on several challenging tasks [Wang2016b, Gruslys2017, Mnih2016]. However, conventional actor-critic algorithms may not provide any benefits over a cleverly-designed critic-only algorithm, see for example [ODonoghue2017], Section 3.3. Actor-critic algorithms also rely on to be accurate for the current actor, even if the actor itself can be distinct from the actual behavior policy of the agent [Degris2012, Wang2016b, Gu2017b]. Failing to ensure this accuracy may cause divergence [Konda1999, Sutton2000].

2.4 Conservative and Dual Policy Iteration

Approximate Policy Iteration and Dual Policy Iteration are two approaches to Policy Iteration. API repeatedly evaluates a policy , producing an on-policy , then trains to be as close as possible to the greedy policy [Kakade2002, Scherrer14]. Conservative Policy Iteration (CPI) extends API to slowly move towards the greedy policy [Pirotta2013]. Dual Policy Iteration [Sun2018] formalizes as CPI several modern reinforcement learning approaches [Anthony2017, Silver2017], by replacing the greedy function with a slow-moving target policy :


with a learning rate, set to a small value in Conservative Policy Iteration algorithms (0.01 in our experiments). Among CPI algorithms, Safe Policy Iteration [Pirotta2013] dynamically adjusts the learning rate to ensure (with high probability) a monotonic improvement of the policy, while [Thomas2015] propose the use of statistical tests to decide whether to update the policy.

While theoretically promising, CPI algorithms present two important limitations: their convergence is difficult to obtain with function approximation [Wagner2011, BohmerGO16]; and their update rule and associated set of bounds and proofs depend on , an on-policy function that would need to be re-computed before every iteration in an on-line setting. As such, CPI algorithms are notoriously difficult to implement, with [Pirotta2013] reporting some of the first empirical results on CPI. Our main contribution, presented in the next section, is inspired by CPI but distinct from it in several key aspects. Our actor learning rule follows the Dual Policy Iteration formalism, with a target policy built from off-policy critics (see Section 3.2). The fact that the actor gathers the experiences on which the critics are trained can be compared to the guidance that gives to in the DPI formalism [Sun2018].

3 Bootstrapped Dual Policy Iteration

Our main contribution, Bootstrapped Dual Policy Iteration (BDPI), consists of two original components. In Section 3.1, we introduce an aggressive off-policy critic, inspired by Bootstrapped DQN and Clipped DQN [Osband2016, Fujimoto18]. In Sections 3.2 to 3.3, we introduce an actor that leads to high-quality exploration, further enhancing sample-efficiency. We detail BDPI’s exploration properties in Section 3.4, before empirically validating our results in a diverse set of environments (Section 4). The complete pseudocode of the algorithm is available in Appendix 0.A, and our implementation of BDPI is available on

3.1 Aggressive Bootstrapped Clipped DQN

We begin our description of BDPI with the algorithm used to train its critics, Aggressive Bootstrapped Clipped DQN (ABCDQN). Like Bootstrapped DQN [Osband2016], ABCDQN consists of critics. Combining ABCDQN with an actor is detailed in Section 3.2. When used without an actor, ABCDQN selects actions by randomly sampling a critic for each episode, then following its greedy function.

Each critic of ABCDQN is trained with an aggressive algorithm loosely inspired by Clipped DQN and Double Q-Learning [Fujimoto18, Hasselt2010]. Each critic maintains two Q-functions, and . Every training iteration, and are swapped, then is trained with Equation 4 on a set of experiences sampled from an experience buffer, shared by all the critics. Contrary to Clipped DQN, an on-policy algorithm that uses as target value, ABCDQN removes the reference to and instead uses the following formulas:


We increase the aggressiveness of ABCDQN by performing several training iterations per training epoch. Every training epoch, every critic is updated using a different batch of experiences, for training iteration. As mentioned above, a training iteration consists of applying Equation 4 on the critic, which produces values, either stored in a tabular critic, or used to optimize the parameters of a parametric critic . The parameters minimize , using gradient descent for several gradient steps.

ABCDQN achieves high sample-efficiency (see Section 4), but its purposefully exaggerated aggressiveness makes it prone to overfitting. We now introduce an actor, that alleviates this problem and leads to high-quality exploration, comparable to Thompson sampling (see Section 3.4).

3.2 Training the Actor with Off-Policy Critics

To improve exploration, and further increase sample-efficiency, we now complement our ABCDQN critic with the second component of BDPI, its actor. The actor takes inspiration from Conservative Policy Iteration [Pirotta2013], but replaces on-policy estimates of with our off-policy ABCDQN critics. Every training epoch, after every critic has been updated on its batch of experiences uniformly sampled from the experience buffer, the actor is sequentially trained towards the greedy policy of all the critics:


with the actor learning rate, computed from the maximum allowed KL-divergence defining a trust-region (see Appendix 0.B), and the greedy function, that returns a policy greedy in , the function of the -th critic. Pseudocode for the complete BDPI algorithm is given in Appendix 0.A, and summarized in Algorithm 1.

for every critic  do
      N experiences sampled from the buffer
     for  training iterations do
         Swap and
         Update of critic on with Equation 4
     end for
     Update actor on with Equation 5
end for
Algorithm 1 Learning with Bootstrapped Dual Policy Iteration (summary)

Contrary to Conservative Policy Iteration algorithms, and because our critics are off-policy, the greedy function is applied on an estimate of , the optimal Q-function, instead of . The use of an actor, that slowly imitates approximations of , leads to an interesting relation between BDPI and Thompson sampling (see Section 3.4). While expressed in the tabular form in Equations 4 and 5, the BDPI update rules produce Q-Values and probability distributions that can directly be used to train any kind of function approximator, on the mean-squared-error loss, and for as many gradient steps as desired. The Policy Distillation literature [Rusu2015] suggests that implementing the actor and critics with neural networks, with the actor having a smaller architecture than the critic, may lead to good results. Large critics reduce bias [Fu2019], and a small policy has been shown to outperform and generalize better than big policies [Rusu2015]. In this paper, we use actors and critics of the same size, and leave the evaluation of asymmetric architectures to future work.

3.3 BDPI and standard Conservative Policy Iteration

The standard Conservative Policy Iteration update rule (see Section 2.4) updates the actor towards , the greedy function according to the Q-Values arising from . This slow-moving update, and the inter-dependence between and , allows several properties to be proven [Kakade2002], and the optimal policy learning rate to be determined from [Pirotta2013]. Because BDPI learns off-policy critics, that can be arbitrarily different from the on-policy function, the Approximate Safe Policy Iteration framework [Pirotta2013] would infer an “optimal” learning rate of 0. Fortunately, a non-zero learning rate still allows BDPI to learn efficiently. In Section 3.4, we show that the off-policy nature of BDPI’s critics makes it approximate Thompson sampling, which CPI’s on-policy critics do not do. Our experimental results in Section 4 further illustrate how BDPI allows fast and robust learning, even in difficult-to-explore environments.

3.4 BDPI and Thompson Sampling

In a bandit setting, Thompson sampling [Thompson1933] is regarded as one of the best ways to balance exploration and exploitation [Agrawal2012, Chapelle2011]. Thompson sampling consists of maintaining a posterior belief of how likely any given action is optimal, and drawing actions directly from this probability distribution. In a reinforcement-learning setting, Thompson sampling consists of selecting an action according to:


with the optimal Q-function. BDPI learns off-policy critics, that produce estimates of . Sampling a critic and updating the actor towards its greedy policy is therefore equivalent to sampling a function [Osband2016], then updating the actor towards , with , and the indicator function. Over several updates, and thanks to a small learning rate (see Equation 5), the actor learns the expected greedy function of the critics, which (intuitively) folds the indicator function into the sampling of , leading to an actor that learns , the Thompson sampling equation for reinforcement learning.

The use of an explicit actor, instead of directly sampling critics and executing actions as Bootstrapped DQN does [Osband2016], positively impacts BDPI’s performance (see Section 4). [Nikolov2019] discuss why Bootstrapped DQN, without an actor, leads to a higher regret than their Information Directed Sampling, and propose to add a Distributional RL [Bellemare2017] component to their agent. [Osband2018] presents arguments against the use of Distributional RL, and instead combines Bootstrapped DQN with prior functions. In the next section, we show that BDPI largely outperforms Boostrapped DQN, along with PPO and ACKTR, without relying on Distributional RL nor prior functions. We believe that having an explicit actor changes the way the posterior is computed, which may positively influence exploration compared to actor-less approaches.

4 Experiments

To illustrate the properties of BDPI, we compare it to its ablations and a wide range of reinforcement learning algorithms, in four environments with completely different state-spaces and dynamics. Our results demonstrate the high sample-efficiency and exploration quality of BDPI. Moreover, these results are obtained with the same configuration of critics, experience replay and learning rates across environments, which illustrates the ease of configuration of BDPI. In Section 4.5, we carry out further experiments, that demonstrate that BDPI is more robust to its hyper-parameters than other algorithms. This is key to the application of reinforcement learning to real-world settings, where vast hyper-parameter tuning is often infeasible.

4.1 Algorithms

We evaluate the algorithms listed below. We also evaluated ACER and A3C [Wang2016b, Mnih2016], conventional actor-critic algorithms available in the OpenAI baselines, but their sample-efficiency was too low for inclusion in our plots.

BDPI this paper
ABCDQN, BDPI without an actor this paper
BDPI w/ AM, see Section 4.2 this paper
BDQN, Bootstrapped DQN [Osband2016]
PPO [Schulman2017]
ACKTR [Wu2017]

Except on Hallway,222

a 3D environment described in the next section, all algorithms use feed-forward neural networks to represent their actor and critic, with one (2 for PPO and ACKTR) hidden layers of 32 neurons (256 on


). The state is one-hot encoded in

FrozenLake, and directly fed to the network in the other environments. The neural networks are trained with the Adam optimizer [kingma2014adam], using a learning rate of 0.0001 (0.001 for PPO, ACKTR uses its own optimizer with a varying learning rate). Several extensively-tuned implementations of PPO and ACKTR have been evaluated, to ensure the fairest comparison (parameters in Appendix 0.D, we used implementations from pytorch-a2c-ppo-acktr on Github). Unless specified otherwise, BDPI uses critics, all updated every time-step on a different 256-experiences batch, sampled from the same shared experience buffer, for 4 applications of our ABCDQN update rule. BDPI trains its neural networks for 20 epochs per training iteration, on the mean-squared-error loss (even for the policy).

Hallway being a 3D environment, the algorithms are configured differently. Changes to BDPI are minimal, as they only consist of using the standard DeepMind convolutional layers, a hidden layer of 256 neurons, and optimizing the networks for 1 epoch per training iteration, instead of 20. PPO and ACKTR, however, see much larger changes. They use the DeepMind layers, 16 replicas of the environment (instead of 1), a learning rate of 0.00005, and perform gradient steps every 80 time-steps (per replica, so 1280 time-steps in total). These PPO and ACKTR parameters are recommended by the author of Hallway.

4.2 BDPI with the Actor-Mimic loss

To the best of our knowledge, the Actor-Mimic [Parisotto2016] is the only actor-critic algorithm, along with BDPI, that learns critics that are off-policy with regards to the actor. We therefore compare BDPI to the Actor-Mimic in Section 4.4. These two algorithms perform extremely well, which demonstrates the potential of off-policy critics, with BDPI being more robust than the Actor-Mimic.

The Actor-Mimic is designed for transfer learning tasks. One critic per task is trained, using the off-policy DQN algorithm. Then, the cross-entropy between the actor and the Softmax policies

of all the critics is minimized, using the (simplified) loss of Equation 7.


Applying the Actor-Mimic to a single-task setting is possible. We implemented an agent based on BDPI, that retains its ABCDQN critics, but replaces our actor learning rule of Equation 5 with the Actor-Mimic loss of Equation 7. Because we only change how the actor is trained, and still use our aggressive critics, we ensure the fairest comparison between our actor learning rule and the cross-entropy loss of the Actor-Mimic. In our experiments, the Actor-Mimic loss with Softmax policies fails to learn efficiently, even after extensive hyper-parameter tuning, probably because the Softmax prevents the policy from becoming deterministic in states where this is necessary. We therefore replaced the Softmax with the greedy function, which led to the much better results that we present in Section 4.4.

4.3 Environments

Figure 1: The four environments. a) Table, a large continuous-state environment with a black circular robot and a blue charging station. b) LunarLander, a continuous-state task based on the Box2D physics simulator. c) Frozen Lake, an 8-by-8 slippery gridworld where black squares represent fatal pits. d) Hallway, a 3D pixel-based navigation task.

Our evaluation of BDPI takes place in four environments that challenge the algorithms on different aspects of reinforcement learning: exploration with sparse rewards (Table), high-dimensional state-spaces (vector LunarLander, pixel-based Hallway), and high stochasticity (FrozenLake).

Table simulates a tiny robot on a large table that has to locate its charging station and dock (see Figure 1a). The table is a 1-by-1 square. The goal is located at , and the robot always starts at , facing away from the goal. A fixed initial position makes exploration more challenging, as the robot never spawns close to the goal. The robot observes its current position and orientation, with . Three actions allow the robot to either move forward 0.005 units, or turn left/right 0.1 radians. A reward of 0 is given every time-step. The episode finishes with a reward of -50 if the robot falls off the table, 0 after 200 time-steps, and 100 when the robot successfully docks, that is, its location is . The slow speed of the robot and reward sparsity make Table more difficult to explore than most Gym tasks [Gym].

LunarLander is a high-dimensional continuous-state physics-based simulation of a rocket landing on the moon (see Figure 1

b). The agent observes the location and velocities of various components of the lander, and has access to four actions: doing nothing, firing the left/right side engines for one time-step, and firing the main engine. The reward signal for this task is quite complicated but informative, as it directly maps the distance between the rocket and the landing pad to a reward, on every time-step. The environment is considered solved when a cumulative reward of 200 or more is achieved per episode


FrozenLake is a grid composed of slippery cells, holes, and one goal cell (see Figure 1c). The agent can move in four directions (up, down, left or right), with a probability of of actually performing an action other than intended. The agent starts at the top-left corner of the environment, and has to reach the goal at its bottom-right corner. The episode terminates when the agent reaches the goal, resulting in a reward of +1, or falls into a hole, resulting in no reward.

Hallway is a 3D pixel-based environment, that simulates a camera-based robotic task in the real world. Hallway

consists of a rectangular room with a target red box, and the agent. The size of the room, location of the goal and initial position of the agent are randomly chosen for each episode. Four discrete actions allow the agent to move forward/backward and turn left/right. Movement is slow, and the amount of movement is stochastic for each time-step. The reward signal is sparse: 0 every time-step, and 1 when the goal is reached. The episode ends with a reward of 0 after 500 time-steps. This sparse reward function heavily stresses the ability of a reinforcement-learning algorithm to train deep convolutional neural networks on small amounts of reward data.

Figure 2: Results on Table, LunarLander, FrozenLake and Hallway. Top: BDPI (16 critics, updated for 4 iterations per time-step) outperforms all the other algorithms in every environment, sometimes significantly (Table and 3D pixel-based Hallway). Middle: Varying the number of critics and how often they are trained, as long as there are more than one critic, only has minimal impact on BDPI’s performance, which demonstrates its robustness. Bottom: Adding off-policy noise (see text) does not impact BDPI on any of the environments.

4.4 Results

Figure 2

shows the cumulative reward per episode obtained by various agents in our four environments. These results are averaged across 8 runs per agent, with the shaded regions representing the standard error. The plots compare BDPI to the algorithms detailed in Section

4.1, and display the effect of varying key hyper-parameters of BDPI.

Algorithms BDPI is the most sample-efficient of all the algorithms, and also achieves the highest returns (especially on hard-to-explore Table and pixel-based Hallway). BDPI with the Actor-Mimic loss matches BDPI with our actor learning rule on Table, but fails to learn LunarLander and Hallway. ABCDQN (BDPI without its actor) fails on Table, an environment where exploration is key, and is generally inferior to BDPI. These results show that both having an explicit actor, and training it with our update rule of Section 3.2, are necessary to achieve top performance. Bootstrapped DQN is highly sample-efficient on FrozenLake, but does not explore well enough on the other environments. PPO and ACKTR, after extensive tuning and with several implementations tested, are not as sample-efficient as BDPI and Bootstrapped DQN, two off-policy algorithms using experience replay. Even with per-environment hyper-parameters, PPO and ACKTR need about 5K episodes to learn FrozenLake, and 1K episodes on Table. BDPI is the only algorithm that, with a single configuration for all the environments, automatically adjusts to the complexity of a task to achieve maximum sample-efficiency.

Interestingly, PPO and ACKTR do perform well on 3D Hallway. We tentatively point out that, due to the prevalence of pixel-based environments in the modern reinforcement-learning literature, current algorithms and hyper-parameters may focus more on the representation learning problem than on the reinforcement learning aspect of tasks. Also note that on Hallway, PPO and ACKTR use 16 replicas of the environment (instead of 1 for BDPI, and PPO/ACKTR on the other environments). This setting greatly stabilizes the algorithms, but cannot be applied to real-world physical robots.

Critics Increasing the number of critics leads to smoother learning curves in every environment, at the cost of sample-efficiency in Table, where a higher variance in the bootstrap distribution of critics seems to help with exploration. Having only one critic seriously degrades BDPI’s performance, and having less than 16 critics is detrimental on LunarLander, where the environment dynamics are complex. This indicates that more critics are beneficial in complex environments, but may slightly reduce pure exploration.

Off-Policy noise BDPI’s actor learning equations do not refer to any behavior policy or on-policy return, and its critics are learned with a variant of Q-Learning. This hints at BDPI being an off-policy algorithm. We now empirically confirm this intuition. In this experiment, training episodes have, at each time-step, a probability of 0.2 that the agents executes a random action, instead of what the actor wants (0.05 on Table, where docking requires precise moves). Testing episodes do not have this noise. The agent learns only from training episodes. Such off-policy noise does not negatively impact BDPI’s learning performance. Robustness to off-policy execution is an important property of BDPI for safety-critical tasks with backup policies.

The performance of BDPI, obtained with a single set of hyper-parameters for all the environments333Only the number of hidden neurons changes between some environments, a trivial change., demonstrate BDPI’s sample-efficiency, high-quality exploration, and strong robustness to hyper-parameters, as rigorously detailed in the next section.

4.5 Robustness to Hyper-Parameters

Hyper-parameters often need to be tweaked depending on the environment. Therefore, it is highly desirable that an algorithm provides good performance even if not optimally configured, as BDPI does. To objectively measure an algorithm’s robustness to its hyper-parameters, we draw inspiration from sensitivity analysis. Thousands of runs of the algorithm are performed on randomly-sampled configurations of hyper-parameters, with each configuration evaluated on the total reward obtained over 800 episodes on LunarLander. Then, we compute the average absolute difference of total reward between random pairs of configurations, weighted by their distance in configuration space. This measures how much changing hyper-parameters affects performance. See Appendix 0.C for more details, and the list of hyper-parameters we consider for each algorithm.

We evaluated numerous algorithms available in the OpenAI baselines. The algorithms, sorted by ascending sensitivity, are DQN with Prioritized ER (930), BDPI (1167), vanilla DQN (1326), A2C (2369), PPO (2452), then ACKTR (5815). Our plot in Appendix 0.C shows that the apparent robustness of DQN-family algorithms comes from them performing equally badly for every configuration. 35% of BDPI’s configurations outperform the best configuration among all the other algorithms.

5 Conclusion and Future Work

In this paper, we propose Bootstrapped Dual Policy Iteration (BDPI), an algorithm where a bootstrap distribution of aggressively-trained off-policy critics provides an imitation target for an actor. Multiple critics, combined with our actor learning rule, lead to high-quality exploration, comparable to bootstrapped Thompson sampling. Off-policy critics can be learned with any state-of-the-art value-based algorithm, depending on the application domain. BDPI is easy to implement, and remarkably robust to its hyper-parameters. The hyper-parameters we used for the highly-stochastic FrozenLake gridworld allowed BDPI to largely outperform the state of the art on three other environments, one of which pixel-based. This, and the availability of BDPI’s full source code, makes it one of the first plug-and-play reinforcement-learning algorithm that can easily be applied to new tasks.

While we focus on discrete actions in this paper, the high-quality exploration and robustness to sparse rewards of BDPI lead to encouraging results with discretized continuous action spaces. In Figure 3, we show that Binary Action Search, an approach that allows precise control of continuous actions, at the cost of increased sparsity in the reward function [Pazis2009], allows BDPI to outperform the Soft Actor-Critic and TD3, three state-of-the-art continuous-actions algorithms. In future work, we will explore and evaluate various discretization approaches, pursuing the goal of applying BDPI to today’s complicated continuous-action tasks.

Figure 3: BDPI adjusted for continuous actions with Binary Action Search [Pazis2009] is more sample-efficient than TD3 [Fujimoto18, seems to quickly learn to spin] and the Soft Actor-Critic [Haarnoja2018] on the Inverted Pendulum task.


The first and second authors are funded by the Science Foundation of Flanders (FWO, Belgium), respectively as 1129319N Aspirant, and 1SA6619N Applied Researcher.

plus 0.3ex


Appendix 0.A Bootstrapped Dual Policy Iteration Pseudocode

The following pseudocode provides a complete description of the BDPI algorithm. To keep our notations simple and general, the pseudocode is given for the tabular setting, and does not refer to any parameter for the actor and critics. An implementation of BDPI based on function approximation, such as the neural networks we use in our experiments, uses the equations below to produce batches of state-action or state-value pairs. The function approximator is then trained on these batches, minimizing the mean-squared-error loss, for several gradient steps.

A policy
critics. and are the two Clipped DQN networks of critic .
procedure BDPI
     for  do
         if  a multiple of  then
         end if
     end for
end procedure
procedure Act
     Execute , observe and
     Add to the experience buffer
end procedure
procedure Learn
     for every critic (in random order) do Bootstrapped DQN
         Sample a batch of experiences from the experience buffer
         for  iterations do Aggressive BDQN
              for all  do Clipped DQN
              end for
              Train towards with learning rate
              Swap and
         end for
          CPI with an off-policy critic
     end for
end procedure
Algorithm 2 Bootstrapped Dual Policy Iteration

Appendix 0.B The CPI Learning Rate Implements a Trust-Region

A trust-region, successfully used in a reinforcement-learning algorithm by [Schulman2015]

, is a constrain on the Kullback-Leibler divergence between a policy

and an updated policy . In BDPI, we want to find a policy learning rate such that , with the trust-region.

While a trust-region is expressed in terms of the KL-divergence, Conservative Policy Iteration algorithms, the family of algorithms to which BDPI belongs, naturally implement a bound on the total variation between and :

see Equation 5 in the paper

The total variation is maximum when , the target policy, and , both have an action selected with a probability of , and the action is not the same. In CPI algorithms, the target policy is a greedy policy, that selects one action with a probability of one. The condition can therefore be slightly simplified: the total variation is maximized if assigns a probability of 1 to an action that is not the greedy one. In this case, the total variation is (2 elements of the sum of (8) are equal to ).

The Pinsker inequality [Pinsker1960] provides a lower bound on the KL-divergence based on the total variation. The inverse problem, upper-bounding the KL-divergence based on the total variation, is known as the Reverse Pinsker Inequality. It allows to implement a trust-region, as and , with a function applied to the total variation so that the reverse Pinsker inequality holds. Upper-bounding the KL-divergence to some then amounts to upper-bounding , which translates to .

The main problem is finding . The reverse Pinsker inequality is still an open problem, with increasingly tighter but complicated bounds being proposed [Sason2015]. A tight bound is important to allow a large learning rate, but the currently-proposed bounds are almost impossible to inverse in a way that produces a tractable function. We therefore propose our own bound, designed specifically for a CPI algorithm, slightly less tight than state-of-the-art bounds, but trivial to inverse.

If we consider two actions, we can produce a policy and a greedy target policy . The updated policy is, for state , . The KL-divergence between and is:


if we assume that . Based on the reverse Pinsker inequality, we assume that if the two policies used above are greedy in different actions, and therefore have a maximal total variation, then their KL-divergence is also maximal. We use this result to introduce a trust region:

trust region

Interestingly, for small values of , as they should be in a practical implementation of BDPI, . The trust-region is therefore implemented by choosing , which is much simpler than the line-search method proposed by [Schulman2015].

0.b.1 State-Dependent Exploration

Compared to Bootstrapped DQN, well-known for its high-quality exploration, BDPI lacks an important component: explicit deep exploration. Deep exploration consists of performing a sequence of directed exploration steps, instead of exploring in a random direction at each time-step [Osband2016]. Bootstrapped DQN achieves deep exploration by greedily following a single critic, sampled at random, for an entire episode. BDPI trains its actor towards a randomly-selected critic at every time-step, which is incompatible with deep exploration. We empirically show in Section 4.4 that BDPI outperforms Bootstrapped DQN, so the loss of explicit deep exploration does not seem to negatively affect performance. In Figure 4, we provide a likely explanation in the Table environment. At the early stages of training, the agent regularly falls off the table, which resets the episode. This can be observed as dips in the entropy of the actor. We believe that this is caused by a sort of novelty-based exploration, probably more limited than what highly-advanced algorithms produce [Burda2018], but still present. After a few episodes, the individual runs learn different policies, which breaks the correlation between them and explains the flat portion of Figure 4. The emergence of such an interesting exploration strategy, leading to higher-quality exploration than Bootstrapped DQN, from the simple use of an actor with several off-policy critics, illustrates how amenable the architecture of BDPI is to relatively advanced features. We believe that further work will allow more features to naturally emerge, or be easily implemented, on top of the BDPI algorithm we present in this paper.

Figure 4: Entropy of the policy per time-step, on the Table

environment (running average and standard deviation of 8 runs). The entropy oscillates as the agent falls off the table, which resets the environment to familiar states. After some time (blue bar), runs start learning distinct policies, whose entropies cannot be observed anymore on an averaged plot.

Appendix 0.C Robustness to Hyper-Parameters

Evaluating the robustness of an algorithm to its hyper-parameters is challenging, and typically not done in Deep RL research. We propose a simple approach, that we designed to be easy to understand and intuitive, and that provides two measures of robustness.

0.c.1 Data Collection

For each algorithm, namely BDPI, DQN, Prioritized and Dueling DQN, A2C, PPO and ACKTR, we define a configuration space that consists of all the combinations of the most relevant hyper-parameters of the algorithms. We then randomly sample configurations, run the algorithm on LunarLander for 800 episodes, and compute the total reward obtained during these 800 episodes. We used the OpenAI Baselines implementations of all the algorithms (but BDPI), to ensure that no implementation error on our side invalidates the results.

The hyper-parameters evaluated for each algorithms are listed below. We ensured that all the known-good configurations of all the algorithms, for various environments in the literature, are covered.

All algorithms

  • Neural network learning rate: 0.00001, 0.00005, 0.0001, 0.0005, 0.001

  • Neurons in the hidden layer of the neural network: 32, 64, 96, 128, 256

All but BDPI

  • Number of parallel environments: 1. BDPI is single-threaded, so, to avoid artificially increasing the sensitivity of the other algorithms, we chose to keep this highly-sensitive parameter to 1.

  • Entropy regularization: 0, 0.01, 0.03, 0.05


  • Experience buffer size: 5K, 10K, 20K, 50K, 100K

  • Batch size: 64, 128, 256, 512

  • Critics trained per time-step: 1, 4, 8

  • Number of critics: 1, 4, 8, 16, 32

  • Clipped DQN iterations per critic-time-step: 1, 2, 4, 8

  • Epochs used to fit the neural networks: 1, 4, 8, 16. The absolute best performance of BDPI is achieved with 20-50+ epochs, but our computing resources did not allow us to increase this parameter as much. We ensure that the best-known configuration of the other algorithms is included in our configuration space.


  • Steps per batch: 64, 128, 256, 384, 512, 1024, 2048

  • Lambda: 0.7, 0.8, 0.9, 1.0

  • Optimization steps per epoch: 1, 2, 4, 8, 16


  • Time-steps between learning epochs: 1, 2, 4, 6, 8

  • Critic loss weight compared to the actor: 0.1, 0.3, 0.5, 0.7, 0.9

  • Gradient norm clipping: 0.1, 0.3, 0.5, 0.8, 1.0


  • Learning rate (specific to ACKTR, default of 0.25): 0.01, 0.10, 0.25, 0.50, 0.90

  • Time-steps between learning epochs: 1, 5, 10, 20, 40, 80

  • Critic loss weight: 0.1, 0.3, 0.5, 0.7, 0.9

  • Fisher weight: 0.1, 0.3, 0.5, 0.7, 0.9

  • Gradient norm clipping: 0.1, 0.3, 0.5, 0.8, 1.0

  • Kronecker-Factored clipping: 0.0001, 0.001, 0.005, 0.01, 0.1 description


  • Experience buffer size: 5K, 10K, 20K, 50K, 100K

  • Batch size: 16, 32, 128

  • Exploration fraction: 0.02, 0.05, 0.1, 1.0

  • Final epsilon after exploration: 0.1, 0.05, 0.01, 0.001

  • Time-steps between learning epochs: 1, 2, 4, 8, 16

  • Time-steps before learning starts: 1, 500, 1000, 10000

  • Target network update frequency: 1, 50, 100, 500, 1000

Dueling DQN with Prioritized Experience Replay

All the same parameters as DQN, and:

  • Alpha parameter for Prioritized ER: 0.5, 0.6, 0.7, 0.8, 0.9

0.c.2 Data Processing

Figure 5: Total reward per configuration, sorted by descending total reward. This plot shows that more than 35% of BDPI’s randomly-sampled configurations perform better than the best (PPO) configuration. The worst BDPI configuration is also better than most of the configurations of the other algorithms. On LunarLander, for 800 episodes, the random policy achieves a total reward of about -240K.

Hundreds of randomly-sampled configurations of the algorithms are evaluated, and we propose to use the total reward over 800 episodes on LunarLander as performance measure. Figure 5 graphically displays this dataset: for each algorithms, all the configurations are sorted by descending total reward, then the lines are stretched horizontally to compensate for the unequal amount of configurations that each algorithm was evaluated on, due to each algorithm having different computational resources requirements.

The measures that we report in Section 4.5 are slightly more advanced. While Figure 5 intuitively shows that BDPI produces a higher curve, sorting the configurations by performance remove any information about the locality of the configurations. It shows that many configurations are good, not that they are close together in configuration space. In order to better measure how slight changes in parameters influences performance, be introduce a second measure:


with and two randomly-sampled configurations. In order to produce accurate scores, we evaluate each algorithm on more than 2000 configurations, and apply Equation 10 on 4000000 pairs of configurations. The resulting scores, also reported in Section 4.5, are DQN with Prioritized ER (930), BDPI (1167), vanilla DQN (1326), then, significantly larger, A2C (2369), PPO (2452) and ACKTR (5815).

Appendix 0.D Experimental Setup

All the algorithms evaluated in Section 4 use feed-forward neural networks to represent their actor(s) and critic(s). They take as input the one-hot encoding of the current state, and are trained with the Adam optimizer [kingma2014adam], using a learning rate of 0.0001 (0.001 for PPO, as it gave better results). We configured each algorithm following the recommendations in their respective papers, and further tuned some parameters to the environments we use. These parameters are given Table 1. They are kept as similar as possible across algorithms, and constant across the three sensors-based environments, to evaluate the generality of the algorithms. For Hallway, differents sets of parameters have been used (especially for PPO and ACKTR), as explained in Section 4.1.

=0pt =0pt ACKTR PPO BDQN ABCDQN BDPI Discount factor 0.99 Replay buffer size 20K 20K Experiences/batch 20 256/1024 256 256 Training epoch every time-steps 20 256/1024 1 1 Policy loss PG+Fisher PPO MSE Trust region 0.05 Entropy regularization 0.01 0.01 0 Value loss coefficient 0.5 Critic count 1 1 16 16 Critic sampling frequency episode Critic learning rate 0.25 1.0 (on ) 1.0 0.2 Critic training iterations 1 1 4 Gradient steps/batch 1 4 20 20 Learning rate dynamic 0.001 0.0001 0.0001 Activation function tanh tanh tanh tanh Hidden layers 2 2 1 1 Hidden neurons 32/256

Table 1: Hyper-parameter of the various algorithms we experimentally evaluate. (a) Hyper-parameters that were required for LunarLander to perform well.