Log In Sign Up

Self-Tuning Deep Reinforcement Learning

by   Tom Zahavy, et al.

Reinforcement learning (RL) algorithms often require expensive manual or automated hyperparameter searches in order to perform well on a new domain. This need is particularly acute in modern deep RL architectures which often incorporate many modules and multiple loss functions. In this paper, we take a step towards addressing this issue by using metagradients (Xu et al., 2018) to tune these hyperparameters via differentiable cross validation, whilst the agent interacts with and learns from the environment. We present the Self-Tuning Actor Critic (STAC) which uses this process to tune the hyperparameters of the usual loss function of the IMPALA actor critic agent(Espeholt et. al., 2018), to learn the hyperparameters that define auxiliary loss functions, and to balance trade offs in off policy learning by introducing and adapting the hyperparameters of a novel leaky V-trace operator. The method is simple to use, sample efficient and does not require significant increase in compute. Ablative studies show that the overall performance of STAC improves as we adapt more hyperparameters. When applied to 57 games on the Atari 2600 environment over 200 million frames our algorithm improves the median human normalized score of the baseline from 243


page 1

page 2

page 3

page 4


Towards Automatic Actor-Critic Solutions to Continuous Control

Model-free off-policy actor-critic methods are an efficient solution to ...

Soft Actor-Critic Algorithms and Applications

Model-free deep reinforcement learning (RL) algorithms have been success...

Revisiting Gaussian mixture critics in off-policy reinforcement learning: a sample-based approach

Actor-critic algorithms that make use of distributional policy evaluatio...

Hyperparameter Tuning for Deep Reinforcement Learning Applications

Reinforcement learning (RL) applications, where an agent can simply lear...

PACMAN: A Planner-Actor-Critic Architecture for Human-Centered Planning and Learning

Conventional reinforcement learning (RL) allows an agent to learn polici...

Efficient Deep Reinforcement Learning with Predictive Processing Proximal Policy Optimization

Advances in reinforcement learning (RL) often rely on massive compute re...

Online Weighted Q-Ensembles for Reduced Hyperparameter Tuning in Reinforcement Learning

Reinforcement learning is a promising paradigm for learning robot contro...

1 Introduction

Reinforcement Learning (RL) algorithms often have multiple hyperparameters that require careful tuning; this is especially true for modern deep RL architectures, which often incorporate many modules and many loss functions. Training a deep RL agent is thus typically performed in two nested optimization loops. In the inner training loop, we fix a set of hyperparameters, and optimize the agent parameters with respect to these fixed hyperparameters. In the outer (manual or automated) tuning loop, we search for good hyperparameters, evaluating them either in terms of their inner loop performance, or in terms of their performance on some special validation data. Inner loop training is typically differentiable and can be efficiently performed using back propagation, while the optimization of the outer loop is typically performed via gradient-free optimization. Since only the hyperparameters (but not the agent policy itself) are transferred between outer loop iterations, we refer to this as a “multiple lifetime” approach. For example, random search for hyperparameters (Bergstra and Bengio, 2012) falls under this category, and population based training (Jaderberg et al., 2017) also shares many of its properties. The cost of relying on multiple lifetimes to tune hyperparameters is often not accounted for when the performance of algorithms is reported. The impact of this is mild when users can rely on hyperparameters established in the literature, but the cost of hyper-parameter tuning across multiple lifetimes manifests itself when algorithm are applied to new domains.
This motivates a significant body of work on tuning hyperparameters online, within a single agent lifetime. Previous work in this area has often focused on solutions tailored to specific hyperparameters. For instance, Schaul et al. (2019) proposed a non stationary bandit algorithm to adapt the exploration-exploitation trade off. Mann et al. (2016) and White and White (2016) proposed algorithms to adapt (the eligibility trace coefficient). Rowland et al. (2019) introduced an

coefficient into V-trace to account for the variance-contraction trade off in off policy learning. In another line of work, metagradients were used to adapt the optimiser’s parameters

(Sutton, 1992; Snoek et al., 2012; Maclaurin et al., 2015; Pedregosa, 2016; Franceschi et al., 2017; Young et al., 2018).

In this paper we build on the metagradient approach to tune hyperparameters. In previous work, online metagradients have been used to learn the discount factor or the coefficient (Xu et al., 2018), to discover intrinsic rewards (Zheng et al., 2018) and auxiliary tasks (Veeriah et al., 2019). This is achieved by representing an inner (training) loss function as a function of both the agent parameters and a set of hyperparameters. In the inner loop, the parameters of the agent are trained to minimize this inner loss function w.r.t the current values of the hyperparameters; in the outer loop, the hyperparameters are adapted via back propagation to minimize the outer (validation) loss.

It is perhaps surprising that we may choose to optimize a different loss function in the inner loop, instead of the outer loss we ultimately care about. However, this is not a new idea. Regularization, for example, is a technique that changes the objective in the inner loop to balance the bias-variance trade-off and avoid the risk of over fitting. In model-based RL, it was shown that the policy found using a smaller discount factor can actually be better than a policy learned with the true discount factor (Jiang et al., 2015). Auxiliary tasks (Jaderberg et al., 2016) are another example, where gradients are taken w.r.t unsupervised loss functions in order to improve the agent’s representation. Finally, it is well known that, in order to maximize the long term cumulative reward efficiently (the objective of the outer loop), RL agents must explore, i.e., act according to a different objective in the inner loop (accounting, for instance, for uncertainty).

This paper makes the following contributions. First, we show that it is feasible to use metagradients to simultaneously tune many critical hyperparameters (controlling important trade offs in a reinforcement learning agent), as long as they are differentiable w.r.t a validation/outer loss. Importantly, we show that this can be done online, within a single lifetime, requiring only additional compute.

We demonstrate this by introducing two novel deep RL architectures, that extend IMPALA (Espeholt et al., 2018), a distributed actor-critic, by adding additional components with many more new hyperparameters to be tuned. The first agent, referred to as a Self-Tuning Actor Critic (STAC), introduces a leaky V-trace operator that mixes importance sampling (IS) weights with truncated IS weights. The mixing coefficient in leaky V-trace is differentiable (unlike the original V-trace) but similarly balances the variance-contraction trade-off in off-policy learning. The second architecture is STACX (STAC with auXiliary tasks). Inspired by Fedus et al. (2019), STACX augments STAC with parametric auxiliary loss functions (each with it’s own hyper-parameters).

These agents allow us to show empirically, through extensive ablation studies, that performance consistently improves as we expose more hyperparameters to metagradients. In particular, when applied to 57 Atari games (Bellemare et al., 2013), STACX achieved a normalized median score of 364, a new state-of-the-art for online model free agents.

2 Background

In the following, we consider three types of parameters:

  1. – the agent parameters.

  2. – the hyperparameters.

  3. – the metaparameters.

denotes the parameters of the agent and parameterises, for example, the value function and the policy; these parameters are randomly initialised at the beginning of an agent’s lifetime, and updated using back-propagation on a suitable inner loss function. denotes the hyperparameters, including, for example, the parameters of the optimizer (e.g the learning rate) or the parameters of the loss function (e.g. the discount factor); these may be tuned over the course of many lifetimes (for instance via random search) to optimize an outer (validation) loss function. In a typical deep RL setup, only these first two types of parameters need to be considered. In metagradient algorithms a third set of parameters must be specified: the metaparameters, denoted ; these are a subset of the differentiable parameters in that start with some some initial value (itself a hyper-parameter), but that are then adapted during the course of training.

2.1 The metagradient approach

Metagradient RL (Xu et al., 2018) is a general framework for adapting, online, within a single lifetime, the differentiable hyperparameters . Consider an inner loss that is a function of both the parameters and the metaparameters : . On each step of an inner loop, can be optimized with a fixed to minimize the inner loss


In an outer loop, can then be optimized to minimize the outer loss by taking a metagradient step. As is a function of this corresponds to updating the parameters by differentiating the outer loss w.r.t


The algorithm is general, as it implements a specific case of online cross-validation, and can be applied, in principle, to any differentiable meta-parameter used by the inner loss.

2.2 Impala

Specific instantiations of the metagradient RL framework require specification of the inner and outer loss functions. Since our agent builds on the IMPALA actor critic agent (Espeholt et al., 2018), we now provide a brief introduction.

IMPALA maintains a policy and a value function that are parameterized with parameters These policy and the value function are trained via an actor-critic update with entropy regularization; such an update is often represented (with slight abuse of notation) as the gradient of the following pseudo-loss function


where are suitable loss coefficients. We refer to the policy that generates the data for these updates as the behaviour policy In the on policy case, where , , then is the n-steps bootstrapped return

IMPALA uses a distributed actor critic architecture, that assign copies of the policy parameters to multiple actors in different machines to achieve higher sample throughput. As a result, the target policy on the learner machine can be several updates ahead of the actor’s policy that generated the data used in an update. Such off policy discrepancy can lead to biased updates, requiring to weight the updates with IS weights for stable learning. Specifically, IMPALA (Espeholt et al., 2018) uses truncated IS weights to balance the variance-contraction trade off on these off-policy updates. This corresponds to instantiating Section 2.2 with


where we define and we set for suitable truncation levels and .

2.3 Metagradient IMPALA

The metagradient agent in (Xu et al., 2018) uses the metagradient update rules from the previous section with the actor-critic loss function of Espeholt et al. (2018). More specifically, the inner loss is a parameterised version of the IMPALA loss with metaparameters based on Section 2.2,

where Notice that and also affect the inner loss through the definition of (Eq. 4).

The outer loss is defined to be the policy gradient loss

3 Self-Tuning actor-critic agents

We first consider a slightly extended version of the metagradient IMPALA agent (Xu et al., 2018). Specifically, we allow the metagradient to adapt learning rates for individual components in the loss,


The outer loss is defined using Section 2.2,


Notice the new Kullback–Leibler (KL) term in Section 3, motivating the -update not to change the policy too much.

Compared to the work by Xu et al. (2018) that only self-tuned the inner-loss and (hence ,}), also self-tuning the inner-loss coefficients corresponds to setting . These metaparameters allow for loss specific learning rates and support dynamically balancing exploration with exploitation by adapting the entropy loss weight 111There are a few additional subtle differences between this self-tuning IMPALA agent and the metagradient agent from Xu et al. (2018). For example, we do not use the embedding used in (Xu et al., 2018). These differences are further discussed in the supplementary where we also reproduce the results of Xu et al. (2018) in our code base.. The hyperparameters of STAC include the initialisations of the metaparameters, the hyperparameters of the outer loss , the KL coefficient (set to ), and the learning rate of the ADAM meta optimizer (set to ).

ADAM parameter,
Table 1: Parameters in IMPALA and self-tuning IMPALA.

To set the initial values of the metaparameters of self-tuning IMPALA, we use a simple “rule of thumb” and set them to the values of the corresponding parameters in the outer loss (e.g., the initial value of the inner-loss is set equal to the outer-loss ). For outer-loss hyperparameters that are common to IMPALA we default to the IMPALA settings.

In the next two sections, we show how embracing self tuning via metagradients enables us to augment this agent with a parameterised Leaky V-trace operator and with self-tuned auxiliary loss functions. These ideas are examples of how the ability to self-tune metaparameters via metagradients can be used to introduce novel ideas into RL algorithms without requiring extensive tuning of the new hyperparameters.

3.1 Stac

All the hyperparameters that we considered for self-tuning so far have the property that they are explicitly defined in the definition of the loss function and can be directly differentiated. The truncation levels in the V-trace operator within IMPALA, on the other hand, are equivalent to applying a ReLU activation and are non differentiable.

Motivated by the study of non linear activations in Deep Learning

(Xu et al., 2015), we now introduce an agent based on a variant of the V-trace operator that we call leaky V-trace. We will refer to this agent as the Self-Tuning Actor Critic (STAC). Leaky V-trace uses a leaky rectifier (Maas et al., 2013) to truncate the importance sampling weights, which allows for a small non-zero gradient when the unit is saturated. We show that the degree of leakiness can control certain trade offs in off policy learning, similarly to V-trace, but in a manner that is differentiable.

Before we introduce Leaky V-trace, let us first recall how the off policy trade offs are represented in V-trace using the coefficients . The weight appears in the definition of the temporal difference and defines the fixed point of this update rule. The fixed point of this update is the value function of the policy that is somewhere between the behaviour policy and the target policy controlled by the hyper parameter


The product of the weights in Eq. 4 measures how much a temporal difference observed at time impacts the update of the value function. The truncation level is used to control the speed of convergence by trading off the update variance for a larger contraction rate, similar to Retrace (Munos et al., 2016). By clipping the importance weights, the variance associated with the update rule is reduced relative to importance-weighted returns . On the other hand, the clipping of the importance weights effectively cuts the traces in the update, resulting in the update placing less weight on later TD errors, and thus worsening the contraction rate of the corresponding operator.

Following this interpretation of the off policy coefficients, we now propose a variation of V-trace which we call leaky V-trace with new parameters


We highlight that for Leaky V-trace is exactly equivalent to V-trace, while for it is equivalent to canonical importance sampling. For other values we get a mixture of the truncated and non truncated importance sampling weights.

Theorem 1 below suggests that Leaky V-trace is a contraction mapping, and that the value function that it will converge to is given by where


is a policy that mixes (and then re-normalizes) the target policy with the V-trace policy Eq. 7 222Note that trace (Rowland et al., 2019), another adaptive algorithm for off policy learning, mixes the V-trace policy with the behaviour policy; Leaky V-trace mixes it with the target policy.. A more formal statement of Theorem 1 and a detailed proof (which closely follows that of Espeholt et al. (2018) for the original v-trace operator) can be found in the supplementary material.

Theorem 1.

The leaky V-trace operator defined by Section 3.1 is a contraction operator and it converges to the value function of the policy defined by Eq. 9.

Similar to the new parameter

controls the fixed point of the update rule, and defines a value function that interpolates between the value function of the target policy

and the behaviour policy Specifically, the parameter allows the importance weights to ”leak back” creating the opposite effect to clipping.

Since Theorem 1 requires us to have our main STAC implementation parametrises the loss with a single parameter . In addition, we also experimented with a version of STAC that learns both and . Quite interestingly, this variation of STAC learns the rule by its own (see the experiments section for more details).

Note that low values of lead to importance sampling which is high contraction but high variance. On the other hand, high values of lead to V-trace, which is lower contraction and lower variance than importance sampling. Thus exposing to meta-learning enables STAC to directly control the contraction/variance trade-off.

In summary, the metaparameters for STAC are To keep things simple, when using Leaky V-trace we make two simplifications w.r.t the hyperparameters. First, we use V-trace to initialise Leaky V-trace, i.e., we initialise Second, we fix the outer loss to be V-trace, i.e. we set

3.2 STAC with auxiliary tasks (STACX)

Next, we introduce a new agent, that extends STAC with auxiliary policies, value functions, and respective auxiliary loss functions; this is new because the parameters that define the auxiliary tasks (the discount factors in this case) are self-tuned. As this agent has a new architecture in addition to an extended set of metaparameters we give it a different acronym and denote it by STACX (STAC with auxiliary tasks). The auxiliary losses have the same parametric form as the main objective and can be used to regularize its objective and improve its representations.

Figure 1: Block diagrams of STAC and STACX.

STACX’s architecture has a shared representation layer , from which it splits into different heads (Fig. 1). For the shared representation layer we use the deep residual net from (Espeholt et al., 2018). Each head has a policy and a corresponding value function that are represented using a layered MLP with parameters . Each one of these heads is trained in the inner loop to minimize a loss function parametrised by its own set of metaparameters .

The policy of the STACX agent is defined to be the policy of a specific head (). The hyperparameters are trained in the outer loop to improve the performance of this single head. Thus, the role of the auxiliary heads is to act as auxiliary tasks (Jaderberg et al., 2016) and improve the shared representation . Finally, notice that each head has its own policy , but the behaviour policy is fixed to be . Thus, to optimize the auxiliary heads we use (Leaky) V-trace for off policy corrections 333 We also considered two extensions of this approach. (1) Random ensemble: The policy head is chosen at random from

and the hyperparameters are differentiated w.r.t the performance of each one of the heads in the outer loop. (2) Average ensemble: The actor policy is defined to be the average logits of the heads, and we learn one additional head for the value function of this policy. The metagradient in the outer loop is taken with respect to the actor policy, and /or, each one of the heads individually. While these extensions seem interesting, in all of our experiments they always led to a small decrease in performance when compared to our auxiliary task agent without these extensions. Similar findings were reported in

(Fedus et al., 2019). .

The metaparameters for STACX are Since the outer loss is defined only w.r.t head introducing the auxiliary tasks into STACX does not require new hyperparameters for the outer loss. In addition, we use the same initialisation values for all the auxiliary tasks. Thus, STACX has exactly the same hyperparameters as STAC.

Figure 2: Median human-normalized scores during training.

4 Experiments

In all of our experiments (with the exception of the robustness experiments in Section 4.3) we use the IMPALA hyperparameters both for the IMPALA baseline and for the outer loss of the STAC agent, i.e., We use as it was found to improve the performance of IMPALA considerably (Xu et al., 2018).

4.1 Atari learning curves

We start by evaluating STAC and STACX in the Arcade Learning Environment (Bellemare et al., 2013, ALE). Fig. 2 presents the normalized median scores 444Normalized median scores are computed as follows. For each Atari game, we compute the human normalized score after 200M frames of training and average this over 3 different seeds; we then report the overall median score over the Atari domains. during training. We found STACX to learn faster and achieve higher final performance than STAC. We also compare these agents with versions of them without self tuning (fixing the metaparameters). The version of STACX with fixed unsupervised auxiliary tasks achieved a normalized median score of , similar to that of UNREAL (Jaderberg et al., 2016) but not much better than IMPALA. In Fig. 3 we report the relative improvement of STACX over IMPALA in the individual levels (an equivalent figure for STAC may be found in the supplementary material).

Figure 3: Mean human-normalized scores after 200M frames, relative improvement in percents of STACX over IMPALA.

STACX achieved a median score of , a new state of the art result in the ALE benchmark for training online model-free agents for 200M frames. In fact, there are only two agents that reported better performance after 200M frames: LASER (Schmitt et al., 2019) achieved a normalized median score of and MuZero (Schrittwieser et al., 2019) achieved . These papers propose algorithmic modifications that are orthogonal to our approach and can be combined in future work; LASER combines IMPALA with a uniform large-scale experience replay; MuZero uses replay and a tree-based search with a learned model.

4.2 Ablative analysis

Next, we perform an ablative study of our approach by training different variations of STAC and STACX. The results are summarized in Fig. 4. In red, we can see different baselines 555We further discuss reproducing the results of Xu et al. (2018) in the supplementary material., and in green and blue, we can see different ablations of STAC and STACX respectively. In these ablations corresponds to the subset of the hyperparameters that are being self-tuned, e.g., corresponds to self-tuning a single loss function where only is self-tuned (and the other hyperparameters are fixed).

Inspecting Fig. 4 we observe that the performance of STAC and STACX consistently improve as they self-tune more metaparameters. These metaparameters control different trade-offs in reinforcement learning: discount factor controls the effective horizon, loss coefficients affect learning rates, the Leaky V-trace coefficient controls the variance-contraction-bias trade-off in off policy RL.

We have also experimented with adding more auxiliary losses, i.e, having or auxiliary loss functions. These variations performed better then having a single loss function but slightly worse then having . This can be further explained by Fig. 9 (Section 4.4), which shows that the auxiliary heads are self-tuned to similar metaparameters.

Figure 4: Ablative studies of STAC (green) and STACX (blue) compared to some baselines (red). Median normalized scores across 57 Atari games after training for 200M frames.

Finally, we experimented with a few more versions of STACX. One variation allows STACX to self-tune both and without enforcing the relation This version performed slightly worse than STACX (achieved a median score of ) and we further discuss it in Section 4.4 (Fig. 10). In another variation we self-tune together with a single truncation parameter This variation performed much worse achieving a median score of , which may be explained by not being differentiable.

In Fig. 5, we further summarize the relative improvement of ablative variations of STAC (green,yellow, and blue; the bottom three lines) and STACX (light blue and red; the top two lines) over the IMPALA baseline. For each value of (the axis), we measure the number of games in which an ablative version of the STAC(X) agent is better than the IMPALA agent by at least percent and subtract from it the number of games in which the IMPALA agent is better than STAC(X). Clearly, we can see that STACX (light blue) improves the performance of IMPALA by a large margin. Moreover, we observe that allowing the STAC(X) agents to self-tune more metaparameters consistently improves its performance in more games.

Figure 5: The number of games (y-axis) in which ablative variations of STAC and STACX improve the IMPALA baseline by at least x percents (x-axis).

4.3 Robustness

It is important to note that our algorithm is not hyperparameter free. For instance, we still need to choose hyperparameter settings for the outer loss. Additionally, each hyperparameter in the inner loss that we expose to meta-gradients still requires an initialization (itself a hyperparameter). Therefore, in this section, we investigate the robustness of STACX to its hyperparameters.

We begin with the hyperparameters of the outer loss. In these experiments we compare the robustness of STACX with that of IMPALA in the following manner. For each hyper parameter () we select perturbations. For STACX we perturb the hyper parameter in outer loss () and for IMPALA we perturb the corresponding hyper parameter (). We randomly selected

Atari levels and present the mean and standard deviation across

random seeds after M frames of training.

Fig. 6 presents the results for the discount factor. We can see that overall (in of the configurations measured by mean), STACX indeed performs better than IMPALA. Similarly, Fig. 7 shows the robustness of STACX to the critic weight (), where STACX improves over IMPALA in of the configurations.

Figure 6:

Robustness to the discount factor. Mean and confidence intervals (over 3 seeds), after 200M frames of training. Left (blue) bars correspond to STACX and right (red) bars to IMPALA. STACX is better than IMPALA in 72

of the runs measured by mean.
Figure 7: Robustness to the critic weight . Mean and confidence intervals (over 3 seeds), after 200M frames of training. Left (blue) bars correspond to STACX and right (red) bars correspond to IMPALA. STACX is better than IMPALA in 80 of the runs measured by mean.

Next, we investigate the robustness of STACX to the initialisation of the metaparameters in Fig. 8. We selected values that are close to as our design principle is to initialise the metaparameters to be similar to the hyperparameters in the outer loss. We observe that overall, the method is quite robust to different initialisations.

Figure 8: Robustness to the initialisation of the metaparameters. Mean and confidence intervals (over 6 seeds), after 200M frames of training. Columns correspond to different games. Bottom: perturbing . Top: perturbing all the meta parameter initialisations. I.e., setting all the hyperparamters to a single fixed value in .

4.4 Adaptivity

In Fig. 9 we visualize the metaparameters of STACX during training. As there are many metaparameters, seeds and levels, we restrict ourselves to a single seed (chosen arbitrary to 1) and a single game (Jamesbond). More examples can be found in the supplementary material. For each hyper parameter we plot the value of the hyperparameters associated with three different heads, where the policy head (head number 1) is presented in blue and the auxiliary heads (2 and 3) are presented in orange and magenta.

Inspecting Fig. 9 we can see that the two auxiliary heads self-tuned their metaparameters to have relatively similar values, but different than those of the main head. The discount factor of the main head, for example, converges to the value of the discount factor in the outer loss (0.995), while the discount factors of the auxiliary heads change quite a lot during training and learn about horizons that differ from that of the main head.

We also observe non trivial behaviour in self-tuning the coefficients of the loss functions in the coefficient and in the off policy coefficient . For instance, we found that at the beginning of training is self tuned to a high value (close to 1) so it is quite similar to V-trace; instead, towards the end of training STACX self-tunes it to lower values which makes it closer to importance sampling.

Figure 9: Adaptivity in Jamesbond.

Finally, we also noticed an interesting behaviour in the version of STACX where we expose to self-tuning both the and coefficients, without imposing (Theorem 1). This variation of STACX achieved a median score of . Quite interestingly, the metagradient discovered on its own, i.e., it self-tunes so that is greater or equal to in of the time (averaged over time, seeds, and levels) , and so that in of the time. Fig. 10 shows an example of this in Jamesbond.

Figure 10: Discovery of .

5 Summary

In this work we demonstrate that it is feasible to use metagradients to simultaneously tune many critical hyperparameters (controlling important trade offs in a reinforcement learning agent), as long as they are differentiable; we show that this can be done online, within a single lifetime. We do so by presenting STAC and STACX, actor critic algorithms that self-tune a large number of hyperparameters of very different nature. We showed that the performance of these agents improves as they self-tune more hyperparameters, and we demonstrated that STAC and STACX are computationally efficient and robust to their own hyperparameters.


  • M. G. Bellemare, Y. Naddaf, J. Veness, and M. Bowling (2013) The arcade learning environment: an evaluation platform for general agents.

    Journal of Artificial Intelligence Research

    47, pp. 253–279.
    Cited by: §1, §4.1.
  • J. Bergstra and Y. Bengio (2012) Random search for hyper-parameter optimization. J. Mach. Learn. Res. 13 (1), pp. 281–305. External Links: ISSN 1532-4435 Cited by: §1.
  • L. Espeholt, H. Soyer, R. Munos, K. Simonyan, V. Mnih, T. Ward, Y. Doron, V. Firoiu, T. Harley, I. Dunning, et al. (2018) Impala: scalable distributed deep-rl with importance weighted actor-learner architectures. arXiv preprint arXiv:1802.01561. Cited by: §1, §2.2, §2.2, §2.3, §3.1, §3.2, §7, §7.
  • W. Fedus, C. Gelada, Y. Bengio, M. G. Bellemare, and H. Larochelle (2019) Hyperbolic discounting and learning over multiple horizons. arXiv preprint arXiv:1902.06865. Cited by: §1, footnote 3.
  • L. Franceschi, M. Donini, P. Frasconi, and M. Pontil (2017) Forward and reverse gradient-based hyperparameter optimization. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 1165–1173. Cited by: §1.
  • M. Jaderberg, V. Dalibard, S. Osindero, W. M. Czarnecki, J. Donahue, A. Razavi, O. Vinyals, T. Green, I. Dunning, K. Simonyan, et al. (2017)

    Population based training of neural networks

    arXiv preprint arXiv:1711.09846. Cited by: §1.
  • M. Jaderberg, V. Mnih, W. M. Czarnecki, T. Schaul, J. Z. Leibo, D. Silver, and K. Kavukcuoglu (2016) Reinforcement learning with unsupervised auxiliary tasks. arXiv preprint arXiv:1611.05397. Cited by: §1, §3.2, §4.1.
  • N. Jiang, A. Kulesza, S. Singh, and R. Lewis (2015) The dependence of effective planning horizon on model accuracy. In Proceedings of the 2015 International Conference on Autonomous Agents and Multiagent Systems, pp. 1181–1189. Cited by: §1.
  • A. L. Maas, A. Y. Hannun, and A. Y. Ng (2013) Rectifier nonlinearities improve neural network acoustic models. In in ICML Workshop on Deep Learning for Audio, Speech and Language Processing, Cited by: §3.1.
  • D. Maclaurin, D. Duvenaud, and R. Adams (2015) Gradient-based hyperparameter optimization through reversible learning. In International Conference on Machine Learning, pp. 2113–2122. Cited by: §1.
  • T. A. Mann, H. Penedones, S. Mannor, and T. Hester (2016) Adaptive lambda least-squares temporal difference learning. arXiv preprint arXiv:1612.09465. Cited by: §1.
  • R. Munos, T. Stepleton, A. Harutyunyan, and M. Bellemare (2016) Safe and efficient off-policy reinforcement learning. In Advances in Neural Information Processing Systems, pp. 1054–1062. Cited by: §3.1.
  • F. Pedregosa (2016) Hyperparameter optimization with approximate gradient. In International Conference on Machine Learning, pp. 737–746. Cited by: §1.
  • M. Rowland, W. Dabney, and R. Munos (2019) Adaptive trade-offs in off-policy learning. arXiv preprint arXiv:1910.07478. Cited by: §1, footnote 2.
  • T. Schaul, D. Borsa, D. Ding, D. Szepesvari, G. Ostrovski, W. Dabney, and S. Osindero (2019) Adapting behaviour for learning progress. arXiv preprint arXiv:1912.06910. Cited by: §1.
  • S. Schmitt, M. Hessel, and K. Simonyan (2019) Off-policy actor-critic with shared experience replay. arXiv preprint arXiv:1909.11583. Cited by: §4.1.
  • J. Schrittwieser, I. Antonoglou, T. Hubert, K. Simonyan, L. Sifre, S. Schmitt, A. Guez, E. Lockhart, D. Hassabis, T. Graepel, et al. (2019) Mastering atari, go, chess and shogi by planning with a learned model. arXiv preprint arXiv:1911.08265. Cited by: §4.1.
  • J. Snoek, H. Larochelle, and R. P. Adams (2012) Practical bayesian optimization of machine learning algorithms. In Advances in neural information processing systems, pp. 2951–2959. Cited by: §1.
  • R. S. Sutton (1992) Adapting bias by gradient descent: an incremental version of delta-bar-delta. In AAAI, pp. 171–176. Cited by: §1.
  • V. Veeriah, M. Hessel, Z. Xu, J. Rajendran, R. L. Lewis, J. Oh, H. P. van Hasselt, D. Silver, and S. Singh (2019) Discovery of useful questions as auxiliary tasks. In Advances in Neural Information Processing Systems, pp. 9306–9317. Cited by: §1.
  • M. White and A. White (2016) A greedy approach to adapting the trace parameter for temporal difference learning. In Proceedings of the 2016 International Conference on Autonomous Agents & Multiagent Systems, pp. 557–565. Cited by: §1.
  • B. Xu, N. Wang, T. Chen, and M. Li (2015) Empirical evaluation of rectified activations in convolutional network. arXiv preprint arXiv:1505.00853. Cited by: §3.1.
  • Z. Xu, H. P. van Hasselt, and D. Silver (2018) Meta-gradient reinforcement learning. In Advances in neural information processing systems, pp. 2396–2407. Cited by: §1, §2.1, §2.3, §3, §3, §4, §8, §8, §8, footnote 1, footnote 5.
  • K. Young, B. Wang, and M. E. Taylor (2018) Metatrace: online step-size tuning by meta-gradient descent for reinforcement learning control. arXiv preprint arXiv:1805.04514. Cited by: §1.
  • Z. Zheng, J. Oh, and S. Singh (2018) On learning intrinsic rewards for policy gradient methods. In Advances in Neural Information Processing Systems, pp. 4644–4654. Cited by: §1.

6 Additional results

Figure 11: Mean human-normalized scores after 200M frames, relative improvement in percents of STAC over the IMPALA baseline.

7 Analysis of Leaky V-trace

Define the Leaky V-trace operator :


where the expectation is with respect to the behaviour policy which has generated the trajectory , i.e., . Similar to (Espeholt et al., 2018), we consider the infinite-horizon operator but very similar results hold for the n-step truncated operator.


be importance sampling weights, let

be truncated importance sampling weights with , and let

be the Leaky importance sampling weights with leaky coefficients

Theorem 2 (Restatement of Theorem 1).

Assume that there exists such that . Then the operator defined by Eq. 10 has a unique fixed point , which is the value function of the policy defined by

Furthermore, is a -contraction mapping in sup-norm, with

where and for


The proof follows the proof of V-trace from (Espeholt et al., 2018) with adaptations for the leaky V-trace coefficients. We have that

Denote by and notice that

since and therefore, Furthermore, since and we have that Thus, the coefficients are non negative in expectation, since

Thus, is a linear combination of the values at the other states, weighted by non-negative coefficients whose sum is


where Eq. 11 holds since we expanded only the first two elements in the sum, and all the elements in this sum are positive, and Eq. 12 holds by the assumption.

We deduce that with so is a contraction mapping. Furthermore, we can see that the parameter controls the contraction rate, for we get the contraction rate of V-trace and as gets smaller with get better contraction as with we get that

Thus possesses a unique fixed point. Let us now prove that this fixed point is where


is a policy that mixes the target policy with the V-trace policy.

We have:

where we get that the left side (up to the summation on ) of the last equality equals zero since this is the Bellman equation for We deduce that thus, is the unique fixed point of

8 Reproducibility

Inspecting the results in Fig. 4 one may notice that there are small differences between the results of IMAPALA and using meta gradients to tune only compared to the results that were reported in (Xu et al., 2018).

We investigated the possible reasons for these differences. First, our method was implemented in a different code base. Our code is written in JAX, compared to the implementation in (Xu et al., 2018)

that was written in TensorFlow. This may explain the small difference in final performance between our IMAPALA baseline (

Fig. 4) and the the result of Xu et. al. which is slightly higher (257.1).

Second, Xu et. al. observed that embedding the hyper parameter into the network improved their results significantly, reaching a final performance (when learning ) of 287.7 (see section 1.4 in (Xu et al., 2018) for more details). Our method, on the other hand, only achieved a score of 240 in this ablative study. We further investigated this difference by introducing the embedding intro our architecture. With embedding, our method achieved a score of which almost reproduces the results in (Xu et al., 2018). We then introduced the same embedding mechanism to our model with auxiliary loss functions. In this case for auxiliary loss we embed . We experimented with two variants, one that shares the embedding weights across the auxiliary tasks and one that learns a specific embedding for each auxiliary task. Both of these variants performed similarly (306.8, 3.077 respectively) which is better then our previous result with embedding and without auxiliary losses that achieved 280.6. Unfortunately, the performance of the auxiliary loss architecture actually performed better without the embedding (353.4) and we therefor ended up not using the embedding in our architecture. We leave it to future work to further investigate methods of combining the embedding mechanisms with the auxiliary loss functions.

9 Individual level learning curves

Figure 12: Meta parameters and reward in each Atari game (and seed) during learning. Different colors correspond to different heads, blue is the main (policy) head.
Figure 13: Meta parameters and reward in each Atari game (and seed) during learning. Different colors correspond to different heads, blue is the main (policy) head.
Figure 14: Meta parameters and reward in each Atari game (and seed) during learning. Different colors correspond to different heads, blue is the main (policy) head.
Figure 15: Meta parameters and reward in each Atari game (and seed) during learning. Different colors correspond to different heads, blue is the main (policy) head.
Figure 16: Meta parameters and reward in each Atari game (and seed) during learning. Different colors correspond to different heads, blue is the main (policy) head.
Figure 17: Meta parameters and reward in each Atari game (and seed) during learning. Different colors correspond to different heads, blue is the main (policy) head.
Figure 18: Meta parameters and reward in each Atari game (and seed) during learning. Different colors correspond to different heads, blue is the main (policy) head.
Figure 19: Meta parameters and reward in each Atari game (and seed) during learning. Different colors correspond to different heads, blue is the main (policy) head.
Figure 20: Meta parameters and reward in each Atari game (and seed) during learning. Different colors correspond to different heads, blue is the main (policy) head.
Figure 21: Meta parameters and reward in each Atari game (and seed) during learning. Different colors correspond to different heads, blue is the main (policy) head.