1 Introduction
Reinforcement Learning (RL) algorithms often have multiple hyperparameters that require careful tuning; this is especially true for modern deep RL architectures, which often incorporate many modules and many loss functions.
Training a deep RL agent is thus typically performed in two nested optimization loops. In the inner training loop, we fix a set of hyperparameters, and optimize the agent parameters with respect to these fixed hyperparameters. In the outer (manual or automated) tuning loop, we search for good hyperparameters, evaluating them either in terms of their inner loop performance, or in terms of their performance on some special validation data. Inner loop training is typically differentiable and can be efficiently performed using back propagation, while the optimization of the outer loop is typically performed via gradientfree optimization. Since only the hyperparameters (but not the agent policy itself) are transferred between outer loop iterations, we refer to this as a “multiple lifetime” approach. For example, random search for hyperparameters (Bergstra and Bengio, 2012) falls under this category, and population based training (Jaderberg et al., 2017) also shares many of its properties. The cost of relying on multiple lifetimes to tune hyperparameters is often not accounted for when the performance of algorithms is reported. The impact of this is mild when users can rely on hyperparameters established in the literature, but the cost of hyperparameter tuning across multiple lifetimes manifests itself when algorithm are applied to new domains.
This motivates a significant body of work on tuning hyperparameters online, within a single agent lifetime. Previous work in this area has often focused on solutions tailored to specific hyperparameters. For instance, Schaul et al. (2019) proposed a non stationary bandit algorithm to adapt the explorationexploitation trade off. Mann et al. (2016) and White and White (2016) proposed algorithms to adapt (the eligibility trace coefficient). Rowland et al. (2019) introduced an
coefficient into Vtrace to account for the variancecontraction trade off in off policy learning. In another line of work, metagradients were used to adapt the optimiser’s parameters
(Sutton, 1992; Snoek et al., 2012; Maclaurin et al., 2015; Pedregosa, 2016; Franceschi et al., 2017; Young et al., 2018).In this paper we build on the metagradient approach to tune hyperparameters. In previous work, online metagradients have been used to learn the discount factor or the coefficient (Xu et al., 2018), to discover intrinsic rewards (Zheng et al., 2018) and auxiliary tasks (Veeriah et al., 2019). This is achieved by representing an inner (training) loss function as a function of both the agent parameters and a set of hyperparameters. In the inner loop, the parameters of the agent are trained to minimize this inner loss function w.r.t the current values of the hyperparameters; in the outer loop, the hyperparameters are adapted via back propagation to minimize the outer (validation) loss.
It is perhaps surprising that we may choose to optimize a different loss function in the inner loop, instead of the outer loss we ultimately care about. However, this is not a new idea. Regularization, for example, is a technique that changes the objective in the inner loop to balance the biasvariance tradeoff and avoid the risk of over fitting. In modelbased RL, it was shown that the policy found using a smaller discount factor can actually be better than a policy learned with the true discount factor (Jiang et al., 2015). Auxiliary tasks (Jaderberg et al., 2016) are another example, where gradients are taken w.r.t unsupervised loss functions in order to improve the agent’s representation. Finally, it is well known that, in order to maximize the long term cumulative reward efficiently (the objective of the outer loop), RL agents must explore, i.e., act according to a different objective in the inner loop (accounting, for instance, for uncertainty).
This paper makes the following contributions. First, we show that it is feasible to use metagradients to simultaneously tune many critical hyperparameters (controlling important trade offs in a reinforcement learning agent), as long as they are differentiable w.r.t a validation/outer loss. Importantly, we show that this can be done online, within a single lifetime, requiring only additional compute.
We demonstrate this by introducing two novel deep RL architectures, that extend IMPALA (Espeholt et al., 2018), a distributed actorcritic, by adding additional components with many more new hyperparameters to be tuned. The first agent, referred to as a SelfTuning Actor Critic (STAC), introduces a leaky Vtrace operator that mixes importance sampling (IS) weights with truncated IS weights. The mixing coefficient in leaky Vtrace is differentiable (unlike the original Vtrace) but similarly balances the variancecontraction tradeoff in offpolicy learning. The second architecture is STACX (STAC with auXiliary tasks). Inspired by Fedus et al. (2019), STACX augments STAC with parametric auxiliary loss functions (each with it’s own hyperparameters).
These agents allow us to show empirically, through extensive ablation studies, that performance consistently improves as we expose more hyperparameters to metagradients. In particular, when applied to 57 Atari games (Bellemare et al., 2013), STACX achieved a normalized median score of 364, a new stateoftheart for online model free agents.
2 Background
In the following, we consider three types of parameters:

– the agent parameters.

– the hyperparameters.

– the metaparameters.
denotes the parameters of the agent and parameterises, for example, the value function and the policy; these parameters are randomly initialised at the beginning of an agent’s lifetime, and updated using backpropagation on a suitable inner loss function. denotes the hyperparameters, including, for example, the parameters of the optimizer (e.g the learning rate) or the parameters of the loss function (e.g. the discount factor); these may be tuned over the course of many lifetimes (for instance via random search) to optimize an outer (validation) loss function. In a typical deep RL setup, only these first two types of parameters need to be considered. In metagradient algorithms a third set of parameters must be specified: the metaparameters, denoted ; these are a subset of the differentiable parameters in that start with some some initial value (itself a hyperparameter), but that are then adapted during the course of training.
2.1 The metagradient approach
Metagradient RL (Xu et al., 2018) is a general framework for adapting, online, within a single lifetime, the differentiable hyperparameters . Consider an inner loss that is a function of both the parameters and the metaparameters : . On each step of an inner loop, can be optimized with a fixed to minimize the inner loss
(1) 
In an outer loop, can then be optimized to minimize the outer loss by taking a metagradient step. As is a function of this corresponds to updating the parameters by differentiating the outer loss w.r.t
(2) 
The algorithm is general, as it implements a specific case of online crossvalidation, and can be applied, in principle, to any differentiable metaparameter used by the inner loss.
2.2 Impala
Specific instantiations of the metagradient RL framework require specification of the inner and outer loss functions. Since our agent builds on the IMPALA actor critic agent (Espeholt et al., 2018), we now provide a brief introduction.
IMPALA maintains a policy and a value function that are parameterized with parameters These policy and the value function are trained via an actorcritic update with entropy regularization; such an update is often represented (with slight abuse of notation) as the gradient of the following pseudoloss function
(3) 
where are suitable loss coefficients. We refer to the policy that generates the data for these updates as the behaviour policy In the on policy case, where , , then is the nsteps bootstrapped return
IMPALA uses a distributed actor critic architecture, that assign copies of the policy parameters to multiple actors in different machines to achieve higher sample throughput. As a result, the target policy on the learner machine can be several updates ahead of the actor’s policy that generated the data used in an update. Such off policy discrepancy can lead to biased updates, requiring to weight the updates with IS weights for stable learning. Specifically, IMPALA (Espeholt et al., 2018) uses truncated IS weights to balance the variancecontraction trade off on these offpolicy updates. This corresponds to instantiating Section 2.2 with
(4) 
where we define and we set for suitable truncation levels and .
2.3 Metagradient IMPALA
The metagradient agent in (Xu et al., 2018) uses the metagradient update rules from the previous section with the actorcritic loss function of Espeholt et al. (2018). More specifically, the inner loss is a parameterised version of the IMPALA loss with metaparameters based on Section 2.2,
where Notice that and also affect the inner loss through the definition of (Eq. 4).
The outer loss is defined to be the policy gradient loss
3 SelfTuning actorcritic agents
We first consider a slightly extended version of the metagradient IMPALA agent (Xu et al., 2018). Specifically, we allow the metagradient to adapt learning rates for individual components in the loss,
(5) 
The outer loss is defined using Section 2.2,
(6) 
Notice the new Kullback–Leibler (KL) term in Section 3, motivating the update not to change the policy too much.
Compared to the work by Xu et al. (2018) that only selftuned the innerloss and (hence ,}), also selftuning the innerloss coefficients corresponds to setting . These metaparameters allow for loss specific learning rates and support dynamically balancing exploration with exploitation by adapting the entropy loss weight ^{1}^{1}1There are a few additional subtle differences between this selftuning IMPALA agent and the metagradient agent from Xu et al. (2018). For example, we do not use the embedding used in (Xu et al., 2018). These differences are further discussed in the supplementary where we also reproduce the results of Xu et al. (2018) in our code base.. The hyperparameters of STAC include the initialisations of the metaparameters, the hyperparameters of the outer loss , the KL coefficient (set to ), and the learning rate of the ADAM meta optimizer (set to ).
IMPALA  SelfTuning IMPALA  

Initialisations  
ADAM parameter,  
– 
To set the initial values of the metaparameters of selftuning IMPALA, we use a simple “rule of thumb” and set them to the values of the corresponding parameters in the outer loss (e.g., the initial value of the innerloss is set equal to the outerloss ). For outerloss hyperparameters that are common to IMPALA we default to the IMPALA settings.
In the next two sections, we show how embracing self tuning via metagradients enables us to augment this agent with a parameterised Leaky Vtrace operator and with selftuned auxiliary loss functions. These ideas are examples of how the ability to selftune metaparameters via metagradients can be used to introduce novel ideas into RL algorithms without requiring extensive tuning of the new hyperparameters.
3.1 Stac
All the hyperparameters that we considered for selftuning so far have the property that they are explicitly defined in the definition of the loss function and can be directly differentiated. The truncation levels in the Vtrace operator within IMPALA, on the other hand, are equivalent to applying a ReLU activation and are non differentiable.
Motivated by the study of non linear activations in Deep Learning
(Xu et al., 2015), we now introduce an agent based on a variant of the Vtrace operator that we call leaky Vtrace. We will refer to this agent as the SelfTuning Actor Critic (STAC). Leaky Vtrace uses a leaky rectifier (Maas et al., 2013) to truncate the importance sampling weights, which allows for a small nonzero gradient when the unit is saturated. We show that the degree of leakiness can control certain trade offs in off policy learning, similarly to Vtrace, but in a manner that is differentiable.Before we introduce Leaky Vtrace, let us first recall how the off policy trade offs are represented in Vtrace using the coefficients . The weight appears in the definition of the temporal difference and defines the fixed point of this update rule. The fixed point of this update is the value function of the policy that is somewhere between the behaviour policy and the target policy controlled by the hyper parameter
(7) 
The product of the weights in Eq. 4 measures how much a temporal difference observed at time impacts the update of the value function. The truncation level is used to control the speed of convergence by trading off the update variance for a larger contraction rate, similar to Retrace (Munos et al., 2016). By clipping the importance weights, the variance associated with the update rule is reduced relative to importanceweighted returns . On the other hand, the clipping of the importance weights effectively cuts the traces in the update, resulting in the update placing less weight on later TD errors, and thus worsening the contraction rate of the corresponding operator.
Following this interpretation of the off policy coefficients, we now propose a variation of Vtrace which we call leaky Vtrace with new parameters
(8) 
We highlight that for Leaky Vtrace is exactly equivalent to Vtrace, while for it is equivalent to canonical importance sampling. For other values we get a mixture of the truncated and non truncated importance sampling weights.
Theorem 1 below suggests that Leaky Vtrace is a contraction mapping, and that the value function that it will converge to is given by where
(9) 
is a policy that mixes (and then renormalizes) the target policy with the Vtrace policy Eq. 7 ^{2}^{2}2Note that trace (Rowland et al., 2019), another adaptive algorithm for off policy learning, mixes the Vtrace policy with the behaviour policy; Leaky Vtrace mixes it with the target policy.. A more formal statement of Theorem 1 and a detailed proof (which closely follows that of Espeholt et al. (2018) for the original vtrace operator) can be found in the supplementary material.
Theorem 1.
The leaky Vtrace operator defined by Section 3.1 is a contraction operator and it converges to the value function of the policy defined by Eq. 9.
Similar to the new parameter
controls the fixed point of the update rule, and defines a value function that interpolates between the value function of the target policy
and the behaviour policy Specifically, the parameter allows the importance weights to ”leak back” creating the opposite effect to clipping.Since Theorem 1 requires us to have our main STAC implementation parametrises the loss with a single parameter . In addition, we also experimented with a version of STAC that learns both and . Quite interestingly, this variation of STAC learns the rule by its own (see the experiments section for more details).
Note that low values of lead to importance sampling which is high contraction but high variance. On the other hand, high values of lead to Vtrace, which is lower contraction and lower variance than importance sampling. Thus exposing to metalearning enables STAC to directly control the contraction/variance tradeoff.
In summary, the metaparameters for STAC are To keep things simple, when using Leaky Vtrace we make two simplifications w.r.t the hyperparameters. First, we use Vtrace to initialise Leaky Vtrace, i.e., we initialise Second, we fix the outer loss to be Vtrace, i.e. we set
3.2 STAC with auxiliary tasks (STACX)
Next, we introduce a new agent, that extends STAC with auxiliary policies, value functions, and respective auxiliary loss functions; this is new because the parameters that define the auxiliary tasks (the discount factors in this case) are selftuned. As this agent has a new architecture in addition to an extended set of metaparameters we give it a different acronym and denote it by STACX (STAC with auxiliary tasks). The auxiliary losses have the same parametric form as the main objective and can be used to regularize its objective and improve its representations.
STACX’s architecture has a shared representation layer , from which it splits into different heads (Fig. 1). For the shared representation layer we use the deep residual net from (Espeholt et al., 2018). Each head has a policy and a corresponding value function that are represented using a layered MLP with parameters . Each one of these heads is trained in the inner loop to minimize a loss function parametrised by its own set of metaparameters .
The policy of the STACX agent is defined to be the policy of a specific head (). The hyperparameters are trained in the outer loop to improve the performance of this single head. Thus, the role of the auxiliary heads is to act as auxiliary tasks (Jaderberg et al., 2016) and improve the shared representation . Finally, notice that each head has its own policy , but the behaviour policy is fixed to be . Thus, to optimize the auxiliary heads we use (Leaky) Vtrace for off policy corrections ^{3}^{3}3 We also considered two extensions of this approach. (1) Random ensemble: The policy head is chosen at random from
and the hyperparameters are differentiated w.r.t the performance of each one of the heads in the outer loop. (2) Average ensemble: The actor policy is defined to be the average logits of the heads, and we learn one additional head for the value function of this policy. The metagradient in the outer loop is taken with respect to the actor policy, and /or, each one of the heads individually. While these extensions seem interesting, in all of our experiments they always led to a small decrease in performance when compared to our auxiliary task agent without these extensions. Similar findings were reported in
(Fedus et al., 2019). .The metaparameters for STACX are Since the outer loss is defined only w.r.t head introducing the auxiliary tasks into STACX does not require new hyperparameters for the outer loss. In addition, we use the same initialisation values for all the auxiliary tasks. Thus, STACX has exactly the same hyperparameters as STAC.
4 Experiments
In all of our experiments (with the exception of the robustness experiments in Section 4.3) we use the IMPALA hyperparameters both for the IMPALA baseline and for the outer loss of the STAC agent, i.e., We use as it was found to improve the performance of IMPALA considerably (Xu et al., 2018).
4.1 Atari learning curves
We start by evaluating STAC and STACX in the Arcade Learning Environment (Bellemare et al., 2013, ALE). Fig. 2 presents the normalized median scores ^{4}^{4}4Normalized median scores are computed as follows. For each Atari game, we compute the human normalized score after 200M frames of training and average this over 3 different seeds; we then report the overall median score over the Atari domains. during training. We found STACX to learn faster and achieve higher final performance than STAC. We also compare these agents with versions of them without self tuning (fixing the metaparameters). The version of STACX with fixed unsupervised auxiliary tasks achieved a normalized median score of , similar to that of UNREAL (Jaderberg et al., 2016) but not much better than IMPALA. In Fig. 3 we report the relative improvement of STACX over IMPALA in the individual levels (an equivalent figure for STAC may be found in the supplementary material).
STACX achieved a median score of , a new state of the art result in the ALE benchmark for training online modelfree agents for 200M frames. In fact, there are only two agents that reported better performance after 200M frames: LASER (Schmitt et al., 2019) achieved a normalized median score of and MuZero (Schrittwieser et al., 2019) achieved . These papers propose algorithmic modifications that are orthogonal to our approach and can be combined in future work; LASER combines IMPALA with a uniform largescale experience replay; MuZero uses replay and a treebased search with a learned model.
4.2 Ablative analysis
Next, we perform an ablative study of our approach by training different variations of STAC and STACX. The results are summarized in Fig. 4. In red, we can see different baselines ^{5}^{5}5We further discuss reproducing the results of Xu et al. (2018) in the supplementary material., and in green and blue, we can see different ablations of STAC and STACX respectively. In these ablations corresponds to the subset of the hyperparameters that are being selftuned, e.g., corresponds to selftuning a single loss function where only is selftuned (and the other hyperparameters are fixed).
Inspecting Fig. 4 we observe that the performance of STAC and STACX consistently improve as they selftune more metaparameters. These metaparameters control different tradeoffs in reinforcement learning: discount factor controls the effective horizon, loss coefficients affect learning rates, the Leaky Vtrace coefficient controls the variancecontractionbias tradeoff in off policy RL.
We have also experimented with adding more auxiliary losses, i.e, having or auxiliary loss functions. These variations performed better then having a single loss function but slightly worse then having . This can be further explained by Fig. 9 (Section 4.4), which shows that the auxiliary heads are selftuned to similar metaparameters.
Finally, we experimented with a few more versions of STACX. One variation allows STACX to selftune both and without enforcing the relation This version performed slightly worse than STACX (achieved a median score of ) and we further discuss it in Section 4.4 (Fig. 10). In another variation we selftune together with a single truncation parameter This variation performed much worse achieving a median score of , which may be explained by not being differentiable.
In Fig. 5, we further summarize the relative improvement of ablative variations of STAC (green,yellow, and blue; the bottom three lines) and STACX (light blue and red; the top two lines) over the IMPALA baseline. For each value of (the axis), we measure the number of games in which an ablative version of the STAC(X) agent is better than the IMPALA agent by at least percent and subtract from it the number of games in which the IMPALA agent is better than STAC(X). Clearly, we can see that STACX (light blue) improves the performance of IMPALA by a large margin. Moreover, we observe that allowing the STAC(X) agents to selftune more metaparameters consistently improves its performance in more games.
4.3 Robustness
It is important to note that our algorithm is not hyperparameter free. For instance, we still need to choose hyperparameter settings for the outer loss. Additionally, each hyperparameter in the inner loss that we expose to metagradients still requires an initialization (itself a hyperparameter). Therefore, in this section, we investigate the robustness of STACX to its hyperparameters.
We begin with the hyperparameters of the outer loss. In these experiments we compare the robustness of STACX with that of IMPALA in the following manner. For each hyper parameter () we select perturbations. For STACX we perturb the hyper parameter in outer loss () and for IMPALA we perturb the corresponding hyper parameter (). We randomly selected
Atari levels and present the mean and standard deviation across
random seeds after M frames of training.Fig. 6 presents the results for the discount factor. We can see that overall (in of the configurations measured by mean), STACX indeed performs better than IMPALA. Similarly, Fig. 7 shows the robustness of STACX to the critic weight (), where STACX improves over IMPALA in of the configurations.
Next, we investigate the robustness of STACX to the initialisation of the metaparameters in Fig. 8. We selected values that are close to as our design principle is to initialise the metaparameters to be similar to the hyperparameters in the outer loss. We observe that overall, the method is quite robust to different initialisations.
4.4 Adaptivity
In Fig. 9 we visualize the metaparameters of STACX during training. As there are many metaparameters, seeds and levels, we restrict ourselves to a single seed (chosen arbitrary to 1) and a single game (Jamesbond). More examples can be found in the supplementary material. For each hyper parameter we plot the value of the hyperparameters associated with three different heads, where the policy head (head number 1) is presented in blue and the auxiliary heads (2 and 3) are presented in orange and magenta.
Inspecting Fig. 9 we can see that the two auxiliary heads selftuned their metaparameters to have relatively similar values, but different than those of the main head. The discount factor of the main head, for example, converges to the value of the discount factor in the outer loss (0.995), while the discount factors of the auxiliary heads change quite a lot during training and learn about horizons that differ from that of the main head.
We also observe non trivial behaviour in selftuning the coefficients of the loss functions in the coefficient and in the off policy coefficient . For instance, we found that at the beginning of training is self tuned to a high value (close to 1) so it is quite similar to Vtrace; instead, towards the end of training STACX selftunes it to lower values which makes it closer to importance sampling.
Finally, we also noticed an interesting behaviour in the version of STACX where we expose to selftuning both the and coefficients, without imposing (Theorem 1). This variation of STACX achieved a median score of . Quite interestingly, the metagradient discovered on its own, i.e., it selftunes so that is greater or equal to in of the time (averaged over time, seeds, and levels) , and so that in of the time. Fig. 10 shows an example of this in Jamesbond.
5 Summary
In this work we demonstrate that it is feasible to use metagradients to simultaneously tune many critical hyperparameters (controlling important trade offs in a reinforcement learning agent), as long as they are differentiable; we show that this can be done online, within a single lifetime. We do so by presenting STAC and STACX, actor critic algorithms that selftune a large number of hyperparameters of very different nature. We showed that the performance of these agents improves as they selftune more hyperparameters, and we demonstrated that STAC and STACX are computationally efficient and robust to their own hyperparameters.
References

The arcade learning environment: an evaluation platform for general agents.
Journal of Artificial Intelligence Research
47, pp. 253–279. Cited by: §1, §4.1.  Random search for hyperparameter optimization. J. Mach. Learn. Res. 13 (1), pp. 281–305. External Links: ISSN 15324435 Cited by: §1.
 Impala: scalable distributed deeprl with importance weighted actorlearner architectures. arXiv preprint arXiv:1802.01561. Cited by: §1, §2.2, §2.2, §2.3, §3.1, §3.2, §7, §7.
 Hyperbolic discounting and learning over multiple horizons. arXiv preprint arXiv:1902.06865. Cited by: §1, footnote 3.
 Forward and reverse gradientbased hyperparameter optimization. In Proceedings of the 34th International Conference on Machine LearningVolume 70, pp. 1165–1173. Cited by: §1.

Population based training of neural networks
. arXiv preprint arXiv:1711.09846. Cited by: §1.  Reinforcement learning with unsupervised auxiliary tasks. arXiv preprint arXiv:1611.05397. Cited by: §1, §3.2, §4.1.
 The dependence of effective planning horizon on model accuracy. In Proceedings of the 2015 International Conference on Autonomous Agents and Multiagent Systems, pp. 1181–1189. Cited by: §1.
 Rectifier nonlinearities improve neural network acoustic models. In in ICML Workshop on Deep Learning for Audio, Speech and Language Processing, Cited by: §3.1.
 Gradientbased hyperparameter optimization through reversible learning. In International Conference on Machine Learning, pp. 2113–2122. Cited by: §1.
 Adaptive lambda leastsquares temporal difference learning. arXiv preprint arXiv:1612.09465. Cited by: §1.
 Safe and efficient offpolicy reinforcement learning. In Advances in Neural Information Processing Systems, pp. 1054–1062. Cited by: §3.1.
 Hyperparameter optimization with approximate gradient. In International Conference on Machine Learning, pp. 737–746. Cited by: §1.
 Adaptive tradeoffs in offpolicy learning. arXiv preprint arXiv:1910.07478. Cited by: §1, footnote 2.
 Adapting behaviour for learning progress. arXiv preprint arXiv:1912.06910. Cited by: §1.
 Offpolicy actorcritic with shared experience replay. arXiv preprint arXiv:1909.11583. Cited by: §4.1.
 Mastering atari, go, chess and shogi by planning with a learned model. arXiv preprint arXiv:1911.08265. Cited by: §4.1.
 Practical bayesian optimization of machine learning algorithms. In Advances in neural information processing systems, pp. 2951–2959. Cited by: §1.
 Adapting bias by gradient descent: an incremental version of deltabardelta. In AAAI, pp. 171–176. Cited by: §1.
 Discovery of useful questions as auxiliary tasks. In Advances in Neural Information Processing Systems, pp. 9306–9317. Cited by: §1.
 A greedy approach to adapting the trace parameter for temporal difference learning. In Proceedings of the 2016 International Conference on Autonomous Agents & Multiagent Systems, pp. 557–565. Cited by: §1.
 Empirical evaluation of rectified activations in convolutional network. arXiv preprint arXiv:1505.00853. Cited by: §3.1.
 Metagradient reinforcement learning. In Advances in neural information processing systems, pp. 2396–2407. Cited by: §1, §2.1, §2.3, §3, §3, §4, §8, §8, §8, footnote 1, footnote 5.
 Metatrace: online stepsize tuning by metagradient descent for reinforcement learning control. arXiv preprint arXiv:1805.04514. Cited by: §1.
 On learning intrinsic rewards for policy gradient methods. In Advances in Neural Information Processing Systems, pp. 4644–4654. Cited by: §1.
6 Additional results
7 Analysis of Leaky Vtrace
Define the Leaky Vtrace operator :
(10) 
where the expectation is with respect to the behaviour policy which has generated the trajectory , i.e., . Similar to (Espeholt et al., 2018), we consider the infinitehorizon operator but very similar results hold for the nstep truncated operator.
Let
be importance sampling weights, let
be truncated importance sampling weights with , and let
be the Leaky importance sampling weights with leaky coefficients
Theorem 2 (Restatement of Theorem 1).
Assume that there exists such that . Then the operator defined by Eq. 10 has a unique fixed point , which is the value function of the policy defined by
Furthermore, is a contraction mapping in supnorm, with
where and for
Proof.
The proof follows the proof of Vtrace from (Espeholt et al., 2018) with adaptations for the leaky Vtrace coefficients. We have that
Denote by and notice that
since and therefore, Furthermore, since and we have that Thus, the coefficients are non negative in expectation, since
Thus, is a linear combination of the values at the other states, weighted by nonnegative coefficients whose sum is
(11)  
(12) 
where Eq. 11 holds since we expanded only the first two elements in the sum, and all the elements in this sum are positive, and Eq. 12 holds by the assumption.
We deduce that with so is a contraction mapping. Furthermore, we can see that the parameter controls the contraction rate, for we get the contraction rate of Vtrace and as gets smaller with get better contraction as with we get that
Thus possesses a unique fixed point. Let us now prove that this fixed point is where
(13) 
is a policy that mixes the target policy with the Vtrace policy.
We have:
where we get that the left side (up to the summation on ) of the last equality equals zero since this is the Bellman equation for We deduce that thus, is the unique fixed point of ∎
8 Reproducibility
Inspecting the results in Fig. 4 one may notice that there are small differences between the results of IMAPALA and using meta gradients to tune only compared to the results that were reported in (Xu et al., 2018).
We investigated the possible reasons for these differences. First, our method was implemented in a different code base. Our code is written in JAX, compared to the implementation in (Xu et al., 2018)
that was written in TensorFlow. This may explain the small difference in final performance between our IMAPALA baseline (
Fig. 4) and the the result of Xu et. al. which is slightly higher (257.1).Second, Xu et. al. observed that embedding the hyper parameter into the network improved their results significantly, reaching a final performance (when learning ) of 287.7 (see section 1.4 in (Xu et al., 2018) for more details). Our method, on the other hand, only achieved a score of 240 in this ablative study. We further investigated this difference by introducing the embedding intro our architecture. With embedding, our method achieved a score of which almost reproduces the results in (Xu et al., 2018). We then introduced the same embedding mechanism to our model with auxiliary loss functions. In this case for auxiliary loss we embed . We experimented with two variants, one that shares the embedding weights across the auxiliary tasks and one that learns a specific embedding for each auxiliary task. Both of these variants performed similarly (306.8, 3.077 respectively) which is better then our previous result with embedding and without auxiliary losses that achieved 280.6. Unfortunately, the performance of the auxiliary loss architecture actually performed better without the embedding (353.4) and we therefor ended up not using the embedding in our architecture. We leave it to future work to further investigate methods of combining the embedding mechanisms with the auxiliary loss functions.
Comments
There are no comments yet.