1 Background
Brief background on GVFs: Standard value functions in RL define a question and its answer; the question is “what is the discounted sum of future rewards under some policy?” and the answer is the approximate value function. Generalized value functions, or GVFs, generalize the standard value function to allow for arbitrary cumulant functions of states in place of rewards, and are specified by the combination of such a cumulant function with a discount factor and a policy. This generalization of standard value functions allows GVFs to express quite general predictive knowledge and, notably, temporaldifference (TD) methods for learning value functions can be extended to learn the predictions/answers of GVFs. We refer to sutton2011horde for additional details.
Prior work on auxiliary tasks in RL: jaderberg2016reinforcement explored extensively the potential, for RL agents, of jointly learning the representation used for solving the main task and a number of GVFbased auxiliary tasks, such as pixelcontrol and featurecontrol tasks based on controlling changes in pixel intensities and feature activations; this class of auxiliary tasks was also used in the multitask setting by multitaskpopart. Other recent examples of auxiliary tasks include depth and loop closure classification (mirowski2016learning), observation reconstruction, reward prediction, inverse dynamics prediction (shelhamer2016loss), and manygoals learning (veeriah2018many). A geometrical perspective on auxiliary tasks was introduced by GeomPerspectiveRL.
Prior work on metalearning: Recently, there has been a lot of interest in exploring metalearning or learning to learn. A metalearner progressively improves the learning process of a learner (schmidhuber1996simple; thrun1998learning) that is attempting to solve some task. Recent work on metalearning includes learning good policy initializations that can be quickly adapted to new tasks (finn2017model; al2017continuous), improving fewshot learning performance (mishra2017simple; duan2017one; snell2017prototypical), learning to explore (stadie2018some)
(gupta2018unsupervised; hsu2018unsupervised), fewshot model adaptation (nagabandi2018deep), and improving the optimizers (andrychowicz2016learning; li2016learning; ravi2016optimization; wichrowska2017learned; chen2016learning; gupta2018unsupervised).Prior work on metagradients: xu2018meta formalized metagradients, a form of metalearning where the metalearner is trained via gradients through the effect of the metaparameters on a learner also trained via gradients. In contrast to much work in metalearning that focuses on multitask learning, xu2018meta formalized the use of metagradients in a way that is applicable also to the single task setting, although not limited to it. They illustrated their approach by using metagradients to adapt both the discount factor and the bootstrapping factor of a reinforcement learning agent, substantially improving performance of an actorcritic agent on many Atari games. Concurrently, zheng2018learning used metagradients to learn intrinsic rewards, demonstrating that maximizing a sum of extrinsic and intrinsic rewards could improve an agent’s performance on a number of Atari games and MuJoCo tasks. xu2018meta discussed the possibility of computing metagradients in a nonmyopic manner, but their proposed algorithm, as that of zheng2018learning, introduced a severe approximation and only measured the immediate consequences of an update.
2 The discovery of useful questions
In this section we present a neural network architecture and a principled metagradient algorithm for the discovery of GVFbased questions for use as auxiliary tasks in the context of deep RL agents.
2.1 A neural network architecture for discovery
The neural network architecture we consider features two networks: the first, on the left in Figure 1, takes the last observations as inputs, and parameterises (directly or indirectly) a policy for the main reinforcement learning task, together with GVFpredictions for a number of discovered cumulants and discounts. We use to denote the parameters of this first network. The second network, referred to as the question network, is depicted on the right in Figure 1. It takes as inputs future observations and, through the metaparameters , computes the values of a set of cumulants and their corresponding discounts (both and
are therefore vectors).
The use of future observations as inputs to the question network requires us to wait steps to unfold before computing the cumulants and discounts; this is acceptable because the question and answer networks are only used during training, and neither is needed for action selection. As discussed in Section 1, a GVFquestion is specified by a cumulant function, a discount function and a policy. In our method, the question network only explicitly parameterises discounts and cumulants because we consider onpolicy GVFs, and therefore the policy will always be, implicitly, the latest maintask policy . Note however, that since each cumulant is a function of future observations, which are influenced by the actions chosen by the main task policy, the cumulant and discount functions are nonstationary, not just because we are learning the question network parameters, but also because the maintask policy itself is changing as learning progresses.
Previous work on auxiliary tasks in reinforcement learning may be interpreted as just using the network on the left, as the cumulant functions were handcrafted and did not have any (meta)learnable parameters; the availability of a separate “question network” is a critical component of our approach to discovery, as it enables the agent to discover from experience the most suitable questions about the future to be used as auxiliary tasks. The terminology of question and answer networks is derived from work on TD networks (sutton2005temporal); we refer to makino2008line for related work on incremental discovery of the structure of TD networks (work that does not, however, use metagradients and that was applied only to relatively simple domains).
2.2 Multistep metagradients
In their most abstract form, reinforcement learning algorithms can be described by an update procedure that modifies, on each step , the agent’s parameters . The central idea of metagradient RL is to parameterise the update by metaparameters . We may then consider the consequences of changing on the
parameterised update rule by measuring the subsequent performance of the agent, in terms of a ”metaloss” function
. Such metaloss may be evaluated after one update (myopic) orupdates (nonmyopic). The metagradient is then, by the chain rule,
(1) 
Implicit in Equation 1 is that changing the metaparameters at one time step affects not just the immediate update to on the next time step, but at all future updates. This makes the metagradient challenging to compute. A straightforward but effective way to capture the multistep effects of changing is to build a computational graph which consists of a sequence of updates made to the parameters , with held fixed, ending with a metaloss evaluation . The metagradient may be efficiently computed from this graph through backwardmode autodifferentiation; this has a computational cost similar to that of the forward computation (griewank2008evaluating), but it requires storage of copies of the parameters , thus increasing the memory footprint. We emphasize that this approach is in contrast to the myopic metagradient used in previous work, that either ignores effects past the first time step, or makes severe approximations.
2.3 A multistep metagradient algorithm for discovery
We apply the metagradient algorithm, as presented in Section 2.2, to the discovery of GVFbased auxiliary tasks represented as in the neural network architecture from Section 2.1. The complete pseudo code for the proposed approach to discovery is outlined in Algorithm 1.
On each iteration of the algorithm, in an inner loop we apply updates to the agent parameters , which parameterise the maintask policy and the GVF answers, using separate samples of experience in an environment. Then, in the outer loop, we apply a single update to the metaparameters (the question network that parameterises cumulant and discount functions that define the GVFs), based on the effect of the updates to on the metaloss; next, we make each of these steps explicit.
The inner update includes two components: the first is a canonical deep reinforcement learning update using loss denoted for optimizing the maintask policy , either directly (as in policybased algorithms, e.g., Williams1992) or indirectly (as in valuebased algorithms, e.g., Watkins:1989). The second component is an update rule for estimating the answers to GVFbased questions. With slight abuse of notation, we can then denote each innerloop update as the following gradient descent steps on the pseudo losses denoted with and :
(2) 
The meta loss is the sum of the RL pseudo losses associated with the main task updates, as computed on the batches generated in the inner loop; it is a function of metaparameters through the updates to the answers. We can therefore compute the update to the metaparameters
(3) 
This metagradient procedure optimizes the area under the curve over the temporal span defined by the inner unroll length . Alternatively, the metaloss may be evaluated on the last batch alone, to optimize for final performance. Unless we specify otherwise, we use the area under the curve.
2.4 An actor critic agent with discovery of questions for auxiliary tasks
In this section we describe a concrete instantiation of the algorithm in the context of an actorcritic reinforcement learning agent. The network on the left of Figure 1 is composed of three modules: 1) an encoder network that, takes the last observations as inputs, and outputs a state representation ; 2) a main task network that, given the state estimates both the policy and a state value function (Sutton:1988) 3) an answer network that, given the state approximates the GVF answers. In this paper, functions and will be linear functions of state .
The maintask network parameters are only affected by the RL component of update defined in Equation 2. In an actorcritic agent, is the union of the parameters of the state values and the parameters of the softmax policy . Therefore the update is the sum of a value update and a policy update , where is a multistep truncated return, using the agent’s estimates of the state values for bootstrapping after steps.
The answer network parameters , instead, are only affected by the second term of the update in Equation 2. Since the answers estimate onpolicy, under , an expected cumulative discounted sum of cumulants, we may use a generalized temporal difference learning algorithm to update . In our agents, the vector is a linear function of state, and therefore each GVF prediction is separately parameterised by . The update for parameters may then be written as , where is the multistep, truncated, discounted sum of cumulants from time onwards. As in the main task updates, the notation highlights that we use the answer network’s own estimates to bootstrap after a fixed number steps.
The maintask and answernetwork pseudo losses used in the updates above can also be straightforwardly used to instantiate equation 2 for the parameters of the encoder network, and to instantiate equation 3, for the parameters of the question network. For the shared state representation, , we explore two updates: (1) using the gradients from both the main task and the answer network, i.e., , and (2) using only the gradients from the answer network, . Using both the maintask and the answer network components is more consistent with the existing literature on auxiliary tasks, but ignoring the maintask updates provides a more stringent test of whether the algorithm is capable of metalearning questions that can drive, even on their own, the learning of an adequate state representations.
3 Experimental setup
In this section we outline the experimental setup, including the environments we used as testbeds and the high level agent and neural network architectures. We refer to the Appendix for more details.
3.1 Domains
Puddleworld domain: is a continuous state gridworld domain (degris2012off), where the state space is a dimensional position in . The agent has actions, where four of these actions move the agent in one of the four cardinal directions by a mean offset of and the last action has an offset of . The actions have a stochastic effect on the environment because, on each step, uniform noise sampled in the range is added to each action component. We refer to degris2012off for further details about this environment.
Collectobjects domain: is a fourroom gridworld, where the agent is rewarded for collecting two objects in the right order. The agent moves deterministically in one of four cardinal directions. For each episode the starting position is chosen randomly. The locations of the two objects are the same across episodes. The agent receives a reward of for picking up the first object and a reward of for picking up the second object after the first one. The maximum length of each episode is .
Atari domain: the Atari games were designed to be challenging and fun for human players, and were packaged up into a canonical benchmark for RL agents: the Arcade Learning Environment (Bellemare:2013; mnih2015human; mnih2016asynchronous; SchulmanTRPO; SchulmanPPO; Rainbow). When summarizing results on this benchmark, we follow the common approach of first normalizing scores on the each game using the scores of random and human agents (vanHasselt:2016).
3.2 Our agents
For the gridworld experiments, we implemented metagradients on top of a step actorcritic agent with parallel actor threads (mnih2016asynchronous). For the Atari experiments, we used a step IMPALA (espeholt2018impala) agent with
distributed actors. In the nonvisual domain of Puddleworld, the encoder is a simple MLP with two fullyconnected layers. In other domains the encoder is a convolutional neural network. The maintask value and policy, and the answer network, are all linear functions of the state
. In the gridworlds the question network outputs a set of cumulants, and the discount factor that jointly defines the GVFs is handtuned. In our Atari experiments the question network outputs both the cumulants and the corresponding discounts. In all experiments we report scores and curves averaging results fromindependent runs of each agent, task or hyperparameter configuration. In Atari we use a single set of hyperparameters across all games.
3.3 Baselines: handcrafted questions as auxiliary tasks
In our experiments we consider the following baseline auxiliary tasks from the literature.
Reward prediction: This baseline agent has no question network. Instead it uses the scalar reward obtained at the next time step as the target for the answer network. The auxiliary task loss function for the reward prediction baseline is, .
Pixel control: This baseline also has no question network. The auxiliary task is to learn to optimally control changes in pixel intensities. Specifically, the answer network must estimate optimal action values for cumulants corresponding to the average absolute change in pixel intensities, between consecutive (in time) observations, for each cell in an nonoverlapping grid overlayed onto the observation. The auxiliary loss function for the action values of the cell is: , where refers to discounted sum of pseudorewards for the cell. The auxiliary loss is summed over the entire grid .
Random questions: This baseline agent is the same as our metagradient based agent except that the question network is kept fixed at its randomly initialized parameters through training. The answer network is still trained to predict values for the cumulants defined by the fixed question network.
4 Empirical findings
In this section, we empirically investigate the performance of the proposed algorithm for discovery, as instantiated in Section 2.4. We refer to our metalearning agent as the “Discovered GVFs” agent. Our experiments address the following questions:

Can metagradients discover GVFquestions such that learning the answers to them is sufficient, on its own, to build representations good enough for solving complex RL tasks? We refer to these as the “representation learning” experiments.

Can metagradients discover GVFs questions such that learning to answer these along side the main task improves the data efficiency of an RL agent? In these experiments the representation is shaped by both the updates based on the discovered GVFs as well as the main task updates; we will thus refer to these as the “joint learning” experiments.

In both settings, how do auxiliary tasks discovered via metagradients compare to handcrafted tasks from the literature? Also, how is performance affected by design decisions such as the number of questions, the number of inner steps used to compute metagradients, and the choice between area under the curve versus final loss as metaobjective?
We note that the “representation learning” experiments are a more stringent test of our metalearning algorithm for discovery, compared to the “joint learning” experiments. However, the latter is consistent with the literature on auxiliary tasks and can be more useful in practice.
4.1 Representation learning experiments
In these experiments, the parameters of the encoder network are unaffected by gradients from the maintask updates. Figures 5 and 5 compare the performance of our metagradient agents to the baseline agents that train the state representation using the handcrafted auxiliary tasks described in Section 3.3. We always include a reference curve (in black) corresponding to the baseline actorcritic agent with no answer or question networks, where the representation is trained directly using the maintask updates. We report results for the Collectobjects domain, Puddleworld, and three Atari games (more are reported in the Appendix). From the experiments we highlight the following:
Discovery: in all the domains, we found evidence that the state representation learned solely through learning the GVFanswers to the discovered questions was sufficient to support learning good policies. Specifically, in the two gridworld domains the resulting policies were optimal (see Figure 5); in the Atari domains the resulting policies were comparable to those achieved by the state of the art IMPALA agent after training for 200M frames (see Figure 5). This is one of our main results, as it confirms that nonmyopic metagradients can discover questions, in the forms of cumulants and discounts, useful to capture rich enough knowledge of the world to support the learning of staterepresentations that yield good policies even in complex RL tasks.
Baselines: we also found that learning the answers to questions discovered using metagradients resulted in state representations that supported better performance, on the main task, compared to the representations resulting from learning the answers to popular handcrafted questions in the literature. Consider the gridworld experiments in Figure 5, learning the representation using “Reward Prediction” (purple) or “Random GVFs” (blue) resulted in notably worse policies than those learned by the agent with “Discovered GVFs”. Similarly, in Atari (shown in Figure 5) the handcrafted auxiliary tasks, now including a “Pixel Control” baseline (green), resulted in almost no learning.
MainTask driven representations: Note that the actorcritic agent that trained the state representation using the maintask updates directly learned faster than the agents where the representation was exclusively trained using auxiliary tasks. The baseline required only 3M steps on the gridworlds and 200M frames on Atari to reach the final performance. This is expected and it is true both for our metagradient solution as well as the auxiliary tasks from the literature.
We used the representation learning setting to investigate a number of design choices. First, we compare optimizing the area under the curve over the length of the unrolled metagradient computation (or “Summed MetaLoss”) to computing the metagradient on the last batch alone (“End MetaLoss”). As shown in Figure 5, both approaches can be effective, but we found that optimizing area under the curve to be more stable. Next we examined the role of the number of GVF questions, and the effect of varying the number of steps unrolled in the metagradient calculation. For this purpose, we used the less computeintensive gridworlds: CollectObjects (reported here) and Puddleworld (in the Appendix). On the left in Figure 5, we report a parameter study, plotting the performance of the agent with metalearned auxiliary tasks as a function of the number of questions . The dashed black line corresponds to the optimal (final) performance. Too few questions () did not provide enough signal to learn good representations: the dashed red line is thus far from optimal for . Other values of all led to learning of a good representation capable of supporting an optimal policy. However, too many questions (e.g. ) made learning slower, as shown by the average performance dropping. The number of questions is therefore an important hyperparameter of the algorithm. On the right, in Figure 5 we report the effect on performance of the number of unrolled steps used for the metagradient computation. Using corresponds to the myopic metagradient: in contrast to previous work (xu2018meta; zheng2018learning), the representation learned with and was insufficient for the final policy to do anything meaningful. Performance generally got better as we increased the unroll length (although the computational cost of metagradients also increased). Again the trend was not fully monotonic, with the largest unroll length performing worse than
both in terms of final and average performance. We conjecture this may be due to the increased variance of the metagradient estimates as the unroll length increases. The number of unrolled steps
is therefore also a sensitive hyperparameter. Note that neither nor were tuned in other experiments, with all other results using the same fixed settings of and .4.2 Joint learning Experiments
The next set of experiments use the most common setting in the literature on auxiliary tasks, where the representation is learned using jointly the auxiliary task updates and the main task updates. To accelerate the learning of useful questions, we provided the encoded state representation as input to the question network instead of learning a separate encoding; this differs from the previous experiments, where the question network was a completely independent network (consistently with the objective of a more stringent evaluation of our algorithm). We used a benchmark consisting of 57 distinct Atari games to evaluate the “Discovered GVFs” agent together with an actorcritic baseline (“IMPALA”) and two auxiliary tasks from the literature: “Reward Prediction” and “Pixel Control”.
None of the auxiliary tasks outperformed IMPALA on each and every of the games. To analyse the results, we ranked games according to the performance of the agent with pixelcontrol questions, to identify the games more conducive to improving performance through the use of auxiliary tasks. On the left of Figure 6, we report the relative gains of the “Discovered GVFs” agent over IMPALA, on the top games for the “Pixel Control” baseline: we observed large gains in out of games, small gains in , and losses in . On the right in Figure 6, we provid a more comprehensive view of the performance of the agents. For each number on the xaxis () we present the median human normalized score achieved by each method on the top games, again selected according to the “Pixel Control” baseline. It is visually clear that discovering questions via metalearning is fast enough to compete with handcrafted questions, and that, in games well suited to auxiliary tasks, it greatly improved performance over all baselines. It was particularly impressive to find that the metagradient solution outperformed pixel control on these games despite the ranking of games being biased in favour of pixelcontrol. The reward prediction baseline is interesting, in comparison, because it’s profile was the closest to that of the actorcritic baseline, never improving performance significantly, but not hurting either.
5 Conclusions and Discussion
There are many forms of questions that an intelligent agent may want to discover. In this paper we introduced a novel and efficient multistep metagradient procedure for the discovery of questions in the form of onpolicy GVFs. In a stringent test, our representation learning experiments demonstrated that the metagradient approach is capable of discovering useful questions such that answering them can drive, by itself, learning of state representations good enough to support the learning of a main reinforcement learning task. Furthermore, our auxiliary tasks experiments demonstrated that the metalearning based discovery approach is dataefficient enough to compete well in terms of performance, and in many cases even outperform, handcrafted questions developed in prior work.
Prior work on auxiliary tasks relied on human ingenuity to define questions useful for shaping the state representation used in a certain task, but it’s hard to create questions that are both useful and general (i.e., that can be applied across many tasks). GeomPerspectiveRL introduced a geometrical perspective to understand when auxiliary tasks give rise to good representations. Our solution differs from this line of work in that it enables us to sidestep the question of how to design good auxiliary questions, by metalearning them, directly optimizing for utility in the context of any given task. Our approach fits in a general trend of increasingly relying on data rather than human designed inductive biases to construct effective learning algorithms (AlphaZero; InductiveBiasesRL).
A promising direction for future research is to investigate offpolicy GVFs, where the policy under which we make the predictions differs from the maintask policy. We also note that our approach to discovery is quite general, and could be extended to metalearning other kind of questions, that do not fit the canonical GVF formulation; see GNLB for one such class of predictive questions. Finally, we emphasize that the unrolled multistep metagradient algorithm is likely to benefit both previous applications of myopic metagradients, as well as possibly open up more applications, other from discovery, where the myopic approximation would fail.
Acknowledgments
We thank John Holler and Zeyu Zheng for many useful comments and discussions. The work of the authors at the University of Michigan was supported by a grant from DARPAs L2M program and by NSF grant IIS1526059. Any opinions, findings, conclusions, or recommendations expressed here are those of the authors and do not necessarily reflect the views of the sponsors.
References
6 Appendix
6.1 Neural network architecture and details
Representation learning experiments:
Puddleworld domain:
A multilayer perceptron (MLP), with layer fullyconnected layers with
hidden units each. ReLU activation functions are used throughout.
CollectObjects domain: A twolayer convolutional neural network (CNN) with filters in each layer respectively. The filter sizes were in both layers. The CNN’s output is then fed to a fullyconnected layer with hidden units. ReLU activation functions are used throughout.
Atari domain: A threelayer CNN architecture that has been successfully used on Atari in several variants of DQN (mnih2015human; vanHasselt:2016; Rainbow). The CNN layers consists of filters respectively, with filter sizes
. The stride lengths at each of these layers were set to
respectively. ReLU activation functions are used throughout.In all cases, the output of the encoding modules are linearly mapped to produce the policy, value function and answer net heads. ReLU activations are used in the learning agent.
We use an independent question network in all representation learning experiments; the architecture of the hidden layers matches that of the learning agent exactly. Note, however, that the heads of the question network output cumulants and discounts to be used as questions, and these are both vectors, of the same size . We use activations for cumulants and a for discounts.
Joint learning experiments:
Atari domain: We use a Deep ResNet architecture identical to the one from espeholt2018impala. They only differ in the outputs: as we now have an answer head, in addition to policy and values.
The question net takes the last hidden layer of the ResNet as input; it uses a metatrained twolayer MLP to produce cumulants, and a separately parameterised twolayer MLP to produce discounts. The MLPs have hidden units respectively, with ReLU activations in both. As in the representation learning experiments, we use activations for cumulants and a for discounts.
6.2 Hyperparameters used in our experiments
Representation learning experiments:
A2C: The A2C agents (used in the gridworld domains) use step returns in the
pseudoloss. We searched the initial learning rate for the RMSProp optimizer and the entropy regularization coefficient in a range of values, and the best combination of these hyperparameter was chosen according to the results of the A2C baseline, and then used for all agents. The range of values for the initial learning rate hyperparameter was:
. The range of values for entropy regularization was: . The hyperparameter of the RMSProp optimizer is set to . The number of unrolling steps is set to .IMPALA: All agents based on IMPALA (used in Atari domains) uses the hyperparameters reported by espeholt2018impala. They are listed in Table 1 together with the hyperparameters specific to DGVF and to other baselines. The number of unrolling steps for metagradients is .
Joint learning experiments:
The hyperparameters specific to the auxiliary tasks are obtained by a search over ten games (ChopperCommand, Breakout, Seaquest, SpaceInvaders, KungFuMaster, MsPacman, Krull, Tutankham, BattleZone, BeamRider) following common practice in Deep RL Atari experiments (mnih2015human; vanHasselt:2016; Rainbow). After choosing the hyperparameter from this search, they remain fixed across all Atari games.
6.3 Preprocessing
In the Atari domain, the input to the learning agent consists of consecutively stacked frames where each frame is a result of repeating the previous action for
timesteps, greyscaling and downsampling the resulting frames to 84x84 images, and maxpooling the last 2. This is a fairly canonical preprocessing pipeline for Atari. Additionally rewards are clipped to the [1, 1] range.
IMPALA  Value 
Network Architecture  Deep ResNet 
step return  20 
Batch size  32 
Value loss coefficient  0.5 
Entropy coefficient  0.01 
Learning rate  0.0006 
RMSProp momentum  0.0 
RMSProp decay  0.99 
RMSProp  0.1 
Global gradient norm clip  40 
Learning rate schedule  Anneal linearly to 0 
Number of learners  1 
Number of actors  200 
GVF Questions  Value 
Meta learning rate  0.0006 
Meta optimiser  ADAM 
Unroll length  10 
Meta gradient norm clip (cumulants)  1 
Meta gradient norm clip (discounts)  10 
Number of Questions  128 
Auxiliary loss coefficient  0.0001 
PixelControl  Value 
Auxiliary loss coefficient  0.0001 
RewardPrediction  Value 
Auxiliary loss coefficient  0.001 
6.4 Derivation of myopic approximation to metagradients
Here we derive the myopic approximation for our metagradient procedure that was previously described in the main text.
(4)  
(5)  
(6)  
(7)  
(8)  
(9)  
(10) 
Equations 5, 8 and 10 are a myopic approximation because they ignore the fact that is affected by the changes in . Furthermore, in Equations 8 and 10, the policy and value function are only indirect functions of (i.e., they are indirectly affected by the auxiliary loss) and thus they do not participate in the myopic approximation. Therefore, after applying all the approximations, we get the following myopic update rule for the metaparameters :
(11) 
6.5 Comparison between myopic and unrolled metagradient
Figure 7 visualizes the computation graph that is a consequence of the unrolled computation for the metagradient and the myopic metagradient computation. In the unrolled computation, the gradient of the metaobjective w.r.to the metaparameters () is computed in such a way that the effect of these parameters over a longer timescale is taken into consideration. The gradient computation for this unrolled computation is given in Equation 4. In contrast, the myopic gradient computation only considers the immediate one timestep effect of the metaparameters in the agent’s policy. The metagradient update based on this myopic gradient computation is given in Equation 11.
6.6 Additional Results
Representation learning experiments: The aim of the representation learning is to evaluate how well auxiliary tasks can drive, on their own, representation learning in support of a main reinforcement learning task. In Figure 8 we report additional representation learning results for Atari games (jamesbond, gravitar, frostbite, amidar, bowling and chopper command) including the games from the main text. The “Discovered GVFs” (red), “Pixel Control” (green), “Reward Prediction” (purple) and “Random GVFs” (blue) baseline agents all rely exclusively on auxiliary tasks to drive representation learning, while the linear policy and value functions are trained using the maintask updates. In all games the “Discovered GVFs” agent significantly outperforms the baselines using the handcrafted auxiliary tasks from the literature to train the representation. In two games (gravitar and frostbite) the “Discovered GVFs” significantly outperforms also the plain “IMPALA” agent (trained for 200M frames) that uses the main task updates to train the state representation. In Figure 9 we report the parameter studies for the “Discovered GVFs” agent, in the second gridworld domain Puddleworld; the plots show performance as a function of the number of questions used as auxiliary tasks (on the left) and the number of steps unrolled to compute the metagradient (on the right). Again results are consistent with those reported in the main text for the Collect Objects domain.
Joint learning experiments: In Figures 11, 12 and 10 we provide additional details for the “joint learning“ experiments. The aim of these experiments is to show that whether the process of discovery of useful questions via metagradients is fast enough to improve the data efficiency of an agent in a standard setting where the state representation is trained using both the auxiliary task updates as well as the main task updates. We report relative performance improvements achieved by the “Discovered GVFs“ agent over the “IMPALA”, “Pixel Control” and “Reward Prediction” agents, after 200M training frames, on each of the Atari games. The same hyperparameters are used for all games. The relative improvements are computed using the human normalized final performance of each agent, averaged across replicas of each experiment (for reproducibility).