Brief background on GVFs: Standard value functions in RL define a question and its answer; the question is “what is the discounted sum of future rewards under some policy?” and the answer is the approximate value function. Generalized value functions, or GVFs, generalize the standard value function to allow for arbitrary cumulant functions of states in place of rewards, and are specified by the combination of such a cumulant function with a discount factor and a policy. This generalization of standard value functions allows GVFs to express quite general predictive knowledge and, notably, temporal-difference (TD) methods for learning value functions can be extended to learn the predictions/answers of GVFs. We refer to sutton2011horde for additional details.
Prior work on auxiliary tasks in RL: jaderberg2016reinforcement explored extensively the potential, for RL agents, of jointly learning the representation used for solving the main task and a number of GVF-based auxiliary tasks, such as pixel-control and feature-control tasks based on controlling changes in pixel intensities and feature activations; this class of auxiliary tasks was also used in the multi-task setting by multitask-popart. Other recent examples of auxiliary tasks include depth and loop closure classification (mirowski2016learning), observation reconstruction, reward prediction, inverse dynamics prediction (shelhamer2016loss), and many-goals learning (veeriah2018many). A geometrical perspective on auxiliary tasks was introduced by GeomPerspectiveRL.
Prior work on meta-learning: Recently, there has been a lot of interest in exploring meta-learning or learning to learn. A meta-learner progressively improves the learning process of a learner (schmidhuber1996simple; thrun1998learning) that is attempting to solve some task. Recent work on meta-learning includes learning good policy initializations that can be quickly adapted to new tasks (finn2017model; al2017continuous), improving few-shot learning performance (mishra2017simple; duan2017one; snell2017prototypical), learning to explore (stadie2018some)gupta2018unsupervised; hsu2018unsupervised), few-shot model adaptation (nagabandi2018deep), and improving the optimizers (andrychowicz2016learning; li2016learning; ravi2016optimization; wichrowska2017learned; chen2016learning; gupta2018unsupervised).
Prior work on meta-gradients: xu2018meta formalized meta-gradients, a form of meta-learning where the meta-learner is trained via gradients through the effect of the meta-parameters on a learner also trained via gradients. In contrast to much work in meta-learning that focuses on multi-task learning, xu2018meta formalized the use of meta-gradients in a way that is applicable also to the single task setting, although not limited to it. They illustrated their approach by using meta-gradients to adapt both the discount factor and the bootstrapping factor of a reinforcement learning agent, substantially improving performance of an actor-critic agent on many Atari games. Concurrently, zheng2018learning used meta-gradients to learn intrinsic rewards, demonstrating that maximizing a sum of extrinsic and intrinsic rewards could improve an agent’s performance on a number of Atari games and MuJoCo tasks. xu2018meta discussed the possibility of computing meta-gradients in a non-myopic manner, but their proposed algorithm, as that of zheng2018learning, introduced a severe approximation and only measured the immediate consequences of an update.
2 The discovery of useful questions
In this section we present a neural network architecture and a principled meta-gradient algorithm for the discovery of GVF-based questions for use as auxiliary tasks in the context of deep RL agents.
2.1 A neural network architecture for discovery
The neural network architecture we consider features two networks: the first, on the left in Figure 1, takes the last observations as inputs, and parameterises (directly or indirectly) a policy for the main reinforcement learning task, together with GVF-predictions for a number of discovered cumulants and discounts. We use to denote the parameters of this first network. The second network, referred to as the question network, is depicted on the right in Figure 1. It takes as inputs future observations and, through the meta-parameters , computes the values of a set of cumulants and their corresponding discounts (both and
are therefore vectors).
The use of future observations as inputs to the question network requires us to wait steps to unfold before computing the cumulants and discounts; this is acceptable because the question and answer networks are only used during training, and neither is needed for action selection. As discussed in Section 1, a GVF-question is specified by a cumulant function, a discount function and a policy. In our method, the question network only explicitly parameterises discounts and cumulants because we consider on-policy GVFs, and therefore the policy will always be, implicitly, the latest main-task policy . Note however, that since each cumulant is a function of future observations, which are influenced by the actions chosen by the main task policy, the cumulant and discount functions are non-stationary, not just because we are learning the question network parameters, but also because the main-task policy itself is changing as learning progresses.
Previous work on auxiliary tasks in reinforcement learning may be interpreted as just using the network on the left, as the cumulant functions were handcrafted and did not have any (meta-)learnable parameters; the availability of a separate “question network” is a critical component of our approach to discovery, as it enables the agent to discover from experience the most suitable questions about the future to be used as auxiliary tasks. The terminology of question and answer networks is derived from work on TD networks (sutton2005temporal); we refer to makino2008line for related work on incremental discovery of the structure of TD networks (work that does not, however, use meta-gradients and that was applied only to relatively simple domains).
2.2 Multi-step meta-gradients
In their most abstract form, reinforcement learning algorithms can be described by an update procedure that modifies, on each step , the agent’s parameters . The central idea of meta-gradient RL is to parameterise the update by meta-parameters . We may then consider the consequences of changing on the
-parameterised update rule by measuring the subsequent performance of the agent, in terms of a ”meta-loss” function. Such meta-loss may be evaluated after one update (myopic) or
updates (non-myopic). The meta-gradient is then, by the chain rule,
Implicit in Equation 1 is that changing the meta-parameters at one time step affects not just the immediate update to on the next time step, but at all future updates. This makes the meta-gradient challenging to compute. A straightforward but effective way to capture the multi-step effects of changing is to build a computational graph which consists of a sequence of updates made to the parameters , with held fixed, ending with a meta-loss evaluation . The meta-gradient may be efficiently computed from this graph through backward-mode autodifferentiation; this has a computational cost similar to that of the forward computation (griewank2008evaluating), but it requires storage of copies of the parameters , thus increasing the memory footprint. We emphasize that this approach is in contrast to the myopic meta-gradient used in previous work, that either ignores effects past the first time step, or makes severe approximations.
2.3 A multi-step meta-gradient algorithm for discovery
We apply the meta-gradient algorithm, as presented in Section 2.2, to the discovery of GVF-based auxiliary tasks represented as in the neural network architecture from Section 2.1. The complete pseudo code for the proposed approach to discovery is outlined in Algorithm 1.
On each iteration of the algorithm, in an inner loop we apply updates to the agent parameters , which parameterise the main-task policy and the GVF answers, using separate samples of experience in an environment. Then, in the outer loop, we apply a single update to the meta-parameters (the question network that parameterises cumulant and discount functions that define the GVFs), based on the effect of the updates to on the meta-loss; next, we make each of these steps explicit.
The inner update includes two components: the first is a canonical deep reinforcement learning update using loss denoted for optimizing the main-task policy , either directly (as in policy-based algorithms, e.g., Williams1992) or indirectly (as in value-based algorithms, e.g., Watkins:1989). The second component is an update rule for estimating the answers to GVF-based questions. With slight abuse of notation, we can then denote each inner-loop update as the following gradient descent steps on the pseudo losses denoted with and :
The meta loss is the sum of the RL pseudo losses associated with the main task updates, as computed on the batches generated in the inner loop; it is a function of meta-parameters through the updates to the answers. We can therefore compute the update to the meta-parameters
This meta-gradient procedure optimizes the area under the curve over the temporal span defined by the inner unroll length . Alternatively, the meta-loss may be evaluated on the last batch alone, to optimize for final performance. Unless we specify otherwise, we use the area under the curve.
2.4 An actor critic agent with discovery of questions for auxiliary tasks
In this section we describe a concrete instantiation of the algorithm in the context of an actor-critic reinforcement learning agent. The network on the left of Figure 1 is composed of three modules: 1) an encoder network that, takes the last observations as inputs, and outputs a state representation ; 2) a main task network that, given the state estimates both the policy and a state value function (Sutton:1988) 3) an answer network that, given the state approximates the GVF answers. In this paper, functions and will be linear functions of state .
The main-task network parameters are only affected by the RL component of update defined in Equation 2. In an actor-critic agent, is the union of the parameters of the state values and the parameters of the softmax policy . Therefore the update is the sum of a value update and a policy update , where is a multi-step truncated return, using the agent’s estimates of the state values for bootstrapping after steps.
The answer network parameters , instead, are only affected by the second term of the update in Equation 2. Since the answers estimate on-policy, under , an expected cumulative discounted sum of cumulants, we may use a generalized temporal difference learning algorithm to update . In our agents, the vector is a linear function of state, and therefore each GVF prediction is separately parameterised by . The update for parameters may then be written as , where is the multi-step, truncated, -discounted sum of cumulants from time onwards. As in the main task updates, the notation highlights that we use the answer network’s own estimates to bootstrap after a fixed number steps.
The main-task and answer-network pseudo losses used in the updates above can also be straightforwardly used to instantiate equation 2 for the parameters of the encoder network, and to instantiate equation 3, for the parameters of the question network. For the shared state representation, , we explore two updates: (1) using the gradients from both the main task and the answer network, i.e., , and (2) using only the gradients from the answer network, . Using both the main-task and the answer network components is more consistent with the existing literature on auxiliary tasks, but ignoring the main-task updates provides a more stringent test of whether the algorithm is capable of meta-learning questions that can drive, even on their own, the learning of an adequate state representations.
3 Experimental setup
In this section we outline the experimental setup, including the environments we used as test-beds and the high level agent and neural network architectures. We refer to the Appendix for more details.
Puddleworld domain: is a continuous state gridworld domain (degris2012off), where the state space is a -dimensional position in . The agent has actions, where four of these actions move the agent in one of the four cardinal directions by a mean offset of and the last action has an offset of . The actions have a stochastic effect on the environment because, on each step, uniform noise sampled in the range is added to each action component. We refer to degris2012off for further details about this environment.
Collect-objects domain: is a four-room gridworld, where the agent is rewarded for collecting two objects in the right order. The agent moves deterministically in one of four cardinal directions. For each episode the starting position is chosen randomly. The locations of the two objects are the same across episodes. The agent receives a reward of for picking up the first object and a reward of for picking up the second object after the first one. The maximum length of each episode is .
Atari domain: the Atari games were designed to be challenging and fun for human players, and were packaged up into a canonical benchmark for RL agents: the Arcade Learning Environment (Bellemare:2013; mnih2015human; mnih2016asynchronous; SchulmanTRPO; SchulmanPPO; Rainbow). When summarizing results on this benchmark, we follow the common approach of first normalizing scores on the each game using the scores of random and human agents (vanHasselt:2016).
3.2 Our agents
For the gridworld experiments, we implemented meta-gradients on top of a -step actor-critic agent with parallel actor threads (mnih2016asynchronous). For the Atari experiments, we used a -step IMPALA (espeholt2018impala) agent with
distributed actors. In the non-visual domain of Puddleworld, the encoder is a simple MLP with two fully-connected layers. In other domains the encoder is a convolutional neural network. The main-task value and policy, and the answer network, are all linear functions of the state. In the gridworlds the question network outputs a set of cumulants, and the discount factor that jointly defines the GVFs is hand-tuned. In our Atari experiments the question network outputs both the cumulants and the corresponding discounts. In all experiments we report scores and curves averaging results from
independent runs of each agent, task or hyperparameter configuration. In Atari we use a single set of hyper-parameters across all games.
3.3 Baselines: handcrafted questions as auxiliary tasks
In our experiments we consider the following baseline auxiliary tasks from the literature.
Reward prediction: This baseline agent has no question network. Instead it uses the scalar reward obtained at the next time step as the target for the answer network. The auxiliary task loss function for the reward prediction baseline is, .
Pixel control: This baseline also has no question network. The auxiliary task is to learn to optimally control changes in pixel intensities. Specifically, the answer network must estimate optimal action values for cumulants corresponding to the average absolute change in pixel intensities, between consecutive (in time) observations, for each cell in an non-overlapping grid overlayed onto the observation. The auxiliary loss function for the action values of the cell is: , where refers to discounted sum of pseudo-rewards for the cell. The auxiliary loss is summed over the entire grid .
Random questions: This baseline agent is the same as our meta-gradient based agent except that the question network is kept fixed at its randomly initialized parameters through training. The answer network is still trained to predict values for the cumulants defined by the fixed question network.
4 Empirical findings
In this section, we empirically investigate the performance of the proposed algorithm for discovery, as instantiated in Section 2.4. We refer to our meta-learning agent as the “Discovered GVFs” agent. Our experiments address the following questions:
Can meta-gradients discover GVF-questions such that learning the answers to them is sufficient, on its own, to build representations good enough for solving complex RL tasks? We refer to these as the “representation learning” experiments.
Can meta-gradients discover GVFs questions such that learning to answer these along side the main task improves the data efficiency of an RL agent? In these experiments the representation is shaped by both the updates based on the discovered GVFs as well as the main task updates; we will thus refer to these as the “joint learning” experiments.
In both settings, how do auxiliary tasks discovered via meta-gradients compare to hand-crafted tasks from the literature? Also, how is performance affected by design decisions such as the number of questions, the number of inner steps used to compute meta-gradients, and the choice between area under the curve versus final loss as meta-objective?
We note that the “representation learning” experiments are a more stringent test of our meta-learning algorithm for discovery, compared to the “joint learning” experiments. However, the latter is consistent with the literature on auxiliary tasks and can be more useful in practice.
4.1 Representation learning experiments
In these experiments, the parameters of the encoder network are unaffected by gradients from the main-task updates. Figures 5 and 5 compare the performance of our meta-gradient agents to the baseline agents that train the state representation using the hand-crafted auxiliary tasks described in Section 3.3. We always include a reference curve (in black) corresponding to the baseline actor-critic agent with no answer or question networks, where the representation is trained directly using the main-task updates. We report results for the Collect-objects domain, Puddleworld, and three Atari games (more are reported in the Appendix). From the experiments we highlight the following:
Discovery: in all the domains, we found evidence that the state representation learned solely through learning the GVF-answers to the discovered questions was sufficient to support learning good policies. Specifically, in the two gridworld domains the resulting policies were optimal (see Figure 5); in the Atari domains the resulting policies were comparable to those achieved by the state of the art IMPALA agent after training for 200M frames (see Figure 5). This is one of our main results, as it confirms that non-myopic meta-gradients can discover questions, in the forms of cumulants and discounts, useful to capture rich enough knowledge of the world to support the learning of state-representations that yield good policies even in complex RL tasks.
Baselines: we also found that learning the answers to questions discovered using meta-gradients resulted in state representations that supported better performance, on the main task, compared to the representations resulting from learning the answers to popular hand-crafted questions in the literature. Consider the gridworld experiments in Figure 5, learning the representation using “Reward Prediction” (purple) or “Random GVFs” (blue) resulted in notably worse policies than those learned by the agent with “Discovered GVFs”. Similarly, in Atari (shown in Figure 5) the handcrafted auxiliary tasks, now including a “Pixel Control” baseline (green), resulted in almost no learning.
Main-Task driven representations: Note that the actor-critic agent that trained the state representation using the main-task updates directly learned faster than the agents where the representation was exclusively trained using auxiliary tasks. The baseline required only 3M steps on the gridworlds and 200M frames on Atari to reach the final performance. This is expected and it is true both for our meta-gradient solution as well as the auxiliary tasks from the literature.
We used the representation learning setting to investigate a number of design choices. First, we compare optimizing the area under the curve over the length of the unrolled meta-gradient computation (or “Summed Meta-Loss”) to computing the meta-gradient on the last batch alone (“End Meta-Loss”). As shown in Figure 5, both approaches can be effective, but we found that optimizing area under the curve to be more stable. Next we examined the role of the number of GVF questions, and the effect of varying the number of steps unrolled in the meta-gradient calculation. For this purpose, we used the less compute-intensive gridworlds: Collect-Objects (reported here) and Puddleworld (in the Appendix). On the left in Figure 5, we report a parameter study, plotting the performance of the agent with meta-learned auxiliary tasks as a function of the number of questions . The dashed black line corresponds to the optimal (final) performance. Too few questions () did not provide enough signal to learn good representations: the dashed red line is thus far from optimal for . Other values of all led to learning of a good representation capable of supporting an optimal policy. However, too many questions (e.g. ) made learning slower, as shown by the average performance dropping. The number of questions is therefore an important hyperparameter of the algorithm. On the right, in Figure 5 we report the effect on performance of the number of unrolled steps used for the meta-gradient computation. Using corresponds to the myopic meta-gradient: in contrast to previous work (xu2018meta; zheng2018learning), the representation learned with and was insufficient for the final policy to do anything meaningful. Performance generally got better as we increased the unroll length (although the computational cost of meta-gradients also increased). Again the trend was not fully monotonic, with the largest unroll length performing worse than
both in terms of final and average performance. We conjecture this may be due to the increased variance of the meta-gradient estimates as the unroll length increases. The number of unrolled stepsis therefore also a sensitive hyperparameter. Note that neither nor were tuned in other experiments, with all other results using the same fixed settings of and .
4.2 Joint learning Experiments
The next set of experiments use the most common setting in the literature on auxiliary tasks, where the representation is learned using jointly the auxiliary task updates and the main task updates. To accelerate the learning of useful questions, we provided the encoded state representation as input to the question network instead of learning a separate encoding; this differs from the previous experiments, where the question network was a completely independent network (consistently with the objective of a more stringent evaluation of our algorithm). We used a benchmark consisting of 57 distinct Atari games to evaluate the “Discovered GVFs” agent together with an actor-critic baseline (“IMPALA”) and two auxiliary tasks from the literature: “Reward Prediction” and “Pixel Control”.
None of the auxiliary tasks outperformed IMPALA on each and every of the games. To analyse the results, we ranked games according to the performance of the agent with pixel-control questions, to identify the games more conducive to improving performance through the use of auxiliary tasks. On the left of Figure 6, we report the relative gains of the “Discovered GVFs” agent over IMPALA, on the top- games for the “Pixel Control” baseline: we observed large gains in out of games, small gains in , and losses in . On the right in Figure 6, we provid a more comprehensive view of the performance of the agents. For each number on the x-axis () we present the median human normalized score achieved by each method on the top- games, again selected according to the “Pixel Control” baseline. It is visually clear that discovering questions via meta-learning is fast enough to compete with handcrafted questions, and that, in games well suited to auxiliary tasks, it greatly improved performance over all baselines. It was particularly impressive to find that the meta-gradient solution outperformed pixel control on these games despite the ranking of games being biased in favour of pixel-control. The reward prediction baseline is interesting, in comparison, because it’s profile was the closest to that of the actor-critic baseline, never improving performance significantly, but not hurting either.
5 Conclusions and Discussion
There are many forms of questions that an intelligent agent may want to discover. In this paper we introduced a novel and efficient multi-step meta-gradient procedure for the discovery of questions in the form of on-policy GVFs. In a stringent test, our representation learning experiments demonstrated that the meta-gradient approach is capable of discovering useful questions such that answering them can drive, by itself, learning of state representations good enough to support the learning of a main reinforcement learning task. Furthermore, our auxiliary tasks experiments demonstrated that the meta-learning based discovery approach is data-efficient enough to compete well in terms of performance, and in many cases even outperform, handcrafted questions developed in prior work.
Prior work on auxiliary tasks relied on human ingenuity to define questions useful for shaping the state representation used in a certain task, but it’s hard to create questions that are both useful and general (i.e., that can be applied across many tasks). GeomPerspectiveRL introduced a geometrical perspective to understand when auxiliary tasks give rise to good representations. Our solution differs from this line of work in that it enables us to side-step the question of how to design good auxiliary questions, by meta-learning them, directly optimizing for utility in the context of any given task. Our approach fits in a general trend of increasingly relying on data rather than human designed inductive biases to construct effective learning algorithms (AlphaZero; InductiveBiasesRL).
A promising direction for future research is to investigate off-policy GVFs, where the policy under which we make the predictions differs from the main-task policy. We also note that our approach to discovery is quite general, and could be extended to meta-learning other kind of questions, that do not fit the canonical GVF formulation; see GNLB for one such class of predictive questions. Finally, we emphasize that the unrolled multi-step meta-gradient algorithm is likely to benefit both previous applications of myopic meta-gradients, as well as possibly open up more applications, other from discovery, where the myopic approximation would fail.
We thank John Holler and Zeyu Zheng for many useful comments and discussions. The work of the authors at the University of Michigan was supported by a grant from DARPAs L2M program and by NSF grant IIS-1526059. Any opinions, findings, conclusions, or recommendations expressed here are those of the authors and do not necessarily reflect the views of the sponsors.
6.1 Neural network architecture and details
Representation learning experiments:
A multi-layer perceptron (MLP), with layer fully-connected layers with
Collect-Objects domain: A two-layer convolutional neural network (CNN) with filters in each layer respectively. The filter sizes were in both layers. The CNN’s output is then fed to a fully-connected layer with hidden units. ReLU activation functions are used throughout.
Atari domain: A three-layer CNN architecture that has been successfully used on Atari in several variants of DQN (mnih2015human; vanHasselt:2016; Rainbow). The CNN layers consists of filters respectively, with filter sizes
. The stride lengths at each of these layers were set torespectively. ReLU activation functions are used throughout.
In all cases, the output of the encoding modules are linearly mapped to produce the policy, value function and answer net heads. ReLU activations are used in the learning agent.
We use an independent question network in all representation learning experiments; the architecture of the hidden layers matches that of the learning agent exactly. Note, however, that the heads of the question network output cumulants and discounts to be used as questions, and these are both vectors, of the same size . We use activations for cumulants and a for discounts.
Joint learning experiments:
Atari domain: We use a Deep ResNet architecture identical to the one from espeholt2018impala. They only differ in the outputs: as we now have an answer head, in addition to policy and values.
The question net takes the last hidden layer of the ResNet as input; it uses a meta-trained two-layer MLP to produce cumulants, and a separately parameterised two-layer MLP to produce discounts. The MLPs have hidden units respectively, with ReLU activations in both. As in the representation learning experiments, we use activations for cumulants and a for discounts.
6.2 Hyperparameters used in our experiments
Representation learning experiments:
A2C: The A2C agents (used in the gridworld domains) use -step returns in the
pseudo-loss. We searched the initial learning rate for the RMSProp optimizer and the entropy regularization coefficient in a range of values, and the best combination of these hyperparameter was chosen according to the results of the A2C baseline, and then used for all agents. The range of values for the initial learning rate hyperparameter was:. The range of values for entropy regularization was: . The hyperparameter of the RMSProp optimizer is set to . The number of unrolling steps is set to .
IMPALA: All agents based on IMPALA (used in Atari domains) uses the hyperparameters reported by espeholt2018impala. They are listed in Table 1 together with the hyper-parameters specific to DGVF and to other baselines. The number of unrolling steps for meta-gradients is .
Joint learning experiments:
The hyperparameters specific to the auxiliary tasks are obtained by a search over ten games (ChopperCommand, Breakout, Seaquest, SpaceInvaders, KungFuMaster, MsPacman, Krull, Tutankham, BattleZone, BeamRider) following common practice in Deep RL Atari experiments (mnih2015human; vanHasselt:2016; Rainbow). After choosing the hyperparameter from this search, they remain fixed across all Atari games.
In the Atari domain, the input to the learning agent consists of consecutively stacked frames where each frame is a result of repeating the previous action for
time-steps, greyscaling and downsampling the resulting frames to 84x84 images, and max-pooling the last 2. This is a fairly canonical pre-processing pipeline for Atari. Additionally rewards are clipped to the [-1, 1] range.
|Network Architecture||Deep ResNet|
|Value loss coefficient||0.5|
|Global gradient norm clip||40|
|Learning rate schedule||Anneal linearly to 0|
|Number of learners||1|
|Number of actors||200|
|Meta learning rate||0.0006|
|Meta gradient norm clip (cumulants)||1|
|Meta gradient norm clip (discounts)||10|
|Number of Questions||128|
|Auxiliary loss coefficient||0.0001|
|Auxiliary loss coefficient||0.0001|
|Auxiliary loss coefficient||0.001|
6.4 Derivation of myopic approximation to meta-gradients
Here we derive the myopic approximation for our meta-gradient procedure that was previously described in the main text.
Equations 5, 8 and 10 are a myopic approximation because they ignore the fact that is affected by the changes in . Furthermore, in Equations 8 and 10, the policy and value function are only indirect functions of (i.e., they are indirectly affected by the auxiliary loss) and thus they do not participate in the myopic approximation. Therefore, after applying all the approximations, we get the following myopic update rule for the meta-parameters :
6.5 Comparison between myopic and unrolled meta-gradient
Figure 7 visualizes the computation graph that is a consequence of the unrolled computation for the meta-gradient and the myopic meta-gradient computation. In the unrolled computation, the gradient of the meta-objective w.r.to the meta-parameters () is computed in such a way that the effect of these parameters over a longer time-scale is taken into consideration. The gradient computation for this unrolled computation is given in Equation 4. In contrast, the myopic gradient computation only considers the immediate one time-step effect of the meta-parameters in the agent’s policy. The meta-gradient update based on this myopic gradient computation is given in Equation 11.
6.6 Additional Results
Representation learning experiments: The aim of the representation learning is to evaluate how well auxiliary tasks can drive, on their own, representation learning in support of a main reinforcement learning task. In Figure 8 we report additional representation learning results for Atari games (jamesbond, gravitar, frostbite, amidar, bowling and chopper command) including the games from the main text. The “Discovered GVFs” (red), “Pixel Control” (green), “Reward Prediction” (purple) and “Random GVFs” (blue) baseline agents all rely exclusively on auxiliary tasks to drive representation learning, while the linear policy and value functions are trained using the main-task updates. In all games the “Discovered GVFs” agent significantly outperforms the baselines using the handcrafted auxiliary tasks from the literature to train the representation. In two games (gravitar and frostbite) the “Discovered GVFs” significantly outperforms also the plain “IMPALA” agent (trained for 200M frames) that uses the main task updates to train the state representation. In Figure 9 we report the parameter studies for the “Discovered GVFs” agent, in the second gridworld domain Puddleworld; the plots show performance as a function of the number of questions used as auxiliary tasks (on the left) and the number of steps unrolled to compute the meta-gradient (on the right). Again results are consistent with those reported in the main text for the Collect Objects domain.
Joint learning experiments: In Figures 11, 12 and 10 we provide additional details for the “joint learning“ experiments. The aim of these experiments is to show that whether the process of discovery of useful questions via meta-gradients is fast enough to improve the data efficiency of an agent in a standard setting where the state representation is trained using both the auxiliary task updates as well as the main task updates. We report relative performance improvements achieved by the “Discovered GVFs“ agent over the “IMPALA”, “Pixel Control” and “Reward Prediction” agents, after 200M training frames, on each of the Atari games. The same hyperparameters are used for all games. The relative improvements are computed using the human normalized final performance of each agent, averaged across replicas of each experiment (for reproducibility).