Credit Assignment as a Proxy for Transfer in Reinforcement Learning

07/18/2019 ∙ by Johan Ferret, et al. ∙ 2

The ability to transfer representations to novel environments and tasks is a sensible requirement for general learning agents. Despite the apparent promises, transfer in Reinforcement Learning is still an open and under-exploited research area. In this paper, we suggest that credit assignment, regarded as a supervised learning task, could be used to accomplish transfer. Our contribution is twofold: we introduce a new credit assignment mechanism based on self-attention, and show that the learned credit can be transferred to in-domain and out-of-domain scenarios.



There are no comments yet.


page 6

page 16

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

To some, intelligence is measured as the capability of transferring knowledge to unprecedented situations. While the notion of intellect itself is hard to define, the ability to reuse learned information is a desirable trait for general learning agents. As a civilization for instance, effective knowledge transmission is key to survival and further development. Avoiding to build representations from the ground up is a sensible goal for representation learning. Hence, unsurprisingly, knowledge transfer is an active research area in Machine Learning. There has been great successes for transfer in the field of Computer Vision 

(Razavian et al., 2014)

. Transfer approaches usually reuse the hidden states of deep neural networks trained on canonical tasks as a pre-training step. More recently, transfer in Natural Language Processing (NLP) also made a leap forward. Unsupervised approaches showed encouraging results on several benchmarks 

(Devlin et al., 2018; Peters et al., 2018; Howard and Ruder, 2018). In Reinforcement Learning (RL) (Sutton and Barto, 2018), the performance of learning agents is usually measured after they are trained from scratch and transfer remains hard and ill-defined. For this reason, no technique appears standard by today’s RL practitioners. While it is tempting to reduce it to the transfer of representations in the case of Computer Vision or in NLP, the RL literature is divided as to what transfer means. For some, transfer means the ability of RL agents to cope with changes to the input distribution (Higgins et al., 2017). For others, transfer means to adapt to changes to the agent’s goal (Schaul et al., 2015; Andrychowicz et al., 2017; Florensa et al., 2018) or to the reward function  (Barreto et al., 2017; Teh et al., 2017), as well as to when the dynamics of the environment are changed (Packer et al., 2018). We differentiate the two standard transfer scenarios (evaluating agents on tasks from the same distribution as opposed to evaluating agents on tasks from a different distribution) by calling them respectively in-domain transfer and out-of-domain transfer. Transfer is notoriously hard in the RL context because agents build internal representations that are calibrated for the task at hand. In other words, learning to act while learning to see undermines the general transferability of the representations learned. Our work is motivated by this observation and lies at the crossroads between three ideas that we have not seen put in relation before. First, a promising lead for transfer in RL is to consider the transfer of representations learned independently from the task solved. To do so, we argue that a supervised learning method, solving a prediction problem in the environment independently from the RL task, should be used. This has been highlighted several times in the case of representations learned from raw observations (Finn et al., 2016; Barreto et al., 2017). Second, we study a backward approach for credit assignment since similar ideas have been shown to synergize well with value-based RL methods (Arjona-Medina et al., 2018; Hung et al., 2018). By looking in the rearview mirror, the credit inferred is contextualized and avoids common pitfalls of the (-)value. Such pitfalls include state leakage and exponentially slow updates in the case of delayed rewards. Third, we look at self-attention (Lin et al., 2017) and more generally speaking at Transformer models (Vaswani et al., 2017). They have an interesting potential for the identification of links in sequential data. For instance, Transformer was shown to learn both syntactic and semantic characteristics when applied to machine translation (Vaswani et al., 2017). Self-attention was also incorporated in neural architectures for relational reasoning that learned correct associations of sequential elements without direct supervision (Santoro et al., 2018; Zambaldi et al., 2019)

. In the light of those ideas, we propose to investigate the role credit assignment and self-attention could play together for transfer in RL. To do so, we propose a credit assignment module that is an attention-based reward model trained to predict quantized environment rewards in a supervised way. The observations that feed the reward model come from a partially observable version of the Markov Decision Process (MDP) considered. The induced partial observability acts as a regularization mechanism and forces the reward model to reason and exploit relations between state-action couples along trajectories of interaction between an agent and the environment. We show that the attention mechanism helps learning dependencies that are necessary to solve the prediction task. Our contributions are the following: (i) we introduce a simple, flexible and interpretable

credit assignment module based on state associations; (ii) we provide a straightforward protocol to exploit this credit to speed up learning, with optimality guarantees in the tabular case; (iii) we show that the learned representations can be transferred to accelerate learning in new environments.

2 Background

We place ourselves in the classical Markov Decision Process (MDP) formalism (Puterman, 1994). An MDP is a tuple where is a state space, is an action space, is a discount factor (), is a reward function that maps state-action pairs to the expected reward for taking such an action in such a state. Note that we choose a form that includes the resulting state in the definition of the reward function over the typical . This is for consistency with objects defined later on. Finally,

is a transition kernel that maps state-action pairs to a probability distribution over resulting states. We note vector spaces in capital letters and elements of those spaces in small letters. The current discrete time index is noted

unless stated otherwise. An RL agent interacts with an MDP at a given timestep by choosing an action and receiving a resulting state and a reward from the environment. A trajectory is a set of state-action pairs and resulting rewards accumulated in an episode. The performance of an agent is evaluated by its expected discounted cumulative reward .

3 Self-Attentional Credit Assignment


In RL, credit assignment is the ability to identify actions that, in the context of the state they are realized in, are responsible for future rewards. It is of the utmost importance since the agent seeks to optimize its discounted cumulative reward over long sequences of actions, all of which might unlock great future reward. Credit can be used to identify and reward adequate early behaviour by creating a positive retroaction loop. We argue that credit assignment can be cast into a supervised learning problem. More specifically, we cast it into a reward prediction task that we address thanks to a sequence-to-sequence (seq2seq) model (Sutskever et al., 2014) with self-attention. This has many advantages, beyond the standard qualities of supervised learning (higher sample efficiency, greater control over the sampling distribution than RL, parallelization for speedups, better studied theory and better understanding). Indeed, reward prediction is task-related and thus should help learning meaningful representations. In addition, as this is a separate learning process, it will not affect the learning of the RL agent’s representations (which are known to be hard to transfer) but an RL agent will be able to leverage the information captured by this extra network (credit assignment can be used to enrich the reward signal and speed up RL). Additionally, it builds contextualized representations that are of help for the prediction task and easier to transfer because they are control-agnostic. Finally, self-attention appears particularly interesting for credit assignment because it relates sequential elements in a constant number of operations. This feature is key for credit assignment. It greatly facilitates the identification of links in sequences, without assumptions over the temporal distance between elements. An obstacle to this view arises from the very nature of MDPs though: states are assumed to contain perfect information about the environment to fulfill the Markov property. This hinders the potential of backward credit assignment since the current state is supposedly sufficient to summarize the whole history of interaction. Thus, predictive models are highly biased towards focusing on the current input and forget about contextual information from previous inputs. Under that consideration, we replace states by partial states that will be obtained by a transformation that breaks the Markov assumption. For instance, moving from a third person fully informed view of an environment to a first person partial view. Doing so encourages the model to look into the past to find predictive signal, and allow us to track the relative importance given to each element to reconstruct the credit assigned.

Reward model architecture

We use a Transformer decoder (Vaswani et al., 2017) with a single self-attention layer and a single attention head. Transformer models are seq2seq models that differ from classical seq2seq architectures in the sense that they are not auto-regressive and do not make use of single-dimensional convolutions. They have proven useful in several domains (originally machine translation but also image classification (Bello et al., 2019), sequence generation (Child et al., 2019) and reinforcement learning (Zambaldi et al., 2019)), mainly due to the absence of locality bias and to the path length between pairs of distinct sequence elements. As a side-note, the number of sequential operations between elements is not always guaranteed. Indeed, for computational performance purposes, a limit is imposed on the size of the self-attentional window, thus very long sequences break this assumption. This drawback has been recently addressed (Dai et al., 2019; Child et al., 2019) and we omit it in the following since the length of sequences we use does not fall in this regime. The model inputs are partial state-action couples

. Each partial state goes through a series of convolutional layers followed by a series of feed-forward layers. Each action is represented as a one-hot vector and concatenated to the processed state after the convolutions. Those representations are fed to a self-attention layer and then to a position-wise feed-forward layer that outputs logits for reward prediction classes. Self-attention is an attention mechanism with parameterization

that puts sequence elements in relation by computing non-linear similarity scores for all pairs of elements in the sequence. To do so, each sequence element is mapped to a query vector that is matched against keys and values obtained from the previous elements. Let denote the input sequence in a matrix form, being the result of internal computations of the model on its input. In the same fashion, we note the sequence resulting from the application of self-attention. We then have

where stores queries, keys, and values as linear projections of the input; stands for the dimension of the key vectors. Notably, the resulting state-action representation can be viewed as a linear combination of the values of previous elements: where . is a vector containing the attention weights for the prediction at timestep . It is the product of a softmax operation, so its sum is equal to . Since partial states contain only a portion of their initial information, the fact that the model succeeds in the prediction task indicates that it reconstructed the missing information from its past. Therefore, attention weights themselves can be viewed as a form of credit assignment, and will be used as such in what follows. To be consistent with the goal of assigning credit, the model should not be able to peek into the future. Thus, we restrict the computational window of each sequence element to the information stored in representations of the previous elements in the sequence and its own. That is because a causal mask (a lower triangular binary matrix) is multiplied to the result of the pairwise similarity computations. We apply dropout (Srivastava et al. (2014)) as a regularizer. We use different rates according to where we apply it in the model. We adopt multiclass classification as the prediction task. While performing regression on the rewards could also be an option, our experiments found that regression tends to converge to poor local optima. More precisely, we predict the sign of the experienced rewards: with

. While the quantization method we opted for is simple, our approach is compatible with a broad range of quantization functions. We use the weighted sequential cross-entropy as a loss function over the class-wise model predictions


By default, we use uniform class weights , but we found asymmetrical weighting to be useful in cases of extreme class imbalance.

4 Reward Shaping through Credit Assignment

In RL, it is quite common that agents evolve in a sparse-reward environment making the learning process very slow. Reward shaping is a technique that aims at densifying the reward so as to speed up RL. Introduced by Ng et al. (1999), it defines a class of reward functions that can be added to the original extrinsic environment rewards without modifying the set of optimal policies: for a given MDP , we define a new MDP where is the shaped reward and the shaping. The reward shaping theorem states that if there exists a function (called the potential function) such that ( being the resulting state after action is taken in ), then and admit the same set of optimal policies. Reward shaping can be used to leverage domain knowledge to design more informative reward functions without giving incentives to unwanted behaviour. Nevertheless, shaping rewards requires good priors for the task to solve and the potential function must often be engineered manually. Our credit assignment mechanism identifies pairwise relations between state-action couples. Hence, the distribution of the attention weights over the states contains information that we can take advantage from to create a potential function, without the need of human input. We define this potential as the forwarded expected redistributed return:

where is the expected redistributed return:

As a reminder, is the attention weight on when predicting the reward

. More precisely, since we cannot sample transitions according to the stationary distribution of the MDP, we sample trajectories (according to a given or random policy) and compute an estimate of this expected redistributed return as follows:

Forwarding attributes the attention weight towards a partial state-action couple to the resulting state that is sampled. We adopt this approach to eliminate the dependence of the potential over actions, and stay within the bounds of application of the reward shaping theorem. The dependence on the statistics of the sampling introduces bias in the estimation of the true potential function. Nevertheless, optimal policies are conserved as the result of applying reward shaping. Moreover, as we elaborate in the upcoming section, we empirically found that the agents still benefit greatly from the resulting augmented reward function. A way to look at it is to consider that we densify the learning signal and bias the agent towards behaviours that encourage future rewards without interfering with the exploration mechanism.

5 Transfer through Credit Assignment

A salient aspect of our approach lies in the fact that we learn to assign credit in a separate process from that of learning how to solve the task. Additionally, by modeling the effects of actions, our credit assignment module learns representations that are robust to task modifications that do not alter the causal structure underlying the reward function. Such scenarios include changes in the dynamics of the environment and specific changes in the state or observation distribution. We study the effectiveness of our method in the setting of zero-shot transfer. Zero-shot transfer implies that we train agents and our reward model in instances of the source distribution exclusively. When evaluated in samples from the target distribution, agents and the reward model have no prior knowledge except for what they can transfer from the source distribution.

6 Experiments

Figure 1: Trigger Environment

The Trigger environment

We introduce Trigger, an interpretable and customizable environment that we use to assess the quality of the credit inferred with our method. In Trigger, the agent is located in a two-dimensional bounded grid. Its actions consist solely of moving of one cell in one of the cardinal directions. Any action that would lead the agent outside the boundaries of the environment (as indicated by the walls in the figure) is ignored but still counted as an action taken by the agent. The goal of the agent (represented as a yellow square) is to activate all the switches (red squares) and then collect all the prizes (pink squares). Prizes are the only source of reward and give a penalty unless all switches are activated, in which case they give a bonus. Both prizes and switches disappear once collected. The main feature of Trigger is that every positive reward is conditional to the presence of a known subset of states in the agent history, and thus credit assignment can be assessed in a rigorous way. Despite its conceptual simplicity, some instances of Trigger can prove challenging to solve optimally for traditional RL methods: agents have to activate every trigger before experiencing rewards. Its flexible features allow to make the task at hand arbitrarily complex. For instance, the grid size can be increased, more switches may have to be collected and randomly positioned walls can be added. We make this environment partially observable for the credit assignment procedure by cropping the view around the agent. We use x windows in all our experiments.

Figure 2: Observations from DMLab are first person views

DMLab keys doors

We use the keys_doors_puzzle 3D environment from DMLab (Beattie et al. (2016)) in which the agent must locate keys whose colors indicate the doors they open. It can only possess one key, therefore picking the wrong key prevents it from reaching further rewards. The environment is partially observable by construction. The agent receives as input what would correspond to a first person view of what is in its line of sight. It can move forward, backward and rotate. Each key picked up grants a bonus, equally to each door opened. Independently, a cake rewards the agent by a increase in score when collected. In that setup, agents benefit from understanding the link between keys and doors. We hypothesized that our credit assignment mechanism might identify this relation and reward the agent for picking up keys. To assert this, we modified the setting so that picking up keys does not provide rewards. Additionally, the visual input is richer than the one from Trigger environments and the average number of steps per episode is extended. Finally, agents move and rotate across the room. Since picking up a key does not require to actually see the key, it can be hard to know whether a key was taken and predict further door opening rewards.

Agents used

Depending on the experimental setup, we use either -learning agents (Watkins and Dayan (1992)), Deep -Networks (DQN) (Mnih et al. (2015)) or Proximal Policy Optimization (PPO) (Schulman et al. (2017)) agents.

6.1 Credit assignment

We provide an analysis of the credit inferred in various scenarios through our method. The analysis is qualitative and quantitative, since we rely on both visual assessment and binary detection metrics. The process of evaluating the credit assignment mechanism in Trigger goes as follows: we first generate trajectories and train the model using the quantized rewards as targets. We then apply the model on held-out trajectories sampled from held-out environments. For each held-out trajectory, we process as follows : if a reward is collected by the agent and the model predicts correctly the sign of the reward, we compare the attention weights from the prediction of this very reward to a ground truth credit assignment. We build that ground truth by exploiting the exact knowledge of where triggers are. It is a vector that is almost everywhere and on state-action couples that precede the activation of a trigger. By doing so, we explicitly target the state-action couples whose resulting state is causally linked to the reward experienced later. We find the redistribution to be near optimal in simple instances of Trigger (see Fig. 3

-left): attention concentrates quasi exclusively on state-action pairs that enable the collection of future reward. This is confirmed by precision-recall analysis: over the distribution of scenarios considered, a simple binarization heuristic over attention values yields an average precision of

for an average recall of , despite long running sequences. More information on the heuristic is in Appendix A.

Figure 3: On the left, the distribution of attention weights around triggers for correct positive reward predictions in a x Trigger maze with

triggers and one reward. The x-axis denotes the signed number of steps between the state-action couple receiving attention and the actual moment the agent took the key on. On the

right, the distribution of attention weights around keys for correct reward predictions for door traversals in DMLab.

In keys_doors_puzzle, we adopt the same set of experiments. Since the agent can move backward and spin, in some scenarios it takes a key that is not in is line of sight. In addition, the granularity of the state space is such that off-by-one prediction errors are common but do not hinder the credit mechanism: attributing credit to the state-action couple preceding the collection of a key or the previous one leads to imperceptible changes in the resulting shaped rewards. Fig. 3-right shows similar results as for Trigger. Appendix A also provides a heatmap for this task that shows that attention concentrates around the keys, as we expected.

6.2 Zero-shot transfer

We then study how we can leverage the inferred credit and transfer representations that are helpful in new scenarios. We show that agents train faster when using the attentional reward shaping proposed. As before, the reward model is being trained on episodes of interaction in environments sampled from the source distribution. In transfer environments, we sample multiple trajectories, each using the same maze configuration. We then compute the attentional potential function by calculating an estimate of the expected redistributed reward, as described in Sec. 4. To evaluate its effect, we compare agents trained from environment rewards to agents that use the resulting shaped reward.

In-domain transfer

For in-domain transfer, we transfer the representations for credit assignment to new instances of the same distribution over MDPs. For the Trigger environment, the RL agents are tabular -learners. We keep the same size, the same number of triggers and prizes, but those elements are at different positions. Since distributions are the same, we generate subsets of environment layouts that are mutually exclusive. For the DMLab environment, we use PPO agents (Schulman et al., 2017) and modify the original task: we do not reward the agent for collecting keys but only to open doors so that the attention can focus on the key positions. Notice that this makes the task harder.

Figure 4: On the left, in-domain transfer results on a x Trigger with triggers and reward. On the right, results in DMLab.

As we display in Fig. 4 the convergence of the agents gets visibly faster under the attention-based reward compared to the standard environment reward function in both environments which shows that our method is efficient for transfer and scales up.

Out-of-domain transfer

For out-of-domain transfer we use the Trigger environment and consider two scenarios that are hard for standard agents: transfer to bigger environments (see Fig. 6) and transfer to environments with modified dynamics (see Fig. 6). In the modified dynamics setting, the effect of the agent’s actions are inverted which makes it hard if not impossible for most of transfer methods of the literature. In that setting, we compare the transferability of our mechanism to that of the representations learned by an agent equipped with deep function approximation. To this end we use DQN agents and either train them from scratch in the target environments or start from the set of weights learned in the source environments.

Figure 5: Bigger environments we consider are bigger mazes where the structure of the original task is conserved (number of triggers, number of prizes). Environments drawn are x grids with trigger and prize for the top figure versus prizes for the bottom one. Environments from the training distribution are x grids.
Figure 6: The controls of the out-of-domain distribution are inverted (up becomes down, right becomes left). While the effect of the shaping is exclusively beneficial, transferring weights from the source control task does not always help and even undermines the agent in some cases.

In both settings, shaping the rewards assists the agent in learning the proper control to solve the game. We display some results in Fig. 6 and Fig. 6. When transferring to bigger environments, the agent benefits very early on from the additional reward brought by the shaping mechanism, while also reaching better asymptotical performance.

7 Related work

As said in Sec. 1, transfer in RL means a lot of different things in the literature. For instance, a line of previous works aimed at making the training of an agent in the same task more sample-efficient by using a pre-trained model as a teacher (Rusu et al., 2016a; Schmitt et al., 2018). Our approach differs by learning a parallel task that does not modify the representations of the RL agent. Others learn auxiliary reward functions in the hope that they will enable transfer by imposing consistency in the reward (Houthooft et al., 2018; Hessel et al., 2018; Agarwal et al., 2019). Although we learn additional reward signal, it is based on a redistribution of the current task reward. Transfer is also viewed as learning tasks in a sequential way (Rusu et al., 2016b; Kirkpatrick et al., 2017) and this suggests to introduce inductive bias to the neural architectures of agents to dampen catastrophic forgetting. Our method differs as it does not change the agent’s architecture. Other explicitly address the problem of transfer through the lens of multitask learning (Parisotto et al., 2016; Teh et al., 2017) while we stick to learning from an initial distribution of environments. Closely related, meta-learning appears as a potential solution to the problem of transfer. Meta-learning approaches aim to train agents on a distribution of tasks or environments so that their learned skills and representations work across the underlying continuum, and allow for fast adaptation of the agents (Duan et al., 2016; Wang et al., 2016; Finn et al., 2017; Mishra et al., 2018; Co-Reyes et al., 2019; Zou et al., 2019). Our approach could be related to that line of work in some sense but remains different to most meta-learning methods as we do not modify the RL algorithm used to train the agent. In the domain of credit assignment, perhaps the closest body of work from ours is by Hung et al. (2018). They provide an agent with an external memory and the unsupervised task of reconstructing its inputs (both states and rewards). They use memory reads as a way to identify related elements in sequences, and use those to transfer the value of states providing delayed rewards to the bootstrapping target of contributing elements. We do not use a recurrent network and also demonstrate that our method not only boosts the learning speed of an RL agent but is also transferable. Interestingly, one of the auxiliary tasks Jaderberg et al. (2017) use to give agents additional learning signal falls pretty close of reward sign prediction task. The agent must predict the next reward from three consecutive frames sampled from the replay buffer. The link with our work is on the representational side, since they show that auxiliary tasks make transfer of representations easier, which is coherent with our findings.

8 Conclusion

In this work, we investigated the role credit assignment could play in transfer learning and came up with a novel credit assignment scheme. It takes advantage of the relational properties of self-attention. As a health check, we verified its pertinence against gridworld tasks and a more complex 3D navigational scenario. Transfer being the primary focus of this work, we demonstrated that our credit assignment method could be used for zero-shot transfer of representations in both in-domain and out-of-domain distributions. To the best of our knowledge, this is the first line of work in that exciting direction. We acknowledge there is still much to be done to confirm the generality of this approach. We think it would be worth exploring how this could be incorporated into online reinforcement learning methods, and leave that to future work.


Appendix A Additional Experiment Details

In this section we provide additional details about our experimental setup and the hyperparameters we use.

a.1 Reward prediction model

Figure 7: The architecture of the reward prediction model used. is the vector containing the attention weights of the model for its prediction at step .

We use the same set of hyperparameters in all our experiments with few variation. In Trigger experiments, we use units per dense layer, convolutional filters and a single convolutional layer to process partial states and actions. As mentioned, we use dropout at several places in the model. We use a dropout rate of after dense layers, a dropout rate of in the self-attention mechanism, and a dropout rate of in the normalization blocks of the Transformer architecture. We use the same positional encoding scheme as in Vaswani et al. (2017). Fig. 7 gives an overview of the whole architecture. In DMLab experiments, we use two convolutional layers, filters for each, and otherwise identical hyperparameters.

a.2 Heuristic for precision-recall analysis

In Sec. 6.1, we compare the attention vectors we get as outputs from the reward prediction model to ideal credit assignment with binary metrics. The ground truth we use is a binary vector of the size of the attention vector. Its values are everywhere and for timesteps that correspond to the activation of a trigger. To do so, we introduce a simple heuristic to binarize the attention scalars: we consider all values above a threshold

to correspond to events to be credited. Then, we can measure precision and recall as in a binary classification paradigm. The precision and recall reported are the average precision and recall over

scenarios : Trigger with a x grid, trigger and reward; Trigger with a x grid, trigger and rewards; Trigger with a x grid, triggers and rewards; and Trigger with a x grid, triggers and reward. In each scenario, we train the model over a set of trajectories, each of which is drawn from a randomly sampled maze. Then, we apply it on trajectories from held-out environments and collect the attention weights corresponding to predictions on timesteps where the agent experiences positive reward. We use a fixed of .

a.3 In-domain transfer in DMLab

We provide additional details about this setup: we train the reward prediction model using trajectories sampled from a distribution of mazes that are generated randomly. These trajectories are sampled using an agent trained over the same distribution. We do so to increase the proportion of trajectories where rewards are experienced. Indeed, we found that using random policies yielded very few of these. Once the model is trained, we use it to compute the attentional potential function over a fixed maze. trajectories are sampled on the fixed maze using the same policy as the one that generated the trajectories used to train the reward prediction model. Since consecutive frames can be very similar, we consider a positive reward prediction to be correct (and thus use the corresponding attention weights when estimating the potential) if it happens within frames of a reward actually experienced in the environment. We then compare the performance of agents trained with the original reward function to those trained with the shaped reward. An important point is that we use the knowledge of the position and the keys the agent possesses to compute and then exploit the potential function. This information is not given to the agent. We acknowledge that relying on the knowledge of the state the agent is in limits the generality of our approach, but we are confident that this limitation can be addressed in future work by training a model that estimates the potential value of each state. All agents mentioned in that section are PPO learners with a learning rate of , an entropy coefficient of , actors, a discount factor . They use generalized advantage estimation (Schulman et al., 2016) with .

a.4 Attention heatmap in DMLab

We display the per-position average attention weight along trajectories starting in the middle-left room that led to opening the blue doors. Attention (shown in shades of red) concentrates around the key position in the lower-left corner.

a.5 -learning

For experiments involving tabular -learning we use online -learning with a learning rate of and a constant greediness factor also equal to .

a.6 Dqn

For the out-of-domain transfer experiment with modified dynamics, we use a smaller copy of the DQN architecture in Mnih et al. (2015). The first convolutional layer has filters, a x

kernel size and a stride of

. The second and the third convolutional layers have both filters, a x kernel size and a stride of . Those are completed by a feed-forward layer with units followed by another feed-forward layer with as many units as the number of available actions. The greediness factor is decayed linearly from to over steps in the environment at train time and has a constant value of at test time. We use RMSProp (Tieleman and Hinton, 2012) as an optimizer with a base learning rate of . We update the target network every steps and initially fill the replay buffer with transitions sampled following a random policy. The replay buffer has a maximum size of .