To some, intelligence is measured as the capability of transferring knowledge to unprecedented situations. While the notion of intellect itself is hard to define, the ability to reuse learned information is a desirable trait for general learning agents. As a civilization for instance, effective knowledge transmission is key to survival and further development. Avoiding to build representations from the ground up is a sensible goal for representation learning. Hence, unsurprisingly, knowledge transfer is an active research area in Machine Learning. There has been great successes for transfer in the field of Computer Vision(Razavian et al., 2014)
. Transfer approaches usually reuse the hidden states of deep neural networks trained on canonical tasks as a pre-training step. More recently, transfer in Natural Language Processing (NLP) also made a leap forward. Unsupervised approaches showed encouraging results on several benchmarks(Devlin et al., 2018; Peters et al., 2018; Howard and Ruder, 2018). In Reinforcement Learning (RL) (Sutton and Barto, 2018), the performance of learning agents is usually measured after they are trained from scratch and transfer remains hard and ill-defined. For this reason, no technique appears standard by today’s RL practitioners. While it is tempting to reduce it to the transfer of representations in the case of Computer Vision or in NLP, the RL literature is divided as to what transfer means. For some, transfer means the ability of RL agents to cope with changes to the input distribution (Higgins et al., 2017). For others, transfer means to adapt to changes to the agent’s goal (Schaul et al., 2015; Andrychowicz et al., 2017; Florensa et al., 2018) or to the reward function (Barreto et al., 2017; Teh et al., 2017), as well as to when the dynamics of the environment are changed (Packer et al., 2018). We differentiate the two standard transfer scenarios (evaluating agents on tasks from the same distribution as opposed to evaluating agents on tasks from a different distribution) by calling them respectively in-domain transfer and out-of-domain transfer. Transfer is notoriously hard in the RL context because agents build internal representations that are calibrated for the task at hand. In other words, learning to act while learning to see undermines the general transferability of the representations learned. Our work is motivated by this observation and lies at the crossroads between three ideas that we have not seen put in relation before. First, a promising lead for transfer in RL is to consider the transfer of representations learned independently from the task solved. To do so, we argue that a supervised learning method, solving a prediction problem in the environment independently from the RL task, should be used. This has been highlighted several times in the case of representations learned from raw observations (Finn et al., 2016; Barreto et al., 2017). Second, we study a backward approach for credit assignment since similar ideas have been shown to synergize well with value-based RL methods (Arjona-Medina et al., 2018; Hung et al., 2018). By looking in the rearview mirror, the credit inferred is contextualized and avoids common pitfalls of the (-)value. Such pitfalls include state leakage and exponentially slow updates in the case of delayed rewards. Third, we look at self-attention (Lin et al., 2017) and more generally speaking at Transformer models (Vaswani et al., 2017). They have an interesting potential for the identification of links in sequential data. For instance, Transformer was shown to learn both syntactic and semantic characteristics when applied to machine translation (Vaswani et al., 2017). Self-attention was also incorporated in neural architectures for relational reasoning that learned correct associations of sequential elements without direct supervision (Santoro et al., 2018; Zambaldi et al., 2019)
. In the light of those ideas, we propose to investigate the role credit assignment and self-attention could play together for transfer in RL. To do so, we propose a credit assignment module that is an attention-based reward model trained to predict quantized environment rewards in a supervised way. The observations that feed the reward model come from a partially observable version of the Markov Decision Process (MDP) considered. The induced partial observability acts as a regularization mechanism and forces the reward model to reason and exploit relations between state-action couples along trajectories of interaction between an agent and the environment. We show that the attention mechanism helps learning dependencies that are necessary to solve the prediction task. Our contributions are the following: (i) we introduce a simple, flexible and interpretablecredit assignment module based on state associations; (ii) we provide a straightforward protocol to exploit this credit to speed up learning, with optimality guarantees in the tabular case; (iii) we show that the learned representations can be transferred to accelerate learning in new environments.
We place ourselves in the classical Markov Decision Process (MDP) formalism (Puterman, 1994). An MDP is a tuple where is a state space, is an action space, is a discount factor (), is a reward function that maps state-action pairs to the expected reward for taking such an action in such a state. Note that we choose a form that includes the resulting state in the definition of the reward function over the typical . This is for consistency with objects defined later on. Finally,
is a transition kernel that maps state-action pairs to a probability distribution over resulting states. We note vector spaces in capital letters and elements of those spaces in small letters. The current discrete time index is notedunless stated otherwise. An RL agent interacts with an MDP at a given timestep by choosing an action and receiving a resulting state and a reward from the environment. A trajectory is a set of state-action pairs and resulting rewards accumulated in an episode. The performance of an agent is evaluated by its expected discounted cumulative reward .
3 Self-Attentional Credit Assignment
In RL, credit assignment is the ability to identify actions that, in the context of the state they are realized in, are responsible for future rewards. It is of the utmost importance since the agent seeks to optimize its discounted cumulative reward over long sequences of actions, all of which might unlock great future reward. Credit can be used to identify and reward adequate early behaviour by creating a positive retroaction loop. We argue that credit assignment can be cast into a supervised learning problem. More specifically, we cast it into a reward prediction task that we address thanks to a sequence-to-sequence (seq2seq) model (Sutskever et al., 2014) with self-attention. This has many advantages, beyond the standard qualities of supervised learning (higher sample efficiency, greater control over the sampling distribution than RL, parallelization for speedups, better studied theory and better understanding). Indeed, reward prediction is task-related and thus should help learning meaningful representations. In addition, as this is a separate learning process, it will not affect the learning of the RL agent’s representations (which are known to be hard to transfer) but an RL agent will be able to leverage the information captured by this extra network (credit assignment can be used to enrich the reward signal and speed up RL). Additionally, it builds contextualized representations that are of help for the prediction task and easier to transfer because they are control-agnostic. Finally, self-attention appears particularly interesting for credit assignment because it relates sequential elements in a constant number of operations. This feature is key for credit assignment. It greatly facilitates the identification of links in sequences, without assumptions over the temporal distance between elements. An obstacle to this view arises from the very nature of MDPs though: states are assumed to contain perfect information about the environment to fulfill the Markov property. This hinders the potential of backward credit assignment since the current state is supposedly sufficient to summarize the whole history of interaction. Thus, predictive models are highly biased towards focusing on the current input and forget about contextual information from previous inputs. Under that consideration, we replace states by partial states that will be obtained by a transformation that breaks the Markov assumption. For instance, moving from a third person fully informed view of an environment to a first person partial view. Doing so encourages the model to look into the past to find predictive signal, and allow us to track the relative importance given to each element to reconstruct the credit assigned.
Reward model architecture
We use a Transformer decoder (Vaswani et al., 2017) with a single self-attention layer and a single attention head. Transformer models are seq2seq models that differ from classical seq2seq architectures in the sense that they are not auto-regressive and do not make use of single-dimensional convolutions. They have proven useful in several domains (originally machine translation but also image classification (Bello et al., 2019), sequence generation (Child et al., 2019) and reinforcement learning (Zambaldi et al., 2019)), mainly due to the absence of locality bias and to the path length between pairs of distinct sequence elements. As a side-note, the number of sequential operations between elements is not always guaranteed. Indeed, for computational performance purposes, a limit is imposed on the size of the self-attentional window, thus very long sequences break this assumption. This drawback has been recently addressed (Dai et al., 2019; Child et al., 2019) and we omit it in the following since the length of sequences we use does not fall in this regime. The model inputs are partial state-action couples
. Each partial state goes through a series of convolutional layers followed by a series of feed-forward layers. Each action is represented as a one-hot vector and concatenated to the processed state after the convolutions. Those representations are fed to a self-attention layer and then to a position-wise feed-forward layer that outputs logits for reward prediction classes. Self-attention is an attention mechanism with parameterizationthat puts sequence elements in relation by computing non-linear similarity scores for all pairs of elements in the sequence. To do so, each sequence element is mapped to a query vector that is matched against keys and values obtained from the previous elements. Let denote the input sequence in a matrix form, being the result of internal computations of the model on its input. In the same fashion, we note the sequence resulting from the application of self-attention. We then have
where stores queries, keys, and values as linear projections of the input; stands for the dimension of the key vectors. Notably, the resulting state-action representation can be viewed as a linear combination of the values of previous elements: where . is a vector containing the attention weights for the prediction at timestep . It is the product of a softmax operation, so its sum is equal to . Since partial states contain only a portion of their initial information, the fact that the model succeeds in the prediction task indicates that it reconstructed the missing information from its past. Therefore, attention weights themselves can be viewed as a form of credit assignment, and will be used as such in what follows. To be consistent with the goal of assigning credit, the model should not be able to peek into the future. Thus, we restrict the computational window of each sequence element to the information stored in representations of the previous elements in the sequence and its own. That is because a causal mask (a lower triangular binary matrix) is multiplied to the result of the pairwise similarity computations. We apply dropout (Srivastava et al. (2014)) as a regularizer. We use different rates according to where we apply it in the model. We adopt multiclass classification as the prediction task. While performing regression on the rewards could also be an option, our experiments found that regression tends to converge to poor local optima. More precisely, we predict the sign of the experienced rewards: with
. While the quantization method we opted for is simple, our approach is compatible with a broad range of quantization functions. We use the weighted sequential cross-entropy as a loss function over the class-wise model predictions:
By default, we use uniform class weights , but we found asymmetrical weighting to be useful in cases of extreme class imbalance.
4 Reward Shaping through Credit Assignment
In RL, it is quite common that agents evolve in a sparse-reward environment making the learning process very slow. Reward shaping is a technique that aims at densifying the reward so as to speed up RL. Introduced by Ng et al. (1999), it defines a class of reward functions that can be added to the original extrinsic environment rewards without modifying the set of optimal policies: for a given MDP , we define a new MDP where is the shaped reward and the shaping. The reward shaping theorem states that if there exists a function (called the potential function) such that ( being the resulting state after action is taken in ), then and admit the same set of optimal policies. Reward shaping can be used to leverage domain knowledge to design more informative reward functions without giving incentives to unwanted behaviour. Nevertheless, shaping rewards requires good priors for the task to solve and the potential function must often be engineered manually. Our credit assignment mechanism identifies pairwise relations between state-action couples. Hence, the distribution of the attention weights over the states contains information that we can take advantage from to create a potential function, without the need of human input. We define this potential as the forwarded expected redistributed return:
where is the expected redistributed return:
As a reminder, is the attention weight on when predicting the reward
. More precisely, since we cannot sample transitions according to the stationary distribution of the MDP, we sample trajectories (according to a given or random policy) and compute an estimate of this expected redistributed return as follows:
Forwarding attributes the attention weight towards a partial state-action couple to the resulting state that is sampled. We adopt this approach to eliminate the dependence of the potential over actions, and stay within the bounds of application of the reward shaping theorem. The dependence on the statistics of the sampling introduces bias in the estimation of the true potential function. Nevertheless, optimal policies are conserved as the result of applying reward shaping. Moreover, as we elaborate in the upcoming section, we empirically found that the agents still benefit greatly from the resulting augmented reward function. A way to look at it is to consider that we densify the learning signal and bias the agent towards behaviours that encourage future rewards without interfering with the exploration mechanism.
5 Transfer through Credit Assignment
A salient aspect of our approach lies in the fact that we learn to assign credit in a separate process from that of learning how to solve the task. Additionally, by modeling the effects of actions, our credit assignment module learns representations that are robust to task modifications that do not alter the causal structure underlying the reward function. Such scenarios include changes in the dynamics of the environment and specific changes in the state or observation distribution. We study the effectiveness of our method in the setting of zero-shot transfer. Zero-shot transfer implies that we train agents and our reward model in instances of the source distribution exclusively. When evaluated in samples from the target distribution, agents and the reward model have no prior knowledge except for what they can transfer from the source distribution.
The Trigger environment
We introduce Trigger, an interpretable and customizable environment that we use to assess the quality of the credit inferred with our method. In Trigger, the agent is located in a two-dimensional bounded grid. Its actions consist solely of moving of one cell in one of the cardinal directions. Any action that would lead the agent outside the boundaries of the environment (as indicated by the walls in the figure) is ignored but still counted as an action taken by the agent. The goal of the agent (represented as a yellow square) is to activate all the switches (red squares) and then collect all the prizes (pink squares). Prizes are the only source of reward and give a penalty unless all switches are activated, in which case they give a bonus. Both prizes and switches disappear once collected. The main feature of Trigger is that every positive reward is conditional to the presence of a known subset of states in the agent history, and thus credit assignment can be assessed in a rigorous way. Despite its conceptual simplicity, some instances of Trigger can prove challenging to solve optimally for traditional RL methods: agents have to activate every trigger before experiencing rewards. Its flexible features allow to make the task at hand arbitrarily complex. For instance, the grid size can be increased, more switches may have to be collected and randomly positioned walls can be added. We make this environment partially observable for the credit assignment procedure by cropping the view around the agent. We use x windows in all our experiments.
DMLab keys doors
We use the keys_doors_puzzle 3D environment from DMLab (Beattie et al. (2016)) in which the agent must locate keys whose colors indicate the doors they open. It can only possess one key, therefore picking the wrong key prevents it from reaching further rewards. The environment is partially observable by construction. The agent receives as input what would correspond to a first person view of what is in its line of sight. It can move forward, backward and rotate. Each key picked up grants a bonus, equally to each door opened. Independently, a cake rewards the agent by a increase in score when collected. In that setup, agents benefit from understanding the link between keys and doors. We hypothesized that our credit assignment mechanism might identify this relation and reward the agent for picking up keys. To assert this, we modified the setting so that picking up keys does not provide rewards. Additionally, the visual input is richer than the one from Trigger environments and the average number of steps per episode is extended. Finally, agents move and rotate across the room. Since picking up a key does not require to actually see the key, it can be hard to know whether a key was taken and predict further door opening rewards.
6.1 Credit assignment
We provide an analysis of the credit inferred in various scenarios through our method. The analysis is qualitative and quantitative, since we rely on both visual assessment and binary detection metrics. The process of evaluating the credit assignment mechanism in Trigger goes as follows: we first generate trajectories and train the model using the quantized rewards as targets. We then apply the model on held-out trajectories sampled from held-out environments. For each held-out trajectory, we process as follows : if a reward is collected by the agent and the model predicts correctly the sign of the reward, we compare the attention weights from the prediction of this very reward to a ground truth credit assignment. We build that ground truth by exploiting the exact knowledge of where triggers are. It is a vector that is almost everywhere and on state-action couples that precede the activation of a trigger. By doing so, we explicitly target the state-action couples whose resulting state is causally linked to the reward experienced later. We find the redistribution to be near optimal in simple instances of Trigger (see Fig. 3
-left): attention concentrates quasi exclusively on state-action pairs that enable the collection of future reward. This is confirmed by precision-recall analysis: over the distribution of scenarios considered, a simple binarization heuristic over attention values yields an average precision offor an average recall of , despite long running sequences. More information on the heuristic is in Appendix A.
In keys_doors_puzzle, we adopt the same set of experiments. Since the agent can move backward and spin, in some scenarios it takes a key that is not in is line of sight. In addition, the granularity of the state space is such that off-by-one prediction errors are common but do not hinder the credit mechanism: attributing credit to the state-action couple preceding the collection of a key or the previous one leads to imperceptible changes in the resulting shaped rewards. Fig. 3-right shows similar results as for Trigger. Appendix A also provides a heatmap for this task that shows that attention concentrates around the keys, as we expected.
6.2 Zero-shot transfer
We then study how we can leverage the inferred credit and transfer representations that are helpful in new scenarios. We show that agents train faster when using the attentional reward shaping proposed. As before, the reward model is being trained on episodes of interaction in environments sampled from the source distribution. In transfer environments, we sample multiple trajectories, each using the same maze configuration. We then compute the attentional potential function by calculating an estimate of the expected redistributed reward, as described in Sec. 4. To evaluate its effect, we compare agents trained from environment rewards to agents that use the resulting shaped reward.
For in-domain transfer, we transfer the representations for credit assignment to new instances of the same distribution over MDPs. For the Trigger environment, the RL agents are tabular -learners. We keep the same size, the same number of triggers and prizes, but those elements are at different positions. Since distributions are the same, we generate subsets of environment layouts that are mutually exclusive. For the DMLab environment, we use PPO agents (Schulman et al., 2017) and modify the original task: we do not reward the agent for collecting keys but only to open doors so that the attention can focus on the key positions. Notice that this makes the task harder.
As we display in Fig. 4 the convergence of the agents gets visibly faster under the attention-based reward compared to the standard environment reward function in both environments which shows that our method is efficient for transfer and scales up.
For out-of-domain transfer we use the Trigger environment and consider two scenarios that are hard for standard agents: transfer to bigger environments (see Fig. 6) and transfer to environments with modified dynamics (see Fig. 6). In the modified dynamics setting, the effect of the agent’s actions are inverted which makes it hard if not impossible for most of transfer methods of the literature. In that setting, we compare the transferability of our mechanism to that of the representations learned by an agent equipped with deep function approximation. To this end we use DQN agents and either train them from scratch in the target environments or start from the set of weights learned in the source environments.
In both settings, shaping the rewards assists the agent in learning the proper control to solve the game. We display some results in Fig. 6 and Fig. 6. When transferring to bigger environments, the agent benefits very early on from the additional reward brought by the shaping mechanism, while also reaching better asymptotical performance.
7 Related work
As said in Sec. 1, transfer in RL means a lot of different things in the literature. For instance, a line of previous works aimed at making the training of an agent in the same task more sample-efficient by using a pre-trained model as a teacher (Rusu et al., 2016a; Schmitt et al., 2018). Our approach differs by learning a parallel task that does not modify the representations of the RL agent. Others learn auxiliary reward functions in the hope that they will enable transfer by imposing consistency in the reward (Houthooft et al., 2018; Hessel et al., 2018; Agarwal et al., 2019). Although we learn additional reward signal, it is based on a redistribution of the current task reward. Transfer is also viewed as learning tasks in a sequential way (Rusu et al., 2016b; Kirkpatrick et al., 2017) and this suggests to introduce inductive bias to the neural architectures of agents to dampen catastrophic forgetting. Our method differs as it does not change the agent’s architecture. Other explicitly address the problem of transfer through the lens of multitask learning (Parisotto et al., 2016; Teh et al., 2017) while we stick to learning from an initial distribution of environments. Closely related, meta-learning appears as a potential solution to the problem of transfer. Meta-learning approaches aim to train agents on a distribution of tasks or environments so that their learned skills and representations work across the underlying continuum, and allow for fast adaptation of the agents (Duan et al., 2016; Wang et al., 2016; Finn et al., 2017; Mishra et al., 2018; Co-Reyes et al., 2019; Zou et al., 2019). Our approach could be related to that line of work in some sense but remains different to most meta-learning methods as we do not modify the RL algorithm used to train the agent. In the domain of credit assignment, perhaps the closest body of work from ours is by Hung et al. (2018). They provide an agent with an external memory and the unsupervised task of reconstructing its inputs (both states and rewards). They use memory reads as a way to identify related elements in sequences, and use those to transfer the value of states providing delayed rewards to the bootstrapping target of contributing elements. We do not use a recurrent network and also demonstrate that our method not only boosts the learning speed of an RL agent but is also transferable. Interestingly, one of the auxiliary tasks Jaderberg et al. (2017) use to give agents additional learning signal falls pretty close of reward sign prediction task. The agent must predict the next reward from three consecutive frames sampled from the replay buffer. The link with our work is on the representational side, since they show that auxiliary tasks make transfer of representations easier, which is coherent with our findings.
In this work, we investigated the role credit assignment could play in transfer learning and came up with a novel credit assignment scheme. It takes advantage of the relational properties of self-attention. As a health check, we verified its pertinence against gridworld tasks and a more complex 3D navigational scenario. Transfer being the primary focus of this work, we demonstrated that our credit assignment method could be used for zero-shot transfer of representations in both in-domain and out-of-domain distributions. To the best of our knowledge, this is the first line of work in that exciting direction. We acknowledge there is still much to be done to confirm the generality of this approach. We think it would be worth exploring how this could be incorporated into online reinforcement learning methods, and leave that to future work.
- Agarwal et al. (2019) Agarwal, R., Liang, C., Schuurmans, D., and Norouzi, M. (2019). Learning to generalize from sparse and underspecified rewards. arXiv preprint arXiv:1902.07198.
- Andrychowicz et al. (2017) Andrychowicz, M., Wolski, F., Ray, A., Schneider, J., Fong, R., Welinder, P., McGrew, B., Tobin, J., Abbeel, O. P., and Zaremba, W. (2017). Hindsight experience replay. In Advances in Neural Information Processing Systems, pages 5048–5058.
- Arjona-Medina et al. (2018) Arjona-Medina, J. A., Gillhofer, M., Widrich, M., Unterthiner, T., and Hochreiter, S. (2018). Rudder: Return decomposition for delayed rewards. arXiv preprint arXiv:1806.07857.
- Barreto et al. (2017) Barreto, A., Dabney, W., Munos, R., Hunt, J. J., Schaul, T., van Hasselt, H. P., and Silver, D. (2017). Successor features for transfer in reinforcement learning. In Advances in neural information processing systems, pages 4055–4065.
- Beattie et al. (2016) Beattie, C., Leibo, J. Z., Teplyashin, D., Ward, T., Wainwright, M., Küttler, H., Lefrancq, A., Green, S., Valdés, V., Sadik, A., et al. (2016). Deepmind lab. arXiv preprint arXiv:1612.03801.
- Bello et al. (2019) Bello, I., Zoph, B., Vaswani, A., Shlens, J., and Le, Q. V. (2019). Attention augmented convolutional networks. CoRR, abs/1904.09925.
- Child et al. (2019) Child, R., Gray, S., Radford, A., and Sutskever, I. (2019). Generating long sequences with sparse transformers. arXiv preprint arXiv:1904.10509.
- Co-Reyes et al. (2019) Co-Reyes, J. D., Gupta, A., Sanjeev, S., Altieri, N., DeNero, J., Abbeel, P., and Levine, S. (2019). Meta-learning language-guided policy learning. In Proceedings of the International Conference on Learning Representations (ICLR(2019).
- Dai et al. (2019) Dai, Z., Yang, Z., Yang, Y., Carbonell, J. G., Le, Q. V., and Salakhutdinov, R. (2019). Transformer-xl: Attentive language models beyond a fixed-length context. CoRR, abs/1901.02860.
- Devlin et al. (2018) Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. (2018). Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.
- Duan et al. (2016) Duan, Y., Schulman, J., Chen, X., Bartlett, P. L., Sutskever, I., and Abbeel, P. (2016). Rl$^2$: Fast reinforcement learning via slow reinforcement learning. CoRR, abs/1611.02779.
- Finn et al. (2017) Finn, C., Abbeel, P., and Levine, S. (2017). Model-agnostic meta-learning for fast adaptation of deep networks. In Proceedings of the 34th International Conference on Machine Learning (ICML 2017).
Finn et al. (2016)
Finn, C., Tan, X. Y., Duan, Y., Darrell, T., Levine, S., and Abbeel, P. (2016).
Deep spatial autoencoders for visuomotor learning.In 2016 IEEE International Conference on Robotics and Automation (ICRA), pages 512–519. IEEE.
- Florensa et al. (2018) Florensa, C., Held, D., Geng, X., and Abbeel, P. (2018). Automatic goal generation for reinforcement learning agents. In Proceedings of the 35th International Conference on Machine Learning (ICML 2018).
- Hessel et al. (2018) Hessel, M., Soyer, H., Espeholt, L., Czarnecki, W., Schmitt, S., and van Hasselt, H. (2018). Multi-task deep reinforcement learning with popart. arXiv preprint arXiv:1809.04474.
- Higgins et al. (2017) Higgins, I., Pal, A., Rusu, A., Matthey, L., Burgess, C., Pritzel, A., Botvinick, M., Blundell, C., and Lerchner, A. (2017). Darla: Improving zero-shot transfer in reinforcement learning. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 1480–1490. JMLR.org.
- Houthooft et al. (2018) Houthooft, R., Chen, Y., Isola, P., Stadie, B., Wolski, F., Ho, O. J., and Abbeel, P. (2018). Evolved policy gradients. In Advances in Neural Information Processing Systems, pages 5400–5409.
- Howard and Ruder (2018) Howard, J. and Ruder, S. (2018). Universal language model fine-tuning for text classification. In ACL. Association for Computational Linguistics.
- Hung et al. (2018) Hung, C., Lillicrap, T. P., Abramson, J., Wu, Y., Mirza, M., Carnevale, F., Ahuja, A., and Wayne, G. (2018). Optimizing agent behavior over long time scales by transporting value. CoRR, abs/1810.06721.
- Jaderberg et al. (2017) Jaderberg, M., Mnih, V., Czarnecki, W. M., Schaul, T., Leibo, J. Z., Silver, D., and Kavukcuoglu, K. (2017). Reinforcement learning with unsupervised auxiliary tasks. Proceedings of the International Conference on Learning Representations.
- Kirkpatrick et al. (2017) Kirkpatrick, J., Pascanu, R., Rabinowitz, N., Veness, J., Desjardins, G., Rusu, A. A., Milan, K., Quan, J., Ramalho, T., Grabska-Barwinska, A., et al. (2017). Overcoming catastrophic forgetting in neural networks. Proceedings of the national academy of sciences, 114(13):3521–3526.
- Lin et al. (2017) Lin, Z., Feng, M., dos Santos, C. N., Yu, M., Xiang, B., Zhou, B., and Bengio, Y. (2017). A structured self-attentive sentence embedding. In Proceedings of the International Conference on Learning Representations (ICLR 2017).
- Mishra et al. (2018) Mishra, N., Rohaninejad, M., Chen, X., and Abbeel, P. (2018). A simple neural attentive meta-learner. In Proceedings of the International Conference on Learning Representations (ICLR 2018).
- Mnih et al. (2015) Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Veness, J., Bellemare, M. G., Graves, A., Riedmiller, M., Fidjeland, A. K., Ostrovski, G., et al. (2015). Human-level control through deep reinforcement learning. Nature, 518(7540):529.
- Ng et al. (1999) Ng, A. Y., Harada, D., and Russell, S. (1999). Policy invariance under reward transformations: Theory and application to reward shaping. In Proceedings of the International Confernece on Machine Learning (ICML 1999), volume 99, pages 278–287.
- Packer et al. (2018) Packer, C., Gao, K., Kos, J., Krähenbühl, P., Koltun, V., and Song, D. (2018). Assessing generalization in deep reinforcement learning. arXiv preprint arXiv:1810.12282.
- Parisotto et al. (2016) Parisotto, E., Ba, J., and Salakhutdinov, R. (2016). Actor-mimic: Deep multitask and transfer reinforcement learning. In Proceedings of the International Conference on Learning Representations (ICLR 2016).
- Peters et al. (2018) Peters, M. E., Neumann, M., Iyyer, M., Gardner, M., Clark, C., Lee, K., and Zettlemoyer, L. (2018). Deep contextualized word representations. In Proceedings of the Annual Meeting of North American chapter of ACL (NAACL 2018).
- Puterman (1994) Puterman, M. L. (1994). Markov Decision Processes. Wiley.
Razavian et al. (2014)
Razavian, A. S., Azizpour, H., Sullivan, J., and Carlsson, S. (2014).
Cnn features off-the-shelf: An astounding baseline for recognition.
Proceedings of the 2014 IEEE Conference on Computer Vision and Pattern Recognition Workshops.
- Rusu et al. (2016a) Rusu, A. A., Colmenarejo, S. G., Gülçehre, Ç., Desjardins, G., Kirkpatrick, J., Pascanu, R., Mnih, V., Kavukcuoglu, K., and Hadsell, R. (2016a). Policy distillation. In Proceedings of the International Conference on Learning Representations (ICLR 2016).
- Rusu et al. (2016b) Rusu, A. A., Rabinowitz, N. C., Desjardins, G., Soyer, H., Kirkpatrick, J., Kavukcuoglu, K., Pascanu, R., and Hadsell, R. (2016b). Progressive neural networks. CoRR, abs/1606.04671.
Santoro et al. (2018)
Santoro, A., Faulkner, R., Raposo, D., Rae, J. W., Chrzanowski, M., Weber, T.,
Wierstra, D., Vinyals, O., Pascanu, R., and Lillicrap, T. P. (2018).
Relational recurrent neural networks.In Advances in Neural Processing Systems (NeurIPS 2018), pages 7310–7321.
- Schaul et al. (2015) Schaul, T., Horgan, D., Gregor, K., and Silver, D. (2015). Universal value function approximators. In Proceedings of the International Conference on Machine Learning (ICML 2015), pages 1312–1320.
- Schmitt et al. (2018) Schmitt, S., Hudson, J. J., Zídek, A., Osindero, S., Doersch, C., Czarnecki, W. M., Leibo, J. Z., Küttler, H., Zisserman, A., Simonyan, K., and Eslami, S. M. A. (2018). Kickstarting deep reinforcement learning. CoRR, abs/1803.03835.
- Schulman et al. (2016) Schulman, J., Moritz, P., Levine, S., Jordan, M., and Abbeel, P. (2016). High-dimensional continuous control using generalized advantage estimation. In Proceedings of the International Conference on Learning Representations (ICLR 2016).
- Schulman et al. (2017) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., and Klimov, O. (2017). Proximal policy optimization algorithms. CoRR, abs/1707.06347.
- Srivastava et al. (2014) Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., and Salakhutdinov, R. (2014). Dropout: A simple way to prevent neural networks from overfitting. J. Mach. Learn. Res., 15(1):1929–1958.
- Sutskever et al. (2014) Sutskever, I., Vinyals, O., and Le, Q. V. (2014). Sequence to sequence learning with neural networks. In Advances in neural information processing systems, pages 3104–3112.
- Sutton and Barto (2018) Sutton, R. S. and Barto, A. G. (2018). Reinforcement Learning: An Introduction. The MIT Press, second edition.
- Teh et al. (2017) Teh, Y., Bapst, V., Czarnecki, W. M., Quan, J., Kirkpatrick, J., Hadsell, R., Heess, N., and Pascanu, R. (2017). Distral: Robust multitask reinforcement learning. In Advances in Neural Information Processing Systems, pages 4496–4506.
Tieleman and Hinton (2012)
Tieleman, T. and Hinton, G. (2012).
Lecture 6.5—RmsProp: Divide the gradient by a running average of its recent magnitude.COURSERA: Neural Networks for Machine Learning.
- Vaswani et al. (2017) Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., and Polosukhin, I. (2017). Attention is all you need. In Advances in neural information processing systems, pages 5998–6008.
- Wang et al. (2016) Wang, J. X., Kurth-Nelson, Z., Tirumala, D., Soyer, H., Leibo, J. Z., Munos, R., Blundell, C., Kumaran, D., and Botvinick, M. (2016). Learning to reinforcement learn. CoRR, abs/1611.05763.
- Watkins and Dayan (1992) Watkins, C. J. and Dayan, P. (1992). Q-learning. Machine learning, 8(3-4):279–292.
- Zambaldi et al. (2019) Zambaldi, V., Raposo, D., Santoro, A., Bapst, V., Li, Y., Babuschkin, I., Tuyls, K., Reichert, D., Lillicrap, T., Lockhart, E., Shanahan, M., Langston, V., Pascanu, R., Botvinick, M., Vinyals, O., and Battaglia, P. (2019). Deep reinforcement learning with relational inductive biases. In Proceedings of the International Conference on Learning Representations (ICLR 2019).
- Zou et al. (2019) Zou, H., Ren, T., Yan, D., Su, H., and Zhu, J. (2019). Reward shaping via meta-learning. arXiv preprint arXiv:1901.09330.
Appendix A Additional Experiment Details
In this section we provide additional details about our experimental setup and the hyperparameters we use.
a.1 Reward prediction model
We use the same set of hyperparameters in all our experiments with few variation. In Trigger experiments, we use units per dense layer, convolutional filters and a single convolutional layer to process partial states and actions. As mentioned, we use dropout at several places in the model. We use a dropout rate of after dense layers, a dropout rate of in the self-attention mechanism, and a dropout rate of in the normalization blocks of the Transformer architecture. We use the same positional encoding scheme as in Vaswani et al. (2017). Fig. 7 gives an overview of the whole architecture. In DMLab experiments, we use two convolutional layers, filters for each, and otherwise identical hyperparameters.
a.2 Heuristic for precision-recall analysis
In Sec. 6.1, we compare the attention vectors we get as outputs from the reward prediction model to ideal credit assignment with binary metrics. The ground truth we use is a binary vector of the size of the attention vector. Its values are everywhere and for timesteps that correspond to the activation of a trigger. To do so, we introduce a simple heuristic to binarize the attention scalars: we consider all values above a threshold
to correspond to events to be credited. Then, we can measure precision and recall as in a binary classification paradigm. The precision and recall reported are the average precision and recall overscenarios : Trigger with a x grid, trigger and reward; Trigger with a x grid, trigger and rewards; Trigger with a x grid, triggers and rewards; and Trigger with a x grid, triggers and reward. In each scenario, we train the model over a set of trajectories, each of which is drawn from a randomly sampled maze. Then, we apply it on trajectories from held-out environments and collect the attention weights corresponding to predictions on timesteps where the agent experiences positive reward. We use a fixed of .
a.3 In-domain transfer in DMLab
We provide additional details about this setup: we train the reward prediction model using trajectories sampled from a distribution of mazes that are generated randomly. These trajectories are sampled using an agent trained over the same distribution. We do so to increase the proportion of trajectories where rewards are experienced. Indeed, we found that using random policies yielded very few of these. Once the model is trained, we use it to compute the attentional potential function over a fixed maze. trajectories are sampled on the fixed maze using the same policy as the one that generated the trajectories used to train the reward prediction model. Since consecutive frames can be very similar, we consider a positive reward prediction to be correct (and thus use the corresponding attention weights when estimating the potential) if it happens within frames of a reward actually experienced in the environment. We then compare the performance of agents trained with the original reward function to those trained with the shaped reward. An important point is that we use the knowledge of the position and the keys the agent possesses to compute and then exploit the potential function. This information is not given to the agent. We acknowledge that relying on the knowledge of the state the agent is in limits the generality of our approach, but we are confident that this limitation can be addressed in future work by training a model that estimates the potential value of each state. All agents mentioned in that section are PPO learners with a learning rate of , an entropy coefficient of , actors, a discount factor . They use generalized advantage estimation (Schulman et al., 2016) with .
a.4 Attention heatmap in DMLab
We display the per-position average attention weight along trajectories starting in the middle-left room that led to opening the blue doors. Attention (shown in shades of red) concentrates around the key position in the lower-left corner.
For experiments involving tabular -learning we use online -learning with a learning rate of and a constant greediness factor also equal to .
For the out-of-domain transfer experiment with modified dynamics, we use a smaller copy of the DQN architecture in Mnih et al. (2015). The first convolutional layer has filters, a x
kernel size and a stride of. The second and the third convolutional layers have both filters, a x kernel size and a stride of . Those are completed by a feed-forward layer with units followed by another feed-forward layer with as many units as the number of available actions. The greediness factor is decayed linearly from to over steps in the environment at train time and has a constant value of at test time. We use RMSProp (Tieleman and Hinton, 2012) as an optimizer with a base learning rate of . We update the target network every steps and initially fill the replay buffer with transitions sampled following a random policy. The replay buffer has a maximum size of .