Attention Privileged Reinforcement Learning For Domain Transfer

by   Sasha Salter, et al.
University of Oxford

Applying reinforcement learning (RL) to physical systems presents notable challenges, given requirements regarding sample efficiency, safety, and physical constraints compared to simulated environments. To enable transfer of policies trained in simulation, randomising simulation parameters leads to more robust policies, but also significantly extends training time. In this paper, we exploit access to privileged information (such as environment states) often available in simulation, in order to improve and accelerate learning over randomised environments. We introduce Attention Privileged Reinforcement Learning (APRiL), which equips the agent with an attention mechanism and makes use of state information in simulation, learning to align attention between state- and image-based policies while additionally sharing generated data. During deployment we can apply the image-based policy to remove the requirement of access to additional information. We experimentally demonstrate accelerated and more robust learning on a number of diverse domains, leading to improved final performance for environments both within and outside the training distribution.



page 3

page 8

page 14

page 15


Reinforcement learning for non-prehensile manipulation: Transfer from simulation to physical system

Reinforcement learning has emerged as a promising methodology for traini...

How to pick the domain randomization parameters for sim-to-real transfer of reinforcement learning policies?

Recently, reinforcement learning (RL) algorithms have demonstrated remar...

Asymmetric Actor Critic for Image-Based Robot Learning

Deep reinforcement learning (RL) has proven a powerful technique in many...

State Space Decomposition and Subgoal Creation for Transfer in Deep Reinforcement Learning

Typical reinforcement learning (RL) agents learn to complete tasks speci...

Self-Paced Context Evaluation for Contextual Reinforcement Learning

Reinforcement learning (RL) has made a lot of advances for solving a sin...

Regularized Hierarchical Policies for Compositional Transfer in Robotics

The successful application of flexible, general learning algorithms -- s...

Deep Reinforcement Learning with Linear Quadratic Regulator Regions

Practitioners often rely on compute-intensive domain randomization to en...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Deep Reinforcement Learning (RL) has recently provided significant successes in a range of areas, including video games (Mnih et al., 2015), board games (Silver et al., 2017), simulated continuous control tasks (Lillicrap et al., 2015), and robotic manipulation (Haarnoja et al., 2018; Haarnoja, 2018; Riedmiller et al., 2018; OpenAI et al., 2018; Schwab et al., 2019; Andrychowicz et al., 2017). However, application to physical systems has proven to be challenging in general, due to expensive and slow data generation as well as safety challenges when running untrained policies. A common approach to circumvent these issues is to transfer models trained in simulation to the real world (Tobin et al., 2017; Rusu et al., 2016; Held et al., 2017). However, simulators only represent approximations of a physical system. Due to physical, visual, and behavioural discrepancies, naively transferring RL agents trained in simulation onto the real world can be challenging.

To bridge the gap between simulation and the real world, we can either aim to align both domains (Ganin et al., 2016; Bousmalis et al., 2016; Wulfmeier et al., 2017) or ensure that the real system is covered by the distribution of simulated training data (OpenAI et al., 2018; Tobin et al., 2017; Pinto et al., 2018; Sadeghi and Levine, 2016; Viereck et al., 2017). However, training under a distribution of randomised visual attributes of the simulator, such as textures and lighting (Sadeghi and Levine, 2016; Viereck et al., 2017), as well as physics (OpenAI et al., 2018), can be substantially more difficult and slower due to the increased variability of the learning domain (OpenAI et al., 2018; Tobin et al., 2017).

The more structured and informative the input representation is with respect to the task, the quicker the agent can be trained. A clear example of this effect can be found when an agent is trained with image inputs, versus training with access to the exact simulator states (Tassa et al., 2018; Pinto et al., 2018). However, visual perception is more general and access to more compressed representations can often be limited. When exact states are available during training but not deployment, we can make use of information asymmetric actor-critic methods (Pinto et al., 2018; Schwab et al., 2019) to train the critic faster via access to the state while providing only images for the actor.

By introducing Attention Privileged Reinforcement Learning (APRiL), we aim to further leverage access to exact states. APRiL leverages states not only to train the critic, but indirectly also for an image-based actor. Extending asymmetric actor-critic methods, APRiL concurrently trains two actor-critic systems (one symmetric, state-based agent, and another asymmetric agent with image-dependent actor). Both actors utilise an attention mechanism to filter input data and by having access to the simulation rendering system, we can optimise image and state based attention masks to align.

By additionally sharing the replay buffer between both agents, we can accelerate the learning process of the image-based actor by training on better performing states that are more quickly discovered by the state-based actor due to its lower dimensional input that is invariant to visual randomisation.

The key benefits of APRiL lie in its application to domain transfer. When training with domain randomisation for transfer, bootstrapping via asymmetric information has displayed crucial benefits (Pinto et al., 2018). Visual randomisation substantially increases the complexity of the image-based actor’s task. Under this setting, the attention network can support invariance with respect to the irrelevant, but highly varying, parts of the image. Furthermore, the convergence of the state-space actor remains unaffected by visual randomisation.

We experimentally demonstrate considerable improvements regarding learning convergence and more robust transfer on a set of continuous action domains including: 2D navigation, 2D locomotion and 3D robotic manipulation.

2 Problem Setup

Before introducing Attention Privileged Reinforcement Learning (APRiL), this section provides a background for the RL algorithms used. For a more in-depth introduction please refer to Lillicrap et al. (2015) and Pinto et al. (2018).

2.1 Reinforcement Learning

We describe an agent’s environment as a Partially Observable Markov Decision Process which is represented as the tuple

, where denotes a set of continuous states, denotes a set of either discrete or continuous actions,

is the transition probability function,

is the reward function, is the discount factor, and is the initial state distribution. is a set of continuous observations corresponding to continuous states in . At every time-step , the agent takes action according to its policy . The policy is optimised as to maximize the expected return . The agent’s Q-function is defined as .

2.2 Asymmetric Deep Deterministic Policy Gradients

Asymmetric Deep Deterministic Policy Gradients (asymmetric DDPG) (Pinto et al., 2018) represents a type of actor-critic algorithm designed specifically for efficient learning of a deterministic, observation-based policy in simulation for sim-to-real transfer. This is achieved by leveraging access to more compressed, informative environment states, available in simulation, to speed up and stabilise training of the critic.

The algorithm maintains two neural networks: an observation-based actor or policy

(with parameters ) used during training and test time, and a state-based Q-function (also known as critic) (with parameters ) which is only used during training.

To enable exploration, the method (like its symmetric version (Silver et al., 2014)) relies on a noisy version of the policy (called behavioural policy), e.g. where (see Appendix C for our particular instantiation). The transition tuples encountered during training are stored in a replay buffer (Mnih et al., 2015). Training examples sampled from the replay buffer are used to optimize the critic and actor. By minimizing the Bellman error loss , where , the critic is optimized to approximate the true Q values. The actor is optimized by minimizing the loss .

3 Attention Privileged Reinforcement Learning (APRiL)

Figure 1: Attention Privileged Reinforcement Learning model structure. Dashed lines indicate attention alignment process. The operator signifies that experiences are evenly sampled from both agents. The operator represents element-wise multiplication.

APRiL proposes to improve the performance and sample efficiency of an observation-based agent by using a quicker learning actor that has access to exact environment states, sharing replay buffers, and aligning attention mechanisms between both actors. While we focus in the following sections on extending asymmetric DDPG (Pinto et al., 2018), these ideas are generally applicable to off-policy actor-critic methods (Konda and Tsitsiklis, 2000).

APRiL is comprised of three modules as displayed in Figure 1. The first two modules, and , are actor-critic algorithms with an attention network incorporated over the input to each actor. For the state-based module we use standard symmetric DDPG, while the observation-based module builds on asymmetric DDPG. Finally, the third part represents the alignment process between attention mechanisms of both actor-critic agents to more effectively transfer knowledge between the quicker and slower learners, and , respectively.

consists of three networks: , , (respectively critic, actor, and attention) with parameters . Given input state , the attention network outputs a soft gating mask of same dimensionality as the input, with values ranging between . The input to the actor is an attention-filtered version of the state, . To encourage a sparse masking function, we found that training this attention module on both the traditional DDPG loss as well as an entropy loss helped:



is a hyperparameter to weight the additional entropy objective, and

is the behaviour policy used to obtain experience (in this case from a shared replay buffer). The actor and critic networks and are trained with the symmetric DDPG actor and Bellman error losses respectively.

Within , the state-attention obtained in is converted to corresponding observation-attention to act as a self-supervised target for the observation-based agent in . This is achieved in a two-step process. First, state-attention is converted into object-attention , which specifies how task-relevant each object in the scene is. Second, object-attention is converted to observation-space attention by performing a weighted sum over object-specific segmentation maps:


Here, (where is the dimensionality of ) is an environment-specific, predefined adjacency matrix that maps the dimensions of to each corresponding object, and

is then an attention vector over the

objects in the environment. corresponds to the object attention value. is the binary segmentation map222Many simulators, like (Todorov et al., 2012), natively provide functionality to access these segmentations of the object segmenting the object with the rest of the scene, and has the same dimensions as the image observation. assigns values of for pixels in the image occupied by the object, and elsewhere. is the converted state-attention to observation-space attention to act as a target to train the observation-attention network on.

The observation-based module also consists of three networks: , , (respectively critic, actor, and attention) with parameters . The structure of this module is the same as except the actor and critic now have asymmetric inputs. The input to the actor is the attention-filtered version of the observation, 333In practice, the output of is tiled to match the number of channels that the image contains.The actor and critic networks and are trained with the standard asymmetric DDPG actor and Bellman error losses respectively defined in Section 2.2. The main difference between and is that the observation attention network is trained on both the actor loss and an object-weighted mean squared error loss:


where weights correspond to the fraction of the partial observation that the object present in occupies, and represents the relative weighting of both loss components. The weight terms, , ensure that the attention network becomes invariant to the size of objects during training and does not simply fit to the most predominant object in the scene. Combining the self-supervised attention loss and the RL loss leverages efficient state-space learning unaffected by visual randomisation.

During training, experiences are collected evenly from both state and observation based agents and stored in a shared replay buffer (similar to Schwab et al. (2019)). This is to ensure that: 1. Both state-based critic and observation-based critic observe states that would be visited by either of their respective policies. 2. The attention modules and are trained on the same data distribution to better facilitate alignment. 3. Efficient discovery of highly performing states from are used to speed up learning of .

Algorithm 1 shows pseudocode for a single actor implementation of APRiL. In practice, in order to speed up data collection and gradient computation, we parallelise the agents and environments and ensure data collection from state- and image- based agents is even.

  Initialize (a)symmetric actor-critic modules , , attention alignment module , replay buffer
  for episode to  do
     Initial state
     while  DONE do
         Render image observation and segmentation maps :
         if episode mod  then
            Obtain action using obs-behavioral policy and obs-attention network:
            Obtain action using state-behavioral policy and state-attention network:
         end if
         Execute action , receive reward , DONE flag, and transition to
         Store in
     end while
     for  to  do
         Sample minibatch from
         Optimise state- critic, actor, and attention using with
         Convert state-attention to target observation-attention using with
         Optimise observation- critic, actor, and attention using with
     end for
  end for
Algorithm 1 Attention Privileged Reinforcement Learning

4 Experiments

To demonstrate the performance and generality of our method, we apply APRiL to a range of environments, and compare with a competitive asymmetric DDPG baseline and various ablations. We evaluate APRiL over different metrics to investigate how attention helps with robustness and generalisation to unseen environments and transfer scenarios. Further experimental details can be found in Appendix C.

4.1 Evaluation Protocol

In order to investigate APRiL under varying conditions, we evaluate in scenarios of increasing complexity covering simple 2D navigation, 3D reaching and 2D dynamic locomotion.

We use the following continuous action-space environments (see Appendix A for further details):

  1. NavWorld: In this 2D environment, the goal is for the circular agent to reach the triangular target in the presence of distractors. The agent is sparsely rewarded if the target is reached.

  2. JacoReach: In this 3D environment the goal of the Kinova arm (Campeau-Lecours et al., 2017) agent is to reach the diamond ShapeStacks object (Groth et al., 2018) in the presence of distractors. The agent is rewarded for approaching and reaching its goal.

  3. Walker2D: In this slightly modified 2D Deepmind Control Suite environment (Tassa et al., 2018) the goal of the agent is to walk forward as far as possible within a time-limit. The agent receives a reward for moving forward as well as a reward for keeping its torso upright.

For these domains we randomise visuals during training as to enable generalisation to these variable aspects of the environment. We randomise a combination of: camera position and orientation, textures, materials, colours, object locations, background. Refer to Appendix B for more details.

4.2 Key research questions

We investigate the following questions to evaluate how well APRiL accommodates for the transferring of policies across visually distinct environments: Does APRiL  1. Increase sample-efficiency during training? 2. Affect interpolation performance on unseen environments from the training distribution? 3. Affect extrapolation performance on environments outside the training distribution?

We qualitatively analyse the learnt attention maps (both on interpolated and extrapolated domains). Finally, we perform an ablation study to investigate which parts of the APRiL contribute to performance gains. This ablation consists of the following models:

  1. APRiL no self-supervision (APRiL no sup): APRiL except without the self-supervision provided by the state agent to train the observation-based attention. Both agents are still equipped with an attention module, but the observation attention must now learn without guidance from the state agent. Without bootstrapping from the state agent in this way we expect learning of informative observation-based attention to be hindered.

  2. APRiL no shared buffer (APRiL no share): APRiL except each agent has its own replay buffer, instead of one shared replay buffer, and hence does not share experiences during training. Under this setting, the observation agent will not be able to benefit from earlier visitation of lucrative states by the state agent. Both agents have an attention module and attention alignment still occurs.

  3. APRiL no background (APRiL no back): APRiL except the state agent’s attention is no longer used to calculate object-space attention values . Instead, all objects are given equal attention and we hence learn a background suppressor. This most competitive ablation investigates how important object suppression is for learning, robustness, and generalisation. Both agents still maintain attention have a shared replay buffer.

Figure 2: Learning curves during training of APRiL , its ablations, and the asymmetric DDPG baseline. Solid line: mean performance. Shaded region: covers minimum and maximum performances across seeds.

4.3 Performance on the training distribution

We evaluate the performance on all domains during training and observe APRiL ’s benefits. As seen in Figure 2, APRiL provides performance gains across all continuous action domains. APRiL not only helps learn useful representations quicker (improving learning rate) but also improves final policy performance (within the allotted training time).

The ablations demonstrate that self-supervision and shared replay both independently provide performance gains for JacoReach and Walker2D444We suspect that this is due to the simplicity of NavWorld, both visually and due to the small confined state-space, that none of the ablation by themselves outperform the baseline.. For Walker2D, shared replay is crucial as stabilises learning (observe APRiL , APRiL no back, APRiL no sup), due to constant visitation for highly performing states. Suppression of task-irrelevant, yet highly varying, information also speed up learning as simplifies the observation space. For this reason, APRiL no back proves to be a competitive ablation, approaching the performance of APRiL for JacoReach and Walker2D. For these domains, the background occupies the majority of the observation space and ignoring it already suppresses most of the irrelevant information. Minimal improvement can be achieved by suppressing additional irrelevant objects. None of the ablations, however, are able to outperform the full APRiL framework, demonstrating that the combination of a shared replay buffer and state-space-informed image-attention module cooperate constructively toward more efficient feature learning and effective policy and critic updates.

4.4 Interpolation: transfer to domains from the training distribution

We evaluate the performance of all actor-critic algorithms on a hold out set of simulation parameters, unseen during training, from the training distribution. For a detailed description of the training distribution for each domain please refer to Appendix B. For both NavWorld and JacoReach, the interpolated environments have the same number of distractors, sampled from the same object catalogue, as the training distribution. Table 1 displays final policy performance on these domains. For APRiL , we observe no degradation in policy performance between training and interpolated domains. We see a very similar trend for the asymmetric DDPG baseline. However, as APRiL performs better on the training distribution, its final performance on the interpolated domains is significantly better. We therefore demonstrate that on these domains APRiL’s attention mechanism does not hurt with respect to overfitting.

4.5 Extrapolation: transfer to domains outside the training distribution

We investigate performances on simulation parameters outside the training distribution. In particular, we investigate how well APRiL , its ablations, and asymmetric DDPG, generalise to environments with more distractor objects than seen during training. For NavWorld and JacoReach, we run two sets of increasingly extrapolated experiments with an additional 4 or 8 distractors (refered to as ext-4 and ext-8 in Table 1). The textures and colours of these objects are sampled from a held-old out set of simulation parameters not seen during training. For NavWorld, the locations and orientations of the additional distractors are randomly sampled. For JacoReach, the locations are sampled from arcs of two concentric circles of different radii (extrapolated arcs and radii to those seen during training), in such a way that each object remains visible. The shapes of the additional distractor object are sampled from the training catalogue of distractor objects. Please refer to Figure 3 for examples of the extrapolated domains.

Table 1 compares performances on the extrapolated sets (except Walker2D) varying in difficulty (ext-4 and ext-8). APRiL yields performance gains over the asymmetric DDPG baseline on every extrapolated domain. For JacoReach, APRiL’s generalisation is so effective that, for the hardest domain with additional 8 distractors, its performance degrades by only %555Percentage decrease is taken with respect to initial and final policy performance on training distribution opposed to % (baseline).

APRiL generalises favorably due to the attention module. Figure 3 shows that attention generalises and suppresses the additional distractors, thereby effectively converting the hold-out observations to those seen during training, which the image-policy can handle. The ablations in Table 1 confirm that in this setting, distractor suppression is crucial. This is seen when comparing the maximum degradation in policy performance of APRiL, APRiL no share, APRiL no back and APRiL no sup (%, %, % and % respectively). APRiL and APRiL no share both align attention between image and state agents during training, and therefore effectively suppress distractors (yielding a favourable decrease in policy performance of only % and %). APRiL no back learns a background suppressor, but does not suppress the distractors (leading to a larger degradation of %). APRiL no sup has an attention module trained only on the asymmetric actor-critic loss and yields the worst extrapolated performance (% policy degradation). For these extrapolated domains, the successful suppression of the background and additional distractors (achieved only by the full APRiL framework), creates policy invariance with respect to them and helps generalise.

Domain Baseline APRiL no sup APRiL no share APRiL no back APRiL
NavWorld (train)
NavWorld (inter)
NavWorld (ext-4)
NavWorld (ext-8)
JacoReach (train)
JacoReach (inter)
JacoReach (ext-4)
JacoReach (ext-8)
Walker2D (train)
Walker2D (inter)
Table 1:

Ablation comparing average return over training, interpolated and extrapolated environments (100 each). Results reflect mean and standard deviation of average return over 5 seeds.

4.6 Attention Module Analysis

To better comprehend the role of the attention, we visualise APRiL’s attention maps (Figure 3, 4, 5) on both interpolated and extrapolated domains. For NavWorld, attention is correctly paid to all relevant aspects (agent and target; circle and triangle respectively). Attention generalises reasonably well to the extrapolated environments. For JacoReach, attention looks at the target, diamond-shaped, object as well as every other link (alternating links) of the Kinova arm. Interestingly, APRiL learnt that as the arm is a constrained system, the state of every other link can be indirectly inferred without explicit attention. The state of the unobserved link can be inferred by observing the links either side of it. The entropy loss over the state-attention module encourages this form of attention over minimal set of objects. Attention here generalises very well to the extrapolated domains. For Walker2D, we observe attention that is dynamic in object space. The attention module attends different subsets of links depending on the state of the system (see Figure 5). When the walker is upright, walking, and collapsing, APRiL pays attention to the lower limbs, every other link, and foot and upper body, respectively. We suspect that in these scenarios, the magnitude of the optimal action depends on the state of and as is largest for the lower links (due to stability), every link (coordination), and foot and upper body (large torque required), respectively.

Figure 3: Example held-out domains (top) and APRiL attention maps (bottom). White and black signify high and low attention values. Attention correctly suppresses background and distractors.

5 Related Work

Domain Randomisation has been applied for reinforcement learning to facilitate transfer between domains (Tobin et al., 2017; Pinto et al., 2018; Sadeghi and Levine, 2016; Viereck et al., 2017; OpenAI et al., 2018; Held et al., 2017) and increase robustness of the learned policies (Rajeswaran et al., 2016). However, while domain randomisation enables us to generate more robust and transferable policies, it leads to a significant increase in required training time (OpenAI et al., 2018).

Existing comparisons in the literature demonstrate that, even without domain randomisation, the increased dimensionality and potential partial observability complicates learning for RL agents (Tassa et al., 2018; Schwab et al., 2019; Watter et al., 2015; Lesort et al., 2018). In this context, accelerated training has been achieved by using access to privileged information such as environment states to asymmetrically train the critic in actor-critic RL (Schwab et al., 2019; Pinto et al., 2018). In addition to using additional information to train the critic, Schwab et al. (2019) use a shared replay buffer for data generated by image- and state-based actors to further accelerate training for the image-based agent. Our method extends these approaches by sharing information about relevant objects by aligning agent-integrated attention mechanisms between an image- and state-based actors.

Recent experiments have demonstrated the strong dependency and bidirectional interaction between attention and learning in human subjects (Leong et al., 2017)

. In the context of machine learning,

attention mechanisms have been integrated into RL agents to increase robustness and enable interpretability of an agent’s behaviour (Sorokin et al., 2015; Choi et al., 2017; Mott et al., 2019). In comparison to these works, we focus on utilising the attention mechanism as an interface to transfer information between two agents to enable faster training.

6 Conclusion

We introduce Attention Privileged Reinforcement Learning (APRiL), an extension to asymmetric actor-critic algorithms that leverages access to privileged information like exact simulator states. The method benefits in two ways, via sharing a replay buffer as well as aligning attention masks between image- and state-space agents. By leveraging simulator ground-truth information about system states, we are able to learn efficiently in the image domain especially during domain randomisation where feature learning becomes increasingly difficult. Our evaluation on a diverse set of environments demonstrates significant improvements over the competitive asymmetric DDPG baseline and reveals that APRiL learns to generalise favourably to environments not seen during training (both within and outside of the training distribution) in comparison to the strong baseline; emphasising the importance of attention and shared experience for robustness of the learnt policies.


  • M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin, S. Ghemawat, G. Irving, M. Isard, et al. (2016) TensorFlow: a system for large-scale machine learning.. In OSDI, Vol. 16, pp. 265–283. Cited by: Appendix C.
  • M. Andrychowicz, F. Wolski, A. Ray, J. Schneider, R. Fong, P. Welinder, B. McGrew, J. Tobin, P. Abbeel, and W. Zaremba (2017) Hindsight experience replay. External Links: 1707.01495 Cited by: §1.
  • J. L. Ba, J. R. Kiros, and G. E. Hinton (2016) Layer normalization. arXiv preprint arXiv:1607.06450. Cited by: Appendix C.
  • K. Bousmalis, G. Trigeorgis, N. Silberman, D. Krishnan, and D. Erhan (2016) Domain separation networks. In Advances in Neural Information Processing Systems, pp. 343–351. Cited by: §1.
  • A. Campeau-Lecours, H. Lamontagne, S. Latour, P. Fauteux, V. Maheu, F. Boucher, C. Deguire, and L. C. L’Ecuyer (2017) Kinova modular robot arms for service robotics applications. Int. J. Robot. Appl. Technol. 5 (2), pp. 49–71. External Links: ISSN 2166-7195, Link, Document Cited by: item 2, item 2.
  • J. Choi, B. Lee, and B. Zhang (2017) Multi-focus attention network for efficient deep reinforcement learning. In

    Workshops at the Thirty-First AAAI Conference on Artificial Intelligence

    Cited by: §5.
  • Y. Ganin, E. Ustinova, H. Ajakan, P. Germain, H. Larochelle, F. Laviolette, M. Marchand, and V. Lempitsky (2016) Domain-adversarial training of neural networks. The Journal of Machine Learning Research 17 (1), pp. 2096–2030. Cited by: §1.
  • O. Groth, F. B. Fuchs, I. Posner, and A. Vedaldi (2018) ShapeStacks: learning vision-based physical intuition for generalised object stacking. In ECCV (1), Lecture Notes in Computer Science, Vol. 11205, pp. 724–739. Cited by: item 2, item 2.
  • T. Haarnoja, S. Ha, A. Zhou, J. Tan, G. Tucker, and S. Levine (2018) Learning to Walk via Deep Reinforcement Learning. arXiv e-prints. External Links: 1812.11103 Cited by: §1.
  • T. Haarnoja (2018) Acquiring diverse robot skills via maximum entropy deep reinforcement learning. Ph.D. Thesis, UC Berkeley. Cited by: §1.
  • D. Held, Z. McCarthy, M. Zhang, F. Shentu, and P. Abbeel (2017) Probabilistically safe policy transfer. In Robotics and Automation (ICRA), 2017 IEEE International Conference on, pp. 5798–5805. Cited by: §1, §5.
  • D. P. Kingma and J. Ba (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: Appendix C.
  • V. R. Konda and J. N. Tsitsiklis (2000) Actor-critic algorithms. In Advances in neural information processing systems, pp. 1008–1014. Cited by: §3.
  • Y. C. Leong, A. Radulescu, R. Daniel, V. DeWoskin, and Y. Niv (2017) Dynamic interaction between reinforcement learning and attention in multidimensional environments. Neuron 93 (2), pp. 451 – 463. External Links: ISSN 0896-6273, Document, Link Cited by: §5.
  • T. Lesort, N. Díaz-Rodríguez, J. Goudou, and D. Filliat (2018) State representation learning for control: an overview. Neural Networks. Cited by: §5.
  • T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa, D. Silver, and D. Wierstra (2015) Continuous control with deep reinforcement learning. External Links: 1509.02971 Cited by: §1, §2.
  • V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski, et al. (2015) Human-level control through deep reinforcement learning. Nature 518 (7540), pp. 529. Cited by: §1, §2.2.
  • A. Mott, D. Zoran, M. Chrzanowski, D. Wierstra, and D. J. Rezende (2019) Towards interpretable reinforcement learning using attention augmented agents. ArXiv abs/1906.02500. Cited by: §5.
  • OpenAI, :, M. Andrychowicz, B. Baker, M. Chociej, R. Jozefowicz, B. McGrew, J. Pachocki, A. Petron, M. Plappert, G. Powell, A. Ray, J. Schneider, S. Sidor, J. Tobin, P. Welinder, L. Weng, and W. Zaremba (2018) Learning dexterous in-hand manipulation. External Links: 1808.00177 Cited by: §1, §1, §5.
  • L. Pinto, M. Andrychowicz, P. Welinder, W. Zaremba, and P. Abbeel (2018) Asymmetric actor critic for image-based robot learning. Robotics: Science and Systems. Cited by: §1, §1, §1, §2.2, §2, §3, §5, §5.
  • M. Plappert, R. Houthooft, P. Dhariwal, S. Sidor, R. Y. Chen, X. Chen, T. Asfour, P. Abbeel, and M. Andrychowicz (2017) Parameter space noise for exploration. arXiv preprint arXiv:1706.01905. Cited by: Appendix C.
  • A. Rajeswaran, S. Ghotra, B. Ravindran, and S. Levine (2016) Epopt: learning robust neural network policies using model ensembles. arXiv preprint arXiv:1610.01283. Cited by: §5.
  • M. Riedmiller, R. Hafner, T. Lampe, M. Neunert, J. Degrave, T. V. de Wiele, V. Mnih, N. Heess, and J. T. Springenberg (2018) Learning by playing - solving sparse reward tasks from scratch. External Links: 1802.10567 Cited by: §1.
  • A. Romero, N. Ballas, S. Kahou, A. Chassang, C. Gatta, and Y. Bengio (2015) Imagenet classification with deep convolutional neural networks. In International Conference on Learning Representations, Cited by: Appendix C.
  • A. A. Rusu, M. Vecerik, T. Rothörl, N. Heess, R. Pascanu, and R. Hadsell (2016) Sim-to-real robot learning from pixels with progressive nets. arXiv preprint arXiv:1610.04286. Cited by: §1.
  • F. Sadeghi and S. Levine (2016) Cad2rl: real single-image flight without a single real image. arXiv preprint arXiv:1611.04201. Cited by: §1, §5.
  • D. Schwab, T. Springenberg, M. F. Martins, T. Lampe, M. Neunert, A. Abdolmaleki, T. Herkweck, R. Hafner, F. Nori, and M. Riedmiller (2019) Simultaneously learning vision and feature-based control policies for real-world ball-in-a-cup. arXiv preprint arXiv:1902.04706. Cited by: §1, §1, §3, §5.
  • D. Silver, G. Lever, N. Heess, T. Degris, D. Wierstra, and M. Riedmiller (2014) Deterministic policy gradient algorithms. In ICML, Cited by: §2.2.
  • D. Silver, J. Schrittwieser, K. Simonyan, I. Antonoglou, A. Huang, A. Guez, T. Hubert, L. Baker, M. Lai, A. Bolton, et al. (2017) Mastering the game of go without human knowledge. Nature 550 (7676), pp. 354. Cited by: §1.
  • I. Sorokin, A. Seleznev, M. Pavlov, A. Fedorov, and A. Ignateva (2015) Deep attention recurrent q-network. arXiv preprint arXiv:1512.01693. Cited by: §5.
  • Y. Tassa, Y. Doron, A. Muldal, T. Erez, Y. Li, D. d. L. Casas, D. Budden, A. Abdolmaleki, J. Merel, A. Lefrancq, et al. (2018) Deepmind control suite. arXiv preprint arXiv:1801.00690. Cited by: item 3, §1, item 3, §5.
  • J. Tobin, R. Fong, A. Ray, J. Schneider, W. Zaremba, and P. Abbeel (2017) Domain randomization for transferring deep neural networks from simulation to the real world. In Intelligent Robots and Systems (IROS), 2017 IEEE/RSJ International Conference on, pp. 23–30. Cited by: §1, §1, §5.
  • E. Todorov, T. Erez, and Y. Tassa (2012) Mujoco: a physics engine for model-based control. In Intelligent Robots and Systems (IROS), 2012 IEEE/RSJ International Conference on, pp. 5026–5033. Cited by: footnote 2.
  • U. Viereck, A. t. Pas, K. Saenko, and R. Platt (2017) Learning a visuomotor controller for real world robotic grasping using simulated depth images. arXiv preprint arXiv:1706.04652. Cited by: §1, §5.
  • M. Watter, J. Springenberg, J. Boedecker, and M. Riedmiller (2015) Embed to control: a locally linear latent dynamics model for control from raw images. In Advances in neural information processing systems, pp. 2746–2754. Cited by: §5.
  • M. Wulfmeier, I. Posner, and P. Abbeel (2017)

    Mutual alignment transfer learning

    arXiv preprint arXiv:1707.07907. Cited by: §1.

Appendix A Environments

  1. NavWorld: In this sparse reward, 2D environment, the goal is for the circular agent to reach the triangular target in the presence of distractor objects. Distractor objects have 4 or more sides and apart from changing the visual appearance of the environment cannot affect the agent. The state space consists of the locations of all objects. The observation space comprises RGB images of dimension . The action space corresponds to the velocity of the agent. The agent only obtains a sparse reward of if the particle is within of the target, after which the episode is terminated prematurely. The maximum episodic length is 20 steps, and all object locations are randomised between episodes.

  2. JacoReach: In this 3D environment the goal of the agent is to move the Kinova arm (Campeau-Lecours et al., 2017) such that the distance between its hand and the diamond ShapeStacks object (Groth et al., 2018) is minimised. The state space consists of the quaternion position and velocity of each joint as well as the Cartesian positions of each ShapeStacks object. The observation space comprises RGB images and is of dimension . The action space consists of the desired relative quaternion positions of each joint (excluding the digits) with respect to their current positions. Mujoco uses a PD controller to execute 20 steps that minimises the error between each joint’s actual and target positions. The agent’s reward is the negative squared Euclidean distance between the Kinova hand and diamond object plus an additional discrete reward of if it is within of the target. The episode is terminated early if the target is reached. All objects are out of reach of the arm and equally far from its base. Between episodes the locations of the objects are randomised along an arc of fixed radius with respect to the base of the Kinova arm. The maximum episodic length is 20 agent steps.

  3. Walker2D: In this 2D modified Deepmind Control Suite environment (Tassa et al., 2018) with a continuous action-space the goal of the agent is to walk forward as far as possible within steps. We introduce a limit to episodic length as we found that in practice this helped stabilise learning across all tested algorithms. The observation space comprises of stacked RGB images and is of dimension . Images are stacked so that velocity of the walker can be inferred. The state space consists of quaternion position and velocities of all joints. The absolute positions of the walker along the x-axis is omitted such that the walker learns to become invariant to this. The action space is setup in the same way as for the JacoReach environment. The reward is the same as defined in (Tassa et al., 2018) and consists of two multiplicative terms: one encouraging moving forward beyond a given speed, the other encouraging the torso of the walker to remain as upright as possible. The episode is terminated early if the walker’s torso falls beyond either radians with the vertex or m along the z axis.

Appendix B Randomisation Procedure

In this section we outline the randomisation procedure taken for each environment during training.

  1. NavWorld: Randomisation occurs at the start of every episode. We randomise the location, orientation and colour of every object as well as the colour of the background. We therefore hope that our agent can become invariant to these aspects of the environment.

  2. JacoReach: Randomisation occurs at the start of every episode. We randomise the textures and materials of every ShapeStacks object, Kinova arm and background. We randomise the locations of each object along an arc of fixed radius with respect to the base of the Kinova arm. Materials vary in reflectance, specularity, shininess and repeated textures. Textures vary between the following: noisy (where RGB noise of a given colour is superimposed on top of another base colour), gradient (where the colour varies linearly between two predefined colours), uniform (only one colour). Camera location and orientation are also randomised. The camera is randomised along a spherical sector of a sphere of varying radius whilst always facing the Kinova arm. We hope that our agent can become invariant to these randomised aspects of the environment.

  3. Walker2D: Randomisation occurs at the start of every episode as well as after every agent steps. We introduce additional randomisation between episodes due to their increased duration. Due to the MDP setup, intra-episodic randomisation is not an issue. Materials, textures, camera location and orientation, are randomised in the same procedure as for JacoReach. The camera is setup to always face the upper torso of the walker.

Appendix C Implementation details

Domain NavWorld and JacoReach Walker2D
State Actor FC() FC()
Obs Actor Conv() Conv()
State Critic FC() FC()
Obs Critic FC() FC()
State Attention FC() FC()
Obs Attention Conv() Conv()
Replay Buffer Size
Table 2:

Model architecture. FC() represents a (multi-layered) fully connected network with the number of nodes per layer stated as argument. Conv() represents a (multi-layered) convolutional network whose arguments take the form [channels, square kernel size, stride] for each hidden layer.

In this section we provide more details on our training setup. Refer to table 2 for the model architecture for each component of APRiL and the asymmetric DDPG baseline. Obs Actor and Obs Critic setup are the same for both APRiL and the baseline. Obs Actor

model structure comprises of the convolutional layers (without padding) defined in table

2 followed by one fully connected layer with hidden units (FC(

)). All layers use ReLU 

(Romero et al., 2015) activations and layer normalisation (Ba et al., 2016) unless otherwise stated. Each actor network is followed by a tanh activation and rescaled to match the limits of the environment’s action space.

The State Attention module includes the fully connected layer defined in table 2 followed by a Softmax operation. The Obs Attention module has the convolutional layers (with padding to ensure constant dimensionality) outlined in table 2 followed by a fully connected convolutional layer (Conv()) with a Sigmoid activation to ensure the outputs vary between and . The output of this module is tiled in order to match the dimensionality of the observation space.

During each iteration of APRiL (for both and ) we perform optimization steps on minibatches of size from the shared replay buffer. The target actor and critic networks are updated every iteration with a Polyak averaging of . We use Adam (Kingma and Ba, 2014) optimization with a learning rate of , and for critic, actor and attention networks respectively. We use default TensorFlow (Abadi et al., 2016)

values for the other hyperparameters. The discount factor, entropy weighting and self-supervised learning hyperparameters are

, and respectively. To stabilize learning, all input states are normalized by running averages of the means and standard deviations of encountered states.

Both actors employ adaptive parameter noise (Plappert et al., 2017) exploration strategy with initial std of , desired action std of and adoption coefficient of . The settings for the baseline are kept the same as for APRiL where appropriate.

Appendix D Attention Visualisation

Figure 4: APRiL attention maps for policy rollouts on NavWorld and Jaco domains. White and black signify high and low attention values respectively. For NavWorld and JacoReach, attention is correctly paid only to the relevant objects (and Jaco links), even for the extrapolated domains. Refer to section 4.6 for more details.
Figure 5: APRiL attention maps for policy rollouts on Walker domain. White and black signify high and low attention values respectively. Attention varies based on the state of the walker. When the walker is upright, high attention is paid to lower limbs. When walking, even attention is paid to every other limb. When about to collapse, high attention is paid to the foot and upper torso. Refer to section 4.6 for more details.