Zero-Shot Reinforcement Learning with Deep Attention Convolutional Neural Networks

by   Sahika Genc, et al.

Simulation-to-simulation and simulation-to-real world transfer of neural network models have been a difficult problem. To close the reality gap, prior methods to simulation-to-real world transfer focused on domain adaptation, decoupling perception and dynamics and solving each problem separately, and randomization of agent parameters and environment conditions to expose the learning agent to a variety of conditions. While these methods provide acceptable performance, the computational complexity required to capture a large variation of parameters for comprehensive scenarios on a given task such as autonomous driving or robotic manipulation is high. Our key contribution is to theoretically prove and empirically demonstrate that a deep attention convolutional neural network (DACNN) with specific visual sensor configuration performs as well as training on a dataset with high domain and parameter variation at lower computational complexity. Specifically, the attention network weights are learned through policy optimization to focus on local dependencies that lead to optimal actions, and does not require tuning in real-world for generalization. Our new architecture adapts perception with respect to the control objective, resulting in zero-shot learning without pre-training a perception network. To measure the impact of our new deep network architecture on domain adaptation, we consider autonomous driving as a use case. We perform an extensive set of experiments in simulation-to-simulation and simulation-to-real scenarios to compare our approach to several baselines including the current state-of-art models.


page 7

page 9


Zero-Shot Reinforcement Learning on Graphs for Autonomous Exploration Under Uncertainty

This paper studies the problem of autonomous exploration under localizat...

Simulation-based reinforcement learning for real-world autonomous driving

We use synthetic data and a reinforcement learning algorithm to train a ...

Zero-shot Deep Reinforcement Learning Driving Policy Transfer for Autonomous Vehicles based on Robust Control

Although deep reinforcement learning (deep RL) methods have lots of stre...

Crossing The Gap: A Deep Dive into Zero-Shot Sim-to-Real Transfer for Dynamics

Zero-shot sim-to-real transfer of tasks with complex dynamics is a highl...

Malleable Agents for Re-Configurable Robotic Manipulators

Re-configurable robots potentially have more utility and flexibility for...

Few-shot model-based adaptation in noisy conditions

Few-shot adaptation is a challenging problem in the context of simulatio...

Generative One-Shot Learning (GOL): A Semi-Parametric Approach to One-Shot Learning in Autonomous Vision

Highly Autonomous Driving (HAD) systems rely on deep neural networks for...

1 Introduction

Most of the recent examples in deep reinforcement learning of autonomous control agents utilize realistic simulation environments to learn various tasks including but not limited to locomotion, motion planning, and robotic-arm manipulation with limited or no human guidance (see Tan et al. (2018) and references therein). These realistic simulation environments are safe for the agent to experience both desired and unwanted behavior. On the other hand, in general, a controller learned in a simulation environment performs poorly in the real world or does not generalize without additional tuning in the real world.

There is no single approach for zero-shot reinforcement learning of a robotic controller agent. In Tzeng et al. (2015), the authors apply domain adaptation at the feature level. In Tobin et al. (2017) and Mordatch et al. (2015) , the authors used domain and dynamics randomization, respectively. In Higgins et al. (2017a)

, the authors propose a new multi-stage RL agent, DARLA (DisentAngled Representation Learning Agent), which learns to see before learning to act. More recently, domain adaptation has been studied for robotic manipulators

Bousmalis et al. (2017); Kalashnikov et al. (2018); James et al. (2017); Gualtieri and Platt (2018); Rusu et al. (2017); Golemo et al. (2018) in which the authors use raw (pixel) images as state for deep reinforcement learning.

To achieve zero-shot RL requires addressing the uncertainty, un-modeled dynamics, and perception challenges across all three components, namely, agent, environment, and interpreter. There are currently two schools of thought, one focusing on improving dynamics and the other on perception. We argue that the key to achieving robust zero-shot reinforcement learning requires jointly addressing uncertainty in dynamics and variability in perception.

We propose a new deep neural network architecture named Deep Attention Convolutional Neural Network (DACNN). An overview of the steps of our proposed approach is shown in Figure 1

. Our key contribution is that our attention model uniquely captures underlying components in the modern control theoretic approach, i.e., image-based servo-ing, without the need for separation of perception and control. The image-based servoing have been succesully applied to robotic control use cases including but not limited to drones. The recent image-based servo-ing methods use image feature vectors which are specific transformation of the raw pixels. We prove that our attention model uniquely captures the image feature vectors, i.e., in image-based visual servo control, via annotation vectors. Annotation vectors are extracted from a CNN as described in

Bahdanau et al. (2014). By defining the image features as annotation vectors, the full image error is defined in term of the weights of the annotation vectors. We assume that the annotation vectors have fixed orientation in the inertial frame. This assumption allows the passivity-like features of the dynamic system to transfer to the full image error when a spherical camera geometry is used. Therefore, we jointly solve the perception and control problem via the attention model that results in robust domain adaptation with zero-shot RL. A complete characterization of the class of systems which can be rendered passive from is beyond the scope of this paper. However, this class is broad, and encompasses mechanical systems modeled by Euler-Lagrange equations Arcak (2007).

Figure 1: The block diagram of our proposed approach using an attention mechanism that preserves passivity-like features of the dynamic system for optimal motion planning. A complete characterization of the class of systems which can be rendered passive from to is beyond the scope of this paper. However, this class is broad, and encompasses mechanical systems modeled by Euler-Lagrange equations.

2 Our Approach: Attention Models in Optimal Visual Control

In this section, we describe each constructing block of the overall architecture of our proposed approach in Figure 1. First, we describe the core of our architecture, attention neural networks. Second, we describe the underlying assumptions that enable attention networks to perform better than the current state-of-the art approaches. Finally, we provide how autonomous driving physics satisfy the assumptions for deep reinforcement learning with attention networks.

Our hypothesis is that under certain assumptions on the control system and sensor configuration, a specific type of neural network, i.e., attention network, enables joint perception and stable control of the system even there are significant changes in the environment such as texture and lighting that transforms the observation space. There are several formulation of attention networks for image-captioning and natural language processing. To the best of our knowledge, the attention mechanism in these applications are used in conjunction with recurrent neural networks. There are several formulations of attention models in recurrent neural networks

Xu et al. (2015); Bahdanau et al. (2014); Ba et al. (2015); Mnih et al. (2014a). Attention enables neural networks to "focus" selectively on different parts of the input while generating the corresponding parts of the output sequence. This selective "focus" for a corresponding input is then learned through back-propogation.

Our formulation is inspired from the attention model in Xu et al. (2015) where the attention-based model can attend to salient part of an image while generating its caption. Intuitively, attention enables the model to see what parts of the image to focus on as it generates a caption. It is very much equivalent to how humans perceive when performing image caption generation or long length machine translation. In the context of autonomous driving, the vehicle needs to focus on the ques from the road such as white and yellow lines but not on the entire road, i.e., grey asphalt, unless there is an obstacle or another vehicle.

Our main goal is to learn the shortest path on an arbitrarily-drawn track using a ground robot or vehicle as the highest possible speed without going off-the-track, hitting obstacles or other vehicles. We assume that the ground vehicle control is based only on raw (pixel) image and there is no other sensor for position or velocity. In the following, first, we consider the image-based control formulation using construct from the image-based visual servo (IBVS) theory and the kinematic model of our vehicle as defined in Kong et al. (2015). We provide the necessary conditions required to preserve passivity-like properties. Second, we consider the full image error for the control problem. We propose that the image features and full image error is defined in terms of annotation vectors and their corresponding weights. We show that the proposed formulation guarantees a stabilizing controller. Finally, we describe the learning task based on attention model.

The image-based only IBVS approach for our ground vehicle problem can be solved if and only if the image geometry is spherical. In the classical setting, the state of our vehicle is defined as where is the position of the vehicle on a 2D plane, is the steering angle of the center of the mass, and is the velocity of the vehicle in 2D plane. The state transition function is defined by the discrete-time formulation of the kinematic model in Equations 1-4 as formulated in Kong et al. (2015):


is the distance of the center of mass to the rear axle, is the length of the vehicle, is the acceleration, and is the steering of the front vehicle111The rear wheels are fixed and do not steer. The proposed kinematic performs at par with the dynamic model for model predictive control off-road test using actual-size passenger automobiles. This kinematic model satisfies the passivity property for the control system.

In the classical IBVS formulation, the control state is inferred from the sensor-based observations. That is, the sensor state is the raw-pixel image and a transformation matrix is used to map observations into the control state. In our deep reinforcement learning formulation, we consider observations as the state and the control state transformation is implicitly embedded into the neural network and will not be inferred directly. The deep reinforcement learning algorithm will infer on the control state through a reward function by making trial-and-error based decision on the observation space, i.e., raw pixel images.

In our and IBVS formulation, the observation state is defined as where is a matrix and each column corresponds to a

-dimensional feature extracted from the observed image,

and are finite integers. The geometry of the camera is modeled by its image surface relative to its focal point. Therefore, the image feature can be written as a function of is projection onto the image surface in the body fixed frame. The image feature in our formulation is the output of the convolutional neural network layers, and the input to the convolutional network layers is the raw-pixel image.

Theorem 2.1.

The passivity-like properties of the body fixed frame dynamics of a rigid object in the image space are preserved if and only if the image geometry is of a spherical camera.

The proof of Theorem 2.1 is in Hamel and Mahony (2002). Our kinematic equations, Equations 1-4, already exhibits a simple linear cascade system, i.e., and where

is a rotation matrix. Since cascade systems exhibit passivity-like properties, to guarantee that the controlled system inherits passivity-like properties, we need to show that the gradient of the full image error contains a skew-symmetric matrix on angular velocities. In the next section, we define the full image error in terms of an attention network.

Figure 2: The block diagram of the our attention-CNN network.

The overall neural network architecture with the deep attention network for our approach is shown in Figure 2. Intuitively, the objective of a visual servo algorithm in image space is to match the observed image to a known “model” image of the target. The target is an image of the environment with the desired outcome. In the context of autonomous vehicle, a target is an image of the road where the vehicle is within the white lines and away from obstacles and other vehicles. Our approach does not require a known model image of the target for controls. However, we need to engineer a reward function that will indirectly provide the means to discriminate between desired versus undesired behavior. Therefore, it is necessary to examine the error in the image space. Furthermore, we hypothesize that our attention network model reduces the image space error compared to naively feeding image features extracted from CNN layers. The full image error between the observed and known model image is defined using a combination matrix approach


where are the desired image features and is the combination matrix that preserve the passivity-like properties. Assuming that is full rank, we can rewrite the full image error as a weighted sum where . The choice of becomes the design component for the control algorithm. We propose that in the above formulation can be chosen as the annotation vectors in a our modified attention model and are the corresponding weights of the annotation vectors. Suppose that a set of image features, annotation vectors are extracted from a CNN as in Xu et al. (2015). The output of the attention layer from the annotation vectors corresponds to the features extracted at different image locations. The extractor produces a finite number of annotation vectors where is a finite integer. These annotation vectors form the state space of our MDP. We define a context vector where is a weighted sum,


For each location , the mechanism generates a positive weight

which can be interpreted either as the probability that location

is the right place to focus for producing the control action (the "hard" but stochastic attention mechanism), or as the relative importance to give to location in weighted sum of the the vectors. The weight of each annotation vector is computed by an attention model . The weight is calculated by the softmax function




are hidden state vectors from the previous LSTM cell and an attention model. In the literature, there are multiple formulations of the model, i.e., additive and multiplicative. In the following, we describe our formulation of based on additive models

The only input to the attention model

is the annotation vectors. Additive attention (or multi-layer perceptron(MLP) attention) and multiplicative attention (or dot-product attention) are the most commonly used attention mechanisms. They share the same and unified form of attention introduced above, but are different in how they compute the function

. We use a modified MLP attention in our network to selectively pick the vector for computing the , and not have any contextual vector for computing the attention weights. Since the annotation weights are output of a softmax function, are positive. Therefore, we preserve the combination matrix in Equation 6 to be full rank. This property also ensures that we can stabilize the system if the velocity is available as a control input, i.e., kinematic control. This condition is satisfied by our design.

While the visual features used as state provide means for robustness to camera and target calibration errors, the rigid-body dynamics of the camera ego-motion are highly coupled when expressed as target motion in the image plane. Therefore, a direct adaptive control approach provide better results. We consider a general formulation of the ground robot navigation as Markov decision process (MDP) and use a vanilla clipped proximate policy optimization algorithm

Schulman et al. (2017). In standard policy optimization, the policy to be learned is parametrized by the weights and bias parameters of the underlying neural network. In our case, the underlying neural network contains an attention mechanism in addition to more commonly used convolutional and dense layers. Therefore, the policy optimization solves for the full image error in the visual space while learning the optimal actions for navigation.

3 Conclusion and Future Work

Most traditional approaches to control robotic agents rely on extracting features from image or using non-image based sensors for state measurements. The optimal control algorithms that utilize pixel as a state typically need a simulation environment. The control policies trained in a specific simulation environment perform poorly in other environments with the same hardware model configuration. We have tackled two critical problems in our work on domain adaptation. First, we provided a theoretical foundation on the conditions under which joint perception and control will perform as well as the current state-of-the-art or better at lower computational complexity. Second, we implemented our approach for a mobile robot and empirically demonstrated the improvement.

We have proposed a new architecture (DACNN) that strives to attain joint perception and control by learning to focus through the lens of the CNN layers, and achieve zero-shot reinforcement learning, converging to an optimal policy that’s transferable across perceptive differences in the environment. The attention model learns to capture the part of the image relevant to driving while the spherical geometry under which the image captures the real-time observation guarantees stable control under the passivity assumption.

We have demonstrated that additive attention can capture the focus required for optimal control theoretically in Section 2 and empirically in the context of autonomous-driving in Section 4. This is achieved by designing the context vector in the form of the full error in visual space with respect to the desired visual features. We empirically prove over comprehensive and targeted experiments in simulation and real world that this new mechanism provides a robust domain transfer performance across different textures, colors, and lighting. We have shown that our attention network performs at par or better compared to the current state-of-the-art method based on variational auto-encoders at lower computational complexity and the need to design extensive set of experiments with domain variation. Future work should also look to explore other attention mechanisms like self-attention Vaswani et al. (2017), deep siamese attention Wu et al. (2018) to have stronger capabilities to teach focus to the encoded features of our network.

4 Supplementary Material

4.1 Experiments and Results

In this section, we summarize our experimental results on sim2sim and sim2real. Our main conclusion is two-folds: 1) DACNN provides equivalent or better performance on domain transfer tasks with texture and light variation in the environment and 2) DACNN converges sooner in training compared to deeper neural networks with better performance, i.e., higher total reward, resulting in lower compute cost without degradation. The faster training is achieved by 1) not randomizing the model and environment conditions and 2) allowing the network to focus on jointly optimizing for perception and dynamics. There is no pre-training stage for learning the latent space representation for perception. An in-depth discussion on our experiments and description of baselines adopted are included in the rest of the paper.

4.1.1 Design of Experiments

We conducted two sets of experiments with increasing complexity:

  1. [label=Task -]

  2. Time trial: For this task, the robocar is required to complete a given racing track as fast as possible without going off the track. The off-track condition is defined as when all the wheels of the car outside of the race track.

  3. Multi-car racing: For this task, the robocar is required to complete a given racing track as possible without crashing into two or more bot cars. The crash condition is defined as when the robocar is in the close vicnity of a bot car. The bot car is controlled by the simulation with no learning capability and models the worst case scenario by frequently changing the lanes to block the learner robocar. The number of the bot cars on the track is based on the length of the track.

We consider two sets of domain adaptation tasks in sim2sim and sim2real experiments. First, we use unseen color, texture, and lighting conditions in the evaluation phase and real track environment respectively but we keep the track shape the same in all experiments. Second, we modify the track shape as well as the color and texture.

We consider two baselines in both sim2sim and sim2real transfer experiments. First baseline has vanilla CNN layers to extract features without our attention mechanism. Second baseline is based on the DARLA approach where we use DARLA’s -VAE neural network to learn a latent state representation and then use the same output layers from our approach to learn the control task. Our implementation of DARLA is as described in Higgins et al. (2017b), and applied to the autonomous vehicle use case.

We trained two baseline models with no domain adaptation, two models with DARLA using different

values, two models with DACNN using different number of units, a total of 6 models. We used a vanilla policy optimization implementation and categorical exploration from a widely popular tool kit. We created an interface to our simulation environment which is open-source


We performed evaluations on sim2sim and sim2real with textural variants - carpet, wood, concrete, and random lighting effects as seen in Figure 3 with five or more replications. Then, we changed the reward function from penalizing deviations from following center yellow, dotted line to penalizing crossing white lines on the inner and outer edge of the track, and repeated the experiments.

Figure 3: The camera capture from robocar’s perspective for testing varying sim2real environments.

4.1.2 Results: Task I - Time-trial

In total, we have more than 100+ and 20+ tests for sim2sim and sim2real respectively, to reach a statistically significant conclusion. Currently, we have several hundred hours of autonomous driving image, time-series, and event logs from the point-of-view of the car333See racing competition links at In the simulation, we note that the baseline model was never able to finish on the concrete track. However the attention based model finished successfully on all surfaces and all of its converged iterations had significantly higher (80%+) completion rates. One interesting observation is that our deeper DACNN architecture achieved the highest cumulative reward. We anticipate that the deeper network implicitly extracts the non-linearities from vision-based controls implicitly.

We summarize our sim2real experiment results in Table 1. In the real world, we observed DARLA and DACNN performed better than the baselines under lighting variations444A video of our findings from these experiments are attached to our submission.. Baseline I uses the probabilistic action space described for “racetrack problem” in Barto et al. (1995) to account for uncertainty in dynamics, while Baseline II uses deterministic action decisions. We consider two types of DARLA models, one uses data augmentation and the other does not, when training the learn to see model. The data augmentation is performed by various perturbations such as shifting the camera image or adding Gaussian noise. DACNN models use a single layer and two-layer attention layers after vanilla input CNN layers.

Simulation-to-Simulation Simulation-to-Reality
Number of iterations to converge Mean reward at convergence Success rate for textures Success rate for lighting Success rate for focused light Success rate for multi non-focused
Baseline I (uncertainty) Barto et al. (1995) 14 350 Fails all Fails 0 0
Baseline II (no adaptation) 24 800 Fails on concrete Success 40% 33%
DARLA (augmentation) 9 700 Success all Success N/A N/A
DARLA (no augmentation) 9 600 Success all Success 33% 66%
DACNN (shallow) 12 900 Success all Success 66% 50%
DACNN (deep) 11 2200 Success all Success in-progress in-progress
Table 1: Summary of results for simulation-to-simulation and simulation-to-real experiments.

4.1.3 Results: Task II - Multi-car Racing

While our results in this task are preliminary, we observed that the reduced computational complexity transferred to the new task of avoiding moving objects. We used a different length and width track to accommodate up to three robocars without crashing into each other. For the same simulation configuration using five and three bot cars changing lanes at specific intervals, the DACNN model converged faster than the deep neural network but slower than the shallow network as shown in Figure 5 and Figure LABEL:fig:MultiCarAttention3cars, respectively. However, the total reward achieved for the DACNN was higher than the shallow one and only slightly higher than the deeper network. The evaluation results on each model during training provided additional insight on the robustness of the DACNN model compared to the deeper network.

Figure 4: Training entropy and total reward using 5 bot cars where (left) DACNN model with three CNN layers, (center) deep neural network model with five CNN layers but without attention, and (right) shallow neural network with three CNN layer.
Figure 5: Training entropy and total reward using 3 bot cars where (left) DACNN model with three CNN layers, (center) deep neural network model with five CNN layers but without attention, and (right) shallow neural network with three CNN layer
Figure 6: The evaluation of models with five bot cars on the track same as the training where (top) DACNN model with three CNN layers plots of evaluation and (bottom) deep neural network with three CNN layer but without attention model plots of evaluation.

Next, we consider evaluation of the five-bot car model on the same track as the training and of the three-bot model on a more complicated track shape with varying textures. The evaluation results for the five-bot car model are shown in Figure 6. We provide the statistics for mean progress over 15 evaluations across each checkpoint model from the training job in the first column of Figure 6. The DACNN model starts achieving more than 70% progress around the track early on. The variation across evaluations of the same model is shown in the second column. The DACNN model has tighter variation plots. Similarly for the number of cars passed during evaluation shown in the fourth column, the DACNN model consistently performs more than 30 passes after iteration 40. The evaluation of the three-bot car model on a track with a more complicated track shape and several texture changes including concrete, carpet, and wood revealed the limits of both the DACNN and deeper network models. None of the models were able to finish the track yet the DACNN model was able to complete more full laps than the deeper model with higher number of passes.

4.1.4 Discussion

We use Gradient-weighted Class Activation Mapping (Grad-CAM) in Selvaraju et al. (2017) to visualize the impact of our baseline versus proposed approach on the image space prior to the output layers for control. Grad-CAM applies to CNNs used in reinforcement learning, without any architectural changes or re-training. Grad-CAM uses the gradients of any target concept, flowing into the final convolutional layer to produce a coarse localization map highlighting the important regions in the image for predicting the concept.

In Figure 7, we compared our baseline and basic DACNN model on the same image collected from the real world through the robocar’s perspective. The warmer colors (yellow-red) correspond to focus areas where cooler colors (blue-purple) correspond to ignored areas. It is clear that the DACNN learns to focus on the track and not distracted by objects and surfaces outside the track. Note that both models are saved at the same iteration so that we can observe whether attention outperforms the baseline at same computational training step.

Figure 7: (Left) Baseline model on the real-world data from the robocar’s perspective is distracted by objects and surfaces outside of the racing track. (Right) DACNN focuses on the current and next actions without distractions to stay near the center of the track.

Figure 8 compare the rewards accumulating for DARLA (top left) and our attention model (bottom left) during training with mini-batch size of 64. The reward function incentives for staying close to the yellow, dotted-line and higher speeds while penalizing for getting out of the track ans steering too frequently. For our basic DACNN models, we observe that the algorithm starts to converge around the same time as DARLA after accounting to DARLA’s pre-training period. Therefore, the training performance is maintained. The impact of changing the batch size from 64 to 32 for 64 unit (top right) attention versus a granular 256 unit (bottom right) attention model is shown on the right in Figure 8. The increased batch size results in faster learning.

Figure 8: (Top Left) The rewards accumulated during the training of the DARLA agent and (Bottom Left) the rewards accumulated during the training of our attention model with the clipped proximal policy optimization Schulman et al. (2017) and categorical. The impact of hyper-parameter variation, i.e., batch size and attention units, on learning is shown on the right.

4.2 Related Work

Vision-based servo control. In this paper, we consider image-based optimal control which is an extension of image-based visual servo (IBVS) control. Classic IBVS was developed for serial-link robotic manipulators Weiss et al. (1987); Espiau et al. (1992) and aims to control the dynamics of features in the image plane directly Hutchinson et al. (1996); Chaumette and Hutchinson (2007). More recent work on visual servoing focused on unmanned aerial vehicles (see Lu et al. (2018) and reference therein). Our work differs from visual servoing most commonly used in unmanned aerial vehicles because we do not have a separate motion planning module. Our work is most similar to recent work on robotic manipulators Bousmalis et al. (2017); Kalashnikov et al. (2018) in which the authors use raw (pixel) images as state for deep reinforcement learning. IBVS methods offer advantages in robustness to camera and target calibration errors, reduced computational complexity. One caveat of the classical IBVS is that it is necessary to determine the depth of each visual feature used in the image error criterion independently from the control algorithm. One of the approaches to overcome this issue is to use adaptive control, hence, the motivation to use reinforcement learning as a direct adaptive control method.

Domain Adaptation for Robot Learning. In domain adaptation literature for robot learning, our approach is comparable to Higgins et al. (2017a) where the authors propose a new multi-stage RL agent, DARLA (DisentAngled Representation Learning Agent), which learns to see before learning to act. In our approach, we do not separate perception from dynamics but our intuition to create a latent attention space for dynamics is a common theme. Our approach differs from recent work on robotic manipulators because the state is based on image only and not augmented with other control state information such as position. Moreover, the use of attention network to create a latent space is new. In Tzeng et al. (2015), the authors apply domain adaptation at the feature level. In Tobin et al. (2017) and Mordatch et al. (2015), the authors use domain and dynamics randomization, respectively.

Attention Mechanisms in Reinforcement Learning.

Attention models were applied with remarkable success to complex visual tasks such as image captioning

Xu et al. (2015) and machine translation Bahdanau et al. (2014). However, attention models have mostly been applied to recurrent neural networks and not for optimal visual-servoing tasks. In Mnih et al. (2014b) and Liang et al. (2018), the authors use a recurrent neural network (RNN) which processes inputs sequentially and incrementally combines information to build up a dynamic internal representation of the scene or environment. We hypothesize that convolutional neural network (CNN) layers can capture local dependencies needed to create an approximate model for the optimal control task. Instead, our approach passes images sampled from a scene through multiple CNN layers prior to the attention network, hence, has the additional advantage of invariance to lighting and texture inherent in CNNs LeCun and Bengio (1998). In other previous attempts to integrate attention with RL, the authors have largely used hand-crafted features as inputs to the attention model Manchin et al. (2019). The hand-crafted features require a large number of hyper-parameters and are not invariant to lighting and texture. Our CNN-based attention network overcomes these challenges in lighting and texture. While it is possible to segment focus areas separately as described in Choi et al. (2017), the delay caused by the model inference is too large to construct a stable controller.


  • M. Arcak (2007) Passivity as a design tool for group coordination. IEEE Transactions on Automatic Control 52 (8), pp. 1380–1390. External Links: Document, ISSN Cited by: §1.
  • J. Ba, V. Mnih, and K. Kavukcuoglu (2015) Multiple object recognition with visual attention. CoRR abs/1412.7755. Cited by: §2.
  • D. Bahdanau, K. Cho, and Y. Bengio (2014) Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473. Cited by: §1, §2, §4.2.
  • A. G. Barto, S. J. Bradtke, and S. P. Singh (1995) Learning to act using real-time dynamic programming. Artificial Intelligence 72 (1), pp. 81 – 138. External Links: ISSN 0004-3702, Document, Link Cited by: §4.1.2, Table 1.
  • K. Bousmalis, A. Irpan, P. Wohlhart, Y. Bai, M. Kelcey, M. Kalakrishnan, L. Downs, J. Ibarz, P. Pastor, K. Konolige, S. Levine, and V. Vanhoucke (2017) Using simulation and domain adaptation to improve efficiency of deep robotic grasping. CoRR abs/1709.07857. External Links: Link, 1709.07857 Cited by: §1, §4.2.
  • F. Chaumette and S. Hutchinson (2007) Visual servo control. ii. advanced approaches [tutorial]. IEEE Robotics Automation Magazine 14 (1), pp. 109–118. External Links: Document, ISSN 1070-9932 Cited by: §4.2.
  • J. Choi, B. Lee, and B. Zhang (2017) Multi-focus attention network for efficient deep reinforcement learning. CoRR abs/1712.04603. External Links: Link, 1712.04603 Cited by: §4.2.
  • B. Espiau, F. Chaumette, and P. Rives (1992) A new approach to visual servoing in robotics. IEEE Transactions on Robotics and Automation 8 (3), pp. 313–326. External Links: Document, ISSN 1042-296X Cited by: §4.2.
  • F. Golemo, A. A. Taiga, A. Courville, and P. Oudeyer (2018) Sim-to-real transfer with neural-augmented robot simulation. In Conference on Robot Learning, pp. 817–828. Cited by: §1.
  • M. Gualtieri and R. Platt (2018) Learning 6-dof grasping and pick-place using attention focus. In Conference on Robot Learning, pp. 477–486. Cited by: §1.
  • T. Hamel and R. Mahony (2002) Visual servoing of an under-actuated dynamic rigid-body system: an image-based approach. IEEE Transactions on Robotics and Automation 18 (2), pp. 187–198. External Links: Document, ISSN 1042-296X Cited by: §2.
  • I. Higgins, A. Pal, A. A. Rusu, L. Matthey, C. Burgess, A. Pritzel, M. Botvinick, C. Blundell, and A. Lerchner (2017a) DARLA: improving zero-shot transfer in reinforcement learning. See DBLP:conf/icml/2017, pp. 1480–1490. External Links: Link Cited by: §1, §4.2.
  • I. Higgins, A. Pal, A. Rusu, L. Matthey, C. Burgess, A. Pritzel, M. Botvinick, C. Blundell, and A. Lerchner (2017b) Darla: improving zero-shot transfer in reinforcement learning. In

    Proceedings of the 34th International Conference on Machine Learning-Volume 70

    pp. 1480–1490. Cited by: §4.1.1.
  • S. Hutchinson, G. D. Hager, and P. I. Corke (1996) A tutorial on visual servo control. IEEE Transactions on Robotics and Automation 12 (5), pp. 651–670. External Links: Document, ISSN 1042-296X Cited by: §4.2.
  • S. James, A. J. Davison, and E. Johns (2017) Transferring end-to-end visuomotor control from simulation to real world for a multi-stage task. arXiv preprint arXiv:1707.02267. Cited by: §1.
  • D. Kalashnikov, A. Irpan, P. Pastor, J. Ibarz, A. Herzog, E. Jang, D. Quillen, E. Holly, M. Kalakrishnan, V. Vanhoucke, and S. Levine (2018) QT-opt: scalable deep reinforcement learning for vision-based robotic manipulation. CoRR abs/1806.10293. External Links: Link, 1806.10293 Cited by: §1, §4.2.
  • J. Kong, M. Pfeiffer, G. Schildbach, and F. Borrelli (2015) Kinematic and dynamic vehicle models for autonomous driving control design. In 2015 IEEE Intelligent Vehicles Symposium (IV), Vol. , pp. 1094–1099. External Links: Document, ISSN 1931-0587 Cited by: §2, §2.
  • Y. LeCun and Y. Bengio (1998) The handbook of brain theory and neural networks. M. A. Arbib (Ed.), pp. 255–258. External Links: ISBN 0-262-51102-9, Link Cited by: §4.2.
  • X. Liang, Q. Wang, Y. Feng, Z. Liu, and J. Huang (2018) VMAV-C: A deep attention-based reinforcement learning algorithm for model-based control. CoRR abs/1812.09968. External Links: Link, 1812.09968 Cited by: §4.2.
  • Y. Lu, Z. Xue, G. Xia, and L. Zhang (2018) A survey on vision-based uav navigation. Geo-spatial Information Science 21 (1), pp. 21–32. External Links: Document Cited by: §4.2.
  • A. Manchin, E. Abbasnejad, and A. van den Hengel (2019) Reinforcement learning with attention that works: A self-supervised approach. CoRR abs/1904.03367. External Links: Link, 1904.03367 Cited by: §4.2.
  • V. Mnih, N. Heess, A. Graves, and K. Kavukcuoglu (2014a) Recurrent models of visual attention. CoRR abs/1406.6247. External Links: Link, 1406.6247 Cited by: §2.
  • V. Mnih, N. Heess, A. Graves, and K. Kavukcuoglu (2014b) Recurrent models of visual attention. CoRR abs/1406.6247. External Links: Link, 1406.6247 Cited by: §4.2.
  • I. Mordatch, K. Lowrey, and E. Todorov (2015) Ensemble-cio: full-body dynamic motion planning that transfers to physical humanoids.. See conf/iros/2015, pp. 5307–5314. External Links: ISBN 978-1-4799-9994-1 Cited by: §1, §4.2.
  • A. A. Rusu, M. Večerík, T. Rothörl, N. Heess, R. Pascanu, and R. Hadsell (2017) Sim-to-real robot learning from pixels with progressive nets. In Conference on Robot Learning, pp. 262–270. Cited by: §1.
  • J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347. Cited by: §2, Figure 8.
  • R. R. Selvaraju, M. Cogswell, A. Das, R. Vedantam, D. Parikh, and D. Batra (2017) Grad-CAM: visual explanations from deep networks via gradient-based localization. In

    Proceedings of the IEEE International Conference on Computer Vision

    pp. 618–626. Cited by: §4.1.4.
  • J. Tan, T. Zhang, E. Coumans, A. Iscen, Y. Bai, D. Hafner, S. Bohez, and V. Vanhoucke (2018) Sim-to-real: learning agile locomotion for quadruped robots. CoRR abs/1804.10332. External Links: Link, 1804.10332 Cited by: §1.
  • J. Tobin, R. Fong, A. Ray, J. Schneider, W. Zaremba, and P. Abbeel (2017) Domain randomization for transferring deep neural networks from simulation to the real world. CoRR abs/1703.06907. External Links: Link, 1703.06907 Cited by: §1, §4.2.
  • E. Tzeng, C. Devin, J. Hoffman, C. Finn, X. Peng, S. Levine, K. Saenko, and T. Darrell (2015) Towards adapting deep visuomotor representations from simulated to real environments. CoRR abs/1511.07111. External Links: Link, 1511.07111 Cited by: §1, §4.2.
  • A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin (2017) Attention is all you need. In Advances in neural information processing systems, pp. 5998–6008. Cited by: §3.
  • L. Weiss, A. Sanderson, and C. Neuman (1987) Dynamic sensor-based control of robots with visual feedback. IEEE Journal on Robotics and Automation 3 (5), pp. 404–417. External Links: Document, ISSN 0882-4967 Cited by: §4.2.
  • L. Wu, Y. Wang, J. Gao, and X. Li (2018) Where-and-when to look: deep siamese attention networks for video-based person re-identification. IEEE Transactions on Multimedia. Cited by: §3.
  • K. Xu, J. Ba, R. Kiros, K. Cho, A. Courville, R. Salakhutdinov, R. Zemel, and Y. Bengio (2015) Show, attend and tell: neural image caption generation with visual attention. arXiv preprint arXiv:1502.03044. Cited by: §2, §2, §2, §4.2.