1 Introduction
Reinforcement learning (RL (suttonbarto)
) considers the problem of an agent learning solely from interactions to maximize reward. Since the introduction of deep neural networks, the field of deep RL has achieved some tremendous achievements, from games
(alphago), to robotics (rubics_cube) and even real world problems (loon).As RL continues to be tested in more challenging settings, there has been increased interest in learning from visionbased observations (planet; slac; dreamer; rad; curl; drq). This presents several challenges, as not only are imagebased observations significantly larger, but they also contain greater possibility of containing confounding variables, which can lead to overfitting (Song2020Observational).
A promising approach for tackling these challenges is through the use of bottlenecks, which force agents to learn from a low dimensional feature representation. This has been shown to be useful for both improving scalability (planet; dreamer) and generalization (ibac_sni). In this paper, we focus on selfattention bottlenecks, using an attention mechanism to select the most important regions of the state space. Recent work showed a specific form of hard attention combined effectively with neuroevolution to create agents with significantly fewer parameters and strong generalization capabilities (yujintang), while also producing interpretable policies.
However, the current form of selective attention proposed is severely limited. It makes use of the most prominent softmax attention, popularized by (vaswani), which suffers from quadratic complexity in the size of the attention matrix (i.e. the number of patches). This means that models become significantly slower as visionbased observations become higher resolution, and the effectiveness of the bottleneck is reduced by relying on larger patches.
In this paper, we demonstrate how new, scalable attention mechanisms (performers) designed for Transformers can be effectively adapted to the visionbased RL setting. We call the resulting algorithm the Implicit Attention for Pixels (IAP). Notably, using IAP we are able to train agents with selfattention for images with 8x more pixels than (yujintang). We are also able to dramatically reduce the patch, to even just a single pixel. In both cases, inference time is only marginally higher due to the linear scaling factor of IAP. We show a simple example of the effectiveness of our approach in Figure 1. Here we train an agent for 100 iterations on the : task from the DM Control Suite (dm_control). The agents are both trained the same way, with the only difference being the use of brute force attention (blue) or IAP efficient attention (orange). Both agents achieve a similar reward, with dramatically different inference time.
In addition, we show that attention rownormalization, which is typically crucial in supervised settings, is not required for training RL policies. Thus, we are able to introduce a new efficient mechanism, approximating softmaxkernel attention (known to be in general superior to other attention kernels) with what we call hybrid random features, leveraging the theory of angular kernels. We show that our new method is more robust than existing algoorithms for approximating softmaxkernel attention when attention normalization is not needed. Our mechanism is effective for RL tasks with as few as 15 random samples which is in striking contrast to the supervised setting, where usually 200300 samples are required. That 13x+ reduction has a profound effect on the speed of the method.
To summarize, our key contributions are as follows:

Practical: To the best of our knowledge, we are the first to use efficient attention mechanisms for RL from pixels. This has two clear benefits: 1) we can scale to larger images than previous works; 2) we can use more finegrained patches which produce more effective selfattention bottlenecks. Both goals can be achieved with an embarrassingly small number of trainable parameters, providing 10x compression over standard CNNbased policies with no loss of quality of the learned controller. In our experiments (Section 5) we demonstrate the strength of this approach by training quadruped robots for obstacle avoidance.

Theoretical: We introduce hybrid random features
, which provably and unbiasedly approximate softmaxkernel attention and better control the variance of the estimation than previous algorithms. We believe this is a significant contribution towards efficient attention for RL and beyond  to the theory of Monte Carlo methods for kernels in machine learning.
2 Related Work
Several approaches to vision in reinforcement learning have been proposed over the years, tackling three key challenges: highdimensional input space, partial observability of the actual state from images, and observational overfitting to spurious features (Song2020Observational). Dimensionality reduction can be obtained with handcrafted features or with learned representations, typically via ResNet/CNNbased modules (resnets). Other approaches equip an agent with segmentation techniques and depth maps (segmentation). Those methods require training a substantial number of parameters, just to process vision, usually a part of the richer heterogeneous agent’s input, that might involve in addition lidar data, tactile sensors and more as in robotics applications. Partial observability was addressed by a line of work focusing on designing new compact and expressive neural network architectures for visionbased controllers such as (kulhanek).
Common ways to reduce observational overfitting are data augmentation (drq; rad; curl), causal approaches (zhang2021invariant) and bottlenecks (ibac_sni). Information bottlenecks have been particularly popular in visionbased reinforcement learning (planet; dreamer; slac), backed by theoretical results for improved generalization (SHAMIR20102696; 7133169).
In this work, we focus on selfattention bottlenecks. These provide a drastic reduction in the number of model parameters compared to standard CNNbased approaches, and furthermore, aid interpretability which is of particular importance in reinforcement learning. The idea of selecting individual “glimpses” with attention was first proposed by rnn_visual_attn, who use REINFORCE (reinforce) to learn which patches to use, achieving strong generalization results. Others have presented approaches to differentiate through hard attention (bengio2013estimating). This work is inspired by yujintang
who proposed to use neuroevolution methods to optimize a hard attention module, circumventing the requirement to backpropagate through it.
Our paper also contributes to the recent line of work on fast attention mechanisms. Since Transformers were shown to produce stateoftheart results for language modelling tasks (vaswani), there has been a series of efforts to reduce the time and space with respect to sequence length (Kitaev2020Reformer; peng2021random; wang2020linformer). This work extends techniques from Performer architectures (performers), which were recently shown to be some of the best performing efficient mechanisms (tay2021long). Finally, it also naturally contributes to the theory of Monte Carlo algorithms for scalable kernel methods (rfs; hanlin; unifomc; geometryrfs; unreas; orthogonalrfs), proposing new random feature map mechanisms for softmaxkernels and consequently, inherently related Gaussian kernels.
Solving robotics tasks from vision input is an important and wellresearched topic (kalashnikov2018qt; yahya2017collective; levine2016end; Pan2019ZeroshotIL). Our robotic experiments focus on learning legged locomotion and necessary navigation skills from vision. In prior work, CNNs have been used to process vision input (Pan2019ZeroshotIL; Li2019HRL4INHR; blanc2005indoor). In this work, we use self attention for processing image observations and compare our results with CNNs for realistic robotics tasks.
3 Compact Vision with Attention for RL
3.1 RL with a SelfAttention Bottleneck
In this paper, we focus on training policies for RL agents, where is the set of states and is a set of actions. The goal is to maximize the expected reward obtained by an agent in the given environment, where the expectation is over trajectories , for a horizon , and a reward function . We consider deterministic policies. A state is either a compact representation of the visual input (RGB(D) image) or its concatenation with other sensors available to an agent (more details in Section 5).
The agents are trained with attention mechanisms, which take vision input state (or observation in a partially observable setting) and produce a compact representation for subsequent layers of the policy. The mechanism is agnostic to the choice of the training algorithm.
3.2 Patch Selection via Attention
Consider an image represented as a collection of (potentially intersecting) RGB(D)patches indexed by for some . Denote by
a matrix with vectorized patches as rows (i.e. vectors of RGB(D)values of all pixels in the patch). Let
be a matrix of (potentially learned) value vectors corresponding to patches as in the regular attention mechanism (transformer).For , we define the following patchtopatch attention module which is a transformation :
(1) 
where is a matrix truncated to its first rows and:

is a kernel admitting the form: for some (randomized) finite kernel feature map ,

is the attention matrix defined as: where are the rows of matrices , (queries & keys), and for some ,

is a (potentially learnable) vector defining how the signal from the attention matrix should be agglomerated to determine the most critical patches,

is a (potentially learnable) function to the space of permutation matrices in .
The above mechanism effectively chooses patches from the entire coverage and takes its corresponding embeddings from as a final representation of the image. The attention block defined in Equation 1 is parameterized by two matrices: , and potentially also by: a vector and a function . The output of the attention module is vectorized and concatenated with other sensor data. The resulting vector is then passed to the controller as its input state. Particular instantiations of the above mechanism lead to techniques studied before. For instance, if is a softmaxkernel, , outputs a permutation matrix that sorts the entries of the input to from largest to smallest, and rows of are centers of the corresponding patches, one retrieves the method proposed in (yujintang), yet with no attention rownormalization.
4 Implicit Attention for Pixels (IAP)
Computing attention blocks, as defined in Equation 1, is in practice very costly when is large, since it requires explicit construction of the matrix . This means it is not possible to use smallsize patches, even for a moderatesize input image, while highresolution images are prohibitive. Standard attention modules are characterized by space and time complexity, where is the number of patches. We instead propose to leverage indirectly, by applying techniques introduced in (performers) for the class of Transformers called Performers. We approximate via (random) finite feature maps given by the mapping for a parameter , as:
(2) 
where are matrices with rows: and respectively. By replacing with in Equation 1, we obtain attention transformation given as:
(3) 
where brackets indicate the order of computations. By disentagling from , we effectively avoid explicitly calculating attention matrices and compute the input to in linear time and space rather than quadratic in . The IAP method is schematically presented in Fig. 2.
Kernel defining attention type, and consequently corresponding finite feature map (randomized or deterministic) can be chosen in different ways, see: (performers), yet a variant of the form: , for
(4) 
or:
(5) 
(samelength input version) and a softmaxkernel , in practice often outperforms others. Thus it suffices to estimate . Its efficient random feature map , from the FAVOR+ mechanism (performers), is of the form:
(6) 
for and the blockorthogonal ensemble of Gaussian vectors with marginal distributions
. This mapping provides an unbiased estimator
of and consequently: an unbiased estimator of the attention matrix for the softmaxkernel .4.1 Hybrid Random Features For SoftmaxKernel
The most straightforward approach to approximating the softmaxkernel is to use trigonometric features and consequently the estimator for defined as: for iid .
As explained in (performers), for the inputs of similar length, estimator is characterized by lower variance when the approximated softmaxkernel values are larger (this can be best illustrated when and an angle between and satisfies when variance is zero) and larger variance when they are smaller. This makes the mechanism unsuitable for approximating attention, if the attention matrix needs to be rownormalized (which is the case in standard supervised setting for Transformers), since the renormalizers might be very poorly approximated if they are given as sums containing many small attention values. On the other hand, the estimator has variance going to zero as approximated values go to zero since the corresponding mapping has nonnegative entries.
Since our proposed algorithm does not conduct rownormalization of the attention matrix (we show in Section 5 that we do not need it for RL applications), the question arises whether we can take the best of both worlds. We propose an unbiased hybrid estimator of the softmaxkernel attention, given as:
(7) 
where is an unbiased estimator of , constructed independently from , and furthermore the two latter estimators rely on the same sets of Gaussian samples . In addition, we constrain to satisfy if or .
Estimator becomes for and for , which means that its variance approaches zero for both: and (for inputs of the same norm). They key observation is that such an estimator expressed as , for a finitedimensional mapping indeed can be constructed. The mapping is given as:
(8) 
where:
(9) 
and: stands for the horizontal concatenation operation, is the sign mapping and and are two independent ensembles of random Gaussian samples. The following is true:
Theorem 4.1 (MSE of the hybrid estimator).
Let . Then satisfies formula from Eq. 7 (thus in particular, it is unbiased) and furthermore, the mean squared error () of satisfies:
(10) 
where , for , .
Estimator is more accurate than both and since the hybrid feature map mechanism better controls its variance, in particular making the vanish for both corner cases: and (for samelength inputs), see: Fig. 3, 4. Furthermore, which is critical from the practical point of view, since it can be efficiently expressed as a dotproducts of finitedimensional randomized vectors, it admits the decomposition from Sec. 3. Consequently, it can be directly used to provide estimation of the attention mechanism from Sec. 4 in space and time complexity which is linear in the number of patches .
Sketch of the proof:
The full proof is given in the Appendix (Sec. A.3). It relies in particular on: (1) the fact that the angular kernel (quantifying relative importance of the two estimators combined in the hybrid method) can be rewritten as for (see: Fig. 5 for the explanation why this is true), (2) composite random feature mechanism for the product of two kernels, each equipped with its own random feature map. Vanishing variance of for is implied by the fact that estimator based on features is deterministic for these two corner cases and thus it is exact.
5 Experiments
In this section, we seek to test our hypothesis that efficient attention mechanisms can achieve strong accuracy in RL, matching their performance in the context of Transformers (performers). We also aim to show that we can scale to significantly larger visual inputs, and use smaller patches, which would be prohibitively expensive with standard attention algorithms. Finally, we hypothesize that fewer, smaller patches will be particularly effective in preventing observational overfitting in the presence of distractions.
To test our hypotheses, we conduct a series of experiments, beginning with a challenging large scale vision task with distractions, where attending to the correct regions of the observation is critical. We finish with difficult simulated robotics environments, where an agent must navigate several obstacles. We use two kernelattention mechanism for IAP: based from (performers) and introduced here hybrid method. The former applies deterministic kernel features and the latter: randomized. Controllers are trained with ES methods (ES).
5.1 How Many Random Features do we Need ?
We first discuss the question of the sensitivity of our method to the number of random features. There is a tradeoff between speed and accuracy: as we reduce the number of random features, the inference time reduces, however accuracy may decline. To test it, we use the default CheetahRun environment from the DM Control Suite (dm_control), with observations resized to (100 x 100), similar to the (96 x 96) sizes used for and in (yujintang). We use patches of size and select the top patches. Results are in Fig. 6. Different variants of the number of random features are encoded as pairs .
As we see, ReLU is the fastest IAP approach, while there is an increase in inference time as we increase the number of random features. However, all IAP approaches are significantly faster than brute force (brown). In terms of performance, we see the best performance for (
,), which we hypothesize is due to it trading off accuracy and exploration in an effective manner for this task. Given that (,) also appears to gain most of the speed benefits, we use this setting for our other experiments involving hybrid softmax.5.2 Distracting Control Suite
We then apply our method to a modified version of the DM control suite termed the Distracting Control Suite (distracting), where the background of the normal DM Control Suite’s observations are replaced with random images and backgrounds and viewed through random camera angles as shown in Fig. 12 in the Appendix.
By default in this benchmark, the native images are of size (240 x 320), substantially larger than (96 x 96) used in (yujintang), and given that we may also use smaller patch sizes (e.g. size 2 vs the default 7 in (yujintang)), this new benchmark leads to a significantly longer maximum sequence length (19200 vs 529) for the attention component. In addition, given the particularly small sticklike appearances of most of the agents, a higher percentage of image patches will contain irrelevant background observations that can cause observational overfitting (Song2020Observational), making this task more difficult for visionbased policies.
Environment  IAP  SAC  QTOpt 

CheetahRun  134  77  74 
WalkerWalk  125  24  111 
CartPoleSwingup  196  167  212 
BallInCup Catch  135  109  62 
ReacherEasy  128  75  109 
Our experimental results on the Distracting Control Suite show that more finegrained patches (lower patch size) with fewer selected patches (lower ) improves performance (Fig. 7). Interestingly, this is contrary to the results found in (yujintang), which showed that for with YouTube/Noisy backgrounds, decreasing reduces performance as the agent attends to noisier patches. We hypothesize this could be due to many potential reasons (higher parameter count from ES, different benchmarks, bottleneck effects, etc.) but we leave this investigation to future works.
We thus use patch sizes of 2 with patches and compare the performances between regular “brute force” softmax, IAP with ReLU features, and IAP with hybrid softmax, in terms of wallclock time. For the hybrid setting, as discussed in Subsection 5.1, we use feature combination, which is significantly lower than the features used in the supervised Transformer setting (performers), yet achieve competitive results in the RL setting. Furthermore, we compare our algorithm with standard ConvNets trained with SAC (sacv2) and QTOpt (qt_opt) in Table 1 and find that we are consistently competitive or outperform those methods.
5.3 Visual Locomotion and Navigation Tasks
We use a simulated quadruped robot for our experiments. This robot has degrees of freedom ( per leg). Our locomotion task is set up in an obstacle course environment. In this environment, the robot starts from the origin on a raised platform and a series of walls lies ahead of it. The robot can observe the environment through a firstperson RGBcamera view, looking straight ahead. To accomplish this, it needs to learn to steer in order to avoid collisions with the walls and falling off the edge. The reward function is specified as the capped () velocity of the robot along the x direction (see: Section A.2).
Policy details and Training setup: We train our IAP policies to solve this robotics task and compare performance against traditional CNN policies. Given the complexity of the task, we use a hierarchical structure for our policies introduced in (Jain2019HierarchicalRL). In this setup, the policy is split into two hierarchical levels  high level and low level. The high level processes the camera observations from the environment and outputs a latent command vector which is fed into the low level. The high level also outputs a scalar duration for which its execution is stopped, while the low level runs at every control timestep. The low level is a linear neural network which controls the robot leg movements.
In the CNN variant, the high level contains a CNN that receives a RGB camera input. It has convolutional layers of filters with output channels , and , followed by a pooling layer with filter of size
applied with a stride of
. Output from the pooling layer is flattened and transformed into a feature vector through a fullyconnected layer with activation. It is then fed into a fullyconnected layer to produce a output clipped between and . The first dimension of the output vector corresponds to the HL duration scalar and the rest to the latent command. The duration is calculated by linearly scaling the output to a value between  timesteps.Patch Size  Stride Length  Maximum Reward 
1  1  8.0 
4  2  6.9 
4  7.5  
8  4  6.3 
8  7.5  
16  8  6.6 
16  7.6 
The IAP policy also has the same specification except that CNNs are replaced with attention modules in the high level. For this task, we have used deterministic ReLU features.
Comparison with CNN: Training curves for the CNN policy and IAP policy are shown in Figure 11. We observe similar task performance for both types of policies. However, the number of parameters in the CNN policy were compared to only parameters in the IAP policy.
Ablation on patch sizes and stride lengths: We trained the IAP policy with different sets of values for the patch size and stride length (defining translation from one patch to the other one) to encode the input image into patches which are processed by selfattention module. The comparative performance of different combinations is shown in Table 2. Best value for maximum episode return is achieved by patch size and stride length  a setting corresponding to the largest number of patches. For a qualitative assessment, we have added a visualization of the policies with patch size and patch size in Figure 9.
IAP locomotion policies for photorealistic Gibson environments: Finally, we trained interpretable IAP policies from scratch for locomotion and navigation in simulated 3Dspaces with realistic visuals from the Gibson dataset (xiazamirhe2018gibsonenv). A visualization of learned policy is shown in Figure 10. Corresponding videos can be viewed here^{1}^{1}1https://sites.google.com/view/implicitattention.
6 Conclusion
In this paper, we significantly expanded the capabilities of methods using selfattention bottlenecks in RL. We are the first to show that efficient attention mechanisms, which have recently demonstrated impressive results for Transformers, can be used for RL policies, in what we call Implicit Attention for Pixels or IAP. While IAP can work with existing kernel features, we also proposed a new robust algorithm for estimating softmaxkernels that is of interest on its own, with strong theoretical results. In a series of experiments, we showed that IAP scales to higherresolution images and emulate much finergrain attention than what was previously possible, improving generalization in challenging visionbased RL tasks such as quadruped locomotion with obstacles and the recently introduced Distracting Control Suite.
References
Appendix A APPENDIX: Unlocking Pixels for Reinforcement Learning via Implicit Attention
a.1 Extra Figures
a.2 Quadruped Locomotion Experiments
We provide here more details regarding an experimental setup for the quadruped locomotion tasks.
Our simulated robot is similar in size, actuator performance, and range of motion to the MIT Mini Cheetah (minicheetah) ( kg) and Unitree A1^{2}^{2}2https://www.unitree.com/products/a1/ ( kg) robots. Robot leg movements are generated using a trajectory generator, based on the Policies Modulating Trajectory Generators (PMTG) architecture, which has shown success at learning diverse primitive behaviors for quadruped robots (iscen2018policies). The latent command from the high level, IMU sensor observations, motor angles, and current PMTG state is fed to low level neural network which outputs the residual motor commands and PMTG parameters at every timestep.
We use the Unitree A1’s URDF description^{3}^{3}3https://github.com/unitreerobotics, which is available in the PyBullet simulator (pybulletcoumans). The swing and extension of each leg is controlled by a PD position controller.
The reward function is specified as the capped () velocity of the robot along the x direction:
(11)  
(12) 
a.3 Proof of Theorem 4.1
Proof.
We will rely on the formulae proven in (performers):
(13) 
and
(14) 
Denote by and angle between and . We start by proving unbiasedness of the proposed hybrid estimator. The first observation is that this estimator can be rewritten as:
(15) 
where:
(16) 
Thus we just need to show that defined as:
(17) 
for is an unbiased estimator of the angular kernel . It remains to show that for is an unbiased expectation of the angular kernel. This is shown in detail in the main body in the sketch of the proof of the Theorem (see: Fig. 5; analysis from there can be trivially extended to any dimensionality and also follows directly from the analysis of the GoemansWilliamson algorithm (goemans)). Notice that effectively the hybrid estimator is constructed by: (1) creating random features for the angular kernel, (2) creating random features for the softmax kernel (two variants), (3) leveraging the formula for the random feature map for the product of two kernels which is a cartesian product of the random features corresponding to the two kernels (composite random features). Vanishing variance of in points: and follows directly from the fact that has zero variance if are colinear or anticolinear.
Having proved that the hybrid estimator admits structure given in Equation 7 (in particular that it is unbiased), we now switch to the computation of its mean squared error. From the definitions of and , we know that these estimators can be rewritten as:
(18) 
where: and:
(19) 
where: for , sampled independently from . From now on, we will drop superscripts from the estimator notation since are fixed. We have the following:
(20) 
The following is also true:
(21) 
where the last equality follows from the fact that and are independent. Therefore we have:
(22) 
Furthermore, since:
(23) 
we obtain the following:
(24) 
Let us now focus on the expression . We have the following:
(25) 
From the definition of the estimator of the angular kernel, we get:
(26) 
Therefore we conclude that:
(27) 
We thus conclude that:
(28) 
Now we switch to the expression . Using similar analysis as above, we get the following:
(29) 
This time we need to compute expression: . We have the following:
(30) 
where we used already derived formulae for . We conclude that:
(31) 
From the above, we obtain:
(32) 
Thus it remains to compute . We have:
(33) 
where the last equation follows from the fact that and are independent. Thus we have:
(34) 
where we again used already derived formulae for . Therefore we conclude that:
(35) 
where and . So what remains is to compute . Denote: and We have:
(36) 
where we used the fact that: different have the same distributions, different have the same distributions, and furthermore for : and are independent (since corresponding and are chosen independently). Using the definition of and