Unlocking Pixels for Reinforcement Learning via Implicit Attention

02/08/2021 ∙ by Krzysztof Choromanski, et al. ∙ 1

There has recently been significant interest in training reinforcement learning (RL) agents in vision-based environments. This poses many challenges, such as high dimensionality and potential for observational overfitting through spurious correlations. A promising approach to solve both of these problems is a self-attention bottleneck, which provides a simple and effective framework for learning high performing policies, even in the presence of distractions. However, due to poor scalability of attention architectures, these methods do not scale beyond low resolution visual inputs, using large patches (thus small attention matrices). In this paper we make use of new efficient attention algorithms, recently shown to be highly effective for Transformers, and demonstrate that these new techniques can be applied in the RL setting. This allows our attention-based controllers to scale to larger visual inputs, and facilitate the use of smaller patches, even individual pixels, improving generalization. In addition, we propose a new efficient algorithm approximating softmax attention with what we call hybrid random features, leveraging the theory of angular kernels. We show theoretically and empirically that hybrid random features is a promising approach when using attention for vision-based RL.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 7

page 8

page 12

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Reinforcement learning (RL (suttonbarto)

) considers the problem of an agent learning solely from interactions to maximize reward. Since the introduction of deep neural networks, the field of deep RL has achieved some tremendous achievements, from games

(alphago), to robotics (rubics_cube) and even real world problems (loon).

As RL continues to be tested in more challenging settings, there has been increased interest in learning from vision-based observations (planet; slac; dreamer; rad; curl; drq). This presents several challenges, as not only are image-based observations significantly larger, but they also contain greater possibility of containing confounding variables, which can lead to overfitting (Song2020Observational).

A promising approach for tackling these challenges is through the use of bottlenecks, which force agents to learn from a low dimensional feature representation. This has been shown to be useful for both improving scalability (planet; dreamer) and generalization (ibac_sni). In this paper, we focus on self-attention bottlenecks, using an attention mechanism to select the most important regions of the state space. Recent work showed a specific form of hard attention combined effectively with neuroevolution to create agents with significantly fewer parameters and strong generalization capabilities (yujintang), while also producing interpretable policies.

However, the current form of selective attention proposed is severely limited. It makes use of the most prominent softmax attention, popularized by (vaswani), which suffers from quadratic complexity in the size of the attention matrix (i.e. the number of patches). This means that models become significantly slower as vision-based observations become higher resolution, and the effectiveness of the bottleneck is reduced by relying on larger patches.

Figure 1: Left: An observation from the - task, when downsized to a (100 x 100) RGB image. Right, comparison of inference time (bars) vs. rewards (crosses) for the Baseline Attention Agent from (yujintang), and our IAP mechanism. Rewards are the means from five seeds, training for 100 iterations. Inference times are the means of 100 forward passes.

In this paper, we demonstrate how new, scalable attention mechanisms (performers) designed for Transformers can be effectively adapted to the vision-based RL setting. We call the resulting algorithm the Implicit Attention for Pixels (IAP). Notably, using IAP we are able to train agents with self-attention for images with 8x more pixels than (yujintang). We are also able to dramatically reduce the patch, to even just a single pixel. In both cases, inference time is only marginally higher due to the linear scaling factor of IAP. We show a simple example of the effectiveness of our approach in Figure 1. Here we train an agent for 100 iterations on the : task from the DM Control Suite (dm_control). The agents are both trained the same way, with the only difference being the use of brute force attention (blue) or IAP efficient attention (orange). Both agents achieve a similar reward, with dramatically different inference time.

In addition, we show that attention row-normalization, which is typically crucial in supervised settings, is not required for training RL policies. Thus, we are able to introduce a new efficient mechanism, approximating softmax-kernel attention (known to be in general superior to other attention kernels) with what we call hybrid random features, leveraging the theory of angular kernels. We show that our new method is more robust than existing algoorithms for approximating softmax-kernel attention when attention normalization is not needed. Our mechanism is effective for RL tasks with as few as 15 random samples which is in striking contrast to the supervised setting, where usually 200-300 samples are required. That 13x+ reduction has a profound effect on the speed of the method.

To summarize, our key contributions are as follows:

  • Practical: To the best of our knowledge, we are the first to use efficient attention mechanisms for RL from pixels. This has two clear benefits: 1) we can scale to larger images than previous works; 2) we can use more fine-grained patches which produce more effective self-attention bottlenecks. Both goals can be achieved with an embarrassingly small number of trainable parameters, providing 10x compression over standard CNN-based policies with no loss of quality of the learned controller. In our experiments (Section 5) we demonstrate the strength of this approach by training quadruped robots for obstacle avoidance.

  • Theoretical: We introduce hybrid random features

    , which provably and unbiasedly approximate softmax-kernel attention and better control the variance of the estimation than previous algorithms. We believe this is a significant contribution towards efficient attention for RL and beyond - to the theory of Monte Carlo methods for kernels in machine learning.

2 Related Work

Several approaches to vision in reinforcement learning have been proposed over the years, tackling three key challenges: high-dimensional input space, partial observability of the actual state from images, and observational overfitting to spurious features (Song2020Observational). Dimensionality reduction can be obtained with hand-crafted features or with learned representations, typically via ResNet/CNN-based modules (resnets). Other approaches equip an agent with segmentation techniques and depth maps (segmentation). Those methods require training a substantial number of parameters, just to process vision, usually a part of the richer heterogeneous agent’s input, that might involve in addition lidar data, tactile sensors and more as in robotics applications. Partial observability was addressed by a line of work focusing on designing new compact and expressive neural network architectures for vision-based controllers such as (kulhanek).

Common ways to reduce observational overfitting are data augmentation (drq; rad; curl), causal approaches (zhang2021invariant) and bottlenecks (ibac_sni). Information bottlenecks have been particularly popular in vision-based reinforcement learning (planet; dreamer; slac), backed by theoretical results for improved generalization (SHAMIR20102696; 7133169).

In this work, we focus on self-attention bottlenecks. These provide a drastic reduction in the number of model parameters compared to standard CNN-based approaches, and furthermore, aid interpretability which is of particular importance in reinforcement learning. The idea of selecting individual “glimpses” with attention was first proposed by rnn_visual_attn, who use REINFORCE (reinforce) to learn which patches to use, achieving strong generalization results. Others have presented approaches to differentiate through hard attention (bengio2013estimating). This work is inspired by yujintang

who proposed to use neuroevolution methods to optimize a hard attention module, circumventing the requirement to backpropagate through it.

Our paper also contributes to the recent line of work on fast attention mechanisms. Since Transformers were shown to produce state-of-the-art results for language modelling tasks (vaswani), there has been a series of efforts to reduce the time and space with respect to sequence length (Kitaev2020Reformer; peng2021random; wang2020linformer). This work extends techniques from Performer architectures (performers), which were recently shown to be some of the best performing efficient mechanisms (tay2021long). Finally, it also naturally contributes to the theory of Monte Carlo algorithms for scalable kernel methods (rfs; hanlin; unifomc; geometry-rfs; unreas; orthogonal-rfs), proposing new random feature map mechanisms for softmax-kernels and consequently, inherently related Gaussian kernels.

Solving robotics tasks from vision input is an important and well-researched topic (kalashnikov2018qt; yahya2017collective; levine2016end; Pan2019ZeroshotIL). Our robotic experiments focus on learning legged locomotion and necessary navigation skills from vision. In prior work, CNNs have been used to process vision input (Pan2019ZeroshotIL; Li2019HRL4INHR; blanc2005indoor). In this work, we use self attention for processing image observations and compare our results with CNNs for realistic robotics tasks.

3 Compact Vision with Attention for RL

3.1 RL with a Self-Attention Bottleneck

In this paper, we focus on training policies for RL agents, where is the set of states and is a set of actions. The goal is to maximize the expected reward obtained by an agent in the given environment, where the expectation is over trajectories , for a horizon , and a reward function . We consider deterministic policies. A state is either a compact representation of the visual input (RGB(D) image) or its concatenation with other sensors available to an agent (more details in Section 5).

The agents are trained with attention mechanisms, which take vision input state (or observation in a partially observable setting) and produce a compact representation for subsequent layers of the policy. The mechanism is agnostic to the choice of the training algorithm.

3.2 Patch Selection via Attention

Consider an image represented as a collection of (potentially intersecting) RGB(D)-patches indexed by for some . Denote by

a matrix with vectorized patches as rows (i.e. vectors of RGB(D)-values of all pixels in the patch). Let

be a matrix of (potentially learned) value vectors corresponding to patches as in the regular attention mechanism (transformer).

For , we define the following patch-to-patch attention module which is a transformation :

Figure 2: Visualization of our Implicit Attention for Pixels (IAP). An input RGB(D) is represented as a union of (not necessarily disjoint) patches (in principle even individual pixels). Each patch is projected via learned matrices /. This is followed by a set of (potentially randomized) projections, which in turn is followed by nonlinear mapping defining attention type. In the inference, this process can be further optimized by computing the product of

with the (random) projection matrix in advance. Tensors

and , obtained via (random) projections followed by , define an attention matrix which is never explicitly materialized. Instead, is multiplied with vector and then the result with tensor . The output is the score vector. The algorithm can in principle use a multi-head mechanism, although we do not apply it in our experiments. Same-color lines indicate axis with the same number of dimensions.
(1)

where is a matrix truncated to its first rows and:

  • is a kernel admitting the form: for some (randomized) finite kernel feature map ,

  • is the attention matrix defined as: where are the rows of matrices , (queries & keys), and for some ,

  • is a (potentially learnable) vector defining how the signal from the attention matrix should be agglomerated to determine the most critical patches,

  • is a (potentially learnable) function to the space of permutation matrices in .

The above mechanism effectively chooses patches from the entire coverage and takes its corresponding embeddings from as a final representation of the image. The attention block defined in Equation 1 is parameterized by two matrices: , and potentially also by: a vector and a function . The output of the attention module is vectorized and concatenated with other sensor data. The resulting vector is then passed to the controller as its input state. Particular instantiations of the above mechanism lead to techniques studied before. For instance, if is a softmax-kernel, , outputs a permutation matrix that sorts the entries of the input to from largest to smallest, and rows of are centers of the corresponding patches, one retrieves the method proposed in (yujintang), yet with no attention row-normalization.

4 Implicit Attention for Pixels (IAP)

Computing attention blocks, as defined in Equation 1, is in practice very costly when is large, since it requires explicit construction of the matrix . This means it is not possible to use small-size patches, even for a moderate-size input image, while high-resolution images are prohibitive. Standard attention modules are characterized by space and time complexity, where is the number of patches. We instead propose to leverage indirectly, by applying techniques introduced in (performers) for the class of Transformers called Performers. We approximate via (random) finite feature maps given by the mapping for a parameter , as:

(2)

where are matrices with rows: and respectively. By replacing with in Equation 1, we obtain attention transformation given as:

(3)

where brackets indicate the order of computations. By disentagling from , we effectively avoid explicitly calculating attention matrices and compute the input to in linear time and space rather than quadratic in . The IAP method is schematically presented in Fig. 2.

Kernel defining attention type, and consequently corresponding finite feature map (randomized or deterministic) can be chosen in different ways, see: (performers), yet a variant of the form: , for

(4)

or:

(5)

(same-length input version) and a softmax-kernel , in practice often outperforms others. Thus it suffices to estimate . Its efficient random feature map , from the FAVOR+ mechanism (performers), is of the form:

(6)

for and the block-orthogonal ensemble of Gaussian vectors with marginal distributions

. This mapping provides an unbiased estimator

of and consequently: an unbiased estimator of the attention matrix for the softmax-kernel .

4.1 Hybrid Random Features For Softmax-Kernel

The most straightforward approach to approximating the softmax-kernel is to use trigonometric features and consequently the estimator for defined as: for iid .

Figure 3: Mean squared errors for three unbiased softmax-kernel estimators discussed in the paper (from left to right on the figure): , and for (values used in our experiments, see. Sec. 5). MSEs are given as functions of two variables: an angle between and and the inputs’ length (symmetrized along for length axis and with ) . For each plot, we marked in grey its slice for a fixed . Those slices show key differences between these estimators. The of goes to zero as goes to zero. The of goes to zero as goes to . The of the hybrid one goes to zero for both: and .

As explained in (performers), for the inputs of similar length, estimator is characterized by lower variance when the approximated softmax-kernel values are larger (this can be best illustrated when and an angle between and satisfies when variance is zero) and larger variance when they are smaller. This makes the mechanism unsuitable for approximating attention, if the attention matrix needs to be row-normalized (which is the case in standard supervised setting for Transformers), since the renormalizers might be very poorly approximated if they are given as sums containing many small attention values. On the other hand, the estimator has variance going to zero as approximated values go to zero since the corresponding mapping has nonnegative entries.

Since our proposed algorithm does not conduct row-normalization of the attention matrix (we show in Section 5 that we do not need it for RL applications), the question arises whether we can take the best of both worlds. We propose an unbiased hybrid estimator of the softmax-kernel attention, given as:

(7)

where is an unbiased estimator of , constructed independently from , and furthermore the two latter estimators rely on the same sets of Gaussian samples . In addition, we constrain to satisfy if or .

Estimator becomes for and for , which means that its variance approaches zero for both: and (for inputs of the same -norm). They key observation is that such an estimator expressed as , for a finite-dimensional mapping indeed can be constructed. The mapping is given as:

(8)

where:

(9)

and: stands for the horizontal concatenation operation, is the sign mapping and and are two independent ensembles of random Gaussian samples. The following is true:

Theorem 4.1 (MSE of the hybrid estimator).

Let . Then satisfies formula from Eq. 7 (thus in particular, it is unbiased) and furthermore, the mean squared error () of satisfies:

(10)

where , for , .

Figure 4: Slices of the 3d-plots of MSEs from Fig. 3 for extended angle axis (). We see that the MSE of the hybrid estimator is better bounded than those of and . Furthermore, it vanishes in places, where the other two do, namely for: .
Figure 5: Visualization of the random feature mechanism for the angular kernel in 3d-space. For the Gaussian vector , the expression is negative iff its projection on the linear span of is in one of the two light-blue cones obtained by rotating the green ones by . Since the distribution of the angle that forms with one of the coordinate axis is

, that event happens with probability

. Thus the expected value of the expression is which is exactly the value of the angular kernel.

Estimator is more accurate than both and since the hybrid feature map mechanism better controls its variance, in particular making the vanish for both corner cases: and (for same-length inputs), see: Fig. 3, 4. Furthermore, which is critical from the practical point of view, since it can be efficiently expressed as a dot-products of finite-dimensional randomized vectors, it admits the decomposition from Sec. 3. Consequently, it can be directly used to provide estimation of the attention mechanism from Sec. 4 in space and time complexity which is linear in the number of patches .

Sketch of the proof:

The full proof is given in the Appendix (Sec. A.3). It relies in particular on: (1) the fact that the angular kernel (quantifying relative importance of the two estimators combined in the hybrid method) can be rewritten as for (see: Fig. 5 for the explanation why this is true), (2) composite random feature mechanism for the product of two kernels, each equipped with its own random feature map. Vanishing variance of for is implied by the fact that estimator based on -features is deterministic for these two corner cases and thus it is exact.

5 Experiments

In this section, we seek to test our hypothesis that efficient attention mechanisms can achieve strong accuracy in RL, matching their performance in the context of Transformers (performers). We also aim to show that we can scale to significantly larger visual inputs, and use smaller patches, which would be prohibitively expensive with standard attention algorithms. Finally, we hypothesize that fewer, smaller patches will be particularly effective in preventing observational overfitting in the presence of distractions.

To test our hypotheses, we conduct a series of experiments, beginning with a challenging large scale vision task with distractions, where attending to the correct regions of the observation is critical. We finish with difficult simulated robotics environments, where an agent must navigate several obstacles. We use two kernel-attention mechanism for IAP: -based from (performers) and introduced here hybrid method. The former applies deterministic kernel features and the latter: randomized. Controllers are trained with ES methods (ES).

5.1 How Many Random Features do we Need ?

We first discuss the question of the sensitivity of our method to the number of random features. There is a trade-off between speed and accuracy: as we reduce the number of random features, the inference time reduces, however accuracy may decline. To test it, we use the default Cheetah-Run environment from the DM Control Suite (dm_control), with observations resized to (100 x 100), similar to the (96 x 96) sizes used for and in (yujintang). We use patches of size and select the top patches. Results are in Fig. 6. Different variants of the number of random features are encoded as pairs .

Figure 6: Cheetah-Run ablations. Left: Inference time for forward passes with different attention mechanisms. Right: mean reward curves for iterations, shaded areas correspond to std.

As we see, ReLU is the fastest IAP approach, while there is an increase in inference time as we increase the number of random features. However, all IAP approaches are significantly faster than brute force (brown). In terms of performance, we see the best performance for (

,), which we hypothesize is due to it trading off accuracy and exploration in an effective manner for this task. Given that (,) also appears to gain most of the speed benefits, we use this setting for our other experiments involving hybrid softmax.

5.2 Distracting Control Suite

We then apply our method to a modified version of the DM control suite termed the Distracting Control Suite (distracting), where the background of the normal DM Control Suite’s observations are replaced with random images and backgrounds and viewed through random camera angles as shown in Fig. 12 in the Appendix.

By default in this benchmark, the native images are of size (240 x 320), substantially larger than (96 x 96) used in (yujintang), and given that we may also use smaller patch sizes (e.g. size 2 vs the default 7 in (yujintang)), this new benchmark leads to a significantly longer maximum sequence length (19200 vs 529) for the attention component. In addition, given the particularly small stick-like appearances of most of the agents, a higher percentage of image patches will contain irrelevant background observations that can cause observational overfitting (Song2020Observational), making this task more difficult for vision-based policies.

Environment IAP SAC QT-Opt
Cheetah-Run 134 77 74
Walker-Walk 125 24 111
CartPole-Swingup 196 167 212
Ball-In-Cup Catch 135 109 62
Reacher-Easy 128 75 109
Table 1: We use the static setting on the medium difficulty benchmark found in (distracting). We include reported results from the paper for SAC and QT-Opt. For IAP, we report the final reward for the fastest convergent method.

Our experimental results on the Distracting Control Suite show that more fine-grained patches (lower patch size) with fewer selected patches (lower ) improves performance (Fig. 7). Interestingly, this is contrary to the results found in (yujintang), which showed that for with YouTube/Noisy backgrounds, decreasing reduces performance as the agent attends to noisier patches. We hypothesize this could be due to many potential reasons (higher parameter count from ES, different benchmarks, bottleneck effects, etc.) but we leave this investigation to future works.

Figure 7: We performed a grid-search sweep over patch sizes in , embedding dimensions in , and number of patches . We see that generally, smaller patch sizes with lower improves performance.
Figure 8: We see that the IAP - Hybrid method is competitive or outperforms IAP - ReLU variant. Both are significantly faster than Brute Force attention approach.

We thus use patch sizes of 2 with patches and compare the performances between regular “brute force” softmax, IAP with ReLU features, and IAP with hybrid softmax, in terms of wall-clock time. For the hybrid setting, as discussed in Subsection 5.1, we use -feature combination, which is significantly lower than the features used in the supervised Transformer setting (performers), yet achieve competitive results in the RL setting. Furthermore, we compare our algorithm with standard ConvNets trained with SAC (sac-v2) and QT-Opt (qt_opt) in Table 1 and find that we are consistently competitive or outperform those methods.

5.3 Visual Locomotion and Navigation Tasks

We use a simulated quadruped robot for our experiments. This robot has degrees of freedom ( per leg). Our locomotion task is set up in an obstacle course environment. In this environment, the robot starts from the origin on a raised platform and a series of walls lies ahead of it. The robot can observe the environment through a first-person RGB-camera view, looking straight ahead. To accomplish this, it needs to learn to steer in order to avoid collisions with the walls and falling off the edge. The reward function is specified as the capped () velocity of the robot along the x direction (see: Section A.2).

Policy details and Training setup: We train our IAP policies to solve this robotics task and compare performance against traditional CNN policies. Given the complexity of the task, we use a hierarchical structure for our policies introduced in (Jain2019HierarchicalRL). In this setup, the policy is split into two hierarchical levels - high level and low level. The high level processes the camera observations from the environment and outputs a latent command vector which is fed into the low level. The high level also outputs a scalar duration for which its execution is stopped, while the low level runs at every control timestep. The low level is a linear neural network which controls the robot leg movements.

In the CNN variant, the high level contains a CNN that receives a RGB camera input. It has convolutional layers of filters with output channels , and , followed by a pooling layer with filter of size

applied with a stride of

. Output from the pooling layer is flattened and transformed into a feature vector through a fully-connected layer with activation. It is then fed into a fully-connected layer to produce a output clipped between and . The first dimension of the output vector corresponds to the HL duration scalar and the rest to the latent command. The duration is calculated by linearly scaling the output to a value between - time-steps.

Patch Size Stride Length Maximum Reward
1 1 8.0
4 2 6.9
4 7.5
8 4 6.3
8 7.5
16 8 6.6
16 7.6
Table 2: Ablation with number of patches and stride length.

The IAP policy also has the same specification except that CNNs are replaced with attention modules in the high level. For this task, we have used deterministic ReLU features.

Figure 9: Visualization of IAP policies with patch size (top row) and patch size (bottom row). A series of image frames along the episode length are shown. On the top-left corner of the images, the input camera image is attached. The red part of the camera image is the area selected by self-attention. In case of patch size , we can see that the policy finely detects the boundaries of the obstacles which helps in navigation. For patch size , only a single patch is selected which covers one fourth of the whole camera image. The policy identifies general walking direction but fine-grained visual information is lost.
Figure 10: Navigating Gibson environments with IAP policies. This navigation environment has realistic visuals which the robot observes with a front depth camera view. The resolution of the camera is . We set the IAP patch size to be . Top patches are selected by self-attention. The input depth camera image is shown on the top left corner each of the frames. The red area in the camera view corresponds to the selected patches. The robot successfully passes through a narrow gate with the help of vision while navigating the environment. As for the previous figure, a policy is highly interpretable.

Comparison with CNN: Training curves for the CNN policy and IAP policy are shown in Figure 11. We observe similar task performance for both types of policies. However, the number of parameters in the CNN policy were compared to only parameters in the IAP policy.

Figure 11: Comparison between IAP and CNN policies on Locomotion Tasks: Both methods show similar performance.

Ablation on patch sizes and stride lengths: We trained the IAP policy with different sets of values for the patch size and stride length (defining translation from one patch to the other one) to encode the input image into patches which are processed by self-attention module. The comparative performance of different combinations is shown in Table 2. Best value for maximum episode return is achieved by patch size and stride length - a setting corresponding to the largest number of patches. For a qualitative assessment, we have added a visualization of the policies with patch size and patch size in Figure 9.

IAP locomotion policies for photo-realistic Gibson environments: Finally, we trained interpretable IAP policies from scratch for locomotion and navigation in simulated 3D-spaces with realistic visuals from the Gibson dataset (xiazamirhe2018gibsonenv). A visualization of learned policy is shown in Figure 10. Corresponding videos can be viewed here111https://sites.google.com/view/implicitattention.

6 Conclusion

In this paper, we significantly expanded the capabilities of methods using self-attention bottlenecks in RL. We are the first to show that efficient attention mechanisms, which have recently demonstrated impressive results for Transformers, can be used for RL policies, in what we call Implicit Attention for Pixels or IAP. While IAP can work with existing kernel features, we also proposed a new robust algorithm for estimating softmax-kernels that is of interest on its own, with strong theoretical results. In a series of experiments, we showed that IAP scales to higher-resolution images and emulate much finer-grain attention than what was previously possible, improving generalization in challenging vision-based RL tasks such as quadruped locomotion with obstacles and the recently introduced Distracting Control Suite.

References

Appendix A APPENDIX: Unlocking Pixels for Reinforcement Learning via Implicit Attention

a.1 Extra Figures

Figure 12: Examples of Distracting Control Suite (distracting) tasks with distractions in the background that need to be automatically filtered out to learn a successful controller. Image resolutions are substantially larger than for most other vision-based benchmarks for RL considered before. Code can be found at https://github.com/google-research/google-research/tree/master/distracting_control.

a.2 Quadruped Locomotion Experiments

We provide here more details regarding an experimental setup for the quadruped locomotion tasks.

Our simulated robot is similar in size, actuator performance, and range of motion to the MIT Mini Cheetah (minicheetah) ( kg) and Unitree A1222https://www.unitree.com/products/a1/ ( kg) robots. Robot leg movements are generated using a trajectory generator, based on the Policies Modulating Trajectory Generators (PMTG) architecture, which has shown success at learning diverse primitive behaviors for quadruped robots (iscen2018policies). The latent command from the high level, IMU sensor observations, motor angles, and current PMTG state is fed to low level neural network which outputs the residual motor commands and PMTG parameters at every timestep.

We use the Unitree A1’s URDF description333https://github.com/unitreerobotics, which is available in the PyBullet simulator (pybulletcoumans). The swing and extension of each leg is controlled by a PD position controller.

The reward function is specified as the capped () velocity of the robot along the x direction:

(11)
(12)

a.3 Proof of Theorem 4.1

Proof.

We will rely on the formulae proven in (performers):

(13)

and

(14)

Denote by and angle between and . We start by proving unbiasedness of the proposed hybrid estimator. The first observation is that this estimator can be rewritten as:

(15)

where:

(16)

Thus we just need to show that defined as:

(17)

for is an unbiased estimator of the angular kernel . It remains to show that for is an unbiased expectation of the angular kernel. This is shown in detail in the main body in the sketch of the proof of the Theorem (see: Fig. 5; analysis from there can be trivially extended to any dimensionality and also follows directly from the analysis of the Goemans-Williamson algorithm (goemans)). Notice that effectively the hybrid estimator is constructed by: (1) creating random features for the angular kernel, (2) creating random features for the softmax kernel (two variants), (3) leveraging the formula for the random feature map for the product of two kernels which is a cartesian product of the random features corresponding to the two kernels (composite random features). Vanishing variance of in points: and follows directly from the fact that has zero variance if are colinear or anti-colinear.

Having proved that the hybrid estimator admits structure given in Equation 7 (in particular that it is unbiased), we now switch to the computation of its mean squared error. From the definitions of and , we know that these estimators can be rewritten as:

(18)

where: and:

(19)

where: for , sampled independently from . From now on, we will drop superscripts from the estimator notation since are fixed. We have the following:

(20)

The following is also true:

(21)

where the last equality follows from the fact that and are independent. Therefore we have:

(22)

Furthermore, since:

(23)

we obtain the following:

(24)

Let us now focus on the expression . We have the following:

(25)

From the definition of the estimator of the angular kernel, we get:

(26)

Therefore we conclude that:

(27)

We thus conclude that:

(28)

Now we switch to the expression . Using similar analysis as above, we get the following:

(29)

This time we need to compute expression: . We have the following:

(30)

where we used already derived formulae for . We conclude that:

(31)

From the above, we obtain:

(32)

Thus it remains to compute . We have:

(33)

where the last equation follows from the fact that and are independent. Thus we have:

(34)

where we again used already derived formulae for . Therefore we conclude that:

(35)

where and . So what remains is to compute . Denote: and We have:

(36)

where we used the fact that: different have the same distributions, different have the same distributions, and furthermore for :