PI-ARS: Accelerating Evolution-Learned Visual-Locomotion with Predictive Information Representations

07/27/2022
by   Kuang-Huei Lee, et al.
0

Evolution Strategy (ES) algorithms have shown promising results in training complex robotic control policies due to their massive parallelism capability, simple implementation, effective parameter-space exploration, and fast training time. However, a key limitation of ES is its scalability to large capacity models, including modern neural network architectures. In this work, we develop Predictive Information Augmented Random Search (PI-ARS) to mitigate this limitation by leveraging recent advancements in representation learning to reduce the parameter search space for ES. Namely, PI-ARS combines a gradient-based representation learning technique, Predictive Information (PI), with a gradient-free ES algorithm, Augmented Random Search (ARS), to train policies that can process complex robot sensory inputs and handle highly nonlinear robot dynamics. We evaluate PI-ARS on a set of challenging visual-locomotion tasks where a quadruped robot needs to walk on uneven stepping stones, quincuncial piles, and moving platforms, as well as to complete an indoor navigation task. Across all tasks, PI-ARS demonstrates significantly better learning efficiency and performance compared to the ARS baseline. We further validate our algorithm by demonstrating that the learned policies can successfully transfer to a real quadruped robot, for example, achieving a 100 dramatically improving prior results achieving 40

READ FULL TEXT VIEW PDF

page 4

page 6

12/13/2021

Teaching a Robot to Walk Using Reinforcement Learning

Classical control techniques such as PID and LQR have been used effectiv...
09/24/2021

Learning to Walk in Minutes Using Massively Parallel Deep Reinforcement Learning

In this work, we present and study a training set-up that achieves fast ...
06/17/2021

Cat-like Jumping and Landing of Legged Robots in Low-gravity Using Deep Reinforcement Learning

In this article, we show that learned policies can be applied to solve l...
06/12/2019

Neural Graph Evolution: Towards Efficient Automatic Robot Design

Despite the recent successes in robotic locomotion control, the design o...
03/19/2018

Simple random search provides a competitive approach to reinforcement learning

A common belief in model-free reinforcement learning is that methods bas...
10/22/2019

Learning Humanoid Robot Running Skills through Proximal Policy Optimization

In the current level of evolution of Soccer 3D, motion control is a key ...
04/06/2020

Learning Stabilizing Control Policies for a Tensegrity Hopper with Augmented Random Search

In this paper, we consider tensegrity hopper - a novel tensegrity-based ...

I Introduction

Evolution Strategy (ES) optimization techniques have received increasing interest in recent years within the robotics and deep reinforcement learning (DRL) communities

[33, 24, 46, 38, 47, 7, 16]. ES algorithms have been shown to be competitive alternatives to commonly used gradient-based DRL algorithms such as PPO [37] and SAC [14], while also enjoying the benefits of massive parallelism, simple implementation, effective parameter-space exploration, and faster training time [33].

Despite the promising progress of ES algorithms, they nevertheless exhibit key limitations when compared to gradient-based DRL algorithms. Namely, unlike gradient-based methods, ES methods scale poorly to high-dimensional search spaces, commonly encountered when using high-capacity modern neural network architectures [29, 33]. An important and exemplary task is visual-locomotion [46]

, in which a legged robot relies on its vision input to decide where to precisely place its feet to navigate uneven terrains. Due to the rich and diverse sensor observations as well as complex robot dynamics, learning such a task requires the use of deep convolutional neural networks (CNNs) with a large number of learnable parameters, thus exacerbating the sample-complexity of ES methods.

In this paper, we develop Predictive Information Augmented Random Search (PI-ARS) to relieve this key bottleneck of ES algorithms. Our key insight is to leverage the power of gradient-based and gradient-free learning together by modularizing the learning agent into two components: (1) an encoder network mapping high-dimensional and diverse observation inputs to a concise fixed-length vector

representation, and (2) a smaller policy network that maps the compressed representations to actions. For (1), we leverage the power of gradient-based learning, and use a self-supervised objective based on maximizing the predictive information (PI) of the output representation, inspired by previous work in representation learning [30, 13, 22, 39, 45, 5]. Meanwhile for (2), we leverage the simplicity and parallelizability of the ES optimization method Augmented Random Search (ARS) [24]. By decoupling representation learning from policy optimization in this way, we avoid scalability issues while fully leveraging the advantages of ES methods.

We evaluate our proposed PI-ARS algorithm on a variety of visual-locomotion tasks, both in simulation and on a quadruped robot. Among these tasks, the robot is evaluated on its ability to walk on uneven stepping stones, quincuncial piles, and moving platforms, as well as to complete an indoor navigation task (Figure 2). Through extensive experimentation in simulation, we find that PI-ARS significantly outperforms the baselines (ARS [46], SAC [14], PI-SAC [22]), both in training speed and final performance. We further validate the results by deploying the learned policies on a real quadruped robot. Using the same physical setup as prior work [46], PI-ARS learns more robust policies that can consistently finish the entire course of stepping stones, achieving 100% success over 10 real-robot trials, compared to 40% success rate of prior work. We observe similarly successful robustness to real-world transfer for the indoor navigation policy.

In summary, the contributions of this paper are the following:

  1. We propose a new PI-ARS algorithm that combines the advantages of gradient-based self-supervised representation learning and gradient-free policy learning, thus solving a key bottleneck of ES algorithms.

  2. We apply PI-ARS in visual-locomotion tasks, which significantly improve the state-of-the-art [46] both in simulation and in the real world.

Ii Related Work

Ii-a Evolution Strategy for RL

There have been numerous works that demonstrate the effectiveness of applying Evolution Strategy (ES) algorithms to continuous control problems [40, 33, 24]. For example, Tan et al. applied Neural Evolution of Augmenting Topologies (NEAT) to optimize a character to perform bicycle stunts [40]. Within the field of deep reinforcement learning, Salimans et al. first demonstrated that ES algorithm can be applied to train successful neural-network policies for the OpenAI Gym control tasks [33]. Mania et al. introduced Augmented Random Search (ARS), a simple yet effective ES algorithm that further improves the learning efficiency for robotic control tasks [24].

Compared to gradient-based RL algorithms, ES algorithms can handle non-differentiable dynamics and objective functions, explore effectively with sparse or delayed rewards, and are naturally parallelizable. As such, researchers have applied ES algorithms in a variety of applications such as legged locomotion [46, 17], power grid control [16], and mixed autonomy traffic [43]

. However, as ES algorithms do not leverage backpropagation, they suffer from low sample efficiency and may not scale well to complex high-dimensional problems

[33]. As a result, applying ES to learn vision-based robotic control policies is rarely explored [46].

Ii-B Predictive Representations for RL

Our work relies on learning representations that are predictive of future events. Prior work has shown benefit from having good models of the past and future states [34, 35, 36]. More recently, using these principles to guide state representation learning methods has been demonstrated to yield favorable performance both in practice [22, 45] and in theory [28, 44]. A natural approach to learning such representations is using generative models to explicitly predict observations [13, 15], which could be challenging and expensive for high-dimensional tasks. Alternatively, using variational forms [32] of the predictive information [2], commonly leading to contrastive objectives, makes learning such representations more tractable [22, 45, 30]. In this work, we take the contrastive approach to learn predictive representations of the observed states, upon which we learn an optimal policy with augmented random search (ARS) [24], an ES method. To the best of our knowledge, the proposed learning system is the first to take such a combined approach, and the first to apply predictive information representations on visual locomotion tasks with legged robots.

Ii-C Visual-Locomotion for Legged Robots

Visual-locomotion is an important research direction that has received much attention in the robotics research community [26, 23, 42, 9, 18, 12, 11, 27, 19]. Directly training a visual-locomotion controller is challenging due to the high dimensional visual input and the highly nonlinear dynamics of legged robots [46, 25]. Many existing methods manually and explicitly decouple the problem into more manageable components including perception [9, 19], motion planning [18, 31], and whole body control [8, 20, 3]. In this work, we consider direct learning of visual-locomotion controllers for quadruped robots as the test-bed for our proposed learning algorithm and demonstrate that by combining gradient-free ES and gradient-based representation learning techniques, we can enable more effective and efficient learning of visual-locomotion controllers.

Iii Method

In this section we describe our method, PI-ARS, which combines representation learning based on predictive information (PI), and Augmented Random Search (ARS) [24]. See Figure 1 for a diagram of the algorithm and Algorithm 1 for a pseudocode.

Iii-a Problem Formulation and Notations

PI-ARS solves a sequential decision-making environment in which at each timestep the agent is presented with an observation . In visual-locomotion, this observation typically includes a visual input (e.g., depth camera images) as well as proprioceptive states . The agent’s policy determines its behavior, by providing a mapping from observations to actions . After application of action , the agent receives a reward and a new observation . This process is repeated until the agent is terminated, either due to a timeout or encountering a terminal condition (e.g., the robot falls). The agent’s return is computed as the sum of rewards over an entire episode, and the agent’s goal is to maximize the expected value of the return.

Iii-B Predictive Information

A good observation encoder for policy learning must provide representations that are both compressive – so that ARS learning is focused on much fewer parameters than learning from raw observations would entail – and task-relevant – so that ARS has access to all features necessary for learning optimal behavior. To this end, we propose to learn an encoder to maximize predictive information (PI). In general, PI refers to the mutual information between past and future  [2]. In our setting involving environment-produced sub-trajectories , corresponds to and refers to the both the per-step rewards and the ultimate visual observation 111This empirical choice of works well in our setting.; i.e., .

We use to map both and to a lower dimensional representation. Namely, as shown in Figure 1, the observation encoder contains a vision encoder that maps visual observations to a 128-d representation. This representation is subsequently concatenated with proprioceptive states, and the concatenation is projected to be the output of , which is also 128-d. We thus use the entire encoder to encode , while use only the vision encoder to encode the future . By learning to maximize the mutual information between these and , we ensure that encodes the necessary information in to predict the future. Notably, previous work has shown that representations that are predictive of the future are also provably beneficial for solving the task [28].

To learn , we use a combination of two objectives, the first corresponding to reward prediction and the second to prediction.222We note that in our early experiments, we found learning using reward prediction alone leads to insufficient representations for solving the tasks. For the former, we simply predict rewards using an RNN over inputs . At time , the RNN cell takes a latent state and an action and outputs a reward prediction and the next latent state. The reward loss is defined as

(1)

This corresponds to maximizing with a generative model [10].

For the prediction of , rather than a generative model, as used for , which would present challenges for high-dimensional image observations, we leverage InfoNCE, a contrastive variational bound on mutual information [30, 32, 22]. We use an auxiliary learned, scalar-valued function

to estimate the mutual information using access to samples of sub-trajectories

. Specifically, we conveniently exclude the course of actions and choose the following form:

(2)

where is from an observation randomly sampled independent of . Our objective is then to maximize this variational form with respect to both and .

To parameterize , we use an MLP to map to , a 128-d vector. Meanwhile, we map to , a 128-d vector representation of the future, using another MLP. The function is then computed as a scalar dot-product of the two vectors.

We train both the reward objective and the variational objective using batch samples from a replay buffer of sub-trajectories collected by ARS. To approximate in Equation (2), we use samples of

within the same batch of sub-trajectories. The full objective is optimized using the Adam stochastic gradient descent optimizer 

[21] and the gradient is calculated using back-propagation. Full implementation details are included in the Appendix.

Initialize encoder , auxiliary networks (RNN, ), and optimizer .
Initialize ARS parameters and optimizer .
Initialize replay buffer .
for  do
     #### ARS ####
     Sample from normal with scale .
     Collect environment trajectories for policies .
     Compute returns for each policy.
     Compute gradient of best-performing directions:
     .
     Update w.r.t. gradient and .
     Add trajectories to .
     #### PI ####
     Sample batch of sub-trajectories from .
     for  do
          Equations (1), (2)
     end for
     Compute total loss .
     Update and aux. networks w.r.t. loss and .
end for
Algorithm 1 Pseudocode for PI-ARS.
Fig. 2: The environments that we used to benchmark the PI-ARS learning system: (a) uneven stepping stones, (b) quincuncial piles, (c) moving platforms (the afterimage is for indicating that these platforms are moving), and (d) indoor navigation.
Fig. 3: Simulation results. We compare the performance of PI-ARS to ARS during training on four challenging simulation environments. PI-ARS consistently and significantly outperforms ARS.

Iii-C Pi-Ars

The encoder learned by PI maps a high-dimensional observation to a concise 128-d representation, upon which we use ARS to train a more compact policy network as follows. At each iteration of ARS, the algorithm samples perturbations of the policy weights

using a standard normal distribution with scale

. The algorithm evaluates the policy returns at and . ARS then computes an estimation of the policy gradient by aggregating the returns from the best-performing perturbation directions:

(3)

where is the update coefficient, is the number of top-performing perturbations to be considered, and denote the total return of the policy at perturbations . We refer the readers to [24] for additional details.

We iterate between updating the representation network with the PI objective and updating the policy with ARS. To maximize data re-use, we store the sampled trajectories from perturbed policies evaluated by ARS in a replay buffer used for the PI learning pipeline.

Iv Experiments

We aim to answer the following questions in our experiments:

  • Is our proposed algorithm, PI-ARS, able to learn vision-based policies that solve challenging visual-locomotion tasks?

  • Does PI-ARS achieve better performance than alternative methods that do not apply representation learning?

  • Are our learned policies applicable to real robots?

Iv-a Visual-Locomotion Tasks

To answer the above questions, we design a variety of challenging visual locomotion tasks. Figure 2 shows the suite of environments that we evaluate on. More details of each environment can be found in Section -B.

Uneven stepping stones

In this task, the robot is tasked to walk over a series of randomly placed stepping stones separated by gaps, and elevation of the stepping stones changes dramatically (Figure 2 (a)).

Quincuncial piles

This is an extension to uneven stepping stones, where we reduce the contact surface area and arrange stones in both forward and lateral directions (Figure 2(b)).

Moving platforms

We construct a set of stepping stones and allow each piece to periodically move either horizontally and vertically at a random speed (Figure 2(c)).

Indoor navigation with obstacles

In this task, we evaluate the performance of PI-ARS controlling the robot to navigate in a cluttered indoor environment (Figure 2 (d). Specifically, we randomly place boxes on the floor of a scanned indoor environment and command the robot to walk to a target position.

Iv-B Experiment Setup

We use the Unitree Laikago quadruped robot [41], which weighs kg and has actuated joints, with two depth cameras installed: one Intel D435 in the front for a wider field of view and one Intel L515 on the belly for better close-range depth quality. We create a corresponding simulated Laikago robot in the PyBullet physics simulator [6] with physical properties from hardware spec and simulated cameras that matches the camera intrinsics and extrinsics from the real cameras. The observation, action, and reward designs are detailed as follows.

Iv-B1 Observation Space

We design the observation space in our visual-locomotion tasks following prior work by Yu et al. [46]. In particular, our observation space consists of two parts: ), where are the two images from depth sensors, and include all the proprioceptive states (and controller states). In our experiments, includes the CoM height, roll, and pitch , the estimated CoM velocity , the gyroscope readings , the robot’s feet positions in the base frame, the feet contact states , the phase of each leg in its respective gait cycle , and the previous action.

For the indoor navigation task, we additionally include the relative goal vector as part of the observation, where is the target location and is the robot’s position.

Iv-B2 Action Space

We follow the prior work [46] to use a hierarchical design for the visual-locomotion controller with a trainable high-level vision policy that maps visual and proprioceptive input to a high-level motion command, and an MPC-based low-level motion controller that executes the high-level motion command with trajectory optimization. The high-level motion command, i.e. the action space for the RL problem, is defined as: , where and are the desired CoM pose, velocity, is th foot’s target landing position , and is the peak height of th foot’s swing trajectory.

Iv-B3 Reward Function

For training a policy to walk on different terrains, we use the following reward function:

(4)

where is the CoM velocity in the forward direction, and the base yaw angle. The first term rewards the robot to move forward with a maximum speed controlled by , the second term encourages the robot to walk straightly.

For the indoor navigation task, we use the delta geodesic (path) distance to the goal as our reward:

(5)

where is the geodesic distance between the robot and the target location at time .

Iv-B4 Early termination.

A training episode is terminated if: 1) the robot loses balance (CoM height below 0.15 m, pitch  rad, or roll  rad in our experiments), or 2) the robot reaches an invalid joint configuration, e.g. knee bending backwards.

Iv-C Learning in Simulation

In this subsection, we discuss the results of PI-ARS learned on simulated visual-locomotion tasks and compare to a state-of-the-art ARS approach to robotic visual-locomotion [24, 46] (Figure 3). Among other baseline approaches that we have tried include SAC [14] and PI-SAC [22]

but both algorithms failed to make any non-negligible learning progress for the tasks we consider despite extensive hyperparameter tuning, and so we omit these algorithms from the results. For fair comparison, all algorithms utilize the same MPC-based locomotion controller and learn policies in the high-level command space described in Section 

IV-B2. All policies utilize the same network architecture; i.e., the policy learned by the baseline ARS method is composed of the same set of convolution and feed-forward layers used for in PI-ARS.

We train PI-ARS and ARS policies using a distributed implementation. For all PI-ARS and ARS experiments, we perform perturbations per ARS iteration and use the top 50% performers () to update the policy network head. This choice, determined through a grid search, empirically works the best for both PI-ARS and ARS in our implementation. Further increase of (and thus computation cost) does not significantly improve the performance. The algorithm is run until convergence with a maximum of training iterations, resulting in a maximum of simulation episodes per trial. We perform trials of training PI-ARS/ARS with uniformly sampled and values (,

) and report the mean and standard error of returns against number of training episodes for each task.

As we demonstrate in the supplementary video, PI-ARS is able to learn vision-based policies that successfully solve these challenging visual-locomotion tasks. Figure 3 shows that on all tasks, PI-ARS gives significantly better returns and sample-efficiency than the ARS baseline. For example, on uneven stepping stones, the mean return after 2,000,000 episodes of training improves by 48.01%, from 2.93 to 4.34. This empirically demonstrates the effectiveness of learning ARS policies upon compressed, gradient-learned representations instead of end-to-end. On the other hand, observing that SAC fails to learn, we hypothesize that the advantages of ARS such as parameter-space exploration and stability are critical to these complex visual-locomotion tasks. Furthermore, adding predictive information to SAC, i.e. PI-SAC, does not improve learning, suggesting that even with an effective representation learner, without a powerful policy solver, a learning algorithm is not able to sufficiently tackle these visual-locomotion tasks.

Fig. 4: PI-ARS policy solving a challenging real-world visual-locomotion task involving a series of four step stones separated by gaps. PI-ARS successfully completes this terrain, avoiding all gaps, 100% of the time measured over 10 trials.
Fig. 5: PI-ARS policy learns to navigate in a cluttered real-world indoor environment.

Iv-D Validation on Real Robot

We deploy the visual-locomotion policy trained in simulation on a Laikago robot to perform two visual-locomotion tasks: 1) walking over real-world stepping stones (Figure 4), and 2) navigating in an indoor environment with obstacles (Figure 5).

To overcome the sim-to-real gap, we adopt the same procedure as done by Yu et al. [46]. For the visual gap, during training, we first apply random noise to the simulated depth images to mimic the real-world depth noises. Then we apply a Navier-Stokes-based in-painting operation [1] with radius of 1 to fill the missing pixels, followed by down-sampling to (with OpenCV’s interpolation method for resizing [4]). On the real hardware, we obtain raw depth images from both L-515 and D435 cameras and perform the same in-painting and down-sampling. To mitigate the dynamics gap, we apply dynamics randomization during training.

Videos of our real-world experiments can be found in the supplementary material.

Stepping Stones

For the stepping stones task, we created a physical setup consisting of four stones separated by three gaps between m (Figure 4). The PI-ARS policy is learned in simulation with an easier version of uneven stepping stones where stone heights change less significantly. Our PI-ARS policy was able to achieve success rate on the stepping stone environment with trials. In contrast, the ARS baseline [46] with the same training and evaluation setting achieved success rate for reaching the last stone with all four legs and often failed at the last gap.

Indoor Navigation

For evaluating the navigation task in the real world, we design a route in an indoor environment with obstacles (Figure 5). The robot needs to navigate to the target location while avoiding the obstacles. To enable the robot to better avoid the obstacles, we rotate the front camera of the robot such that it can see 3 meters ahead of the robot. We also track the robot base position using a motion capture system, which is needed to compute the relative goal vector . As shown in the supplementary video, our PI-ARS policy is able to successfully navigate to the designated target location. For the setting shown in Figure 5, our policy successfully discovered a ‘shortcut’ in between two obstacles and was able to go through. We do note there is collision with the obstacle’s arm. This is because the robot was trained with simulated obstacles with box shapes only; further training with more diverse obstacles would likely mitigate this problem.

Overall, these experiments validate that PI-ARS is capable of learning policies that can transfer to real robots.

V Conclusion

We present a new learning method, PI-ARS, and apply it to the visual-locomotion problem. PI-ARS combines gradient-based representation learning with gradient-free policy optimization to leverage the advantages of both. PI-ARS enjoys the simplicity and scalability of gradient-free methods, and it relieves a key bottleneck of ES algorithms on high-dimensional problems by simultaneously learning a low-dimensional representation that reduces the search space. We evaluate our method on a set of challenging visual-locomotion tasks, including navigating through uneven stepping stones, quincuncial piles, moving platforms, and cluttered indoor environments, among which PI-ARS significantly outperforms the state-of-the-art. Furthermore, we validate the policy learned by PI-ARS on a real quadruped robot. It enables the robot to walk over randomly-placed stepping stones and navigating in an indoor space with obstacles. In the future, we plan to test PI-ARS in outdoor visual-locomotion tasks, which presents more diverse and interesting terrains for the robot to overcome.

Acknowledgments

We thank Noah Brown, Gus Kouretas, and Thinh Nguyen for helping set up the real-world stepping stones and address robot hardware issues.

References

  • [1] M. Bertalmio, A. L. Bertozzi, and G. Sapiro (2001) Navier-stokes, fluid dynamics, and image and video inpainting. In

    Proceedings of the 2001 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. CVPR 2001

    ,
    Vol. 1, pp. I–I. Cited by: §IV-D.
  • [2] W. Bialek and N. Tishby (1999) Predictive information. arXiv preprint cond-mat/9902341. Cited by: §II-B, §III-B.
  • [3] G. Bledt, P. M. Wensing, and S. Kim (2017) Policy-regularized model predictive control to stabilize diverse quadrupedal gaits for the mit cheetah. In 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 4102–4109. Cited by: §II-C.
  • [4] G. Bradski (2000) The OpenCV Library. Dr. Dobb’s Journal of Software Tools. Cited by: §IV-D.
  • [5] X. Chen, S. Toyer, C. Wild, S. Emmons, I. Fischer, K. Lee, N. Alex, S. H. Wang, P. Luo, S. Russell, et al. (2021) An empirical investigation of representation learning for imitation. In Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2), Cited by: §I.
  • [6] E. Coumans and Y. Bai (2017)

    Pybullet, a python module for physics simulation in robotics, games and machine learning

    .
    Cited by: §IV-B.
  • [7] A. Cully, J. Clune, D. Tarapore, and J. Mouret (2015) Robots that can adapt like animals. Nature 521 (7553), pp. 503–507. Cited by: §I.
  • [8] J. Di Carlo, P. M. Wensing, B. Katz, G. Bledt, and S. Kim (2018) Dynamic locomotion in the mit cheetah 3 through convex model-predictive control. In 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Vol. , pp. 1–9. External Links: Document Cited by: §II-C.
  • [9] P. Fankhauser, M. Bjelonic, C. D. Bellicoso, T. Miki, and M. Hutter (2018) Robust rough-terrain locomotion with a quadrupedal robot. In 2018 IEEE International Conference on Robotics and Automation (ICRA), pp. 5761–5768. Cited by: §II-C.
  • [10] I. Fischer (2020) The conditional entropy bottleneck. Entropy 22 (9), pp. 999. Cited by: §III-B.
  • [11] S. Gangapurwala, M. Geisert, R. Orsolino, M. Fallon, and I. Havoutis (2020) RLOC: terrain-aware legged locomotion using reinforcement learning and optimal control. arXiv preprint arXiv:2012.03094. Cited by: §II-C.
  • [12] R. Grandia, A. J. Taylor, A. D. Ames, and M. Hutter (2020) Multi-layered safety for legged robots via control barrier functions and model predictive control. arXiv preprint arXiv:2011.00032. Cited by: §II-C.
  • [13] D. Ha and J. Schmidhuber (2018) World models. arXiv preprint arXiv:1803.10122. Cited by: §I, §II-B.
  • [14] T. Haarnoja, A. Zhou, K. Hartikainen, G. Tucker, S. Ha, J. Tan, V. Kumar, H. Zhu, A. Gupta, P. Abbeel, et al. (2018) Soft actor-critic algorithms and applications. arXiv preprint arXiv:1812.05905. Cited by: §I, §I, §IV-C.
  • [15] D. Hafner, T. Lillicrap, J. Ba, and M. Norouzi (2019) Dream to control: learning behaviors by latent imagination. arXiv preprint arXiv:1912.01603. Cited by: §II-B.
  • [16] R. Huang, Y. Chen, T. Yin, X. Li, A. Li, J. Tan, W. Yu, Y. Liu, and Q. Huang (2020) Accelerated deep reinforcement learning based load shedding for emergency voltage control. arXiv preprint arXiv:2006.12667. Cited by: §I, §II-A.
  • [17] D. Jain, A. Iscen, and K. Caluwaerts (2019) Hierarchical reinforcement learning for quadruped locomotion. In 2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 7551–7557. Cited by: §II-A.
  • [18] F. Jenelten, T. Miki, A. E. Vijayan, M. Bjelonic, and M. Hutter (2020) Perceptive locomotion in rough terrain–online foothold optimization. IEEE Robotics and Automation Letters 5 (4), pp. 5370–5376. Cited by: §II-C.
  • [19] D. Kim, D. Carballo, J. Di Carlo, B. Katz, G. Bledt, B. Lim, and S. Kim (2020) Vision aided dynamic exploration of unstructured terrain with a small-scale quadruped robot. In 2020 IEEE International Conference on Robotics and Automation (ICRA), pp. 2464–2470. Cited by: §II-C.
  • [20] D. Kim, J. Di Carlo, B. Katz, G. Bledt, and S. Kim (2019) Highly dynamic quadruped locomotion via whole-body impulse control and model predictive control. arXiv preprint arXiv:1909.06586. Cited by: §II-C.
  • [21] D. P. Kingma and J. Ba (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §III-B.
  • [22] K. Lee, I. Fischer, A. Liu, Y. Guo, H. Lee, J. Canny, and S. Guadarrama (2020) Predictive information accelerates learning in rl. Advances in Neural Information Processing Systems 33, pp. 11890–11901. Cited by: §-A5, §I, §I, §II-B, §III-B, §IV-C.
  • [23] O. A. V. Magana, V. Barasuol, M. Camurri, L. Franceschi, M. Focchi, M. Pontil, D. G. Caldwell, and C. Semini (2019) Fast and continuous foothold adaptation for dynamic locomotion through CNNs. IEEE Robotics and Automation Letters 4 (2), pp. 2140–2147. Cited by: §II-C.
  • [24] H. Mania, A. Guy, and B. Recht (2018) Simple random search provides a competitive approach to reinforcement learning. arXiv preprint arXiv:1803.07055. Cited by: §I, §I, §II-A, §II-B, §III-C, §III, §IV-C.
  • [25] G. B. Margolis, T. Chen, K. Paigwar, X. Fu, D. Kim, S. Kim, and P. Agrawal (2021) Learning to jump from pixels. arXiv preprint arXiv:2110.15344. Cited by: §II-C.
  • [26] C. Mastalli, M. Focchi, I. Havoutis, A. Radulescu, S. Calinon, J. Buchli, D. G. Caldwell, and C. Semini (2017) Trajectory and foothold optimization using low-dimensional models for rough terrain locomotion. In 2017 IEEE International Conference on Robotics and Automation (ICRA), pp. 1096–1103. Cited by: §II-C.
  • [27] T. Miki, J. Lee, J. Hwangbo, L. Wellhausen, V. Koltun, and M. Hutter (2022) Learning robust perceptive locomotion for quadrupedal robots in the wild. Science Robotics 7 (62), pp. eabk2822. Cited by: §II-C.
  • [28] O. Nachum and M. Yang (2021) Provable representation learning for imitation with contrastive fourier features. Advances in Neural Information Processing Systems 34. Cited by: §II-B, §III-B.
  • [29] Y. Nesterov and V. Spokoiny (2017) Random gradient-free minimization of convex functions. Foundations of Computational Mathematics 17 (2), pp. 527–566. Cited by: §I.
  • [30] A. v. d. Oord, Y. Li, and O. Vinyals (2018) Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748. Cited by: §I, §II-B, §III-B.
  • [31] H. W. Park, P. M. Wensing, and S. Kim (2015) Online planning for autonomous running jumps over obstacles in high-speed quadrupeds. In 2015 Robotics: Science and Systems Conference, RSS 2015, Cited by: §II-C.
  • [32] B. Poole, S. Ozair, A. Van Den Oord, A. Alemi, and G. Tucker (2019) On variational bounds of mutual information. In International Conference on Machine Learning, pp. 5171–5180. Cited by: §II-B, §III-B.
  • [33] T. Salimans, J. Ho, X. Chen, S. Sidor, and I. Sutskever (2017) Evolution strategies as a scalable alternative to reinforcement learning. arXiv preprint arXiv:1703.03864. Cited by: §I, §I, §II-A, §II-A.
  • [34] J. Schmidhuber (1990)

    Making the world differentiable: on using self-supervised fully recurrent neural networks for dynamic reinforcement learning and planning in non-stationary environments

    .
    Technical report Institut für Informatik, Technische Universität München. Cited by: §II-B.
  • [35] J. Schmidhuber (1991) Reinforcement learning in markovian and non-markovian environments. In Advances in Neural Information Processing Systems, pp. 500–506. Cited by: §II-B.
  • [36] J. Schmidhuber (2015) On learning to think: algorithmic information theory for novel combinations of reinforcement learning controllers and recurrent neural world models. arXiv preprint arXiv:1511.09249. Cited by: §II-B.
  • [37] J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347. Cited by: §I.
  • [38] X. Song, Y. Yang, K. Choromanski, K. Caluwaerts, W. Gao, C. Finn, and J. Tan (2020) Rapidly adaptable legged robots via evolutionary meta-learning. In 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 3769–3776. Cited by: §I.
  • [39] A. Srinivas, M. Laskin, and P. Abbeel (2020) CURL: contrastive unsupervised representations for reinforcement learning. arXiv preprint arXiv:2004.04136. Cited by: §I.
  • [40] J. Tan, Y. Gu, C. K. Liu, and G. Turk (2014) Learning bicycle stunts. ACM Trans. Graph. 33 (4), pp. 50:1–50:12. External Links: ISSN 0730-0301, Link, Document Cited by: §II-A.
  • [41] Unitree Robotics. External Links: Link Cited by: §IV-B.
  • [42] O. Villarreal, V. Barasuol, P. M. Wensing, D. G. Caldwell, and C. Semini (2020) MPC-based controller with terrain insight for dynamic legged locomotion. In 2020 IEEE International Conference on Robotics and Automation (ICRA), pp. 2436–2442. Cited by: §II-C.
  • [43] E. Vinitsky, A. Kreidieh, L. Le Flem, N. Kheterpal, K. Jang, C. Wu, F. Wu, R. Liaw, E. Liang, and A. M. Bayen (2018) Benchmarks for reinforcement learning in mixed-autonomy traffic. In Conference on robot learning, pp. 399–409. Cited by: §II-A.
  • [44] M. Yang, S. Levine, and O. Nachum (2021)

    TRAIL: near-optimal imitation learning with suboptimal data

    .
    arXiv preprint arXiv:2110.14770. Cited by: §II-B.
  • [45] M. Yang and O. Nachum (2021) Representation matters: offline pretraining for sequential decision making. In International Conference on Machine Learning, pp. 11784–11794. Cited by: §I, §II-B.
  • [46] W. Yu, D. Jain, A. Escontrela, A. Iscen, P. Xu, E. Coumans, S. Ha, J. Tan, and T. Zhang (2021) Visual-locomotion: learning to walk on complex terrains with vision. In 5th Annual Conference on Robot Learning, Cited by: §-B, item 2, §I, §I, §I, §II-A, §II-C, §IV-B1, §IV-B2, §IV-C, §IV-D, §IV-D.
  • [47] W. Yu, J. Tan, Y. Bai, E. Coumans, and S. Ha (2020) Learning fast adaptation with meta strategy optimization. IEEE Robotics and Automation Letters 5 (2), pp. 2950–2957. Cited by: §I.