Learning Agile Robotic Locomotion Skills by Imitating Animals

04/02/2020 ∙ by Xue Bin Peng, et al. ∙ Google berkeley college 0

Reproducing the diverse and agile locomotion skills of animals has been a longstanding challenge in robotics. While manually-designed controllers have been able to emulate many complex behaviors, building such controllers involves a time-consuming and difficult development process, often requiring substantial expertise of the nuances of each skill. Reinforcement learning provides an appealing alternative for automating the manual effort involved in the development of controllers. However, designing learning objectives that elicit the desired behaviors from an agent can also require a great deal of skill-specific expertise. In this work, we present an imitation learning system that enables legged robots to learn agile locomotion skills by imitating real-world animals. We show that by leveraging reference motion data, a single learning-based approach is able to automatically synthesize controllers for a diverse repertoire behaviors for legged robots. By incorporating sample efficient domain adaptation techniques into the training process, our system is able to learn adaptive policies in simulation that can then be quickly adapted for real-world deployment. To demonstrate the effectiveness of our system, we train an 18-DoF quadruped robot to perform a variety of agile behaviors ranging from different locomotion gaits to dynamic hops and turns.



There are no comments yet.


page 1

page 3

page 6

page 7

page 14

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Animals can traverse complex environments with remarkable agility, bringing to bear broad repertoires of agile and acrobatic skills. Reproducing such agile behaviors has been a long-standing challenge in robotics, with a large body of work devoted to designing control strategies for  various  locomotion skills   [37, 49, 54, 18, 3].   However,

designing control strategies often involves a lengthy development process, and requires substantial expertise of both the underlying system and the desired skills. Despite the many success in this domain, the capabilities achieved by these systems are still far from the fluid and graceful motions seen in the animal kingdom.

Learning-based approaches offer the potential to improve the agility of legged robots, while also automating a substantial portion of the manual effort involved in the development of controllers. In particular, reinforcement learning (RL) can be an effective and general approach for developing controllers that can perform a wide range of sophisticated skills [7, 43, 25, 44, 34]. While these methods have demonstrated promising results in simulation, agents trained through RL are prone to adopting unnatural behaviors that are dangerous or infeasible when deployed in the real world. Furthermore, designing reward functions that elicit the desired behaviors can itself require a laborious task-specific tuning process.

The comparatively superior agility seen in animals, as compared to robots, might lead one to wonder: can we build more agile robotic controllers with less effort by directly imitating animal motions? In this work, we propose an imitation learning framework that enables legged robots to learn agile locomotion skills from real-world animals. Our framework leverages reference motion data to provide priors regarding feasible control strategies for a particular skill. The use of reference motions alleviates the need to design skill-specific reward functions, thereby enabling a common framework to learn a diverse array of behaviors. To address the high sample requirements of current RL algorithms, the initial training phase is performed in simulation. In order to transfer policies learned in simulation to the real world, we propose a sample efficient adaptation technique, which fine-tunes the behavior of a policy using a learned dynamics representation.

The central contribution of our work is a system that enables legged robots to learn agile locomotion skills by imitating animals. We demonstrate the effectiveness of our framework on a variety of dynamic locomotion skills with the Laikago quadruped robot [61], including different locomotion gaits, as well as dynamic hops and turns. In our ablation studies, we explore the impact of different design decisions made for the various components of our system.

Ii Related Work

The development of controllers for legged locomotion has been an enduring subject of interest in robotics, with a large body of work proposing a variety of control strategies for legged systems [37, 49, 54, 20, 18, 64, 8, 3]. However, many of these methods require in-depth knowledge and manual engineering for each behavior, and as such, the resulting capabilities are ultimately limited by the designer’s understanding of how to model and represent agile and dynamic behaviors. Trajectory optimization and model predictive control can mitigate some of the manual effort involved in the design process, but due to the high-dimensional and complex dynamics of legged systems, reduced-order models are often needed to formulate tractable optimization problems [11, 17, 12, 2]. These simplified abstractions tend to be task-specific, and again require significant insight into the salient characteristics of each skill.

Motion imitation. Imitating reference motions provides a general approach for robots to perform a rich variety of behaviors that would otherwise be difficult to manually encode into controllers [48, 21, 55, 63]. But applications of motion imitation to legged robots have predominantly been limited to behaviors that emphasize upper-body motions, with fairly static lower-body movements, where balance control can be delegated to separate control strategies [39, 27, 30]. In contrast to physical robots, substantially more dynamic skills can be reproduced by agents in simulation [38, 33, 9, 35]. Recently, motion imitation with reinforcement learning has been effective for learning a large repertoire of highly acrobatic skills in simulation [44, 34, 45, 32]. But due to the high sample complexity of RL algorithms and other physical limitations, many of the capabilities demonstrated in simulation have yet to be replicated in the real world.

Sim-to-real transfer. The challenges of applying RL in the real world have driven the use of domain transfer approaches, where policies are first trained in simulation (source domain), and then transferred to the real world (target domain). Sim-to-real transfer can be facilitated by constructing more accurate simulations [58, 62], or adapting the simulator with real-world data [57, 23, 26, 36, 5]. However, building high-fidelity simulators remains a challenging endeavour, and even state-of-the-art simulators provide only a coarse approximation of the rich dynamics of the real world. Domain randomization can be incorporated into the training process to encourage policies to be robust to variations in the dynamics [52, 60, 47, 42, 41]. Sample efficient adaptation techniques, such as finetuning [51] and meta-learning [13, 16, 6] can also be applied to further improve the performance of pre-trained policies in new domains. In this work, we leverage a class of adaptation techniques, which we broadly referred to as latent space methods [24, 65, 67], to transfer locomotion policies from simulation to the real world. During pre-training, these methods learn a latent representation of different behaviors that are effective under various scenarios. When transferring to a new domain, a search can be conducted in the latent space to find behaviors that successfully execute a desired task in the new domain. We show that by combining motion imitation and latent space adaptation, our system is able to learn a diverse corpus of dynamic locomotion skills that can be transferred to legged robots in the real world.

RL for legged locomotion. Reinforcement learning has been effective for automatically acquiring locomotion skills in simulation [44, 34, 32] and in the real world [31, 59, 14, 58, 22, 26]. Kohl and Stone [31] applied a policy gradient method to tune manually-crafted walking controllers for the Sony Aibo robot. By carefully modeling the motor dynamics of the Minitaur quadruped robot, Tan et al. [58] was able to train walking policies in simulation that can be directly deployed on a real robot. Hwangbo et al. [26] proposed learning a motor dynamics model using real-world data, which enabled direct transfer of a variety of locomotion skills to the ANYmal robot. Their system trained policies using manually-designed reward functions for each skill, which can be difficult to specify for more complex behaviors. Imitating reference motions can be a general approach for learning diverse repertoires of skills without the need to design skill-specific reward functions [35, 44, 45]. Xie et al. [62] trained bipedal walking policies for the Cassie robot by imitating reference motions recorded from existing controllers and keyframe animations. The policies are again transferred from simulation to the real world with the aid of careful system identification. Yu et al. [65] transferred bipedal locomotion policies from simulation to a physical Darwin OP2 robot using a latent space adaptation method, which mitigates the dependency on accurate simulators. In this work, we leverage a similar latent space method, but by combining it with motion imitation, our system enables real robots to perform more diverse and agile behaviors than have been demonstrated by these previous methods.

Iii Overview

The objective of our framework is to enable robots to learn skills from real animals. Our framework receives as input a reference motion that demonstrates a desired skill for the robot, which may be recorded using motion capture (mocap) of real animals (e.g. a dog). Given a reference motion, it then uses reinforcement learning to synthesize a policy that enables a robot to reproduce that skill in the real world. A schematic illustration of our framework is available in Figure 2. The process is organized into three stages: motion retargeting, motion imitation, and domain adaptation. 1) The reference motion is first processed by the motion retargeting stage, where the motion clip is mapped from the original subject’s morphology to the robot’s morphology via inverse-kinematics. 2) Next, the retargeted reference motion is used in the motion imitation stage to train a policy to reproduce the motion with a simulated model of the robot. To facilitate transfer to the real world, domain randomization is applied in simulation to train policies that can adapt to different dynamics. 3) Finally, the policy is transferred to a real robot via a sample efficient domain adaptation process, which adapts the policy’s behavior using a learned latent dynamics representation.

Fig. 2: The framework consists of three stages: motion retargeting, motion imitation, and domain adaptation. It receives as input motion data recorded from an animal, and outputs a control policy that enables a real robot to reproduce the motion.

Iv Motion Retargeting

When using motion data recorded from animals, the subject’s morphology tends to differ from that of the robot’s. To address this discrepancy, the source motions are retargeted to the robot’s morphology using inverse-kinematics [19]. First, a set of source keypoints are specified on the subject’s body, which are paired with corresponding target keypoints on the robot’s body. An illustration of the keypoints is available in Figure 3. The keypoints include the positions of the feet and hips. At each timestep, the source motion specifies the 3D location of each keypoint . The corresponding target keypoint is determined by the robot’s pose , represented in generalized coordinates [15]. IK is then applied to construct a sequence of poses that track the keypoints at each frame,


An additional regularization term is included to encourage the poses to remain similar to a default pose , and is a diagonal matrix specifying regularization coefficients for each joint.

V Motion Imitation

We formulate motion imitation as a reinforcement learning problem. In reinforcement learning, the objective is to learn a control policy that enables an agent to maximize its expected return for a given task [56]. At each timestep , the agent observers a state from the environment, and samples an action from its policy . The agent then applies this action, which results in a new state and a scalar reward . Repeated applications of this process generates a trajectory . The objective then is to learn a policy that maximizes the agent’s expected return ,


where denotes the time horizon of each episode, and is a discount factor. represents the likelihood of a trajectory under a given policy ,


with being the initial state distribution, and representing the dynamics of the system, which determines the effects of the agent’s actions.

Fig. 3: Inverse-kinematics (IK) is used to retarget mocap clips recorded from a real dog (left) to the Laikago robot (right). Corresponding pairs of keypoints (red) are specified on the dog and robot’s bodies, and then IK is used to compute a pose for the robot that tracks the keypoints.

To imitate a given reference motion, we follow a similar motion imitation approach as Peng et al. [44]. The inputs to the policy is augmented with an additional goal , which specifies the motion that the robot should imitate. The policy is modeled as a feedforward network that maps a given state and goal to a distribution over actions . The policy is queried at 30Hz for a new action at each timestep. The state is represented by the poses of the robot in the three previous timesteps, and the three previous actions . The pose features

consist of IMU readings of the root orientation (row, pitch, yaw) and the local rotations of every joint. The root position is not included among the pose features to avoid the need to estimate the root position during real-world deployment. The goal

specifies target poses from the reference motion at four future timesteps, spanning approximately 1 second. The action specifies target rotations for PD controllers at each joint. To ensure smoother motions, the PD targets are first processed by a low-pass filter before being applied on the robot [4].

Reward Function. The reward function encourages the policy to track the sequence of target poses from the reference motion at every timestep. The reward function is similar to the one used by Peng et al. [44], where the reward at each timestep is given by:


The pose reward encourages the robot to minimize the difference between the joint rotations specified by the reference motion and those of the robot. In the equation below, represents the 1D local rotation of joint from the reference motion at time , and represents the robot’s joint,


Similarly, the velocity reward is calculated according to the joint velocities, with and being the angular velocity of joint from the reference motion and robot respectively,


Next, the end-effector reward , encourages the robot to track the positions of the end-effectors, where denotes the relative 3D position of end-effector with respect to the root,


Finally, the root pose reward and root velocity reward encourage the robot to track the reference root motion. and denotes the root’s global position and linear velocity, while and are the rotation and angular velocity,


Vi Domain Adaptation

Due to discrepancies between the dynamics of the simulation and the real world, policies trained in simulation tend to perform poorly when deployed on a physical system. Therefore, we propose a sample efficient adaptation technique for transferring policies from simulation to the real world.

Vi-a Domain Randomization

Domain randomization is a simple strategy for improving a policy’s robustness to dynamics variations [52, 60, 42]. Instead of training a policy in a single environment with fixed dynamics, domain randomization varies the dynamics during training, thereby encouraging the policy to learn strategies that are functional across different dynamics. However, there may be no single strategy that is effective across all environments, and due to unmodeled effects in the real world, strategies that are robust to different simulated dynamics may nonetheless fail when deployed in a physical system.

Vi-B Domain Adaptation

In this work, we aim to learn strategies that are robust to variations in the dynamics of the environment, while also being able to adapt its behaviors as necessary for new environments. Let represent the values of the dynamics parameters that are randomized during training in simulation (Table I). At the start of each episode, a random set of parameters are sampled according to . The dynamics parameters are then encoded into a latent embedding by a stochastic encoder , and is provided as an additional input to the policy . For brevity, we have excluded the goal input for the policy. When transferring a policy to the real world, we follow a similar approach as Yu et al. [66], where a search is performed to find a latent encoding that enables the policy to successfully execute the desired behaviors on the physical system. Next, we propose an extension that addresses potential issues due to over-fitting with the previously proposed method.

A potential degeneracies of the previously described approach is that the policy may learn strategies that depend on being an accurate representation of the true dynamics of the system. This can result in brittle behaviors where the strategies utilized by the policy for a given can overfit to the precise dynamics from the corresponding parameters . Furthermore, due to unmodeled effects in the real world, there might be no that accurately models real-world dynamics. Therefore, to encourage the policy to be robust to uncertainty in the dynamics, we incorporate an information bottleneck into the encoder. The information bottleneck enforces an upper bound on the mutual information between the dynamics parameters and the encoding . This results in the following constrained policy optimization objective,

s.t. (11)

where the trajectory distribution is now given by,


Since computing the mutual information is intractable, the constraint in Equation 11 can be approximated with a variational upper bound using the KL divergence between and a variational prior [1],


We can further simplify the objective by converting Equation 11 into a soft constraint, to yield the following information-regularized objective,


with being a Lagrange multiplier. In our experiments, we model the encoder

as a Gaussian distribution with mean

and standard deviation

, and the prior is given by the unit Gaussian. This objective can be interpreted as training a policy that maximizes the agent’s expected return across different dynamics, while also being able to adapt its behaviors when necessary by relying on only a minimal amount of information from the ground-truth dynamics parameters. In our formulation, the Lagrange multiplier provides a trade-off between robustness and adaptability. Large values of restrict the amount of information that the policy can access from . In the limit , the policy converges to a robust but non-adaptive policy that does not access the underlying dynamics parameters. Conversely, small values of provides the policy with unfettered access to the dynamics parameters, which can result in brittle strategies where the policy’s behaviors overfit to the nuances of each setting of the dynamics parameters, potentially leading to poor generalization to real-world dynamics.

4:  for  do
6:     Rollout an episode with conditioned and record the return
7:     Store in
10:  end for
Algorithm 1 Adaptation with Advantage-Weighted Regression

Vi-C Real World Transfer

To adapt a policy to the real world, we directly search for an encoding that maximizes the return on the physical system


with being the trajectory distribution under real-world dynamics. To identify , we use advantage-weighted regression (AWR) [40, 46], a simple off-policy RL algorithm. Algorithm 1 summarizes the adaptation process. The search distribution is initialized with the prior . At each iteration , we sample an encoding from the current distribution and execute an episode with the policy conditioned on . The return for the episode is recorded and stored along with in a replay buffer containing all samples from previous iterations. is then updated by fitting a new distribution that assigns higher likelihoods to samples with larger advantages. The likelihood of each sample is weighted by the exponentiated-advantage , where the baselines is the average return of all samples in , and is a manually specified temperature parameter. Note that, since is Gaussian, the optimal distribution at each iteration (Line 9) can be determined analytically. However, we found that the analytic solution is prone to premature convergence to a suboptimal solution. Instead, we update incrementally using a few steps of gradient descent. This process is repeated for iterations, and the mean of the final distribution is used as an approximation of the optimal encoding for deploying the policy in the real world.

Parameter Training Range Testing Range
Mass default value default value
Inertia default value default value
Motor Strength default value default value
Motor Friction
Lateral Friction
TABLE I: Dynamic parameters and their respective range of values used during training and testing. A larger range of values are used during testing to evaluate the policies’ ability to generalize to unfamiliar dynamics.
(a) Dog Pace
(b) Dog Backwards Trot
(c) Side-Steps
(d) Turn
(e) Hop-Turn
(f) Running Man
Fig. 4: Laikago robot performing skills learned by imitating reference motions. Top: Reference motion. Middle: Simulated robot. Bottom: Real robot.

Vii Experimental Evaluation

We evaluate our robotic learning system by learning to imitating a variety of dynamic locomotion skills using the Laikago robot [61]

, an 18 degrees-of-freedom quadruped with 3 actuated degrees-of-freedom per leg, and 6 under-actuated degrees of freedom for the root (torso). Behaviors learned by the policies are best seen in the

supplementary video1, and snapshots of the behaviors are also available in Figure 4. In the following experiments, we aim to evaluate the effectiveness of our framework on learning a diverse set of quadruped skills, and study how well real-world adaptation can enable more agile behaviors. We show that our adaptation method can efficiently transfer policies trained in simulation to the real world with a small number of trials on the physical system. We further study the effects of regularizing the latent dynamics encoding with an information bottleneck, and show that this provides a mechanism to trade off between the robustness and adaptability of the learned policies.

Vii-a Experimental Setup

Retargeting via inverse-kinematics and simulated training is performed using PyBullet [10]. Table I summarizes the dynamics parameters and their respective range of values. The motion dataset contains a mixture of mocap clips recorded from a dog and clips from artist generated animations. The mocap clips are collected from a public dataset [68] and retargeted to the Laikago following the procedure in Section IV. Figure 5 lists the skills learned by the robot and summarizes the performance of the policies when deployed in the real world. Motion clips recorded from a dog are designated with “Dog”, and the other clips correspond to artist animated motions. Performance is recorded as the average normalized return, with 0 corresponding to the minimum possible return per episode and 1 being the maximum return. Note that the maximum return may not be achievable, since the reference motions are generally not physically feasible for the robot. Performance is calculated using the average of 3 policies initialized with different random seeds. Each policy is trained with proximal policy optimization using about 200 million samples in simulation [53]. Both the encoder and policy are trained end-to-end using the reparameterization trick [29]

. Domain adaptation is performed on the physical system with AWR in the latent dynamics space, using approximately 50 real-world trials to adapt each policy. Trials vary between 5s and 10s in length depending on the space requirements of each skill. Hyperparameter settings are available in Appendix 


Fig. 5: Performance statistics of imitating various skills in the real world. Performance is recorded as the average normalized return between [0, 1]. Three policies initialized with different random seeds are trained for each combination of skill and method. The performance of each policy is evaluated over 5 episodes, for a total of 15 trials per method. The adaptive policies outperform the non-adaptive policies on most skills.
Fig. 6: Schematic illustration of the network architecture used for the adaptive policy. The encoder receives the dynamics parameters

as input, which are processed by two fully-connected layers with 256 and 128 ReLU units, and then mapped to a Gaussian distribution over the latent space

with mean and standard deviation . An encoding is sampled from the encoder distribution and provided to the policy as input, along with the state and goal . The policy is modeled with two layers of 512 and 256 units, followed by an output layer which specifies the mean of the action distribution. The standard deviation of the action distribution is specified by a fixed diagonal matrix. The value function is modeled by a separate network with 512 and 256 hidden units.

Model representation.

All policies are modeled using the neural network architecture shown in Figure 

6. The encoder is represented by a fully-connected network that maps the dynamics parameters to the mean and standard deviation of the encoder distribution. The policy network receives as input the state , goal , and dynamics encoding , then outputs the mean of a Gaussian action distribution. The standard deviation of the action distribution is represented by a fixed matrix. The value function receives as input the state, goal, and dynamics parameters.

Vii-B Learned Skills

Our framework is able to learn a diverse set of locomotion skills for the Laikago, including dynamic gaits, such as pacing and trotting, as well as agile turning and spinning motions (Figure 4). Pacing is typically used for walking at slower speeds, and is characterized by each pair of legs on the same side of the body moving in unison (Figure 4(a)) [50]. Trotting is a faster gait, where diagonal pairs of legs move together (Figure 1). We are able to train policies for these different gaits just by providing the system with different reference motions. Furthermore, by simply playing the mocap clips backwards, we are able to train policies for different backwards walking gaits (Figure 4(b)). The gaits learned by our policies are faster than those of the manually-designed controller from the manufacturer. The fastest manufacturer gait reaches a top speed of about 0.84m/s, while the Dog Trot policy reaches a speed of 1.08m/s. The backwards trotting gait reaches an even higher speed of 1.20m/s. In addition to imitating mocap data from animals, our system is also able to learn from artist animated motions. While these hand-animated motions are generally not physically correct, the policies are nonetheless able to closely imitate most motions with the real robot. This includes a highly dynamic Hop-Turn motion, in which the robot performs a 90 degrees turn midair (Figure 4(e)). While our system is able to imitate a variety of motions, some motions, such as Running Man (Figure 4(f)), prove challenging to reproduce. The motion requires the robot to travel backwards while moving in a forward-walking manner. Our policies learn to keep the robot’s feet on the ground and shuffle backwards, instead of lifting the feet during each step.

Fig. 7: Comparison of the time elapsed before the robot falls when deploying various policies in the real world. The adaptive policies are often able to maintain balance longer than the other baselines policies, and tend to reach the max episode length without falling.

Vii-C Domain Adaptation

To determine the effects of domain adaptation, we compare our method to non-adaptive policies trained in simulation without randomization (No Rand), and robust policies trained with randomization (Robust) but do not perform adaptation in new environments. Real-world performance comparisons of these methods are shown in Figure 5, detailed performance statistics in simulation and the real world are available in Appendix -B. When deployed on the real robot, the adaptive policies outperform their non-adaptive counterparts on most skills. For simpler skills, such as In-Place Steps and Side-Steps, the robust policies are sufficient for transfer to the real robot. But for more dynamic skills, such as Dog Pace and Dog Spin, the robust policies are prone to falling, while the adaptive policies can execute the skills more consistently. Policies trained without randomization fail to transfer to the real world for most skills. Figure 7 compares the time elapsed before the robot falls under the various policies. The adaptive policies are often able to maintain balance for a longer period of time than the other methods, with a significant performance improvement after adaptation.

Fig. 8: Performance of policies in 100 simulated environments with different dynamics. The vertical axis represents the normalized return, and the horizontal axis records the portion of environments in which a policy achieves a return higher than a particular value. The adaptive policies achieve higher returns under more diverse dynamics than the non-adaptive policies.
Fig. 9: Learning curves of adapting policies to different simulated environments using the learned latent space. The policies are able to adapt to new environments in a relatively small number of episodes.

To evaluate the policies’ abilities to cope with unfamiliar dynamics, we test the policies in out-of-distribution simulated environments, where the dynamics parameters are sampled from a larger range of values than those used during training. The range of values used during training and testing are detailed in Table I. Figure 8 visualizes the performance of the policies in 100 simulated environments with different dynamics. The vertical axis represents the normalized return, and the horizontal axis records the portion of environments in which a policy achieves a return higher than a particular value. For example, in the case of Dog Pace, the adaptive policies achieve a return higher than 0.6 in of the environments, while the robust policy achieves a return higher than 0.6 in of the environments. The experiments are repeated 3 times for each method using policies initialized with different random seeds. In these experiments, the adaptive policies tend to outperform their non-adaptive counterparts across the various skills. This suggests that the adaptation process is able to better generalize to environments that differ from those encountered during training. To analyze the performance of policies during the adaptation process, we record the performance of individual policies after each update iteration. Figure 9 illustrates the learning curves in 5 different environments for each skill. The policies are generally able to adapt to new environments in a relatively few number of episodes.

Fig. 10: Performance of adaptive policies trained with different coefficients for the information penalty. ”No IB” corresponds to policies trained without an information bottleneck. The dotted lines represent performance before adaptation, and the solid lines represent after adaptation.

Vii-D Information Bottleneck

Next we evaluate the effects of the information bottleneck on adaptation performance. Figure 8 summarizes the performance of policies trained with different values of for the information penalty. Larger values of produce policies that access fewer number of bits of information from the dynamics parameters during pre-training. This encourages a policy to be less reliant on precise knowledge of the underlying dynamics, which in turn results in more robust behaviors that attain higher performance before adaptation. However, since the policy’s behavior is less dependent on the latent variables, this can also result in less adaptable policies, which exhibit smaller performance improvements after adaptation. Similarly, smaller values of tend to produce less robust but more adaptive policies, exhibiting lower performance before adaptation, but a larger improvement after adaptation. In our experiments, we find that provides a good trade-off between robustness and adaptability. We also compare the information-constrained latent representations to the unconstrained counterparts (No IB). The information-constrained policies generally achieve better performance both before and after adaptation.

Viii Discussion and Future Work

We presented a framework for learning agile legged-locomotion skills by imitating reference motion data. By simply providing the system with different reference motions, we are able to learn policies for a diverse set of behaviors with a quadruped robot, which can then be efficiently transferred from simulation to the real world. However, due to hardware and algorithmic limitations, we have not been able to learn more dynamic behaviors such as large jumps and runs. Exploring techniques that are able to reproduce these behaviors in the real world could significantly increase the agility of legged robots. The behaviors learned by our policies are currently not as stable as the best manually-designed controllers. Improving the robustness of these learned controllers would be valuable for more complex real-world applications. We are also interested in learning from other sources of motion data, such video clips, which could substantially increase the volume of behavioral data that robots can learn from.