Reinforcement Learning (RL) had a significant impact in the space of legged locomotion, showcasing robust policies that can handle a wide variety of challenging terrain in the real world . With this advancement, we believe that these articulated robots can perform specialized motions like their natural counterparts. Therefore, we aim to push these robots even more to their limits by executing advanced skills like the quadruped-humanoid transformer in Fig. 1 performed by our wheeled-legged robot . In this work, we rely on a combination of motion priors and RL to achieve such skills.
I-a Related Work
Executing specific behaviors for a real robot is a fundamental challenge in robotics and RL. For example, the computer animation community synthesizes life-like behaviors from human or animal demonstrations for their simulated agents. Boston Dynamic’s real humanoid robot, Atlas, shows impressive dancing motions and backflips based on human motion animators. Similarly, our wheeled-legged robot can track motions from an offline trajectory optimization with an model predictive control (MPC) algorithm, as shown in our previous work . Furthermore, motion optimizations, such as [18, 22]
, have the added benefit of producing physically plausible motions, which is favorable in computer graphics but vital in robot control. However, designing objective functions is usually exceptionally difficult. However, these tracking-based methods require carefully designed objective functions. When applied to more extensive and diverse motion libraries, these methods need heuristics to select the suitable motion prior to the scenario.
Data-driven strategies like  automate the imitation objective and mechanisms for motion selection based on adversarial imitation learning. This paper verifies that this imitation learning approach can be applied to real robotics systems and not just computer animations. Gaussian processes [12, 23] can learn a low-dimensional motion embedding space generating suitable kinematic motions when provided with a relatively large amount of motion data. However, the approaches are not goal conditioned and can not leverage task-specific information.
Animation techniques [24, 5, 15] attempt to solve this by imitating/tracking motion clips. This is usually implemented with pose errors, requiring a motion clip selection and synchronizing the selected reference motion and the policy’s movement. By using a phase variable as an additional input to the policy, the right frame in the motion data-set can be selected. It can be challenging to scale the number of motion clips with these approaches. Defining error metrics that generalize to a wide variety of motions is difficult.
Two alternative approaches are adversarial learning and student-teacher architectures . The latter trains a teacher policy with privileged information such as perfect knowledge about the height map, friction coefficients, and ground contact forces. With that, the teacher can learn complex motions more easily. After the teacher’s training, the student policy learns to reproduce the teacher’s output using non-privileged observations and the robot’s proprioceptive history. Hereby, a style transfer from teacher to student is happening. On the other hand, adversarial imitation learning techniques [1, 7] and more recently  build upon a different approach. The latter offers a discriminator-based learning strategy called Adversarial Motion Priors (AMP), which outsources the error-metrics, phase, and motion-clip selection to a discriminator which learns to distinguish between the policy’s and motion data’s state transitions. AMP does not require specific motion clips to be selected as tracking targets since the policy automatically chooses which style to apply given a particular task. The method’s limitation is that whenever multiple provided motion-priors cover the same task, the policy might either go for the more straightforward style to fulfill or find a hybrid motion similar to both motion clips. In other words, there is no option of actively choosing styles in single or multi-task settings. Furthermore, the task-reward still has to motivate the policy to execute a specific movement because otherwise the policy might identify two states and oscillate between them. Generally, to our experience, it is not trivial to find task-reward formulations for complex and highly dynamic movements that do not conflict with the style reward provided by the discriminator.
This paper introduces the Multi-AMP algorithm and applies it to our real wheeled-legged robot. Like its AMP predecessor , this approach automates the imitation objective and motion selection process without heuristics. Furthermore, our extension allows for the intentional switching of multiple different style objectives. The approach can imitate motion priors from three different data sets, i.e., from existing RL controllers, trajectory optimization, and reverse stand-up motions. The latter enables the automatic discovery of feasible sit-down motions on the real robot without tedious reward function tuning. This permits exceptional skills with our wheeled-legged robot in Fig. 1, where the robot can switch between a quadruped and humanoid configuration. To the best of our knowledge, this is the first time such a highly dynamic skill is shown and also the first time that the AMP approach is verified on a real robot.
Ii Multiple Adversarial Motion Priors
In this work, the goal is to train a policy capable of executing multiple tasks, including styles extracted from individual motion data-sets with the ability to actively switch between them. In contrast to tracking-based methods, the policy should not blindly follow specific motions but rather extract and apply the underlying characteristics of the movements while fulfilling its task.
Similar to the AMP algorithm , we split the reward calculation into two parts . The task-reward is a description of what to do, e.g., velocity tracking, and the style-reward defines how to do it, namely by extracting and applying the style of the motion priors. While task rewards often have simple mathematical descriptions, the style reward is not trivial to calculate. In the following, we introduce Multi-AMP, a generalization of AMP which allows for switching of multiple different style-rewards, which constitutes the main theoretical contribution of this work.
A style reward motivates the agent to extract the motion prior’s style. We use an adversarial setup with discriminators . For every trained style , a roll-out buffer collects the states of time-steps where the policy applies the style, and another buffer contains the motion-data prior to that specific style. Each discriminator learns to differentiate between descriptors built from a pair of consecutive states sampled from and . Thus, every trainable style is defined by a tuple . By avoiding any dependency on the source’s actions, the pipeline can process data of sources with unknown actions, such as data from motion-tracking and character animation. The discriminator learns to predict the difference between random samples of its motion database , and the agent’s transitions sampled from the style’s roll-out buffer by scoring them with and , respectively. This behavior is encouraged by solving the least-squares problem  defined by
where the descriptors are built by concatenating the output of an arbitrary function for two consecutive states, whereby the choice of decides which style information is extracted from the state-transitions, e.g., the robot’s joint and torso position, velocity, etc.
During the policy’s roll-out only one style is active at a time. The state passed into the policy at every time-step contains a command
, which is augmented with a one-hot-encoded style selector, i.e., the elements of are zero everywhere except at the index of the active style . As in the standard RL-cycle, after the policy predicts an action , the environment returns a new state and a task-reward . The latest state-transition is used to construct the style-descriptor , which is mapped to a style-reward using the current style’s discriminator and the style-reward given by
Our agents interact with the environment in a command-conditioned framework. During the training, the environment rewards the policy for fulfilling commands sampled from a command distribution
. For example, the task might be to achieve the desired body velocity sampled from a uniform distribution in x, y and yaw coordinates. The task is included in the policy’s observation and essentially informs the agent what to do. The task reward depends on the performance of the policy with respect to the command
Ii-C Multi-AMP algorithm
The sum of the style and task rewards constitutes the overall reward, which can be used in any RL algorithm such as Proximal Policy Optimization (PPO)  or Soft Actor Critic (SAC) . The state is additionally stored in the style’s roll-out buffer
to train the discriminator at the end of the epoch. The full approach is shown in the following algorithm:
Ii-D Data-free skills
If no motion data is present for the desired skill and it should nevertheless be trained alongside multiple motion-data skills, Multi-AMP can be adapted slightly. While the policy learns the motion-data free skill, is set to . Thereby, the data-free skill is still treated as a valid style and present in the one-hot-encoded style-selector , but the policy is not guided by the style-reward anymore.
Iii Experimental Results and Discussion
We implement and deploy the proposed Multi-AMP framework on our wheeled-legged robot in Fig. 1 with 16 DOF (degrees of freedom). The training environment consists of three tasks, two of which are supported by motion data, and one is a data-decoupled task. The first task is four-legged locomotion, the motion data of which consists of motions recorded from another RL policy (Fig. 3 top left). The second task is a ducking skill, allowing the robot to duck under a table. The motion data for this skill was generated by a trajectory optimization pipeline, which was deployed and tracked by an MPC controller  (Fig. 3 bottom left). The last skill represents a partly data-decoupled skill. Here, the wheeled-legged robot learns to stand up on its hind legs followed by two-legged navigation (Fig. 4), before sitting down again. The sit-down skill is supported by motion data as detailed in Section III-B. A video available at https://youtu.be/kEdr0ARq48A showing the results accompanies this paper.
, which allows for massively parallel simulation. We spawn 4096 environments in parallel to learn all three tasks simultaneously in a single neural network. The number of environments per task is weighted according to their approximate difficulty, e.g.,in the case of the tasks described above. The state-transitions collected during the roll-outs of these environments are mapped using a function such that it extracts the linear and angular base velocity, gravity direction in base frame, the base’s height above ground, joint position and velocity, and finally the position of the wheels relative to the robot’s base-frame, i.e., . The task reward definitions for the three tasks are in Table I and II.
|see Tab. II|
|Robot base-frame rotation|
|Robot base-frame position|
|Joint DOF positions (excl. wheels)|
|Hind-Leg DOF position|
|(robot-x axis, world z axis)|
|Feet on ground (binary)|
|Standing robots (binary)|
Due to the problem of catastrophic forgetting [2, 10, 9], we learn these skills in parallel. This section analyzes the task performance of each Multi-AMP policy compared to policies that exclusively learn a single task (baseline). The three tasks (standing up, ducking, and four-legged locomotion) are trained in different combinations, where ducking and walking are always learned with motion data and stand-up without:
Stand up only
Walking and standing up
Walking and ducking
Walking, ducking, and standing up
First, we compare the learning performance of the stand-up skill between the models Nr. 1, 4, and 6. The stand-up task is an informative benchmark since it requires a complex sequence of movements to achieve the goal. We normalize all rewards in the following Figures with the number of robots receiving the reward, making the plots comparable between the experiments. Fig. 6 shows important metrics of the stand-up learning progress. The figure shows that the policy does not make compromises during the training of multiple tasks compared to single-task settings. The policy that learns three tasks simultaneously (3 styles in Fig. 6) performs equally well while standing up and sitting down. While it takes the 3 style policy a bit longer to reach the maximum rewards (see and at epoch 1000), the differences vanish after sufficiently long training times. In this case, it takes Multi-AMP about 300 epochs longer to reach the maximum task rewards compared to the single task policy.
The walking and ducking tasks show a very similar picture, with the specialized policies (model Nr. 2 and 3 in the list above) reaching a similar final performance compared to the others. Furthermore, all policies manage to extract the walking and ducking style such that no visible difference can be seen.
In summary, in this specific implementation of the environment and selection of tasks, Multi-AMP, while taking longer, learns to achieve all goals equally well as more specialized policies that learn fewer tasks.
Iii-B Sit-down training
While the sit-down rewards presented in Table II work well in simulation, the policy’s sit-down motions created high impulses in the real robot’s knees, which exceeded the robot’s safety torque threshold. To easily perform more gentle sit-down motions and avoid reward function tuning, we recorded the stand-up motion, reversed the motion data, and trained a policy using Multi-AMP. As this motion starts with a front end-effector velocity of 0 when lifting them off the ground, the reversed style should encourage low impact sit-down motions. In the Multi-AMP combination, one style contains the reversed motion data for sitting down, while the second style receives plain stand-up rewards. The result is a sit-down motion that uses its hind knees to lower the center of gravity before tilting the base and catching itself on four legs, as shown in Fig. 5. The agent receives zero task rewards for a predefined time after the command to sit down, avoiding task rewards that conflict with the sit-down motion-prior. E.g., rewarding horizontal body orientation leads the agent to accelerate the sit-down, which breaks the style. After this buffer-time, the sit-down task-rewards become active and reward the agent. This allows the robot to sit down with its own speed and style and guarantees non-conflicting rewards.
Finding a balance between training the policy and the discriminators is vital during the Multi-AMP training process. Our observations show that fast or slow training of the discriminators relative to the policy hampers the policy’s style training. In our current implementation, the number of discriminator and policy updates is fixed, which might not be an optimal strategy. Since the setup is very similar to Generative Adversarial Network (GAN), more ideas from  could be incorporated into Multi-AMP.
We use an actuator model for the leg joints to bridge the sim-to-real gap  while an actuator model is not needed for the velocity controlled wheels. Moreover, we apply strategies to increase the policy’s robustness, such as rough terrain training (see rough terrain robustness in Fig. 1), random disturbances, and game inspired curriculum training . The highly dynamic stand-up skill is especially prone to these robustness measures, which we solve by introducing timed pushes and joint-velocity-based trajectory termination. The former identifies the most critical phase of the skill and pushes the policy in the worst possible way. This increases the number of disturbances the policy experiences during these critical phases, rendering it more robust, and thus, also helping with sim-to-real efforts. Furthermore, by terminating the trajectory if the joint velocity of any degrees of freedom (DOF) exceeds the actuator’s limits, the policy learns to keep a safety tolerance to these limits.
This work introduces Multi-AMP, with which we automate the imitation objective and motion selection process of multiple motion priors without heuristics. Our experimental section shows that we can simultaneously learn different styles and skills in a single policy. Furthermore, our approach can intentionally switch between these styles and skills, whereby also data-free styles are supported. Various multi-style policies are successfully deployed on a wheeled-legged robot. To this end, we show different combinations of skills such as walking, ducking, standing up on the hind legs, navigating on two wheels, and sitting down on all four legs again. We avoid tedious reward function tuning by training the sit-down motions with a motion prior gained from reversing a stand-up recording. Furthermore, we note that similar performances as in the single-style case can be expected even when learning multiple styles simultaneously. We conclude that Multi-AMP and its predecessor AMP  are promising steps towards a possible future without style-reward function tuning in RL. However, even though less time is invested in tuning reward functions, more time is required to generate motion priors, which is in most cases not available for specific tasks.
To the best of our knowledge, this is the first time that a quadruped-humanoid transformation is shown on a real robot, challenging how we categorize multi-legged robots. Over the next few years, this skill will further expand the possibilities of wheeled quadrupeds by opening doors, grabbing packages, and many more use-cases.
-  (2004) Apprenticeship learning via inverse reinforcement learning. New York, NY, USA. External Links: Cited by: §I-A.
-  (2018) Pseudo-rehearsal: achieving deep reinforcement learning without catastrophic forgetting. CoRR abs/1812.02464. External Links: Cited by: §III-A.
-  (2022) Complex motion decomposition: combining offline motion libraries with online MPC. under review for The International Journal of Robotics Research. Cited by: §I-A, Fig. 3, §III.
-  (2021) Whole-Body MPC and Online Gait Sequence Generation for Wheeled-Legged Robots. In under review for IEEE Int. Conf. on Robotics and Automation, Cited by: §I.
-  (2018-11) Physics-based motion capture imitation with deep reinforcement learning. pp. 1–10. External Links: Cited by: §I-A.
-  (2018) Soft actor-critic: off-policy maximum entropy deep reinforcement learning with a stochastic actor. External Links: Cited by: §II-C.
-  (2016) Generative adversarial imitation learning. External Links: Cited by: §I-A.
-  (2019) Learning agile and dynamic motor skills for legged robots. Science Robotics 4 (26), pp. eaau5872. External Links: Cited by: §III-C.
-  (2021) Understanding catastrophic forgetting and remembering in continual learning with optimal relevance mapping. CoRR abs/2102.11343. External Links: Cited by: §III-A.
-  (2017) Overcoming catastrophic forgetting in neural networks. Proceedings of the National Academy of Sciences 114 (13), pp. 3521–3526. External Links: Cited by: §III-A.
-  (2020) Learning quadrupedal locomotion over challenging terrain. Science Robotics 5 (47), pp. eabc5986. External Links: Cited by: §I-A.
-  (2012) Continuous character control with low-dimensional embeddings. 31 (4). External Links: Cited by: §I-A.
-  (2021) Isaac gym: high performance GPU based physics simulation for robot learning. In Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2), External Links: Cited by: §III.
-  (2022) Learning robust perceptive locomotion for quadrupedal robots in the wild. Science Robotics 7 (62). Cited by: §I.
-  (2018-07) DeepMimic: example-guided deep reinforcement learning of physics-based character skills. ACM Trans. Graph. 37 (4), pp. 143:1–143:14. External Links: Cited by: §I-A.
-  (2021) Amp: adversarial motion priors for stylized physics-based character control. ACM Transactions on Graphics (TOG) 40 (4), pp. 1–20. Cited by: §I-A, §I-B, §II, §II.
-  (2021) Amp. ACM Transactions on Graphics 40 (4), pp. 1–20. External Links: Cited by: §I-A, §IV.
-  (1991) Animation of dynamic legged locomotion. 25 (4). External Links: Cited by: §I-A.
-  (2021) Learning to walk in minutes using massively parallel deep reinforcement learning. External Links: Cited by: §III-C, §III.
-  (2016) Improved techniques for training gans. CoRR abs/1606.03498. External Links: Cited by: §III-C.
-  (2017) Proximal policy optimization algorithms. External Links: Cited by: §II-C.
-  (2014) Generalizing locomotion style to new animals with inverse optimal regression. 33 (4). External Links: Cited by: §I-A.
-  Synthesis of responsive motion using a dynamic model. Computer Graphics Forum 29 (2), pp. 555–562. External Links: Cited by: §I-A.
-  (2002) Motion capture-driven simulations that hit and react. New York, NY, USA. External Links: Cited by: §I-A.