Advanced Skills through Multiple Adversarial Motion Priors in Reinforcement Learning

03/23/2022
by   Eric Vollenweider, et al.
Microsoft
ETH Zurich
0

In recent years, reinforcement learning (RL) has shown outstanding performance for locomotion control of highly articulated robotic systems. Such approaches typically involve tedious reward function tuning to achieve the desired motion style. Imitation learning approaches such as adversarial motion priors aim to reduce this problem by encouraging a pre-defined motion style. In this work, we present an approach to augment the concept of adversarial motion prior-based RL to allow for multiple, discretely switchable styles. We show that multiple styles and skills can be learned simultaneously without notable performance differences, even in combination with motion data-free skills. Our approach is validated in several real-world experiments with a wheeled-legged quadruped robot showing skills learned from existing RL controllers and trajectory optimization, such as ducking and walking, and novel skills such as switching between a quadrupedal and humanoid configuration. For the latter skill, the robot is required to stand up, navigate on two wheels, and sit down. Instead of tuning the sit-down motion, we verify that a reverse playback of the stand-up movement helps the robot discover feasible sit-down behaviors and avoids tedious reward function tuning.

READ FULL TEXT VIEW PDF

page 1

page 5

page 6

04/27/2020

Emergent Real-World Robotic Skills via Unsupervised Off-Policy Reinforcement Learning

Reinforcement learning provides a general framework for learning robotic...
03/28/2022

Adversarial Motion Priors Make Good Substitutes for Complex Reward Functions

Training a high-dimensional simulated agent with an under-specified rewa...
04/08/2022

Custom Sine Waves Are Enough for Imitation Learning of Bipedal Gaits with Different Styles

Not until recently, robust bipedal locomotion has been achieved through ...
04/11/2022

Learning Implicit Priors for Motion Optimization

In this paper, we focus on the problem of integrating Energy-based Model...
09/14/2021

Reinforcement Learning with Evolutionary Trajectory Generator: A General Approach for Quadrupedal Locomotion

Recently reinforcement learning (RL) has emerged as a promising approach...
11/24/2020

CoMic: Complementary Task Learning & Mimicry for Reusable Skills

Learning to control complex bodies and reuse learned behaviors is a long...
09/27/2021

Model-based Motion Imitation for Agile, Diverse and Generalizable Quadupedal Locomotion

Robots operating in human environments need a variety of skills, like sl...

I Introduction

Fig. 1: Quadruped-humanoid transformer (https://youtu.be/kEdr0ARq48A) with a time-lapse from left to right of a stand-up and sit-down motion (top image), obstacle negotiation (middle image), and indoor navigation (bottom images). The former skill and the humanoid navigation on two legs are achieved through traditional RL training with a task reward formulation. Instead of tuning the sit-down skill, we can reverse the playback of the stand-up motion and use it as a motion prior that helps the robot discover feasible sit-down behaviors avoiding tedious reward function tuning.

Reinforcement Learning (RL) had a significant impact in the space of legged locomotion, showcasing robust policies that can handle a wide variety of challenging terrain in the real world [14]. With this advancement, we believe that these articulated robots can perform specialized motions like their natural counterparts. Therefore, we aim to push these robots even more to their limits by executing advanced skills like the quadruped-humanoid transformer in Fig. 1 performed by our wheeled-legged robot [4]. In this work, we rely on a combination of motion priors and RL to achieve such skills.

I-a Related Work

Executing specific behaviors for a real robot is a fundamental challenge in robotics and RL. For example, the computer animation community synthesizes life-like behaviors from human or animal demonstrations for their simulated agents. Boston Dynamic’s real humanoid robot, Atlas, shows impressive dancing motions and backflips based on human motion animators. Similarly, our wheeled-legged robot can track motions from an offline trajectory optimization with an model predictive control (MPC) algorithm, as shown in our previous work [3]. Furthermore, motion optimizations, such as [18, 22]

, have the added benefit of producing physically plausible motions, which is favorable in computer graphics but vital in robot control. However, designing objective functions is usually exceptionally difficult. However, these tracking-based methods require carefully designed objective functions. When applied to more extensive and diverse motion libraries, these methods need heuristics to select the suitable motion prior to the scenario.

Data-driven strategies like [16] automate the imitation objective and mechanisms for motion selection based on adversarial imitation learning. This paper verifies that this imitation learning approach can be applied to real robotics systems and not just computer animations. Gaussian processes [12, 23] can learn a low-dimensional motion embedding space generating suitable kinematic motions when provided with a relatively large amount of motion data. However, the approaches are not goal conditioned and can not leverage task-specific information.

Animation techniques [24, 5, 15] attempt to solve this by imitating/tracking motion clips. This is usually implemented with pose errors, requiring a motion clip selection and synchronizing the selected reference motion and the policy’s movement. By using a phase variable as an additional input to the policy, the right frame in the motion data-set can be selected. It can be challenging to scale the number of motion clips with these approaches. Defining error metrics that generalize to a wide variety of motions is difficult.

Two alternative approaches are adversarial learning and student-teacher architectures [11]. The latter trains a teacher policy with privileged information such as perfect knowledge about the height map, friction coefficients, and ground contact forces. With that, the teacher can learn complex motions more easily. After the teacher’s training, the student policy learns to reproduce the teacher’s output using non-privileged observations and the robot’s proprioceptive history. Hereby, a style transfer from teacher to student is happening. On the other hand, adversarial imitation learning techniques [1, 7] and more recently [17] build upon a different approach. The latter offers a discriminator-based learning strategy called Adversarial Motion Priors (AMP), which outsources the error-metrics, phase, and motion-clip selection to a discriminator which learns to distinguish between the policy’s and motion data’s state transitions. AMP does not require specific motion clips to be selected as tracking targets since the policy automatically chooses which style to apply given a particular task. The method’s limitation is that whenever multiple provided motion-priors cover the same task, the policy might either go for the more straightforward style to fulfill or find a hybrid motion similar to both motion clips. In other words, there is no option of actively choosing styles in single or multi-task settings. Furthermore, the task-reward still has to motivate the policy to execute a specific movement because otherwise the policy might identify two states and oscillate between them. Generally, to our experience, it is not trivial to find task-reward formulations for complex and highly dynamic movements that do not conflict with the style reward provided by the discriminator.

I-B Contribution

This paper introduces the Multi-AMP algorithm and applies it to our real wheeled-legged robot. Like its AMP predecessor [16], this approach automates the imitation objective and motion selection process without heuristics. Furthermore, our extension allows for the intentional switching of multiple different style objectives. The approach can imitate motion priors from three different data sets, i.e., from existing RL controllers, trajectory optimization, and reverse stand-up motions. The latter enables the automatic discovery of feasible sit-down motions on the real robot without tedious reward function tuning. This permits exceptional skills with our wheeled-legged robot in Fig. 1, where the robot can switch between a quadruped and humanoid configuration. To the best of our knowledge, this is the first time such a highly dynamic skill is shown and also the first time that the AMP approach is verified on a real robot.

Ii Multiple Adversarial Motion Priors

In this work, the goal is to train a policy capable of executing multiple tasks, including styles extracted from individual motion data-sets with the ability to actively switch between them. In contrast to tracking-based methods, the policy should not blindly follow specific motions but rather extract and apply the underlying characteristics of the movements while fulfilling its task.

Fig. 2: Multi-AMP overview: The discriminator predicts a style reward which is high if the policy’s behavior is similar to the motions of the motion-data base , by distinguishing between state transitions of both sources. The style reward is added to the task reward, which finally leads to the policy fulfilling the task while applying the motion data’s style.

Similar to the AMP algorithm [16], we split the reward calculation into two parts . The task-reward is a description of what to do, e.g., velocity tracking, and the style-reward defines how to do it, namely by extracting and applying the style of the motion priors. While task rewards often have simple mathematical descriptions, the style reward is not trivial to calculate. In the following, we introduce Multi-AMP, a generalization of AMP which allows for switching of multiple different style-rewards, which constitutes the main theoretical contribution of this work.

A style reward motivates the agent to extract the motion prior’s style. We use an adversarial setup with discriminators . For every trained style , a roll-out buffer  collects the states of time-steps where the policy applies the style, and another buffer  contains the motion-data prior to that specific style. Each discriminator learns to differentiate between descriptors built from a pair of consecutive states sampled from  and  . Thus, every trainable style is defined by a tuple . By avoiding any dependency on the source’s actions, the pipeline can process data of sources with unknown actions, such as data from motion-tracking and character animation. The discriminator learns to predict the difference between random samples of its motion database  , and the agent’s transitions sampled from the style’s roll-out buffer  by scoring them with and , respectively. This behavior is encouraged by solving the least-squares problem [16] defined by

(1)

where the descriptors are built by concatenating the output of an arbitrary function for two consecutive states, whereby the choice of decides which style information is extracted from the state-transitions, e.g., the robot’s joint and torso position, velocity, etc.

Ii-a Style-reward

During the policy’s roll-out only one style is active at a time. The state passed into the policy at every time-step contains a command

, which is augmented with a one-hot-encoded style selector

, i.e., the elements of are zero everywhere except at the index of the active style . As in the standard RL-cycle, after the policy predicts an action , the environment returns a new state and a task-reward . The latest state-transition is used to construct the style-descriptor , which is mapped to a style-reward using the current style’s discriminator and the style-reward given by

(2)

Ii-B Task-reward

Our agents interact with the environment in a command-conditioned framework. During the training, the environment rewards the policy for fulfilling commands sampled from a command distribution

. For example, the task might be to achieve the desired body velocity sampled from a uniform distribution in x, y and yaw coordinates. The task is included in the policy’s observation and essentially informs the agent what to do. The task reward depends on the performance of the policy with respect to the command

Ii-C Multi-AMP algorithm

The sum of the style and task rewards constitutes the overall reward, which can be used in any RL algorithm such as Proximal Policy Optimization (PPO[21] or Soft Actor Critic (SAC[6]. The state is additionally stored in the style’s roll-out buffer

 to train the discriminator at the end of the epoch. The full approach is shown in the following algorithm:

0:      (n motion data-sets)
1:   initialize policy
2:   initialize Value function
3:   initialize n style replay buffers
4:   initialize discriminators
5:   initialize main replay buffers
6:  while not done do
7:     for trajectory i = 1, …, m do
8:         roll-out with
9:         style-index of (encoded in )
10:        if  is not empty then
11:           for t = 0, …, T-1 do
12:              
13:               according to Eq. 2
14:              record in
15:           end for
16:           store in and in
17:        end if
18:     end for
19:     for update step = 1, …,  do
20:        for d = 0, …, n do
21:            sample batch of transitions from
22:            sample batch of transitions from
23:           update according to Eq. 1
24:        end for
25:     end for
26:     update and (standard PPO step using )
27:  end while

Ii-D Data-free skills

If no motion data is present for the desired skill and it should nevertheless be trained alongside multiple motion-data skills, Multi-AMP can be adapted slightly. While the policy learns the motion-data free skill, is set to . Thereby, the data-free skill is still treated as a valid style and present in the one-hot-encoded style-selector , but the policy is not guided by the style-reward anymore.

Iii Experimental Results and Discussion

We implement and deploy the proposed Multi-AMP framework on our wheeled-legged robot in Fig. 1 with 16 DOF (degrees of freedom). The training environment consists of three tasks, two of which are supported by motion data, and one is a data-decoupled task. The first task is four-legged locomotion, the motion data of which consists of motions recorded from another RL policy (Fig. 3 top left). The second task is a ducking skill, allowing the robot to duck under a table. The motion data for this skill was generated by a trajectory optimization pipeline, which was deployed and tracked by an MPC controller [3] (Fig. 3 bottom left). The last skill represents a partly data-decoupled skill. Here, the wheeled-legged robot learns to stand up on its hind legs followed by two-legged navigation (Fig. 4), before sitting down again. The sit-down skill is supported by motion data as detailed in Section III-B. A video available at https://youtu.be/kEdr0ARq48A showing the results accompanies this paper.

The training environment of our Multi-AMP pipelines is implemented using the Isaac Gym simulator [13, 19]

, which allows for massively parallel simulation. We spawn 4096 environments in parallel to learn all three tasks simultaneously in a single neural network. The number of environments per task is weighted according to their approximate difficulty, e.g.,

in the case of the tasks described above. The state-transitions collected during the roll-outs of these environments are mapped using a function such that it extracts the linear and angular base velocity, gravity direction in base frame, the base’s height above ground, joint position and velocity, and finally the position of the wheels relative to the robot’s base-frame, i.e., . The task reward definitions for the three tasks are in Table I and II.

All tasks formula weight
-0.0001
-0.0001
-0.0001
4-legged locomotion
1.5
1.5
Ducking
2
Stand-up
see Tab. II
TABLE I: Task-rewards.
symbols description
Robot base-frame rotation
Robot base-frame position
Joint DOF positions (excl. wheels)
Hind-Leg DOF position
(robot-x axis, world z axis)
Feet on ground (binary)
Standing robots (binary)
stand-up formula weight
2
3
-2
-0.003
-1
1
sit-down weight
-3
2.65
-0.015
3
navigation weight
2
2
TABLE II: Rewards for AOW standing up, sitting down, and navigating while standing
Fig. 3: Four-legged locomotion (top row) and ducking motion (bottom row) of the motion data source (left column), simulation training (center column), and final deployment on the real robot using Multi-AMP. The former skill is trained with a motion prior from a different simulation environment and control approach, while the ducking motion is trained with data from trajectory optimization [3].
Fig. 4: Stand up-sequence in simulation and on the real robot. The policy is able to stand up, navigate large distances on two legs, and finally sit down again using the stand-up motion prior.

Iii-a Experiments

Due to the problem of catastrophic forgetting [2, 10, 9], we learn these skills in parallel. This section analyzes the task performance of each Multi-AMP policy compared to policies that exclusively learn a single task (baseline). The three tasks (standing up, ducking, and four-legged locomotion) are trained in different combinations, where ducking and walking are always learned with motion data and stand-up without:

  1. Stand up only

  2. Duck only

  3. Walk only

  4. Walking and standing up

  5. Walking and ducking

  6. Walking, ducking, and standing up

First, we compare the learning performance of the stand-up skill between the models Nr. 1, 4, and 6. The stand-up task is an informative benchmark since it requires a complex sequence of movements to achieve the goal. We normalize all rewards in the following Figures with the number of robots receiving the reward, making the plots comparable between the experiments. Fig. 6 shows important metrics of the stand-up learning progress. The figure shows that the policy does not make compromises during the training of multiple tasks compared to single-task settings. The policy that learns three tasks simultaneously (3 styles in Fig. 6) performs equally well while standing up and sitting down. While it takes the 3 style policy a bit longer to reach the maximum rewards (see and at epoch 1000), the differences vanish after sufficiently long training times. In this case, it takes Multi-AMP about 300 epochs longer to reach the maximum task rewards compared to the single task policy.

The walking and ducking tasks show a very similar picture, with the specialized policies (model Nr. 2 and 3 in the list above) reaching a similar final performance compared to the others. Furthermore, all policies manage to extract the walking and ducking style such that no visible difference can be seen.

In summary, in this specific implementation of the environment and selection of tasks, Multi-AMP, while taking longer, learns to achieve all goals equally well as more specialized policies that learn fewer tasks.

Fig. 5:

Comparison of the sitting down motions. Top row: If the agent learns to sit down with task rewards only, it falls forward with extended front legs, which causes high impacts and leads to over-torque on the real robot. Marked in blue is the trajectory of the center of gravity of the base. Bottom row: When sitting down with task reward and style reward from the reversed stand-up sequence, the robot squats down to lower its center of gravity before tilting forward, thereby reducing the impact’s magnitude. Marked in green is the trajectory of the center of gravity of the base. We note that compared to the previous case the base is lowered in a way that causes less vertical base velocity at the moment of impact.

Iii-B Sit-down training

While the sit-down rewards presented in Table II work well in simulation, the policy’s sit-down motions created high impulses in the real robot’s knees, which exceeded the robot’s safety torque threshold. To easily perform more gentle sit-down motions and avoid reward function tuning, we recorded the stand-up motion, reversed the motion data, and trained a policy using Multi-AMP. As this motion starts with a front end-effector velocity of 0 when lifting them off the ground, the reversed style should encourage low impact sit-down motions. In the Multi-AMP combination, one style contains the reversed motion data for sitting down, while the second style receives plain stand-up rewards. The result is a sit-down motion that uses its hind knees to lower the center of gravity before tilting the base and catching itself on four legs, as shown in Fig. 5. The agent receives zero task rewards for a predefined time after the command to sit down, avoiding task rewards that conflict with the sit-down motion-prior. E.g., rewarding horizontal body orientation leads the agent to accelerate the sit-down, which breaks the style. After this buffer-time, the sit-down task-rewards become active and reward the agent. This allows the robot to sit down with its own speed and style and guarantees non-conflicting rewards.

Fig. 6: Multi-AMP learning capability of the stand-up task. The horizontal axis denotes the number of epochs, and the vertical axis represents the value of the reward calculations after post-processing for comparability. Furthermore, the maximum stand duration is plotted over the number of epochs. Legend: Blue (one style), yellow (two styles), blue (three styles)

Iii-C Remarks

Finding a balance between training the policy and the discriminators is vital during the Multi-AMP training process. Our observations show that fast or slow training of the discriminators relative to the policy hampers the policy’s style training. In our current implementation, the number of discriminator and policy updates is fixed, which might not be an optimal strategy. Since the setup is very similar to Generative Adversarial Network (GAN), more ideas from [20] could be incorporated into Multi-AMP.

We use an actuator model for the leg joints to bridge the sim-to-real gap [8] while an actuator model is not needed for the velocity controlled wheels. Moreover, we apply strategies to increase the policy’s robustness, such as rough terrain training (see rough terrain robustness in Fig. 1), random disturbances, and game inspired curriculum training [19]. The highly dynamic stand-up skill is especially prone to these robustness measures, which we solve by introducing timed pushes and joint-velocity-based trajectory termination. The former identifies the most critical phase of the skill and pushes the policy in the worst possible way. This increases the number of disturbances the policy experiences during these critical phases, rendering it more robust, and thus, also helping with sim-to-real efforts. Furthermore, by terminating the trajectory if the joint velocity of any degrees of freedom (DOF) exceeds the actuator’s limits, the policy learns to keep a safety tolerance to these limits.

Iv Conclusions

This work introduces Multi-AMP, with which we automate the imitation objective and motion selection process of multiple motion priors without heuristics. Our experimental section shows that we can simultaneously learn different styles and skills in a single policy. Furthermore, our approach can intentionally switch between these styles and skills, whereby also data-free styles are supported. Various multi-style policies are successfully deployed on a wheeled-legged robot. To this end, we show different combinations of skills such as walking, ducking, standing up on the hind legs, navigating on two wheels, and sitting down on all four legs again. We avoid tedious reward function tuning by training the sit-down motions with a motion prior gained from reversing a stand-up recording. Furthermore, we note that similar performances as in the single-style case can be expected even when learning multiple styles simultaneously. We conclude that Multi-AMP and its predecessor AMP [17] are promising steps towards a possible future without style-reward function tuning in RL. However, even though less time is invested in tuning reward functions, more time is required to generate motion priors, which is in most cases not available for specific tasks.

To the best of our knowledge, this is the first time that a quadruped-humanoid transformation is shown on a real robot, challenging how we categorize multi-legged robots. Over the next few years, this skill will further expand the possibilities of wheeled quadrupeds by opening doors, grabbing packages, and many more use-cases.

References

  • [1] P. Abbeel and A. Y. Ng (2004) Apprenticeship learning via inverse reinforcement learning. New York, NY, USA. External Links: ISBN 1581138385, Link, Document Cited by: §I-A.
  • [2] C. Atkinson, B. McCane, L. Szymanski, and A. V. Robins (2018) Pseudo-rehearsal: achieving deep reinforcement learning without catastrophic forgetting. CoRR abs/1812.02464. External Links: Link, 1812.02464 Cited by: §III-A.
  • [3] M. Bjelonic, R. Grandia, M. Geilinger, O. Harley, V. S. Medeiros, V. Pajovic, S. Edo, and M. Hutter (2022) Complex motion decomposition: combining offline motion libraries with online MPC. under review for The International Journal of Robotics Research. Cited by: §I-A, Fig. 3, §III.
  • [4] M. Bjelonic, R. Grandia, O. Harley, C. Galliard, S. Zimmermann, and M. Hutter (2021) Whole-Body MPC and Online Gait Sequence Generation for Wheeled-Legged Robots. In under review for IEEE Int. Conf. on Robotics and Automation, Cited by: §I.
  • [5] N. Chentanez, M. Müller, M. Macklin, V. Makoviychuk, and S. Jeschke (2018-11) Physics-based motion capture imitation with deep reinforcement learning. pp. 1–10. External Links: Document Cited by: §I-A.
  • [6] T. Haarnoja, A. Zhou, P. Abbeel, and S. Levine (2018) Soft actor-critic: off-policy maximum entropy deep reinforcement learning with a stochastic actor. External Links: 1801.01290 Cited by: §II-C.
  • [7] J. Ho and S. Ermon (2016) Generative adversarial imitation learning. External Links: 1606.03476 Cited by: §I-A.
  • [8] J. Hwangbo, J. Lee, A. Dosovitskiy, D. Bellicoso, V. Tsounis, V. Koltun, and M. Hutter (2019) Learning agile and dynamic motor skills for legged robots. Science Robotics 4 (26), pp. eaau5872. External Links: Document, Link Cited by: §III-C.
  • [9] P. Kaushik, A. Gain, A. Kortylewski, and A. L. Yuille (2021) Understanding catastrophic forgetting and remembering in continual learning with optimal relevance mapping. CoRR abs/2102.11343. External Links: Link, 2102.11343 Cited by: §III-A.
  • [10] J. Kirkpatrick, R. Pascanu, N. Rabinowitz, J. Veness, G. Desjardins, A. A. Rusu, K. Milan, J. Quan, T. Ramalho, A. Grabska-Barwinska, D. Hassabis, C. Clopath, D. Kumaran, and R. Hadsell (2017) Overcoming catastrophic forgetting in neural networks. Proceedings of the National Academy of Sciences 114 (13), pp. 3521–3526. External Links: Document, Link, https://www.pnas.org/content/114/13/3521.full.pdf Cited by: §III-A.
  • [11] J. Lee, J. Hwangbo, L. Wellhausen, V. Koltun, and M. Hutter (2020) Learning quadrupedal locomotion over challenging terrain. Science Robotics 5 (47), pp. eabc5986. External Links: Document, Link, https://www.science.org/doi/pdf/10.1126/scirobotics.abc5986 Cited by: §I-A.
  • [12] S. Levine, J. M. Wang, A. Haraux, Z. Popović, and V. Koltun (2012) Continuous character control with low-dimensional embeddings. 31 (4). External Links: ISSN 0730-0301, Link, Document Cited by: §I-A.
  • [13] V. Makoviychuk, L. Wawrzyniak, Y. Guo, M. Lu, K. Storey, M. Macklin, D. Hoeller, N. Rudin, A. Allshire, A. Handa, and G. State (2021) Isaac gym: high performance GPU based physics simulation for robot learning. In Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2), External Links: Link Cited by: §III.
  • [14] T. Miki, J. Lee, J. Hwangbo, L. Wellhausen, V. Koltun, and M. Hutter (2022) Learning robust perceptive locomotion for quadrupedal robots in the wild. Science Robotics 7 (62). Cited by: §I.
  • [15] X. B. Peng, P. Abbeel, S. Levine, and M. van de Panne (2018-07) DeepMimic: example-guided deep reinforcement learning of physics-based character skills. ACM Trans. Graph. 37 (4), pp. 143:1–143:14. External Links: ISSN 0730-0301, Link, Document Cited by: §I-A.
  • [16] X. B. Peng, Z. Ma, P. Abbeel, S. Levine, and A. Kanazawa (2021) Amp: adversarial motion priors for stylized physics-based character control. ACM Transactions on Graphics (TOG) 40 (4), pp. 1–20. Cited by: §I-A, §I-B, §II, §II.
  • [17] X. B. Peng, Z. Ma, P. Abbeel, S. Levine, and A. Kanazawa (2021) Amp. ACM Transactions on Graphics 40 (4), pp. 1–20. External Links: Link, Document Cited by: §I-A, §IV.
  • [18] M. H. Raibert and J. K. Hodgins (1991) Animation of dynamic legged locomotion. 25 (4). External Links: ISSN 0097-8930, Link Cited by: §I-A.
  • [19] N. Rudin, D. Hoeller, P. Reist, and M. Hutter (2021) Learning to walk in minutes using massively parallel deep reinforcement learning. External Links: 2109.11978 Cited by: §III-C, §III.
  • [20] T. Salimans, I. J. Goodfellow, W. Zaremba, V. Cheung, A. Radford, and X. Chen (2016) Improved techniques for training gans. CoRR abs/1606.03498. External Links: Link, 1606.03498 Cited by: §III-C.
  • [21] J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov (2017) Proximal policy optimization algorithms. External Links: 1707.06347 Cited by: §II-C.
  • [22] K. Wampler, Z. Popović, and J. Popović (2014) Generalizing locomotion style to new animals with inverse optimal regression. 33 (4). External Links: ISSN 0730-0301, Link, Document Cited by: §I-A.
  • [23] Y. Ye and C. K. Liu Synthesis of responsive motion using a dynamic model. Computer Graphics Forum 29 (2), pp. 555–562. External Links: Document, Link, https://onlinelibrary.wiley.com/doi/pdf/10.1111/j.1467-8659.2009.01625.x Cited by: §I-A.
  • [24] V. B. Zordan and J. K. Hodgins (2002) Motion capture-driven simulations that hit and react. New York, NY, USA. External Links: ISBN 1581135734, Link, Document Cited by: §I-A.