Learning to Brachiate via Simplified Model Imitation

by   Daniele Reda, et al.
The University of British Columbia

Brachiation is the primary form of locomotion for gibbons and siamangs, in which these primates swing from tree limb to tree limb using only their arms. It is challenging to control because of the limited control authority, the required advance planning, and the precision of the required grasps. We present a novel approach to this problem using reinforcement learning, and as demonstrated on a finger-less 14-link planar model that learns to brachiate across challenging handhold sequences. Key to our method is the use of a simplified model, a point mass with a virtual arm, for which we first learn a policy that can brachiate across handhold sequences with a prescribed order. This facilitates the learning of the policy for the full model, for which it provides guidance by providing an overall center-of-mass trajectory to imitate, as well as for the timing of the holds. Lastly, the simplified model can also readily be used for planning suitable sequences of handholds in a given environment. Our results demonstrate brachiation motions with a variety of durations for the flight and hold phases, as well as emergent extra back-and-forth swings when this proves useful. The system is evaluated with a variety of ablations. The method enables future work towards more general 3D brachiation, as well as using simplified model imitation in other settings.


page 1

page 2

page 3

page 4


Delayed Reinforcement Learning by Imitation

When the agent's observations or interactions are delayed, classic reinf...

Analytic Model for Quadruped Locomotion Task-Space Planning

Despite the extensive presence of the legged locomotion in animals, it i...

Solving Challenging Control Problems Using Two-Staged Deep Reinforcement Learning

We present a two-staged deep reinforcement learning algorithm for solvin...

Flying through a narrow gap using neural network: an end-to-end planning and control approach

In this paper, we investigate the problem of enabling a drone to fly thr...

Planning and Execution of Dynamic Whole-Body Locomotion for a Hydraulic Quadruped on Challenging Terrain

We present a framework for dynamic quadrupedal locomotion over challengi...

Robust Humanoid Locomotion Using Trajectory Optimization and Sample-Efficient Learning

Trajectory optimization (TO) is one of the most powerful tools for gener...

Hint Orchestration Using ACL2's Simplifier

This paper describes a strategy for providing hints during an ACL2 proof...

1. Introduction

Brachiation is the form of locomotion that consists of moving between tree branches or handholds using the arms alone (Fleagle, 1974; Preuschoft and Demes, 1985)

. Gibbons and siamangs make this seem effortless and are among the world’s most agile brachiators. Brachiating movements can be classified into two types:

continuous contact brachiation, characterized by the animal mantaining contact at all time with a handhold such as a tree branch, and ricochetal brachiation, which involves a flight phase between successive grasp (Bertram et al., 1999; Wan et al., 2015). These two modes of brachiation are analogous to walking and running, respectively, with ricochetal brachiation used at faster speeds.

Machine learning techniques such as deep reinforcement learning (RL) have been previously applied to legged, aerial, and underwater locomotion (Won et al., 2018; Luo et al., 2020; Min et al., 2019; Xie et al., 2020). Despite the apparent similarity to these types of locomotion, brachiation is unique in a number of ways. First, the choice of handholds is often discrete, i.e., the choice of particular branches, and therefore unlike the continuous terrain usually available for legged locomotion. This offers apparently fewer features to control, i.e., less control authority, while at the same time necessitating high spatial precision to effect a grasp. Second, advance planning of the motion across a sequence of handholds becomes more important because of the momentum that may be needed to reach future handholds. Lastly, brachiation offers the possibility of very efficient horizontal motion; an ideal point mass can follow alternating pendulum-like swings and parabolic flight phases with no loss of energy (Bertram et al., 1999).

In our work, we present a fully learned solution for simulating a 14-link planar articulated model. Similar to previous work on brachiation, our model uses a pinned attachment model as a proxy for the grasping mechanics. The learned control policy demonstrates ricochetal brachiation across challenging sequences of handholds that, to the best of our knowledge, is the most capable of its kind.

Our primary contributions are as follows:

  • [leftmargin=]

  • We present an effective learning-based solution for planar brachiation. The learned control policies can traverse challenging sequences of handholds using ricochetal motions with significant flight phases, and demonstrate emergent phenomena, such as extra back-and-forth swings when needed.

  • We propose a general two-stage RL method for leveraging simplified physical models. RL on the simplified model allows for fast and efficient exploration, quickly producing an effective control policy. The full model then uses imitation learning of the motions resulting from the simplified-model for efficient learning of complex behaviors.

2. Related Work

Our work sits at the intersection of physics-based character animation, the study of brachiation, and a broad set of methods that leverage abstract models in support of motion planning and control.

2.1. Brachiation

Due to the fascinating and seemingly effortless nature of brachiation, there exists a long history of attempts to reproduce this often-graceful type of motion, both in simulation and on physical robots. The key challenge lies in developing control strategies that are well suited to the underactuated and dynamic nature of the task.

A starting point is to first study brachiation in nature, such as via a detailed film study (Fleagle, 1974). Among other insights, they note that the siamang is able to limit lateral motion of the center of mass between handholds, in part due to extensive rotation at the wrist, elbow, and shoulder. A benefit of longer arms for reduced energetic costs of brachiation can also be found (Preuschoft and Demes, 1985). A more recent study (Michilsens et al., 2009) notes that “compared with other primates, the elbow flexors of gibbons are particularly powerful, suggesting that these muscles are particularly important for a brachiating lifestyle.” We note that the limited lateral motion and strong elbow flexors correspond well to the capabilities afforded by sagittal-plane point-mass models.

One of the earliest results for brachiation capable of varying height and distance uses an intricate heuristic procedure and manually structured motion phases, as applied to a two-link robot with no flight phases 

(Saito, 1992; Saito et al., 1994). A loosely comparable approach is proposed for a 3-link planar model, restricted to horizontal motions (Zhang and Wong, 1999). A point mass model can be shown to produce solutions with zero energetic cost for both continuous contact and ricochetal brachiation, for horizontal brachiation with regular spacing of handholds (Bertram et al., 1999). The solutions are based on circular pendular motion, alternating with parabolic free flight phases in the case of ricochetal motion. It is found that natural gibbon motion is even smoother than the motions predicted by this model, particularly for handhold forces. Several results successfully developed horizontal continuous brachiation strategies for a two-link robot (Nakanishi et al., 2000) and motions for a three-link model as demonstrated on a humanoid robot (Kajima et al., 2004). Zero-energy costs are later demonstrated for a five-link model (Gomes and Ruina, 2005).

Methods and results since 2015 have continued to make additional progress. Non-horizontal ricochetal brachiation is demonstrated for a 2-link primate robot (Wan et al., 2015) and uses an analytic solver for the boundary conditions of each swing and a Lyapunov-based tracking controller. A method is developed for planar brachiation on flexible cables (Farzan et al., 2018), for a 3-link model and corresponding robot. A multiple shooting method is used to plan open-loop feedforward commands. Brachiation control of a 3-link robot is also designed and demonstrated using iLQR (Yang et al., 2019), for animated sequences of single swings. A pendulum model and a planar articulated model are described in the PhD work of Berseth (Berseth, 2019), based on a finite-state-machine control structure. While demonstrating potential, the results are based on a ”grab anywhere” model and produce controllers with limited capabilities and realism. The control for a transverse (sideways) ricochetal brachiation has also been investigated recently, for a 4-link arms, body, and tail model (Lin and Yang, 2020).

Our work differs from the bulk of the above work in the following ways. We demonstrate the ability to learn planar ricochetal brachiation with minimal manual engineering, and that can traverse handhold sequences with significant variation. We demonstrate emergent behaviors, such as additional back-and-forth swings as may be needed to proceed in some situations. Our approach leverages the benefits of learning with a simplified model, showing that the simplified-model control policies can provide highly-effective guidance for learning with the full articulated-body model. The learned control policies anticipate upcoming handholds, and can be readily integrated with a simple motion planner.

2.2. Physics-based Character Skills

Physics-based worlds require simulated characters that are capable of many skills, including the ability to move through all kinds of environments with speed and agility. Significant progress has been made for over thirty years, although it is only recently that we have begun to see approaches that demonstrate scalability, in particular with regard to being able to imitate a wide variety of motion capture clips with a physics-based human model, e.g., (Peng et al., 2018; Chentanez et al., 2018; Merel et al., 2018; Park et al., 2019; Bergamin et al., 2019; Won et al., 2020; Wang et al., 2020). Recent work has also demonstrated the ability to learn bipedal locomotion that are capable of navigating difficult step sequences, with learning curricula and in the absence of motion capture data (Xie et al., 2020).

2.3. Learning with Simplified Models

Simplified physical approximations are a standard way to create models that are tractable to simulate and control, particularly for legged locomotion and brachiation. A common strategy is to first create a motion plan using the simplified model, e.g., a spring-loaded inverted pendulum (Poulakakis and Grizzle, 2009) or centroidal dynamics models, e.g., (Green et al., 2021; Tsounis et al., 2020; Xie et al., 2021), and then treat this as a reference motion to be tracked online (Siciliano et al., 2008). Both the motion planning and tracking problems have commonly been solved using model-based trajectory optimization (TO) or model-free reinforcement learning (RL) methods. In what follows below, we will use a ‘plan+tracking’ notation to loosely categorize combinations of motion planning and tracking of the resulting motion plan. There also exists an abundance of literature on pure TO and RL solutions, which we place outside the scope of our discussion, although for evaluation purposes, we do consider an RL-only baseline method.

Figure 2. Overview of our simplified model imitation learning system. The simplified model allows for efficient exploration and quickly produces physically-feasible reference trajectories for the full model to imitate.

A prominent example of a TO+TO approach can be found in the control of the Boston Dynamics Atlas robot (Kuindersma, 2020), where the planning TO uses a long horizon and solves from scratch, and the tracking TO performs online MPC-style solvers over a short horizon and leverages the planning TO results to warm start the optimization. Another TO+TO example first plans legged locomotion with a sliding-based inverted pendulum, which then initializes a more detailed TO optimization with contact locations and a centroidal model (Kwon et al., 2020), and, lastly, maps this to a full body model using momentum-mapped inverse kinematics. Much recent work in imitation-based physics-based animation and robotics could be categorized as fixed+RL, where captured motion data effectively serves as a fixed motion plan, e.g., (Peng et al., 2018) and related works.

Quadruped control policies have been developed using both TO+RL and RL+TO approaches, e.g., (Gangapurwala et al., 2021) and (Gangapurwala et al., 2020; Xie et al., 2021), respectively. A recent RL+RL approach uses a simplified model (and related policy) that is allowed to sample in state space rather than action space (Sontakke and Ha, 2021). This allows it to behave more like a motion planner and assumes that the result can still provide a useful motion plan to for the full model to imitate. The method demonstrates the ability to avoid a stand-in-place local minima for a simulated quadruped, which may arise due to sparse rewards. Our work develops an RL+RL approach for the challenging control problem of brachiation. In our case, both the simplified and full MDPs work directly in a physically-grounded action space. We demonstrate the benefit of this two-level approach via multiple baselines and ablations.

3. Environments

3.1. Simplified Model

Our simplified model treats the gibbon as a point mass equipped with a virtual extensible arm. The virtual arm is capable of grabbing any handhold within the annular area defined by a radius , with =10 cm, and =75 cm. We note that while the is greater than the arm length of the full model (60 cm), the simplified model is intended to be a proxy for the center of mass of the gibbon, and not the shoulder.

The simplified model dynamics consist of two alternating phases. The swing phase starts when the character performs a grab when one of its hands is at (or nearby) to the target handhold. During this phase, the character can apply a force along the direction of the grabbing arm, either to pull towards or push away from the current handhold. The swing phase dynamics is equivalent to a spring-and-damper point-mass pendulum system. The character can influence the angular velocity by shortening or lengthening the swing arm of the pendulum. The flight phase is defined as the period when neither hand is grabbing. During this phase, the character’s trajectory is governed by passive physics, following a parabolic trajectory. The control authority that remains during a flight phase comes from the decision of when to grab onto the next handhold, if it is within reaching distance. The control parameters are described in detail in Section 3.4.

The minimum and maximum arm length are enforced by applying a force impulse when the length constraint is violated during a swing phase. We implement the physics simulation of the simplified environment in PyTorch 

(Paszke et al., 2019), which can be trivially parallelized on a GPU.

3.2. Full Model

The full model is a planar articulated model of a gibbon-like character that approximates the morphology of real gibbons, as reported in (Michilsens et al., 2009), and is simulated using PyBullet (Coumans and Bai, 2016).

Our gibbon model consists of 13 hinge joints including shoulders, elbows, wrists, hips, knees, ankles, and a single waist joint at the pelvis. All joints have associated torques, except for the wrist joints, which we consider to be passive. To capture the 3D motion of the ball-and-socket joints in two dimensions, we model the shoulder joints as hinge joints without any corresponding joint limits. This allows the simulated character to produce the under-swing motions of real gibbons. Our gibbon has a total mass of 9.8 kg, an arm-reach of 60 cm, and a standing height of 63 centimeters. We artificially increase the mass of the hands and the feet to improve simulation stability. The physical properties of our gibbon model are summarized in Table 1.

Joint Max. Torque (Nm) Link Mass (kg) Link Length (cm)
Waist 21 3.69 31
Shoulder 35 0.74 26
Elbow 28 0.74 26
Wrist 0 0.40 8
Hip 14 0.37 12
Knee 14 0.31 12
Ankle 7 0.47 8
Table 1. Physical Properties of Full Gibbon Model. The waist joint link mass and length includes the torso and the head. Others include only the direct child link.

The grab behavior in the full model is simulated using point-to-point constraints. This allows us to abstract away the complexity of modelling hand-object dynamics, yet still provides a reasonable approximation to the underlying physics. In our implementation, a point-to-point constraint is created when the character performs a grab and the hand is within five centimeters of the target handhold. The constraint pins the grabbing hand in its current position, as opposed to the target handhold location, to avoid introducing undesirable fictitious forces. At release, the constraint is removed.

3.3. Handhold Sequences Generation

Both the simplified and full characters operate in the same environment, which contains handhold sequences. In a brachiation task, the goal is to grab successive handholds precisely and move forward. Successive handholds are generated from uniform distributions defined by the distance,

 meters, and pitch, , relative to the previous handhold.

3.4. Action Spaces

The simplified and full models both use a control frequency of 60 Hz and a simulation frequency is 480 Hz.

The action space for the simplified model consists of two values, a target length offset and a flag indicating to grab or to release. At the beginning of each control step, the target length offset is added to the current arm length to form the desired target length. In addition, the grab flag is used to determine the current character state. At each simulation step, the applied force is computed using PD control based on the desired target length. We clamp the maximum applied force to 240 N, close to the maximum observed forces in real gibbon reported in (Michilsens et al., 2009).

In the full model, the action space is analogous to that of the simplified model, only extended to every joint. An action consists of 15 parameters representing the joint angle offsets for each of the 13 joints and a flag indicating grab or release for each hand. The applied torques in each joint are computed using PD control. We set the proportional gain, , for each joint to be equal to its range of motion in radians and the derivative gain to be .

3.5. State Spaces

Both environments have a similar state space, composed of information of the character and information on the future handholds sequence to grab.

In the simplified environment, the character state is three dimensional consisting of the root velocity in world coordinates and the elapsed time since the start of the current swing phase. The elapsed time information is necessary to make the environment fully observable since the environment enforces a minimum and a maximum grab duration (§4.4). We normalize the elapsed time by dividing by the maximum allowable grab duration. A zero indicates that the character is not grabbing and a one indicates it has grabbed for the maximum amount of time.

The character state for the full model is a 45D vector consisting of root velocity, root pitch, joint angles, joint velocities, grab arm height, and grab states. The torso is used as the root link. The root velocity (

) is the velocity in the sagittal plane. Joint angles () are represented as and of each joint, where is the current angle. Joint velocities () are the angular velocities measured in radians per second. Grab arm height () is the distance between the free hand and the root in the upward direction, which can be used to infer whether the next target is reachable. Lastly, the grab states () are Boolean flags indicating if each hand is currently grabbing.

In both environments, the control policy receives task information pertinent to the handhold locations in addition to the character state. In particular, the task information includes the location of the current handhold and the upcoming handholds in the character’s coordinate frame. We experiment with different values of in Section 5.1.

4. Learning Control Policies

We use deep reinforcement learning (DRL) to learn brachiation skills in both the simplified and the detailed environment. In RL, at each time step , the control policy reacts to an environment state by performing an action . Based on the action performed, the policy receives a reward signal as feedback. In DRL, the policy computes

using a neural network

, where

is the probability density of

under the current policy. The goal of DRL is to find the network parameters which maximize the following:


Here, is the discount factor so that the sum converges. We solve this optimization problem using the proximal policy optimization (PPO) algorithm (Schulman et al., 2017), a policy gradient actor-critic algorithm. We choose PPO for its effective utilization of hardware resources in parallelized settings which results in reduced wall-clock learning time. Our learning pipeline is based on a publicly available implementation of PPO (Kostrikov, 2018). A review of the PPO algorithm is provided in  Appendix B.

4.1. System Overview

We start by providing an overview of our system, shown in Figure 2. Our system contains three distinct components: the simplified model, the full model, and a handhold sequence generator. During training, the simplified model is trained first on randomly sampled handhold sequences generated by the handhold generator. After the simplified model is trained, we obtain reference trajectories from the simplified environment on a fixed set of handholds sequences. The full model is subsequently trained by imitating the reference trajectories from the simplified model while optimizing task and style objectives. At evaluation time, the simplified model can be used as a planner to provide guidance for the full model either upfront for an entire handhold sequence or on a per-grab basis. Both models working in tandem allows our system to traverse difficult handholds sequences that would otherwise be challenging for standard reinforcement learning policies.

4.2. Learning Simplified Policies

The simplified policy is trained to optimize a sparse reward that is given when the character grabs the next handhold. The fact that the simplified policy can be trained with only the sparse task reward is desirable as it allows for interesting behaviors to emerge. In DRL, reward shaping is often required for the learning to succeed or to significantly speed up the learning. At the same time, reward shaping can also introduce biases that cause the control policy to optimize for task-irrelevant objectives and deviate from the main task. Directly optimizing a policy using the sparse task reward allows us to find solution modes that would otherwise be exceedingly difficult to find.

Simplified Policy Networks

The simplified policy contains two neural networks, the controller and the value function. The controller and the value function have similar architecture differing only in the output layer. Each neural network is a three-layered feed-forward network with 256 hidden units followed by ReLU activations. Since the actions are to be scaled by the proportional gain constants in the PD controller, we apply Tanh activation to the controller network’s output layer to ensure the values are normalized between -1 and +1. The value function outputs a scalar value approximating the expected return which is not normalized. The input to the networks are as described earlier (


(a) Backward swing
(b) Forward swing
(c) Complete pumping trajectory
Figure 3. Brachiation with emergent pumping behavior. The blue line gives the trajectory of the simplified model, while the brown line gives the trajectory of the root of the full character. The disparity between the two is caused by the fact that the simplified model has a longer arm length with respect to the full model, to better represent its center of mass. For further visual trajectories please refer to Appendix D and to the supplementary video.

4.3. Learning Full Model Policies Using Simplified Model Imitation

The full model policy is trained using imitation learning by tracking the reference trajectory generated by the simplified model. The imitation learning task can be considered as an inverse dynamics problem where the full model policy is required to produce the joint torques at each time step given the current character state and the future reference trajectory. We can consider the reward function to have a task reward component, an auxiliary reward component, and a style reward component.

As in the simplified environment, the task reward, , is the sparse reward for successfully grabbing the next target handhold. The auxiliary reward () consists of reward objectives that facilitate learning, including a tracking term and a reaching term. The tracking term rewards the character to closely follow the reference trajectory and a reaching term encourage the grabbing hand to be close to the next target handhold. The style reward () terms are added to incentivize more natural motions, including keeping the torso upright, slower arm rotation, reducing lower body movements, and minimizing energy. Finally, the overall reward can be computed as:


Note that the style reward does not affect the learning outcome; the character can learn brachiation motions without the style reward. The reward terms and their weighting coefficients are fully described in Appendix A.

Full Policy Networks

The controller and the value function of the full model consist of two five-layer neural networks, with a layer width of 256. The first three hidden layers of the policy use the softsign activiation, while the final two hidden layers use ReLU activation. A Tanh is applied to the final output to normalize the actions. The critic uses ReLU for all hidden layers. Prior work is contradictory in nature with respect to the impact of network size, finding larger network sizes to perform both better (Wang et al., 2020) and worse (Bergamin et al., 2019) for full-body motion imitation tasks. In our experiments, we find the smaller networks used for the simplified model to be incapable of learning the brachiation task in the full environment.

4.4. Initial State and Early Termination

For both environments, we initialize the character to be hanging from the first handhold at a 90-degree angle, where zero degree corresponds to the pendulum resting position. The initial velocity of the character is set to zero as the base state provides the necessary forward momentum to reach the next handhold.

The environment resets when the termination condition is satisfied, which we define as entering an unrecoverable state. An unrecoverable state is reached when the character is not grabbing onto any handholds and falls below the graspable range of the next handhold with a downward velocity.

In addition to the unrecoverable criteria, both environments implement a minimum and maximum grab duration. The minimum and maximum duration are set to 0.25 and 4 seconds respectively. The idea of the minimum grab duration is to simulate the reaction time and time required to physically form a fist during a grasp. The maximum grab duration effectively serves as another form of early termination. It forces the character to enter an unrecoverable state unless the policy has already learned to generate forward momentum. We check the importance of early termination in Section 5.

5. Results

All experiments are performed on a single 12-core machine with one NVIDIA RTX 2070 GPU. For the simplified model, training takes approximately five minutes as the simulation is parallelized on the GPU. Training the full model policy takes just under one hour. Each experiment is limited to collect a total of 25 million samples and the learning rate is decreased to over time using exponential annealing. We use the Adam (Kingma and Ba, 2014) optimizer with a mini-batch size of 2000 and a learning rate of

to update policy parameters in all experiments. Other implementation and hyperparameter details are summarized in

Appendix C.

5.1. Simplified Model Results

The goal of the simplified model is to facilitate more efficient training of a full model policy by generating physically-feasible reference trajectories. A baseline solution for the simplified model can be successfully trained to find reasonable solutions for traversing difficult terrains, i.e. with handholds sampled from the full distribution (§3.3), using only the task reward described in Section 4.2. The learned controllers have some speed variations; we use a low speed controller to train the full model policy as the generated trajectories are closer to the motions demonstrated by real gibbons.

The learned control policies can traverse challenging sequences of handholds which require significant flight phases and exhibit emergent pumping behavior where the characters would perform extra back-and-forth swings as needed. We observe three scenarios where pumping behavior can be observed: waiting for the minimum swing time to elapse, adjusting for a better release angle, and for gaining momentum. A visualization of the pumping behavior is shown in Figure 3. Please refer to the supplementary video for visual demonstrations of generated trajectories.

Number of Look-ahead Handholds

In this experiment, we empirically verify the choice for the number of look-ahead handholds by comparing the task performance on . Results are summarized in Table 2. The best performance is achieved when as measured by both the number of completed handholds and the average episode reward. This is surprising since previous work in human locomotion has shown that a two-step anticipation horizon achieves the lowest stepping error (Coros et al., 2008). The result shows that the policy performance generally decreases with increasing number of look-ahead handholds. We hypothesize that, due to the relative small network size used in the simplified policy, increasing the number of look-ahead handholds creates more distraction for the policy. We leave validation of this hypothesis for future work.

Importance of Early Termination

In this ablation study, we verify the importance of applying early termination in the simplified environment. There are two forms of early termination: the unrecoverable criteria and the maximum grab duration. The complete simplified model implements all early termination strategies. Table 3 shows the average number of handholds reached aggregated across five different runs. The results show that enforcing strict early termination can facilitate faster learning. When maximum grab duration is disabled, only one run out of ten total successfully learned the brachiation task. In addition, we find that enforcing the minimum grab duration of 0.25 seconds helps the character to learn to swing, which in turn helps to discover the future handholds.

Number of Look-ahead Handholds
1 2 3 5 10
Handholds Completed
Episode Reward (
Table 2.

Experiment results on the effect of the number of look-ahead handholds for the simplified model. We report the average and the standard deviation over five independent runs.

Early Termination Ablations
None Unrecoverable Min Duration Max Duration Min & Max
Table 3. Ablation results on early termination strategies for the simplified model. Each column names the termination strategy that is removed. Each entry represents the average number of completed handholds and the standard deviation over five runs.

5.2. Full Model Results

The full model policy is learned through simplified model imitation, as described earlier (§4.3). In addition to using the tracking reward, which uses the simplified model only during training, we experiment with three different methods of further leveraging the simplified model reference trajectories at inference time.

  • [leftmargin=]

  • A. Tracking reward only. The tracking reward is combined with the task reward to train the full model policy. The reference trajectory is only used to compute the tracking reward.

  • B. Tracking + Other Rewards. Reward includes all reward terms: tracking, task, auxiliary, and style rewards.

  • C. Rewards + Release Timing. Reward includes everything described in Experiment B. In addition, the release timing from the reference trajectory is used to control the grab action of the full model policy.

  • D. Rewards + States. Reward includes everything described in Experiment B. In addition, a slice of the future reference trajectory is included as a part of the full model policy input.

  • E. Rewards + States + Grab Info. Reward includes everything described in Experiment D. In addition, instead of conditioning the full model state with just the future reference trajectory, we also give access to the grab flag for those future points.

Figure 4. Learning Curves for the full model, for various configurations. No imitation reward results in failure. Using reference trajectory information only at training time (A and B) improves learning. Using the release timing from the simplified model degrades performance (C). The best policies are obtained when simplified and full models are used in tandem at evaluation time (D and E). Please refer to Section 5.2 for a detailed description.

Figure 4 shows the learning curves for different full model configurations. We include a baseline model which is trained using all reward terms except for the tracking reward. For the baseline model, we experimented both with and without a learning curriculum on the terrain difficulty, e.g., similar to (Xie et al., 2020), but neither version was able to advance to the second handhold. The best policies are obtained when the simplified model is also used at evaluation time. This includes adding information from the reference trajectory (Experiment D) and also further adding the grab information (Experiment E). However, the grab information provides no additional benefit to the policy when reference trajectory information is present.

5.3. Full Model with Planning

Gaps Passed
Planning Method 0 1 1+2 1+2+3
Only Full Value Function 40 22 9 7
Only Simplified Rewards 40 36 26 15
Reward & Value Function 40 35 26 21
Table 4. Quantitative evaluation of the different planning methods. In each generated terrain, three unreachable gaps (e.g., Figure 5) are placed at random locations. Each planning method is evaluated on 40 different terrains. The combined approach is able to pass all gaps most consistently.
(a) Planning using only the full model value function. The last planned target grasp location (red dot) is unreachable. This approach has a limited planning horizon, so unrealistic plans are occasionally generated.
(b) Planning using only the simplified model. Here, the last handhold target is unreachable because this planning approach fails to consider the capability of the full model policy.
(c) Planning with combined full model value function and simplified model. The target handhold locations are planned very close to the discontinuities, which effectively reduces the risk of producing unreachable targets.
Figure 5. Comparison of different planning approaches.

At inference time, we are able to extend the capability of the full model policy by expanding the anticipation horizon using the simplified model. The full model policy’s value function cannot anticipate more than the number of look-ahead handholds that was used for training. However, the simplified model is fast enough to replan on a handhold-by-handhold basis, in the style of model predictive control (MPC). It can be used to simulated thousands of trajectories in parallel at run-time. In this experiment, we set the planner to explore 10k trajectories for 4 seconds into the future; this can be completed in under one second of machine time. There is still room for improvement for real time animation purposes.

The planner uses three fully-trained components: the full model policy , the full model value function and the simplified model policy . As input, the method takes a terrain which is a contiguous ceiling where handholds can be placed. Given the current state of the full model character , we randomly sample handhold plans of length handholds on the terrain, . Concretely, each handhold in a plan is sampled at a horizontal distance beyond the previous handhold, and takes its corresponding height from the terrain. We then retain the best plan, i.e., , according to the objective defined by

where is the reward for the simplified model obtained by , and is the policy look-ahead value, with . The selected plan is then used until the next handhold is reached, at which point a full replanning takes place. This MPC-like process repeats ad infinitum. We find empirically that the above objective generates better planning trajectories as compared to only using the reward of or only . This is because the reward of the simplified model is not enough to capture the capabilities of the full character, while the value function has a limited planning horizon. Quantitative results are presented in Table 4. Examples of terrains traversed by the policy with the planners are presented in Figure 5 and in the supplementary video.

6. Conclusions

Brachiation is challenging to learn, as it requires the careful management of momentum, as well as precision in grasping. We demonstrate a two-level reinforcement learning method for learning highly-capable and dynamic brachiation motions. The motions generated by the policy learned for the simplified model play a critical role in the efficient learning of the full policy. Our work still has many limitations. There are likely to be other combinations of rewards, reward curricula, and handhold curricula, that, when taken together, may produce equivalent capabilities. We have not yet fully characterized the limits of the model’s capabilities, i.e., the simulated gibbon’s ability to climb, leap, descend, and more. We wish to better understand, in a quantitative fashion, how the resulting motions compare to those of real gibbons. Lastly, to emulate the true capabilities of gibbons, we need methods that can plan in complex 3D worlds, perform 3D brachiation, make efficient use of the legs when necessary, and more. We believe the simplicity of our approach provides a solid foundation for future research in these directions.

This work was supported by the NSERC grant RGPIN-2020-05929.


  • K. Bergamin, S. Clavet, D. Holden, and J. R. Forbes (2019) DReCon: data-driven responsive control of physics-based characters. ACM Trans. Graph. 38 (6). External Links: ISSN 0730-0301, Link, Document Cited by: §2.2, §4.3.
  • G. Berseth (2019) Scalable deep reinforcement learning for physics-based motion control. Ph.D. Thesis, University of British Columbia. External Links: Link, Document Cited by: §2.1.
  • J. E. Bertram, A. Ruina, C. E. Cannon, Y. H. Chang, and M. J. Coleman (1999) A point-mass model of gibbon locomotion. The Journal of Experimental Biology 202 (Pt 19), pp. 2609–2617. External Links: ISSN 0022-0949 Cited by: §1, §1, §2.1.
  • N. Chentanez, M. Müller, M. Macklin, V. Makoviychuk, and S. Jeschke (2018) Physics-based motion capture imitation with deep reinforcement learning. In Proceedings of the 11th Annual International Conference on Motion, Interaction, and Games, pp. 1–10. Cited by: §2.2.
  • S. Coros, P. Beaudoin, K. K. Yin, and M. van de Panne (2008) Synthesis of constrained walking skills. In ACM SIGGRAPH Asia 2008 Papers, SIGGRAPH Asia ’08, New York, NY, USA. External Links: ISBN 9781450318310, Link, Document Cited by: §5.1.
  • E. Coumans and Y. Bai (2016) PyBullet, a python module for physics simulation for games, robotics and machine learning. Note: http://pybullet.org Cited by: §3.2.
  • S. Farzan, A. Hu, E. Davies, and J. Rogers (2018) Modeling and Control of Brachiating Robots Traversing Flexible Cables. In 2018 IEEE International Conference on Robotics and Automation (ICRA), pp. 1645–1652. Note: ISSN: 2577-087X External Links: Document Cited by: §2.1.
  • J. Fleagle (1974) Dynamics of a brachiating siamang [Hylobates (Symphalangus) syndactylus]. Nature 248 (5445), pp. 259–260. Note: Bandiera_abtest: a Cg_type: Nature Research Journals Number: 5445 Primary_atype: Research Publisher: Nature Publishing Group External Links: ISSN 1476-4687, Link, Document Cited by: §1, §2.1.
  • S. Gangapurwala, M. Geisert, R. Orsolino, M. Fallon, and I. Havoutis (2020) Rloc: terrain-aware legged locomotion using reinforcement learning and optimal control. arXiv preprint arXiv:2012.03094. Cited by: §2.3.
  • S. Gangapurwala, M. Geisert, R. Orsolino, M. Fallon, and I. Havoutis (2021) Real-time trajectory adaptation for quadrupedal locomotion using deep reinforcement learning. In Proceedings of the International Conference on Robotics and Automation (ICRA), Cited by: §2.3.
  • M. W. Gomes and A. L. Ruina (2005) A five-link 2D brachiating ape model with life-like zero-energy-cost motions. Journal of Theoretical Biology 237 (3), pp. 265–278. External Links: ISSN 0022-5193, Document Cited by: §2.1.
  • K. Green, Y. Godse, J. Dao, R. L. Hatton, A. Fern, and J. Hurst (2021) Learning Spring Mass Locomotion: Guiding Policies with a Reduced-Order Model. arXiv:2010.11234 [cs]. Note: arXiv: 2010.11234 External Links: Link Cited by: §2.3.
  • H. Kajima, M. Doi, Y. Hasegawa, and T. Fukuda (2004) A study on a brachiation controller for a multi-locomotion robot — realization of smooth, continuous brachiation. Advanced Robotics 18 (10), pp. 1025–1038. Note: Publisher: Taylor & Francis _eprint: https://doi.org/10.1163/1568553042674671 External Links: ISSN 0169-1864, Link, Document Cited by: §2.1.
  • D. P. Kingma and J. Ba (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §5.
  • I. Kostrikov (2018) PyTorch implementations of reinforcement learning algorithms. GitHub. Note: https://github.com/ikostrikov/pytorch-a2c-ppo-acktr-gail Cited by: §4.
  • S. Kuindersma (2020) Recent progress on atlas, the world’s most dynamic humanoid robot. External Links: Link Cited by: §2.3.
  • T. Kwon, Y. Lee, and M. van de Panne (2020) Fast and flexible multilegged locomotion using learned centroidal dynamics. ACM Trans. Graph. 39 (4). External Links: ISSN 0730-0301, Link, Document Cited by: §2.3.
  • C. Lin and Z. Yang (2020) TRBR: Flight body posture compensation for transverse ricochetal brachiation robot. Mechatronics 65, pp. 102307. External Links: ISSN 0957-4158, Link, Document Cited by: §2.1.
  • Y. Luo, J. H. Soeseno, T. P. Chen, and W. Chen (2020) CARL: controllable agent with reinforcement learning for quadruped locomotion. ACM Trans. Graph. 39 (4). External Links: ISSN 0730-0301, Link, Document Cited by: §1.
  • J. Merel, L. Hasenclever, A. Galashov, A. Ahuja, V. Pham, G. Wayne, Y. W. Teh, and N. Heess (2018) Neural probabilistic motor primitives for humanoid control. arXiv preprint arXiv:1811.11711. Cited by: §2.2.
  • F. Michilsens, E. E. Vereecke, K. D’Août, and P. Aerts (2009) Functional anatomy of the gibbon forelimb: adaptations to a brachiating lifestyle. Journal of Anatomy 215 (3), pp. 335–354. External Links: ISSN 0021-8782, Link, Document Cited by: §2.1, §3.2, §3.4.
  • S. Min, J. Won, S. Lee, J. Park, and J. Lee (2019) SoftCon: simulation and control of soft-bodied animals with biomimetic actuators. ACM Trans. Graph. 38 (6). Cited by: §1.
  • J. Nakanishi, T. Fukuda, and D.E. Koditschek (2000) A brachiating robot controller. IEEE Transactions on Robotics and Automation 16 (2), pp. 109–123. Note: Conference Name: IEEE Transactions on Robotics and Automation External Links: ISSN 2374-958X, Document Cited by: §2.1.
  • S. Park, H. Ryu, S. Lee, S. Lee, and J. Lee (2019) Learning predict-and-simulate policies from unorganized human motion data. ACM Transactions on Graphics (TOG) 38 (6), pp. 1–11. Cited by: §2.2.
  • A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, et al. (2019)

    PyTorch: an imperative style, high-performance deep learning library

    In Advances in Neural Information Processing Systems, pp. 8024–8035. Cited by: §3.1.
  • X. B. Peng, P. Abbeel, S. Levine, and M. van de Panne (2018) Deepmimic: example-guided deep reinforcement learning of physics-based character skills. ACM Transactions on Graphics (TOG) 37 (4), pp. 1–14. Cited by: §2.2, §2.3.
  • I. Poulakakis and J. W. Grizzle (2009) The spring loaded inverted pendulum as the hybrid zero dynamics of an asymmetric hopper. IEEE Transactions on Automatic Control 54 (8), pp. 1779–1793. Cited by: §2.3.
  • H. Preuschoft and B. Demes (1985) Influence of Size and Proportions on the Biomechanics of Brachiation. In Size and Scaling in Primate Biology, W. L. Jungers (Ed.), Advances in Primatology, pp. 383–399. External Links: ISBN 978-1-4899-3647-9, Link, Document Cited by: §1, §2.1.
  • F. Saito, T. Fukuda, and F. Arai (1994) Swing and locomotion control for a two-link brachiation robot. IEEE Control Systems Magazine 14 (1), pp. 5–12. Note: Conference Name: IEEE Control Systems Magazine External Links: ISSN 1941-000X, Document Cited by: §2.1.
  • F. Saito (1992) Movement control of brachiation robot using cmac between different distance and height. In Proc. IMACS/SICE Int. Symp. on Robotics, Mechatronics and Manufacturing Systems, pp. 35–40. Cited by: §2.1.
  • J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347. Cited by: §4.
  • B. Siciliano, O. Khatib, and T. Kröger (2008) Springer handbook of robotics. Vol. 200, Springer. Cited by: §2.3.
  • N. Sontakke and S. Ha (2021) Solving challenging control problems using two-staged deep reinforcement learning. arXiv preprint arXiv:2109.13338. Cited by: §2.3.
  • V. Tsounis, M. Alge, J. Lee, F. Farshidian, and M. Hutter (2020) DeepGait: Planning and Control of Quadrupedal Gaits using Deep Reinforcement Learning. arXiv:1909.08399 [cs]. Note: arXiv: 1909.08399 External Links: Link Cited by: §2.3.
  • D. Wan, H. Cheng, G. Ji, and S. Wang (2015) Non-horizontal ricochetal brachiation motion planning and control for two-link Bio-primate robot. In 2015 IEEE International Conference on Robotics and Biomimetics (ROBIO), pp. 19–24. External Links: Document Cited by: §1, §2.1.
  • T. Wang, Y. Guo, M. Shugrina, and S. Fidler (2020) UniCon: universal neural controller for physics-based character motion. External Links: 2011.15119 Cited by: §2.2, §4.3.
  • J. Won, D. Gopinath, and J. Hodgins (2020) A scalable approach to control diverse behaviors for physically simulated characters. ACM Transactions on Graphics (TOG) 39 (4), pp. 33–1. Cited by: §2.2.
  • J. Won, J. Park, and J. Lee (2018) Aerobatics control of flying creatures via self-regulated learning. ACM Trans. Graph. 37 (6). External Links: ISSN 0730-0301, Link, Document Cited by: §1.
  • Z. Xie, X. Da, B. Babich, A. Garg, and M. van de Panne (2021) GLiDE: Generalizable Quadrupedal Locomotion in Diverse Environments with a Centroidal Model. arXiv:2104.09771 [cs]. Note: arXiv: 2104.09771 External Links: Link Cited by: §2.3, §2.3.
  • Z. Xie, H. Y. Ling, N. H. Kim, and M. van de Panne (2020) ALLSTEPS: curriculum-driven learning of stepping stone skills. Computer Graphics Forum (Proc. ACM SIGGRAPH/EG Symposium on Computer Animation) 39 (8), pp. 213–224. Cited by: §1, §2.2, §5.2.
  • S. Yang, Z. Gu, R. Ge, A. M. Johnson, M. Travers, and H. Choset (2019) Design and Implementation of a Three-Link Brachiation Robot with Optimal Control Based Trajectory Tracking Controller. arXiv:1911.05168 [cs, eess]. Note: arXiv: 1911.05168 External Links: Link Cited by: §2.1.
  • Z. Zhang and K. C. Wong (1999) Details and Implementation Issues of Animating Brachiation. In Computer Animation and Simulation ’99, N. Magnenat-Thalmann and D. Thalmann (Eds.), Eurographics, Vienna, pp. 123–132. External Links: ISBN 978-3-7091-6423-5, Document Cited by: §2.1.

Appendix A Full Model Rewards

The tracking reward is formulated as a penalty and computed as the squared distance between the character’s root and the position of the reference trajectory.

The reaching reward is computed similarly as the tracking reward, except the distance is between the next grabbing hand and the target handhold. This reward is only applied during the flight phase. During swing phase, the character can still use the free arm to generate momentum without penalty.

The upright posture term penalizes the character if the root pitch is not within 40 degrees of the vertical axis.

The arm rotation reward penalizes the character for excessive angular velocity in the next grabbing arm.

The lower body reward encourages the knees to be close to 110 degrees from base position of straight legs. This reward is applied to improve the visibility of the legs during brachiation.

The energy reward penalizes the policy for using excessive amounts of energy. The overall energy is approximated based on the applied torques and the combined angular velocity of all joints, computed as:

Appendix B Proximal Policy Optimization

Let an experience tuple be and a trajectory be . We episodically collect trajectories for a fixed number of environment transitions and we use this data to train the controller and the value function networks. The value function network approximates the expected future returns of each state, and is defined for a policy as

This function can be optimized using supervised learning due to its recursive nature:


In PPO, the value function is used for computing the advantage

which is then used for training the policy by maximizing:

where is an importance sampling term used for calculating the expectation under the old policy .

Appendix C PPO Hyperparameters

Hyperparameter Simplified/Full Model
Learning Rate
Final Learning Rate
Optimizer Adam
Batch Size
Training Steps
Num processes /
Episode Steps /

Num PPO Epochs

Discount Factor
Gradient Clipping False
Entropy Coefficient
Value Loss Coefficient
Clip parameter
Max Grad Norm
Table 5. Hyperparameters used for training PPO.

Appendix D Supplementary Results

Figure 6. Additional trajectories showing a variety of motions.

Appendix D Supplementary Results