1 Introduction
The introduction of deep learning models to reinforcement learning (RL) has enabled policies to operate directly on highdimensional, lowlevel state features. As a result, deep reinforcement learning (DeepRL) has demonstrated impressive capabilities, such as developing control policies that can map from input image pixels to output joint torques
(Lillicrap et al., 2015). However, the quality and robustness often falls short of what has been achieved with handcrafted action abstractions, e.g., Coros et al. (2011); Geijtenbeek et al. (2013). Relatedly, the choice of action parameterization is a design decision whose impact is not yet well understood.Joint torques can be thought of as the most basic and generic representation for driving the movement of articulated figures, given that muscles and other actuation models eventually result in joint torques. However this ignores the intrinsic embodied nature of biological systems, particularly the synergy between control and biomechanics. Passivedynamics, such as elasticity and damping from muscles and tendons, play an integral role in shaping motions: they provide mechanisms for energy storage, and mechanical impedance which generates instantaneous feedback without requiring any explicit computation. Loeb coins the term preflexes (Loeb, 1995) to describe these effects, and their impact on motion control has been described as providing intelligence by mechanics (Blickhan et al., 2007).
In this paper we explore the impact of four different actuation models on learning to control dynamic articulated figure locomotion: (1) torques (Tor); (2) activations for musculotendon units (MTU); (3) target joint angles for proportionalderivative controllers (PD); and (4) target joint velocities (Vel). Because Deep RL methods are capable of learning control policies for all these models, it now becomes possible to directly assess how the choice of actuation model affects the learning difficulty. We also assess the learned policies with respect to robustness, motion quality, and policy query rates. We show that action spaces which incorporate local feedback can significantly improve learning speed and performance, while still preserving the generality afforded by torquelevel control. Such parameterizations also allow for more complex body structures and subjective improvements in motion quality.
Our specific contributions are: (1) We introduce a DeepRL framework for motion imitation tasks; (2) We evaluate the impact of four different actuation models on learned control policies according to four criteria; and (3) We propose an optimization approach that combines policy learning and actuator optimization, allowing neural networks to effective control complex muscle models.
2 Background
Our task will be structured as a standard reinforcement problem where an agent interacts with its environment according to a policy in order to maximize a reward signal. The policy
represents the conditional probability density function of selecting action
in state . At each control step , the agent observes a state and samples an action from . The environment in turn responds with a scalar reward , and a new state sampled from its dynamics . For a parameterized policy , the goal of the agent is learn the parameters which maximizes the expected cumulative rewardwith as the discount factor, and as the horizon. The gradient of the expected reward can be determined according to the policy gradient theorem (Sutton et al., 2001), which provides a direction of improvement to adjust the policy parameters .
where is the discounted state distribution, where represents the initial state distribution, and models the likelihood of reaching state by starting at and following the policy for steps (Silver et al., 2014). represents a generalized advantage function. The choice of advantage function gives rise to a family of policy gradient algorithms, but in this work, we will focus on the onestep temporal difference advantage function (Schulman et al., 2015)
where is the statevalue function, and can be defined recursively via the Bellman equation
A parameterized value function , with parameters , can be learned iteratively in a manner similar to QLearning by minimizing the Bellman loss,
and can be trained in tandem using an actorcritic framework (Konda & Tsitsiklis, 2000).
In this work, each policy will be represented as a gaussian distribution with a parameterized mean
and fixed covariance matrix , where is manually specified for each action parameter. Actions can be sampled from the distribution by applying gaussian noise to the mean actionThe corresponding policy gradient will assume the form
which can be interpreted as shifting the mean of the action distribution towards actions that lead to higher than expected rewards, while moving away from actions that lead to lower than expected rewards.
3 Task Representation
3.1 Reference Motion
In our task, the goal of a policy is to imitate a given reference motion which consists of a sequence of kinematic poses in reduced coordinates. The reference velocity at a given time is approximated by finitedifference . Reference motions are generated via either using a recorded simulation result from a preexisting controller (“Sim”), or via manuallyauthored keyframes. Since handcrafted reference motions may not be physically realizable, the goal is to closely reproduce a motion while satisfying physical constraints.
3.2 States
To define the state of the agent, a feature transformation is used to extract a set of features from the reducedcoordinate pose and velocity . The features consist of the height of the root (pelvis) from the ground, the position of each link with respect to the root, and the center of mass velocity of each link. When training a policy to imitate a cyclic reference motion , knowledge of the motion phase can help simplify learning. Therefore, we augment the state features with a set of target features , resulting in a combined state represented by . Similar results can also be achieved by providing a single motion phase variable as a state feature, as we show in Figure 12 (supplemental material).
3.3 Actions
We train separate policies for each of the four actuation models, as described below. Each actuation model also has related actuation parameters, such as feedback gains for PDcontrollers and musculotendon properties for MTUs. These parameters can be manually specified, as we do for the PD and Vel models, or they can be optimized for the task at hand, as for the MTU models. Table 1 provides a list of actuator parameters for each actuation model.
Target Joint Angles (PD): Each action represents a set of target angles , where specifies the target angles for joint . is applied to PDcontrollers which compute torques according to , where , and and are manuallyspecified gains.
Target Joint Velocities (Vel): Each action specifies a set of target velocities which are used to compute torques according to , where the gains are specified to be the same as those used for target angles.
Torques (Tor): Each action directly specifies torques for every joint, and constant torques are applied for the duration of a control step. Due to torque limits, actions are bounded by manually specified limits for each joint. Unlike the other actuation models, the torque model does not require additional actuator parameters, and can thus be regarded as requiring the least amount of domain knowledge. Torque limits are excluded from the actuator parameter set as they are common for all parameterizations.
Muscle Activations (MTU): Each action specifies activations for a set of musculotendon units (MTU). Detailed modeling and implementation information are available in Wang et al. (2012). Each MTU is modeled as a contractile element (CE) attached to a serial elastic element (SE) and parallel elastic element (PE). The force exerted by the MTU can be calculated according to . Both and are modeled as passive springs, while is actively controlled according to , with being the muscle activation, the maximum isometric force, and being the length and velocity of the contractile element. The functions and represent the forcelength and forcevelocity relationships, modeling the variations in the maximum force that can be exerted by a muscle as a function of its length and contraction velocity. Analytic forms are available in Geyer et al. (2003). Activations are bounded between [0, 1]. The length of each contractile element are included as state features. To simplify control and reduce the number of internal state parameters per MTU, the policies directly control muscle activations instead of indirectly through excitations (Wang et al., 2012).
Actuation Model  Actuator Parameters 

Target Joint Angles (PD)  proportional gains , derivative gains 
Target Joint Velocities (Vel)  derivative gains 
Torques (Tor)  none 
Muscle Activations (MTU)  optimal contractile element length, serial elastic element rest length, 
maximum isometric force, pennation, moment arm, 

maximum moment arm joint orientation, rest joint orientation. 
3.4 Reward
The reward function consists of a weighted sum of terms that encourage the policy to track a reference motion.
Details of each term are available in the supplemental material. penalizes deviation of the character pose from the reference pose, and penalizes deviation of the joint velocities. and accounts for the position error of the endeffectors and root. penalizes deviations in the center of mass velocity from that of the reference motion.
3.5 Initial State Distribution
We design the initial state distribution, , to sample states uniformly along the reference trajectory. At the start of each episode, and are sampled from the reference trajectory, and used to initialize the pose and velocity of the agent. This helps guide the agent to explore states near the target trajectory.
4 ActorCritic Learning Algorithm
Instead of directly using the temporal difference advantage function, we adapt a positive temporal difference (PTD) update as proposed by Van Hasselt (2012).
Unlike more conventional policy gradient methods, PTD is less sensitive to the scale of the advantage function and avoids instabilities that can result from negative TD updates. For a Gaussian policy, a negative TD update moves the mean of the distribution away from an observed action, effectively shifting the mean towards an unknown action that may be no better than the current mean action (Van Hasselt, 2012)
. In expectation, these updates converges to the true policy gradient, but for stochastic estimates of the policy gradient, these updates can cause the agent to adopt undesirable behaviours which affect subsequent experiences collected by the agent. Furthermore, we incorporate experience replay, which has been demonstrated to improve stability when training neural network policies with Qlearning in discrete action spaces. Experience replay often requires offpolicy methods, such as importance weighting, to account for differences between the policy being trained and the behavior policy used to generate experiences
(WawrzyńSki & Tanwani, 2013). However, we have not found importance weighting to be beneficial for PTD.Stochastic policies are used during training for exploration, while deterministic policy are deployed for evaluation at runtime. The choice between a stochastic and deterministic policy can be specified by the addition of a binary indicator variable
where corresponds to a stochastic policy with exploration noise, and corresponds to a deterministic policy that always selects the mean of the distribution. Noise from a stochastic policy will result in a state distribution that differs from that of the deterministic policy at runtime. To imitate this discrepancy, we incorporate greedy exploration in addition to the original Gaussian exploration. During training,
is determined by a Bernoulli random variable
, where with probability . The exploration rate is annealed linearly from 1 to 0.2 over 500k iterations, which slowly adjusts the state distribution encountered during training to better resemble the distribution at runtime. Since the policy gradient is defined for stochastic policies, only tuples recorded with exploration noise (i.e. ) can be used to update the actor, while the critic can be updated using all tuples.Training proceeds episodically, where the initial state of each episode is sampled from
, and the episode duration is drawn from an exponential distribution with a mean of 2s. To discourage falling, an episode will also terminate if any part of the character’s trunk makes contact with the ground for an extended period of time, leaving the agent with zero reward for all subsequent steps. Algorithm
1 in the supplemental material summarizes the complete learning process.MTU Actuator Optimization: Actuation models such as MTUs are defined by further parameters whose values impact performance (Geijtenbeek et al., 2013). Geyer et al. (2003) uses existing anatomical estimates for humans to determine MTU parameters, but such data is not be available for more arbitrary creatures. Alternatively, Geijtenbeek et al. (2013)
uses covariance matrix adaptation (CMA), a derivativefree evolutionary search strategy, to simultaneously optimize MTU and policy parameters. This approach is limited to policies with reasonably low dimensional parameter spaces, and is thus illsuited for neural network models with hundreds of thousands of parameters. To avoid manualtuning of actuator parameters, we propose a heuristic approach that alternates between policy learning and actuator optimization, as detailed in the supplemental material.
5 Results
The motions are best seen in the supplemental video https://youtu.be/L3vDo3nLI98. We evaluate the action parameterizations by training policies for a simulated 2D biped, dog, and raptor as shown in Figure 1. Depending on the agent and the actuation model, our systems have 58–214 state dimensions, 6–44 action dimensions, and 0–282 actuator parameters, as summarized in Table 3 (supplemental materials). The MTU models have at least double the number of action parameters because they come in antagonistic pairs. As well, additional MTUs are used for the legs to more accurately reflect bipedal biomechanics. This includes MTUs that span multiple joints.
Each policy is represented by a three layer neural network, as illustrated in Figure 8
(supplemental material) with 512 and 256 fullyconnected units, followed by a linear output layer where the number of output units vary according to the number of action parameters for each character and actuation model. ReLU activation functions are used for both hidden layers. Each network has approximately 200k parameters. The value function is represented by a similar network, except having a single linear output unit. The policies are queried at 60Hz for a control step of about 0.0167s. Each network is randomly initialized and trained for about 1 million iterations, requiring 32 million tuples, the equivalent of approximately 6 days of simulated time. Each policy requires about 10 hours for the biped, and 20 hours for the raptor and dog on an 8core Intel Xeon E52687W.
Only the actuator parameters for MTUs are optimized with Algorithm 2, since the parameters for the other actuation models are few and reasonably intuitive to determine. The initial actuator parameters are manually specified, while the initial policy parameters are randomly initialized. Each pass optimizes using CMA for 250 generations with 16 samples per generation, and is trained for 250k iterations. Parameters are initialized with values from the previous pass. The expected value of each CMA sample of is estimated using the average cumulative reward over 16 rollouts with a duration of 10s each. Separate MTU parameters are optimized for each character and motion. Each set of parameters is optimized for 6 passes following Algorithm 2, requiring approximately 50 hours. Figure 5 illustrates the performance improvement per pass. Figure 6 compares the performance of MTUs before and after optimization. For most examples, the optimized actuator parameters significantly improve learning speed and final performance. For the sake of comparison, after a set of actuator parameters has been optimized, a new policy is retrained with the new actuator parameters and its performance compared to the other actuation models.
Policy Performance and Learning Speed: Figure 2 shows learning curves for the policies and the performance of the final policies are summarized in Table 4. Performance is evaluated using the normalized cumulative reward (NCR), calculated from the average cumulative reward over 32 episodes with lengths of 10s, and normalized by the maximum and minimum cumulative reward possible for each episode. No discounting is applied when calculating the NCR. The initial state of each episode is sampled from the reference motion according to . To compare learning speeds, we use the normalized area under each learning curve (AUC) as a proxy for the learning speed of a particular actuation model, where 0 represents the worst possible performance and no progress during training, and 1 represents the best possible performance without requiring training.
PD performs well across all examples, achieving comparabletothebest performance for all motions. PD also learns faster than the other parameterizations for 5 of the 7 motions. The final performance of Tor is among the poorest for all the motions. Differences in performance appear more pronounced as characters become more complex. For the simple 7link biped, most parameterizations achieve similar performance. However, for the more complex dog and raptor, the performance of Tor policies deteriorate with respect to other policies such as PD and Vel. MTU policies often exhibited the slowest learning speed, which may be a consequence of the higher dimensional action spaces, i.e., requiring antagonistic muscle pairs, and complex muscle dynamics. Nonetheless, once optimized, the MTU policies produce more natural motions and responsive behaviors as compared to other parameterizations. We note that the naturalness of motions is not well captured by the reward, since it primarily gauges similarity to the reference motion, which may not be representative of natural responses when perturbed from the nominal trajectory.
Policy Robustness: To evaluate robustness, we recorded the NCR achieved by each policy when subjected to external perturbations. The perturbations assume the form of random forces applied to the trunk of the characters. Figure 3 illustrates the performance of the policies when subjected to perturbations of different magnitudes. The magnitude of the forces are constant, but direction varies randomly. Each force is applied for 0.1 to 0.4s, with 1 to 4s between each perturbation. Performance is estimated using the average over 128 episodes of length 20s each. For the biped walk, the Tor policy is significantly less robust than those for the other types of actions, while the MTU policy is the least robust for the raptor run. Overall, the PD policies are among the most robust for all the motions. In addition to external forces, we also evaluate robustness over randomly generated terrain consisting of bumps with varying heights and slopes with varying steepness. We evaluate the performance on irregular terrain (Figure 10, supplemental material). There are few discernible patterns for this test. The Vel and MTU policies are significantly worse than the Tor and PD policies for the dog bound on the bumpy terrain. The unnatural jittery behavior of the dog Tor policy proves to be surprisingly robust for this scenario. We suspect that the behavior prevents the trunk from contacting the ground for extended periods for time, and thereby escaping our system’s fall detection.
Query Rate: Figure 4 compares the performance of different parameterizations for different policy query rates. Separate policies are trained with queries of 15Hz, 30Hz, 60Hz, and 120Hz. Actuation models that incorporate lowlevel feedback such as PD and Vel, appear to cope more effectively to lower query rates, while the Tor degrades more rapidly at lower query rates. It is not yet obvious to us why MTU policies appear to perform better at lower query rates and worse at higher rates. Lastly, Figure 11 shows the policy outputs as a function of time for the four actuation models, for a particular joint, as well as showing the resulting joint torque. Interestingly, the MTU action is visibly smoother than the other actions and results in joint torques profiles that are smoother than those seen for PD and Vel.
6 Related Work
DeepRL has driven impressive recent advances in learning motion control, i.e., solving for continuousaction control problems using reinforcement learning. All four of the actions types that we explore have seen previous use in the machine learning literature.
WawrzyńSki & Tanwani (2013) use an actorcritic approach with experience replay to learn skills for an octopus arm (actuated by a simple muscle model) and a planar half cheetah (actuated by jointbased PDcontrollers). Recent work on deterministic policy gradients (Lillicrap et al., 2015) and on RL benchmarks, e.g., OpenAI Gym, generally use joint torques as the action space, as do the test suites in recent work (Schulman et al., 2015) on using generalized advantage estimation. Other recent work uses: the PR2 effort control interface as a proxy for torque control (Levine et al., 2015); joint velocities (Gu et al., 2016); velocities under an implicit control policy (Mordatch et al., 2015); or provide abstract actions (Hausknecht & Stone, 2015). Our learning procedures are based on prior work using actorcritic approaches with positive temporal difference updates (Van Hasselt, 2012).Work in biomechanics has long recognized the embodied nature of the control problem and the view that musculotendon systems provide “preflexes” (Loeb, 1995) that effectively provide a form intelligence by mechanics (Blickhan et al., 2007), as well as allowing for energy storage. The control strategies for physicsbased character simulations in computer animation also use all the forms of actuation that we evaluate in this paper. Representative examples include quadratic programs that solve for joint torques (de Lasa et al., 2010), joint velocities for skilled bicycle stunts (Tan et al., 2014), muscle models for locomotion (Wang et al., 2012; Geijtenbeek et al., 2013), mixed use of feedforward torques and joint target angles (Coros et al., 2011), and joint target angles computed by learned linear (timeindexed) feedback strategies (Liu et al., 2016). Lastly, control methods in robotics use a mix of actuation types, including directdrive torques (or their virtualized equivalents), series elastic actuators, PD control, and velocity control. These methods often rely heavily on modelbased solutions and thus we do not describe these in further detail here.
7 Conclusions
Our experiments suggest that action parameterizations that include basic local feedback, such as PD target angles, MTU activations, or target velocities, can improve policy performance and learning speed across different motions and character morphologies. Such models more accurately reflect the embodied nature of control in biomechanical systems, and the role of mechanical components in shaping the overall dynamics of motions and their control. The difference between lowlevel and highlevel action parameterizations grow with the complexity of the characters, with highlevel parameterizations scaling more gracefully to complex characters.
Our results have only been demonstrated on planar articulated figure simulations; the extension to 3D currently remains as future work. Tuning actuator parameters for complex actuation models such as MTUs remains challenging. Though our actuator optimization technique is able to improve performance as compared to manual tuning, the resulting parameters may still not be optimal for the desired task. Therefore, our comparisons of MTUs to other action parameterizations may not be reflective of the full potential of MTUs with more optimal actuator parameters. Furthermore, our actuator optimization currently tunes parameters for a specific motion, rather than a larger suite of motions, as might be expected in nature.
To better understand the effects of different action parameterizations, we believe it will be beneficial to replicate our experiments with other reinforcement learning algorithms and motion control tasks. As is the case with other results in this area, hyperparameter choices can have a significant impact on performance, and therefore it is difficult to make definitive statements with regards to the merits of the various actions spaces that we have explored. However, we believe that the general trends we observed are likely to generalize.
Finally, it is reasonable to expect that evolutionary processes would result in the effective codesign of actuation mechanics and control capabilities. Developing optimization and learning algorithms to allow for this kind of codesign is a fascinating possibility for future work.
References
 Blickhan et al. (2007) Reinhard Blickhan, Andre Seyfarth, Hartmut Geyer, Sten Grimmer, Heiko Wagner, and Michael Günther. Intelligence by mechanics. Philosophical Transactions of the Royal Society of London A: Mathematical, Physical and Engineering Sciences, 365(1850):199–220, 2007.
 Coros et al. (2011) Stelian Coros, Andrej Karpathy, Ben Jones, Lionel Reveret, and Michiel van de Panne. Locomotion skills for simulated quadrupeds. ACM Transactions on Graphics, 30(4):Article TBD, 2011.
 de Lasa et al. (2010) Martin de Lasa, Igor Mordatch, and Aaron Hertzmann. Featurebased locomotion controllers. In ACM Transactions on Graphics (TOG), volume 29, pp. 131. ACM, 2010.
 Geijtenbeek et al. (2013) Thomas Geijtenbeek, Michiel van de Panne, and A. Frank van der Stappen. Flexible musclebased locomotion for bipedal creatures. ACM Transactions on Graphics, 32(6), 2013.
 Geyer et al. (2003) Hartmut Geyer, Andre Seyfarth, and Reinhard Blickhan. Positive force feedback in bouncing gaits? Proc. Royal Society of London B: Biological Sciences, 270(1529):2173–2183, 2003.
 Gu et al. (2016) Shixiang Gu, Ethan Holly, Timothy Lillicrap, and Sergey Levine. Deep reinforcement learning for robotic manipulation. arXiv preprint arXiv:1610.00633, 2016.
 Hausknecht & Stone (2015) Matthew J. Hausknecht and Peter Stone. Deep reinforcement learning in parameterized action space. CoRR, abs/1511.04143, 2015.
 Konda & Tsitsiklis (2000) Vijay Konda and John Tsitsiklis. Actorcritic algorithms. In SIAM Journal on Control and Optimization, pp. 1008–1014. MIT Press, 2000.
 Levine et al. (2015) Sergey Levine, Chelsea Finn, Trevor Darrell, and Pieter Abbeel. Endtoend training of deep visuomotor policies. CoRR, abs/1504.00702, 2015.
 Lillicrap et al. (2015) Timothy P. Lillicrap, Jonathan J. Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa, David Silver, and Daan Wierstra. Continuous control with deep reinforcement learning. CoRR, abs/1509.02971, 2015.
 Liu et al. (2016) Libin Liu, Michiel van de Panne, and KangKang Yin. Guided learning of control graphs for physicsbased characters. ACM Transactions on Graphics, 35(3), 2016.
 Loeb (1995) GE Loeb. Control implications of musculoskeletal mechanics. In Engineering in Medicine and Biology Society, 1995., IEEE 17th Annual Conference, volume 2, pp. 1393–1394. IEEE, 1995.
 Mordatch et al. (2015) Igor Mordatch, Kendall Lowrey, Galen Andrew, Zoran Popovic, and Emanuel Todorov. Interactive control of diverse complex characters with neural networks. In Advances in Neural Information Processing Systems 28, pp. 3132–3140, 2015.
 Schulman et al. (2015) John Schulman, Philipp Moritz, Sergey Levine, Michael I. Jordan, and Pieter Abbeel. Highdimensional continuous control using generalized advantage estimation. CoRR, abs/1506.02438, 2015.
 Silver et al. (2014) David Silver, Guy Lever, Nicolas Heess, Thomas Degris, Daan Wierstra, and Martin Riedmiller. Deterministic policy gradient algorithms. In Proc. International Conference on Machine Learning, pp. 387–395, 2014.
 Sutton et al. (2001) R. Sutton, D. Mcallester, S. Singh, and Y. Mansour. Policy gradient methods for reinforcement learning with function approximation, 2001.
 Tan et al. (2014) Jie Tan, Yuting Gu, C. Karen Liu, and Greg Turk. Learning bicycle stunts. ACM Trans. Graph., 33(4):50:1–50:12, 2014. ISSN 07300301.
 Van Hasselt (2012) Hado Van Hasselt. Reinforcement learning in continuous state and action spaces. In Reinforcement Learning, pp. 207–251. Springer, 2012.
 Wang et al. (2012) Jack M. Wang, Samuel R. Hamner, Scott L. Delp, Vladlen Koltun, and More Specifically. Optimizing locomotion controllers using biologicallybased actuators and objectives. ACM Trans. Graph, 2012.
 WawrzyńSki & Tanwani (2013) Paweł WawrzyńSki and Ajay Kumar Tanwani. Autonomous reinforcement learning with experience replay. Neural Networks, 41:156–167, 2013.
Supplementary Material
MTU Actuator Optimization
The actuator parameters can be interpreted as a parameterization of the dynamics of the system . The expected cumulative reward can then be reparameterized according to
where . and are then learned in tandem following Algorithm 2. This alternating method optimizes both the control and dynamics in order to maximize the expected value of the agent, as analogous to the role of evolution in biomechanics. During each pass, the policy parameters are trained to improve the agent’s expected value for a fixed set of actuator parameters . Next, is optimized using CMA to improve performance while keeping fixed. The expected value of each CMA sample of is estimated using the average cumulative reward over multiple rollouts.
Figure 5 illustrates the improvement in performance during the optimization process, as applied to motions for three different agents. Figure 6 compares the learning curves for the initial and final MTU parameters, for the same three motions.
Bounded Action Space
Properties such as torque and neural activation limits result in bounds on the range of values that can be assumed by actions for a particular parameterization. Improper enforcement of these bounds can lead to unstable learning as the gradient information outside the bounds may not be reliable (Hausknecht & Stone, 2015). To ensure that all actions respect their bounds, we adopt a method similar to the inverting gradients approach proposed by Hausknecht & Stone (2015). Let be the empirical action gradient from the policy gradient estimate of a Gaussian policy. Given the lower and upper bounds of the th action parameter, the bounded gradient of the th action parameter is determined according to
Unlike the inverting gradients approach, which scales all gradients depending on proximity to the bounds, this method preserves the empirical gradients when bounds are respected, and alters the gradients only when bounds are violated.
Reward
The terms of the reward function are defined as follows:
and denotes the character pose and reference pose represented in reducedcoordinates, while and are the respective joints velocities. is a manuallyspecified per joint diagonal weighting matrix. is the height of the root from the ground, and is the center of mass velocity.
Parameter  Value  Description 

0.9  cumulative reward discount factor  
0.001  actor learning rate  
0.01  critic learning rate  
momentum  0.9  stochastic gradient descent momentum 
weight decay  0  L2 regularizer for critic parameters 
weight decay  0.0005  L2 regularizer for actor parameters 
minibatch size  32  tuples per stochastic gradient descent step 
replay memory size  500000  number of the most recent tuples stored for future updates 
Character + Actuation Model  State Parameters  Action Parameters  Actuator Parameters 

Biped + Tor  58  6  0 
Biped + Vel  58  6  6 
Biped + PD  58  6  12 
Biped + MTU  74  16  114 
Raptor + Tor  154  18  0 
Raptor + Vel  154  18  18 
Raptor + PD  154  18  36 
Raptor + MTU  194  40  258 
Dog + Tor  170  20  0 
Dog + Vel  170  20  20 
Dog + PD  170  20  40 
Dog + MTU  214  44  282 
Character + Actuation  Motion  Performance (NCR)  Learning Speed (AUC) 

Biped + Tor  Walk  0.7662 0.3117  0.4788 
Biped + Vel  Walk  0.9520 0.0034  0.6308 
Biped + PD  Walk  0.9524 0.0034  0.6997 
Biped + MTU  Walk  0.9584 0.0065  0.7165 
Biped + Tor  March  0.9353 0.0072  0.7478 
Biped + Vel  March  0.9784 0.0018  0.9035 
Biped + PD  March  0.9767 0.0068  0.9136 
Biped + MTU  March  0.9484 0.0021  0.5587 
Biped + Tor  Run  0.9032 0.0102  0.6938 
Biped + Vel  Run  0.9070 0.0106  0.7301 
Biped + PD  Run  0.9057 0.0056  0.7880 
Biped + MTU  Run  0.8988 0.0094  0.5360 
Raptor + Tor  Run (Sim)  0.7265 0.0037  0.5061 
Raptor + Vel  Run (Sim)  0.9612 0.0055  0.8118 
Raptor + PD  Run (Sim)  0.9863 0.0017  0.9282 
Raptor + MTU  Run (Sim)  0.9708 0.0023  0.6330 
Raptor + Tor  Run  0.6141 0.0091  0.3814 
Raptor + Vel  Run  0.8732 0.0037  0.7008 
Raptor + PD  Run  0.9548 0.0010  0.8372 
Raptor + MTU  Run  0.9533 0.0015  0.7258 
Dog + Tor  Bound (Sim)  0.8016 0.0034  0.5472 
Dog + Vel  Bound (Sim)  0.9788 0.0044  0.7862 
Dog + PD  Bound (Sim)  0.9797 0.0012  0.9280 
Dog + MTU  Bound (Sim)  0.9033 0.0029  0.6825 
Dog + Tor  RearUp  0.8151 0.0113  0.5550 
Dog + Vel  RearUp  0.7364 0.2707  0.7454 
Dog + PD  RearUp  0.9565 0.0058  0.8701 
Dog + MTU  RearUp  0.8744 0.2566  0.7932 
Comments
There are no comments yet.