Reinforcement Learning of Musculoskeletal Control from Functional Simulations

07/13/2020 ∙ by Emanuel Joos, et al. ∙ 0

To diagnose, plan, and treat musculoskeletal pathologies, understanding and reproducing muscle recruitment for complex movements is essential. With muscle activations for movements often being highly redundant, nonlinear, and time dependent, machine learning can provide a solution for their modeling and control for anatomy-specific musculoskeletal simulations. Sophisticated biomechanical simulations often require specialized computational environments, being numerically complex and slow, hindering their integration with typical deep learning frameworks. In this work, a deep reinforcement learning (DRL) based inverse dynamics controller is trained to control muscle activations of a biomechanical model of the human shoulder. In a generalizable end-to-end fashion, muscle activations are learned given current and desired position-velocity pairs. A customized reward functions for trajectory control is introduced, enabling straightforward extension to additional muscles and higher degrees of freedom. Using the biomechanical model, multiple episodes are simulated on a cluster simultaneously using the evolving neural models of the DRL being trained. Results are presented for a single-axis motion control of shoulder abduction for the task of following randomly generated angular trajectories.



There are no comments yet.


page 3

page 8

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Biomechanical tissue models have been proposed for several different anatomical structures such as the prostate, brain, liver, and muscles; for various computer-assisted applications including preoperative planning, intraoperative navigation and visualization, implant optimization, and simulated training. Musculoskeletal biomechanical simulations that use muscle activation models are used in orthopedics for functional understanding of complex joints and movements as well as for patient-specific surgical planning. Shoulder is the most complex joint in the body, offering the greatest range-of-motion. The upper arm is actively controlled and stabilized with over 10 anatomical muscles [10] subdivided in several parts [6]. With high range-of-motion and redundancy, the shoulder is regularly exposed to forces larger than the body weight, making the area particularly prone to soft tissue damages [30, 8]. Consequent surgical interventions and corresponding planing and decisions could benefit from simulated functional models.

For simulating complex biomechanical models, sophisticated computational and simulation environments are often needed, such as SOFA [11] and Artisynth [17]. Due to many tortuous effects such as time-dependent, nonlinear behaviour of muscle fibres and soft tissue and the bone and muscle contacts and collisions, the control of muscle activations required for a desired movement is not trivial. Linearized control schemes [29] easily become unstable for complex motion and anatomy, e.g. the shoulder, despite tedious controller parametrization and small simulation time-steps leading to lengthy computations [22]. Machine learning based solutions for such biomechanical models would not only enable simple, fast, stable, and thus effective controllers, but could also facilitate studying motor control paradigms such as neural adaptation and rehabilitation efficacy after orthopedic surgeries, e.g. muscle transfers.

Reinforcement learning (RL) is a machine learning technique for model control of complex behaviour, in a black-box manner from trial-and-error of input-output combinations, i.e. not requiring information on the underlying model nor its environment. With Deep Reinforcement Learning (DRL), impressive examples have been demonstrated such as for playing Atari console games [19] and the game of Go [27], for control of industrial robots [14, 31], and for animating characters [16, 13, 15], e.g. learning how to walk. Despite DRL applications with simpler rigid and multi-body dynamics as above, soft-body structures and complex material, activation, geometry, and contact models have not been well studied. Furthermore, sophisticated simulations required for complex models are not trivial to couple with DRL strategies. In this paper, we present the DRL control of a Finite Element Method (FEM) based musculoskeletal shoulder model, while investigating two DRL approaches comparatively.

2 Materials and Methods

Musculoskeletal Shoulder Model. We herein demonstrate DRL-based control with a simplified model of the shoulder joint. We used segmentations from the BodyParts3D dataset [18], cf. Fig.1-left. The shoulder complex consists of three bones: the humerus, the scapula and the clavicle; as well as multiple muscles, tendons and ligaments. Our model involves surface triangulations for the three bones; a manual surface fit to the ribs imitating the trunk; and B-spline based quadrilateral thin-shell meshes to model large, relatively flat muscles via FEM [21]. The bones are rigid objects and the muscles are displacement-constrained on the bones at tendon origins and insertions. Muscle fibres are modeled nonlinearly with respect to deformation based on [4]. Fibres are embedded within a linear co-rotational FE background material model, coupled at FE integration nodes. A normalized activation signal sent homogeneously to all fibres of a muscle segment linearly generate internal stresses, contracting them against their background soft-tissue material, while pulling on the attached bones [21]. In this paper, we focus on the abduction motion, being a standard reference movement for diagnosis and in clinical studies [33, 12, 7]. Accordingly, four muscles relevant for abduction and adduction [23] supraspinatus (ssp), infraspinatus (isp), deltoid middle part (dmi), and latissimus dorsi (ld) are simulated herein, cf. Fig.1-center.

Figure 1: Shoulder segmentation [18] (left) and our functional musculoskeletal model of the shoulder with four muscles relevant to abduction (center-right).

Learning Muscle Control. Consider that at each time step , an agent with a current state executes an action according to some policy , makes an observation , and receives a scalar reward . RL aims to find the policy for an optimal outcome that maximizes the cumulative sum of current and discounted future rewards. This is predicted either based only on the current state, i.e. state value function  , or based on the current state and action together, i.e. action value function also known as Q function. is the discounting factor to ensure future rewards are worth lower.

The shoulder model and its forward simulation is herein considered as black-box with its input being discrete muscles activation’s and the output being the (angular) pose and velocity of the humerus, where is the degrees of freedom to control. Using full 100% activation range of the muscles as the potential RL action set makes it difficult to predict small activation changes precisely, as well as leading potentially to large activation jumps at failed predictions. Therefore we use an action space of differential activation changes where is the number of muscles controlled. We thus additionally provide the current muscle activations as input to the agent so that it can infer the incremental effect of on top. To formalize a solution, typically a Markov Reward Process (MRP) [3] is defined as a quintuple with the set of possible states , the set of possible actions , the reward function

, the transition probability matrix

, and a discounting factor . Given the above, our state set is then


where the hatted variables indicate the desired position and velocity at the next step. Accordingly, we require only a short look-ahead, allowing for simpler network structures and efficient real-time inference. We employ the following reward strategy


with the first term enforcing to follow the desired trajectory. Lasso regularization in the second term encourages a sparse activation vector

, to resolve redundancy with the assumption that physiologically not all muscles are needed at the same time. More sophisticated muscle recruitment strategies extensively studied in the literature can also be introduced in this reward function. The last term prevents the agent from learning a so-called “bang-bang” solution [2], where a controller alternately switches between two extreme states, e.g.  and herein. This term then ensures a sufficient exploration of the whole interval during learning. is inherent to the system being modeled, in our case the shoulder model and its forward simulation.

is a hyperparameter defining the discounting factor, set to be 0.99 herein. To find an optimal policy, we comparatively study two following DRL strategies:

Deep Q-learning [32] is a common approach to find an optimal policy by maximizing the action value function, i.e. solving , where and are respectively the next state and action, and is the transition probability matrix for transitioning from state to for action alternatives. can be populated by a so-called replay buffer of past experiences, e.g. earlier simulations or games. Deep Q-Learning

(DQL) approximates the Q-value function via a neural network (NN) and outputs discrete actions, typically converging relatively fast. As the action space of DQL, we quantized [-1,1]% diffential activation range in 21 steps, i.e. 

. In other words, between two simulation steps any muscle activation cannot change more than 1%.

Policy Gradient Methods

work by estimating the policy gradient in order to utilize simple gradient-based optimizers such as Stochastic Gradient Descent for optimal policy decisions. To prevent large policy changes based on small sets of data, Trust Region Policy Optimization (TRPO) 

[24] proposes to regularize the policy update optimization by penalizing the KL divergence of policy change. This helps update the policy in an incremental manner to avoid detrimental large policy changes, also utilizing any (simulated) training data optimally. , , and are estimated by different NNs, some parameters of which may be shared. Policy gradient loss can be defined as , where is the advantage function [25], which represents the added value (advantage) from taking the given action at state .

Despite the constraint by TRPO, large policy updates may still be observed; therefore Proximal Policy Optimization (PPO) [26] proposes to clip the gradient updates around 1 within a margin defined by an hyperparameter as follows:

is the policy change ratio, with and being the policy network parameter vectors, respectively, before and after the intended network update. The minimum makes sure that the change in policy ratio only effects the objective when it makes it worse. NN parameters of the policy and the value function are shared, allowing to also introduce a value function error . Additionally, to ensure policy exploration, an entropy bonus term is introduced [26] as follows:


where and are weights, is the change in value function before and after the NN update, and S denotes the entropy. PPO can have continuous action spaces, so as its action set we used corresponding to same range for our DQL implementation. In contrast to Q-learning with a replay buffer, PPO is an on-policy algorithm, i.e. it learns on-line via trial and error.

Implementation. We implemented111 PPO [20]

in Pytorch. For DQL we used its OpenAI implementation 


. For single-muscle control DQL and PPO were both implemented as simple networks of one hidden layer with 256 neurons. For PPO with four muscles (PPO4), 3 hidden layers each with 250 neurons were used. We used ReLu activations and the Adam optimizer. For multibody biomechanical simulation, we used Artisynth 

[17], a framework written in Java on CPU. For training, the simulation runtime is the main computational bottleneck, with the network back-propagation taking negligible time. For speed-up, we used a CPU cluster of 100 concurrent Artisynth simulations, each running a separate simulation episode and communicating with a DRL agent over a custom TCP interface based on [1]. For simulation, an integration time-step of 100 ms was chosen for a stability and performance tradeoff. During training, at each simulation time step (DRL frame), a simulation provides the respective RL agent with a state as in (2) including the simulated position and velocity of the humerus. The agent then calculated the respective reward and, according to the current policy, executes an action, i.e. sends an update of muscles activations back to the simulation. This is repeated for a preset episode length (herein 10 s), or until the simulation “crashes” prematurely due to numerical failure, which is recorded as a high negative reward. Convergence was ascertained visually in the reward curves.

3 Experiments and Results

Herein we demonstrate experiments showing a single-axis control of the shoulder. The glenohumeral joint was thus modeled as a revolute joint allowing rotation only around the axis seen in Fig. 1-center. We conducted two sets of experiments: In a preliminary experiment with only one muscle (ssp), we compared the presented RL algorithms for our problem setting. A more sophisticated scenario with 4 muscles shows feasibility and scalability of the method to higher number of muscles. For training and testing, we used random trajectories. Using 5-order polynomials [28] as  , we generate 5 s random sections. We set end-point velocity and acceleration constraints of zero, with end-point positions randomly sampled from [30,90] during training, and from [20,100] for testing. Using a different and wider range for the latter was to test for generalizability. By stacking such random sections while matching their end-point conditions, longer complex motions were constructed. With this, for each training episode a 10 s trajectory was generated on-the-fly, i.e. an episode being 100 frames given the integration time step of 0.1 s. Note that a trained RL agent can control an arbitrary trajectory length. For testing, a set of 100 random trajectories of each 20 s was generated once, and used to compare all presented RL agents; using root mean square error (RMSE) and mean average error (MAE) for tracking accuracy of desired trajectories.

Control of Single Muscle Activation. With this experiment we aim to comparatively study DQL and PPO. The exploration term in the reward (2) is irrelevant for DQL. In order to have a fair comparison, we thus removed this term from the reward for this experiment. Note that given a single muscle, the Lasso regularization of multiple activations in reward (2) also becomes unnecessary. Accordingly, this experiment employs a straight-forward reward function as the absolute tracking error, i.e. .

Episode reward is defined as were T is the episode length. Mean episode reward over last 10 episodes during training is depicted in Fig. 2 for DQL and PPO.

Figure 2: Mean episode reward over last 10 episodes during training for DQL and PPO.

Both algorithms are observed to show a similar learning behaviour overall, although PPO requires approximately 10 times more samples than DQL to converge, due to being a continuous range. In Fig. 3 both models are shown while controlling the ssp activation in the forward simulation for a sample trajectory.

Figure 3: A random target trajectory (black) and its tracking by RL using DQL (left) and PPO (center) with the muscle activation (bottom) and resulting angular upper arm motion (top). (right) Distribution of tracking errors for 100 random trajectories.

It is seen that the discrete action space of DQL results in a sawtooth-like pattern in activations , and hence a relatively oscillatory trajectory

. Note that for small abduction angles the moment from humerus mass is minimal, and due to this lack of a counter-acting torsional load, the control becomes difficult, i.e. any slight changes in activations

may lead to large angular changes, visible in Fig.3 for small abduction angles.

Over 100 trajectories, DQL has an MAE of 3.70 and RMSE of 5.78, while PPO has an MAE of 4.00 and RMSE of 5.36. MAE and RMSE distributions of both methods over all tested trajectories can be seen in Fig. 3-right.

Muscle Control with Redundancy. In this scenario, all the four muscles relevant for abduction with redundancy are controlled at the same time. Given similar action space quantization of 21 steps, DQL for 4 muscles would require a dimensional discrete action space , which is computationally unfeasible. Indeed, this is a major drawback of DQL preventing its extension to high dimensional input spaces. In contrast, a continuous action space for PPO is easily defined for the four muscles. Given the simulation with 4-muscles, a PPO agent (PPO4) was trained for 1.6 M frames, taking a total of 1 hour including overheads for communication and Artisynth resets after crashes. Mean episode reward and loss (3) averaged over last 10 episodes are plotted in Fig. 4. Note that high gradients in policy updates due, e.g., to crashes, is a challenge

Figure 4: Evolution of mean reward and loss of last 10 episodes during PPO4 training.

Despite the 4 times higher action space dimension, a feasible learning curve is observed. Large negative spikes in reward, e.g. near 1 M frames, correspond to simulation crashes, e.g. due to infeasible activations generated by the DRL agent. Despite the large policy gradients these spikes cause in (3

), PPO is able to successfully recover thanks to its gradient clipping. Using the trained PPO4 agent for controlling four muscles, Fig. 

5 shows the humerus tracking for the same earlier random trajectory in Fig. 3, with the PPO-generated muscle activations.

Figure 5: Controlling four muscle activations for tracking the random abduction trajectory in Fig. 3-left, along with the activation patterns (left). Simulation frames at two time instances shown with red squares in the trajectory plot (center&right).

It is observed that ssp and isp help to initiate the abduction motion – a well-known behaviour [23]. Beyond initial abduction, their activation however diminishes with the rest of the motion mainly carried out by dmi and ld, which have stronger moment arms. PPO control of 4-muscles during 100 randomly-generated trajectories results in an MAE of 5.15 and RMSE of 6.64, with their distributions shown in Fig. 3-right. The slightly higher tracking error compared to single-muscle case is likely due to higher network capacity, training episodes, and thus time for convergence required for a higher dimensional action space.

We further tested an in-vivo trajectory from the public motion-tracking dataset of [5]. We used a combing motion of 17.5 s, involving lifting up the arm twice and combing while up. Using the angle between the humerus and vertical axis, the 3D tracked motion was converted to an abduction angle as our tracking target (varying between 20 and 100 degrees). Applying our earlier-trained PPO4 agent on this shows good tracking visually, with an RMSE and MAE of 7.67 and 6.57. These results being comparable with the earlier ones show that our method generalizes well to in-vivo trajectories even with synthetic training.

4 Conclusions

We have studied two DRL approaches demonstrating the successful application for single-axis control of a functional biomechanical model of the human shoulder. PPO was implemented in a way that allows for multiple environments to run simultaneously using network based sockets. This is indispensable for scalability to higher dimensions within reasonable computational time-frames. Inference of our NN-based DRL agents are near real-time, enabling fast control of complex functional simulations. Any constraints that make tracking suboptimal or simulation infeasible are implicitly learned with DRL, as a remedy to occasional simulation crashes occurring with conventional analytical controllers.

A main bottleneck for scalability to sophisticated models is the limitation with action spaces. In contrast to the discrete action space of DQL exploding exponentially with the curse of dimensionality, it is shown herein that the continuous action space and corresponding policy optimization of PPO enables its extension to multiple muscles with redundancy. Given the generalizable form of the learning scheme and the reward function with the proposed approach, extensions to more muscles and additional degrees-of-freedom is straightforward. This opens up the potential for full control of the shoulder and other musculoskeletal structures. This also enables neuroplasticity studies after corrective surgeries such as muscle transfers: After major orthopedic interventions, the patients may not easily adjust to postop configurations, therewith recovery expectancy and rehabilitation time-frames varying widely. Networks trained on preop settings and tested on simulated postop scenarios could provide insight into operative choices, e.g. for faster rehabilitation and improved outcomes.


  • [1] Abdi, A.H., Saha, P., Srungarapu, V.P., Fels, S.: Muscle excitation estimation in biomechanical simulation using NAF reinforcement learning. In: Computational Biomechanics for Medicine. pp. 133–141 (2020)
  • [2] Artstein, Z.: Discrete and continuous bang-bang and facial spaces or: Look for the extreme points. Siam Review 22(2), 172–185 (1980)
  • [3] Bertsekas, D.P.: Dynamic Programming and Optimal Control, volume 2. Athena Scientific (2012)
  • [4] Blemker, S.S., Pinsky, P.M., Delp, S.L.: A 3D model of muscle reveals the causes of nonuniform strains in the biceps brachii. Journal of Biomechanics 38(4), 657–665 (2005)
  • [5] Bolsterlee, B., Veeger, H.E.J., van der Helm, F.C.T.: Modelling clavicular and scapular kinematics: from measurement to simulation. Med Biol Eng Comput 52, 283–291 (2014)
  • [6] Brown, J.M.M., Wickham, J.B., McAndrew, D.J., Huang, X.F.: Muscles within muscles: Coordination of 19 muscle segments within three shoulder muscles during isometric motor tasks. Journal of Electromyography and Kinesiology 17(1), 57–73 (2007)
  • [7] Contemori, S., Panichi, R., Biscarini, A.: Effects of scapular retraction/protraction position and scapular elevation on shoulder girdle muscle activity during glenohumeral abduction. Human Movement Science 64, 55–66 (2019)
  • [8] Craik, J.D., Mallina, R., Ramasamy, V., Little, N.J.: Human evolution and tears of the rotator cuff. International Orthopaedics 38(3), 547–552 (2014)
  • [9] Dhariwal, P., Hesse, C., Klimov, O., Nichol, A., Plappert, M., Radford, A., Schulman, J., Sidor, S., Wu, Y., Zhokhov, P.: OpenAI Baselines. (2017)
  • [10] Di Giacomo, G., Pouliart, N., Costantini, A., De Vita, A.: Atlas of functional shoulder anatomy. Springer (2008)
  • [11] Faure, F., Duriez, C., Delingette, H., Allard, J., Gilles, B., Marchesseau, S., Talbot, H., Courtecuisse, H., Bousquet, G., Peterlik, I., Cotin, S.: SOFA: A multi-model framework for interactive physical simulation. In: Payan, Y. (ed.) Soft Tissue Biomechanical Modeling for Computer Assisted Surgery, vol. 11, pp. 283–321. Springer (2012)
  • [12] Gerber, C., Snedeker, J.G., Baumgartner, D., Viehöfer, A.F.: Supraspinatus tendon load during abduction is dependent on the size of the critical shoulder angle: A biomechanical analysis. Journal of Orthopaedic Research 32(7), 952–957 (2014)
  • [13] Heess, N., TB, D., Sriram, S., Lemmon, J., Merel, J., Wayne, G., Tassa, Y., Erez, T., Wang, Z., Eslami, S.M.A., Riedmiller, M., Silver, D.: Emergence of locomotion behaviours in rich environments. arXiv:1707.02286 (2017)
  • [14] James, S., Johns, E.: 3D simulation for robot arm control with deep Q-learning. arXiv:1609.03759 (2016)
  • [15] Kidziński, Ł., Mohanty, S.P., Ong, C., Hicks, J., Francis, S., Levine, S., Salathé, M., Delp, S.: Learning to run challenge: Synthesizing physiologically accurate motion using deep reinforcement learning. In: The NIPS ’17 Competition: Building Intelligent Systems. Springer (2018)
  • [16] Lillicrap, T.P., Hunt, J.J., Pritzel, A., Heess, N., Erez, T., Tassa, Y., Silver, D., Wierstra, D.: Continuous control with deep reinforcement learning. arXiv:1509.02971 (2015)
  • [17] Lloyd, J.E., Stavness, I., Fels, S.: ArtiSynth: A fast interactive biomechanical modeling toolkit combining multibody and finite element simulation. In: Payan, Y. (ed.) Soft Tissue Biomechanical Modeling for Computer Assisted Surgery, pp. 355–394. Springer (2012)
  • [18] Mitsuhashi, N., Fujieda, K., Tamura, T., Kawamoto, S., Takagi, T., Okubo, K.: Bodyparts3d: 3d structure database for anatomical concepts. Nucleic Acids Research 37, D782–D785 (2008)
  • [19] Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A.A., Veness, J., et al.: Human-level control through deep reinforcement learning. Nature 518(7540), 529–533 (2015)
  • [20] Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., Killeen, T., Lin, Z., et al.: PyTorch: An imperative style, high-performance deep learning library. In: Advances in Neural Information Processing Systems 32. pp. 8024–8035 (2019)
  • [21] Pean, F., Goksel, O.: Surface-based modeling of muscles: Functional simulation of the shoulder. Medical Engineering & Physics (2020)
  • [22] Péan, F., Tanner, C., Gerber, C., Fürnstahl, P., Goksel, O.: A comprehensive and volumetric musculoskeletal model for the dynamic simulation of the shoulder function. Computer Methods in Biomechanics and Biomedical Engineering 22(7), 740–751 (2019)
  • [23] Reed, D., Cathers, I., Halaki, M., Ginn, K.: Does supraspinatus initiate shoulder abduction? Journal of Electromyography and Kinesiology 23(2), 425–429 (2013)
  • [24] Schulman, J., Levine, S., Moritz, P., Jordan, M., Abbeel, P.: Trust region policy optimization. In: International Conference on Machine Learning (ICML). vol. PMLR 37, pp. 1889–1897 (2015)
  • [25] Schulman, J., Moritz, P., Levine, S., Jordan, M.I., Abbeel, P.: High-dimensional continuous control using generalized advantage estimation. In: International Conference on Learning Representations (ICLR) (2016)
  • [26] Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv:1707.06347 (2017)
  • [27] Silver, D., Huang, A., Maddison, C.J., Guez, A., et al.: Mastering the game of Go with deep neural networks and tree search. Nature 529(7587), 484–489 (2016)
  • [28] Spong, M.W., Hutchinson, S., Vidyasagar, M.: Robot modeling and control. John Wiley & Sons (2020)
  • [29] Stavness, I., Lloyd, J.E., Fels, S.: Automatic prediction of tongue muscle activations using a finite element model. Journal of Biomechanics 45(16), 2841–2848 (2012)
  • [30] Streit, J.J., Lenarz, C.J., Shishani, Y., McCrum, C., Wanner, J.P., Nowinski, R.J., Warner, J.J., Gobezie, R.: Pectoralis major tendon transfer for the treatment of scapular winging due to long thoracic nerve palsy. Journal of Shoulder and Elbow Surgery 21(5), 685–690 (2012)
  • [31] Tsurumine, Y., Cui, Y., Uchibe, E., Matsubara, T.: Deep reinforcement learning with smooth policy update: Application to robotic cloth manipulation. Robotics and Autonomous Systems 112, 72–83 (2019)
  • [32]

    Van Hasselt, H., Guez, A., Silver, D.: Deep reinforcement learning with double Q-learning. In: AAAI Conference on Artificial Intelligence. pp. 2094–2100 (2016)

  • [33] Wickham, J., Pizzari, T., Stansfeld, K., Burnside, A., Watson, L.: Quantifying ‘normal’ shoulder muscle activity during abduction. Journal of Electromyography and Kinesiology 20(2), 212–222 (2010)