1 Introduction
Biomechanical tissue models have been proposed for several different anatomical structures such as the prostate, brain, liver, and muscles; for various computerassisted applications including preoperative planning, intraoperative navigation and visualization, implant optimization, and simulated training. Musculoskeletal biomechanical simulations that use muscle activation models are used in orthopedics for functional understanding of complex joints and movements as well as for patientspecific surgical planning. Shoulder is the most complex joint in the body, offering the greatest rangeofmotion. The upper arm is actively controlled and stabilized with over 10 anatomical muscles [10] subdivided in several parts [6]. With high rangeofmotion and redundancy, the shoulder is regularly exposed to forces larger than the body weight, making the area particularly prone to soft tissue damages [30, 8]. Consequent surgical interventions and corresponding planing and decisions could benefit from simulated functional models.
For simulating complex biomechanical models, sophisticated computational and simulation environments are often needed, such as SOFA [11] and Artisynth [17]. Due to many tortuous effects such as timedependent, nonlinear behaviour of muscle fibres and soft tissue and the bone and muscle contacts and collisions, the control of muscle activations required for a desired movement is not trivial. Linearized control schemes [29] easily become unstable for complex motion and anatomy, e.g. the shoulder, despite tedious controller parametrization and small simulation timesteps leading to lengthy computations [22]. Machine learning based solutions for such biomechanical models would not only enable simple, fast, stable, and thus effective controllers, but could also facilitate studying motor control paradigms such as neural adaptation and rehabilitation efficacy after orthopedic surgeries, e.g. muscle transfers.
Reinforcement learning (RL) is a machine learning technique for model control of complex behaviour, in a blackbox manner from trialanderror of inputoutput combinations, i.e. not requiring information on the underlying model nor its environment. With Deep Reinforcement Learning (DRL), impressive examples have been demonstrated such as for playing Atari console games [19] and the game of Go [27], for control of industrial robots [14, 31], and for animating characters [16, 13, 15], e.g. learning how to walk. Despite DRL applications with simpler rigid and multibody dynamics as above, softbody structures and complex material, activation, geometry, and contact models have not been well studied. Furthermore, sophisticated simulations required for complex models are not trivial to couple with DRL strategies. In this paper, we present the DRL control of a Finite Element Method (FEM) based musculoskeletal shoulder model, while investigating two DRL approaches comparatively.
2 Materials and Methods
Musculoskeletal Shoulder Model. We herein demonstrate DRLbased control with a simplified model of the shoulder joint. We used segmentations from the BodyParts3D dataset [18], cf. Fig.1left. The shoulder complex consists of three bones: the humerus, the scapula and the clavicle; as well as multiple muscles, tendons and ligaments. Our model involves surface triangulations for the three bones; a manual surface fit to the ribs imitating the trunk; and Bspline based quadrilateral thinshell meshes to model large, relatively flat muscles via FEM [21]. The bones are rigid objects and the muscles are displacementconstrained on the bones at tendon origins and insertions. Muscle fibres are modeled nonlinearly with respect to deformation based on [4]. Fibres are embedded within a linear corotational FE background material model, coupled at FE integration nodes. A normalized activation signal sent homogeneously to all fibres of a muscle segment linearly generate internal stresses, contracting them against their background softtissue material, while pulling on the attached bones [21]. In this paper, we focus on the abduction motion, being a standard reference movement for diagnosis and in clinical studies [33, 12, 7]. Accordingly, four muscles relevant for abduction and adduction [23] supraspinatus (ssp), infraspinatus (isp), deltoid middle part (dmi), and latissimus dorsi (ld) are simulated herein, cf. Fig.1center.
Learning Muscle Control. Consider that at each time step , an agent with a current state executes an action according to some policy , makes an observation , and receives a scalar reward . RL aims to find the policy for an optimal outcome that maximizes the cumulative sum of current and discounted future rewards. This is predicted either based only on the current state, i.e. state value function , or based on the current state and action together, i.e. action value function also known as Q function. is the discounting factor to ensure future rewards are worth lower.
The shoulder model and its forward simulation is herein considered as blackbox with its input being discrete muscles activation’s and the output being the (angular) pose and velocity of the humerus, where is the degrees of freedom to control. Using full 100% activation range of the muscles as the potential RL action set makes it difficult to predict small activation changes precisely, as well as leading potentially to large activation jumps at failed predictions. Therefore we use an action space of differential activation changes where is the number of muscles controlled. We thus additionally provide the current muscle activations as input to the agent so that it can infer the incremental effect of on top. To formalize a solution, typically a Markov Reward Process (MRP) [3] is defined as a quintuple with the set of possible states , the set of possible actions , the reward function
, the transition probability matrix
, and a discounting factor . Given the above, our state set is then(1) 
where the hatted variables indicate the desired position and velocity at the next step. Accordingly, we require only a short lookahead, allowing for simpler network structures and efficient realtime inference. We employ the following reward strategy
(2) 
with the first term enforcing to follow the desired trajectory. Lasso regularization in the second term encourages a sparse activation vector
, to resolve redundancy with the assumption that physiologically not all muscles are needed at the same time. More sophisticated muscle recruitment strategies extensively studied in the literature can also be introduced in this reward function. The last term prevents the agent from learning a socalled “bangbang” solution [2], where a controller alternately switches between two extreme states, e.g. and herein. This term then ensures a sufficient exploration of the whole interval during learning. is inherent to the system being modeled, in our case the shoulder model and its forward simulation.is a hyperparameter defining the discounting factor, set to be 0.99 herein. To find an optimal policy, we comparatively study two following DRL strategies:
Deep Qlearning [32] is a common approach to find an optimal policy by maximizing the action value function, i.e. solving , where and are respectively the next state and action, and is the transition probability matrix for transitioning from state to for action alternatives. can be populated by a socalled replay buffer of past experiences, e.g. earlier simulations or games. Deep QLearning
(DQL) approximates the Qvalue function via a neural network (NN) and outputs discrete actions, typically converging relatively fast. As the action space of DQL, we quantized [1,1]% diffential activation range in 21 steps, i.e.
. In other words, between two simulation steps any muscle activation cannot change more than 1%.Policy Gradient Methods
work by estimating the policy gradient in order to utilize simple gradientbased optimizers such as Stochastic Gradient Descent for optimal policy decisions. To prevent large policy changes based on small sets of data, Trust Region Policy Optimization (TRPO)
[24] proposes to regularize the policy update optimization by penalizing the KL divergence of policy change. This helps update the policy in an incremental manner to avoid detrimental large policy changes, also utilizing any (simulated) training data optimally. , , and are estimated by different NNs, some parameters of which may be shared. Policy gradient loss can be defined as , where is the advantage function [25], which represents the added value (advantage) from taking the given action at state .Despite the constraint by TRPO, large policy updates may still be observed; therefore Proximal Policy Optimization (PPO) [26] proposes to clip the gradient updates around 1 within a margin defined by an hyperparameter as follows:
is the policy change ratio, with and being the policy network parameter vectors, respectively, before and after the intended network update. The minimum makes sure that the change in policy ratio only effects the objective when it makes it worse. NN parameters of the policy and the value function are shared, allowing to also introduce a value function error . Additionally, to ensure policy exploration, an entropy bonus term is introduced [26] as follows:
(3) 
where and are weights, is the change in value function before and after the NN update, and S denotes the entropy. PPO can have continuous action spaces, so as its action set we used corresponding to same range for our DQL implementation. In contrast to Qlearning with a replay buffer, PPO is an onpolicy algorithm, i.e. it learns online via trial and error.
Implementation. We implemented^{1}^{1}1https://github.com/CAiMlab/PPO PPO [20]
in Pytorch. For DQL we used its OpenAI implementation
[9]. For singlemuscle control DQL and PPO were both implemented as simple networks of one hidden layer with 256 neurons. For PPO with four muscles (PPO4), 3 hidden layers each with 250 neurons were used. We used ReLu activations and the Adam optimizer. For multibody biomechanical simulation, we used Artisynth
[17], a framework written in Java on CPU. For training, the simulation runtime is the main computational bottleneck, with the network backpropagation taking negligible time. For speedup, we used a CPU cluster of 100 concurrent Artisynth simulations, each running a separate simulation episode and communicating with a DRL agent over a custom TCP interface based on [1]. For simulation, an integration timestep of 100 ms was chosen for a stability and performance tradeoff. During training, at each simulation time step (DRL frame), a simulation provides the respective RL agent with a state as in (2) including the simulated position and velocity of the humerus. The agent then calculated the respective reward and, according to the current policy, executes an action, i.e. sends an update of muscles activations back to the simulation. This is repeated for a preset episode length (herein 10 s), or until the simulation “crashes” prematurely due to numerical failure, which is recorded as a high negative reward. Convergence was ascertained visually in the reward curves.3 Experiments and Results
Herein we demonstrate experiments showing a singleaxis control of the shoulder. The glenohumeral joint was thus modeled as a revolute joint allowing rotation only around the axis seen in Fig. 1center. We conducted two sets of experiments: In a preliminary experiment with only one muscle (ssp), we compared the presented RL algorithms for our problem setting. A more sophisticated scenario with 4 muscles shows feasibility and scalability of the method to higher number of muscles. For training and testing, we used random trajectories. Using 5order polynomials [28] as , we generate 5 s random sections. We set endpoint velocity and acceleration constraints of zero, with endpoint positions randomly sampled from [30,90] during training, and from [20,100] for testing. Using a different and wider range for the latter was to test for generalizability. By stacking such random sections while matching their endpoint conditions, longer complex motions were constructed. With this, for each training episode a 10 s trajectory was generated onthefly, i.e. an episode being 100 frames given the integration time step of 0.1 s. Note that a trained RL agent can control an arbitrary trajectory length. For testing, a set of 100 random trajectories of each 20 s was generated once, and used to compare all presented RL agents; using root mean square error (RMSE) and mean average error (MAE) for tracking accuracy of desired trajectories.
Control of Single Muscle Activation. With this experiment we aim to comparatively study DQL and PPO. The exploration term in the reward (2) is irrelevant for DQL. In order to have a fair comparison, we thus removed this term from the reward for this experiment. Note that given a single muscle, the Lasso regularization of multiple activations in reward (2) also becomes unnecessary. Accordingly, this experiment employs a straightforward reward function as the absolute tracking error, i.e. .
Episode reward is defined as were T is the episode length. Mean episode reward over last 10 episodes during training is depicted in Fig. 2 for DQL and PPO.
Both algorithms are observed to show a similar learning behaviour overall, although PPO requires approximately 10 times more samples than DQL to converge, due to being a continuous range. In Fig. 3 both models are shown while controlling the ssp activation in the forward simulation for a sample trajectory.
It is seen that the discrete action space of DQL results in a sawtoothlike pattern in activations , and hence a relatively oscillatory trajectory
. Note that for small abduction angles the moment from humerus mass is minimal, and due to this lack of a counteracting torsional load, the control becomes difficult, i.e. any slight changes in activations
may lead to large angular changes, visible in Fig.3 for small abduction angles.Over 100 trajectories, DQL has an MAE of 3.70 and RMSE of 5.78, while PPO has an MAE of 4.00 and RMSE of 5.36. MAE and RMSE distributions of both methods over all tested trajectories can be seen in Fig. 3right.
Muscle Control with Redundancy. In this scenario, all the four muscles relevant for abduction with redundancy are controlled at the same time. Given similar action space quantization of 21 steps, DQL for 4 muscles would require a dimensional discrete action space , which is computationally unfeasible. Indeed, this is a major drawback of DQL preventing its extension to high dimensional input spaces. In contrast, a continuous action space for PPO is easily defined for the four muscles. Given the simulation with 4muscles, a PPO agent (PPO4) was trained for 1.6 M frames, taking a total of 1 hour including overheads for communication and Artisynth resets after crashes. Mean episode reward and loss (3) averaged over last 10 episodes are plotted in Fig. 4. Note that high gradients in policy updates due, e.g., to crashes, is a challenge
Despite the 4 times higher action space dimension, a feasible learning curve is observed. Large negative spikes in reward, e.g. near 1 M frames, correspond to simulation crashes, e.g. due to infeasible activations generated by the DRL agent. Despite the large policy gradients these spikes cause in (3
), PPO is able to successfully recover thanks to its gradient clipping. Using the trained PPO4 agent for controlling four muscles, Fig.
5 shows the humerus tracking for the same earlier random trajectory in Fig. 3, with the PPOgenerated muscle activations.It is observed that ssp and isp help to initiate the abduction motion – a wellknown behaviour [23]. Beyond initial abduction, their activation however diminishes with the rest of the motion mainly carried out by dmi and ld, which have stronger moment arms. PPO control of 4muscles during 100 randomlygenerated trajectories results in an MAE of 5.15 and RMSE of 6.64, with their distributions shown in Fig. 3right. The slightly higher tracking error compared to singlemuscle case is likely due to higher network capacity, training episodes, and thus time for convergence required for a higher dimensional action space.
We further tested an invivo trajectory from the public motiontracking dataset of [5]. We used a combing motion of 17.5 s, involving lifting up the arm twice and combing while up. Using the angle between the humerus and vertical axis, the 3D tracked motion was converted to an abduction angle as our tracking target (varying between 20 and 100 degrees). Applying our earliertrained PPO4 agent on this shows good tracking visually, with an RMSE and MAE of 7.67 and 6.57. These results being comparable with the earlier ones show that our method generalizes well to invivo trajectories even with synthetic training.
4 Conclusions
We have studied two DRL approaches demonstrating the successful application for singleaxis control of a functional biomechanical model of the human shoulder. PPO was implemented in a way that allows for multiple environments to run simultaneously using network based sockets. This is indispensable for scalability to higher dimensions within reasonable computational timeframes. Inference of our NNbased DRL agents are near realtime, enabling fast control of complex functional simulations. Any constraints that make tracking suboptimal or simulation infeasible are implicitly learned with DRL, as a remedy to occasional simulation crashes occurring with conventional analytical controllers.
A main bottleneck for scalability to sophisticated models is the limitation with action spaces. In contrast to the discrete action space of DQL exploding exponentially with the curse of dimensionality, it is shown herein that the continuous action space and corresponding policy optimization of PPO enables its extension to multiple muscles with redundancy. Given the generalizable form of the learning scheme and the reward function with the proposed approach, extensions to more muscles and additional degreesoffreedom is straightforward. This opens up the potential for full control of the shoulder and other musculoskeletal structures. This also enables neuroplasticity studies after corrective surgeries such as muscle transfers: After major orthopedic interventions, the patients may not easily adjust to postop configurations, therewith recovery expectancy and rehabilitation timeframes varying widely. Networks trained on preop settings and tested on simulated postop scenarios could provide insight into operative choices, e.g. for faster rehabilitation and improved outcomes.
References
 [1] Abdi, A.H., Saha, P., Srungarapu, V.P., Fels, S.: Muscle excitation estimation in biomechanical simulation using NAF reinforcement learning. In: Computational Biomechanics for Medicine. pp. 133–141 (2020)
 [2] Artstein, Z.: Discrete and continuous bangbang and facial spaces or: Look for the extreme points. Siam Review 22(2), 172–185 (1980)
 [3] Bertsekas, D.P.: Dynamic Programming and Optimal Control, volume 2. Athena Scientific (2012)
 [4] Blemker, S.S., Pinsky, P.M., Delp, S.L.: A 3D model of muscle reveals the causes of nonuniform strains in the biceps brachii. Journal of Biomechanics 38(4), 657–665 (2005)
 [5] Bolsterlee, B., Veeger, H.E.J., van der Helm, F.C.T.: Modelling clavicular and scapular kinematics: from measurement to simulation. Med Biol Eng Comput 52, 283–291 (2014)
 [6] Brown, J.M.M., Wickham, J.B., McAndrew, D.J., Huang, X.F.: Muscles within muscles: Coordination of 19 muscle segments within three shoulder muscles during isometric motor tasks. Journal of Electromyography and Kinesiology 17(1), 57–73 (2007)
 [7] Contemori, S., Panichi, R., Biscarini, A.: Effects of scapular retraction/protraction position and scapular elevation on shoulder girdle muscle activity during glenohumeral abduction. Human Movement Science 64, 55–66 (2019)
 [8] Craik, J.D., Mallina, R., Ramasamy, V., Little, N.J.: Human evolution and tears of the rotator cuff. International Orthopaedics 38(3), 547–552 (2014)
 [9] Dhariwal, P., Hesse, C., Klimov, O., Nichol, A., Plappert, M., Radford, A., Schulman, J., Sidor, S., Wu, Y., Zhokhov, P.: OpenAI Baselines. https://github.com/openai/baselines (2017)
 [10] Di Giacomo, G., Pouliart, N., Costantini, A., De Vita, A.: Atlas of functional shoulder anatomy. Springer (2008)
 [11] Faure, F., Duriez, C., Delingette, H., Allard, J., Gilles, B., Marchesseau, S., Talbot, H., Courtecuisse, H., Bousquet, G., Peterlik, I., Cotin, S.: SOFA: A multimodel framework for interactive physical simulation. In: Payan, Y. (ed.) Soft Tissue Biomechanical Modeling for Computer Assisted Surgery, vol. 11, pp. 283–321. Springer (2012)
 [12] Gerber, C., Snedeker, J.G., Baumgartner, D., Viehöfer, A.F.: Supraspinatus tendon load during abduction is dependent on the size of the critical shoulder angle: A biomechanical analysis. Journal of Orthopaedic Research 32(7), 952–957 (2014)
 [13] Heess, N., TB, D., Sriram, S., Lemmon, J., Merel, J., Wayne, G., Tassa, Y., Erez, T., Wang, Z., Eslami, S.M.A., Riedmiller, M., Silver, D.: Emergence of locomotion behaviours in rich environments. arXiv:1707.02286 (2017)
 [14] James, S., Johns, E.: 3D simulation for robot arm control with deep Qlearning. arXiv:1609.03759 (2016)
 [15] Kidziński, Ł., Mohanty, S.P., Ong, C., Hicks, J., Francis, S., Levine, S., Salathé, M., Delp, S.: Learning to run challenge: Synthesizing physiologically accurate motion using deep reinforcement learning. In: The NIPS ’17 Competition: Building Intelligent Systems. Springer (2018)
 [16] Lillicrap, T.P., Hunt, J.J., Pritzel, A., Heess, N., Erez, T., Tassa, Y., Silver, D., Wierstra, D.: Continuous control with deep reinforcement learning. arXiv:1509.02971 (2015)
 [17] Lloyd, J.E., Stavness, I., Fels, S.: ArtiSynth: A fast interactive biomechanical modeling toolkit combining multibody and finite element simulation. In: Payan, Y. (ed.) Soft Tissue Biomechanical Modeling for Computer Assisted Surgery, pp. 355–394. Springer (2012)
 [18] Mitsuhashi, N., Fujieda, K., Tamura, T., Kawamoto, S., Takagi, T., Okubo, K.: Bodyparts3d: 3d structure database for anatomical concepts. Nucleic Acids Research 37, D782–D785 (2008)
 [19] Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A.A., Veness, J., et al.: Humanlevel control through deep reinforcement learning. Nature 518(7540), 529–533 (2015)
 [20] Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., Killeen, T., Lin, Z., et al.: PyTorch: An imperative style, highperformance deep learning library. In: Advances in Neural Information Processing Systems 32. pp. 8024–8035 (2019)
 [21] Pean, F., Goksel, O.: Surfacebased modeling of muscles: Functional simulation of the shoulder. Medical Engineering & Physics (2020)
 [22] Péan, F., Tanner, C., Gerber, C., Fürnstahl, P., Goksel, O.: A comprehensive and volumetric musculoskeletal model for the dynamic simulation of the shoulder function. Computer Methods in Biomechanics and Biomedical Engineering 22(7), 740–751 (2019)
 [23] Reed, D., Cathers, I., Halaki, M., Ginn, K.: Does supraspinatus initiate shoulder abduction? Journal of Electromyography and Kinesiology 23(2), 425–429 (2013)
 [24] Schulman, J., Levine, S., Moritz, P., Jordan, M., Abbeel, P.: Trust region policy optimization. In: International Conference on Machine Learning (ICML). vol. PMLR 37, pp. 1889–1897 (2015)
 [25] Schulman, J., Moritz, P., Levine, S., Jordan, M.I., Abbeel, P.: Highdimensional continuous control using generalized advantage estimation. In: International Conference on Learning Representations (ICLR) (2016)
 [26] Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv:1707.06347 (2017)
 [27] Silver, D., Huang, A., Maddison, C.J., Guez, A., et al.: Mastering the game of Go with deep neural networks and tree search. Nature 529(7587), 484–489 (2016)
 [28] Spong, M.W., Hutchinson, S., Vidyasagar, M.: Robot modeling and control. John Wiley & Sons (2020)
 [29] Stavness, I., Lloyd, J.E., Fels, S.: Automatic prediction of tongue muscle activations using a finite element model. Journal of Biomechanics 45(16), 2841–2848 (2012)
 [30] Streit, J.J., Lenarz, C.J., Shishani, Y., McCrum, C., Wanner, J.P., Nowinski, R.J., Warner, J.J., Gobezie, R.: Pectoralis major tendon transfer for the treatment of scapular winging due to long thoracic nerve palsy. Journal of Shoulder and Elbow Surgery 21(5), 685–690 (2012)
 [31] Tsurumine, Y., Cui, Y., Uchibe, E., Matsubara, T.: Deep reinforcement learning with smooth policy update: Application to robotic cloth manipulation. Robotics and Autonomous Systems 112, 72–83 (2019)

[32]
Van Hasselt, H., Guez, A., Silver, D.: Deep reinforcement learning with double Qlearning. In: AAAI Conference on Artificial Intelligence. pp. 2094–2100 (2016)
 [33] Wickham, J., Pizzari, T., Stansfeld, K., Burnside, A., Watson, L.: Quantifying ‘normal’ shoulder muscle activity during abduction. Journal of Electromyography and Kinesiology 20(2), 212–222 (2010)
Comments
There are no comments yet.