1 Introduction
Even “simple” tasks like grabbing an object involve highly coordinated actions of our musculoskeletal system. At the core of such coordinated movements are voluntary contractions of skeletal muscles. Understanding the underlying mechanism of recruitment and muscle force generation is a challenging task and subject to much research. One of the few noninvasive and clinically available diagnostic tools to obtain insights in the functioning (or disfunctioning) of the neuromuscular system are electromyographic (EMG) recordings, e.g., measuring the activationinduced, resulting potentials on the skin surface. Conclusions on the neuromuscular system then reduce to a signal processing task, which often is based on the assumption that the muscle itself is a black box and hence ignores spatial arrangements within the respective muscle. But (surface) EMG measurements also have their limitations, e.g., they typically only capture activity from muscle parts close to the surface, which leads to difficulties in identifying crosstalk. In addition, it often only records weak signals due to layers of adipose tissue, or is, in some cases, restricted to isometric contractions. Hence, to obtain more holistic and noninvasive insights into the neuromuscular system, computational models can be employed. Such models need to capture much of the electromechanical properties of skeletal muscle tissue and of the interaction between neural recruitment and muscular contraction.
The contractile behavior of skeletal muscle tissue has been extensively modeled using lumpedparameter models such as Hilltype skeletal muscle models [46], continuummechanical skeletal muscle models [25, 4, 41, 5], or multiscale, chemoelectromechanical skeletal muscle models [39, 40, 23, 21]. For prediction of the resulting EMG of a particular stimulation, there exist analytical models [10, 11, 30] with short compute times, or numerical approaches [28, 30, 32, 33].
However, irrespective of whether one aims to simulate EMG signals or the mechanical behavior of a skeletal muscle, numerical methods are almost unavoidable, if realistic muscle geometries are considered. The chemoelectromechanical models as proposed by [40], [21, 22], or [18]
are particularly wellsuited to incorporate many structural and functional features of skeletal muscles. They embed onedimensional “muscle fibers” within a threedimensional skeletal muscle model and associate them with a particular motor unit. Moreover, those models can be directly linked to motor neuron models either phenomenologically
[17, 12] or biophysically [9, 34]to further investigate the relationship between neural and mechanical behavior. The desired degree of detail and complexity within these models requires the coupling of different physical phenomena on different temporal and spatial scales, e.g., models describing the mechanical or electrophysiological state of the muscle tissue on the organ scale and the biochemical processes on the cellular scale (cf. Section 2).
Being able to take into account all these different processes on different scales requires a flexible multiscale, multiphysics computational framework and significant compute power. The availability of computational resources restricts the number of individual muscle fibers that can be considered within a skeletal muscle. The chemoelectromechanical models as implemented within the international opensource libraries OpenCMISS [6, 21, 32]
allow general muscle geometries with about 1000 embedded “muscle fibers”. Most skeletal muscles, however, have significantly more fibers. While simulations with 1000 fibers and less can potentially provide some insights into the neuromuscular system, some effects such as the motor unit recruitment over the full range of motor units and muscle fibers and their implication on the resulting EMG can not be estimated unless the full model with a realistic number of muscle fibers is simulated. This full model allows us to estimate the accuracy of “reduced” models by comparing them to the output of the detailed full “benchmark” model. Unless such comparisons are carried out, only predictions can be made how additional details such as more fibers or functional units (motor units) affect the overall outcome – both in terms of muscle force generation and in terms of computed EMG signals.
While highly optimized and highly parallel software for biomechanical applications exists, e.g., for chemoelectromechanical heart models [15], most multipurpose computational frameworks for biomedical applications such as OpenCMISS are developed to provide flexibility. This is, e.g., achieved through standards like CellML [27] and FieldML [8]. The respective frameworks are utilised to enhance existing multiphysics models for a wide range of (bioengineering) applications. They are designed to be run by biomedical researchers on smallsized compute clusters. While they typically can be compiled on largescale HPC compute clusters such as HazelHen at the HLRS in Stuttgart, they typically are not capable of exploiting the full potential of the machine. Moreover, simulation run time is typically considered less important than model complexity and output. Hence, typical simulations of biomedical applications are not necessarily optimized for numerical efficiency, minimal communication, the exploitation of novel algorithms or file output. Within this paper, we demonstrate how one can exploit analysis tools, suitable numerical techniques, and coupling strategies to obtain an efficient chemoelectromechanical skeletal muscle model that is suitable to be run on a largescale HPC infrastructure and, thus, is capable of running with a sufficient resolution and number of muscle fibers to provide the required details. Once largescale simulations of biomedical applications are solved with a high degree of detail, most specialized visualization tools such as OpenCMISSZinc, can no longer handle the amount of output data. Dedicated visualization tools for largescale visualizations need to be developed. In this work, the MegaMol framework [14] has been adapted to visualize the different biophysical parameters and the resulting EMG.
2 The multiscale skeletal muscle model
Before outlining the model in its full detail, we first provide a brief overview on some anatomical and physiological characteristics of skeletal muscles that are relevant to our model. From an anatomical point of view, skeletal muscles are a hierarchical system. Starting from its basic unit, the socalled sarcomere, several inseries and inparallel arranged sarcomeres constitute a cylindrically shaped myofibril. Several inparallel arranged myofibrils make up a skeletal muscle fiber and multiple muscle fibers form a fascicle. All fascicles constitute the entire muscle. These structures are connected with each other through an extracellular matrix (ECM). From a physiological point of view, several fibers are controlled by the same lower motor neuron through nervous axons. The entire unit consisting of the lower motor neuron, the axons and the respective fibers that are innervated by the axons, is referred to as motor unit, which is the smallest unit within a skeletal muscle that can voluntarily contract. The lower motor neuron sends ratecoded impulses called action potentials to all fibers belonging to the same motor unit. Moreover, motor units are activated in an orderly fashion, starting from the smallest ones, up to the biggest (recruitment size principle). After a motor neuron stimulates a muscle fiber at the neuromuscular junction, an action potential (AP) is triggered and propagates along the muscle fiber resulting in a local activity. For more comprehensive insights into muscle physiology and anatomy, we refer to the book by [29].
As the focus of this research is on solving biophysically detailed models of a neuromuscular system on HPC architectures, this section provides an overview on the multiscale modeling framework of our chemoelectromechanical skeletal muscle model that is based on the work by [40], [21, 22], and [18]
. These models can account for the main mechanical and electrophysiological properties of skeletal muscle tissue, including a realistic activation process and resulting force generation. This is realized by linking multiple submodels, describing different physical phenomena. To reduce the computational costs, the different submodels that describe phenomena on different length and time scales are simulated using different discretizations, i. e., spatial resolution and timestep size. Data are exchanged between the submodels using homogenization and interpolation techniques.
2.1 The 3D continuummechanical muscle model
The physiological working range of skeletal muscles includes large deformations. Therefore, we use a continuum mechanical modeling approach that is based on the theory of finite elasticity to simulate the macroscopic deformations and stresses in the muscle tissue. In continuum mechanics, the placement function describes the motion of a material point, i. e., it assigns every material point with position in the reference (nondeformed) domain at a time to a position in the actual (deformed) domain at time
. The deformation of the body at a material point can be described by the deformation gradient tensor
which is defined as the partial derivative of the placement function
with respect to the reference configuration. The local displacement is defined by the vector
.The governing equation of the continuum mechanical model is the balance of linear momentum. Under the assumption of slow motions (i. e. inertia forces vanish) and neglecting body forces, the balance of linear momentum in its local form can be written as
(1) 
where denotes the divergence operator and is the first PiolaKirchhoff stresstensor. To solve the balance of linear momentum, one needs to define a constitutive equation that specifies . It describes the overall mechanical behavior of the muscle that can be divided into a passive and an active component. The latter represents the muscle’s ability to contract and produce forces. In this work, we assume a superposition of the active and passive behavior, i. e., an additive split of .
Passive skeletal muscle tissue is assumed to be hyperelastic and transversely isotropic. Consequently, the passive part to the first PiolaKirchhoff stress tensor ) depends on the deformation gradient tensor and a structure tensor , which is defined by the muscle fiber direction . The isotropic part of the passive stresstensor assumes a MooneyRivlin material. It is enhanced by an additive anisotropic contribution accounting for the specific material properties in the muscle fiber direction .
The active force is generated on a microscopic scale, i. e. , within a halfsarcomere (the smallest functional unit of a muscle) consisting of thin actin and thick myosin filaments. Based on geometrical considerations of the halfsarcomere structure, it is known that the active muscle force depends on the actual halfsarcomere length (forcelength relation) [13]. When a halfsarcomere is activated by calcium as a second messenger, actin and myosin filaments can form crossbridges and produce forces (crossbridge cycling).
The active force state of the microscopic halfsarcomere is summarized in an activation parameter that enters the macroscopic constitutive equation. Furthermore, we assume that the active stress contribution acts only along the fiber direction . When considering only isometric or slow contractions, the active stress tensor can be defined as a function of the lumped activation parameter , the deformation gradient tensor , and the structure tensor . An additional forcelength relationship needs to be included within .
Finally, we assume skeletal muscle tissue to be incompressible, which implies the constraint . The resulting first PiolaKirchhoff stress tensor reads
(2) 
where is the hydrostatic pressure, which enters the equation as a Lagrange multiplier enforcing the incompressibility constraint. The material parameters of the continuummechanical skeletal muscles are fitted to experimental data [16] and can be found in [22].
2.2 The 1D Model for Action Potential Propagation
The electrical activity of skeletal muscles resulting from the local activity of all muscle fibers can be analyzed by measuring the extracellular potential. The bidomainmodel is a widely used continuum mechanical framework to simulate the electrical activity of living tissues [35]. It is based on the homogeneity assumption that intracellular and extracellular space homogeneously occupy the same domain. Intracellular and extracellular space are electrically coupled by an electrical current flowing over the cell membrane,
where and denote the current density in the intracellular and extracellular space, respectively, and is the fiber’s surface to volume ratio. The muscle fiber membrane is nearly impermeable for ions and serves as a capacitor. However, ions can be transported over the membrane by ion channels and active ion pumps. These properties can be mathematically described by a biophysically motivated modeling approach proposed by [24], leading to the constitutive equation
(3) 
where is the transmembrane potential, is the capacity of the muscle fiber membrane (sarcolemma) and is the ionic current flowing over the ionchannels and pumps, which depends on . Further state variables are summarized in , e. g., the states of different ion channels. is an externally applied stimulation current, e. g., due to a stimulus from the nervous system. Assuming that the intracellular space and extracellular space show equivalent anisotropy, which is the case for 1D problems, the bidomain equations can be reduced to the monodomain equation. We use the onedimensional monodomain equation in the domain :
(4) 
Here, denotes the spatial coordinate along a onedimensional line, i.e., the fiber, and is the effective conductivity.
2.3 The 0D SubCellular Muscle Model
The model proposed by [44] provides a basis to compute the lumped activation parameter which is the link to the 3D continuummechanical muscle model. Its evolution model is steered by the ionic current of the 1D model.
It contains a detailed biophysical description of the subcellular excitationcontraction coupling pathway. Specifically, it models the depolarization of the membrane potential in response to stimulation, release of calcium from the sarcoplasmic reticulum (SR) serving as a second messenger, and crossbridge (XB) cycling. To do so, the Shorten model couples three subcellular models: A HodgkinHuxleytype model is utilized to simulate the electrical potentials and ion currents through the musclefiber membrane and the membrane of the Ttubule system. To simulate calcium dynamics, a model of the SR membrane ryanodine receptor (RyR) channels ([38]) is coupled to the electrical potential across the Ttubule membrane and triggers the release of calcium from the SR. Additionally, the calciumdynamics model describes diffusion of calcium in the muscle cell, active calcium transport over the SR membrane via the SERCA pump (sarco/endoplasmic reticulum ATPase), binding of calcium to buffer molecules (e. g. parvalbumin or ATP), and binding of calcium to troponin enabling the formation of crossbridges. The active force generation is simulated by solving a simplified Huxleytype model [37], which is the basis for calculating the activation parameter .
The entire incorporated subcellular processes are modeled with a set of coupled ordinary differential equations (ODEs)
(5) 
where summarizes the righthandside of all the ODEs associated with the state variables which are, in the case of the Shorten et al. model, more than 50. The final activation parameter is computed from entries of the state variable vector and the length and contraction velocity of the halfsarcomere, and . Assuming isometric or very slow contractions, the contraction velocity can be neglected. Hence, following [37] and [22], the activation parameter is calculated as
(6) 
Here, the function is the forcelength relation for a cat skeletal muscle by [36], is the concentration of post powerstroke XBs, is the concentration of post powerstroke crossbridges for a tetanic contraction (100 Hz stimulation after 500 ms stimulation) and is an offset parameter denoting the concentration of post powerstroke crossbridges in the resting state.
2.4 Test Scenario, Boundary and Initial Conditions
The overall aim of this work is to analyse the implemented algorithms on (largescale) clusters and to identify bottlenecks that prevent us from achieving scalable and efficient implementations. For this purpose, we define a test scenario that will be used throughout this work to test the model dynamics and the implemented algorithms.
To summarize the previous paragraphs, the chemoelectromechanical behavior of a skeletal muscle is described by the following coupled equations:
&0 & = & div P(F,M,γ(y,
ł_hs) ) & in Ω_t for all t ,
&∂Vm∂t & = & 1AmCm ( ∂∂x( σ_eff ∂Vm∂x )  A_m I_ion(y, V_m, I_stim ) ) & on all fibers Γ_t ,
&∂y∂t & = & G_y (y, V_m , I_stim ) & at all halfsacromere positions.
For the test scenario, we use a generic cubic muscle geometry (). The muscle fibers are aligned parallel to one cubeedge (i. e. the direction). We consider a long isometric singletwitch experiment by stimulating all fibers at their midpoints for with . To constrain the generic muscle sample, Dirichlet boundary conditions (zero displacement) are used resulting in the fixation of the following faces of the muscle cube: the left and the right faces (faces normal to the direction), the front face (face normal to the direction) and the bottom face (face normal to the direction). Further, no current flows over the boundary of the computational muscle fibers, i. e., zero Neumann boundary condition are assumed at both muscle fiber ends.
For the material parameters for the continuummechanical model, the effective conductivity , the surfacetovolume ratio , and the membrane capacity , we use exactly the same values as reported in [22]. The initial values for the state vector at of the subcellular model are taken from the slowtwitch scenario of [44].
2.5 Status Quo and Current Implementation
Spatial Discretization. The submodels of the multiscale skeletal muscle model show significantly different characteristic time and length scales. In order to solve the model efficiently, different discretization techniques and resolutions are required for the submodels. The current implementation of the OpenCMISS solves the continuummechanical skeletal muscle model using a finite element method with triquadratic/trilinear Lagrange basis functions, i. e., TaylorHood elements. The onedimensional muscle fibers are represented by much finer onedimensional finite element meshes with linear Lagrange basis functions. The respective onedimensional elements are embedded into the threedimensional finite element mesh.
To simulate the coupled multiphysics problem, data have to be transfered between the different spatial discretization approaches. The microscopic sarcomere forces provided by the monodomain model are homogenized and projected to the macroscopic Gauss points of the threedimensional continuummechanical model (). Vice versa, the node positions of the onedimensional computational muscle fibers are updated from the actual displacements of the threedimensional continuummechanical model by interpolating the node positions via the basisfunctions of the threedimensional model. Based on this step, the microscopic halfsarcomere lengths are evaluated. Fig. 1 (left) schematically shows the embedding of discretised 1D fibers within the 3D muscle domain discretised with finite elements.
Time discretization and coupling in time. To compute an approximate solution for Eqns eq_summary, the different characteristic time scales of the 3D, 1D and 0D problems can be exploited. The action potential propagates faster than the muscle deformation, the subcellular chemomechanical processes evolve considerably faster than the diffusive action potential propagation. To achieve common global time steps, which is desirable from a computational point of view, we choose , with and define . State values associated with time are denoted with the superscript . Employing different time steps requires a time splitting scheme. The statusquo implementation uses a firstorder accurate Godunov splitting scheme, for which one timestep of the threedimensional equation including all substeps for the onedimensional monodomain equation reads:

For do

For perform explicit Euler steps for Eqn. e:0D_ystate2 and the 0D portion of Eqn. e:1D_fiber.

Set and .

Perform one implicit Euler step for the 1D portion of Eqn. e:1D_fiber to compute .


Set and .

Calculate and compute the homogenized values at the Gauss points of the 3D finite element mesh.

Solve Eqn. e:3D_momentum_balance.

Interpolate the actual configuration to the fibers’ nodes for computing the local halfsacromere length .
Fig. 1 (right) schematically depicts the algorithm. We assume quasistatic conditions for the continuummechanical problem. This is equivalent to assuming constant muscle deformations throughout .
Solvers. The coupled time stepping algorithm described above contains two large systems of equations that need to be solved. The first one results from the 3D elasticity problem e:3D_momentum_balance and the second one stems from an implicit time integration of the linear 1D diffusion problem of the fiber e:1D_fiber. In the existing implementation, the GMRES solver of [42] from the PETSc^{1}^{1}1http://www.mcs.anl.gov/petsc/index.html library is used.
Domain Partitioning. The 3D computational domain is decomposed into disjoint cuboid shaped partitions by cutting the whole domain with axisaligned planar cuts at element boundaries. The partitioning spans all model components, such that the quantities in the 3D, 1D and 0D model corresponding to the same spatial location are on the same process. This is to avoid unnecessary interprocess volumecommunication between the submodels. The implementation of this partitioning concept is motivated by a skeletal muscle anatomy and physiology: all skeletal muscle fibers are, from an electrical point of view, independent from each other. Thus, the obvious way to partition the workload is to distribute whole fibers among the processes. For instance, [21] hardcoded a partitioning with 4 processes.
3 Towards HPC: Scalability Evaluation
Before being able to simulate such complex models as the one introduced in Section 2 on HPC systems, it is essential to first analyse bottlenecks of existing implementations. In this paper, we focus on investigating numerical and algorithmic characteristics of the implementation as described in Sec. 2.5. The results provide the basis for new algorithms and improved implementations to enhance scalability. As the parallel status quo code used 4 cores, we only analyse numerical complexity, i. e., scalability in terms of the size of the problem, and redesign the parallelisation.
As discretization, we choose 8 TaylorHood finite elements and include 36 muscle fibers (), which go through all of the Gauss points that exhibit the same  and coordinates. The time steps are , and , i. e., and . The system resulting from the discretization of the 1D fibers is solved using a restarted GMRES solver with restart after 30 iterations and relative residuum tolerance . To solve the 3D problem, Newton’s method from the PETSc library is used with relative and absolute tolerance and a backtracking line search approach with maximum number of 40 iterations.
To assess problem size scalability, we vary the number of nodes along each muscle fiber and measure runtimes for a single step of the 3D model. Results are depicted in Fig. 2.
In a second experiment, we compare the previously used GMRES solver with a conjugate gradient (CG) solver and the direct MUMPS LU solver [1, 2] as implemented within PETSc [3].
Overall, the results for both numerical investigations reveal the following insights:

The loworder explicit ODE solver (Euler) requires very small time steps for the desired accuracy. Hence, the majority of the runtime is spent for solving the 0D problem.

The portion of runtime for solving the 3D problem is negligible and constant. This is due to the low number of 3D finite elements for the mechanics problem (eight elements). Realistic models would require a finer resolution of the 3D problem.

The runtime for the other computational components increases approximately linearly with the number of fiber elements. This indicates a good scaling behavior with respect to problem size.

The computation of the macroscopic variable from fiber nodes and the homogenized activation parameter has almost no impact on the computational time. Computing is more time consuming as it involves simultaneously traversing the fiber and the 3D meshes, whereas computing requires a single averaging operation for each Gauss point of the 3D elements.

The GMRES solver is a robust choice for general sparse linear systems of equations, but does not exploit the symmetry, positive definiteness and tridiagonal structure of a 1D diffusion system and hence exhibits substantial overhead. The LU solver exhibits the lowest computational time for large problems. It shows linear complexity in the number of 1D nodes^{2}^{2}2Note that the 1D diffusion is an exception. In all higher dimensions, direct solvers are known to fail for large problems due to their high complexity..
These findings are used in the following sections to improve the numerical efficiency from an algorithmic and implementational point of view.
4 Towards HPC: Improvements For efficiency
Although linear scalability with respect to the number of nodes per fiber is observed for all 0D and 1D components (cf. Sec. 3), the runtime is not optimal. This is mainly due to the firstorder accurate time stepping scheme. It enforces very small time steps (for and ) in order to achieve the required accuracy. We propose a combination of secondorder splitting and secondorder time stepping schemes within the 0D and 1D subproblems to tackle this issue. A further obstacle on the way to largescale simulations with a reasonable number of fibers is the limited parallelism (currently using cores). We design a parallelization scheme for arbitrary core numbers along with a new partitioning scheme minimizing comunication cost. Finally, we replace the customized OpenCMISS file output and visualization by a new library that is able to handle very largescale data.
4.1 Algorithmic Optimization– SecondOrder Time Stepping
To reduce computational cost and increase accuracy, we propose to replace the Godunov splitting with a secondorder Strang splitting. We replace the explicit Euler for Eqn. e:0D_ystate2 and the 0D portion of Eqn. e:1D_fiber with the method of Heun and employ an implicit CrankNicolson method for the diffusion part of Eqn. e:1D_fiber. Secondorder timestepping schemes allow to reduce the discretization error much faster with decreasing time step size . Thus, the required accuracy might be achieved by employing larger time steps. In contrast to the simpler Godunov splitting, the Strang splitting uses three substeps per time step: a first step with length for the 0D part, a second step with length for the diffusion, and a third step with length again for the 0D part. The modified algorithm at a time reads:

For do

For perform explicit Heun steps for Eqn. e:0D_ystate2 and the 0D portion of Eqn. e:1D_fiber.

Set .

Perform one implicit CrankNicolson step for the 1D portion of Eqn. e:1D_fiber.

Set .

For perform explicit Heun steps for Eqn. e:0D_ystate2 and the 0D portion of Eqn. e:1D_fiber.

Set and .


Set and .

Calculate and compute the homogenised values at the Gauss points of the 3D finite element mesh.

Solve Eqn. e:3D_momentum_balance.

Interpolate the displacements to the fibers’ nodes for computing the local halfsacromere length .
The explicit Heun step in a. and d. reads:
(7)  
(9)  
In b., we solve the system resulting from the CrankNicolson time discretization of the diffusion part in Eqn. e:1D_fiber.:
(10) 
In the following, we present numerical experiments demonstrating the increase in efficiency achieved as a result of this new time discretization.
4.1.1 Time Discretization for the SubCellular Model (0D ODEs)
In a first step, we verify the convergence order of Heun’s method numerically. Therefore, we restrict ourselves to step 1.a of the Godunov algorithm, but use Heun’s method for Eqn. e:0D_ystate2 and the 0D portion of Eqn. e:1D_fiber. We use the same test setup as presented in Sects. 2.4 and 2.5. To compare the accuracy of Heun’s method with the explicit Euler method, we compare the values of and at a stimulated material point on a muscle fiber while varying the time step size . To compare the methods in terms of efficiency, we measure the related compute times. We restrict ourselves to the time interval , with . As a reference solution, we use the solution calculated with Heun’s method with a very high resolution (). The tests are performed sequentially on an Intel^{®} Core^{TM} i54590 CPU, using the OpenCMISS implementation. Fig. 5 depicts the relative error depending on the number of time steps executed in the interval while Fig. 5 compares the necessary CPUtime to reach a certain accuracy for the different solvers.
Fig. 5 shows the expected firstorder convergence for the explicit Euler method and secondorder convergence for Heun’s method. From an application point of view, however, efficient computation (“Which accuracy can be achieved in which runtime?”) is more important than the order of convergence. Since each time step of the Heun’s method is twice as expensive as an explicit Euler step, we need to also take into account the runtime in our assessment to show the potential to decrease the runtime for Heun’s method given a required accuracy. All times are normalized by the CPUtime of the Euler method with . The results presented in Fig. 5 show that two Heun steps with replace forward Euler steps yielding a theoretical speedup of for the 0Dsolver. At the same time, the error decreases by a factor of approximately .
4.1.2 Time Discretization for Muscle Fibers (0D coupled to 1D)
In the second step, we verify the convergence order of the Strang splitting scheme. We consider again the same test setup as above except that we use the larger time interval . We varying the number of 1D time steps. Based on the previous results for the isolated 0D problem, we choose for the Strangsplitting scheme and for the Godunovsplitting scheme. This ensures a comparable relative error for the 0D subproblem while saving computational time. The reference solution is computed using the Strangsplitting scheme with , yielding .
Fig. 7 shows the relative errors of at a stimulated subcell for the Godunov and Strangsplitting schemes. Comparable relative errors as for the Godunov scheme with are achieved for the Strang splitting scheme with . The resulting speedups are depicted in Fig. 7. We normalize the compute times with the compute time of the Godunov scheme with . For a relative error limit in , one can achieve a speedup of about with the improved time stepping scheme if the accuracy requirement is weakened slightly. If the error constraint cannot be weakened, we still obtain a speedup of . Note that, for lower error limits, the speedup achieved with a secondorder scheme is even higher due to the higher convergence order.
4.2 Parallelization – Domain Partitioning
The parallelization of the skeletal muscle simulation is based on a domain partitioning approach, i.e., the whole 3D cuboid muscle domain is decomposed into disjoint subdomains (partitions) that are assigned to parallel processes.
Within this work, we aim to use a parallelization scheme that is able to make use of an arbitrary number of processes. Previous work [21] based their parallelization strategy on physiological characteristics, which leads to pillarlike partitions ensuring that one fiber is never distributed to multiple processors. Considering not only the skeletal muscle fibers but the entire model, this induces a significant communication overhead due to the long and stretched partitions. We therefore investigate a further domain partitioning approach with cubeshaped partitions exhibiting subdomains with minimal surface. The two abovementioned partitioning strategies, i. e., the pillarlike and cubic ones, are visualized in Fig. 8.
The domain partitioning induces communication due to dependencies of local data on data of neighboring partitions. These dependencies comprise the following:

For solving for the propagation of , i. e., using an implicit Euler or CrankNicolson method (Eqn. eqn:CN) to solve the monodomain equation (Eqn. eqn:monodomain), communication between neighboring processes along a single fiber is required. The cost for this communication per process grows linearly with the number of fibers contained in the respective 3D partition. For the existing pillarlike implementation of partitioning, this cost vanishes since fibers are not partitioned into several pieces.

Computing the mucle displacements of the 3D model (Eqn. e:3D_momentum_balance) involves all processes. This is a result of using a finite element discretization, which inherently requires communication between processes which share common partition boundaries. Thus, these costs are proportional to the surface of the 3D partitions.

Interpolating the muscle displacements of the 3D muscle mesh to 1D fiber mesh node positions and calculating , requires ghost layers at the partition boundaries containing one layer of 3D elements. Note that for the reverse transfer, the accumulation of the activation parameter from the 0D model at the Gauss points of the elements in the 3D mesh, i. e., computing , does not involve communication since the process is completely local as all 0D points are contained within the respective 3D element and reside on the same process.
4.2.1 Generalized and Optimized Domain Partitioning
Based on the communication dependencies 1 and 2, we enhance the original pillarlike domain partitioning in two ways: (i) we allow for an arbitrary number of processes instead of a fixed number of four processes and (ii) we introduce a new partitioning concept with nearly cubic partitions. In contrast to partitioning strategies based on spacefilling curves such as [43], or graphpartitioning such as [31, 47], a cuboid partition has the advantage that the interaction of one cuboid partition with others is guaranteed to be planar and bounded by the maximum number of neighboring partitions, i. e., . This allows communication with reduced complexity and cost. On the other hand, load balancing may not be optimal because no partition might exist that divides the 3D elements into equally sized subdomains with a number matching the total number of processes.
Within this work, we assume that it is possible to create nearly optimal cubic partitions. However, we can not completely avoid obtaining subdomains at the boundary of the computational domain that have less elements than others. Given a fixed number of available cores, we maximize the number of employed processes by adapting the number of subdivisions in each axis direction corresponding to a factorization of the total number of processes. By introducing the additional constraint that each generated partition has to be larger than a specified “atomic” cuboid of elements, we can easily ensure that each process contains only entire fibers (pillarlike partition), a fixed number of fiber subdivisions (cubelike partition), or anything in between.
In Sec. 4.2.3, various runtime and scalability studies are presented. In particular, we present scaling results that demonstrate weak scaling for the full muscle model showing the influence of pillarlike or cubelike partitioning and the memory consumption per process of the current implementation in two different experiments; and strong scaling for the full muscle model showing the dependency of the achieved efficiency on the shape of the partitions.
4.2.2 Optimized Interpolation and Homogenization
Based on the communication dependency 3 listed above, the interpolation and homogenization routines need to be optimized. Interpolation and homogenization involves information transfer between values at Gauss points of the 3D elements to nodes of the 1D fibers. To allow general domain decomposition, a mapping between the respective 3D and 1D finite elements is necessary. In [21], the homogenization was achieved by a naive search over all locally stored fibers. This search was performed for each 3D element. We replace this approach, which exhibits quadratic complexity (in terms of the number of involved elements), by a linear algorithm that uses a correct indexing approach based on the element’s topology. Fig. 9 shows that the new implementation achieves substantial runtime reductions.
4.2.3 Scaling Experiments
All scaling experiments are conducted on HazelHen, the Tier1 supercomputer at the High Performance Computing Center Stuttgart (HLRS). A dualsocket node of this Cray XC40 contains two Intel Haswell E52680v3 processors with base frequency of 2.5 GHz, maximum turbo frequency of 3.3 GHz, 12 Cores each and 2 Hyperthreads per Core, leading to a total number of 48 possible threads per core.
Weak Scaling Measurements – Experiment #1. For weak scaling, the problem size is increased proportional to the number of processes. Thus, invariants are the number of elements per process and the overall shape of the computational domain. Here, we show weak scaling for both partitioning strategies: partitioning only in  and direction, i.e., pillarlike partitioning, and cubic partitioning. We start with processes on a single node of HazelHen with an initial partition consisting of subdivisions for both the pillarlike and the cubic partitioning. Each partition contains 3D elements per MPI rank. Further, we ensure that each 3D element contains fibers in direction with three 1D elements per fiber, i.e., 1D elements per 3D element. Hence, the initial problem is made up of elements and fibers.
In the two conducted series of measurements for the two partitioning strategies, further subdivisions are defined such that the pillarlike or cubic partitioning structure is maintained. The refinements are obtained by first refining by a factor of in direction, in direction and then in direction before repeating the process. For the cubic partitioning, we fix the number of 3D elements that each MPI rank contains to be . For the pillarlike partitioning, the constraint is that each subdomain spans over all threedimensional elements in the direction, whose number varies with increasing problem size. Therefore, the number of elements per MPI rank in  and direction is halved for each refinement in an alternating way. This way, we double the number of partitions while maintaining the constant number of eight 3D elements per MPI rank. By allocating processes on the cores of each node (no hyperthreading), we scale from 1 to 32 nodes, i. e., from 24 to 768 cores. Table 1 provides the details on the partitioning and the number of threedimensional and onedimensional elements. Results are depicted in Fig. 10.
Pillars  Cubes  

Nodes  3D Elements  1D El.  
2.304  
4.608  
9.216  
18.432  
36.864  
73.728 
The results of Fig. 10 show that the solver for the 3D model has a slightly higher computational time for the pillarlike partitioning compared to the cubic partitioning. This is expected as the partition boundaries are larger and induce more communication. For the 1D problem solver, pillars are better as fibers are not subdivided to multiple cores and no communication is needed. The reduced benefit from a cubic partitioning is due to the fact that time spent on communication is rather dominant compared to the time needed to solve the rather small problem, e. g., only 1D elements of a fiber are locally stored in each partition. This should improve as one chooses larger subproblem sizes, i. e., increases the number of nodes per fiber.
Theoretically, the time needed to solve the 0D problem should not be affected by the domain decomposition. However, due to memory effects, the runtime for a cubic partitioning is slightly higher. Overall, this leads to a higher total computational time for cubic partitioning compared to the pillarlike partitioning. This conclusion is, however, only valid for the chosen scenario and for the relatively low number of cores. Note that extending this scaling experiments to a larger numbers of cores is currently limited due to memory duplications in the current code. This needs to be first eliminated before conducting further scaling studies.
Weak Scaling Measurements – Experiment #2.
While the somewhat artificial setting in experiment #1 yields perfect pillarlike or cubic partitions, experiment #2 addresses a more realistic setup, where we increase the number of processes more smoothly, i.e., by less than a factor of two in each step. With this, it not possible any more to choose perfect cubic or pillarlike partitions. Thus, we identify reasonable parameters by solving an optimization problem that trades based on process counts and targeted subdomain shape. Note that the combination of number of processes and number of elements leads to partitions at the boundary of the computational domain that potentially have less elements than interior partitions. Compared to the previous example, the number of 3D elements per process is only approximately constant with the pillarlike
partitions getting closer to constant size than the cubic ones. The numbers of processes and dimensions of the computational domain are listed in Table 2. Fig. 11 presents the runtime results.
Pillars  Cubes  

Nodes,  3D Elements  1D El.  
Cores  
1, 24  14.784  
2, 40  28.728  
3, 60  45.144  
4, 84  60.588  
6, 140  63.840  
8, 192  103.680 
As already discussed above, the ODE solver for the 0Dproblem (yellow line) requires the majority of the runtime. This is followed by the solution times for the 1D (red line) and 3D (green line) subproblems. The blue lines depict the duration of the interpolation and homogenization between the node positions of the 1D fibers and the 3D mesh. It can be seen that the computational times stay nearly constant for increasing problem size. As in the previous experiment, the 3D solver performs better for cubic partitioning whereas the 1D solver is faster for pillarlike partitions. In this scenario, the cubic partitioning in total slightly outperforms the pillarlike partitioning, as expected.
As before, the memory consumption appears to be a weakness. Therefore, additional tests investigating the memory consumption per process at the end of the runtime were carried out. The memory consumption for the presented scenario is plotted in Fig. 12 with respect to the overall number of 1D elements. Also the average number of ghost layer elemts per process is depicted. Ghost layer elements are copies of elements adjacent to the partition of a process, i.e., belong to the subdomain of a neighbouring process. They are used as data buffers for communication.
We observe that the average number of ghost elements per process for the 3D problem is higher for pillarlike partitions (dashed black line) than for the cubic partitions (solid black line). A sharp increase of memory consumption (magenta lines) is observed independent of the partitioning scheme. This is due to duplications of global data on each process which will be eliminated in a next optimization step. Compared to this effect, the difference between the number of ghost elements needed for the two partitioning strategies is negligible.
Dependency Between Runtime and Partition Shape – Experiment #3. In our third scaling test, the dependency of the solver of the 3D continuummechanical problem on the partitioning strategy is investigated. We analyse how different domain decomposition approaches, in particular others than the previously discussed pillarlike and cubic partitioning schemes, affect the runtime. A test case with threedimensional elements is considered. Otherwise, the setup is as described in Sec. 2.4. To reduce the contributions of the 0D/1D subproblem and focus on the performance of the 3D components, we include in each 3D element only two 1D fiber elements. The domain is decomposed into a constant number of partitions by axisaligned cutting planes in all possible ways. To distinguish between the different partitioning variants, we compute the average boundary surface area between the partitions for each variant and relate this to runtime. The results are presented in Fig. 13.
The smallest average surface area between the partitions, which corresponds to the first data point in Fig. 13, is obtained for a partitioning with partitions with elements each. The highest average surface area between the partitions, which is the last data point within Fig. 13, is obtained for partitions with elements each. All experiments are run on 12 nodes of Hazel Hen with 12 processes per node. It can be seen that only the time needed to solve the 3D continuummechanical problem increases monotonically with respect to the average surface area between the partitions, i. e., depends on the partitions’ shape. This is expected. Further, the runtime ratio of the 3D solver between the partitioning with the smallest and largest average surface area is .
4.3 Optimizing Input/Output – File Format
Optimizing HPC simulations also includes the consideration of efficient input and output (IO) file formats, e. g., writing data in parallel to reduce communication, and binary codecs to reduce the size of a data set. Currently, OpenCMISS appeals to an EXfile format^{3}^{3}3http://opencmiss.org/documentation/data_format/ex_file_format.html, Accessed: 20170930, which writes header information and data as plain text. Headers and data are stored in the same file, and one file is generated for each time step. In general, writing textbased data has very low read and write performance, and very high storage requirements compared to binary encoded data. To successfully run simulations on large systems, the data IO should ideally not cause scaling issues. Therefore, we propose in this work a new, efficient file format^{4}^{4}4https://github.com/UniStuttgartVISUS/ngpf, Accessed: 20170930.
The new data structure includes custom binary codecs, such as the highcompression algorithm ZFP [26], and stores meta data in separate header files using the humanreadable JSON object notation. Fig. 14 gives an overview of the data layout, where each colored box denotes a separate file. The main feature of the new file format is the full support of domain decomposition (DD) with parallel file IO. Every node processes its own set of data (e. g., time integration or data encoding) and claims the corresponding memory region in the data files to write the encoded data stream.
The new file format has three optional features. It can write time independent data (TID, cf. Fig. 14) that are written per attribute data, e. g., positions of Gaussian points in the 3D mesh in the reference configuration. A further feature is the ”Type Header” that writes typespecific quantities, which can contain simple attributes or a complex structural layout. With the third feature, the DD meta data (e. g., load in each node), we are able to analyze how each node has performed during the simulation. These meta data can answer questions like “which muscle fiber was simulated on which node”, or “was this node at full capacity at a specific point in time”. These data can be used to optimize the load balancing within the simulation in future work.
4.4 Visualization
Efficient (parallel) IOfile formats are also the basis for the visualization of large data sets. The current visualization framework within the OpenCMISS software initative is OpenCMISSZinc. This framework already offers a range of visualization techniques for muscle fiber data, e. g., a convex hull calculation to construct a mesh geometry from point cloud data. However, Zinc uses outdated rendering methods, resulting in insufficient performance for largescale data. Furthermore, it does not offer a suitable platform for fast visualization prototyping, distributed rendering, or advanced visualization techniques, which would be required for developing new muscle fiber visualizations. MegaMol [14] fulfills these criteria and offers additional functionality and features that are valuable for this project. One example for an additional feature is the provided infrastructure for brushing and linking that allows for interactive visual analytics. MegaMol also offers a builtin headless mode and a remote control interface that is crucial for the HPC and insitu rendering planned for the future in order to support the visual analysis of largescale muscle simulations. Insitu visualization is an alternative approach to traditional posthoc data processing. It processes and visualizes data while the simulation produces them. Consequently, writing raw data to disk can be avoided completely. Insitu visualization is very much an area of active research, and thus, there are many different approaches and no standard approach has been established (see, e. g., [7]).
MegaMol allows GPU and CPU rendering via a thin abstraction layer, where the GPU rendering uses the OpenGL API, whereas the CPU rendering is based on the ray tracing engine OSPRay [45]. In particular the CPUbased rendering enables image synthesis on any HPC cluster, regardless of the availability of dedicated GPUs. It offers advanced rendering and shading methods (e. g., path tracing and ambient occlusion) that enhance the perception of depth and run mostly interactive even on single workstations. Like the simulation software OpenCMISS, MegaMol is currently not optimized for HPC usage, but it provides a solid base and infrastructure and is already capabale of rendering the discretized muscle fibers as continuous geometry (cf. Fig. 15).
While analyzing and optimizing existing code for HPC infrastructures are best performed with test cases, in which the geometry has a minimal influence, e. g., the cuboid muscle test case as introduced in Sec. 2.4, it is advantageous to demonstrate the visualization capabilities on complex geometries. Therefore, data from previous simulations (here data of the Tibialis Anterior as obtained within [21]) are used to showcase visualization capabilities. The image was rendered with MegaMol [14] using the CPU ray tracing engine OSPRay [45].
5 Conclusion and Outlook
Using models to gain new insights into the complex physiological or anatomical mechanisms of biological tissue, or to better interpret and understand experimentally measured data, requires accurate and detailed models of the underlying mechanisms. Depending on the application or the system to be modeled, the required detail can lead very quickly to highly complex and computationally extremely demanding models. Most software packages are, however, often designed to build up computational models for a variety of complex biomechanical systems, like in the case of OpenCMISS for simulating the chemoelectromechanical behavior of skeletal muscles after neural recruitment, the mechanics of the heart, the functioning of the lung, etc. Such software packages might already run within a parallel computing environment, but are not optimized to run largescale simulations on largescale systems such as HazelHen, the Tier1 system in Stuttgart. Before being able to exploit the full capabilities of supercomputers, software packages such as OpenCMISS have to be analyzed and optimized to achieve good scaling properties – ideally perfect scaling meaning that the simulation of a twice as large problem on twice as many nodes/cores requires the same runtime as the original setup.
Within this paper, we have demonstrated that the chemoelectromechanical multiscale skeletal muscle model as introduced in Sec. 2 and implemented in OpenCMISS is capable of running significantly sized model setups in a parallel compute environment. We have simulated the deformation of a skeletal muscle in which 34,560 randomly activated fibers are discretized with 103,680 1D elements. Further, by utilizing a standard test case, we have been able to show good weak scaling properties for a small number of compute nodes. For the partitioning of the domain, two different approaches have been considered: a pillarlike partition, in which each embedded 1D skeletal muscle fiber is kept within a single partition, and a cubic partitioning, in which individual fibers can be distributed over different partitions. It has been observed that mainly the solution times of the 3D and the 1D solver depend on the domain partitioning. For the test cases, the 1D solver profits from pillarlike partitions, while the 3D solver exhibited lower runtimes for cubic partitions. The analysis has revealed that for computeintense scenarios with a high numbers of 1D elements (in relation to the computational work needed to solve the 3D continuummechanical model), cubic partitions yield a better overall performance. The scaling of the 1D solver with increasing problem size has been found to be linear and, hence, promises good behavior when the overall problem size will be further increased in the future. Furthermore, the text based EXfile format has been identified as not appropriate to run on an HPC system. Therefore, a new file format is being developed that offers highcompression algorithms and performs file IO in parallel. However, before utilizing significant portions of HazelHen, further aspects concerning the model, algorithms, implementation, and visualization need to be considered.
From a modeling point of view, we would like to use even more complicated chemoelectromechanical model that include, for example, the mechanical behavior of titin [18, 19], and include further important biophysical details such as metabolism, a biophysical recruitment model [20], and a feedback mechanism from the spindles and the golgitendon organs to the neuromuscular system. Moreover, simulating and visualizing the surface EMG is needed to further test motor unit decomposition algorithms. From an algorithmic point of view, such model enhancements also always require novel or customtailored efficient numerical schemes. From an implementational point of view, the scaling analysis also revealed a memory consumption issue. The amount of memory per process rises proportional to the total problem size. The cause for this has been identified as the suboptimal implementation of a globaltolocal node numbering mapping within OpenCMISS, which, however, can be fixed. The memory consumption currently prevents simulation of larger problem sizes.
Once the modeling, algorithmic, implementational aspects have been solved and largescale simulations have been set up, e. g., a single chemoelectromechanical skeletal muscle model with a realistic number of fibers (e. g., about 300.000) of realistic length, the results need to be visualized and analyzed. To do so, we are planning to extend MegaMol with novel, comprehensive visualizations that allow users to explore the complex behavior of muscle fiber simulation data. For example, the deformation due to the contraction of the muscle is currently barely visible for different time steps, whereas to compare the contraction of different runs or different time steps, a clear visual representation is necessary. Furthermore, visualization can help to gain new insights into the functioning of our neuromuscular system by comparing the simulated surface EMG of a muscle with experimental data obtained via noninvasive and clinically available diagnostic tools.
After we achieved all the model enhancements and have overcome the computational and implementational challenges, these simulations can be utilized to set up in silico benchmark tests for complex muscles and recruitment patterns to improve existing data mining algorithms, e.g., the motor unit decomposition algorithms, and provide, at the same time, a new tool to investigate the interplay of the underlying complex and coupled mechanisms leading from neural stimulation to force generation.
Conflict of Interest Statement
The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.
Author Contributions
All authors have equally contributed to the conception and design of the work, data analysis and interpretation, drafting of the article, and critical revision of the article. Hence, the author list appears in alphabetical order. In addition NE, TK, AK, BM and TR have conducted the simulations and summarized their results. All authors fully approve the content of this work.
Funding
This research was funded by the BadenWürttemberg Stiftung as part of the DiHu project of the High Performance Computing II program, the Deutsche Forschungsgemeinschaft (DFG) as part of the International Graduate Research Group on ”Soft Tissue Robotics – SimulationDriven Concepts and Design for Control and Automation for Robotic Devices Interacting with Soft Tissues” (GRK 2198/1) and as part of the Cluster of Excellence for Simulation Technology (EXC 310/1).
References
 [1] P. R. Amestoy, I. S. Duff, J.Y. L’Excellent, and J. Koster. A fully asynchronous multifrontal solver using distributed dynamic scheduling. SIAM Journal on Matrix Analysis and Applications, 23(1):15–41, 2001.
 [2] P. R. Amestoy, A. Guermouche, J.Y. L’Excellent, and S. Pralet. Hybrid scheduling for the parallel solution of linear systems. Parallel Computing, 32(2):136–156, 2006.
 [3] S. Balay, W. D. Gropp, L. C. McInnes, and B. F. Smith. Efficient management of parallelism in object oriented numerical software libraries. In E. Arge, A. M. Bruaset, and H. P. Langtangen, editors, Modern Software Tools in Scientific Computing, pages 163–202. Birkhäuser Press, 1997.
 [4] S. S. Blemker, P. M. Pinsky, and S. L. Delp. A 3D model of muscle reveals the causes of nonuniform strains in the biceps brachii. Journal of Biomechanics, 38(4):657–665, 2005.
 [5] M. Böl and S. Reese. Micromechanical modelling of skeletal muscles based on the finite element method. Computer methods in biomechanics and biomedical engineering, 11(5):489–504, 2008.
 [6] C. P. Bradley, A. Bowery, R. Britten, V. Budelmann, O. Camara, R. Christie, A. Cookson, A. F. Frangi, T. Gamage, T. Heidlauf, S. Krittian, D. Ladd, C. Little, K. Mithraratne, M. Nash, D. Nickerson, P. Nielsen, O. Nordbø, S. Omholt, A. Pashaei, D. Paterson, V. Rajagopal, A. Reeve, O. Röhrle, S. Safaei, R. Sebastián, M. Steghöfer, T. Wu, T. Yu, H. Zhang, and P. J. Hunter. OpenCMISS: A multiphysics & multiscale computational infrastructure for the VPH/Physiome project. Progress in Biophysics and Molecular Biology, 107:32–47, 2011.
 [7] H. Childs, K.L. Ma, H. Yu, B. Whitlock, J. Meredith, J. Favre, S. Klasky, N. Podhorszki, K. Schwan, M. Wolf, et al. In situ processing. Technical report, Ernest Orlando Lawrence Berkeley National Laboratory, Berkeley, CA (US), 2012.
 [8] G. R. Christie, P. M. Nielsen, S. A. Blackett, C. P. Bradley, and P. J. Hunter. Fieldml: concepts and implementation. Philosophical Transactions of the Royal Society of London A: Mathematical, Physical and Engineering Sciences, 367(1895):1869–1884, 2009.
 [9] R. R. Cisi and A. F. Kohn. Simulation system of spinal cord motor nuclei and associated nerves and muscles, in a webbased architecture. Journal of computational neuroscience, 25(3):520–542, 2008.
 [10] G. V. Dimitrov and N. A. Dimitrova. Precise and fast calculation of the motor unit potentials detected by a point and rectangular plate electrode. Medical engineering & physics, 20(5):374–381, 1998.
 [11] D. Farina and R. Merletti. A novel approach for precise simulation of the emg signal detected by surface electrodes. IEEE Transactions on Biomedical Engineering, 48(6):637–646, 2001.
 [12] A. J. Fuglevand, D. A. Winter, and A. E. Patla. Models of recruitment and rate coding organization in motorunit pools. Journal of neurophysiology, 70(6):2470–2488, 1993.
 [13] A. M. Gordon, A. F. Huxley, and F. J. Julian. The variation in isometric tension with sarcomere length in vertebrate muscle fibres. The Journal of Physiology, 184:170–192, 1966.
 [14] S. Grottel, M. Krone, C. Müller, G. Reina, and T. Ertl. MegaMol – A Prototyping Framework for ParticleBased Visualization. IEEE Transactions on Visualization and Computer Graphics, 21(2):201–214, Feb. 2015.
 [15] V. Gurev, P. Pathmanathan, J.L. Fattebert, H.F. Wen, J. Magerlein, R. A. Gray, D. F. Richards, and J. J. Rice. A highresolution computational model of the deforming human heart. Biomechanics and Modeling in Mechanobiology, 14(4):829–849, Aug 2015.
 [16] D. Hawkins and M. Bey. A comprehensive approach for studying muscletendon mechanics. Journal of Biomechanical Engineering, 116(1):51–55, 1994.
 [17] C. Heckman and M. D. Binder. Computer simulation of the steadystate inputoutput function of the cat medial gastrocnemius motoneuron pool. Journal of Neurophysiology, 65(4):952–967, 1991.
 [18] T. Heidlauf, T. Klotz, C. Rode, E. Altan, C. Bleiler, T. Siebert, and O. Röhrle. A multiscale continuum model of skeletal muscle mechanics predicting force enhancement based on actin–titin interaction. Biomechanics and Modeling in Mechanobiology, 15(6):1423–1437, 2016.
 [19] T. Heidlauf, T. Klotz, C. Rode, T. Siebert, and O. Röhrle. A continuummechanical skeletal muscle model including actintitin interaction predicts stable contractions on the descending limb of the forcelength relation. PLoS computational biology, 13(10):e1005773, 2017.
 [20] T. Heidlauf, F. Negro, D. Farina, and O. Rohrle. An integrated model of the neuromuscular system. In Neural Engineering (NER), 2013 6th International IEEE/EMBS Conference on, pages 227–230. IEEE, 2013.
 [21] T. Heidlauf and O. Röhrle. Modeling the Chemoelectromechanical Behavior of Skeletal Muscle Using the Parallel OpenSource Software Library OpenCMISS. Computational and Mathematical Methods in Medicine, 2013:1–14, 2013.
 [22] T. Heidlauf and O. Röhrle. A multiscale chemoelectromechanical skeletal muscle model to analyze muscle contraction and force generation for different muscle fiber arrangements. Frontiers in Physiology, 5(498):1–14, 2014.
 [23] B. HernándezGascón, J. Grasa, B. Calvo, and J. F. Rodríguez. A 3D electromechanical continuum model for simulating skeletal muscle contraction. Journal of Theoretical Biology, 335:108–118, 2013.
 [24] A. L. Hodgkin and A. F. Huxley. A quantitative description of membrane current and its application to conduction and excitation in nerve. The Journal of physiology, 117(4):500–544, 1952.
 [25] T. Johansson, P. Meier, and R. Blickhan. A finiteelement model for the mechanical analysis of skeletal muscles. Journal of Theoretical Biology, 206(1):131–49, 2000.
 [26] P. Lindstrom. FixedRate Compressed FloatingPoint Arrays. IEEE Transactions on Visualization and Computer Graphics, 20, Aug. 2014.
 [27] C. M. Lloyd, M. D. Halstead, and P. F. Nielsen. Cellml: its future, present and past. Progress in biophysics and molecular biology, 85(2):433–450, 2004.
 [28] M. M. Lowery, N. S. Stoykov, A. Taflove, and T. A. Kuiken. A multiplelayer finiteelement model of the surface emg signal. IEEE Transactions on Biomedical Engineering, 49(5):446–454, 2002.
 [29] R. MacIntosh, B., F. Gardiner, P., and J. McComas, A. Skeletal Muscle: Form and Function. Human Kinetics, second edition, 2006.
 [30] L. Mesin and D. Farina. An analytical model for surface emg generation in volume conductors with smooth conductivity variations. IEEE transactions on biomedical engineering, 53(5):773–779, 2006.
 [31] G. L. Miller, S.H. Teng, W. Thurston, and S. A. Vavasis. Automatic mesh partitioning. In Graph Theory and Sparse Matrix Computation, pages 57–84. Springer, 1993.
 [32] M. Mordhorst, T. Heidlauf, and O. Röhrle. Predicting electromyographic signals under realistic conditions using a multiscale chemoelectromechanical finite element model. Interface Focus, 5(2):1–11, February 2015.
 [33] M. Mordhorst, T. Strecker, D. Wirtz, T. Heidlauf, and O. Röhrle. PODDEIM reduction of computational EMG models. Journal of Computational Science, 19:86–96, 2017.
 [34] F. Negro and D. Farina. Decorrelation of cortical inputs and motoneuron output. Journal of neurophysiology, 106(5):2688–2697, 2011.
 [35] A. J. Pullan, L. K. Cheng, and M. L. Buist. Mathematically modelling the electrical activity of the heart: from cell to body surface and back again. World Scientific, 2005.
 [36] D. E. Rassier, B. R. MacIntosh, and W. Herzog. Length dependence of active force production in skeletal muscle. Journal of Applied Physiology, 86(5):1445 – 1457, 1999.
 [37] M. V. Razumova, A. E. Bukatina, and K. B. Campbell. Stiffnessdistortion sarcomere model for muscle simulation. Journal of Applied Physiology, 87(5):1861–1876, November 1999.
 [38] E. Ríos, M. Karhanek, J. Ma, and A. González. An Allosteric Model of the Molecular Interactions of ExcitationContraction Coupling in Skeletal Muscle. The Journal of General Physiology, 102(3):449–481, 1993.
 [39] O. Röhrle, J. B. Davidson, and A. J. Pullan. Bridging scales: a threedimensional electromechanical finite element model of skeletal muscle. SIAM Journal on Scientific Computing, 30(6):2882–2904, 2008.
 [40] O. Röhrle, J. B. Davidson, and A. J. Pullan. A physiologically based, multiscale model of skeletal muscle structure and function. Frontiers in Physiology, 3, 2012.
 [41] O. Röhrle and A. J. Pullan. Threedimensional finite element modelling of muscle forces during mastication. Journal of Biomechanics, 40(15):3363–3372, 2007.
 [42] Y. Saad and M. H. Schultz. Gmres: A generalized minimal residual algorithm for solving nonsymmetric linear systems. SIAM Journal on Scientific and Statistical Computing, 7(3):856–869, 1986.
 [43] S. Schamberger and J.M. Wierum. Partitioning finite element meshes using spacefilling curves. Future Generation Computer Systems, 21(5):759 – 766, 2005. Parallel computing technologies.
 [44] P. R. Shorten, P. O’Callaghan, J. B. Davidson, and T. K. Soboleva. A mathematical model of fatigue in skeletal muscle force contraction. Journal of Muscle Research and Cell Motility, 28(6):293–313, 2007.
 [45] I. Wald, G. Johnson, J. Amstutz, C. Brownlee, A. Knoll, J. Jeffers, J. G?nther, and P. Navratil. OSPRay  A CPU Ray Tracing Framework for Scientific Visualization. IEEE Transactions on Visualization and Computer Graphics, 23(1):931–940, Jan. 2017.
 [46] F. E. Zajac. Muscle and tendon properties models scaling and application to biomechanics and motor. Critical reviews in biomedical engineering, 17(4):359–411, 1989.
 [47] M. Zhou, O. Sahni, K. D. Devine, M. S. Shephard, and K. E. Jansen. Controlling unstructured mesh partitions for massively parallel simulations. SIAM Journal on Scientific Computing, 32(6):3201–3227, 2010.