Even “simple” tasks like grabbing an object involve highly coordinated actions of our musculoskeletal system. At the core of such coordinated movements are voluntary contractions of skeletal muscles. Understanding the underlying mechanism of recruitment and muscle force generation is a challenging task and subject to much research. One of the few non-invasive and clinically available diagnostic tools to obtain insights in the functioning (or disfunctioning) of the neuromuscular system are electromyographic (EMG) recordings, e.g., measuring the activation-induced, resulting potentials on the skin surface. Conclusions on the neuromuscular system then reduce to a signal processing task, which often is based on the assumption that the muscle itself is a black box and hence ignores spatial arrangements within the respective muscle. But (surface) EMG measurements also have their limitations, e.g., they typically only capture activity from muscle parts close to the surface, which leads to difficulties in identifying cross-talk. In addition, it often only records weak signals due to layers of adipose tissue, or is, in some cases, restricted to isometric contractions. Hence, to obtain more holistic and non-invasive insights into the neuromuscular system, computational models can be employed. Such models need to capture much of the electro-mechanical properties of skeletal muscle tissue and of the interaction between neural recruitment and muscular contraction.
The contractile behavior of skeletal muscle tissue has been extensively modeled using lumped-parameter models such as Hill-type skeletal muscle models , continuum-mechanical skeletal muscle models [25, 4, 41, 5], or multi-scale, chemo-electromechanical skeletal muscle models [39, 40, 23, 21]. For prediction of the resulting EMG of a particular stimulation, there exist analytical models [10, 11, 30] with short compute times, or numerical approaches [28, 30, 32, 33].
However, irrespective of whether one aims to simulate EMG signals or the mechanical behavior of a skeletal muscle, numerical methods are almost unavoidable, if realistic muscle geometries are considered. The chemo-electromechanical models as proposed by , [21, 22], or 
are particularly well-suited to incorporate many structural and functional features of skeletal muscles. They embed one-dimensional “muscle fibers” within a three-dimensional skeletal muscle model and associate them with a particular motor unit. Moreover, those models can be directly linked to motor neuron models either phenomenologically[17, 12] or biophysically [9, 34]
to further investigate the relationship between neural and mechanical behavior. The desired degree of detail and complexity within these models requires the coupling of different physical phenomena on different temporal and spatial scales, e.g., models describing the mechanical or electro-physiological state of the muscle tissue on the organ scale and the bio-chemical processes on the cellular scale (cf. Section 2).
Being able to take into account all these different processes on different scales requires a flexible multi-scale, multi-physics computational framework and significant compute power. The availability of computational resources restricts the number of individual muscle fibers that can be considered within a skeletal muscle. The chemo-electromechanical models as implemented within the international open-source libraries OpenCMISS [6, 21, 32]
allow general muscle geometries with about 1000 embedded “muscle fibers”. Most skeletal muscles, however, have significantly more fibers. While simulations with 1000 fibers and less can potentially provide some insights into the neuromuscular system, some effects such as the motor unit recruitment over the full range of motor units and muscle fibers and their implication on the resulting EMG can not be estimated unless the full model with a realistic number of muscle fibers is simulated. This full model allows us to estimate the accuracy of “reduced” models by comparing them to the output of the detailed full “benchmark” model. Unless such comparisons are carried out, only predictions can be made how additional details such as more fibers or functional units (motor units) affect the overall outcome – both in terms of muscle force generation and in terms of computed EMG signals.
While highly optimized and highly parallel software for biomechanical applications exists, e.g., for chemo-electromechanical heart models , most multi-purpose computational frameworks for biomedical applications such as OpenCMISS are developed to provide flexibility. This is, e.g., achieved through standards like CellML  and FieldML . The respective frameworks are utilised to enhance existing multi-physics models for a wide range of (bioengineering) applications. They are designed to be run by biomedical researchers on small-sized compute clusters. While they typically can be compiled on large-scale HPC compute clusters such as HazelHen at the HLRS in Stuttgart, they typically are not capable of exploiting the full potential of the machine. Moreover, simulation run time is typically considered less important than model complexity and output. Hence, typical simulations of biomedical applications are not necessarily optimized for numerical efficiency, minimal communication, the exploitation of novel algorithms or file output. Within this paper, we demonstrate how one can exploit analysis tools, suitable numerical techniques, and coupling strategies to obtain an efficient chemo-electromechanical skeletal muscle model that is suitable to be run on a large-scale HPC infrastructure and, thus, is capable of running with a sufficient resolution and number of muscle fibers to provide the required details. Once large-scale simulations of biomedical applications are solved with a high degree of detail, most specialized visualization tools such as OpenCMISS-Zinc, can no longer handle the amount of output data. Dedicated visualization tools for large-scale visualizations need to be developed. In this work, the MegaMol framework  has been adapted to visualize the different biophysical parameters and the resulting EMG.
2 The multi-scale skeletal muscle model
Before outlining the model in its full detail, we first provide a brief overview on some anatomical and physiological characteristics of skeletal muscles that are relevant to our model. From an anatomical point of view, skeletal muscles are a hierarchical system. Starting from its basic unit, the so-called sarcomere, several in-series and in-parallel arranged sarcomeres constitute a cylindrically shaped myofibril. Several in-parallel arranged myofibrils make up a skeletal muscle fiber and multiple muscle fibers form a fascicle. All fascicles constitute the entire muscle. These structures are connected with each other through an extracellular matrix (ECM). From a physiological point of view, several fibers are controlled by the same lower motor neuron through nervous axons. The entire unit consisting of the lower motor neuron, the axons and the respective fibers that are innervated by the axons, is referred to as motor unit, which is the smallest unit within a skeletal muscle that can voluntarily contract. The lower motor neuron sends rate-coded impulses called action potentials to all fibers belonging to the same motor unit. Moreover, motor units are activated in an orderly fashion, starting from the smallest ones, up to the biggest (recruitment size principle). After a motor neuron stimulates a muscle fiber at the neuromuscular junction, an action potential (AP) is triggered and propagates along the muscle fiber resulting in a local activity. For more comprehensive insights into muscle physiology and anatomy, we refer to the book by .
As the focus of this research is on solving biophysically detailed models of a neuromuscular system on HPC architectures, this section provides an overview on the multi-scale modeling framework of our chemo-electromechanical skeletal muscle model that is based on the work by , [21, 22], and 
. These models can account for the main mechanical and electro-physiological properties of skeletal muscle tissue, including a realistic activation process and resulting force generation. This is realized by linking multiple sub-models, describing different physical phenomena. To reduce the computational costs, the different sub-models that describe phenomena on different length and time scales are simulated using different discretizations, i. e., spatial resolution and time-step size. Data are exchanged between the sub-models using homogenization and interpolation techniques.
2.1 The 3D continuum-mechanical muscle model
The physiological working range of skeletal muscles includes large deformations. Therefore, we use a continuum mechanical modeling approach that is based on the theory of finite elasticity to simulate the macroscopic deformations and stresses in the muscle tissue. In continuum mechanics, the placement function describes the motion of a material point, i. e., it assigns every material point with position in the reference (non-deformed) domain at a time to a position in the actual (deformed) domain at time
. The deformation of the body at a material point can be described by the deformation gradient tensor
which is defined as the partial derivative of the placement function
with respect to the reference configuration. The local displacement is defined by the vector.
The governing equation of the continuum mechanical model is the balance of linear momentum. Under the assumption of slow motions (i. e. inertia forces vanish) and neglecting body forces, the balance of linear momentum in its local form can be written as
where denotes the divergence operator and is the first Piola-Kirchhoff stress-tensor. To solve the balance of linear momentum, one needs to define a constitutive equation that specifies . It describes the overall mechanical behavior of the muscle that can be divided into a passive and an active component. The latter represents the muscle’s ability to contract and produce forces. In this work, we assume a superposition of the active and passive behavior, i. e., an additive split of .
Passive skeletal muscle tissue is assumed to be hyperelastic and transversely isotropic. Consequently, the passive part to the first Piola-Kirchhoff stress tensor ) depends on the deformation gradient tensor and a structure tensor , which is defined by the muscle fiber direction . The isotropic part of the passive stress-tensor assumes a Mooney-Rivlin material. It is enhanced by an additive anisotropic contribution accounting for the specific material properties in the muscle fiber direction .
The active force is generated on a microscopic scale, i. e. , within a half-sarcomere (the smallest functional unit of a muscle) consisting of thin actin and thick myosin filaments. Based on geometrical considerations of the half-sarcomere structure, it is known that the active muscle force depends on the actual half-sarcomere length (force-length relation) . When a half-sarcomere is activated by calcium as a second messenger, actin and myosin filaments can form cross-bridges and produce forces (cross-bridge cycling).
The active force state of the microscopic half-sarcomere is summarized in an activation parameter that enters the macroscopic constitutive equation. Furthermore, we assume that the active stress contribution acts only along the fiber direction . When considering only isometric or slow contractions, the active stress tensor can be defined as a function of the lumped activation parameter , the deformation gradient tensor , and the structure tensor . An additional force-length relationship needs to be included within .
Finally, we assume skeletal muscle tissue to be incompressible, which implies the constraint . The resulting first Piola-Kirchhoff stress tensor reads
where is the hydrostatic pressure, which enters the equation as a Lagrange multiplier enforcing the incompressibility constraint. The material parameters of the continuum-mechanical skeletal muscles are fitted to experimental data  and can be found in .
2.2 The 1D Model for Action Potential Propagation
The electrical activity of skeletal muscles resulting from the local activity of all muscle fibers can be analyzed by measuring the extracellular potential. The bidomain-model is a widely used continuum mechanical framework to simulate the electrical activity of living tissues . It is based on the homogeneity assumption that intracellular and extracellular space homogeneously occupy the same domain. Intracellular and extracellular space are electrically coupled by an electrical current flowing over the cell membrane,
where and denote the current density in the intracellular and extracellular space, respectively, and is the fiber’s surface to volume ratio. The muscle fiber membrane is nearly impermeable for ions and serves as a capacitor. However, ions can be transported over the membrane by ion channels and active ion pumps. These properties can be mathematically described by a biophysically motivated modeling approach proposed by , leading to the constitutive equation
where is the transmembrane potential, is the capacity of the muscle fiber membrane (sarcolemma) and is the ionic current flowing over the ion-channels and -pumps, which depends on . Further state variables are summarized in , e. g., the states of different ion channels. is an externally applied stimulation current, e. g., due to a stimulus from the nervous system. Assuming that the intracellular space and extracellular space show equivalent anisotropy, which is the case for 1D problems, the bidomain equations can be reduced to the monodomain equation. We use the one-dimensional monodomain equation in the domain :
Here, denotes the spatial coordinate along a one-dimensional line, i.e., the fiber, and is the effective conductivity.
2.3 The 0D Sub-Cellular Muscle Model
The model proposed by  provides a basis to compute the lumped activation parameter which is the link to the 3D continuum-mechanical muscle model. Its evolution model is steered by the ionic current of the 1D model.
It contains a detailed biophysical description of the sub-cellular excitation-contraction coupling pathway. Specifically, it models the depolarization of the membrane potential in response to stimulation, release of calcium from the sarcoplasmic reticulum (SR) serving as a second messenger, and cross-bridge (XB) cycling. To do so, the Shorten model couples three sub-cellular models: A Hodgkin-Huxley-type model is utilized to simulate the electrical potentials and ion currents through the muscle-fiber membrane and the membrane of the T-tubule system. To simulate calcium dynamics, a model of the SR membrane ryanodine receptor (RyR) channels () is coupled to the electrical potential across the T-tubule membrane and triggers the release of calcium from the SR. Additionally, the calcium-dynamics model describes diffusion of calcium in the muscle cell, active calcium transport over the SR membrane via the SERCA pump (sarco/endoplasmic reticulum -ATPase), binding of calcium to buffer molecules (e. g. parvalbumin or ATP), and binding of calcium to troponin enabling the formation of cross-bridges. The active force generation is simulated by solving a simplified Huxley-type model , which is the basis for calculating the activation parameter .
The entire incorporated sub-cellular processes are modeled with a set of coupled ordinary differential equations (ODEs)
where summarizes the right-hand-side of all the ODEs associated with the state variables which are, in the case of the Shorten et al. model, more than 50. The final activation parameter is computed from entries of the state variable vector and the length and contraction velocity of the half-sarcomere, and . Assuming isometric or very slow contractions, the contraction velocity can be neglected. Hence, following  and , the activation parameter is calculated as
Here, the function is the force-length relation for a cat skeletal muscle by , is the concentration of post power-stroke XBs, is the concentration of post power-stroke cross-bridges for a tetanic contraction (100 Hz stimulation after 500 ms stimulation) and is an offset parameter denoting the concentration of post power-stroke cross-bridges in the resting state.
2.4 Test Scenario, Boundary and Initial Conditions
The overall aim of this work is to analyse the implemented algorithms on (large-scale) clusters and to identify bottlenecks that prevent us from achieving scalable and efficient implementations. For this purpose, we define a test scenario that will be used throughout this work to test the model dynamics and the implemented algorithms.
To summarize the previous paragraphs, the chemo-electromechanical behavior of a skeletal muscle is described by the following coupled equations:
&0 & = & div P(F,M,γ(y,
ł_hs) ) & in Ω_t for all t ,
&∂Vm∂t & = & 1AmCm ( ∂∂x( σ_eff ∂Vm∂x ) - A_m I_ion(y, V_m, I_stim ) ) & on all fibers Γ_t ,
&∂y∂t & = & G_y (y, V_m , I_stim ) & at all half-sacromere positions.
For the test scenario, we use a generic cubic muscle geometry (). The muscle fibers are aligned parallel to one cube-edge (i. e. the -direction). We consider a -long isometric single-twitch experiment by stimulating all fibers at their mid-points for with . To constrain the generic muscle sample, Dirichlet boundary conditions (zero displacement) are used resulting in the fixation of the following faces of the muscle cube: the left and the right faces (faces normal to the -direction), the front face (face normal to the -direction) and the bottom face (face normal to the -direction). Further, no current flows over the boundary of the computational muscle fibers, i. e., zero Neumann boundary condition are assumed at both muscle fiber ends.
For the material parameters for the continuum-mechanical model, the effective conductivity , the surface-to-volume ratio , and the membrane capacity , we use exactly the same values as reported in . The initial values for the state vector at of the sub-cellular model are taken from the slow-twitch scenario of .
2.5 Status Quo and Current Implementation
Spatial Discretization. The sub-models of the multi-scale skeletal muscle model show significantly different characteristic time and length scales. In order to solve the model efficiently, different discretization techniques and resolutions are required for the sub-models. The current implementation of the OpenCMISS solves the continuum-mechanical skeletal muscle model using a finite element method with triquadratic/trilinear Lagrange basis functions, i. e., Taylor-Hood elements. The one-dimensional muscle fibers are represented by much finer one-dimensional finite element meshes with linear Lagrange basis functions. The respective one-dimensional elements are embedded into the three-dimensional finite element mesh.
To simulate the coupled multi-physics problem, data have to be transfered between the different spatial discretization approaches. The microscopic sarcomere forces provided by the monodomain model are homogenized and projected to the macroscopic Gauss points of the three-dimensional continuum-mechanical model (). Vice versa, the node positions of the one-dimensional computational muscle fibers are updated from the actual displacements of the three-dimensional continuum-mechanical model by interpolating the node positions via the basis-functions of the three-dimensional model. Based on this step, the microscopic half-sarcomere lengths are evaluated. Fig. 1 (left) schematically shows the embedding of discretised 1D fibers within the 3D muscle domain discretised with finite elements.
Time discretization and coupling in time. To compute an approximate solution for Eqns eq_summary, the different characteristic time scales of the 3D, 1D and 0D problems can be exploited. The action potential propagates faster than the muscle deformation, the sub-cellular chemo-mechanical processes evolve considerably faster than the diffusive action potential propagation. To achieve common global time steps, which is desirable from a computational point of view, we choose , with and define . State values associated with time are denoted with the superscript . Employing different time steps requires a time splitting scheme. The status-quo implementation uses a first-order accurate Godunov splitting scheme, for which one time-step of the three-dimensional equation including all sub-steps for the one-dimensional monodomain equation reads:
For perform explicit Euler steps for Eqn. e:0D_ystate2 and the 0D portion of Eqn. e:1D_fiber.
Set and .
Perform one implicit Euler step for the 1D portion of Eqn. e:1D_fiber to compute .
Set and .
Calculate and compute the homogenized values at the Gauss points of the 3D finite element mesh.
Solve Eqn. e:3D_momentum_balance.
Interpolate the actual configuration to the fibers’ nodes for computing the local half-sacromere length .
Fig. 1 (right) schematically depicts the algorithm. We assume quasi-static conditions for the continuum-mechanical problem. This is equivalent to assuming constant muscle deformations throughout .
Solvers. The coupled time stepping algorithm described above contains two large systems of equations that need to be solved. The first one results from the 3D elasticity problem e:3D_momentum_balance and the second one stems from an implicit time integration of the linear 1D diffusion problem of the fiber e:1D_fiber. In the existing implementation, the GMRES solver of  from the PETSc111http://www.mcs.anl.gov/petsc/index.html library is used.
Domain Partitioning. The 3D computational domain is decomposed into disjoint cuboid shaped partitions by cutting the whole domain with axis-aligned planar cuts at element boundaries. The partitioning spans all model components, such that the quantities in the 3D, 1D and 0D model corresponding to the same spatial location are on the same process. This is to avoid unnecessary inter-process volume-communication between the sub-models. The implementation of this partitioning concept is motivated by a skeletal muscle anatomy and physiology: all skeletal muscle fibers are, from an electrical point of view, independent from each other. Thus, the obvious way to partition the workload is to distribute whole fibers among the processes. For instance,  hard-coded a partitioning with 4 processes.
3 Towards HPC: Scalability Evaluation
Before being able to simulate such complex models as the one introduced in Section 2 on HPC systems, it is essential to first analyse bottlenecks of existing implementations. In this paper, we focus on investigating numerical and algorithmic characteristics of the implementation as described in Sec. 2.5. The results provide the basis for new algorithms and improved implementations to enhance scalability. As the parallel status quo code used 4 cores, we only analyse numerical complexity, i. e., scalability in terms of the size of the problem, and redesign the parallelisation.
As discretization, we choose 8 Taylor-Hood finite elements and include 36 muscle fibers (), which go through all of the Gauss points that exhibit the same - and -coordinates. The time steps are , and , i. e., and . The system resulting from the discretization of the 1D fibers is solved using a restarted GMRES solver with restart after 30 iterations and relative residuum tolerance . To solve the 3D problem, Newton’s method from the PETSc library is used with relative and absolute tolerance and a backtracking line search approach with maximum number of 40 iterations.
To assess problem size scalability, we vary the number of nodes along each muscle fiber and measure runtimes for a single step of the 3D model. Results are depicted in Fig. 2.
Overall, the results for both numerical investigations reveal the following insights:
The low-order explicit ODE solver (Euler) requires very small time steps for the desired accuracy. Hence, the majority of the runtime is spent for solving the 0D problem.
The portion of runtime for solving the 3D problem is negligible and constant. This is due to the low number of 3D finite elements for the mechanics problem (eight elements). Realistic models would require a finer resolution of the 3D problem.
The runtime for the other computational components increases approximately linearly with the number of fiber elements. This indicates a good scaling behavior with respect to problem size.
The computation of the macroscopic variable from fiber nodes and the homogenized activation parameter has almost no impact on the computational time. Computing is more time consuming as it involves simultaneously traversing the fiber and the 3D meshes, whereas computing requires a single averaging operation for each Gauss point of the 3D elements.
The GMRES solver is a robust choice for general sparse linear systems of equations, but does not exploit the symmetry, positive definiteness and tri-diagonal structure of a 1D diffusion system and hence exhibits substantial overhead. The LU solver exhibits the lowest computational time for large problems. It shows linear complexity in the number of 1D nodes222Note that the 1D diffusion is an exception. In all higher dimensions, direct solvers are known to fail for large problems due to their high complexity..
These findings are used in the following sections to improve the numerical efficiency from an algorithmic and implementational point of view.
4 Towards HPC: Improvements For efficiency
Although linear scalability with respect to the number of nodes per fiber is observed for all 0D and 1D components (cf. Sec. 3), the runtime is not optimal. This is mainly due to the first-order accurate time stepping scheme. It enforces very small time steps (for and ) in order to achieve the required accuracy. We propose a combination of second-order splitting and second-order time stepping schemes within the 0D and 1D sub-problems to tackle this issue. A further obstacle on the way to large-scale simulations with a reasonable number of fibers is the limited parallelism (currently using cores). We design a parallelization scheme for arbitrary core numbers along with a new partitioning scheme minimizing comunication cost. Finally, we replace the customized OpenCMISS file output and visualization by a new library that is able to handle very large-scale data.
4.1 Algorithmic Optimization– Second-Order Time Stepping
To reduce computational cost and increase accuracy, we propose to replace the Godunov splitting with a second-order Strang splitting. We replace the explicit Euler for Eqn. e:0D_ystate2 and the 0D portion of Eqn. e:1D_fiber with the method of Heun and employ an implicit Crank-Nicolson method for the diffusion part of Eqn. e:1D_fiber. Second-order time-stepping schemes allow to reduce the discretization error much faster with decreasing time step size . Thus, the required accuracy might be achieved by employing larger time steps. In contrast to the simpler Godunov splitting, the Strang splitting uses three sub-steps per time step: a first step with length for the 0D part, a second step with length for the diffusion, and a third step with length again for the 0D part. The modified algorithm at a time reads:
For perform explicit Heun steps for Eqn. e:0D_ystate2 and the 0D portion of Eqn. e:1D_fiber.
Perform one implicit Crank-Nicolson step for the 1D portion of Eqn. e:1D_fiber.
For perform explicit Heun steps for Eqn. e:0D_ystate2 and the 0D portion of Eqn. e:1D_fiber.
Set and .
Set and .
Calculate and compute the homogenised values at the Gauss points of the 3D finite element mesh.
Solve Eqn. e:3D_momentum_balance.
Interpolate the displacements to the fibers’ nodes for computing the local half-sacromere length .
The explicit Heun step in a. and d. reads:
In b., we solve the system resulting from the Crank-Nicolson time discretization of the diffusion part in Eqn. e:1D_fiber.:
In the following, we present numerical experiments demonstrating the increase in efficiency achieved as a result of this new time discretization.
4.1.1 Time Discretization for the Sub-Cellular Model (0D ODEs)
In a first step, we verify the convergence order of Heun’s method numerically. Therefore, we restrict ourselves to step 1.a of the Godunov algorithm, but use Heun’s method for Eqn. e:0D_ystate2 and the 0D portion of Eqn. e:1D_fiber. We use the same test setup as presented in Sects. 2.4 and 2.5. To compare the accuracy of Heun’s method with the explicit Euler method, we compare the values of and at a stimulated material point on a muscle fiber while varying the time step size . To compare the methods in terms of efficiency, we measure the related compute times. We restrict ourselves to the time interval , with . As a reference solution, we use the solution calculated with Heun’s method with a very high resolution (). The tests are performed sequentially on an Intel® CoreTM i5-4590 CPU, using the OpenCMISS implementation. Fig. 5 depicts the relative error depending on the number of time steps executed in the interval while Fig. 5 compares the necessary CPU-time to reach a certain accuracy for the different solvers.
Fig. 5 shows the expected first-order convergence for the explicit Euler method and second-order convergence for Heun’s method. From an application point of view, however, efficient computation (“Which accuracy can be achieved in which runtime?”) is more important than the order of convergence. Since each time step of the Heun’s method is twice as expensive as an explicit Euler step, we need to also take into account the runtime in our assessment to show the potential to decrease the runtime for Heun’s method given a required accuracy. All times are normalized by the CPU-time of the Euler method with . The results presented in Fig. 5 show that two Heun steps with replace forward Euler steps yielding a theoretical speedup of for the 0D-solver. At the same time, the error decreases by a factor of approximately .
4.1.2 Time Discretization for Muscle Fibers (0D coupled to 1D)
In the second step, we verify the convergence order of the Strang splitting scheme. We consider again the same test setup as above except that we use the larger time interval . We varying the number of 1D time steps. Based on the previous results for the isolated 0D problem, we choose for the Strang-splitting scheme and for the Godunov-splitting scheme. This ensures a comparable relative error for the 0D sub-problem while saving computational time. The reference solution is computed using the Strang-splitting scheme with , yielding .
Fig. 7 shows the relative errors of at a stimulated sub-cell for the Godunov- and Strang-splitting schemes. Comparable relative errors as for the Godunov scheme with are achieved for the Strang splitting scheme with . The resulting speedups are depicted in Fig. 7. We normalize the compute times with the compute time of the Godunov scheme with . For a relative error limit in , one can achieve a speedup of about with the improved time stepping scheme if the accuracy requirement is weakened slightly. If the error constraint cannot be weakened, we still obtain a speedup of . Note that, for lower error limits, the speedup achieved with a second-order scheme is even higher due to the higher convergence order.
4.2 Parallelization – Domain Partitioning
The parallelization of the skeletal muscle simulation is based on a domain partitioning approach, i.e., the whole 3D cuboid muscle domain is decomposed into disjoint subdomains (partitions) that are assigned to parallel processes.
Within this work, we aim to use a parallelization scheme that is able to make use of an arbitrary number of processes. Previous work  based their parallelization strategy on physiological characteristics, which leads to pillar-like partitions ensuring that one fiber is never distributed to multiple processors. Considering not only the skeletal muscle fibers but the entire model, this induces a significant communication overhead due to the long and stretched partitions. We therefore investigate a further domain partitioning approach with cube-shaped partitions exhibiting sub-domains with minimal surface. The two above-mentioned partitioning strategies, i. e., the pillar-like and cubic ones, are visualized in Fig. 8.
The domain partitioning induces communication due to dependencies of local data on data of neighboring partitions. These dependencies comprise the following:
For solving for the propagation of , i. e., using an implicit Euler or Crank-Nicolson method (Eqn. eqn:CN) to solve the monodomain equation (Eqn. eqn:monodomain), communication between neighboring processes along a single fiber is required. The cost for this communication per process grows linearly with the number of fibers contained in the respective 3D partition. For the existing pillar-like implementation of partitioning, this cost vanishes since fibers are not partitioned into several pieces.
Computing the mucle displacements of the 3D model (Eqn. e:3D_momentum_balance) involves all processes. This is a result of using a finite element discretization, which inherently requires communication between processes which share common partition boundaries. Thus, these costs are proportional to the surface of the 3D partitions.
Interpolating the muscle displacements of the 3D muscle mesh to 1D fiber mesh node positions and calculating , requires ghost layers at the partition boundaries containing one layer of 3D elements. Note that for the reverse transfer, the accumulation of the activation parameter from the 0D model at the Gauss points of the elements in the 3D mesh, i. e., computing , does not involve communication since the process is completely local as all 0D points are contained within the respective 3D element and reside on the same process.
4.2.1 Generalized and Optimized Domain Partitioning
Based on the communication dependencies 1 and 2, we enhance the original pillar-like domain partitioning in two ways: (i) we allow for an arbitrary number of processes instead of a fixed number of four processes and (ii) we introduce a new partitioning concept with nearly cubic partitions. In contrast to partitioning strategies based on space-filling curves such as , or graph-partitioning such as [31, 47], a cuboid partition has the advantage that the interaction of one cuboid partition with others is guaranteed to be planar and bounded by the maximum number of neighboring partitions, i. e., . This allows communication with reduced complexity and cost. On the other hand, load balancing may not be optimal because no partition might exist that divides the 3D elements into equally sized sub-domains with a number matching the total number of processes.
Within this work, we assume that it is possible to create nearly optimal cubic partitions. However, we can not completely avoid obtaining sub-domains at the boundary of the computational domain that have less elements than others. Given a fixed number of available cores, we maximize the number of employed processes by adapting the number of sub-divisions in each axis direction corresponding to a factorization of the total number of processes. By introducing the additional constraint that each generated partition has to be larger than a specified “atomic” cuboid of elements, we can easily ensure that each process contains only entire fibers (pillar-like partition), a fixed number of fiber subdivisions (cube-like partition), or anything in between.
In Sec. 4.2.3, various runtime and scalability studies are presented. In particular, we present scaling results that demonstrate weak scaling for the full muscle model showing the influence of pillar-like or cube-like partitioning and the memory consumption per process of the current implementation in two different experiments; and strong scaling for the full muscle model showing the dependency of the achieved efficiency on the shape of the partitions.
4.2.2 Optimized Interpolation and Homogenization
Based on the communication dependency 3 listed above, the interpolation and homogenization routines need to be optimized. Interpolation and homogenization involves information transfer between values at Gauss points of the 3D elements to nodes of the 1D fibers. To allow general domain decomposition, a mapping between the respective 3D and 1D finite elements is necessary. In , the homogenization was achieved by a naive search over all locally stored fibers. This search was performed for each 3D element. We replace this approach, which exhibits quadratic complexity (in terms of the number of involved elements), by a linear algorithm that uses a correct indexing approach based on the element’s topology. Fig. 9 shows that the new implementation achieves substantial runtime reductions.
4.2.3 Scaling Experiments
All scaling experiments are conducted on HazelHen, the Tier-1 supercomputer at the High Performance Computing Center Stuttgart (HLRS). A dual-socket node of this Cray XC40 contains two Intel Haswell E5-2680v3 processors with base frequency of 2.5 GHz, maximum turbo frequency of 3.3 GHz, 12 Cores each and 2 Hyperthreads per Core, leading to a total number of 48 possible threads per core.
Weak Scaling Measurements – Experiment #1. For weak scaling, the problem size is increased proportional to the number of processes. Thus, invariants are the number of elements per process and the overall shape of the computational domain. Here, we show weak scaling for both partitioning strategies: partitioning only in - and -direction, i.e., pillar-like partitioning, and cubic partitioning. We start with processes on a single node of HazelHen with an initial partition consisting of subdivisions for both the pillar-like and the cubic partitioning. Each partition contains 3D elements per MPI rank. Further, we ensure that each 3D element contains fibers in -direction with three 1D elements per fiber, i.e., 1D elements per 3D element. Hence, the initial problem is made up of elements and fibers.
In the two conducted series of measurements for the two partitioning strategies, further subdivisions are defined such that the pillar-like or cubic partitioning structure is maintained. The refinements are obtained by first refining by a factor of in -direction, in -direction and then in -direction before repeating the process. For the cubic partitioning, we fix the number of 3D elements that each MPI rank contains to be . For the pillar-like partitioning, the constraint is that each sub-domain spans over all three-dimensional elements in the -direction, whose number varies with increasing problem size. Therefore, the number of elements per MPI rank in - and -direction is halved for each refinement in an alternating way. This way, we double the number of partitions while maintaining the constant number of eight 3D elements per MPI rank. By allocating processes on the cores of each node (no hyperthreading), we scale from 1 to 32 nodes, i. e., from 24 to 768 cores. Table 1 provides the details on the partitioning and the number of three-dimensional and one-dimensional elements. Results are depicted in Fig. 10.
|Nodes||3D Elements||1D El.|
The results of Fig. 10 show that the solver for the 3D model has a slightly higher computational time for the pillar-like partitioning compared to the cubic partitioning. This is expected as the partition boundaries are larger and induce more communication. For the 1D problem solver, pillars are better as fibers are not subdivided to multiple cores and no communication is needed. The reduced benefit from a cubic partitioning is due to the fact that time spent on communication is rather dominant compared to the time needed to solve the rather small problem, e. g., only 1D elements of a fiber are locally stored in each partition. This should improve as one chooses larger sub-problem sizes, i. e., increases the number of nodes per fiber.
Theoretically, the time needed to solve the 0D problem should not be affected by the domain decomposition. However, due to memory effects, the runtime for a cubic partitioning is slightly higher. Overall, this leads to a higher total computational time for cubic partitioning compared to the pillar-like partitioning. This conclusion is, however, only valid for the chosen scenario and for the relatively low number of cores. Note that extending this scaling experiments to a larger numbers of cores is currently limited due to memory duplications in the current code. This needs to be first eliminated before conducting further scaling studies.
Weak Scaling Measurements – Experiment #2.
While the somewhat artificial setting in experiment #1 yields perfect pillar-like or cubic partitions, experiment #2 addresses a more realistic setup, where we increase the number of processes more smoothly, i.e., by less than a factor of two in each step. With this, it not possible any more to choose perfect cubic or pillar-like partitions. Thus, we identify reasonable parameters by solving an optimization problem that trades based on process counts and targeted sub-domain shape. Note that the combination of number of processes and number of elements leads to partitions at the boundary of the computational domain that potentially have less elements than interior partitions. Compared to the previous example, the number of 3D elements per process is only approximately constant with the pillar-like
partitions getting closer to constant size than the cubic ones. The numbers of processes and dimensions of the computational domain are listed in Table 2. Fig. 11 presents the runtime results.
|Nodes,||3D Elements||1D El.|
As already discussed above, the ODE solver for the 0D-problem (yellow line) requires the majority of the runtime. This is followed by the solution times for the 1D (red line) and 3D (green line) sub-problems. The blue lines depict the duration of the interpolation and homogenization between the node positions of the 1D fibers and the 3D mesh. It can be seen that the computational times stay nearly constant for increasing problem size. As in the previous experiment, the 3D solver performs better for cubic partitioning whereas the 1D solver is faster for pillar-like partitions. In this scenario, the cubic partitioning in total slightly outperforms the pillar-like partitioning, as expected.
As before, the memory consumption appears to be a weakness. Therefore, additional tests investigating the memory consumption per process at the end of the runtime were carried out. The memory consumption for the presented scenario is plotted in Fig. 12 with respect to the overall number of 1D elements. Also the average number of ghost layer elemts per process is depicted. Ghost layer elements are copies of elements adjacent to the partition of a process, i.e., belong to the subdomain of a neighbouring process. They are used as data buffers for communication.
We observe that the average number of ghost elements per process for the 3D problem is higher for pillar-like partitions (dashed black line) than for the cubic partitions (solid black line). A sharp increase of memory consumption (magenta lines) is observed independent of the partitioning scheme. This is due to duplications of global data on each process which will be eliminated in a next optimization step. Compared to this effect, the difference between the number of ghost elements needed for the two partitioning strategies is negligible.
Dependency Between Runtime and Partition Shape – Experiment #3. In our third scaling test, the dependency of the solver of the 3D continuum-mechanical problem on the partitioning strategy is investigated. We analyse how different domain decomposition approaches, in particular others than the previously discussed pillar-like and cubic partitioning schemes, affect the runtime. A test case with three-dimensional elements is considered. Otherwise, the setup is as described in Sec. 2.4. To reduce the contributions of the 0D/1D sub-problem and focus on the performance of the 3D components, we include in each 3D element only two 1D fiber elements. The domain is decomposed into a constant number of partitions by axis-aligned cutting planes in all possible ways. To distinguish between the different partitioning variants, we compute the average boundary surface area between the partitions for each variant and relate this to runtime. The results are presented in Fig. 13.
The smallest average surface area between the partitions, which corresponds to the first data point in Fig. 13, is obtained for a partitioning with partitions with elements each. The highest average surface area between the partitions, which is the last data point within Fig. 13, is obtained for partitions with elements each. All experiments are run on 12 nodes of Hazel Hen with 12 processes per node. It can be seen that only the time needed to solve the 3D continuum-mechanical problem increases monotonically with respect to the average surface area between the partitions, i. e., depends on the partitions’ shape. This is expected. Further, the runtime ratio of the 3D solver between the partitioning with the smallest and largest average surface area is .
4.3 Optimizing Input/Output – File Format
Optimizing HPC simulations also includes the consideration of efficient input and output (IO) file formats, e. g., writing data in parallel to reduce communication, and binary codecs to reduce the size of a data set. Currently, OpenCMISS appeals to an EX-file format333http://opencmiss.org/documentation/data_format/ex_file_format.html, Accessed: 2017-09-30, which writes header information and data as plain text. Headers and data are stored in the same file, and one file is generated for each time step. In general, writing text-based data has very low read and write performance, and very high storage requirements compared to binary encoded data. To successfully run simulations on large systems, the data IO should ideally not cause scaling issues. Therefore, we propose in this work a new, efficient file format444https://github.com/UniStuttgart-VISUS/ngpf, Accessed: 2017-09-30.
The new data structure includes custom binary codecs, such as the high-compression algorithm ZFP , and stores meta data in separate header files using the human-readable JSON object notation. Fig. 14 gives an overview of the data layout, where each colored box denotes a separate file. The main feature of the new file format is the full support of domain decomposition (DD) with parallel file IO. Every node processes its own set of data (e. g., time integration or data encoding) and claims the corresponding memory region in the data files to write the encoded data stream.
The new file format has three optional features. It can write time independent data (TID, cf. Fig. 14) that are written per attribute data, e. g., positions of Gaussian points in the 3D mesh in the reference configuration. A further feature is the ”Type Header” that writes type-specific quantities, which can contain simple attributes or a complex structural layout. With the third feature, the DD meta data (e. g., load in each node), we are able to analyze how each node has performed during the simulation. These meta data can answer questions like “which muscle fiber was simulated on which node”, or “was this node at full capacity at a specific point in time”. These data can be used to optimize the load balancing within the simulation in future work.
Efficient (parallel) IO-file formats are also the basis for the visualization of large data sets. The current visualization framework within the OpenCMISS software initative is OpenCMISS-Zinc. This framework already offers a range of visualization techniques for muscle fiber data, e. g., a convex hull calculation to construct a mesh geometry from point cloud data. However, Zinc uses outdated rendering methods, resulting in insufficient performance for large-scale data. Furthermore, it does not offer a suitable platform for fast visualization prototyping, distributed rendering, or advanced visualization techniques, which would be required for developing new muscle fiber visualizations. MegaMol  fulfills these criteria and offers additional functionality and features that are valuable for this project. One example for an additional feature is the provided infrastructure for brushing and linking that allows for interactive visual analytics. MegaMol also offers a built-in headless mode and a remote control interface that is crucial for the HPC and in-situ rendering planned for the future in order to support the visual analysis of large-scale muscle simulations. In-situ visualization is an alternative approach to traditional post-hoc data processing. It processes and visualizes data while the simulation produces them. Consequently, writing raw data to disk can be avoided completely. In-situ visualization is very much an area of active research, and thus, there are many different approaches and no standard approach has been established (see, e. g., ).
MegaMol allows GPU and CPU rendering via a thin abstraction layer, where the GPU rendering uses the OpenGL API, whereas the CPU rendering is based on the ray tracing engine OSPRay . In particular the CPU-based rendering enables image synthesis on any HPC cluster, regardless of the availability of dedicated GPUs. It offers advanced rendering and shading methods (e. g., path tracing and ambient occlusion) that enhance the perception of depth and run mostly interactive even on single workstations. Like the simulation software OpenCMISS, MegaMol is currently not optimized for HPC usage, but it provides a solid base and infrastructure and is already capabale of rendering the discretized muscle fibers as continuous geometry (cf. Fig. 15).
While analyzing and optimizing existing code for HPC infrastructures are best performed with test cases, in which the geometry has a minimal influence, e. g., the cuboid muscle test case as introduced in Sec. 2.4, it is advantageous to demonstrate the visualization capabilities on complex geometries. Therefore, data from previous simulations (here data of the Tibialis Anterior as obtained within ) are used to showcase visualization capabilities. The image was rendered with MegaMol  using the CPU ray tracing engine OSPRay .
5 Conclusion and Outlook
Using models to gain new insights into the complex physiological or anatomical mechanisms of biological tissue, or to better interpret and understand experimentally measured data, requires accurate and detailed models of the underlying mechanisms. Depending on the application or the system to be modeled, the required detail can lead very quickly to highly complex and computationally extremely demanding models. Most software packages are, however, often designed to build up computational models for a variety of complex biomechanical systems, like in the case of OpenCMISS for simulating the chemo-electromechanical behavior of skeletal muscles after neural recruitment, the mechanics of the heart, the functioning of the lung, etc. Such software packages might already run within a parallel computing environment, but are not optimized to run large-scale simulations on large-scale systems such as HazelHen, the Tier-1 system in Stuttgart. Before being able to exploit the full capabilities of supercomputers, software packages such as OpenCMISS have to be analyzed and optimized to achieve good scaling properties – ideally perfect scaling meaning that the simulation of a twice as large problem on twice as many nodes/cores requires the same runtime as the original setup.
Within this paper, we have demonstrated that the chemo-electromechanical multi-scale skeletal muscle model as introduced in Sec. 2 and implemented in OpenCMISS is capable of running significantly sized model setups in a parallel compute environment. We have simulated the deformation of a skeletal muscle in which 34,560 randomly activated fibers are discretized with 103,680 1D elements. Further, by utilizing a standard test case, we have been able to show good weak scaling properties for a small number of compute nodes. For the partitioning of the domain, two different approaches have been considered: a pillar-like partition, in which each embedded 1D skeletal muscle fiber is kept within a single partition, and a cubic partitioning, in which individual fibers can be distributed over different partitions. It has been observed that mainly the solution times of the 3D and the 1D solver depend on the domain partitioning. For the test cases, the 1D solver profits from pillar-like partitions, while the 3D solver exhibited lower runtimes for cubic partitions. The analysis has revealed that for compute-intense scenarios with a high numbers of 1D elements (in relation to the computational work needed to solve the 3D continuum-mechanical model), cubic partitions yield a better overall performance. The scaling of the 1D solver with increasing problem size has been found to be linear and, hence, promises good behavior when the overall problem size will be further increased in the future. Furthermore, the text based EX-file format has been identified as not appropriate to run on an HPC system. Therefore, a new file format is being developed that offers high-compression algorithms and performs file IO in parallel. However, before utilizing significant portions of HazelHen, further aspects concerning the model, algorithms, implementation, and visualization need to be considered.
From a modeling point of view, we would like to use even more complicated chemo-electromechanical model that include, for example, the mechanical behavior of titin [18, 19], and include further important biophysical details such as metabolism, a biophysical recruitment model , and a feedback mechanism from the spindles and the golgi-tendon organs to the neuromuscular system. Moreover, simulating and visualizing the surface EMG is needed to further test motor unit decomposition algorithms. From an algorithmic point of view, such model enhancements also always require novel or custom-tailored efficient numerical schemes. From an implementational point of view, the scaling analysis also revealed a memory consumption issue. The amount of memory per process rises proportional to the total problem size. The cause for this has been identified as the suboptimal implementation of a global-to-local node numbering mapping within OpenCMISS, which, however, can be fixed. The memory consumption currently prevents simulation of larger problem sizes.
Once the modeling, algorithmic, implementational aspects have been solved and large-scale simulations have been set up, e. g., a single chemo-electromechanical skeletal muscle model with a realistic number of fibers (e. g., about 300.000) of realistic length, the results need to be visualized and analyzed. To do so, we are planning to extend MegaMol with novel, comprehensive visualizations that allow users to explore the complex behavior of muscle fiber simulation data. For example, the deformation due to the contraction of the muscle is currently barely visible for different time steps, whereas to compare the contraction of different runs or different time steps, a clear visual representation is necessary. Furthermore, visualization can help to gain new insights into the functioning of our neuro-muscular system by comparing the simulated surface EMG of a muscle with experimental data obtained via non-invasive and clinically available diagnostic tools.
After we achieved all the model enhancements and have overcome the computational and implementational challenges, these simulations can be utilized to set up in silico benchmark tests for complex muscles and recruitment patterns to improve existing data mining algorithms, e.g., the motor unit decomposition algorithms, and provide, at the same time, a new tool to investigate the interplay of the underlying complex and coupled mechanisms leading from neural stimulation to force generation.
Conflict of Interest Statement
The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.
All authors have equally contributed to the conception and design of the work, data analysis and interpretation, drafting of the article, and critical revision of the article. Hence, the author list appears in alphabetical order. In addition NE, TK, AK, BM and TR have conducted the simulations and summarized their results. All authors fully approve the content of this work.
This research was funded by the Baden-Württemberg Stiftung as part of the DiHu project of the High Performance Computing II program, the Deutsche Forschungsgemeinschaft (DFG) as part of the International Graduate Research Group on ”Soft Tissue Robotics – Simulation-Driven Concepts and Design for Control and Automation for Robotic Devices Interacting with Soft Tissues” (GRK 2198/1) and as part of the Cluster of Excellence for Simulation Technology (EXC 310/1).
-  P. R. Amestoy, I. S. Duff, J.-Y. L’Excellent, and J. Koster. A fully asynchronous multifrontal solver using distributed dynamic scheduling. SIAM Journal on Matrix Analysis and Applications, 23(1):15–41, 2001.
-  P. R. Amestoy, A. Guermouche, J.-Y. L’Excellent, and S. Pralet. Hybrid scheduling for the parallel solution of linear systems. Parallel Computing, 32(2):136–156, 2006.
-  S. Balay, W. D. Gropp, L. C. McInnes, and B. F. Smith. Efficient management of parallelism in object oriented numerical software libraries. In E. Arge, A. M. Bruaset, and H. P. Langtangen, editors, Modern Software Tools in Scientific Computing, pages 163–202. Birkhäuser Press, 1997.
-  S. S. Blemker, P. M. Pinsky, and S. L. Delp. A 3D model of muscle reveals the causes of nonuniform strains in the biceps brachii. Journal of Biomechanics, 38(4):657–665, 2005.
-  M. Böl and S. Reese. Micromechanical modelling of skeletal muscles based on the finite element method. Computer methods in biomechanics and biomedical engineering, 11(5):489–504, 2008.
-  C. P. Bradley, A. Bowery, R. Britten, V. Budelmann, O. Camara, R. Christie, A. Cookson, A. F. Frangi, T. Gamage, T. Heidlauf, S. Krittian, D. Ladd, C. Little, K. Mithraratne, M. Nash, D. Nickerson, P. Nielsen, O. Nordbø, S. Omholt, A. Pashaei, D. Paterson, V. Rajagopal, A. Reeve, O. Röhrle, S. Safaei, R. Sebastián, M. Steghöfer, T. Wu, T. Yu, H. Zhang, and P. J. Hunter. OpenCMISS: A multi-physics & multi-scale computational infrastructure for the VPH/Physiome project. Progress in Biophysics and Molecular Biology, 107:32–47, 2011.
-  H. Childs, K.-L. Ma, H. Yu, B. Whitlock, J. Meredith, J. Favre, S. Klasky, N. Podhorszki, K. Schwan, M. Wolf, et al. In situ processing. Technical report, Ernest Orlando Lawrence Berkeley National Laboratory, Berkeley, CA (US), 2012.
-  G. R. Christie, P. M. Nielsen, S. A. Blackett, C. P. Bradley, and P. J. Hunter. Fieldml: concepts and implementation. Philosophical Transactions of the Royal Society of London A: Mathematical, Physical and Engineering Sciences, 367(1895):1869–1884, 2009.
-  R. R. Cisi and A. F. Kohn. Simulation system of spinal cord motor nuclei and associated nerves and muscles, in a web-based architecture. Journal of computational neuroscience, 25(3):520–542, 2008.
-  G. V. Dimitrov and N. A. Dimitrova. Precise and fast calculation of the motor unit potentials detected by a point and rectangular plate electrode. Medical engineering & physics, 20(5):374–381, 1998.
-  D. Farina and R. Merletti. A novel approach for precise simulation of the emg signal detected by surface electrodes. IEEE Transactions on Biomedical Engineering, 48(6):637–646, 2001.
-  A. J. Fuglevand, D. A. Winter, and A. E. Patla. Models of recruitment and rate coding organization in motor-unit pools. Journal of neurophysiology, 70(6):2470–2488, 1993.
-  A. M. Gordon, A. F. Huxley, and F. J. Julian. The variation in isometric tension with sarcomere length in vertebrate muscle fibres. The Journal of Physiology, 184:170–192, 1966.
-  S. Grottel, M. Krone, C. Müller, G. Reina, and T. Ertl. MegaMol – A Prototyping Framework for Particle-Based Visualization. IEEE Transactions on Visualization and Computer Graphics, 21(2):201–214, Feb. 2015.
-  V. Gurev, P. Pathmanathan, J.-L. Fattebert, H.-F. Wen, J. Magerlein, R. A. Gray, D. F. Richards, and J. J. Rice. A high-resolution computational model of the deforming human heart. Biomechanics and Modeling in Mechanobiology, 14(4):829–849, Aug 2015.
-  D. Hawkins and M. Bey. A comprehensive approach for studying muscle-tendon mechanics. Journal of Biomechanical Engineering, 116(1):51–55, 1994.
-  C. Heckman and M. D. Binder. Computer simulation of the steady-state input-output function of the cat medial gastrocnemius motoneuron pool. Journal of Neurophysiology, 65(4):952–967, 1991.
-  T. Heidlauf, T. Klotz, C. Rode, E. Altan, C. Bleiler, T. Siebert, and O. Röhrle. A multi-scale continuum model of skeletal muscle mechanics predicting force enhancement based on actin–titin interaction. Biomechanics and Modeling in Mechanobiology, 15(6):1423–1437, 2016.
-  T. Heidlauf, T. Klotz, C. Rode, T. Siebert, and O. Röhrle. A continuum-mechanical skeletal muscle model including actin-titin interaction predicts stable contractions on the descending limb of the force-length relation. PLoS computational biology, 13(10):e1005773, 2017.
-  T. Heidlauf, F. Negro, D. Farina, and O. Rohrle. An integrated model of the neuromuscular system. In Neural Engineering (NER), 2013 6th International IEEE/EMBS Conference on, pages 227–230. IEEE, 2013.
-  T. Heidlauf and O. Röhrle. Modeling the Chemoelectromechanical Behavior of Skeletal Muscle Using the Parallel Open-Source Software Library OpenCMISS. Computational and Mathematical Methods in Medicine, 2013:1–14, 2013.
-  T. Heidlauf and O. Röhrle. A multiscale chemo-electro-mechanical skeletal muscle model to analyze muscle contraction and force generation for different muscle fiber arrangements. Frontiers in Physiology, 5(498):1–14, 2014.
-  B. Hernández-Gascón, J. Grasa, B. Calvo, and J. F. Rodríguez. A 3D electro-mechanical continuum model for simulating skeletal muscle contraction. Journal of Theoretical Biology, 335:108–118, 2013.
-  A. L. Hodgkin and A. F. Huxley. A quantitative description of membrane current and its application to conduction and excitation in nerve. The Journal of physiology, 117(4):500–544, 1952.
-  T. Johansson, P. Meier, and R. Blickhan. A finite-element model for the mechanical analysis of skeletal muscles. Journal of Theoretical Biology, 206(1):131–49, 2000.
-  P. Lindstrom. Fixed-Rate Compressed Floating-Point Arrays. IEEE Transactions on Visualization and Computer Graphics, 20, Aug. 2014.
-  C. M. Lloyd, M. D. Halstead, and P. F. Nielsen. Cellml: its future, present and past. Progress in biophysics and molecular biology, 85(2):433–450, 2004.
-  M. M. Lowery, N. S. Stoykov, A. Taflove, and T. A. Kuiken. A multiple-layer finite-element model of the surface emg signal. IEEE Transactions on Biomedical Engineering, 49(5):446–454, 2002.
-  R. MacIntosh, B., F. Gardiner, P., and J. McComas, A. Skeletal Muscle: Form and Function. Human Kinetics, second edition, 2006.
-  L. Mesin and D. Farina. An analytical model for surface emg generation in volume conductors with smooth conductivity variations. IEEE transactions on biomedical engineering, 53(5):773–779, 2006.
-  G. L. Miller, S.-H. Teng, W. Thurston, and S. A. Vavasis. Automatic mesh partitioning. In Graph Theory and Sparse Matrix Computation, pages 57–84. Springer, 1993.
-  M. Mordhorst, T. Heidlauf, and O. Röhrle. Predicting electromyographic signals under realistic conditions using a multiscale chemo-electro-mechanical finite element model. Interface Focus, 5(2):1–11, February 2015.
-  M. Mordhorst, T. Strecker, D. Wirtz, T. Heidlauf, and O. Röhrle. POD-DEIM reduction of computational EMG models. Journal of Computational Science, 19:86–96, 2017.
-  F. Negro and D. Farina. Decorrelation of cortical inputs and motoneuron output. Journal of neurophysiology, 106(5):2688–2697, 2011.
-  A. J. Pullan, L. K. Cheng, and M. L. Buist. Mathematically modelling the electrical activity of the heart: from cell to body surface and back again. World Scientific, 2005.
-  D. E. Rassier, B. R. MacIntosh, and W. Herzog. Length dependence of active force production in skeletal muscle. Journal of Applied Physiology, 86(5):1445 – 1457, 1999.
-  M. V. Razumova, A. E. Bukatina, and K. B. Campbell. Stiffness-distortion sarcomere model for muscle simulation. Journal of Applied Physiology, 87(5):1861–1876, November 1999.
-  E. Ríos, M. Karhanek, J. Ma, and A. González. An Allosteric Model of the Molecular Interactions of Excitation-Contraction Coupling in Skeletal Muscle. The Journal of General Physiology, 102(3):449–481, 1993.
-  O. Röhrle, J. B. Davidson, and A. J. Pullan. Bridging scales: a three-dimensional electromechanical finite element model of skeletal muscle. SIAM Journal on Scientific Computing, 30(6):2882–2904, 2008.
-  O. Röhrle, J. B. Davidson, and A. J. Pullan. A physiologically based, multi-scale model of skeletal muscle structure and function. Frontiers in Physiology, 3, 2012.
-  O. Röhrle and A. J. Pullan. Three-dimensional finite element modelling of muscle forces during mastication. Journal of Biomechanics, 40(15):3363–3372, 2007.
-  Y. Saad and M. H. Schultz. Gmres: A generalized minimal residual algorithm for solving nonsymmetric linear systems. SIAM Journal on Scientific and Statistical Computing, 7(3):856–869, 1986.
-  S. Schamberger and J.-M. Wierum. Partitioning finite element meshes using space-filling curves. Future Generation Computer Systems, 21(5):759 – 766, 2005. Parallel computing technologies.
-  P. R. Shorten, P. O’Callaghan, J. B. Davidson, and T. K. Soboleva. A mathematical model of fatigue in skeletal muscle force contraction. Journal of Muscle Research and Cell Motility, 28(6):293–313, 2007.
-  I. Wald, G. Johnson, J. Amstutz, C. Brownlee, A. Knoll, J. Jeffers, J. G?nther, and P. Navratil. OSPRay - A CPU Ray Tracing Framework for Scientific Visualization. IEEE Transactions on Visualization and Computer Graphics, 23(1):931–940, Jan. 2017.
-  F. E. Zajac. Muscle and tendon properties models scaling and application to biomechanics and motor. Critical reviews in biomedical engineering, 17(4):359–411, 1989.
-  M. Zhou, O. Sahni, K. D. Devine, M. S. Shephard, and K. E. Jansen. Controlling unstructured mesh partitions for massively parallel simulations. SIAM Journal on Scientific Computing, 32(6):3201–3227, 2010.