Microfluidic devices are used to transport, control, analyse and manipulate nanoliter quantities of liquids, gasses and other substances Beebe2002; Whitesides2006; Streets2013 in natural sciences and engineering as well as in clinical research Squires2005; Streets2013; Sackmann2014; Zhang2016. Fluid flows in microfluidic devices are characterized by Reynolds numbers that are low ( - ) or moderate ( - ) in high throughput settings. Fluid flows at the microscale have been often modeled by the continuum Navier Stokes equations with various discretizations Clague2001; DiCarlo2009; Cimrak2012; Karabacak2014; Scherr2015; grimmer2019. Continuum models have been successful in capturing several key features of microscale flows but at the same time face limitations as they are not able to resolve phenomena affected by fluctuations and related biophysical processes Freund2014. Moreover the complex geometries, the multiple deforming objects as well as the need to handle chemistry and related processes often adds computational complexity to the classical Navier Stokes solvers leading to expensive computations with low time-to-solution rahimian2010petascale. We note for example that state-of-the-art simulations of Red Blood Cells that use scalable boundary integral methods have only used a few hundred RBCs in two dimensions Quaife2014; Kabacaoglu2018.
An alternative simulation approach for micorfluidics uses mesoscale Dissipative Particle Dynamics (DPD) method, which represents the fluids and suspended objects as collections of particles. DPD is a stochastic, short-range particle method that bridges the gap between Molecular Dynamics and Navier-Stokes equations Espanol1995. It has been used extensively to model complex fluids such as colloidal suspensions, emulsions and polymers Boek1997; Moeendarbary2010 and has recently become a key method for the study of the blood rheology Quinn2011; Zhang2014; Fedosov2014; Lanotte2016. However, currently most of these simulations are carried out by non open source software based on LAMMPS Plimpton1995 Molecular Dynamics (MD) package. LAMMPS is notable for its flexibility and capability to model multiphysics but at the same time this may come at the expense of speed in domain specific settings such as MD 111 http://www.hecbiosim.ac.uk/benchmarks . A number of open source codesTang2014; Blumers2017; seaton2013dl_meso,provide little usage instructions, but at the same time demonstrate better performance than LAMMPS by exploiting extensively Graphics Processing Units to accelerate the most computationally-expensive kernels. As many problems in computational science are usually data-parallel, the appeal of moving some computations to the GPU is high Schulte2015: while demanding highly parallel problems and careful implementations, the GPUs offer unmatched FLOP performance and memory bandwidth, outperforming state-of-the-art server-grade CPUs by a factor of to .
However, transferring the existing code onto a GPU
is usually not a trivial task due to significant architectural differences in the hardware such as cache size, width of the vector instructions, different control flow penalties etc.. Another aspect of porting the application to theGPU is the memory traffic through the PCI-E bus between the accelerator and the CPU. The bus only delivers a fraction of main RAM bandwidth, and if used extensively, may easily result in a performance bottleneck. Therefore often porting parts of the application onto the GPU may not be either beneficial nor easy, and starting from scratch is to be preferred.
Here we present MiRheo 222https://github.com/cselab/Mirheo , a high-throughput software for microfluidic flow simulations in complex geometries with suspended visco-elastic cell membranes and rigid objects, written exclusively for GPUs and clusters of GPUs (e.g. see Figure 1). It is a successor of uDeviceX code Rossinelli2015 with improved performance, usability, extensibility and many additional features. MiRheo handles complex geometries, large number of suspended rigid bodies and cells, fluids with different viscosity and provides a flexible yet efficient and well-documented way to specify the simulation setup and parameters. In the rest of the paper we first introduce the employed numerical method (Section 2), then go over details of our implementation and parallelization strategies (Section 3), followed by code validation (Section 4) and benchmarks (Section 5) before concluding (Section 6).
2 Numerical method
MiRheo is based on the DPD method, which yields fluctuating hydrodynamic Espanol1995; Groot1997. The software accommodates flows in complex geometrical domains as well as deformable and rigid objects suspended in the fluid. More specifically, the supported objects are visco-elastic closed shells (representing cell membranes discretized on triangular meshes) and rigid bodies of arbitrary shape. The evolution of the system is governed by pairwise particle forces while enforcing of the no-slip and no-through boundary conditions where applicable.
2.1 Dissipative particle dynamics
The DPD, is a particle based method introduced by Hoogerbrugge Hoogerbrugge1992 and further formulated and developed in Groot1997; Espanol1995. In DPD the fluid is described by a set of particles in the 3D space. Each particle is characterized by its mass , position and velocity . Particles evolve in time according to the Newton’s law of motion:
where is the force exerted on the particle and is time. The force fields are usually expressed in terms of the distance between particles and they imply local interactions as they vanish after a cutoff radius . The particles interact through central forces, which implies, by the Newton’s third law, conservation of linear and angular momentum. The DPD forces acting on the particle indexed by are written as
where the force is composed of a conservative, dissipative and random term. The conservative term acts as purely repulsive force and reads
where , , and
The dissipative and random terms are given by
The random variableis independent Gaussian noise satisfying , and . The parameters and are linked through the fluctuation-dissipation relation and Espanol1995. The dissipative kernel has the form with Fan2006.
2.2 Objects representation
Rigid objects are modeled as groups of particles moving with the same velocity field of their center of mass. Their surface geometry is expressed either analytically or by a triangular mesh-based representation. The state of a rigid object is fully determined by its center of mass, orientation (stored as a quaternion), linear and angular velocities.
The visco-elastic incompressible membrane is modeled by a triangular mesh with particles as its vertices. The elastic potential energy of the membrane with constant volume and area is given by Fedosov2010b:
accounts for the energy of the elastic spectrin network of the membrane, including an attractive worm-like chain potential and a repulsive potential such that a nonzero equilibrium spring length can be obtained.
where is the spring constant, is the normalized spring length and is the number of mesh edges. The bending energy term, , models the resistance of the lipid bilayer to bending. We implement two different energy models for the membrane dynamics. The first is attributed to Kantor and Nelson kantor1987phase:
where is the bending constant, is the angle between two adjacent triangles (called dihedral) and is the equilibrium angle. The second was developed by Jülicher julicher1996morphology:
and are penalization terms accounting for area and volume conservation of the membrane:
where is the area of a single triangle, , is the volume enclosed by the membrane and is the number of triangles in the mesh.
The membrane viscosity is modeled by an additional pairwise interaction between particles sharing the same edge. This interaction contains a dissipative and random term with the same form as the DPD interaction with .
2.3 Boundary conditions
Solid boundaries in the computational domain are represented via a Signed Distance Function (SDF), zero value isosurface which defines the wall surface. A layer of frozen particles with thickness of is located just inside the boundary. The no-through condition on the wall surface is enforced via a bounce-back mechanism Revenga1998. These particles have the same radial distribution function as the fluid particles, and interact with the latter with the same DPD forces. This ensures the no-slip condition as well as negligible density variations of the fluid in proximity to the wall fedosov2008BC; Kotsalis2009; Werder2005.
The fluid-structure interactions for the rigid objects are similar to the ones employed for the walls. The surface impenetrability is ensured by bouncing-back solvent particles off the rigid objects surfaces with linear and angular momentum conservation.
In order to maintain the no-slip and no-through flow boundary conditions on the membrane surface, we use the technique originally proposed in Fedosov2010b. We assume that a membrane is always surrounded by fluid from both sides, with the same density and conservative potential, but the code allows for different fluid viscosities. We then let the fluid particles across the membrane interact only with the conservative part of the DPD force, and in contrast, make the fluid–membrane interaction purely viscous. In that way we maintain constant radial distribution function in the liquids in proximity of the membrane, and with the appropriate choice of the liquid–membrane viscous parameter the no-slip condition is satisfied. The no-through condition is also enforced via the bounce-back mechanism.
3 Implementation details
The outline of MiRheo shares several key features with classical MD application with local interactions. However the introduction of the bounce-back mechanism, adoption of relatively large time-step and the necessity to operate on membranes consisting of hundreds and thousands of particles requires extra attention that distinguishes MiRheo from classical MD implementations. The software targets GPU-enabled clusters and use established technologies and libraries such as C++, CUDA, MPI, HDF5 and Python.
3.1 Algorithmic overview
The majority of design decisions in MiRheo have been dictated by the demands of the GPU architectures and the requirement for a robust and extensible code. The overall structure of MiRheo includes the following main components (see Figure 2):
Data management classes, called Particle Vectors. They store particles of a specific type and their properties. Objects, like rigid bodies or cell membranes, are also implemented as Particle Vectors for uniformity.
Various handler classes that implements various actions on the Particle Vectors, e.g., integration, force computations, wall interactions.
Plugins, which provide a convenient and non intrusive way of adding functionalities to MiRheo.
Coordinator classes, that perform initial simulation setup and time-stepping. These classes stitch together the Particle Vectors, handlers and plugins into an extensive simulation pipeline.
Python bindings, that provide a way to create and manipulate the data and handlers.
We use MPI parallelization by employing domain decomposition into equal rectangular boxes, such that each MPI rank keeps only the local particles. To reduce communication between the MPI ranks, we assign all the particles of a single object to a single MPI process depending on the center of mass of the object. The core data (including particles, cell lists, forces, etc.) are stored in the GPU RAM, while only the objects and particle adjacent to the subdomain boundaries require to be transferred to the CPU memory and communicated via MPI to the adjacent ranks. Moreover, we organize the particle data as Structure-Of-Arrays (SOA) in order to optimize memory traffic, and also to allow dynamic addition of extra properties per each particle or each object.
The time-stepping pipeline of MiRheo is organized as follows:
Create the cell-lists for all the types of particles (Particle Vectors) and interaction cut-offs. In a typical simulation the cut-off radius is the same for all the pairwise forces involved and we found that using a separate cell-list for different cut-off radii is beneficial in terms of overall performance.
Using the created cell-lists, we identify the halo (or ghost) particles, that have to be communicated to the adjacent processes. The transfer itself is overlapped with the subsequent force computation. We will give more details about the overlap in the later section.
Compute forces due to the local particles.
After the halo exchange is completed, we compute the forces in the system due to particles coming from neighboring processes.
Integrate the particles with the fused Velocity-Verlet. The rigid bodies need special treatment: we first calculate the total force and torque for each body, and then integrate their positions and rotational quaternions.
Bounce the particles off the walls, rigid bodies and membranes. The forces due to the bounce are saved in the objects and transferred to the next time-step.
Identify the particles and objects that have left the local subdomain and send them to the corresponding adjacent MPI rank.
3.2 Pairwise interactions and cell-lists
The nominal cost of computing all the pairwise forces in a system with particles scales as . However, in DPD the pairwise forces only affect the local neighborhood of each particle as the potential vanishes with the increased particle distance and the cost of force calculation is reduced to . A common approach to restrict force computation to the particle pairs within a distance is to use the Verlet, or neighbor lists (LAMMPS Plimpton1995, NAMD Phillips2005, GROMACS Berendsen1995, HOOMD-blue Anderson2008). Such lists store for each particle the indices of all the other particles in the system within the distance . The non-zero is introduced as the cost of building that structure is significant compared to the force evaluation, therefore it is advisable to rebuild it only once every few time-steps. For each simulation there exist an optimal that is governed by a balance between building the neighbor list (benefits from large ) and force evaluation (benefits from smaller ).
Another distinguishing feature of the DPD forces with respect to classical MD force fields such as Lennard-Jones, is the fact that the potential of the conservative force increases at a far lower rate with the pairwise distance (a ”soft” potential). In turn, the typical DPD time-steps are much larger than ones used in MD. These two factors together result in fast changes of the particle neighborhoods, that in turn would result in frequent Verlet list rebuilding and consequently a performance penalty.
In MiRheo we use the cell-lists to accelerate the force computation. We first split the domain of interest into cubic cells with edge length , forming a uniform Cartesian grid. Then the cell-list data structure is defined as a two-way mapping of a particle onto a unique cell. The particle-cell mapping is trivial and can easily be computed from the particles coordinates. The construction of the inverse mapping, the cell-list itself, requires the particles to be sorted according to the index of the cell they belong to, and computing the positions in the particle array corresponding to each cell.
The force evaluation is typically the most time-consuming operation of each time-step. We map each particle to a GPU thread which scans the adjacent cells and calculates all the interactions for the given particle. The ordering of the particles in memory due to cell-lists increases data locality which accelerates the fetching of the particle data through cache. We observe that exploiting the symmetry of the forces yields in faster execution despite the additional atomic operations.
3.3 Particle bounce-back
An important part of a microfluidics simulation is to maintain no-through properties of the particles with respect to rigid bodies, walls and membranes. In all the three cases we introduce a continuous “inside-outside” function of the particle coordinates that changes its sign on the impenetrable boundary. For example, for the wall that function is the SDF. By equating the “inside-outside” function with zero we obtain an equation whose solution gives the exact collision location. After this location is found, we place the particle into the collision point and reverse its velocity in the frame of reference of the surface. In order to reduce the computational cost, we exploit the cell-list that is built in the beginning of each time-step and only check the particles that are located in the cells close to the zero level of the “inside-outside” function.
3.4 Efficient parallelization: compute/IO overlap
The vastly different scales of bandwidth provided by the GPU, PCI-E bus and the HDD storage make it necessary to overlap the intensive computation with different I/O operations performed by the code, such as MPI communications and dumping data on the disk pacheco2011introduction. This is achieved by the two layers in the MiRheo design.
First, we run 2 MPI tasks for every computational subdomain. One of the tasks, called compute task, performs the actual time-stepping on the GPU, while the other one (postprocess task) is responsible for all the heavy I/O and in-situ data post-processing on the CPU. Such asynchronous design ensures perfect overlap of the disk operations with the simulation, improves code modularity and adds flexibility, since heavy data processing can be performed in parallel to the simulation, on the otherwise idling CPU. Figure 9 shows the importance of the asynchronous HDF5 writes for the overall execution time.
The second layer of overlapping operations with each other is utilized to hide the MPI and PCI-E latencies. Since the total number of fine-grained tasks in the time-step pipeline is about 30, maintaining their dependencies and concurrently executing some of the kernels becomes a tedious task. To facilitate the setup, we have implemented a GPU-aware task scheduler based on the Kahn’s topological sorting algorithm Kahn1962, that supports task execution on concurrent CUDA streams. With the help of the scheduler, we can easily overlap the halo host-to-device and device-to-host memory transfers together and the corresponding MPI communications with the force computations and potentially other heavy kernels.
3.5 Python interface
As software complexity increases to address multiple setups, the complexity of its usage is increasing accordingly. Software packages often provide custom syntax for their configuration files, or even introduce a simple programming language to help users Plimpton1995; Berendsen1995. We believe that implementing simulation setup through a well established programming language is superior with respect to the software specific approaches, as it benefits from the mature infrastructure and widespread usage of the language. With its flexibility and extensive support for scientific computations via comprehensive numerical libraries, Python proves to be one the best front-end languages Tosenberger2011 for complex codes, such as MiRheo. The pybind11 project pybind11 allowed us to easily provide a C++/CUDA proxy into Python with minimal coding efforts.
We expose all our data holder classes, handlers, plugins and the coordinator class such that the user is able to assemble the specific simulation setup out of the few basic building blocks like a construction toy. Further advances of our approach include a very thin abstraction layer and the ease of documenting the functions available to the end users.
In this section we present a set of validation cases for MiRheo, in which we compare our results against available analytical solutions or previously published data. An additional large set of more fine-grained tests (for example, bounce-back tests, momentum conservation verification, etc.) is available with the source code. We note that MiRheo was developed using experiences from a previous code (uDeviceX Rossinelli2015) co-developed by our group. These experiences were instrumental in developing a code that is shown to outperform uDeviceX, a Gordon bell finalist in 2015 Rossinelli2015.
4.1 Viscosity of the Dpd liquid for different parameters
Table 1 shows the measured viscosity of the DPD fluid with mass density given specific parameters compared to the one presented in the literature. We chose to measure the viscosity using a Poiseuille flow, such that . Here is the radius of the pipe, is the body force applied on each particle in order to form the pressure gradient, and is the average flow velocity. We assume small enough time-step where in case it is not reported, and obtain the range of viscosities for body force ranging from to , obtaining good agreement with the previously reported values.
4.2 Periodic Poiseuille flow
Periodic Poiseuille flow was introduced in Backer2005 as a convenient way to measure viscosity of a particle fluid without walls. The setup consists of a cubic domain () with periodic boundary conditions and the space-dependent body force that drives the fluid in the opposite directions:
For a Newtonian fluid, the resulting laminar flow has a parabolic profile: , where is the fluid’s mass density and its dynamic viscosity. The simulation results are depicted in Figure 3.
4.3 Taylor-Couette flow
Taylor-Couette flow consists of a fluid moving between two concentric cylinders, one of them rotating with respect to the other. Given the cylinders radii and , and their rotational velocities and , the resulting azimuthal velocity of the fluid is given by the following:
where and . The simulation results are depicted in Figure 4.
4.4 Jeffery orbits in shear flow
We validate the rigid body dynamics by simulating the rotation of an ellipsoid in a simple shear flow. In the limit of small Reynolds number, the inclination angle of the longer ellipsoid axis is known to be following the Jeffery orbit Jeffery1922 over time:
where and is the longer and shorter axes of the ellipsoid and is the shear rate. We enforce the shear profile by moving two parallel plates with opposite velocities. The ellipsoid is kept in the middle of the computational domain throughout the entire simulation. The results are depicted in the Figure 5.
4.5 Flow past a sphere close to a wall
We study the drag and lift coefficients of a sphere translating in the otherwise quiescent fluid close to a flat wall. The setup is characterized (see Figure 6) by the fluid kinematic viscosity , sphere radius , wall distance and the translation velocity . We perform the simulation in the sphere frame of reference by keeping the sphere center of mass forcedly fixed. We use periodic boundary conditions in and directions and introduce two walls orthogonal to the direction. The walls are moved with the velocity and the average fluid velocity in the domain is kept constant. The remaining non-uniform wake downstream the sphere is removed in a thin layer before the domain boundary. The sphere, however, is allowed to rotate freely.
The quantities of interest are the lift and drag coefficients of the sphere, which are computed by time-averaging of the fluid forces acting on the sphere. Due to the third Newton’s law those forces are simply negation of the forces that are required to keep the sphere in place. The non-dimensional expression of the above quantities read:
where the angular brackets denote the time-averaging and the subscripts represent wall-normal lift and wall-parallel drag, respectively. The simulation results are depicted in Figure 6. Here we consider the case with and vary the Reynolds number from to .
We obtain a good correspondence against the previous simulations carried out with spectral methods ZENG2005. A noticeable discrepancy in for can attributed to the fact that we used a little smaller domain size in order to reduce the computational cost.
4.6 Cell stretching
We validate the RBC model described in Section 2.2 by simulating a RBC stretched by optical tweezers and comparing the force-extension curve with the experimental data Fedosov2010. We vary the force applied to the few opposite particles of a single cells membrane, and measure the axial and transverse diameters of the cell. The results are depicted in the Figure 7.
One of the advantages of the MiRheo software is the very fast time to solution and nearly perfect weak scaling up to hundreds of nodes.
We benchmark our software against the state-of-the-art MD packages (HOOMD-Blue and LAMMPS) and uDeviceX code Rossinelli2015 running on GPU clusters, and perform strong and weak scaling studies.
All the timings were collected with the standard
nvprof CUDA profiler and the internal high-resolution system clock.
We used several hardware platforms to obtain the results: Piz Daint supercomputer (CSCS, Switzerland) with one Nvidia Tesla P100 per node, Leonhard cluster (ETHZ, Switzerland) with 8 Nvidia GTX 1080Ti per node, Microsoft Azure platform with Nvidia Tesla V100 and a high-end consumer laptop with Nvidia GTX 1070.
5.1 Periodic Poiseuille flow
Our first benchmark is the periodic Poiseuille flow of the DPD particles, the least complex setup that is nevertheless representative for a wide class of problems where the object of cell suspension is dilute. We consider a cubic domain of size per every GPU, filled uniformly with the DPD particles at a constant density with the periodic body force (see Section 4.2).
The reference benchmark employs and , resulting in total of M particles, or
M degrees of freedom, and roughlyM interacting pairs per node assuming uniform particle distribution. The average time-step on 1 compute node of Piz Daint is ms, which results in throughput of billion interactions per second per GPU node.
Table 2 summarizes the performance comparison against uDeviceX code and HOOMD-blue on the Piz Daint supercomputer. Since the choice of the simulation parameters may affect the run-time, in Table 3 we show that our code consistently outperforms HOOMD-blue for various benchmark setups.
|Speedup range||Average speedup|
|0.05||0.02||1.6 – 2.2||1.8|
|0.5||0.02||2.8 – 7.8||4.2|
|5.0||0.02||7.2 – 22.0||12.6|
|0.05||0.005||1.0 – 1.4||1.2|
|0.5||0.005||1.4 – 2.3||1.8|
|5.0||0.005||2.6 – 5.4||3.9|
|0.05||0.002||0.9 – 1.2||1.1|
|0.5||0.002||1.1 – 1.6||1.4|
|5.0||0.002||1.6 – 2.8||2.1|
Figure 8 (top) shows weak scaling capabilities of MiRheo running periodic Poiseuille benchmark on Piz Daint. Note that the reference point was chosen at nodes, as the single node execution employs some optimizations eliminating almost all of the MPI communication. Due to the good compute/transfer overlap, we reach almost perfect weak scaling for up to nodes. Strong scaling is not the primary scope of our code, as typically the problems of interest consist of very many particles. However, the strong efficiency of MiRheo is still good, see Figure 8 (bottom). It also shows super-linear behavior that we believe is attributed to better spatial locality of smaller amount of data in the cache, which is the main bottleneck in computing interactions.
Figure 9 shows benefits of the overlapping computations with I/O. We ran the periodic Poiseuille benchmark on Piz Daint with dumping HDF5 flow fields every steps. The I/O bandwidth doesn’t scale well with the number of nodes, reaching about for nodes. As one can observe, data dumps are overlapped with the computations, making the total runtime approximately maximum of the I/O and calculations. Note that in typical simulations, the dump frequency is much lower, such that we are never limited by the I/O performance.
5.2 Periodic whole blood flow
The second representative benchmark is the periodic Poiseuille flow of the blood, which include cell membranes, different viscosities between the plasma and the cytoplasm and runs with activated bounce-back mechanism. The latter incurs the biggest performance penalty, because it requires that objects are exchanged over MPI after the integration but before the bounce-back itself is performed. Therefore possibilities to overlap communication of objects and computation are very limited in this case.
|LAMMPS USER-MESO 2.0||140.2||144.1||143.8|
We consider a cubic domain of size filled uniformly with the RBCs at a specific volume fraction (or hematocrit level) . The fluid density is and the periodic force is applied in the same manner as for the previous case.
Table 4 summarizes the performance comparison against the only GPU code known to the authors with roughly similar feature set: LAMMPS USER-MESO Blumers2017. We used the set-up available with the USER-MESO that runs whole blood at and set the domain to roughly . The present implementation outperforms USER-MESO by a factor of x (x on many nodes), mainly attributed to the fact that LAMMPS, although evaluating forces on the GPU, keeps and uses a lot of supporting data on the CPU, incurring slow PCI-E traffic.
Figure 10 (top) shows weak scaling capabilities of MiRheo running periodic blood benchmark at . The compute/transfer overlap worsens compared to the pure liquid flow, resulting in unstable execution time and deteriorated scaling. For the same reason we observe that the single node case benefits significantly more from the MPI calls elimination. Nevertheless, the code reaches efficiency on nodes for a bigger subdomain size. Strong scaling is also worse compared to the simple DPD case, but still yields in about efficiency going from to nodes on a domain size, see Figure 10 (bottom).
5.3 Microfluidic device
To assess MiRheo at full complexity we simulate a part of a microfluidic device that captures Circulating Tumor Cells from the whole blood Karabacak2014a. The first stage of the device exploits DLD principle Huang2004a to separate the bigger and stiffer CTCs from the smaller and very flexible RBCs. In order to study the device and, later, to optimize the flow parameters and shape of the obstacles, we model its small part with two posts and impose periodic boundary conditions such that the domain replications correspond to the full device, see Figure 11 (top). The setup features domain of complex shape, blood cells at with cytoplasm times more viscous compared to the solvent, and all the bounce-back mechanisms to prevent particle leakage.
Here, we report time distribution for various parts of the code with completely synchronous GPU kernel execution (forced by setting
CUDA_LAUNCH_BLOCKING environmental variable).
The reason for using synchronous timings is that in the production mode, several independent computational kernels may overlap, thus sharing the GPU and yielding in longer individual execution times.
So in order to accurately estimate the performance of each kernel, we use the synchronous mode, while to obtain the overall wall-clock time per simulation step we use the faster asynchronous one.
The time chart in Figure 11 (bottom) shows that fluid forces account for biggest part of the execution time: 57%.
Bounce-back and the internal membrane forces have an equal share of 14% each, while 6% of the time is taken by memory-intensive integration and cell-list creating.
Remaining 10% of the time is spent in various helper kernels, like packing/unpacking, various sorts, scans, etc.
With the help of the profiler we identify that the main bottleneck for the force kernels is the L1 and L2 cache performance, with bulk and FSI kernels reaching around 60% of the aggregate cache bandwidth.
5.4 Hardware comparison
As a last step of our performance analysis, we run the periodic Poiseuille (see Section 5.1) the DLD benchmark (see Section 5.3) on different available hardware platforms: a consumer-grade laptop with Nvidia GTX 1070, ETHZ Leonhard cluster with Nvidia GTX 1080Ti, Piz Daint supercomputer with Nvidia Tesla P100 and Microsoft Azure virtual machine with Nvidia Tesla V100. Table 5 summarizes the results. Together with lower-level kernel analysis, they show that MiRheo performance benefits greatly from the better and newer hardware, with the most important factor being the size and the speed of the L1 and L2 GPU caches.
In this paper we introduced the open-source GPU software package MiRheo, that implements the DPD method. Our code can handle arbitrarily complex domains, many visco-elastic RBCs and rigid bodies of various shapes, parallelized over hundreds of GPU nodes. Such set of features, up to our knowledge, is not offered by any other particle-based simulation software like Rossinelli2015; Blumers2017. We also presented a set of validation cases that exhibit good correspondence of simulation results with analytical, experimental or earlier numerical data.
With extensive benchmarking, we showed that MiRheo provides very fast time-to-solution, efficiently harnessing GPU capability. Our code outperforms the state-of-the-art competitors by factors of (for pure DPD liquid) up to (for dense blood) and reaches high weak and strong scaling efficiencies for up to 512 nodes of Piz Daint supercomputer. Furthermore, MiRheo comes with the extensively documented Python interface, that offers a simple mechanism to combine the implemented features into a complex simulation. The code is distributed as open-source at https://github.com/cselab/Mirheo.