1 Introduction
Many applications in science and engineering require the simulation of a large number of interacting particles. This includes the simulation of macro molecules, powders, mills, silo discharges and more. Due to the insight computer simulations allow into these processes, numerous molecular dynamics frameworks as well as discrete element frameworks like CHARMM Brooks1983 , LAMMPS Plimpton1995 , GROMACS Pronk2013 , NAMD Phillips2005 , LIGGGHTSKloss2012 , ls1 mardyn Horsch2013 , PROJECTCHRONO Tasora2016 and waLBerla Preclik2015 have been developed. To simulate such a huge number of particles computations may require enormous compute power and memory capacity and can thus be only performed on parallel supercomputers. Here we focus on distributed memory systems that must be programmed with message passing using MPI MPI2016 . To fit simulations to this kind of supercomputers the simulation domain is partitioned into subdomains. These subdomains including the associated particles are assigned to different processes which get mapped to the compute cores. Each of these processes is responsible for handling one or more subdomains. For optimal memory use and computational efficiency, each process stores only the information relevant to its own subdomains. This creates the need to exchange information between processes.
In this paper we will focus on rigid particles of finite size which interact via collisions. Note that particles are modeled geometrically, and thus a particle may overlap with two subdomains. In this case, both subdomains need information about this particle and this duplicate information must be kept synchronized. A well known approach is to use nearestneighbor communication to exchange information between adjacent subdomains. However this approach has clear limitations. When particles are so large that they overlap with several nonneighboring subdomains, communication beyond just the nearest neighbors becomes necessary. The conventional parallelization methods do not support such constellations Pinches1991 ; Liem1991 ; Rapaport1991 ; Plimpton1995 . Advanced communication methods improve the communication volume but are also not capable of exchanging information past the next neighbor Bowers2005 ; Bowers2007 ; Shaw2005 ; Snir2004 ; Iglberger2009 ; Iglberger2009a ; Larsson2011 ; Preclik2015 .
However, this scenario can occur in many practical situations, as e.g. in the simulation of sediments, where the particle size distribution can be assumed to follow a lognormal distribution Schruff2016 . Thus a typical simulation will be characterized by many small particles as well as few very large particles. For a fast simulation using a large number of CPUs is desirable and thus the simulation domain will have to be partitioned into many small subdomains. This can easily lead to the case where large particles span multiple subdomains. Such cases cannot be handled correctly by nearest neighbor communication alone, or only at increased cost, when the domain partitioning is modified such that even the largest particles are always contained in the union of only nearest neighbor subdomains. This in turn can lead to an unfavorable load balancing or even to a situation where subdomains become so large that they exceed the memory capacity of a single processor.
A similar problem occurs in coupled simulations when e.g. suspensions are simulated with the particles embedded in a fluid Rettinger2017a ; Seil2017 . Then two simulation algorithms, one for particles and one for fluid dynamics, must be copartitioned and coscheduled. For efficiency, both methods must use a common domain partitioning. A reasonable load distribution for the flow solver as well as a good load balance for the rigid body dynamics may be impossible to achieve if the size of the largest particle determines a lower bound for the subdomain size.
Finally, also dynamic domain partitioning and refinement algorithms might pose problems. They may induce additional limitations on the domain partitioning as described e.g. in Schornbaum2016 and may lead to equivalent problems as discussed above. In such a constellation with a nonuniform mesh, large moving particles would pose limitations on refinement structure and would limit the flexibility to perform adaptive mesh refinement.
From these examples we conclude that there is significant interest to extend the current nearestneighbor communication and synchronization algorithm. This article therefore introduces a new shadowowner synchronization with the goal to permit more flexible domain partitioning. We describe the new parallel algorithm and analyze its parallel performance systematically. To this end we will show perfect weak scalability as well as agreeable strong scalability.
This work describes the new synchronization and communication method which has been implemented in the high performance multi physics framework waLBerla which is openly available, released under GPLv3 license^{1}^{1}1walberla.net. waLBerla includes rigid body dynamics as a subsystem called PE. The excellent scalability and performance of waLBerla^{2}^{2}2www.fzjuelich.de/ias/jsc/EN/Expertise/HighQClub/waLBerla/_node.html and PE^{3}^{3}3www.fzjuelich.de/ias/jsc/EN/Expertise/HighQClub/pe/_node.html is acknowledged by being a member of the HighQ club of the Forschungszentrum Jülich Eibl2017 that includes all programs that have shown scalability to almost half a million compute cores. In this work we demonstrate that the new communication algorithm also scales up to this maximal processor count of a current petascale supercomputer.
To ensure the comparability of our results the performance data will be expressed in Particle Updates per Core Second. This unit is calculated as
This enables a comparison between different supercomputer architectures as well as between implementations. The parallel efficiency can also easily be computed from these numbers by normalizing all values to the one for the smallest number of cores.
The remainder of this paper is structured as follows. In section 2 we give an introduction to rigid body dynamics simulations detailing the different components needed. We then continue in section 3 with the description of our implementation. We will discuss in detail the common next neighbor synchronization (section 3.2) and our newly developed shadow owner synchronization (section 3.3). In section 4 we discuss the scalability of both algorithms for up to almost 2 million processes as well as the range of scenarios where they can be applied. We then conclude with a summary in section 5.
2 Rigid Body Dynamics Simulations
In contrast to molecular dynamics, rigid body dynamics models particles with an actual spatial extent. Interaction between particles takes place when two particles are geometrically in contact with each other. Such collisions must be resolved by suitable models. Since the interaction through collisions is essentially short ranged, many ideas from molecular dynamics for short range potentials can be applied.
The general time step of a parallel rigid body dynamics simulation consists of three parts which are executed in a loop. At first the collision detection checks which particles are in contact. This is a two step process. First the number of possible contact pairs is reduced. Checking all possible pairs results in a runtime complexity of with being equal to the number of particles which has to be avoided to achieve good performance. Therefore more advanced algorithms like verlet neighbor lists Verlet_1967 , cell linked lists Hockney1974 ; Allen2017 or hierarchical hash grids Ericson2004 ; Erleben2005 are used to reduce the complexity to . In the second step all candidate pairs identified in the first step are checked for collision using their actual geometry. In the non spherical case an additional selection step can be introduced in between to avoid costly complex collision functions. Commonly axis aligned bounding box (AABB) checks are used for this Schinner_1999 ; Perkins_2001 .
Detected contacts are then passed on to the collision resolution. Well known algorithms for this stage are soft contact models CUNDALL1971 ; Cundall1979 as well as hard contact models Moreau1988 ; Anitescu1997 ; Stewart2000 . This stage is also responsible for performing the time integration. For distributed memory parallel simulations an additional synchronization step is required. Here information must be redistributed such that the data structures are consistent and that every process has all the information it needs to perform the next iteration. The details of this synchronization step depend heavily on the parallelization strategy used. In this paper we focus on the spatial subdivision approach Plimpton1995 ; Beazley1995 .
The simulation domain is partitioned into nonoverlapping subdomains and each subdomain is assigned to a processor so that each processor may handle one or several subdomains. This allows the distribution of workload as well as the distribution of data. Each particle is only stored in the subdomain in which its center of mass lies. The location of the particle defines which subdomain and thus which processor owns the particle. In addition to these master particles, we will employ shadow particles to facilitate distributed memory processing.
The need for shadow particles is illustrated in Fig. 1 showing the situation at the interface between two subdomains that belong to different processes. In order to detect the collision between the two particles one of the two processes must have information about both particles. To this end, the natural choice is to store additional information in each process about particles which overlap the subdomain. This information is organized as shadow particles which mirror master particles that are owned by different subdomains. A shadow particle must be created when a particle overlaps a subdomain that is not its owning subdomain. The shadow particle is visualized by a dashed contour in Fig. 1. Note that using the shadow particle it is now possible for process 1 to detect the collision and compute a collision response. In the following section we will discuss in detail two alternative synchronization algorithms to manage the shadow particles.
3 Parallelization using the waLBerla Framework
waLBerla is a software framework that provides functionality for domain partitioning using cuboidal subdomains that are organized in a forest of octrees Godenschwager2013 ; Schornbaum2016 . Excellent scalability and efficiency is achieved since all data structures are stored in a fully distributed manner avoiding storing any global information. In particular each process has only information about its own subdomains and its directly adjacent neighbors. This includes particle data and domain partitioning information.
3.1 General Requirements for Synchronization Algorithms
The two synchronization methods discussed in this paper are independent of the contact model. As long as the interaction is short ranged they can be used for both molecular dynamics as well as rigid body dynamics. Short ranged in this context means that particles only interact with each other when they are closer than a certain threshold. This threshold has to be much smaller then the domain size. In this paper we focus on rigid body dynamics using a non smooth granular dynamics model Preclik2015 . For both algorithms we rely on the assumption that particles do not move farther than half the diameter of the smallest particle within one time step. Since the physical accuracy of the simulation would decrease for even lower velocities this is no constraint.
When particles are generated, the respective memory is allocated in the subdomain where their center of mass lies. The synchronization algorithm must then create and update shadow particles whenever they are needed on other processes. Shadow particles hold passive information that can be read by the algorithms, but not modified. In particular, all geometric transformations, such as translations and rotations that occur due to time integration are executed only on the master particle and are subsequently synchronized to all shadow particles. The synchronization is also responsible for deleting shadow particles when they are not necessary anymore. For all these tasks, the synchronization algorithm maintains a list of shadow particles for every master particle on the process of the master particle. Additionally a reference back from every shadow particle to its corresponding master particle is stored on the process of every shadow particle. This data structure enables to efficiently update shadow particles and to collect and aggregate interactions. Since it is sometimes unavoidable to set forces and torques on shadow particles – either due to user interaction or due to collisions – these contributions are cached and collected on the master particle during the collision resolution phase. The reduction algorithm used for this can also rely on the references maintained by the synchronization algorithm.
The different types of information are communicated with a special messaging protocol. Each message consists of an identifier and message specific data. Both the identifier as well as the data are packed at the sender side and are unpacked to be processed at the receiver side. To optimize communication time, messages sent to the same destination process are first collected and sent in one block. This message aggregation helps to reduce the impact of latency involved in every communication. A more detailed description of this optimized message protocol can be found in Iglberger2009 .
In summary, synchronization algorithms must fulfill three tasks:

create and delete shadow particles

update shadow particles when master particle is modified

maintain references between corresponding master and shadow particles
Two additional requirements must be met to achieve good scaling results:

use only local and next neighbor information

reduce non next neighbor communication to a minimum
We will refer to these requirements as simulation requirements and scaling requirements. In the following two sections a basic synchronization algorithm as well as our extension will be discussed.
3.2 Next Neighbor Synchronization
For the basic next neighbor synchronization (NNS) one assumption has to be made. The radius of the largest particle, i.e. the maximal distance between the center of mass and the confining hull, has to be smaller than the smallest edge length of the subdomains. If this is true one can achieve all three tasks within one communication step by only using next neighbor communication. This communication pattern is for example described in Rapaport1991 .It is very well suited for highly parallel computations and exhibits perfect weak and strong scaling Eibl2017 .
We will now describe the algorithm in detail. A pseudocode version is shown in Algorithm 1. Since only next neighbors are involved every aspect can be controlled by the owner. Therefore every subdomain processes only its master particles. Everything related to shadow particles must be communicated to the corresponding process. This involves sending a message. The algorithm can be divided into two steps. In the first step, for all master particles that overlap a neighboring subdomain which does not have a shadow particle, one is created. If there is already a shadow particle all its properties get updated with the values from the master particle. These two cases are treated separately since updating involves less data to be sent in comparison to a full copy. If there is no overlap but a shadow particle exists then it gets deleted. In all these operations the references between the master and shadow particles are updated to remain consistent.
In the second step for all master particles it is checked whether their center of mass still lies within the owning subdomain. If it has left its subdomain, the new owner is determined. The new owner will subsequently promote its shadow particle to a master particle and all other shadow copies get notified about the new owner. If no owner is found  this means that the particle has left the simulation domain  it gets deleted. Since there is no dependency between the messages within this algorithm, they can be aggregated and sent in a block at the end of the routine.
A significant drawback of this algorithm is that if the initial assumption about the maximal particle size does not hold, the synchronization method cannot be used (see Fig. 2). Since it only involves next neighbor communication subdomains farther away will not be updated. This would eventually result in undetected collisions and wrong simulations. A correct simulation would require sufficiently large subdomains. However the subdomain size might be restricted by coupled algorithms or load balancing algorithms like described in the introduction. An alternate synchronization method is needed to tackle that case. In the next section we describe our new approach which does not depend on the next neighbor assumption.
3.3 Shadow Owner Synchronization
When particles do not satisfy the next neighbor assumption, one option is to handle them globally. This means that every process holds a copy of these large particles and all synchronizations involve global communication. Since global communication does not scale with the number of processes involved, this would degrade the parallel efficiency in large scale simulations. Therefore our alternative approach avoids global communication entirely and instead uses an approach which we call diffusive. The idea behind this is to propagate information to neighbors if they are lacking it. This way information spreads like a drop of ink in water, slowly covering the whole domain. Applied to our synchronization problem, every subdomain checks its neighbors if they need a shadow particle for one of its master or shadow particles. If so they send the required information. This check can be executed locally and does not involve global communication. Information about shadow particles will propagate one subdomain per synchronization call. Also the update of shadow particles can be performed by pointtopoint communication using the list of shadow particles maintained for the master particle. Though more expensive than simple next neighbor communication this avoids an expensive global communication.
Note that information is propagated by each subdomain based only on its local view of its neighborhood. Therefore the situation may arise that one subdomain receives the same information twice from different senders. However, in all cases one is the exact duplicate of the other and one is discarded.
When a large particle that covers more than one subdomain is initialized, more synchronization calls are required to create all shadow particles. In every synchronization call new shadow particles can be created in neighboring subdomains of existing shadow particles. Therefore the exact number of synchronizations calls is dependent on the size of the particle. Once all initial references between master and shadow particles have been created one synchronization call per time step is sufficient to keep all references up to date. In the following we will refer to this new algorithm as shadow owner synchronization (SOS).
A pseudo code with the details of the SOS can be found in Algorithm 2. The algorithm can be decomposed into two parts. The first part starts with updating all shadow particles with values from the master particle. Then the ownership of all master particles is checked by finding the subdomain which contains the center of mass. When needed the ownership is transfered. The second part of the algorithm checks all particles  master as well as shadow particles  if they overlap an adjacent subdomain. If needed and not already existent new shadow particles are created. The creation also involves an update of the shadow owner list stored by the master particle. Then every subdomain deletes shadow particles that are not needed anymore.
In this algorithm synchronization is needed at two points. All messages in the first part can be aggregated. The communication pattern is a pointtopoint communication between the master particle and all its shadow particles. The second point of communication is after the second part of the algorithm. Again all messages can be collected and sent as one large aggregated block. This communication is again a pointtopoint communication between master and shadow particles with next neighbors added. In total this algorithm needs two communication steps and the number of messages sent is roughly proportional to the number of shadow particles.
In its basic form this algorithm will produce unnecessary communications. Since shadow copies do not have any information about the other shadow copies their subdomains will attempt to generate new shadow copies at adjacent subdomains as part of every synchronization. To avoid this overhead, a cache is introduced for every subdomain. This cache stores information about previously executed communication. In particular, if the creation of a shadow particle on a neighboring subdomain has been requested. With this information the cache can be used to avoid redundant communication (compare line 17 of Alg. 2). The cost of this optimization is the additional memory needed as well as management overhead. However, since this cache only stores information about neighboring subdomains its size and cost remains bounded and thus supports scaling to large processor numbers. For further details see the pseudo code in Alg. 2.
The proposed synchronization algorithm fulfills all simulation as well as scaling requirements stated in Sec. 3.1. Scaling results for this algorithm are discussed in the next section.
4 Performance Results
In this section the scalability of both synchronization algorithms is evaluated and the advantages as well as disadvantages of our approach are discussed. All performance measurements are conducted on the Juqueen supercomputer located at the Jülich Supercomputing Centre (JSC) juqueen . This is a IBM BlueGene/Q cluster with nodes. Each node contains one PowerPC A2 processor with (only 16 are usable) cores clocked at Gilge2014 ; Wautelet2014 . The processors support 4way simultaneous multithreading (SMT) for all cores. Previous tests showed that making full use of the multithreading capabilities yields the best performance Eibl2017 . For fullmachine jobs this sums up to processes. Each node also offers of RAM. The nodes are connected via a 5D torus network.
4.1 Weak and Strong Scalability
To demonstrate the scalability of these algorithms we perform weak and strongscaling experiments. In weakscaling experiments the number of processes is increased together with the problem size. For perfect weakscaling we expect the time to solution to stay constant. The performance unit PUpCS will reflect that by also staying constant since the increase in the number of particles will cancel out with the increase in the number of cores. Let be the time to solution for p processes. For weakscaling experiments the parallel efficiency is then defined to be
For strongscaling experiments the number of cores is increased but the problem size stays the same. In this case, ideally, the computation time should decrease by the same amount as the computation power is increased. Again this will have no effect on the PUpCS measure since both contributions will cancel out each other. With the same definition of as before the parallel efficiency for strongscaling is defined to be:
More indepth information about performance measurements in general can be found in Hager2010 .
4.2 Scalability in Different Scenarios
The simulation loop of a parallel rigid particle simulation can be divided into two phases: synchronization/communication and collision detection/resolution. The computation time spent in the collision detection/resolution phase heavily depends on the number of collisions which occur per time step. For a homogeneous scenario this corresponds to the particle density. To study the performance of the algorithms for a broad range of applications, we will investigate two different scenarios. One scenario is sparse with a low particle density where we expect that the synchronization and communication dominate the overall performance. In a second scenario we explore a dense setup where more compute time is spent in the collision detection and resolution. An exemplary distribution of runtime for two specific scenarios can be found in Fig. 9. An additional third scenario with large particles is evaluated to demonstrate the features of our new synchronization algorithm. The values describing the scenarios in the upcoming paragraphs are given in dimensionless units. All scenarios kept simple to minimize the influence of other factors on our performance measurements.
4.2.1 Sparse Scenario
The sparse scenario is a granular gas confined in a static box. The granular gas consists of spherical particles. All particles are initialized on a rectangular grid (sc J.R.Hook1991 ) and have the same radius but a random initial velocity. The particle density is low enough for the particles to travel longer distances and to cross subdomain boundaries. Fig. 6 a) shows a simplified illustration of this scenario. For the weakscaling experiment a subdomain of size containing particles of radius is assigned to every process. In our largest weakscaling experiments we simulate master particles and shadow particles. This is executed on the full Juqueen with nodes using processes on cores. For the strong scaling a fixed domain size of is chosen. The domain is equally divided between all processes. This results in particles per process for the nodes run and particles for nodes run. More detailed parameters for the different scenarios can be found in Tab. 1.
In this scenario both synchronization methods exclusively use next neighbor communication. Therefore we expect both algorithms to scale equally well. Fig. 16 a) and b) show the results for the weak and strongscaling experiments using the sparse scenario. In the weakscaling experiments neither of the algorithms drops in performance. Even on the whole machine with processes both methods yield almost parallel efficiency. The slow degradation in parallel efficiency for the strongscaling gets noticeable above nodes ( processes). During the increase from nodes to all available nodes the parallel efficiency drops to approximately for both algorithms. This is due to the reduced number of particles per process. Fewer particles lead to fewer collisions and therefore proportionally less numerical effort is needed to advance the simulation. In this situation, algorithmic components that are independent of the number of particles like communication latency and logistic operations on the data structures become more prominent. These contributions dominate in strongscaling experiments when the number of processes becomes very large. If we compare both algorithms we see that both methods exhibit the same behaviour. If we consider the absolute values of performance we can calculate that the more advanced SOS method achieves roughly of the PUpCS of the NNS. Although the information communicated is the same for both synchronization methods, the SOS needs two communication cycles whereas the NNS needs only one. Therefore a slightly lower performance of the SOS as compared to the NNS is expected and is unavoidable.
4.2.2 Dense Scenario
The dense scenario consists of a periodic simulation domain confined in the z direction by solid walls. It is filled with spherical particles arranged on a hexagonal close packing (hcp J.R.Hook1991 ) lattice. The radius of the spheres is chosen such that the spheres touch their neighbors. Additionally a rotated gravitational force is applied such that the containing wall represents a ramp with inclination. All spheres have an initial velocity down the ramp. This scenario is illustrated in Fig. 6 b). This scenario is chosen because the distribution of computation time is different to the previous scenario. The communication is less significant since more of the computation time is spent inside the collision detection/resolution phase (see Fig. 9). For the experiment the simulation domain is partitioned only in the x and y direction since the height of the simulation stays the same. For the weakscaling scenario each domain contains particles. The strongscaling experiment uses the same scenario but with a simulation domain of fixed size. The number of particles per process ranges from particles on nodes to on nodes.
Again we expect the same behaviour for both algorithms as only next neighbor communication is used. The graphs in Fig. 16 c) and d) show the weak and strongscaling. The weakscaling for both algorithms shows no performance loss even for the fullmachine runs. For the strongscaling case the parallel efficiency drops to approximately for both algorithms in the largest simulations. Compared to the sparse scenario this is significantly less. Since the computation time of the simulation is directly related to the number of collisions, each particle requires more computation time in the dense scenario. Therefore particle independent effects like latency and maintenance start to dominate at fewer particles leading to a better overall strong scaling. Looking at the absolute performance difference between the two algorithms we can conclude that for dense particle ensembles the improved functionality of SOS induces only insignificant extra costs.
4.2.3 Large Particle Scenario
Since the SOS was specifically developed to tackle problems with large particles one scenario of this category is also evaluated. NNS cannot be applied here therefore only the SOS is studied. For a bidisperse scenario with spherical particles of two radii is chosen. The smaller particle always has radius 1 whereas the larger particle has a radius of 30 in the weakscaling experiment and 15 in the strongscaling experiment. A simulation area of is used as a building block. The complete simulation domain is then constructed by placing these building blocks adjacent to each other. The building block is depicted in Fig. 6 c). It is partitioned into subdomains resulting in subdomains with an edge length of . Each subdomain is assigned to exactly one process. In the center of this building block a large spherical particle is introduced. The rest of the simulation area is filled with small particles arranged on a simple cubic lattice. All particles are given a random initial velocity.
For this scenario the results for weak and strongscaling can be found in Fig. 16 e) and f). The SOS shows superb weakscaling also in this kind of scenario up to the full machine. In the strongscaling case one can observe a breakdown in performance starting at nodes. This can be explained as follows. Since the large particle is positioned at a 3d corner where eight subdomains meet it overlaps all of them. For nodes each subdomain is in size. From now on every further refinement of the domain leads to an increase of the number of subdomains that the particle overlaps. As a consequence the communication volume increases considerably and the performance deteriorates.
Note that the absolute PUpCS depend on the number of collisions per particle. Since the time needed for collision resolution is directly related to the number of collisions a dense scenario will always achieve less PUpCS than a sparse scenario. This can also be seen in Fig. 16.
parameter  sparse  dense  large 

weakscaling domain size (per process)  
strongscaling domain size (total)  
domain partitioning  3d  2d  3d 
lattice structure J.R.Hook1991  sc  hcp  sc 
grid spacing  1  1  1 
sphere radius  0.4  0.5  0.4 
large particle radius      30(15) 
initial  0.1  
time step length  1  0.01  0.1 
time steps  1000(500)  1000(300)  1000(100) 
it means that every component of the vector is a random value in that range. The number of time steps refers to the number of time steps used to build the average of the simulation time used.
4.3 Direct Comparison of both Algorithms
In this section we evaluate the parallel performance for different particle size ratios. This demonstrates the extended functionality of SOS and at the same time allows to examine the performance cost in more detail. The setup is similar to the third scenario of the last paragraph. The same building block is used but the radius of the large particle in the center is varied. To reduce fluctuation of the time measurement a grid of of these building blocks is simulated. We use the same domain partitioning into subdomains per building block as in the third scenario of the last paragraph and add an additional experiment with a finer partitioning of . The domain partitioning as well as the different sizes of the large particle are illustrated in Fig. 19. All particle sizes can be simulated with SOS but only those large particles which are marked by a dotted red line can be simulated with NNS. As long as the radius of the large sphere is not larger than the edge length of one subdomain both synchronization methods can be used. When the radius reaches a certain value only the SOS can be used. These radii are indicated by blue dashed circles. For the first domain partitioning into subdomains processes are needed to run and for the second one . All possible setups are run and the achieved performance is compared in Fig. 20.
As the previous experiments already suggest, the SOS achieves about of the PUpCS of NNS. However SOS can be applied to a wider range of scenarios. This can be seen by the fact that for SOS all data points are available whereas the graphs for NNS stop when it is impossible to simulate the next particle radius. The overall performance decreases moderately when the radius of the large particle increases. A large particle radius leads to an increased amount of communication as it overlaps more subdomains. Note also that processes whose subdomains are completely covered by a large particle will be idle which further decreases the performance. This effect could be overcome by sophisticated load balancing algorithms which however are outside the scope of this paper. The performance also drops if we progress to a finer domain subdivision. For a subdomain edge length of roughly particles are located in each subdomain. If we use a partitioning with a subdomain edge length of only particles remain per subdomain. When the workload per process is so small, a further loss in efficiency is unavoidable (compare Fig. 16b)).
We repeat the measurements with a denser setup. For that the small spherical particles are arranged on a hexagonal close packing lattice with spacing and a radius of . This setup consists of slightly more particles and also the number of collisions is higher due to the dense packing. Fig. 21 shows the performance data for this setup. As anticipated form the previous scaling experiments both synchronization methods yield almost the same PUpCS. The SOS achieves roughly of the PUpCS of NNS. And also the general trend is the same as for the sparse setup due to the same reasons.
5 Conclusion
In this paper we have presented a solution to resolve a limitation of the NNS algorithm used in rigid body dynamics. The NNS only exchanges information with next neighbors. Therefore no particle is allowed to extend into subdomains past direct adjacent ones. We have proposed a new synchronization algorithm which uses a minimal amount of pointtopoint communications to exchange the needed information even with subdomains which are not adjacent. New shadow particles are created with a diffusive approach. Careful bookkeeping and caching are applied to efficiently implement this algorithm. We then have shown that this algorithm scales as well as the NNS in various scenarios up to processes on the Juqueen supercomputer. We have also shown that for dense scenarios the NNS method can be replaced by our newly proposed SOS with only a moderate performance penalty. Eventually we have applied our new method to scenarios which previously could not have been simulated and showed a broader range of simulations which are now possible. Further improvements could be gained by introducing advanced load balancing techniques reducing the number of of idling processes in scenarios with large particles. Since the synchronization protocol is independent of the contact model both the NNS as well as our new SOS can be used in conjunction with various interaction solvers. This enables our synchronization method to be easily adapted to for example molecular dynamics.
Acknowledgment
The authors gratefully acknowledge the Gauss Centre for Supercomputing e.V. (www.gausscentre.eu) for funding this project by providing computing time through the John von Neumann Institute for Computing (NIC) on the GCS Supercomputer JUQUEEN at Jülich Supercomputing Centre (JSC).
The authors would like to acknowledge the support through the Cluster of Excellence Engineering of Advanced Materials (EAM).
Declaration of interest: none
References
References
 (1) B. R. Brooks, R. E. Bruccoleri, B. D. Olafson, D. J. States, S. Swaminathan, M. Karplus, Charmm: a program for macromolecular energy, minimization, and dynamics calculations, Journal of computational chemistry 4 (2) (1983) 187–217. doi:10.1002/jcc.540040211.
 (2) S. Plimpton, Fast parallel algorithms for shortrange molecular dynamics, Journal of Computational Physics 117 (1) (1995) 1–19. doi:10.1006/jcph.1995.1039.
 (3) S. Pronk, S. Páll, R. Schulz, P. Larsson, P. Bjelkmar, R. Apostolov, M. R. Shirts, J. C. Smith, P. M. Kasson, D. van der Spoel, B. Hess, E. Lindahl, Gromacs 4.5: a highthroughput and highly parallel open source molecular simulation toolkit, Bioinformatics 29 (7) (2013) 845–854. doi:10.1093/bioinformatics/btt055.
 (4) J. C. Phillips, R. Braun, W. Wang, J. Gumbart, E. Tajkhorshid, E. Villa, C. Chipot, R. D. Skeel, L. Kale, K. Schulten, Scalable molecular dynamics with namd, Journal of computational chemistry 26 (16) (2005) 1781–1802. doi:10.1002/jcc.20289.
 (5) C. Kloss, C. Goniva, A. Hager, S. Amberger, S. Pirker, Models, algorithms and validation for opensource dem and cfd–dem, Progress in Computational Fluid Dynamics, an International Journal 12 (23) (2012) 140–152. doi:10.1504/pcfd.2012.047457.
 (6) M. Horsch, C. Niethammer, J. Vrabec, H. Hasse, Computational molecular engineering as an emerging technology in process engineering, itInformation Technology Methoden und innovative Anwendungen der Informatik und Informationstechnik 55 (3) (2013) 97–101. doi:10.1524/itit.2013.0013.
 (7) A. Tasora, R. Serban, H. Mazhar, A. Pazouki, D. Melanz, J. Fleischmann, M. Taylor, H. Sugiyama, D. Negrut, Chrono: An open source multiphysics dynamics engine, High Performance Computing in Science and Engineering (2016) 19–49doi:10.1007/9783319403618_2.
 (8) T. Preclik, U. Rüde, Ultrascale simulations of nonsmooth granular dynamics, Computational Particle Mechanics 2 (2) (2015) 173–196. doi:10.1007/s4057101500476.
 (9) MPI: A MessagePassing Interface Standard. Version 3.1., Tech. rep., Message Passing Interface Forum (2015).
 (10) M. R. S. Pinches, D. J. Tildesley, W. Smith, Large scale molecular dynamics on parallel computers using the linkcell algorithm, Molecular Simulation 6 (13) (1991) 51–87. doi:10.1080/08927029108022139.
 (11) S. Liem, D. Brown, J. Clarke, Molecular dynamics simulations on distributed memory machines, Computer Physics Communications 67 (2) (1991) 261–267. doi:10.1016/00104655(91)90021c.
 (12) D. Rapaport, Multimillion particle molecular dynamics: Ii. design considerations for distributed processing, Computer Physics Communications 62 (23) (1991) 217–228. doi:10.1016/00104655(91)900964.
 (13) K. J. Bowers, R. O. Dror, D. E. Shaw, Overview of neutral territory methods for the parallel evaluation of pairwise particle interactions, Journal of Physics: Conference Series 16 (1) (2005) 300–304. doi:10.1088/17426596/16/1/041.
 (14) K. J. Bowers, R. O. Dror, D. E. Shaw, Zonal methods for the parallel execution of rangelimited nbody simulations, Journal of Computational Physics 221 (1) (2007) 303–329. doi:10.1016/j.jcp.2006.06.014.
 (15) D. E. Shaw, A fast, scalable method for the parallel evaluation of distancelimited pairwise particle interactions, Journal of computational chemistry 26 (13) (2005) 1318–1328. doi:10.1002/jcc.20267.

(16)
M. Snir, A note on nbody computations with cutoffs, Theory of Computing Systems 37 (2) (2004) 295–318.
doi:10.1007/s0022400310710.  (17) K. Iglberger, U. Rüde, Massively parallel rigid body dynamics simulations, Computer Science  Research and Development 23 (34) (2009) 159. doi:10.1007/s0045000900668.
 (18) K. Iglberger, U. Rüde, A parallel rigid body dynamics algorithm, EuroPar 2009 Parallel Processing (2009) 760–771doi:10.1007/9783642038693_71.
 (19) P. Larsson, B. Hess, E. Lindahl, Algorithm improvements for molecular dynamics simulations, Wiley Interdisciplinary Reviews: Computational Molecular Science 1 (1) (2011) 93–108. doi:10.1002/wcms.3.
 (20) T. Schruff, R. Liang, U. Rüde, H. Schüttrumpf, R. M. Frings, Generation of dense granular deposits for porosity analysis: assessment and application of largescale nonsmooth granular dynamics, Computational Particle Mechanics (2016) 1–12doi:10.1007/s4057101601530.
 (21) C. Rettinger, C. Godenschwager, S. Eibl, T. Preclik, T. Schruff, R. Frings, U. Rüde, Fully resolved simulations of dune formation in riverbeds, in: International Supercomputing Conference, Springer, 2017, pp. 3–21. doi:10.1007/9783319586670_1.
 (22) P. Seil, S. Pirker, T. Lichtenegger, Onset of sediment transport in monoand bidisperse beds under turbulent shear flow, Computational Particle Mechanics (2017) 1–10doi:10.1007/s4057101701636.
 (23) F. Schornbaum, U. Rüde, Massively parallel algorithms for the lattice boltzmann method on NonUniform grids, SIAM Journal on Scientific Computing 38 (2) (2016) C96–C126. doi:10.1137/15M1035240.
 (24) S. Eibl, T. Preclik, U. Rüde, JUQUEEN Extreme Scaling Workshop 2017, no. FZJJSCIB201701 in JSC Internal Report, 2017, p. 47.
 (25) L. Verlet, Computer "experiments" on classical fluids. i. thermodynamical properties of lennardjones molecules, Physical Review 159 (1) (1967) 98–103. doi:10.1103/PhysRev.159.98.
 (26) R. Hockney, S. Goel, J. Eastwood, Quiet highresolution computer models of a plasma, Journal of Computational Physics 14 (2) (1974) 148–158. doi:10.1016/00219991(74)900102.
 (27) M. P. Allen, D. J. Tildesley, Computer simulation of liquids, Oxford university press, 2017.
 (28) C. Ericson, Realtime collision detection, CRC Press, 2004.
 (29) K. Erleben, J. Sporring, K. Henriksen, K. Dohlman, Physicsbased animation (graphics series) (2005).
 (30) A. Schinner, Fast algorithms for the simulation of polygonal particles, Granular Matter 2 (1) (1999) 35–43. doi:10.1007/s100350050032.
 (31) E. Perkins, J. R. Williams, A fast contact detection algorithm insensitive to object sizes, Engineering Computations 18 (1/2) (2001) 48–62. doi:10.1108/02644400110365770.
 (32) P. A. Cundall, A computer model for simulating progressive, largescale movements in blocky rock systems, Proc. Int. Symp. on Rock Fracture (1971) 11–8.
 (33) P. A. Cundall, O. D. L. Strack, A discrete numerical model for granular assemblies, Géotechnique 29 (1) (1979) 47–65. doi:10.1680/geot.1979.29.1.47.
 (34) J. J. Moreau, Unilateral contact and dry friction in finite freedom dynamics, in: Nonsmooth mechanics and Applications, Springer, 1988, pp. 1–82. doi:10.1007/9783709126240_1.
 (35) M. Anitescu, F. A. Potra, Formulating dynamic multirigidbody contact problems with friction as solvable linear complementarity problems, Nonlinear Dynamics 14 (3) (1997) 231–247. doi:10.1023/A:1008292328909.
 (36) D. E. Stewart, Rigidbody dynamics with friction and impact, SIAM Review 42 (1) (2000) 3–39. doi:10.1137/s0036144599360110.
 (37) D. M. Beazley, P. S. Lomdahl, N. GronbechJensen, R. Giles, P. Tamayo, Parallel algorithms for shortrange molecular dynamics, Annual reviews of computational physics 3 (1995) 119–175. doi:10.1142/9789812830647_0004.
 (38) C. Godenschwager, F. Schornbaum, M. Bauer, H. Köstler, U. Rüde, A framework for hybrid parallel flow simulations with a trillion cells in complex geometries, in: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis on  SC '13, ACM, ACM Press, 2013, p. 35. doi:10.1145/2503210.2503273.
 (39) Jülich Supercomputing Centre, JUQUEEN: IBM Blue Gene/Q Supercomputer System at the Jülich Supercomputing Centre, Journal of largescale research facilities 1 (A1). doi:10.17815/jlsrf118.
 (40) M. Gilge, et al., IBM System Blue Gene Solution Blue Gene/Q Application Development, IBM Redbooks, 2014.
 (41) P. Wautelet, M. Boiarciuc, J. Dupays, S. Giuliani, M. Guarrasi, G. Muscianisi, M. Cytowski., Best Practice Guide – Blue Gene/Q, v1.1.1 edition Edition (2014).
 (42) G. Hager, G. Wellein, Introduction to high performance computing for scientists and engineers, CRC Press, 2010. doi:10.1201/ebk1439811924.
 (43) H. E. H. J. R. Hook, Solid State Physics, Second Edition, WileyBlackwell, 1991.