1 Introduction
The objective of this work is to propose a GPU implementation of a 3D fluid simulation with immersed objects, using the immersed boundary methods (IBMs), and to observe a significant speed up compared to a CPU implementation, even in situations in which the immersed objects are densely packed.
To solve the incompressible NavierStokes equations for the fluid, we use the lattice Boltzmann (LBM) algorithm (Krüger, 2016), within the BhatnagarGrossKrook (BGK) framework. Multiple good GPU implementations already exist for this model, and our work is based on the implementation of Adrien Python, which is part of the Palabos framework.
The LBM simulations are carried out on a regular, static, Euclidean mesh. This kind of simulation is well adapted to GPU implementations, and massive accelerations are achieved through manycore parallelization. A natural way to implement LBM on GPU is to assign a GPU thread to each lattice cell. With this strategy, our tests show a speed up of roughly a factor 20 as compared to a solid CPU implementation (the Palabos code). In this comparison, a standard highend CPU and GPU are used, and all cores of the CPU are exploited through MPIparallelism. Building on this, the challenge of the present work consisted in the integration of a IBM model into the software framework while maintaining a substantial gain of performance of the GPU as compared to the CPU.
The immersed object is represented by a set of Lagrangian points, which are superimposed to the Euclidean fluid mesh. The interaction between these points and the fluid is implemented in terms of a directforcing IBM, as described by Ota, Suzuki and Inamuro (Ota, 2012). This algorithm is difficult to parallelize, because it involves nonlocal operations, such as the computation of averaged quantities over the extent of the interaction kernel of one point. Another difficulty stems from the mismatch of memory representations between the fluid’s Eulerian lattice and the object’s Lagrangian points which makes coalesced memory accesses difficult (Nvidia, 2007).
Our approach is based on the work of ValeroLara, Pinelli, and PrietoMatias (ValeroLara, 2014), in which the LBM fluid iterations and the implementation of the IBM force are carried out in separate CUDA kernels, by adopting different types of memory traversal in either case. A certain number of optimisations are proposed to achieve acceptable cases. As expected, this strategy cannot achieve ideal levels of parallelism, because parts of the computation remain sequential, and the bandwidth of memory accesses is suboptimal. Nevertheless, the implementation can achieve a substantial speedup of approximately a factor 19 compared to a modern, multicore CPU, and is capable of efficiently handling situations with a large number of densely packed immersed objects.
The article first presents LBM and IBM methods used, and the details the algorithm of the CUDA implementation. The final part presents test cases and numerical results.
2 Lattice Boltzmann method
The computational domain is represented by a regular, homogeneous lattice. The degrees of freedom of the model consist of the
populations, a group of 19 variables on each lattice cell. At the beginning of each iteration of the fluid model, the macroscopic variables density () and velocity () are computed from the populations on each lattice cell as follows:(1) 
In these equation, we have 19 structural vector
that reflect the connection between a cell and one of its neighbors, such as for example the vector . This choice of vectors leads to the socalled D3Q19 model, which is summarized in (Krüger, 2016).The value of the populations are updated at each iteration in two phases, the streaming and the collision phase. In the collision phase, the populations are updated locally, without communication with any neighbor. We use the socalled BGK collision model (for the initial of its authors Bhatnagar, Gross, and Krook) (Bhatnagar, 1954):
(2) 
The constant is a characteristic relaxation time, which relates to the fluid viscosity, and depends on the time and space resolution. The operator , for equilibrium, depends on all populations on the local cell and is computed as follows:
(3) 
In this equation, the 19 values are constants, which are listed in (Krüger, 2016), which can be understood as weighting factors that compensate for the fact that some of the lattice vectors are longer than others. In the second phase, the streaming, the populations are propagated to their neighbouring cells, in direction of the corresponding vector :
(4) 
This concludes the description of the fundamental fluid algorithm. However, to further account for the presence of immersed objects through the IBM, a forcing term is added to the method, which modifies the collision phase. The computation of the value of the force resulting from the IBM is described in detail in the next section. For the time being, we describe it as a general function
that depends on all Lagrangian coordinates of the points describing the surface of the immersed object , and on the velocity in all cells of the lattice , and produces a force to be exerted on all cells of the lattice. To apply the obtained force to the fluid, it is multiplied by the relaxation time , and added to the velocity used to computed the equilibrium function. In summary, the following pseudo code describes the full lattice Boltzmann algorithm:

Computation of and according to (1).

Computation of , , as described in the next section.

Inclusion of the force into the velocity term for the computation of the equilibrium: on each lattice cell.

Computation of the 19 equilibrium values on each lattice cell according to (3).

Execution of the collision phase according to (2).

Execution of the streaming phase according to (4).
3 Immersed boundary method
The goal of the algorithm is to compute a force for each lattice cell which, once applied to the fluid, enforces a noslip condition along the object surface. In case of a nonmoving object, this amounts to enforcing a zero fluid velocity along the surface, and otherwise, a velocity equal to the local surface velocity of the object.
We will make a clear distinction between the velocity defined at the Eulerian positions of the lattice cells, described by a capitalletter , and the lowercaseletter velocity defined at the Lagrangian positions on the surface of the obstacle. The first step of the immersedboundary algorithm consists of the computation of the fluid velocity at the Lagrangian positions on the surface of the object. It is computed as the weighted sum of the velocities of lattice cells in the neighborhood of the Lagrangian point:
The neighbourhood of the Lagrangian position is defined by the weighting function . The result of this function, as applied to a given fluid cell, is proportional to the distance between the Lagrangian point and the fluid cell. It has as cut off for , which means that the weighted sum is in practice computed only for a kernel of cells around the Lagrangian point. We use the weighting function as defined by Peskin (peskin, 2000):
As a next step, the force exerted by the fluid on the object at the point is computed as follows:
where is the local velocity of the surface of the obstacle at position
. This provides a first estimate for
, the velocity correction to be exerted on the fluid by the IBM:where is the area of the surface part represented by the Lagrangian point.
Finally, the fluid velocity is updated in the Eulerian mesh cells, to obtain a corrected field at iteration 1 as follows:
At this point, the full procedure starts over, with the purpose to obtain a converged velocity field through a procedure of fixedpoint iterations. As in the first iteration, a corrective force is computed, leading to an updated value of the velocity correction :
This procedure should in principle be repeated until convergence is reached, or in other terms, until the computed force correction is negligibly small. In practice, we follow the recommendation by Ota, Suzuki, and Inamuro (Ota, 2012) and apply the iterations times.
Finally, the correction on the Lagrangian points is used to compute the force to be applied to each lattice cell:
4 CUDA implementation
CUDA is an application programming interface, working with the C, C++, and FORTRAN language, which allows us to run generalpurpose code on NVidia GPUs. The CUDA API is articulated around function called kernels, which are called from the CPU but executed on the GPU in a multithreaded manner. Threads are grouped by blocks, which can have a threedimensional shape. In this case, a kernel can refer to the current thread through 3 coordinates: threadIdx.x, threadIdx.y, threadIdx.z. The blocks can themselves be arranged in a threedimensional grid, using the three coordinates blockIdx.x, blokIdx.y blockIdx.z. All in all, the current thread is referred to by 6 coordinates in a kernel.
Only GPU memory is accessible from within a thread. There exist 3 types of GPU memory. The first, global memory, is accessible by all threads, but is relatively slow, the second, shared memory, is shared among all threads of the same block and is significantly faster. The fastest type of GPU memory, are the registers, accessible only by the current thread and used for local variables.
Global memory accesses are most efficient if they are coalesced (Nvidia, 2007), meaning that neighbouring threads access neighbouring addresses. As an example, the following Ccode instruction in a CUDA kernel
myGlobalArray[threadIdx.x] += 1
represents a coalesced access, while the following does not
myGlobalArray[2*threadIdx.x] += 1
The shapes of blocks and grids can be set up at the moment of a kernel call. The order of kernel calls are sequential by default, and the only way to synchronize code across different threads blocks is to proceed with subsequent kernel calls.
4.1 Kernel calls
We divided each iteration of the LBMIBM algorithm into different kernels for two reasons: firstly, to synchronize some part of the computation, and secondly, because we used different parallelization strategies and thread configurations for different parts of the algorithm.
In total, four kernels were used, which are
 The kernel lbm_u_start

computes the fluid velocity from the populations .

The kernels ib_force1 and ib_force2 compute the force from the coordinates of the immersed object points and the fluid velocity . This step has been split in two parts to allow synchronization of the threads after computation of the force by ib_force1 and the update of and in ib_force2.
 The kernel lbm_computation

computes the lattice Boltzmann collision and streaming phases depending on the populations of the previous step, and on and .
The kernels lbm_u_start and lbm_computation are very closely related to the kernels described in Adrien Python’s work (Python, 2018) and are therefore not explained any further.
4.2 The kernel ib_force1
The purpose of this kernel is to compute the vectors and at each point of the immersed object. Each given thread is responsible for the computation of one vector and one vector at a single Lagrangian point. A loop over neighboring lattice cells is carried out to access their velocities Eulerian velocities . In this case, the corresponding memory accesses are often not coalesced. Indeed, fluid cells are arranged in memory according to a regular matrix ordering, row by row. As a consequence, they cannot be accessed consequently as a loop over the Lagrangian object positions and their Eulerian neighborhood is carried out, as it can be seen on Figure 2a. As it will be shown in the benchmark section, the impact of these noncoalesced accesses on the overall performance remains acceptable, as they correspond to relatively small memory chunks (the neighborhood of the solid surfaces), as compared to the overall size of the fluid domain. Furthermore, the performance impact of this kernel could be improved by reordering the Lagrangian positions on the object surface in the order of appearance of their nearest neighbor in the rowbyrow matrix data structure, leading to an improved occurrence of coalesced accesses.
This kernel is also responsible for moving the points in the case of a nonstatic object, an operation that is fully performed within the GPU’s memory to improve the performance. In this work, we focus entirely on rigidbody motion, as the one of a rotating propeller, and express therefore the motion as a linear transformation
applied to all points of the object. This transformation is applied at the first fixedpoint iteration only. The updated positions are then stored and reloaded at the subsequent iterations, as they are needed to compute the point velocities .The vectors and are finally stored back in global memory, to be reused in subsequent kernel calls. These memory access have a negligible impact on performances because they are coalesced and are applied to a comparatively small amount of data.
The pseudocode of this kernel has the following shape:
4.3 The kernel ib_force2
This kernel computes the corrected velocity and the force for each cell of the lattice.
The parallelization strategy is fundamentally different to the one adopted in ib_force1, as a Lagrangian position is assigned to a full CUDA block, and all threads of the block load the same values and . But each thread is assigned a single Eulerian position in the neighborhood of , and are responsible for adding the appropriate value to and in this cell, as illustrated in Figure 2 b. In other words, CUDA blocks are built to have the same shape as the neighbourhood of a Lagrangian position. With this strategy, write operations into and variables are coalesced within one block.
In this case, different blocks concurrently write into and at the same cell. To avoid a resulting race conditions, the write operations are made atomic with help of the CUDA function atomicAdd().
The pseudocode of this kernel is written as follows:
5 Optimisations
One of the problems that stem from the IBM is that the velocity needs to be precomputed prior to the collision phase, as it is needed for the computation of the IBM force. As a result, all cells populations need to be loaded twice, once to compute the velocity and once to compute the collision phase of the lattice Boltzmann. Our optimisation attempts are based on the observation that for the IBM, is required only in vicinity of the object surface and therefore can be precomputed in a domain of limited extent.
We tried two optimisation strategies:
 The box strategy

, in which is precomputed only in a bounding box around all points on the object surface. Given that the object may move, the bounding box needs to be recomputed at every iteration. To achieve this efficiently in our algorithm, the assumption of rigidbody motion is used, and the bounding box of the object is computed only in the rest position. Then, at every iteration the linear transformation is applied to the restposition bounding box to obtain the current one.
 The kernel strategy

, in which is precomputed exactly for the points in which it will be needed, in the neighbourhood of the object surface. The implementation of this strategy is similar to the one of the kernel , using a CUDA block to compute in the neighbourhood of a Lagrangian point. The obtained domain is tighter than the one resulting from the box strategy. However, the same value of is computed multiple times, since there are substantial overlaps between the neighborhoods of different points. As a result, this strategy is sometimes faster than the box strategy, and in some cases slower.
6 Test cases
Our test cases implement a rotating propeller, the geometry of which was created artistically, without any assumption on the use of the propeller in fluid engineering. The surface mesh of the propeller was built with help of the CAD functionality of the Blender software and consists of 3930 Lagrangian points. Before superimposing the surface mesh to the regular fluid mesh, it is rescaled to a size at which the area assigned to a Lagrangian point corresponds approximately to a 2D crosssection of a 3D fluid cell, in order to guarantee the accuracy of the IBM. The fluid volume is resolved by a lattice of cells, as shown in Figure 4. The benchmark cases are executed during 1000 iterations, and the performance of the code is asserted, as it is custom in the LBM community, by a timeaveraged measure of million latticecell updates per second (Mlups).
Both GPU and CPU tests were run on the parallel computer Baobab at the University of Geneva. The CPU is an Intel Xeon E52643 v3 CPU at 3.40 GHz. The GPU is a NVIDIA Tesla P100PCIE GPU with 3584 CUDA cores at 1.33 GHz.
We ran 6 tests with one to six immersed and simultaneously rotating propellers. Each test was executed using both the kernel strategy and the box strategy, and the results are shown in Figure 5. In all six tests, the kernel strategy achieved better performances than the box strategy, and the difference in performance increased with the number of propellers. Our best measured performances correspond to 893 Mlups in the singlepropeller and 650 Mlups for 6 propellers.
These performance values are similar to the ones obtained by other authors, and are more than an order of magnitude above the performances obtained on a CPU. The purpose of this article, however, is to show that our implementation strategy yields a substantial speedup compared to a CPU even with a much larger number of Lagrangian points, representing a situation of a dense arrangement of immersed objects. Indeed, one can consider such a situation to be in principle more favorable for a CPU rather than a GPU implementation, due to the frequent irregular memory traversal patterns.
For the sake of comparison, the CPU version of the code was executed with help of the highperformance LBM library Palabos, which used MPI parallelism to use all 12 cores available on the test CPU. Figure 6 shows a comparison of the two GPU versions and the Palabos CPU version for 16 propellers, and for an extreme case including 18 propellers. In the onepropeller case, the best performance is of 893 Mlups on GPU against 45.3 Mlups on the CPU (the GPU is 19.7 times faster), and in the 18propeller case, the GPU yields 344.2 Mlups against 21.7 Mlups for the CPU (the GPU is 15.8 times faster). In the case of 18 propellers, the simulation uses 70740 Lagrangian points, and the neighborhoods of the immersed objects fill a substantial volume of the fluid. In this case, the kernel strategy loses its advantage against the box strategy, and both yield approximately the same performance.
7 Conclusion
In this article, we present a GPU implementation of an immersedboundary LBM, capable of simulating moving, immersed rigid objects substantially faster than a CPU implementation of the same problem. Similar to other publications in the field, the implementation yields a speedup of a factor 20, approximately, of the GPU against the CPU. But unlike other articles, we consider cases in which multiple moving immersed objects are densely packed inside the fluid, in which the advantage of the GPU against the CPU remains substantial, with a speed up of approximately 15.
The GPU algorithm is split into sections which, individually, are straightforward and relatively similar to their sequential counterpart. The test runs show that the overall performance of the implementation decreases as the number of Lagrangian points is increased. This is unsurprising, as the number of Lagrangian points is superior to the number of available GPU cores, and is also compatible with the observation that the IBM algorithm is subject to a limited memory bandwidth, as the memory accesses for this algorithm are very often uncoalesced. Nevertheless, the GPU retains a speedup of an order of magnitude compared to the CPU, and should therefore be considered an almost compulsory choice, even for non trivial problems including densely packed moving objects.
Although it would have been interesting to test the scaling of the implementation to an even larger number of Lagrangian points, this was not possible due to the limited amount of memory on the tested NVidia GPU. More generally, there is a need to generalize the proposed algorithm to a multiGPU context to access larger domains, a project which we reserve for future work.
References
 Krüger (2016) Krüger. The Lattice Boltzmann Method : Principles and Practice. Krüger, T. and Kusumaatmaja, H. and Kuzmin, A. and Shardt, O. and Silva, G. and Viggen, E.M. Graduate Texts in Physics. 2016.
 Python (2018) Adrien Python Implémentation sur GPU de simulation de fluide avec la méthode Lattice Boltzmann pour des exécutions hybrides avec Palabos. Master of Science in Computer Science. Université de Genève. Février 2018
 Latt (2011) http://www.palabos.org/
 Bhatnagar (1954) P. L. Bhatnagar, E. P. Gross, and M. Krook. A Model for Collision Processes in Gases. I. Small Amplitude Processes in Charged and Neutral OneComponent Systems. Phys. Rev. 94. 51 May 1954
 Ota (2012) Keigo Ota and Kosuke Suzuki and Takaji Inamuro. Lift generation by a twodimensional symmetric flapping wing: immersed boundarylattice Boltzmann simulations. Fluid Dynamics Research, vol 44, page 045504. 2012
 peskin (2000) Lai MC and Peskin. An immersed boundary method with formal secondorder accuracy and reduced numerical viscosity. J. Comput. Phys. 160 705–19 2000
 Nvidia (2007) https://devblogs.nvidia.com/eveneasierintroductioncuda/

Nvidia (2007)
https://www.nvidia.com/content/PDF/fermi_white_papers/
NVIDIA_Fermi_Compute_Architecture_Whitepaper.pdf  ValeroLara (2014) Pedro ValeroLara, Alfredo Pinelli, and Manuel PrietoMatias. Accelerating SolidFluid Interaction using LatticeBoltzmann and Immersed Boundary Coupled Simulations on Heterogeneous Platforms. Procedia Computer Science, Volume 29, Pages 50–61. 2014
 Nvidia (2007) https://docs.nvidia.com/cuda/cudacprogrammingguide/index.html#atomicfunctions
 Nvidia (2007) https://devblogs.nvidia.com/howaccessglobalmemoryefficientlycudackernels/
 Nvidia (2007) https://docs.nvidia.com/cuda/cudacprogrammingguide/index.html#computecapabilities
 Tölke (2010) Jonas Tölke. Implementation of a Lattice Boltzmann kernel using the Compute Unified Device Architecture developed by nVIDIA. Comput Visual Sci 13:29–39. 2010
Comments
There are no comments yet.