A Batched GPU Methodology for Numerical Solutions of Partial Differential Equations

by   Enda Carroll, et al.

In this paper we present a methodology for data accesses when solving batches of Tridiagonal and Pentadiagonal matrices that all share the same left-hand-side (LHS) matrix. The intended application is to the numerical solution of Partial Differential Equations via the finite-difference method, although the methodology is applicable more broadly. By only storing one copy of this matrix, a significant reduction in storage overheads is obtained, together with a corresponding decrease in compute time. Taken together, these two performance enhancements lead to an overall more efficient implementation over the current state of the art algorithms cuThomasBatch and cuPentBatch, allowing for a greater number of systems to be solved on a single GPU. We demonstrate the methodology in the case of the Diffusion Equation, Hyperdiffusion Equation, and the Cahn–Hilliard Equation, all in one spatial dimension. In this last example, we demonstrate how the method can be used to perform 2^20 independent simulations of phase separation in one dimension. In this way, we build up a robust statistical description of the coarsening phenomenon which is the defining behavior of phase separation. We anticipate that the method will be of further use in other similar contexts requiring statistical simulation of physical systems.



There are no comments yet.


page 7

page 11

page 13

page 21


Efficient Interleaved Batch Matrix Solvers for CUDA

In this paper we present a new methodology for data accesses when solvin...

GPU Methodologies for Numerical Partial Differential Equations

In this thesis we develop techniques to efficiently solve numerical Part...

A solver for stiff finite-rate relaxation in Baer-Nunziato two-phase flow models

In this paper we present a technique for constructing robust solvers for...

Concurrent multi-parameter learning demonstrated on the Kuramoto-Sivashinsky equation

We develop an algorithm for the concurrent (on-the-fly) estimation of pa...

Massively Parallel Stencil Strategies for Radiation Transport Moment Model Simulations

The radiation transport equation is a mesoscopic equation in high dimens...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

In this paper we consider batched numerical solutions of systems of the form on a GPU, in particular where is a tridiagonal or pentadiagonal matrix. Batched solutions of such tri/pentadiagonal matrix systems are becoming increasingly prevalent methods for tackling a variety of problems on GPUs which offer a high level of parallelism zhang2010fast ; kim2011scalable ; cuThomasBatch ; gloster2019cupentbatch .

The intended application of the methodology is to the numerical solution of partial differential equations via finite-difference methods. We therefore showcase the application of these methods to the Diffusion Equation, the Hyperdiffusion Equation, and the Cahn–Hilliard Equation. We focus on examples in one spatial dimension, although higher-dimensional problems can be tackled using this same methodology, together with an Alternating Direction Implicit (ADI) discretization scheme ADIGPU ; gloster2019custen . The Diffusion Equation is one of the most-studies equations in Computational Science, and we use this to showcase the method and to quantify its performance. Similarly, we use the Hyperdiffusion Equation as a benchmark problem for a pentadiagonal solver. However, in this article, we focus in much more detail on the last example, namely the one-dimensional Cahn–Hilliard equation. In this example, we demonstrate the computational power of the new methodology by characterizing the phenomenon of coarsening and phase-separation in a statistically robust fashion.

We emphasize that the methodology in this article may find application wherever GPUs are useful for accelerating established computational procedures. For example, in Fluid Mechanics, there has been a broad application of GPUs where solutions of Poisson’s equation are commonly required hockney1964fast ; valeroPoisson , tsunami modelling and simulation reguly2018volna , numerical linear algebra laraVariableBatched ; HaidarSmallMatrices , batch solving of 1D partial differential equations gloster2019cupentbatch , ADI methods for 2D simulations ADIGPU ; gloster2019custen and modelling mesoscopic-scale flows using Lattice–Boltzmann methods pedro_lbm

. Also, in the field of gravitational wave data analysis, much work is done simplifying complex waveforms and analyses to achieve results in a physically reasonable and desirable amount of time. A large portion of this work involves using cubic splines for interpolation 

splineIEEE , a highly parallelizable process that has shown promising results using GPUs. Yet further examples of areas using GPUs include image in-painting imageInpainting , simulations of the human brain cuHines ; BrainsOilGas , and animation pixar . Consequently, the methodology presented in this article may find very broad applicability, although for definiteness we focus on the numerical solutions of certain canonical partial differential equations for the present purposes.

In the remainder of this Introduction, we place the present methodology in the context of the state-of-the-art. At the library level various implementations/functions have been developed for batched solves, examples include the Batched BLAS project dongarra2017optimized ; dongarra2017design , the CUDA libraries cuBLAS, cuSPARSE and cuSOLVER, and the Intel MKL library. In this paper we will focus on the development of tridiagonal and pentadiagonal batch solvers with single , multiple , matrices in CUDA for application in solving batches of 1D problems and 2D ADI methods. CUDA, an API extension to the C/C++ language, allows programmers to write code to be executed on NVIDIA GPUs. In particular we develop solvers in this work which are more efficient in terms of data storage than the state of the art, this saving of data usage is due to the fact we have only one global copy of the matrix rather than one for each thread/system. While the primary improvement is in data storage reduction the new functions also provide increased speedup, due to better memory access patterns, when compared to the existing state of the art. This increase in efficiency is also beneficial as it leads to further savings in resources, both in terms of run-time and electricity usage. The latter of which is of increasing concern as modern supercomputers become larger, requiring more and more electricity which has both financial and environmental implications. The need for both energy efficient hardware and algorithms is becoming an ever more prevalent trend.

To date all existing batch pentadiagonal solvers in CUDA require that each system in the batch being solved have its own copy of the matrix entries, leading to a sub-optimal use of device memory (GPU memory) in terms of unnecessary space usage and increased memory access overhead when only one global copy of is actually needed. We also note that, while there are options for single , multiple , tridiagonal and pentadiagonal matrices in cuSPARSE, these options overwrite the LHS data leading to a necessary correction step having to be performed when using these in iteration resulting in additional computational cost. Thus we propose the implementation of two solvers, one implementing the Thomas Algorithm for tridiagonal matrices and the other implementing the pentadiagonal equivalent as presented in gloster2019cupentbatch ; numalgC , each with a single global copy of the matrix and multiple matrices. We will benchmark these implementations against cuThomasBatch cuThomasBatch (implemented as gtsvInterleavedBatch in the cuSPARSE library) and existing work by the authors cuPentBatch gloster2019cupentbatch , for tridiagonal and pentadiagonal matrices respectively, which are the existing state of the art for the algorithms we are interested in. We point the user to the papers cuThomasBatch and gloster2019cupentbatch for existing benchmark comparisons of these algorithms with multiple matrices with the cuSPARSE library and comparisons with serial/OpenMP implementations. Thus we can compare our new implementations to the existing state of the art, cuThomasBatch and cuPentBatch, from which relative performance compared to the rest of cuSPARSE can be interpolated by the reader.

The paper is laid out as follows. In Section 2 we outline the modified methodology of interleaved data layout for vectors and a single global copy for the matrix. In Sections 3 and 4 we show how this methodology can be applied to tridiagonal and pentadiagonal matrices respectively. We include here benchmarks against the state-of-the-art algorithms to show performance is at least as good, if not better in most cases of batch size and matrix size. In the later parts of the paper, we move on to a specific physical application, namely the coarsening phenomenon in the 1D Cahn–Hilliard equation. As such, in Section 5 we present a numerical method for solving multiple copies of the 1D Cahn–Hilliard equation in a batch using an implicit numerical algorithm. We show how this approach can be used to generate a statistically robust description of coarsening dynamics in a phase-separating liquid. We continue with this study in Section 6

, where we show that with the same approach, a detailed parameter study of the forced Cahn–Hilliard equation (with a rich parameter space) can be explored using batch simulations to generate large amounts of data. We again characterize the data with with statistical approaches, this time using k-means clustering to demarcate distinct regions of the parameter space. Finally we present our conclusions in Section 


Figure 1: Diagram of memory hierarchy in an NVIDIA GPU. We omit nodes for the purposes of the diagram, the top and bottom branches are representative of the memory routing from Device Ram to warp. When every thread is accessing the same memory location warps which share the same SM will receive an L1 cache hit, SMs sharing an L2 cache will receive an L2 cache hit and if the data is not present in the L2 cache it will be retrieved from the Device memory. Thus it can be seen here how the first warp to request a location in memory in a given group will be the slowest but accesses for following warps become faster and faster as the memory propagates down the tree into various caches.
Figure 2: Interleaved data format of the array containing the RHS matrix used in this paper.

2 Matrix Solver Methodology

The methodologies implementing the tridiagonal and pentadiagonal algorithms we are concerned with in this paper have to date relied on using interleaved data layouts (shown schematically in Figure 2) in global memory. Each thread handles a particular system solution and the system is solved by a forward sweep followed by a backward sweep. The interleaved data format is to ensure coalescence of data accesses during theses sweeps, with each thread retrieving neighbouring memory locations in global memory for various and matrix entries cuThomasBatch . This maximises bandwidth but is limiting to the overall memory budget as each system requires its own copy of the matrix. Instead we propose to store globally just a single copy of the matrix entries with all threads accessing the desired value at the same time, the right-hand-side (RHS) matrices, , will still be stored in the same interleaved data access pattern as before. This will reduce the overall bandwidth of memory accesses to the but this will not harm the solver’s performance as we will show in later sections. Critically also this approach of using just a single matrix will save drastically on the amount of memory used in batch solvers allowing for larger batch numbers and greater matrix sizes to be solved on a single GPU, increasing hardware usage efficiency.

We now discuss the memory access pattern for the single matrix. Global memory access in GPUs is achieved via the L2 cache. When a piece of memory is requested for by a warp this memory is retrieved from global memory and copied to the L2 cache followed by the L1 cache on the SM before then being used by the relevant warps being computed on that SM. Each SM has their own pipe to the L2 cache. So when warps from different SMs all request the same memory they will all get a cache hit in at least the L2 cache except the first one to arrive which will be the warp to retrieve the data from the global memory on the device. We can guarantee that this will occur as each warp will be computing the same instruction in either the forward or backward sweep of the solver but warps on different SMs will not arrive to the L2 cache at the same time. In addition to this warps which share L1 caches will get cache hits here when they’re not the first to request the piece of memory. A diagram of this memory access pattern discussion can also be seen in Figure 1. Thus it is clear that given every piece of memory must follow the above path most of the threads solving a batched problem will benefit from speed-ups due to cache hits as they no longer need to retrieve their own copy of the data.

In the following two sections we present specific details for each of the tridiagonal and pentadiagonal schemes along with the previously discussed benchmarks. The benchmarks are performed on 16GB Tesla V100 GPUs with the pre-factorisation step for the single performed on the CPU, this pre-factorisation step is a purely serial operation and hence the CPU was chosen to perform this computation. We shall refer to the new versions of the algorithms as cuThomasConstantBatch and cuPentConstantBatch for tridiagonal and pentadiagonal respectively. We use the nomenclature ‘Constant’ denoting the fact that all systems have the same . The will be stored as usual in an interleaved data format and is computed using cuSten gloster2019custen . In both algorithms we utilise uniform memory access. In terms of thread organisation structure, two grids were used. The first, a 1D grid, performs the inversion (Thomas algorithm). The second is a 2D gird which performs the updating steps of both algorithms. For notational purposes we will use to describe the number of unknowns in our systems and to describe the batch size (number of systems being solved).

3 Tridiagonal Systems

In this section we first present the Thomas Algorithm numalgC and how it is modified for batch solutions with multiple matrices and with a single . We then present our chosen benchmark problem of periodic diffusion equations and then the results.

3.1 Tridiagonal Inversion Algorithm

We begin with a generalised tridiagonal matrix system with given by

We then solve this system using a pre-factorisation step followed by a forwards and backwards sweep. The pre-factorisation is given by


And then for


For the forwards sweep we have


While for the backwards sweep we have


And for


A summary of cuThomasConstantBatch can be found in Appendix A. The pre-factorization step performed is performed on the CPU and the remaining forward and backward sweeps of the algorithm which are performed on the GPU can be found in Algorithm 1 and Algorithm 2 in Appendix A.

Previous applications of this algorithm cuThomasBatch required that each thread had access to its own copy of vectors, the diagonals , and along with the RHS . These would then be overwritten in the pre-factorisation and solve steps to save memory, thus the total memory usage here is . We now limit the to a single global case that will be accessed simultaneously using all threads as discussed in Section 2 and retain the individual in interleaved format for each thread . This reduces the data storage to , an approximate 75% reduction. We present benchmark methodology and results for this method in the following subsections.

3.2 Benchmark Problem

For a benchmark problem we solve the Diffusion Equation, a standard model equation in Computational Science and Engineering, its presence can be seen in most systems involving heat and mass transfer. The equation on a one-dimensional interval is given as:


where is the diffusion coefficient. We solve this equation on a periodic domain of length such that with an initial condition valid on the domain . We rescale Equation (7) by setting and and integrate the equation forward in time using a standard Crank–Nicolson scheme with central differences for space which is unconditionally stable. Finite differencing is done using standard notation


where and . Thus our numerical scheme can be written as




Thus we can relate these coefficients to matrix entries by


In the next section we present our method to deal with the periodicity of the matrix and then after that present the benchmark comparison with the existing state of the art.

3.3 Periodic Tridiagonal Matrix

As the system is periodic two extra entries will appear in the matrix, one in the top right corner and another in the bottom left, thus our matrix is now given by

To deal with these we use the Sherman–Morrison formula: we rewrite our system as




Thus two tridiagonal systems must now be solved


the second of which need only be performed once at the beginning of a given simulation. Finally to recover we substitute these results into

Figure 3: Speedup of of cuThomasConstant versus cuThomasBatch (gtsvInterleavedBatch)
Figure 4: Roofline plot of the cuThomasConstant kernel for

3.4 Benchmark Results

The diffusion equation benchmark problem presented above is solved for 1000 time-steps in order to extract the relevant timing statistics and average out any effects of background processes on the benchmark. We omit start up costs such as memory allocation, setting of parameters etc. We also do not include the timing of the pre-factorization step. This operation is only performed once and its cost is so insignificant to the rest of the algorithm that we attribute it to part of the setup cost. As such, both codes are timed for just the time–stepping sections of the process. The speed-up results can be seen in Figure 3, in all situations there is a clear speed-up over the existing state of the art cuThomasBatch. The largest speed-ups are for large and moderate . Significant speed-up is available for all of large which is the domain where GPUs would typically be deployed to solve the problem. Shown also in Figure 4 is a roofline plot of the cuThomasConstantBatch kernel: it is evident that as the batch size is increased the kernel tends towards the memory roofline of the device.

It should be noted that some of the speed-up seen here can be attributed to the pre-factorisation step which the cuThomasBatch implementation does not have. cuThomasBatch requires that the matrix be reset at every time-step as the pre-factorisation and solve steps are carried out in the one function, leading to an overwrite of the data and making it unusable for repeated time-stepping. This overwriting feature has been seen in previous studies gloster2019cupentbatch where the authors also carried out a rewrite to make the benchmarking conditions between their work and the state of the art. It was shown that not all of the speed-up can be attributed to the lack of needing to reset the matrix, thus some of the performance increase we are seeing in Figure 3 can be attributed to the new data layout presented in this paper. Our second benchmark for the pentadiagonal case will also show this as there was no resetting of the matrix required.

cuThomasConstantBatch cuThomasBatch
64 0.097 0.127
256 0.161 0.267
1024 0.510 0.850
64 0.101 0.192
256 0.255 0.549
1024 1.020 1.987
64 1.278 2.894
256 5.015 11.460
1024 20.268 46.175
Table 1: Table of execution times for cuThomasConstantBatch and cuThomasBatch (measured in seconds).

4 Pentadiagonal Systems

In this section we first present a modified version of the pentadiagonal inversion algorithm as presented in gloster2019cupentbatch with a single . Then we present the benchmark problem and finally this is followed by the results comparing our new implementation with existing state of the art.

4.1 Pentadiagonal Inversion Algorithm

In this section we describe a standard numerical method numalgC ; gloster2019cupentbatch for solving a pentadiagonal problem . We present the algorithm in a general context so we have a pentadiagonal matrix given by

Three steps are required to solve the system:

  1. Factor to obtain and .

  2. Find from

  3. Back-substitute to find from

Here, , and are given by the following equations:


(the other entries in and are zero). The explicit factorisation steps for the factorisation are as follows:

  1. For each

The steps to find are as follows:

Finally, the back-substitution steps find are as follows:

Previous applications of this algorithm gloster2019cupentbatch required that each thread had access to its own copy of vectors, the diagonals , , , and along with the . These would then be overwritten in the pre-factorisation and solve steps to save memory, thus the total memory usage here is . We now limit the to a single global case that will be accessed simultaneously using all threads as discussed in Section 2 and retain the individual in interleaved format for each thread . This reduces the data storage to , this is an approximate 83% reduction in data usage. We present benchmark methodology and results for this method in the following subsections.

4.2 Benchmark Problem

Figure 5: Speedup of of cuPentConstantBatch versus cuPentBatch
Figure 6: Roofline plot of the cuPentConstantBatch kernel for

For a benchmark problem we solve the hyperdiffusion equation in a one-dimensional interval , given here as:


where is the hyperdiffusion coefficient. We solve this equation on a periodic domain of length such that with an initial condition valid on the domain . We rescale Equation (17) by setting and and integrate the equation forward in time using a standard Crank–Nicolson scheme with central differences for space which is unconditionally stable. Finite differencing is done the same standard notation as in Section 3 (specifically, Equation (8)). Thus, our numerical scheme can be written as


where now means


For a more more detailed exposition of this benchmark problem, see Reference gloster2019cupentbatch .

4.3 Benchmark Results

We plot the speed-up of cuPentConstantBatch versus cuPentBatch solving batches of the above method for the hyperdiffusion equation in Figure 5. Like cuThomasConstantBatch, the pre-factorization step is not included in these timings. We use cuSten gloster2019custen to calculate the values for . We also present a roofline plot for the the cuPentConstantBatch kernel in Figure 6. Similar to cuThomasConstantBatch, the cuPentConstantBatch kernel is ultimately limited by the memory constraint of the device.

From Figure 5, it can be seen that cuPentBatchConstant outperforms cuPentBatch consistently for high values of both and . At low and the two methods are roughly equivalent; indeed, there are some areas where cuPentBatch performs better. However, in this part of the parameter space, standard CPU implementation is more sensible – taking into account the movement over the PCI lane along with other overheads. Furthermore, if we examine much larger values of and than those plotted, the superior performance of cuPentBatchConstant becomes much more apparent. The performance difference in these cases can be orders of magnitude, as the memory for cuPentBatch rapidly exceeds the available memory on the GPU while cuPentBatchConstant can still fit. We explore this part of the parameter space much more thoroughly in Section 5.

cuPentConstantBatch cuPentBatch
64 369.902 376.321
256 461.727 487.455
1024 658.475 717.109
64 460.857 523.693
256 429.127 535.657
1024 1344.377 1495.057
64 1786.018 2562.037
256 6535.750 9707.131
1024 25893.904 38407.516
Table 2: Table of execution times for cuPentConstantBatch and cuPentBatch (measured in seconds).

5 Physical Application – Coarsening and the Cahn–Hilliard Equation

In this section, we study the Cahn–Hilliard equation in one spatial dimension. The purpose of this section is twofold – we wish to showcase the GPU methodology as applied to a meaningful physical problem. We also wish to explore the coarsening phenomenon of the one-dimensional Cahn–Hilliard equation, in a statistically robust fashion. This can be readily achieved with the present GPU methodology, which enables the simulation of individual copies of the Cahn–Hilliard equation.

We begin by presenting the Cahn–Hilliard equation in one spatial dimension, valid on an interval :


Here, is a scalar concentration field taking arbitrary real values. The concentration field is used to demarcate different domains of a binary mixture CH_orig . Thus, a local solution indicates a region rich in one binary fluid component, whereas a local solution indicates a region rich int he other fluid component. Here also, is a positive constant; the domains are separated by transition regions of width . We assume for definiteness that Equation (20) is posed on the interval with periodic boundary conditions, although other standard boundary conditions (e.g. Neumann, Dirichlet) are possible also.

With such standard boundary conditions, the mean value of the concentration is conserved:


This can be seen by doing differentiation under the integral in Equation (21) and replacing with . Application of the periodic boundary conditions to the resulting integral gives


hence . The value of is a a key parameter in this section and in Section 6.

Any constant value is a solution of Equation (20). This constant solution can be characterized by introducing a perturbed solution


where is a mean-zero perturbation and is ‘small’ in the sense that . As such, the solution (23) with is a stable solution to Equation (20), in the sense of linear stability analysis; that is, . In contrast, the solution (23) with is unstable, and furthermore has the property that , in domains. The domains correspond to phase separation – regions rich in one fluid component or the other, and are separated by transition zones of width . The spontaneous formation of domains from the initial condition (23) in this instance is called coarsening. A sample numerical simulation showing the coarsening phenomenon is shown in Figure 7.

The coarsening phenomenon in the Cahn–Hilliard equation is dimension-dependent. In one dimension, theoretical arguments argentina2005coarsening predict that the size of a typical domain should grow as , at late times. In contrast, in higher dimensions, the size of a typical domain grows as , which is the famous Lifshitz–Slyozov coarsening law LS . The growth of the ‘typical’ domain size in this way is referred to as scaling. As such, the aim of this section is to confirm the theoretical one-dimensional scaling law.

Figure 7: Space time plot showing the time evolution of the solution of the 1D Cahn–Hilliard equation. Simulation parameters: , domain size , grid-points, such that , to properly resolve the transition zones between domains. The initial condition is

, the random fluctuations are numbers drawn from the uniform distribution on the interval

. The simulation is run out to a final time and the numerical scheme is the one presented in Equation (24).

5.1 1D Numerical Scheme

Equation (20) is solved implicitly on a uniform grid. We use a superscript to denote evaluation of the solution at time . Taking the hyperdiffusion term to the , and discretising in time we get


We apply standard second-order accurate differencing to the spatial derivatives:


This, along with the periodicity of the domain, yields a pentadiagonal matrix system of the form

Here, the coefficients of the matrix in Equation (26a) have the following meaning:

where we have taken , and for brevity. This system can then be solved following the methodology presented for cyclic pentadiagonal matrices as found in navon_pent ; gloster2019cupentbatch ; gloster2019efficient , this methodology also influenced our choice of second order accuracy above. All of the necessary matrix calculations were carried out using an in-house finite-difference library developed by the authors gloster2019custen . We use a time-step size


to ensure stability of the scheme. This is a heuristic. Stability of the numerical algorithm (

24) is not bound by the linear hyperdiffusion term, which is treated implicitly. Instead, the numerical algorithm may become unstable due to the explicit treatment of the non-linear term. The instability is therefore controlled by taking the timestep sufficiently small, but still proportional to , along the lines of Equation (27).

In the following section we show the convergence features of the scheme which is clearly second order accurate in space due to the choice of the central differences above.

5.2 1D Convergence Study

For the convergence study we set and . We set the time-step using Equation (27) and apply a cosine initial condition given by


with . The convergence results are presented in Figure 8 and Table 3. We can see that this scheme for intermediate numbers of grid points is second-order accurate except for very high resolutions where the scheme converges with first-order accuracy. In all cases, since the solutions of the Cahn–Hilliard equation are smooth, and possess smooth features even on the scale of the transition zones between domains (i.e. no ‘shocks’), these orders of convergence are acceptable.

Figure 8: Plot showing the convergence of the scheme in equation 24. The slope is matching the spatial accuracy of the scheme.
128 0.0310 3.7932
256 0.0022 2.0376
512 2.0089
1024 2.0022
2048 2.0005
4096 0.8947
Table 3: Details of convergence study of semi-implicit scheme for solving the 1D Cahn–Hilliard equation.

5.3 Batched 1D Scaling

We now examine the how the scaling behaviour of the coarsening phenomenon in the 1D Cahn–Hilliard equation. In Reference argentina2005coarsening , theoretical arguments are given for why the size of a typical domain should scale as


Thus, in order to measure the ‘typical domain size’, we must average across the different domains that appear transiently in a single simulation, and across multiple independent simulations. To do this, we follow the computational methodology for solving batches of hyperdiffusion equations as presented in gloster2019cupentbatch ; gloster2019efficient and combine it with the method for solving individual 1D Cahn–Hilliard equations presented above in Section 5.1. We store each individual of the system in interleaved format and the cuSten library gloster2019custen is used to compute the required non-linear finite differences. Each simulation is given its own randomised initial condition where values are drawn from a uniform distribution between and . Thus, the average value of the concentration field is , valid for all times (cf. Equation (21)). We therefore consider coarsening with respect to the well-mixed state , wherein both components of the binary fluid are present in equal amounts.

The model equation (24) is solved in batch mode on a GPU, time-stepping until and again we set . In order to measure we compute the quantity


the connection between the a typical domain size

within a given simulation, and the variance

is established in Reference LennonAurore . The quantity is captured for each simulation at every time-step and then averaged across simulations, to produce an average instantaneous value for the typical domain size, which we denote by .

(a) Scaling of domain size .
(b) Scaling of domain size .
Figure 9: Plots showing the scaling in as a function of .

We plot as a function of in Figures 8(a) and 8(b) for domains of size and respectively. In both cases again to ensure so that the transition region between domains is well resolved. In both cases we simulate

independent systems in parallel on a single GPU in order to ensure the space is well sampled and that we average out the statistic of interest as much as possible. In both cases, we fit lines with linear regression as we are testing that the quantities in equation (

29) are linearly proportional. For the case in Figure 8(a) we report a value of

for the correlation coefficient of the linear regression analysis, while for

in Figure 8(b) we report a value of . Clearly these results confirm the theoretical findings in argentina2005coarsening and provide a good example of how a GPU can be used to efficiently solve a large batch of separate 1D equations.

Figure 10: Plot of numerical solution to equation (33) showing regions of .

6 Forced 1D Cahn–Hilliard Flow Pattern Map

In this section we again use the GPU methodology to study the 1D Cahn–Hilliard equation. The focus of this section is on the forced Cahn–Hilliard equation, which introduces additional parameters into the problem. The structure of the solution of the Cahn–Hilliard equation then depends on these parameters. We use a batch of simulations to characterize the parameter space. The aim here is again twofold: to learn something about the forced Cahn–Hilliard equation, and also to showcase potential uses of the GPU methodology.

To fix ideas, in this section we focus on a travelling-wave forced Cahn–Hilliard equation in one spatial dimension, recalled here as


This introduces new parameters into the problem based on the forcing term: the forcing amplitude , the wavenumber , and the travelling wave-speed . In view of the sinusoidal nature of the forcing term, the average concentration is conserved and represents a further parameter in the problem. Assuming a fixed value of , the parameter space to be explored is therefore four-dimensional.

We emphasize that the forced Cahn–Hilliard equation with sinusoidal forcing is a well-established model in the literature, and is used to describe the Ludwig–Soret effect in binary liquids, wherein a temperature gradient induces a concentration gradient krekhov2004phase . The presence of a travelling-wave forcing has been studied before, e.g. Reference weith2009traveling , and the mathematical properties of the travelling-wave solutions have been characterized in Reference lennonWave .

Based on Equation (31), three distinct solution types have been isolated previously, by theoretical analysis weith2009traveling ; lennonWave (the nomenclature is taken from these references):

  • A2-travelling-wave solutions – these are travelling-wave solutions of Equation (31), which consist of a sinusoidal travelling wave superimposed on a mean concentration value . Example: Figure 11.

  • A1-travelling-wave solutions – these are travelling-wave solutions of Equation (31) which consist of interconnected domains (varying between and ). The interconnected domain structure moves with the travelling-wave velocity.

  • A0 solutions – these occur in regions of the parameter space where the forcing term is not strong enough to impose a travelling-wave structure on the solutions. Instead, the solution consists of standard domain structures, which move from time-to-time in an irrgular fashion due to the forcing term. In short, these are intermediate solutions, where there is dynamic competition between phase separation and forcing.

Previously, in Reference lennonWave , these different solution types were identified with specific regions of parameter space – this was done by theoretical analysis in the first instance, combined with a visual inspection of a limited number of numerical simulations. The result was colloquially called a ‘flow-pattern map’, in analogy with the flow-pattern maps for multiphase flow in the Chemical Engineering literature Hewitt1982 . The aim in this section of the paper is to see if these results can be reproduced using batch solution of Equation (31

), together with standard techniques from machine learning.

Figure 11: Plot of numerical solution to equation(33) showing regions oscillation around a mean value of .

To produce a good quality flow-pattern map with well-defined regions of parameter space as in Reference lennonWave

, a large number of simulations is required followed by some visual inspection. This approach is slow and inefficient as it leaves the researcher having to study each numerical solution to classify the finding, particularly in cases where an analytic metric cannot be used to help determine the classification of a given solution in the flow-pattern map. Another key drawback of this method is it can be hard to draw out the precise boundaries between two solution regions, particularly when there is a high number of simulations and a lack of an analytic expression to divide the parameter space. In this section of the paper, we therefore present a proposed solution which addresses these issues. The solution has two parts:

  • First, a method to produce the necessary large batches of simulation in parallel using a GPU, this allows for extremely large and diverse data-sets to be produced for the study,

  • Next, an application of k-means clustering to classify the results automatically to produce the desired flow-pattern maps.

Before proceeding with the analysis we first of all rewrite Equation (31) in the frame of reference of the travelling wave, this ensures we have a consistent position relative to the wave when feeding it into our classification method. The desired coordinate transformation is given by


Under this transformation equation (31) becomes

Figure 12: Plot showing the convergence of the forced Cahn–Hilliard scheme in equation 34. The slope is matching the spatial accuracy of the scheme.

6.1 Numerical Scheme and Convergence Study

Before presenting the analysis of results for equation (33), we first of all present the numerical scheme and a convergence study. As in previous sections, we discretise the hyperdiffusion term in Equation (33) implicitly to give a version of equation (24) with additional terms on the :


Thus the solution methodology with pentadiagonal inversions and central differences is the same as Section 5.1 except we need to discretise the advection term . In order to deal with this term we use a standard Hamilton–Jacobi WENO scheme presented in fedkiwBook . We choose this scheme as it correctly differences the advection term in the direction of travel and is accurate in smooth regions, this scheme has been implemented using the cuSten library gloster2019custen .

We proceed with an identical convergence study to Section 5.2, we have set the additional necessary parameters as follows , and . The convergence results are presented in Figure 12 and Table 4. In both cases we can see clear second order convergence as expected as we have been using at least second order accurate stencils in all of our calculations.

256 0.0013 2.8363
512 1.9869
1024 2.0086
2048 1.9673
Table 4: Details of convergence study of semi-implicit scheme for solving the 1D forced Cahn–Hilliard equation.

6.2 k-Means Clustering

k-means clustering is a standard method for partitioning observations into clusters based on distance to the nearest centroid. Distance in this sense is calculated using Euclidean distance in a space of suitable dimension for the dataset. In this work we use MATLAB’s implementation of the k-means clustering algorithm which is itself based on Lloyd’s algorithm kmeans . This algorithm is built on two steps which alternate, the first step is to assign each observation to a cluster based on its Euclidean distance to each of the centroids, the assigned cluster is the closest centroid. Then the second step is to calculate a new set of centroids by calculating the mean position of all the observations within each assigned cluster. These steps alternate until convergence has been reached. The algorithm is initialised by randomly creating the desired number of centroid (3 in our case).

6.3 Clustering Results

In this section we present the results of running batches of the Cahn–Hilliard equation with a travelling wave and classifying the solutions using the k-means algorithm. The methodology will be to solve batches of equations where we fix the wave-speed and wave-number for all systems within a given batch and then vary and within the batch. will be varied in the range and in the range . In order to accurately capture the travelling wave we set the domain size to and set the number of points in the domain to , it was found that this high number of was required in order to accurately resolve the travelling wave at high values of and . This choice of also gives us the maximum batch size of the simulations that can be run on a single GPU, in this study our GPU has 12GB of space, thus we determine that we can have a total number of 589824 simulations in each batch. Thus the resolution of our flow pattern maps will be dividing the ranges of and appropriately. The initial conditions in this section are drawn from a random uniform distribution of values between and with set to a value of . We simulate all of the systems up to a final time of .

Figure 13: Flow-pattern map of equation (34) with and , the white area corresponds to A0 solutions, grey to A1 solutions and black to A2 solutions.

We begin by examining the case and , the results of which are presented in Figure 13. In this plot we can see the clear extraction of regions of different solutions, the top left grey region corresponds to solution type A1 and the right side black region corresponds to solution type A2. A mix of solutions can be seen in the bottom left white region of the domain corresponding to region A0 along with some areas of A1 and A2. Between these two regions k-means has been able to pick out an exact boundary extending from to the top of the domain. Some noise can be seen along this boundary where the k-means algorithm has struggled to differentiate between very similar solution types as the behaviour transitions, This also explains the less well behaved bottom left region where the algorithm has been unable to deal with areas near the boundaries.

The results presented in Figure 13 are in keeping with those presented in lennonWave ; where the classification was performed by visual inspection. Three regions in approximately the same locations were found, solution type A0 in the bottom left, A1 on the top left and then A2 on the right side of the parameter space. We now increase the value of to see if the boundary between A0 and A1 moves higher up the axis.

Figure 14: Flow-pattern map of equation (34) with and , the white area corresponds to A0 solutions, grey to A1 solutions and black to A2 solutions.

We now examine the case and , the results of which are presented in Figure 14. The k-means algorithm has struggled here to extract clean boundaries between the A0 region and the other two. This can be attributed to the fact that these are hard to distinguish, even by eye. We can see that the algorithm has successfully classified the regions A1 and A2 broadly quite well, with a less well defined boundary than previously. Thus we can conclude that k-means is a useful approach for broadly classifying large parameter spaces in terms of flow pattern maps yet will struggle with more subtle transitions between regions.

Instead k-means could be used as a first step in building a model to classify simulation data for a flow pattern map. Indeed one could take the well defined regions from a large number of simulations to then train a logistic regression classifier or neural network, these may produce better results as they may detect features that are not as obvious to the eye or representable using a Euclidean distance as in the k-means algorithm. Also additional features beyond the raw solution could be included in the dataset such as a time series of the free energy or an increased number of time-steps of given solutions.

7 Conclusions

Summarizing, we have presented new algorithms that utilise improved data access when solving batches of Triadiagonal and Pentadiagonal matrices that all share the same LHS matrix. This new methodology of an interleaved batch of RHS matrices along with a single LHS achieves large reductions in data usage along with a substantial speed-up when compared with the existing state of the art. The reduction in data usage allows for greater use of hardware resources to be made with significant extra space that can now be devoted to solving more systems of equations simultaneously rather than unnecessarily storing data. The extra equations now being solved by one GPU along with the speed-ups provided by the new implementation are a significant improvement over the state of the art in applications where only one LHS matrix is required for the solution of all the systems in the batch. In addition the reduction in the number of GPUs one would need when solving very large batches, coupled with the speed-ups achieved, leads to a reduction in electricity usage which has both economical and environmental implications for HPC.

In this article, we have used a non-trivial physical problem to motivate the development of the GPU methodology – notably, the Cahn–Hilliard equation in one spatial dimension. We have demonstrated how the GPU methodology can be used to perform up to simulations of the Cahn–Hilliard equation – the resulting data can be used to build a statistically robust description of the coarsening phenomenon in one-dimensional phase separation. The sheer number of simulations used here to reveal the scaling behaviour in one sptial dimension is without precedent in the literature.

We have also used the physical test-bed of the Cahn–Hilliard equation with a forcing term to show how the different solution-types of equation can be classified as a function of the forcing-term parameters. Again, we use a very large number of simulations to generate simulation data which in this instance can be fed into standard machine-learning algorithms. This approach automates the construction of parameter spaces (‘flow-pattern maps’) which were previously produced by hand – although visual inspection is still required on occasion for ambiguous edge cases.

We emphasize finally that the numerical algorithms at the heart of the Cahn–Hilliard test-bed are rather generic (finite-differences, batch solution of partial differential equations, etc.). Consequently, the approach outlined in this article may find broad use in Computational Fluid Mechanics, and in Computational Science and Engineering more broadly.


Andrew Gloster acknowledges funding received from the UCD Research Demonstratorship. Enda Caroll acknowledges funding recieved under the Government of Ireland Postgraduate Scholarship Programme funded by Irish Research Council, grant number GOIPG/2018/2653. All authors gratefully acknowledge the support of NVIDIA Corporation with the donation of the Titan X Pascal GPUs used for this research. The authors also thank Lung Sheng Chien for helpful discussions throughout this project. The authors wish to acknowledge the DJEI/DES/SFI/HEA Irish Centre for High-End Computing (ICHEC) for the provision of computational facilities and support.

Appendix A Tridiagonal Algorithm

1:function preFactConstantTri(, , , )
3:     for  do
6:     end for
8:end function
Algorithm 1 Pre-Factorization Step of the Thomas Algorithm (performed on CPU)
1:function cuThomasConstantBatch(, , , , , )
2:     rowID threadId.x + blockDim.x * blockIdx.x
3:     if rowID  then
5:         for  do
6:              rowID rowID
8:         end for
9:         for  do
10:              rowID rowID
12:         end for
13:     end if
14:end function
Algorithm 2 Batch Thomas Algorithm (performed on GPU)

Appendix B Pentadiagonal Algorithm

1:function preFactConstantPent(, , , , , )
7:     for  do
12:     end for
18:end function
Algorithm 3 Pre-Factorization Step of the Pentadiagonal Inversion Algorithm (performed on CPU)
1:function cuPentConstantBatch(, , , , , , , )
2:     rowID threadId.x + blockDim.x * blockIdx.x
3:     if rowID  then
5:         rowID rowID
7:         for  do
8:              rowID rowID
10:         end for
11:         rowID rowID
13:         for  do
14:              rowID rowID
16:         end for
17:     end if
18:end function
Algorithm 4 Batch Pentadiagonal Inversion Algorithm (performed on GPU)



  • (1) Yao Zhang, Jonathan Cohen, and John D Owens. Fast tridiagonal solvers on the GPU. ACM Sigplan Notices, 45(5):127–136, 2010.
  • (2) Hee-Seok Kim, Shengzhao Wu, Li-wen Chang, and W Hwu Wen-mei. A scalable tridiagonal solver for GPUs. In 2011 International Conference on Parallel Processing, pages 444–453. IEEE, 2011.
  • (3) Pedro Valero-Lara, Ivan Martínez-Pérez, Raül Sirvent, Xavier Martorell, and Antonio J. Peña. NVIDIA GPUs scalability to solve multiple (batch) tridiagonal systems implementation of cuThomasBatch. In Roman Wyrzykowski, Jack Dongarra, Ewa Deelman, and Konrad Karczewski, editors, Parallel Processing and Applied Mathematics, pages 243–253, Cham, 2018. Springer International Publishing.
  • (4) Andrew Gloster, Lennon Ó Náraigh, and Khang Ee Pang. cuPentBatch – A batched pentadiagonal solver for NVIDIA GPUs. Computer Physics Communications, 241:113–121, 2019.
  • (5) Zhangping Wei, Byunghyun Jang, Yaoxin Zhang, and Yafei Jia. Parallelizing alternating direction implicit solver on GPUs. Procedia Computer Science, 18:389 – 398, 2013. 2013 International Conference on Computational Science.
  • (6) Andrew Gloster and Lennon Ó Náraigh. cuSten – CUDA finite difference and stencil library. SoftwareX, 10:100337, 2019.
  • (7) Roger W Hockney. A fast direct solution of Poisson’s equation using Fourier analysis. Technical report, STANFORD UNIV CA STANFORD ELECTRONICS LABS, 1964.
  • (8) Pedro Valero-Lara, Alfredo Pinelli, and Manuel Prieto-Matias. Fast finite difference Poisson solvers on heterogeneous architectures. Computer Physics Communications, 185(4):1265 – 1272, 2014.
  • (9) Istvan Z Reguly, Daniel Giles, Devaraj Gopinathan, Laure Quivy, Joakim H Beck, Michael B Giles, Serge Guillas, and Frederic Dias. The VOLNA-OP2 tsunami code (version 1.5). Geoscientific Model Development, 11(11):4621–4635, 2018.
  • (10) P. Valero-Lara, I. Martínez-Pérez, S. Mateo, R. Sirvent, V. Beltran, X. Martorell, and J. Labarta. Variable batched DGEMM. In 2018 26th Euromicro International Conference on Parallel, Distributed and Network-based Processing (PDP), pages 363–367, March 2018.
  • (11) A. Haidar, A. Abdelfattah, M. Zounon, S. Tomov, and J. Dongarra. A guide for achieving high performance with very small matrices on GPU: A case study of batched LU and Cholesky factorizations. IEEE Transactions on Parallel and Distributed Systems, 29(5):973–984, May 2018.
  • (12) Pedro Valero-Lara. Reducing memory requirements for large size lbm simulations on gpus. Concurrency and Computation: Practice and Experience, 29(24):e4221, 2017. e4221 cpe.4221.
  • (13) P. K. Mohanty, M. Reza, P. Kumar, and P. Kumar. Implementation of cubic spline interpolation on parallel skeleton using pipeline model on CPU -GPU cluster. In 2016 IEEE 6th International Conference on Advanced Computing (IACC), pages 747–751, Feb 2016.
  • (14) M. Akbar, Pranowo, and S. Magister.

    Computational acceleration of image inpainting alternating-direction implicit (ADI) method using GPU CUDA.

    In 2017 International Conference on Control, Electronics, Renewable Energy and Communications (ICCREC), pages 185–189, Sep. 2017.
  • (15) Pedro Valero-Lara, Ivan Martínez-Perez, Antonio J. Peña, Xavier Martorell, Raül Sirvent, and Jesús Labarta. cuHinesBatch: Solving multiple Hines systems on GPUs Human Brain Project. Procedia Computer Science, 108:566 – 575, 2017. International Conference on Computational Science, ICCS 2017, 12-14 June 2017, Zurich, Switzerland.
  • (16) Pedro Valero-Lara, Ivan Martínez-Pérez, Raül Sirvent, Antonio J Peña, Xavier Martorell, and Jesús Labarta. Simulating the behavior of the human brain on GPUs. Oil & Gas Science and Technology–Revue d’IFP Energies nouvelles, 73:63, 2018.
  • (17) Michael Kass. Interactive depth of field using simulated diffusion on a GPU. Pixar Animation Studios, Jan 2006.
  • (18) Jack Dongarra, Sven Hammarling, Nicholas J Higham, Samuel D Relton, and Mawussi Zounon. Optimized batched linear algebra for modern architectures. In European Conference on Parallel Processing, pages 511–522. Springer, 2017.
  • (19) Jack Dongarra, Sven Hammarling, Nicholas J Higham, Samuel D Relton, Pedro Valero-Lara, and Mawussi Zounon. The design and performance of batched BLAS on modern high-performance computing systems. Procedia Computer Science, 108:495–504, 2017.
  • (20) Gisela Engeln-Müllges and Frank Uhlig. Numerical Algorithms with C. Springer-Verlag, Berlin, Heidelberg, 1996.
  • (21) J. W. Cahn and J. E. Hilliard. Free energy of a nonuniform system. i. interfacial energy. J. Chem. Phys, 28:258–267, 1957.
  • (22) Mederic Argentina, MG Clerc, R Rojas, and E Tirapegui. Coarsening dynamics of the one-dimensional Cahn-Hilliard model. Physical Review E, 71(4):046210, 2005.
  • (23) I. M. Lifshitz and V. V. Slyozov. The kinetics of precipitation from supersaturated solid solutions. J. Chem. Phys. Solids, 19:35–50, 1961.
  • (24) I. M. Navon. Pent: A periodic pentadiagonal systems solver. Communications in Applied Numerical Methods, 3(1):63–69, 1987.
  • (25) Andrew Gloster, Enda Carroll, Miguel Bustamante, and Lennon O’Naraigh. Efficient interleaved batch matrix solvers for CUDA, 2019.
  • (26) Aurore Naso and Lennon Ó Náraigh. A flow-pattern map for phase separation using the Navier Stokes Cahn Hilliard model. European Journal of Mechanics - B/Fluids, 72:576 – 585, 2018.
  • (27) AP Krekhov and L Kramer. Phase separation in the presence of spatially periodic forcing. Physical Review E, 70(6):061801, 2004.
  • (28) Vanessa Weith, Alexei Krekhov, and Walter Zimmermann. Traveling spatially periodic forcing of phase separation. The European Physical Journal B, 67(3):419–427, 2009.
  • (29) Lennon Ó Náraigh. Travelling-wave spatially periodic forcing of asymmetric binary mixtures. Physica D: Nonlinear Phenomena, 393:24 – 37, 2019.
  • (30) G. F. Hewitt. Flow regimes. In G. Hetsroni, editor, Handbook of multiphase systems. McGraw-Hill Book Company, New York, 1982.
  • (31) S Osher, R Fedkiw, and K Piechor. Level set methods and dynamic implicit surfaces, 2004.
  • (32) S. Lloyd. Least squares quantization in PCM. IEEE Transactions on Information Theory, 28(2):129–137, March 1982.