1 Introduction
In this paper we consider batched numerical solutions of systems of the form on a GPU, in particular where is a tridiagonal or pentadiagonal matrix. Batched solutions of such tri/pentadiagonal matrix systems are becoming increasingly prevalent methods for tackling a variety of problems on GPUs which offer a high level of parallelism zhang2010fast ; kim2011scalable ; cuThomasBatch ; gloster2019cupentbatch .
The intended application of the methodology is to the numerical solution of partial differential equations via finitedifference methods. We therefore showcase the application of these methods to the Diffusion Equation, the Hyperdiffusion Equation, and the Cahn–Hilliard Equation. We focus on examples in one spatial dimension, although higherdimensional problems can be tackled using this same methodology, together with an Alternating Direction Implicit (ADI) discretization scheme ADIGPU ; gloster2019custen . The Diffusion Equation is one of the moststudies equations in Computational Science, and we use this to showcase the method and to quantify its performance. Similarly, we use the Hyperdiffusion Equation as a benchmark problem for a pentadiagonal solver. However, in this article, we focus in much more detail on the last example, namely the onedimensional Cahn–Hilliard equation. In this example, we demonstrate the computational power of the new methodology by characterizing the phenomenon of coarsening and phaseseparation in a statistically robust fashion.
We emphasize that the methodology in this article may find application wherever GPUs are useful for accelerating established computational procedures. For example, in Fluid Mechanics, there has been a broad application of GPUs where solutions of Poisson’s equation are commonly required hockney1964fast ; valeroPoisson , tsunami modelling and simulation reguly2018volna , numerical linear algebra laraVariableBatched ; HaidarSmallMatrices , batch solving of 1D partial differential equations gloster2019cupentbatch , ADI methods for 2D simulations ADIGPU ; gloster2019custen and modelling mesoscopicscale flows using Lattice–Boltzmann methods pedro_lbm
. Also, in the field of gravitational wave data analysis, much work is done simplifying complex waveforms and analyses to achieve results in a physically reasonable and desirable amount of time. A large portion of this work involves using cubic splines for interpolation
splineIEEE , a highly parallelizable process that has shown promising results using GPUs. Yet further examples of areas using GPUs include image inpainting imageInpainting , simulations of the human brain cuHines ; BrainsOilGas , and animation pixar . Consequently, the methodology presented in this article may find very broad applicability, although for definiteness we focus on the numerical solutions of certain canonical partial differential equations for the present purposes.In the remainder of this Introduction, we place the present methodology in the context of the stateoftheart. At the library level various implementations/functions have been developed for batched solves, examples include the Batched BLAS project dongarra2017optimized ; dongarra2017design , the CUDA libraries cuBLAS, cuSPARSE and cuSOLVER, and the Intel MKL library. In this paper we will focus on the development of tridiagonal and pentadiagonal batch solvers with single , multiple , matrices in CUDA for application in solving batches of 1D problems and 2D ADI methods. CUDA, an API extension to the C/C++ language, allows programmers to write code to be executed on NVIDIA GPUs. In particular we develop solvers in this work which are more efficient in terms of data storage than the state of the art, this saving of data usage is due to the fact we have only one global copy of the matrix rather than one for each thread/system. While the primary improvement is in data storage reduction the new functions also provide increased speedup, due to better memory access patterns, when compared to the existing state of the art. This increase in efficiency is also beneficial as it leads to further savings in resources, both in terms of runtime and electricity usage. The latter of which is of increasing concern as modern supercomputers become larger, requiring more and more electricity which has both financial and environmental implications. The need for both energy efficient hardware and algorithms is becoming an ever more prevalent trend.
To date all existing batch pentadiagonal solvers in CUDA require that each system in the batch being solved have its own copy of the matrix entries, leading to a suboptimal use of device memory (GPU memory) in terms of unnecessary space usage and increased memory access overhead when only one global copy of is actually needed. We also note that, while there are options for single , multiple , tridiagonal and pentadiagonal matrices in cuSPARSE, these options overwrite the LHS data leading to a necessary correction step having to be performed when using these in iteration resulting in additional computational cost. Thus we propose the implementation of two solvers, one implementing the Thomas Algorithm for tridiagonal matrices and the other implementing the pentadiagonal equivalent as presented in gloster2019cupentbatch ; numalgC , each with a single global copy of the matrix and multiple matrices. We will benchmark these implementations against cuThomasBatch cuThomasBatch (implemented as gtsvInterleavedBatch in the cuSPARSE library) and existing work by the authors cuPentBatch gloster2019cupentbatch , for tridiagonal and pentadiagonal matrices respectively, which are the existing state of the art for the algorithms we are interested in. We point the user to the papers cuThomasBatch and gloster2019cupentbatch for existing benchmark comparisons of these algorithms with multiple matrices with the cuSPARSE library and comparisons with serial/OpenMP implementations. Thus we can compare our new implementations to the existing state of the art, cuThomasBatch and cuPentBatch, from which relative performance compared to the rest of cuSPARSE can be interpolated by the reader.
The paper is laid out as follows. In Section 2 we outline the modified methodology of interleaved data layout for vectors and a single global copy for the matrix. In Sections 3 and 4 we show how this methodology can be applied to tridiagonal and pentadiagonal matrices respectively. We include here benchmarks against the stateoftheart algorithms to show performance is at least as good, if not better in most cases of batch size and matrix size. In the later parts of the paper, we move on to a specific physical application, namely the coarsening phenomenon in the 1D Cahn–Hilliard equation. As such, in Section 5 we present a numerical method for solving multiple copies of the 1D Cahn–Hilliard equation in a batch using an implicit numerical algorithm. We show how this approach can be used to generate a statistically robust description of coarsening dynamics in a phaseseparating liquid. We continue with this study in Section 6
, where we show that with the same approach, a detailed parameter study of the forced Cahn–Hilliard equation (with a rich parameter space) can be explored using batch simulations to generate large amounts of data. We again characterize the data with with statistical approaches, this time using kmeans clustering to demarcate distinct regions of the parameter space. Finally we present our conclusions in Section
7.2 Matrix Solver Methodology
The methodologies implementing the tridiagonal and pentadiagonal algorithms we are concerned with in this paper have to date relied on using interleaved data layouts (shown schematically in Figure 2) in global memory. Each thread handles a particular system solution and the system is solved by a forward sweep followed by a backward sweep. The interleaved data format is to ensure coalescence of data accesses during theses sweeps, with each thread retrieving neighbouring memory locations in global memory for various and matrix entries cuThomasBatch . This maximises bandwidth but is limiting to the overall memory budget as each system requires its own copy of the matrix. Instead we propose to store globally just a single copy of the matrix entries with all threads accessing the desired value at the same time, the righthandside (RHS) matrices, , will still be stored in the same interleaved data access pattern as before. This will reduce the overall bandwidth of memory accesses to the but this will not harm the solver’s performance as we will show in later sections. Critically also this approach of using just a single matrix will save drastically on the amount of memory used in batch solvers allowing for larger batch numbers and greater matrix sizes to be solved on a single GPU, increasing hardware usage efficiency.
We now discuss the memory access pattern for the single matrix. Global memory access in GPUs is achieved via the L2 cache. When a piece of memory is requested for by a warp this memory is retrieved from global memory and copied to the L2 cache followed by the L1 cache on the SM before then being used by the relevant warps being computed on that SM. Each SM has their own pipe to the L2 cache. So when warps from different SMs all request the same memory they will all get a cache hit in at least the L2 cache except the first one to arrive which will be the warp to retrieve the data from the global memory on the device. We can guarantee that this will occur as each warp will be computing the same instruction in either the forward or backward sweep of the solver but warps on different SMs will not arrive to the L2 cache at the same time. In addition to this warps which share L1 caches will get cache hits here when they’re not the first to request the piece of memory. A diagram of this memory access pattern discussion can also be seen in Figure 1. Thus it is clear that given every piece of memory must follow the above path most of the threads solving a batched problem will benefit from speedups due to cache hits as they no longer need to retrieve their own copy of the data.
In the following two sections we present specific details for each of the tridiagonal and pentadiagonal schemes along with the previously discussed benchmarks. The benchmarks are performed on 16GB Tesla V100 GPUs with the prefactorisation step for the single performed on the CPU, this prefactorisation step is a purely serial operation and hence the CPU was chosen to perform this computation. We shall refer to the new versions of the algorithms as cuThomasConstantBatch and cuPentConstantBatch for tridiagonal and pentadiagonal respectively. We use the nomenclature ‘Constant’ denoting the fact that all systems have the same . The will be stored as usual in an interleaved data format and is computed using cuSten gloster2019custen . In both algorithms we utilise uniform memory access. In terms of thread organisation structure, two grids were used. The first, a 1D grid, performs the inversion (Thomas algorithm). The second is a 2D gird which performs the updating steps of both algorithms. For notational purposes we will use to describe the number of unknowns in our systems and to describe the batch size (number of systems being solved).
3 Tridiagonal Systems
In this section we first present the Thomas Algorithm numalgC and how it is modified for batch solutions with multiple matrices and with a single . We then present our chosen benchmark problem of periodic diffusion equations and then the results.
3.1 Tridiagonal Inversion Algorithm
We begin with a generalised tridiagonal matrix system with given by
We then solve this system using a prefactorisation step followed by a forwards and backwards sweep. The prefactorisation is given by
(1) 
And then for
(2) 
For the forwards sweep we have
(3) 
(4) 
While for the backwards sweep we have
(5) 
And for
(6) 
A summary of cuThomasConstantBatch can be found in Appendix A. The prefactorization step performed is performed on the CPU and the remaining forward and backward sweeps of the algorithm which are performed on the GPU can be found in Algorithm 1 and Algorithm 2 in Appendix A.
Previous applications of this algorithm cuThomasBatch required that each thread had access to its own copy of vectors, the diagonals , and along with the RHS . These would then be overwritten in the prefactorisation and solve steps to save memory, thus the total memory usage here is . We now limit the to a single global case that will be accessed simultaneously using all threads as discussed in Section 2 and retain the individual in interleaved format for each thread . This reduces the data storage to , an approximate 75% reduction. We present benchmark methodology and results for this method in the following subsections.
3.2 Benchmark Problem
For a benchmark problem we solve the Diffusion Equation, a standard model equation in Computational Science and Engineering, its presence can be seen in most systems involving heat and mass transfer. The equation on a onedimensional interval is given as:
(7) 
where is the diffusion coefficient. We solve this equation on a periodic domain of length such that with an initial condition valid on the domain . We rescale Equation (7) by setting and and integrate the equation forward in time using a standard Crank–Nicolson scheme with central differences for space which is unconditionally stable. Finite differencing is done using standard notation
(8) 
where and . Thus our numerical scheme can be written as
(9) 
where
(10) 
Thus we can relate these coefficients to matrix entries by
(11) 
In the next section we present our method to deal with the periodicity of the matrix and then after that present the benchmark comparison with the existing state of the art.
3.3 Periodic Tridiagonal Matrix
As the system is periodic two extra entries will appear in the matrix, one in the top right corner and another in the bottom left, thus our matrix is now given by
To deal with these we use the Sherman–Morrison formula: we rewrite our system as
(12) 
Here,
(13a)  
and  
(13b) 
Thus two tridiagonal systems must now be solved
(14) 
the second of which need only be performed once at the beginning of a given simulation. Finally to recover we substitute these results into
(15) 
3.4 Benchmark Results
The diffusion equation benchmark problem presented above is solved for 1000 timesteps in order to extract the relevant timing statistics and average out any effects of background processes on the benchmark. We omit start up costs such as memory allocation, setting of parameters etc. We also do not include the timing of the prefactorization step. This operation is only performed once and its cost is so insignificant to the rest of the algorithm that we attribute it to part of the setup cost. As such, both codes are timed for just the time–stepping sections of the process. The speedup results can be seen in Figure 3, in all situations there is a clear speedup over the existing state of the art cuThomasBatch. The largest speedups are for large and moderate . Significant speedup is available for all of large which is the domain where GPUs would typically be deployed to solve the problem. Shown also in Figure 4 is a roofline plot of the cuThomasConstantBatch kernel: it is evident that as the batch size is increased the kernel tends towards the memory roofline of the device.
It should be noted that some of the speedup seen here can be attributed to the prefactorisation step which the cuThomasBatch implementation does not have. cuThomasBatch requires that the matrix be reset at every timestep as the prefactorisation and solve steps are carried out in the one function, leading to an overwrite of the data and making it unusable for repeated timestepping. This overwriting feature has been seen in previous studies gloster2019cupentbatch where the authors also carried out a rewrite to make the benchmarking conditions between their work and the state of the art. It was shown that not all of the speedup can be attributed to the lack of needing to reset the matrix, thus some of the performance increase we are seeing in Figure 3 can be attributed to the new data layout presented in this paper. Our second benchmark for the pentadiagonal case will also show this as there was no resetting of the matrix required.
cuThomasConstantBatch  cuThomasBatch  

M  N  
64  0.097  0.127  
256  0.161  0.267  
1024  0.510  0.850  
64  0.101  0.192  
256  0.255  0.549  
1024  1.020  1.987  
64  1.278  2.894  
256  5.015  11.460  
1024  20.268  46.175 
4 Pentadiagonal Systems
In this section we first present a modified version of the pentadiagonal inversion algorithm as presented in gloster2019cupentbatch with a single . Then we present the benchmark problem and finally this is followed by the results comparing our new implementation with existing state of the art.
4.1 Pentadiagonal Inversion Algorithm
In this section we describe a standard numerical method numalgC ; gloster2019cupentbatch for solving a pentadiagonal problem . We present the algorithm in a general context so we have a pentadiagonal matrix given by
Three steps are required to solve the system:

Factor to obtain and .

Find from

Backsubstitute to find from
Here, , and are given by the following equations:
(16a)  
(16b) 
(the other entries in and are zero). The explicit factorisation steps for the factorisation are as follows:








For each







The steps to find are as follows:
Finally, the backsubstitution steps find are as follows:
Previous applications of this algorithm gloster2019cupentbatch required that each thread had access to its own copy of vectors, the diagonals , , , and along with the . These would then be overwritten in the prefactorisation and solve steps to save memory, thus the total memory usage here is . We now limit the to a single global case that will be accessed simultaneously using all threads as discussed in Section 2 and retain the individual in interleaved format for each thread . This reduces the data storage to , this is an approximate 83% reduction in data usage. We present benchmark methodology and results for this method in the following subsections.
4.2 Benchmark Problem
For a benchmark problem we solve the hyperdiffusion equation in a onedimensional interval , given here as:
(17) 
where is the hyperdiffusion coefficient. We solve this equation on a periodic domain of length such that with an initial condition valid on the domain . We rescale Equation (17) by setting and and integrate the equation forward in time using a standard Crank–Nicolson scheme with central differences for space which is unconditionally stable. Finite differencing is done the same standard notation as in Section 3 (specifically, Equation (8)). Thus, our numerical scheme can be written as
(18) 
where now means
(19) 
For a more more detailed exposition of this benchmark problem, see Reference gloster2019cupentbatch .
4.3 Benchmark Results
We plot the speedup of cuPentConstantBatch versus cuPentBatch solving batches of the above method for the hyperdiffusion equation in Figure 5. Like cuThomasConstantBatch, the prefactorization step is not included in these timings. We use cuSten gloster2019custen to calculate the values for . We also present a roofline plot for the the cuPentConstantBatch kernel in Figure 6. Similar to cuThomasConstantBatch, the cuPentConstantBatch kernel is ultimately limited by the memory constraint of the device.
From Figure 5, it can be seen that cuPentBatchConstant outperforms cuPentBatch consistently for high values of both and . At low and the two methods are roughly equivalent; indeed, there are some areas where cuPentBatch performs better. However, in this part of the parameter space, standard CPU implementation is more sensible – taking into account the movement over the PCI lane along with other overheads. Furthermore, if we examine much larger values of and than those plotted, the superior performance of cuPentBatchConstant becomes much more apparent. The performance difference in these cases can be orders of magnitude, as the memory for cuPentBatch rapidly exceeds the available memory on the GPU while cuPentBatchConstant can still fit. We explore this part of the parameter space much more thoroughly in Section 5.
cuPentConstantBatch  cuPentBatch  

M  N  
64  369.902  376.321  
256  461.727  487.455  
1024  658.475  717.109  
64  460.857  523.693  
256  429.127  535.657  
1024  1344.377  1495.057  
64  1786.018  2562.037  
256  6535.750  9707.131  
1024  25893.904  38407.516 
5 Physical Application – Coarsening and the Cahn–Hilliard Equation
In this section, we study the Cahn–Hilliard equation in one spatial dimension. The purpose of this section is twofold – we wish to showcase the GPU methodology as applied to a meaningful physical problem. We also wish to explore the coarsening phenomenon of the onedimensional Cahn–Hilliard equation, in a statistically robust fashion. This can be readily achieved with the present GPU methodology, which enables the simulation of individual copies of the Cahn–Hilliard equation.
We begin by presenting the Cahn–Hilliard equation in one spatial dimension, valid on an interval :
(20) 
Here, is a scalar concentration field taking arbitrary real values. The concentration field is used to demarcate different domains of a binary mixture CH_orig . Thus, a local solution indicates a region rich in one binary fluid component, whereas a local solution indicates a region rich int he other fluid component. Here also, is a positive constant; the domains are separated by transition regions of width . We assume for definiteness that Equation (20) is posed on the interval with periodic boundary conditions, although other standard boundary conditions (e.g. Neumann, Dirichlet) are possible also.
With such standard boundary conditions, the mean value of the concentration is conserved:
(21) 
This can be seen by doing differentiation under the integral in Equation (21) and replacing with . Application of the periodic boundary conditions to the resulting integral gives
(22) 
hence . The value of is a a key parameter in this section and in Section 6.
Any constant value is a solution of Equation (20). This constant solution can be characterized by introducing a perturbed solution
(23) 
where is a meanzero perturbation and is ‘small’ in the sense that . As such, the solution (23) with is a stable solution to Equation (20), in the sense of linear stability analysis; that is, . In contrast, the solution (23) with is unstable, and furthermore has the property that , in domains. The domains correspond to phase separation – regions rich in one fluid component or the other, and are separated by transition zones of width . The spontaneous formation of domains from the initial condition (23) in this instance is called coarsening. A sample numerical simulation showing the coarsening phenomenon is shown in Figure 7.
The coarsening phenomenon in the Cahn–Hilliard equation is dimensiondependent. In one dimension, theoretical arguments argentina2005coarsening predict that the size of a typical domain should grow as , at late times. In contrast, in higher dimensions, the size of a typical domain grows as , which is the famous Lifshitz–Slyozov coarsening law LS . The growth of the ‘typical’ domain size in this way is referred to as scaling. As such, the aim of this section is to confirm the theoretical onedimensional scaling law.
5.1 1D Numerical Scheme
Equation (20) is solved implicitly on a uniform grid. We use a superscript to denote evaluation of the solution at time . Taking the hyperdiffusion term to the , and discretising in time we get
(24) 
We apply standard secondorder accurate differencing to the spatial derivatives:
(25a)  
(25b) 
This, along with the periodicity of the domain, yields a pentadiagonal matrix system of the form
(26a)  
Here, the coefficients of the matrix in Equation (26a) have the following meaning:  
(26b)  
Similarly,  
(26c) 
where we have taken , and for brevity. This system can then be solved following the methodology presented for cyclic pentadiagonal matrices as found in navon_pent ; gloster2019cupentbatch ; gloster2019efficient , this methodology also influenced our choice of second order accuracy above. All of the necessary matrix calculations were carried out using an inhouse finitedifference library developed by the authors gloster2019custen . We use a timestep size
(27) 
to ensure stability of the scheme. This is a heuristic. Stability of the numerical algorithm (
24) is not bound by the linear hyperdiffusion term, which is treated implicitly. Instead, the numerical algorithm may become unstable due to the explicit treatment of the nonlinear term. The instability is therefore controlled by taking the timestep sufficiently small, but still proportional to , along the lines of Equation (27).In the following section we show the convergence features of the scheme which is clearly second order accurate in space due to the choice of the central differences above.
5.2 1D Convergence Study
For the convergence study we set and . We set the timestep using Equation (27) and apply a cosine initial condition given by
(28) 
with . The convergence results are presented in Figure 8 and Table 3. We can see that this scheme for intermediate numbers of grid points is secondorder accurate except for very high resolutions where the scheme converges with firstorder accuracy. In all cases, since the solutions of the Cahn–Hilliard equation are smooth, and possess smooth features even on the scale of the transition zones between domains (i.e. no ‘shocks’), these orders of convergence are acceptable.
128  0.0310  3.7932 
256  0.0022  2.0376 
512  2.0089  
1024  2.0022  
2048  2.0005  
4096  0.8947  
8192 
5.3 Batched 1D Scaling
We now examine the how the scaling behaviour of the coarsening phenomenon in the 1D Cahn–Hilliard equation. In Reference argentina2005coarsening , theoretical arguments are given for why the size of a typical domain should scale as
(29) 
Thus, in order to measure the ‘typical domain size’, we must average across the different domains that appear transiently in a single simulation, and across multiple independent simulations. To do this, we follow the computational methodology for solving batches of hyperdiffusion equations as presented in gloster2019cupentbatch ; gloster2019efficient and combine it with the method for solving individual 1D Cahn–Hilliard equations presented above in Section 5.1. We store each individual of the system in interleaved format and the cuSten library gloster2019custen is used to compute the required nonlinear finite differences. Each simulation is given its own randomised initial condition where values are drawn from a uniform distribution between and . Thus, the average value of the concentration field is , valid for all times (cf. Equation (21)). We therefore consider coarsening with respect to the wellmixed state , wherein both components of the binary fluid are present in equal amounts.
The model equation (24) is solved in batch mode on a GPU, timestepping until and again we set . In order to measure we compute the quantity
(30) 
the connection between the a typical domain size
within a given simulation, and the variance
is established in Reference LennonAurore . The quantity is captured for each simulation at every timestep and then averaged across simulations, to produce an average instantaneous value for the typical domain size, which we denote by .We plot as a function of in Figures 8(a) and 8(b) for domains of size and respectively. In both cases again to ensure so that the transition region between domains is well resolved. In both cases we simulate
independent systems in parallel on a single GPU in order to ensure the space is well sampled and that we average out the statistic of interest as much as possible. In both cases, we fit lines with linear regression as we are testing that the quantities in equation (
29) are linearly proportional. For the case in Figure 8(a) we report a value offor the correlation coefficient of the linear regression analysis, while for
in Figure 8(b) we report a value of . Clearly these results confirm the theoretical findings in argentina2005coarsening and provide a good example of how a GPU can be used to efficiently solve a large batch of separate 1D equations.6 Forced 1D Cahn–Hilliard Flow Pattern Map
In this section we again use the GPU methodology to study the 1D Cahn–Hilliard equation. The focus of this section is on the forced Cahn–Hilliard equation, which introduces additional parameters into the problem. The structure of the solution of the Cahn–Hilliard equation then depends on these parameters. We use a batch of simulations to characterize the parameter space. The aim here is again twofold: to learn something about the forced Cahn–Hilliard equation, and also to showcase potential uses of the GPU methodology.
To fix ideas, in this section we focus on a travellingwave forced Cahn–Hilliard equation in one spatial dimension, recalled here as
(31) 
This introduces new parameters into the problem based on the forcing term: the forcing amplitude , the wavenumber , and the travelling wavespeed . In view of the sinusoidal nature of the forcing term, the average concentration is conserved and represents a further parameter in the problem. Assuming a fixed value of , the parameter space to be explored is therefore fourdimensional.
We emphasize that the forced Cahn–Hilliard equation with sinusoidal forcing is a wellestablished model in the literature, and is used to describe the Ludwig–Soret effect in binary liquids, wherein a temperature gradient induces a concentration gradient krekhov2004phase . The presence of a travellingwave forcing has been studied before, e.g. Reference weith2009traveling , and the mathematical properties of the travellingwave solutions have been characterized in Reference lennonWave .
Based on Equation (31), three distinct solution types have been isolated previously, by theoretical analysis weith2009traveling ; lennonWave (the nomenclature is taken from these references):

A1travellingwave solutions – these are travellingwave solutions of Equation (31) which consist of interconnected domains (varying between and ). The interconnected domain structure moves with the travellingwave velocity.

A0 solutions – these occur in regions of the parameter space where the forcing term is not strong enough to impose a travellingwave structure on the solutions. Instead, the solution consists of standard domain structures, which move from timetotime in an irrgular fashion due to the forcing term. In short, these are intermediate solutions, where there is dynamic competition between phase separation and forcing.
Previously, in Reference lennonWave , these different solution types were identified with specific regions of parameter space – this was done by theoretical analysis in the first instance, combined with a visual inspection of a limited number of numerical simulations. The result was colloquially called a ‘flowpattern map’, in analogy with the flowpattern maps for multiphase flow in the Chemical Engineering literature Hewitt1982 . The aim in this section of the paper is to see if these results can be reproduced using batch solution of Equation (31
), together with standard techniques from machine learning.
To produce a good quality flowpattern map with welldefined regions of parameter space as in Reference lennonWave
, a large number of simulations is required followed by some visual inspection. This approach is slow and inefficient as it leaves the researcher having to study each numerical solution to classify the finding, particularly in cases where an analytic metric cannot be used to help determine the classification of a given solution in the flowpattern map. Another key drawback of this method is it can be hard to draw out the precise boundaries between two solution regions, particularly when there is a high number of simulations and a lack of an analytic expression to divide the parameter space. In this section of the paper, we therefore present a proposed solution which addresses these issues. The solution has two parts:

First, a method to produce the necessary large batches of simulation in parallel using a GPU, this allows for extremely large and diverse datasets to be produced for the study,

Next, an application of kmeans clustering to classify the results automatically to produce the desired flowpattern maps.
Before proceeding with the analysis we first of all rewrite Equation (31) in the frame of reference of the travelling wave, this ensures we have a consistent position relative to the wave when feeding it into our classification method. The desired coordinate transformation is given by
(32) 
Under this transformation equation (31) becomes
(33) 
6.1 Numerical Scheme and Convergence Study
Before presenting the analysis of results for equation (33), we first of all present the numerical scheme and a convergence study. As in previous sections, we discretise the hyperdiffusion term in Equation (33) implicitly to give a version of equation (24) with additional terms on the :
(34) 
Thus the solution methodology with pentadiagonal inversions and central differences is the same as Section 5.1 except we need to discretise the advection term . In order to deal with this term we use a standard Hamilton–Jacobi WENO scheme presented in fedkiwBook . We choose this scheme as it correctly differences the advection term in the direction of travel and is accurate in smooth regions, this scheme has been implemented using the cuSten library gloster2019custen .
We proceed with an identical convergence study to Section 5.2, we have set the additional necessary parameters as follows , and . The convergence results are presented in Figure 12 and Table 4. In both cases we can see clear second order convergence as expected as we have been using at least second order accurate stencils in all of our calculations.
256  0.0013  2.8363 
512  1.9869  
1024  2.0086  
2048  1.9673  
4096 
6.2 kMeans Clustering
kmeans clustering is a standard method for partitioning observations into clusters based on distance to the nearest centroid. Distance in this sense is calculated using Euclidean distance in a space of suitable dimension for the dataset. In this work we use MATLAB’s implementation of the kmeans clustering algorithm which is itself based on Lloyd’s algorithm kmeans . This algorithm is built on two steps which alternate, the first step is to assign each observation to a cluster based on its Euclidean distance to each of the centroids, the assigned cluster is the closest centroid. Then the second step is to calculate a new set of centroids by calculating the mean position of all the observations within each assigned cluster. These steps alternate until convergence has been reached. The algorithm is initialised by randomly creating the desired number of centroid (3 in our case).
6.3 Clustering Results
In this section we present the results of running batches of the Cahn–Hilliard equation with a travelling wave and classifying the solutions using the kmeans algorithm. The methodology will be to solve batches of equations where we fix the wavespeed and wavenumber for all systems within a given batch and then vary and within the batch. will be varied in the range and in the range . In order to accurately capture the travelling wave we set the domain size to and set the number of points in the domain to , it was found that this high number of was required in order to accurately resolve the travelling wave at high values of and . This choice of also gives us the maximum batch size of the simulations that can be run on a single GPU, in this study our GPU has 12GB of space, thus we determine that we can have a total number of 589824 simulations in each batch. Thus the resolution of our flow pattern maps will be dividing the ranges of and appropriately. The initial conditions in this section are drawn from a random uniform distribution of values between and with set to a value of . We simulate all of the systems up to a final time of .
We begin by examining the case and , the results of which are presented in Figure 13. In this plot we can see the clear extraction of regions of different solutions, the top left grey region corresponds to solution type A1 and the right side black region corresponds to solution type A2. A mix of solutions can be seen in the bottom left white region of the domain corresponding to region A0 along with some areas of A1 and A2. Between these two regions kmeans has been able to pick out an exact boundary extending from to the top of the domain. Some noise can be seen along this boundary where the kmeans algorithm has struggled to differentiate between very similar solution types as the behaviour transitions, This also explains the less well behaved bottom left region where the algorithm has been unable to deal with areas near the boundaries.
The results presented in Figure 13 are in keeping with those presented in lennonWave ; where the classification was performed by visual inspection. Three regions in approximately the same locations were found, solution type A0 in the bottom left, A1 on the top left and then A2 on the right side of the parameter space. We now increase the value of to see if the boundary between A0 and A1 moves higher up the axis.
We now examine the case and , the results of which are presented in Figure 14. The kmeans algorithm has struggled here to extract clean boundaries between the A0 region and the other two. This can be attributed to the fact that these are hard to distinguish, even by eye. We can see that the algorithm has successfully classified the regions A1 and A2 broadly quite well, with a less well defined boundary than previously. Thus we can conclude that kmeans is a useful approach for broadly classifying large parameter spaces in terms of flow pattern maps yet will struggle with more subtle transitions between regions.
Instead kmeans could be used as a first step in building a model to classify simulation data for a flow pattern map. Indeed one could take the well defined regions from a large number of simulations to then train a logistic regression classifier or neural network, these may produce better results as they may detect features that are not as obvious to the eye or representable using a Euclidean distance as in the kmeans algorithm. Also additional features beyond the raw solution could be included in the dataset such as a time series of the free energy or an increased number of timesteps of given solutions.
7 Conclusions
Summarizing, we have presented new algorithms that utilise improved data access when solving batches of Triadiagonal and Pentadiagonal matrices that all share the same LHS matrix. This new methodology of an interleaved batch of RHS matrices along with a single LHS achieves large reductions in data usage along with a substantial speedup when compared with the existing state of the art. The reduction in data usage allows for greater use of hardware resources to be made with significant extra space that can now be devoted to solving more systems of equations simultaneously rather than unnecessarily storing data. The extra equations now being solved by one GPU along with the speedups provided by the new implementation are a significant improvement over the state of the art in applications where only one LHS matrix is required for the solution of all the systems in the batch. In addition the reduction in the number of GPUs one would need when solving very large batches, coupled with the speedups achieved, leads to a reduction in electricity usage which has both economical and environmental implications for HPC.
In this article, we have used a nontrivial physical problem to motivate the development of the GPU methodology – notably, the Cahn–Hilliard equation in one spatial dimension. We have demonstrated how the GPU methodology can be used to perform up to simulations of the Cahn–Hilliard equation – the resulting data can be used to build a statistically robust description of the coarsening phenomenon in onedimensional phase separation. The sheer number of simulations used here to reveal the scaling behaviour in one sptial dimension is without precedent in the literature.
We have also used the physical testbed of the Cahn–Hilliard equation with a forcing term to show how the different solutiontypes of equation can be classified as a function of the forcingterm parameters. Again, we use a very large number of simulations to generate simulation data which in this instance can be fed into standard machinelearning algorithms. This approach automates the construction of parameter spaces (‘flowpattern maps’) which were previously produced by hand – although visual inspection is still required on occasion for ambiguous edge cases.
We emphasize finally that the numerical algorithms at the heart of the Cahn–Hilliard testbed are rather generic (finitedifferences, batch solution of partial differential equations, etc.). Consequently, the approach outlined in this article may find broad use in Computational Fluid Mechanics, and in Computational Science and Engineering more broadly.
Acknowledgements
Andrew Gloster acknowledges funding received from the UCD Research Demonstratorship. Enda Caroll acknowledges funding recieved under the Government of Ireland Postgraduate Scholarship Programme funded by Irish Research Council, grant number GOIPG/2018/2653. All authors gratefully acknowledge the support of NVIDIA Corporation with the donation of the Titan X Pascal GPUs used for this research. The authors also thank Lung Sheng Chien for helpful discussions throughout this project. The authors wish to acknowledge the DJEI/DES/SFI/HEA Irish Centre for HighEnd Computing (ICHEC) for the provision of computational facilities and support.
Appendix A Tridiagonal Algorithm
Appendix B Pentadiagonal Algorithm
References
References
 (1) Yao Zhang, Jonathan Cohen, and John D Owens. Fast tridiagonal solvers on the GPU. ACM Sigplan Notices, 45(5):127–136, 2010.
 (2) HeeSeok Kim, Shengzhao Wu, Liwen Chang, and W Hwu Wenmei. A scalable tridiagonal solver for GPUs. In 2011 International Conference on Parallel Processing, pages 444–453. IEEE, 2011.
 (3) Pedro ValeroLara, Ivan MartínezPérez, Raül Sirvent, Xavier Martorell, and Antonio J. Peña. NVIDIA GPUs scalability to solve multiple (batch) tridiagonal systems implementation of cuThomasBatch. In Roman Wyrzykowski, Jack Dongarra, Ewa Deelman, and Konrad Karczewski, editors, Parallel Processing and Applied Mathematics, pages 243–253, Cham, 2018. Springer International Publishing.
 (4) Andrew Gloster, Lennon Ó Náraigh, and Khang Ee Pang. cuPentBatch – A batched pentadiagonal solver for NVIDIA GPUs. Computer Physics Communications, 241:113–121, 2019.
 (5) Zhangping Wei, Byunghyun Jang, Yaoxin Zhang, and Yafei Jia. Parallelizing alternating direction implicit solver on GPUs. Procedia Computer Science, 18:389 – 398, 2013. 2013 International Conference on Computational Science.
 (6) Andrew Gloster and Lennon Ó Náraigh. cuSten – CUDA finite difference and stencil library. SoftwareX, 10:100337, 2019.
 (7) Roger W Hockney. A fast direct solution of Poisson’s equation using Fourier analysis. Technical report, STANFORD UNIV CA STANFORD ELECTRONICS LABS, 1964.
 (8) Pedro ValeroLara, Alfredo Pinelli, and Manuel PrietoMatias. Fast finite difference Poisson solvers on heterogeneous architectures. Computer Physics Communications, 185(4):1265 – 1272, 2014.
 (9) Istvan Z Reguly, Daniel Giles, Devaraj Gopinathan, Laure Quivy, Joakim H Beck, Michael B Giles, Serge Guillas, and Frederic Dias. The VOLNAOP2 tsunami code (version 1.5). Geoscientific Model Development, 11(11):4621–4635, 2018.
 (10) P. ValeroLara, I. MartínezPérez, S. Mateo, R. Sirvent, V. Beltran, X. Martorell, and J. Labarta. Variable batched DGEMM. In 2018 26th Euromicro International Conference on Parallel, Distributed and Networkbased Processing (PDP), pages 363–367, March 2018.
 (11) A. Haidar, A. Abdelfattah, M. Zounon, S. Tomov, and J. Dongarra. A guide for achieving high performance with very small matrices on GPU: A case study of batched LU and Cholesky factorizations. IEEE Transactions on Parallel and Distributed Systems, 29(5):973–984, May 2018.
 (12) Pedro ValeroLara. Reducing memory requirements for large size lbm simulations on gpus. Concurrency and Computation: Practice and Experience, 29(24):e4221, 2017. e4221 cpe.4221.
 (13) P. K. Mohanty, M. Reza, P. Kumar, and P. Kumar. Implementation of cubic spline interpolation on parallel skeleton using pipeline model on CPU GPU cluster. In 2016 IEEE 6th International Conference on Advanced Computing (IACC), pages 747–751, Feb 2016.

(14)
M. Akbar, Pranowo, and S. Magister.
Computational acceleration of image inpainting alternatingdirection implicit (ADI) method using GPU CUDA.
In 2017 International Conference on Control, Electronics, Renewable Energy and Communications (ICCREC), pages 185–189, Sep. 2017.  (15) Pedro ValeroLara, Ivan MartínezPerez, Antonio J. Peña, Xavier Martorell, Raül Sirvent, and Jesús Labarta. cuHinesBatch: Solving multiple Hines systems on GPUs Human Brain Project. Procedia Computer Science, 108:566 – 575, 2017. International Conference on Computational Science, ICCS 2017, 1214 June 2017, Zurich, Switzerland.
 (16) Pedro ValeroLara, Ivan MartínezPérez, Raül Sirvent, Antonio J Peña, Xavier Martorell, and Jesús Labarta. Simulating the behavior of the human brain on GPUs. Oil & Gas Science and Technology–Revue d’IFP Energies nouvelles, 73:63, 2018.
 (17) Michael Kass. Interactive depth of field using simulated diffusion on a GPU. Pixar Animation Studios, Jan 2006.
 (18) Jack Dongarra, Sven Hammarling, Nicholas J Higham, Samuel D Relton, and Mawussi Zounon. Optimized batched linear algebra for modern architectures. In European Conference on Parallel Processing, pages 511–522. Springer, 2017.
 (19) Jack Dongarra, Sven Hammarling, Nicholas J Higham, Samuel D Relton, Pedro ValeroLara, and Mawussi Zounon. The design and performance of batched BLAS on modern highperformance computing systems. Procedia Computer Science, 108:495–504, 2017.
 (20) Gisela EngelnMüllges and Frank Uhlig. Numerical Algorithms with C. SpringerVerlag, Berlin, Heidelberg, 1996.
 (21) J. W. Cahn and J. E. Hilliard. Free energy of a nonuniform system. i. interfacial energy. J. Chem. Phys, 28:258–267, 1957.
 (22) Mederic Argentina, MG Clerc, R Rojas, and E Tirapegui. Coarsening dynamics of the onedimensional CahnHilliard model. Physical Review E, 71(4):046210, 2005.
 (23) I. M. Lifshitz and V. V. Slyozov. The kinetics of precipitation from supersaturated solid solutions. J. Chem. Phys. Solids, 19:35–50, 1961.
 (24) I. M. Navon. Pent: A periodic pentadiagonal systems solver. Communications in Applied Numerical Methods, 3(1):63–69, 1987.
 (25) Andrew Gloster, Enda Carroll, Miguel Bustamante, and Lennon O’Naraigh. Efficient interleaved batch matrix solvers for CUDA, 2019.
 (26) Aurore Naso and Lennon Ó Náraigh. A flowpattern map for phase separation using the Navier Stokes Cahn Hilliard model. European Journal of Mechanics  B/Fluids, 72:576 – 585, 2018.
 (27) AP Krekhov and L Kramer. Phase separation in the presence of spatially periodic forcing. Physical Review E, 70(6):061801, 2004.
 (28) Vanessa Weith, Alexei Krekhov, and Walter Zimmermann. Traveling spatially periodic forcing of phase separation. The European Physical Journal B, 67(3):419–427, 2009.
 (29) Lennon Ó Náraigh. Travellingwave spatially periodic forcing of asymmetric binary mixtures. Physica D: Nonlinear Phenomena, 393:24 – 37, 2019.
 (30) G. F. Hewitt. Flow regimes. In G. Hetsroni, editor, Handbook of multiphase systems. McGrawHill Book Company, New York, 1982.
 (31) S Osher, R Fedkiw, and K Piechor. Level set methods and dynamic implicit surfaces, 2004.
 (32) S. Lloyd. Least squares quantization in PCM. IEEE Transactions on Information Theory, 28(2):129–137, March 1982.
Comments
There are no comments yet.