1 Introduction
In the past ten years, GPUs have evolved from chips specialized for graphics processing to powerful and flexible accelerators for general computational and data processing tasks. Together with improvements in hardware capabilities, there has been a growing software ecosystem. In terms of programmability, beyond the original CUDA C, GPUs can now be programmed with compiler directives (OpenACC, OpenMP), or in high level languages such as Python, MATLAB, or Julia. On the system side, there are now servers like the NVIDIA DGX2 with all GPUs connected via NVLink and NVSwitch. These types of systems are bringing SMPlike capabilities to multiGPU programming, as GPUs on the node can access data on other GPUs in a fast and transparent way.
In this paper, we use the 2D Ising model to compare the level of performance achievable with different programming approaches on GPUs.
2 Ising Model
In statistical mechanics, “spin system” indicates a broad class of models used for the description of a number of physical phenomena. A spin system is usually identified by a Hamiltonian function having the following general form:
(1) 
The spins are defined on a lattice which may have one, two, three or even a higher number of dimensions. The sum in Equation 1 runs on the nearest neighbors of each spin location (for example, 2 in 1D, 4 in 2D and 6 in 3D). The spins and the couplings may be either discrete or continuous and their values determine the specific model. In the Ising model [1] of ferromagnetism, the spins can be in one of two states (+1 or 1), is and, as a further simplification, all of the nearest neighbors have the same interaction strength so that for all pairs . There are analytical solutions for the Ising model in 1D and 2D; however, these can be considered an exception because, despite the illusory simplicity of their Hamiltonian formulation, the study of spin systems in higher dimensions is by no means trivial. Most of the time, numerical simulations (very often based on Monte Carlo methods) are the only way to understand their behavior.
A Monte Carlo simulation of the Ising model can be performed with the Metropolis algorithm [2], where the following steps are repeated:

From an initial configuration of spins, flip a randomly chosen spin.

If the change in energy between the old and new state is negative, we accept the change.

Otherwise, the move is accepted with probability
, where is the inverse of the temperature of the system and is the difference in the energy due to the spin flip.
It is very desirable from a computational point of view to update multiple spins in parallel. Given the local interactions of a spin with its neighbors, we can see that if we consider the lattice as a checkerboard, the flipping of a spin of a particular color is completely determined by its neighbors of the opposite color. We can update all the spins of one color in parallel, keeping the other color constant, and then repeat the process with the opposite color.
The checkerboard decomposition can be used to run parallel versions of other local Monte Carlo algorithms, like the Heat Bath algorithm in which the probability of a spin flip from to is equal to . Due to their locality, algorithms like Metropolis and Heat Bath suffer from the so called critical slowing down syndrome; a state in which it becomes increasingly difficult to flip a spin at random as it is likely to be coupled to neighboring spins pointing in the same direction. A solution to this problem is provided by an algorithm proposed by U. Wolff [3] in which a whole cluster of spins is flipped in each update instead of a single one. The cluster is constructed starting from a seed spin selected at random and looking at its neighboring spins. Those with the same sign as the seed spin are added to the cluster with a probability equal to , whereas they are excluded from the cluster with probability (spins having the opposite value with respect to the seed spin are ignored). This implies that spins are added to the cluster with a probability that is temperature dependent. It is not difficult to check that in both the low and high temperature regimes, the Wolff algorithm is not very efficient and the simpler Metropolis algorithm performs better. As a consequence, there is still much interest in using efficient (from the computational viewpoint) implementations of the Metropolis algorithm for the simulation of the Ising (and similar) models [4].
In the present paper, we discuss the performance of several implementations for the simulation of the 2D Ising model running on NVIDIA Volta GPUs. We focus on the 2D model because the results can be easily compared to the analytical solution derived by Onsager [5]. We compare our performance with those recently produced on other computing platforms [7, 8].
3 SingleGPU implementations
3.1 Basic Implementation
To begin, a basic singleGPU implementation of the checkerboard Metropolis algorithm was implemented in Python, combining features available in the popular Numba and CuPy packages. Specifically, Numba [9] was used for general GPU data handling via its provided device_array objects and also for its ability to compile custom CUDA kernels at runtime expressed purely in Python. In our experimentation, we found that the random number generation supported in Numba for use in device kernels to be fairly low performing. As a replacement, we used available bindings into the NVIDIA cuRAND [11] library available in CuPy [10] to prepopulate an array of random numbers as a separate operation before each lattice update kernel call.
The checkerboard lattice of spins is represented in two separate arrays of dimension , with each array containing only spins of a single color. This decomposition is depicted in the central image in Figure 1. As each spin only takes the value of , each spin location can be represented using a single byte. While further compressed data representations are possible, a byte is the smallest data type that does not require bitwise operations.
With the data decomposition in place, the implementation is straightforward and consists of two steps per color, per iteration. For a given color:

Populate an array of random values with CuPy/cuRAND

Update spins on the lattice for the current color using the opposite colored lattice spin values and the random value array, implemented in a custom kernel written with Numba. See Figure 2 for a listing of the spin update code.
To better gauge the performance of Numba, a nearly identical implementation of this basic algorithm for singleGPU was also implemented in CUDA C. A comparison of the main lattice kernel in both implementations can be seen in Figure 2.
3.2 Tensor Core Implementation
In [7], an implementation of the checkerboard Metropolis algorithm was developed to map the computation of nearestneighbor sums to matrix multiplications in order to execute them on TPUs using their hardware FP16 matrix multiply units (TPU Tensor Cores). As GPUs are generally programmable and the computation can be directly expressed in CUDA, such a mapping to matrix multiplies is not necessary for GPUs and adds unneeded complexity; however, as a means for comparison, a version of this matrix multiply algorithm was implemented to utilize NVIDIA Tensor Cores available on Volta GPUs. A complete description of the algorithm can be found in [7] and only key details will be discussed here. For this implementation, the lattice data is organized in a pattern similar to the rightmost diagram in Figure 1. In our implementation, the lattice is organized into sublattices, each further decomposed into four blocks. At the high level, the algorithm uses matrix multiplications to compute sublattice local nearest neighbor sums using a kernel matrix of the form,
(2) 
For a given sublattice, , to compute the sum of sublattice local nearest neighbor spins for the black subblocks, and , the following operations are performed,
(3)  
(4) 
where denotes the sum of sublattice local nearest neighbor spins. A similar expression can be used to compute sums for the white subblocks,
(5)  
(6) 
With these equations, one can obtain the sublattice local sums for a given color over the entire lattice via two batched matrixmultiply operations, corresponding to the left and right summands in either Eqs. 3 and 4 for the black spins, or Eqs. 5 and 6 for the white spins. To enable the use of tensor cores, the lattice spins and nearest neighbor sums are stored in halfprecision, and the matrix multiplies are computed using cublasHgemmBatched routine available in the NVIDIA cuBLAS
library [12].
Once the sublattice local sums are obtained for a given color, a separate kernel adding boundary contributions from neighboring sublattices is executed, followed by a final spin update kernel which uses the completed sums and generated random values to update spins. In this implementation, we used the Philox4_32_10
generator provided by the device API of the cuRAND
library to generate perthread random numbers within the spin update kernel. In order to avoid allocated separate global memory to store generator states and loading/storing them for every color update, we initialize a new state at each kernel call in a way such that each thread progresses coherently along the same random sequence across consecutive launches. This is possible thanks to the curand_init()
call that allows to specify a seed
, a sequence
, and an offset
. The seed determines a series of random sequences, the sequence number identifies one of those sequences, and the offset specifies an element in that sequence. For each kernel call, each thread uses the same seed, specifies as sequence number its unique linear index in the grid, and specifies an offset equal to the total count of random numbers generated in the previous kernel calls. This is in contrast to the approach in the basic implementation where an array of random values is prepopulated using the host API. To summarize, the steps to update spins for a given color are as follows:

Compute sublattice local spin sums for current color using two batched matrixmultiplication operations (via calls to cublasHgemmBatched)

Add boundary contributions from neighboring sublattice to local sums using a custom boundary update CUDA kernel.

Update spins for current color using completed sums and random values generated with
cuRAND
in a custom spin update CUDA kernel.
From this description, a few notable issues with this approach can be discussed. First, the use of matrix multiplications necessitates additional memory usage in order to store intermediate local nearest neighbor sums (for at least half of the total lattice). On top of this, to utilize tensor cores the sums and lattice spins must be stored in halfprecision which further increases memory requirements. Next, the splitting of the computation into several distinct operations results in increased global device memory traffic as data must be loaded and stored from global memory in between operations. Additionally, the standalone boundary update operation yields many uncoalesced memory loads and stores when updating the sublattice boundaries along the direction strided in memory, resulting in poor memory bandwidth utilization for this kernel. As this problem is memorybandwidth limited, increasing memory bandwidth pressure just to enable the mapping of the problem to tensor cores is not a wise tradeoff. This is especially true when the matrix multiplications themselves consist of mostly useless FLOPs (i.e. multiplications by zero), with only two multiplications per inner product contributing to the final result, yielding a fraction of
useful FLOPs when using matrix multiplications.3.3 Optimized Implementation
In order to use the compute resources of the GPU as efficiently as possible, we developed an optimized implementation of the Metropolis algorithm based on the multispin coding technique [6] in CUDA C. Similar to the basic implementation, we represented the spin lattice of size as two separate arrays of size , each holding one set of spins from the checkerboard representation. Each spin is represented with bits instead of using an entire byte (or larger words). This choice not only reduces the memory footprint of the lattice, with respect to using a whole word per spin, but it makes it possible to perform the sums of neighbors’ values for multiple consecutive spins (of the same color) in parallel, provided that the theoretical spin values 1/1 are mapped to 0/1, respectively. The combination of the reduced number of bits used to store a spin value with the use of the largest word size (64bit) allows drastic reductions in the number of addition instructions executed by the hardware. Assuming that each of four 64bit words contain 16 consecutive spins from the N, S, E, and W directions, three additions are sufficient to compute the neighbors sums instead of the that would be required if each quadruple was processed independently (). The choice of four bits per spin enables performing the sums without using additional variables given that each spin can contribute at most . In theory, since each sum can add to at most , three bits would be sufficient but that would require explicit handling of the interword cases with no sizeable benefit as far as performance or storage are concerned.
The update of each color is performed with a single kernel that is launched with enough blocks to cover the whole lattice. Each thread is in charge of updating spins (256 bits), thus it executes two 128bit loads from the target lattice array (the spins to be updated), eight additional 128bit loads from the source lattice array (to fetch the four neighbors of the target spins, four loads per source word), and two 128bit writes to store the flipped spins back into the target lattice. The separation on the spins according to the checkerboard pattern is such that it is quite simple to compute the words required to load the neighbors of a target word of spins. The neighbors of the spins in word from the target lattice are found in the words from the source lattice at coordinates , , (three vertically aligned words centered on ) and in a word from one of the sides. If is even, then this word is the one at coordinates , otherwise it’s at . The side word contains only one spin required for the computation (the one nearest to the central word) and so it cannot be directly used in the neighbors additions. This however, can easily be fixed by shifting out the unnecessary spins and shifting in an equal number of spins from the central word, as shown in Figure 3.
In order to avoid reading the lattices entries multiple times from the global memory, each block initially loads its spin tile from the source lattice into the shared memory, and then the threads load the words containing their neighboring spins from that faster memory space. We generate random numbers similarly to what we described for the tensor core implementation, using the Philox4_32_10
generator provided by the device API of the cuRAND
library.
Each simulation step is performed by performing two kernel launches, one for the black and one for the white sublattice.
4 MultiGPU implementations
A classic multiGPU approach would implement the update of the halo spins in a different routine from the spins in the bulk to allow an overlap of communication and computation, as shown in [4]. While this approach is very effective and allows good scaling, in this work we exploited new capabilities available in the DGX line of systems (DGX Station, DGX1 and DGX2), where the GPUs are connected through NVLink and unified memory allows allocations that span multiple GPUs. When one GPU needs to access data stored on another GPU, the driver will automatically migrate the corresponding memory pages.
When using multiple GPUs, the whole lattice can be partitioned into horizontal slabs and each GPU stores one slab in its own global memory in the same layout employed in the singleGPU case (according to the checkerboard pattern). In this way, each GPU needs only read access to the memory of the two GPUs that handle the slabs on top and bottom of its own region.
4.1 Basic Implementation
First, we consider distributing the basic implementation to multiple GPUs. One limitation of the Numba library is that a single thread can have at most one active CUDA context at a time, which actually prohibits the simple usage of unified memory to access data across multiple GPUs. Instead, a multiprocess multiGPU version was implemented using MPI via mpi4py and CUDA Interprocess Communication (IPC) handles, obtained directly from the Numba device_array objects. In this case, each process obtains a CUDA IPC handle to the lattice allocations of neighboring slabs, and uses them to access spin data in the update lattice kernel when the neighbor stencil crosses slab boundaries.
4.2 Tensor Core and Optimized Implementations
Since both these implementations are written in CUDA C, the full set of options for distribution across multiple GPUs are available.
The slab based distribution can be realized very easily by using the CUDA Unified Memory system. It is sufficient to allocate the entire lattice via a single call to the cudaMallocManaged()
function and then, for each slab, setting the preferred location to be the memory belonging to target GPU and setting up the mapping of the first and last line of that slab in the page tables of adjacent GPUs. These operation are performed by using the cudaMemAdvise()
call by specifying, respectively, the cudaMemAdviseSetPreferredLocation
and the cudaMemAdviseSetAccessedBy
advice parameters. In total, since black and white lattices are stored in separate arrays, six cudaMemAdvise()
calls are required per GPU. Figure 4 shows an example of the Unified Memory calls and the lattice regions they need to be called onto.
This setup does not require any change to the existing GPU kernels to access external memory, or any explicit exchange of data among GPUs (like it would normally be necessary in a CUDA+MPI setup). Rather, the existing kernels are simply launched on each GPU with the lattice array accesses properly offset to access different slabs. When a GPU executes a load on a word belonging to one of its neighbors then a data transfer through NVLink automatically takes place.
5 Results
5.1 SingleGPU Performance
In this section, we present singleGPU performance results. We ran our singleGPU tests on a Tesla V100SXM card mounted in a DGX2 [13]. The card is equipped with 32GB of HBM2 memory and a total of CUDA cores. For the basic and tensor core implementations, we ran tests on lattice sizes ranging from to to compare directly with TPU results reported in [7]. For the optimized implementation, we ran tests with different lattice sizes ranging from (2 MB) to (8 GB), quadrupling the total number of spins at each step. We also ran the largest lattice possible with the available memory, a requiring 30.3 GB of memory. The results are reported in Tables 1 and 2 respectively.
Considering the performance results in Table 1, both the basic implementation and the tensor core one achieve higher performance on a single Tesla V100SXM than the single TPUv3 results reported in [7], with speedups ranging from a factor of 3 for the tensor core implementation to 5 for the CUDA C variant of the basic implementation, when comparing highest reported spin update rates. Comparing across GPU implementations in this table, the tensor core implementation and basic implementation in Python are significantly slower than the basic implementation written in CUDA C. For the tensor core implementation, this is not surprising due to the additional memory bandwidth overhead introduced from splitting the problem into more distinct operations and poor performance of the boundary update kernel due to the required uncoalesced loads and stores. In comparing the basic implementation in Python to the implementation in CUDA C, the major performance difference was mostly due to quality of the compiled spin update kernels. While the Python and CUDA C versions of these kernels are nearly identical (see Figure 2), the JITcompiled Numba kernels were much less performant, in part due to greater register usage relative to the kernel generated by the NVCC compiler with the CUDA C source.
For comparison with the performance of our optimized implementation, we include in Table 2 the best results from the high performance TPUv3 implementation on a single and 32 TPUv3 cores, as well as the performance achieved on a FPGA [8] using a lattice of size exactly . The comparison shows that our optimized implementation on a single V100SXM provides a speedup in excess of with respect to a TPUv3 core. It is necessary to combine the processing power of 32 TPUv3 cores (four TPU units) to reach performance in the ballpark of a single Tesla V100.
lattice size  Basic (Python)  Basic (CUDA C)  Tensor Core  TPU (on TPUv3 core) [7] 

15.179  48.147  31.010  8.1920  
40.984  59.606  35.356  9.3623  
42.887  64.578  38.726  12.336  
43.594  66.382  39.152  12.827  
43.768  66.787  39.208  12.906  
43.535  66.954  38.749  12.878 
5.2 MultiGPU Performance
In this section, we present the multiGPU performance results. For the optimized code, we measured performance on both a DGX2 and a DGX2H system. The DGX2H is a newer version of the DGX2 server that is outfitted with higherperforming CPUs and GPUs. Its 16 Tesla V100 GPUs run at higher clock speed and feature a 450W TDP each. The performance testing of the basic and tensor core implementations were limited to the base DGX2 system.
Table 3 reports the weak scaling measurements obtained running the optimized implementation on both a DGX2 and a DGX2H system. The lattice size per GPU has been kept constant at spins (30GB) and we measured the flip/ns rate for update steps. As expected, the scaling is perfectly linear up to 16 GPUs. This is due to the simple access pattern of remote memory and to the high throughput of the NVLink communications. The performance figures from the table are plotted on the adjacent graph together with the numbers reported in [7]. Due to memory constraints we couldn’t run a test case of size similar to the largest one reported in [7].
Table 4 reports the strong scaling measurements of the optimized implementation obtained on the two DGX systems. The total lattice size has been kept constant at spins (30GB) and it has been partitioned in as many horizontal slabs of the same size as the number of GPUs used in the measurements. As for the weak scaling case, we measured the flip/ns rate for update steps. Since with any number of GPUs the transfers of the top and of the bottom boundaries is negligible with respect to the processing of the bulk, the scaling is linear up to GPUs. The plot near the table shows a comparison of the performance figures from the two DGX systems.
For completeness, we also report the weak and scaling measurements obtained on a DGX2 system for the basic implementation in Python using MPI and CUDA IPC and the tensor core implementation using unified memory in Table 5. Similar to the optimized implementation, both of these implementations achieve very good scaling efficiency. Notably, the tensor core implementation with unified memory achieves better scaling efficiency than the basic implementation using MPI and CUDA IPC.
no. of GPUs  lattice size  Basic (Python)  Tensor Core 

1  43.488  38.747  
2  82.447  77.492  
4  164.352  154.980  
8  327.136  309.918  
16  648.254  619.520  
1  43.481  38.752  
2  83.146  78.104  
4  165.793  156.676  
8  330.258  313.077  
16  650.543  602.083 
5.3 Validation
As mentioned in Section 1
, there is an analytical solution for the 2D Ising model in the limit of infinite volume that shows the presence of an orderdisorder phase transition such that the magnetization is
for whereas it is equal to:(7) 
for . The condition for determining the critical temperature at which the phase transition occurs is , corresponding to . To verify the accuracy of our optimized code for the simulation of the 2D Ising model, we carried out a set of tests for different temperatures and lattice sizes comparing the results of simulated magnetization with the corresponding (i.e., for the same temperature) values of Equation 7. Figure 5 shows the simulated magnetization with respect to the temperatures along with the values of the analytical expression 7. Another interesting quantity is the socalled Binder parameter or Binder cumulant [14]
corresponding to the kurtosis of the order parameter (
i.e., the magnetization). The Binder parameter can be computed asThis parameter makes it possible to accurately determine phase transition points in numerical simulations of various spin models. The phase transition point may be identified comparing the behavior of as a function of the temperature for different values of the system size . The critical temperature corresponds to the point where the curves for different values of cross. Our measures of the Binder parameter are reported in Figure 6 for the same lattice sizes used for the magnetization. As expected they show a phase transition occurring at the critical temperature . Although these measures confirm the accuracy of the simulations using our implementation, we found other interesting effects on large systems (e.g., for ) that deserve an indepth analysis. In particular we observed that in some runs, the time (measured as number of lattice sweeps of the Metropolis algorithm) required to reach the steady state is much larger than expected (far from the critical temperature, it should be ). In those cases, spins tend to organize into bands (horizontal, vertical or even diagonal), remaining in those metastable states for very long times. We plan to study that phenomenon in a forthcoming paper.
6 Conclusions
We have presented several implementations of the 2D Ising model using a variety of approaches. All the implementations are accurate, fast and scale very well to multiGPU configurations. We discussed some of the limitations of the basic and tensor core approaches outlined, however, found that all outperformed results on TPU systems [7]. Our optimized implementation achieves stateoftheart performance and compares favorably to custom FPGA solutions [8].
These codes can be easily extended to simulate other models for which there are no analytical solutions, for instance a 2D Ising spin glass model, and can also be used to study the dynamics of the classic (i.e., ferromagnetic) Ising model simulated with the Metropolis algorithm.
The different versions of the code are available at http://github.com/NVIDIA/isinggpu.
Acknowledgements
We would like to thank Profs. Giorgio Parisi, Enzo Marinari and Federico RicciTersenghi from the University of Rome "Sapienza" for useful discussions about the convergence properties of the Ising model on large lattices.
References
 [1] Ising, E. Contribution to the Theory of Ferromagnetism. Zeitschrift für Physik, vol. XXXI, 1925.
 [2] Metropolis, N. and Rosenbluth, A.W. and Rosenbluth, M.N. and Teller, A.H. and Teller, E. Equation of State Calculations by Fast Computing Machines. Journal of Chemical Physics. 21 (6): 1087–1092, 1953.
 [3] Wolff, U. Collective Monte Carlo Updating for Spin Systems. Phys. Rev. Lett. 62, 361, 1989.
 [4] Bernaschi, M. and Fatica, M. and Parisi, G. and Parisi, L. MultiGPU codes for spin systems simulations. Computer Physics Communications, Vol. 183, 2012.
 [5] Onsager, L, Crystal statistics. I. A twodimensional model with an orderdisorder transition. Physical Review, Series II, 65 (3–4): 117–149, 1944.
 [6] Jacobs, L, Rebbi, C. Multispin coding: a very efficient technique for Monte Carlo simulations of spin systems. Journal of Computational Physics 41(1), 203–210, 1981.
 [7] Kun Yang, YiFan Chen, Georgios Roumpos, Chris Colby, John Anderson. High Performance Monte Carlo Simulation of Ising Model on TPU Clusters. arXiv preprint:1903.11714v2, 2019.
 [8] F. OrtegaZamorano , M. A. Montemurro, S. A. Cannas, J. M. Jerez, L. Franco. FPGA Hardware Acceleration of Monte Carlo Simulations for the Ising Model. IEEE Transactions on Parallel and Distributed Systems, 2016, Vol. 27 Issue 9, Pages:26182627
 [9] S. K. Lam, A. Pitrou, Antoine and S. Seibert. Numba: A LLVMbased Python JIT Compiler. Proceedings of the Second Workshop on the LLVM Compiler Infrastructure in HPC, p.16, November 1515, 2015, Austin, Texas.

[10]
R. Okuta, Y. Unno, D. Nishino, S. Hido, and C. Loomis.
CuPy: A NumPyCompatible Library for NVIDIA GPU Calculations.
Proceedings of Workshop on Machine Learning Systems (LearningSys) in The Thirtyfirst Annual Conference on Neural Information Processing Systems, 2017.
 [11] cuRAND Library, http://docs.nvidia.com/cuda/curand
 [12] cuBLAS Library, http://docs.nvidia.com/cuda/cublas
 [13] NVIDIA DGX2, https://www.nvidia.com/enus/datacenter/dgx2/
 [14] Binder K, Phys.Rev. Lett. 47, 693, 1981.
Comments
There are no comments yet.