1 Introduction
Driven by the need for realistic real time computer graphics applications, graphics processing units (GPUs) now offer more computing power than central processing units (CPUs). This is accomplished with a large number of (relatively slow) processors that can simultaneously render complex 3D graphical applications by relying on parallelism. Modern GPUs can be programmed to execute many of the same tasks as a CPU. For tasks with efficient (primarily floating point) parallel implementations, GPUs have reduced execution time by an order of magnitude or more (Brodtkorb et al., 2010). Additionally, GPUs are extremely cost effective; delivering far more floating point operations per second (FLOPS) per dollar than multicore CPUs. This provides a great deal of promise for general purpose computing on graphics processing units (GPGPU), and in particular for computationally intensive statistical applications (e.g., Zechner & Granitzer (2009); Lee et al. (2009)).
In this note, we explore the merits of heterogeneous computing (CPU+GPU) over traditional computing (CPU only) for fitting Gaussian process (GP) models. GP models are popular supervised learning models
(Rasmussen & Williams, 2006), and are also used for emulating computer simulators which are too time consuming to run for realtime predictions (Sacks et al., 1989). Fitting a GP model with simulator runs requires the determinant and inverse computation of numerous spatial correlation matrices. The computational costs of both the determinant and inverse are , which become prohibitive for moderate to large . We demonstrate that for moderate to large datasets (thousands of runs), the implementation on a heterogeneous system can be more than a hundred times faster. The tremendous speedups were achieved by leveraging NVIDIA’s compute unified device architecture (CUDA) toolkit (NVIDIA Corporation, 2010).This note is intended as a “proof of concept” to demonstrate the tremendous time savings that can be achieved for statistical simulations by using GPUs. In particular we show that fitting GP models on a heterogeneous computing platform leads to significant time savings as compared to our previous CPUonly implementation.
The remainder of this note is organized as follows: Section 2 briefly reviews the basic formulation of the GP model. We also highlight the computationally intensive steps of the model fitting procedure in this section. Implementation details for both platforms (CPU + GPU and CPUonly) are discussed in Section 3. Comparisons are made via several simulations in Section 4, followed by discussion in Section 5. It is clear from this study that the efficiency, affordability, and pervasiveness of GPUs suggest that they should seriously be considered for intensive statistical computation.
2 Gaussian Process Model
GP models are commonly used for regression in machine learning, geostatistics and computer experiments. Here, we adopt terminology of computer experiments
(Sacks et al., 1989). Denote the th input and output of the computer simulator by , and , respectively. The simulator output, , is modeled as(1) 
where is the overall mean, and is a GP with , , and Cov. Although there are several choices for (Santner et al., 2003), we use the power exponential correlation family
(2) 
where
is the vector of hyperparameters, and
is the smoothness parameter. The squared exponential correlation function is very popular and has good theoretical properties, however, the power exponential correlation function with is more stable in terms of nearsingularity of (especially if the design points are very close, which can occur with large in small  dimensional space). We used for all illustrations in this note (Kaufman et al., 2011). Alternatively, a suitably chosen nugget with Gaussian correlation can be used (Ranjan et al., 2011) to avoid nearsingularity. One could also use another correlation structure like Matérn (Stein, 1999; Santner et al., 2003), which is considered to be numerically more stable than power exponential correlation.We implemented the maximum likelihood approach for fitting the GP model, which requires the estimation of the parameter vector
. The closed form estimators of and are given by(3) 
which are used to obtain the profile loglikelihood
(4) 
for estimating the hyperparameters , where denotes the determinant of .
The evaluation of and for different values of is computationally expensive. One can save some computation time and increase numerical stability by using matrix factorization of (e.g., using LU, QR or Cholesky decomposition) along with solving the linear systems via backsolves to obtain . From an implementation viewpoint, the loglikelihood (4) consists of three computationally expensive components: (i) the evaluation of , (ii) the factorization of and (iii) solving the linear systems via backsolves. The computational cost for the matrix factorization and backsolves are and respectively. Thus, the loglikelihood function evaluation is computationally expensive, specifically for large .
The loglikelihood surface is sometimes bumpy near and often has multiple local optima (Yuan et al., 2008; Schirru et al., 2011; Kalaitzis & Lawrence, 2011; Petelin et al., 2011). This makes maximizing the likelihood challenging. For instance, Figure 1 presents contours of for a point, twodimesional data set, where the inputs are generated using a maximin Latin hypercube design (LHD) in , and the computer simulator outputs are obtained by evaluating the twodimensional GoldPrice function (see Example 1 for more details on this test simulator).
The loglikelihood surface in Figure 1 contains multiple local minima near . Though not very clear from Figure 1(b), more local minima are found by enlarging the region near
. A good optimization algorithm that guarantees the global optimum of such an objective function would require numerous function evaluations. Evolutionary algorithms like genetic algorithm
(Ranjan et al., 2008, 2011)and particle swarm optimization
(Petelin et al., 2011) are commonly used for optimizing the loglikelihood of the GP models. A multistart gradient based optimization approach might be faster but would require careful selection of the starting points to achieve the global minimum of . Once a sufficiently large number of starting points are chosen, a gradient based optimization is also likely to involve a large number of evaluations of the likelihood.The large number of likelihood evaluations, no matter what approach is taken to search for the MLEs, means that efficient ways to evaluate the likelihood will be very important.
3 Implementation of GP on CPUs and GPUs
In this section, we discuss the implementations of GP fitting algorithms in the two computing environments. As in Ranjan et al. (2008, 2011), we use a genetic algorithm (GA) for minimizing . Other optimizers might use different strategies, but all approaches require numerous evaluation of (4).
Our implementation of computing requires one evaluation of , one Cholesky decomposition of and seven backsolves involving the Cholesky factors. The version of GA implemented here for optimizing the loglikelihood uses evaluations of . Thus, the GP model fitting procedure requires evaluations of , Cholesky decompositions and backsolves. Furthermore, the GP model predictions on a set of test points are obtained using the estimated parameters with one evaluation of , one Cholesky decomposition and backsolves. Thus, fitting a GP model and obtaining model predictions are computationally expensive, and this paper advocates the use of GPUs to reduce the computational burden.
The heterogeneous (CPU+GPU) workstation for this study includes two moderately highend consumer grade components each costing approximately $200: a Intel Core i5 750 CPU and a NVIDIA GTX 260 (core 216) GPU. The Intel CPU is capable of 4 flops per clock cycle 4 cores 2.66 gigaHertz per core = 42.58 gigaFlops per cycle (Intel Corporation, 2011). The NVIDIA GPU is capable of 3 flops per clock cycle 216 cores 1.547 gigaHertz per core = 1002 gigaFlops per cycle (Barrachina et al., 2009), roughly 24 times more than the Intel Core i5 processor. The workstation used for comparison is a dualsocket quadcore AMD Opteron 2352 with 32 GB of RAM running Matlab version R2009b. This hardware is slightly older, but is workstation class instead of consumer class, has more RAM and twice as many CPU cores.
CPU implementation. Most personal computers available now have multicore processors. For example, the somewhat highend consumergrade AMD FX8120 has 8 cores. We used a dualsocket quad core AMD Opteron 2352 based workstation, giving 8 cores in total. The CPU implementation was written in Matlab, making use of several builtin functions (e.g., matrix factorization routines, backslash operator for backsolving linear systems) that are linked to compiled high performance subroutines written in Fortran or C. The Cholesky decomposition was used for computing both the determinant and the inverse of (via solving linear systems). The Matlab (CPU) code did not contain specialized parallelization commands. However, the Matlab session used all 8 cores through its intermal math routines.
CPU+GPU implementation. The most popular GPGPU platform is that of a Intel or AMD x86 CPU and a Nvidia GPU. This is the platform that was adopted for this study  an Intel core i5 750 quad core CPU and Nvidia GTX 260 GPU. We used the Nvidia GPGPU software development kit (CUDA 2.3) to implement the Gaussian process model on this system. As part of that implementation we rewrote our Matlab code in the C for CUDA programming language to enable general purpose computation on the GPU. The free libraries CUDA basic linear algebra subroutines (CUBLAS) (NVIDIA Corporation, 2009) and CULATools (EM Photonics and NVIDIA Corporation, 2009) were used for executing the most computationally expensive steps of the GP model on the GPU. CULATools is a proprietary library with a free version that provides popular linear algebra operations for NVIDIA GPUs. Because the Cholesky decomposition routine was not part of the free CULATools library, the free (but closed source) LU decomposition (culaDeviceSgetrf) was used instead. Though we are not certain due to the proprietary nature of CULATools, we believe the GPU LU decomposition is parallelized in a blockbased manner similar to the freely available SCALAPACK version for CPUs (Choi et al., 1992).
We explicitly used the GPU for evaluation of correlation matrix using a custom GPU kernel (C for CUDA function), LU decomposition of (CULATOOLS library), all linear algebraic operations including the backsolves (CUBLAS library), and a few other custom C for CUDA functions in order to minimize the transfer of data to and from the GPU. The computation of requires evaluations of all of which are mutually independent and can be executed in parallel.
Note that no task parallelization (e.g., multiple evaluation of the likelihood at different parameter values) was performed in this implementation. Evaluating the likelihood for numerous parameters simultaneously seems a natural approach to parallelizing a GP model, however this type of parallelism was not possible because only one kernel could have been executed at a time. If we had access to a more advanced GPU (e.g., GF100 series or newer), executing some of these computationally intensive steps in a task parallel manner (using concurrent kernel execution) could potentially lead to even more time savings.
To summarize, we used the GPU for all linear algebra operations and building the correlation matrix . The CPU was used for data input/output and process control. Additional execution configurations could have been tuned to optimize the performance of GPUs (Archuleta et al., 2009). However the aim of this study is not to achieve the highest performance possible for these models and procedures; but rather to explore the use of a GPU as a coprocessor that can lead to significant time savings at a reasonable cost.
4 Examples
In this section, we use simulated examples to compare the results obtained from the two implementations of the GP model. Throughout this note, the CPU implementation of the methodology in Matlab is referred to as CPU implementation and the mixed system (CPU + GPU) implementation (in C based language called CUDA) is referred to as CUDA implementation
. Our CUDA implementation enforces single precision floatingpoint arithmetic and uses LU instead of Cholesky decomposition of the correlation matrices
.For fair comparison of the results, both implementations used exactly the same input data and the number of likelihood evaluations in the optimization procedure. The design points, , were generated using a maximin Latinhypercube scheme. For efficient distribution of the data to the processors in the CUDA implementation, we used powers of for , however, other values of can also be considered in our implementations (e.g., we used in Example 2). Simulations with were not considered here because of a restriction within the CUBLAS library (version 2.3). In the most recent CUBLAS release (3.0) this restriction has been relaxed, allowing for much larger matrices to be considered in future applications.
As pointed out in Sections 2 and 3, the challenge of effectively maximizing the likelihood for a GP model leads to algorithms that repeatedly evaluate the likelihood. In the examples considered in this section, a genetic algorithm with 20 generations and a population of 100 solutions leading to 2,000 likelihood evaluations, was used.
Example 1. Suppose the emulator outputs are generated from the of the 2dimensional GoldsteinPrice function (Törn & Žilinskas, 1979). Designs of different sizes, for , were used to fit the GP model. The reason for not using larger designs in this simulation study is discussed in Section 5. Table 1 shows the parameter estimates, optimized value of as in (4), sum of squared prediction error (), and total runtime of the code (i.e., fitting the GP model and computing SSPE). The SSPE is calculated over a fixed test set, which is a maximin LHD chosen independent to the design used for model fitting. Both CPU and CUDA implementations used the same training and test datasets. A slight randomness (or variation) in these performance measures is introduced due to the stochasticity of the genetic algorithm used for optimizing the likelihood. To reduce the variation, the results in Table 1 are averaged over ten simulations. For each replication of the experiment a different design (training dataset for fitting the GP model) was chosen.
CPU implementation  


CUDA implementation  

It is clear from Table 1 that the optimized value of the profile loglikelihood (), , and SSPE from the two implementations are comparable within each sample size. The discrepancy between the results of these two implementations may be partially explained by the fact that single precision floating point operations were used in CUDA as compared to double precision calculations in Matlab. See Section 5 for further discussion.
While the discrepancies in Table 1 suggest that singleprecision arithmetic on the GPU may not always yield exact optima, we stress that the most difficult part of the likelihood optimization is finding the right neighbourhood of a good optimum. Short doubleprecision runs on the CPU may be used to refine the CUDA results.
The column “Time(sec)” in Table 1 shows that as increases the CUDA implementation significantly outperforms the CPU implementation. For , on average CPU implementation takes more than minutes but CUDA implementation requires roughly seconds.
Example 2. Consider the 6dimensional Hartman function (Törn & Žilinskas, 1979) for generating the simulator outputs. Since the input space is six dimensional, we considered relatively large designs with and ( is not allowed due to CUBLAS restrictions) to fit the GP model, and the SSPE was calculated over 1000 points chosen using a maximin LHD. Table 2 summarizes the simulation results.
CPU implementation  


CUDA implementation  

Similar to Example 1, the results are averaged over 10 simulations. One exception is CPU with : a single run took almost 45 hours, and thus we did not perform 10 simulations. Table 2 shows that the CUDA implementation achieves computation time savings of up to a factor of 150 ().
5 Discussion
This note illustrates that GPU parallelization can significantly reduce the computational burden in fitting GP models. The largest improvements are evident in the most difficult problems (e.g., from 45 hours to 18 minutes in Example 2, case). These dramatic savings suggest that more complete implementations of GP models on GPU platforms deserve further study.
Although this article focuses on speedups due to hardware, there are other approaches for estimation of GP models with large sample sizes. For instance, adhoc methods have been proposed which modify the correlation function and give sparse correlation matrices, facilitating faster computation of the inverses and determinants (e.g., Furrer et al. (2006), Stein (2008)). However, it seems quite plausible that efficiencies realized by a GPU+CPU approach would also benefit these sparse approximations.
A number of design choices in the experiments merit commentary. In Example 1, the reason for not considering or larger is that the correlation matrices are nearsingular. Although choice of the exponent in the correlation matrix , set at , substantially reduces the chance of getting nearsingular correlation matrices, this is still a problem for large in the twodimensional problem. We did not want to further lower the value of , as it would affect the smoothness of the GP fit. Alternatively, a nugget based approach (Ranjan et al., 2011) can be used to implement the GP model with large datasets. It is expected that the runtime of the model fitting procedure would not change significantly with , except when the correlation matrix becomes sparse.
As noted in Section 4, MLEs obtained by the CPU+GPU implementation differed somewhat from those found by the CPU implementation. A significant reason for this is the use of singleprecision arithmetic in the GPU code. These estimates could be refined by a short run of doubleprecision CPU code. Since this refinement might involve 10 or 20 likelihood evaluations, a small fraction of the 2000 necesary to find a good solution, we have not included this effect in our studies.
There are a number of specific functions for highlevel software packages like Matlab and R that leverage GPUs and can be used to improve existing codes. For instance, our CPU implementation can be improved by using Jacket (AccelerEyes, 2011) and GPUlib (TechX Corporation, 2011). This approach was not taken for this study as GPUs were not available in our cluster, and Matlab was not available on our heterogenous computing system. Similarly, GPUtools is a library for GPU accelerated routines in R (Buckner et al., 2010).
Though not discussed here, it is possible to make use of more than one CPU and as many as three or four GPUs in one application/system, if there are available GPU slots. In fact, cutting edge research is currently being done using multiple GPUs coupled by an extremely fast interconnect (Infiniband) (QLogic Corporation, 2011). Even more processing power can be achieved by creating a cluster of CPU+GPU systems coupled over a network, like the second and fourth fastest supercomputers in the world (as of June 2011), the Chinese Tianhe1A and Nebulae (Top500.org, 2011). Developing algorithms to execute in parallel across numerous processors is not a task that is frequently undertaken by statisticians. However, parallel computing, GPU coprocessing, and heterogeneous computing techniques can become popular in the near future for computationally intensive statistical computing applications because of the promise of heterogeneous computing and likely changes to CPU architectures (Brodtkorb et al., 2010).
References
 (1)
 AccelerEyes (2011) AccelerEyes (2011), ‘Jacket website’. http://www.accelereyes.com/products/jacket.
 Archuleta et al. (2009) Archuleta, J., Cao, Y., Scogland, T. & Feng, W. (2009), ‘Multidimensional characterization of temporal data mining on graphics processors’, Proceedings of the 2009 IEEE International Symposium on Parallel & Distributed Processing 00, 1–12.
 Barrachina et al. (2009) Barrachina, S., Castillo, M., Igual, F., Mayo, R., QuintanaOrtí, E. & QuintanaOrtí, G. (2009), ‘Exploiting the capabilities of modern gpus for dense matrix computations’, Concurrency and Computation: Practice and Experience 21(18), 2457–2477.
 Brodtkorb et al. (2010) Brodtkorb, A. R., Dyken, C., Hagen, T. R., Hjelmervik, J. M. & Storaasli, O. O. (2010), ‘Stateoftheart in heterogeneous computing’, Scientific Programming 18, 1–33.
 Buckner et al. (2010) Buckner, J., Seligman, M. & Wilson, J. (2010), ‘Gputools’, http://cran.rproject.org/web/packages/gputools/index.html.
 Choi et al. (1992) Choi, J., Dongarra, J., Pozo, R. & Walker, D. (1992), Scalapack: A scalable linear algebra library for distributed memory concurrent computers, in ‘Frontiers of Massively Parallel Computation, 1992., Fourth Symposium on the’, IEEE, pp. 120–127.
 EM Photonics and NVIDIA Corporation (2009) EM Photonics and NVIDIA Corporation (2009), ‘Culatools basic: GPUaccelerated LAPACK’, http://www.culatools.com.

Furrer et al. (2006)
Furrer, R., Genton, M. G. & Nychka, D. (2006), ‘Covariance tapering for interpolation of large spatial datasets’,
Journal of Computational and Graphical Statistics 15, 502–523.  Intel Corporation (2011) Intel Corporation (2011), ‘Intel microprocessor export compliance metrics’. http://www.intel.com/support/processors/sb/CS023143.htm.
 Kalaitzis & Lawrence (2011) Kalaitzis, A. A. & Lawrence, N. D. (2011), ‘A simple approach to ranking differentially expressed gene expression time courses through gaussian process regression’, BMC Bioinformatics 12, 180.
 Kaufman et al. (2011) Kaufman, C. G., Bingham, D., Habib, S., Heitmann, K. & Frieman, J. A. (2011), ‘Efficient emulators of computer experiments using compactly supported correlation functions, with an application to cosmology’, The Annals of Applied Statistics 5(4), 2470 – 2492.
 Lee et al. (2009) Lee, A., Yau, C., Giles, M. B., Doucet, A. & Holmes, C. C. (2009), ‘On the utility of graphics cards to perform massively parallel simulation of advanced Monte Carlo methods’. arXiv:0905.2441v3.
 NVIDIA Corporation (2009) NVIDIA Corporation (2009), ‘CUBLAS Library, Version 2.3’, http://developer.nvidia.com/object/gpucomputing.html.
 NVIDIA Corporation (2010) NVIDIA Corporation (2010), ‘CUDA Toolkit’, http://developer.nvidia.com/cudatoolkit.

Petelin et al. (2011)
Petelin, D., Filipič, B. & Kocijan, J. (2011), Optimization of gaussian process models with
evolutionary algorithms, in ‘Proceedings of the 10th international
conference on Adaptive and natural computing algorithms  Volume Part I’,
ICANNGA’11, SpringerVerlag, Berlin, Heidelberg, pp. 420–429.
http://dl.acm.org/citation.cfm?id=1997052.1997098  QLogic Corporation (2011) QLogic Corporation (2011), ‘Maximizing gpu cluster performance: Truescale infiniband accelerates gpu cluster’, Whitepaper: http://www.qlogic.com/Resources/Documents/WhitePapers/InfiniBand/Maximizing_GPU_Cluster_Performance.pdf.
 Ranjan et al. (2008) Ranjan, P., Bingham, D. & Michailidis, G. (2008), ‘Sequential experiment design for contour estimation from complex computer codes’, Technometrics 50(4), 527–541.
 Ranjan et al. (2011) Ranjan, P., Haynes, R. & Karsten, R. (2011), ‘A computationally stable approach to gaussian process interpolation of deterministic computer simulation data’, Technometrics 53(4), 366–378.
 Rasmussen & Williams (2006) Rasmussen, C. E. & Williams, C. K. I. (2006), Gaussian Processes for Machine Learning, The MIT Press, Cambridge.
 Sacks et al. (1989) Sacks, J., Welch, W., Mitchell, T. & Wynn, H. (1989), ‘Design and analysis of computer experiments’, Statistical Science 4(4), 409–435.
 Santner et al. (2003) Santner, T. J., Williams, B. & Notz, W. (2003), The design and analysis of computer experiments, Springer Verlag, New York.
 Schirru et al. (2011) Schirru, A., Pampuri, S., Nicolao, G. D. & McLoone, S. (2011), ‘Efficient marginal likelihood computation for gaussian processes and kernel ridge regression’. arXiv:1110.6546v1.
 Stein (1999) Stein, M. L. (1999), Interpolation of spatial Data: some theory for kriging, Springer, New York.
 Stein (2008) Stein, M. L. (2008), ‘A modeling approach for large spatial datasets’, Journal of Korean Statistical Society 37, 3–10.
 TechX Corporation (2011) TechX Corporation (2011), ‘Gpulib website’. http://www.txcorp.com/products/GPULib.
 Top500.org (2011) Top500.org (2011), ‘Top500 supercomputer sites’, http://www.top500.org/lists/2011/06.
 Törn & Žilinskas (1979) Törn, A. & Žilinskas, A. (1979), Global Optimization, Springer Verlag, Berlin.

Yuan et al. (2008)
Yuan, J., Wang, K., Yu, T. & Fang, M. (2008), ‘Reliable multiobjective optimization of highspeed
wedm process based on gaussian process regression’, International
Journal of Machine Tools and Manufacture 48(1), 47 – 60.
http://www.sciencedirect.com/science/article/pii/S0890695507001265 
Zechner & Granitzer (2009)
Zechner, M. & Granitzer, M. (2009), ‘Accelerating kmeans on the graphics processor via CUDA’,
First International Conference on Intensive Applications and Services pp. 7–15.
Comments
There are no comments yet.