    # Pushing the Limit: A Hybrid Parallel Implementation of the Multi-resolution Approximation for Massive Data

The multi-resolution approximation (MRA) of Gaussian processes was recently proposed to conduct likelihood-based inference for massive spatial data sets. An advantage of the methodology is that it can be parallelized. We implemented the MRA in C++ for both serial and parallel versions. In the parallel implementation, we use a hybrid parallelism that employs both distributed and shared memory computing for communications between and within nodes by using the Message Passing Interface (MPI) and OpenMP, respectively. The performance of the serial code is compared between the C++ and MATLAB implementations over a small data set on a personal laptop. The C++ parallel program is further carefully studied under different configurations by applications to data sets from around a tenth of a million to 47 million observations. We show the practicality of this implementation by demonstrating that we can get quick inference for massive real-world data sets. The serial and parallel C++ code can be found at https://github.com/hhuang90.

## Authors

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction and Background

### 1.1 Multi-resolution Approximation

The multi-resolution approximation (MRA) was recently proposed by Katzfuss (2017) to handle massive spatial data sets modeled with Gaussian processes. We present a brief summary of the methodology here before moving to our implementation details.

Let be the random process for , where is the physical domain of interest, typically two-dimensional. Under the Gaussian process assumption, the random process , where and are the mean function and covariance function, respectively. Without loss of generality, we can assume

is zero because the true mean is always estimated in advance. The covariance function

is a positive definite function with parameter(s) .

When the random process is observed at a finite set of locations

, the random vector

follows a multivariate normal distribution

, where the variance-covariance matrix

is given by

 Σ(S)=C(S,S)=⎡⎢ ⎢ ⎢ ⎢ ⎢⎣C(s1,s1)C(s1,s2)…C(s1,sn)C(s2,s1)C(s2,s2)…C(s2,sn)⋮⋮⋮⋮C(sn,s1)C(sn,s2)…C(sn,sn)⎤⎥ ⎥ ⎥ ⎥ ⎥⎦.

When is large, it is computationally prohibitive to conduct inference from the data. The MRA proposes an approximate random process and substitutes it for the original random process to make inference. The approximation is based on a multi-resolution partitioning of the physical domain. Let , and then for any natural number for the number of total levels of the multi-resolution partitioning, each region is divided into subregions, i.e.,

 D1,j2,…,jm−1=J⋃jm=1D1,j2,…,jm,j2,…,jm−1=1,…,J,m=2,…,M.

For each region , we place knots that are the centers of basis functions to be discussed later in Section 1.2. denotes the set of the knots in , and denotes all the sets at level . At level , is chosen to be the set of observation locations in .

Then the multi-resolution approximation uses the predictive process (Banerjee et al., 2008) iteratively as follows. is the predictive process that approximates based on at level 1, i.e, . The remainder process is approximated at level 2 by assuming independence among , and is denoted as . The notation about the subscript is that for a generic Gaussian process , is a Gaussian process such that where if and are in the same region at level , and otherwise. is the predictive process that approximates based on , i.e., . Repeat this procedure to level . For clarity, these processes are given below,

 τ1(s)=E[X(s)∣X(Q)],δ2(s)={X(s)−τ1(s)}⟨2⟩,τ2(s)=E[δ2(s)∣δ2(Q)],δ3(s)={δ2(s)−τ2(s)}⟨3⟩,τ3(s)=E[δ3(s)∣δ3(Q)],δ4(s)={δ3(s)−τ3(s)}⟨4⟩,τ4(s)=E[δ4(s)∣δ4(Q)],⋮,⋮δM(s)={δM−1(s)−τM−1(s)}⟨M⟩,τM(s)=E[δM(s)∣δM(Q[M])].

Hence,

 X(s)≈τ1(s)+δ2(s)≈τ1(s)+τ2(s)+δ3(s)≈τ1(s)+τ2(s)+…+τM−1(s)+δM(s)≈τ1(s)+τ2(s)+…+τM−1(s)+τM(s).

Under the assumption that , it can be shown that each approximated remainder process is also a Gaussian process with . In addition, the covariance functions for not in the same region at level and have the forms below for ,

 C2(s1,s2)=C(s1,s2)−C(s1,Q1)\tiny TC(Q1,Q1)−1C(s2,Q1),C3(s1,s2)=C2(s1,s2)−C2(s1,Q1,j2)\tiny TC2(Q1,j2,Q1,j2)−1C2(s2,Q1,j2),⋮CM(s1,s2)=CM−1(s1,s2)−CM−1(s1,Q1,j2,…,jM−1)\tiny T% CM−1(Q1,j2,…,jM−1,Q1,j2,…,jM−1)−1CM−1(s2,Q1,j2,…,jM−1).

Writing , then the predictive process in each level can be written as

 τm(s)=Cm(s,Q1,j2,…,jm)\tiny Tη1,j2,…,jm,~{}for~{}s∈D1,j2,…,jm,

where the random vector . In addition, is a vector-valued function of dimension on domain . We call each component a basis function, and it is easy to observe that each basis function achieves its maximum value at the corresponding knot in .

### 1.2 Strategy of Building the Structure

Starting from the data domain at level , we recursively partition the regions into subregions. We restrict to be either 2 or 4 for simplicity. In practice, these options are sufficient because the correlation between different subregions is ignored, and if is too large, too much correlation is discarded in a single level, which is contrary to the objective. When , every region is partitioned into four subregions with half the length for each of the two dimensions. When , the partition is executed along the longer dimension. Figure 1 gives an example of how the multi-resolution structure is built for . At level , we extend the domain by pushing the top and right boundaries outward by one percent. The reason to extend the domain at level slightly is that we always treat the region as a half-closed, half-open area (data on the bottom and left boundaries are included whereas data on the top and right boundaries are excluded) as indicated in Figure 1, where the points on the dashed line are excluded. By extending the domain, the original top or right boundaries as indicated by gray in Figure 1 are still included, and all the observation points on those boundaries are not ignored. Then, from level , each region is divided into subregions along the longer axis. Since the length along the -axis is longer at level , we divide the region into two subregions along the -axis with equal area. The same technique applies to level except that we divide the region along the -axis, which now has a longer length in each region. This procedure is repeated until we get to the finest resolution level . Figure 1: Illustration of building the multi-resolution structure for J=2, M=3. Dashed lines indicate open boundaries and solid lines indicate closed boundaries.

Once we have the multi-resolution structure built, we place knots, which are the centers of the basis functions, in each region. The number of knots in each region other than for the finest resolution level, i.e., levels , is specified by . The number of knots in regions at the finest resolution level can also be chosen as , resulting in a further approximation at the finest resolution level. However, in our implementation at the finest resolution level, we place the knots where the data points locate. The number of knots in each region at the finest resolution level is then automatically determined by the data set and the built structure. Because the number of observations can be different in different regions at the finest resolution level, the number of knots can vary in regions at the finest resolution level. For levels , the knots are placed on the grid, where means the minimum value that is not less than , and means the maximum value that is not greater than . The grid has equal distances between any two consecutive grid bars (see the dashed lines in Figure 2), and we use a parameter offset to control the smallest distance between the knots and the region boundary. If the boundary of the underlying region is along -dimension, then the -grid bars are

 x1=xmin+offset×(xmax−xmin),x2=x1+xmax−xmin−2offset⌈√r⌉−1,x3=x2+xmax−xmin−2offset⌈√r⌉−1,⋮x⌈√r⌉=x⌈√r⌉−1+xmax−xmin−2offset⌈√r⌉−1.

The locations of the -grid bars can be obtained similarly. We see the actual number of placed knots are , which is equal to when is a square number. An example is illustrated in Figure 2, where the underlying region is indicated in gray. Suppose , and then we have the -grid bars and -grid bars. Therefore, we actually place knots in this region. If , the placed knots are shown as solid dots in Figure 2. Figure 2: Illustration of the placed knots in a toy example. The region is [0,1]×[0,1] indicated in gray. Choosing r=32 and offset=0.1, the ^r=30 placed knots are shown as solid dots. The dashed lines are the grid bars.

One restriction of placing the knots is that no knots should be exactly located at the same position. Otherwise, the variance-covariance matrix in the statistical inference will be singular. To guarantee that, one requirement is that any observation that has the same location as any knots before the finest resolution level will be eliminated. The chance of locations of observations coinciding with knots before the finest resolution level is low (this coincidence actually never happened in our experiments for different data sets in later sections) and we often have many observations when we use the MRA to model large data sets, so the number of observations that may need to be eliminated is negligible. The other requirement is that no knots at levels before the finest resolution level should have the same location. We use a simple strategy to prevent this collocation from happening by setting the offset as an irrational number, for example, as a default value in our code. The mathematical justification for this strategy is given in Appendix A.

## 2 Computing Facility and Code Outline

### 2.1 Computing Facility

We use two kinds of computing facilities to run the code and produce the reported results. One is a personal laptop—a MacBook Pro (13-inch, early 2015) with a 3.1-GHz Intel Core i7 processor and 16 GB DDR3 memory. The other one is the high-performance computing facility including Cheyenne and Casper, built for the National Center for Atmospheric Research (NCAR) by Silicon Graphics, Inc. (SGI). There are two types of nodes on Cheyenne—some have 45 GB usable memory and others have 109 GB usable memory. Each node has two sockets, and each socket is equipped with 18-core 2.3-GHz Intel Xeon E5-2697V4 (Broadwell) processors. There are multiple types of nodes on Casper as well, and we use the 22 available nodes that have 384 GB memory and two sockets equipped with 18-core 2.3-GHz Intel Xeon Gold 6140 (Skylake) processors. More information can be found at https://www2.cisl.ucar.edu/resources/computational-systems/cheyenne for Cheyenne and https://www2.cisl.ucar.edu/resources/computational-systems/casper for Casper.

### 2.2 Code Outline

We compare two MRA implementations—Matlab and C++. This article focuses on introducing the C++ implementation, and the details of the full-layer Matlab implementation can be found in Blake et al. (2018).

In the C++ implementation, there are two versions for the serial and parallel execution. There is little difference in the user interfaces. The only extra requirement for the parallel version is that the code should be compiled and run with Message Passing Interface (MPI) compilers and commands. Examples of MPI packages include Intel MPI, Open MPI, and SGI MPT. The users need to ensure at least one MPI package is installed before compiling and running the parallel code. In our reported results hereinafter for the parallel execution, we use the SGI MPT MPI package mpt-2.19111The setting MPI_IB_CONGESTED that helps alleviate network congestion but may lead to more communication overhead is disabled in our experiments. on the Cheyenne nodes and Intel MPI package impi-2017.1.132 on the Casper nodes.

#### 2.2.1 Compiling

The file Makefile is provided for both the serial and parallel versions. The users need to modify the value of “CXX” for the available compiler for C++ (the parallel code requires a C++ compiler that supports MPI) before running make to compile. All the results in this article are from programs compiled with clang++ on the laptop, icpc on Cheyenne for the serial version, and mpicxx as a front end of icpc on Cheyenne and Casper for the parallel version. In our reported results hereinafter, clang++-x86_64-apple-darwin17.7.0 is used on the laptop, and icpc-17.0.1 is used on the Cheyenne and Casper nodes.

#### 2.2.2 Prerequisite Libraries

There are three dependent libraries needed for our implementation—mkl, armadillo, and dlib. We use mkl as the linear algebra kernel, armadillo for easier manipulation of matrices (Sanderson and Curtin, 2016), and dlib for solving the optimization problems. More information about the libraries can be found at https://software.intel.com/en-us/mkl, http://arma.sourceforge.net, and http://www.dlib.net. In our reported results hereinafter, armadillo-9.100.5 and dlib-19.16 are used on Cheyenne nodes, Casper nodes, and the laptop. The mkl library mkl-2017.0.1 is used on Cheyenne nodes and Casper nodes, and mkl-2019.0.0 is used on the laptop.

#### 2.2.3 User Parameters

The file user_parameters is provided along with the code. This is the place for users to specify the settings for the program. The program will read parameters from this file at runtime. This file is not part of the source code, which means the code does not need to be compiled again if the parameter values in user_parameters are changed. Tables 1 and 2 show the user parameter names and the corresponding description. The usage of some parameters will be discussed more in Section 4 when related parallel implementation concepts are introduced afterward.

#### 2.2.4 Source Code

There are two directories containing the source code—the directory src contains the source files and the directory include contains all the associated header files. Table 3 gives a brief description of all the source files in the directory src.

#### 2.2.5 Calculation Mode

There are four calculation modes that can be chosen—“build_structure_only”, “likelihood”, “prediction”, and “optimization”. The “build_structure_only” mode only prints out the details about the built multi-resolution structure. The “likelihood” and “prediction” modes evaluate the likelihood and make predictions for given parameters, respectively. The “optimization” mode iterates the likelihood evaluations to find the optimal parameter values that maximize the likelihood. There are a variety of optimization libraries and methods that can be used. In our implementation, we use the BOBYQA method (Powell, 2009) in the library dlib for optimization, which does not require knowing the gradient. Users can change the optimization method by modifying the file optimization.cpp in the directory src if another optimization method is preferred.

## 3 Serial Implementation Performance Comparison between C++ and Matlab

### 3.1 Small MODIS Data

#### 3.1.1 Data Description

In this section, we use a small satellite data set that was originally analyzed by Heaton et al. (2018). The data set is visualized in Figure 3 showing daytime land surface temperatures on August 4, 2016, as measured by the Moderate Resolution Imaging Spectroradiometer (MODIS) sensor aboard the Terra Satellite. There are some missing values in this region on August 6, 2016, and this missing value pattern is applied to this data set on August 4, 2016, resulting in observations out of the data locations. We use the serial code to calculate the likelihood from the observations and make predictions at all the locations. We found that there are some mistakes in the magnitude of the temperatures and potentially there is an accidental upward shift in the data values. Since the focus of this article is to show how we apply our MRA implementations, but not to draw geophysical conclusions about this specific data, we still use the original data from Heaton et al. (2018). Figure 3: Small MODIS data set from Heaton et al. (2018). MODIS land surface temperatures (LST) in Celsius in the daytime of August 4, 2016 at 105,569 locations. Note that there are some mistakes in the magnitude of the temperatures and potentially there is an accidental upward shift in the data values.

#### 3.1.2 Results

We consider an experiment with the setting shown in Table 4. is the number of levels, is the number of knots specified in the regions before the finest resolution level, and is the actual number of placed knots in the regions before the finest resolution level. The slight difference between and comes from our partitioning strategy as described in Section 1.2. We always choose the natural number that is closest to and not less than the square root of as the number of knots in one dimension. Then, the number of knots in the other dimension is determined accordingly. If is not a square number, there may be a slight difference between the specified value and the actual value that is used in the program.

The experiment is executed on the laptop and one Cheyenne node with 45 GB usable memory specified in Section 2.1. The running time throughout this article is measured by the elapsed real time (or the so-called wall time). Note that the “optimization” calculation mode is just a series of likelihood evaluations, so we only report the running time of evaluating likelihoods and making predictions using C++ and Matlab. We run the serial program on the laptop only once for each of the case in Table 4. It is noteworthy that the Matlab running time heavily depends on whether the operating system is just rebooted or has been used for a long time so that there is less contiguous memory space. We reboot the operating system each time before running the Matlab experiments on the laptop.

The running time results of C++ and Matlab on the laptop are shown in Table 5. We also run the experiments for the serial program on one Cheyenne node with 45 GB usable memory. We run five experiments for each scenario in Table 4, and the running time results on the Cheyenne node are shown in Figure 4. We see that the differences between the running time of C++ and Matlab are small or negligible when is small, but C++ outperforms Matlab with more significant improvement as increases.

Comparing the running time for each program over different scenarios, we see that the running time always decreases as increases at the beginning, and increases again after a certain point. A possible reason is that at the beginning, an increasing indicates more approximation with the setting in Table 4 where the sum of the number of knots over all the regions at the second finest resolution level is restricted to be the same. When reaches a point where the benefit of further approximation in terms of computational cost is negligible, the cost of more levels to be calculated from a larger dominates, resulting in the running time increasing if continues to increase, and the extra cost is more severe in the Matlab implementation. Note that making predictions always requires more time than evaluating likelihoods because there are more quantities to calculate. Figure 4: Running time of evaluating likelihoods and making predictions on one Cheyenne node with 45 GB usable memory using C++ and Matlabfor different number of levels specified. The solid lines are the mean running time, shaded with one standard deviation above and below in the five experiments.

To show the prediction accuracy, we randomly leave out 5,000 data points from the original data set and predict the land surface temperatures at these 5,000 data points without access to the true values. The locations and the values of the data points that are left out are shown in Figure 5, where we see a good coverage of the domain. Then, we calculate the differences between the prediction means and the true values. The prediction variances are also available in the prediction outputs for these 5,000 data points. The prediction means and variances from C++ and Matlab are exactly the same, supporting evidence that the implementations are correct. The mean squared prediction error (MSPE) that is the mean of the 5,000 squared differences between the prediction means and the true values, and the mean prediction variance (MPV) that is the mean of the 5,000 prediction variances, are given in Figure 6. It is no surprise that the accuracy generally gets worse when increases since more approximations are applied. Figure 5: The randomly selected validation data points that are left out for prediction. Figure 6: Mean squared prediction error (MSPE) and mean prediction variance (MPV) for different number of levels specified either using C++ or Matlab. Note that the prediction results from C++ and Matlab are exactly the same.

## 4 Parallel Implementation

### 4.1 Parallel Mechanism

It is assumed in the methodology of MRA that knots in different regions at the same level are independent and only correlated with knots in regions that are ancestors or descendants. This feature allows us to develop a parallel implementation that makes use of the independence structure. With a massive data set, the inference problem is almost always memory-bounded since the memory of a node is fixed and limited. Almost all high-performance computing systems nowadays have an architecture such that cores in a node have direct access to the Random-access Memory (RAM) device inside the node, while communication is required if the cores try to access data in another node’s RAM. We first focus on this type of distributed architecture, allocating local memory to each core and using the Message Passing Interface (MPI) to implement the communication between cores.

Suppose we have MPI processes and regions at the finest resolution level. Each MPI process will be assigned regions (some are assigned regions to ensure that every region at the finest resolution level is assigned to an MPI process). Then, all the regions that are ancestors of the assigned regions at the finest resolution level are also assigned to the same MPI process. Therefore, each MPI process only needs to allocate memory for variables and conduct calculations related to the assigned regions. Each MPI process can work independently during the procedure calculating the prior quantities and no synchronization is needed.

However, when looping over a region where all the children are not only handled by one MPI process during posterior inference, synchronization is needed. The MPI process with the smallest rank that handles some children of this region is chosen as the “supervisor” of this region, and the rest as the “worker(s)” of the region. The “worker(s)” send information about their children that are not handled by the “supervisor” to the “supervisor”. Then the “supervisor” can use information of all the children of the underlying region to calculate the relevant quantities for that region. We call a region a working region for an MPI process when the MPI process needs to calculate the associated statistical quantities of knots in that region. After synchronization, the underlying region will not be considered a working region for the “worker(s)” anymore. Subsequently, every ancestor that was assigned as a working region only because it is an ancestor of this underlying region and not an ancestor of other working regions will be eliminated from the working region list for the “worker(s)” as well.

Figure 7 illustrates the parallel mechanism on a toy example. Assume we have a one-dimensional domain. We set the number of levels to 4, the number of subregions in each partitioning to 2, and use 3 MPI processes. Figure 7: Illustration of the parallel MRA implementation. Solid blue indicates working regions in the prior computation. Solid green indicates working regions at one level below the current level in posterior inference. Solid red with a star indicates regions at the current level where all children were handled by this MPI process and no synchronization is needed. Checkered red and slashed red indicate regions at the current level that require synchronization, where checkered red represents the “supervisor” and slashed red the “worker(s)”.

Step 1 is to build the multi-resolution structure. In each level starting from level 2, a region is divided into subregions. Step 2 is to count the number of regions each MPI process should work on and assign each MPI process their working regions. In the toy example, we have 8 regions at the finest resolution level and 3 MPI processes that can be used. This indicates each MPI process should work on at least regions. To ensure every region is assigned, the first two MPI processes are assigned one additional region, i.e., 3 regions are assigned to MPI process 1 and MPI process 2. Step 3 is to determine all the working regions when computing prior quantities. Since there exist correlations between the assigned regions at the finest resolution level and all their ancestors, all the ancestors will be assigned as working regions as well. All the working regions are indicated in solid blue for each MPI process in Figure 7. Step 4 is to compute all the prior quantities in the order from level 1 to level 4 for a given MPI process. Step 5 is to compute the posterior inference where the parallelism is much more complicated than in the prior computation. It starts from the second finest resolution level, which is level 3 in our example, to the coarsest resolution level, which is level 1. In Figure 7, solid green indicates working regions at one level below the current level. For example, for MPI process 1 at level 3, the first three regions at level 4 are shown in solid green. All the ancestors of green regions will be the working regions at the current level. However, there are three types of working regions at the current level. The solid red with a star indicates all the children of this region were handled by this MPI process and no synchronization is needed. If not all the children of this region were handled by this MPI process, synchronization is required and the other two types of regions come into play. If the MPI process is the “supervisor” for this region, the region will be indicated in checkered red, and if the MPI process is one of the “worker(s)” for this region, the region will be indicated in slashed red. The “worker(s)” will send information related to the slashed red region to the “supervisor”. The “supervisor” will gather all the information about the checkered red region from the “worker(s)” and compute the posterior quantities related to the checkered red region. After the synchronization, the checkered red region will still be considered a working region, which means it becomes green when the program goes one level up. However, the slashed red region will be eliminated from the working region list since all the useful information has been aggregated by the “supervisor” in the checkered red region, and the slashed red region will not become green when the program goes one level up.

As will be seen later in 4.2, the problem is almost always memory-bounded for an extremely large data set, which means we can only use a small number of cores for MPI processes in a node due to the memory constraint. Therefore, the majority of cores in the node are left idle. To utilize them, we use the shared-memory multiprocessing mechanism to further parallelize the code by using the extra idle cores in the code wherever it is applicable. We use OpenMP to implement the shared-memory multiprocessing mechanism in our code. Then in the hybrid parallel program, MPI and OpenMP are used simultaneously for distributed-memory and shared-memory parallelism, respectively.

### 4.2 Memory Requirement

When dealing with an extremely large data set, before executing the program, we need to choose an appropriate model setup that the computing facility can handle. Memory allocation is the key issue since in practice the physical memory of a node is not infinite but limited.

When the number of observations and the number of levels are both large, the variable allocating the most memory in the C++ code for likelihood evaluation or optimization is ATilde in the class Approximation. Under the allocation and de-allocation mechanism of ATilde implemented in the code, the upper bound for its memory consumption is

 JM−1×M×(M−1)×r2×2−28~{}GiB. (1)

See Appendix B for more details. The exact upper bound of the total memory consumption of the entire program should be something bigger than that of ATilde. However, the ratio between the two can vary from some value slightly greater than one to a much larger number, depending on many factors such as the values of and the distribution of the number of observations in regions at the finest resolution level. We do not intend to declare the value in Equation 1 as an estimate of the memory consumption bound, but it can be used as a reference to an initial impression of the total memory requirement. Users are required to monitor the memory usage for the specific data set and parameter choices if the memory constraint is an issue. We offer an option to write the values of ATilde in each region to disk and load the values from the disk when needed by setting SAVE_TO_DISK_FLAG in the file user_parameters to “true”. However, this will significantly slow down the execution by the input/output operations. It is recommended to use a larger memory if possible rather than enabling this option. Nevertheless, this save-to-disk option can be used as a backup.

If the program is executed for making predictions, there is a chance for BTilde in the class Approximation to consume more memory than ATilde, depending on the number of predictions and how the prediction locations are distributed. Then, the estimate of the exact memory consumption of the entire program will be more complicated. Users can adjust the values of and accordingly to make the execution feasible for a given computing facility.

It is also noteworthy that when the program is executed in parallel, the memory consumption by each MPI process will be smaller than the total memory consumption since the code is implemented in a distributed fashion.

### 4.3 Parallel Results

Throughout the studies in Section 4, we will run our programs on the two kinds of Cheyenne nodes that have 45 GB and 109 GB usable memory per node, respectively, and one kind of Casper node that has 384 GB memory.

The actual running time of the parallel execution using MPI does not only depend on the program, the number of nodes, or the associated MPI processes per node, but also depends on how the MPI processes are placed within a node. If the number of MPI processes that are used is much smaller than the physical limit 36, say 2 or 4, the performance is highly related to the MPI process placement. However, the optimal configuration is very complicated and should be analyzed case by case due to many factors such as the different levels of cache, the memory architecture, the difference in communication speed between MPI processes in a socket and across sockets, and the amount of memory each MPI process will use for running the program. For simplicity, all the configurations in this section for the parallel study uniformly spread out MPI processes as much as possible over the 36 physical cores in a node.

Two types of parallel experiments are conducted—only using MPI or using both MPI and OpenMP. When we run the experiments for the hybrid parallel program that uses both MPI and OpenMP, we try to use up to all 36 cores of a node. We assign each MPI process an equal number of cores for OpenMP threads. The scheme of the MPI-OpenMP core assignment is given in Table 6. Then, each MPI process will utilize more than one core as OpenMP threads except for the last case where only one core is available for each MPI process when using 32 MPI processes per node. We see that starting from the fourth row of Table 6, where the number of MPI processes per node is 8, the total number of used cores is not 36 but 32 because 36 is not divisible by the specified number of MPI processes per node.

Another issue is how we divide the computational workload into different portions and assign each portion to each MPI process. Section 4.1 shows a predetermined static strategy that the regions at the finest resolution level are assigned to different MPI processes with the same number of regions (some MPI processes may have one more). However, this may not lead to an even division because the number of observations in each region at the finest resolution level may vary if the data points are not uniformly spread over the domain. We offer another option to divide the regions at the finest resolution level by the square of the number of observations in each region, which means each MPI process handles regions at the finest resolution level with almost an equal value of the sum of these squares. We call this strategy “dynamic scheduling”, and it is controlled by the variable DYNAMIC_SCHEDULE_FLAG in the file user_parameters. When DYNAMIC_SCHEDULE_FLAG is “false”, the static scheduling strategy elaborated in Section 4.1 is used; when DYNAMIC_SCHEDULE_FLAG is “true”, the dynamic scheduling is applied. Similarly, when using OpenMP for

loops, the number of loops is divided evenly for each thread by default. It is highly probable in our program that each thread may have a different workload if the data points are not uniformly spread over the domain. Thus, the execution will be slowed down by the slowest thread. Dynamic scheduling also applies here so that the number of loops for each thread is not determined in advance but each loop is assigned to the currently idle thread in the runtime. We also use the same variable

DYNAMIC_SCHEDULE_FLAG to control the thread scheduling scheme for OpenMP.

#### 4.3.1 Small MODIS Data Set

In this section, we use the same small data set described in Section 3.1.1. We choose levels and knots in all regions before the finest resolution level to compare running time in different parallel settings. The number of partitions is set to , leading to regions in total including regions at the finest resolution level. In each region at the finest resolution level, the number of knots (or observations in this case) depends on how the data location is spatially distributed. Since there are observations and regions at the finest resolution level, we know the average number of knots is at the finest resolution level, comparable to . For this data set, we found the maximum number of observations in a region at the finest resolution level is , and there are 30 regions that have no observations within their designated boundaries. Figure 8 shows the histogram of the number of observations in regions at the finest resolution level, where we see the data is not uniformly spread over the domain.

We run the parallel program that only uses MPI and no OpenMP threads with different total numbers of MPI processes and numbers of MPI processes per node. The experiments are run on Cheyenne nodes that have 45 GB usable memory. The results are shown in Table 7 for evaluating the likelihood and in Table 8 for making predictions. The execution time for prediction is longer than for likelihood evaluation with the same setup because there are more quantities to be calculated in prediction. It is also noteworthy that the running time of making predictions would also depend on the number of locations to predict, and here we use all the location on the full grid. In this specific case, the execution time for prediction is about times as long as the corresponding likelihood evaluation. Figure 8: Histogram of the number of knots in regions at the finest resolution level for the small MODIS data set.

Looking at a row in Table 7 or Table 8, we see that the running time always gets shorter as the number of MPI processes per node increases at the beginning and then possibly gets longer in the end. The reason for a shorter time at the beginning of a row is that the communication in a node is always faster than the communication across nodes. However, if too many MPI processes are used in a node, they will eventually compete for the memory and shared caches. When this competition becomes dominating, it will again slow down the running time. This phenomenon is more obvious for the larger data set shown in Section 4.3.2. Looking at a column, more and more MPI processes are used in total as it goes down, indicating a higher level of parallelism. The running time reduces around by half as the total number of MPI processes is doubled given a fixed number of MPI processes per node. It is also noteworthy that the values along the diagonals are from the same number of nodes used. We see that it is always beneficial to use more MPI processes per node for a given number of nodes that can be used for this data set.

This is the smallest data set we have applied our MRA implementation to. More options including using OpenMP and dynamic scheduling will be explored for a much larger data set in Section 4.3.2 when the computation becomes more intensive.

#### 4.3.2 Medium AMSR Data Set

There are global observations for the sea surface temperature (SST) in the daytime of October 15, 2014, from the Advanced Microwave Scanning Radiometer (AMSR) sensor aboard the Advanced Earth Observing Satellite II (ADEOS II). Some of the observations are at the same locations, and we choose only one of them, which results in unique data points. A visualization of the data set is given in Figure 9. Figure 9: AMSR sea surface temperature (SST) in Kelvin in the daytime of October 15, 2014 at 2,439,827 locations.

Since this data set has a much larger size than that in Section 4.3.1, we choose a larger number of levels as with a higher level of approximation. The number of partitions is set to . Therefore, there are regions in total including regions at the finest resolution level. The number of knots is set to in all regions before the finest resolution level.

We know that the exact number of knots (or observations) in each region at the finest resolution level depends on the locations of the data. Since the average number of observations is at the finest resolution level, a similar number may be used to assign for the number of knots before the finest resolution level. However, we found that this will lead to too much memory consumption, so we turned to use a moderate number for . The histogram of the number of observations in regions at the finest resolution level is given in Figure 10

, where we see the distribution is quite skewed in the way that a large portion of regions have smaller numbers of observations and a small portion of regions contain remarkably large numbers of observations. Figure 10: Histogram of the number of knots in regions at the finest resolution level for the medium AMSR data set.

We run the parallel program that only uses MPI and no OpenMP threads on the Cheyenne nodes that have 45 GB usable memory. The results for evaluating likelihoods are shown in Table 9. The same trend happens here as in Section 4.3.1. The running time is decreasing at the beginning and increasing in the end in a row. In a column, the running time is always decreasing when we use more MPI processes in total. The results for making predictions at all the locations are shown in Table 10. For the same setting, the ratio between the execution time of making predictions and the likelihood evaluation is greater than 2, larger than that in the small MODIS data set in Section 4.3.1 because the number of locations we chose to predict at is much larger. As stated in Section 4.2, making predictions consumes more memory than the likelihood evaluation; we see that all the settings indicated in red in Table 10 show insufficient memory for the execution. The reason is that all the settings in red in Table 10 require only one node, which only has 45 GB usable memory. However, the entire memory consumption for the predictions goes beyond that. To handle these failure cases, we use one node with 109 GB usable memory in Cheyenne to run these prediction routines, and the results are shown in Table 11.

As introduced in Section 4.1, each node can be utilized more efficiently if we apply shared-memory multiprocessing mechanism to each MPI process through OpenMP using the extra idle cores. The results for likelihood evaluation from the hybrid parallel implementation with MPI and OpenMP executed on the Cheyenne nodes with the 45 GB usable memory are shown in Table 12. The strategy of the core assignments for OpenMP is given in Table 6. Note that the settings in Table 9 only constitutes half of the settings in Table 12. For ease of comparisons between the two tables, values in colors are presented in accordance with the same setup in Table 9. Note that the mean running time in all the settings except for the last column are smaller than the corresponding values with MPI-only execution in Table 9, which demonstrates the improvement from the OpenMP usage. The last column shows similar values to those in Table 9 because only one core is assigned to each MPI process due to the physical core limit and multiprocessing does not come into play there. When making predictions, we observe that the memory consumption goes beyond 45 GB if only one node is used. Therefore, we run the experiments for one node used on the Cheyenne node with 109 GB usable memory and all the other scenarios on the Cheyenne nodes with 45 GB usable memory. The results are shown in Table 13.

In addition, as stated in Section 4.1, the dynamic scheduling strategy is intended to implement a more balanced workload for each worker. The results from the hybrid implementation with dynamic scheduling in both MPI and OpenMP are shown in Table 14 for evaluating the likelihood and Table 15 for making predictions. We see that the running time in the majority of the cases from the dynamic scheduling scheme is shorter than the static scheduling scheme comparing Table 14 with Table 12 and Table 15 with Table 13. The optimal number of MPI processes per node in terms of execution time depends on the number of nodes used. However, the results suggest that it would generally be better to use a smaller number of MPI processes per node in the dynamic scheduling scheme. The reason is that when we assign the workload to each MPI process dynamically, we use the sum of the number of observations squared in each region as the metric, which may not be optimal. However, the assignment for OpenMP thread is determined in the runtime, leading to a more balanced workload division.

#### 4.3.3 Large MODIS Data Set

There are observations for the sea surface temperature in the daytime of October 15, 2014, from the Moderate Resolution Imaging Spectroradiometer (MODIS) sensor aboard the Aqua satellite. A visualization of the data set is given in Figure 11. Figure 11: MODIS sea surface temperature (SST) in Kelvin in the daytime of October 15, 2014 at 47,567,759 locations. Figure 12: Histogram of the number of knots in regions at the finest resolution level for the large MODIS data set.

We choose , and . Then, the largest number of observations in regions at the finest resolution level is . The distribution of the number of observations among all the regions at the finest resolution level is skewed in a way such that a large portion of regions have smaller numbers of observations and a small portion of regions contain large numbers of observations, which is shown in Figure 12.

For the computations, we have shown in Section 4.3.2 that the hybrid implementation with dynamic scheduling in both MPI and OpenMP will lead to the shortest running time. So we only report the results using this strategy for this large MODIS dataset. By monitoring the memory used, an approximate upper bound of the total memory consumption will be 720 GiB, which is too large to store on the Cheyenne nodes with 45 GB usable memory when only a relatively small number of nodes are used. We tried two approaches. The first approach is to save the most memory consuming variable ATilde in each region as a separate file to disk. Every time the values of ATilde of a particular region is needed, we load them to memory from the disk and free the allocation once finishing that region. The running time is 2663.79 seconds when using 16 nodes, each of which executes 4 MPI processes. We consider this execution is too slow to be feasible.

Another way to run the experiment is to use the Casper nodes with 384 GB memory and the Cheyenne nodes with 109 GB usable memory. Since we only have 22 Casper nodes, we can just run experiments over a few scenarios on Casper nodes. The running time results on Casper nodes in different settings are shown in Table 16 for evaluating likelihoods and Table 17 for making predictions. We see that for the same setting, the running time of making predictions is about 3 times the running time of evaluating likelihoods. This ratio is larger than that in Section 4.3.2, which is 2. The reason is that the number of prediction locations here is about 47 million, considerably larger than that in Section 4.3.2, which is about 2 million. Generally speaking, the running time is almost halved when the number of nodes used is doubled. The optimal number of MPI processes per node in terms of execution time is generally one or two.

The running time results on Cheyenne nodes with 109 GB usable memory for different settings are shown in Table 18 for evaluating likelihoods and Table 19 for making predictions with some cases failing to provide results due to insufficient memory. We see the execution on Cheyenne nodes is always faster than on Casper nodes for the same setup. The speedup factor for doubling the number of nodes used decreases as the number of used nodes increases. We have also tried to run the experiments on Cheyenne nodes with 45 GB usable memory when moderate number of nodes are used to show the execution results if computing facilities only with limited memory are available. The values in bold in Table 18 and Table 19 indicate that the execution can be run on Cheyenne nodes with 45 GB usable memory in the corresponding settings. It is noteworthy that in some cases indicated by the bold values running on Cheyenne nodes with 45 GB usable memory, there is a chance that the execution may crash when the memory consumption is around the limit of the memory device.

We also show the running time excluding loading the data and building the structure for likelihood evaluations in Table 20. In the “optimization” calculation mode, loading the data and building the structure would not need to be repeated. So the values of the running time for likelihood evaluations excluding these two parts indicate the execution time in each iteration of the parameter optimizations starting from the second iteration. From the results, we see it is promising that less than 10 seconds are needed for a single likelihood evaluation since the second iteration in the optimizations when using 128 Cheyenne nodes with 109 GB usable memory and two MPI processes per node for this data set on the order of 47 million observations.

## 5 Discussion

In this article, we described our implementation of the multi-resolution approximation of Gaussian processes in C++ in detail, compared the performance of the serial code between C++ and Matlab for a small data set consisting of around a tenth of a million observations running on a personal laptop, studied different configurations for the parallel code applied to data sets with sizes ranging from around a tenth of a million to 47 million. Many practical concerns have been discussed to optimize the execution for the given data set and available computing facilities. In the end, we have shown that for a massive data set on the order of 47 million observations, we can get the likelihood evaluation in a single iteration of the optimization within 10 seconds (except the first iteration that needs about 30 seconds due to loading the data and building the structure) and make predictions over the 47 million observations within 85 seconds using 128 Cheyenne nodes with 109 GB usable memory. We believe this is a great improvement that extends the use of the MRA methodology to real-world environmental applications, which often have millions of measurements.

## Acknowledgements

We are grateful to Yuliya Marchetti for kindly providing the satellite data and helpful information on its usage. We would like to thank Brian Dobbins for his constructive review, in addition to having provided support and helpful advice on the use of MPI on Cheyenne. We will like to further thank Brian Vanderwende for support with Matlab and Matthias Katzfuss for reading an earlier version of this manuscript and providing valuable feedback.

Appendices

## Appendix A Why We Assign an Irrational Number to offset

We denote different levels, the associated length of -axis dimension, the associated position of the underlying region boundary along -axis on the left, the associated number of knots that are placed along -axis, and offset. Then, if two knots at levels and have the same location, it means the following equations satisfies

 xi+fli+pli1−2fni−1=xj+flj+qlj1−2fnj−1, for p=0,…,ni−1,q=0,…,nj−1,

which is equivalent to

 f=(ni−1)(nj−1)(xj−xi)+(ni−1)qlj−(nj−1)pli(ni−1)(nj−1)(li−lj)+2(ni−1)qlj−2(nj−1)pli.

Since all the values in the denominator and numerator are rational numbers, the equation can not hold if is irrational.

It is noteworthy that even though in theory we can avoid the collocation of any knots by assigning offset an irrational number, the irrational number is stored with finite decimals in computers, which are essentially rational. But with enough digits for the stored offset , we observe that the strategy works in practice.

## Appendix B The Upper Bound of the Memory Consumption of ATilde

ATilde is defined at all regions, but the memory for regions in a particular level can be de-allocated once the program moves to the level above in the posterior inference. This means the maximum number of regions for ATilde to be present in memory at the same time is at the finest resolution level. For each region at the finest resolution level, there are matrices associated with the conditional variance for all the ancestors and conditional covariance for any pair of ancestors. Each matrix is of dimension . The matrix elements are 64-bit real numbers in 8 bytes. Hence, the upper bound of the memory consumption of ATilde is,

 JM−1×M×(M−1)/2×r2×8~{}% Bytes=JM−1×M×(M−1)×r2×2−28~{}GiB.

## References

• Banerjee et al. (2008) Banerjee, S., A. E. Gelfand, A. O. Finley, and H. Sang (2008). Gaussian predictive process models for large spatial data sets. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 70(4), 825–848.
• Blake et al. (2018) Blake, L. R., P. Simonson, and D. Hammerling (2018). Parallel implementation and computational analysis of the multi-resolution approximation. NCAR Technical Note (NCAR/TN-551+STR).
• Heaton et al. (2018) Heaton, M. J., A. Datta, A. O. Finley, R. Furrer, J. Guinness, R. Guhaniyogi, F. Gerber, R. B. Gramacy, D. Hammerling, M. Katzfuss, F. Lindgren, D. W. Nychka, F. Sun, and A. Zammit-Mangion (2018). A case study competition among methods for analyzing large spatial data. Journal of Agricultural, Biological and Environmental Statistics, 1–28.
• Katzfuss (2017) Katzfuss, M. (2017). A multi-resolution approximation for massive spatial datasets. Journal of the American Statistical Association 112(517), 201–214.
• Powell (2009) Powell, M. J. (2009). The BOBYQA algorithm for bound constrained optimization without derivatives. Cambridge NA Report NA2009/06, University of Cambridge, Cambridge, 26–46.
• Sanderson and Curtin (2016) Sanderson, C. and R. Curtin (2016). Armadillo: a template-based C++ library for linear algebra. Journal of Open Source Software 1(2), 26.