We introduce a trivially parallelizable approach to quickly and accurately approximate the cumulative distribution function (cdf
) of multivariate Gaussian distributions with highly structured covariance matrices, such as those arising from spatial Gaussian processes. The multivariate Gaussian distribution is by far the most widely used for modeling multivariate and spatial data. To a large degree, its near universal adoption is the result of its simplicity; it is concisely and intuitively parametrized by a mean vector and pairwise dependence in the form of a covariance matrix. Prominent examples of its use include time series models like autoregressive and moving average models, which consider the joint distribution of the observations observed at discrete time points to be multivariate Gaussian, as well as geostatistics models, which consider spatially-indexed observations to be realizations of a Gaussian process, usually with a parsimoniously parametrized covariance structure. Even multivariate models that do not assume Gaussian responses often represent dependence using some kind of latent multivariate Gaussian distribution.
In most situations, likelihood-based inference on popular models just requires calculation of the joint density of all observations. The probability density function (pdf
) for a multivariate Gaussian random variable is
where is the mean vector of length and is the covariance matrix.
In principle, there is nothing difficult about calculating this density; it simply requires commonplace operations like calculating an exponent, matrix determinant, matrix multiplication, and matrix inversion. However this is not an easy task in practice when the dimension is large. The complexity of calculating the determinant and inverse of a matrix is typically for algorithms in common use. This means that for large values of , the calculation of the pdf becomes prohibitive.
Computing the Gaussian cdf, which is a much more difficult problem, has received much less attention. The problem has increased in prominence recently with advances in spatial modeling of extreme events. State-of-the-art approaches for spatial extremes like Wadsworth2014, Thibaud2016, deFondeville2018, and Huser2019 all require high-dimensional Gaussian cdfs for inference. This turns out to be the dominant computational bottleneck, and all but deFondeville2018 restricted their analyses to fewer than 20 spatial locations because larger datasets are computationally intractable using widely-used techniques for computing the Gaussian cdf. In real-world spatial applications, one should expect to see many more spatial locations, and existing approaches are not equipped to handle datasets of even moderate size.
Multivariate Gaussian cdf
s appear in other contexts as well; for example the density of multivariate skewed Gaussian andrandom variables are functions of the multivariate Gaussian cdf[SN2006]. Here, we will focus on the case of spatial extremes. To make things concrete, we will use the example of the Gaussian scale mixture model from [Huser2017], although our computational strategy would work equally well in any context with highly-structured covariance matrices.
Most approaches to calculating multivariate Gaussian probabilities are intended for problems of small or moderate dimension. Genz1992 proposed a transformation from the original integral over to an integral over a unit hypercube. Transforming to a finite region then allows the use any standard numerical integration method. Genz2004 derived formulas to calculate bivariate and trivariate Gaussian cdfs with high precision using Gauss-Legendre numerical integration. The calculations are fast and precise but do not apply in higher dimensions. Miwa2003
proposed a two-stage recursive approach to estimate the Gaussiancdf. Their approach does not scale to high dimensions because it requires a sum over a combinatorially exploding (in ) number of terms.
The most popular approach for approximating Gaussian cdfs in moderate dimensions was proposed by Genz2009. They describe the use of Monte Carlo (MC) and quasi-Monte Carlo (QM) methods to estimate the joint cdf. Their QM methods have smaller asymptotic errors than the MC versions, and hence are the more widely used.
More recently, Genton2018 sped up the Genz2009 QM algorithm by performing matrix computations with fast hierarchical matrix libraries [Hackbusch2015]. As a follow-up, Cao2019 combined hierarchical matrix computations with a blocking technique to further speed up computations. These approaches are much faster than their predecessors and work for Gaussian random variables with arbitrary covariance structures. They lean heavily on linking to specialized libraries for matrix operations. Our approach achieves speedups using a fundamentally different strategy, by specifically leveraging the properties of highly-structured covariance forms arising from, for example, time series or spatial data. It requires no exotic software, and is trivially parallelizable using simple tools in R.
2 A Vecchia Approximation for the Multivariate Gaussian Distribution Function
The multivariate Gaussian cdf that we wish to calculate is simply the integral of the pdf (1),
To calculate the integral (2
), one must resort to numerical techniques, as it is well-known that no closed form exists, even in a single dimension. In high dimensions, numerical integration is very difficult simply due to geometry and the curse of dimensionality. The difficulty is compounded in the case of the Gaussiancdf because while the curse of dimensionality requires an exponentially (in ) increasing number of evaluations of the integrand, the cost of each evaluation of the integration itself grows as . We seek a technique that simultaneously 1) reduces the effective dimension of the integral and 2) reduces the dimension of the pdf in the integrand.
2.1 Vecchia Approximation for the Gaussian pdf
Vecchia introduced a way to approximate high-dimensional Gaussian pdfs arising from spatial data, which is particularly amenable to modification for our purposes. The starting point of the Vecchia approximation is to write the joint density as a product of cascading conditional densities,
Here, is the univariate Gaussian density with mean
and variance, and, for , the conditional density is the univariate Gaussian density with mean and variance . The leading terms in this product are fast to calculate, but for terms corresponding to large , the computations are nearly as burdensome as those of the original representation (1).
To help solve this problem, Vecchia proposed an approximation to the full joint distribution, in the setting where the random vector is observed from a spatial Gaussian process. He modified the cascading conditional representation (3) by replacing the conditioning on high-dimensional vectors with conditioning on well-chosen vectors that have much smaller dimension. By limiting the conditioning sets to vectors of length , this strategy replaces expensive matrix operations with much faster matrix operations. The approximation of the joint density is then
where is the conditioning set of size (more precisely, ) chosen for the component , for .
A good choice for a conditioning set to approximate the complete conditional density of each might be the components that are most correlated with . In the context where the random vector arises from a stationary spatial Gaussian process observed at locations , the components most correlated with will be those observed at locations that are the nearest neighbors to (under covariance models in common use). Other strategies for constructing conditioning sets have also been explored [stein2004approximating, Guinness2016].
Vecchia’s approximation has been found to be quite accurate under many covariance models and sampling scenarios relevant to analysis of spatial Gaussian processes [Guinness2016]. Moreover, it is very fast to compute, even using the most naive implementation. However, its power is fully realized when the components of the product are computed in parallel, which is trivially easy to implement using standard tools in R.
2.2 Extending the Vecchia Approximation for the Gaussian cdf
Our approach to approximating the high-dimensional Gaussian cdf is to re-write the joint cdf as a telescoping product of conditional cdfs, analogously to (3), and then to approximate each complete conditional cdf with cdf that conditions on a smaller collection of components, analogously to (4). In the case of the pdf, this strategy of choosing smaller conditioning sets eliminates the need to compute high-dimensional matrix computations required by (1), whereas in the case of the cdf, this strategy eliminates the need to compute the high-dimensional integral required by (2).
Specifically, we can re-write any joint cdf as
Then, just as in the approximation to the pdf (4), in the cdf (2.2) each conditional probability in the product can be approximated by reducing the size of the conditioning set to at most components. Thus, our Vecchia approximation for the Gaussian cdf is
where again is the conditioning set of size chosen for the component , for .
The approximation given by (2.2) reduces computational costs by replacing the -dimensional integral in (2) with a series of much simpler integrals of dimension and , for . Furthermore, all of the elements in the product can be computed in parallel.
The multivariate cdfs in (2.2) still have to be evaluated numerically. For all but the smallest possible choices of , best practices suggest using a QM method like that of Genz2009 to approximate the numerator and denominator.
Similarly to the original Vecchia approximation to the Gaussian pdf, choosing the conditioning sets involves a trade-off; choose too small and the accuracy of the approximation will suffer, but choose too large and the computational benefits will diminish.
3 Simulation Study
To assess the accuracy and speed of this approximation, and to explore the trade-off inherent in the choice of , we conduct a simulation study. Since the true value of the cdf is not available, the best we can do to check for accuracy is to see whether it is consistent with results obtained from direct use of the Genz2009 QM approach. We simulate a Gaussian process observed on equally spaced grids of five different sizes, , , , and . We try two different covariance functions for the Gaussian process to see whether this has an impact on the cdf estimation: an exponential model with range parameter and and exponential model with range parameter , each with unit variance. This makes a total of different scenarios. For each scenario, we used four different sizes of conditioning sets, choosing , , and closest neighbors. For comparison, we computed the Genz2009 QM method using 499 and 3,607 sample points. We use the implementation of the Genz2009 algorithm in the mvPot package [package-mvPot] for R. In principle, the accuracy and computational requirements of the QM grows with the number of sample points (which, here, must be a prime number). Since the algorithms are stochastic, we repeated each calculation five times and plotted each replication as a dot in Figures 1, 2, 3, and 4.
Figure 1 shows the value of the estimated log cdf for all grid sizes and all estimation methods for the simulated Gaussian process with range parameter 1. The log cdf estimated with the Vecchia approximation increases with the number of neighbors until it stabilizes for 30 neighbors, after which it is consistent with the two QM approximations. This suggests that, under this scenario, it is advisable to use at least 30 neighbors in order to estimate the log cdf. For the two smaller grids, it appears that the Vecchia approximation has a similar variance to the QM approximation using 499 sample points, but a higher variance that the QM approximation using 3,607 sample points. For the larger grids, the Vecchia approximation appears to have a lower variance than both QM approximations. Figure 2 shows the same as Figure 1, but for exponential Gaussian processes with range parameter 5. The story is similar to the case with the shorter range process, except it appears that 50 neighbors may be necessary in order to stabilize the estimated log cdf. It may be the case that the number of neighbors necessary to accurately approximate the log cdf increases with length of the dependence of the Gaussian process. Intuitively, this may occur because for processes with longer-range dependence, a smaller proportion of the information in data may be captured by local approximations.
, on a single core, for Gaussian processes with range parameters of 1 and 5, respectively. The computation time is influenced by both the number of observations and number of neighbors used in the Vecchia approximation. Computational costs increase with the number of observations, for both the Vecchia and QM approximation methods, and also increase with the number of neighbors in the conditioning set. Oddly, the empirical computation time did not increase for the QM approximation with the larger set of sample points. For smaller grid sizes, the QM methods are faster than the Vecchia approximations, except when the size of the conditioning set very small. For grids of sizeand larger, computation time of the approximation using 30 neighbors was as fast as or faster than the QM method. When the number of observations is extremely large, in the case of the grid, the computation time was much lower for the Vecchia approximation compared to the QM approximation. This suggests that for high-dimensional datasets the use of the Vecchia approximation is preferable QM method, even if computations are done sequentially.
3.1 Parallel Computing
Since each term of the Vecchia cdf approximation (2.2) is independent of every term, it is trivial to parallelize the computations. In practice, we compute all of the required low-dimensional Gaussian cdfs on the log scale, and then sum them at the end. In principal, the speedup should be linear in the number of cores used for the calculation. To explore this relationship, we compute the cdf approximation based on a Gaussian process observed at 10,000 locations, varying the number of compute cores used between 5 and 40. For each setup, we repeat the computation 15 times. Figure 5 shows time required to compute the log cdf approximation. The computing time decreases with the number of cores. We observe roughly the expected linear relationship up to 20 cores, when a jump occurs before again decreasing. We suspect that this is behavior a result of the particular hardware configuration we used, which consists of networked 20-core processors. That is, we guess that once an additional physical processor is engaged, which occurs beyond 20 cores, overhead costs increase and attenuate the expected computational gains. When 40 cores were used, it took less than 1 minute to compute the log cdf approximation for 10,000 observations. There are clearly some diminishing returns due to communication overhead, but in principle, this approximation could be made arbitrarily fast with a big enough computing system.
3.2 Effect of Neighbor Selection and Joint Estimation
The representation defined by equation (2.2) and its approximation (2.2) calculates the joint probability as the product of univariate conditional distributions. However it is also possible to write the full joint cdf as a cascading product of multivariate, rather than univariate, conditional cdfs. Under equation 2.2, it is necessary to calculate the univariate conditional probabilities, each of which requires a -dimensional cdf calculation. If instead we divide the components into groups of joint observations, such that , we would only need to calculate the product of conditional probabilities. However, doing so would make the dimensionality of each individual Gaussian cdf calculation in (2.2) between and . So it would trade the cost of computing higher-dimensional cdf terms for the benefit of computing fewer terms. Such a trade-off could affect both the accuracy and computational efficiency of the approximation. Guinness2016 explored this possibility in the context of pdfs and found that it can be advantageous to consider multivariate conditional densities in the Vecchia density approximation. To explore the effect of calculating higher dimensional conditional probabilities, we calculate the log cdf approximation based on groupings of observations of different sizes.
An additional consideration that could effect the accuracy and speed of the approximation is the construction of the conditioning sets. Using the nearest neighbors, as we have done above, requires the additional step of ordering the components by distance, which could be slow. Choosing randomly-selected conditioning sets could potentially speed up the computation by avoiding this sorting step.
Figures 6 and 7 show the estimated log cdf and time (in seconds) to compute the approximated log cdf, using the grid. We used approximations based on joint conditional cdfs of dimension 2, 5, 10, 20, 30, and 50. For each grouping size, we constructed conditioning sets using 3 different methods. The first method conditions on the most correlated observations (in this case simply the nearest neighbors) for each observation in the joint grouping, resulting in a conditioning set of size . The second method simply conditions on random observations. The third method conditions on random observations per element of the multivariate conditional calculation, again resulting of a conditioning set of size .
From Figure 6 it is clear that simply conditioning on random observations fails to yield an acceptable approximation. Performance can be improved by conditioning on more random observations, which is what the third method does. Method 3 shows somewhat improved behavior, however it was only able to perform acceptably when both the dimensionality of the joint conditional probability and the size of the conditioning set were both large. It is clear from Figure 6 that conditioning on random neighbors is much less accurate than conditioning on the most highly correlated neighbors. For conditioning sets consisting of small numbers of neighbors per element in the joint conditional probability, the use of a large group of joint observations had a better result, probably simply due to fact that the total number of neighbors in the conditioning set was larger. However, when the number of correlated neighbors gets large enough the number of joint observations does not seem to affect result of the approximation.
Figure 7 shows the computation time required for all of the approximation schemes depicted in Figure 6. The clear trend is that choosing a small conditioning set of random observations is very fast (middle panel), using higher-dimensional joint conditional cdfs is slower than using lower-dimensional joint conditional cdfs (all panels), and for the same size conditioning set, the time required to find the nearest neighbors is not a major bottleneck (right and left panels). This conclusion is different from exploration of the same issues, in the context of the pdf, found in Guinness2016. There, using higher-dimensional joint conditional calculations was found to be beneficial, and the time required to find nearest neighbors was substantial enough to warrant the use of a fast approximate ordering algorithm. In the case of the cdf approximation, code profiling confirmed that the time required to order the observations was insignificant, with the overwhelming majority of the computation time being used in calculating the lower-dimensional joint cdfs using the QM technique.
4 Example: A Gaussian Scale Mixture for Spatial Extremes
Recent advances in the statistics of extremal spatial phenomena have produced models that are flexible enough to accommodate both strong and weak spatial dependence in the far joint tails. One prominent strategy for achieving this is to construct scale mixtures of Gaussian processes, where the mixing distribution is chosen carefully so as to produce the desired tail dependence characteristics [Opitz2016, Huser2017, Morris2017, Huser2019]. The preferred flavor of maximum likelihood inference for these models requires computing a Gaussian cdf whose dimension is roughly equal to the number of spatial locations in the dataset. Other state-of-the-art models for spatial extremes also rely on high-dimensional Gaussian cdfs [Wadsworth2014, Thibaud2016, deFondeville2018]. To show the usefulness of our cdf approximation, we analyze data from precipitation gauges in Europe using the Gaussian scale mixture model from Huser2017, which we describe below.
The class of scale mixtures of Gaussian processes is defined generically by
Here, is a standard Gaussian process (i.e. with unit variance) on some domain indexed by . For a collection of observations, the finite dimensional distribution of the Gaussian component is , where is a covariance matrix constructed using a chosen covariance model that is indexed by parameter .
The random scaling comes from distribution . The choice of is critical and determines the strength of the tail dependence in the resulting model [Engelke2019]. A key quantity for summarizing the strength of tail dependence is the conditional probability , for spatial locations and . If for all , we say that is asymptotically independent, while if for all , we say that is asymptotically dependent.
While many choices are available for the mixing distribution , Huser2017
suggest the parametric model defined by equation (8). When , the mixture process is asymptotically independent, and when , is asymptotically dependent. Therefore, this class of scale mixtures is rich enough to include both asymptotic independence and asymptotic dependence as nontrivial sub-models.
To construct the likelihood for maximum likelihood estimation, we must integrate out from the model (4). Equations (9) and (10) show the marginal multivariate cdf and pdf, respectively, for a finite collection of observations from defined in (4). Here represents the -dimensional multivariate cdf from a Gaussian distribution with mean vector and covariance matrix , and represents the -dimensional multivariate pdf from a Gaussian distribution with mean and covariance matrix . There are no closed forms for these expressions, so it is necessary to use numerical methods to evaluate the (one-dimensional) integrals.
The preferred strategy for maximum likelihood estimation of extremal dependence models is to treat all observations falling below a high threshold as left censored [Huser2016b]. This leads to a favorable balance between using the data as efficiently as possible, while not allowing data in the bulk of the distribution to have a large effect on dependence estimation. The censored likelihood for each temporal replicate is obtained by taking one partial derivative of (9) for every observation that falls above the threshold. Thus, (10
) is the relevant likelihood when all observations, at one particular temporal replicate, are above the threshold, so nothing is censored. However, since the threshold is chosen to be a high quantile to prioritize inference on the tail, most observations are usually censored for any temporal replicate. When all observations fall below the threshold, the relevant likelihood is (9).
Most often, in any temporal replicate, there will be a mixture of observations above and below the threshold. In this case, the relevant joint likelihood of is defined by equation (4), which results from taking partial derivatives of (9) with respect to only the un-censored observations. If we let be the set of points above the threshold and be the points below, then
where dependence of the covariance matrices on is suppressed for brevity, and the notation refers to rows and columns of pertinent to the points in . The matrix
is the covariance matrix of the conditional normal distribution of the censored observations given the un-censored observations.
The computational issue arises because the integrand (4) contains a Gaussian cdf of dimension , the number of censored observations in a temporal replicate. Again, for most replicates, this number is close to the total number of observation locations because the censoring threshold is chosen to be high, such that most observations fall below the threshold and are therefore censored.
4.1 Precipitation Over Europe
Our dataset consists of weekly maximum precipitation observations between January, 2000 and April, 2019, in the western and central region of continental Europe, north of the mountain ranges the Pyrenees, Alps, and Carpathians. The 6 countries we consider are Germany, Poland, Netherlands, Belgium, Czech Republic, and France. Figure 8 shows the locations of the observation stations distributed over Europe. This dataset consists of 1,006 weekly maxima from weather stations. For context, the computational bottleneck from the Gaussian cdf limited the analysis in Huser2017 to a dataset of locations, even though analysis was performed on a large high-performance computing cluster. We use the weekly maximum daily accumulations at each location to break temporal dependence that might arise from storms that persist for more than one day. Out of the 531,168 total observations, 32.6% were missing values. For each weekly maximum, only the available data was used for estimation, and all missing observations were disregarded.
The covariance model we use for the underlying Gaussian processes is an anisotropic exponential, , where is the range parameter and is the Mahalanobis distance between locations and . The Mahalanois distance is parametrized as
for rotation angle and and aspect ratio . Thus, after fixing the mixing parameter at 1, as it plays a much less significant role than the parameter in determining tail dependence characteristics, we arrive at a total of 4 parameters to estimate, .
The first step in estimating the dependence is to transform the observations to be on the same marginal scale. To do this, we start by applying a rank transformation to standard uniform, independently for each station. That is, for each station and each time point , the observation on the uniform scale is
We next choose a high threshold to be the 0.95 marginal empirical quantile at each location. Then, denoting the marginal cdf and pdf of each , respectively, as and (we assume stationarity, so the marginal distribution is assumed to be the same at each location), and letting the vector , the copula censored likelihood for each time replicate is
Finally, the log likelihood across all time points for the parameter vector is
We found the maximum likelihood estimator (MLE) by applying the Nelder-Mead numerical optimizer in the R function optim. MLEs are shown in Table 1. The MLE for the mixing parameter is 0.82, which in this context is fairly far away from zero—far enough to strongly suggest that the process is asymptotically independent. The MLEs for the anisotropy parameters suggest pronounced eccentricity. To interpret and visualize the estimated dependence model implied by the MLEs shown in Table 1, we plot level curves in the resulting function for on the quantile scale, shown in Figure 8. Each ellipse represents a constant value of , for an arbitrarily-chosen reference point near the center of the map. The level curves are ellipses due to the anisotropic construction, with the major axis roughly along a northeast-southwest orientation, and joint exceedances more likely with decreasing distance from .
The main objective of this paper was to propose fast approximation to high-dimensional Gaussian cdfs that arise from spatial Gaussian processes. We modified Vecchia’s approximation for Gaussian pdfs to the context of Gaussian cdfs. Simulations showed that for large numbers of locations and relatively small conditioning sets, this approximation gives results consistent with state-of-the-art QM methods, and reduces computational time considerably, even when computations are performed sequentially. Furthermore, the approximation is trivially easy to code in parallel using standard R packages, and requiring no linking to specialized software libraries.
We demonstrated the utility of our fast cdf approximation by using it to find maximum censored likelihood estimates for the scale mixture model of Huser2017. This model is attractive because of its flexible tail dependence characteristics, but is hampered by computational difficulties arising from the need to compute high-dimensional Gaussian cdfs during inference. We fit this model to a precipitation dataset consisting over 500 spatial locations, whereas previous efforts using conventional QM techniques were limited to just 12 locations.
One drawback that we noticed during the data analysis is that conventional optimization routines had trouble converging, due to the stochastic nature of the likelihood objective function. For future studies, one possible approach to circumventing this problem is to use stochastic optimization algorithms, which may be better suited to optimizing random objective functions.
This research was supported in part by NSF grant DMS-1752280. Computations for this research were performed on the Institute for Computational and Data Sciences Advanced CyberInfrastructure (ICDS-ACI).