Estimating high-resolution Red sea surface temperature hotspots, using a low-rank semiparametric spatial model

In this work, we estimate extreme sea surface temperature (SST) hotspots, i.e., high threshold exceedance regions, for the Red Sea, a vital region of high biodiversity. We analyze high-resolution satellite-derived SST data comprising daily measurements at 16703 grid cells across the Red Sea over the period 1985–2015. We propose a semiparametric Bayesian spatial mixed-effects linear model with a flexible mean structure to capture spatially-varying trend and seasonality, while the residual spatial variability is modeled through a Dirichlet process mixture (DPM) of low-rank spatial Student-t processes (LTPs). By specifying cluster-specific parameters for each LTP mixture component, the bulk of the SST residuals influence tail inference and hotspot estimation only moderately. Our proposed model has a nonstationary mean, covariance and tail dependence, and posterior inference can be drawn efficiently through Gibbs sampling. In our application, we show that the proposed method outperforms some natural parametric and semiparametric alternatives. Moreover, we show how hotspots can be identified and we estimate extreme SST hotspots for the whole Red Sea, projected for the year 2100. The estimated 95% credible region for joint high threshold exceedances include large areas covering major endangered coral reefs in the southern Red Sea.




A low-rank semiparametric Bayesian spatial model for estimating extreme Red Sea surface temperature hotspots

In this work, we focus on estimating sea surface temperature (SST) hotsp...

Mean-dependent nonstationary spatial models

Nonstationarity is a major challenge in analyzing spatial data. For exam...

Joint Estimation of Extreme Precipitation at Different Spatial Scales through Mixture Modelling

Parsimonious and effective models for the extremes of precipitation aggr...

Latent Gaussian Models for High-Dimensional Spatial Extremes

In this chapter, we show how to efficiently model high-dimensional extre...

Bayesian space-time gap filling for inference on extreme hot-spots: an application to Red Sea surface temperatures

We develop a method for probabilistic prediction of extreme value hot-sp...

New Exploratory Tools for Extremal Dependence: Chi Networks and Annual Extremal Networks

Understanding dependence structure among extreme values plays an importa...

A Model-Free Selection Criterion For The Mixing Coefficient Of Spatial Max-Mixture Models

One of the main concerns in extreme value theory is to quantify the depe...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Sea surface temperature (SST) has an immense environmental and ecological impact on marine life and ecosystems, e.g., affecting the survival of endangered animal species including corals (Reaser et al., 2000; Berumen et al., 2013; Lewandowska et al., 2014), and also has an important economic impact for neighboring countries, which depend on it for their local fisheries and tourism. Hence, the identification of the regions within the Red Sea where SST may exceed high thresholds is a vital concern and this motivates a proper statistical analysis of (present and future) extreme hotspots from a high resolution spatiotemporal SST dataset. Operational Sea Surface Temperature and Sea Ice Analysis (OSTIA) produces satellite-derived daily SST data at resolution (Donlon et al., 2012). Over the whole Red Sea, daily SST data are available at 16703 grid cells between 1985–2015 and we consider these data for estimating the extreme hotspots.

The most common model in spatial geostatistics is the Gaussian process (GP) due to its appealing theoretical and computational properties (Gelfand and Schliep, 2016). However, fitting an ordinary GP model involves computing the determinant and the inverse of the spatial covariance matrix, which is excessively prohibitive in dimensions as high as the Red Sea SST data (here, available at 16703 grid cells). A variety of methods have been proposed to tackle this problem. These include approaches based on kernel convolutions (Higdon, 2002), low-rank methods using basis functions (Wikle and Cressie, 1999), the predictive process (Banerjee et al., 2008), approximations of the likelihood in the spectral domain (Stein, 1999; Fuentes, 2007) or by a product of appropriate conditional distributions (Vecchia, 1988; Stein et al., 2004), covariance tapering (Furrer et al., 2006; Anderes et al., 2013) and Markov random fields (Rue and Held, 2005; Rue et al., 2009)

. Irrespective of being an ordinary GP or a low-rank GP (LGP) model, the marginal normal density functions are thin-tailed and hence they can heavily underestimate the probabilities of extreme events. Additionally, the tails of multivariate normal distributions lead to independent extremes except in the trivial case of perfect dependence which can result in disastrous underestimation of the simultaneous occurrence probabilities of extreme events

(Davison et al., 2013). Hence, both GPs and LGPs are criticized when the main interest is in the tail behavior. Relaxing the parametric GP assumption, Gelfand et al. (2005) propose a flexible nonparametric Bayesian model based on a Dirichlet process mixture (DPM) of spatial GPs in the context of geostatistical analysis; however, Hazra et al. (2018)

showed that the tails of the joint distributions for a finite mixture of GPs also lead to independent extremes. While there are more flexible nonparametric spatial models available in the geostatistics literature (see, e.g.,

Duan et al., 2007), they are not computationally suitable for large spatial datasets.

While GPs and related processes are typically used to describe the bulk behavior, models stemming from extreme-value theory are designed to accurately describe the tail behavior. The classical modeling of spatial extremes usually relies on site-wise block-maxima or peaks over some high threshold (Smith, 1990; Davison et al., 2012, 2019; Davison and Huser, 2015). They can be divided into three main categories: (asymptotic) max-stable and Pareto processes (Smith, 1990; Padoan et al., 2010; Davison et al., 2012; Reich and Shaby, 2012; Opitz, 2013; Thibaud and Opitz, 2015; de Fondeville and Davison, 2018), latent variable models (Sang and Gelfand, 2009, 2010; Cooley and Sain, 2010; Opitz et al., 2018) and sub-asymptotic models (Huser et al., 2017; Morris et al., 2017; Hazra et al., 2019; Huser and Wadsworth, 2019). Max-stable and Pareto processes are asymptotically justified models for spatial extremes but likelihood computations are usually challenging even for low or moderate spatial dimensions (see, e.g., Castruccio et al., 2016; Huser et al., 2019). An exception is the max-stable model of Reich and Shaby (2012), which is computationally tractable for higher spatial dimensions; see also Bopp et al. (2019a). However, this model is constructed from deterministic kernels and is generally overly smooth in practice. Analogously, de Fondeville and Davison (2018) show how Pareto processes can be efficiently fitted to high-dimensional peaks-over-thresholds using proper scoring rules, but their approach is limited to a few thousand sites and cannot easily be accommodated to the Bayesian setting. Alternatively, Morris et al. (2017)

propose a Bayesian model for high threshold exceedances based on the spatial skew-

process, which lacks the strong asymptotic characterization of max-stable and Pareto processes but benefits from exceptionally faster computation. Instead of considering only the block-maxima or peaks-over-thresholds as in the above mentioned approaches, Hazra et al. (2018) consider a DPM of spatial skew- processes where the extremes are selected probabilistically through the Dirichlet process prior; hence, this modeling approach does not require any arbitrary high thresholding but assumes that identically distributed temporal replicates are available. Recently, Bopp et al. (2019b) propose a Bayesian analogue model where the intensity of flood-inducing precipitation is modeled using a mixture of Student’s processes. However, both of these approaches are applicable only for low spatial dimension and they are not directly applicable for a large spatiotemporal dataset with trend and seasonality, such as our Red Sea SST dataset.

In this paper, we propose a low-rank semiparametric Bayesian spatial mixed-effects linear model, which extends the spatial model of Hazra et al. (2018), to handle large, highly nonstationary spatiotemporal datasets. The mean SST profile is assumed to be comprised of a spatially-varying linear trend and nonlinear seasonality term (modeled using B-splines). Furthermore, for computational tractability with high spatial resolution, we consider a low-rank approximation of the spatially-varying coefficients involved within the trend and seasonality components. We model the variability using a DPM of low-rank spatial Student- processes (LTPs), abbreviated by LTP-DPM in short. The DPM is constructed using a (truncated) stick-breaking prior (Sethuraman, 1994). As for the LTP mixture components, we construct them by a scalar product of multivariate normal random effects with orthonormal spatial basis functions, then multiplied by an inverse-gamma random scaling. The random effects are assumed to be independent and identically distributed (iid) across time points. The final proposed model has nonstationary mean, covariance and tail dependence structure, and under limiting conditions, it spans all squared-integrable continuous spatial processes. We draw posterior inference about the model parameters through an efficient Gibbs sampler.

Beyond modeling the SST data, our ultimate goal is to identify regions “at risk”, where the SST level might exceed a high threshold at some future time. Similar problems arise in a wide range of scientific disciplines, for example, environmental health monitoring (Bolin and Lindgren, 2015), brain imaging (Mejia et al., 2019), astrophysics (Beaky et al., 1992) and climatology (Furrer et al., 2007; French and Sain, 2013). The easiest and most naive approach for estimating exceedance regions (i.e., hotspots) is to perform site-specific exceedance tests at each grid cell separately (see, e.g., Eklundh and Olsson, 2003). However, such a naive approach does not adequately account for multiple testing, and a better approach is to set the joint probability of exceeding a high threshold over the whole region equal to some predefined value. In this spirit, a variety of more advanced methods have been proposed for identifying hotspots (Cressie et al., 2005; Craigmile et al., 2006; French and Sain, 2013; Bolin and Lindgren, 2015). In particular, French and Sain (2013) provide a method to construct confidence regions for Gaussian processes that contain the true exceedance regions with some predefined probability. Here, we develop an approach to estimate extreme hotspots by extending the Gaussian-based method of French and Sain (2013) to the more general framework of our highly flexible semiparametric LTP-DPM model, which is better suited for capturing the joint tail behavior.

The paper is organized as follows. In §2, we present the Red Sea SST dataset and some exploratory analysis. Our proposed LTP-DPM model and its properties are discussed in §3. In §4, we discuss Bayesian computational details and and the hotspot estimation technique. In §5, we apply the proposed methodology to the Red Sea SST dataset and discuss the results. We conclude with some discussion and perspectives on future research in §6.

2 The Red Sea SST dataset and exploratory analysis

The OSTIA project generates satellite-derived daily SST estimates (free of diurnal variability) at an output grid resolution of (about 6 km). This yields 16703 grid cells for the whole Red Sea. The data can be freely obtained from the website Figure 1 shows spatial maps of observed SST profiles for three days with high spatially-averaged SST. For all three days, the SST values are lowest near the Gulfs of Aqaba and Suez in the North, and highest in the southern Red Sea near the coast of Eritrea and the southwest of Saudi Arabia.

[height = 0.4width = 0.25trim = .120pt .00pt .270pt .00pt, clip]hotday1.pdf [height = 0.4width = 0.25trim = .120pt .00pt .270pt .00pt, clip]hotday2.pdf [height = 0.4width = 0.36trim = .120pt .00pt .00pt .00pt, clip]hotday3.pdf

Figure 1: Observed SST profiles across the Red Sea for three extremely hot days: September 12, 1998 (left), August 30, 2010 (middle), and September 18, 2015 (right). All sub-figures are on the same scale.

Some exploratory analysis (not shown) reveals that daily SST data are highly autocorrelated in time, and that the strength of temporal dependence varies strongly over space. It would be extremely statistically and computationally challenging—if possible at all—to flexibly account for spatially-varying autocorrelation in a single fully Bayesian model for a dataset of this size ( grid cells and days). Therefore, for simplicity, we here analyze temporally-thinned data, keeping only one day per week at each grid cell, thus greatly reducing the temporal auto-correlation. Hence, we obtain seven spatial sub-datasets, comprising 1612 spatial fields (i.e., one for each week) that we treat as independent time replicates. The results are observed to be consistent across the sub-datasets. To concisely summarize the results for all sub-datasets, and to reduce the overall uncertainty in our final estimates, we obtain the results separately for each sub-dataset, check that they are indeed mutually consistent, and then report the averages. As our main goal is in drawing spatial inference, and estimating spatial (rather than spatio-temporal) hotspots, this approach is reasonable.

[height = 0.4width = 0.23trim = .120pt .00pt .320pt .00pt, clip]inter_annual_variation_slope_1st.pdf [height = 0.4width = 0.23trim = .120pt .00pt .320pt .00pt, clip]inter_annual_variation_slope_3rd.pdf [height = 0.4width = 0.32trim = .120pt .00pt .100pt .00pt, clip]inter_annual_variation_slope.pdf

Figure 2:

Estimated slopes for site-wise simple linear regression model fits performed on quarterly average SST for the first (left) and the third (middle) quarters and also on annual average SST (right) over the Red Sea. All sub-figures are on the same scale.

Considering the spatially-varying nature of SST across the Red Sea and the long data collection period of 31 years from 1985 to 2015, a period that faced global warming, we first conduct a preliminary site-by-site trend analysis by fitting simple linear regression models to the annual mean SST as well as the quarterly mean SST. The estimated intercept and slope profiles in all three cases vary over space. The mean SST across the Red Sea is lowest during the first quarter and highest during the third quarter. The estimated slope profiles of the first-quarterly, third-quarterly and annual mean SST are displayed in Figure 2. For the first quarterly and annual mean SST, the slopes are the highest near the latitude 22N and are the lowest near the southern end of the Red Sea between Eritrea and Yemen. However, for the third quarterly mean SST, the highest slopes are observed throughout the coast of Egypt in the northwest, and the lowest values are observed near the southwest of Saudi Arabia. Thus, Figure 2 explains the need for spatially-, as well as seasonally-, or weekly-varying coefficients for modeling the marginal SST distributions.

We then study the seasonality profile (averaged across years) at various grid cells. Significantly different patterns are observed throughout the Red Sea. The hottest weeks (maximizing the annual weekly average SST) vary mainly between weeks 32 and 35 (above the latitude 20N), and weeks 37 and 42 (below 20N). The observed nonstationarity of SST across weeks explains the need for a flexible modeling of seasonality, e.g., through some linear combination of local basis functions with spatially-varying coefficients.

We then compute the site-wise standard deviations (SDs) of the detrended SST data (with trend obtained by spline smoothing as discussed in Section

3.2). The results are presented in the left panel of Figure 3. Again, there is a highly nonstationary pattern, with high SDs near the northeast region and low SDs near the southeast region. The middle panel of Figure 3 shows the spatial correlation structure with respect to the central grid cell (E, N). The correlation values in the northern region are significantly higher than the values in the southern region despite being at the same distance. This suggests that the spatial correlation structure is also highly nonstationary.

[height = 0.4width = 0.31trim = .130pt .00pt .120pt .00pt, clip]spatial_SD.pdf [height = 0.4width = 0.31trim = .130pt .00pt .120pt .00pt, clip]spatial_corr.pdf [height = 0.4width = 0.31trim = .130pt .00pt .120pt .00pt, clip]spatial_chi.pdf

Figure 3: Grid cell-wise SD (left), spatial correlation (, middle) and extremal dependence (, right) profiles corresponding to the grid cell (E, N), respectively.

To investigate the extremal dependence structure, we compute the empirical tail dependence coefficient with respect to the same grid cell (E,

N). The tail dependence coefficient between two random variables

and is defined as , where


and and are the marginal distribution functions of and , respectively. A nonzero value of indicates asymptotic dependence while indicates asymptotic independence. Here, we estimate with the empirical conditional probability with . The values are reported on the right panel of Figure 3. The observed values are nonzero throughout a major portion of the spatial domain indicating the necessity for a model that can capture nonzero extremal dependence (unlike GPs) at large distances and high thresholds.

Finally, we investigate the bias in estimating high quantiles when fitting normal and Student’s

distributions. High biases observed at many grid cells indicate the need for a more flexible model than the usual parametric alternatives. As mixture models fit low through high quantiles of the distributions more flexibly, a semiparametric model is warranted.

More details on the exploratory analysis are provided in the Supplementary Material. To summarize, we need a model that accounts for spatially-varying trend and seasonality components within the mean structure, while the residual variability needs to be modeled through a mixture of spatial processes that allows for extremal dependence. Considering the high dimension, a low-rank approach is necessary to ensure that inference is practically feasible.

3 Modeling

3.1 General framework

Let denote the spatial domain of the Red Sea. We model the Red Sea SST data as


where denotes the observed SST at location and at time , is the mean SST profile and is the corresponding error component.

In §3.2, we first discuss the modeling of the mean term , and in §3.3, we then discuss the modeling of the residual process for and . In §3.4, we specify the overall Bayesian model, and in §3.5, we describe the model properties.

3.2 Mean modeling

By an abuse of notation, we write where denotes the year corresponding to time and denotes the corresponding week within the -th year. Here, for 31 years of data and for the 52 weeks within each year. Henceforth, this one-to-one relation between and holds for the rest of the paper. In spite of the data being observed at discrete time points, we model the mean SST profile as a continuous function of and .

We assume that is linear in with , where and denote the intercept and slope coefficients that vary over space, as well as for each week of a specific year. Let denote the -dimensional matrix with -th entry denoted by . We consider standardized covariates, and , which ensures that is an orthonormal matrix. This is important from a computational perspective.

We then further model the regression coefficients , as , where , are cubic B-splines defined over the continuous interval with equidistant knots and evaluated at . Considering one B-spline per month, we have . For convenience, here we assume that the B-spline basis functions are the same for , though other choices are also possible. We denote the corresponding -dimensional design matrix by , with -th entry .

Finally, let the grid cells within be denoted by . In order to reduce the computational burden due to the high spatial dimension , we consider a low-rank approximation of the spatially-varying coefficients by specifying where

are suitable spatial basis functions. While other choices are also possible, we consider a tensor product of cubic B-splines defined over a rectangular surface covering

. Taking the elongated geometry of the Red Sea into account, we place 30 equidistant B-splines along the northwest–southeast direction and 10 equidistant B-splines along the west–east direction. Out of these 300 B-splines, we only keep the of them that represent more than 99% of the total weight over . The knots are well spread across all the Red Sea and hence the splines are able to capture local characteristics quite well. We denote the corresponding -dimensional design matrix by , with -th entry .

In summary, we model the mean SST at spatial location and time point as


Let the spatial mean vector at time

be , and combining all time points, let . Grouping the regression coefficients, let and . Denoting the two columns of by and , respectively, we can write in vectorial form as , where denotes the Kronecker product between two matrices.

3.3 Spatial dependence modeling

We now discuss the modeling of the residual process . We here assume that the ’s are iid across time, with zero mean, and we write for a generic copy of . Our semiparametric Bayesian model for the spatial residual processes is based on a Dirichlet process mixture (DPM) of parametric low-rank Student- processes (LTPs). We first describe the construction of LTPs, and then discuss the modeling based on mixtures.

3.3.1 Low-rank Student- process (LTP)

LTPs are richer than LGPs as they have heavier marginal and joint tails, and they can capture spatial extremal dependence contrary to Gaussian processes. At a spatial location , we model a realization from a LTP as


where denotes the vector of length comprised of some spatial basis functions evaluated at . The random effects are specified as , while , for some positive definite matrix and , and

denotes a spatial white noise process (i.e., nugget effect) such that


For the spatial grid cells , let be the vector of observed values from the process . Moreover, let be the -dimensional matrix, whose columns are the different spatial basis functions evaluated at . After marginalization over the random effects and , the joint distribution of is


where denotes the multivariate Student’s distribution with location vector , dispersion matrix

and degrees of freedom

, is the -by-identity matrix, and is the zero vector of length . In case (temporal) replications are available from the spatial process , the spatial covariance matrix of (which exists since ) can be estimated using the sample covariance matrix . Here, we consider spatial basis functions to be the eigenvectors of

with the largest corresponding eigenvalues. In other words, the matrix

is comprised of empirical orthogonal functions (EOFs). Specifically, let be the diagonal matrix with the largest eigenvalues of , and be the matrix with the column vectors equal to the corresponding eigenvectors. Then, we have the approximation, . Other choices of basis functions are also possible (even in case replicates are unavailable); a detailed discussion regarding the choices of is provided in Wikle (2010). While could be assumed, we consider instead to be unknown and use an informative prior on with prior mean equal to . The nugget component is important for to be full-rank and also is expected to be small. The specific parametrization of ensures that , so plugging the prior mean of , we get the approximation . Hence, we construct a zero-mean LTP, where the covariance structure resembles the sample covariance.

3.3.2 Dirichlet process mixture (DPM) of LTPs

In case we do not have any (temporal) replicates of the process , parametric assumptions are required (though —and hence —are not available and other choice of basis functions is required). However, for the Red Sea data, independent temporal replicates of the anomalies, , , are available and we can thus estimate the underlying spatial process semiparametrically.

Considering that our focus mainly lies in inferences from the tail, here we extend the construction (4) by modeling the residuals using a DPM in the same spirit as Hazra et al. (2018), where the characteristics of the bulk and the tail of the anomaly process are described by different mixture components with component-specific parameters. Thus, our approach can automatically and probabilistically cluster observations into weeks characterized by “extreme conditions” or weeks characterized by “normal conditions”, without any artificial and subjective thresholding. Thus, tail inference is expected to be minimally influenced by observations from the bulk, while providing a reasonable fit on the entire probability range.

Now, we assume that for all , are iid -dimensional realizations from a LTP-DPM model with mixture components for some natural number . The corresponding multivariate density function is


where are the mixture probabilities with , denotes the set of parameters of the -th LTP component, and denotes the density of a -dimensional realization from the LTP as described in (5). When , the model becomes fully nonparametric.

The main advantage of the LTP-DPM model lies in its hierarchical Bayesian model representation. The model can be rewritten as a clustering model, where, conditional on the random cluster label with mass function , , we have . Thus, our LTP-DPM model assumes that weeks with similar residuals can be clustered together, their distribution being described by the same LTP. Treating the cluster labels as unknown, the model accounts for uncertainty in cluster allocation.

We assign a truncated stick-breaking prior distribution (Sethuraman, 1994) on the mixture probabilities, , . Precisely, we have the following constructive representation: for some Dirichlet process concentration parameter , and subsequently, for with . As we consider to be finite, we set so that . In our MCMC implementation, we exploit the one-to-one correspondence between the ’s and the ’s, by iteratively updating the latter to estimate the former. We write .

3.4 Overall model

Considering two types of spatial basis functions within and leads to collinearity issues between fixed and random effects. To fix this issue, we divide the matrix from §3.2 into two parts based on its projection onto the column space the matrix from §3.3. The projection matrix is , as is an orthonormal matrix. We have where , and , and then rewrite with two separate vectors of coefficients corresponding to and for each of the intercept-related and slope-related terms. For a specific grid cell and time point , , where is the -th entry of , is the -th row of , and is the -th row of .

Overall, our hierarchical model is defined as follows. Given the cluster label ,


where is the mean SST profile, is the vector of spatial basis functions as described in Section 3.3, and and denote independent copies of the corresponding random effects. The set of parameters of the LTP corresponding to time is . We treat the cluster-specific parameters

as unknown and put hyperpriors on them. We assume that

, with the components of , i.e., and are treated as independent of each other. Our choice of hyperpriors is discussed in §4.1.

3.5 Model properties

For Model (3.4), the conditional mean and covariance structure of given the coefficients , and the cluster-specific parameters are


where denotes the -th row of and if and zero otherwise.

The mean structure is clearly nonstationary both in space and time, non-linear and includes interaction terms between the spatial and temporal effects. Thus, the model can capture spatially-varying seasonality and trend as well as seasonally-varying trends at each spatial grid cell. Because we specify high-resolution B-spline knot locations both over space and time, our model can capture the local features in the mean behavior reasonably well.

The covariance between the observations at grid cells and is dependent on both and and cannot be reduced to a function of the spatial lag . Hence, the covariance structure is also nonstationary. For the Dirichlet process atoms , , we consider priors so that ; recall §3.3. Marginalizing over , we get , where . Considering to be small, we get the approximation , where and is as discussed in §3.3. Thus, the LTP-DPM model is centered around a low-rank Student’s process, constructed as in §3.3 with mean structure discussed in §3.2.

In case we consider all eigenvectors of in the matrix , i.e., , there is no need to consider a nugget term to make sure is full-rank. Considering as an infinite-dimensional spatial process and ignoring the nugget term, for , we can write . For any pair and with , the vector follows a DPM of bivariate zero-mean Student’s distributions which spans any bivariate zero-mean distribution when . Integrating with respect to the priors of , . Thus, for , the proposed model satisfies the criteria of the Karhunen-Loève Theorem (Alexanderian, 2015) and spans all squared-integrable continuous stochastic processes. Under suitable regularity conditions, posterior consistency of the proposed model holds (Wu and Ghosal, 2010; Ghosal and van der Vaart, 2017).

The spatial tail dependence coefficient between two different grid cells and , defined in (1), is nonstationary for the proposed LTP-DPM model and it is easy to show that it is determined by the heaviest-tailed mixture component in (6). Using the tail properties of Student- copulas (Demarta and McNeil, 2005), its expression may be written explicitly as


where , and denotes the underlying spatial correlation of the Gaussian term characterizing the -th mixture component, and is the survival function for a standard (univariate) Student’s distribution with degrees of freedom; see, also Hazra et al. (2018). Here, is nonzero, so our model can capture asymptotic dependence unlike Gaussian processes, and is an increasing function of . When , (unless ). This is a downside of the proposed model in case the spatial domain is large and the process of interest is rough across the domain (e.g., with precipitation or wind speed data), but this should not be a big limitation for our Red Sea SST data, which remain strongly dependent at large distances.

4 Bayesian inference and identification of hotspots

4.1 Hyperpriors and an efficient Gibbs sampler

We draw posterior inference about the model parameters in our model using Markov chain Monte Carlo (MCMC) sampling. Except for the degrees of freedom parameters,


, of the LTP components of the LTP-DPM model, conjugate priors exist for all other model parameters, which allows Gibbs sampling. While Metropolis–Hastings sampling would be possible for the

’s, we prefer to consider discrete uniform priors on a fine grid, which allows drawing samples from the full conditional distributions in a fast and easy manner. This strategy is often considered in the literature due to its numerical stability; for example, Gelfand et al. (2005) use it for posterior sampling from the range parameter of the spatial Matérn covariance; see also Morris et al. (2017) and Hazra et al. (2019).

For the vectors of fixed effects involved within the mean terms, , , we consider the priors , where is the -dimensional vector of ones and is the -by- identity matrix, with . While the posterior distribution of is also a -variate normal distribution, we avoid the computationally challenging inversion of the involved high-dimensional posterior covariance matrix through eigen-decomposition. For the hyper-parameters , we consider the relatively non-informative priors with and for . The parameter vectors , , correspond to the intercept term, while , , correspond to the slopes. The absolute values of the intercept-related terms are likely to be large while the slope terms, indicators of the rate of change of SST with time, are likely to be close to zero and thus, we consider flatter prior for the ’s. For the hyper-parameters , we consider the priors , . We fix the hyper-parameters to and for . While both priors are quite non-informative, we choose the hyper-priors differently following a similar logic as the one used when considering the priors for the ’s.

The parameters involved in the distribution of the error terms are the component-specific parameters and hyper-parameters of the DPM model described in §3.3. For the purpose of computation, we fix the number of components in the stick-breaking prior by setting for some finite integer . The choice of

is problem-specific, and leads to a bias–variance trade-off. Large

is desirable to increase the model flexibility (i.e., decrease the bias), but considering to be very large may lead to spurious estimates as the sampling from the parameters of a LTP component depend on the observations from that specific cluster and a large may lead to very few observations within some clusters (thus increasing the variance). In our application, we fit different models with and compare them by cross-validation (see §5.2). The prior choices for the DPM model parameters are where is the diagonal containing the largest eigenvalues of (recall §3.3), , and . The prior for is conjugate and it ensures that . We choose hyper-parameters for the prior of so that the mass is mainly distributed near zero. The existence of a continuous conjugate prior for is unknown and the discrete uniform prior avoids the need of Metropolis–Hastings sampling, as mentioned above. The support of the distribution considered here covers a wide range of degrees of freedom. The smallest values within the support of correspond to a heavy-tailed and strongly dependent process, while the largest values practically correspond to a near-Gaussian behavior. For the mixing probabilities , we consider a stick-breaking prior as discussed in §3.3. The distribution of involves the unknown concentration hyper-parameter . We here consider a fairly non-informative conjugate hyper-prior, .

The full conditional distributions required for model fitting and prediction are provided in the Supplementary Material. In our data application, we implement the MCMC algorithm in R ( We generate 60,000 posterior samples and discard the first 10,000 iterations as burn-in period. Subsequently, we thin the Markov chains by keeping one out of five consecutive samples and thus, we finally obtain 10,000 samples for drawing posterior inference. Convergence and mixing are monitored through trace plots.

4.2 Hotspot estimation

Our main goal is to exploit the observed Red Sea SST data to identify extreme hotspots, i.e., to construct a confidence region that contains joint threshold exceedances of some (very) high threshold at some future time with a predefined probability. Our proposed approach developed below generalizes French and Sain (2013) (who focus on Gaussian processes only) to the more general and flexible case of LTP-DPMs, which better capture the joint tail behavior.

Let be our discretized domain of interest. We define the exceedance region above a threshold at time as . Note that because is a random process, is a random set. We want to find a region so that

for some predefined Type I error

. The approach is akin to multiple hypothesis testing, where we first state a null and an alternative hypothesis and then draw conclusions based on a test statistic and a critical value. The difference is in treating the random process

as a “parameter” though the main concept remains same. For each fixed grid cell

, we can individually test the null hypothesis

versus the alternative based on some test statistic . An obvious choice for is to exploit (a rescaled version of) , a predictor of . To find the exceedance region, a possible approach is to combine these single-cell tests by collecting all where we fail to reject . To account for multiple testing, we explain below how to set the critical value of the single-cell tests, in order to reach a family-wise error rate .

We now first show how to perform single exceedance tests based on our proposed LTP-DPM model (3.4). We have . Let and be the year and week corresponding to , respectively. The mean of is


where , and is the -th row of . The density of the corresponding error term is


where denotes the density function of a univariate Student’s distribution with location 0, scale and degrees of freedom . In the hotspot estimation context, is treated as an unobserved random parameter. Based on the posterior samples from the model parameters (recall §4.1) and the distribution of defined by (10) and (11

), we get samples from the posterior predictive distribution of

. Let be the posterior mean of

. From the Bayesian central limit theorem, the large-sample approximation

holds, with the number of samples available from the posterior predictive distribution of and the standard deviation of the posterior samples from . This leads to considering the test statistic


where , with a rejection region for of the form . Under the equivalent null hypothesis (versus ), has an approximate standard normal distribution.

We now explain how to adjust the critical value , so that the family-wise Type I error rate is . This ensures a confidence level of for the confidence region . A Type I error can only occur at locations within the true exceedance region, and hence to control the overall Type I error, we need to consider the test statistic values only within . The critical value is chosen such that Although the distribution of is unknown, we can approximate by posterior sampling, given that our model is easy to simulate from as follows. Precisely, for the posterior predictive samples from , we identify for each . Subsequently, for each , we calculate and also . We repeat this procedure for each and estimate by , the sample -th quantile of . Finally, the estimated confidence region (i.e., hotspot) is .

5 Data application

5.1 Model fitting and cross-validation study

To fit our model, we first need to fix the number of mixture components , and the number of spatial basis functions (EOFs), . Both and affect the bias-variance trade-off and the computational burden. We specify , and with , where are the ordered eigenvalues of (recall §3.3). For and , we denote the values of by and respectively. For the Red Sea SST data, we obtain , while varies between 24 and 26 for the seven weekly sub-datasets. We then compare the different choices of and by cross-validation. Moreover, we also compare our proposed LTP-DPM model to some simpler parametric and semiparametric sub-models, namely LGP (, ), LTP () and LGP-DPM ().

[height = 0.5width = trim = .120pt .450pt .120pt .460pt, clip]Crossvalidation_figures_new.pdf

Figure 4: Averaged BSS (top) and TWCRPSS (bottom) for the models LTP (green), LGP-DPM (red) and LTP-DPM (blue) considering the LGP as the reference model, plotted as a function of the threshold ranging from the -quantile to the -quantile of the observed data. Higher values of BSS and TWCRPSS indicate better prediction performance.

For model comparison, we divide each weekly Red Sea SST sub-dataset into two parts, using the years 1985–2010 (1352 weeks) for training and keeping the years 2011–2015 (260 weeks) for testing. For each spatiotemporal observation in the test set, we estimate the posterior predictive distribution based on the posterior samples and confront the estimated distribution with the test observation. Because our primary goal is to predict high threshold exceedances, we use the Brier score (BS) and the tail-weighted continuous rank probability score (TWCRPS), proposed by Gneiting and Raftery (2007) and Gneiting and Ranjan (2011), respectively. For a single test sample , the BS at a given level is defined as , where is the survival function corresponding to the posterior predictive distribution . For a single test sample , the TWCRPS is defined as , where is a non-negative weight function. To focus on the upper tail, we use so that . Then, we define the Brier skill score (BSS) and tail-weighted continuous rank probability skill score (TWCRPSS) for a model as


using LGP as the benchmark, where and are the short-hand notation for the BS and TWCRPS for a model , respectively. When comparing models for a specific , we use that same for the benchmark. Higher values of BSS or TWCRPSS indicate better prediction performance. We report the results by averaging values over the test set. Figure 4 reports the results plotted as a function of the threshold , which ranges between and the -quantiles of the SST data. The LGP model is consistently the worst (zero, the benchmark), while our proposed LTP-DPM model is better than the others in most of the cases. Comparing across the choices of and , the predictive performance is best when and . Hence, we consider and for drawing inferences. Goodness-of-fit diagnostics of the LTP-DPM model are provided in the Supplementary Material.

[height = 0.23width = 0.19trim = .00pt .00pt .290pt .050pt, clip]sb_density_cluster1.pdf [height = 0.23width = 0.19trim = .00pt .00pt .300pt .050pt, clip]sb_density_cluster2.pdf [height = 0.23width = 0.19trim = .00pt .00pt .300pt .050pt, clip]sb_density_cluster3.pdf [height = 0.23width = 0.19trim = .00pt .00pt .300pt .050pt, clip]sb_density_cluster4.pdf [height = 0.23width = 0.19trim = .00pt .00pt .300pt .050pt, clip]sb_density_cluster5.pdf

Figure 5: Posterior densities of the five ordered stick-breaking probabilities (left to right), based on fitting the LGP-DPM (pink) and LTP-DPM (blue) models with mixture components and