Bag of DAGs: Flexible Scalable Modeling of Spatiotemporal Dependence

by   Bora Jin, et al.
Duke University

We propose a computationally efficient approach to construct a class of nonstationary spatiotemporal processes in point-referenced geostatistical models. Current methods that impose nonstationarity directly on covariance functions of Gaussian processes (GPs) often suffer from computational bottlenecks, causing researchers to choose less appropriate alternatives in many applications. A main contribution of this paper is the development of a well-defined nonstationary process using multiple yet simple directed acyclic graphs (DAGs), which leads to computational efficiency, flexibility, and interpretability. Rather than acting on the covariance functions, we induce nonstationarity via sparse DAGs across domain partitions, whose edges are interpreted as directional correlation patterns in space and time. We account for uncertainty about these patterns by considering local mixtures of DAGs, leading to a “bag of DAGs” approach. We are motivated by spatiotemporal modeling of air pollutants in which a directed edge in DAGs represents a prevailing wind direction causing some associated covariance in the pollutants; for example, an edge for northwest to southeast winds. We establish Bayesian hierarchical models embedding the resulting nonstationary process from the bag of DAGs approach and illustrate inferential and performance gains of the methods compared to existing alternatives. We consider a novel application focusing on the analysis of fine particulate matter (PM2.5) in South Korea and the United States. The code for all analyses is publicly available on Github.



page 13

page 16

page 17

page 21

page 22


Highly Scalable Bayesian Geostatistical Modeling via Meshed Gaussian Processes on Partitioned Domains

We introduce a class of scalable Bayesian hierarchical models for the an...

Scalable Inference for Space-Time Gaussian Cox Processes

The log-Gaussian Cox process is a flexible and popular class of point pa...

Towards a Complete Picture of Covariance Functions on Spheres Cross Time

With the advent of wide-spread global and continental-scale spatiotempor...

Spatiotemporal Multi-Resolution Approximations for Analyzing Global Environmental Data

Technological developments and open data policies have made large, globa...

Diffusion Based Gaussian Processes on Restricted Domains

In nonparametric regression and spatial process modeling, it is common f...

Encoding spatiotemporal priors with VAEs for small-area estimation

Gaussian processes (GPs), implemented through multivariate Gaussian dist...

On the simulation of the Hawkes process via Lambert-W functions

Several methods have been developed for the simulation of the Hawkes pro...

Code Repositories

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Gaussian processes (GPs) have been successfully and frequently used in spatial regressions with point-referenced data. Let denote a zero-mean univariate GP where is a location in a region of interest , and is the dimension of coordinates (e.g., if (longitude, latitude)). A covariance function specifies associations across locations, indexed by a set of parameters . The GP ensures that any finite realizations of at an arbitrary set of locations

follow a multivariate normal distribution. Usually,

is embedded in a regression as spatial/spatiotemporal random effects. Omitting dependence on for brevity, customary GP models have a stationary covariance function such that for any , which assumes the same spatial correlation for any two locations with the same spatial lag . Isotropy further assumes the covariance function to depend only on distance, i.e., . The term “stationary” includes both stationarity and isotropy hereafter for convenience.

The stationary assumption may be overly restrictive in some applications. For instance, the strength of spatial and temporal correlations in the spread of aqueous or air pollutants will depend on dynamics of local currents or winds. Their volatile nature will likely cause spatial/spatiotemporal associations to vary by locations and directions, leading to nonstationarities. The left plot in Figure 1 shows smoke from wildfires in California (CA) in the United States (US) on September 4, 2020. This figure implies that the correlation of air quality at two locations would be higher if the locations align with the path of fire progression and smoke than otherwise even with the same distance.

Figure 1: California state map with fire centers as red dots and smoke of different densities on September 4, 2020 (left). White arrows indicate directions of smoke spread. Middle plot illustrates four nearest neighbors from NNGP connected to a location at around (120W, 37N) by pink arrows. Right plot shows a cubic mesh from MGP as green arrows with dashed rectangular partitions overlaid.

One common way to construct nonstationary GPs is to devise nonstationary covariance functions, e.g. via deformation (sampson_nonparametric_1992), convolutions of kernels (higdon_process-convolution_1998; higdon_space_2002; paciorek_nonstationary_2004; paciorek_spatial_2006), and basis functions at spatial locations attached to a positive definite matrix (cressie_fixed_2008). More recently, many extensions of the aforementioned approaches have been proposed to include spatially- and/or temporally-varying covariates into the covariance structure. schmidt_considering_2011 suggest using covariate information as additional coordinates in a deformed latent space by sampson_nonparametric_1992. calder_dynamic_2008 uses daily wind direction and speed to fix parameters in the covariance matrix in a convolution model (higdon_process-convolution_1998) for spatiotemporal processes of air pollutants. neto_accounting_2014 focus on spatial convolution models and formulate spatially-varying kernels with wind directions, extending higdon_process-convolution_1998 and paciorek_spatial_2006. In some cases, however, it may be challenging to obtain relevant covariates that change and are recorded in space and time.

Furthermore, these methods generally increase model complexity, leading to difficulties in identifiability, extensions to multivariate settings, and model fitting. Nonstationary GPs suffer from heavier computational burden due to numeric approximation to integrals (higdon_space_2002), larger number of parameters (paciorek_nonstationary_2004; schmidt_considering_2011; risser_regression-based_2015), or requiring independent replications of the process (sampson_nonparametric_1992). The added computation compounds the notoriously poor scalability of GPs even with stationary covariance functions. The computational cost of GPs in computing the joint density of for observations is due to calculations of the inverse and determinant of an covariance matrix . These operations are too expensive even for moderately large (datta_review_2016). This suggests implementation of nonstationary GPs is a serious challenge in increasingly prevalent scenarios in which data are collected at high spatial and temporal resolutions. Hence, the development of interpretable nonstationary GP models feasible for large data sets is urgent.

A few papers have proposed computationally efficient nonstationary GP methods for large data sets. fuentes_high_2001 and kim_analyzing_2005 divide a domain into disjoint partitions (“domain partitioning”) and induce global nonstationarity by fitting a locally stationary process within each partition. Independence is assumed across partitions, while allowing different covariance parameters. By fitting separate GPs within each subset, computational advantages are immediate. However, the independence assumption is restrictive and leads to discontinuities at partition boundaries. lindgren_explicit_2011

induce global nonstationarity using locally stationary covariance functions. They achieve computational efficiency by finding a Gaussian Markov random field (GMRF) approximation to a target nonstationary Gaussian random field through a stochastic partial differential equation (SPDE) approach. Unfortunately, the induced form of the resulting nonstationary covariance function is intractable.

risser_bayesian_2020 recently extend scalable GP approaches developed with stationary covariance functions (datta_NNGP_2016; katzfuss_general_2021) to convolution-based nonstationary ones, which is only applicable to moderately large data because of the complexity in convolution methods.

In this paper, we propose a novel way to develop a well-defined scalable nonstationary process that characterizes directional dependence structures without requiring nonstationarity-informing covariates. We partition a domain for computational efficiency, but the independence assumption is replaced by a conditional dependence structure encoded in a directed acyclic graph (DAG). We induce global nonstationarity from a stationary base covariance function through multiple yet simple DAGs. In particular, we allow DAGs to locally vary to capture directional dependencies across the domain. We call this process a Bag of directed Acyclic Graphs process (BAG). Since the BAG does not directly act on nonstationary covariance functions, it circumvents associated computational difficulties. Consequently, it scales to data of size

or more. In BAGs, the form of the resulting nonstationary covariance function is available. The application of BAGs to spatiotemporal and/or multivariate data is direct once a suitable base covariance function is selected. Furthermore, it is worth emphasizing that a BAG results in a valid process, leading to considerable inferential advantages in incorporating parameter estimation and prediction at arbitrary locations into a coherent framework. The BAG can be employed as a prior in Bayesian latent process models of spatial/spatiotemporal dependence.

We are motivated by a class of scalable models originating from the likelihood approximation of vecchia_estimation_1988 that takes advantage of DAGs to model conditional independence of process realizations. vecchia_estimation_1988 approximates a likelihood by a product of conditional densities in which a conditioning set for each observation is a smaller subset of previous observations in some ordering. This approximation assumes conditional independence between each observation and the elements removed from its conditioning set, inducing a sparse precision matrix in the Gaussian density. This sparsity is easily represented by a DAG. For example, a Nearest-Neighbor GP (NNGP; datta_NNGP_2016) restricts dependence within a few closest neighbors by Euclidean distance, resulting in a DAG with directed edges only coming from them. The DAG arising from the nearest neighbors may be replaced with a much simpler and patterned DAG (“mesh”) in a Meshed GP (MGP; peruzzi_highly_2020) for improved efficiencies. Generalizations of such approaches have been proposed (katzfuss_general_2021) along with extensions focusing on spatiotemporal data (datta_nonseparable_2016) or on big data multi-output models (zhang_spatial_2021; peruzzi_spatial_2020). These sparsity-inducing DAGs reduce computational complexity, potentially even to order .

Conditional independence encoded via predetermined DAGs, however, may inhibit inference of certain nonstationarities. In the left plot of Figure 1, each smoke polygon seems to spread in different directions across CA as directed in white arrows. This suggests that smoke, fire progression, and consequently, air pollutants are strongly affected by local winds that may change drastically across a domain over different time periods. Unfortunately, this important property may not be reflected with fixed DAGs. An example of NNGP is described in the middle plot of Figure 1. In a simple scenario in which the number of neighbors is four, the nearest neighbors are connected by pink arrows to a location approximately at (120W, 37N). Three of these neighbors, however, are outside the smoke above the location of interest and may be undesirably non-informative despite proximity. More fundamentally, with some ordering, locations of previous observations to choose the nearest neighbors from may be incompatible with the changing and unknown wind conditions. Among many other DAG-based GPs, we view cubic meshes from MGPs as fixed wind directions. The right of Figure 1 depicts a cubic mesh (patterned green arrows) over simplified axis-parallel partitions (dashed rectangles). Cubic MGP (Q-MGP) models enjoy computational benefits due to its repeated simple structure, but fail to capture locally varying association structures due to fixed patterns. These exemplify potentially incorrect assumptions about spatial and/or temporal dependence on which previous DAG-based models rest, which can also result in unreliable estimates of covariance parameters and predictions.

Our modification is inspired by the usefulness of DAGs for interpretable nonstationarity as well as scalability. In the literature on causal inference, DAGs have been used as systematic representations of causal relationships (wright_method_1934; greenland_causal_1999; shrier_reducing_2008; textor_robust_2016). Inferring directed edges in such causal DAGs is common in many disciplines including economics, sociology, and epidemiology (spirtes_building_1995). We propose a related idea in which directed edges in a DAG can be inferred probabilistically and interpreted as directional associations across a spatiotemporal domain. BAGs use a local mixture of directed edges, leading to a constructive approach of inducing nonstationarity with flexibility and interpretability. In spatial modeling, it is not uncommon to use mixtures for flexibility and/or nonstationarity (griffin_order-based_2006; duan_generalized_2007). In particular, Dirichlet processes (gelfand_bayesian_2005) or stick-breaking constructions (rodriguez_latent_2010) are often employed with underlying stationary GPs to yield a random, nonstationary, and non-Gaussian spatial process. Similarly, we construct nonstationarity from stationary base GPs through a mixture approach. Distinctively from previous works, however, our mixture is local and involves smaller regions of the domain one-at-a-time, instead of the whole domain at once, for computational efficiency. In the context of air pollution, local mixture components represent, for each region, different prevailing wind directions which may explain associated local covariance in the pollutants. Through stochastic selection of arrows, BAGs account for uncertainties of directional dependence patterns. Since different prevailing wind directions could be selected for different regions of the domain, BAGs are flexible enough to mimic potential volatility of wind directions.

We complete specifications of DAGs by defining a node set. Observed locations are an intuitive choice of nodes; however, it may be unwieldy to infer directions at the level of observed locations unless they lie on a grid. Since we intend to interpret arrows in DAGs as directional associations between nodes, we simplify the problem by partitioning the spatial domain into regions, each of which becomes a node. Due to the reduced number of nodes from to , the resulting DAG is much simpler. In particular, tessellations of the domain with polygons such as rectangles or hexagons are advantageous because stochastic search of directed edges will proceed analogously for all nodes due to the same neighbor structure and thus the same “bag”. With these tessellations, DAGs lead to intuitive interpretations, e.g. a directed edge from the top to the bottom region can be interpreted as the direction from north to south.

The rest of the paper is structured as follows: Section 2 formulates BAGs and Gaussian BAGs (G-BAGs) as a special case, followed by an empirical investigation on nonstationarity of G-BAG induced covariance functions. We demonstrate the performance and inferential benefits of our proposed approach via simulation results (Section 3) and application results to air quality data (Section 4). Section 5 concludes the paper with useful discussions and possible extensions. Supporting Material is separately available including detailed derivations of the properties of BAGs, estimation and prediction steps, computational costs, and more results from simulation studies and data applications.

2 Spatial Process Modeling Using BAGs

Consider a univariate spatial process, . The following discussions are identically applied to spatiotemporal processes defined in , so we may proceed with “spatial” processes for brevity.

Figure 2: Examples of partitions made of rectangles (left) or hexagons (right). Each dot is a spatial location, and they are omitted for visual simplicity except on the first partition in the left plot. Arrows show a pool of potential directed edges to choose for shaded partitions.

We first divide the domain into disjoint partitions, i.e., and for . Let denote a fixed set of locations which we call the “reference set.” The reference set need not coincide or intersect with a set of observations , but is a practical choice for nonetheless. Figure 2 illustrates examples of domain partitioning using tessellations via rectangles and hexagons. Only the first (top-left) partition in the left plot is drawn with locations for illustrative convenience. Blue dots represent reference locations in , while black dots are other locations. A partition of similarly divides as in which are disjoint reference subsets enumerated as with and .

2.1 Fixed DAG

Consider finite realizations of the latent process over , written as where

is a vector of realizations over partition

for . After choosing an arbitrary order of the partitions, the joint density can be rewritten as a product of conditional densities:


We can represent equation (1) as a DAG with nodes and directed edges . The vector of random variables is collectively mapped to a single node . As an example, realizations of at all of the blue dots in Figure 2 are assigned to the node . A set of nodes is a subset of whose edges are directed to and is referred to as a parent set. The equation (1) implies and for . The directed graph is acyclic because elements in do not make cycles.

Motivated by Vecchia approximation, we can drop some edges in and build a new DAG that leads to a distinct joint density as a product of conditional densities with reduced conditioning sets


Here, is a subset of collecting realizations at parent nodes of , i.e., . Notice that the parent set is now a subset of . Since is a DAG, is a proper joint density (lauritzen_graphical_1996). Approximating with coupled with a DAG renders multiple advantages: (a) these approximations have been shown to be highly accurate to the true

in terms of Kullback-Leibler divergence; (b) such approximations may be improved by specific order of DAG nodes

(guinness_permutation_2018); (c) the DAG can be extended to any arbitrary locations to create standalone stochastic processes (datta_NNGP_2016); and (d) special choices of DAG designs can be exploited for faster computations (peruzzi_highly_2020). In all these cases, the DAG is a pre-specified model parameter which is not itself an unknown quantity of interest.

2.2 Unknown DAG

Taking a different perspective, we allow the DAG to be inferred from data. We account for uncertainty about dependency patterns across the spatial domain by considering finite mixtures of possible local arrows within each partition. A simplifying assumption is made that each node can have at most one parent from neighboring partitions in the spatial domain. This assumption may be relaxed by letting the number of parents be larger than one; but it comes at a cost of overall clarity in exposition, computational tractability, and interpretability.

Taking toy examples in Figure 2, directed edges 1, 2, 3, 4 (left) or 1, 2, 3 (right) correspond to alternative and competing assumptions on the dependence structure relevant for the shaded partitions. For node (equivalently, reference subset ), we introduce a latent membership variable that determines the direction along which dependence is allowed to flow and thus the parent node. After enumerating possible directions to each node as , we let

be the probability that

receives the th directed edge. We denote the resulting parent node as . Notice that we deliberately include only half of the possible directions in Figure 2. This is to ensure acyclicity of any inferred directed graphs. Prior knowledge about the domain may be utilized to choose one direction over the other on each axis. For a spatiotemporal domain, we assume that each node receives an additional directed edge from the partition that covers the same spatial region at the previous time point, and a spatial and the temporal parent are collectively denoted as

On account of these assumptions, we define a joint density conditional on a specific configuration of directed edges , modifying equation (2) as follows:


with . Since are independent a priori, integrating them out gives


For notational brevity, is hereafter unless otherwise specified. Then proposition 1 holds.

Proposition 1

is a proper joint density.

The joint density in equation (3) is proper because there exists an associated DAG whose nodes are and edges are identified by . Then

because for all . Hence, is a proper joint density.

2.3 BAGs

The discussion above has focused on the finite dimensional . Now we extend our approach to a valid process by making statements about all locations outside in the domain, which we label as non-reference locations . Given our partitioning scheme subdividing the domain into regions, we can similarly partition into disjoint sets such that . We then extend the DAG over to a larger DAG which includes a new set of non-reference nodes: each of the ’s is mapped to a node . The construction of is completed by assigning directed edges from nodes in to those in , securing acyclicity of . There are several possible ways to place these directed edges. Assuming that for all , one can simply fix the edges as (), which implies that local reference locations become the parent set for the non-reference locations in the same partition. In this case, Kolmogorov consistency conditions are easily verified (see Appendix A in peruzzi_highly_2020), proving the validity of the resulting process. However, a better alternative would be to let , which allows modeling at any non-reference locations to depend not only on the local reference locations but also on their parent locations learned from data in our approach. In this way, predictions are made based on inferred wind directions. We prove the Kolmogorov consistency conditions for our proposed stochastic process when there is randomness in choosing the DAG for both and .

First, we assume conditional independence of non-reference locations given the local reference locations and their parents so that


leading to the following conditional density of

The equations and suffice to describe the joint density of any finite subset of spatial locations . With ,


We confirm that the collection of finite dimensional densities in equation (6) satisfy the Kolmogorov conditions (see Supporting Material S1). This implies that there exists a stochastic process associated with them, which we call a BAG. Therefore, our approach generates a valid spatial process via domain partitioning and local mixtures of DAGs.

When modeling spatial variation, we are interested in understanding the covariance between process realizations at different spatial locations. With BAGs, we find that


if for any , where and indicate expectation with respect to and , respectively. This suggests that pairwise covariances between process realizations can be studied starting from the conditional covariances given a DAG configuration . We further study BAGs in the Gaussian case and investigate the induced nonstationary covariance.

2.4 Gaussian BAGs

Suppose the spatial process is a zero-centered GP, i.e., with a valid covariance function parametrized by . Then equation (3) becomes


where, if , and . If , and . The matrix has dimension and its element is and we let for . The latent membership variable determines the spatial DAG which in turn defines and based on the well-known Gaussian conditional densities. We also find


with the identity matrix

of size , a sparse block matrix whose th block is if and zero otherwise, and a block-diagonal matrix . Given a bag of directed edges, it is always possible to find an ordering of such that for , resulting in a lower triangular matrix with zero diagonals. This renders , and thus Furthermore, in the spatial only case, the th block-row of has at most one non-zero block for any due to the assumption that all nodes in have at most one parent. As a result, the undirected moral graph found by “marrying the parents” of nodes in generates no additional edges. This means we can immediately identify the th block of as non-zero if and only if there is a directed edge between and in (either or ). Hence, the sparsity structure of lower-triangular blocks (excluding diagonals) of is identical to that of . For , the th block of is

with by symmetry. In spatiotemporal cases, the undirected moral graph will include additional edges connecting a spatial parent and a temporal parent for each node in . As a result, will have a non-zero block at not only if there is a directed edge between and , but also if and have a common child node (one as the spatial directional parent and the other as the temporal parent).

Similarly, for non-reference locations, equation (5) becomes


with and . If and , becomes for any , yielding a covariance matrix whose elements are . In vector form, the joint density of is multivariate Gaussian with mean and covariance matrix in which is a matrix with being the number of locations in , and is a diagonal matrix. Again, if , the th block-row of for all locations in has non-zero blocks in columns corresponding to and , and the th block matrix of is also a diagonal matrix whose elements are .

Following equation (6) and the subsequent discussion, we obtain a Gaussian BAG (G-BAG) such that . The new covariance function for G-BAG is given as for any two locations by equation (7). is the induced covariance function of a GP one obtains from fixing a single DAG as : since equations (8) and (9) are all Gaussian, the density is also Gaussian for any finite subset , leading to a GP for any given . The implied covariance function conditional on is then


where is a submatrix of corresponding to locations in the sets and , and is the indicator function. Note that is nonstationary, and so is the marginal covariance function . Derivation of equation (10) is described in Supporting Material S2.

2.5 Bayesian Hierarchical Model with G-BAGs

Consider a general regression model



is a response variable,

is a -dimensional vector of point-referenced predictors, is a spatial process over domain , and is a measurement error. Over the observation set , equation (11) is expressed in vector form as with , and similarly defined, a matrix having as its th row, and . We assume a priori as specified in Section 2.4 for scalable modeling of a nonstationary process. We complete our Bayesian hierarchical model by specifying priors for all unknowns. The joint posterior distribution for is proportional to the likelihood times the prior distributions:

where , and , , and

denote Categorical, Dirichlet, and inverse Gamma distribution, respectively. The vector

stores the probability for each possible value of , and is the

Dirichlet hyperparameter vector. The prior for the covariance parameter

is left unspecified as it depends on the choice of base covariance function .

Spatiotemporal G-BAGs can be built with any spatiotemporal covariance functions as a base. In particular, one may consider assuming separability of space and time components for simplicity and computational advantages. However, they are unable to capture interactions between spatial and temporal variability, which can be significant in some applications. Therefore, maintaining a more flexible perspective, we choose a nonseparable (stationary) spatiotemporal covariance function proposed in gneiting_nonseparable_2002 for our applications. For space-time lag , the covariance function is


where is a space-time interaction parameter, and and are temporal and spatial decay, respectively. and denote the Euclidean distance in two and one dimension, respectively. The equation (12) reduces to a separable function when . We discuss nonstationarity induced by our approach from this stationary base covariance in Section 2.6

. A straightforward Markov chain Monte Carlo (MCMC) sampler to obtain posterior samples is provided in Supporting Material

S3. We show that G-BAGs have computational complexity of order at each iteration of the MCMC sampler in Supporting Material S4.

2.6 Nonstationarity of G-BAGs

Even when built using a stationary base covariance function, G-BAGs induce directional nonstationarities through random DAGs. Here, we consider a simplified setting to elucidate the nonstationarity features of among reference locations. The reference set is a grid on divided into partitions. With a bag of three arrows coming from west (W), northwest (NW), and north (N), we assume that only three DAGs have a positive probability. Each DAG consists of only one of the three directed edges across the whole domain, and their probability is , , and for W, NW, and N, respectively. The stationary covariance function in equation (12) is adopted with , , , and . The resulting nonstationary covariances from two partitioning schemes are depicted in Figure 2(a). Partition1 has an octagon in the middle with fan-shaped arms, and partition2 consists of axis-parallel rectangles. Each row of Figure 2(a) is a covariance heat map between the reference point (red point in the middle at time ) and other locations.

The induced covariance is orientational. The reference point has higher covariance on west-east and northwest-southeast axes than other axes, while the stationary covariance produces the same value at the same space-time lag regardless of axes. With time components, however, the induced nonstationary covariance becomes directional. The time identifies W over east (E) and NW over southeast (SE) on the corresponding axes, as hinted in partition2 of Figure 2(a). On the second row of the figure, covariance from the reference point to is higher than that to despite the same space-time lag. This is because the path from the reference point to aligns with the DAG with NW arrows, whereas requires SE arrows not specified in any DAGs.

We further examine the directionality in G-BAGs’ nonstationary covariance functions and show that covariances flow along DAGs. The same base covariance function is used with a different temporal decay . With on a grid in , we let each grid point be a partition and assume W arrows among them. As drawn with gray arrows on the left plot of Figure 2(b), we can imagine steady winds from W over time. Then we compute covariances of a reference point in black at (0.5, 0.5, ) to (i) locations on its left and to (ii) those on its right at different time points. As expected, stationarity produces the same covariance values for both groups of locations, peaked at time and decreasing as time lag increases (see the right plot of Figure 2(b)). The G-BAG induced nonstationary covariance gives separate curves for the two groups. In the middle plot of Figure 2(b), we observe that the reference point at time has the maximum covariance with locations on its right at time , while it achieves its maximum with locations on its left in the past (). Moreover, at each time lag in the future, locations on the right have uniformly higher covariance values with the reference point than locations on the left. These results are intuitive as winds are coming from left to right. The locations on the left affect the reference point first, and in turn, the reference point affects those on the right. Hence, covariance persists longer in the future in directions towards which winds blow.

(a) Covariance heat maps between a reference point (red dot around (0.5, 0.5, 0.333)) and the others. G-BAG nonstationary covariances are depicted with two partitioning schemes (first two rows), using the base stationary covariance on the bottom row.
(b) Covariances of a reference point (black dot at (0.5, 0.5, )) to (i) locations on its left (blue) and to (ii) those on its right (vermilion) at varying times. The time point of the reference point is given as a black vertical line. A DAG with W arrows is assumed.
Figure 3: Features of nonstationarity induced by G-BAGs compared with stationary base covariance functions.

3 Applications on Simulated Data

We conduct simulation studies to evaluate performance of G-BAGs in prediction and learning DAGs. We consider a scenario in which a G-BAG model is correctly specified, and one in which a fitted G-BAG model is misspecified relative to the data generating model. In both scenarios, our spatiotemporal domain is with locations . For G-BAG models, we assume that a latent process follows a priori with the base covariance function in equation (12) with .

We compare our G-BAG approach to Q-MGP having the same domain partitions and SPDE approaches with a stationary (referred to as SPDE-stationary; lindgren_explicit_2011) or a nonstationary (SPDE-nonstationary; lindgren_bayesian_2015) separable spatiotemporal covariance function. The SPDE models are approximated via integrated nested Laplace approximations (INLA). The R-INLA software package (rue_approximate_2009) in R

is used to implement SPDE models, whose basic interface only supports separable spatiotemporal models. Hence, an autoregressive model of order 1 (AR(1)) is assumed for temporal correlations, while nonstationarity is imposed in spatially varying spatial range such that

. The reference sets for G-BAG and MGP models are the observed locations which are 80% of the simulated locations. Twenty percent of data are held-out for validation. The number of threads is set at 10; different methods benefit from multiple threads by different degrees.

3.1 Fitted G-BAG is correctly specified

In the first scenario, we let with and , on a 40408 regular grid in for a total of 12,800 spatiotemporal locations. Parameters and are set at 2 and 0.01, respectively. We partition each of the spatial axes into 6 intervals and the temporal axis into 8 intervals, resulting in partitions. True directions (coming from W, NW, N, and northeast (NE)) vary by space and time as depicted in Figure 3(a). Twenty-five synthetic data sets were generated with and another 25 data sets with .

To fit a G-BAG model, a vague normal prior is specified for . IG priors are specified for nugget () and spatial variability () such that and . The covariance parameters have weakly informative uniform priors: and under and and under . In both cases, . The lower and upper bounds for and correspond to various correlation values ranging from 0.1 to 0.9 at varying spatial/temporal distances. For instance, corresponds to a temporal range between roughly 0.5 and 1 at which the correlation drops to 0.1. The prior distribution for probabilities to choose one of the four directions is given as with for . In Q-MGP, its default priors (, ) are used except for the covariance decay parameters which have the common uniform prior whose lower bound is the minimum of lower bounds for and in the G-BAG model, and an upper bound is the maximum of upper bounds for and . Penalized Complexity (PC) priors (simpson_penalising_2017)

are used for SPDE model parameters. For AR(1), the autocorrelation parameter is assumed to be larger than 0.05 with prior probability 0.99, and the precision of white noise is larger than 1 with prior probability 0.01. Under stationarity, the spatial standard deviation is larger than 1.5 with prior probability 0.01, and the range parameter is smaller than 1.8 or 9 when the truth is

or , respectively. In a nonstationary case, ’s are modeled with i.i.d. priors. G-BAG and Q-MGP results are based on 1,000 MCMC samples obtained from 10,000 samples by discarding first 8,000 samples as burn-in and saving every other sample in the subsequent 2,000 samples.

(a) True directions
(b) Inferred directions
Figure 4:

Each time frame depicts the whole spatial domain whose partitions are colored by true or inferred directions. In (b), the inferred direction in each partition is the directed edge with the highest posterior probability whose value determines the degree of transparency. The inferred results are based on 25 data sets generated from G-BAG with


The true DAG is almost perfectly retrieved through the inferred DAG by posterior probabilities of the four arrows in the bag (Figure 4). Boundary partitions whose true DAG induces no parent often miss the true arrows but properly with high uncertainty. Prediction/estimation performance of different models are summarized in Table 1 & S1 in Supporting Material S5

. The Root Mean Square Prediction Error (RMSPE), Mean Absolute Prediction Error (MAPE), empirical coverage of 95% posterior predictive credible intervals (CI), and 95% CI width all indicate large predictive gains of G-BAG over the other models. This suggests that when directional associations are believed to shift rapidly over space and time, it is recommended to use G-BAGs that explicitly model directional dependence structures. Its computational time (in minutes), however, is longer than the competitors largely due to difference in coding languages and a mixture component that brings heavier computational burden. Although we believe the current running time is acceptable, we expect some improvement in computational time by fully optimizing the code as a future work. The coefficient

is estimated correctly by all models. The SPDE models considerably overestimate , while G-BAG and Q-MGP produce estimates close to the truth. The decay parameters and are identified by G-BAG using the weakly informative priors. The nostationary SPDE barely improves the stationary SPDE in terms of prediction. This may imply that when directional associations are prevalent, the nonstationarity specified via a location-specific spatial range does not suffice.

stationarty nonstationary
2.004 (0.019) 2.013 (0.036) 2.005 (0.056) 2.005 (0.055)
0.012 (0.001) 0.017 (0.002) 0.178 (0.019) 0.177 (0.019)
1.760 (0.240) 2.523 (0.428) - -
7.034 (0.267) 7.822 (0.301) - -
0.527 (0.076) 3.266 (0.569) - -
0.771 (0.071) 0.955 (0.080) - -
RMSPE 0.189 (0.003) 0.404 (0.028) 0.479 (0.029) 0.480 (0.029)
MAPE 0.150 (0.002) 0.273 (0.015) 0.321 (0.017) 0.322 (0.017)
95% CI coverage 0.951 (0.005) 0.945 (0.008) 0.925 (0.008) 0.925 (0.009)
95% CI width 0.745 (0.007) 1.722 (0.082) 1.795 (0.094) 1.792 (0.094)
Run time (min.) 26.034 (0.557) 1.279 (0.023) 2.639 (0.430) 5.800 (1.460)
Language R/C++ C++ C/C++/Fortran C/C++/Fortran
Table 1: Simulation results when G-BAG is the true data generating model with . Mean (standard deviation) over 25 synthetic data sets are provided.

3.2 Fitted G-BAG is misspecified

In the misspecified case, we generate data on a regular lattice of size assuming a univariate model with and . We create with N arrows fixed over partitions to mimic a surface under constant wind directions. The following covariance function introduced in gneiting_nonseparable_2002 is used as a base: where is the smoothness parameter of space, and is the modified Bessel function of the second kind of order . Fixing , we generate 25 synthetic data sets with and another 25 with . The latter mimics faster wind speed than the former. Figure 5 illustrates simulated examples of the true with and . Directional dependence from north to south and different speed by ’s are evident across the spatiotemporal domain.

Figure 5: Simulated examples of the true with (top row) and (bottom row) at different five consecutive time points. Contour lines are drawn at 2.5 (top) and 1.7 (bottom).

Owing to multiple simulations to obtain reliable performance measures, we take a subset of data of size by retaining grid points. We fit a misspecified G-BAG model with axis-parallel partitions. Assuming that the true data generating process is unknown, directions - NW, N, NE, E - were placed in a bag by visual inspection of data. In the G-BAG model, prior distributions for other parameters are identical as before except or under or , respectively, while , in both cases. The prior for decay parameters in Q-MGP changes accordingly. The PC priors are modified for SPDE-stationary so that the range parameter is assumed to be smaller than 0.1 with prior probability 0.01, while the spatial standard deviation is higher than 1 with prior probability 0.01. For G-BAG and Q-MGP, 7,000 MCMC samples were drawn, of which 5,000 samples were discarded as burn-ins, and every second sample was saved in the remaining 2,000 draws.

Prediction results for the 25 data sets with are illustrated in Figure 6. In terms of MAPE and RMSPE, Q-MGP shows the best performance followed by G-BAG and SPDE models. Both G-BAG and Q-MGP have more accurate empirical coverage to the nominal level of CIs, while SPDE models suffer from lower-than nominal coverage. Slightly better predictive performance of Q-MGP than the G-BAG model can be attributable to (a) the data generating DAG being fixed and (b) two spatial parents in Q-MGP versus one spatial parent in G-BAG. In fact, it is encouraging that G-BAG is able to produce comparable performance to Q-MGP despite less amount of information from parent nodes, by choosing one “best” spatial parent based on inferred directions. We observe similar results with which are omitted to save space. Moreover, G-BAG enables inferences on directions along which correlations move across the domain. Figure S1 in Supporting Material S5 demonstrates that the true direction (N) is properly found among four directions in the bag by the highest posterior probability in over 80% of the space-time partitions (left). Across time, each spatial partition assigns N an average posterior probability of around 0.7 (right).

Figure 6: Simulation results when the fitted G-BAG is misspecified, and the true covariance parameters are . Box plots over 25 synthetic data sets are presented.

4 Air Quality Data Analysis

Among many ambient pollutants, fine particulate matter that is 2.5 microns or less in diameter (PM2.5) is the primary concern due to its abundance and association to adverse health effects. In particular, acute exposures to PM2.5 have been associated with detrimental effects on human health including heart/lung diseases and associated premature deaths (kloog_long-_2013; gutierrez-avila_cardiovascular_2018). However, PM2.5 monitoring sites are in general sparse. For instance, the United States Environmental Protection Agency (EPA) has 165 PM2.5 monitoring sites in CA as of 2020, implying that each site covers 2,570 on average, which is roughly the size of Luxembourg. Considering the association of PM2.5 to various health effects and its influence on daily activities, accurate prediction of PM2.5 in regions without monitoring sites is an imperative task. In this paper, we focus on CA (Section 4.1) and South Korea (Supporting Material S6) whose residents suffer from acute exposures to high PM2.5 seasonally due to wildfires and long-range movement of air pollutants, respectively.

The spatiotemporal spread of PM2.5 is heavily affected by winds. A large body of literature has included wind-relevant variables in an attempt to predict PM2.5 (wu_exposure_2006; calder_dynamic_2008; wang_effects_2015; preisler_statistical_2015; kleine_deters_modeling_2017; aguilera_santa_2020). However, particularly for wind direction, a fundamental issue exists in properly summarizing it at discrete time points. The ambiguity of representative values for wind direction is evident in a tendency that different data sources provide different summaries unlike other meteorological variables. For instance, EPA provides average wind direction, while the National Oceanic and Atmospheric Administration (NOAA) provides direction of the fastest wind. Moreover, due to the volatile nature of winds, naive averages of such wind directions over a given time period may rarely be meaningful. Therefore, we use the G-BAG approach on the latent process of log transformed PM2.5 to overcome these difficulties and implicitly incorporate wind effects. The base covariance function is in equation (12).

4.1 California, the United States

The year 2020 is recorded as the largest wildfire season in CA’s modern history according to the government of CA. During fire events, dramatically poor air quality is witnessed due to wildfire emissions in which PM2.5 is the primary pollutant (liu_airborne_2017). These environmental risks have made CA outstanding in its widespread use of low-cost sensors such as PurpleAir. In particular, more than half of the PurpleAir sensors in the US are concentrated in CA as of February, 2020 (desouza_distribution_2021). Therefore, we analyze daily PM2.5 levels in CA during fire seasons (August 1 to October 22, 2020) using EPA and PurpleAir measurements. Since low-cost sensors lack universal quality standards yet, they require appropriate correction for improved comparability to regulatory monitors such as Federal Reference Method (FRM) or Federal Equivalent Method (FEM) (datta_statistical_2020). Using a recommended practice by EPA (barkjohn_development_2021; evans_airnow_2021), PurpleAir data are calibrated relative to FRM/FEM as follows: if and otherwise. is the calibrated value, is the PurpleAir measurements from their correction factor labeled as , and is relative humidity in percentage.

The model is where is the Euclidean distance to the nearest fire at spatiotemporal location , and is the mean centered log(PM2.5). We expect to capture elevated PM2.5 level due to proximity to wildfires. The covariate is standardized to have mean 0 and standard deviation 1. A total of 110,473 spatiotemporal locations are observed of which 20% are held out for validation. We additionally predict at 274,564 new locations. The (W, NW, N, NE) edges are chosen in the bag because CA lies within the area of prevailing westerlies. CA is partitioned by rectangles, and each covers approximately (0.5 resolution). Same priors are used as in Supporting Material S6 except and . With 8,000 burn-ins and 2 thinning, 1,000 posterior samples are analyzed. We compare G-BAG to Q-MGP and SPDE-nonstationary. The G-BAG and Q-MGP models have the same priors for all parameters. The location-specific spatial range is assumed in the SPDE model as where and are scaled eastings and northings in , respectively. PC priors are given identically as in Section 3.1 except independent priors for , , and . The total run time for G-BAG, Q-MGP, and SPDE-nonstationary with 10 threads is roughly 22 hours, 32 minutes, and 4 hours, respectively.

Table 2 summarizes the results in comparison to Q-MGP and SPDE-nonstationary. All parameters are similarly estimated by G-BAG and Q-MGP, and the effect of the distance to the nearest fire appears insignificant in that the 95% CIs of from both models contain zero. Although the fitted is significantly negative from SPDE-nonstationary, it is likely misled due to unexplained space-time variations after fitting spatiotemporal random effects (see Figure S3 in Supporting Material S7). Comparing residuals of SPDE-nonstationary and those of G-BAG, subtracting fitted from , the former exhibit apparent spatial patterns unlike the latter. These patterns are rarely explained by either because the fitted (-0.038) is small compared to the residuals ranging from -6 to 6. We observe that the spatiotemporal random effects from SPDE-nonstationary tend to underestimate PM2.5 when the true PM2.5 is high and overestimate when it is low. Much larger fitted by the SPDE model than that by G-BAG also corroborates this remaining variability. These suggest that the nonstationarity via location-specific spatial range is insufficient to fully characterize the cause and removal of air pollution affected by wind directions in CA over the time period of interest.

According to Table 2, G-BAG yields smaller errors than the other two models and bolsters the accuracy of uncertainty quantification compared to Q-MGP in terms of prediction across the whole period. A detailed comparison is illustrated in Figure S4 in Supporting Material S7 by which we confirm that the G-BAG approach improves prediction by a large margin at each time point especially over the fixed-DAG approach. We achieved down to 60% reduction in prediction errors. These results signify that the unknown DAG approach and construction of nonstationarity via the inferred directions help enhance performance in this application.

G-BAG Q-MGP SPDE-nonstationary
-0.001 (-0.011, 0.007) 0.006 (-0.003, 0.018) -0.038 (-0.054, -0.021)
0.011 (0.010, 0.011) 0.011 (0.011, 0.011) 0.079 (0.076, 0.081)
3.730 (3.551, 3.886) 4.409 (4.408, 4.410) -
3.124 (3.011, 3.258) 1.262 (1.262, 1.263) -
0.009 (0.009, 0.009) 0.010 (0.010, 0.010) -
0.010 (0.000, 0.036) 0.151 (0.151, 0.151) -
RMSPE 0.296 0.435 0.343
MAPE 0.154 0.296 0.174
95% CI coverage 0.963 0.906 0.961
95% CI width 1.503 1.388 1.216
Table 2: Posterior summaries and prediction performance measures of G-BAG, Q-MGP, and SPDE-nonstationary models on CA PM2.5 data. Posterior mean (95% CI) are provided for parameters.

The predicted surfaces of log(PM2.5) and observations are presented in Figure 7. We confirm that G-BAG can produce plausible predicted values even in regions lacking data with properly increased prediction uncertainty (Figure 8). Moreover, we find that W and N are the most prevailing directions causing associations in PM2.5 in CA from August to October, 2020. They are selected in 70% of the space-time partitions. In particular, the bottom row of Figure 7 shows that the growth and reduction of areas with poor PM2.5 levels () progress from W and N directions, and the posterior mode of arrows capture the tendency well. Between latitude 36N and 40N and longitude 121W and 124W, cleaner air started to flow in from W on September 16 and expanded along NW and N in a few days.